Markdown and GFM escaper for converting plaintext into escaped Markdown
npm install gfm-escape> ...the only escaper passing backtranslation tests.
GfmEscape is an enterprise-grade library for transforming untagged plain text
to CommonMark and
GitHub Flavored Markdown (GFM).
There are neat and configurable markup converters like Turndown,
which even allows transforming any markup that can be converted to HTML first.
While conversion of inline and block constructs is well covered, little attention is
paid to transforming text content itself. And this is tricky especially with
non-delimited "extended" autolinks, which make escaping heavily context-dependent.
In short:
* No escaping breaks your output.
* Naive or aggressive escaping breaks your output.
* Overescaping would also break the John Gruber's _overriding design goal for
Markdown’s formatting syntax_,
i.e. _to make it as readable as possible_ and _publishable as-is, as plain text,
without looking like it’s been marked up with tags or formatting instructions_.
GfmEscape addresses these issues without significant performance penalty, as it
is based on UnionReplacer.
See below for more details.
In browsers:
``html`
Using npm:
`bash`
npm install gfm-escape
In Node.js:
`js`
const GfmEscape = require('gfm-escape');
``
escaper = new GfmEscape(escapingOptions[, syntax])
newStr = escaper.escape(input[, gfmContext[, metadata]])
A created GfmEscape instance is intended to be reused and shared in your code.
escapingOptions: option object defining how to perform escaping, its keys`
correspond to individual replaces. When a replace option is set to any truthy
value, suboption defaults are applied and can be overriden by passed suboptions.
A single option object can be reused for instantiating escapers for
different syntaxes, some options would just render irrelevant.
The current full options are:js`
{
strikethrough: { // default false
optimizeForDoubleTilde: false,
},
extAutolink: { // default false
breakUrl: false,
breakWww: false,
breaker: '',
allowedTransformations: [ 'entities', 'commonmark' ],
allowAddHttpScheme: false,
inImage: false,
},
table: true, // default false
emphasisNonDelimiters: { // default true
maxIntrawordUnderscoreRun: undefined,
},
linkTitle: { // default true
delimiters: [ '"', '\'', '()' ],
alwaysEscapeDelimiters: [],
},
}syntax
See below for more details.
: suggests the syntax escaper is built for.GfmEscape.Syntax
The predefined syntaxes are available as members of :text
- : normal text, the default.linkDestination
- : text rendered sometext.cmAutolink
- : text rendered . Please note that a valid CommonMark mustisEncodable(input)
contain a URI scheme, which cannot be addressed by the escaper. When deciding if
CommonMark autolink is an appropriate construct to use, we suggest to use
the and wouldBeUnaltered(input) methods on theSyntax.cmAutolink
object.codeSpan
- : text rendered here .linkTitle
- : text rendered text ortext
or text).
input: the string to escape. Please note that correct escaping is currentlytextContent
only guaranteed when the input is trimmed and normalized in terms of whitespace.
The library does not perfos qrm whitespace normalizing on its own, as it is often
ensured by the source's origin, e.g. of a normalized HTML DOM.input.trim().replace(/[ \t\n\r]+/g, ' ')
Manual normalizing can be done with .input.replace(/^[ \t\r\n]+|[ \t]+$/gm, '')
If it is intended to keep the source somewhat organized in lines, the minimum
treatment to make escaping safe would be .
In such case, the caller has a responsibility to place the output correctly in
the generated document. I.e. to indent all the lines when the context requires
indenting.
gfmContext: extra contextual information to be considered. The contexts have`
no defaults, i.e. they are falsy by default. The following contexts are available:js`
{
inLink: true, // indicates suppressing nested links
inImage: true, // similar to inLink for !this image text
inTable: true, // indicates extra escaping of table contents
}metadata
When escaping, is extra input-output parameter that collectsmetadata
metadata about the actual escaping. Currently are used forcodeSpan syntax and linkTitle syntax.`js'
const escaper = new GfmEscape({ table: true }, GfmEscape.Syntax.codeSpan);
const x = {}; // not necessary as the surrounding delimiter is always '
const context = { inTable: true };
const escaped = escaper.escape('array|string', context, x);
console.log(\${escaped}\`); // array\|string ${x.extraBacktickString.length} backtickts and ${x.extraSpace.length} spaces added.
console.log();
// 1 backticks and 1 spaces added.
const linkTitleEscaper = new GfmEscape({}, GfmEscape.Syntax.linkTitle);
const x = {}; // needed as we let GfmEscape decide the surrounding delimiter
let escaped = escaper.escape('cool "link \'title\'"', context, x);
console.log(${x.startDelimiter}${escaped}${x.endDelimiter});
// (cool "link 'title'")
escaped = escaper.escape('average link title', context, x);
console.log(${x.startDelimiter}${escaped}${x.endDelimiter});
// "average link title"
const rigidLinkTitleEscaper = new GfmEscape({
linkTitle: {
delimiters: '"',
}
}, GfmEscape.Syntax.linkTitle);
// metadata not necessary, as the surronding delimiter will be always '"'
escaped = escaper.escape('cool "link \'title\'"');
console.log("${escaped}");`
// "cool \"link 'title'\""
#### Escaping options: strikethrough
Defaults to false, i.e. '~' is not special and it is not escaped.
Suboptions:
- optimizeForDoubleTilde: only eventual sequences of two tildes are escaped.false
Default .
#### Escaping options: extAutolink
Defaults to false, i.e. autolinks are not detected and do not form special
case for escaping.
Suboptions:
- breakUrl: if a string capable of forming extended url autolink is encountered,https://orchi.tech
it is broken to prevent that. E.g. becomeshttps://orchi.tech
. Default false.breakWww
- : if a string capable of forming extended www autolink is encountered,www.orchi.tech
it is broken to prevent that. E.g. becomeswww.orchi.tech
. Default false.breaker
- : a sequence used to break extended autolinks, used both for breaking
and terminating. Default . Please note that some Markdown renderers
like Redcarpet do not support HTML comments - tag sequences like
or artificial can be used instead.allowedTransformations
- : array of transformations that are allowed if an['entities', 'commonmark']
extended autolink-like string needs to be transformed to retain the expected
target and text. The order indicates priority. Defaults to
. Available transformations are:'keep'
- : always the most preferred, no reason to set it explicitly.'entites'
- : entity name references are used to escape trailinghttp://orchi.tech,
characters. E.g. becomes \*http://orchi.tech,*.'commonmark'
- : a CommonMark autolink is used to delimit the actual linkhttp://orchi.tech,
part. E.g. becomes \.'breakup'
- : autolink-like string is broken, so that it is not interpretedhttps://orchi.tech,
as an autolink. E.g. becomes\https://orchi.tech,\
.'breakafter'
- : autolink-like string is terminated after the actual link part.https://orchi.tech,
E.g. becomes \https://orchi.tech,\.allowAddHttpScheme
This transformation is the default fallback, no reason to set it explicitly.
- : add http:// scheme when a transformation needs it towww.orchi.tech,
work. E.g. would become \commonmark
with the transformation.inImage
- : suggest if extended autolink treatment should be applied withinalt
image text. Although the CommonMark spec says links are interpreted and just
the stripped plain text part renders to the attribute, cmark-gfm actually
does not do it for extended autolinks, so the default is false.
_How to choose the options_:
1. Consider rendering details of the target Markdown flavor. Backtranslation
test should pass on text. And if a link is produced, it should match the
input.
2. Consider user expectations. The users probably don't expect HTML comments
all over their documents. They probably don't expect HTML entity references
too, but see also the next point.
3. Consider declared semantics. Transforming to CommonMark autolinks looks quite
well, but CommonMark autolinks form explicit link demarkation when the input
was not explicitly link-demarked. 'breakafter' might be better option in'breakUrl'
some situations.
4. Last, but not least - consider the origin of your input. If you transform
HTML rendered from another markup language that supports autolinking too,
you may expect that an autolink-suppression mechanism was used if an
autolink-like string is encountered in plain text. Then it might be better
to break it too.\
And if the original renderer supports url autolinks, but not www autolinks, it
might be better to set only , as users may still expect www links
to be autolinked in the plain text.
#### Escaping options: emphasisNonDelimiters
Defaults to true, i.e. intraword emphasis delimiters are not escaped if it is safeMy account is joe_average.
not to escape them. E.g. in , the underscore staysjoe_average
unescaped as , not ~~joe\_average~~.
Suboptions:
- maxIntrawordUnderscoreRun: if defined as a number, it sets the maximum length of1
intraword underscores to be kept as is. E.g. for and inputjoe_average or joe__average
, the output would be joe_average or joe\_\_average.undefined
This is helpful for some renderers like Redcarpet. Both and falseundefined
mean no limit on unescaped intraword underscore run length.
Defaults to .
#### Escaping options: table
Defaults to false, i.e. table pipes are not escaped. If enabled, rendering of table
delimiter rows is suppressed by escaping its pipes and all pipes are escaped when in
table context.
#### Escaping options: linkTitle
Suboptions:
- delimiters: array of allowed delimiter to be chosen from or a single delimiter."
Delimiters are , ' and (). When more delimiters are allowed, GfmEscape picksalwaysEscapeDelimiters
the least interferring one. The picked delimiter is returned in metadata, as shown
in the example above.
- : array of delimiters that are always escaped.
Terminology:
- cmAutolink - CommonMark autolink
- cmUriAutolink - CommonMark URI autolink
- cmEmailAutolink - CommonMark email autolink
- extAutolink - GFM extended autolink
- extWebAutolink - GFM extended url or www autolink
- extUrlAutolink - GFM extended url autolink
- extWwwAutolink - GFM extended www autolink
- extEmailAutolink - GFM extended email autolink
Specs:
- CommonMark
- GFM
Reference implementations examined:
- cmark-gfm
While cmark-gfm is somewhat a reference
implementation of GFM Spec, we have found a few interesting details...
- cmark_gfm-001: Contrary to the GFM spec stating _All such recognized autolinks.https://orchi.tech
can only come at the beginning of a line, after whitespace, or any of the
delimiting characters \*, \_, \~, and (_, it seems this applies just to extended
www autolinks in cmark-gfm. E.g. is recognized as ancmark_gfm-002
autolink by this library. We follow this.
- : Contrary to the GFM spec, extended autolinks in cmark-gfm do[\v\f]
not treat as space, while CM autolinks do. We follow this.cmark_gfm-003
- : cmark-gfm considers < as valid for autolink detection andhttps://or_chi.tech.<
trims the resulting link afterwards. So leads tohttps://or_chi.tech
autolinking of , although this wouldn't form autolink<
without the trailing . We follow this, but non-explicit extended autolinkhttps://or_chi.tech.<
transformations would break the autolink detection - which is probaly good.
E.g. with the default settings, leads tohttps://or_chi.tech.<
(wouldn't be detected as extended autolink byhttps://or_chi.tech.<~
cmark-gfm), while leads to
(forced CM autolink).cmark_gfm-004
- : GFM spec says _If an autolink ends in a semicolon (;), wecmark_gfm-005
check to see if it appears to resemble an entity reference; if the preceding
text is & followed by one or more alphanumeric characters. If so, it is
excluded from the autolink..._ Alphabetic references cmark-gfm
- : Backslash escape in link destination, e.g.foo
does not prevent entity reference&`.
from interpreting in rendered HTML. We use entity encoding instead, i.e.
The same applies to link titles.
- Minification for browsers.
- Complete and polish implementation remarks.