Fast HTML to markdown cross-compiler, compatible with both node and the browser
npm install node-html-markdown

!Build Status/badge.svg)

NHM is a _fast_ HTML to markdown converter, compatible with both node and the browser.
It was built with the following two goals in mind:
We had a need to convert gigabytes of HTML daily very quickly. All libraries we found were too slow with node.
We considered using a low-level language but decided to attempt to write something that would squeeze every bit
of performance out of the JIT that we could. The end result was fast enough to make the cut!
The other libraries we tested produced output that would break in numerous conditions and produced output with many
repeating linefeeds, etc. Generally speaking, outside of a markdown viewer, the result was not easy to read.
We took the approach of producing a _clean, concise_ result with consistent spacing rules.
``sh`
-----------------------------------------------------------------------------Estimated processing times (fastest to slowest):
[node-html-markdown (reused instance)]
100 kB: 17ms
1 MB: 176ms
50 MB: 8.80sec
1 GB: 3min, 0sec
50 GB: 2hr, 30min, 14sec
[turndown (reused instance)]
100 kB: 27ms
1 MB: 280ms
50 MB: 13.98sec
1 GB: 4min, 46sec
50 GB: 3hr, 58min, 35sec
-----------------------------------------------------------------------------
Speed comparison - node-html-markdown (reused instance) is:
1.02 times as fast as node-html-markdown
1.57 times as fast as turndown
1.59 times as fast as turndown (reused instance)
-----------------------------------------------------------------------------
`Usage
`ts
import { NodeHtmlMarkdown, NodeHtmlMarkdownOptions } from 'node-html-markdown'
/ *
* Single use
* If using it once, you can use the static method
* /
// Single file
NodeHtmlMarkdown.translate(
/ html /
hello,
/ options (optional) / {},
/ customTranslators (optional) / undefined,
/ customCodeBlockTranslators (optional) / undefined
);// Multiple files
NodeHtmlMarkdown.translate(
/ FileCollection / {
'file1.html':
hello,
'file2.html': goodbye
},
/ options (optional) / {},
/ customTranslators (optional) / undefined,
/ customCodeBlockTranslators (optional) / undefined
);
/ *
* Re-use
* If using it several times, creating an instance saves time
* /
const nhm = new NodeHtmlMarkdown(
/ options (optional) / {},
/ customTransformers (optional) / undefined,
/ customCodeBlockTranslators (optional) / undefined
);
// Single file
nhm.translate(/ html /
hello);// Multiple Files
nhm.translate(
/ FileCollection / {
'file1.html':
hello,
'file2.html': goodbye
},
);
`Options
`tsexport interface NodeHtmlMarkdownOptions {
/**
* Use native window DOMParser when available
* @default false
*/
preferNativeParser: boolean,
/**
* Code block fence
* @default
`
*/
codeFence: string, /**
* Bullet marker
@default
*/
bulletMarker: string,
/**
* Style for code block
* @default fence
*/
codeBlockStyle: 'indented' | 'fenced',
/**
* Emphasis delimiter
* @default _
*/
emDelimiter: string,
/**
* Strong delimiter
@default *
*/
strongDelimiter: string,
/**
* Strike delimiter
* @default ~~
*/
strikeDelimiter: string,
/**
* Supplied elements will be ignored (ignores inner text does not parse children)
*/
ignore?: string[],
/**
* Supplied elements will be treated as blocks (surrounded with blank lines)
*/
blockElements?: string[],
/**
* Max consecutive new lines allowed
* @default 3
*/
maxConsecutiveNewlines: number,
/**
* Line Start Escape pattern
* (Note: Setting this will override the default escape settings, you might want to use textReplace option instead)
*/
lineStartEscape: [ pattern: RegExp, replacement: string ]
/**
* Global escape pattern
* (Note: Setting this will override the default escape settings, you might want to use textReplace option instead)
*/
globalEscape: [ pattern: RegExp, replacement: string ]
/**
* User-defined text replacement pattern (Replaces matching text retrieved from nodes)
*/
textReplace?: [ pattern: RegExp, replacement: string ][]
/**
* Keep images with data: URI (Note: These can be up to 1MB each)
* @example
* 
* @default false
*/
keepDataImages?: boolean
/**
* Place URLS at the bottom and format links using link reference definitions
*
* @example
* Click here. Or here. Or this link.
*
* Becomes:
* Click [here][1]. Or [here][2]. Or [this link][1].
*
* [1]: /url
* [2]: /url2
*/
useLinkReferenceDefinitions?: boolean
/**
* Wrap URL text in < > instead of []() syntax.
*
* @example
* The input https://google.com
* becomes
* instead of https://google.com
*
* @default true
*/
useInlineLinks?: boolean
}
`Newline Handling
$3
In standard Markdown, paragraphs are separated by blank lines. This library follows this convention, so HTML block elements like
,
, , etc. are surrounded by blank lines in the output.Example:
`ts
const html = Hello
World
!
;
const markdown = NodeHtmlMarkdown.translate(html);
console.log(markdown);
// Output:
// Hello
//
// World
//
// !
`This is the expected behavior and produces valid, readable Markdown. If you need tighter spacing, consider using line breaks instead of paragraphs.
$3
- Paragraphs (
) create blank lines between content (standard Markdown behavior)
- Line breaks () create single line breaks with two trailing spaces (Markdown line break syntax)
Example:
`ts
// Using line breaks
const html =
Line 1
Line 2
Line 3
;
const markdown = NodeHtmlMarkdown.translate(html);
console.log(markdown);
// Output:
// Line 1
// Line 2
// Line 3
`$3
The
maxConsecutiveNewlines option (default: 3) limits how many consecutive newlines appear in the output. This helps keep the Markdown clean and prevents excessive whitespace.Example with multiple
tags:`ts
// Default behavior - limits to 3 consecutive newlines
const html = a
${'b
;
const markdown = NodeHtmlMarkdown.translate(html);
// Result has maximum 3 consecutive line breaks// Allow more consecutive newlines
const markdown2 = NodeHtmlMarkdown.translate(html, {
maxConsecutiveNewlines: 10
});
// Result preserves all 10 line breaks
`Example with inline elements:
`ts
const html = text${';// Default (max 3 newlines)
NodeHtmlMarkdown.translate(html);
// Output: text \n \n \n_something_
// Custom (max 10 newlines)
NodeHtmlMarkdown.translate(html, { maxConsecutiveNewlines: 10 });
// Output: text \n \n \n \n \n \n \n \n \n \n_something_
`When to adjust this setting:
- Decrease (e.g.,
maxConsecutiveNewlines: 1) for more compact output
- Increase (e.g., maxConsecutiveNewlines: 10) when you need to preserve spacing from the source HTML
- Keep default (3) for balanced, readable Markdown outputCustom Translators
Custom translators are an advanced option to allow handling certain elements a specific way.
These can be modified via the
NodeHtmlMarkdown#translators property, or added during creation.__For detail on how to use them see__:
- translator.ts - Documentation for
TranslatorConfig
- config.ts - Translators in defaultTranslatorsThe
NodeHtmlMarkdown#codeBlockTranslators property is a collection of translators which handles elements within a ` block.Further improvements
Being a performance-centric library, we're always interested in further improvements.
There are several probable routes by which we could gain substantial performance increases over the current model.
Such methods include:
- Writing a custom parser
- Integrating an async worker-thread based model for multi-threading
- Fully replacing any remaining regex
These would be fun to implement; however, for the time being, the present library is fast enough for my purposes. That
said, I welcome discussion and any PR toward the effort of further improving performance, and I may ultimately do more
work in that capacity in the future!
Help Wanted!
Looking to contribute? Check out our [help wanted] list for a good place to start!
[help wanted]: https://github.com/crosstype/node-html-markdown/issues?q=is%3Aissue+is%3Aopen+label%3A%22help+wanted%22