The fastest pure JS SAX/DOM XML/HTML parser.
npm install tag-soupTagSoup is the fastest pure JS SAX/DOM XML/HTML parser and serializer.
- Extremely low memory consumption.
- Tolerant of malformed tag nesting, missing end tags, etc.
- Recognizes CDATA sections, processing instructions, and DOCTYPE declarations.
- Supports both strict XML and forgiving HTML parsing modes.
- 20 kB gzipped ↗, including dependencies.
- Check out TagSoup dependencies:
Speedy Entities ↗
and Flyweight DOM ↗.
``sh`
npm install --save-prod tag-soup
- API docs ↗
- DOM parsing
- SAX parsing
- Tokenization
- Serialization
- Performance
- Limitations
TagSoup exports preconfigured HTMLDOMParser ↗
which parses HTML markup as a DOM node. This parser never throws errors during parsing and forgives malformed markup:
`ts
import { HTMLDOMParser, toHTML } from 'tag-soup';
const fragment = HTMLDOMParser.parseFragment('
hello
cool');
// ⮕ DocumentFragment
toHTML(fragment);
// ⮕ '
hello
cool
HTMLDOMParser decodes both HTML entities and numeric character references with
decodeHTML ↗.XMLDOMParser ↗
parses XML markup as a DOM node. It throws
ParserError ↗ if markup doesn't
satisfy XML spec:`ts
import { XMLDOMParser, toXML } from 'tag-soup';XMLDOMParser.parseFragment('
hello');
// ❌ ParserError: Unexpected end tag.
const fragment = XMLDOMParser.parseFragment('
hello
');
// ⮕ DocumentFragmenttoXML(fragment);
// ⮕ '
hello
`XMLDOMParser decodes both XML entities and numeric character references with
decodeXML ↗.TagSoup uses Flyweight DOM ↗ nodes,
which provide many standard DOM manipulation features:
`ts
const document = HTMLDOMParser.parseDocument('hello');document.doctype.name;
// ⮕ 'html'
document.textContent;
// ⮕ 'hello'
`For example, you can use
TreeWalker to traverse DOM nodes:`ts
import { TreeWalker, NodeFilter } from 'flyweight-dom';const fragment = XMLDOMParser.parseFragment('
hello
');const treeWalker = new TreeWalker(fragment, NodeFilter.SHOW_TEXT);
treeWalker.nextNode();
// ⮕ Text { 'hello' }
`createDOMParser ↗:`ts
import { createDOMParser } from 'tag-soup';const myParser = createDOMParser({
voidTags: ['br'],
});
myParser.parseFragment('
');
// ⮕ DocumentFragment
`SAX parsing
HTMLSAXParser ↗ which parses
HTML markup and calls handler methods when a token is read. This parser never throws errors during parsing and forgives
malformed markup:`ts
import { HTMLSAXParser } from 'tag-soup';HTMLSAXParser.parseFragment('
hello
cool', {
onStartTagOpening(tagName) {
// Called with 'p', 'p', and 'br'
},
onText(text) {
// Called with 'hello' and 'cool'
},
});
`XMLSAXParser ↗
parses XML markup and calls handler methods when a token is read. It throws
ParserError ↗ if markup doesn't satisfy XML
spec:`ts
import { XMLSAXParser } from 'tag-soup';XMLSAXParser.parseFragment('
hello', {});
// ❌ ParserError: Unexpected end tag.
XMLSAXParser.parseFragment('
hello
', {
onEndTag(tagName) {
// Called with 'br' and 'p'
},
});
`createSAXParser ↗:`ts
import { createSAXParser } from 'tag-soup';const myParser = createSAXParser({
voidTags: ['br'],
});
myParser.parseFragment('
', {
onStartTagOpening(tagName) {
// Called with 'p' and 'br'
},
});
`Tokenization
HTMLTokenizer ↗
which parses HTML markup and invokes a callback when a token is read. This tokenizer never throws errors during
tokenization and forgives malformed markup:`ts
import { HTMLTokenizer } from 'tag-soup';HTMLTokenizer.tokenizeFragment('
hello
cool', (token, startIndex, endIndex) => {
// Handle token
});
`XMLTokenizer ↗
parses XML markup and invokes a callback when a token is read. It throws
ParserError ↗ if markup doesn't
satisfy XML spec:`ts
import { XMLTokenizer } from 'tag-soup';XMLTokenizer.tokenizeFragment('
hello', (token, startIndex, endIndex) => {});
// ❌ ParserError: Unexpected end tag.
XMLTokenizer.tokenizeFragment('
hello
', (token, startIndex, endIndex) => {
// Handle token
});
`createTokenizer ↗:`ts
import { createTokenizer } from 'tag-soup';const myTokenizer = createTokenizer({
voidTags: ['br'],
});
myTokenizer.tokenizeFragment('
', (token, startIndex, endIndex) => {
// Handle token
});
`Serialization
toHTML ↗ and
toXML ↗.`ts
import { HTMLDOMParser, toHTML } from 'tag-soup';const fragment = HTMLDOMParser.parseFragment('
hello
cool');
// ⮕ DocumentFragment
toHTML(fragment);
// ⮕ '
hello
cool
'
`createSerializer ↗:`ts
import { HTMLDOMParser, createSerializer } from 'tag-soup';const mySerializer = createSerializer({
voidTags: ['br'],
});
const fragment = HTMLDOMParser.parseFragment('
hello');
// ⮕ DocumentFragment
mySerializer(fragment);
// ⮕ '
hello
'
`Performance
Execution performance is measured in operations per second (± 5%), the higher number is better.
Memory consumption (RAM) is measured in bytes, the lower number is better.
Library
Library size
DOM parsing
SAX parsing
Ops/sec
RAM
Ops/sec
RAM
tag-soup@3.0.0
20 kB ↗
26 Hz
22 MB
58 Hz
22 kB
htmlparser2@10.0.0
58 kB ↗
19 Hz
23 MB
31 Hz
10 MB
parse5@8.0.0
45 kB ↗
7 Hz
105 MB
12 Hz
10 MB
Performance was measured when parsing the 3.8 MB HTML file.
Tests were conducted using TooFast on Apple M1 with Node.js v23.11.1.
To reproduce the performance test suite results, clone this repo and run:
`shell
npm ci
npm run build
npm run perf
`Limitations
TagSoup doesn't resolve some quirky element structures that malformed HTML may cause.
Assume the following markup:
`html
okay
nope
`DOMParser ↗ this markup would be transformed to:`html
okay
nope
`TagSoup doesn't insert the second
strong tag:`html
okay
nope
``