Straightforward HTML parser for Node.js and browser
npm install hyntax
Straightforward HTML parser for JavaScript. Live Demo.
- Simple. API is straightforward, output is clear.
- Forgiving. Just like a browser, normally parses invalid HTML.
- Supports streaming. Can process HTML while it's still being loaded.
- No dependencies.
- Usage
- TypeScript Typings
- Streaming
- Tokens
- AST Format
- API Reference
- Types Reference
``bash`
npm install hyntax
`javascript
const { tokenize, constructTree } = require('hyntax')
const util = require('util')
const inputHTML =
const { tokens } = tokenize(inputHTML)
const { ast } = constructTree(tokens)
console.log(JSON.stringify(tokens, null, 2))
console.log(util.inspect(ast, { showHidden: false, depth: null }))
`TypeScript Typings
Hyntax is written in JavaScript but has integrated TypeScript typings to help you navigate around its data structures. There is also Types Reference which covers most common types.
Streaming
Use
StreamTokenizer and StreamTreeConstructor classes to parse HTML chunk by chunk while it's still being loaded from the network or read from the disk.`javascript
const { StreamTokenizer, StreamTreeConstructor } = require('hyntax')
const http = require('http')
const util = require('util')http.get('http://info.cern.ch', (res) => {
const streamTokenizer = new StreamTokenizer()
const streamTreeConstructor = new StreamTreeConstructor()
let resultTokens = []
let resultAst
res.pipe(streamTokenizer).pipe(streamTreeConstructor)
streamTokenizer
.on('data', (tokens) => {
resultTokens = resultTokens.concat(tokens)
})
.on('end', () => {
console.log(JSON.stringify(resultTokens, null, 2))
})
streamTreeConstructor
.on('data', (ast) => {
resultAst = ast
})
.on('end', () => {
console.log(util.inspect(resultAst, { showHidden: false, depth: null }))
})
}).on('error', (err) => {
throw err;
})
`Tokens
Here are all kinds of tokens which Hyntax will extract out of HTML string.
!Overview of all possible tokens
Each token conforms to Tokenizer.Token interface.
AST Format
Resulting syntax tree will have at least one top-level Document Node with optional children nodes nested within.
`javascript
{
nodeType: TreeConstructor.NodeTypes.Document,
content: {
children: [
{
nodeType: TreeConstructor.NodeTypes.AnyNodeType,
content: {…}
},
{
nodeType: TreeConstructor.NodeTypes.AnyNodeType,
content: {…}
}
]
}
}
`Content of each node is specific to node's type, all of them are described in AST Node Types reference.
API Reference
$3
Hyntax has its tokenizer as a separate module. You can use generated tokens on their own or pass them further to a tree constructor to build an AST.
#### Interface
`typescript
tokenize(html: String): Tokenizer.Result
`#### Arguments
-
html
HTML string to process
Required.
Type: string.#### Returns Tokenizer.Result
$3
After you've got an array of tokens, you can pass them into tree constructor to build an AST.
#### Interface
`typescript
constructTree(tokens: Tokenizer.AnyToken[]): TreeConstructor.Result
`#### Arguments
-
tokens
Array of tokens received from the tokenizer.
Required.
Type: [Tokenizer.AnyToken[]](#tokenizeranytoken)#### Returns TreeConstructor.Result
Types Reference
#### Tokenizer.Result
`typescript
interface Result {
state: Tokenizer.State
tokens: Tokenizer.AnyToken[]
}
`-
state
The current state of tokenizer. It can be persisted and passed to the next tokenizer call if the input is coming in chunks.
- tokens
Array of resulting tokens.
Type: [Tokenizer.AnyToken[]](#tokenizeranytoken)#### TreeConstructor.Result
`typescript
interface Result {
state: State
ast: AST
}
`-
state
The current state of the tree constructor. Can be persisted and passed to the next tree constructor call in case when tokens are coming in chunks.
- ast
Resulting AST.
Type: TreeConstructor.AST #### Tokenizer.Token
Generic Token, other interfaces use it to create a specific Token type.
`typescript
interface Token {
type: T
content: string
startPosition: number
endPosition: number
}
`-
type
One of the Token types.
- content
Piece of original HTML string which was recognized as a token.
- startPosition
Index of a character in the input HTML string where the token starts.
- endPosition
Index of a character in the input HTML string where the token ends.#### Tokenizer.TokenTypes.AnyTokenType
Shortcut type of all possible tokens.
`typescript
type AnyTokenType =
| Text
| OpenTagStart
| AttributeKey
| AttributeAssigment
| AttributeValueWrapperStart
| AttributeValue
| AttributeValueWrapperEnd
| OpenTagEnd
| CloseTag
| OpenTagStartScript
| ScriptTagContent
| OpenTagEndScript
| CloseTagScript
| OpenTagStartStyle
| StyleTagContent
| OpenTagEndStyle
| CloseTagStyle
| DoctypeStart
| DoctypeEnd
| DoctypeAttributeWrapperStart
| DoctypeAttribute
| DoctypeAttributeWrapperEnd
| CommentStart
| CommentContent
| CommentEnd
`#### Tokenizer.AnyToken
Shortcut to reference any possible token.
`typescript
type AnyToken = Token
`#### TreeConstructor.AST
Just an alias to DocumentNode. AST always has one top-level DocumentNode. See AST Node Types
`typescript
type AST = TreeConstructor.DocumentNode
`$3
There are 7 possible types of Node. Each type has a specific content.
`typescript
type DocumentNode = Node
``typescript
type DoctypeNode = Node
``typescript
type TextNode = Node
``typescript
type TagNode = Node
``typescript
type CommentNode = Node
``typescript
type ScriptNode = Node
``typescript
type StyleNode = Node
`Interfaces for each content type:
- Document
- Doctype
- Text
- Tag
- Comment
- Script
- Style
#### TreeConstructor.Node
Generic Node, other interfaces use it to create specific Nodes by providing type of Node and type of the content inside the Node.
`typescript
interface Node {
nodeType: T
content: C
}
`#### TreeConstructor.NodeTypes.AnyNodeType
Shortcut type of all possible Node types.
`typescript
type AnyNodeType =
| Document
| Doctype
| Tag
| Text
| Comment
| Script
| Style
`$3
#### TreeConstructor.NodeTypes.AnyNodeContent
Shortcut type of all possible types of content inside a Node.
`typescript
type AnyNodeContent =
| Document
| Doctype
| Text
| Tag
| Comment
| Script
| Style
`#### TreeConstructor.NodeContents.Document
`typescript
interface Document {
children: AnyNode[]
}
`#### TreeConstructor.NodeContents.Doctype
`typescript
interface Doctype {
start: Tokenizer.Token
attributes?: DoctypeAttribute[]
end: Tokenizer.Token
}
`#### TreeConstructor.NodeContents.Text
`typescript
interface Text {
value: Tokenizer.Token
}
`#### TreeConstructor.NodeContents.Tag
`typescript
interface Tag {
name: string
selfClosing: boolean
openStart: Tokenizer.Token
attributes?: TagAttribute[]
openEnd: Tokenizer.Token
children?: AnyNode[]
close?: Tokenizer.Token
}
`#### TreeConstructor.NodeContents.Comment
`typescript
interface Comment {
start: Tokenizer.Token
value: Tokenizer.Token
end: Tokenizer.Token
}
`#### TreeConstructor.NodeContents.Script
`typescript
interface Script {
openStart: Tokenizer.Token
attributes?: TagAttribute[]
openEnd: Tokenizer.Token
value: Tokenizer.Token
close: Tokenizer.Token
}
`#### TreeConstructor.NodeContents.Style
`typescript
interface Style {
openStart: Tokenizer.Token,
attributes?: TagAttribute[],
openEnd: Tokenizer.Token,
value: Tokenizer.Token,
close: Tokenizer.Token
}
`#### TreeConstructor.DoctypeAttribute
`typescript
interface DoctypeAttribute {
startWrapper?: Tokenizer.Token,
value: Tokenizer.Token,
endWrapper?: Tokenizer.Token
}
`#### TreeConstructor.TagAttribute
`typescript
interface TagAttribute {
key?: Tokenizer.Token,
startWrapper?: Tokenizer.Token,
value?: Tokenizer.Token,
endWrapper?: Tokenizer.Token
}
``