HTML to JSON extractor
npm install @qretaio/html2jsonA Rust port of
cheerio-json-mapper.
---
- Input: HTML source + Extractor spec (JSON)
- Output: JSON matching the structure defined in the spec
- Available as: Rust crate, CLI tool, and WebAssembly npm package
``bash`
npm install @qretaio/html2json
`bash`
cargo install html2json --features cli
`bash`
cargo install --path . --features clior from a git repository
cargo install --git https://github.com/qretaio/html2json --features cli
`bash`
just install
`javascript
import { extract } from "@qretaio/html2json";
const html =
My Article
;
const spec = {
title: "h2",
author: ".author",
tags: [
{
$: ".tags span",
name: "$",
},
],
};
const result = await extract(html, spec);
console.log(result);
// {
// "title": "My Article",
// "author": "John Doe",
// "tags": [{"name": "rust"}, {"name": "wasm"}]
// }
`
`bashExtract from file
html2json examples/hn.html --spec examples/hn.json
$3
-
--spec, -s - Path to JSON extractor spec file (required)
- --check, -c - Compare output against expected JSON file. Exits with 0 if match, 1 if differ (with colored diff).Spec Format
The spec is a JSON object where each key defines an output field and each value defines a CSS selector to extract that field.
$3
`json
{
"title": "h1",
"description": "p.description"
}
`$3
`json
{
"link": "a.main | attr:href",
"image": "img.hero | attr:src"
}
`$3
`json
{
"title": "h1 | trim",
"slug": "h1 | lower | regex:\\s+-",
"price": ".price | regex:\\$(\\d+\\.\\d+) | parseAs:int"
}
`Available pipes:
-
trim - Trim whitespace
- lower - Convert to lowercase
- upper - Convert to uppercase
- substr:start:end - Extract substring
- regex:pattern - Regex capture (first group)
- parseAs:int - Parse as integer
- parseAs:float - Parse as float
- attr:name - Get attribute value
- void - Extract from void elements, useful for extracting xml$3
`json
{
"items": [
{
"$": ".item",
"title": "h2",
"description": "p"
}
]
}
`$3
`json
{
"$": "article",
"title": "h1",
"paragraphs": ["p"]
}
`$3
`json
{
"title": "h1.main || h1.fallback || h1"
}
`$3
`json
{
"title": "h1",
"description?": "p.description"
}
`Optional fields that return
null` are removed from the output.MIT