My headline
My article text.
Extract data from HTML using JSON mapping with custom pipe functionality
npm install cheerio-json-mapperExtract HTML markup to JSON using Cheerio.
---


``shnpm
npm i -S cheerio-json-mapper
Usage
`js
import { cheerioJsonMapper } from 'cheerio-json-mapper';const html =
My article text.
;const template = {
headline: 'article > h1',
articleText: 'article > .content',
author: {
$: 'article > .author',
name: '> a',
email: '> a | attr:href | substr:7',
},
};
const result = await cheerioJsonMapper(html, template);
console.log(result);
// output:
// {
// headline: "My headline",
// articleText: "My article text.",
// author: {
// name: "John Doe",
// email: "john.doe@example.com"
// }
// }
`More examples are found in the repo's tests/cases folder.
Core concepts
- End-Result Structure First
- Scoping
- Pipes
$3
The main approach is to start from what we need to retrieve. Defining the end structure and just telling each property which _selector_ to use to get its value.
#### Hard-coded values (literals)
We can set hard values to the structure by wrapping strings in quotes or single-quotes. Numbers and booleans are automatically detected as literals:
`json
{
"headline": "article > h1",
"public": true,
"copyright": "'© Copyright Us Inc. 2023'",
"version": 1.23
}
`$3
Large documents with nested parts tend to require big and ugly selectors. To simplify things, we can _scope_ an object to only care for a certain selected part.
Add a
$ property with selector to narrow down what the rest of the object should use as base.Example:
`html
My headline
My article text.
This wont be selected due to scoping
``js
const template = {
$: 'article',
headline: '> h1',
articleText: '> .content',
author: {
$: '> .author',
name: 'span.name',
telephone: 'span.tel',
email: 'a[href^=mailto:] | attr:href | substr:7',
},
};
`#### Self-selector
In some cases we want to reuse the object selector (
$) for a property selector. Especially handy when targeting lists, e.g. this case:`js
const html = ;
const template = [
{
$: 'ul > li',
value: '$', // uses ul > li as property selector
},
];
const result = await cheerioJsonMapper(html, template);
console.log(result);
// Output:
// [
// { value: 'One' },
// { value: 'Two' },
// { value: 'Three' }
// ];
`> Note: Don't like the
$ name for scope selector? Change it through options: cheerioJsonMapper(html, template, { scopeProp: '__scope' }): $3
Sometimes the text content of a selected node is not what we need. Or not enough. _Pipes_ to rescue!
Pipes are functionality that can be applied to a value - both a property selector and an object. Use pipes to handle any custom needs.
Multiple pipes are supported (seperated by
| char) and will run in sequence. Do note that value returned from a pipe will be passed to next pipe, allowing us to chain functionality (kind of same way as \*nix terminal pipes, which was the inspiration to this syntax).Pipes can have basic arguments by adding colon (
:) along with semi-colon (;) seperated values.Pipes can by asynchronous.
#### Use pipes in selector props:
`js
{
email: 'a[href^=mailto:] | attr:href | substr:7';
}
`$3
`js
{
name: 'span.name',
email: 'a[href^=mailto:] | attr:href | substr:7',
telephone: 'span.tel',
'|': 'requiredProps:name;email'
}
`> Note: Don't like the
| name for pipe property? Change it through options: cheerioJsonMapper(html, template, { pipeProp: '__pipes' }): #### Default pipes included:
-
text - grab .textContent from selected node (used default if no other pipes are specified)
- trim - trim grabbed text
- lower - turn grabbed text to lower case
- upper - turn grabbed text to upper case
- substr - get substring of grabbed text
- default - if value is nullish/empty, use specified fallback value
- parseAs - parse a string to different types:
- parseAs:number - number
- parseAs:int - integer
- parseAs:float - float
- parseAs:bool - boolean
- parseAs:date - date
- parseAs:json - JSON
- log - will console.log current value (use for debugging)
- attr - get attribute value from selected node#### Custom pipes
Create your own pipes to handle any customization needed.
`js
const customPipes = {
/* Replace any http:// link into https:// /
onlyHttps: ({ value }) => value?.toString().replace(/^http:/, 'https:'), /* Check if all required props exists - and if not, set object to undefined /
requiredProps: ({ value, args }) => {
const obj = value; // as this should be run as object pipe, value should be an object
const requiredProps = args; // string array
const hasMissingProps = requiredProps.some((prop) => obj[prop] == null);
return hasMissingProps ? undefined : obj;
},
};
const template = [
{
name: 'span.name',
telephone: 'span.tel',
email: 'a[href^=mailto:] | attr:href | substr:7',
website: 'a[href^=http] | attr:href | onlyHttps',
'|': 'requiredProps:name;email',
},
];
const contacts = await cheerioJsonMapper(html, template, { pipeFns: customPipes });
``More examples are found in the repo's tests/cases folder.
- Update Cheerio to v1.0.0
- Add test/example for table formatting
- Fixed bug with getScope() method.
- Fixed bug when scoped object selectors doesn't match anything.
- Support self-selector.
- Align how default pipes should behave.
- Updated README
- First release with initial functionality;
- End-result structure first approach
- Scoping
- Pipes with default setup of pipe funcs