A powerful miner that will scrape html pages for you. ` HTML Scraper ´
npm install html-minerHTML Miner
==========





A powerful miner that will scrape html pages for you.

``shusing npm
npm i --save html-miner
Example
I decided to collect common use cases inside a dedicated EXAMPLE.md. Feel free to start from Usage section or jump directly to Example page.
If you want to experiment, an online playground is also available.
:green_book: Enjoy your reading
Usage
$3
html-miner accepts two arguments: html and selector.`js
const htmlMiner = require('html-miner');// htmlMiner(html, selector);
`#### HTML
_html_ is a string and contains
html code.`js
let html = 'Hello Marco!';
`#### SELECTOR
_selector_ could be:
STRING`js
htmlMiner(html, '.title');
//=> Hello Marco!
`If the selector extracts more elements, the result is an array:
`js
let htmlWithDivs = 'Element 1Element 2';
htmlMiner(htmlWithDivs, 'div');
//=> ['Element 1', 'Element 2']
`FUNCTIONRead function in detail paragraph.
`js
htmlMiner(html, () => 'Hello everyone!');
//=> Hello everyone!htmlMiner(html, function () {
return 'Hello everyone!'
});
//=> Hello everyone!
`ARRAY`js
htmlMiner(html, ['.title', 'span']);
//=> ['Hello Marco!', 'Marco']
`OBJECT`js
htmlMiner(html, {
title: '.title',
who: 'span'
});
//=> {
// title: 'Hello Marco!',
// who: 'Marco'
// }
`You can combine
array and object with each other or with string and functions.`js
htmlMiner(html, {
title: '.title',
who: '.title span',
upper: (arg) => { return arg.scopeData.who.toUpperCase(); }
});
//=> {
// title: 'Hello Marco!',
// who: 'Marco',
// upper: 'MARCO'
// }
`
$3
A
function accepts only one argument that is an object containing:-
$: is a jQuery-like function pointing to the document ( html argument ). You can use it to query and fetch elements from the html.
`js
htmlMiner(html, arg => arg.$('.title').text());
//=> Hello Marco!
`-
$scope: useful when combined with _each_ or _container_ (read special keys paragraph).
`js
htmlMiner(html, {
title: '.title',
spanList: {
_each_: 'span',
value: (arg) => {
// "arg.$scope.find('.title')" doesn't exist.
return arg.$scope.text();
}
}
});
//=> {
// title: 'Hello Marco!',
// spanList: [{
// value: 'Marco'
// }]
// }
`-
globalData: is an object that contains all previously fetched datas.
`js
htmlMiner(html, {
title: '.title',
spanList: {
_each_: '.title span',
pageTitle: function(arg) {
// "arg.globalData.who" is undefined because defined later.
return arg.globalData.title;
}
},
who: '.title span'
});
//=> {
// title: 'Hello Marco!',
// spanList: [{
// pageTitle: 'Hello Marco!'
// }],
// who: 'Marco'
// }
`-
scopeData: similar to globalData, but only contains scope data. Useful when combined with _each_ (read special keys paragraph).
`js
htmlMiner(html, {
title: '.title',
upper: (arg) => { return arg.scopeData.title.toUpperCase(); },
sublist: {
who: '.title span',
upper: (arg) => {
// "arg.scopeData.title" is undefined because "title" is out of scope.
return arg.scopeData.who.toUpperCase();
},
}
});
//=> {
// title: 'Hello Marco!',
// upper: 'HELLO MARCO!',
// sublist: {
// who: 'Marco',
// upper: 'MARCO'
// }
// }
`
$3
When selector is an
object, you can use _special keys_: -
_each_: creates a list of items. HTML Miner will iterate for the value and will parse siblings keys.
`js
{
articles: {
_each_: '.articles .article',
title: 'h2',
content: 'p',
}
}
`-
_eachId_: useful when combined with _each_. Instead of creating an Array, it creates an Object where keys are the result of _eachId_ function.
`js
{
articles: {
_each_: '.articles .article',
_eachId_: function(arg) {
return arg.$scope.data('id');
}
title: 'h2',
content: 'p',
}
}
`-
_container_: uses the parsed value as container. HTML Miner will parse siblings keys, searching them inside the _container_.
`js
{
footer: {
_container_: 'footer',
copyright: (arg) => { return arg.$scope.text().trim(); },
company: 'span' // find only 'span' inside 'footer'.
}
}
`For more details see the following example.
Let's try this out
Consider the following html snippet: we will try and fetch some information.
`html
Hello, world!
Heading 1
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Heading 2
Donec maximus ipsum quis est tempor, sit amet laoreet libero bibendum.
Heading 3
Suspendisse viverra convallis risus, vitae molestie est tincidunt eget.
``js
const htmlMiner = require('html-miner');let json = htmlMiner(html, {
title: 'h1',
who: 'h1 span',
h2: 'h2',
articlesArray: {
_each_: '.articles .article',
title: 'h2',
content: 'p',
},
articlesObject: {
_each_: '.articles .article',
_eachId_: function(arg) {
return arg.$scope.data('id');
},
title: 'h2',
content: 'p',
},
footer: {
_container_: 'footer',
copyright: (arg) => { return arg.$scope.text().trim(); },
company: 'span',
year: (arg) => { return arg.scopeData.copyright.match(/[0-9]+/)[0]; },
},
greet: () => { return 'Hi!'; }
});
console.log( json );
//=> {
// title: 'Hello, world!',
// who: 'world',
// h2: ['Heading 1', 'Heading 2', 'Heading 3'],
// articlesArray: [
// {
// title: 'Heading 1',
// content: 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.',
// },
// {
// title: 'Heading 2',
// content: 'Donec maximus ipsum quis est tempor, sit amet laoreet libero bibendum.',
// },
// {
// title: 'Heading 3',
// content: 'Suspendisse viverra convallis risus, vitae molestie est tincidunt eget.',
// }
// ],
// articlesObject: {
// 'a001': {
// title: 'Heading 1',
// content: 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.',
// },
// 'a002': {
// title: 'Heading 2',
// content: 'Donec maximus ipsum quis est tempor, sit amet laoreet libero bibendum.',
// },
// 'a003': {
// title: 'Heading 3',
// content: 'Suspendisse viverra convallis risus, vitae molestie est tincidunt eget.',
// }
// },
// footer: {
// copyright: '© Company 2017',
// company: 'Company',
// year: '2017'
// },
// greet: 'Hi!'
// }
`You can find other examples under the folder
/examples
`sh
you can test examples with nodejs
node examples/demo.js
node examples/site.js
`
Development
`sh
npm install
npm teststart the playground locally
npm start
``