A web scraper for NodeJs
npm install nodejs-web-scrapernodejs-web-scraper is a simple tool for scraping/crawling server-side rendered pages.
It supports features like recursive scraping(pages that "open" other pages), file download and handling, automatic retries of failed requests, concurrency limitation, pagination, request delay, etc. Tested on Node 10 - 16(Windows 7, Linux Mint).
The API uses Cheerio selectors. Click here for reference
For any questions or suggestions, please open a Github issue.
sh
$ npm install nodejs-web-scraper
`
Table of Contents
- Basic examples
* Collect articles from a news site
* Get data of every page as a dictionary
* Download images
* Use multiple selectors
- Advanced
* Pagination
* Get an entire HTML file
* Downloading a file that is not an image
* getElementContent and getPageResponse hooks
* Add additional conditions
* Scraping an auth protected site
- API
- Pagination explained
- Error Handling
- Automatic Logs
- Concurrency
- License
- Disclaimer
Basic examples
#### Collect articles from a news site
Let's say we want to get every article(from every category), from a news site. We want each item to contain the title,
story and image link(or links).
`javascript
const { Scraper, Root, DownloadContent, OpenLinks, CollectContent } = require('nodejs-web-scraper');
const fs = require('fs');
(async () => {
const config = {
baseSiteUrl: https://www.some-news-site.com/,
startUrl: https://www.some-news-site.com/,
filePath: './images/',
concurrency: 10,//Maximum concurrent jobs. More than 10 is not recommended.Default is 3.
maxRetries: 3,//The scraper will try to repeat a failed request few times(excluding 404). Default is 5.
logPath: './logs/'//Highly recommended: Creates a friendly JSON for each operation object, with all the relevant data.
}
const scraper = new Scraper(config);//Create a new Scraper instance, and pass config to it.
//Now we create the "operations" we need:
const root = new Root();//The root object fetches the startUrl, and starts the process.
//Any valid cheerio selector can be passed. For further reference: https://cheerio.js.org/
const category = new OpenLinks('.category',{name:'category'});//Opens each category page.
const article = new OpenLinks('article a', {name:'article' });//Opens each article page.
const image = new DownloadContent('img', { name: 'image' });//Downloads images.
const title = new CollectContent('h1', { name: 'title' });//"Collects" the text from each H1 element.
const story = new CollectContent('section.content', { name: 'story' });//"Collects" the the article body.
root.addOperation(category);//Then we create a scraping "tree":
category.addOperation(article);
article.addOperation(image);
article.addOperation(title);
article.addOperation(story);
await scraper.scrape(root);
const articles = article.getData()//Will return an array of all article objects(from all categories), each
//containing its "children"(titles,stories and the downloaded image urls)
//If you just want to get the stories, do the same with the "story" variable:
const stories = story.getData();
fs.writeFile('./articles.json', JSON.stringify(articles), () => { })//Will produce a formatted JSON containing all article pages and their selected data.
fs.writeFile('./stories.json', JSON.stringify(stories), () => { })
})();
`
This basically means: "go to https://www.some-news-site.com; Open every category; Then open every article in each category page; Then collect the title, story and image href, and download all images on that page".
#### Get data of every page as a dictionary
An alternative, perhaps more firendly way to collect the data from a page, would be to use the "getPageObject" hook.
`javascript
const { Scraper, Root, OpenLinks, CollectContent, DownloadContent } = require('nodejs-web-scraper');
const fs = require('fs');
(async () => {
const pages = [];//All ad pages.
//pageObject will be formatted as {title,phone,images}, becuase these are the names we chose for the scraping operations below.
//Note that each key is an array, because there might be multiple elements fitting the querySelector.
//This hook is called after every page finished scraping.
//It will also get an address argument.
const getPageObject = (pageObject,address) => {
pages.push(pageObject)
}
const config = {
baseSiteUrl: https://www.profesia.sk,
startUrl: https://www.profesia.sk/praca/,
filePath: './images/',
logPath: './logs/'
}
const scraper = new Scraper(config);
const root = new Root();//Open pages 1-10. You need to supply the querystring that the site uses(more details in the API docs).
const jobAds = new OpenLinks('.list-row h2 a', { name: 'Ad page', getPageObject });//Opens every job ad, and calls the getPageObject, passing the formatted dictionary.
const phones = new CollectContent('.details-desc a.tel', { name: 'phone' })//Important to choose a name, for the getPageObject to produce the expected results.
const titles = new CollectContent('h1', { name: 'title' });
root.addOperation(jobAds);
jobAds.addOperation(titles);
jobAds.addOperation(phones);
await scraper.scrape(root);
fs.writeFile('./pages.json', JSON.stringify(pages), () => { });//Produces a formatted JSON with all job ads.
})()
`
Let's describe again in words, what's going on here: "Go to https://www.profesia.sk/praca/; Then paginate the root page, from 1 to 10; Then, on each pagination page, open every job ad; Then, collect the title, phone and images of each ad."
#### Download all images from a page
A simple task to download all images in a page(including base64)
`javascript
const { Scraper, Root, DownloadContent } = require('nodejs-web-scraper');
(async () => {
const config = {
baseSiteUrl: https://spectator.sme.sk,//Important to provide the base url, which is the same as the starting url, in this example.
startUrl: https://spectator.sme.sk/,
filePath: './images/',
cloneFiles: true,//Will create a new image file with an appended name, if the name already exists. Default is false.
}
const scraper = new Scraper(config);
const root = new Root();//Root corresponds to the config.startUrl. This object starts the entire process
const images = new DownloadContent('img')//Create an operation that downloads all image tags in a given page(any Cheerio selector can be passed).
root.addOperation(images);//We want to download the images from the root page, we need to Pass the "images" operation to the root.
await scraper.scrape(root);//Pass the Root to the Scraper.scrape() and you're done.
})();
`
When done, you will have an "images" folder with all downloaded files.
#### Use multiple selectors
If you need to select elements from different possible classes("or" operator), just pass comma separated classes.
This is part of the Jquery specification(which Cheerio implemets), and has nothing to do with the scraper.
`javascript
const { Scraper, Root, CollectContent } = require('nodejs-web-scraper');
(async () => {
const config = {
baseSiteUrl: https://spectator.sme.sk,
startUrl: https://spectator.sme.sk/,
}
function getElementContent(element){
// Do something...
}
const scraper = new Scraper(config);
const root = new Root();
const title = new CollectContent('.first_class, .second_class',{getElementContent});//Any of these will fit.
root.addOperation(title);
await scraper.scrape(root);
})();
`
Advanced Examples
#### Pagination
Get every job ad from a job-offering site. Each job object will contain a title, a phone and image hrefs. Being that the site is paginated, use the pagination feature.
`javascript
const { Scraper, Root, OpenLinks, CollectContent, DownloadContent } = require('nodejs-web-scraper');
const fs = require('fs');
(async () => {
const pages = [];//All ad pages.
//pageObject will be formatted as {title,phone,images}, becuase these are the names we chose for the scraping operations below.
const getPageObject = (pageObject,address) => {
pages.push(pageObject)
}
const config = {
baseSiteUrl: https://www.profesia.sk,
startUrl: https://www.profesia.sk/praca/,
filePath: './images/',
logPath: './logs/'
}
const scraper = new Scraper(config);
const root = new Root({ pagination: { queryString: 'page_num', begin: 1, end: 10 } });//Open pages 1-10.
// YOU NEED TO SUPPLY THE QUERYSTRING that the site uses(more details in the API docs). "page_num" is just the string used on this example site.
const jobAds = new OpenLinks('.list-row h2 a', { name: 'Ad page', getPageObject });//Opens every job ad, and calls the getPageObject, passing the formatted object.
const phones = new CollectContent('.details-desc a.tel', { name: 'phone' })//Important to choose a name, for the getPageObject to produce the expected results.
const images = new DownloadContent('img', { name: 'images' })
const titles = new CollectContent('h1', { name: 'title' });
root.addOperation(jobAds);
jobAds.addOperation(titles);
jobAds.addOperation(phones);
jobAds.addOperation(images);
await scraper.scrape(root);
fs.writeFile('./pages.json', JSON.stringify(pages), () => { });//Produces a formatted JSON with all job ads.
})()
`
Let's describe again in words, what's going on here: "Go to https://www.profesia.sk/praca/; Then paginate the root page, from 1 to 10; Then, on each pagination page, open every job ad; Then, collect the title, phone and images of each ad."
#### Get an entire HTML file
`javascript
const sanitize = require('sanitize-filename');//Using this npm module to sanitize file names.
const fs = require('fs');
const { Scraper, Root, OpenLinks } = require('nodejs-web-scraper');
(async () => {
const config = {
baseSiteUrl: https://www.profesia.sk,
startUrl: https://www.profesia.sk/praca/,
removeStyleAndScriptTags: false//Telling the scraper NOT to remove style and script tags, cause i want it in my html files, for this example.
}
let directoryExists;
const getPageHtml = (html, pageAddress) => {//Saving the HTML file, using the page address as a name.
if(!directoryExists){
fs.mkdirSync('./html');
directoryExists = true;
}
const name = sanitize(pageAddress)
fs.writeFile(./html/${name}.html, html, () => { })
}
const scraper = new Scraper(config);
const root = new Root({ pagination: { queryString: 'page_num', begin: 1, end: 100 } });
const jobAds = new OpenLinks('.list-row h2 a', { getPageHtml });//Opens every job ad, and calls a hook after every page is done.
root.addOperation(jobAds);
await scraper.scrape(root);
})()
`
Description: "Go to https://www.profesia.sk/praca/; Paginate 100 pages from the root; Open every job ad; Save every job ad page as an html file;
#### Downloading a file that is not an image
`javascript
const config = {
baseSiteUrl: https://www.some-content-site.com,
startUrl: https://www.some-content-site.com/videos,
filePath: './videos/',
logPath: './logs/'
}
const scraper = new Scraper(config);
const root = new Root();
const video = new DownloadContent('a.video',{ contentType: 'file' });//The "contentType" makes it clear for the scraper that this is NOT an image(therefore the "href is used instead of "src").
const description = new CollectContent('h1').
root.addOperation(video);
root.addOperation(description);
await scraper.scrape(root);
console.log(description.getData())//You can call the "getData" method on every operation object, giving you the aggregated data collected by it.
`
Description: "Go to https://www.some-content-site.com; Download every video; Collect each h1; At the end, get the entire data from the "description" object;
#### getElementContent and getPageResponse hooks
`javascript
const getPageResponse = async (response) => {
//Do something with response.data(the HTML content). No need to return anything.
}
const myDivs=[];
const getElementContent = (content, pageAddress) => {
myDivs.push(myDiv content from page ${pageAddress} is ${content}...)
}
const config = {
baseSiteUrl: https://www.nice-site,
startUrl: https://www.nice-site/some-section,
}
const scraper = new Scraper(config);
const root = new Root();
const articles = new OpenLinks('article a');
const posts = new OpenLinks('.post a'{getPageResponse});//Is called after the HTML of a link was fetched, but before the children have been scraped. Is passed the response object of the page.
const myDiv = new CollectContent('.myDiv',{getElementContent});//Will be called after every "myDiv" element is collected.
root.addOperation(articles);
articles.addOperation(myDiv);
root.addOperation(posts);
posts.addOperation(myDiv)
await scraper.scrape(root);
`
Description: "Go to https://www.nice-site/some-section; Open every article link; Collect each .myDiv; Call getElementContent()".
"Also, from https://www.nice-site/some-section, open every post; Before scraping the children(myDiv object), call getPageResponse(); CollCollect each .myDiv".
#### Add additional conditions
In some cases, using the cheerio selectors isn't enough to properly filter the DOM nodes. This is where the "condition" hook comes in. Both OpenLinks and DownloadContent can register a function with this hook, allowing you to decide if this DOM node should be scraped, by returning true or false.
`javascript
const config = {
baseSiteUrl: https://www.nice-site,
startUrl: https://www.nice-site/some-section,
}
/**
* Will be called for each node collected by cheerio, in the given operation(OpenLinks or DownloadContent)
*/
const condition = (cheerioNode) => {
//Note that cheerioNode contains other useful methods, like html(), hasClass(), parent(), attr() and more.
const text = cheerioNode.text().trim();//Get the innerText of the tag.
if(text === 'some text i am looking for'){//Even though many links might fit the querySelector, Only those that have this innerText,
// will be "opened".
return true
}
}
const scraper = new Scraper(config);
const root = new Root();
//Let's assume this page has many links with the same CSS class, but not all are what we need.
const linksToOpen = new OpenLinks('some-css-class-that-is-just-not-enough',{condition});
root.addOperation(linksToOpen);
await scraper.scrape(root);
`
#### Scraping an auth protected site
Please refer to this guide: https://nodejs-web-scraper.ibrod83.com/blog/2020/05/23/crawling-subscription-sites/
API
$3
The main nodejs-web-scraper object. Starts the entire scraping process via Scraper.scrape(Root). Holds the configuration and global state.
These are the available options for the scraper, with their default values:
`javascript
const config ={
baseSiteUrl: '',//Mandatory.If your site sits in a subfolder, provide the path WITHOUT it.
startUrl: '',//Mandatory. The page from which the process begins.
logPath:null,//Highly recommended.Will create a log for each scraping operation(object).
cloneFiles: true,//If an image with the same name exists, a new file with a number appended to it is created. Otherwise. it's overwritten.
removeStyleAndScriptTags: true,// Removes any