$3

``bash npm install huntsman --save`

![NPM](https://nodei.co/npm/huntsman/)

`$3`

`javascript / Crawl wikipedia and use jquery syntax to extract information from the page /

var huntsman = require('huntsman'); var spider = huntsman.spider();

spider.extensions = [ huntsman.extension( 'recurse' ), // load recurse extension & follow anchor links huntsman.extension( 'cheerio' ) // load cheerio extension ];

// follow pages which match this uri regex spider.on( /http:\/\/en\.wikipedia\.org\/wiki\/\w+:\w+$/, function ( err, res ){

// use jquery-style selectors & functions var $ = res.extension.cheerio; if( !$ ) return; // content is not html

// extract information from page body var wikipedia = { uri: res.uri, heading: $('h1.firstHeading').text().trim(), body: $('div#mw-content-text p').text().trim() };

console.log( wikipedia );

});

spider.queue.add( 'http://en.wikipedia.org/wiki/Huntsman_spider' ); spider.start();`

`$3`

`bash peter@edgy:/tmp$ node examples/html.js { "uri": "http://en.wikipedia.org/wiki/Wikipedia:Recent_additions", "heading": "Wikipedia:Recent additions", "body": "This is a selection of recently created new articles and greatly expanded former stub articles on Wikipedia that were featured on the Main Page as part of Did you know? You can submit new pages for consideration. (Archives are grouped by month of Main page appearance.)Tip: To find which archive contains the fact that appeared on Did You Know?, return to the article and click \"What links here\" to the left of the article. Then, in the dropdown menu provided for namespace, choose Wikipedia and click \"Go\". When you find \"Wikipedia:Recent additions\" and a number, click it and search for the article name.\n\nCurrent archive" }

... etc`

More examples are available in the /examples directory

---

`How it works`

Huntsman takes one or more 'seed' urls with the spider.queue.add() method.

Once the process is kicked off with spider.start(), it will take care of extracting links from the page and following only the pages we want.

To define which pages are crawled use the spider.on() function with a string or regular expression.

Each page will only be crawled once. If multiple regular expressions match the uri, they will all be called.

Page URLs which do not match an on condition will never be crawled

---

`Configuration`

The spider has default settings, you can override them by passing a settings object when you create a spider.

`javascript // use default settings var huntsman = require('huntsman'); var spider = huntsman.spider();`

`javascript // override default settings var huntsman = require('huntsman'); var spider = huntsman.spider({ throttle: 10, // maximum requests per second timeout: 5000 // maximum gap of inactivity before exiting (in milliseconds) });`

---

`Crawling a site`

How you configure your spider will vary from site to site, generally you will only be looking for for pages with a specific url format.

`$3`

In this example we can see that amazon product uris all seem to share the format '/gp/product/'.

After queueing the seed uri http://www.amazon.co.uk/ huntsman will follow all the product pages it finds recursively.

`javascript / Example of scraping products from the amazon website /

var huntsman = require('huntsman'); var spider = huntsman.spider();

spider.extensions = [ huntsman.extension( 'recurse' ), // load recurse extension & follow anchor links huntsman.extension( 'cheerio' ) // load cheerio extension ];

// target only product uris spider.on( '/gp/product/', function ( err, res ){

if( !res.extension.cheerio ) return; // content is not html var $ = res.extension.cheerio;

// extract product information var product = { uri: res.uri, heading: $('h1.parseasinTitle').text().trim(), image: $('img#main-image').attr('src'), description: $('#productDescription').text().trim().substr( 0, 50 ) };

console.log( product );

});

spider.queue.add( 'http://www.amazon.co.uk/' ); spider.start();`

`$3`

More complex crawls may require you to specify hub pages to follow before you can get to the content you really want. You can add an on event without a callback & huntsman will still follow and extract links from it.

`javascript / Example of scraping information about pets for sale on cragslist in london /

var huntsman = require('huntsman'); var spider = huntsman.spider({ throttle: 2 });

spider.extensions = [ huntsman.extension( 'recurse' ), // load recurse extension & follow anchor links huntsman.extension( 'cheerio' ), // load cheerio extension huntsman.extension( 'stats' ) // load stats extension ];

// target only pet uris spider.on( /\/pet\/(\w+)\.html$/, function ( err, res ){

if( !res.extension.cheerio ) return; // content is not html var $ = res.extension.cheerio;

// extract listing information var listing = { heading: $('h2.postingtitle').text().trim(), uri: res.uri, image: $('img#iwi').attr('src'), description: $('#postingbody').text().trim().substr( 0, 50 ) };

console.log( listing );

});

// hub pages spider.on( /http:\/\/london\.craigslist\.co\.uk$/ ); spider.on( /\/pet$/ );

spider.queue.add( 'http://www.craigslist.org/about/sites' ); spider.start();`

---

`Extensions`

Extensions have default settings, you can override them by passing an optional second argument when the extension is loaded.

`javascript // loading an extension spider.extensions = [ huntsman.extension( 'extension_name', options ) ];`

`$3`

This extension extracts links from html pages and then adds them to the queue.

The default patterns only target anchor tags which use the http protocol, you can change any of the default patterns by declaring them when the extension is loaded.

`javascript // default patterns huntsman.extension( 'recurse', { pattern: { search: /a([^>]+)href\s?=\s?'"/gi, refine: /'"$/, filter: /^https?:\/\// } })`

- search must be a globalregexp and is used to target the links we want to extract. -refine is a regexp used to extract the bits we want from the searchregex matches. -filter is a regexp that must match or links are discarded.

`javascript // extract both anchor tags and script tags huntsman.extension( 'recurse', { pattern: { search: /(a([^>]+)href|script([^>]+)src)\s?=\s?'"/gi, // or huntsman - npm explorer

`huntsman`

v0.3.0

Super configurable async web spider

spidercrawlercrawlhuntsmanrobotaync

0/weekUpdated 3 years agoMIT

Published by Peter Johnson

npm install huntsman

RepositoryHomepagenpm

`$3`

``bash npm install huntsman --save `

![NPM](https://nodei.co/npm/huntsman/)

`$3`

`javascript / Crawl wikipedia and use jquery syntax to extract information from the page /

var huntsman = require('huntsman'); var spider = huntsman.spider();

spider.extensions = [ huntsman.extension( 'recurse' ), // load recurse extension & follow anchor links huntsman.extension( 'cheerio' ) // load cheerio extension ];

// follow pages which match this uri regex spider.on( /http:\/\/en\.wikipedia\.org\/wiki\/\w+:\w+$/, function ( err, res ){

// use jquery-style selectors & functions var $ = res.extension.cheerio; if( !$ ) return; // content is not html

// extract information from page body var wikipedia = { uri: res.uri, heading: $('h1.firstHeading').text().trim(), body: $('div#mw-content-text p').text().trim() };

console.log( wikipedia );

});

spider.queue.add( 'http://en.wikipedia.org/wiki/Huntsman_spider' ); spider.start(); `

`$3`

... etc `

More examples are available in the /examples directory

---

`How it works`

Huntsman takes one or more 'seed' urls with the spider.queue.add() method.

Once the process is kicked off with spider.start(), it will take care of extracting links from the page and following only the pages we want.

To define which pages are crawled use the spider.on() function with a string or regular expression.

Each page will only be crawled once. If multiple regular expressions match the uri, they will all be called.

Page URLs which do not match an on condition will never be crawled

---

`Configuration`

The spider has default settings, you can override them by passing a settings object when you create a spider.

`javascript // use default settings var huntsman = require('huntsman'); var spider = huntsman.spider(); `

`javascript // override default settings var huntsman = require('huntsman'); var spider = huntsman.spider({ throttle: 10, // maximum requests per second timeout: 5000 // maximum gap of inactivity before exiting (in milliseconds) }); `

---

`Crawling a site`

How you configure your spider will vary from site to site, generally you will only be looking for for pages with a specific url format.

`$3`

In this example we can see that amazon product uris all seem to share the format '/gp/product/'.

After queueing the seed uri http://www.amazon.co.uk/ huntsman will follow all the product pages it finds recursively.

`javascript / Example of scraping products from the amazon website /

var huntsman = require('huntsman'); var spider = huntsman.spider();

spider.extensions = [ huntsman.extension( 'recurse' ), // load recurse extension & follow anchor links huntsman.extension( 'cheerio' ) // load cheerio extension ];

// target only product uris spider.on( '/gp/product/', function ( err, res ){

if( !res.extension.cheerio ) return; // content is not html var $ = res.extension.cheerio;

console.log( product );

});

spider.queue.add( 'http://www.amazon.co.uk/' ); spider.start(); `

`$3`

More complex crawls may require you to specify hub pages to follow before you can get to the content you really want. You can add an on event without a callback & huntsman will still follow and extract links from it.

`javascript / Example of scraping information about pets for sale on cragslist in london /

var huntsman = require('huntsman'); var spider = huntsman.spider({ throttle: 2 });

// target only pet uris spider.on( /\/pet\/(\w+)\.html$/, function ( err, res ){

if( !res.extension.cheerio ) return; // content is not html var $ = res.extension.cheerio;

console.log( listing );

});

// hub pages spider.on( /http:\/\/london\.craigslist\.co\.uk$/ ); spider.on( /\/pet$/ );

spider.queue.add( 'http://www.craigslist.org/about/sites' ); spider.start(); `

---

`Extensions`

Extensions have default settings, you can override them by passing an optional second argument when the extension is loaded.

`javascript // loading an extension spider.extensions = [ huntsman.extension( 'extension_name', options ) ]; `

`$3`

This extension extracts links from html pages and then adds them to the queue.

The default patterns only target anchor tags which use the http protocol, you can change any of the default patterns by declaring them when the extension is loaded.

`javascript // default patterns huntsman.extension( 'recurse', { pattern: { search: /a([^>]+)href\s?=\s?'"/gi, refine: /'"$/, filter: /^https?:\/\// } }) `

- search must be a global regexp and is used to target the links we want to extract. - refine is a regexp used to extract the bits we want from the search regex matches. - filter is a regexp that must match or links are discarded.

`javascript // extract both anchor tags and script tags huntsman.extension( 'recurse', { pattern: { search: /(a([^>]+)href|script([^>]+)src)\s?=\s?'"/gi, // or