Simple web crawler for node.js

![NPM version](https://www.npmjs.com/package/simplecrawler)
![Linux Build Status](https://travis-ci.org/simplecrawler/simplecrawler)
![Windows Build Status](https://ci.appveyor.com/project/fredrikekelund/simplecrawler/branch/master)
![Dependency Status](https://david-dm.org/simplecrawler/simplecrawler)
![devDependency Status](https://david-dm.org/simplecrawler/simplecrawler?type=dev)
![Greenkeeper badge](https://greenkeeper.io/)

simplecrawler is designed to provide a basic, flexible and robust API for crawling websites. It was written to archive, analyse, and search some very large websites and has happily chewed through hundreds of thousands of pages and written tens of gigabytes to disk without issue.

What does simplecrawler do?

* Provides a very simple event driven API using EventEmitter
* Extremely configurable base for writing your own crawler
* Provides some simple logic for auto-detecting linked resources - which you can replace or augment
* Automatically respects any robots.txt rules
* Has a flexible queue system which can be frozen to disk and defrosted
* Provides basic statistics on network performance
* Uses buffers for fetching and managing data, preserving binary data (except when discovering links)

Documentation

- Installation
- Getting started
- Events
- A note about HTTP error conditions
- Waiting for asynchronous event listeners
- Configuration
- Fetch conditions
- Download conditions
- The queue
- Manually adding to the queue
- Queue items
- Queue statistics and reporting
- Saving and reloading the queue (freeze/defrost)
- Cookies
- Cookie events
- Link Discovery
- FAQ/Troubleshooting
- Node Support Policy
- Current Maintainers
- Contributing
- Contributors
- License

Installation

``sh npm install --save simplecrawler`

`Getting Started`

Initializing simplecrawler is a simple process. First, you require the module and instantiate it with a single argument. You then configure the properties you like (eg. the request interval), register a few event listeners, and call the start method. Let's walk through the process!

After requiring the crawler, we create a new instance of it. We supply the constructor with a URL that indicates which domain to crawl and which resource to fetch first.

`js var Crawler = require("simplecrawler");

var crawler = new Crawler("http://www.example.com/");`

You can initialize the crawler with or without the new operator. Being able to skip it comes in handy when you want to chain API calls.

`js var crawler = Crawler("http://www.example.com/") .on("fetchcomplete", function () { console.log("Fetched a resource!") });`

By default, the crawler will only fetch resources on the same domain as that in the URL passed to the constructor. But this can be changed through the crawler.domainWhitelist property.

Now, let's configure some more things before we start crawling. Of course, you're probably wanting to ensure you don't take down your web server. Decrease the concurrency from five simultaneous requests - and increase the request interval from the default 250 ms like this:

`js crawler.interval = 10000; // Ten seconds crawler.maxConcurrency = 3;`

You can also define a max depth for links to fetch:

`js crawler.maxDepth = 1; // Only first page is fetched (with linked CSS & images) // Or: crawler.maxDepth = 2; // First page and discovered links from it are fetched // Or: crawler.maxDepth = 3; // Etc.`

For a full list of configurable properties, see the configuration section.

You'll also need to set up event listeners for the events you want to listen to. crawler.fetchcomplete and crawler.complete are good places to start.

`js crawler.on("fetchcomplete", function(queueItem, responseBuffer, response) { console.log("I just received %s (%d bytes)", queueItem.url, responseBuffer.length); console.log("It was a resource of type %s", response.headers['content-type']); });`

Then, when you're satisfied and ready to go, start the crawler! It'll run through its queue finding linked resources on the domain to download, until it can't find any more.

`js crawler.start();`

`Events`

simplecrawler's API is event driven, and there are plenty of events emitted during the different stages of the crawl.

#### "crawlstart" Fired when the crawl starts. This event gives you the opportunity to adjust the crawler's configuration, since the crawl won't actually start until the next processor tick.

#### "discoverycomplete" (queueItem, resources) Fired when the discovery of linked resources has completed

| Param | Type | Description | | --- | --- | --- | | queueItem | QueueItem | The queue item that represents the document for the discovered resources | | resources | Array | An array of discovered and cleaned URL's |

#### "invaliddomain" (queueItem) Fired when a resource wasn't queued because of an invalid domain name

| Param | Type | Description | | --- | --- | --- | | queueItem | QueueItem | The queue item representing the disallowed URL |

#### "fetchdisallowed" (queueItem) Fired when a resource wasn't queued because it was disallowed by the site's robots.txt rules

| Param | Type | Description | | --- | --- | --- | | queueItem | QueueItem | The queue item representing the disallowed URL |

#### "fetchconditionerror" (queueItem, error) Fired when a fetch condition returns an error

| Param | Type | Description | | --- | --- | --- | | queueItem | QueueItem | The queue item that was processed when the error was encountered | | error | \* | |

#### "fetchprevented" (queueItem, fetchCondition) Fired when a fetch condition prevented the queueing of a URL

| Param | Type | Description | | --- | --- | --- | | queueItem | QueueItem | The queue item that didn't pass the fetch conditions | | fetchCondition | function | The first fetch condition that returned false |

#### "queueduplicate" (queueItem) Fired when a new queue item was rejected because another queue item with the same URL was already in the queue

| Param | Type | Description | | --- | --- | --- | | queueItem | QueueItem | The queue item that was rejected |

#### "queueerror" (error, queueItem) Fired when an error was encountered while updating a queue item

| Param | Type | Description | | --- | --- | --- | | error | QueueItem | The error that was returned by the queue | | queueItem | QueueItem | The queue item that the crawler tried to update when it encountered the error |

#### "queueadd" (queueItem, referrer) Fired when an item was added to the crawler's queue

| Param | Type | Description | | --- | --- | --- | | queueItem | QueueItem | The queue item that was added to the queue | | referrer | QueueItem | The queue item representing the resource where the new queue item was found |

#### "fetchtimeout" (queueItem, timeout) Fired when a request times out

| Param | Type | Description | | --- | --- | --- | | queueItem | QueueItem | The queue item for which the request timed out | | timeout | Number | The delay in milliseconds after which the request timed out |

#### "fetchclienterror" (queueItem, error) Fired when a request encounters an unknown error

| Param | Type | Description | | --- | --- | --- | | queueItem | QueueItem | The queue item for which the request has errored | | error | Object | The error supplied to theerror event on the request |

#### "fetchstart" (queueItem, requestOptions) Fired just after a request has been initiated

| Param | Type | Description | | --- | --- | --- | | queueItem | QueueItem | The queue item for which the request has been initiated | | requestOptions | Object | The options generated for the HTTP request |

#### "cookieerror" (queueItem, error, cookie) Fired when an error was encountered while trying to add a cookie to the cookie jar

| Param | Type | Description | | --- | --- | --- | | queueItem | QueueItem | The queue item representing the resource that returned the cookie | | error | Error | The error that was encountered | | cookie | String | The Set-Cookie header value that was returned from the request |

#### "fetchheaders" (queueItem, response) Fired when the headers for a request have been received

| Param | Type | Description | | --- | --- | --- | | queueItem | QueueItem | The queue item for which the headers have been received | | response | http.IncomingMessage | The http.IncomingMessage for the request's response |

#### "downloadconditionerror" (queueItem, error) Fired when a download condition returns an error

| Param | Type | Description | | --- | --- | --- | | queueItem | QueueItem | The queue item that was processed when the error was encountered | | error | \* | |

#### "downloadprevented" (queueItem, response) Fired when the downloading of a resource was prevented by a download condition

| Param | Type | Description | | --- | --- | --- | | queueItem | QueueItem | The queue item representing the resource that was halfway fetched | | response | http.IncomingMessage | The http.IncomingMessage for the request's response |

#### "notmodified" (queueItem, response, cacheObject) Fired when the crawler's cache was enabled and the server responded with a 304 Not Modified status for the request

| Param | Type | Description | | --- | --- | --- | | queueItem | QueueItem | The queue item for which the request returned a 304 status | | response | http.IncomingMessage | The http.IncomingMessage for the request's response | | cacheObject | CacheObject | The CacheObject returned from the cache backend |

#### "fetchredirect" (queueItem, redirectQueueItem, response) Fired when the server returned a redirect HTTP status for the request

| Param | Type | Description | | --- | --- | --- | | queueItem | QueueItem | The queue item for which the request was redirected | | redirectQueueItem | QueueItem | The queue item for the redirect target resource | | response | http.IncomingMessage | The http.IncomingMessage for the request's response |

#### "fetch404" (queueItem, response) Fired when the server returned a 404 Not Found status for the request

| Param | Type | Description | | --- | --- | --- | | queueItem | QueueItem | The queue item for which the request returned a 404 status | | response | http.IncomingMessage | The http.IncomingMessage for the request's response |

#### "fetch410" (queueItem, response) Fired when the server returned a 410 Gone status for the request

| Param | Type | Description | | --- | --- | --- | | queueItem | QueueItem | The queue item for which the request returned a 410 status | | response | http.IncomingMessage | The http.IncomingMessage for the request's response |

#### "fetcherror" (queueItem, response) Fired when the server returned a status code above 400 that isn't 404 or 410

| Param | Type | Description | | --- | --- | --- | | queueItem | QueueItem | The queue item for which the request failed | | response | http.IncomingMessage | The http.IncomingMessage for the request's response |

#### "fetchcomplete" (queueItem, responseBody, response) Fired when the request has completed

| Param | Type | Description | | --- | --- | --- | | queueItem | QueueItem | The queue item for which the request has completed | | responseBody | String \| Buffer | If decodeResponses is true, this will be the decoded HTTP response. Otherwise it will be the raw response buffer. | | response | http.IncomingMessage | The http.IncomingMessage for the request's response |

#### "gziperror" (queueItem, responseBody, response) Fired when an error was encountered while unzipping the response data

| Param | Type | Description | | --- | --- | --- | | queueItem | QueueItem | The queue item for which the unzipping failed | | responseBody | String \| Buffer | If decodeResponses is true, this will be the decoded HTTP response. Otherwise it will be the raw response buffer. | | response | http.IncomingMessage | The http.IncomingMessage for the request's response |

#### "fetchdataerror" (queueItem, response) Fired when a resource couldn't be downloaded because it exceeded the maximum allowed size

#### "robotstxterror" (error) Fired when an error was encountered while retrieving a robots.txt file

| Param | Type | Description | | --- | --- | --- | | error | Error | The error returned from getRobotsTxt |

#### "complete" Fired when the crawl has completed - all resources in the queue have been dealt with

`$3`

By default, simplecrawler does not download the response body when it encounters an HTTP error status in the response. If you need this information, you can listen to simplecrawler's error events, and through node's native data event (response.on("data",function(chunk) {...})) you can save the information yourself.

`$3`

Sometimes, you might want to wait for simplecrawler to wait for you while you perform some asynchronous tasks in an event listener, instead of having it racing off and firing the complete event, halting your crawl. For example, if you're doing your own link discovery using an asynchronous library method.

simplecrawler provides a wait method you can call at any time. It is available via this from inside listeners, and on the crawler object itself. It returns a callback function.

Once you've called this method, simplecrawler will not fire the complete event until either you execute the callback it returns, or a timeout is reached (configured in crawler.listenerTTL, by default 10000 ms.)

#### Example asynchronous event listener

`js crawler.on("fetchcomplete", function(queueItem, data, res) { var continue = this.wait();

doSomeDiscovery(data, function(foundURLs) { foundURLs.forEach(function(url) { crawler.queueURL(url, queueItem); });

continue(); }); });`

`Configuration`

simplecrawler is highly configurable and there's a long list of settings you can change to adapt it to your specific needs.

#### crawler.initialURL : String Controls which URL to request first

#### crawler.host : String Determines what hostname the crawler should limit requests to (so long as filterByDomain is true)

#### crawler.interval : Number Determines the interval at which new requests are spawned by the crawler, as long as the number of open requests is under the maxConcurrency cap.

#### crawler.maxConcurrency : Number Maximum request concurrency. If necessary, simplecrawler will increase node's http agent maxSockets value to match this setting.

#### crawler.timeout : Number Maximum time we'll wait for headers

#### crawler.listenerTTL : Number Maximum time we'll wait for async listeners

#### crawler.userAgent : String Crawler's user agent string

Default: "Node/simplecrawler <version> (https://github.com/simplecrawler/simplecrawler)"

#### crawler.queue : FetchQueue Queue for requests. The crawler can use any implementation so long as it uses the same interface. The default queue is simply backed by an array.

#### crawler.respectRobotsTxt : Boolean Controls whether the crawler respects the robots.txt rules of any domain. This is done both with regards to the robots.txt file, andtags that specify anofollowvalue for robots. The latter only applies if the default discoverResources method is used, though.

#### crawler.allowInitialDomainChange : Boolean Controls whether the crawler is allowed to change the host setting if the first response is a redirect to another domain.

#### crawler.decompressResponses : Boolean Controls whether HTTP responses are automatically decompressed based on their Content-Encoding header. If true, it will also assign the appropriate Accept-Encoding header to requests.

#### crawler.decodeResponses : Boolean Controls whether HTTP responses are automatically character converted to standard JavaScript strings using the iconv-lite module before emitted in the fetchcomplete event. The character encoding is interpreted from the Content-Type header firstly, and secondly from any tags.

#### crawler.filterByDomain : Boolean Controls whether the crawler fetches only URL's where the hostname matches host. Unless you want to be crawling the entire internet, I would recommend leaving this on!

#### crawler.scanSubdomains : Boolean Controls whether URL's that points to a subdomain of host should also be fetched.

#### crawler.ignoreWWWDomain : Boolean Controls whether to treat the www subdomain as the same domain as host. So if http://example.com/example has already been fetched, http://www.example.com/example won't be fetched also.

#### crawler.stripWWWDomain : Boolean Controls whether to strip the www subdomain entirely from URL's at queue item construction time.

#### crawler.cache : SimpleCache Internal cache store. Must implementSimpleCacheinterface. You can save the site to disk using the built in file system cache like this:

`js crawler.cache = new Crawler.cache('pathToCacheDirectory');`

#### crawler.useProxy : Boolean Controls whether an HTTP proxy should be used for requests

#### crawler.proxyHostname : String If useProxy is true, this setting controls what hostname to use for the proxy

#### crawler.proxyPort : Number If useProxy is true, this setting controls what port to use for the proxy

#### crawler.proxyUser : String If useProxy is true, this setting controls what username to use for the proxy

#### crawler.proxyPass : String If useProxy is true, this setting controls what password to use for the proxy

#### crawler.needsAuth : Boolean Controls whether to use HTTP Basic Auth

#### crawler.authUser : String If needsAuth is true, this setting controls what username to send with HTTP Basic Auth

#### crawler.authPass : String If needsAuth is true, this setting controls what password to send with HTTP Basic Auth

#### crawler.acceptCookies : Boolean Controls whether to save and send cookies or not

#### crawler.cookies : CookieJar The module used to store cookies

#### crawler.customHeaders : Object Controls what headers (besides the default ones) to include with every request.

#### crawler.domainWhitelist : Array Controls what domains the crawler is allowed to fetch from, regardless of host or filterByDomain settings.

#### crawler.allowedProtocols : Array.<RegExp> Controls what protocols the crawler is allowed to fetch from

#### crawler.maxResourceSize : Number Controls the maximum allowed size in bytes of resources to be fetched

Default: 16777216

#### crawler.supportedMimeTypes : Array.<(RegExp\|string)> Controls what mimetypes the crawler will scan for new resources. If downloadUnsupported is false, this setting will also restrict what resources are downloaded.

#### crawler.downloadUnsupported : Boolean Controls whether to download resources with unsupported mimetypes (as specified by supportedMimeTypes)

#### crawler.urlEncoding : String Controls what URL encoding to use. Can be either "unicode" or "iso8859"

#### crawler.stripQuerystring : Boolean Controls whether to strip query string parameters from URL's at queue item construction time.

#### crawler.sortQueryParameters : Boolean Controls whether to sort query string parameters from URL's at queue item construction time.

#### crawler.discoverRegex : Array.<(RegExp\|function())> Collection of regular expressions and functions that are applied in the default discoverResources method.

#### crawler.parseHTMLComments : Boolean Controls whether the default discoverResources should scan for new resources inside of HTML comments.

#### crawler.parseScriptTags : Boolean Controls whether the default discoverResources should scan for new resources inside ofsimplecrawler - npm explorer

simplecrawler

v1.1.9

Very straightforward, event driven web crawler. Features a flexible queue interface and a basic cache mechanism with extensible backend.

simple crawler spider cache queue simplecrawler eventemitter

0/weekUpdated 3 years agoBSD-2-ClauseUnpacked: 175.2 KB

Published by Christopher Giffard

npm install simplecrawler

Repository Homepage npm