Very straightforward, event driven web crawler. Features a flexible queue interface and a basic cache mechanism with extensible backend.
npm install simplecrawler





simplecrawler is designed to provide a basic, flexible and robust API for crawling websites. It was written to archive, analyse, and search some very large websites and has happily chewed through hundreds of thousands of pages and written tens of gigabytes to disk without issue.
* Provides a very simple event driven API using EventEmitter
* Extremely configurable base for writing your own crawler
* Provides some simple logic for auto-detecting linked resources - which you can replace or augment
* Automatically respects any robots.txt rules
* Has a flexible queue system which can be frozen to disk and defrosted
* Provides basic statistics on network performance
* Uses buffers for fetching and managing data, preserving binary data (except when discovering links)
- Installation
- Getting started
- Events
- A note about HTTP error conditions
- Waiting for asynchronous event listeners
- Configuration
- Fetch conditions
- Download conditions
- The queue
- Manually adding to the queue
- Queue items
- Queue statistics and reporting
- Saving and reloading the queue (freeze/defrost)
- Cookies
- Cookie events
- Link Discovery
- FAQ/Troubleshooting
- Node Support Policy
- Current Maintainers
- Contributing
- Contributors
- License
``sh`
npm install --save simplecrawler
Initializing simplecrawler is a simple process. First, you require the module and instantiate it with a single argument. You then configure the properties you like (eg. the request interval), register a few event listeners, and call the start method. Let's walk through the process!
After requiring the crawler, we create a new instance of it. We supply the constructor with a URL that indicates which domain to crawl and which resource to fetch first.
`js
var Crawler = require("simplecrawler");
var crawler = new Crawler("http://www.example.com/");
`
You can initialize the crawler with or without the new operator. Being able to skip it comes in handy when you want to chain API calls.
`js`
var crawler = Crawler("http://www.example.com/")
.on("fetchcomplete", function () {
console.log("Fetched a resource!")
});
By default, the crawler will only fetch resources on the same domain as that in the URL passed to the constructor. But this can be changed through the crawler.domainWhitelist property.
Now, let's configure some more things before we start crawling. Of course, you're probably wanting to ensure you don't take down your web server. Decrease the concurrency from five simultaneous requests - and increase the request interval from the default 250 ms like this:
`js`
crawler.interval = 10000; // Ten seconds
crawler.maxConcurrency = 3;
You can also define a max depth for links to fetch:
`js`
crawler.maxDepth = 1; // Only first page is fetched (with linked CSS & images)
// Or:
crawler.maxDepth = 2; // First page and discovered links from it are fetched
// Or:
crawler.maxDepth = 3; // Etc.
For a full list of configurable properties, see the configuration section.
You'll also need to set up event listeners for the events you want to listen to. crawler.fetchcomplete and crawler.complete are good places to start.
`js`
crawler.on("fetchcomplete", function(queueItem, responseBuffer, response) {
console.log("I just received %s (%d bytes)", queueItem.url, responseBuffer.length);
console.log("It was a resource of type %s", response.headers['content-type']);
});
Then, when you're satisfied and ready to go, start the crawler! It'll run through its queue finding linked resources on the domain to download, until it can't find any more.
`js`
crawler.start();
simplecrawler's API is event driven, and there are plenty of events emitted during the different stages of the crawl.
#### "crawlstart"
Fired when the crawl starts. This event gives you the opportunity to
adjust the crawler's configuration, since the crawl won't actually start
until the next processor tick.
#### "discoverycomplete" (queueItem, resources)
Fired when the discovery of linked resources has completed
| Param | Type | Description |
| --- | --- | --- |
| queueItem | QueueItem | The queue item that represents the document for the discovered resources |
| resources | Array | An array of discovered and cleaned URL's |
#### "invaliddomain" (queueItem)
Fired when a resource wasn't queued because of an invalid domain name
| Param | Type | Description |
| --- | --- | --- |
| queueItem | QueueItem | The queue item representing the disallowed URL |
#### "fetchdisallowed" (queueItem)
Fired when a resource wasn't queued because it was disallowed by the
site's robots.txt rules
| Param | Type | Description |
| --- | --- | --- |
| queueItem | QueueItem | The queue item representing the disallowed URL |
#### "fetchconditionerror" (queueItem, error)
Fired when a fetch condition returns an error
| Param | Type | Description |
| --- | --- | --- |
| queueItem | QueueItem | The queue item that was processed when the error was encountered |
| error | \* | |
#### "fetchprevented" (queueItem, fetchCondition)
Fired when a fetch condition prevented the queueing of a URL
| Param | Type | Description |
| --- | --- | --- |
| queueItem | QueueItem | The queue item that didn't pass the fetch conditions |
| fetchCondition | function | The first fetch condition that returned false |
#### "queueduplicate" (queueItem)
Fired when a new queue item was rejected because another
queue item with the same URL was already in the queue
| Param | Type | Description |
| --- | --- | --- |
| queueItem | QueueItem | The queue item that was rejected |
#### "queueerror" (error, queueItem)
Fired when an error was encountered while updating a queue item
| Param | Type | Description |
| --- | --- | --- |
| error | QueueItem | The error that was returned by the queue |
| queueItem | QueueItem | The queue item that the crawler tried to update when it encountered the error |
#### "queueadd" (queueItem, referrer)
Fired when an item was added to the crawler's queue
| Param | Type | Description |
| --- | --- | --- |
| queueItem | QueueItem | The queue item that was added to the queue |
| referrer | QueueItem | The queue item representing the resource where the new queue item was found |
#### "fetchtimeout" (queueItem, timeout)
Fired when a request times out
| Param | Type | Description |
| --- | --- | --- |
| queueItem | QueueItem | The queue item for which the request timed out |
| timeout | Number | The delay in milliseconds after which the request timed out |
#### "fetchclienterror" (queueItem, error)
Fired when a request encounters an unknown error
| Param | Type | Description |
| --- | --- | --- |
| queueItem | QueueItem | The queue item for which the request has errored |
| error | Object | The error supplied to the error event on the request |
#### "fetchstart" (queueItem, requestOptions)
Fired just after a request has been initiated
| Param | Type | Description |
| --- | --- | --- |
| queueItem | QueueItem | The queue item for which the request has been initiated |
| requestOptions | Object | The options generated for the HTTP request |
#### "cookieerror" (queueItem, error, cookie)
Fired when an error was encountered while trying to add a
cookie to the cookie jar
| Param | Type | Description |
| --- | --- | --- |
| queueItem | QueueItem | The queue item representing the resource that returned the cookie |
| error | Error | The error that was encountered |
| cookie | String | The Set-Cookie header value that was returned from the request |
#### "fetchheaders" (queueItem, response)
Fired when the headers for a request have been received
| Param | Type | Description |
| --- | --- | --- |
| queueItem | QueueItem | The queue item for which the headers have been received |
| response | http.IncomingMessage | The http.IncomingMessage for the request's response |
#### "downloadconditionerror" (queueItem, error)
Fired when a download condition returns an error
| Param | Type | Description |
| --- | --- | --- |
| queueItem | QueueItem | The queue item that was processed when the error was encountered |
| error | \* | |
#### "downloadprevented" (queueItem, response)
Fired when the downloading of a resource was prevented
by a download condition
| Param | Type | Description |
| --- | --- | --- |
| queueItem | QueueItem | The queue item representing the resource that was halfway fetched |
| response | http.IncomingMessage | The http.IncomingMessage for the request's response |
#### "notmodified" (queueItem, response, cacheObject)
Fired when the crawler's cache was enabled and the server responded with a 304 Not Modified status for the request
| Param | Type | Description |
| --- | --- | --- |
| queueItem | QueueItem | The queue item for which the request returned a 304 status |
| response | http.IncomingMessage | The http.IncomingMessage for the request's response |
| cacheObject | CacheObject | The CacheObject returned from the cache backend |
#### "fetchredirect" (queueItem, redirectQueueItem, response)
Fired when the server returned a redirect HTTP status for the request
| Param | Type | Description |
| --- | --- | --- |
| queueItem | QueueItem | The queue item for which the request was redirected |
| redirectQueueItem | QueueItem | The queue item for the redirect target resource |
| response | http.IncomingMessage | The http.IncomingMessage for the request's response |
#### "fetch404" (queueItem, response)
Fired when the server returned a 404 Not Found status for the request
| Param | Type | Description |
| --- | --- | --- |
| queueItem | QueueItem | The queue item for which the request returned a 404 status |
| response | http.IncomingMessage | The http.IncomingMessage for the request's response |
#### "fetch410" (queueItem, response)
Fired when the server returned a 410 Gone status for the request
| Param | Type | Description |
| --- | --- | --- |
| queueItem | QueueItem | The queue item for which the request returned a 410 status |
| response | http.IncomingMessage | The http.IncomingMessage for the request's response |
#### "fetcherror" (queueItem, response)
Fired when the server returned a status code above 400 that isn't 404 or 410
| Param | Type | Description |
| --- | --- | --- |
| queueItem | QueueItem | The queue item for which the request failed |
| response | http.IncomingMessage | The http.IncomingMessage for the request's response |
#### "fetchcomplete" (queueItem, responseBody, response)
Fired when the request has completed
| Param | Type | Description |
| --- | --- | --- |
| queueItem | QueueItem | The queue item for which the request has completed |
| responseBody | String \| Buffer | If decodeResponses is true, this will be the decoded HTTP response. Otherwise it will be the raw response buffer. |
| response | http.IncomingMessage | The http.IncomingMessage for the request's response |
#### "gziperror" (queueItem, responseBody, response)
Fired when an error was encountered while unzipping the response data
| Param | Type | Description |
| --- | --- | --- |
| queueItem | QueueItem | The queue item for which the unzipping failed |
| responseBody | String \| Buffer | If decodeResponses is true, this will be the decoded HTTP response. Otherwise it will be the raw response buffer. |
| response | http.IncomingMessage | The http.IncomingMessage for the request's response |
#### "fetchdataerror" (queueItem, response)
Fired when a resource couldn't be downloaded because it exceeded the maximum allowed size
| Param | Type | Description |
| --- | --- | --- |
| queueItem | QueueItem | The queue item for which the request failed |
| response | http.IncomingMessage | The http.IncomingMessage for the request's response |
#### "robotstxterror" (error)
Fired when an error was encountered while retrieving a robots.txt file
| Param | Type | Description |
| --- | --- | --- |
| error | Error | The error returned from getRobotsTxt |
#### "complete"
Fired when the crawl has completed - all resources in the queue have been dealt with
By default, simplecrawler does not download the response body when it encounters an HTTP error status in the response. If you need this information, you can listen to simplecrawler's error events, and through node's native data event (response.on("data",function(chunk) {...})) you can save the information yourself.
Sometimes, you might want to wait for simplecrawler to wait for you while you perform some asynchronous tasks in an event listener, instead of having it racing off and firing the complete event, halting your crawl. For example, if you're doing your own link discovery using an asynchronous library method.
simplecrawler provides a wait method you can call at any time. It is available via this from inside listeners, and on the crawler object itself. It returns a callback function.
Once you've called this method, simplecrawler will not fire the complete event until either you execute the callback it returns, or a timeout is reached (configured in crawler.listenerTTL, by default 10000 ms.)
#### Example asynchronous event listener
`js
crawler.on("fetchcomplete", function(queueItem, data, res) {
var continue = this.wait();
doSomeDiscovery(data, function(foundURLs) {
foundURLs.forEach(function(url) {
crawler.queueURL(url, queueItem);
});
continue();
});
});
`
simplecrawler is highly configurable and there's a long list of settings you can change to adapt it to your specific needs.
#### crawler.initialURL : String
Controls which URL to request first
#### crawler.host : String
Determines what hostname the crawler should limit requests to (so long as
filterByDomain is true)
#### crawler.interval : Number
Determines the interval at which new requests are spawned by the crawler,
as long as the number of open requests is under the
maxConcurrency cap.
#### crawler.maxConcurrency : Number
Maximum request concurrency. If necessary, simplecrawler will increase
node's http agent maxSockets value to match this setting.
#### crawler.timeout : Number
Maximum time we'll wait for headers
#### crawler.listenerTTL : Number
Maximum time we'll wait for async listeners
#### crawler.userAgent : String
Crawler's user agent string
Default: "Node/simplecrawler <version> (https://github.com/simplecrawler/simplecrawler)"
#### crawler.queue : FetchQueue
Queue for requests. The crawler can use any implementation so long as it
uses the same interface. The default queue is simply backed by an array.
#### crawler.respectRobotsTxt : Boolean
Controls whether the crawler respects the robots.txt rules of any domain.
This is done both with regards to the robots.txt file, and tagsnofollow
that specify a value for robots. The latter only applies if
the default discoverResources method is used, though.
#### crawler.allowInitialDomainChange : Boolean
Controls whether the crawler is allowed to change the
host setting if the first response is a redirect to
another domain.
#### crawler.decompressResponses : Boolean
Controls whether HTTP responses are automatically decompressed based on
their Content-Encoding header. If true, it will also assign the
appropriate Accept-Encoding header to requests.
#### crawler.decodeResponses : Boolean
Controls whether HTTP responses are automatically character converted to
standard JavaScript strings using the iconv-lite
module before emitted in the fetchcomplete event.
The character encoding is interpreted from the Content-Type header
firstly, and secondly from any tags.
#### crawler.filterByDomain : Boolean
Controls whether the crawler fetches only URL's where the hostname
matches host. Unless you want to be crawling the entire
internet, I would recommend leaving this on!
#### crawler.scanSubdomains : Boolean
Controls whether URL's that points to a subdomain of host
should also be fetched.
#### crawler.ignoreWWWDomain : Boolean
Controls whether to treat the www subdomain as the same domain as
host. So if http://example.com/example has
already been fetched, http://www.example.com/example won't be
fetched also.
#### crawler.stripWWWDomain : Boolean
Controls whether to strip the www subdomain entirely from URL's at queue
item construction time.
#### crawler.cache : SimpleCache
Internal cache store. Must implement SimpleCache interface. You can
save the site to disk using the built in file system cache like this:
`js`
crawler.cache = new Crawler.cache('pathToCacheDirectory');
#### crawler.useProxy : Boolean
Controls whether an HTTP proxy should be used for requests
#### crawler.proxyHostname : String
If useProxy is true, this setting controls what hostname
to use for the proxy
#### crawler.proxyPort : Number
If useProxy is true, this setting controls what port to
use for the proxy
#### crawler.proxyUser : String
If useProxy is true, this setting controls what username
to use for the proxy
#### crawler.proxyPass : String
If useProxy is true, this setting controls what password
to use for the proxy
#### crawler.needsAuth : Boolean
Controls whether to use HTTP Basic Auth
#### crawler.authUser : String
If needsAuth is true, this setting controls what username
to send with HTTP Basic Auth
#### crawler.authPass : String
If needsAuth is true, this setting controls what password
to send with HTTP Basic Auth
#### crawler.acceptCookies : Boolean
Controls whether to save and send cookies or not
#### crawler.cookies : CookieJar
The module used to store cookies
#### crawler.customHeaders : Object
Controls what headers (besides the default ones) to include with every
request.
#### crawler.domainWhitelist : Array
Controls what domains the crawler is allowed to fetch from, regardless of
host or filterByDomain settings.
#### crawler.allowedProtocols : Array.<RegExp>
Controls what protocols the crawler is allowed to fetch from
#### crawler.maxResourceSize : Number
Controls the maximum allowed size in bytes of resources to be fetched
#### crawler.supportedMimeTypes : Array.<(RegExp\|string)>
Controls what mimetypes the crawler will scan for new resources. If
downloadUnsupported is false, this setting will also
restrict what resources are downloaded.
#### crawler.downloadUnsupported : Boolean
Controls whether to download resources with unsupported mimetypes (as
specified by supportedMimeTypes)
#### crawler.urlEncoding : String
Controls what URL encoding to use. Can be either "unicode" or "iso8859"
#### crawler.stripQuerystring : Boolean
Controls whether to strip query string parameters from URL's at queue
item construction time.
#### crawler.sortQueryParameters : Boolean
Controls whether to sort query string parameters from URL's at queue
item construction time.
#### crawler.discoverRegex : Array.<(RegExp\|function())>
Collection of regular expressions and functions that are applied in the
default discoverResources method.
#### crawler.parseHTMLComments : Boolean
Controls whether the default discoverResources should
scan for new resources inside of HTML comments.
#### crawler.parseScriptTags : Boolean
Controls whether the default discoverResources should
scan for new resources inside of