Basic page crawler written in nodejs
npm install sauron-crawler
module.exports = {
comments: {
customDirectory: 'Directory to store custom parsing functions',
outputDirectory: 'Directory to store output files',
saveDirectory: 'Directory to store save files',
},
customDirectory: './crawler/custom/',
outputDirectory: './crawler/output/',
saveDirectory: './crawler/save/',
pluginsDirectory: './crawler/plugins/',
};
`
* all "custom" (custom.customFile) js files must be placed in the ___customDirectory___ specified above.
In the config file provide a path relative to that folder.
* crawl based on a config file
`
npx sauron .\configs\sample.config.js
`
* same as above but start with a list of urls
`
npx sauron .\configs\sample.config.js .\configs\list.input.json
`
Config file example
`
module.exports = {
"id": "projectId",
"startURL": "http://example.com",
"sitemapURL": "http://example.com/sitemap.xml",
"output": "json",
"storeDefaultData": true,
"custom": {
"useCustom": true,
"customFile": "custom.blank"
},
// !important: When using a regex in allowedDomains do no use a "g" flag
// Using the g flag with a reused regex in JavaScript can make you lose
// matches because it makes the regex stateful via lastIndex, so each new
// call starts matching from where the previous one left off instead of
// from the beginning of the string.
"allowedDomains": [
"example.com",
"test.example.com"
],
"allowedProtocols": [
"http:",
"https:"
],
"dedupeProtocol": true,
"allowLinksFrom": {
"pattern": "^.*",
"pathnameAllow": [],
"pathnameDeny": []
},
"crawlLinks": {
"pattern": "^.*",
"pathnameAllow": [],
"pathnameDeny": []
},
"saveCrawlData": {
"pattern": "^.*",
"pathnameAllow": [],
"pathnameDeny": []
},
"httpAuth": {
"enable": false,
"user": "login",
"pass": "pass"
},
"cookies": [
{
"key": "testCookie",
"value": "test-value"
}
],
"customHeaders": {
"User-Agent": "Mozilla/5.0 (SAURON NT 10.0; Win64; x64; rv:131.0) Gecko/20100101 Firefox/131.0"
},
"requireValidSSLCert": false,
"saveStatusEach": 1000,
"verbose": false,
"requestCount": 4,
"maxPages": -1,
"stripGET": false,
"timeout": 5000,
"linksToLowercase": false
}
`
$3
| Option | Value | Description |
| ------------------- |:-------------:| -------------------------------------------------------------------- |
| `id` | `string` | Crawl id - used in output file name etc. |
| `startURL` | `string` | Url to start crawl from |
| `sitemapURL` | `string` | when provided, crawler will use sitemap.xml to get list of pages to crawl |
| `output` | `string` | Crawl output method. Allowed values: `console`, `csv `, `json `, `blank ` |
| `storeDefaultData` | `boolean` | Store default 'output' data with links, statusCodes etc - can be disabled when output is set to 'blank' |
| `custom` | `object` | Custom parsing actions settings |
| `allowedDomains` | `array` | Only domains from this array will be crawled. Empty array will discard this check. Can be a string or a regex |
| `allowedProtocols` | `array` | Page protocols to crawl. Allowed values: `http`, `https`. Empty array will discard this check. |
| `dedupeProtocol` | `boolean` | De-duplicate links based on protocol. |
| `allowLinksFrom` | `object` | Only links that are found on a urls that match given requirements will be crawled. |
| `crawlLinks` | `object` | Only links that match given requirements will be crawled. Example pattern to exclude "/files/" path and PDF files `^(.(?!.\\/files\\/|.\\.pdf$))*` |
| `saveCrawlData` | `object` | Only links that match given requirements will be saved to output. |
| `httpAuth` | `object` | Settings for basic authentication |
| `cookies` | `array` | Array of cookies represented by an object with keys: key and value |
| `customHeaders` | `object` | Object containing custom headers to be sent with each request |
| `requireValidSSLCert` | `boolean` | Check if SSL certificates are valid |
| `saveStatusEach` | `number` | Save status each N crawls to enable abort and continue later |
| `verbose` | `boolean` | Print more output to console |
| `requestCount` | `number` | Number of requests to be run in one batch |
| `maxPages` | `number` | Max pages to crawl. To have no limit set `-1` |
| `stripGET` | `boolean` | Strip GET parameters from links |
| `timeout` | `number` | Single request timeout in ms |
| `linksToLowercase` | `boolean` | Make all links to crawl lowercase |
Plugins
Sauron supports an event-based plugin system that allows you to extend and modify the crawler's behavior without changing the core codebase. Plugins are loaded automatically from the pluginsDirectory specified in sauron.settings.js`.