Crawl data from webpages and apply content extraction.
npm install crawly-mccrawlface#Installnpm install crawly-mccrawlface
//Create crawler and supply seed as string or array of strings
const crawler = new Crawler('https://budick.eu');
// or if you want multiple domains:
// const crawler = new Crawler(['https://budick.eu', 'https://hackerberryfinn.com']);
//start crawling
crawler.start();
crawler.on('finished', () => {
// the crawler has loaded all the sites that could be reached by the seed from the domain of the seed.
// get content of the site by its url
});
Windows:setx GOOGLE_NLP_API 1234APIKEY
or you can find the tool from System by typing 'environment' into the search box in start menu.
Unix:export GOOGLE_NLP_API=1234APIKEY
Or:
Create .env file with the content:GOOGLE_NLP_API=1234APIKEY
To accomplish that it uses the google-nlp-api package.
Some examples:
const Redis = require('ioredis');
const redis = new Redis({
port: 6379, // Redis port
host: 'localhost', // Redis host
family: 4,
password: 'superSecurePassword',
db: 0
});
crawler.setCache({
get: function (key) {
return redis.get(key);
},
set: function (key, value, expire) {
redis.set(key, value, 'EX', expire);
}
});
crawler.addCache(cache);
const options = {
readyIn: 50,
goHaywire: false,
userAgent: 'CrawlyMcCrawlface',
expireDefault: 7 24 60 60 1000
};
const crawler = new Crawler([...some urls...], options);
readyIn (Number):
Number of sites, that have to be loaded that ready-event is fired.
goHaywire (Boolean):
On defautl the crawler will only get content from the domains that where in the seed.
On haywire mode the crawler will never stop and go crazy on the web. You should not use this mode for now.
Or use it at your own risk, I'm not you boss.
userAgent (String): User Agent
expireDefault (Number): Expire key that is set in cache.
crawler.addFilter('details.html');
// if(url.match('details.html')){//url is crawled}
crawler.addFilter(new RegExp('[0-9]{5}'), i);
//if(url.match('details.html') || url.match(/[0-9]{5}/)){//url is crawled}
ready is fired when five sites where loaded, this is the first point where content extraction can be applied.siteAdded or sitesChanged.siteAdded is fired when a new site was added. It contains the new site as object.
sitesChanged is fired when a new site was added, it contains the count of all sites.
finished is fired when the queue is empty. On default usages, this is the point when everything is ready.
ready is called, when there are enough sites (default: 50 or set with options.readyIn) to do a content extraction or all sites of domain were crawled.
Test with:
npm test