Parser for XML Sitemaps to be used with Robots.txt and web crawlers
npm install sitemapper














Parse through a sitemaps xml to get all the urls for your crawler.
``bash`
npm install sitemapper --save
`javascript
const Sitemapper = require('sitemapper');
const sitemap = new Sitemapper();
sitemap.fetch('https://wp.seantburke.com/sitemap.xml').then(function (sites) {
console.log(sites);
});
`
`javascript
import Sitemapper from 'sitemapper';
(async () => {
const Google = new Sitemapper({
url: 'https://www.google.com/work/sitemap.xml',
timeout: 15000, // 15 seconds
});
try {
const { sites } = await Google.fetch();
console.log(sites);
} catch (error) {
console.log(error);
}
})();
// or
const sitemapper = new Sitemapper();
sitemapper.timeout = 5000;
sitemapper
.fetch('https://wp.seantburke.com/sitemap.xml')
.then(({ url, sites }) => console.log(url:${url}, 'sites:', sites))`
.catch((error) => console.log(error));
You can add options on the initial Sitemapper object when instantiating it.
- requestHeaders: (Object) - Additional Request Headers (e.g. User-Agent)timeout
- : (Number) - Maximum timeout in ms for a single URL. Default: 15000 (15 seconds)url
- : (String) - Sitemap URL to crawldebug
- : (Boolean) - Enables/Disables debug console logging. Default: Falseconcurrency
- : (Number) - Sets the maximum number of concurrent sitemap crawling threads. Default: 10retries
- : (Number) - Sets the maximum number of retries to attempt in case of an error response (e.g. 404 or Timeout). Default: 0rejectUnauthorized
- : (Boolean) - If true, it will throw on invalid certificates, such as expired or self-signed ones. Default: Truelastmod
- : (Number) - Timestamp of the minimum lastmod value allowed for returned urlsproxyAgent
- : (HttpProxyAgent|HttpsProxyAgent) - instance of npm "hpagent" HttpProxyAgent or HttpsProxyAgent to be passed to npm "got"exclusions
- : (Arrayfields
- : (Object) - An object of fields to be returned from the sitemap. Leaving a field out has the same effect as . If not specified sitemapper defaults to returning the 'classic' array of urls. Available fields:loc
- : (Boolean) - The URL location of the pagesitemap
- : (Boolean) - The URL of the sitemap containing the URL, useful if lastmod
- : (Boolean) - The date of last modification of the pagechangefreq
- : (Boolean) - How frequently the page is likely to changepriority
- : (Boolean) - The priority of this URL relative to other URLs on your siteimage:loc
- : (Boolean) - The URL location of the image (for image sitemaps)image:title
- : (Boolean) - The title of the image (for image sitemaps)image:caption
- : (Boolean) - The caption of the image (for image sitemaps)video:title
- : (Boolean) - The title of the video (for video sitemaps)video:description
- : (Boolean) - The description of the video (for video sitemaps)video:thumbnail_loc
- : (Boolean) - The thumbnail URL of the video (for video sitemaps)
For Example:
``
fields: {
loc: true,
lastmod: true,
changefreq: true,
priority: true,
}
Leaving a field out has the same effect as . If not specified sitemapper defaults to returning the 'classic' array of urls.
An example using all available options:
`javascript
import { HttpsProxyAgent } from 'hpagent';
const sitemapper = new Sitemapper({
requestHeaders: {
'User-Agent':
'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:81.0) Gecko/20100101 Firefox/81.0',
},
timeout: 15000,
url: 'https://art-works.community/sitemap.xml',
debug: true,
concurrency: 2,
retries: 1,
lastmod: 1600000000000,
proxyAgent: new HttpsProxyAgent({
proxy: 'http://localhost:8080',
}),
exclusions: [/\/v1\//, /scary/],
rejectUnauthorized: false,
fields: {
loc: true,
lastmod: true,
priority: true,
changefreq: true,
sitemap: true,
},
});
``