Parses the wget spider output into an object
npm install wget-parserTable of Contents
=================
* Spider parser
* Usage
* wget-parser
* wget-spider
* Output
* Developer
* Test
* Cover
* Lint
* Clean
* Readme
Spider parser
=============
Parses the spider output from wget into an object structure of links.
This object could then be processed further to create a tree structure of the hierarchy of a website such that sitemap generation could be implemented.
Tested using wget v1.15 on linux.
``javascript`
var parser = require('wget-parser')
, buf = new Buffer(0); // buffer should contain the spider output
console.dir(parser(buf));
* parser.Parser: The parser class. parser.Link
* : The class that represents a link. parser.ParseStream
* : Parse stream class.
Streams support is available, see the test spec for example usage.
A program that reads from stdin and prints the result of the parse as JSON, exits with error code 1 if any broken links are found.
``
cat test/fixtures/mock.txt | wget-parser
cat test/fixtures/broken.txt | wget-parser; echo $?;
A program that performs a spider with wget and pipes the output to wget-parser:
``
wget-spider http://google.com
Example output from the parser:
`json`
{
"links": [
{
"url": {
"protocol": "http:",
"slashes": true,
"auth": null,
"host": "google.com",
"port": null,
"hostname": "google.com",
"hash": null,
"search": null,
"query": null,
"pathname": "/",
"path": "/",
"href": "http://google.com/"
},
"link": "http://google.com/",
"line": "--2016-02-10 16:11:57-- http://google.com/"
},
{
"url": {
"protocol": "http:",
"slashes": true,
"auth": null,
"host": "www.google.co.id",
"port": null,
"hostname": "www.google.co.id",
"hash": null,
"search": "?gws_rd=cr&ei=zfC6Vv6KKYexuATc3pu4DQ",
"query": "gws_rd=cr&ei=zfC6Vv6KKYexuATc3pu4DQ",
"pathname": "/",
"path": "/?gws_rd=cr&ei=zfC6Vv6KKYexuATc3pu4DQ",
"href": "http://www.google.co.id/?gws_rd=cr&ei=zfC6Vv6KKYexuATc3pu4DQ"
},
"link": "http://www.google.co.id/?gws_rd=cr&ei=zfC6Vv6KKYexuATc3pu4DQ",
"line": "--2016-02-10 16:11:57-- http://www.google.co.id/?gws_rd=cr&ei=zfC6Vv6KKYexuATc3pu4DQ"
}
],
"broken": []
}
To run the test suite:
``
npm test
To generate code coverage run:
``
npm run cover
Run the source tree through jshint and jscs:
``
npm run lint
Remove generated files:
``
npm run clean
To build the readme file from the partial definitions:
```
npm run readme
Generated by mdp(1).
[wget]: https://www.gnu.org/software/wget
[jshint]: http://jshint.com
[jscs]: http://jscs.info