Javascript wrapper around Mozilla Readability for ArchiveBox to call as a oneshot CLI to extract article text
npm install readability-extractorThis is a tiny JS wrapper library around Mozilla's article-text extraction tool https://github.com/mozilla/readability.
It's designed to be used as an ArchiveBox archive method.
``bash
npm install -g 'git+https://github.com/pirate/readability-extractor'
Usage
`bash
readability-extractor >
readability-extractor some_article.html 'https://exmaple.com/original/url/some/article.html' 'UTF-8' > some_article.json
`
`json
{
"title": "Title autodetected from article html",
"byline": "Autodetected author...",
"excerpt": "Autodetected short description",
"dir": "ltr",
"length": 1337,
"lang": null,
"charset": "UTF-8",
"content": "abc some article body text...",
"textContent": "abc some article body text..."
}
`ArchiveBox Integration
`bash
You don't have to run these commands usually.
Readability is on by default and ArchiveBox will find any
installed version in your $PATH automatically
However, if you explicitly want to turn readability on
and/or specify a manual path to the binary, you can do this:
archivebox config --set SAVE_READABILITY=True
archivebox config --set READABILITY_BINARY="$(which readability-extractor)"test archiving oneshot using only singlefile+readability
archivebox add --extract=singlefile,readability 'https://exmaple.com'
``