Web Page Inspection Tool. Sentiment Analysis, Keyword Extraction, Named Entity Recognition & Spell Check
npm install horseman-article-parserbash
npm install horseman-article-parser --save
`
$3
#### parseArticle(options, socket) ? Object
| Param | Type | Description |
| ------- | ------------------- | ------------------- |
| options | Object | the options object |
| socket | Object | the optional socket |
Returns: Object - article parser results object
#### Async/Await Example
`js
import { parseArticle } from "horseman-article-parser";
const options = {
url: "https://www.theguardian.com/politics/2018/sep/24/theresa-may-calls-for-immigration-based-on-skills-and-wealth",
enabled: [
"lighthouse",
"screenshot",
"links",
"sentiment",
"entities",
"spelling",
"keywords",
"summary",
"readability",
],
};
(async () => {
try {
const article = await parseArticle(options);
const response = {
title: article.title.text,
excerpt: article.excerpt,
metadescription: article.meta.description.text,
url: article.url,
sentiment: {
score: article.sentiment.score,
comparative: article.sentiment.comparative,
},
keyphrases: article.processed.keyphrases,
keywords: article.processed.keywords,
people: article.people,
orgs: article.orgs,
places: article.places,
language: article.language,
readability: {
readingTime: article.readability.readingTime,
characters: article.readability.characters,
words: article.readability.words,
sentences: article.readability.sentences,
paragraphs: article.readability.paragraphs,
},
text: {
raw: article.processed.text.raw,
formatted: article.processed.text.formatted,
html: article.processed.text.html,
summary: article.processed.text.summary,
sentences: article.processed.text.sentences,
},
spelling: article.spelling,
meta: article.meta,
links: article.links,
lighthouse: article.lighthouse,
};
console.log(response);
} catch (error) {
console.log(error.message);
console.log(error.stack);
}
})();
`
parseArticle(options, accepts an optional socket for pipeing the response object, status messages and errors to a front end UI.
See horseman-article-parser-ui as an example.
$3
The options below are set by default
`js
var options = {
// Imposes a hard limit on how long the parser will run. When the limit is reached, the browser instance is closed and a timeout error is thrown.
// This prevents the parser from hanging indefinitely and ensures long‑running parses are cut off after the specified duration.
timeoutMs: 40000,
// puppeteer options (https://github.com/GoogleChrome/puppeteer)
puppeteer: {
// puppeteer launch options (https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#puppeteerlaunchoptions)
launch: {
headless: true,
defaultViewport: null,
},
// Optional user agent and headers (some sites require a realistic UA)
// userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36',
// extraHTTPHeaders: { 'Accept-Language': 'en-US,en;q=0.9' },
// puppeteer goto options (https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#pagegotourl-options)
goto: {
waitUntil: "domcontentloaded",
},
// Ignore content security policy
setBypassCSP: true,
},
// clean-html options (https://ghub.io/clean-html)
cleanhtml: {
"add-remove-tags": ["blockquote", "span"],
"remove-empty-tags": ["span"],
"replace-nbsp": true,
},
// html-to-text options (https://ghub.io/html-to-text)
htmltotext: {
wordwrap: 100,
noLinkBrackets: true,
ignoreHref: true,
tables: true,
uppercaseHeadings: true,
},
// retext-keywords options (https://ghub.io/retext-keywords)
retextkeywords: { maximum: 10 },
// content detection defaults (detector is always enabled)
contentDetection: {
// minimum characters required for a candidate
minLength: 400,
// maximum link density allowed for a candidate
maxLinkDensity: 0.5,
// optional: promote selection to a parent container when
// article paragraphs are split across sibling blocks
fragment: {
// require at least this many sibling parts containing paragraphs
minParts: 2,
// minimum text length per part
minChildChars: 150,
// minimum combined text across parts (set higher to be stricter)
minCombinedChars: 400,
// override parent link-density threshold (default uses max(maxLinkDensity, 0.65))
// maxLinkDensity: 0.65
},
// reranker is disabled by default; enable after training weights
// Note: scripts/single-sample-run.js auto-loads weights.json (if present) and enables the reranker
reranker: { enabled: false },
// optional: dump top-N candidates per page for labeling
// debugDump: { path: 'candidates_with_url.csv', topN: 5, addUrl: true }
},
// retext-spell defaults and output tweaks
retextspell: {
tweaks: {
// filter URL/domain-like tokens and long slugs by default
ignoreUrlLike: true,
// positions: only start by default
includeEndPosition: false,
// offsets excluded by default
includeOffsets: false,
},
},
};
`
At a minimum you should pass a url
`js
var options = {
url: "https://www.theguardian.com/politics/2018/sep/24/theresa-may-calls-for-immigration-based-on-skills-and-wealth",
};
`
If you want to enable the advanced features you should pass the following
`js
var options = {
url: "https://www.theguardian.com/politics/2018/sep/24/theresa-may-calls-for-immigration-based-on-skills-and-wealth",
enabled: [
"lighthouse",
"screenshot",
"links",
"sentiment",
"entities",
"spelling",
"keywords",
"summary",
"readability",
],
};
`
Add "summary" to options.enabled to generate a short summary of the article text. The result
includes text.summary and a text.sentences array containing the first five sentences.
Add "readability" to options.enabled to evaluate readability, estimate reading time, and gather basic text statistics. The result is available as article.readability with readingTime (seconds), characters, words, sentences, and paragraphs.
You may pass rules for returning an articles title & contents. This is useful in a case
where the parser is unable to return the desired title or content e.g.
`js
rules: [
{
host: "www.bbc.co.uk",
content: () => {
var j = window.$;
j("article section, article figure, article header").remove();
return j("article").html();
},
},
{
host: "www.youtube.com",
title: () => {
return window.ytInitialData.contents.twoColumnWatchNextResults.results
.results.contents[0].videoPrimaryInfoRenderer.title.runs[0].text;
},
content: () => {
return window.ytInitialData.contents.twoColumnWatchNextResults.results
.results.contents[1].videoSecondaryInfoRenderer.description.runs[0]
.text;
},
},
];
`
If you want to pass cookies to puppeteer use the following
`js
var options = {
puppeteer: {
cookies: [
{ name: "cookie1", value: "val1", domain: ".domain1" },
{ name: "cookie2", value: "val2", domain: ".domain2" },
],
},
};
`
To strip tags before processing use the following
`js
var options = {
striptags: [".something", "#somethingelse"],
};
`
If you need to dismiss any popups e.g. a privacy popup use the following
`js
var options = {
clickelements: ["#button1", "#button2"],
};
`
there are some additional "complex" options available
`js
var options = {
// array of html elements to stip before analysis
striptags: [],
// array of resource types to block e.g. ['image' ]
blockedResourceTypes: [],
// array of resource source names (all resources from
// these sources are skipped) e.g. [ 'google', 'facebook' ]
skippedResources: [],
// retext spell options (https://ghub.io/retext-spell)
retextspell: {
// dictionary defaults to en-GB; you can override
// dictionary,
tweaks: {
// Filter URL/domain-like tokens and long slugs (default: true)
ignoreUrlLike: true,
// Include end position (endLine/endColumn) in each item (default: false)
includeEndPosition: false,
// Include offsets (offsetStart/offsetEnd) in each item (default: false)
includeOffsets: false
}
}
// compromise nlp options
nlp: { plugins: [ myPlugin, anotherPlugin ] }
}
`
$3
Compromise is the natural language processor that allows horseman-article-parser to return
topics e.g. people, places & organisations. You can now pass custom plugins to compromise to modify or add to the word lists like so:
`js
/* add some names /
const testPlugin = (Doc, world) => {
world.addWords({
rishi: "FirstName",
sunak: "LastName",
});
};
const options = {
url: "https://www.theguardian.com/commentisfree/2020/jul/08/the-guardian-view-on-rishi-sunak-right-words-right-focus-wrong-policies",
enabled: [
"lighthouse",
"screenshot",
"links",
"sentiment",
"entities",
"spelling",
"keywords",
"summary",
"readability",
],
// Optional: tweak spelling output/filters
retextspell: {
tweaks: {
ignoreUrlLike: true,
includeEndPosition: true,
includeOffsets: true,
},
},
nlp: {
plugins: [testPlugin],
},
};
`
By tagging new words as FirstName and LastName, the parser records fallback hints and can still detect the full name even if Compromise doesn't tag it directly. This allows us to match names which are not in the base Compromise word lists.
Check out the compromise plugin docs for more info.
#### Extended name hints and secondary NER sources
loadNlpPlugins also accepts additional hint buckets for middle names and suffixes. You can provide them directly via options.nlp.hints or through a Compromise plugin using MiddleName/Suffix tags. These extra hints help prevent false splits in names that include common middle initials or honorifics.
`js
const options = {
nlp: {
hints: {
first: ['José', 'Ana'],
middle: ['Luis', 'María'],
last: ['Rodríguez', 'López'],
suffix: ['Jr']
},
secondary: {
endpoint: 'https://ner.yourservice.example/people',
method: 'POST',
timeoutMs: 1500,
minConfidence: 0.65
}
}
}
`
When secondary is configured the parser will send the article text to that endpoint (default payload { text: "…" }) and merge any PERSON entities it returns with the Compromise results. Responses that include a simple people array or spaCy-style ents collections are supported. If the service is unreachable or errors, the parser automatically falls back to Compromise-only detection.
$3
The detector is always enabled and uses a structured-data-first strategy, falling back to heuristic scoring:
- Structured data: Extracts JSON-LD Article/NewsArticle (headline, articleBody).
- Heuristics: Gathers DOM candidates (e.g., article, main, [role=main], content-like containers) and scores them by text length, punctuation, link density, paragraph count, semantic tags, and boilerplate penalties.
- Fragment promotion: When content is split across sibling blocks, a fragmentation heuristic merges them into a single higher-level candidate.
- ML reranker (optional): If weights are supplied, a lightweight reranker can refine the heuristic ranking.
- Title detection: Chooses from structured headline, og:title/twitter:title, first , or document.title, with normalization.
- Debug dump (optional): Write top-N candidates to CSV for dataset labeling.
You can tune thresholds and fragmentation frequency under options.contentDetection:
`js
contentDetection: {
minLength: 400,
maxLinkDensity: 0.5,
fragment: {
// require at least this many sibling parts
minParts: 2,
// minimum text length per part
minChildChars: 150,
// minimum combined text across parts
minCombinedChars: 400
},
// enable after training weights
reranker: { enabled: false }
}
`
$3
Horseman automatically detects the article language and exposes ISO codes via article.language in the result. Downstream steps such as keyword extraction or spelling use these codes to select language-specific resources when available. Dictionaries for English, French, and Spanish are bundled; other languages fall back to English if a matching dictionary or NLP plugin is not found.
Development
Please feel free to fork the repo or open pull requests to the development branch. I've used eslint for linting.
Build the dependencies with:
`
npm install
`
Lint the project files with:
`
npm run lint
`
Quick single-run (sanity check):
`
npm run sample:single -- --url "https://example.com/article"
`
Quick Start (CLI)
Run quick tests and batches from this repo without writing code.
$3
- merge:csv: Merge CSVs (utility for dataset building).
- npm run merge:csv -- scripts/data/merged.csv scripts/data/candidates_with_url.csv
- sample:prepare: Fetch curated URLs from feeds/sitemaps into scripts/data/urls.txt.
- npm run sample:prepare -- --count 200 --progress-only
- sample:single: Run a single URL parse and write JSON to scripts/results/single-sample-run-result.json.
- npm run sample:single -- --url "https://example.com/article"
- sample:batch: Run the multi-URL sample with progress bar and summaries.
- npm run sample:batch -- --count 100 --concurrency 5 --urls-file scripts/data/urls.txt --timeout 20000 --unique-hosts --progress-only
- batch:crawl: Crawl URLs and dump content-candidate features to CSV.
- npm run batch:crawl -- --urls-file scripts/data/urls.txt --out-file scripts/data/candidates_with_url.csv --start 0 --limit 200 --concurrency 1 --unique-hosts --progress-only
- train:ranker: Train reranker weights from a candidates CSV.
- npm run train:ranker --
$3
- --bar-width: progress bar width for scripts with progress bars.
- --feed-concurrency / --feed-timeout: tuning for curated feed collection.
$3
Writes a detailed JSON to scripts/results/single-sample-run-result.json.
`bash
npm run sample:single -- --url "https://www.cnn.com/business/live-news/fox-news-dominion-trial-04-18-23/index.html" --timeout 40000
or run directly
node scripts/single-sample-run.js --url "https://www.cnn.com/business/live-news/fox-news-dominion-trial-04-18-23/index.html" --timeout 40000
`
Parameters
- --timeout: maximum time (ms) for the parse. If omitted, the test uses its default (40000 ms).
- --url: the article page to parse.
$3
1. Fetch a fresh set of URLs:
`bash
npm run sample:prepare -- --count 200 --feed-concurrency 8 --feed-timeout 15000 --bar-width 20 --progress-only
or run directly
node scripts/fetch-curated-urls.js --count 200 --feed-concurrency 8 --feed-timeout 15000 --bar-width 20 --progress-only
`
Parameters
- --count: target number of URLs to collect into scripts/data/urls.txt.
- --feed-concurrency: number of feeds to fetch in parallel (optional).
- --feed-timeout: per-feed timeout in ms (optional).
- --bar-width: progress bar width (optional).
- --progress-only: print only progress updates (optional).
2. Run a batch against unique hosts with a simple progress-only view. Progress and a final summary print to the console; JSON/CSV reports are saved under scripts/results/.
`bash
npm run sample:batch -- --count 100 --concurrency 5 --urls-file scripts/data/urls.txt --timeout 20000 --unique-hosts --bar-width 20 --progress-only
or run directly
node scripts/batch-sample-run.js --count 100 --concurrency 5 --urls-file scripts/data/urls.txt --timeout 20000 --unique-hosts --bar-width 20 --progress-only
`
Parameters
- --count: number of URLs to process.
- --concurrency: number of concurrent parses.
- --urls-file: file containing URLs to parse.
- --timeout: maximum time (ms) allowed for each parse.
- --unique-hosts: ensure each sampled URL has a unique host (optional).
- --progress-only: print only progress updates (optional).
- --bar-width: progress bar width (optional).
$3
You can train a simple logistic-regression reranker to improve candidate selection.
1. Generate candidate features
- Single URL (appends candidates):
- npm run sample:single -- --url
- Batch (recommended):
- npm run batch:crawl -- --urls-file scripts/data/urls.txt --out-file scripts/data/candidates_with_url.csv --start 0 --limit 200 --concurrency 1 --unique-hosts --progress-only
- Adjust --start and --limit to process in slices (e.g., --start 200 --limit 200, --start 400 --limit 200, ...).
Parameters
- --urls-file: input list of URLs to crawl
- --out-file: output CSV file for candidate features
- --start: start offset (row index) in the URLs file
- --limit: number of URLs to process in this run
- --concurrency: number of parallel crawlers
- --unique-hosts: ensure each URL has a unique host (optional)
- --progress-only: show only progress updates (optional)
- The project dumps candidate features with URL by default (see scripts/single-sample-run.js):
- Header: url,xpath,css_selector,text_length,punctuation_count,link_density,paragraph_count,has_semantic_container,boilerplate_penalty,direct_paragraph_count,direct_block_count,paragraph_to_block_ratio,average_paragraph_length,dom_depth,heading_children_count,aria_role_main,aria_role_negative,aria_hidden,image_alt_ratio,image_count,training_label,default_selected
- Up to topN unique-XPath rows per page (default 5)
2. Label the dataset
- Open scripts/data/candidates_with_url.csv in a spreadsheet/editor.
- For each URL group, set label = 1 for the correct article body candidate (leave others as 0).
- Column meanings (subset):
- url: source page
- xpath: Chrome console snippet to select the container (e.g., $x('...')[0])
- css_selector: Chrome console snippet to select via CSS (e.g., document.querySelector('...'))
- text_length: raw character length
- punctuation_count: count of punctuation (.,!?,;:)
- link_density: ratio of link text length to total text (0..1)
- paragraph_count: count of and nodes under the container
has_semantic_container
- : 1 if within article/main/role=main/itemtype*=Article, else 0
boilerplate_penalty
- : number of boilerplate containers detected (nav/aside/comments/social/newsletter/consent), capped
direct_paragraph_count
- , direct_block_count, paragraph_to_block_ratio, average_paragraph_length, dom_depth, heading_children_count:
aria_role_main
direct-children structure features used by heuristics
- , aria_role_negative, aria_hidden: accessibility signals
image_alt_ratio
- , image_count: image accessibility metrics
training_label
- : 1 for the true article candidate; 0 otherwise
default_selected
- : 1 if this candidate would be chosen by the default heuristic (no custom weights)
--silent
3. Train weights and export JSON
- Via npm (use and arg separator):
npm run --silent train:ranker -- scripts/data/candidates_with_url.csv > weights.json
-
node scripts/train-reranker.js scripts/data/candidates_with_url.csv weights.json
- Or run directly (avoids npm banner output):
-
scripts/data/candidates_with_url.csv
Parameters
- : labeled candidates CSV (input)
weights.json
- : output weights file (JSON)
--
Tips
- passes subsequent args to the underlying script
> weights.json
- redirects stdout to a file
scripts/single-sample-run.js
4. Use the weights
- auto-loads weights.json (if present) and enables the reranker:
options.contentDetection.reranker = { enabled: true, weights }
-
npm run merge:csv -- scripts/data/merged.csv scripts/data/candidates_with_url.csv
Notes
- If no reranker is configured, the detector uses heuristic scoring only.
- You can merge CSVs from multiple runs: .
weights.json
- Tip: placing a in the project root will make scripts/single-sample-run.js auto-enable the reranker on the next run.
`
Update API docs with:
``
npm run docs