Horseman Article Parser

Horseman is a focused article scraping module for the open web. It loads pages (dynamic or AMP), detects the main story body, and returns clean, structured content ready for downstream use. Alongside text and title, it includes in-article links, metadata, sentiment, keywords/keyphrases, named entities, optional summaries, optional spelling suggestions, readability metrics and basic counts (characters, words, sentences, paragraphs), site icon, and Lighthouse signals. It also copes with live blogs, applies simple per-domain tweaks (headers/cookies/goto), and uses Puppeteer + stealth to reduce blocking. The parser now detects the article language and exposes ISO codes, with best-effort support for non-English content (features may fall back to English dictionaries when specific resources are missing).

- Prerequisites
- Install
- Usage
- Async/Await Example
- Options
- Development
- Dependencies
- Dev Dependencies
- License

$3

Node.js >= 18, NPM >= 9.
For Linux environments, ensure Chromium dependencies for Puppeteer are installed.

$3

bash

npm install horseman-article-parser --save





$3



#### parseArticle(options, socket) ? Object



| Param   | Type                | Description         |

| ------- | ------------------- | ------------------- |

| options | Object | the options object  |

| socket  | Object | the optional socket |



Returns: Object - article parser results object



#### Async/Await Example

js

import { parseArticle } from "horseman-article-parser";



const options = {

  url: "https://www.theguardian.com/politics/2018/sep/24/theresa-may-calls-for-immigration-based-on-skills-and-wealth",

  enabled: [

    "lighthouse",

    "screenshot",

    "links",

    "sentiment",

    "entities",

    "spelling",

    "keywords",

    "summary",

    "readability",

  ],

};



(async () => {

  try {

    const article = await parseArticle(options);



    const response = {

      title: article.title.text,

      excerpt: article.excerpt,

      metadescription: article.meta.description.text,

      url: article.url,

      sentiment: {

        score: article.sentiment.score,

        comparative: article.sentiment.comparative,

      },

      keyphrases: article.processed.keyphrases,

      keywords: article.processed.keywords,

      people: article.people,

      orgs: article.orgs,

      places: article.places,

      language: article.language,

      readability: {

        readingTime: article.readability.readingTime,

        characters: article.readability.characters,

        words: article.readability.words,

        sentences: article.readability.sentences,

        paragraphs: article.readability.paragraphs,

      },

      text: {

        raw: article.processed.text.raw,

        formatted: article.processed.text.formatted,

        html: article.processed.text.html,

        summary: article.processed.text.summary,

        sentences: article.processed.text.sentences,

      },

      spelling: article.spelling,

      meta: article.meta,

      links: article.links,

      lighthouse: article.lighthouse,

    };



    console.log(response);

  } catch (error) {

    console.log(error.message);

    console.log(error.stack);

  }

})();

parseArticle(options, )

 accepts an optional socket for pipeing the response object, status messages and errors to a front end UI.



See horseman-article-parser-ui as an example.



$3



The options below are set by default

js

var options = {

  // Imposes a hard limit on how long the parser will run. When the limit is reached, the browser instance is closed and a timeout error is thrown.

  // This prevents the parser from hanging indefinitely and ensures long‑running parses are cut off after the specified duration.

  timeoutMs: 40000,

  // puppeteer options (https://github.com/GoogleChrome/puppeteer)

  puppeteer: {

    // puppeteer launch options (https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#puppeteerlaunchoptions)

    launch: {

      headless: true,

      defaultViewport: null,

    },

    // Optional user agent and headers (some sites require a realistic UA)

    // userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36',

    // extraHTTPHeaders: { 'Accept-Language': 'en-US,en;q=0.9' },

    // puppeteer goto options (https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#pagegotourl-options)

    goto: {

      waitUntil: "domcontentloaded",

    },

    // Ignore content security policy

    setBypassCSP: true,

  },

  // clean-html options (https://ghub.io/clean-html)

  cleanhtml: {

    "add-remove-tags": ["blockquote", "span"],

    "remove-empty-tags": ["span"],

    "replace-nbsp": true,

  },

  // html-to-text options (https://ghub.io/html-to-text)

  htmltotext: {

    wordwrap: 100,

    noLinkBrackets: true,

    ignoreHref: true,

    tables: true,

    uppercaseHeadings: true,

  },

  // retext-keywords options (https://ghub.io/retext-keywords)

  retextkeywords: { maximum: 10 },

  // content detection defaults (detector is always enabled)

  contentDetection: {

    // minimum characters required for a candidate

    minLength: 400,

    // maximum link density allowed for a candidate

    maxLinkDensity: 0.5,

    // optional: promote selection to a parent container when

    // article paragraphs are split across sibling blocks

    fragment: {

      // require at least this many sibling parts containing paragraphs

      minParts: 2,

      // minimum text length per part

      minChildChars: 150,

      // minimum combined text across parts (set higher to be stricter)

      minCombinedChars: 400,

      // override parent link-density threshold (default uses max(maxLinkDensity, 0.65))

      // maxLinkDensity: 0.65

    },

    // reranker is disabled by default; enable after training weights

    // Note: scripts/single-sample-run.js auto-loads weights.json (if present) and enables the reranker

    reranker: { enabled: false },

    // optional: dump top-N candidates per page for labeling

    // debugDump: { path: 'candidates_with_url.csv', topN: 5, addUrl: true }

  },

  // retext-spell defaults and output tweaks

  retextspell: {

    tweaks: {

      // filter URL/domain-like tokens and long slugs by default

      ignoreUrlLike: true,

      // positions: only start by default

      includeEndPosition: false,

      // offsets excluded by default

      includeOffsets: false,

    },

  },

};





At a minimum you should pass a url

js

var options = {

  url: "https://www.theguardian.com/politics/2018/sep/24/theresa-may-calls-for-immigration-based-on-skills-and-wealth",

};





If you want to enable the advanced features you should pass the following

js

var options = {

  url: "https://www.theguardian.com/politics/2018/sep/24/theresa-may-calls-for-immigration-based-on-skills-and-wealth",

  enabled: [

    "lighthouse",

    "screenshot",

    "links",

    "sentiment",

    "entities",

    "spelling",

    "keywords",

    "summary",

    "readability",

  ],

};



Add "summary" to

options.enabled

 to generate a short summary of the article text. The result

includes

text.summary and a text.sentences

 array containing the first five sentences.



Add "readability" to

options.enabled to evaluate readability, estimate reading time, and gather basic text statistics. The result is available as article.readability with readingTime (seconds), characters, words, sentences, and paragraphs

.



You may pass rules for returning an articles title & contents. This is useful in a case

where the parser is unable to return the desired title or content e.g.

js

rules: [

  {

    host: "www.bbc.co.uk",

    content: () => {

      var j = window.$;

      j("article section, article figure, article header").remove();

      return j("article").html();

    },

  },

  {

    host: "www.youtube.com",

    title: () => {

      return window.ytInitialData.contents.twoColumnWatchNextResults.results

        .results.contents[0].videoPrimaryInfoRenderer.title.runs[0].text;

    },

    content: () => {

      return window.ytInitialData.contents.twoColumnWatchNextResults.results

        .results.contents[1].videoSecondaryInfoRenderer.description.runs[0]

        .text;

    },

  },

];





If you want to pass cookies to puppeteer use the following

js

var options = {

  puppeteer: {

    cookies: [

      { name: "cookie1", value: "val1", domain: ".domain1" },

      { name: "cookie2", value: "val2", domain: ".domain2" },

    ],

  },

};





To strip tags before processing use the following

js

var options = {

  striptags: [".something", "#somethingelse"],

};





If you need to dismiss any popups e.g. a privacy popup use the following

js

var options = {

  clickelements: ["#button1", "#button2"],

};





there are some additional "complex" options available

js

var options = {



  // array of html elements to stip before analysis

  striptags: [],



  // array of resource types to block e.g. ['image' ]

  blockedResourceTypes: [],



  // array of resource source names (all resources from

  // these sources are skipped) e.g. [ 'google', 'facebook' ]

  skippedResources: [],





  // retext spell options (https://ghub.io/retext-spell)

  retextspell: {

    // dictionary defaults to en-GB; you can override

    // dictionary,

    tweaks: {

      // Filter URL/domain-like tokens and long slugs (default: true)

      ignoreUrlLike: true,

      // Include end position (endLine/endColumn) in each item (default: false)

      includeEndPosition: false,

      // Include offsets (offsetStart/offsetEnd) in each item (default: false)

      includeOffsets: false

    }

  }



  // compromise nlp options

  nlp: { plugins: [ myPlugin, anotherPlugin ] }



}





$3



Compromise is the natural language processor that allows

horseman-article-parser

 to return

topics e.g. people, places & organisations. You can now pass custom plugins to compromise to modify or add to the word lists like so:

js

/* add some names /

const testPlugin = (Doc, world) => {

  world.addWords({

    rishi: "FirstName",

    sunak: "LastName",

  });

};



const options = {

  url: "https://www.theguardian.com/commentisfree/2020/jul/08/the-guardian-view-on-rishi-sunak-right-words-right-focus-wrong-policies",

  enabled: [

    "lighthouse",

    "screenshot",

    "links",

    "sentiment",

    "entities",

    "spelling",

    "keywords",

    "summary",

    "readability",

  ],

  // Optional: tweak spelling output/filters

  retextspell: {

    tweaks: {

      ignoreUrlLike: true,

      includeEndPosition: true,

      includeOffsets: true,

    },

  },

  nlp: {

    plugins: [testPlugin],

  },

};





By tagging new words as

FirstName and LastName

, the parser records fallback hints and can still detect the full name even if Compromise doesn't tag it directly. This allows us to match names which are not in the base Compromise word lists.



Check out the compromise plugin docs for more info.



#### Extended name hints and secondary NER sources

loadNlpPlugins also accepts additional hint buckets for middle names and suffixes. You can provide them directly via options.nlp.hints or through a Compromise plugin using MiddleName/Suffix

 tags. These extra hints help prevent false splits in names that include common middle initials or honorifics.

js

const options = {

  nlp: {

    hints: {

      first: ['José', 'Ana'],

      middle: ['Luis', 'María'],

      last: ['Rodríguez', 'López'],

      suffix: ['Jr']

    },

    secondary: {

      endpoint: 'https://ner.yourservice.example/people',

      method: 'POST',

      timeoutMs: 1500,

      minConfidence: 0.65

    }

  }

}





When

secondary is configured the parser will send the article text to that endpoint (default payload { text: "…" }) and merge any PERSON entities it returns with the Compromise results. Responses that include a simple people array or spaCy-style ents

 collections are supported. If the service is unreachable or errors, the parser automatically falls back to Compromise-only detection.



$3



The detector is always enabled and uses a structured-data-first strategy, falling back to heuristic scoring:



- Structured data: Extracts JSON-LD Article/NewsArticle (

headline, articleBody

).

- Heuristics: Gathers DOM candidates (e.g.,

article, main, [role=main]

, content-like containers) and scores them by text length, punctuation, link density, paragraph count, semantic tags, and boilerplate penalties.

- Fragment promotion: When content is split across sibling blocks, a fragmentation heuristic merges them into a single higher-level candidate.

- ML reranker (optional): If weights are supplied, a lightweight reranker can refine the heuristic ranking.

- Title detection: Chooses from structured

headline, og:title/twitter:title, first

`, or` document.title`, with normalization. - Debug dump (optional): Write top-N candidates to CSV for dataset labeling. You can tune thresholds and fragmentation frequency under`options.contentDetection`:```js contentDetection: { minLength: 400, maxLinkDensity: 0.5, fragment: { // require at least this many sibling parts minParts: 2, // minimum text length per part minChildChars: 150, // minimum combined text across parts minCombinedChars: 400 }, // enable after training weights reranker: { enabled: false } }```$3 Horseman automatically detects the article language and exposes ISO codes via`article.language`in the result. Downstream steps such as keyword extraction or spelling use these codes to select language-specific resources when available. Dictionaries for English, French, and Spanish are bundled; other languages fall back to English if a matching dictionary or NLP plugin is not found. Development Please feel free to fork the repo or open pull requests to the development branch. I've used eslint for linting. Build the dependencies with:```npm install```Lint the project files with:```npm run lint```Quick single-run (sanity check):```npm run sample:single -- --url "https://example.com/article"```Quick Start (CLI) Run quick tests and batches from this repo without writing code. $3 - merge:csv: Merge CSVs (utility for dataset building). -`npm run merge:csv -- scripts/data/merged.csv scripts/data/candidates_with_url.csv`- sample:prepare: Fetch curated URLs from feeds/sitemaps into`scripts/data/urls.txt`. -`npm run sample:prepare -- --count 200 --progress-only`- sample:single: Run a single URL parse and write JSON to`scripts/results/single-sample-run-result.json`. -`npm run sample:single -- --url "https://example.com/article"`- sample:batch: Run the multi-URL sample with progress bar and summaries. -`npm run sample:batch -- --count 100 --concurrency 5 --urls-file scripts/data/urls.txt --timeout 20000 --unique-hosts --progress-only`- batch:crawl: Crawl URLs and dump content-candidate features to CSV. -`npm run batch:crawl -- --urls-file scripts/data/urls.txt --out-file scripts/data/candidates_with_url.csv --start 0 --limit 200 --concurrency 1 --unique-hosts --progress-only`- train:ranker: Train reranker weights from a candidates CSV. -`npm run train:ranker -- `$3 -`--bar-width`: progress bar width for scripts with progress bars. -`--feed-concurrency `/` --feed-timeout`: tuning for curated feed collection. $3 Writes a detailed JSON to`scripts/results/single-sample-run-result.json`.```bash npm run sample:single -- --url "https://www.cnn.com/business/live-news/fox-news-dominion-trial-04-18-23/index.html" --timeout 40000 or run directly node scripts/single-sample-run.js --url "https://www.cnn.com/business/live-news/fox-news-dominion-trial-04-18-23/index.html" --timeout 40000```Parameters -`--timeout`: maximum time (ms) for the parse. If omitted, the test uses its default (40000 ms). -`--url`: the article page to parse. $3 1. Fetch a fresh set of URLs:```bash npm run sample:prepare -- --count 200 --feed-concurrency 8 --feed-timeout 15000 --bar-width 20 --progress-only or run directly node scripts/fetch-curated-urls.js --count 200 --feed-concurrency 8 --feed-timeout 15000 --bar-width 20 --progress-only```Parameters -`--count`: target number of URLs to collect into` scripts/data/urls.txt`. -`--feed-concurrency`: number of feeds to fetch in parallel (optional). -`--feed-timeout`: per-feed timeout in ms (optional). -`--bar-width`: progress bar width (optional). -`--progress-only`: print only progress updates (optional). 2. Run a batch against unique hosts with a simple progress-only view. Progress and a final summary print to the console; JSON/CSV reports are saved under`scripts/results/`.```bash npm run sample:batch -- --count 100 --concurrency 5 --urls-file scripts/data/urls.txt --timeout 20000 --unique-hosts --bar-width 20 --progress-only or run directly node scripts/batch-sample-run.js --count 100 --concurrency 5 --urls-file scripts/data/urls.txt --timeout 20000 --unique-hosts --bar-width 20 --progress-only```Parameters -`--count`: number of URLs to process. -`--concurrency`: number of concurrent parses. -`--urls-file`: file containing URLs to parse. -`--timeout`: maximum time (ms) allowed for each parse. -`--unique-hosts`: ensure each sampled URL has a unique host (optional). -`--progress-only`: print only progress updates (optional). -`--bar-width`: progress bar width (optional). $3 You can train a simple logistic-regression reranker to improve candidate selection. 1. Generate candidate features - Single URL (appends candidates): -`npm run sample:single -- --url `- Batch (recommended): -`npm run batch:crawl -- --urls-file scripts/data/urls.txt --out-file scripts/data/candidates_with_url.csv --start 0 --limit 200 --concurrency 1 --unique-hosts --progress-only`- Adjust`--start `and` --limit `to process in slices (e.g.,` --start 200 --limit 200`,` --start 400 --limit 200`, ...). Parameters -`--urls-file`: input list of URLs to crawl -`--out-file`: output CSV file for candidate features -`--start`: start offset (row index) in the URLs file -`--limit`: number of URLs to process in this run -`--concurrency`: number of parallel crawlers -`--unique-hosts`: ensure each URL has a unique host (optional) -`--progress-only`: show only progress updates (optional) - The project dumps candidate features with URL by default (see`scripts/single-sample-run.js`): - Header:`url,xpath,css_selector,text_length,punctuation_count,link_density,paragraph_count,has_semantic_container,boilerplate_penalty,direct_paragraph_count,direct_block_count,paragraph_to_block_ratio,average_paragraph_length,dom_depth,heading_children_count,aria_role_main,aria_role_negative,aria_hidden,image_alt_ratio,image_count,training_label,default_selected`- Up to`topN`unique-XPath rows per page (default 5) 2. Label the dataset - Open`scripts/data/candidates_with_url.csv`in a spreadsheet/editor. - For each URL group, set`label = 1`for the correct article body candidate (leave others as 0). - Column meanings (subset): -`url`: source page -`xpath`: Chrome console snippet to select the container (e.g.,` $x('...')[0]`) -`css_selector`: Chrome console snippet to select via CSS (e.g.,` document.querySelector('...')`) -`text_length`: raw character length -`punctuation_count`: count of punctuation (.,!?,;:) -`link_density`: ratio of link text length to total text (0..1) -`paragraph_count`: count of`
`and`
`nodes under the container -`has_semantic_container`: 1 if within` article`/`main`/`role=main`/`itemtype*=Article`, else 0 -`boilerplate_penalty`: number of boilerplate containers detected (nav/aside/comments/social/newsletter/consent), capped -`direct_paragraph_count`,` direct_block_count`,` paragraph_to_block_ratio`,` average_paragraph_length`,` dom_depth`,` heading_children_count`: direct-children structure features used by heuristics -`aria_role_main`,` aria_role_negative`,` aria_hidden`: accessibility signals -`image_alt_ratio`,` image_count`: image accessibility metrics -`training_label`: 1 for the true article candidate; 0 otherwise -`default_selected`: 1 if this candidate would be chosen by the default heuristic (no custom weights) 3. Train weights and export JSON - Via npm (use`--silent`and arg separator): -`npm run --silent train:ranker -- scripts/data/candidates_with_url.csv > weights.json`- Or run directly (avoids npm banner output): -`node scripts/train-reranker.js scripts/data/candidates_with_url.csv weights.json`Parameters -`scripts/data/candidates_with_url.csv`: labeled candidates CSV (input) -`weights.json`: output weights file (JSON) Tips -`--`passes subsequent args to the underlying script -`> weights.json`redirects stdout to a file 4. Use the weights -`scripts/single-sample-run.js `auto-loads` weights.json`(if present) and enables the reranker: -`options.contentDetection.reranker = { enabled: true, weights }`Notes - If no reranker is configured, the detector uses heuristic scoring only. - You can merge CSVs from multiple runs:`npm run merge:csv -- scripts/data/merged.csv scripts/data/candidates_with_url.csv`. - Tip: placing a`weights.json `in the project root will make` scripts/single-sample-run.js`auto-enable the reranker on the next run. Update API docs with:```npm run docs```

Dependencies

- Puppeteer: High-level API to control Chrome or Chromium over the DevTools Protocol
- puppeteer-extra: Framework for puppeteer plugins
- puppeteer-extra-plugin-stealth: Plugin to evade detection
- puppeteer-extra-plugin-user-data-dir: Persist and reuse Chromium user data
- lighthouse: Automated auditing, performance metrics, and best practices
- compromise: Natural language processing in the browser
- retext: Natural language processor powered by plugins
- retext-pos: Plugin to add part-of-speech (POS) tags
- retext-keywords: Keyword extraction with Retext
- retext-spell: Spelling checker for retext
- retext-language: Language detection for retext
- franc: Fast language detection from text
- sentiment: AFINN-based sentiment analysis for Node.js
- jquery: JavaScript library for DOM operations
- jsdom: A JavaScript implementation of many web standards
- lodash: Lodash modular utilities
- absolutify: Relative to Absolute URL Replacer
- clean-html: HTML cleaner and beautifier
- dictionary-en-gb: English (United Kingdom) spelling dictionary in UTF-8
- html-to-text: Advanced HTML to plain text converter
- nlcst-to-string: Stringify NLCST

Dev Dependencies

- eslint: An AST-based pattern checker for JavaScript
- eslint-plugin-import: Import with sanity
- eslint-plugin-json: Lint JSON files
- eslint-plugin-n: Additional ESLint rules for Node.js
- eslint-plugin-promise: Enforce best practices for JavaScript promises

License

This project is licensed under the GNU GENERAL PUBLIC LICENSE Version 3 - see the LICENSE file for details

Horseman Article Parser

- Prerequisites
- Install
- Usage
- Async/Await Example
- Options
- Development
- Dependencies
- Dev Dependencies
- License

$3

Node.js >= 18, NPM >= 9.
For Linux environments, ensure Chromium dependencies for Puppeteer are installed.

$3

bash

npm install horseman-article-parser --save





$3



#### parseArticle(options, socket) ? Object



| Param   | Type                | Description         |

| ------- | ------------------- | ------------------- |

| options | Object | the options object  |

| socket  | Object | the optional socket |



Returns: Object - article parser results object



#### Async/Await Example

js

import { parseArticle } from "horseman-article-parser";



const options = {

  url: "https://www.theguardian.com/politics/2018/sep/24/theresa-may-calls-for-immigration-based-on-skills-and-wealth",

  enabled: [

    "lighthouse",

    "screenshot",

    "links",

    "sentiment",

    "entities",

    "spelling",

    "keywords",

    "summary",

    "readability",

  ],

};



(async () => {

  try {

    const article = await parseArticle(options);



    const response = {

      title: article.title.text,

      excerpt: article.excerpt,

      metadescription: article.meta.description.text,

      url: article.url,

      sentiment: {

        score: article.sentiment.score,

        comparative: article.sentiment.comparative,

      },

      keyphrases: article.processed.keyphrases,

      keywords: article.processed.keywords,

      people: article.people,

      orgs: article.orgs,

      places: article.places,

      language: article.language,

      readability: {

        readingTime: article.readability.readingTime,

        characters: article.readability.characters,

        words: article.readability.words,

        sentences: article.readability.sentences,

        paragraphs: article.readability.paragraphs,

      },

      text: {

        raw: article.processed.text.raw,

        formatted: article.processed.text.formatted,

        html: article.processed.text.html,

        summary: article.processed.text.summary,

        sentences: article.processed.text.sentences,

      },

      spelling: article.spelling,

      meta: article.meta,

      links: article.links,

      lighthouse: article.lighthouse,

    };



    console.log(response);

  } catch (error) {

    console.log(error.message);

    console.log(error.stack);

  }

})();

parseArticle(options, )

 accepts an optional socket for pipeing the response object, status messages and errors to a front end UI.



See horseman-article-parser-ui as an example.



$3



The options below are set by default

js

var options = {

  // Imposes a hard limit on how long the parser will run. When the limit is reached, the browser instance is closed and a timeout error is thrown.

  // This prevents the parser from hanging indefinitely and ensures long‑running parses are cut off after the specified duration.

  timeoutMs: 40000,

  // puppeteer options (https://github.com/GoogleChrome/puppeteer)

  puppeteer: {

    // puppeteer launch options (https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#puppeteerlaunchoptions)

    launch: {

      headless: true,

      defaultViewport: null,

    },

    // Optional user agent and headers (some sites require a realistic UA)

    // userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36',

    // extraHTTPHeaders: { 'Accept-Language': 'en-US,en;q=0.9' },

    // puppeteer goto options (https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#pagegotourl-options)

    goto: {

      waitUntil: "domcontentloaded",

    },

    // Ignore content security policy

    setBypassCSP: true,

  },

  // clean-html options (https://ghub.io/clean-html)

  cleanhtml: {

    "add-remove-tags": ["blockquote", "span"],

    "remove-empty-tags": ["span"],

    "replace-nbsp": true,

  },

  // html-to-text options (https://ghub.io/html-to-text)

  htmltotext: {

    wordwrap: 100,

    noLinkBrackets: true,

    ignoreHref: true,

    tables: true,

    uppercaseHeadings: true,

  },

  // retext-keywords options (https://ghub.io/retext-keywords)

  retextkeywords: { maximum: 10 },

  // content detection defaults (detector is always enabled)

  contentDetection: {

    // minimum characters required for a candidate

    minLength: 400,

    // maximum link density allowed for a candidate

    maxLinkDensity: 0.5,

    // optional: promote selection to a parent container when

    // article paragraphs are split across sibling blocks

    fragment: {

      // require at least this many sibling parts containing paragraphs

      minParts: 2,

      // minimum text length per part

      minChildChars: 150,

      // minimum combined text across parts (set higher to be stricter)

      minCombinedChars: 400,

      // override parent link-density threshold (default uses max(maxLinkDensity, 0.65))

      // maxLinkDensity: 0.65

    },

    // reranker is disabled by default; enable after training weights

    // Note: scripts/single-sample-run.js auto-loads weights.json (if present) and enables the reranker

    reranker: { enabled: false },

    // optional: dump top-N candidates per page for labeling

    // debugDump: { path: 'candidates_with_url.csv', topN: 5, addUrl: true }

  },

  // retext-spell defaults and output tweaks

  retextspell: {

    tweaks: {

      // filter URL/domain-like tokens and long slugs by default

      ignoreUrlLike: true,

      // positions: only start by default

      includeEndPosition: false,

      // offsets excluded by default

      includeOffsets: false,

    },

  },

};





At a minimum you should pass a url

js

var options = {

  url: "https://www.theguardian.com/politics/2018/sep/24/theresa-may-calls-for-immigration-based-on-skills-and-wealth",

};





If you want to enable the advanced features you should pass the following

js

var options = {

  url: "https://www.theguardian.com/politics/2018/sep/24/theresa-may-calls-for-immigration-based-on-skills-and-wealth",

  enabled: [

    "lighthouse",

    "screenshot",

    "links",

    "sentiment",

    "entities",

    "spelling",

    "keywords",

    "summary",

    "readability",

  ],

};



Add "summary" to

options.enabled

 to generate a short summary of the article text. The result

includes

text.summary and a text.sentences

 array containing the first five sentences.



Add "readability" to

.



You may pass rules for returning an articles title & contents. This is useful in a case

where the parser is unable to return the desired title or content e.g.

js

rules: [

  {

    host: "www.bbc.co.uk",

    content: () => {

      var j = window.$;

      j("article section, article figure, article header").remove();

      return j("article").html();

    },

  },

  {

    host: "www.youtube.com",

    title: () => {

      return window.ytInitialData.contents.twoColumnWatchNextResults.results

        .results.contents[0].videoPrimaryInfoRenderer.title.runs[0].text;

    },

    content: () => {

      return window.ytInitialData.contents.twoColumnWatchNextResults.results

        .results.contents[1].videoSecondaryInfoRenderer.description.runs[0]

        .text;

    },

  },

];





If you want to pass cookies to puppeteer use the following

js

var options = {

  puppeteer: {

    cookies: [

      { name: "cookie1", value: "val1", domain: ".domain1" },

      { name: "cookie2", value: "val2", domain: ".domain2" },

    ],

  },

};





To strip tags before processing use the following

js

var options = {

  striptags: [".something", "#somethingelse"],

};





If you need to dismiss any popups e.g. a privacy popup use the following

js

var options = {

  clickelements: ["#button1", "#button2"],

};





there are some additional "complex" options available

js

var options = {



  // array of html elements to stip before analysis

  striptags: [],



  // array of resource types to block e.g. ['image' ]

  blockedResourceTypes: [],



  // array of resource source names (all resources from

  // these sources are skipped) e.g. [ 'google', 'facebook' ]

  skippedResources: [],





  // retext spell options (https://ghub.io/retext-spell)

  retextspell: {

    // dictionary defaults to en-GB; you can override

    // dictionary,

    tweaks: {

      // Filter URL/domain-like tokens and long slugs (default: true)

      ignoreUrlLike: true,

      // Include end position (endLine/endColumn) in each item (default: false)

      includeEndPosition: false,

      // Include offsets (offsetStart/offsetEnd) in each item (default: false)

      includeOffsets: false

    }

  }



  // compromise nlp options

  nlp: { plugins: [ myPlugin, anotherPlugin ] }



}





$3



Compromise is the natural language processor that allows

horseman-article-parser

 to return

topics e.g. people, places & organisations. You can now pass custom plugins to compromise to modify or add to the word lists like so:

js

/* add some names /

const testPlugin = (Doc, world) => {

  world.addWords({

    rishi: "FirstName",

    sunak: "LastName",

  });

};



const options = {

  url: "https://www.theguardian.com/commentisfree/2020/jul/08/the-guardian-view-on-rishi-sunak-right-words-right-focus-wrong-policies",

  enabled: [

    "lighthouse",

    "screenshot",

    "links",

    "sentiment",

    "entities",

    "spelling",

    "keywords",

    "summary",

    "readability",

  ],

  // Optional: tweak spelling output/filters

  retextspell: {

    tweaks: {

      ignoreUrlLike: true,

      includeEndPosition: true,

      includeOffsets: true,

    },

  },

  nlp: {

    plugins: [testPlugin],

  },

};





By tagging new words as

FirstName and LastName

, the parser records fallback hints and can still detect the full name even if Compromise doesn't tag it directly. This allows us to match names which are not in the base Compromise word lists.



Check out the compromise plugin docs for more info.



#### Extended name hints and secondary NER sources

loadNlpPlugins also accepts additional hint buckets for middle names and suffixes. You can provide them directly via options.nlp.hints or through a Compromise plugin using MiddleName/Suffix

 tags. These extra hints help prevent false splits in names that include common middle initials or honorifics.

js

const options = {

  nlp: {

    hints: {

      first: ['José', 'Ana'],

      middle: ['Luis', 'María'],

      last: ['Rodríguez', 'López'],

      suffix: ['Jr']

    },

    secondary: {

      endpoint: 'https://ner.yourservice.example/people',

      method: 'POST',

      timeoutMs: 1500,

      minConfidence: 0.65

    }

  }

}





When

 collections are supported. If the service is unreachable or errors, the parser automatically falls back to Compromise-only detection.



$3



The detector is always enabled and uses a structured-data-first strategy, falling back to heuristic scoring:



- Structured data: Extracts JSON-LD Article/NewsArticle (

headline, articleBody

).

- Heuristics: Gathers DOM candidates (e.g.,

article, main, [role=main]

, content-like containers) and scores them by text length, punctuation, link density, paragraph count, semantic tags, and boilerplate penalties.

- Fragment promotion: When content is split across sibling blocks, a fragmentation heuristic merges them into a single higher-level candidate.

- ML reranker (optional): If weights are supplied, a lightweight reranker can refine the heuristic ranking.

- Title detection: Chooses from structured

headline, og:title/twitter:title, first

`, or` document.title`, with normalization. - Debug dump (optional): Write top-N candidates to CSV for dataset labeling. You can tune thresholds and fragmentation frequency under`options.contentDetection`:```js contentDetection: { minLength: 400, maxLinkDensity: 0.5, fragment: { // require at least this many sibling parts minParts: 2, // minimum text length per part minChildChars: 150, // minimum combined text across parts minCombinedChars: 400 }, // enable after training weights reranker: { enabled: false } }```$3 Horseman automatically detects the article language and exposes ISO codes via`article.language`in the result. Downstream steps such as keyword extraction or spelling use these codes to select language-specific resources when available. Dictionaries for English, French, and Spanish are bundled; other languages fall back to English if a matching dictionary or NLP plugin is not found. Development Please feel free to fork the repo or open pull requests to the development branch. I've used eslint for linting. Build the dependencies with:```npm install```Lint the project files with:```npm run lint```Quick single-run (sanity check):```npm run sample:single -- --url "https://example.com/article"```Quick Start (CLI) Run quick tests and batches from this repo without writing code. $3 - merge:csv: Merge CSVs (utility for dataset building). -`npm run merge:csv -- scripts/data/merged.csv scripts/data/candidates_with_url.csv`- sample:prepare: Fetch curated URLs from feeds/sitemaps into`scripts/data/urls.txt`. -`npm run sample:prepare -- --count 200 --progress-only`- sample:single: Run a single URL parse and write JSON to`scripts/results/single-sample-run-result.json`. -`npm run sample:single -- --url "https://example.com/article"`- sample:batch: Run the multi-URL sample with progress bar and summaries. -`npm run sample:batch -- --count 100 --concurrency 5 --urls-file scripts/data/urls.txt --timeout 20000 --unique-hosts --progress-only`- batch:crawl: Crawl URLs and dump content-candidate features to CSV. -`npm run batch:crawl -- --urls-file scripts/data/urls.txt --out-file scripts/data/candidates_with_url.csv --start 0 --limit 200 --concurrency 1 --unique-hosts --progress-only`- train:ranker: Train reranker weights from a candidates CSV. -`npm run train:ranker -- `$3 -`--bar-width`: progress bar width for scripts with progress bars. -`--feed-concurrency `/` --feed-timeout`: tuning for curated feed collection. $3 Writes a detailed JSON to`scripts/results/single-sample-run-result.json`.```bash npm run sample:single -- --url "https://www.cnn.com/business/live-news/fox-news-dominion-trial-04-18-23/index.html" --timeout 40000 or run directly node scripts/single-sample-run.js --url "https://www.cnn.com/business/live-news/fox-news-dominion-trial-04-18-23/index.html" --timeout 40000```Parameters -`--timeout`: maximum time (ms) for the parse. If omitted, the test uses its default (40000 ms). -`--url`: the article page to parse. $3 1. Fetch a fresh set of URLs:```bash npm run sample:prepare -- --count 200 --feed-concurrency 8 --feed-timeout 15000 --bar-width 20 --progress-only or run directly node scripts/fetch-curated-urls.js --count 200 --feed-concurrency 8 --feed-timeout 15000 --bar-width 20 --progress-only```Parameters -`--count`: target number of URLs to collect into` scripts/data/urls.txt`. -`--feed-concurrency`: number of feeds to fetch in parallel (optional). -`--feed-timeout`: per-feed timeout in ms (optional). -`--bar-width`: progress bar width (optional). -`--progress-only`: print only progress updates (optional). 2. Run a batch against unique hosts with a simple progress-only view. Progress and a final summary print to the console; JSON/CSV reports are saved under`scripts/results/`.```bash npm run sample:batch -- --count 100 --concurrency 5 --urls-file scripts/data/urls.txt --timeout 20000 --unique-hosts --bar-width 20 --progress-only or run directly node scripts/batch-sample-run.js --count 100 --concurrency 5 --urls-file scripts/data/urls.txt --timeout 20000 --unique-hosts --bar-width 20 --progress-only```Parameters -`--count`: number of URLs to process. -`--concurrency`: number of concurrent parses. -`--urls-file`: file containing URLs to parse. -`--timeout`: maximum time (ms) allowed for each parse. -`--unique-hosts`: ensure each sampled URL has a unique host (optional). -`--progress-only`: print only progress updates (optional). -`--bar-width`: progress bar width (optional). $3 You can train a simple logistic-regression reranker to improve candidate selection. 1. Generate candidate features - Single URL (appends candidates): -`npm run sample:single -- --url `- Batch (recommended): -`npm run batch:crawl -- --urls-file scripts/data/urls.txt --out-file scripts/data/candidates_with_url.csv --start 0 --limit 200 --concurrency 1 --unique-hosts --progress-only`- Adjust`--start `and` --limit `to process in slices (e.g.,` --start 200 --limit 200`,` --start 400 --limit 200`, ...). Parameters -`--urls-file`: input list of URLs to crawl -`--out-file`: output CSV file for candidate features -`--start`: start offset (row index) in the URLs file -`--limit`: number of URLs to process in this run -`--concurrency`: number of parallel crawlers -`--unique-hosts`: ensure each URL has a unique host (optional) -`--progress-only`: show only progress updates (optional) - The project dumps candidate features with URL by default (see`scripts/single-sample-run.js`): - Header:`url,xpath,css_selector,text_length,punctuation_count,link_density,paragraph_count,has_semantic_container,boilerplate_penalty,direct_paragraph_count,direct_block_count,paragraph_to_block_ratio,average_paragraph_length,dom_depth,heading_children_count,aria_role_main,aria_role_negative,aria_hidden,image_alt_ratio,image_count,training_label,default_selected`- Up to`topN`unique-XPath rows per page (default 5) 2. Label the dataset - Open`scripts/data/candidates_with_url.csv`in a spreadsheet/editor. - For each URL group, set`label = 1`for the correct article body candidate (leave others as 0). - Column meanings (subset): -`url`: source page -`xpath`: Chrome console snippet to select the container (e.g.,` $x('...')[0]`) -`css_selector`: Chrome console snippet to select via CSS (e.g.,` document.querySelector('...')`) -`text_length`: raw character length -`punctuation_count`: count of punctuation (.,!?,;:) -`link_density`: ratio of link text length to total text (0..1) -`paragraph_count`: count of`
`and`
`nodes under the container -`has_semantic_container`: 1 if within` article`/`main`/`role=main`/`itemtype*=Article`, else 0 -`boilerplate_penalty`: number of boilerplate containers detected (nav/aside/comments/social/newsletter/consent), capped -`direct_paragraph_count`,` direct_block_count`,` paragraph_to_block_ratio`,` average_paragraph_length`,` dom_depth`,` heading_children_count`: direct-children structure features used by heuristics -`aria_role_main`,` aria_role_negative`,` aria_hidden`: accessibility signals -`image_alt_ratio`,` image_count`: image accessibility metrics -`training_label`: 1 for the true article candidate; 0 otherwise -`default_selected`: 1 if this candidate would be chosen by the default heuristic (no custom weights) 3. Train weights and export JSON - Via npm (use`--silent`and arg separator): -`npm run --silent train:ranker -- scripts/data/candidates_with_url.csv > weights.json`- Or run directly (avoids npm banner output): -`node scripts/train-reranker.js scripts/data/candidates_with_url.csv weights.json`Parameters -`scripts/data/candidates_with_url.csv`: labeled candidates CSV (input) -`weights.json`: output weights file (JSON) Tips -`--`passes subsequent args to the underlying script -`> weights.json`redirects stdout to a file 4. Use the weights -`scripts/single-sample-run.js `auto-loads` weights.json`(if present) and enables the reranker: -`options.contentDetection.reranker = { enabled: true, weights }`Notes - If no reranker is configured, the detector uses heuristic scoring only. - You can merge CSVs from multiple runs:`npm run merge:csv -- scripts/data/merged.csv scripts/data/candidates_with_url.csv`. - Tip: placing a`weights.json `in the project root will make` scripts/single-sample-run.js`auto-enable the reranker on the next run. Update API docs with:```npm run docs```

Dependencies

- Puppeteer: High-level API to control Chrome or Chromium over the DevTools Protocol
- puppeteer-extra: Framework for puppeteer plugins
- puppeteer-extra-plugin-stealth: Plugin to evade detection
- puppeteer-extra-plugin-user-data-dir: Persist and reuse Chromium user data
- lighthouse: Automated auditing, performance metrics, and best practices
- compromise: Natural language processing in the browser
- retext: Natural language processor powered by plugins
- retext-pos: Plugin to add part-of-speech (POS) tags
- retext-keywords: Keyword extraction with Retext
- retext-spell: Spelling checker for retext
- retext-language: Language detection for retext
- franc: Fast language detection from text
- sentiment: AFINN-based sentiment analysis for Node.js
- jquery: JavaScript library for DOM operations
- jsdom: A JavaScript implementation of many web standards
- lodash: Lodash modular utilities
- absolutify: Relative to Absolute URL Replacer
- clean-html: HTML cleaner and beautifier
- dictionary-en-gb: English (United Kingdom) spelling dictionary in UTF-8
- html-to-text: Advanced HTML to plain text converter
- nlcst-to-string: Stringify NLCST

Dev Dependencies

- eslint: An AST-based pattern checker for JavaScript
- eslint-plugin-import: Import with sanity
- eslint-plugin-json: Lint JSON files
- eslint-plugin-n: Additional ESLint rules for Node.js
- eslint-plugin-promise: Enforce best practices for JavaScript promises

License

This project is licensed under the GNU GENERAL PUBLIC LICENSE Version 3 - see the LICENSE file for details

horseman-article-parser

Horseman Article Parser

Table of Contents

$3

$3

$3

$3

$3

$3

$3

Development

Quick Start (CLI)

$3

$3

$3

or run directly

$3

or run directly

or run directly

$3

Dependencies

Dev Dependencies

License

horseman-article-parser

Horseman Article Parser

Table of Contents

$3

$3

$3

$3

$3

$3

$3

Development

Quick Start (CLI)

$3

$3

$3

or run directly

$3

or run directly

or run directly

$3

Dependencies

Dev Dependencies

License