Magpie HTML 🦅

![npm version](https://www.npmjs.com/package/magpie-html)
![npm downloads](https://www.npmjs.com/package/magpie-html)
![CI](https://github.com/Anonyfox/magpie-html/actions/workflows/ci.yml)
![Documentation](https://anonyfox.github.io/magpie-html)
![License: MIT](https://opensource.org/licenses/MIT)
![TypeScript](https://www.typescriptlang.org/)
![Node.js](https://nodejs.org/)
![Live Demo](https://crispread.com)

Modern web scraping for when you need the good parts, not the markup soup. Extracts clean article content, parses feeds (RSS, Atom, JSON, Sitemaps), and gathers metadata from any page. Handles broken encodings, malformed feeds, and the chaos of real-world HTML. TypeScript-native, works everywhere. Named after the bird known for collecting valuable things... you get the idea.

Production-ready · Powers CrispRead, a trilingual news aggregator processing thousands of articles daily.

Features

- 🎯 Isomorphic - Works in Node.js and browsers
- 📦 Modern ESM/CJS - Dual format support
- 🔒 Type-safe - Full TypeScript support
- 🧪 Well-tested - Built with Node.js native test runner
- 🚀 Minimal dependencies - Lightweight and fast
- 🔄 Multi-Format Feed Parser - Parse RSS 2.0, Atom 1.0, JSON Feed, and XML Sitemaps
- 🔗 Smart URL Resolution - Automatic normalization to absolute URLs
- 🛡️ Error Resilient - Graceful handling of malformed data
- 🦅 High-Level Convenience - One-line functions for common tasks

Installation

``bash npm install magpie-html`

`Quick Start`

`typescript import { gatherWebsite, gatherArticle, gatherFeed } from "magpie-html";

// Gather complete website metadata const site = await gatherWebsite("https://example.com"); console.log(site.title); // Page title console.log(site.description); // Meta description console.log(site.image); // Featured image console.log(site.feeds); // Discovered feeds console.log(site.internalLinks); // Internal links

// Gather article content + metadata const article = await gatherArticle("https://example.com/article"); console.log(article.title); // Article title console.log(article.content); // Clean article text console.log(article.wordCount); // Word count console.log(article.readingTime); // Reading time in minutes

// Gather feed data const feed = await gatherFeed("https://example.com/feed.xml"); console.log(feed.title); // Feed title console.log(feed.items); // Feed items`

`Usage`

`$3`

Extract comprehensive metadata from any webpage:

`typescript import { gatherWebsite } from "magpie-html";

const site = await gatherWebsite("https://example.com");

// Basic metadata console.log(site.url); // Final URL (after redirects) console.log(site.title); // Best title (cleaned) console.log(site.description); // Meta description console.log(site.image); // Featured image URL console.log(site.icon); // Site favicon/icon

// Language & region console.log(site.language); // ISO 639-1 code (e.g., 'en') console.log(site.region); // ISO 3166-1 alpha-2 (e.g., 'US')

// Discovered content console.log(site.feeds); // Array of feed URLs console.log(site.internalLinks); // Internal links (same domain) console.log(site.externalLinks); // External links (other domains)

// Raw content console.log(site.html); // Raw HTML console.log(site.text); // Plain text (full page)`

What it does:

- Fetches the page with automatic redirect handling - Extracts metadata from multiple sources (OpenGraph, Schema.org, Twitter Card, etc.) - Picks the "best" value for each field (longest, highest priority, cleaned) - Discovers RSS/Atom/JSON feeds linked on the page - Categorizes internal vs external links - Returns normalized, absolute URLs

`$3`

Extract clean article content with metadata:

`typescript import { gatherArticle } from "magpie-html";

const article = await gatherArticle("https://example.com/article");

// Core content console.log(article.url); // Final URL console.log(article.title); // Article title (Readability or metadata) console.log(article.content); // Clean article text (formatted) console.log(article.description); // Excerpt/summary

// Metrics console.log(article.wordCount); // Word count console.log(article.readingTime); // Est. reading time (minutes)

// Media & language console.log(article.image); // Article image console.log(article.language); // Language code console.log(article.region); // Region code

// Links & raw content console.log(article.internalLinks); // Internal links console.log(article.externalLinks); // External links (citations) console.log(article.html); // Raw HTML console.log(article.text); // Plain text (full page)`

What it does:

- Uses Mozilla Readability to extract clean article content - Falls back to metadata extraction if Readability fails - Converts cleaned HTML to well-formatted plain text - Calculates reading metrics (word count, reading time) - Provides both cleaned content and raw HTML

`$3`

Parse any feed format with one function:

`typescript import { gatherFeed } from "magpie-html";

const feed = await gatherFeed("https://example.com/feed.xml");

// Feed metadata console.log(feed.title); // Feed title console.log(feed.description); // Feed description console.log(feed.url); // Feed URL console.log(feed.siteUrl); // Website URL

// Feed items for (const item of feed.items) { console.log(item.title); // Item title console.log(item.url); // Item URL (absolute) console.log(item.description); // Item description console.log(item.publishedAt); // Publication date console.log(item.author); // Author }

// Format detection console.log(feed.format); // 'rss', 'atom', or 'json-feed'`

What it does:

- Auto-detects feed format (RSS 2.0, Atom 1.0, JSON Feed) - Normalizes all formats to a unified interface - Resolves relative URLs to absolute - Handles malformed data gracefully

`Advanced Usage`

For more control, use the lower-level modules directly:

`$3`

`typescript import { pluck, parseFeed } from "magpie-html";

// Fetch feed content const response = await pluck("https://example.com/feed.xml"); const feedContent = await response.textUtf8();

// Parse with base URL for relative links const result = parseFeed(feedContent, response.finalUrl);

console.log(result.feed.title); console.log(result.feed.items[0].title); console.log(result.feed.format); // 'rss', 'atom', or 'json-feed'`

`$3`

When standard feeds aren't available, XML sitemaps can be a useful fallback for discovering URLs. Supports standard sitemaps, sitemap indexes, and Google News/Image/Video extensions:

`typescript import { pluck, parseSitemap, isSitemap } from "magpie-html";

const response = await pluck("https://example.com/sitemap.xml"); const content = await response.textUtf8();

if (isSitemap(content)) { const result = parseSitemap(content, response.finalUrl);

for (const url of result.sitemap.urls) { console.log(url.loc); // URL console.log(url.lastmod); // Last modified date console.log(url.news?.title); // Google News title (if present) console.log(url.news?.publicationDate); // Publication date }

// For sitemap indexes, check result.sitemap.sitemaps[] }`

`$3`

`typescript import { parseHTML, extractContent, htmlToText } from "magpie-html";

// Parse HTML once const doc = parseHTML(html);

// Extract article with Readability const result = extractContent(doc, { baseUrl: "https://example.com/article", cleanConditionally: true, keepClasses: false, });

if (result.success) { console.log(result.title); // Article title console.log(result.excerpt); // Article excerpt console.log(result.content); // Clean HTML console.log(result.textContent); // Plain text console.log(result.wordCount); // Word count console.log(result.readingTime); // Reading time }

// Or convert any HTML to text const plainText = htmlToText(html, { preserveWhitespace: false, includeLinks: true, wrapColumn: 80, });`

`$3`

`typescript import { parseHTML, extractOpenGraph, extractSchemaOrg, extractSEO, } from "magpie-html";

const doc = parseHTML(html);

// Extract OpenGraph metadata const og = extractOpenGraph(doc); console.log(og.title); console.log(og.description); console.log(og.image);

// Extract Schema.org data const schema = extractSchemaOrg(doc); console.log(schema.articles); // NewsArticle, etc.

// Extract SEO metadata const seo = extractSEO(doc); console.log(seo.title); console.log(seo.description); console.log(seo.keywords);`

Available extractors:

- extractSEO- SEO meta tags -extractOpenGraph- OpenGraph metadata -extractTwitterCard- Twitter Card metadata -extractSchemaOrg- Schema.org / JSON-LD -extractCanonical- Canonical URLs -extractLanguage- Language detection -extractIcons- Favicon and icons -extractAssets- All linked assets (images, scripts, fonts, etc.) -extractLinks- Navigation links (with internal/external split) -extractFeedDiscovery- Discover RSS/Atom/JSON feeds - ...and more

`$3`

Use pluck() for robust fetching with automatic encoding and redirect handling:

`typescript import { pluck } from "magpie-html";

const response = await pluck("https://example.com", { timeout: 30000, // 30 second timeout maxRedirects: 10, // Follow up to 10 redirects maxSize: 10485760, // 10MB limit userAgent: "MyBot/1.0", throwOnHttpError: true, strictContentType: false, });

// Enhanced response properties console.log(response.finalUrl); // URL after redirects console.log(response.redirectChain); // All redirect URLs console.log(response.detectedEncoding); // Detected charset console.log(response.timing); // Request timing

// Get UTF-8 decoded content const text = await response.textUtf8();`

Why pluck()?

- Handles broken sites with wrong/missing encoding declarations - Follows redirect chains and tracks them - Enforces timeouts and size limits - Compatible with standardfetch()API - Namedpluck() to avoid confusion (magpies pluck things! 🦅)

`Experimental:` swoop() `(client-side DOM rendering without a browser engine)`

> ⚠️ SECURITY WARNING — Remote Code Execution (RCE) > >swoop() executes remote, third‑party JavaScript inside your current Node.js process (via node:vm+ browser shims). > This is fundamentally insecure. Only useswoop()on fully trusted targets and treat inputs as hostile by default. > For any professional/untrusted scraping, run this in a real sandbox (container/VM/locked-down OS user + seccomp/apparmor/firejail, etc.).

> Note: magpie-html does not use swoop() internally. It’s provided as an optional standalone utility for the few cases where you really need DOM-only client-side rendering.

swoop() is an explicitly experimental helper that tries to execute client-side scripts against a DOM-only environment and then returns a best-effort HTML snapshot.

`$3`

Sometimes curl / fetch / pluck()isn’t enough because the page is a SPA and only renders content after client-side JavaScript runs.swoop() exists to quickly turn “CSR-only” pages into HTML so the rest of magpie-html can work with the result.

If it works, it can be comparably light and fast because it avoids a full browser engine by using a custom node:vm-based execution environment with browser shims.

For very complicated targets (heavy JS, complex navigation, strong anti-bot, layout-dependent rendering), you should use a real browser engine instead.

swoop() is best seen as a building block—you still need to provide the real sandboxing around it.

`$3`

- A pragmatic “SPA snapshotter” for cases where a page renders content via client-side JavaScript. - No browser engine: no layout/paint/CSS correctness.

`$3`

- Not a headless browser replacement (no navigation lifecycle, no reliable layout APIs).

`$3`

`typescript import { swoop } from "magpie-html";

const result = await swoop("https://example.com/spa", { waitStrategy: "networkidle", timeout: 3000, });

console.log(result.html); console.log(result.errors);`

`Performance Tips`

Best Practice: Parse HTML once and reuse the document:

`typescript import { parseHTML, extractSEO, extractOpenGraph, extractContent, } from "magpie-html";

const doc = parseHTML(html);

// Reuse the same document for multiple extractions const seo = extractSEO(doc); // Fast: <5ms const og = extractOpenGraph(doc); // Fast: <5ms const content = extractContent(doc); // ~100-500ms

// Total: One parse + all extractions`

`Development`

`$3`

`bash npm install`

`$3`

`bash npm test`

The test suite includes both unit tests (*.test.ts) and integration tests using real-world HTML/feed files from cache/.

`$3`

`bash npm run test:watch`

`$3`

`bash npm run build`

`$3`

`bash

`Check for issues`


npm run lint
Auto-fix issues

npm run lint:fix
Format code

npm run format
Run all checks (typecheck + lint)

npm run check

$3

`bash npm run typecheck`

`$3`

Generate API documentation:

`bash npm run docs npm run docs:serve`

`Integration Testing`

The cache/ directory contains real-world HTML and feed samples for integration testing. This enables testing against actual production data without network calls.

`Publishing`

`bash npm publish`

The prepublishOnly` script automatically builds the package before publishing.

---

$3

If this package helps your project, consider sponsoring its maintenance:

![GitHub Sponsors](https://github.com/sponsors/Anonyfox)

---

Anonyfox • API Docs • MIT License

Magpie HTML 🦅

Production-ready · Powers CrispRead, a trilingual news aggregator processing thousands of articles daily.

Features

Installation

``bash npm install magpie-html`

`Quick Start`

`typescript import { gatherWebsite, gatherArticle, gatherFeed } from "magpie-html";

// Gather feed data const feed = await gatherFeed("https://example.com/feed.xml"); console.log(feed.title); // Feed title console.log(feed.items); // Feed items`

`Usage`

`$3`

Extract comprehensive metadata from any webpage:

`typescript import { gatherWebsite } from "magpie-html";

const site = await gatherWebsite("https://example.com");

// Language & region console.log(site.language); // ISO 639-1 code (e.g., 'en') console.log(site.region); // ISO 3166-1 alpha-2 (e.g., 'US')

// Raw content console.log(site.html); // Raw HTML console.log(site.text); // Plain text (full page)`

What it does:

`$3`

Extract clean article content with metadata:

`typescript import { gatherArticle } from "magpie-html";

const article = await gatherArticle("https://example.com/article");

// Metrics console.log(article.wordCount); // Word count console.log(article.readingTime); // Est. reading time (minutes)

// Media & language console.log(article.image); // Article image console.log(article.language); // Language code console.log(article.region); // Region code

What it does:

`$3`

Parse any feed format with one function:

`typescript import { gatherFeed } from "magpie-html";

const feed = await gatherFeed("https://example.com/feed.xml");

// Feed metadata console.log(feed.title); // Feed title console.log(feed.description); // Feed description console.log(feed.url); // Feed URL console.log(feed.siteUrl); // Website URL

// Format detection console.log(feed.format); // 'rss', 'atom', or 'json-feed'`

What it does:

- Auto-detects feed format (RSS 2.0, Atom 1.0, JSON Feed) - Normalizes all formats to a unified interface - Resolves relative URLs to absolute - Handles malformed data gracefully

`Advanced Usage`

For more control, use the lower-level modules directly:

`$3`

`typescript import { pluck, parseFeed } from "magpie-html";

// Fetch feed content const response = await pluck("https://example.com/feed.xml"); const feedContent = await response.textUtf8();

// Parse with base URL for relative links const result = parseFeed(feedContent, response.finalUrl);

console.log(result.feed.title); console.log(result.feed.items[0].title); console.log(result.feed.format); // 'rss', 'atom', or 'json-feed'`

`$3`

When standard feeds aren't available, XML sitemaps can be a useful fallback for discovering URLs. Supports standard sitemaps, sitemap indexes, and Google News/Image/Video extensions:

`typescript import { pluck, parseSitemap, isSitemap } from "magpie-html";

const response = await pluck("https://example.com/sitemap.xml"); const content = await response.textUtf8();

if (isSitemap(content)) { const result = parseSitemap(content, response.finalUrl);

// For sitemap indexes, check result.sitemap.sitemaps[] }`

`$3`

`typescript import { parseHTML, extractContent, htmlToText } from "magpie-html";

// Parse HTML once const doc = parseHTML(html);

// Extract article with Readability const result = extractContent(doc, { baseUrl: "https://example.com/article", cleanConditionally: true, keepClasses: false, });

// Or convert any HTML to text const plainText = htmlToText(html, { preserveWhitespace: false, includeLinks: true, wrapColumn: 80, });`

`$3`

`typescript import { parseHTML, extractOpenGraph, extractSchemaOrg, extractSEO, } from "magpie-html";

const doc = parseHTML(html);

// Extract OpenGraph metadata const og = extractOpenGraph(doc); console.log(og.title); console.log(og.description); console.log(og.image);

// Extract Schema.org data const schema = extractSchemaOrg(doc); console.log(schema.articles); // NewsArticle, etc.

// Extract SEO metadata const seo = extractSEO(doc); console.log(seo.title); console.log(seo.description); console.log(seo.keywords);`

Available extractors:

`$3`

Use pluck() for robust fetching with automatic encoding and redirect handling:

`typescript import { pluck } from "magpie-html";

// Get UTF-8 decoded content const text = await response.textUtf8();`

Why pluck()?

`Experimental:` swoop() `(client-side DOM rendering without a browser engine)`

> Note: magpie-html does not use swoop() internally. It’s provided as an optional standalone utility for the few cases where you really need DOM-only client-side rendering.

swoop() is an explicitly experimental helper that tries to execute client-side scripts against a DOM-only environment and then returns a best-effort HTML snapshot.

`$3`

If it works, it can be comparably light and fast because it avoids a full browser engine by using a custom node:vm-based execution environment with browser shims.

For very complicated targets (heavy JS, complex navigation, strong anti-bot, layout-dependent rendering), you should use a real browser engine instead.

swoop() is best seen as a building block—you still need to provide the real sandboxing around it.

`$3`

- A pragmatic “SPA snapshotter” for cases where a page renders content via client-side JavaScript. - No browser engine: no layout/paint/CSS correctness.

`$3`

- Not a headless browser replacement (no navigation lifecycle, no reliable layout APIs).

`$3`

`typescript import { swoop } from "magpie-html";

const result = await swoop("https://example.com/spa", { waitStrategy: "networkidle", timeout: 3000, });

console.log(result.html); console.log(result.errors);`

`Performance Tips`

Best Practice: Parse HTML once and reuse the document:

`typescript import { parseHTML, extractSEO, extractOpenGraph, extractContent, } from "magpie-html";

const doc = parseHTML(html);

// Reuse the same document for multiple extractions const seo = extractSEO(doc); // Fast: <5ms const og = extractOpenGraph(doc); // Fast: <5ms const content = extractContent(doc); // ~100-500ms

// Total: One parse + all extractions`

`Development`

`$3`

`bash npm install`

`$3`

`bash npm test`

The test suite includes both unit tests (*.test.ts) and integration tests using real-world HTML/feed files from cache/.

`$3`

`bash npm run test:watch`

`$3`

`bash npm run build`

`$3`

`bash

`Check for issues`


npm run lint
Auto-fix issues

npm run lint:fix
Format code

npm run format
Run all checks (typecheck + lint)

npm run check

$3

`bash npm run typecheck`

`$3`

Generate API documentation:

`bash npm run docs npm run docs:serve`

`Integration Testing`

The cache/ directory contains real-world HTML and feed samples for integration testing. This enables testing against actual production data without network calls.

`Publishing`

`bash npm publish`

The prepublishOnly` script automatically builds the package before publishing.

---

$3

If this package helps your project, consider sponsoring its maintenance:

![GitHub Sponsors](https://github.com/sponsors/Anonyfox)

---

Anonyfox • API Docs • MIT License

magpie-html

Magpie HTML 🦅

Features

Installation

Quick Start

Usage

$3

$3

$3

Advanced Usage

$3

$3

$3

$3

$3

Experimental: swoop() (client-side DOM rendering without a browser engine)

$3

$3

$3

$3

Performance Tips

Development

$3

$3

$3

$3

$3

Check for issues

Auto-fix issues

Format code

Run all checks (typecheck + lint)

$3

$3

Integration Testing

Publishing

$3

magpie-html

Magpie HTML 🦅

Features

Installation

Quick Start

Usage

$3

$3

$3

Advanced Usage

$3

$3

$3

$3

$3

Experimental: swoop() (client-side DOM rendering without a browser engine)

$3

$3

$3

$3

Performance Tips

Development

$3

$3

$3

$3

$3

Check for issues

Auto-fix issues

Format code

Run all checks (typecheck + lint)

$3

$3

Integration Testing

Publishing

$3

`Quick Start`

`Usage`

`$3`

`$3`

`$3`

`Advanced Usage`

`$3`

`$3`

`$3`

`$3`

`$3`

`Experimental:` swoop() `(client-side DOM rendering without a browser engine)`

`$3`

`$3`

`$3`

`$3`

`Performance Tips`

`Development`

`$3`

`$3`

`$3`

`$3`

`$3`

`Check for issues`

`$3`

`Integration Testing`

`Publishing`

`Quick Start`

`Usage`

`$3`

`$3`

`$3`

`Advanced Usage`

`$3`

`$3`

`$3`

`$3`

`$3`

`Experimental:` swoop() `(client-side DOM rendering without a browser engine)`

`$3`

`$3`

`$3`

`$3`

`Performance Tips`

`Development`

`$3`

`$3`

`$3`

`$3`

`$3`

`Check for issues`

`$3`

`Integration Testing`

`Publishing`