TypeScript RSS feed reader with Puppeteer-based scraping for JavaScript-rendered content
npm install @xmer/rss-readerA robust TypeScript RSS feed reader with Puppeteer-based scraping for JavaScript-rendered content. Designed to handle challenging RSS feeds that require full browser execution, with built-in browser pool management, link extraction, and configurable content parsing.
- JavaScript-Rendered Feed Support: Uses Puppeteer to handle feeds that require browser execution
- Browser Pool Management: Maintains warm browser instances for <5s feed fetch latency
- Comprehensive Link Extraction: Extracts all links from HTML content (HTTP, HTTPS, magnet, etc.)
- Configurable Title Cleaning: Remove site-specific suffixes/prefixes with regex patterns
- Retry Logic: Automatic retry with exponential backoff for transient failures
- TypeScript-First: Full type safety with detailed interface definitions
- Error Handling: Comprehensive error hierarchy for precise error handling
- Memory Efficient: Configurable browser pool size to balance performance and resource usage
- Installation
- Quick Start
- FitGirl Repacks Example
- Configuration Options
- API Reference
- Error Handling
- Security Considerations
- Performance
- Docker/CI Deployment
- Examples
- Contributing
- License
``bash`
npm install @xmer/rss-reader
- Node.js 18.0.0 or higher
- Sufficient system resources for Puppeteer (default: 2 browser instances)
All dependencies are bundled with the package:
- puppeteer ^24.34.0cheerio
- ^1.0.0-rc.12xml2js
- ^0.6.2
`typescript
import { RssReader } from '@xmer/rss-reader';
// Create reader instance
const reader = new RssReader();
// Initialize browser pool (REQUIRED)
await reader.initialize();
try {
// Fetch and parse feed
const feed = await reader.fetchAndParse('https://example.com/feed.xml');
console.log(Feed: ${feed.title});Items: ${feed.items.length}
console.log();
// Access parsed items
for (const item of feed.items) {
console.log(${item.title} - ${item.links.length} links);Published: ${item.publishedAt.toISOString()}
console.log();`
}
} finally {
// CRITICAL: Always close to prevent zombie browsers
await reader.close();
}
Primary use case: Parsing FitGirl Repacks feed with title cleaning and magnet link extraction.
`typescript
import { RssReader, type RssItem } from '@xmer/rss-reader';
const reader = new RssReader({
// Remove site branding from titles
titleSuffixPattern: /[–-]\s*FitGirl Repacks/,
// Extract WordPress-style item IDs
itemIdPattern: /[?&]p=(\d+)/,
// Increase pool size for better throughput
browserPoolSize: 3,
// More retries for potentially flaky connections
retryAttempts: 5,
// Longer timeout for slow-loading feeds
timeout: 60000
});
await reader.initialize();
try {
const feed = await reader.fetchAndParse('https://fitgirl-repacks.site/feed/');
for (const item of feed.items) {
// Title is cleaned: "Game Name – FitGirl Repacks" -> "Game Name"
console.log(Game: ${item.title});Post ID: ${item.itemId}
console.log();
// Extract magnet links
const magnetLinks = item.links.filter(link => link.startsWith('magnet:'));
console.log(Magnet links: ${magnetLinks.length});
if (magnetLinks.length > 0) {
console.log(Download: ${magnetLinks[0]});Failed to fetch feed: ${error.message}
}
}
} catch (error) {
if (error instanceof FeedFetchError) {
console.error();Failed to parse item: ${error.itemTitle}
} else if (error instanceof ParseError) {
console.error();`
}
} finally {
await reader.close();
}
Configure RssReader behavior by passing options to the constructor:
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| titleSuffixPattern | RegExp | undefined | Pattern to remove from end of titles (e.g., /[–-]\s*Site Name/) |titlePrefixPattern
| | RegExp | undefined | Pattern to remove from start of titles |itemIdPattern
| | RegExp | /[?&]p=(\d+)/ | Pattern to extract item ID from URLs (WordPress-style by default) |browserPoolSize
| | number | 2 | Number of browser instances in pool (balance performance vs resources) |retryAttempts
| | number | 3 | Maximum retry attempts for transient failures |timeout
| | number | 30000 | Timeout for feed fetching in milliseconds |linkValidationPattern
| | RegExp | undefined | Optional regex pattern to validate extracted links |disableSandbox
| | boolean | false | DANGER: Disable Chrome sandbox (only for Docker/CI) |
Basic configuration:
`typescript`
const reader = new RssReader({
browserPoolSize: 3,
timeout: 45000
});
Site-specific extraction:
`typescript`
const reader = new RssReader({
titleSuffixPattern: /\s-\sBlog Name$/,
titlePrefixPattern: /^\[News\]\s*/,
itemIdPattern: /\/(\d+)\/$/ // Extract from URL path
});
High-performance setup:
`typescript`
const reader = new RssReader({
browserPoolSize: 5,
retryAttempts: 5,
timeout: 60000
});
Main entry point for RSS feed reading.
#### Constructor
`typescript`
new RssReader(config?: RssReaderConfig)
Creates a new RssReader instance with optional configuration.
Parameters:
- config (RssReaderConfig): Optional configuration object
Example:
`typescript`
const reader = new RssReader({
browserPoolSize: 2,
retryAttempts: 3
});
#### initialize()
`typescript`
async initialize(): Promise
Initializes the browser pool. MUST be called before fetchAndParse().
Throws:
- BrowserError: If browser pool initialization fails
Example:
`typescript`
await reader.initialize();
#### fetchAndParse()
`typescript`
async fetchAndParse(feedUrl: string): Promise
Fetches and parses an RSS feed.
Parameters:
- feedUrl (string): URL of the RSS feed to fetch
Returns:
- Promise: Parsed RSS feed with all items
Throws:
- FeedFetchError: If fetching fails after all retriesInvalidFeedError
- : If feed structure is invalidParseError
- : If item parsing failsBrowserError
- : If browser pool is not initialized
Example:
`typescriptFound ${feed.items.length} items
const feed = await reader.fetchAndParse('https://example.com/feed.xml');
console.log();`
#### close()
`typescript`
async close(): Promise
Closes the browser pool and cleans up resources. MUST be called when done to prevent zombie browser processes.
Example:
`typescript`
try {
await reader.fetchAndParse(url);
} finally {
await reader.close();
}
#### getPoolStats()
`typescript`
getPoolStats(): BrowserPoolStats
Returns current browser pool statistics.
Returns:
- BrowserPoolStats: Object containing total, available, and inUse counts
Example:
`typescriptPool: ${stats.inUse}/${stats.total} in use
const stats = reader.getPoolStats();
console.log();`
#### isInitialized()
`typescript`
isInitialized(): boolean
Checks if the reader is initialized.
Returns:
- boolean: true if initialized
#### Static Utilities
Utility methods for standalone use without creating a reader instance:
extractLinks()
`typescript`
static extractLinks(html: string): string[]
Extracts all links from HTML content.
Parameters:
- html (string): HTML content to parse
Returns:
- string[]: Array of extracted links
Example:
`typescript`
const links = RssReader.extractLinks('Link');
extractItemId()
`typescript`
static extractItemId(url: string, pattern?: RegExp): string | undefined
Extracts item ID from URL using regex pattern.
Parameters:
- url (string): URL to parsepattern
- (RegExp): Optional regex pattern (default: WordPress-style)
Returns:
- string | undefined: Extracted ID or undefined if not found
Example:
`typescript`
const id = RssReader.extractItemId('https://example.com/?p=12345');
// Returns: "12345"
cleanTitle()
`typescript`
static cleanTitle(title: string, options?: TitleCleanOptions): string
Cleans title by removing patterns and normalizing whitespace.
Parameters:
- title (string): Title to cleanoptions
- (TitleCleanOptions): Optional cleaning options
Returns:
- string: Cleaned title
Example:
`typescript`
const cleaned = RssReader.cleanTitle('Game – Site Name', {
suffixPattern: /[–-]\s*Site Name/
});
// Returns: "Game"
validateLink()
`typescript`
static validateLink(link: string, pattern?: RegExp): boolean
Validates link against optional pattern.
Parameters:
- link (string): Link to validatepattern
- (RegExp): Optional validation pattern
Returns:
- boolean: true if valid
#### RssFeed
`typescript`
interface RssFeed {
title: string; // Feed title
feedUrl: string; // Feed URL
description?: string; // Feed description
items: RssItem[]; // Parsed items
fetchedAt: Date; // Fetch timestamp
}
#### RssItem
`typescript`
interface RssItem {
title: string; // Item title (cleaned)
link: string; // Item link/URL
publishedAt: Date; // Publication date
links: string[]; // All extracted links from content
itemId?: string; // Optional item ID extracted from URL
rawContent?: string; // Optional raw HTML content (UNSANITIZED)
metadata?: Record
}
SECURITY WARNING: The rawContent field contains unsanitized HTML. See Security Considerations.
#### RssReaderConfig
See Configuration Options table above.
#### BrowserPoolStats
`typescript`
interface BrowserPoolStats {
total: number; // Total browsers in pool
available: number; // Available browsers
inUse: number; // Browsers currently in use
}
All errors extend the base RssReaderError class for easy identification.
RssReaderError
Base error class for all package errors.
FeedFetchError
Thrown when feed fetching fails (network issues, timeout, Puppeteer errors).
`typescriptFailed to fetch ${error.feedUrl}: ${error.message}
try {
await reader.fetchAndParse(url);
} catch (error) {
if (error instanceof FeedFetchError) {
console.error();`
// Retry with exponential backoff or alert
}
}
InvalidFeedError
Thrown when feed structure is invalid or cannot be parsed as RSS/Atom.
`typescriptInvalid feed at ${error.feedUrl}
catch (error) {
if (error instanceof InvalidFeedError) {
console.error();`
// Skip this feed or alert administrator
}
}
ParseError
Thrown when individual item parsing fails.
`typescriptFailed to parse item: ${error.itemTitle}
catch (error) {
if (error instanceof ParseError) {
console.error();`
// Log and continue with other items
}
}
BrowserError
Thrown when browser operations fail (launch, crash, pool exhaustion).
`typescriptBrowser error during ${error.operation}
catch (error) {
if (error instanceof BrowserError) {
console.error();`
// Reinitialize browser pool
await reader.close();
await reader.initialize();
}
}
LinkNotFoundError
Thrown when no links are found in item content (may be expected for some feeds).
`typescriptNo links found in: ${error.itemTitle}
catch (error) {
if (error instanceof LinkNotFoundError) {
console.warn();`
// Expected for some items - log and continue
}
}
`typescript
import {
RssReader,
FeedFetchError,
BrowserError,
ParseError
} from '@xmer/rss-reader';
const reader = new RssReader();
await reader.initialize();
try {
const feed = await reader.fetchAndParse(url);
// Process items
for (const item of feed.items) {
try {
await processItem(item);
} catch (itemError) {
// Log individual item failures but continue
console.error(Failed to process ${item.title}:, itemError);`
}
}
} catch (error) {
if (error instanceof FeedFetchError) {
// Network/Puppeteer failure - retry or alert
console.error('Feed fetch failed:', error.message);
} else if (error instanceof BrowserError) {
// Browser crashed - reinitialize pool
console.error('Browser error:', error.message);
await reader.close();
await reader.initialize();
} else if (error instanceof ParseError) {
// Parse failure - log and skip
console.error('Parse error:', error.message);
} else {
// Unexpected error
console.error('Unexpected error:', error);
}
} finally {
// ALWAYS cleanup
await reader.close();
}
The disableSandbox option removes Chrome's process isolation security layer.
NEVER set disableSandbox: true in production environments unless:
- Running inside a Docker container with proper isolation
- Running in a CI/CD pipeline with ephemeral environments
- You fully understand the security implications
Why this is dangerous:
- Compromised websites can escape the browser and access your system
- Malicious content in RSS feeds can execute arbitrary code
- No process isolation between browser tabs
Safe usage in Docker:
`typescript`
const reader = new RssReader({
// Only safe inside Docker containers
disableSandbox: process.env.RUNNING_IN_DOCKER === 'true'
});
References:
- Chromium Sandboxing
- Puppeteer Docker Best Practices
The rawContent field in RssItem contains unsanitized HTML from the RSS feed.
NEVER render this content directly:
`typescript`
// DANGEROUS - XSS vulnerability
element.innerHTML = item.rawContent;
Always sanitize before rendering:
`typescript
import DOMPurify from 'dompurify';
// SAFE - sanitized HTML
const clean = DOMPurify.sanitize(item.rawContent);
element.innerHTML = clean;
`
Recommended sanitization libraries:
- DOMPurify - Client-side HTML sanitization
- sanitize-html - Server-side HTML sanitization
References:
- OWASP XSS Prevention Cheat Sheet
1. Run in isolated containers: Use Docker with proper resource limits
2. Sanitize all output: Never trust content from RSS feeds
3. Validate links: Use linkValidationPattern to restrict allowed link formats
4. Monitor resource usage: Browser instances consume significant memory
5. Set reasonable timeouts: Prevent indefinite hangs on malicious feeds
6. Log security events: Track failed fetches and suspicious content
- Feed fetch: <5s (with warm browser pool)
- Parse per item: <100ms
- Memory: ~300MB with 2 browser pool
- Browser startup: ~2s cold start
- Max concurrent: Limited by browserPoolSize
Browser Pool Sizing:
`typescript
// Low concurrency (1-2 feeds simultaneously)
browserPoolSize: 2 // Default, ~300MB memory
// Medium concurrency (3-5 feeds simultaneously)
browserPoolSize: 5 // ~750MB memory
// High concurrency (10+ feeds simultaneously)
browserPoolSize: 10 // ~1.5GB memory
`
Timeout Configuration:
`typescript
// Fast, reliable feeds
timeout: 15000 // 15s
// Slow or unreliable feeds
timeout: 60000 // 60s
// Very slow feeds (use sparingly)
timeout: 120000 // 2 minutes
`
Retry Strategy:
`typescript
// Reliable network
retryAttempts: 2
// Unreliable network or flaky feeds
retryAttempts: 5
// Critical feeds that must succeed
retryAttempts: 10
`
Monitor browser pool health:
`typescript`
const stats = reader.getPoolStats();
if (stats.inUse === stats.total) {
console.warn('Browser pool exhausted - consider increasing poolSize');
}
`typescript
// Track fetch performance
const start = Date.now();
const feed = await reader.fetchAndParse(url);
const duration = Date.now() - start;
console.log(Fetched ${feed.items.length} items in ${duration}ms);Avg per item: ${duration / feed.items.length}ms
console.log();
// Monitor pool utilization
const stats = reader.getPoolStats();
console.log(Pool utilization: ${(stats.inUse / stats.total * 100).toFixed(1)}%);`
`dockerfile
FROM node:18-alpine
WORKDIR /app
CMD ["node", "dist/index.js"]
`
`yaml`
version: '3.8'
services:
rss-reader:
build: .
environment:
- NODE_ENV=production
- RUNNING_IN_DOCKER=true
deploy:
resources:
limits:
memory: 1G
cpus: '1.0'
reservations:
memory: 512M
security_opt:
- no-new-privileges:true
cap_drop:
- ALL
`typescript
import { RssReader } from '@xmer/rss-reader';
const reader = new RssReader({
// Safe in Docker container
disableSandbox: process.env.RUNNING_IN_DOCKER === 'true',
browserPoolSize: 3,
timeout: 60000
});
await reader.initialize();
try {
const feed = await reader.fetchAndParse(url);
// Process feed
} finally {
await reader.close();
}
`
`yaml
name: RSS Reader CI
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Node.js
uses: actions/setup-node@v3
with:
node-version: '18'
- name: Install dependencies
run: npm ci
- name: Install Chromium dependencies
run: |
sudo apt-get update
sudo apt-get install -y \
chromium-browser \
libx11-xcb1 \
libxcomposite1 \
libxcursor1 \
libxdamage1 \
libxi6 \
libxtst6 \
libnss3 \
libcups2 \
libxss1 \
libxrandr2 \
libasound2 \
libatk1.0-0 \
libatk-bridge2.0-0 \
libpangocairo-1.0-0 \
libgtk-3-0
- name: Run tests
run: npm test
env:
PUPPETEER_EXECUTABLE_PATH: /usr/bin/chromium-browser
`
`typescript
import { RssReader } from '@xmer/rss-reader';
const reader = new RssReader();
await reader.initialize();
try {
const feed = await reader.fetchAndParse('https://blog.example.com/feed.xml');
for (const item of feed.items) {
console.log(item.title);
console.log(item.publishedAt.toLocaleDateString());
console.log(item.links.join(', '));
console.log('---');
}
} finally {
await reader.close();
}
`
`typescript
const reader = new RssReader({
titleSuffixPattern: /\s-\sTechBlog$/,
titlePrefixPattern: /^\[Article\]\s*/
});
await reader.initialize();
try {
const feed = await reader.fetchAndParse('https://techblog.example.com/feed');
// Titles are cleaned:
// "[Article] TypeScript Tips - TechBlog" -> "TypeScript Tips"
for (const item of feed.items) {
console.log(item.title);
}
} finally {
await reader.close();
}
`
`typescript
const reader = new RssReader();
await reader.initialize();
try {
const feed = await reader.fetchAndParse('https://downloads.example.com/feed');
for (const item of feed.items) {
// Extract only magnet links
const magnetLinks = item.links.filter(link => link.startsWith('magnet:'));
// Extract only HTTP(S) links
const httpLinks = item.links.filter(link =>
link.startsWith('http://') || link.startsWith('https://')
);
console.log(${item.title}:); Magnet: ${magnetLinks.length}
console.log(); HTTP: ${httpLinks.length}
console.log();`
}
} finally {
await reader.close();
}
`typescript
const reader = new RssReader({
browserPoolSize: 5, // Support concurrent processing
retryAttempts: 3
});
await reader.initialize();
try {
const feedUrls = [
'https://feed1.example.com/rss',
'https://feed2.example.com/rss',
'https://feed3.example.com/rss'
];
// Process feeds concurrently
const results = await Promise.allSettled(
feedUrls.map(url => reader.fetchAndParse(url))
);
for (const result of results) {
if (result.status === 'fulfilled') {
const feed = result.value;
console.log(${feed.title}: ${feed.items.length} items);Failed to fetch feed: ${result.reason}
} else {
console.error();`
}
}
} finally {
await reader.close();
}
`typescript
import {
RssReader,
FeedFetchError,
BrowserError,
ParseError,
InvalidFeedError,
LinkNotFoundError
} from '@xmer/rss-reader';
const reader = new RssReader();
await reader.initialize();
try {
const feed = await reader.fetchAndParse(url);
for (const item of feed.items) {
console.log(${item.title}: ${item.links.length} links);Network error for ${error.feedUrl}:
}
} catch (error) {
if (error instanceof FeedFetchError) {
console.error(, error.message);Invalid feed structure: ${error.feedUrl}
// Implement exponential backoff retry
} else if (error instanceof InvalidFeedError) {
console.error();Browser crashed during ${error.operation}
// Alert administrator - feed may have changed format
} else if (error instanceof BrowserError) {
console.error();Parse error for item ${error.itemTitle}:
// Reinitialize browser pool
await reader.close();
await reader.initialize();
} else if (error instanceof ParseError) {
console.error(, error.message);No links in ${error.itemTitle}
// Log and continue - some items may still be valid
} else if (error instanceof LinkNotFoundError) {
console.warn();`
// Expected for some feeds - log warning
} else {
console.error('Unexpected error:', error);
throw error;
}
} finally {
await reader.close();
}
`typescript
import { RssReader } from '@xmer/rss-reader';
// Extract links from HTML without creating a reader instance
const html = 'Link 1Magnet';
const links = RssReader.extractLinks(html);
console.log(links); // ['https://example.com', 'magnet:?xt=...']
// Extract item ID from URL
const id = RssReader.extractItemId('https://blog.com/?p=12345');
console.log(id); // '12345'
// Clean title
const cleaned = RssReader.cleanTitle('Game Name – Site Branding', {
suffixPattern: /[–-]\s*Site Branding/
});
console.log(cleaned); // 'Game Name'
// Validate link
const isValid = RssReader.validateLink('https://example.com', /^https:\/\//);
console.log(isValid); // true
``
Contributions are welcome! Please see the repository for contribution guidelines.
MIT License - see LICENSE file for details.
---
Made with TypeScript, Puppeteer, Cheerio, and xml2js.