Fetch URLs, extract clean article content, and cache as markdown. Supports automatic JavaScript rendering via Playwright.
npm install arcfetch

Zero-config URL fetching that converts web pages to clean markdown with automatic JavaScript rendering fallback.
Perfect for AI workflows, research, and documentation. Fetches URLs, extracts article content using Mozilla Readability, and caches as markdown with 90-95% token reduction.
| Problem | Solution |
|---------|----------|
| JS-heavy sites return blank | Auto-detects and retries with Playwright |
| Login walls / error pages scored as content | Boilerplate detection (22 patterns) catches them |
| Too much HTML clutter | Mozilla Readability extracts just the article |
| High token costs for LLMs | 90-95% token reduction vs raw HTML |
| No good caching story | Temp → Docs workflow for easy curation |
| Hard to integrate | Works as CLI or MCP server with zero setup |
- Smart Fetching: Simple HTTP first, automatic Playwright fallback for JS-heavy sites
- Quality Gates: Scoring (0-100) with boilerplate, login wall, paywall, and error page detection
- Content-to-Source Ratio: Catches JS-rendered or gated content by comparing extracted text to source HTML size
- Anti-Bot Detection: Stealth plugin, viewport/timezone/locale rotation, realistic headers, navigator overrides
- Clean Markdown: Mozilla Readability + Turndown for 90-95% token reduction
- Temp → Docs Workflow: Cache to temp folder, promote to docs when ready
- Link Extraction: Extract and batch-fetch all links from a cached reference
- CLI & MCP: Available as command-line tool and MCP server (6 tools)
- Multiple Output Formats: Plain text, JSON, filepath, or summary
``bashFetch and display markdown
npx arcfetch fetch https://example.com/article
$3
`bash
npm install -g arcfetchThen use directly
arcfetch fetch https://example.com/article
`$3
`bash
git clone https://github.com/briansunter/arcfetch.git
cd arcfetch
bun install
bun run cli.ts fetch https://example.com
`CLI Usage
$3
`bash
Fetch a URL and save to temp folder
arcfetch fetch https://example.com/articleList all cached references
arcfetch listExtract links from a cached reference
arcfetch links my-articleFetch all links from a cached reference (parallel)
arcfetch fetch-links my-articlePromote from temp to permanent docs
arcfetch promote my-articleDelete a cached reference
arcfetch delete my-articleShow current configuration
arcfetch configStart MCP server
arcfetch mcp
`$3
`bash
Plain text (LLM-friendly, default)
arcfetch fetch https://example.com -o textJust the filepath (for scripts)
arcfetch fetch https://example.com -o pathSummary: slug|filepath
arcfetch fetch https://example.com -o summaryStructured JSON
arcfetch fetch https://example.com -o jsonHuman-friendly with emojis
arcfetch fetch https://example.com --pretty
`$3
`bash
Add search query as metadata
arcfetch fetch https://example.com -q "machine learning"Set minimum quality threshold (default: 60)
arcfetch fetch https://example.com --min-quality 80Force Playwright (skip simple fetch)
arcfetch fetch https://example.com --force-playwrightUse faster wait strategy for simple sites
arcfetch fetch https://example.com --wait-strategy loadRe-fetch even if URL already cached
arcfetch fetch https://example.com --refetchCustom directories
arcfetch fetch https://example.com --temp-dir .cache --docs-dir contentVerbose output for debugging
arcfetch fetch https://example.com -v
`MCP Server
$3
Add to your Claude Code MCP configuration:
`json
{
"mcpServers": {
"arcfetch": {
"command": "npx",
"args": ["arcfetch"]
}
}
}
`Or using bunx (faster):
`json
{
"mcpServers": {
"arcfetch": {
"command": "bunx",
"args": ["arcfetch"]
}
}
}
`$3
`json
{
"mcpServers": {
"arcfetch": {
"command": "bun",
"args": ["run", "/path/to/arcfetch/index.ts"],
"cwd": "/path/to/arcfetch"
}
}
}
`$3
| Tool | Parameters | Description |
|------|------------|-------------|
|
fetch_url | url, query?, minQuality?, refetch?, outputFormat? | Fetch URL with auto JS fallback |
| list_cached | tempDir? | List all cached references |
| promote_reference | refId, docsDir? | Move from temp to docs folder |
| delete_cached | refId | Delete a cached reference |
| extract_links | refId, outputFormat? | Extract links from a cached reference |
| fetch_links | refId, refetch?, outputFormat? | Fetch all links from a cached reference |Configuration
$3
Create
arcfetch.config.json in your project root:`json
{
"quality": {
"minScore": 60,
"jsRetryThreshold": 85
},
"paths": {
"tempDir": ".tmp/arcfetch",
"docsDir": "docs/ai/references"
},
"playwright": {
"timeout": 30000,
"waitStrategy": "networkidle"
}
}
`Config files checked (in order):
arcfetch.config.json, .arcfetchrc, .arcfetchrc.json$3
`bash
SOFETCH_MIN_SCORE=60
SOFETCH_JS_RETRY_THRESHOLD=85
SOFETCH_TEMP_DIR=.tmp/arcfetch
SOFETCH_DOCS_DIR=docs/ai/references
`$3
CLI arguments > Environment variables > Config file > Built-in defaults
Quality Pipeline
`
URL → Simple Fetch → Extract → Quality Score (0-100)
│
┌──────────────────┼──────────────────┐
▼ ▼ ▼
Score >= 85 60 - 84 < 60
│ │ │
▼ ▼ ▼
Save Try Playwright Try Playwright
pick best score (required)
│
▼
Score >= 60?
Yes → Save
No → Error
`$3
Score starts at 100, deductions apply:
| Check | Deduction |
|-------|-----------|
| Blank content | Score = 0 |
| Content < 50 chars | -50 |
| Content < 300 chars | -15 |
| HTML tags > 100 | -40 |
| HTML tags > 50 | -20 |
| HTML ratio > 30% | -25 |
| Extraction ratio < 0.5% (large page) | -35 |
| Extraction ratio < 2% (large page) | -20 |
| Boilerplate detected | -40 |
| Script/style tags | -10 to -15 |
$3
On short content (< 2000 chars), 22 patterns are checked:
- Error pages: "something went wrong", "an error occurred"
- 404 pages: "page not found"
- Login walls: "log in to continue", "please log in", "sign in to continue"
- Paywalls: "subscribe to continue reading"
- Bot detection: "are you a robot", "complete the captcha"
- Access denied, JS-required, unsupported browser
Long articles (>= 2000 chars) are not checked for boilerplate to avoid false positives.
$3
| Strategy | Speed | Reliability | Best For |
|----------|-------|-------------|----------|
|
networkidle | Slowest | Highest | JS-heavy apps, dynamic content |
| domcontentloaded | Medium | Medium | Most SPAs, modern sites |
| load | Fastest | Basic | Static sites, simple pages |File Format
Cached files use markdown with YAML frontmatter:
`markdown
---
title: "Article Title"
source_url: https://example.com/article
fetched_date: 2026-02-06
type: web
status: temporary
query: "optional search query"
---Article Title
Extracted markdown content...
`- Ref IDs are slugified titles (e.g.,
how-to-build-react-apps)
- Temp storage: .tmp/arcfetch/ (status: temporary)
- Permanent storage: docs/ai/references/ (status: permanent, after promote)
- Duplicate detection: re-fetching same URL returns existing ref unless --refetchReal-World Examples
$3
`bash
Fetch multiple articles for research
arcfetch fetch https://arxiv.org/abs/2301.00001 -q "LLM research"
arcfetch fetch https://openai.com/research/gpt-4 -q "GPT-4"Review all cached references
arcfetch list --prettyPromote the good ones to docs
arcfetch promote llm-research-paper
arcfetch promote gpt-4-technical-report
`$3
`bash
Fetch a page with lots of links
arcfetch fetch https://example.com/resources --prettySee what links it contains
arcfetch links resources --prettyFetch all of them in parallel
arcfetch fetch-links resources --pretty
`$3
`bash
#!/bin/bash
Fetch and get filepath
filepath=$(arcfetch fetch https://example.com -o path)Process with other tools
cat "$filepath" | other-toolOr get JSON for structured processing
arcfetch fetch https://example.com -o json | jq '.quality'
`$3
`bash
Modern React/Vue/Angular apps
arcfetch fetch https://spa-example.com --force-playwrightSimple blogs (use faster strategy)
arcfetch fetch https://blog.example.com --wait-strategy loadUnknown site (let arcfetch decide automatically)
arcfetch fetch https://unknown-site.com -v
`Troubleshooting
$3
`bash
Lower the threshold temporarily
arcfetch fetch https://example.com --min-quality 40Or force Playwright (often produces better results)
arcfetch fetch https://example.com --force-playwrightCheck what's happening with verbose mode
arcfetch fetch https://example.com -v
`$3
`bash
Use faster wait strategy
arcfetch fetch https://example.com --wait-strategy loadCombine with force-playwright for JS sites
arcfetch fetch https://example.com --force-playwright --wait-strategy domcontentloaded
`Architecture
`
┌──────────────────────────────────────────────────────┐
│ CLI / MCP Interface │
└──────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Core Pipeline │
│ 1. Simple HTTP Fetch (browser-like UA) │
│ 2. Extract with Readability + Turndown │
│ 3. Quality Score + Boilerplate Detection │
│ 4. Conditional Playwright Retry (with stealth) │
│ 5. Cache with YAML Frontmatter │
└──────────────────────────────────────────────────────┘
│
┌───────────────┼───────────────┐
▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐
│ Cache │ │ Playwright│ │ Quality │
│ Manager │ │ Manager │ │ Validator │
└───────────┘ └───────────┘ └───────────┘
`Contributing
`bash
git clone https://github.com/briansunter/arcfetch.git
cd arcfetch
bun install
bun test # Run tests (199 tests)
bun run typecheck # Type checking
bun run check # Lint + format check
``MIT License - see LICENSE file for details