Fast, token-efficient web content extraction - fetch web pages and convert to clean Markdown
npm install @just-every/crawlFast, token-efficient web content extraction - fetch web pages and convert to clean Markdown. Perfect for LLMs to read data from web pages quickly and efficiently.
- 🚀 Fast HTML to Markdown conversion using Mozilla's Readability
- 📄 Clean, readable markdown output optimized for LLMs
- 🤖 Respects robots.txt (configurable)
- 💾 Built-in SHA-256 based caching with graceful fallbacks
- 🔄 Multi-page crawling with intelligent link extraction
- ⚡ Concurrent fetching with configurable limits (default: 3)
- 🔗 Automatic relative to absolute URL conversion in markdown
- 🛡️ Robust error handling and timeout management
``bash`
npm install @just-every/crawl
`bashFetch a single page
npx web-crawl https://example.com
$3
`javascript
import { fetch, fetchMarkdown } from '@just-every/crawl';// Fetch and convert a single URL to markdown
const markdown = await fetchMarkdown('https://example.com');
console.log(markdown);
// Fetch multiple pages with options
const results = await fetch('https://example.com', {
pages: 3, // Maximum pages to crawl (default: 1)
maxConcurrency: 5, // Max concurrent requests (default: 3)
respectRobots: true, // Respect robots.txt (default: true)
sameOriginOnly: true, // Only crawl same origin (default: true)
userAgent: 'MyBot/1.0',
cacheDir: '.cache',
timeout: 30000 // Request timeout in ms (default: 30000)
});
// Process results
results.forEach(result => {
if (result.error) {
console.error(
Error for ${result.url}: ${result.error});
} else {
console.log(# ${result.title});
console.log(result.markdown);
}
});
`CLI Options
`
Options:
-p, --pages Maximum pages to crawl (default: 1)
-c, --concurrency Max concurrent requests (default: 3)
--no-robots Ignore robots.txt
--all-origins Allow cross-origin crawling
-u, --user-agent Custom user agent
--cache-dir Cache directory (default: ".cache")
-t, --timeout Request timeout in milliseconds (default: 30000)
-o, --output Output format: json, markdown, or both (default: "markdown")
-h, --help Display help
`API Reference
$3
Fetches a URL and returns an array of crawl results.
Parameters:
-
url (string): The URL to fetch
- options (CrawlOptions): Optional crawling configurationReturns:
Promise$3
Fetches a single URL and returns only the markdown content.
Parameters:
-
url (string): The URL to fetch
- options (CrawlOptions): Optional crawling configurationReturns:
Promise$3
`typescript
interface CrawlOptions {
pages?: number; // Maximum pages to crawl (default: 1)
maxConcurrency?: number; // Max concurrent requests (default: 3)
respectRobots?: boolean; // Respect robots.txt (default: true)
sameOriginOnly?: boolean; // Only crawl same origin (default: true)
userAgent?: string; // Custom user agent
cacheDir?: string; // Cache directory (default: ".cache")
timeout?: number; // Request timeout in ms (default: 30000)
}interface CrawlResult {
url: string; // The URL that was crawled
markdown: string; // Converted markdown content
title?: string; // Page title
links?: string[]; // Links extracted from markdown content
error?: string; // Error message if failed
}
``MIT