AgentCrawl

The High-Performance TypeScript Web Scraper for LLM Agents.

AgentCrawl is built to be the "eyes" of your AI Agents. It fetches web content, strips away the noise (ads, scripts, styles), and returns clean, token-optimized Markdown ready for your LLM context window.

It features a Hybrid Engine that starts with extremely fast static scraping and automatically falls back to a headless browser (Playwright) only when necessary for dynamic content or authentication.

Features

- 🚀 Hybrid Engine: Instant static fetch by default, auto-switch to Headless Browser for dynamic sites.
- ⚡ Token Optimized: Returns clean Markdown, stripping 80-90% of tokens (ads, navs, footers).
- 🧠 Agent-First: Detects Main Content, removes boilerplate, and extracts semantic structure.
- 🔌 Plug-and-Play: Native adapters for Vercel AI SDK and OpenAI Agents SDK.
- 🛡️ Production Ready: Built-in caching, retry logic, user-agent rotation, and resource blocking.
- 🇹 Type-Safe: 100% TypeScript with Zod validation.

Installation

``bash npm install agent-crawl

`OR`


bun add agent-crawl


Quick Start
$3

`typescript import { AgentCrawl } from 'agent-crawl';

// Simplest usage - returns clean properties const page = await AgentCrawl.scrape("https://example.com");

console.log(page.title); // "Example Domain" console.log(page.content); // "Example Domain\n\nThis domain is for use..." console.log(page.links); // Array of same-origin links found on the page`

`$3`

`typescript const page = await AgentCrawl.scrape("https://news.ycombinator.com", { mode: "hybrid", // "static" | "browser" | "hybrid" (default) extractMainContent: true, // Extract only the article body optimizeTokens: true, // Compress excessive whitespace (default: true) waitFor: ".main-content", // CSS selector to wait for (browser mode) });`

`$3`

Crawl an entire website with configurable depth, page limits, and concurrency:

`typescript const result = await AgentCrawl.crawl("https://docs.example.com", { maxDepth: 2, // How many link-hops from start URL (default: 1) maxPages: 20, // Stop after N pages (default: 10) concurrency: 4, // Parallel requests (default: 2) extractMainContent: true, });

console.log(Crawled ${result.totalPages} pages); console.log(Max depth reached: ${result.maxDepthReached}); result.pages.forEach(page => { console.log(- ${page.title}: ${page.url}); });`

`AI SDK Integration`

`$3`

One-line integration to give your Vercel AI agent browsing and crawling capabilities.

`typescript import { generateText } from 'ai'; import { openai } from '@ai-sdk/openai'; import { AgentCrawl } from 'agent-crawl';

const result = await generateText({ model: openai('gpt-4o'), tools: { browser: AgentCrawl.asVercelTool(), crawler: AgentCrawl.asVercelCrawlTool() }, prompt: "Crawl the documentation site and summarize all pages." });`

`$3`

Generates the perfect JSON Schema for OpenAI function calling.

`typescript import OpenAI from 'openai'; import { AgentCrawl } from 'agent-crawl';

const client = new OpenAI();

const response = await client.chat.completions.create({ model: 'gpt-4', messages: [{ role: 'user', content: 'Crawl this documentation site...' }], tools: [ AgentCrawl.asOpenAITool(), // Single page scraping AgentCrawl.asOpenAICrawlTool() // Multi-page crawling ], });`

`Configuration`

`$3`

| Option | Type | Default | Description | |--------|------|---------|-------------| |mode | 'hybrid' \| 'static' \| 'browser' | 'hybrid' | Strategy to use. hybridtries static first, then browser. | |extractMainContent | boolean | false| Extract only the main article body using Readability-like algorithm. | |optimizeTokens | boolean | true| Remove extra whitespace and empty links for token efficiency. | |waitFor | string | undefined | CSS selector to wait for (browser mode only). |

`$3`

Crawl options include all scrape options plus:

| Option | Type | Default | Description | |--------|------|---------|-------------| |maxDepth | number | 1| Maximum link depth to crawl from the start URL. | |maxPages | number | 10| Maximum number of pages to crawl. | |concurrency | number | 2 | Number of pages to fetch in parallel. |

`Return Values`

`$3`

`typescript { url: string; // The final URL (after redirects) content: string; // Clean markdown content title?: string; // Page title links?: string[]; // Same-origin links found on the page metadata?: { status: number; // HTTP status code contentLength: number; // ... other headers } }`

`$3`

`typescript { pages: ScrapedPage[]; // Array of all scraped pages totalPages: number; // Total number of pages crawled maxDepthReached: number; // Deepest level reached errors: Array<{ // Any errors encountered url: string; error: string; }>; }``

License

MIT © silupanda

AgentCrawl

The High-Performance TypeScript Web Scraper for LLM Agents.

Features

Installation

``bash npm install agent-crawl

`OR`


bun add agent-crawl


Quick Start
$3

`typescript import { AgentCrawl } from 'agent-crawl';

// Simplest usage - returns clean properties const page = await AgentCrawl.scrape("https://example.com");

console.log(page.title); // "Example Domain" console.log(page.content); // "Example Domain\n\nThis domain is for use..." console.log(page.links); // Array of same-origin links found on the page`

`$3`

Crawl an entire website with configurable depth, page limits, and concurrency:

console.log(Crawled ${result.totalPages} pages); console.log(Max depth reached: ${result.maxDepthReached}); result.pages.forEach(page => { console.log(- ${page.title}: ${page.url}); });`

`AI SDK Integration`

`$3`

One-line integration to give your Vercel AI agent browsing and crawling capabilities.

`typescript import { generateText } from 'ai'; import { openai } from '@ai-sdk/openai'; import { AgentCrawl } from 'agent-crawl';

`$3`

Generates the perfect JSON Schema for OpenAI function calling.

`typescript import OpenAI from 'openai'; import { AgentCrawl } from 'agent-crawl';

const client = new OpenAI();

`Configuration`

`$3`

Crawl options include all scrape options plus:

`Return Values`

`$3`

License