TypeScript MCP server for Crawl4AI - web crawling and content extraction
npm install mcp-crawl4ai-ts> Note: Tested with Crawl4AI version 0.7.4




TypeScript implementation of an MCP server for Crawl4AI. Provides tools for web crawling, content extraction, and browser automation.
- Prerequisites
- Quick Start
- Configuration
- Client-Specific Instructions
- Available Tools
- 1. get_markdown
- 2. capture_screenshot
- 3. generate_pdf
- 4. execute_js
- 5. batch_crawl
- 6. smart_crawl
- 7. get_html
- 8. extract_links
- 9. crawl_recursive
- 10. parse_sitemap
- 11. crawl
- 12. manage_session
- 13. extract_with_llm
- Advanced Configuration
- Development
- License
- Node.js 18+ and npm
- A running Crawl4AI server
``bash`
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.7.4
This MCP server works with any MCP-compatible client (Claude Desktop, Claude Code, Cursor, LMStudio, etc.).
#### Using npx (Recommended)
`json`
{
"mcpServers": {
"crawl4ai": {
"command": "npx",
"args": ["mcp-crawl4ai-ts"],
"env": {
"CRAWL4AI_BASE_URL": "http://localhost:11235"
}
}
}
}
#### Using local installation
`json`
{
"mcpServers": {
"crawl4ai": {
"command": "node",
"args": ["/path/to/mcp-crawl4ai-ts/dist/index.js"],
"env": {
"CRAWL4AI_BASE_URL": "http://localhost:11235"
}
}
}
}
#### With all optional variables
`json`
{
"mcpServers": {
"crawl4ai": {
"command": "npx",
"args": ["mcp-crawl4ai-ts"],
"env": {
"CRAWL4AI_BASE_URL": "http://localhost:11235",
"CRAWL4AI_API_KEY": "your-api-key",
"SERVER_NAME": "custom-name",
"SERVER_VERSION": "1.0.0"
}
}
}
}
`envRequired
CRAWL4AI_BASE_URL=http://localhost:11235
Client-Specific Instructions
$3
Add to
~/Library/Application Support/Claude/claude_desktop_config.json$3
`bash
claude mcp add crawl4ai -e CRAWL4AI_BASE_URL=http://localhost:11235 -- npx mcp-crawl4ai-ts
`$3
Consult your client's documentation for MCP server configuration. The key details:
- Command:
npx mcp-crawl4ai-ts or node /path/to/dist/index.js
- Required env: CRAWL4AI_BASE_URL
- Optional env: CRAWL4AI_API_KEY, SERVER_NAME, SERVER_VERSIONAvailable Tools
$3
`typescript
{
url: string, // Required: URL to extract markdown from
filter?: 'raw'|'fit'|'bm25'|'llm', // Filter type (default: 'fit')
query?: string, // Query for bm25/llm filters
cache?: string // Cache-bust parameter (default: '0')
}
`Extracts content as markdown with various filtering options. Use 'bm25' or 'llm' filters with a query for specific content extraction.
$3
`typescript
{
url: string, // Required: URL to capture
screenshot_wait_for?: number // Seconds to wait before screenshot (default: 2)
}
`Returns base64-encoded PNG. Note: This is stateless - for screenshots after JS execution, use
crawl with screenshot: true.$3
`typescript
{
url: string // Required: URL to convert to PDF
}
`Returns base64-encoded PDF. Stateless tool - for PDFs after JS execution, use
crawl with pdf: true.$3
`typescript
{
url: string, // Required: URL to load
scripts: string | string[] // Required: JavaScript to execute
}
`Executes JavaScript and returns results. Each script can use 'return' to get values back. Stateless - for persistent JS execution use
crawl with js_code.$3
`typescript
{
urls: string[], // Required: List of URLs to crawl
max_concurrent?: number, // Parallel request limit (default: 5)
remove_images?: boolean, // Remove images from output (default: false)
bypass_cache?: boolean, // Bypass cache for all URLs (default: false)
configs?: Array<{ // Optional: Per-URL configurations (v3.0.0+)
url: string,
[key: string]: any // Any crawl parameters for this specific URL
}>
}
`Efficiently crawls multiple URLs in parallel. Each URL gets a fresh browser instance. With
configs array, you can specify different parameters for each URL.$3
`typescript
{
url: string, // Required: URL to crawl
max_depth?: number, // Maximum depth for recursive crawling (default: 2)
follow_links?: boolean, // Follow links in content (default: true)
bypass_cache?: boolean // Bypass cache (default: false)
}
`Intelligently detects content type (HTML/sitemap/RSS) and processes accordingly.
$3
`typescript
{
url: string // Required: URL to extract HTML from
}
`Returns preprocessed HTML optimized for structure analysis. Use for building schemas or analyzing patterns.
$3
`typescript
{
url: string, // Required: URL to extract links from
categorize?: boolean // Group by type (default: true)
}
`Extracts all links and groups them by type: internal, external, social media, documents, images.
$3
`typescript
{
url: string, // Required: Starting URL
max_depth?: number, // Maximum depth to crawl (default: 3)
max_pages?: number, // Maximum pages to crawl (default: 50)
include_pattern?: string, // Regex pattern for URLs to include
exclude_pattern?: string // Regex pattern for URLs to exclude
}
`Crawls a website following internal links up to specified depth. Returns content from all discovered pages.
$3
`typescript
{
url: string, // Required: Sitemap URL (e.g., /sitemap.xml)
filter_pattern?: string // Optional: Regex pattern to filter URLs
}
`Extracts all URLs from XML sitemaps. Supports regex filtering for specific URL patterns.
$3
`typescript
{
url: string, // URL to crawl
// Browser Configuration
browser_type?: 'chromium'|'firefox'|'webkit'|'undetected', // Browser engine (undetected = stealth mode)
viewport_width?: number, // Browser width (default: 1080)
viewport_height?: number, // Browser height (default: 600)
user_agent?: string, // Custom user agent
proxy_server?: string | { // Proxy URL (string or object format)
server: string,
username?: string,
password?: string
},
proxy_username?: string, // Proxy auth (if using string format)
proxy_password?: string, // Proxy password (if using string format)
cookies?: Array<{name, value, domain}>, // Pre-set cookies
headers?: Record, // Custom headers
// Crawler Configuration
word_count_threshold?: number, // Min words per block (default: 200)
excluded_tags?: string[], // HTML tags to exclude
remove_overlay_elements?: boolean, // Remove popups/modals
js_code?: string | string[], // JavaScript to execute
wait_for?: string, // Wait condition (selector or JS)
wait_for_timeout?: number, // Wait timeout (default: 30000)
delay_before_scroll?: number, // Pre-scroll delay
scroll_delay?: number, // Between-scroll delay
process_iframes?: boolean, // Include iframe content
exclude_external_links?: boolean, // Remove external links
screenshot?: boolean, // Capture screenshot
pdf?: boolean, // Generate PDF
session_id?: string, // Reuse browser session (only works with crawl tool)
cache_mode?: 'ENABLED'|'BYPASS'|'DISABLED', // Cache control
// New in v3.0.0 (Crawl4AI 0.7.3/0.7.4)
css_selector?: string, // CSS selector to filter content
delay_before_return_html?: number, // Delay in seconds before returning HTML
include_links?: boolean, // Include extracted links in response
resolve_absolute_urls?: boolean, // Convert relative URLs to absolute
// LLM Extraction (REST API only supports 'llm' type)
extraction_type?: 'llm', // Only 'llm' extraction is supported via REST API
extraction_schema?: object, // Schema for structured extraction
extraction_instruction?: string, // Natural language extraction prompt
extraction_strategy?: { // Advanced extraction configuration
provider?: string,
api_key?: string,
model?: string,
[key: string]: any
},
table_extraction_strategy?: { // Table extraction configuration
enable_chunking?: boolean,
thresholds?: object,
[key: string]: any
},
markdown_generator_options?: { // Markdown generation options
include_links?: boolean,
preserve_formatting?: boolean,
[key: string]: any
},
timeout?: number, // Overall timeout (default: 60000)
verbose?: boolean // Detailed logging
}
`$3
`typescript
{
action: 'create' | 'clear' | 'list', // Required: Action to perform
session_id?: string, // For 'create' and 'clear' actions
initial_url?: string, // For 'create' action: URL to load
browser_type?: 'chromium' | 'firefox' | 'webkit' | 'undetected' // For 'create' action
}
`Unified tool for managing browser sessions. Supports three actions:
- create: Start a persistent browser session
- clear: Remove a session from local tracking
- list: Show all active sessions
Examples:
`typescript
// Create a new session
{ action: 'create', session_id: 'my-session', initial_url: 'https://example.com' }// Clear a session
{ action: 'clear', session_id: 'my-session' }
// List all sessions
{ action: 'list' }
`$3
`typescript
{
url: string, // URL to extract data from
query: string // Natural language extraction instructions
}
`Uses AI to extract structured data from webpages. Returns results immediately without any polling or job management. This is the recommended way to extract specific information since CSS/XPath extraction is not supported via the REST API.
Advanced Configuration
For detailed information about all available configuration options, extraction strategies, and advanced features, please refer to the official Crawl4AI documentation:
- Crawl4AI Documentation
- Crawl4AI GitHub Repository
Changelog
See CHANGELOG.md for detailed version history and recent updates.
Development
$3
`bash
1. Start the Crawl4AI server
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:latest2. Install MCP server
git clone https://github.com/omgwtfwow/mcp-crawl4ai-ts.git
cd mcp-crawl4ai-ts
npm install
cp .env.example .env3. Development commands
npm run dev # Development mode
npm test # Run tests
npm run lint # Check code quality
npm run build # Production build4. Add to your MCP client (See "Using local installation")
`$3
Integration tests require a running Crawl4AI server. Configure your environment:
`bash
Required for integration tests
export CRAWL4AI_BASE_URL=http://localhost:11235
export CRAWL4AI_API_KEY=your-api-key # If authentication is requiredOptional: For LLM extraction tests
export LLM_PROVIDER=openai/gpt-4o-mini
export LLM_API_TOKEN=your-llm-api-key
export LLM_BASE_URL=https://api.openai.com/v1 # If using custom endpointRun integration tests (ALWAYS use the npm script; don't call
jest directly)
npm run test:integrationRun a single integration test file
npm run test:integration -- src/__tests__/integration/extract-links.integration.test.ts> IMPORTANT: Do NOT run
npx jest directly for integration tests. The npm script injects NODE_OPTIONS=--experimental-vm-modules which is required for ESM + ts-jest. Running Jest directly will produce SyntaxError: Cannot use import statement outside a module and hang.
`Integration tests cover:
- Dynamic content and JavaScript execution
- Session management and cookies
- Content extraction (LLM-based only)
- Media handling (screenshots, PDFs)
- Performance and caching
- Content filtering
- Bot detection avoidance
- Error handling
$3
1. Docker container healthy:
`bash
docker ps --filter name=crawl4ai --format '{{.Names}} {{.Status}}'
curl -sf http://localhost:11235/health || echo "Health check failed"
`
2. Env vars loaded (either exported or in .env): CRAWL4AI_BASE_URL (required), optional: CRAWL4AI_API_KEY, LLM_PROVIDER, LLM_API_TOKEN, LLM_BASE_URL.
3. Use npm run test:integration (never raw jest).
4. To target one file add it after -- (see example above).
5. Expect total runtime ~2–3 minutes; longer or immediate hang usually means missing NODE_OPTIONS or wrong Jest version.$3
| Symptom | Likely Cause | Fix |
|---------|--------------|-----|
| SyntaxError: Cannot use import statement outside a module | Ran jest directly without script flags | Re-run with npm run test:integration |
| Hangs on first test (RUNS ...) | Missing experimental VM modules flag | Use npm script / ensure NODE_OPTIONS=--experimental-vm-modules |
| Network timeouts | Crawl4AI container not healthy / DNS blocked | Restart container: docker restart |
| LLM tests skipped | Missing LLM_PROVIDER or LLM_API_TOKEN | Export required LLM vars |
| New Jest major upgrade breaks tests | Version mismatch with ts-jest | Keep Jest 29.x unless ts-jest upgraded accordingly |$3
Current stack: jest@29.x + ts-jest@29.x + ESM ("type": "module"). Updating Jest to 30+ requires upgrading ts-jest and revisiting jest.config.cjs`. Keep versions aligned to avoid parse errors.MIT