MCP server for web scraping and converting web pages to Markdown with Playwright support
npm install webscraper-mcp-serverA Model Context Protocol (MCP) server that provides advanced web scraping and HTML to Markdown conversion using Microsoft Playwright. This version automatically detects and handles JavaScript-rendered pages.
- š Microsoft Playwright - Superior JavaScript rendering with automatic fallback
- šÆ Smart Detection - Automatically switches to JS rendering when needed
- šø Screenshots - Capture page screenshots as base64
- ā±ļø Custom Waits - Wait for specific selectors or time periods
- š Dual Mode - Static scraping for speed, JS rendering for dynamic content
- š Performance Metrics - Track load times and render methods
- š Intelligent Web Scraping: Automatic detection of static vs dynamic pages
- š HTML to Markdown: Clean, well-formatted Markdown conversion
- š JavaScript Rendering: Full Playwright support for SPA and dynamic content
- š Link Extraction: Extract all hyperlinks with filtering options
- š¼ļø Image Extraction: Extract images including lazy-loaded ones
- š¦ Batch Processing: Scrape up to 10 URLs simultaneously
- šÆ Metadata Extraction: Title, description, author, keywords, and more
- āļø Flexible Options: Control timeouts, redirects, content inclusion
- š Multiple Formats: Output in Markdown or JSON
- šø Screenshot Capture: Get base64 screenshots of pages
1. Static Mode (Default, Fast)
- Uses Axios + Cheerio
- Suitable for traditional HTML pages
- Fastest performance
2. JavaScript Mode (Auto-detected or Forced)
- Uses Playwright with Chromium
- Executes JavaScript
- Handles SPAs, lazy loading, dynamic content
- Auto-activates when static mode returns < 50 words
``bashClone or navigate to the project
cd webscraper-mcp-server-v2
Usage
$3
`bash
npm start
`$3
`bash
TRANSPORT=http PORT=3000 npm start
`Available Tools
$3
Automatically detects and handles both static and dynamic pages.
New Parameters:
-
use_javascript (boolean): Force JavaScript rendering
- wait_for_selector (string): CSS selector to wait for
- wait_time (number): Additional wait time in milliseconds
- take_screenshot (boolean): Capture page screenshotExample - Force JavaScript Rendering:
`json
{
"url": "https://docs.uazapi.com/endpoint/post/instance~init",
"use_javascript": true,
"wait_for_selector": ".content",
"wait_time": 3000,
"take_screenshot": true
}
`Example - Auto-Detection:
`json
{
"url": "https://example.com/spa-app"
}
`
Automatically switches to JavaScript if static content is insufficient$3
New Parameter:
-
use_javascript (boolean): Use JavaScript rendering for dynamic linksExample:
`json
{
"url": "https://example.com",
"use_javascript": true,
"filter_external": true
}
`$3
New Parameter:
-
use_javascript (boolean): Extract lazy-loaded imagesExample:
`json
{
"url": "https://example.com/gallery",
"use_javascript": true,
"limit": 50
}
`$3
New Parameter:
-
use_javascript (boolean): Use JavaScript for all URLsExample:
`json
{
"urls": ["https://page1.com", "https://page2.com"],
"use_javascript": true,
"timeout": 60000
}
`Configuration
$3
-
TRANSPORT: Transport type ('stdio' or 'http', default: 'stdio')
- PORT: HTTP server port (default: 3000, only for HTTP transport)$3
`json
{
"mcpServers": {
"webscraper": {
"command": "node",
"args": ["/path/to/webscraper-mcp-server-v2/dist/index.js"]
}
}
}
`Output Formats
$3
`markdown
Page Title
URL: https://example.com
Render Method: javascript
Description: Page description
Author: Author Name
Word Count: 1500 | Status: 200 | Load Time: 2340ms
---
[Page content in Markdown...]
`$3
`json
{
"url": "https://example.com",
"title": "Page Title",
"content": "Markdown content...",
"renderMethod": "javascript",
"metadata": {
"description": "Page description",
"wordCount": 1500,
"loadTime": 2340,
"screenshot": "base64..." // if requested
}
}
`Performance Comparison
| Feature | Static Mode | JavaScript Mode |
|---------|-------------|-----------------|
| Speed | ~1-3s | ~3-8s |
| JavaScript | ā | ā
|
| SPA Support | ā | ā
|
| Lazy Loading | ā | ā
|
| Resource Usage | Low | Medium |
| Best For | Traditional HTML | Modern Web Apps |
Use Cases
$3
`javascript
// Site with React/Vue/Angular
{
"url": "https://spa-site.com",
"use_javascript": true,
"wait_for_selector": "#root > div",
"wait_time": 2000
}
`$3
`javascript
// Get screenshot along with content
{
"url": "https://example.com/dashboard",
"use_javascript": true,
"take_screenshot": true
}
`$3
`javascript
// Like your UAZ API docs example
{
"url": "https://docs.uazapi.com/endpoint/post/instance~init",
"use_javascript": true,
"wait_for_selector": ".api-content",
"response_format": "json"
}
`$3
`javascript
// Lazy-loaded images and dynamic prices
{
"url": "https://shop.example.com/product/123",
"use_javascript": true,
"wait_time": 3000
}
`Troubleshooting
$3
`bash
Reinstall browsers
npm run install:browsersCheck Playwright installation
npx playwright --version
`$3
Problem: Getting < 50 words from a JavaScript site?
Solution:
- Set
use_javascript: true explicitly
- Use wait_for_selector for specific elements
- Increase wait_time if content loads slowly$3
Problem: Browser consuming too much memory?
Solution:
- The browser instance is reused and shared
- Contexts are closed after each operation
- Consider increasing system resources for heavy usage
Advantages over Puppeteer
ā
Better Performance: Playwright is generally faster
ā
More Reliable: Better handling of modern web apps
ā
Auto-waiting: Smarter element waiting
ā
Multiple Browsers: Can use Chromium, Firefox, or WebKit
ā
Modern APIs: Cleaner, more intuitive API
ā
Active Development: Microsoft-backed, frequent updates
Development
$3
`
webscraper-mcp-server-v2/
āāā src/
ā āāā index.ts # Main entry point
ā āāā types.ts # TypeScript definitions (enhanced)
ā āāā constants.ts # Configuration constants
ā āāā schemas/ # Zod validation (updated)
ā āāā services/ # Playwright-based scraping
ā āāā tools/ # MCP tool implementations
āāā dist/ # Compiled JavaScript
āāā package.json # Dependencies (with Playwright)
āāā README.md
`$3
`bash
npm run build
`$3
`bash
With MCP Inspector
npx @modelcontextprotocol/inspector node dist/index.js
``- Maximum 10 URLs for batch scraping
- Content truncated at 100,000 characters
- Request timeout: 1-120 seconds
- Chromium browser required (~170MB download)
- Supports only HTTP/HTTPS protocols
- Requires publicly accessible URLs
1. Use Static Mode When Possible: 3-5x faster for traditional sites
2. Batch Related URLs: More efficient than individual calls
3. Set Appropriate Timeouts: Longer for slow sites, shorter for fast ones
4. Use Selectors Wisely: Wait for specific elements instead of fixed times
5. Limit Screenshot Usage: Screenshots increase response size significantly
| Feature | v1.0 (Cheerio Only) | v2.0 (Playwright) |
|---------|---------------------|-------------------|
| Static HTML | ā
Fast | ā
Fast |
| JavaScript | ā | ā
Full Support |
| Auto-Detection | ā | ā
Smart Fallback |
| Screenshots | ā | ā
Base64 Output |
| Lazy Loading | ā | ā
Supported |
| SPAs | ā Limited | ā
Full Support |
MIT
Contributions welcome! Areas for improvement:
- [ ] Support for other Playwright browsers (Firefox, WebKit)
- [ ] PDF generation from pages
- [ ] Advanced selector strategies
- [ ] Request interception for blocking ads
- [ ] Cookie management
- [ ] Proxy support
For issues or questions, please open an issue on the GitHub repository.
---
Made with ā¤ļø using Microsoft Playwright and Model Context Protocol