WebScraper MCP Server v2.0 (Playwright Edition)

A Model Context Protocol (MCP) server that provides advanced web scraping and HTML to Markdown conversion using Microsoft Playwright. This version automatically detects and handles JavaScript-rendered pages.

🆕 What's New in v2.0

- 🚀 Microsoft Playwright - Superior JavaScript rendering with automatic fallback
- 🎯 Smart Detection - Automatically switches to JS rendering when needed
- 📸 Screenshots - Capture page screenshots as base64
- ⏱️ Custom Waits - Wait for specific selectors or time periods
- 🔄 Dual Mode - Static scraping for speed, JS rendering for dynamic content
- 📊 Performance Metrics - Track load times and render methods

Features

$3

- 🌐 Intelligent Web Scraping: Automatic detection of static vs dynamic pages
- 📝 HTML to Markdown: Clean, well-formatted Markdown conversion
- 🎭 JavaScript Rendering: Full Playwright support for SPA and dynamic content
- 🔗 Link Extraction: Extract all hyperlinks with filtering options
- 🖼️ Image Extraction: Extract images including lazy-loaded ones
- 📦 Batch Processing: Scrape up to 10 URLs simultaneously
- 🎯 Metadata Extraction: Title, description, author, keywords, and more
- ⚙️ Flexible Options: Control timeouts, redirects, content inclusion
- 📊 Multiple Formats: Output in Markdown or JSON
- 📸 Screenshot Capture: Get base64 screenshots of pages

$3

1. Static Mode (Default, Fast)
- Uses Axios + Cheerio
- Suitable for traditional HTML pages
- Fastest performance

2. JavaScript Mode (Auto-detected or Forced)
- Uses Playwright with Chromium
- Executes JavaScript
- Handles SPAs, lazy loading, dynamic content
- Auto-activates when static mode returns < 50 words

Installation

``bash

`Clone or navigate to the project`


cd webscraper-mcp-server-v2
Install dependencies

npm install
Install Playwright browsers

npm run install:browsers
Build the project

npm run build


Usage
$3

`bash npm start`

`$3`

`bash TRANSPORT=http PORT=3000 npm start`

`Available Tools`

`$3`

Automatically detects and handles both static and dynamic pages.

New Parameters: -use_javascript(boolean): Force JavaScript rendering -wait_for_selector(string): CSS selector to wait for -wait_time(number): Additional wait time in milliseconds -take_screenshot (boolean): Capture page screenshot

Example - Force JavaScript Rendering:`json { "url": "https://docs.uazapi.com/endpoint/post/instance~init", "use_javascript": true, "wait_for_selector": ".content", "wait_time": 3000, "take_screenshot": true }`

Example - Auto-Detection:`json { "url": "https://example.com/spa-app" }`Automatically switches to JavaScript if static content is insufficient

`$3`

New Parameter: -use_javascript (boolean): Use JavaScript rendering for dynamic links

Example:`json { "url": "https://example.com", "use_javascript": true, "filter_external": true }`

`$3`

New Parameter: -use_javascript (boolean): Extract lazy-loaded images

Example:`json { "url": "https://example.com/gallery", "use_javascript": true, "limit": 50 }`

`$3`

New Parameter: -use_javascript (boolean): Use JavaScript for all URLs

Example:`json { "urls": ["https://page1.com", "https://page2.com"], "use_javascript": true, "timeout": 60000 }`

`Configuration`

`$3`

- TRANSPORT: Transport type ('stdio' or 'http', default: 'stdio') -PORT: HTTP server port (default: 3000, only for HTTP transport)

`$3`

`json { "mcpServers": { "webscraper": { "command": "node", "args": ["/path/to/webscraper-mcp-server-v2/dist/index.js"] } } }`

`Output Formats`

`$3`

`markdown

`Page Title`

URL: https://example.com Render Method: javascript

Description: Page description Author: Author Name Word Count: 1500 | Status: 200 | Load Time: 2340ms

---

[Page content in Markdown...]`

`$3`

`json { "url": "https://example.com", "title": "Page Title", "content": "Markdown content...", "renderMethod": "javascript", "metadata": { "description": "Page description", "wordCount": 1500, "loadTime": 2340, "screenshot": "base64..." // if requested } }`

`Performance Comparison`

| Feature | Static Mode | JavaScript Mode | |---------|-------------|-----------------| | Speed | ~1-3s | ~3-8s | | JavaScript | ❌ | ✅ | | SPA Support | ❌ | ✅ | | Lazy Loading | ❌ | ✅ | | Resource Usage | Low | Medium | | Best For | Traditional HTML | Modern Web Apps |

`Use Cases`

`$3`

`javascript // Site with React/Vue/Angular { "url": "https://spa-site.com", "use_javascript": true, "wait_for_selector": "#root > div", "wait_time": 2000 }`

`$3`

`javascript // Get screenshot along with content { "url": "https://example.com/dashboard", "use_javascript": true, "take_screenshot": true }`

`$3`

`javascript // Like your UAZ API docs example { "url": "https://docs.uazapi.com/endpoint/post/instance~init", "use_javascript": true, "wait_for_selector": ".api-content", "response_format": "json" }`

`$3`

`javascript // Lazy-loaded images and dynamic prices { "url": "https://shop.example.com/product/123", "use_javascript": true, "wait_time": 3000 }`

`Troubleshooting`

`$3`

`bash

`Reinstall browsers`


npm run install:browsers
Check Playwright installation

npx playwright --version


$3
Problem: Getting < 50 words from a JavaScript site?

Solution: - Setuse_javascript: trueexplicitly - Usewait_for_selectorfor specific elements - Increasewait_time if content loads slowly

`$3`

Problem: Browser consuming too much memory?

Solution: - The browser instance is reused and shared - Contexts are closed after each operation - Consider increasing system resources for heavy usage

`Advantages over Puppeteer`

✅ Better Performance: Playwright is generally faster ✅ More Reliable: Better handling of modern web apps ✅ Auto-waiting: Smarter element waiting ✅ Multiple Browsers: Can use Chromium, Firefox, or WebKit ✅ Modern APIs: Cleaner, more intuitive API ✅ Active Development: Microsoft-backed, frequent updates

`Development`

`$3`

`webscraper-mcp-server-v2/ ├── src/ │ ├── index.ts # Main entry point │ ├── types.ts # TypeScript definitions (enhanced) │ ├── constants.ts # Configuration constants │ ├── schemas/ # Zod validation (updated) │ ├── services/ # Playwright-based scraping │ └── tools/ # MCP tool implementations ├── dist/ # Compiled JavaScript ├── package.json # Dependencies (with Playwright) └── README.md`

`$3`

`bash npm run build`

`$3`

`bash

`With MCP Inspector`


npx @modelcontextprotocol/inspector node dist/index.js

Limitations

- Maximum 10 URLs for batch scraping
- Content truncated at 100,000 characters
- Request timeout: 1-120 seconds
- Chromium browser required (~170MB download)
- Supports only HTTP/HTTPS protocols
- Requires publicly accessible URLs

Performance Tips

1. Use Static Mode When Possible: 3-5x faster for traditional sites
2. Batch Related URLs: More efficient than individual calls
3. Set Appropriate Timeouts: Longer for slow sites, shorter for fast ones
4. Use Selectors Wisely: Wait for specific elements instead of fixed times
5. Limit Screenshot Usage: Screenshots increase response size significantly

Comparison with v1.0

| Feature | v1.0 (Cheerio Only) | v2.0 (Playwright) |
|---------|---------------------|-------------------|
| Static HTML | ✅ Fast | ✅ Fast |
| JavaScript | ❌ | ✅ Full Support |
| Auto-Detection | ❌ | ✅ Smart Fallback |
| Screenshots | ❌ | ✅ Base64 Output |
| Lazy Loading | ❌ | ✅ Supported |
| SPAs | ❌ Limited | ✅ Full Support |

License

MIT

Contributing

Contributions welcome! Areas for improvement:
- [ ] Support for other Playwright browsers (Firefox, WebKit)
- [ ] PDF generation from pages
- [ ] Advanced selector strategies
- [ ] Request interception for blocking ads
- [ ] Cookie management
- [ ] Proxy support

Support

For issues or questions, please open an issue on the GitHub repository.

---

Made with ❤️ using Microsoft Playwright and Model Context Protocol

WebScraper MCP Server v2.0 (Playwright Edition)

🆕 What's New in v2.0

Features

$3

1. Static Mode (Default, Fast)
- Uses Axios + Cheerio
- Suitable for traditional HTML pages
- Fastest performance

Installation

``bash

`Clone or navigate to the project`


cd webscraper-mcp-server-v2
Install dependencies

npm install
Install Playwright browsers

npm run install:browsers
Build the project

npm run build


Usage
$3

`bash npm start`

`$3`

`bash TRANSPORT=http PORT=3000 npm start`

`Available Tools`

`$3`

Automatically detects and handles both static and dynamic pages.

Example - Auto-Detection:`json { "url": "https://example.com/spa-app" }`Automatically switches to JavaScript if static content is insufficient

`$3`

New Parameter: -use_javascript (boolean): Use JavaScript rendering for dynamic links

Example:`json { "url": "https://example.com", "use_javascript": true, "filter_external": true }`

`$3`

New Parameter: -use_javascript (boolean): Extract lazy-loaded images

Example:`json { "url": "https://example.com/gallery", "use_javascript": true, "limit": 50 }`

`$3`

New Parameter: -use_javascript (boolean): Use JavaScript for all URLs

Example:`json { "urls": ["https://page1.com", "https://page2.com"], "use_javascript": true, "timeout": 60000 }`

`Configuration`

`$3`

- TRANSPORT: Transport type ('stdio' or 'http', default: 'stdio') -PORT: HTTP server port (default: 3000, only for HTTP transport)

`$3`

`json { "mcpServers": { "webscraper": { "command": "node", "args": ["/path/to/webscraper-mcp-server-v2/dist/index.js"] } } }`

`Output Formats`

`$3`

`markdown

`Page Title`

URL: https://example.com Render Method: javascript

Description: Page description Author: Author Name Word Count: 1500 | Status: 200 | Load Time: 2340ms

---

[Page content in Markdown...]`

`$3`

`Performance Comparison`

`Use Cases`

`$3`

`javascript // Site with React/Vue/Angular { "url": "https://spa-site.com", "use_javascript": true, "wait_for_selector": "#root > div", "wait_time": 2000 }`

`$3`

`javascript // Get screenshot along with content { "url": "https://example.com/dashboard", "use_javascript": true, "take_screenshot": true }`

`$3`

`javascript // Lazy-loaded images and dynamic prices { "url": "https://shop.example.com/product/123", "use_javascript": true, "wait_time": 3000 }`

`Troubleshooting`

`$3`

`bash

`Reinstall browsers`


npm run install:browsers
Check Playwright installation

npx playwright --version


$3
Problem: Getting < 50 words from a JavaScript site?

Solution: - Setuse_javascript: trueexplicitly - Usewait_for_selectorfor specific elements - Increasewait_time if content loads slowly

`$3`

Problem: Browser consuming too much memory?

Solution: - The browser instance is reused and shared - Contexts are closed after each operation - Consider increasing system resources for heavy usage

`Advantages over Puppeteer`

`Development`

`$3`

`bash npm run build`

`$3`

`bash

`With MCP Inspector`


npx @modelcontextprotocol/inspector node dist/index.js