The command line web scraper
npm install web-scrapybash
npm install -g web-scrapy
`
Or use directly with npx:
`bash
npx web-scrapy --help
`
Quick Start
$3
`bash
Extract page title
curl https://example.com | web-scrapy -s "title" -t
Extract all links
echo 'Link 1Link 2' | web-scrapy -s "a" --pretty
Extract specific content
cat article.html | web-scrapy -s ".content p" -t --format text
`
$3
`bash
Inline schema for article extraction
curl https://news.site.com | web-scrapy --schema '{
"fields": {
"title": {"selector": "h1", "type": "text"},
"author": {"selector": ".author", "type": "text"},
"date": {"selector": "time", "type": "attribute", "attribute": "datetime"}
}
}' --pretty
Use schema file for complex extraction
curl https://shop.com/product/123 | web-scrapy -f product-schema.json -o results.json
`
Usage
`
web-scrapy - Advanced command line web scraper with schema support
Usage:
echo "..." | web-scrapy [options]
cat file.html | web-scrapy [options]
curl https://example.com | web-scrapy [options]
Input Options (choose one):
-s, --selector Simple CSS selector extraction
--schema Inline JSON schema for complex extraction
-f, --schema-file JSON schema file for complex extraction
Output Options:
-o, --output Save output to file (default: stdout)
--format Output format: json, pretty, text (default: json)
-p, --pretty Pretty-print JSON output
Extraction Options:
-m, --mode Extraction mode: single, multiple (default: single)
-c, --container Container selector for multiple mode
-t, --text Extract text content only (selector mode)
-l, --limit Limit number of results (multiple mode)
--ignore-errors Continue extraction despite errors
Utility Options:
-h, --help Show this help message
-e, --examples Show example schemas and usage
-v, --version Show version information
`
Schema Format
Schemas are JSON objects that define how to extract structured data from HTML:
`json
{
"name": "Schema name (optional)",
"description": "Schema description (optional)",
"fields": {
"fieldName": {
"selector": "CSS selector",
"type": "text|attribute|html|number|boolean|array",
"required": true|false,
"default": "default value"
}
},
"config": {
"ignoreErrors": true|false,
"limit": number
}
}
`
$3
- text - Extract text content (supports trim option)
- attribute - Extract attribute value (requires attribute property)
- html - Extract HTML content (supports inner option for innerHTML)
- number - Parse as number (supports integer option)
- boolean - Convert to true/false (supports trueValue option)
- array - Extract multiple items (requires itemSchema property)
Examples
$3
Schema file: article-schema.json
`json
{
"name": "News Article",
"fields": {
"title": {
"selector": "h1, .headline, .title",
"type": "text",
"trim": true
},
"author": {
"selector": ".author, [rel='author']",
"type": "text",
"required": false
},
"publishDate": {
"selector": "time",
"type": "attribute",
"attribute": "datetime"
},
"content": {
"selector": ".content p",
"type": "array",
"itemSchema": {
"selector": "",
"type": "text"
}
},
"tags": {
"selector": ".tag",
"type": "array",
"itemSchema": {
"selector": "",
"type": "text"
}
}
}
}
`
Usage:
`bash
curl https://news.com/article | web-scrapy -f article-schema.json --pretty
`
$3
Inline schema:
`bash
echo '
Awesome Product
$29.99
$39.99
In Stock
' | web-scrapy --schema '{
"fields": {
"name": {"selector": "h1", "type": "text"},
"price": {"selector": ".price", "type": "number"},
"originalPrice": {"selector": ".original-price", "type": "number"},
"inStock": {"selector": ".in-stock", "type": "boolean", "trueValue": "In Stock"}
}
}' --pretty
`
Output:
`json
{
"data": {
"name": "Awesome Product",
"price": 29.99,
"originalPrice": 39.99,
"inStock": true
},
"errors": [],
"extractedAt": "2024-01-15T10:30:00.000Z",
"schema": "Unnamed schema"
}
`
$3
Extract multiple products from a catalog page:
`bash
curl https://shop.com/catalog | web-scrapy --schema '{
"fields": {
"name": {"selector": "h3", "type": "text"},
"price": {"selector": ".price", "type": "number"},
"rating": {"selector": ".rating", "type": "number"}
}
}' -m multiple -c ".product-item" -l 10 --pretty
`
$3
`bash
cat social-feed.html | web-scrapy --schema '{
"fields": {
"username": {"selector": ".username", "type": "text"},
"content": {"selector": ".post-text", "type": "text"},
"likes": {"selector": ".likes", "type": "number", "default": 0},
"hashtags": {
"selector": ".hashtag",
"type": "array",
"itemSchema": {"selector": "", "type": "text"}
}
}
}' -m multiple -c ".post" --ignore-errors -o posts.json
`
Advanced Features
$3
The scraper provides detailed error reporting:
`bash
Ignore errors and continue extraction
web-scrapy -s ".missing-selector" --ignore-errors
Get detailed error information in JSON output
web-scrapy -f schema.json --pretty # Errors included in output
`
$3
`json
{
"fields": {
"price": {
"selector": ".price",
"type": "number",
"default": 0,
"required": false
},
"description": {
"selector": ".desc",
"type": "text",
"default": "No description available"
}
}
}
`
$3
`json
{
"fields": {
"specifications": {
"selector": ".spec-row",
"type": "array",
"itemSchema": {
"selector": "",
"type": "object",
"fields": {
"name": {"selector": ".spec-name", "type": "text"},
"value": {"selector": ".spec-value", "type": "text"}
}
}
}
}
}
`
Output Formats
$3
`bash
web-scrapy -s "title" --format json
{"content": "Page Title"}
`
$3
`bash
web-scrapy -s "title" --format pretty
{
"content": "Page Title"
}
`
$3
`bash
web-scrapy -s "title" --format text
Page Title
`
Integration Examples
$3
`bash
Extract and filter data
curl https://api.example.com | web-scrapy -f schema.json | jq '.data.title'
Count extracted items
curl https://news.com | web-scrapy -f news.json -m multiple -c "article" | jq '.data | length'
`
$3
`bash
#!/usr/bin/env bash
Monitor product prices
curl -s "https://shop.com/product/123" | \
web-scrapy --schema '{"fields":{"price":{"selector":".price","type":"number"}}}' | \
jq -r '.data.price' > current-price.txt
`
$3
`javascript
import { spawn } from 'child_process';
import { readFileSync } from 'fs';
const html = readFileSync('page.html', 'utf8');
const schema = JSON.stringify({
fields: {
title: { selector: 'h1', type: 'text' },
price: { selector: '.price', type: 'number' }
}
});
const scraper = spawn('web-scrapy', ['--schema', schema, '--format', 'json']);
scraper.stdin.write(html);
scraper.stdin.end();
scraper.stdout.on('data', (data) => {
const result = JSON.parse(data.toString());
console.log('Extracted:', result.data);
});
`
Error Codes
- 0 - Success
- 1 - Argument parsing error
- 2 - Input/output error
- 3 - Schema validation error
- 4 - Extraction error
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
MIT License - see LICENSE file for details.
Related Projects
- node-html-parser - Fast HTML parser
- cheerio - jQuery-like server-side HTML manipulation
- puppeteer - Headless browser automation
---
For more examples and detailed documentation, run:
`bash
web-scrapy --examples
``