web-scrapy

A powerful command-line web scraper with schema support for structured data extraction. Extract data from HTML using CSS selectors or comprehensive JSON schemas with built-in type conversion and error handling.

Features

- 🔍 CSS Selector Extraction - Simple text/HTML extraction using CSS selectors
- 📋 Schema-Based Extraction - Complex data extraction using JSON schemas
- 🎯 Multiple Data Types - Support for text, attributes, HTML, numbers, booleans, and arrays
- 📄 Multiple Output Formats - JSON, pretty JSON, and plain text output
- 🔄 Batch Processing - Extract multiple records from lists and tables
- 🛡️ Error Handling - Robust error handling with detailed feedback
- ⚡ Fast & Lightweight - Uses node-html-parser for efficient HTML parsing
- 🎨 Google Style TypeScript - Clean, maintainable codebase

Installation

bash

npm install -g web-scrapy





Or use directly with npx:

bash

npx web-scrapy --help





Quick Start



$3

bash

Extract page title

curl https://example.com | web-scrapy -s "title" -t



Extract all links

echo 'Link 1Link 2' | web-scrapy -s "a" --pretty



Extract specific content

cat article.html | web-scrapy -s ".content p" -t --format text

$3

bash

Inline schema for article extraction

curl https://news.site.com | web-scrapy --schema '{

  "fields": {

    "title": {"selector": "h1", "type": "text"},

    "author": {"selector": ".author", "type": "text"},

    "date": {"selector": "time", "type": "attribute", "attribute": "datetime"}

  }

}' --pretty



Use schema file for complex extraction

curl https://shop.com/product/123 | web-scrapy -f product-schema.json -o results.json





Usage



web-scrapy - Advanced command line web scraper with schema support



Usage:

  echo "..." | web-scrapy [options]

  cat file.html | web-scrapy [options]

  curl https://example.com | web-scrapy [options]



Input Options (choose one):

  -s, --selector      Simple CSS selector extraction

  --schema                Inline JSON schema for complex extraction

  -f, --schema-file       JSON schema file for complex extraction



Output Options:

  -o, --output            Save output to file (default: stdout)

  --format              Output format: json, pretty, text (default: json)

  -p, --pretty                  Pretty-print JSON output



Extraction Options:

  -m, --mode              Extraction mode: single, multiple (default: single)

  -c, --container     Container selector for multiple mode

  -t, --text                    Extract text content only (selector mode)

  -l, --limit           Limit number of results (multiple mode)

  --ignore-errors               Continue extraction despite errors



Utility Options:

  -h, --help                    Show this help message

  -e, --examples                Show example schemas and usage

  -v, --version                 Show version information





Schema Format



Schemas are JSON objects that define how to extract structured data from HTML:

json

{

  "name": "Schema name (optional)",

  "description": "Schema description (optional)",

  "fields": {

    "fieldName": {

      "selector": "CSS selector",

      "type": "text|attribute|html|number|boolean|array",

      "required": true|false,

      "default": "default value"

    }

  },

  "config": {

    "ignoreErrors": true|false,

    "limit": number

  }

}





$3



- text - Extract text content (supports

trim

 option)

- attribute - Extract attribute value (requires

attribute

 property)

- html - Extract HTML content (supports

inner

 option for innerHTML)

- number - Parse as number (supports

integer

 option)

- boolean - Convert to true/false (supports

trueValue

 option)

- array - Extract multiple items (requires

itemSchema

 property)



Examples



$3



Schema file:

article-schema.json

json

{

  "name": "News Article",

  "fields": {

    "title": {

      "selector": "h1, .headline, .title",

      "type": "text",

      "trim": true

    },

    "author": {

      "selector": ".author, [rel='author']",

      "type": "text",

      "required": false

    },

    "publishDate": {

      "selector": "time",

      "type": "attribute",

      "attribute": "datetime"

    },

    "content": {

      "selector": ".content p",

      "type": "array",

      "itemSchema": {

        "selector": "",

        "type": "text"

      }

    },

    "tags": {

      "selector": ".tag",

      "type": "array",

      "itemSchema": {

        "selector": "",

        "type": "text"

      }

    }

  }

}





Usage:

bash

curl https://news.com/article | web-scrapy -f article-schema.json --pretty





$3



Inline schema:

bash

echo '

  Awesome Product

  $29.99

  $39.99

  In Stock

' | web-scrapy --schema '{

  "fields": {

    "name": {"selector": "h1", "type": "text"},

    "price": {"selector": ".price", "type": "number"},

    "originalPrice": {"selector": ".original-price", "type": "number"},

    "inStock": {"selector": ".in-stock", "type": "boolean", "trueValue": "In Stock"}

  }

}' --pretty





Output:

json

{

  "data": {

    "name": "Awesome Product",

    "price": 29.99,

    "originalPrice": 39.99,

    "inStock": true

  },

  "errors": [],

  "extractedAt": "2024-01-15T10:30:00.000Z",

  "schema": "Unnamed schema"

}





$3



Extract multiple products from a catalog page:

bash

curl https://shop.com/catalog | web-scrapy --schema '{

  "fields": {

    "name": {"selector": "h3", "type": "text"},

    "price": {"selector": ".price", "type": "number"},

    "rating": {"selector": ".rating", "type": "number"}

  }

}' -m multiple -c ".product-item" -l 10 --pretty

$3

bash

cat social-feed.html | web-scrapy --schema '{

  "fields": {

    "username": {"selector": ".username", "type": "text"},

    "content": {"selector": ".post-text", "type": "text"},

    "likes": {"selector": ".likes", "type": "number", "default": 0},

    "hashtags": {

      "selector": ".hashtag",

      "type": "array",

      "itemSchema": {"selector": "", "type": "text"}

    }

  }

}' -m multiple -c ".post" --ignore-errors -o posts.json





Advanced Features



$3



The scraper provides detailed error reporting:

bash

Ignore errors and continue extraction

web-scrapy -s ".missing-selector" --ignore-errors



Get detailed error information in JSON output

web-scrapy -f schema.json --pretty  # Errors included in output

$3

json

{

  "fields": {

    "price": {

      "selector": ".price",

      "type": "number",

      "default": 0,

      "required": false

    },

    "description": {

      "selector": ".desc",

      "type": "text",

      "default": "No description available"

    }

  }

}

$3

json

{

  "fields": {

    "specifications": {

      "selector": ".spec-row",

      "type": "array",

      "itemSchema": {

        "selector": "",

        "type": "object",

        "fields": {

          "name": {"selector": ".spec-name", "type": "text"},

          "value": {"selector": ".spec-value", "type": "text"}

        }

      }

    }

  }

}





Output Formats



$3

bash

web-scrapy -s "title" --format json

{"content": "Page Title"}

$3

bash

web-scrapy -s "title" --format pretty

{

  "content": "Page Title"

}

$3

bash

web-scrapy -s "title" --format text

Page Title





Integration Examples



$3

bash

Extract and filter data

curl https://api.example.com | web-scrapy -f schema.json | jq '.data.title'



Count extracted items

curl https://news.com | web-scrapy -f news.json -m multiple -c "article" | jq '.data | length'

$3

bash

#!/usr/bin/env bash

Monitor product prices

curl -s "https://shop.com/product/123" | \

  web-scrapy --schema '{"fields":{"price":{"selector":".price","type":"number"}}}' | \

  jq -r '.data.price' > current-price.txt

$3

javascript

import { spawn } from 'child_process';

import { readFileSync } from 'fs';



const html = readFileSync('page.html', 'utf8');

const schema = JSON.stringify({

  fields: {

    title: { selector: 'h1', type: 'text' },

    price: { selector: '.price', type: 'number' }

  }

});



const scraper = spawn('web-scrapy', ['--schema', schema, '--format', 'json']);

scraper.stdin.write(html);

scraper.stdin.end();



scraper.stdout.on('data', (data) => {

  const result = JSON.parse(data.toString());

  console.log('Extracted:', result.data);

});





Error Codes



-

 - Success

-

 - Argument parsing error

-

 - Input/output error

-

 - Schema validation error

-

 - Extraction error



Contributing



Contributions are welcome! Please feel free to submit a Pull Request.



License



MIT License - see LICENSE file for details.



Related Projects



- node-html-parser - Fast HTML parser

- cheerio - jQuery-like server-side HTML manipulation

- puppeteer - Headless browser automation



---



For more examples and detailed documentation, run:

bash

web-scrapy --examples