A powerful web scraping library built with Playwright
npm install stepwrightA powerful web scraping library built with Playwright that provides a declarative, step-by-step approach to web automation and data extraction.
- 🚀 Declarative Scraping: Define scraping workflows using JSON templates
- 🔄 Pagination Support: Built-in support for next button and scroll-based pagination
- 📊 Data Collection: Extract text, HTML, values, and files from web pages
- 🔗 Multi-tab Support: Handle multiple tabs and complex navigation flows
- 📄 PDF Generation: Save pages as PDFs or trigger print-to-PDF actions
- 📥 File Downloads: Download files with automatic directory creation
- 🔁 Looping & Iteration: ForEach loops for processing multiple elements
- 📡 Streaming Results: Real-time result processing with callbacks
- 🎯 Error Handling: Graceful error handling with configurable termination
- 🔧 Flexible Selectors: Support for ID, class, tag, and XPath selectors
``bashUsing pnpm (recommended)
pnpm add stepwright
Quick Start
$3
`typescript
import { runScraper } from 'stepwright';const templates = [
{
tab: 'example',
steps: [
{
id: 'navigate',
action: 'navigate',
value: 'https://example.com'
},
{
id: 'get_title',
action: 'data',
object_type: 'tag',
object: 'h1',
key: 'title',
data_type: 'text'
}
]
}
];
const results = await runScraper(templates);
console.log(results);
`Examples
$3
The repository includes basic examples demonstrating core functionality:
- Basic Usage (TypeScript):
examples/basic-usage.ts - Simple navigation and data extraction
- Basic Usage (JavaScript): examples/basic-usage.js - Same example in JavaScriptRun the examples:
`bash
Run all examples
./examples/run-examples.shOr run individual examples
node examples/basic-usage.js
npx tsx examples/basic-usage.ts
`$3
For more complex scenarios, check out:
- Advanced Usage (TypeScript):
examples/advanced-usage.ts - Pagination, file downloads, and multi-tab handling
- Advanced Usage (JavaScript): examples/advanced-usage.js - Same advanced features in JavaScriptAPI Reference
$3
####
runScraper(templates, options?)Main function to execute scraping templates.
Parameters:
-
templates: Array of TabTemplate objects
- options: Optional RunOptions objectReturns: Promise[]>
####
runScraperWithCallback(templates, onResult, options?)Execute scraping with streaming results via callback.
Parameters:
-
templates: Array of TabTemplate objects
- onResult: Callback function for each result
- options: Optional RunOptions object$3
####
TabTemplate`typescript
interface TabTemplate {
tab: string;
initSteps?: BaseStep[]; // Steps executed once before pagination
perPageSteps?: BaseStep[]; // Steps executed for each page
steps?: BaseStep[]; // Legacy single steps array
pagination?: PaginationConfig;
}
`####
BaseStep`typescript
interface BaseStep {
id: string;
description?: string;
object_type?: SelectorType; // 'id' | 'class' | 'tag' | 'xpath'
object?: string;
action: 'navigate' | 'input' | 'click' | 'data' | 'scroll' | 'download' | 'foreach' | 'open' | 'savePDF' | 'printToPDF';
value?: string;
key?: string;
data_type?: DataType; // 'text' | 'html' | 'value' | 'default'
wait?: number;
terminateonerror?: boolean;
subSteps?: BaseStep[];
}
`####
RunOptions`typescript
interface RunOptions {
browser?: LaunchOptions;
onResult?: (result: Record, index: number) => void | Promise;
}
`Step Actions
$3
Navigate to a URL.`typescript
{
id: 'go_to_page',
action: 'navigate',
value: 'https://example.com'
}
`$3
Fill form fields.`typescript
{
id: 'search',
action: 'input',
object_type: 'id',
object: 'search-box',
value: 'search term'
}
`$3
Click on elements.`typescript
{
id: 'submit',
action: 'click',
object_type: 'class',
object: 'submit-button'
}
`$3
Extract data from elements.`typescript
{
id: 'get_title',
action: 'data',
object_type: 'tag',
object: 'h1',
key: 'title',
data_type: 'text'
}
`$3
Process multiple elements.`typescript
{
id: 'process_items',
action: 'foreach',
object_type: 'class',
object: 'item',
subSteps: [
// Steps to execute for each item
]
}
`$3
#### Download
`typescript
{
id: 'download_file',
action: 'download',
object_type: 'class',
object: 'download-link',
value: './downloads/file.pdf',
key: 'downloaded_file'
}
`#### Save PDF
`typescript
{
id: 'save_pdf',
action: 'savePDF',
value: './output/page.pdf',
key: 'pdf_file'
}
`#### Print to PDF
`typescript
{
id: 'print_pdf',
action: 'printToPDF',
object_type: 'id',
object: 'print-button',
value: './output/printed.pdf',
key: 'printed_file'
}
`Pagination
$3
`typescript
pagination: {
strategy: 'next',
nextButton: {
object_type: 'class',
object: 'next-page',
wait: 2000
},
maxPages: 10
}
`$3
`typescript
pagination: {
strategy: 'scroll',
scroll: {
offset: 800,
delay: 1500
},
maxPages: 5
}
`Advanced Features
$3
`typescript
const results = await runScraper(templates, {
browser: {
proxy: {
server: 'http://proxy-server:8080',
username: 'user',
password: 'pass'
}
}
});
`$3
`typescript
const results = await runScraper(templates, {
browser: {
headless: false,
slowMo: 1000,
args: ['--no-sandbox', '--disable-setuid-sandbox']
}
});
`$3
`typescript
await runScraperWithCallback(templates, async (result, index) => {
console.log(Result ${index}:, result);
// Process result immediately
}, {
browser: { headless: true }
});
`$3
Use collected data in subsequent steps:`typescript
{
id: 'save_with_title',
action: 'savePDF',
value: './output/{{meeting_title}}.pdf',
key: 'meeting_pdf'
}
`Error Handling
Steps can be configured to terminate on error:
`typescript
{
id: 'critical_step',
action: 'click',
object_type: 'id',
object: 'important-button',
terminateonerror: true
}
`Development
$3
`bash
Install dependencies
pnpm installBuild the project
pnpm buildRun tests
pnpm testRun tests in watch mode
pnpm test:watchLint code
pnpm lintFormat code
pnpm format
`$3
`bash
Run all tests
pnpm testRun tests with coverage
pnpm test:coverageRun specific test file
pnpm test scraper.test.ts
``1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests for new functionality
5. Run the test suite
6. Submit a pull request
MIT License - see LICENSE file for details.