StepWright

A powerful web scraping library built with Playwright that provides a declarative, step-by-step approach to web automation and data extraction.

Features

- 🚀 Declarative Scraping: Define scraping workflows using JSON templates
- 🔄 Pagination Support: Built-in support for next button and scroll-based pagination
- 📊 Data Collection: Extract text, HTML, values, and files from web pages
- 🔗 Multi-tab Support: Handle multiple tabs and complex navigation flows
- 📄 PDF Generation: Save pages as PDFs or trigger print-to-PDF actions
- 📥 File Downloads: Download files with automatic directory creation
- 🔁 Looping & Iteration: ForEach loops for processing multiple elements
- 📡 Streaming Results: Real-time result processing with callbacks
- 🎯 Error Handling: Graceful error handling with configurable termination
- 🔧 Flexible Selectors: Support for ID, class, tag, and XPath selectors

Installation

``bash

`Using pnpm (recommended)`


pnpm add stepwright
Using npm

npm install stepwright
Using yarn

yarn add stepwright


Quick Start
$3

`typescript import { runScraper } from 'stepwright';

const templates = [ { tab: 'example', steps: [ { id: 'navigate', action: 'navigate', value: 'https://example.com' }, { id: 'get_title', action: 'data', object_type: 'tag', object: 'h1', key: 'title', data_type: 'text' } ] } ];

const results = await runScraper(templates); console.log(results);`

`Examples`

`$3`

The repository includes basic examples demonstrating core functionality:

- Basic Usage (TypeScript): examples/basic-usage.ts- Simple navigation and data extraction - Basic Usage (JavaScript):examples/basic-usage.js - Same example in JavaScript

Run the examples:`bash

`Run all examples`


./examples/run-examples.sh
Or run individual examples

node examples/basic-usage.js
npx tsx examples/basic-usage.ts


$3
For more complex scenarios, check out:

- Advanced Usage (TypeScript): examples/advanced-usage.ts- Pagination, file downloads, and multi-tab handling - Advanced Usage (JavaScript):examples/advanced-usage.js - Same advanced features in JavaScript

`API Reference`

`$3`

#### runScraper(templates, options?)

Main function to execute scraping templates.

Parameters: -templates: Array of TabTemplateobjects -options: Optional RunOptions object

Returns: Promise[]>

#### runScraperWithCallback(templates, onResult, options?)

Execute scraping with streaming results via callback.

Parameters: -templates: Array of TabTemplateobjects -onResult: Callback function for each result -options: Optional RunOptions object

`$3`

#### TabTemplate

`typescript interface TabTemplate { tab: string; initSteps?: BaseStep[]; // Steps executed once before pagination perPageSteps?: BaseStep[]; // Steps executed for each page steps?: BaseStep[]; // Legacy single steps array pagination?: PaginationConfig; }`

#### BaseStep

`typescript interface BaseStep { id: string; description?: string; object_type?: SelectorType; // 'id' | 'class' | 'tag' | 'xpath' object?: string; action: 'navigate' | 'input' | 'click' | 'data' | 'scroll' | 'download' | 'foreach' | 'open' | 'savePDF' | 'printToPDF'; value?: string; key?: string; data_type?: DataType; // 'text' | 'html' | 'value' | 'default' wait?: number; terminateonerror?: boolean; subSteps?: BaseStep[]; }`

#### RunOptions

`typescript interface RunOptions { browser?: LaunchOptions; onResult?: (result: Record, index: number) => void | Promise; }`

`Step Actions`

`$3`


Navigate to a URL.

`typescript { id: 'go_to_page', action: 'navigate', value: 'https://example.com' }`

`$3`


Fill form fields.

`typescript { id: 'search', action: 'input', object_type: 'id', object: 'search-box', value: 'search term' }`

`$3`


Click on elements.

`typescript { id: 'submit', action: 'click', object_type: 'class', object: 'submit-button' }`

`$3`


Extract data from elements.

`typescript { id: 'get_title', action: 'data', object_type: 'tag', object: 'h1', key: 'title', data_type: 'text' }`

`$3`


Process multiple elements.

`typescript { id: 'process_items', action: 'foreach', object_type: 'class', object: 'item', subSteps: [ // Steps to execute for each item ] }`

`$3`

#### Download`typescript { id: 'download_file', action: 'download', object_type: 'class', object: 'download-link', value: './downloads/file.pdf', key: 'downloaded_file' }`

#### Save PDF`typescript { id: 'save_pdf', action: 'savePDF', value: './output/page.pdf', key: 'pdf_file' }`

#### Print to PDF`typescript { id: 'print_pdf', action: 'printToPDF', object_type: 'id', object: 'print-button', value: './output/printed.pdf', key: 'printed_file' }`

`Pagination`

`$3`

typescript
pagination: {
  strategy: 'next',
  nextButton: {
    object_type: 'class',
    object: 'next-page',
    wait: 2000
  },
  maxPages: 10
}

$3

typescript
pagination: {
  strategy: 'scroll',
  scroll: {
    offset: 800,
    delay: 1500
  },
  maxPages: 5
}


Advanced Features
$3

typescript
const results = await runScraper(templates, {
  browser: {
    proxy: {
      server: 'http://proxy-server:8080',
      username: 'user',
      password: 'pass'
    }
  }
});

$3

typescript
const results = await runScraper(templates, {
  browser: {
    headless: false,
    slowMo: 1000,
    args: ['--no-sandbox', '--disable-setuid-sandbox']
  }
});

$3

typescript
await runScraperWithCallback(templates, async (result, index) => {
  console.log(

Result ${index}:

, result);
  // Process result immediately
}, {
  browser: { headless: true }
});


$3

Use collected data in subsequent steps:

`typescript { id: 'save_with_title', action: 'savePDF', value: './output/{{meeting_title}}.pdf', key: 'meeting_pdf' }`

`Error Handling`

Steps can be configured to terminate on error:

`typescript { id: 'critical_step', action: 'click', object_type: 'id', object: 'important-button', terminateonerror: true }`

`Development`

`$3`

bash
Install dependencies

pnpm install
Build the project

pnpm build
Run tests

pnpm test
Run tests in watch mode

pnpm test:watch
Lint code

pnpm lint
Format code

pnpm format

$3

bash
Run all tests

pnpm test
Run tests with coverage

pnpm test:coverage
Run specific test file

pnpm test scraper.test.ts

Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests for new functionality
5. Run the test suite
6. Submit a pull request

License

MIT License - see LICENSE file for details.

Support

- 🐛 Issues: GitHub Issues

StepWright

A powerful web scraping library built with Playwright that provides a declarative, step-by-step approach to web automation and data extraction.

Features

Installation

``bash

`Using pnpm (recommended)`


pnpm add stepwright
Using npm

npm install stepwright
Using yarn

yarn add stepwright


Quick Start
$3

`typescript import { runScraper } from 'stepwright';

const results = await runScraper(templates); console.log(results);`

`Examples`

`$3`

The repository includes basic examples demonstrating core functionality:

- Basic Usage (TypeScript): examples/basic-usage.ts- Simple navigation and data extraction - Basic Usage (JavaScript):examples/basic-usage.js - Same example in JavaScript

Run the examples:`bash

`Run all examples`


./examples/run-examples.sh
Or run individual examples

node examples/basic-usage.js
npx tsx examples/basic-usage.ts


$3
For more complex scenarios, check out:

`API Reference`

`$3`

#### runScraper(templates, options?)

Main function to execute scraping templates.

Parameters: -templates: Array of TabTemplateobjects -options: Optional RunOptions object

Returns: Promise[]>

#### runScraperWithCallback(templates, onResult, options?)

Execute scraping with streaming results via callback.

Parameters: -templates: Array of TabTemplateobjects -onResult: Callback function for each result -options: Optional RunOptions object

`$3`

#### TabTemplate

#### BaseStep

#### RunOptions

`typescript interface RunOptions { browser?: LaunchOptions; onResult?: (result: Record, index: number) => void | Promise; }`

`Step Actions`

`$3`


Navigate to a URL.

`typescript { id: 'go_to_page', action: 'navigate', value: 'https://example.com' }`

`$3`


Fill form fields.

`typescript { id: 'search', action: 'input', object_type: 'id', object: 'search-box', value: 'search term' }`

`$3`


Click on elements.

`typescript { id: 'submit', action: 'click', object_type: 'class', object: 'submit-button' }`

`$3`


Extract data from elements.

`typescript { id: 'get_title', action: 'data', object_type: 'tag', object: 'h1', key: 'title', data_type: 'text' }`

`$3`


Process multiple elements.

`typescript { id: 'process_items', action: 'foreach', object_type: 'class', object: 'item', subSteps: [ // Steps to execute for each item ] }`

`$3`

#### Download`typescript { id: 'download_file', action: 'download', object_type: 'class', object: 'download-link', value: './downloads/file.pdf', key: 'downloaded_file' }`

#### Save PDF`typescript { id: 'save_pdf', action: 'savePDF', value: './output/page.pdf', key: 'pdf_file' }`

#### Print to PDF`typescript { id: 'print_pdf', action: 'printToPDF', object_type: 'id', object: 'print-button', value: './output/printed.pdf', key: 'printed_file' }`

`Pagination`

`$3`

typescript
pagination: {
  strategy: 'next',
  nextButton: {
    object_type: 'class',
    object: 'next-page',
    wait: 2000
  },
  maxPages: 10
}

$3

typescript
pagination: {
  strategy: 'scroll',
  scroll: {
    offset: 800,
    delay: 1500
  },
  maxPages: 5
}


Advanced Features
$3

typescript
const results = await runScraper(templates, {
  browser: {
    proxy: {
      server: 'http://proxy-server:8080',
      username: 'user',
      password: 'pass'
    }
  }
});

$3

typescript
const results = await runScraper(templates, {
  browser: {
    headless: false,
    slowMo: 1000,
    args: ['--no-sandbox', '--disable-setuid-sandbox']
  }
});

$3

typescript
await runScraperWithCallback(templates, async (result, index) => {
  console.log(

Result ${index}:

, result);
  // Process result immediately
}, {
  browser: { headless: true }
});


$3

Use collected data in subsequent steps:

`typescript { id: 'save_with_title', action: 'savePDF', value: './output/{{meeting_title}}.pdf', key: 'meeting_pdf' }`

`Error Handling`

Steps can be configured to terminate on error:

`typescript { id: 'critical_step', action: 'click', object_type: 'id', object: 'important-button', terminateonerror: true }`

`Development`

`$3`

bash
Install dependencies

pnpm install
Build the project

pnpm build
Run tests

pnpm test
Run tests in watch mode

pnpm test:watch
Lint code

pnpm lint
Format code

pnpm format

$3

bash
Run all tests

pnpm test
Run tests with coverage

pnpm test:coverage
Run specific test file

pnpm test scraper.test.ts

Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests for new functionality
5. Run the test suite
6. Submit a pull request

License

MIT License - see LICENSE file for details.

Support

- 🐛 Issues: GitHub Issues