A powerful and flexible web scraping library with concurrent processing and DOM hierarchy awareness
npm install web-structureA powerful and flexible web scraping library built with TypeScript and Puppeteer. It supports concurrent scraping, recursive crawling, and intelligent content extraction with DOM hierarchy awareness.
- Concurrent Processing: Parallel processing of multiple selectors and pages
- DOM Hierarchy Aware: Smart content extraction that respects DOM structure
- Recursive Crawling: Ability to crawl through child pages with depth control
- Flexible Selectors: Support for both single and multiple CSS selectors
- Retry Mechanism: Built-in retry with exponential backoff for reliability
- Deduplication: Automatic deduplication of content and URLs
- Structured Output: Clean, structured JSON output with metadata
``bash`
npm install web-structure
`typescript
import { scraping } from 'web-structure';
// Basic usage
const result = await scraping('https://example.com');
// Advanced usage with options
const result = await scraping('https://example.com', {
maxDepth: 2,
selectors: {
headings: ['h1', 'h2', 'h3'],
content: '.article-content',
links: 'a.important-link'
},
excludeChildPage: (url) => url.includes('login'),
withConsole: true
});
`
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| maxDepth | number | 0 | Maximum depth for recursive crawling |excludeChildPage
| | (url: string) => boolean | () => false | Function to determine if a URL should be skipped |selectors
| | { [key: string]: string \| string[] } | See below | Selectors to extract content |withConsole
| | boolean | true | Whether to show console information |breakWhenFailed
| | boolean | false | Whether to break when a page fails |retryCount
| | number | 3 | Number of retries when scraping fails |waitForSelectorTimeout
| | number | 12000 | Timeout for waiting for a selector (ms) |waitForPageLoadTimeout
| | number | 12000 | Timeout for waiting for page load (ms) |
`typescript`
{
headings: ['h1', 'h2', 'h3', 'h4', 'h5'],
paragraphs: 'p',
articles: 'article',
spans: 'span',
orderLists: 'ol',
lists: 'ul'
}
`typescript`
interface ScrapingResult {
url: string; // URL of the scraped page
title: string; // Page title
data: { // Extracted content
[key: string]: string | string[];
};
timestamp: string; // ISO timestamp of when the page was scraped
childPages?: ScrapingResult[]; // Results from child pages (if maxDepth > 0)
}
The library intelligently handles nested elements to prevent duplicate content. If a parent element is selected, its child elements won't be included separately in the results.
- Multiple selectors are processed concurrently
- Array selectors (e.g., ['h1', 'h2', 'h3']) are processed in parallel
- Child pages are processed sequentially to prevent overwhelming the target server
Built-in retry mechanism with exponential backoff:
- Retries failed operations with increasing delays
- Configurable retry count
- Includes random jitter to prevent thundering herd problems
The library provides robust error handling:
- Failed selector extractions don't stop the entire process
- Each selector and page has independent error handling
- Detailed error logging when withConsole is enabledbreakWhenFailed`
- Option to break on failures with
- Maximum crawling depth is limited to 10 levels
- Maximum of 5 child links per page are processed
- Respects robots.txt and rate limiting by default
- Requires JavaScript to be enabled on target pages
MIT