A powerful web scraping tool built with Playwright that crawls websites using BFS or DFS algorithms, captures screenshots, and extracts content
npm install @harshvz/crawlerbash
npm install -g @harshvz/crawler
`
Note: Chromium browser will be automatically downloaded during installation (approximately 300MB). This is required for web scraping functionality.
$3
`bash
npm install @harshvz/crawler
`
Note: The postinstall script will automatically download the Chromium browser.
$3
If the automatic installation fails, you can manually install browsers:
`bash
npx playwright install chromium
`
$3
`bash
git clone https://github.com/harshvz/crawler.git
cd crawler
npm install
npm run build
npm install -g .
`
š Usage
$3
Simply run the command and follow the prompts:
`bash
Primary command (recommended)
crawler
Alternative (for backward compatibility)
scraper
`
You'll be prompted to enter:
1. URL: The website URL to scrape (e.g., https://example.com)
2. Algorithm: Choose between bfs or dfs (default: bfs)
3. Output Directory: Custom save location (default: ~/knowledgeBase)
$3
`bash
Show version
crawler --version
crawler -v
Show help
crawler --help
crawler -h
`
> Note: Both crawler and scraper commands work identically. We recommend using crawler for new projects.
$3
`typescript
import ScrapperServices from '@harshvz/crawler';
const scraper = new ScrapperServices('https://example.com', 2); // depth limit of 2
// Using BFS
await scraper.bfsScrape('/');
// Using DFS
await scraper.dfsScrape('/');
`
š ļø CLI Commands
$3
`bash
Run in development mode with auto-reload
npm run dev
Build the project
npm run build
Start the built version (uses crawler command)
npm start
`
š API Documentation
$3
Main class for web scraping operations.
#### Constructor
`typescript
new ScrapperServices(website: string, depth?: number, customPath?: string)
`
Parameters:
- website (string): The base URL of the website to scrape
- depth (number, optional): Maximum depth to crawl (0 = unlimited, default: 0)
- customPath (string, optional): Custom output directory path (default: ~/knowledgeBase)
#### Methods
##### bfsScrape(endpoint?: string, results?: string[], visited?: Record
Crawls the website using Breadth-First Search algorithm.
Parameters:
- endpoint (string): Starting path (default: "/")
- results (string[]): Array to collect visited endpoints
- visited (Record): Object to track visited URLs
##### dfsScrape(endpoint?: string, results?: string[], visited?: Record
Crawls the website using Depth-First Search algorithm.
Parameters:
- endpoint (string): Starting path (default: "/")
- results (string[]): Array to collect visited endpoints
- visited (Record): Object to track visited URLs
##### buildFilePath(endpoint: string): string
Generates a file path for storing screenshots.
##### buildContentPath(endpoint: string): string
Generates a file path for storing extracted content.
##### getLinks(page: Page): Promise
Extracts all internal links from the current page.
āļø Configuration
$3
The default timeout for page navigation is 60 seconds. You can modify this by editing the timeout property in the ScrapperServices class:
`typescript
const scraper = new ScrapperServices('https://example.com');
scraper.timeout = 30000; // 30 seconds
`
$3
By default, all scraped data is stored in:
`
~/knowledgeBase/
`
Each website gets its own folder based on its hostname.
š Output Structure
`
~/knowledgeBase/
āāā examplecom/
āāā home.png # Screenshot of homepage
āāā home.md # Extracted content from homepage
āāā _about.png # Screenshot of /about page
āāā _about.md # Extracted content from /about
āāā _contact.png # Screenshot of /contact page
āāā _contact.md # Extracted content from /contact
`
$3
Each .md file contains:
1. JSON metadata (first line):
- Page title
- Meta description
- Robots directives
- Open Graph tags
- Twitter Card tags
2. Extracted text content (subsequent lines):
- All text from h1-h6, p, and span elements
š Examples
$3
`typescript
import ScrapperServices from '@harshvz/crawler';
const scraper = new ScrapperServices('https://docs.example.com');
await scraper.bfsScrape('/');
`
$3
`typescript
const scraper = new ScrapperServices('https://blog.example.com', 2);
await scraper.dfsScrape('/');
// Only crawls 2 levels deep from the starting page
`
$3
`typescript
const scraper = new ScrapperServices('https://example.com');
const results = [];
const visited = {};
await scraper.bfsScrape('/docs', results, visited);
console.log(Scraped ${results.length} pages);
`
$3
`typescript
const scraper = new ScrapperServices(
'https://example.com',
0, // No depth limit
'/custom/output/path' // Custom save location
);
await scraper.bfsScrape('/');
// Files will be saved to /custom/output/path instead of ~/knowledgeBase
`
š§ Development
$3
- Node.js >= 16.x
- npm >= 7.x
$3
`bash
Clone the repository
git clone https://github.com/harshvz/crawler.git
Navigate to directory
cd crawler
Install dependencies
npm install
Run in development mode
npm run dev
`
$3
`
crawler/
āāā src/
ā āāā index.ts # CLI entry point
ā āāā Services/
ā āāā ScrapperServices.ts # Main scraping logic
āāā dist/ # Compiled JavaScript
āāā package.json
āāā tsconfig.json
āāā README.md
`
$3
`bash
npm run build
`
This compiles TypeScript files to JavaScript in the dist/ directory.
š¤ Contributing
Contributions are welcome! Please follow these steps:
1. Fork the repository
2. Create a feature branch (git checkout -b feature/amazing-feature)
3. Commit your changes (git commit -m 'Add amazing feature')
4. Push to the branch (git push origin feature/amazing-feature`)