🕷️ @harshvz/crawler

> A powerful web scraping tool built with Playwright that crawls websites using BFS or DFS algorithms, captures screenshots, and extracts content.

![npm version](https://www.npmjs.com/package/@harshvz/crawler)
![License: ISC](https://opensource.org/licenses/ISC)

📋 Table of Contents

- Features
- Installation
- Usage
- CLI Commands
- API Documentation
- Configuration
- Output Structure
- Examples
- Development
- Contributing
- License

✨ Features

- 🔍 Intelligent Crawling: Choose between BFS (Breadth-First Search) or DFS (Depth-First Search) algorithms
- 📸 Full Page Screenshots: Automatically captures full-page screenshots of each visited page
- 📝 Content Extraction: Extracts metadata, headings, paragraphs, and text content
- 🎯 Domain-Scoped: Only crawls internal links within the same domain
- 🚀 Interactive CLI: User-friendly command-line interface with input validation
- 💾 Organized Storage: Saves screenshots and content in a structured directory format
- 🔄 Duplicate Prevention: Tracks visited URLs to avoid redundant scraping
- 🎨 SEO Metadata: Extracts Open Graph, Twitter Cards, and other meta tags
- ⏱️ Timeout Handling: Built-in timeout management for unresponsive pages

📦 Installation

$3

bash

npm install -g @harshvz/crawler





Note: Chromium browser will be automatically downloaded during installation (approximately 300MB). This is required for web scraping functionality.



$3

bash

npm install @harshvz/crawler





Note: The postinstall script will automatically download the Chromium browser.



$3



If the automatic installation fails, you can manually install browsers:

bash

npx playwright install chromium

$3

bash

git clone https://github.com/harshvz/crawler.git

cd crawler

npm install

npm run build

npm install -g .





🚀 Usage



$3



Simply run the command and follow the prompts:

bash

Primary command (recommended)

crawler



Alternative (for backward compatibility)

scraper





You'll be prompted to enter:

1. URL: The website URL to scrape (e.g.,

https://example.com

)

2. Algorithm: Choose between

bfs or dfs

 (default: bfs)

3. Output Directory: Custom save location (default:

~/knowledgeBase

)



$3

bash

Show version

crawler --version

crawler -v



Show help

crawler --help

crawler -h





> Note: Both

crawler and scraper commands work identically. We recommend using crawler

 for new projects.



$3

typescript

import ScrapperServices from '@harshvz/crawler';



const scraper = new ScrapperServices('https://example.com', 2); // depth limit of 2



// Using BFS

await scraper.bfsScrape('/');



// Using DFS

await scraper.dfsScrape('/');





🛠️ CLI Commands



$3

bash

Run in development mode with auto-reload

npm run dev



Build the project

npm run build



Start the built version (uses crawler command)

npm start





📚 API Documentation



$3



Main class for web scraping operations.



#### Constructor

typescript

new ScrapperServices(website: string, depth?: number, customPath?: string)





Parameters:

-

website

 (string): The base URL of the website to scrape

-

depth

 (number, optional): Maximum depth to crawl (0 = unlimited, default: 0)

-

customPath (string, optional): Custom output directory path (default: ~/knowledgeBase

)



#### Methods



#####

bfsScrape(endpoint?: string, results?: string[], visited?: Record): Promise





Crawls the website using Breadth-First Search algorithm.



Parameters:

-

endpoint

 (string): Starting path (default: "/")

-

results

 (string[]): Array to collect visited endpoints

-

visited

 (Record): Object to track visited URLs



#####

dfsScrape(endpoint?: string, results?: string[], visited?: Record): Promise





Crawls the website using Depth-First Search algorithm.



Parameters:

-

endpoint

 (string): Starting path (default: "/")

-

results

 (string[]): Array to collect visited endpoints

-

visited

 (Record): Object to track visited URLs



#####

buildFilePath(endpoint: string): string





Generates a file path for storing screenshots.



#####

buildContentPath(endpoint: string): string





Generates a file path for storing extracted content.



#####

getLinks(page: Page): Promise





Extracts all internal links from the current page.



⚙️ Configuration



$3



The default timeout for page navigation is 60 seconds. You can modify this by editing the

timeout property in the ScrapperServices

 class:

typescript

const scraper = new ScrapperServices('https://example.com');

scraper.timeout = 30000; // 30 seconds





$3



By default, all scraped data is stored in:



~/knowledgeBase/





Each website gets its own folder based on its hostname.



📁 Output Structure



~/knowledgeBase/

└── examplecom/

    ├── home.png                 # Screenshot of homepage

    ├── home.md                  # Extracted content from homepage

    ├── _about.png              # Screenshot of /about page

    ├── _about.md               # Extracted content from /about

    ├── _contact.png            # Screenshot of /contact page

    └── _contact.md             # Extracted content from /contact





$3



Each

.md

 file contains:

1. JSON metadata (first line):

   - Page title

   - Meta description

   - Robots directives

   - Open Graph tags

   - Twitter Card tags

2. Extracted text content (subsequent lines):

   - All text from h1-h6, p, and span elements



📖 Examples



$3

typescript

import ScrapperServices from '@harshvz/crawler';



const scraper = new ScrapperServices('https://docs.example.com');

await scraper.bfsScrape('/');

$3

typescript

const scraper = new ScrapperServices('https://blog.example.com', 2);

await scraper.dfsScrape('/');

// Only crawls 2 levels deep from the starting page

$3

typescript

const scraper = new ScrapperServices('https://example.com');

const results = [];

const visited = {};

await scraper.bfsScrape('/docs', results, visited);

console.log(

Scraped ${results.length} pages

);

$3

typescript

const scraper = new ScrapperServices(

    'https://example.com',

    0,  // No depth limit

    '/custom/output/path'  // Custom save location

);

await scraper.bfsScrape('/');

// Files will be saved to /custom/output/path instead of ~/knowledgeBase





🔧 Development



$3



- Node.js >= 16.x

- npm >= 7.x



$3

bash

Clone the repository

git clone https://github.com/harshvz/crawler.git



Navigate to directory

cd crawler



Install dependencies

npm install



Run in development mode

npm run dev

$3



crawler/

├── src/

│   ├── index.ts                    # CLI entry point

│   └── Services/

│       └── ScrapperServices.ts     # Main scraping logic

├── dist/                           # Compiled JavaScript

├── package.json

├── tsconfig.json

└── README.md

$3

bash

npm run build





This compiles TypeScript files to JavaScript in the

dist/

 directory.



🤝 Contributing



Contributions are welcome! Please follow these steps:



1. Fork the repository

2. Create a feature branch (

git checkout -b feature/amazing-feature

)

3. Commit your changes (

git commit -m 'Add amazing feature'

)

4. Push to the branch (

git push origin feature/amazing-feature`)
5. Open a Pull Request

📝 License

ISC © Harshvz

🙏 Acknowledgments

- Built with Playwright
- CLI powered by Inquirer.js

---

Made with ❤️ by harshvz

!npm version

🕷️ @harshvz/crawler

📋 Table of Contents

- Features
- Installation
- Usage
- CLI Commands
- API Documentation
- Configuration
- Output Structure
- Examples
- Development
- Contributing
- License

✨ Features

📦 Installation

$3

bash

npm install -g @harshvz/crawler





Note: Chromium browser will be automatically downloaded during installation (approximately 300MB). This is required for web scraping functionality.



$3

bash

npm install @harshvz/crawler





Note: The postinstall script will automatically download the Chromium browser.



$3



If the automatic installation fails, you can manually install browsers:

bash

npx playwright install chromium

$3

bash

git clone https://github.com/harshvz/crawler.git

cd crawler

npm install

npm run build

npm install -g .





🚀 Usage



$3



Simply run the command and follow the prompts:

bash

Primary command (recommended)

crawler



Alternative (for backward compatibility)

scraper





You'll be prompted to enter:

1. URL: The website URL to scrape (e.g.,

https://example.com

)

2. Algorithm: Choose between

bfs or dfs

 (default: bfs)

3. Output Directory: Custom save location (default:

~/knowledgeBase

)



$3

bash

Show version

crawler --version

crawler -v



Show help

crawler --help

crawler -h





> Note: Both

crawler and scraper commands work identically. We recommend using crawler

 for new projects.



$3

typescript

import ScrapperServices from '@harshvz/crawler';



const scraper = new ScrapperServices('https://example.com', 2); // depth limit of 2



// Using BFS

await scraper.bfsScrape('/');



// Using DFS

await scraper.dfsScrape('/');





🛠️ CLI Commands



$3

bash

Run in development mode with auto-reload

npm run dev



Build the project

npm run build



Start the built version (uses crawler command)

npm start





📚 API Documentation



$3



Main class for web scraping operations.



#### Constructor

typescript

new ScrapperServices(website: string, depth?: number, customPath?: string)





Parameters:

-

website

 (string): The base URL of the website to scrape

-

depth

 (number, optional): Maximum depth to crawl (0 = unlimited, default: 0)

-

customPath (string, optional): Custom output directory path (default: ~/knowledgeBase

)



#### Methods



#####

bfsScrape(endpoint?: string, results?: string[], visited?: Record): Promise





Crawls the website using Breadth-First Search algorithm.



Parameters:

-

endpoint

 (string): Starting path (default: "/")

-

results

 (string[]): Array to collect visited endpoints

-

visited

 (Record): Object to track visited URLs



#####

dfsScrape(endpoint?: string, results?: string[], visited?: Record): Promise





Crawls the website using Depth-First Search algorithm.



Parameters:

-

endpoint

 (string): Starting path (default: "/")

-

results

 (string[]): Array to collect visited endpoints

-

visited

 (Record): Object to track visited URLs



#####

buildFilePath(endpoint: string): string





Generates a file path for storing screenshots.



#####

buildContentPath(endpoint: string): string





Generates a file path for storing extracted content.



#####

getLinks(page: Page): Promise





Extracts all internal links from the current page.



⚙️ Configuration



$3



The default timeout for page navigation is 60 seconds. You can modify this by editing the

timeout property in the ScrapperServices

 class:

typescript

const scraper = new ScrapperServices('https://example.com');

scraper.timeout = 30000; // 30 seconds





$3



By default, all scraped data is stored in:



~/knowledgeBase/





Each website gets its own folder based on its hostname.



📁 Output Structure



~/knowledgeBase/

└── examplecom/

    ├── home.png                 # Screenshot of homepage

    ├── home.md                  # Extracted content from homepage

    ├── _about.png              # Screenshot of /about page

    ├── _about.md               # Extracted content from /about

    ├── _contact.png            # Screenshot of /contact page

    └── _contact.md             # Extracted content from /contact





$3



Each

.md

 file contains:

1. JSON metadata (first line):

   - Page title

   - Meta description

   - Robots directives

   - Open Graph tags

   - Twitter Card tags

2. Extracted text content (subsequent lines):

   - All text from h1-h6, p, and span elements



📖 Examples



$3

typescript

import ScrapperServices from '@harshvz/crawler';



const scraper = new ScrapperServices('https://docs.example.com');

await scraper.bfsScrape('/');

$3

typescript

const scraper = new ScrapperServices('https://blog.example.com', 2);

await scraper.dfsScrape('/');

// Only crawls 2 levels deep from the starting page

$3

typescript

const scraper = new ScrapperServices('https://example.com');

const results = [];

const visited = {};

await scraper.bfsScrape('/docs', results, visited);

console.log(

Scraped ${results.length} pages

);

$3

typescript

const scraper = new ScrapperServices(

    'https://example.com',

    0,  // No depth limit

    '/custom/output/path'  // Custom save location

);

await scraper.bfsScrape('/');

// Files will be saved to /custom/output/path instead of ~/knowledgeBase





🔧 Development



$3



- Node.js >= 16.x

- npm >= 7.x



$3

bash

Clone the repository

git clone https://github.com/harshvz/crawler.git



Navigate to directory

cd crawler



Install dependencies

npm install



Run in development mode

npm run dev

$3



crawler/

├── src/

│   ├── index.ts                    # CLI entry point

│   └── Services/

│       └── ScrapperServices.ts     # Main scraping logic

├── dist/                           # Compiled JavaScript

├── package.json

├── tsconfig.json

└── README.md

$3

bash

npm run build





This compiles TypeScript files to JavaScript in the

dist/

 directory.



🤝 Contributing



Contributions are welcome! Please follow these steps:



1. Fork the repository

2. Create a feature branch (

git checkout -b feature/amazing-feature

)

3. Commit your changes (

git commit -m 'Add amazing feature'

)

4. Push to the branch (

git push origin feature/amazing-feature`)
5. Open a Pull Request

📝 License

ISC © Harshvz

🙏 Acknowledgments

- Built with Playwright
- CLI powered by Inquirer.js

---

Made with ❤️ by harshvz