n8n-nodes-scraper-web

This is an n8n community node for web scraping using Cheerio and Crawlee.

n8n is a fair-code licensed workflow automation platform.

Features

- Scrape Single Page: Extract data from a single web page using CSS selectors
- Crawl Website: Crawl multiple pages following internal links
- Flexible Extraction: Extract text, HTML, or specific attributes
- Multiple Selectors: Define multiple CSS selectors to extract different data points
- Crawl Control: Control max pages, depth, and link patterns
- Same Domain Filtering: Option to stay within the same domain while crawling

Installation

Follow the installation guide in the n8n community nodes documentation.

$3

bash

npm install n8n-nodes-scraper-web





$3



In n8n, go to Settings > Community Nodes and install:



n8n-nodes-scraper-web





Operations



$3



Extract data from a single web page.



Parameters:

- URL: The URL to scrape

- Extraction Mode: Choose between CSS selectors, full HTML, or text content

- Selectors: Define CSS selectors to extract specific data



Example:



URL: https://example.com

Selector: .title -> Extract text

Result: { title: "Example Title", url: "https://example.com" }





$3



Crawl multiple pages following internal links.



Parameters:

- Start URLs: Starting URLs for crawling (one per line)

- Max Pages: Maximum number of pages to crawl

- Max Depth: Maximum depth of crawling

- Link Selector: CSS selector for links to follow (default:

a[href]

)

- Pagination Selector: CSS selector specifically for pagination links (e.g.,

.pagination a, a[aria-label*="next"]

). Leave empty to use Link Selector for all links

- Same Domain Only: Only crawl pages on the same domain (default:

true

)



Example:



Start URLs: https://example.com

Max Pages: 50

Max Depth: 2

Link Selector: a[href]

Pagination Selector: .pagination a





CSS Selectors



The node supports standard CSS selectors:



- Element:

div, p, a



- Class:

.classname



- ID:

#idname



- Attribute:

[href], [data-id]



- Combined:

div.content > p.text





Extraction Options



For each selector, you can extract:



- Text: The text content of the element

- HTML: The HTML content of the element

- Attribute: A specific attribute value (e.g.,

href, src

)



You can also choose to extract:

- Single: Only the first matching element

- Multiple: All matching elements (returns an array)



Advanced Options



- User Agent: Custom user agent string

- Timeout: Request timeout in milliseconds

- Max Retries: Maximum number of retries for failed requests

- Wait For: Wait time before scraping (useful for dynamic content)



Examples



$3



Operation: Scrape Single Page

URL: https://news.example.com

Selectors:

  - Field: titles, Selector: .article-title, Extract: text, Multiple: true

  - Field: links, Selector: .article-link, Extract: attribute (href), Multiple: true

$3



Operation: Crawl Website

Start URLs: https://blog.example.com

Max Pages: 20

Max Depth: 2

Link Selector: a.post-link

Selectors:

  - Field: title, Selector: h1.post-title, Extract: text

  - Field: content, Selector: .post-content, Extract: text

  - Field: author, Selector: .author-name, Extract: text

$3



Operation: Crawl Website

Start URLs: https://blog.example.com

Max Pages: 50

Max Depth: 2

Link Selector: a[href]

Same Domain Only: Yes



Selectors:

  - Field: title, Selector: h1, Extract: text

  - Field: content, Selector: .post-content, Extract: text

  - Field: author, Selector: .author-name, Extract: text

$3



Operation: Crawl Website

Start URLs: https://www.vivareal.com.br/venda/rj/niteroi/bairros/centro/apartamento_residencial/

Max Pages: 100

Max Depth: 1

Pagination Selector: .olx-core-pagination a, a[aria-label*="página"]

Same Domain Only: Yes



Selectors:

  - Field: title, Selector: h2.property-card__title, Extract: text, Multiple: Yes

  - Field: price, Selector: .property-card__price, Extract: text, Multiple: Yes

  - Field: link, Selector: a.property-card__content-link, Extract: attribute (href), Multiple: Yes

``

Tip for Pagination:
- Use Pagination Selector to target only pagination links (next page, page numbers)
- Set Max Depth: 1 to avoid following links inside individual listings
- Set Max Pages to the number of result pages you want to scrape
- The crawler will automatically follow pagination links and extract data from each page

Dependencies

- Cheerio - Fast, flexible HTML parsing
- Crawlee - Web scraping and browser automation library

Compatibility

- Requires n8n version 1.0.0 or later
- Node.js 18.10 or later

Resources

- n8n community nodes documentation
- Cheerio documentation
- Crawlee documentation

License

MIT

n8n-nodes-scraper-web

This is an n8n community node for web scraping using Cheerio and Crawlee.

n8n is a fair-code licensed workflow automation platform.

Features

Installation

Follow the installation guide in the n8n community nodes documentation.

$3

bash

npm install n8n-nodes-scraper-web





$3



In n8n, go to Settings > Community Nodes and install:



n8n-nodes-scraper-web





Operations



$3



Extract data from a single web page.



Parameters:

- URL: The URL to scrape

- Extraction Mode: Choose between CSS selectors, full HTML, or text content

- Selectors: Define CSS selectors to extract specific data



Example:



URL: https://example.com

Selector: .title -> Extract text

Result: { title: "Example Title", url: "https://example.com" }





$3



Crawl multiple pages following internal links.



Parameters:

- Start URLs: Starting URLs for crawling (one per line)

- Max Pages: Maximum number of pages to crawl

- Max Depth: Maximum depth of crawling

- Link Selector: CSS selector for links to follow (default:

a[href]

)

- Pagination Selector: CSS selector specifically for pagination links (e.g.,

.pagination a, a[aria-label*="next"]

). Leave empty to use Link Selector for all links

- Same Domain Only: Only crawl pages on the same domain (default:

true

)



Example:



Start URLs: https://example.com

Max Pages: 50

Max Depth: 2

Link Selector: a[href]

Pagination Selector: .pagination a





CSS Selectors



The node supports standard CSS selectors:



- Element:

div, p, a



- Class:

.classname



- ID:

#idname



- Attribute:

[href], [data-id]



- Combined:

div.content > p.text





Extraction Options



For each selector, you can extract:



- Text: The text content of the element

- HTML: The HTML content of the element

- Attribute: A specific attribute value (e.g.,

href, src

)



You can also choose to extract:

- Single: Only the first matching element

- Multiple: All matching elements (returns an array)



Advanced Options



- User Agent: Custom user agent string

- Timeout: Request timeout in milliseconds

- Max Retries: Maximum number of retries for failed requests

- Wait For: Wait time before scraping (useful for dynamic content)



Examples



$3



Operation: Scrape Single Page

URL: https://news.example.com

Selectors:

  - Field: titles, Selector: .article-title, Extract: text, Multiple: true

  - Field: links, Selector: .article-link, Extract: attribute (href), Multiple: true

$3



Operation: Crawl Website

Start URLs: https://blog.example.com

Max Pages: 20

Max Depth: 2

Link Selector: a.post-link

Selectors:

  - Field: title, Selector: h1.post-title, Extract: text

  - Field: content, Selector: .post-content, Extract: text

  - Field: author, Selector: .author-name, Extract: text

$3



Operation: Crawl Website

Start URLs: https://blog.example.com

Max Pages: 50

Max Depth: 2

Link Selector: a[href]

Same Domain Only: Yes



Selectors:

  - Field: title, Selector: h1, Extract: text

  - Field: content, Selector: .post-content, Extract: text

  - Field: author, Selector: .author-name, Extract: text

$3



Operation: Crawl Website

Start URLs: https://www.vivareal.com.br/venda/rj/niteroi/bairros/centro/apartamento_residencial/

Max Pages: 100

Max Depth: 1

Pagination Selector: .olx-core-pagination a, a[aria-label*="página"]

Same Domain Only: Yes



Selectors:

  - Field: title, Selector: h2.property-card__title, Extract: text, Multiple: Yes

  - Field: price, Selector: .property-card__price, Extract: text, Multiple: Yes

  - Field: link, Selector: a.property-card__content-link, Extract: attribute (href), Multiple: Yes

Dependencies

- Cheerio - Fast, flexible HTML parsing
- Crawlee - Web scraping and browser automation library

Compatibility

- Requires n8n version 1.0.0 or later
- Node.js 18.10 or later

Resources

- n8n community nodes documentation
- Cheerio documentation
- Crawlee documentation

License

MIT

n8n-nodes-scraper-web-new

n8n-nodes-scraper-web

Features

Installation

$3

$3

Operations

$3

$3

CSS Selectors

Extraction Options

Advanced Options

Examples

$3

$3

$3

$3

Dependencies

Compatibility

Resources

License

n8n-nodes-scraper-web-new

n8n-nodes-scraper-web

Features

Installation

$3

$3

Operations

$3

$3

CSS Selectors

Extraction Options

Advanced Options

Examples

$3

$3

$3

$3

Dependencies

Compatibility

Resources

License