n8n node for web scraping using Cheerio and Crawlee
npm install n8n-nodes-scraper-web-newbash
npm install n8n-nodes-scraper-web
`
$3
In n8n, go to Settings > Community Nodes and install:
`
n8n-nodes-scraper-web
`
Operations
$3
Extract data from a single web page.
Parameters:
- URL: The URL to scrape
- Extraction Mode: Choose between CSS selectors, full HTML, or text content
- Selectors: Define CSS selectors to extract specific data
Example:
`
URL: https://example.com
Selector: .title -> Extract text
Result: { title: "Example Title", url: "https://example.com" }
`
$3
Crawl multiple pages following internal links.
Parameters:
- Start URLs: Starting URLs for crawling (one per line)
- Max Pages: Maximum number of pages to crawl
- Max Depth: Maximum depth of crawling
- Link Selector: CSS selector for links to follow (default: a[href])
- Pagination Selector: CSS selector specifically for pagination links (e.g., .pagination a, a[aria-label*="next"]). Leave empty to use Link Selector for all links
- Same Domain Only: Only crawl pages on the same domain (default: true)
Example:
`
Start URLs: https://example.com
Max Pages: 50
Max Depth: 2
Link Selector: a[href]
Pagination Selector: .pagination a
`
CSS Selectors
The node supports standard CSS selectors:
- Element: div, p, a
- Class: .classname
- ID: #idname
- Attribute: [href], [data-id]
- Combined: div.content > p.text
Extraction Options
For each selector, you can extract:
- Text: The text content of the element
- HTML: The HTML content of the element
- Attribute: A specific attribute value (e.g., href, src)
You can also choose to extract:
- Single: Only the first matching element
- Multiple: All matching elements (returns an array)
Advanced Options
- User Agent: Custom user agent string
- Timeout: Request timeout in milliseconds
- Max Retries: Maximum number of retries for failed requests
- Wait For: Wait time before scraping (useful for dynamic content)
Examples
$3
`
Operation: Scrape Single Page
URL: https://news.example.com
Selectors:
- Field: titles, Selector: .article-title, Extract: text, Multiple: true
- Field: links, Selector: .article-link, Extract: attribute (href), Multiple: true
`
$3
`
Operation: Crawl Website
Start URLs: https://blog.example.com
Max Pages: 20
Max Depth: 2
Link Selector: a.post-link
Selectors:
- Field: title, Selector: h1.post-title, Extract: text
- Field: content, Selector: .post-content, Extract: text
- Field: author, Selector: .author-name, Extract: text
`
$3
`
Operation: Crawl Website
Start URLs: https://blog.example.com
Max Pages: 50
Max Depth: 2
Link Selector: a[href]
Same Domain Only: Yes
Selectors:
- Field: title, Selector: h1, Extract: text
- Field: content, Selector: .post-content, Extract: text
- Field: author, Selector: .author-name, Extract: text
`
$3
`
Operation: Crawl Website
Start URLs: https://www.vivareal.com.br/venda/rj/niteroi/bairros/centro/apartamento_residencial/
Max Pages: 100
Max Depth: 1
Pagination Selector: .olx-core-pagination a, a[aria-label*="página"]
Same Domain Only: Yes
Selectors:
- Field: title, Selector: h2.property-card__title, Extract: text, Multiple: Yes
- Field: price, Selector: .property-card__price, Extract: text, Multiple: Yes
- Field: link, Selector: a.property-card__content-link, Extract: attribute (href), Multiple: Yes
``