Utility Nodes
Crawler Node
Powerful web crawling and scraping using Firecrawl for content extraction and website analysis
Crawler Node
The Crawler node enables comprehensive web crawling and content extraction using Firecrawl. It can crawl entire websites, extract structured data, take screenshots, and convert web content to various formats for further processing.
Configuration
Input Schema
Field | Type | Required | Default | Description |
---|---|---|---|---|
url | string | Yes | - | Target URL to crawl |
includePaths | array | No | [] | Path patterns to include in crawling |
excludePaths | array | No | [] | Path patterns to exclude from crawling |
limit | number | No | 100 | Maximum number of pages to crawl |
maxDepth | number | No | 3 | Maximum crawling depth |
outputFormats | object | No | - | Output format configuration |
additionalCrawlParams | JsonValue | No | - | Additional Firecrawl parameters as JSON |
Output Formats Schema
Field | Type | Required | Default | Description |
---|---|---|---|---|
formats | array | No | ["markdown"] | Output formats: markdown, html, json, content, links, screenshots |
maxAge | number | No | 86400000 | Cache max age in milliseconds (24 hours) |
Configuration Schema
Field | Type | Required | Description |
---|---|---|---|
apiKey | string | No | Custom Firecrawl API key |
Output Schema
Field | Type | Description |
---|---|---|
output | JsonValue | Extracted content and data from crawl |
Basic Usage
Simple Page Crawl
- id: crawl_website
type: crawler
input:
url: "https://example.com"
limit: 50
maxDepth: 2
Targeted Content Extraction
- id: extract_blog_posts
type: crawler
input:
url: "https://blog.example.com"
includePaths: ["/posts/", "/articles/"]
excludePaths: ["/admin/", "/private/"]
limit: 25
outputFormats:
formats: ["markdown", "json"]
maxAge: 3600000 # 1 hour cache
Advanced Examples
- id: crawl_docs
type: crawler
input:
url: "https://docs.example.com"
includePaths: ["/docs/", "/guides/", "/api/"]
excludePaths: ["/legacy/", "/deprecated/"]
limit: 100
maxDepth: 4
outputFormats:
formats: ["markdown", "content"]
maxAge: 7200000 # 2 hours cache
configuration:
apiKey: "{{ secrets.firecrawl_api_key }}"
- id: process_documentation
type: ai
input:
prompt: |
Process this documentation content:
{{ crawl_docs.output }}
Create a structured knowledge base with:
- Topic categorization
- Key concepts extraction
- Cross-references
Perfect for building knowledge bases, API documentation analysis, and technical content organization.
- id: crawl_products
type: crawler
input:
url: "https://shop.example.com"
includePaths: ["/products/", "/categories/"]
excludePaths: ["/cart/", "/checkout/", "/account/"]
limit: 200
outputFormats:
formats: ["json", "content"]
additionalCrawlParams: |
{
"extractorOptions": {
"mode": "llm-extraction",
"extractionPrompt": "Extract product details including name, price, description, and specifications"
}
}
- id: analyze_products
type: ai
input:
prompt: |
Analyze this product data:
{{ crawl_products.output }}
Provide insights on:
- Price ranges and trends
- Product categories
- Popular features
Ideal for competitive pricing analysis, product catalog management, and market research automation.
- id: crawl_news_site
type: crawler
input:
url: "https://news.example.com"
includePaths: ["/articles/", "/news/"]
excludePaths: ["/ads/", "/sponsored/"]
limit: 50
maxDepth: 2
outputFormats:
formats: ["markdown", "content", "links"]
- id: extract_key_articles
type: ai
input:
prompt: |
From this news content:
{{ crawl_news_site.output }}
Extract the 10 most important articles with:
- Headlines
- Summaries
- Key topics
- Publication dates
Essential for media monitoring, content curation, and real-time news analysis workflows.
- id: monitor_website
type: crawler
input:
url: "{{ target_website }}"
limit: 20
outputFormats:
formats: ["content", "screenshots"]
maxAge: 3600000 # Check for changes hourly
- id: detect_changes
type: ai
input:
prompt: |
Compare this current content with previous version:
Current: {{ monitor_website.output }}
Previous: {{ previous_crawl.output }}
Identify and summarize any significant changes.
- id: notify_changes
type: condition
input:
if: "{{ detect_changes.output.hasChanges }}"
then: ["send_notification"]
else: ["log_no_changes"]
Critical for website change detection, compliance monitoring, and automated content tracking.
Output Formats
Markdown Format
- Clean, structured text representation
- Preserves headers, links, and formatting
- Ideal for AI processing and analysis
HTML Format
- Raw HTML content
- Preserves original structure and styling
- Useful for visual analysis
JSON Format
- Structured data extraction
- Programmatically accessible
- Best for data processing
Content Format
- Plain text content
- Minimal formatting
- Fast processing
Links Format
- Extracted hyperlinks
- Link relationship mapping
- Navigation structure analysis
Screenshots Format
- Visual page captures
- Layout verification
- Visual change detection
Advanced Configuration
Custom Extraction Parameters
- id: custom_extraction
type: crawler
input:
url: "{{ target_url }}"
additionalCrawlParams: |
{
"pageOptions": {
"waitFor": 2000,
"screenshot": true
},
"extractorOptions": {
"mode": "llm-extraction",
"extractionSchema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "number"},
"description": {"type": "string"}
}
}
}
}
Rate-Limited Crawling
- id: gentle_crawl
type: crawler
input:
url: "{{ website_url }}"
limit: 10
maxDepth: 1
configuration:
timeoutMs: 60000 # 1 minute timeout
- id: pause_between_crawls
type: suspend
input:
type: "duration"
duration: "PT5S" # 5 second pause
- id: continue_crawl
type: crawler
input:
url: "{{ next_section_url }}"
Error Handling
- id: robust_crawl
type: crawler
input:
url: "{{ target_url }}"
limit: 100
configuration:
errorHandling: "continue"
retryPolicy:
maxRetries: 3
- id: validate_crawl_results
type: condition
input:
if: "{{ robust_crawl.output && robust_crawl.output.data }}"
then: ["process_results"]
else: ["handle_crawl_failure"]
- id: handle_crawl_failure
type: ai
input:
prompt: "Analyze why crawling failed and suggest alternatives for URL: {{ target_url }}"
Last updated on