Powerful web crawling and scraping using Firecrawl for content extraction and website analysis

Crawler Node

The Crawler node enables comprehensive web crawling and content extraction using Firecrawl. It can crawl entire websites, extract structured data, take screenshots, and convert web content to various formats for further processing.

Configuration

Input Schema

Field	Type	Required	Default	Description
`url`	`string`	Yes	-	Target URL to crawl
`includePaths`	`array`	No	`[]`	Path patterns to include in crawling
`excludePaths`	`array`	No	`[]`	Path patterns to exclude from crawling
`limit`	`number`	No	`100`	Maximum number of pages to crawl
`maxDepth`	`number`	No	`3`	Maximum crawling depth
`outputFormats`	`object`	No	-	Output format configuration
`additionalCrawlParams`	`JsonValue`	No	-	Additional Firecrawl parameters as JSON

Output Formats Schema

Field	Type	Required	Default	Description
`formats`	`array`	No	`["markdown"]`	Output formats: markdown, html, json, content, links, screenshots
`maxAge`	`number`	No	`86400000`	Cache max age in milliseconds (24 hours)

Configuration Schema

Field	Type	Required	Description
`apiKey`	`string`	No	Custom Firecrawl API key

Output Schema

Field	Type	Description
`output`	`JsonValue`	Extracted content and data from crawl

Basic Usage

Simple Page Crawl

- id: crawl_website
  type: crawler
  input:
    url: "https://example.com"
    limit: 50
    maxDepth: 2

Targeted Content Extraction

- id: extract_blog_posts
  type: crawler
  input:
    url: "https://blog.example.com"
    includePaths: ["/posts/", "/articles/"]
    excludePaths: ["/admin/", "/private/"]
    limit: 25
    outputFormats:
      formats: ["markdown", "json"]
      maxAge: 3600000  # 1 hour cache

Advanced Examples

- id: crawl_docs
  type: crawler
  input:
    url: "https://docs.example.com"
    includePaths: ["/docs/", "/guides/", "/api/"]
    excludePaths: ["/legacy/", "/deprecated/"]
    limit: 100
    maxDepth: 4
    outputFormats:
      formats: ["markdown", "content"]
      maxAge: 7200000  # 2 hours cache
  configuration:
    apiKey: "{{ secrets.firecrawl_api_key }}"

- id: process_documentation
  type: ai
  input:
    prompt: |
      Process this documentation content:
      {{ crawl_docs.output }}
      
      Create a structured knowledge base with:
      - Topic categorization
      - Key concepts extraction
      - Cross-references

Perfect for building knowledge bases, API documentation analysis, and technical content organization.

- id: crawl_products
  type: crawler
  input:
    url: "https://shop.example.com"
    includePaths: ["/products/", "/categories/"]
    excludePaths: ["/cart/", "/checkout/", "/account/"]
    limit: 200
    outputFormats:
      formats: ["json", "content"]
    additionalCrawlParams: |
      {
        "extractorOptions": {
          "mode": "llm-extraction",
          "extractionPrompt": "Extract product details including name, price, description, and specifications"
        }
      }

- id: analyze_products
  type: ai
  input:
    prompt: |
      Analyze this product data:
      {{ crawl_products.output }}
      
      Provide insights on:
      - Price ranges and trends
      - Product categories
      - Popular features

Ideal for competitive pricing analysis, product catalog management, and market research automation.

- id: crawl_news_site
  type: crawler
  input:
    url: "https://news.example.com"
    includePaths: ["/articles/", "/news/"]
    excludePaths: ["/ads/", "/sponsored/"]
    limit: 50
    maxDepth: 2
    outputFormats:
      formats: ["markdown", "content", "links"]

- id: extract_key_articles
  type: ai
  input:
    prompt: |
      From this news content:
      {{ crawl_news_site.output }}
      
      Extract the 10 most important articles with:
      - Headlines
      - Summaries
      - Key topics
      - Publication dates

Essential for media monitoring, content curation, and real-time news analysis workflows.

- id: monitor_website
  type: crawler
  input:
    url: "{{ target_website }}"
    limit: 20
    outputFormats:
      formats: ["content", "screenshots"]
      maxAge: 3600000  # Check for changes hourly

- id: detect_changes
  type: ai
  input:
    prompt: |
      Compare this current content with previous version:
      Current: {{ monitor_website.output }}
      Previous: {{ previous_crawl.output }}
      
      Identify and summarize any significant changes.

- id: notify_changes
  type: condition
  input:
    if: "{{ detect_changes.output.hasChanges }}"
    then: ["send_notification"]
    else: ["log_no_changes"]

Critical for website change detection, compliance monitoring, and automated content tracking.

Output Formats

Markdown Format

Clean, structured text representation
Preserves headers, links, and formatting
Ideal for AI processing and analysis

HTML Format

Raw HTML content
Preserves original structure and styling
Useful for visual analysis

JSON Format

Structured data extraction
Programmatically accessible
Best for data processing

Content Format

Plain text content
Minimal formatting
Fast processing

Links Format

Extracted hyperlinks
Link relationship mapping
Navigation structure analysis

Screenshots Format

Visual page captures
Layout verification
Visual change detection

Advanced Configuration

Custom Extraction Parameters

- id: custom_extraction
  type: crawler
  input:
    url: "{{ target_url }}"
    additionalCrawlParams: |
      {
        "pageOptions": {
          "waitFor": 2000,
          "screenshot": true
        },
        "extractorOptions": {
          "mode": "llm-extraction",
          "extractionSchema": {
            "type": "object",
            "properties": {
              "title": {"type": "string"},
              "price": {"type": "number"},
              "description": {"type": "string"}
            }
          }
        }
      }

Rate-Limited Crawling

- id: gentle_crawl
  type: crawler
  input:
    url: "{{ website_url }}"
    limit: 10
    maxDepth: 1
  configuration:
    timeoutMs: 60000  # 1 minute timeout

- id: pause_between_crawls
  type: suspend
  input:
    type: "duration"
    duration: "PT5S"  # 5 second pause

- id: continue_crawl
  type: crawler
  input:
    url: "{{ next_section_url }}"

Error Handling

- id: robust_crawl
  type: crawler
  input:
    url: "{{ target_url }}"
    limit: 100
  configuration:
    errorHandling: "continue"
    retryPolicy:
      maxRetries: 3

- id: validate_crawl_results
  type: condition
  input:
    if: "{{ robust_crawl.output && robust_crawl.output.data }}"
    then: ["process_results"]
    else: ["handle_crawl_failure"]

- id: handle_crawl_failure
  type: ai
  input:
    prompt: "Analyze why crawling failed and suggest alternatives for URL: {{ target_url }}"

Crawler Node

On this page