WarpBuild LogoWarpBuild Docs
Utility Nodes

Crawler Node

Powerful web crawling and scraping using Firecrawl for content extraction and website analysis

Crawler Node

The Crawler node enables comprehensive web crawling and content extraction using Firecrawl. It can crawl entire websites, extract structured data, take screenshots, and convert web content to various formats for further processing.

Configuration

Input Schema

FieldTypeRequiredDefaultDescription
urlstringYes-Target URL to crawl
includePathsarrayNo[]Path patterns to include in crawling
excludePathsarrayNo[]Path patterns to exclude from crawling
limitnumberNo100Maximum number of pages to crawl
maxDepthnumberNo3Maximum crawling depth
outputFormatsobjectNo-Output format configuration
additionalCrawlParamsJsonValueNo-Additional Firecrawl parameters as JSON

Output Formats Schema

FieldTypeRequiredDefaultDescription
formatsarrayNo["markdown"]Output formats: markdown, html, json, content, links, screenshots
maxAgenumberNo86400000Cache max age in milliseconds (24 hours)

Configuration Schema

FieldTypeRequiredDescription
apiKeystringNoCustom Firecrawl API key

Output Schema

FieldTypeDescription
outputJsonValueExtracted content and data from crawl

Basic Usage

Simple Page Crawl

- id: crawl_website
  type: crawler
  input:
    url: "https://example.com"
    limit: 50
    maxDepth: 2

Targeted Content Extraction

- id: extract_blog_posts
  type: crawler
  input:
    url: "https://blog.example.com"
    includePaths: ["/posts/", "/articles/"]
    excludePaths: ["/admin/", "/private/"]
    limit: 25
    outputFormats:
      formats: ["markdown", "json"]
      maxAge: 3600000  # 1 hour cache

Advanced Examples

- id: crawl_docs
  type: crawler
  input:
    url: "https://docs.example.com"
    includePaths: ["/docs/", "/guides/", "/api/"]
    excludePaths: ["/legacy/", "/deprecated/"]
    limit: 100
    maxDepth: 4
    outputFormats:
      formats: ["markdown", "content"]
      maxAge: 7200000  # 2 hours cache
  configuration:
    apiKey: "{{ secrets.firecrawl_api_key }}"

- id: process_documentation
  type: ai
  input:
    prompt: |
      Process this documentation content:
      {{ crawl_docs.output }}
      
      Create a structured knowledge base with:
      - Topic categorization
      - Key concepts extraction
      - Cross-references

Perfect for building knowledge bases, API documentation analysis, and technical content organization.

- id: crawl_products
  type: crawler
  input:
    url: "https://shop.example.com"
    includePaths: ["/products/", "/categories/"]
    excludePaths: ["/cart/", "/checkout/", "/account/"]
    limit: 200
    outputFormats:
      formats: ["json", "content"]
    additionalCrawlParams: |
      {
        "extractorOptions": {
          "mode": "llm-extraction",
          "extractionPrompt": "Extract product details including name, price, description, and specifications"
        }
      }

- id: analyze_products
  type: ai
  input:
    prompt: |
      Analyze this product data:
      {{ crawl_products.output }}
      
      Provide insights on:
      - Price ranges and trends
      - Product categories
      - Popular features

Ideal for competitive pricing analysis, product catalog management, and market research automation.

- id: crawl_news_site
  type: crawler
  input:
    url: "https://news.example.com"
    includePaths: ["/articles/", "/news/"]
    excludePaths: ["/ads/", "/sponsored/"]
    limit: 50
    maxDepth: 2
    outputFormats:
      formats: ["markdown", "content", "links"]

- id: extract_key_articles
  type: ai
  input:
    prompt: |
      From this news content:
      {{ crawl_news_site.output }}
      
      Extract the 10 most important articles with:
      - Headlines
      - Summaries
      - Key topics
      - Publication dates

Essential for media monitoring, content curation, and real-time news analysis workflows.

- id: monitor_website
  type: crawler
  input:
    url: "{{ target_website }}"
    limit: 20
    outputFormats:
      formats: ["content", "screenshots"]
      maxAge: 3600000  # Check for changes hourly

- id: detect_changes
  type: ai
  input:
    prompt: |
      Compare this current content with previous version:
      Current: {{ monitor_website.output }}
      Previous: {{ previous_crawl.output }}
      
      Identify and summarize any significant changes.

- id: notify_changes
  type: condition
  input:
    if: "{{ detect_changes.output.hasChanges }}"
    then: ["send_notification"]
    else: ["log_no_changes"]

Critical for website change detection, compliance monitoring, and automated content tracking.

Output Formats

Markdown Format

  • Clean, structured text representation
  • Preserves headers, links, and formatting
  • Ideal for AI processing and analysis

HTML Format

  • Raw HTML content
  • Preserves original structure and styling
  • Useful for visual analysis

JSON Format

  • Structured data extraction
  • Programmatically accessible
  • Best for data processing

Content Format

  • Plain text content
  • Minimal formatting
  • Fast processing
  • Extracted hyperlinks
  • Link relationship mapping
  • Navigation structure analysis

Screenshots Format

  • Visual page captures
  • Layout verification
  • Visual change detection

Advanced Configuration

Custom Extraction Parameters

- id: custom_extraction
  type: crawler
  input:
    url: "{{ target_url }}"
    additionalCrawlParams: |
      {
        "pageOptions": {
          "waitFor": 2000,
          "screenshot": true
        },
        "extractorOptions": {
          "mode": "llm-extraction",
          "extractionSchema": {
            "type": "object",
            "properties": {
              "title": {"type": "string"},
              "price": {"type": "number"},
              "description": {"type": "string"}
            }
          }
        }
      }

Rate-Limited Crawling

- id: gentle_crawl
  type: crawler
  input:
    url: "{{ website_url }}"
    limit: 10
    maxDepth: 1
  configuration:
    timeoutMs: 60000  # 1 minute timeout

- id: pause_between_crawls
  type: suspend
  input:
    type: "duration"
    duration: "PT5S"  # 5 second pause

- id: continue_crawl
  type: crawler
  input:
    url: "{{ next_section_url }}"

Error Handling

- id: robust_crawl
  type: crawler
  input:
    url: "{{ target_url }}"
    limit: 100
  configuration:
    errorHandling: "continue"
    retryPolicy:
      maxRetries: 3

- id: validate_crawl_results
  type: condition
  input:
    if: "{{ robust_crawl.output && robust_crawl.output.data }}"
    then: ["process_results"]
    else: ["handle_crawl_failure"]

- id: handle_crawl_failure
  type: ai
  input:
    prompt: "Analyze why crawling failed and suggest alternatives for URL: {{ target_url }}"

Last updated on