Track MCP LogoTrack MCP
Track MCP LogoTrack MCP

The world's largest repository of Model Context Protocol servers. Discover, explore, and submit MCP tools.

Product

  • Categories
  • Top MCP
  • New & Updated

Company

  • About

Legal

  • Privacy Policy
  • Terms of Service
  • Cookie Policy

© 2025 TrackMCP. All rights reserved.

Built with ❤️ by Krishna Goyal

    Mcp Crawl4ai Ts

    TypeScript implementation of an MCP server for Crawl4AI. Provides tools for web crawling, content extraction, and browser automation.

    9 stars
    TypeScript
    Updated Oct 7, 2025
    crawl4ai
    mcp
    mcp-server

    Documentation

    MCP Server for Crawl4AI

    Note: Tested with Crawl4AI version 0.7.4

    npm version

    License: MIT

    Node.js CI

    coverage

    TypeScript implementation of an MCP server for Crawl4AI. Provides tools for web crawling, content extraction, and browser automation.

    Table of Contents

    • Prerequisites
    • Quick Start
    • Configuration
    • Client-Specific Instructions
    • Available Tools
    • 1. get_markdown
    • 2. capture_screenshot
    • 3. generate_pdf
    • 4. execute_js
    • 5. batch_crawl
    • 6. smart_crawl
    • 7. get_html
    • 8. extract_links
    • 9. crawl_recursive
    • 10. parse_sitemap
    • 11. crawl
    • 12. manage_session
    • 13. extract_with_llm
    • Advanced Configuration
    • Development
    • License

    Prerequisites

    • Node.js 18+ and npm
    • A running Crawl4AI server

    Quick Start

    1. Start the Crawl4AI server (for example, local docker)

    bash
    docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.7.4

    2. Add to your MCP client

    This MCP server works with any MCP-compatible client (Claude Desktop, Claude Code, Cursor, LMStudio, etc.).

    Using npx (Recommended)

    json
    {
      "mcpServers": {
        "crawl4ai": {
          "command": "npx",
          "args": ["mcp-crawl4ai-ts"],
          "env": {
            "CRAWL4AI_BASE_URL": "http://localhost:11235"
          }
        }
      }
    }

    Using local installation

    json
    {
      "mcpServers": {
        "crawl4ai": {
          "command": "node",
          "args": ["/path/to/mcp-crawl4ai-ts/dist/index.js"],
          "env": {
            "CRAWL4AI_BASE_URL": "http://localhost:11235"
          }
        }
      }
    }

    With all optional variables

    json
    {
      "mcpServers": {
        "crawl4ai": {
          "command": "npx",
          "args": ["mcp-crawl4ai-ts"],
          "env": {
            "CRAWL4AI_BASE_URL": "http://localhost:11235",
            "CRAWL4AI_API_KEY": "your-api-key",
            "SERVER_NAME": "custom-name",
            "SERVER_VERSION": "1.0.0"
          }
        }
      }
    }

    Configuration

    Environment Variables

    env
    # Required
    CRAWL4AI_BASE_URL=http://localhost:11235
    
    # Optional - Server Configuration
    CRAWL4AI_API_KEY=          # If your server requires auth
    SERVER_NAME=crawl4ai-mcp   # Custom name for the MCP server
    SERVER_VERSION=1.0.0       # Custom version

    Client-Specific Instructions

    Claude Desktop

    Add to ~/Library/Application Support/Claude/claude_desktop_config.json

    Claude Code

    bash
    claude mcp add crawl4ai -e CRAWL4AI_BASE_URL=http://localhost:11235 -- npx mcp-crawl4ai-ts

    Other MCP Clients

    Consult your client's documentation for MCP server configuration. The key details:

    • Command: npx mcp-crawl4ai-ts or node /path/to/dist/index.js
    • Required env: CRAWL4AI_BASE_URL
    • Optional env: CRAWL4AI_API_KEY, SERVER_NAME, SERVER_VERSION

    Available Tools

    1. get_markdown - Extract content as markdown with filtering

    typescript
    { 
      url: string,                              // Required: URL to extract markdown from
      filter?: 'raw'|'fit'|'bm25'|'llm',       // Filter type (default: 'fit')
      query?: string,                           // Query for bm25/llm filters
      cache?: string                            // Cache-bust parameter (default: '0')
    }

    Extracts content as markdown with various filtering options. Use 'bm25' or 'llm' filters with a query for specific content extraction.

    2. capture_screenshot - Capture webpage screenshot

    typescript
    { 
      url: string,                   // Required: URL to capture
      screenshot_wait_for?: number   // Seconds to wait before screenshot (default: 2)
    }

    Returns base64-encoded PNG. Note: This is stateless - for screenshots after JS execution, use crawl with screenshot: true.

    3. generate_pdf - Convert webpage to PDF

    typescript
    { 
      url: string  // Required: URL to convert to PDF
    }

    Returns base64-encoded PDF. Stateless tool - for PDFs after JS execution, use crawl with pdf: true.

    4. execute_js - Execute JavaScript and get return values

    typescript
    { 
      url: string,                    // Required: URL to load
      scripts: string | string[]      // Required: JavaScript to execute
    }

    Executes JavaScript and returns results. Each script can use 'return' to get values back. Stateless - for persistent JS execution use crawl with js_code.

    5. batch_crawl - Crawl multiple URLs concurrently

    typescript
    { 
      urls: string[],           // Required: List of URLs to crawl
      max_concurrent?: number,  // Parallel request limit (default: 5)
      remove_images?: boolean,  // Remove images from output (default: false)
      bypass_cache?: boolean,   // Bypass cache for all URLs (default: false)
      configs?: Array
    }

    Efficiently crawls multiple URLs in parallel. Each URL gets a fresh browser instance. With configs array, you can specify different parameters for each URL.

    6. smart_crawl - Auto-detect and handle different content types

    typescript
    { 
      url: string,            // Required: URL to crawl
      max_depth?: number,     // Maximum depth for recursive crawling (default: 2)
      follow_links?: boolean, // Follow links in content (default: true)
      bypass_cache?: boolean  // Bypass cache (default: false)
    }

    Intelligently detects content type (HTML/sitemap/RSS) and processes accordingly.

    7. get_html - Get sanitized HTML for analysis

    typescript
    { 
      url: string  // Required: URL to extract HTML from
    }

    Returns preprocessed HTML optimized for structure analysis. Use for building schemas or analyzing patterns.

    8. extract_links - Extract and categorize page links

    typescript
    { 
      url: string,          // Required: URL to extract links from
      categorize?: boolean  // Group by type (default: true)
    }

    Extracts all links and groups them by type: internal, external, social media, documents, images.

    9. crawl_recursive - Deep crawl website following links

    typescript
    { 
      url: string,              // Required: Starting URL
      max_depth?: number,       // Maximum depth to crawl (default: 3)
      max_pages?: number,       // Maximum pages to crawl (default: 50)
      include_pattern?: string, // Regex pattern for URLs to include
      exclude_pattern?: string  // Regex pattern for URLs to exclude
    }

    Crawls a website following internal links up to specified depth. Returns content from all discovered pages.

    10. parse_sitemap - Extract URLs from XML sitemaps

    typescript
    { 
      url: string,              // Required: Sitemap URL (e.g., /sitemap.xml)
      filter_pattern?: string   // Optional: Regex pattern to filter URLs
    }

    Extracts all URLs from XML sitemaps. Supports regex filtering for specific URL patterns.

    11. crawl - Advanced web crawling with full configuration

    typescript
    {
      url: string,                              // URL to crawl
      // Browser Configuration
      browser_type?: 'chromium'|'firefox'|'webkit'|'undetected',  // Browser engine (undetected = stealth mode)
      viewport_width?: number,                  // Browser width (default: 1080)
      viewport_height?: number,                 // Browser height (default: 600)
      user_agent?: string,                      // Custom user agent
      proxy_server?: string | {                 // Proxy URL (string or object format)
        server: string,
        username?: string,
        password?: string
      },
      proxy_username?: string,                  // Proxy auth (if using string format)
      proxy_password?: string,                  // Proxy password (if using string format)
      cookies?: Array,   // Pre-set cookies
      headers?: Record,          // Custom headers
      
      // Crawler Configuration
      word_count_threshold?: number,            // Min words per block (default: 200)
      excluded_tags?: string[],                 // HTML tags to exclude
      remove_overlay_elements?: boolean,        // Remove popups/modals
      js_code?: string | string[],              // JavaScript to execute
      wait_for?: string,                        // Wait condition (selector or JS)
      wait_for_timeout?: number,                // Wait timeout (default: 30000)
      delay_before_scroll?: number,             // Pre-scroll delay
      scroll_delay?: number,                    // Between-scroll delay
      process_iframes?: boolean,                // Include iframe content
      exclude_external_links?: boolean,         // Remove external links
      screenshot?: boolean,                     // Capture screenshot
      pdf?: boolean,                           // Generate PDF
      session_id?: string,                      // Reuse browser session (only works with crawl tool)
      cache_mode?: 'ENABLED'|'BYPASS'|'DISABLED',  // Cache control
      
      // New in v3.0.0 (Crawl4AI 0.7.3/0.7.4)
      css_selector?: string,                    // CSS selector to filter content
      delay_before_return_html?: number,        // Delay in seconds before returning HTML
      include_links?: boolean,                  // Include extracted links in response
      resolve_absolute_urls?: boolean,          // Convert relative URLs to absolute
      
      // LLM Extraction (REST API only supports 'llm' type)
      extraction_type?: 'llm',                  // Only 'llm' extraction is supported via REST API
      extraction_schema?: object,               // Schema for structured extraction
      extraction_instruction?: string,          // Natural language extraction prompt
      extraction_strategy?: {                   // Advanced extraction configuration
        provider?: string,
        api_key?: string,
        model?: string,
        [key: string]: any
      },
      table_extraction_strategy?: {             // Table extraction configuration
        enable_chunking?: boolean,
        thresholds?: object,
        [key: string]: any
      },
      markdown_generator_options?: {            // Markdown generation options
        include_links?: boolean,
        preserve_formatting?: boolean,
        [key: string]: any
      },
      
      timeout?: number,                         // Overall timeout (default: 60000)
      verbose?: boolean                         // Detailed logging
    }

    12. manage_session - Unified session management

    typescript
    { 
      action: 'create' | 'clear' | 'list',    // Required: Action to perform
      session_id?: string,                    // For 'create' and 'clear' actions
      initial_url?: string,                   // For 'create' action: URL to load
      browser_type?: 'chromium' | 'firefox' | 'webkit' | 'undetected'  // For 'create' action
    }

    Unified tool for managing browser sessions. Supports three actions:

    • create: Start a persistent browser session
    • clear: Remove a session from local tracking
    • list: Show all active sessions

    Examples:

    typescript
    // Create a new session
    { action: 'create', session_id: 'my-session', initial_url: 'https://example.com' }
    
    // Clear a session
    { action: 'clear', session_id: 'my-session' }
    
    // List all sessions
    { action: 'list' }

    13. extract_with_llm - Extract structured data using AI

    typescript
    { 
      url: string,          // URL to extract data from
      query: string         // Natural language extraction instructions
    }

    Uses AI to extract structured data from webpages. Returns results immediately without any polling or job management. This is the recommended way to extract specific information since CSS/XPath extraction is not supported via the REST API.

    Advanced Configuration

    For detailed information about all available configuration options, extraction strategies, and advanced features, please refer to the official Crawl4AI documentation:

    • Crawl4AI Documentation
    • Crawl4AI GitHub Repository

    Changelog

    See CHANGELOG.md for detailed version history and recent updates.

    Development

    Setup

    bash
    # 1. Start the Crawl4AI server
    docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:latest
    
    # 2. Install MCP server
    git clone https://github.com/omgwtfwow/mcp-crawl4ai-ts.git
    cd mcp-crawl4ai-ts
    npm install
    cp .env.example .env
    
    # 3. Development commands
    npm run dev    # Development mode
    npm test       # Run tests
    npm run lint   # Check code quality
    npm run build  # Production build
    
    # 4. Add to your MCP client (See "Using local installation")

    Running Integration Tests

    Integration tests require a running Crawl4AI server. Configure your environment:

    bash
    # Required for integration tests
    export CRAWL4AI_BASE_URL=http://localhost:11235
    export CRAWL4AI_API_KEY=your-api-key  # If authentication is required
    
    # Optional: For LLM extraction tests
    export LLM_PROVIDER=openai/gpt-4o-mini
    export LLM_API_TOKEN=your-llm-api-key
    export LLM_BASE_URL=https://api.openai.com/v1  # If using custom endpoint
    
    # Run integration tests (ALWAYS use the npm script; don't call `jest` directly)
    npm run test:integration
    
    # Run a single integration test file
    npm run test:integration -- src/__tests__/integration/extract-links.integration.test.ts
    
    > IMPORTANT: Do NOT run `npx jest` directly for integration tests. The npm script injects `NODE_OPTIONS=--experimental-vm-modules` which is required for ESM + ts-jest. Running Jest directly will produce `SyntaxError: Cannot use import statement outside a module` and hang.

    Integration tests cover:

    • Dynamic content and JavaScript execution
    • Session management and cookies
    • Content extraction (LLM-based only)
    • Media handling (screenshots, PDFs)
    • Performance and caching
    • Content filtering
    • Bot detection avoidance
    • Error handling

    Integration Test Checklist

    1. Docker container healthy:

    bash
    docker ps --filter name=crawl4ai --format '{{.Names}} {{.Status}}'
      curl -sf http://localhost:11235/health || echo "Health check failed"

    2. Env vars loaded (either exported or in .env): CRAWL4AI_BASE_URL (required), optional: CRAWL4AI_API_KEY, LLM_PROVIDER, LLM_API_TOKEN, LLM_BASE_URL.

    3. Use npm run test:integration (never raw jest).

    4. To target one file add it after -- (see example above).

    5. Expect total runtime ~2–3 minutes; longer or immediate hang usually means missing NODE_OPTIONS or wrong Jest version.

    Troubleshooting

    SymptomLikely CauseFix
    SyntaxError: Cannot use import statement outside a moduleRan jest directly without script flagsRe-run with npm run test:integration
    Hangs on first test (RUNS ...)Missing experimental VM modules flagUse npm script / ensure NODE_OPTIONS=--experimental-vm-modules
    Network timeoutsCrawl4AI container not healthy / DNS blockedRestart container: docker restart
    LLM tests skippedMissing LLM_PROVIDER or LLM_API_TOKENExport required LLM vars
    New Jest major upgrade breaks testsVersion mismatch with ts-jestKeep Jest 29.x unless ts-jest upgraded accordingly

    Version Compatibility Note

    Current stack: jest@29.x + ts-jest@29.x + ESM ("type": "module"). Updating Jest to 30+ requires upgrading ts-jest and revisiting jest.config.cjs. Keep versions aligned to avoid parse errors.

    License

    MIT

    Similar MCP

    Based on tags & features

    • MC

      Mcp Ipfs

      TypeScript·
      11
    • LI

      Liveblocks Mcp Server

      TypeScript·
      11
    • MC

      Mcp Open Library

      TypeScript·
      42
    • ME

      Metmuseum Mcp

      TypeScript·
      14

    Trending MCP

    Most active this week

    • PL

      Playwright Mcp

      TypeScript·
      22.1k
    • SE

      Serena

      Python·
      14.5k
    • MC

      Mcp Playwright

      TypeScript·
      4.9k
    • MC

      Mcp Server Cloudflare

      TypeScript·
      3.0k
    View All MCP Servers

    Similar MCP

    Based on tags & features

    • MC

      Mcp Ipfs

      TypeScript·
      11
    • LI

      Liveblocks Mcp Server

      TypeScript·
      11
    • MC

      Mcp Open Library

      TypeScript·
      42
    • ME

      Metmuseum Mcp

      TypeScript·
      14

    Trending MCP

    Most active this week

    • PL

      Playwright Mcp

      TypeScript·
      22.1k
    • SE

      Serena

      Python·
      14.5k
    • MC

      Mcp Playwright

      TypeScript·
      4.9k
    • MC

      Mcp Server Cloudflare

      TypeScript·
      3.0k