MCP Server for Crawl4AI

Note: Tested with Crawl4AI version 0.7.4

TypeScript implementation of an MCP server for Crawl4AI. Provides tools for web crawling, content extraction, and browser automation.

Prerequisites

Node.js 18+ and npm
A running Crawl4AI server

Quick Start

1. Start the Crawl4AI server (for example, local docker)

bash

docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.7.4

2. Add to your MCP client

This MCP server works with any MCP-compatible client (Claude Desktop, Claude Code, Cursor, LMStudio, etc.).

Using npx (Recommended)

json

{
  "mcpServers": {
    "crawl4ai": {
      "command": "npx",
      "args": ["mcp-crawl4ai-ts"],
      "env": {
        "CRAWL4AI_BASE_URL": "http://localhost:11235"
      }
    }
  }
}

Using local installation

json

{
  "mcpServers": {
    "crawl4ai": {
      "command": "node",
      "args": ["/path/to/mcp-crawl4ai-ts/dist/index.js"],
      "env": {
        "CRAWL4AI_BASE_URL": "http://localhost:11235"
      }
    }
  }
}

With all optional variables

json

{
  "mcpServers": {
    "crawl4ai": {
      "command": "npx",
      "args": ["mcp-crawl4ai-ts"],
      "env": {
        "CRAWL4AI_BASE_URL": "http://localhost:11235",
        "CRAWL4AI_API_KEY": "your-api-key",
        "SERVER_NAME": "custom-name",
        "SERVER_VERSION": "1.0.0"
      }
    }
  }
}

Configuration

Environment Variables

env

# Required
CRAWL4AI_BASE_URL=http://localhost:11235

# Optional - Server Configuration
CRAWL4AI_API_KEY=          # If your server requires auth
SERVER_NAME=crawl4ai-mcp   # Custom name for the MCP server
SERVER_VERSION=1.0.0       # Custom version

Client-Specific Instructions

Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json

Claude Code

bash

claude mcp add crawl4ai -e CRAWL4AI_BASE_URL=http://localhost:11235 -- npx mcp-crawl4ai-ts

Other MCP Clients

Consult your client's documentation for MCP server configuration. The key details:

Command: npx mcp-crawl4ai-ts or node /path/to/dist/index.js
Required env: CRAWL4AI_BASE_URL
Optional env: CRAWL4AI_API_KEY, SERVER_NAME, SERVER_VERSION

Available Tools

1. `get_markdown` - Extract content as markdown with filtering

typescript

{ 
  url: string,                              // Required: URL to extract markdown from
  filter?: 'raw'|'fit'|'bm25'|'llm',       // Filter type (default: 'fit')
  query?: string,                           // Query for bm25/llm filters
  cache?: string                            // Cache-bust parameter (default: '0')
}

Extracts content as markdown with various filtering options. Use 'bm25' or 'llm' filters with a query for specific content extraction.

2. `capture_screenshot` - Capture webpage screenshot

typescript

{ 
  url: string,                   // Required: URL to capture
  screenshot_wait_for?: number   // Seconds to wait before screenshot (default: 2)
}

Returns base64-encoded PNG. Note: This is stateless - for screenshots after JS execution, use crawl with screenshot: true.

3. `generate_pdf` - Convert webpage to PDF

typescript

{ 
  url: string  // Required: URL to convert to PDF
}

Returns base64-encoded PDF. Stateless tool - for PDFs after JS execution, use crawl with pdf: true.

4. `execute_js` - Execute JavaScript and get return values

typescript

{ 
  url: string,                    // Required: URL to load
  scripts: string | string[]      // Required: JavaScript to execute
}

Executes JavaScript and returns results. Each script can use 'return' to get values back. Stateless - for persistent JS execution use crawl with js_code.

5. `batch_crawl` - Crawl multiple URLs concurrently

typescript

{ 
  urls: string[],           // Required: List of URLs to crawl
  max_concurrent?: number,  // Parallel request limit (default: 5)
  remove_images?: boolean,  // Remove images from output (default: false)
  bypass_cache?: boolean,   // Bypass cache for all URLs (default: false)
  configs?: Array
}

Efficiently crawls multiple URLs in parallel. Each URL gets a fresh browser instance. With configs array, you can specify different parameters for each URL.

6. `smart_crawl` - Auto-detect and handle different content types

typescript

{ 
  url: string,            // Required: URL to crawl
  max_depth?: number,     // Maximum depth for recursive crawling (default: 2)
  follow_links?: boolean, // Follow links in content (default: true)
  bypass_cache?: boolean  // Bypass cache (default: false)
}

Intelligently detects content type (HTML/sitemap/RSS) and processes accordingly.

7. `get_html` - Get sanitized HTML for analysis

typescript

{ 
  url: string  // Required: URL to extract HTML from
}

Returns preprocessed HTML optimized for structure analysis. Use for building schemas or analyzing patterns.

8. `extract_links` - Extract and categorize page links

typescript

{ 
  url: string,          // Required: URL to extract links from
  categorize?: boolean  // Group by type (default: true)
}

Extracts all links and groups them by type: internal, external, social media, documents, images.

9. `crawl_recursive` - Deep crawl website following links

typescript

{ 
  url: string,              // Required: Starting URL
  max_depth?: number,       // Maximum depth to crawl (default: 3)
  max_pages?: number,       // Maximum pages to crawl (default: 50)
  include_pattern?: string, // Regex pattern for URLs to include
  exclude_pattern?: string  // Regex pattern for URLs to exclude
}

Crawls a website following internal links up to specified depth. Returns content from all discovered pages.

10. `parse_sitemap` - Extract URLs from XML sitemaps

typescript

{ 
  url: string,              // Required: Sitemap URL (e.g., /sitemap.xml)
  filter_pattern?: string   // Optional: Regex pattern to filter URLs
}

Extracts all URLs from XML sitemaps. Supports regex filtering for specific URL patterns.

11. `crawl` - Advanced web crawling with full configuration

typescript

{
  url: string,                              // URL to crawl
  // Browser Configuration
  browser_type?: 'chromium'|'firefox'|'webkit'|'undetected',  // Browser engine (undetected = stealth mode)
  viewport_width?: number,                  // Browser width (default: 1080)
  viewport_height?: number,                 // Browser height (default: 600)
  user_agent?: string,                      // Custom user agent
  proxy_server?: string | {                 // Proxy URL (string or object format)
    server: string,
    username?: string,
    password?: string
  },
  proxy_username?: string,                  // Proxy auth (if using string format)
  proxy_password?: string,                  // Proxy password (if using string format)
  cookies?: Array,   // Pre-set cookies
  headers?: Record,          // Custom headers
  
  // Crawler Configuration
  word_count_threshold?: number,            // Min words per block (default: 200)
  excluded_tags?: string[],                 // HTML tags to exclude
  remove_overlay_elements?: boolean,        // Remove popups/modals
  js_code?: string | string[],              // JavaScript to execute
  wait_for?: string,                        // Wait condition (selector or JS)
  wait_for_timeout?: number,                // Wait timeout (default: 30000)
  delay_before_scroll?: number,             // Pre-scroll delay
  scroll_delay?: number,                    // Between-scroll delay
  process_iframes?: boolean,                // Include iframe content
  exclude_external_links?: boolean,         // Remove external links
  screenshot?: boolean,                     // Capture screenshot
  pdf?: boolean,                           // Generate PDF
  session_id?: string,                      // Reuse browser session (only works with crawl tool)
  cache_mode?: 'ENABLED'|'BYPASS'|'DISABLED',  // Cache control
  
  // New in v3.0.0 (Crawl4AI 0.7.3/0.7.4)
  css_selector?: string,                    // CSS selector to filter content
  delay_before_return_html?: number,        // Delay in seconds before returning HTML
  include_links?: boolean,                  // Include extracted links in response
  resolve_absolute_urls?: boolean,          // Convert relative URLs to absolute
  
  // LLM Extraction (REST API only supports 'llm' type)
  extraction_type?: 'llm',                  // Only 'llm' extraction is supported via REST API
  extraction_schema?: object,               // Schema for structured extraction
  extraction_instruction?: string,          // Natural language extraction prompt
  extraction_strategy?: {                   // Advanced extraction configuration
    provider?: string,
    api_key?: string,
    model?: string,
    [key: string]: any
  },
  table_extraction_strategy?: {             // Table extraction configuration
    enable_chunking?: boolean,
    thresholds?: object,
    [key: string]: any
  },
  markdown_generator_options?: {            // Markdown generation options
    include_links?: boolean,
    preserve_formatting?: boolean,
    [key: string]: any
  },
  
  timeout?: number,                         // Overall timeout (default: 60000)
  verbose?: boolean                         // Detailed logging
}

12. `manage_session` - Unified session management

typescript

{ 
  action: 'create' | 'clear' | 'list',    // Required: Action to perform
  session_id?: string,                    // For 'create' and 'clear' actions
  initial_url?: string,                   // For 'create' action: URL to load
  browser_type?: 'chromium' | 'firefox' | 'webkit' | 'undetected'  // For 'create' action
}

Unified tool for managing browser sessions. Supports three actions:

create: Start a persistent browser session
clear: Remove a session from local tracking
list: Show all active sessions

Examples:

typescript

// Create a new session
{ action: 'create', session_id: 'my-session', initial_url: 'https://example.com' }

// Clear a session
{ action: 'clear', session_id: 'my-session' }

// List all sessions
{ action: 'list' }

13. `extract_with_llm` - Extract structured data using AI

typescript

{ 
  url: string,          // URL to extract data from
  query: string         // Natural language extraction instructions
}

Uses AI to extract structured data from webpages. Returns results immediately without any polling or job management. This is the recommended way to extract specific information since CSS/XPath extraction is not supported via the REST API.

Advanced Configuration

For detailed information about all available configuration options, extraction strategies, and advanced features, please refer to the official Crawl4AI documentation:

Changelog

See CHANGELOG.md for detailed version history and recent updates.

Development

Setup

bash

# 1. Start the Crawl4AI server
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:latest

# 2. Install MCP server
git clone https://github.com/omgwtfwow/mcp-crawl4ai-ts.git
cd mcp-crawl4ai-ts
npm install
cp .env.example .env

# 3. Development commands
npm run dev    # Development mode
npm test       # Run tests
npm run lint   # Check code quality
npm run build  # Production build

# 4. Add to your MCP client (See "Using local installation")

Running Integration Tests

Integration tests require a running Crawl4AI server. Configure your environment:

bash

# Required for integration tests
export CRAWL4AI_BASE_URL=http://localhost:11235
export CRAWL4AI_API_KEY=your-api-key  # If authentication is required

# Optional: For LLM extraction tests
export LLM_PROVIDER=openai/gpt-4o-mini
export LLM_API_TOKEN=your-llm-api-key
export LLM_BASE_URL=https://api.openai.com/v1  # If using custom endpoint

# Run integration tests (ALWAYS use the npm script; don't call `jest` directly)
npm run test:integration

# Run a single integration test file
npm run test:integration -- src/__tests__/integration/extract-links.integration.test.ts

> IMPORTANT: Do NOT run `npx jest` directly for integration tests. The npm script injects `NODE_OPTIONS=--experimental-vm-modules` which is required for ESM + ts-jest. Running Jest directly will produce `SyntaxError: Cannot use import statement outside a module` and hang.

Integration tests cover:

Dynamic content and JavaScript execution
Session management and cookies
Content extraction (LLM-based only)
Media handling (screenshots, PDFs)
Performance and caching
Content filtering
Bot detection avoidance
Error handling

Integration Test Checklist

1. Docker container healthy:

bash

docker ps --filter name=crawl4ai --format '{{.Names}} {{.Status}}'
  curl -sf http://localhost:11235/health || echo "Health check failed"

2. Env vars loaded (either exported or in .env): CRAWL4AI_BASE_URL (required), optional: CRAWL4AI_API_KEY, LLM_PROVIDER, LLM_API_TOKEN, LLM_BASE_URL.

3. Use npm run test:integration (never raw jest).

4. To target one file add it after -- (see example above).

5. Expect total runtime ~2–3 minutes; longer or immediate hang usually means missing NODE_OPTIONS or wrong Jest version.

Troubleshooting

Symptom	Likely Cause	Fix
`SyntaxError: Cannot use import statement outside a module`	Ran `jest` directly without script flags	Re-run with `npm run test:integration`
Hangs on first test (RUNS ...)	Missing experimental VM modules flag	Use npm script / ensure `NODE_OPTIONS=--experimental-vm-modules`
Network timeouts	Crawl4AI container not healthy / DNS blocked	Restart container: `docker restart`
LLM tests skipped	Missing `LLM_PROVIDER` or `LLM_API_TOKEN`	Export required LLM vars
New Jest major upgrade breaks tests	Version mismatch with `ts-jest`	Keep Jest 29.x unless `ts-jest` upgraded accordingly

Version Compatibility Note

Current stack: jest@29.x + ts-jest@29.x + ESM ("type": "module"). Updating Jest to 30+ requires upgrading ts-jest and revisiting jest.config.cjs. Keep versions aligned to avoid parse errors.

License

MIT

Mcp Crawl4ai Ts

Documentation

MCP Server for Crawl4AI

Table of Contents

Prerequisites

Quick Start

1. Start the Crawl4AI server (for example, local docker)

2. Add to your MCP client

Using npx (Recommended)

Using local installation

With all optional variables

Configuration

Environment Variables

Client-Specific Instructions

Claude Desktop

Claude Code

Other MCP Clients

Available Tools

1. get_markdown - Extract content as markdown with filtering

2. capture_screenshot - Capture webpage screenshot

3. generate_pdf - Convert webpage to PDF

4. execute_js - Execute JavaScript and get return values

5. batch_crawl - Crawl multiple URLs concurrently

6. smart_crawl - Auto-detect and handle different content types

7. get_html - Get sanitized HTML for analysis

8. extract_links - Extract and categorize page links

9. crawl_recursive - Deep crawl website following links

10. parse_sitemap - Extract URLs from XML sitemaps

11. crawl - Advanced web crawling with full configuration

12. manage_session - Unified session management

13. extract_with_llm - Extract structured data using AI

Advanced Configuration

Changelog

Development

Setup

Running Integration Tests

Integration Test Checklist

Troubleshooting

Version Compatibility Note

License

Similar MCP

Mcp Ipfs

Liveblocks Mcp Server

Mcp Open Library

Metmuseum Mcp

Trending MCP

Playwright Mcp

Serena

Mcp Playwright

Mcp Server Cloudflare

Mcp Ipfs

Liveblocks Mcp Server

Mcp Open Library

Metmuseum Mcp

Playwright Mcp

Serena

Mcp Playwright

Mcp Server Cloudflare

1. `get_markdown` - Extract content as markdown with filtering

2. `capture_screenshot` - Capture webpage screenshot

3. `generate_pdf` - Convert webpage to PDF

4. `execute_js` - Execute JavaScript and get return values

5. `batch_crawl` - Crawl multiple URLs concurrently

6. `smart_crawl` - Auto-detect and handle different content types

7. `get_html` - Get sanitized HTML for analysis

8. `extract_links` - Extract and categorize page links

9. `crawl_recursive` - Deep crawl website following links

10. `parse_sitemap` - Extract URLs from XML sitemaps

11. `crawl` - Advanced web crawling with full configuration

12. `manage_session` - Unified session management

13. `extract_with_llm` - Extract structured data using AI