Architecture
Crate Structure
crw is a Rust workspace with 6 crates, each with a focused responsibility:
crw-server (Axum HTTP API — main binary)
├── crw-crawl (BFS crawler, single-page scraper)
│ ├── crw-extract (HTML cleaning, format conversion)
│ │ └── crw-core (types, config, errors)
│ └── crw-renderer (HTTP + CDP fetcher)
│ └── crw-core
└── crw-core
crw-mcp (Stdio MCP proxy — standalone binary)
Request Flow
Scrape request
Client → POST /v1/scrape
→ Auth middleware (Bearer token check)
→ scrape handler
→ FallbackRenderer.fetch(url)
→ HTTP request (reqwest)
→ SPA detection heuristics
→ CDP rendering if needed (LightPanda/Playwright/Chrome)
→ HTML cleaning (lol_html)
→ Remove script/style/iframe/svg
→ Remove nav/footer/header if onlyMainContent
→ Apply includeTags/excludeTags CSS selectors
→ Readability extraction
→ Try: article → main → [role=main] → .post-content → body
→ Format conversion
→ Markdown (fast_html2md with fallback chain)
→ HTML / RawHTML / PlainText / Links
→ JSON (LLM extraction via Anthropic/OpenAI)
→ Metadata extraction
→ title, description, og:*, canonical, lang
→ JSON response
Crawl request
Client → POST /v1/crawl
→ Create crawl job (UUID)
→ Spawn async BFS task
→ Fetch robots.txt
→ BFS loop with VecDeque:
→ Dequeue URL
→ Check robots.txt allowance
→ Acquire semaphore (max_concurrency)
→ Rate limit (requests_per_second)
→ Scrape page (same pipeline as single scrape)
→ Extract links from page
→ Filter: same origin, not visited, under max_depth
→ Enqueue new URLs
→ Update crawl state via watch::Sender
→ Until: maxPages reached, maxDepth exhausted, or queue empty
→ Return job ID immediately
Client → GET /v1/crawl/{id}
→ Read crawl state via watch::Receiver
→ Return progress + results
Middleware Stack
The Axum server applies middleware in this order:
- CORS —
CorsLayer::permissive() - Trace — HTTP request logging via
tracing - Body limit — Max 1 MB request body
- Timeout — Configurable request timeout
- Auth — Bearer token validation (if
auth.api_keysis set)
Feature Flags
| Flag | Crate | Effect |
|---|---|---|
cdp |
crw-renderer |
Enables CDP rendering via tokio-tungstenite |
cdp |
crw-server |
Passes through to crw-renderer/cdp |
test-utils |
crw-server |
Exposes internal functions for testing |
Key Dependencies
| Dependency | Purpose |
|---|---|
axum 0.8 |
HTTP API framework |
tokio |
Async runtime |
reqwest |
HTTP client (rustls) |
tokio-tungstenite |
CDP WebSocket (with cdp feature) |
lol_html |
Streaming HTML rewriting |
scraper |
CSS selector-based HTML parsing |
fast_html2md |
HTML → Markdown conversion |
jsonschema |
LLM output validation |
config |
Layered TOML configuration |
tracing |
Structured logging |