Changelog
This page is generated from the root CHANGELOG.md, which is maintained by release-please during releases.
:::note The source of truth is the repository root changelog. Do not edit this docs page manually. :::
All notable changes to CRW are documented here.
0.3.4 (2026-04-09)
Bug Fixes
0.3.3 (2026-04-09)
Features
- add APT/Debian package distribution (c34b8e9)
- renderer: spawn all available browsers for multi-renderer fallback (f546437)
0.3.2 (2026-04-08)
Bug Fixes
- cli: auto-prepend https:// when no scheme provided (1050606)
0.3.1 (2026-04-08)
Features
- add llms.txt, SKILL.md, MCP init command, and docs UI improvements (1b22d19)
- add one-line install script with auto platform detection (6354f79)
- docs: add dark mode logo support and improve docs UI (047df7b)
- docs: align design with SaaS site and update branding (631d07c)
- docs: unify docs into docs.fastcrw.com with Mintlify-style design (4994998)
- docs: update URLs, dark mode, syntax highlighting, and benchmarks (0678cdf)
- release all 3 binaries, CLI auto-browser, README overhaul (aa2950d)
- update README banner with new logo (bcba1ad)
Bug Fixes
- crawl HTTP polling bug + SDK test suite + docs (#16) (b6d8983)
- remove internal implementation detail from roadmap (a5013f0)
0.3.0 (2026-04-02)
Features
- add search() method to Python SDK and docs (591e3fe)
0.2.2 (2026-04-02)
Bug Fixes
- renderer: escalate to JS renderer on HTTP 401/403 responses (f515caa)
- use GitHub latest release instead of pinned version for binary download (4afcb1a)
0.2.1 (2026-03-28)
Bug Fixes
- make crw-mcp npm wrapper executable (576a9eb)
- use latest tag in server.json OCI identifier (7ec3b82)
0.2.0 (2026-03-28)
Features
- add MCP Registry support for official server discovery (154b9f5)
0.1.2 (2026-03-27)
Bug Fixes
- vendor pdf-inspector as crw-pdf for crates.io publishability (3f7681d)
0.1.1 (2026-03-26)
Bug Fixes
- skip already-published crates without masking real errors (010649c)
0.1.0 (2026-03-26)
Features
- add PDF extraction support via pdf-inspector (06dd5bf)
0.0.14 (2026-03-25)
Features
- mcp: auto-download LightPanda binary for zero-config JS rendering (41f443b)
- mcp: auto-spawn headless Chrome for JS rendering in embedded mode (9a6b0ae)
Bug Fixes
- ci: move crw-mcp to Tier 4 in release workflow and add workflow_dispatch (d7584a8)
0.0.13 (2026-03-24)
Features
- mcp: add embedded mode — self-contained MCP server, no crw-server needed (75e5450)
Bug Fixes
- ci: switch release-please to simple type for Rust workspace support (51cd420)
v0.0.12
- Readability drill-down — when
<main>or<article>wraps >90% of body, the extractor now searches inside for narrower content elements (.main-page-content,.article-content,.entry-content, etc.) instead of discarding. Fixes MDN pages returning 35 chars and StackOverflow returning only the question - Base64 image stripping —
data:URI images are removed in both HTML cleaning (lol_html) and markdown post-processing (regex safety net). Eliminates massive base64 blobs from Reddit and similar sites - Select/dropdown removal —
<select>elements removed inonlyMainContentmode; dropdown/city-selector/location-selector noise patterns added. Fixes Hürriyet city dropdown leaking into content - Extended scored selectors — added
.main-page-content,.js-post-body,.s-prose,#question,.page-content,#page-content,[role="article"]for better MDN, StackOverflow, and generic site coverage - Smarter fallback chain — when primary extraction produces too-short markdown, both fallbacks (cleaned HTML and basic clean) are tried and the longer result is picked, instead of short-circuiting on non-empty but insufficient content
v0.0.11
- Stealth anti-bot bypass — automatic stealth JS injection via
Page.addScriptToEvaluateOnNewDocumentbefore every CDP navigation. Spoofsnavigator.webdriver, Chrome runtime object, plugins array, languages, permissions API, iframecontentWindow, andtoString()proxy to bypass Cloudflare, PerimeterX, and other bot detection systems - Cloudflare challenge auto-retry — detects Cloudflare JS challenge pages ("Just a moment",
cf-browser-verification,challenge-platform) after page load and polls up to 3 times at 3-second intervals for non-interactive challenges to auto-resolve - HTTP → CDP auto-escalation —
FallbackRenderer::fetch()in auto mode now checks HTTP responses for anti-bot challenge signatures and automatically escalates to JS rendering when detected, instead of returning the challenge HTML - Chrome failover in Docker — full automatic failover chain: HTTP → LightPanda → Chrome. Added
chromedp/headless-shellas a Docker Compose sidecar service with 2GB shared memory. If LightPanda crashes on complex SPAs (React, Angular), Chrome handles the render - Chrome WS URL auto-discovery — CDP renderer resolves Chrome DevTools WebSocket URL via the
/json/versionHTTP endpoint withHost: localhostheader (required for chromedp/headless-shell's socat proxy). UsesOnceCellfor lazy one-time resolution - Proxy configuration docs — expanded proxy config comments with examples for HTTP, SOCKS5, and residential proxy providers (IPRoyal, Oxylabs, Smartproxy)
- Raw string delimiter fix — fixed
markdown.rstest that usedr#"..."#with a string containing"#, changed tor##"..."##
v0.0.10 / v0.0.9
- Crawl cancel endpoint —
DELETE /v1/crawl/{id}cancels a running crawl job viaAbortHandleand returns{ success: true } - API rate limiting — token-bucket rate limiter (configurable
rate_limit_rps, default 10). Returns 429 witherror_code: "rate_limited"when exceeded - Machine-readable error codes — all error responses now include an
error_codefield (e.g."invalid_url","http_error","rate_limited","not_found") - Map response envelope —
/v1/mapnow returns{ success, data: { links } }instead of{ success, links }for consistency with other endpoints - Fenced code blocks — indented code blocks (4-space) are post-processed into fenced (```) blocks for better LLM/RAG compatibility
- Sphinx footer cleanup —
"footer"added to exact-token noise patterns, catching<div class="footer">in Sphinx/documentation sites renderedWith: "http"— HTTP-only fetches now reportrendered_with: "http"in metadata instead ofnull- 405 JSON responses — all routes now have
.fallback(method_not_allowed)returning structured JSON witherror_code: "method_not_allowed"instead of empty bodies - Anchor link cleanup — empty anchor links (
[](#id),[¶](#id)) and pilcrow/section signs stripped from Markdown output role="contentinfo"cleanup — elements with ARIA rolescontentinfo,navigation,banner,complementaryremoved during cleaning- Tiny chunk merging — topic chunking merges heading-only chunks (<50 chars) with the next chunk to improve RAG embedding quality
v0.0.8
- Wikipedia / MediaWiki onlyMainContent fix —
onlyMainContent: truenow correctly extracts article text from Wikipedia pages (~49% size reduction). Previously the<html>element'sclass="vector-toc-available"matched the"toc"noise pattern via substring, removing the entire page - 3-tier noise pattern matching — noise class/id matching now uses substring (long patterns), exact-token (short/ambiguous:
toc,share,social,comment,related), and prefix (ad-,ads-) matching to avoid false positives - Structural element guard — noise handler never removes
<html>,<head>,<body>, or<main>elements - Re-clean after readability — readability output is re-cleaned to strip residual noise (infobox, navbox, catlinks) that survives inside broad containers
- Wikipedia-aware readability — added
.mw-parser-output,#mw-content-text,#bodyContentto scored selectors; priority/scored selectors that wrap >90% of body are skipped - BYOK LLM extraction — per-request
llmApiKey,llmProvider,llmModelfields for bring-your-own-key structured extraction without server config - JSON format validation —
formats: ["json"]withoutjsonSchemanow returns a 400 error instead of a warning - Block detection skip — pages >50 KB skip interstitial/block detection (no more false "blocked by anti-bot" on Wikipedia)
- Null byte URL rejection — URLs with
%00or null bytes rejected at validation - Request timeout — default timeout bumped from 60s to 120s
- Dockerfile fix — corrected
cargo buildflags, addedconfig.docker.toml
v0.0.7
success: falseon 4xx targets — scraping a 403/404/429 target with minimal body now correctly returnssuccess: falsewith error details, instead ofsuccess: truewith a warning. Targets with real content (custom error pages) still returnsuccess: truewith a warning- JS renderer fallback warning — when
renderJs: trueis requested but no CDP renderer is available, the response now includesrendered_with: "http_only_fallback"and a warning instead of silently falling back - CDP health check —
is_available()now runs a realBrowser.getVersioncommand instead of just testing the WebSocket connection - Specific error messages — unknown formats now return descriptive errors (e.g.,
"Unknown format 'extract'. Valid formats: ...") instead of generic 422 "extract"format alias —formats: ["extract"]andformats: ["llm-extract"]are now accepted as aliases for"json"(Firecrawl compatibility)- Chunk dedup by default — deduplication is now enabled by default for all chunking strategies; separator-only chunks (
---,***) are filtered out - Chunk relevance scores — chunks now return
{ content, score, index }objects instead of plain strings when a query is provided - Map timeout —
/v1/mapaccepts atimeoutparameter (default 120s, max 300s) to prevent 502s on large sites - Stealth + JS rendering fix —
stealth: truewithrenderJs: trueno longer bypasses CDP; the shared renderer is used with stealth headers injected - BM25 NaN guard — prevents
NaNscores when all chunks are empty
v0.0.6
- Crate READMEs on crates.io — all 7 crates now have detailed README documentation visible on their crates.io pages, with usage examples, API docs, and installation instructions
v0.0.5
crw-clinow on crates.io — install the standalone CLI withcargo install crw-cliand scrape URLs without running a server- Parallelized release workflow — crate publishing uses tiered parallelism, cutting release time by ~2.25 minutes
- CLI and MCP install docs — README now includes
cargo installinstructions for bothcrw-cliandcrw-mcp
v0.0.4
- Hardened rendering and warning semantics — improved reliability of the rendering pipeline and warning detection logic
- XPath output escaping — XPath extraction results are now properly escaped to prevent injection
- Broadened status warnings — expanded HTTP status code range that triggers warning metadata
- Capped interstitial scan — bounded interstitial page detection to avoid excessive scanning
- Clippy cleanup — simplified status code checks for cleaner, idiomatic Rust
v0.0.3
- Warning-aware target handling — 4xx and anti-bot targets now return
success: truewithwarningandmetadata.statusCode - More reliable JS rendering — CDP navigation now waits for real page lifecycle completion before applying
waitFor - Stealth decompression fix — gzip and brotli responses decode cleanly instead of leaking garbled binary payloads
- Crawl compatibility —
limit,maxPages, andmax_pagesnow normalize to the same crawl cap - XPath and chunking fixes — XPath returns all matches, chunk overlap/dedupe is supported, and scorer rank order is preserved
v0.0.2
- CSS selector & XPath — target specific DOM elements before Markdown conversion (
cssSelector,xpath) - Chunking strategies — split content into topic, sentence, or regex-delimited chunks for RAG pipelines (
chunkStrategy) - BM25 & cosine filtering — rank chunks by relevance to a query and return top-K results (
filterMode,topK) - Better Markdown — switched to
htmd(Turndown.js port): tables, code block languages, nested lists all render correctly - Stealth mode — rotate User-Agent from a built-in Chrome/Firefox/Safari pool and inject 12 browser-like headers (
stealth: true) - Per-request proxy — override the global proxy on a per-request basis (
proxy: "http://...") - Rate limit jitter — randomized delay between requests to avoid uniform traffic fingerprinting
crw-server setup— one-command JS rendering setup: downloads LightPanda, createsconfig.local.toml
v0.0.1
- Firecrawl-compatible REST API —
/v1/scrape,/v1/crawl,/v1/mapwith identical request/response format - 6 output formats — markdown, HTML, cleaned HTML, raw HTML, plain text, links, structured JSON
- LLM structured extraction — JSON schema in, validated structured data out (Anthropic tool_use + OpenAI function calling)
- JS rendering — auto-detect SPAs via heuristics, render via LightPanda, Playwright, or Chrome (CDP)
- BFS crawler — async crawl with rate limiting, robots.txt, sitemap support, concurrent jobs
- MCP server — built-in stdio + HTTP transport for Claude Code and Claude Desktop
- SSRF protection — private IPs, cloud metadata, IPv6, dangerous URI filtering
- Docker ready — multi-stage build with LightPanda sidecar