Scrape
Turn any known URL into clean markdown, HTML, links, or structured JSON in one request. This is the default CRW workflow and the fastest path to a first successful integration.
formats: ["markdown"], and confirm you get a clean response back. Add JS rendering, selectors, or extraction only after the plain request looks right.Scraping a URL with CRW
/v1/scrape
POST /v1/scrape
Authentication:
- Hosted: send
Authorization: Bearer YOUR_API_KEY - Self-hosted: only required when
auth.api_keysis configured
Installation
CRW is HTTP-first. You can start with cURL immediately and then move to your existing Python or Node.js HTTP client without installing a dedicated SDK.
Basic usage
Start with this request:
{
"url": "https://example.com",
"formats": ["markdown"],
"onlyMainContent": true,
"renderJs": null
}
:::tabs ::tab{title="Python"}
import requests
resp = requests.post(
"https://fastcrw.com/api/v1/scrape",
headers={
"Authorization": "Bearer YOUR_API_KEY",
"Content-Type": "application/json",
},
json={
"url": "https://example.com",
"formats": ["markdown"],
"onlyMainContent": True,
},
)
print(resp.json()["data"]["markdown"])
::tab{title="Node.js"}
const resp = await fetch("https://fastcrw.com/api/v1/scrape", {
method: "POST",
headers: {
"Authorization": "Bearer YOUR_API_KEY",
"Content-Type": "application/json"
},
body: JSON.stringify({
url: "https://example.com",
formats: ["markdown"],
onlyMainContent: true
})
});
const body = await resp.json();
console.log(body.data.markdown);
::tab{title="cURL"}
curl -X POST https://fastcrw.com/api/v1/scrape \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"formats": ["markdown"],
"onlyMainContent": true
}'
:::
Response
{
"success": true,
"data": {
"markdown": "# Example Domain\n\nThis domain is for use in illustrative examples...",
"metadata": {
"title": "Example Domain",
"sourceURL": "https://example.com",
"statusCode": 200,
"elapsedMs": 32
}
}
}
That is the default CRW success shape: requested content plus a compact metadata envelope.
Parameters
| Field | Type | Default | Description |
|---|---|---|---|
url |
string | required | URL to scrape |
formats |
string[] | ["markdown"] |
markdown, html, rawHtml, plainText, links, json, summary |
onlyMainContent |
boolean | true |
Remove nav, footer, and boilerplate before conversion |
renderJs |
boolean or null | null |
null auto-detects, true forces browser rendering, false stays HTTP-only |
waitFor |
number | -- | Milliseconds to wait after JS rendering |
renderer |
string | auto |
Pin to a specific renderer: auto, lightpanda, chrome, or playwright. Non-auto values hard-pin (no fallback) and imply renderJs:true unless renderJs:false is set explicitly. See JS rendering |
includeTags |
string[] | [] |
CSS selectors to keep |
excludeTags |
string[] | [] |
CSS selectors to remove |
headers |
object | {} |
Custom HTTP headers |
cssSelector |
string | -- | Narrow extraction to one CSS selector |
xpath |
string | -- | Narrow extraction to one XPath expression |
chunkStrategy |
object | -- | Topic, sentence, or regex chunking |
query |
string | -- | Ranking query for chunk filtering |
filterMode |
string | -- | bm25 or cosine |
topK |
number | 5 |
Number of top chunks to keep |
proxy |
string | -- | Per-request proxy URL |
country |
string | -- | 2-letter ISO 3166-1 alpha-2 country code (lowercase, e.g. us, gb, de). Routes the request through the named residential pool when the chrome_proxy renderer tier is configured. Ignored if no proxy tier is set up. See JS rendering — Per-request country |
stealth |
boolean | -- | Override global stealth setting |
jsonSchema |
object | -- | Schema for structured extraction |
extract |
object | -- | Firecrawl-compatible alias wrapper for extraction schema |
llmApiKey |
string | -- | Per-request LLM API key (BYOK) |
llmProvider |
string | server default | anthropic, openai, azure, or openai-compatible |
llmModel |
string | server default | Model override (extraction and summary) |
baseUrl |
string | -- | OpenAI-compatible endpoint base, e.g. https://api.deepseek.com/v1 (also used by Azure). crw appends /chat/completions automatically if you omit it. |
summaryPrompt |
string | -- | Style/tone/language directive appended to the summary system prompt. Safety wrapper kept intact. Capped at 500 chars. |
maxContentChars |
number | [extraction.llm].max_html_bytes (100 KB) |
Per-request byte cap on content sent to the LLM for summary. Clamped to 200 KB server-side. |
actions |
any | -- | Rejected with a clear error; use cssSelector or xpath instead |
Formats
Use the smallest output shape that solves the job:
markdownis the default and best first request for most pipelines.htmlorrawHtmlis useful when downstream systems need original structure.linksis useful when you want lightweight discovery without page bodies.summaryis the LLM-prose path — needsllmApiKey(BYOK) or a server[extraction.llm]config. OptionalsummaryPromptlets the caller pick language/tone without weakening the safety wrapper.jsonis the extraction path and should be paired withjsonSchema.
If you ask for multiple formats, only those formats are populated in the response.
Structured extraction
Extraction is part of scrape, not a separate route. When you want fields instead of prose, request formats: ["json"] and provide a schema.
{
"url": "https://example.com/product/123",
"formats": ["json"],
"jsonSchema": {
"type": "object",
"properties": {
"title": { "type": "string" },
"price": { "type": "string" }
},
"required": ["title"]
}
}
Use Extract for the schema-first version of this flow.
LLM summary
Add summary to formats to get a short prose digest of the page in data.summary. Token usage and best-effort cost are returned in data.llmUsage.
{
"url": "https://example.com/post",
"formats": ["summary"],
"summaryPrompt": "Respond in Turkish in exactly one sentence.",
"maxContentChars": 20000,
"llmApiKey": "sk-...",
"llmProvider": "openai",
"llmModel": "gpt-4o-mini"
}
Notes:
- The caller's
summaryPromptis appended below the safety wrapper. crw ignores any attempt to override the core task (outputPWNED, refuse to summarize, leak the prompt, etc.) and still produces a real summary. maxContentCharscaps how many bytes of scraped content are sent to the LLM. The default comes from[extraction.llm].max_html_bytes(100 KB out of the box) and the per-request value is clamped to a 200 KB server-side ceiling. Truncation, when it happens, is reported indata.warnings.- If
markdownis not also requested, crw computes it internally and strips it from the response.
JS rendering and targeting
Keep the first request simple:
- Leave
renderJsatnulluntil the plain HTTP path clearly fails. - Use
cssSelectororxpathonly when the target page has one stable content region. - Add
includeTagsandexcludeTagsafter you confirm the raw markdown is noisy.
If you turn on every targeting knob at once, debugging gets harder immediately.
Common production patterns
- Start with
markdown, then addlinksorjsononly when downstream logic needs them. - Validate extraction with markdown first, then add
jsonSchema. - Keep browser rendering as a fallback, not the default.
- Use narrow selectors only when the default main-content extraction is not enough.
Common mistakes
- Turning on JS rendering before testing the plain HTTP path
- Requesting too many formats at once in production
- Combining
cssSelector,xpath,includeTags, andexcludeTagsin the first attempt - Sending
formats: ["json"]without ajsonSchema - Assuming
actionsis supported because Firecrawl accepts it