Output Formats
crw supports 6 output formats. Request multiple formats in a single scrape call.
Formats
| Format | Key | Description |
|---|---|---|
| Markdown | markdown |
HTML converted to Markdown via fast_html2md |
| HTML | html |
Cleaned HTML with main content extraction |
| Raw HTML | rawHtml |
Original unmodified HTML |
| Plain Text | plainText |
Stripped to plain text, no formatting |
| Links | links |
All <a href> links extracted (excludes # and javascript:) |
| JSON | json |
LLM structured extraction with JSON schema validation |
| Extract | extract |
Alias for json — accepted for Firecrawl compatibility |
Which Format Should You Choose?
The practical rule is simple:
- choose
markdownwhen the output is headed into search, RAG, summarization, or LLM prompts, - choose
htmlwhen you still want cleaned structure, - choose
rawHtmlonly when you truly need the original source, - choose
linkswhen discovery matters as much as page content, - and choose
jsonwhen the end result needs to be schema-shaped.
For most product and retrieval workflows, markdown is the best default because it is compact, readable, and easier to inspect than raw markup.
Common Format Combinations
| Combination | Good for |
|---|---|
["markdown"] |
Default page extraction |
["markdown", "links"] |
Content plus local link discovery |
["html", "rawHtml"] |
Debugging the extraction pipeline |
["json"] |
Structured extraction only |
["markdown", "json"] |
Human-readable content plus structured fields |
:::tip In production, request only the formats you will actually store or process. Requesting more formats is convenient for debugging but adds unnecessary overhead. :::
Markdown
The default and most commonly used format. crw uses fast_html2md for conversion with a multi-step fallback chain:
- Convert main-content HTML to Markdown
- If output is too short, try full cleaned HTML
- If still short, try without
onlyMainContent - If still short, try raw HTML
- Last resort: fall back to plain text
This ensures you always get meaningful output, even from unusual page structures.
HTML
Returns cleaned HTML after the extraction pipeline:
- Scripts, styles, iframes removed
- Navigation, footer, sidebar removed (if
onlyMainContent) - CSS selector filters applied (
includeTags/excludeTags)
Links
Extracts all anchor hrefs from the page. Useful for site mapping and link analysis.
Excluded: # fragment-only links and javascript: URLs.
JSON (LLM Extraction)
Sends the page content to an LLM (Anthropic or OpenAI) and extracts structured data matching a JSON schema.
{
"url": "https://example.com/product",
"formats": ["json"],
"jsonSchema": {
"type": "object",
"properties": {
"title": { "type": "string" },
"price": { "type": "number" }
}
}
}
The LLM response is validated against the provided schema using the jsonschema crate. If the model wraps JSON in a fenced code block, CRW strips the fence automatically before validation.
Supported LLM providers
| Provider | Tool mechanism |
|---|---|
| Anthropic | tool_use with input_schema |
| OpenAI | Function calling with parameters |
Configure in config.toml:
[extraction.llm]
provider = "anthropic" # or "openai"
api_key = "sk-..."
model = "claude-sonnet-4-20250514"
max_tokens = 4096
# base_url = "https://..." # for OpenAI-compatible endpoints
Response Shape
Each format populates a corresponding field in the response data object:
| Format | Response field | Type |
|---|---|---|
markdown |
markdown |
string |
html |
html |
string |
rawHtml |
rawHtml |
string |
plainText |
plainText |
string |
links |
links |
string[] |
json / extract |
json |
object |
Full Response Schema
Every API response follows this envelope:
{
"success": true,
"data": { ... },
"error": "...",
"warning": "..."
}
The exact shape of data depends on what you requested. Do not assume every field is always present.
data object (scrape)
| Field | Type | Present when |
|---|---|---|
markdown |
string / null |
formats includes markdown or json |
html |
string / null |
formats includes html |
rawHtml |
string / null |
formats includes rawHtml |
plainText |
string / null |
formats includes plainText |
links |
string[] / null |
formats includes links |
json |
object / null |
formats includes json AND jsonSchema provided AND LLM configured |
chunks |
ChunkResult[] / null |
chunkStrategy provided |
warning |
string / null |
Target returned error status, anti-bot detected, etc. |
metadata |
object |
Always |
metadata object
| Field | Type | Description |
|---|---|---|
title |
string / null |
Page <title> |
description |
string / null |
Meta description |
ogTitle |
string / null |
Open Graph title |
ogDescription |
string / null |
Open Graph description |
ogImage |
string / null |
Open Graph image URL |
canonicalUrl |
string / null |
Canonical link |
sourceURL |
string |
Final URL after redirects |
language |
string / null |
<html lang> value |
statusCode |
number |
Target HTTP status code |
renderedWith |
string / null |
Usually "http", "lightpanda", "playwright", "chrome", "pdf", or "http_only_fallback" |
elapsedMs |
number |
Total processing time in ms |
ChunkResult object
| Field | Type | Description |
|---|---|---|
content |
string |
Chunk text |
score |
number / null |
Relevance score (present when query + filterMode set) |
index |
number |
Original chunk position |
Format Aliases
"extract" and "llm-extract" are accepted as aliases for "json". The canonical name is json. All three behave identically — they require jsonSchema for structured extraction.
Implementation Guidance
Three habits keep format usage sane in production:
- request only the formats you really consume,
- keep
metadatawith the stored output so later debugging is easier, - and validate
data.jsonin your own application before trusting it as final truth.
If you are debugging extraction quality, request both markdown and json for a while. That makes it easy to compare the page text against the structured output.
Not Supported in This Release
screenshot— not implemented. Requesting it will return a 422 error.actions— click/scroll/wait actions are not yet supported. Sendingactionswill return a 400 error with a message suggestingcssSelectororxpathas alternatives.