Crawl
Recursively crawl a site when one page is not enough. CRW crawl is asynchronous by design: start a job, poll it, and widen scope only after the first batch looks correct.
maxPages: 5 and maxDepth: 1 first. If the returned batch is wrong, a larger crawl only makes the mistake more expensive.Crawling a site with CRW
/v1/crawl
POST /v1/crawl
GET /v1/crawl/{id}
DELETE /v1/crawl/{id}
Authentication:
- Hosted: send
Authorization: Bearer YOUR_API_KEY - Self-hosted: only required when
auth.api_keysis configured
Installation
CRW crawl is also plain HTTP. You start the job with one request and check its status with another.
Basic usage
Start with this request:
{
"url": "https://docs.example.com",
"maxDepth": 1,
"maxPages": 5,
"formats": ["markdown"],
"onlyMainContent": true
}
:::tabs ::tab{title="Python"}
import requests
import time
start = requests.post(
"https://fastcrw.com/api/v1/crawl",
headers={"Authorization": "Bearer YOUR_API_KEY"},
json={
"url": "https://docs.example.com",
"maxDepth": 1,
"maxPages": 5,
"formats": ["markdown"],
},
)
crawl_id = start.json()["id"]
time.sleep(2)
status = requests.get(
f"https://fastcrw.com/api/v1/crawl/{crawl_id}",
headers={"Authorization": "Bearer YOUR_API_KEY"},
)
print(status.json()["status"])
::tab{title="Node.js"}
const start = await fetch("https://fastcrw.com/api/v1/crawl", {
method: "POST",
headers: {
"Authorization": "Bearer YOUR_API_KEY",
"Content-Type": "application/json"
},
body: JSON.stringify({
url: "https://docs.example.com",
maxDepth: 1,
maxPages: 5,
formats: ["markdown"]
})
});
const { id } = await start.json();
await new Promise((resolve) => setTimeout(resolve, 2000));
const status = await fetch(`https://fastcrw.com/api/v1/crawl/${id}`, {
headers: { "Authorization": "Bearer YOUR_API_KEY" }
});
console.log((await status.json()).status);
::tab{title="cURL"}
curl -X POST https://fastcrw.com/api/v1/crawl \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.example.com",
"maxDepth": 1,
"maxPages": 5,
"formats": ["markdown"]
}'
:::
Response
Start response:
{
"success": true,
"id": "550e8400-e29b-41d4-a716-446655440000"
}
Poll response:
{
"success": true,
"status": "scraping",
"total": 5,
"completed": 2,
"data": []
}
Parameters
| Field | Type | Default | Description |
|---|---|---|---|
url |
string | required | Starting URL |
maxDepth |
number | 2 |
Maximum depth from the start URL |
maxPages |
number | 100 |
Maximum number of pages to crawl |
limit |
number | alias | Firecrawl-compatible alias for maxPages |
max_pages |
number | alias | Snake_case alias for maxPages |
formats |
string[] | ["markdown"] |
Output formats for each page |
onlyMainContent |
boolean | true |
Remove boilerplate content before conversion |
jsonSchema |
object | -- | Optional schema for structured extraction per page |
Scrape options and extraction
Crawl inherits the same content-format logic as scrape:
- Start with
formats: ["markdown"] - Add extraction only after the first crawl batch looks correct
- Keep
onlyMainContent: trueunless you explicitly need full-page noise
If you need to debug one problematic page, go back to Scrape and validate that page in isolation first.
Checking job status
Poll the crawl ID until you reach completed or failed:
curl -H "Authorization: Bearer YOUR_API_KEY" \
https://fastcrw.com/api/v1/crawl/CRAWL_ID
Status response shape:
{
"success": true,
"status": "scraping | completed | failed",
"total": 12,
"completed": 12,
"data": [
{
"markdown": "# Page content",
"metadata": {
"sourceURL": "https://example.com/page"
}
}
],
"error": "optional error"
}
Cancellation and limits
Cancel a running job with:
DELETE /v1/crawl/{id}
CRW crawl stays within the same origin and should be treated as a bounded, respectful site job, not an open-ended spider.
Common production patterns
- Run Map first when you are unsure about the reachable section.
- Keep
maxPagesvery low on first contact with a new site. - Poll with backoff instead of hammering the same crawl ID.
- Use extraction only after the markdown output of the first crawl batch looks correct.
Common mistakes
- Starting with
maxPages: 500before validating the target - Treating
crawllike a synchronous route - Assuming crawl crosses origins
- Ignoring
robots.txtand target-side rate behavior