Recipe: Batch-Scrape a URL List

Scrape dozens (or hundreds) of unrelated URLs in one async job — no crawling, no link discovery, just your exact list processed in parallel.


When to Use Batch Instead of Crawl

Situation Use
You already know every URL you want Batch
You need to discover links from a seed domain Crawl
URLs span multiple domains Batch
You want maxDepth / maxPages control over discovery Crawl
You want all pages under one site section Crawl
Processing a CSV export, sitemap, or search results list Batch

Batch (POST /v2/batch/scrape) and crawl (POST /v2/crawl) share the same async job machinery and identical status/response envelopes. The difference is the input: batch takes an explicit urls array, crawl takes a single seed URL and discovers the rest itself.


How It Works

POST /v2/batch/scrape        →  { id, url, invalidURLs }
GET  /v2/batch/scrape/{id}   →  { status, total, completed, data[], next }
GET  /v2/batch/scrape/{id}?skip=100   (paginate large results)
DELETE /v2/batch/scrape/{id}  (cancel)
GET  /v2/batch/scrape/{id}/errors

Status values: scrapingcompleted | failed

The response is paginated (100 documents per page, max ~10 MB per page). While status is scraping, the next cursor is set even if the current page is empty — keep polling forward until next is null and status is completed.


Examples

Target URLs

These three unrelated pages are used throughout the examples below:

https://news.ycombinator.com/
https://github.com/trending
https://lobste.rs/

cURL

Step 1 — Start the job

curl -s -X POST https://api.fastcrw.com/v2/batch/scrape \
  -H "Authorization: Bearer $CRW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://news.ycombinator.com/",
      "https://github.com/trending",
      "https://lobste.rs/"
    ],
    "formats": ["markdown", "links"]
  }'

Expected response:

{
  "success": true,
  "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "url": "https://api.fastcrw.com/v2/batch/scrape/a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "invalidURLs": []
}

Save the id value.


Step 2 — Poll for status

JOB_ID="a1b2c3d4-e5f6-7890-abcd-ef1234567890"

curl -s "https://api.fastcrw.com/v2/batch/scrape/$JOB_ID" \
  -H "Authorization: Bearer $CRW_API_KEY"

Repeat every 2–3 seconds until "status": "completed". While still running:

{
  "success": true,
  "status": "scraping",
  "total": 3,
  "completed": 1,
  "creditsUsed": 1,
  "expiresAt": "2026-06-16T10:00:00.000Z",
  "next": "https://api.fastcrw.com/v2/batch/scrape/a1b2c3d4-e5f6-7890-abcd-ef1234567890?skip=0",
  "data": [
    {
      "markdown": "# Hacker News\n...",
      "links": ["https://news.ycombinator.com/item?id=..."],
      "metadata": {
        "title": "Hacker News",
        "sourceURL": "https://news.ycombinator.com/",
        "url": "https://news.ycombinator.com/",
        "statusCode": 200,
        "proxyUsed": "basic",
        "cacheState": "miss",
        "concurrencyLimited": false,
        "creditsUsed": 1,
        "scrapeId": "f1a2b3c4-..."
      }
    }
  ]
}

When complete:

{
  "success": true,
  "status": "completed",
  "total": 3,
  "completed": 3,
  "creditsUsed": 3,
  "expiresAt": "2026-06-16T10:00:00.000Z",
  "next": null,
  "data": [ /* all 3 documents */ ]
}

Step 3 — Paginate large results (optional)

For jobs with many URLs, use ?skip=N to page through results. next always contains the ready-to-use URL:

# Page 2 (skip the first 100 docs)
curl -s "https://api.fastcrw.com/v2/batch/scrape/$JOB_ID?skip=100" \
  -H "Authorization: Bearer $CRW_API_KEY"

Follow next until it is null.


Python

The crw SDK ships batch_scrape() which handles starting, polling, and paginating internally. It returns the flat list of page-result dicts once the job completes.

import os
from crw import CrwClient

client = CrwClient(api_key=os.environ["CRW_API_KEY"])

urls = [
    "https://news.ycombinator.com/",
    "https://github.com/trending",
    "https://lobste.rs/",
]

# Start the job, poll until done, collect all results.
# poll_interval: seconds between status checks (default 2.0)
# timeout: max total wait in seconds (default 300.0)
pages = client.batch_scrape(
    urls,
    formats=["markdown", "links"],
    poll_interval=2.0,
    timeout=120.0,
)

for page in pages:
    meta = page.get("metadata", {})
    print(f"URL: {meta.get('sourceURL')}")
    print(f"Status: {meta.get('statusCode')}")
    md = page.get("markdown", "")
    print(f"Content ({len(md)} chars): {md[:200]}")
    print("---")

Expected output:

URL: https://news.ycombinator.com/
Status: 200
Content (4821 chars): # Hacker News

Ask HN: ... | 312 points | 143 comments ...
---
URL: https://github.com/trending
Status: 200
Content (5102 chars): # Trending repositories on GitHub today ...
---
URL: https://lobste.rs/
Status: 200
Content (3874 chars): # Lobsters

[Show HN] ... submitted 2 hours ago ...
---

Raw HTTP (no SDK) — Python

Use this when you want full control or need to integrate batch scraping into an existing HTTP session:

import os
import time
import urllib.request
import json

API_KEY = os.environ["CRW_API_KEY"]
BASE = "https://api.fastcrw.com"

def _call(method: str, path: str, body: dict | None = None) -> dict:
    url = f"{BASE}{path}"
    data = json.dumps(body).encode() if body else None
    req = urllib.request.Request(
        url, data=data,
        headers={
            "Content-Type": "application/json",
            "Authorization": f"Bearer {API_KEY}",
        },
        method=method,
    )
    with urllib.request.urlopen(req, timeout=30) as r:
        return json.loads(r.read())

# 1. Start batch job
start = _call("POST", "/v2/batch/scrape", {
    "urls": [
        "https://news.ycombinator.com/",
        "https://github.com/trending",
        "https://lobste.rs/",
    ],
    "formats": ["markdown", "links"],
})
job_id = start["id"]
print(f"Job started: {job_id}")

# 2. Poll until completed
while True:
    status = _call("GET", f"/v2/batch/scrape/{job_id}")
    print(f"  {status['completed']}/{status['total']} completed ({status['status']})")
    if status["status"] == "completed":
        break
    if status["status"] == "failed":
        raise RuntimeError(f"Batch failed: {status.get('error')}")
    time.sleep(2)

# 3. Collect all pages (follow `next` cursor for large jobs)
all_docs = []
skip = 0
while True:
    page = _call("GET", f"/v2/batch/scrape/{job_id}?skip={skip}")
    all_docs.extend(page["data"])
    next_url = page.get("next")
    if not next_url or page["status"] != "completed":
        break
    # parse skip from the next URL
    skip = int(next_url.split("skip=")[-1])

print(f"\nCollected {len(all_docs)} documents")
for doc in all_docs:
    src = doc["metadata"]["sourceURL"]
    chars = len(doc.get("markdown") or "")
    print(f"  {src}: {chars} chars of markdown")

Expected output:

Job started: a1b2c3d4-e5f6-7890-abcd-ef1234567890
  1/3 completed (scraping)
  2/3 completed (scraping)
  3/3 completed (completed)

Collected 3 documents
  https://news.ycombinator.com/: 4821 chars of markdown
  https://github.com/trending: 5102 chars of markdown
  https://lobste.rs/: 3874 chars of markdown

Key Request Fields

All fields except urls are optional.

Field Type Default Notes
urls string[] required At least one valid URL
formats string[] ["markdown"] Any of: markdown, html, rawHtml, plainText, links, json, summary, changeTracking
onlyMainContent bool true Strip nav/footer boilerplate
waitFor number MS to wait for JS after load
includeTags string[] HTML tags to keep
excludeTags string[] HTML tags to remove
ignoreInvalidURLs bool true Skip unparseable URLs; false = reject the whole request
proxy string "auto" "basic" or "stealth" (residential)
location.country string 2-letter country code for proxy egress
timeout number Per-URL timeout in ms

Invalid URLs (SSRF-blocked, unparseable) are returned in invalidURLs on the start response and skipped from the job unless ignoreInvalidURLs is false.


Key Response Fields

Start response (POST /v2/batch/scrape):

id           — UUID for polling/cancellation
url          — ready-to-use status URL
invalidURLs  — URLs that were skipped

Status response (GET /v2/batch/scrape/{id}):

status        — "scraping" | "completed" | "failed"
total         — total URLs in the job
completed     — URLs finished so far
creditsUsed   — credits consumed so far
expiresAt     — RFC3339 UTC expiry of this job in server memory
next          — pagination cursor URL (null when done)
data[]        — Document objects for this page
  .markdown   — page content as Markdown
  .links      — outbound link URLs (if requested)
  .metadata
    .sourceURL      — original URL
    .statusCode     — HTTP status of the page
    .proxyUsed      — "basic" or "stealth"
    .creditsUsed    — credits for this document
    .scrapeId       — per-document UUID

Cancelling a Job

curl -s -X DELETE "https://api.fastcrw.com/v2/batch/scrape/$JOB_ID" \
  -H "Authorization: Bearer $CRW_API_KEY"

Returns { "success": true, "status": "cancelled", "message": "..." }.


Checking Errors

URLs that fail mid-job are recorded but don't fail the entire batch. Retrieve them after the job completes:

curl -s "https://api.fastcrw.com/v2/batch/scrape/$JOB_ID/errors" \
  -H "Authorization: Bearer $CRW_API_KEY"

Returns { "success": true, "errors": [...], "robotsBlocked": [] }.