More API

PDF Parsing

Upload a PDF and get clean markdown, plain text, or structured JSON back. One multipart request, no third-party OCR pipeline needed.

Best for: research papers, reports, contracts
Route: POST /v2/parse
Max upload: 50 MiB

PDF Parsing with CRW

/v2/parse

POST /v2/parse
Content-Type: multipart/form-data

Authentication:

  • Hosted: send Authorization: Bearer YOUR_API_KEY
  • Self-hosted: only required when auth.api_keys is configured

The route accepts a multipart/form-data body with two parts:

Part Required Description
file yes The PDF file bytes. Must begin with a %PDF- header.
options no A JSON string with output options (see Options below).

The file must be a valid PDF. Binary files with other content types are rejected with 400 Bad Request. A corrupt or encrypted PDF returns 422 Unprocessable Entity. The maximum upload size is 50 MiB (52,428,800 bytes); requests above this limit receive a 413 Content Too Large before the body is fully read.

Quick start {#quick-start}

curl -X POST https://api.fastcrw.com/v2/parse \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@/path/to/document.pdf;type=application/pdf"
from crw import CrwClient

client = CrwClient()  # reads CRW_API_KEY from env
result = client.parse_file(path="/path/to/document.pdf")
print(result["markdown"])
import { CrwClient } from "crw-sdk";
import { readFileSync } from "fs";

const client = new CrwClient();
const bytes = new Uint8Array(readFileSync("/path/to/document.pdf"));
const result = await client.parseFile(bytes, { filename: "document.pdf" });
console.log(result.markdown);
import requests

with open("/path/to/document.pdf", "rb") as f:
    resp = requests.post(
        "https://api.fastcrw.com/v2/parse",
        headers={"Authorization": "Bearer YOUR_API_KEY"},
        files={"file": ("document.pdf", f, "application/pdf")},
    )

print(resp.json()["data"]["markdown"])

Response

{
  "success": true,
  "data": {
    "markdown": "# Annual Report 2025\n\nRevenue grew 18%...",
    "metadata": {
      "sourceURL": "upload://document.pdf",
      "statusCode": 200,
      "numPages": 12,
      "sourceFilename": "document.pdf",
      "proxyUsed": "basic",
      "cacheState": "miss",
      "concurrencyLimited": false,
      "creditsUsed": 1,
      "scrapeId": "a1b2c3d4-..."
    }
  }
}

Key metadata fields for PDF responses:

Field Description
numPages Total number of pages in the document
sourceFilename Original filename passed in the upload
sourceURL Always upload://<filename> for parse requests

Options

Pass options as a JSON string in the options multipart field. All fields are optional; defaults match /v2/scrape.

Field Type Default Description
formats string[] ["markdown"] Output formats. See Formats.
parsers ParserSpec[] [{"type":"pdf"}] Parser directives. See parsers[].
jsonSchema object JSON Schema for LLM extraction. Requires formats: ["json"].
summaryPrompt string Custom prompt for formats: ["summary"].
maxContentChars number Truncate each content field to this many characters.

Example with options

curl -X POST https://api.fastcrw.com/v2/parse \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@/path/to/document.pdf;type=application/pdf" \
  -F 'options={"formats":["markdown","plainText"],"parsers":[{"type":"pdf","maxPages":5}]}'

Formats {#formats}

/v2/parse supports a subset of the scrape formats. Formats that require a renderer (html, rawHtml, changeTracking) are not applicable to uploaded documents and return a warning if requested.

Format Description
markdown Extracted text as Markdown (default)
plainText Extracted text without Markdown syntax
links Array of URLs found in the document
json Structured extraction via LLM + jsonSchema
summary LLM-generated prose summary

json and summary require an LLM provider configured on the engine (or a per-request llmApiKey). Without one, the server returns a 400 with a clear message.

parsers[] {#parsers}

The parsers field controls how the document is processed. Currently only "pdf" is supported.

Accepted forms (the server accepts both):

// Short form (bare string):
"parsers": ["pdf"]

// Object form (full control):
"parsers": [{ "type": "pdf", "mode": "auto", "maxPages": 10 }]
Field Type Default Description
type string Parser type. Only "pdf" is supported today.
mode string "auto" "auto" or "fast" use text extraction. "ocr" is accepted for wire-compatibility but fastCRW has no OCR pipeline — scanned pages return empty text with a warning.
maxPages number no limit Cap the number of pages converted. Useful for very large documents where you only need the first N pages.
fastCRW performs **text-layer extraction only**. Image-only (scanned) PDFs that have no embedded text layer will return empty or near-empty markdown. No warning is guaranteed for individual scanned pages — check `numPages` vs actual content length if you expect text.

SDK usage

Python SDK — client.parse_file()

The Python SDK parse_file() works in both HTTP mode (cloud or self-hosted server) and local subprocess mode.

from crw import CrwClient

client = CrwClient()  # CRW_API_KEY from env

# From a file on disk:
result = client.parse_file(path="/path/to/report.pdf")

# From bytes already in memory:
with open("/path/to/report.pdf", "rb") as f:
    pdf_bytes = f.read()
result = client.parse_file(content=pdf_bytes, filename="report.pdf")

# Multiple formats:
result = client.parse_file(
    path="/path/to/report.pdf",
    formats=["markdown", "plainText"],
)

# Page cap:
result = client.parse_file(
    path="/path/to/large-report.pdf",
    parsers=[{"type": "pdf", "maxPages": 20}],
)

print(result["markdown"])
print("Pages:", result["metadata"]["numPages"])

Signature:

def parse_file(
    path: str | None = None,
    *,
    content: bytes | None = None,
    filename: str | None = None,
    formats: list[str] | None = None,
    json_schema: dict | None = None,
    parsers: list[str] | None = None,
    **kwargs,
) -> dict

Provide either path (file on disk) or content (raw bytes). filename defaults to the basename of path, or "document.pdf" when using content= without a name.

TypeScript SDK — client.parseFile()

The TypeScript SDK takes the file as a Uint8Array (not a path). Read the file before calling.

import { CrwClient } from "crw-sdk";
import { readFileSync } from "fs";

const client = new CrwClient(); // CRW_API_KEY from env

// Basic:
const bytes = new Uint8Array(readFileSync("report.pdf"));
const result = await client.parseFile(bytes, { filename: "report.pdf" });
console.log(result.markdown);

// With options:
const result2 = await client.parseFile(bytes, {
  filename: "report.pdf",
  formats: ["markdown", "plainText"],
  parsers: [{ type: "pdf", maxPages: 20 }],
});
console.log(result2.metadata.numPages);

Signature:

parseFile(
  content: Uint8Array,
  opts?: ParseFileOptions,
): Promise<ParseResult>

interface ParseFileOptions {
  filename?: string;       // default: "document.pdf"
  formats?: string[];
  jsonSchema?: object;
  parsers?: string[];
  [key: string]: unknown;  // any other engine option passed through
}

Python vs TypeScript asymmetry: The Python SDK accepts either a path string or raw content bytes (your choice). The TypeScript SDK accepts only raw bytes (Uint8Array) — you must read the file before calling parseFile. This is intentional: TypeScript environments (Deno, edge runtimes, browsers) cannot always read from a filesystem path.

MCP tool — crw_parse_file

When running CRW via MCP (e.g. in Claude Desktop or Cursor), the crw_parse_file tool is available. It accepts Base64-encoded PDF bytes — the MCP transport cannot carry raw binary.

Tool definition (excerpt):

{
  "name": "crw_parse_file",
  "description": "Parse a local PDF (base64 in contentBase64) to markdown. No OCR: scanned PDFs return empty markdown with a warning.",
  "inputSchema": {
    "type": "object",
    "properties": {
      "contentBase64": { "type": "string", "description": "Base64-encoded PDF bytes" },
      "filename":      { "type": "string", "description": "Original filename (optional)" },
      "formats": {
        "type": "array",
        "items": { "type": "string", "enum": ["markdown", "plainText", "links", "json", "summary"] }
      },
      "jsonSchema":    { "type": "object", "description": "JSON schema for LLM extraction" },
      "parsers": {
        "type": "array",
        "items": { "type": "string", "enum": ["pdf"] }
      },
      "maxLength": { "type": "integer", "description": "Max chars per content field; 0 = unbounded" }
    },
    "required": ["contentBase64"]
  }
}

Example call from Python with the MCP transport:

import base64
from crw import CrwClient

# local mode: CRW_LOCAL=1, no HTTP server needed
client = CrwClient()  # subprocess mode

with open("report.pdf", "rb") as f:
    pdf_bytes = f.read()

# The SDK handles base64 encoding automatically in local mode:
result = client.parse_file(content=pdf_bytes, filename="report.pdf")
print(result["markdown"])

The Python and TypeScript SDKs automatically base64-encode the bytes when running in local/MCP mode. You do not call crw_parse_file directly — call parse_file() / parseFile() and the SDK chooses the transport.

Crawl → Parse workflow {#crawl-parse-workflow}

A common pattern: crawl a documentation site, identify PDF links in the crawl output, then parse each PDF for full-text content.

import requests
import time
import base64
from crw import CrwClient

API_KEY = "YOUR_API_KEY"
HEADERS = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}
BASE = "https://api.fastcrw.com"

# 1. Start a crawl, collecting links only.
job = requests.post(f"{BASE}/v1/crawl", headers=HEADERS, json={
    "url": "https://example.com/reports",
    "maxPages": 50,
    "formats": ["links"],
}).json()
job_id = job["id"]

# 2. Poll until done.
while True:
    status = requests.get(f"{BASE}/v1/crawl/{job_id}", headers=HEADERS).json()
    if status["status"] in ("completed", "failed"):
        break
    time.sleep(2)

# 3. Collect PDF URLs from crawl results.
pdf_urls = []
for page in status.get("data", []):
    for link in page.get("links", []):
        if link.lower().endswith(".pdf"):
            pdf_urls.append(link)

print(f"Found {len(pdf_urls)} PDFs")

# 4. Download and parse each PDF.
client = CrwClient()
for url in pdf_urls[:5]:  # start small
    pdf_resp = requests.get(url)
    pdf_resp.raise_for_status()

    result = client.parse_file(
        content=pdf_resp.content,
        filename=url.split("/")[-1],
        formats=["markdown"],
    )
    print(f"\n--- {url} ({result['metadata']['numPages']} pages) ---")
    print(result["markdown"][:500])

Structured extraction from PDFs

Combine /v2/parse with formats: ["json"] and a jsonSchema to extract structured data from a PDF in one step. Requires an LLM configured on the engine.

curl -X POST https://api.fastcrw.com/v2/parse \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@/path/to/contract.pdf;type=application/pdf" \
  -F 'options={
    "formats": ["json"],
    "jsonSchema": {
      "type": "object",
      "properties": {
        "parties":       { "type": "array", "items": { "type": "string" } },
        "effectiveDate": { "type": "string" },
        "value":         { "type": "string" }
      }
    }
  }'

Response:

{
  "success": true,
  "data": {
    "json": {
      "parties": ["Acme Corp", "Beta LLC"],
      "effectiveDate": "2026-01-01",
      "value": "$120,000"
    },
    "metadata": { "numPages": 8, "sourceFilename": "contract.pdf" }
  }
}

Error reference

Status Cause Fix
400 Missing file part, or file does not begin with %PDF- Ensure the form includes a file field containing a valid PDF
400 options is not valid JSON Validate the options string before sending
400 formats: ["json"] without a configured LLM Set [extraction.llm] in config.toml or pass llmApiKey
413 Body exceeds 50 MiB Split the PDF or trim pages before upload
422 Corrupt, encrypted, or unreadable PDF Verify the PDF opens locally and is not password-protected
503 Document parsing is disabled on this server Set [document] enabled = true in config.toml

Self-hosted configuration

Document parsing is enabled by default. To tune it, add a [document] section to config.toml:

[document]
enabled              = true
max_pages            = 0          # 0 = no limit
max_upload_bytes     = 52428800   # 50 MiB (hard cap)
upload_concurrency   = 4          # simultaneous uploads buffered in memory
max_concurrent_parses = 8         # across all surfaces (URL, crawl, upload)
parse_timeout_ms     = 30000      # ms; 0 = no timeout
max_decompressed_bytes = 104857600  # 100 MiB decompression-bomb guard
sandbox              = false      # isolate each parse in a child process

sandbox = true is recommended for hosts that accept untrusted PDF uploads. It runs each parse in a child process with hard OS memory and CPU limits. Cost: ~1–3 ms spawn overhead per parse.

When to use something else

  • Use Scrape when the document is a web page, not a binary file
  • Use Extract when you already have a URL to a PDF (scrape fetches and parses automatically when the response is application/pdf)
  • Use Crawl when you need to discover PDFs across an entire site before parsing them