Developer API

PDF extraction API

Extraction is the glue operation for any PDF-driven data pipeline: invoice parsing, resume processing, financial statement analysis, legal discovery. The output format is what separates usable from unusable — raw text from `pdftotext` tells you nothing about layout; Adobe Extract's JSON is rich but $0.15-$0.30 per doc; AWS Textract's form-aware mode is $50/1000 pages.

PennyPDF's /v1/extract returns structured JSON at 1 coin per doc (~$0.04). Per-page blocks with positional data (x,y,width,height), table detection with cell geometry, identified image regions with bounding boxes, and the raw concatenated text as a fallback. Output roughly matches the Tika/PDFBox output model so migrations are straightforward.

For scanned PDFs without a text layer, chain /v1/ocr first (3 coins) then /v1/extract (1 coin). 4 coins (~$0.16) for a full OCR + structured extraction pipeline per doc. No third-party data sharing — both ops run on our infrastructure with PDFs deleted within 2 hours.

Get an API key See coin pricing

Copy, paste, ship

Same bearer-token auth across every endpoint. Set PENNYPDF_API_KEY in your environment first.

curlPOST /v1/extract

curl -X POST https://api.pennypdf.com/v1/extract \
  -H "Authorization: Bearer $PENNYPDF_API_KEY" \
  -F "file=@invoice.pdf" \
  -F "format=json" \
  -F "include=text,tables,images"

Pythonextract + structure tables

import os, requests, json

r = requests.post(
    "https://api.pennypdf.com/v1/extract",
    headers={"Authorization": f"Bearer {os.environ['PENNYPDF_API_KEY']}"},
    files={"file": open("statement.pdf", "rb")},
    data={"format": "json", "include": "text,tables"},
)

data = r.json()
for page in data["pages"]:
    print(f"Page {page['number']}: {len(page['tables'])} tables")
    for tbl in page["tables"]:
        # Each table has rows[] of cells[] with .text, .bbox
        print([cell["text"] for row in tbl["rows"] for cell in row])

How it works

1POST the PDF to /v1/extract with the fields you want (text / tables / images).
2Parse the returned JSON. Each page is an object with positional blocks and structured elements.
3For scanned PDFs: chain /v1/ocr first to add a text layer, then extract normally.

Frequently asked

How accurate is table detection?+

Good on ruled tables (lines between cells) — ~95% cell-boundary accuracy. Mixed on unruled tables (column-aligned text without visible separators) — ~70-80%. For mission-critical table extraction (financial filings, scientific papers), evaluate on your actual docs before committing.

Can I extract just specific page ranges?+

Yes — pass `pages=1-5,10,12-15` in the form data. Coins are still 1 per call regardless of page count.

Latency?+

p50 = 1.4 s for a 20-page invoice, p90 = 3.5 s. Scales roughly linearly with page count. For 500+ page docs, use the async endpoint.

Does it preserve the reading order of multi-column pages?+

We attempt to reconstruct reading order using text block clustering + horizontal/vertical separator detection. On clean two-column academic PDFs, order is correct 95%+ of the time. On complex newspaper-style layouts, you may need to reconstruct order from the positional data yourself.

Rate limits?+

150 requests per minute per API key. Async endpoint bypasses the per-minute cap — use it for bulk extraction runs (1000+ docs/hour).

Can I get raw PDF objects (like pdftotext -layout)?+

Add `format=text-raw` — you get the plain layout-preserved text without the JSON wrapper. Useful for LLM ingestion where structured JSON bloats the context.

Developer API

PDF OCR API

Developer API

PDF to Word API

Developer API

PDF merge API

Why PennyPDF

No subscription. Ever.
Coins never expire — use them in 5 years.
Client-side processing for 14 of 22 tools.
No watermarks at any tier.
Per-operation pricing, shown before you click.
Same coins for web + public API.

PDF extraction API

Copy, paste, ship

How it works

Frequently asked

Related

Why PennyPDF