Developer API
PDF extraction API
Extraction is the glue operation for any PDF-driven data pipeline: invoice parsing, resume processing, financial statement analysis, legal discovery. The output format is what separates usable from unusable — raw text from `pdftotext` tells you nothing about layout; Adobe Extract's JSON is rich but $0.15-$0.30 per doc; AWS Textract's form-aware mode is $50/1000 pages.
PennyPDF's /v1/extract returns structured JSON at 1 coin per doc (~$0.04). Per-page blocks with positional data (x,y,width,height), table detection with cell geometry, identified image regions with bounding boxes, and the raw concatenated text as a fallback. Output roughly matches the Tika/PDFBox output model so migrations are straightforward.
For scanned PDFs without a text layer, chain /v1/ocr first (3 coins) then /v1/extract (1 coin). 4 coins (~$0.16) for a full OCR + structured extraction pipeline per doc. No third-party data sharing — both ops run on our infrastructure with PDFs deleted within 2 hours.
Copy, paste, ship
Same bearer-token auth across every endpoint. Set PENNYPDF_API_KEY in your environment first.
curl -X POST https://api.pennypdf.com/v1/extract \
-H "Authorization: Bearer $PENNYPDF_API_KEY" \
-F "file=@invoice.pdf" \
-F "format=json" \
-F "include=text,tables,images"import os, requests, json
r = requests.post(
"https://api.pennypdf.com/v1/extract",
headers={"Authorization": f"Bearer {os.environ['PENNYPDF_API_KEY']}"},
files={"file": open("statement.pdf", "rb")},
data={"format": "json", "include": "text,tables"},
)
data = r.json()
for page in data["pages"]:
print(f"Page {page['number']}: {len(page['tables'])} tables")
for tbl in page["tables"]:
# Each table has rows[] of cells[] with .text, .bbox
print([cell["text"] for row in tbl["rows"] for cell in row])How it works
- 1POST the PDF to /v1/extract with the fields you want (text / tables / images).
- 2Parse the returned JSON. Each page is an object with positional blocks and structured elements.
- 3For scanned PDFs: chain /v1/ocr first to add a text layer, then extract normally.
Frequently asked
How accurate is table detection?+
Good on ruled tables (lines between cells) — ~95% cell-boundary accuracy. Mixed on unruled tables (column-aligned text without visible separators) — ~70-80%. For mission-critical table extraction (financial filings, scientific papers), evaluate on your actual docs before committing.
Can I extract just specific page ranges?+
Yes — pass `pages=1-5,10,12-15` in the form data. Coins are still 1 per call regardless of page count.
Latency?+
p50 = 1.4 s for a 20-page invoice, p90 = 3.5 s. Scales roughly linearly with page count. For 500+ page docs, use the async endpoint.
Does it preserve the reading order of multi-column pages?+
We attempt to reconstruct reading order using text block clustering + horizontal/vertical separator detection. On clean two-column academic PDFs, order is correct 95%+ of the time. On complex newspaper-style layouts, you may need to reconstruct order from the positional data yourself.
Rate limits?+
150 requests per minute per API key. Async endpoint bypasses the per-minute cap — use it for bulk extraction runs (1000+ docs/hour).
Can I get raw PDF objects (like pdftotext -layout)?+
Add `format=text-raw` — you get the plain layout-preserved text without the JSON wrapper. Useful for LLM ingestion where structured JSON bloats the context.
Why PennyPDF
- No subscription. Ever.
- Coins never expire — use them in 5 years.
- Client-side processing for 14 of 22 tools.
- No watermarks at any tier.
- Per-operation pricing, shown before you click.
- Same coins for web + public API.