pennypdf

Developer API

PDF extraction API

Extraction is the glue operation for any PDF-driven data pipeline: invoice parsing, resume processing, financial statement analysis, legal discovery. The output format is what separates usable from unusable — raw text from `pdftotext` tells you nothing about layout; Adobe Extract's JSON is rich but $0.15-$0.30 per doc; AWS Textract's form-aware mode is $50/1000 pages.

PennyPDF's /v1/extract returns structured JSON at 1 coin per doc (~$0.04). Per-page blocks with positional data (x,y,width,height), table detection with cell geometry, identified image regions with bounding boxes, and the raw concatenated text as a fallback. Output roughly matches the Tika/PDFBox output model so migrations are straightforward.

For scanned PDFs without a text layer, chain /v1/ocr first (3 coins) then /v1/extract (1 coin). 4 coins (~$0.16) for a full OCR + structured extraction pipeline per doc. No third-party data sharing — both ops run on our infrastructure with PDFs deleted within 2 hours.

Copy, paste, ship

Same bearer-token auth across every endpoint. Set PENNYPDF_API_KEY in your environment first.

curlPOST /v1/extract
curl -X POST https://api.pennypdf.com/v1/extract \
  -H "Authorization: Bearer $PENNYPDF_API_KEY" \
  -F "file=@invoice.pdf" \
  -F "format=json" \
  -F "include=text,tables,images"
Pythonextract + structure tables
import os, requests, json

r = requests.post(
    "https://api.pennypdf.com/v1/extract",
    headers={"Authorization": f"Bearer {os.environ['PENNYPDF_API_KEY']}"},
    files={"file": open("statement.pdf", "rb")},
    data={"format": "json", "include": "text,tables"},
)

data = r.json()
for page in data["pages"]:
    print(f"Page {page['number']}: {len(page['tables'])} tables")
    for tbl in page["tables"]:
        # Each table has rows[] of cells[] with .text, .bbox
        print([cell["text"] for row in tbl["rows"] for cell in row])

How it works

  1. 1POST the PDF to /v1/extract with the fields you want (text / tables / images).
  2. 2Parse the returned JSON. Each page is an object with positional blocks and structured elements.
  3. 3For scanned PDFs: chain /v1/ocr first to add a text layer, then extract normally.

Frequently asked

How accurate is table detection?+

Good on ruled tables (lines between cells) — ~95% cell-boundary accuracy. Mixed on unruled tables (column-aligned text without visible separators) — ~70-80%. For mission-critical table extraction (financial filings, scientific papers), evaluate on your actual docs before committing.

Can I extract just specific page ranges?+

Yes — pass `pages=1-5,10,12-15` in the form data. Coins are still 1 per call regardless of page count.

Latency?+

p50 = 1.4 s for a 20-page invoice, p90 = 3.5 s. Scales roughly linearly with page count. For 500+ page docs, use the async endpoint.

Does it preserve the reading order of multi-column pages?+

We attempt to reconstruct reading order using text block clustering + horizontal/vertical separator detection. On clean two-column academic PDFs, order is correct 95%+ of the time. On complex newspaper-style layouts, you may need to reconstruct order from the positional data yourself.

Rate limits?+

150 requests per minute per API key. Async endpoint bypasses the per-minute cap — use it for bulk extraction runs (1000+ docs/hour).

Can I get raw PDF objects (like pdftotext -layout)?+

Add `format=text-raw` — you get the plain layout-preserved text without the JSON wrapper. Useful for LLM ingestion where structured JSON bloats the context.

Why PennyPDF

  • No subscription. Ever.
  • Coins never expire — use them in 5 years.
  • Client-side processing for 14 of 22 tools.
  • No watermarks at any tier.
  • Per-operation pricing, shown before you click.
  • Same coins for web + public API.