pennypdf

Developer API

PDF OCR API

Production OCR pipelines split along a predictable line: some teams ship with Google Vision or AWS Textract because accuracy on messy scans is marginally better, others stay away because of per-1000-page pricing that starts at $1.50 and the privacy implications of sending receipts/medical forms to a cloud OCR service.

PennyPDF's /v1/ocr runs Tesseract — the open-source OCR engine Google Vision itself was built on top of originally. We support 12 languages including Japanese, Arabic, and Hindi. Output is the original PDF with a searchable text layer grafted on, so downstream /v1/pdf-to-word and /v1/extract endpoints can find the text without re-processing.

3 coins per document (~$0.12 at the Saver pack, $0.09 at the Pro pack). Compare: Google Vision $1.50/1000 pages (so ~$0.075/page for a 20-page doc = $1.50), AWS Textract $1/1000 pages basic + $50/1000 for tables/forms, Adobe Extract API $0.03/doc with a $2500 annual floor.

Copy, paste, ship

Same bearer-token auth across every endpoint. Set PENNYPDF_API_KEY in your environment first.

curlPOST /v1/ocr
curl -X POST https://api.pennypdf.com/v1/ocr \
  -H "Authorization: Bearer $PENNYPDF_API_KEY" \
  -F "file=@scanned-invoice.pdf" \
  -F "languages=eng,spa" \
  -o searchable.pdf
PythonOCR + extract pipeline
import os, requests

auth = {"Authorization": f"Bearer {os.environ['PENNYPDF_API_KEY']}"}

# 1. OCR the scan (3 coins)
r = requests.post(
    "https://api.pennypdf.com/v1/ocr",
    headers=auth,
    files={"file": open("scan.pdf", "rb")},
    data={"languages": "eng"},
)
ocr_pdf = r.content

# 2. Extract structured text (0 coins — text layer is already there now)
r = requests.post(
    "https://api.pennypdf.com/v1/extract",
    headers=auth,
    files={"file": ("ocr.pdf", ocr_pdf)},
    data={"format": "json"},
)
print(r.json()["text"][:500])  # first 500 chars

PennyPDF vs Google Cloud Vision

 PennyPDFGoogle Cloud Vision
Price per 1k pages~$6 (3 coins × 20 pages avg)$1.50 (text detection)
Price per doc (avg 20pp)$0.12$0.03
Monthly minimum$0$0 (then quota)
Data sent to third partyNo — self-hosted TesseractYes — Google
Languages12 built-in50+
Output formatSearchable PDFJSON (DIY PDF assembly)

How it works

  1. 1POST the scanned PDF as multipart form-data to /v1/ocr with optional language hints.
  2. 2Receive the same PDF with an invisible text layer — copy-paste and search work, visuals unchanged.
  3. 3Optionally chain /v1/extract afterwards at 0 coins to pull structured text out.

Frequently asked

How accurate is Tesseract compared to Google Vision?+

On clean office scans (300dpi+, black text on white), both hit 99%+. On receipts, handwriting, or low-contrast scans, Google Vision and AWS Textract have a real edge — 2-5 percentage points better character accuracy. If accuracy is life-critical (medical records, legal discovery), use the cloud providers. For invoice/form digitization, Tesseract is good enough.

What does 'text layer added in place' mean?+

The output PDF looks visually identical to the input (raster pages unchanged) but has an invisible text overlay positioned above each character. Screen readers, text search, copy-paste, and downstream text extraction all work. No re-layout, no visual regression.

Latency?+

Tesseract is CPU-bound. p50 = 8 s for a 5-page scan, p90 = 22 s for 20 pages. Use the async /v1/jobs/ocr endpoint for anything over 10 pages to avoid tying up the connection.

What's the max resolution supported?+

600dpi scans are the sweet spot. Anything higher gets downsampled to 400dpi internally (Tesseract's accuracy actually drops at very high resolutions because of per-pixel noise). If your scan is below 200dpi, accuracy will be poor; bump the scanner's DPI before calling us.

Does it handle rotated pages?+

Yes — we auto-detect rotation per page and straighten before OCR'ing. If the source PDF has pages rotated 90°/180°/270°, the output will have those pages in their correctly-oriented form with the text layer matching.

Rate limits?+

30 synchronous OCRs per minute per API key. Async: 100 job creations per minute. OCR is our most CPU-expensive operation; bulk workloads (1000+/day) should go through the async endpoint.

Why PennyPDF

  • No subscription. Ever.
  • Coins never expire — use them in 5 years.
  • Client-side processing for 14 of 22 tools.
  • No watermarks at any tier.
  • Per-operation pricing, shown before you click.
  • Same coins for web + public API.