Developer API
PDF OCR API
Production OCR pipelines split along a predictable line: some teams ship with Google Vision or AWS Textract because accuracy on messy scans is marginally better, others stay away because of per-1000-page pricing that starts at $1.50 and the privacy implications of sending receipts/medical forms to a cloud OCR service.
PennyPDF's /v1/ocr runs Tesseract — the open-source OCR engine Google Vision itself was built on top of originally. We support 12 languages including Japanese, Arabic, and Hindi. Output is the original PDF with a searchable text layer grafted on, so downstream /v1/pdf-to-word and /v1/extract endpoints can find the text without re-processing.
3 coins per document (~$0.12 at the Saver pack, $0.09 at the Pro pack). Compare: Google Vision $1.50/1000 pages (so ~$0.075/page for a 20-page doc = $1.50), AWS Textract $1/1000 pages basic + $50/1000 for tables/forms, Adobe Extract API $0.03/doc with a $2500 annual floor.
Copy, paste, ship
Same bearer-token auth across every endpoint. Set PENNYPDF_API_KEY in your environment first.
curl -X POST https://api.pennypdf.com/v1/ocr \
-H "Authorization: Bearer $PENNYPDF_API_KEY" \
-F "file=@scanned-invoice.pdf" \
-F "languages=eng,spa" \
-o searchable.pdfimport os, requests
auth = {"Authorization": f"Bearer {os.environ['PENNYPDF_API_KEY']}"}
# 1. OCR the scan (3 coins)
r = requests.post(
"https://api.pennypdf.com/v1/ocr",
headers=auth,
files={"file": open("scan.pdf", "rb")},
data={"languages": "eng"},
)
ocr_pdf = r.content
# 2. Extract structured text (0 coins — text layer is already there now)
r = requests.post(
"https://api.pennypdf.com/v1/extract",
headers=auth,
files={"file": ("ocr.pdf", ocr_pdf)},
data={"format": "json"},
)
print(r.json()["text"][:500]) # first 500 charsPennyPDF vs Google Cloud Vision
| PennyPDF | Google Cloud Vision | |
|---|---|---|
| Price per 1k pages | ~$6 (3 coins × 20 pages avg) | $1.50 (text detection) |
| Price per doc (avg 20pp) | $0.12 | $0.03 |
| Monthly minimum | $0 | $0 (then quota) |
| Data sent to third party | No — self-hosted Tesseract | Yes — Google |
| Languages | 12 built-in | 50+ |
| Output format | Searchable PDF | JSON (DIY PDF assembly) |
How it works
- 1POST the scanned PDF as multipart form-data to /v1/ocr with optional language hints.
- 2Receive the same PDF with an invisible text layer — copy-paste and search work, visuals unchanged.
- 3Optionally chain /v1/extract afterwards at 0 coins to pull structured text out.
Frequently asked
How accurate is Tesseract compared to Google Vision?+
On clean office scans (300dpi+, black text on white), both hit 99%+. On receipts, handwriting, or low-contrast scans, Google Vision and AWS Textract have a real edge — 2-5 percentage points better character accuracy. If accuracy is life-critical (medical records, legal discovery), use the cloud providers. For invoice/form digitization, Tesseract is good enough.
What does 'text layer added in place' mean?+
The output PDF looks visually identical to the input (raster pages unchanged) but has an invisible text overlay positioned above each character. Screen readers, text search, copy-paste, and downstream text extraction all work. No re-layout, no visual regression.
Latency?+
Tesseract is CPU-bound. p50 = 8 s for a 5-page scan, p90 = 22 s for 20 pages. Use the async /v1/jobs/ocr endpoint for anything over 10 pages to avoid tying up the connection.
What's the max resolution supported?+
600dpi scans are the sweet spot. Anything higher gets downsampled to 400dpi internally (Tesseract's accuracy actually drops at very high resolutions because of per-pixel noise). If your scan is below 200dpi, accuracy will be poor; bump the scanner's DPI before calling us.
Does it handle rotated pages?+
Yes — we auto-detect rotation per page and straighten before OCR'ing. If the source PDF has pages rotated 90°/180°/270°, the output will have those pages in their correctly-oriented form with the text layer matching.
Rate limits?+
30 synchronous OCRs per minute per API key. Async: 100 job creations per minute. OCR is our most CPU-expensive operation; bulk workloads (1000+/day) should go through the async endpoint.
Why PennyPDF
- No subscription. Ever.
- Coins never expire — use them in 5 years.
- Client-side processing for 14 of 22 tools.
- No watermarks at any tier.
- Per-operation pricing, shown before you click.
- Same coins for web + public API.