IT
OmnvertImage • Document • Network

PDF OCR — make scanned PDFs searchable

Run OCR on a scanned PDF so the text becomes searchable and copyable. Pick one or more Tesseract language packs.

OCR runs server-side via tesseract + ocrmypdf. Files are deleted after download.
1
Upload scanned PDF
Photos of pages, fax scans, old contracts — up to 50 MB.
2
Pick languages & mode
Up to 3 languages; use 'Force' on image-only PDFs.
3
Run OCR & download
Text becomes searchable and copyable without changing the look.

PDF Preview

No file selected
Your PDF preview will appear here.

How to use this tool

  1. Upload a PDF. OCR is designed for scanned documents — photos of pages, receipts, faxes, or PDFs created from camera images. Digital-born PDFs (exported from Word, InDesign, etc.) already have a text layer, so OCR has nothing to do on them by default.
  2. Pick up to three languages. Match your document's language(s). For a Turkish lease with a few English clauses, pick tur + eng. More languages slow down OCR slightly, so don't add ones you don't need.
  3. Pick a mode (see explanation below). The default 'Skip' is right for most mixed PDFs; use 'Force' if you're sure your PDF is image-only.
  4. Click 'Run OCR & Download'. The visible pages won't look different — that's expected. Open the downloaded PDF and try to select text; it should now be copyable.

Why you might see 'OCR skipped on page(s) …'

The default 'Skip' mode asks ocrmypdf to leave pages alone when they already contain text. If your PDF is digital-born, every page gets skipped and the download is effectively unchanged. Switch to 'Re-OCR pages that have text' to replace the existing layer, or 'Force OCR every page' to run OCR unconditionally. Note: Force mode on a big digital PDF will rasterize it, making files bigger and losing sharp vector text.

Skip (default)

Pages that already have text are left as-is. Only image-only pages get OCR. Best for mixed PDFs (e.g., a scan bundled with digital cover pages).

Re-OCR

Removes any existing OCR layer and runs fresh recognition. Use when a previous OCR pass was bad (wrong language, old engine) and you want to redo it.

Force

Rasterizes every page and runs OCR on the image. Use when you're certain the PDF is an image-only scan. Caveat: this can make digital PDFs larger and fuzzier.

Tips

  • To verify OCR worked, open the downloaded PDF and try to highlight a word on a previously 'image-only' page. If you can select it, OCR ran successfully.
  • Enable the .txt transcript option to also get a plain-text extract alongside the PDF (delivered as a zip).
  • OCR quality depends on scan quality: 300 DPI clean scans are great, 72 DPI phone photos of crumpled receipts are not.
  • Large PDFs can take several minutes. Processing is done on the server and the file is deleted after you download it.
Server-sideProcessed server-side

This tool uses a server-side service for processing; uploaded files or requests are not kept for long-term storage.

About

OCR — Optical Character Recognition — is the bridge between PDFs that contain real text and PDFs that just contain pictures of text, and the difference between the two only becomes obvious the moment you try to do anything with the file. A real-text PDF lets you search for a word, copy a paragraph, paste it into another document, run an analysis tool over it, hand it to an indexing system, or feed it to a language model that needs to read the contents. A picture-of-text PDF lets you do exactly one thing: look at it. Most scans, photographs of paper, and exports from older systems fall into the second category, and the entire downstream world of digital workflows assumes the first. OCR converts one into the other without changing how the page looks.

The technical model is worth understanding because it shapes what works well and what doesn't. The OCR engine processes each page image, identifies regions of text, recognises the characters in each region, and writes the recognised text as a hidden layer beneath the original image. The visible page stays exactly the same — same fonts, same layout, same artefacts from the original scan — but a search in the PDF reader now finds matches because there's a real-text layer underneath. Copying text grabs from that hidden layer rather than trying to interpret pixels. The page rendering and the searchable content are decoupled: visual fidelity stays at 100%, and the search/copy/index functionality gets added on top.

Language matters more than people initially realise. Tesseract — the OCR engine used here — supports over a hundred languages, but the recognition quality depends heavily on whether the right language model is selected for the input. English text recognised with the English model produces excellent results; the same text recognised with a Cyrillic model produces nonsense. For Turkish documents, picking the Turkish model gets the diacritics (ç, ğ, ı, ö, ş, ü) right, which English mode would fail to handle. For mixed-language pages — common in academic literature where English and another language interleave, or in international business documents — the engine can handle multi-language input with the right configuration, though recognition accuracy on each language gets slightly weaker as more languages are stacked.

Source quality is the strongest predictor of OCR success. A clean 300 DPI scan of crisp, high-contrast typed text on white paper produces nearly perfect recognition — error rates under 1% are routine. A 150 DPI scan of the same source produces noticeably worse results because the engine has fewer pixels per character to work with. A photograph of a paper page, taken with a phone, with uneven lighting and slight skew, produces worse results still. Heavily compressed JPEGs of scanned text produce some of the worst results because JPEG compression artefacts in particular look superficially like character strokes to the recognition engine. The practical takeaway: if quality matters, scan at 300 DPI as TIFF or PNG before converting to PDF; if you have to work from a phone photo, expect more errors and budget time for proofreading.

Multi-column magazines and complex layouts are where OCR engines have historically struggled, and Tesseract is no exception. The challenge is figuring out the reading order — should the recognised text follow the left column down to the bottom and then jump to the top of the right column, or should it interleave the columns line by line? The default behaviour now handles two-column layouts well, but three-column layouts, sidebars, callout boxes, and irregular layouts (academic papers with figures, magazine pages with embedded ads) sometimes produce text that's correctly recognised but in a confusing order. The visible page is fine; the searchable layer is just slightly scrambled. For most search use cases that's still useful — you can find a phrase even if the words around it are out of order — but for full text extraction it's worth being aware of the limitation.

Mathematical notation, chemical formulas, and other specialised symbol systems are beyond what general-purpose OCR engines can handle. Tesseract recognises text characters; an integral sign, a chemical structure diagram, or a complex equation gets either skipped, mis-recognised as similar-looking letters, or rendered as garbage. For documents heavy in this kind of content — physics papers, organic chemistry textbooks, mathematical reference works — specialised OCR tools like InftyReader exist, but they're a separate workflow. For everyday business documents, contracts, articles, and reports that contain occasional non-text elements, the imperfection in recognising those elements rarely matters because the searchable text around them carries the meaning.

Use cases that benefit most from OCR are predictable. Researchers digitising old papers and books for searchable archives. Lawyers indexing case files for litigation discovery. Medical practices converting paper records to electronic ones. Genealogists working with scanned historical records. Journalists searching through document leaks. Authors converting old scanned manuscripts. Compliance officers building searchable evidence vaults. In all of these, the value isn't visible — the document looks the same — but the time saved over the document's lifespan, every time someone needs to find a specific passage, accumulates dramatically. A single OCR pass on a 500-page archive turns it from a stack of pictures into a searchable corpus that pays for itself the first time someone searches for a name.

Skew correction, deskewing, and noise removal are pre-processing steps that significantly improve OCR accuracy, and the engine here applies them automatically. A page that was scanned at a 5-degree angle gets straightened before recognition; bleed-through from the back of the page gets removed; coffee stains and dust speckles get cleaned up. None of these preprocessing changes affect the visible page in the output PDF — the original scan is preserved as the visible layer — but the recognised text layer underneath is more accurate because the engine worked from a cleaner intermediate. For very poor source scans, manual preprocessing with image-editing tools before running OCR can help further, but the automatic pipeline handles most everyday cases adequately.

There's a privacy angle worth considering for sensitive documents. OCR processing requires the document content to be readable by the OCR engine, and engines that run in the cloud necessarily see that content during processing. The implementation here processes documents on temporary servers, deletes them after the conversion window, and doesn't retain copies for training or indexing. For documents that are too sensitive to send to any external service — leaked materials being investigated by a journalist, internal HR documents containing personnel data, medical records subject to HIPAA — the right answer is running OCR locally with offline tools (Tesseract directly, or commercial alternatives like ABBYY FineReader). The trade is convenience for control; for most everyday documents the convenience wins, but the option matters for the cases where it doesn't.

Operationally the tool takes a single drop. Upload the PDF, choose the language (or languages) for recognition, optionally choose to skip pages that already have searchable text, and download the result. Files are processed in temporary storage, links expire quickly, no signup is required, no watermark is added, no per-day quota counts down in the background. Multiple PDFs can run through one after another, useful when digitising a folder of scanned archives rather than a single document. Most pages process in 1-2 seconds at 300 DPI; large multi-hundred-page archives take proportionally longer but still complete in a single pass without manual intervention between pages.

Use cases

  • Digitise old contracts, receipts, or lecture notes for search.
  • Extract text from scanned academic papers.
  • Search a pile of scanned invoices by keyword.
  • Make scanned PDFs screen-reader accessible.

How it works

  1. 1Upload a PDF (typically a scan).
  2. 2Pick up to three Tesseract languages and an OCR mode.
  3. 3Download the searchable PDF (and optional .txt transcript zip).

FAQ

Which languages can I combine?

Most Tesseract packs — eng, tur, ita, deu, fra, spa, ara, and many more. Combine up to three codes for mixed-language documents.

Will the layout change?

OCR adds an invisible text layer; the visible scan stays the same, so layout is preserved.

Why did my output look the same?

That's expected — the text becomes selectable without changing the visible pages.

Are files stored?

No. Files are processed transiently and deleted right after download.