Optical Character Recognition

What is a Scanned (Image) PDF?

There are several ways to create a PDF:

  1. using PDF authoring suite, Adobe Acrobat, or some other graphics software to create custom PDFs,
  2. using other programs such as Microsoft Office to print their native applications to PDF,
  3. using file converters for PDF creation, or
  4. using a scanner to scan paper documents and turn them into digital files.

All these methods except the last one yield native PDFs that contain electronic character designation and are easily searchable. PDFs obtained via scanning are scanned or image PDF files that only contain an image of the document.

Scanning paper documentation is an excellent way of transferring extensive paperwork into the space-saving digital archive. However, scanned documents have certain shortcomings that make them difficult to work with. They are not searchable, so looking for a specific section in a book or an annual financial report in a scanned document will prove to be as difficult as searching in an original, paper document. Also, they are not editable. In order to do anything other than just view them, it is the standard practice to convert scanned PDFs into editable formats. Software that is used for this purpose is called the Optical Character Recognition (OCR) software.

What is OCR?

What Optical Character Recognition software does is optically recognize and represent each character in a scanned document, or, in other words, it translates an image of each character in a scanned document into an electronically designated character.

Character recognition process is very complex and requires that the OCR program matches each image letter to an electronic version that corresponds to it. The program has to recognize the font that is used in order to be able to recreate the document. In many cases the scanned copies of a document are of low quality, blurred, with unrecognizable characters, especially if the original paper copy was of poor quality, crumpled, faded, etc. In these cases it is really difficult for the OCR software to perform accurately and that’s when errors occur.

Until now they haven’t invented a completely error-free OCR software. However, advancements are continually made in this direction. Today we have many professional OCR tools on the market that can convert scanned documents surprisingly well. One of them is the professional version of Able2Extract that includes advanced OCR capabilities and gives its users an opportunity to quickly overcome issues that come with image PDFs.