Are all PDF documents the same?

No, they are not. PDF documents can be created in a variety of ways. PDFs that are generated from an electronic source, (such as an MS Word document), a computer generated report, or spreadsheet data, have an internal structure that can be read and interpreted. These “generated” PDF documents already contain characters that have an electronic character designation. As such, conversion from such a PDF can rely on these electonic character designations and provide reliable output.

PDF documents can also be created through the process of scanning a document into an electronic format. What a scanned document represents is really just a “picture” of the words contained within that document. In order to convert a scanned document into an editable format, OCR software is required to analyze the “image” of each character and match it to an electronic character-based file. Because of this, it is much more difficult to ensure that the character that is “recognized” by the OCR software is the character on the scanned document. There are issues that can affect the quality of the OCR output, such as poor image quality of the scanned document, a mixture of fonts used in the scanned documents, the italicized and underlining of fonts, all of which can blur the quality and shape of the individual characters.

Convert Your Scanned and Image PDFs into Excel, Word and More.

Download Free Trial
Learn More

I've been @able2extract excel sheets from a pdf scan; saving 2 days worth of my time. You guys have an amazing piece of software!


source: Twitter

What is an image PDF?

As noted above, there is more than one way to create a PDF document. One of these methods is by using a scanner, or similar machine, that takes an image of a document and then stores this image as an electronic PDF file. A scanner, or photocopier with scanning capabilities, does not recreate each character of every word when it creates this scanned image,rather, it simply takes a “snap-shot” of the image. This snap-shot is then turned into a PDF document by software that integrates with the scanner or photocopier – the result is a “scanned” PDF document. There are a variety of scan to PDF software on the market today that can assist with this. The alternative to a scanned PDF document is a created PDF document. For instance, a document that begins as an electronic document, say a Word document, but then is converted into PDF using PDF creation software. In most cases, the PDF creation software will take information from the structure of the Word document – such as character information, word placement information, etc. – and retain these items in the created PDF. As such, there is much more of an internal structure for a created PDF rather than a scanned PDF – which a program like Able2Extract v.5.0 uses to extract the information. In order to edit a scanned PDF document, Optical Character Recognition software is required to electronically identify each character on a page and then convert it into a useable format. Essentially, what it does is extract text from an image – this functionality has been added to the Professional versions of Able2Extract and Able2Doc.

What is OCR (Optical Character Recognition)?

Optical Character Recognition (OCR) is a visual recognition process that turns printed or written text into an electronic character-based file. A document that is scanned and converted into a PDF document provides the basis for which character recognition software may interpret each character image on the PDF and assign it an electonic character-based file that can then be entered into an editable format, such as a Text or Word document.

Given the proliferation of scan-to-PDF technology available today, Investintech’s OCR solutions aim to provide a way to convert scanned PDF documents that were already created that way. The quality of the OCR conversion process will largely depend on the quality of the scanned image and the clarity of the characters of that image.