Does the type of PDF created matter? Yes, it does. When it comes to converting PDFs, the nature of the PDF does matter. Here’s a behind-the-scenes look at the teo types of PDFs that exist.
As noted, Native PDFs are ones that are generated from an electronic source – such as a Word document, a computer generated report, or spreadsheet data. These have an internal structure that can be read and interpreted.
These "generated" PDF documents, thus, already contain characters that have an electronic character designation. In most cases, the PDF creation software will take information from the structure of the Word document - such as character information, word placement information, etc. - and retain these items in the created PDF. This is why you can word search a text-based document. Conversion from such a PDF can rely on these electronic character designations and provide reliable output. There are a variety of PDF converters available on the market that will take the PDF data from native PDF to Word, Excel and other formats. Able2Extract 8 by Investintech is one such example of a PDF converter that can handle native PDFs.
Sometimes documents needing to be transmitted are not digital yet, and because of this, a conversion of the physical paper document into an electronic form still needs to be completed. This is where a scanned PDF type comes into play.
It would be inefficient to re-type documents manually into electronic forms and then convert them into PDFs. The solution to this is to scan them using an electronic scanning device. Like the PDF writer, a scanner "digitally captures" the image of the physical document into an electronic form. A scanner doesn't reconstruct the character of every word when it creates this scanned image, rather, it takes a "snap-shot" of the document. This snap-shot is then turned into a PDF by using software integrated with the scanner. The result is a scanned PDF document.
However, even though the image may be a document that contains words, the computer recognizes those words only as “images” that it displays without any information structure behind it. If you try to text search the document, the PDF search engine won’t yield any results.
To convert scanned PDF documents into an editable format, OCR (Optical Character Recognition) software is required to analyze the “image” of each character and match it to an electronic character-based file. Because of this, it is much more difficult to determine that the character "recognized" by the OCR software is, indeed, the character on the scanned document.
One should note that the quality of the OCR output is affected by issues such as poor image quality of the scanned document, a mixture of fonts used in the scanned documents, and italicized or underlined fonts, which may blur the quality and shape of individual characters.
Finding a PDF Converter that handles image PDFs is more difficult. The professional version of Able2Extract can handle image PDFs, as well as native PDFs.