Date: 2006-05-30
Types of PDFs: Native vs. Scanned PDFs
Does the type of PDF created matter? Yes, it does. When it comes to converting PDFs, the nature of the PDF does matter. Here’s a behind-the-scenes look at the types of PDFs.
Native PDFs
As noted, Native PDFs are ones that are generated from an electronic source – such as a Word document, a computer generated report, or spreadsheet data. These have an internal structure that can be read and interpreted.
These “generated” PDF documents, thus, already contain characters that have an electronic character designation. In most cases, the PDF creation software will take information from the structure of the Word document – such as character information, word placement information, etc. – and retain these items in the created PDF, which is why you can word search a text-based document. C onversion from such a PDF can rely on these electronic character designations and provide reliable output. There are a variety of PDF converters available on the market that will take the PDF data from native PDFs and move it into MS Word, Excel and other formats. Able2Extract 3.0 by Investintech is one such example of a PDF converter that can handle native PDFs.
Scanned PDFs
Because not all documents needing to be transmitted are in electronic form yet, conversion of the physical paper document into the electronic form still needs to be done. This is where a scanned PDF type comes into play.
It would be inefficient to re-type documents manually into electronic forms and then convert them into PDFs. The solution to this is to scan them, using an electronic scanning device. Like the PDF writer, a scanner “digitally captures” the image of the physical document into an electronic form. A scanner, doesn’t reconstruct the character of every word when it creates this scanned image; the scanner takes a “snap-shot” of the document. This snap-shot is then turned into a PDF by using software integrated with the scanner. The result is a scanned PDF document .
However, even though the image may be a document that contains words, the computer recognizes those words only as “images” that it displays without any information structure behind it. If you try to text search the document, the PDF search engine won’t yield any results.
Converting a scanned PDF into an editable format, OCR (Optical Character Recognition) software is required to analyze the “image” of each character and match it to an electronic character-based file. Because of this, it is much more difficult to determine that the character “recognized” by the OCR software is, indeed, the character on the scanned document.
One should note, that the quality of OCR output is affected by matters such as poor image quality of the scanned document, mixture of fonts used in the scanned documents, and italicized and underlined fonts, which may blur the quality and shape of individual characters.
Finding a PDF Converter that handles image PDFs is more difficult. The professional version of Able2Extract can handle image PDFs, as well as native PDFs.