Converting Scanned AutoCAD PDFs With OCR

As the new 2008 year rolls on, so does the work and no doubt, the PDF conversions as well. Don’t worry, we’re at it too. And every now and again, amidst troubleshooting and developing, we get an email from clients having difficulties with AutoCAD PDFs:

“I downloaded and installed your Pro version as a trial.  When I tried to convert a PDF file which was an AutoCAD drawing scanned and saved as such, it seems as if it was working but it opens Excel and nothing is converted in?”

If you’re experiencing or have experienced the same problem without any luck, don’t give up yet. Here’s a conversion tip: try resizing the image-based/scanned PDF.

This is because AutoCAD files are usually created with huge page dimensions that measure up to 30″ by 40″. In addition, it is difficult for the OCR engine to determine the size (in points) of any letter on an OCR page.  So the OCR engine is oftentimes unable to extract legible text from AutoCAD documents due to the small text size (hence the empty Excel output).

The only way it can determine the size of the text is by comparing it relative to the size of a stated PDF page which the OCR engine can read and support. The OCR engine in Able2Extract Professional can only support AutoCAD file dimensions of up to 22″ by 22″.

To resize the PDF:

1) Open the PDF in either Adobe Reader or Acrobat

2) Select File > Print

3) Change the Printer Name to ‘Adobe PDF’ in the drop box

4) Under the Page Scaling section ensure that ‘Choose Paper Source by PDF page size’ is deselected

AutoCad Print

5) Click OK to print a new PDF

You can also resize the PDF with our trial version of Sonic PDF Creator 2.0.  After installing Sonic, select ‘Sonic PDF’ as a printer (as opposed to Adobe PDF in step 3).

After you’ve resized the PDF, try the conversion again.

Hope this tip helps!

Why Performing OCR On Handwriting Doesn’t Work

Unsurprisingly, OCR is consistently a hot topic in PDF and the PDF user mind in general. In paper intense work environments, PDF conversion and OCR engines have proven to be a successful work-around for transferring paper files into word processing applications. Thus, with the help of scanners and the PDF format, any and all types of paper work can be done electronically and efficiently. Or can it?

While trying to integrate and transfer every non-digital working habit into an electronic equivalent, there are still some things that just can’t be done with ease using the same everyday tools. For instance, what about converting hand printed/written documents?

Three Flavours Of OCR

Many of you have probably wondered why such a thing can’t be done with the OCR technology in PDF conversion products. Well, this is because OCR technology and devices are only capable of recognizing the machine printed characters and fonts. And seeing as how the number of documents that are being scanned in are usually typewritten, OCR is employed in almost all cases.

In other cases, there are documents that contain handwritten sections and/or fields that are used for collecting data—a thing being slowly superseded by the fill-able PDF form. You can create a digital copy from such a document simply by scanning it in, right? Yes. However, it requires a different recognition technology altogether. Using OCR, you can perhaps get maybe one letter to “OCR” into ASCII, if it’s printed clearly and written in ink that’s thick enough to be read. But that’s about it. This is where another flavor of OCR comes in: Intelligent Character Recognition.

ICR is a more advanced form of OCR that translates hand printed letters into digital ASCII equivalents. This version of OCR is primarily used for processing applications and forms on which you “print clearly” and place individual letters in boxes. This structured method of reading a hand printed document is one of the major limitations of the technology, but controls and reduces the amount of human errors that cause misinterpretations.

In addition, there are documents that contain handwriting—aka cursive writing. Can recognition on such documents be performed? The answer: Yes. The third flavor of OCR is IR (Intelligent Recognition), the latest generation of OCR technology to date. This is used to read unconstrained writing (text not contained in boxes) and uses the same methods to translate the characters into ASCII text. From my online searching, there are a good number of companies that provide full fledged OCR/ICR/IR solutions, which can be integrated with digital workflows.

Thus, if you’re looking to OCR handwritten PDFs, you’ll be sorely disappointed. The ability to do everything and anything with technology is perhaps the ultimate goal for developers and users. Practicing it, on the other hand, is perhaps the ideal goal for every worker bee out there. It’s sad to say, but there are some cases in which you can only do so much.

The ABCs of the PDF: M to O

A lot has happened with the PDF format in the last year—submission for standardization, release of a new specification, software upgrades, and improvements with graphic and dynamic PDFs. In this series posting, you get a look at the PDF’s recent format competition and past legal issues as well as the other uses of PDF related technology. Here it is.

Macromedia

Adobe Systems, Inc. acquired Macromedia Inc. in 2006 and has, since then, injected Macromedia technology into their software. However, Adobe and Macromedia had come into close, legal contact even before the acquisition—over patent disputes.

The patent dispute according to past articles in early 2000-2002, was over a tabbed palette interface element that was awarded to Adobe. The issue dated back to 1996, right up until 2000, during which time Adobe had confronted Macromedia about the palette’s inclusion in several of the company’s products.

Yet Macromedia’s argument against the suit, filed in August of 2000, was that the patent was invalid. This escalated to a point where Macromedia countersued against Adobe in September 2000 for infringing on three of Macromedia’s own patents. After two years of back and forth legal battles, Adobe won the lawsuit and was awarded 2.8 million.

And five years later, Macromedia is now one of Adobe’s acquisitions. . . .

Native PDFs

As you know, native PDFs are ones that are generated from electronically created documents. Yet, while these native PDFs are beneficial when it comes to conversion, they can also produce just as much legal hubbub as patent disputes can. Moving the ability to create PDF files, or PDF-like formats, directly into the authoring application was definitely a complex issue that became a major headliner in PDF news this year.

Back in February, I wrote three postings on factors that made creating digital documents and native PDFs a more significant matter than ever before. There were the legal issues between Adobe and Microsoft; the PDF specification submission to ISO; and then, there was OpenOffice.org, Microsoft’s word processing app rival whose applications sport ODF creation, a format that became a statewide standard in Massachusetts.

Creating native PDFs and PDF-like formats now involves more politics at the authoring application level. Microsoft has the convenience of a widely used platform, Adobe has the ubiquity as de facto standard, and OpenOffice has the state of Massachusetts. Creating a native PDF, or PDF-like format is now, in one sense, a matter of “moral??? choice: are you an Acrobat advocate, a loyal MS Office user, or an open source supporter?

OCR

You know it by its three letter acronym, you know what it does when it comes to converting scanned PDF files. Yet, as a software that literally recognizes and translates digitally imaged characters into character codes (ASCII or Unicode), OCR isn’t just for converting scanned PDFs.

OCR has been used for a wide range of data processing systems. It’s been used by the Standard Oil Company of California for credit card imprints for billing purposes. At the Ohio Bell Telephone Company, OCR was used for reading bill stubs. Even the United States Air Force used OCR for reading and transmitting typewritten messages.

Another big use for OCR technology is postal office work. The first use of OCR in Europe was by the British General Post Office for automating the mail sorting process. OCR scanners read the routing barcodes marked upon the envelopes that are based on corresponding postal codes, resulting in faster organization and shipment times. In 1965, the United States Postal Services adopted the method, followed by Canada Post in 1971.

Today, OCR is being further enhanced as a data input method ranging from simple text to digital scanning processes to sophisticated ICR (Intelligent Character Recognition), a more advanced version of OCR that recognizes hand printed documents.

Whether the PDF world is buzzing with long standing issues from the past or just slowly unfolding with new developments, the PDF world, can be an interesting place, indeed.