13 Ways The PDF Is Vulnerable

PDF LogoWhat makes the PDF so enticing to malicious users? There are more reasons than you think.

With the recent headlines about Adobe PDF vulnerabilities being taken advantage of, just about anyone who used a PC was on the alert. PDF files have the potential to do some serious damage to systems and data when infected.

Because the PDF is not without its weaknesses, anticipating ways in which attackers can use the format can be the best way to defend against it.

Below is a brief look at 13 ways—both technical and simple, in which the PDF is vulnerable and can be manipulated by malicious users.

1) JavaScript

Online PDFs are designated with open parameters that can be injected with malicious JavaScript code. Because of the flexibility of JavaScript, hackers have a broad range of what can be done using the PDF file as their hacking tool of choice.

2) Spam

The recent spamming attacks this year demonstrated a way of exploiting the nature of the PDF as file format. Until recently, the PDF never really got caught at the anti-spam gates. Thus, although most anti-spam products now check PDFs and other forms of image spam, PDF containing spam made it into millions of inboxes everywhere. Although not immediately threatening as code executions, spam is still spam and has the ability to deliver Trojans, viruses, and malware.

Continue reading →

Converting Scanned AutoCAD PDFs With OCR

As the new 2008 year rolls on, so does the work and no doubt, the PDF conversions as well. Don’t worry, we’re at it too. And every now and again, amidst troubleshooting and developing, we get an email from clients having difficulties with AutoCAD PDFs:

“I downloaded and installed your Pro version as a trial.  When I tried to convert a PDF file which was an AutoCAD drawing scanned and saved as such, it seems as if it was working but it opens Excel and nothing is converted in?”

If you’re experiencing or have experienced the same problem without any luck, don’t give up yet. Here’s a conversion tip: try resizing the image-based/scanned PDF.

This is because AutoCAD files are usually created with huge page dimensions that measure up to 30″ by 40″. In addition, it is difficult for the OCR engine to determine the size (in points) of any letter on an OCR page.  So the OCR engine is oftentimes unable to extract legible text from AutoCAD documents due to the small text size (hence the empty Excel output).

The only way it can determine the size of the text is by comparing it relative to the size of a stated PDF page which the OCR engine can read and support. The OCR engine in Able2Extract Professional can only support AutoCAD file dimensions of up to 22″ by 22″.

To resize the PDF:

1) Open the PDF in either Adobe Reader or Acrobat

2) Select File > Print

3) Change the Printer Name to ‘Adobe PDF’ in the drop box

4) Under the Page Scaling section ensure that ‘Choose Paper Source by PDF page size’ is deselected

AutoCad Print

5) Click OK to print a new PDF

You can also resize the PDF with our trial version of Sonic PDF Creator 2.0.  After installing Sonic, select ‘Sonic PDF’ as a printer (as opposed to Adobe PDF in step 3).

After you’ve resized the PDF, try the conversion again.

Hope this tip helps!

PDF, A De-Facto Standard No More

While you’re all excited about the upcoming holidays and can’t think of anything else but that gift list to get through, you can add one more thing to get excited about.

The de facto standard of information interchange, aka the PDF, just got one step closer to being adopted as a standardized format. Last week, the PDF 1.7 specification gained the approval votes it needed from ISO committee voting members as it reached the Enquiry “Close of voting” stage in the standardization process.

Before this certification happens though, the comments included with the votes need to be addressed before the format gets its official ISO standard tag—ISO 32000 (lovely name, no?). Even with those last few hurdles, the PDF’s standardization process is looking good.

Jim King, PDF architect and Senior Principle Scientist at Adobe Systems Inc. will serve as technical editor for the international working group meeting in January where the submitted 205 comments will be resolved.

On his blog he states, “If the group can address all the comments to the satisfaction of all countries, especially the ones voting negatively, it is possible to finish at that meeting and publish the revised document.”

So Is It Still An Adobe-Microsoft Showdown?

In the face of impending success, you can’t help but wonder about OOXML and where its standardization is headed.

OOXML was also submitted and fast tracked for an official ISO standard, but rejected in September. Alongside that rejection was the controversy over Microsoft’s active influence over committee members and their votes. The OOXML proposal then went back to the drawing board for revisions to take the negative votes and comments into account.

Boxing AnimalsNow, three months later, as its Ballot Resolution Meeting (BRM) draws near in February, OOXML’s standardization is still up in the air as its interoperability, the OOXML hot topic of the day, will be a major factor in the decision to approve it as such.

Making it even harder is that OOXML is constantly held up against ODF, the poster child of open source solutions. It’ll be interesting to see how “open” and how much “interoperability” a Microsoft format can possess in general.

While that issue unfolds, the PDF will more than likely get the ISO standardization without much drama. Has Adobe won this round already without even trying?

These are exciting times for the PDF format indeed.

How Has Web 2.0 Made An Impact On The PDF Format?

If you were to come up with a good sampling of trends you see online, what would be included in that list? Facebook, YouTube, G-mail, ebay? Perhaps Yahoo!, Google Maps and Wiki sites as well. Let’s not forget the infinite number of blogs, RSS feeds, tagging, podcasting and bookmarking sites that are out there. If you’ve listed these, then you’ve listed a good sample of Web 2.0 elements.

Web 2.0 is a term that you’ve probably seen around on the Internet, and perhaps a term that is a bit obscured. Coined by Tim O’Reilly in 2004, the term encompasses a broad definition. But in a nutshell, it refers to the general trend in which the World Wide Web is going—a more connected and dynamic direction than ever before.

Broadly speaking, Web 2.0 places emphasis on the web as platform. Moreover, the user participation that enriches it, the networks that add to it, the tech innovation that motivates it, and the data that drives it, are all hallmarks of a Web 2.0 application. The result of such a combination? A web environment in which users can do more.

From just that short description alone, you can see that it’s a Web 2.0 world out there. And it’s seeping into the PDF world simply by influencing our digital habits and interests. For example, web designing tools are turning users into developers. If you’re a downloader, you’re a server as well. Desktop publishing software can make the user both publisher and reader at the same time.

These desktop applications, in turn, are then gradually shifted to web applications. Whether or not you’re an avid user of the PDF, you can see that these characteristics play a role in how we look at the file format in a different way, how it’s used, how it’s innovated and how it can be made more efficient online.

Web 2.0—It’s presence, It’s Impact, And It’s Influence On The PDF

Along with this Web 2.0 growth, Adobe has been taking PDF and its authoring tools and combining it with the web tools of Macromedia. This is forming an important relationship between the PDF and online content.

And this is why the PDF is gaining a foothold as Web 2.0 further develops. Adobe’s AIR is a runtime client that can render PDF, HTML and Flash content that can work external to the browser and as a desktop application connected to the Web while still taking advantage of your local storage and hard drive. This is just an example of PDF technology being leveraged for the web.

There is so much PDF content on the web because the PDF is accomplishing what other formats can’t do online. For instance, if you take a look at the main use of PDFs today, three words that might come to mind are: Interactive Document Processing. This is an efficient way of connecting both business workflows and the Web to each other.

The PDF format is becoming the interface between businesses and users. Just because of the sheer growth of the Internet and the wide user-base websites have established, it’s now convenient that tax forms be downloaded in PDF, or that applications be filled out online. Document processing and dynamic security control is what software and online services like LiveCycle Design ES and Adobe Document Server is geared towards–creating and connecting custom tailored backend systems with the client user.

Real time collaboration is now a major feature that enhances PDF workflows. The format is no longer a static virtual page , but a dynamic virtual space. This connectedness is also accompanied by hyperlinking, a thing to which other online documents is not immune. The format actually goes beyond the identity of a closed format and connects to the web and to users online.

However, what makes the PDF more unique is that the Adobe Reader is feature rich. Reader 8.1 can now support RSS feeds in XML format, a more dynamic and heavy duty linking than simple hyperlinking can provide. A Reader can keep you in touch with dynamic content that’s constantly updated.

Another aspect of today’s web is the “openness” of software and technology that is now becoming the norm for Web 2.0. It generates user input and contribution that drives the innovation and integration of different technologies.

Software manufacturing is becoming a communal project. The Adobe Mars project, for instance, still in beta status, is an example of this communal production. In one sense, Web 2.0 has made PDF users participants, third party developers and innovators all in one.

The user perspective on the PDF is changing drastically in that anyone can create one for any purpose, not just for those dealing with top corporation documents. This allows users to spontaneously create and share PDF files as you would normally email media or image files. PDFs are being uploaded and downloaded publicly on webpages in the form of documents and data containers.

In one form or another, the PDF is being developed in stride with Web 2.0 in mind. The PDF has gradually shifted its position and is a far from what it was three versions ago,  the last version especially. If Web 2.0 is a vision of what the web could be, then the PDF format is a vision of what Web 2.0 documents should be.

 

Why Performing OCR On Handwriting Doesn’t Work

Unsurprisingly, OCR is consistently a hot topic in PDF and the PDF user mind in general. In paper intense work environments, PDF conversion and OCR engines have proven to be a successful work-around for transferring paper files into word processing applications. Thus, with the help of scanners and the PDF format, any and all types of paper work can be done electronically and efficiently. Or can it?

While trying to integrate and transfer every non-digital working habit into an electronic equivalent, there are still some things that just can’t be done with ease using the same everyday tools. For instance, what about converting hand printed/written documents?

Three Flavours Of OCR

Many of you have probably wondered why such a thing can’t be done with the OCR technology in PDF conversion products. Well, this is because OCR technology and devices are only capable of recognizing the machine printed characters and fonts. And seeing as how the number of documents that are being scanned in are usually typewritten, OCR is employed in almost all cases.

In other cases, there are documents that contain handwritten sections and/or fields that are used for collecting data—a thing being slowly superseded by the fill-able PDF form. You can create a digital copy from such a document simply by scanning it in, right? Yes. However, it requires a different recognition technology altogether. Using OCR, you can perhaps get maybe one letter to “OCR” into ASCII, if it’s printed clearly and written in ink that’s thick enough to be read. But that’s about it. This is where another flavor of OCR comes in: Intelligent Character Recognition.

ICR is a more advanced form of OCR that translates hand printed letters into digital ASCII equivalents. This version of OCR is primarily used for processing applications and forms on which you “print clearly” and place individual letters in boxes. This structured method of reading a hand printed document is one of the major limitations of the technology, but controls and reduces the amount of human errors that cause misinterpretations.

In addition, there are documents that contain handwriting—aka cursive writing. Can recognition on such documents be performed? The answer: Yes. The third flavor of OCR is IR (Intelligent Recognition), the latest generation of OCR technology to date. This is used to read unconstrained writing (text not contained in boxes) and uses the same methods to translate the characters into ASCII text. From my online searching, there are a good number of companies that provide full fledged OCR/ICR/IR solutions, which can be integrated with digital workflows.

Thus, if you’re looking to OCR handwritten PDFs, you’ll be sorely disappointed. The ability to do everything and anything with technology is perhaps the ultimate goal for developers and users. Practicing it, on the other hand, is perhaps the ideal goal for every worker bee out there. It’s sad to say, but there are some cases in which you can only do so much.