The "Options..." item under the View Menu will display the following Options dialog:
The Options dialog enables the end user to configure Able2Extract's output to account for different possible PDF document structures. A brief description of each option is provided below:
Auto-Spacing between words – Some PDF documents are created such that their internal structure does not demarcate "spaces" between words, even though the viewable PDF page does contain spaces between words. As such, the Investintech PDF conversion engine automatically adds spaces between document patterns (i.e. words) as a default setting.
In certain cases, such as the case of expanded text with Justify alignment, the Auto-Spacing between words default can result in the insertion of extra spaces between words and poor conversion results. In these cases, conversion results may improve if the Auto-Spacing between words setting is deselected. This option is applicable only for non-scanned documents/pages.
Auto-Spacing between close numbers – This function works similarly to the Auto-Spacing between words function, except it focuses on patters that consist of numbers.
Gap (between letters) treated as space: In cases where a gap between letters is treated as a space, using this option gives users the ability to select standard spacing between letters or smaller spacing, which puts the letter closer together. The default option is standard.
Unrecognized symbols: Some PDFs use non-standard fonts and therefore it is can be difficult to find a match from the existing Windows fonts on a particular computer. In cases when Able2Extract cannot find a match for a particular symbol, there is the option to either (1) show the ¤ symbol – if the checkbox is marked; (2) try to use OCR to grab the symbol that cannot be read; or if nothing is checked – a space will be shown.
PowerPoint Format: This option designates the PowerPoint output options. The default is conversion into the pptx format while the PowerPoint 2007 option is conversion into the ppt format. Users also have the ability to select OpenOffice Impress (.odp), an open standard presentation format used and generated by OpenOffice, as their designated output option.
Eliminate Repeated Characters – Documents will occasionally have a line of repeated characters, which may interfere with PDF conversion results.
The Eliminate Repeated Characters (if more than 5) setting allows the user to replace commonly repeated characters, such as asterisks (more than 5, ******) with the following: " *** ". This option should be utilized in documents where the repeated characters are causing problems to the conversion output. This option does not change the way multiple (more than 3) dots are processing — they are kept in RTF/DOC conversion and eliminated otherwise. This option is available for (1) all types of conversions; (2) for Excel only; (3) ignore and keep all characters. The default option is for Excel Only.
Retain Problematic Font Names From the Source Document – In certain cases, a PDF document will contain a variety of challenging fonts within a PDF. The default is for the application to try and match the font to the closest available font in Word. In situations where the application is unable to find a suitable font for replacement, selecting this option provides the user with the ability to retain the font names from the PDF. By doing so, the user can then choose the fonts that they think will work best and can give a better conversion result.
Output Image Resolution: The drop down offers 3 different settings (72,150,300)to set the quality of any image that will display in the background of the converted document.A lower number means that the resulting document size will be smaller. The default is a medium size.
Open File After Conversion: A simple check box that when checked will cause the output program to launch with the file open once the conversion is made. For example, a Word conversion will cause an instance of Word to open, etc…
Autospacing between words: A simple check box that when checked, lets the software determine the accurate spacing between words on scanned documents when converted using OCR.
Extract images: When checked, the OCR engine will work to convert not only text but also images from the scanned document.
OCR mode: A dropdown menu that lets the user determine whether to override the default scanned document detection and make a determination on their own about whether to use OCR to make the conversion. 3 options are available: (1) Default – this allows the software to determine whether a page or a document is scanned or not. (2) Perform image based conversion: tells the conversion engine to treat every document as being s canned and use OCR to convert documents regardless of whether they are scanned or not. (3) No Image based conversion: tells the program to never use OCR to convert a document, even if it is a scanned document.
Excel/CSV Conversion Parameters
Use Text Format for Non-numeric Data – In certain documents, dates may be automatically converted into a numerical format in Excel. This may cause confusion or problems with the conversion output. To preserve dates in the Text format, select this item on the option menu.
Trailing Minus Sign ("-") Treatment – In certain financial documents or reports, the minus sign symbol trails to the right of the number that it is associated with (e.g. "4,560–" instead of "–4,560"). Converting negative numbers where the minus sign trails to the right of the number may cause the resulting conversion to Excel to place these numbers as text items, rather than number items.
Use this option to move trailing minus signs from the end of the number to the beginning of the number, to prevent such instances from being converted to textual items in Excel.
Retain Dollar Sign ($) as Separate Symbol – Certain financial documents contain dollar signs – often at the top or bottom of financial document. Unfortunately, sometimes the dollar sign ($) create challenges in the way the program interprets column structure – as such, the default is to meld the dollar sign into the same column as the number next to it. In certain cases, however, the user may wish to retain the dollar sign in its separate column. To do so, select this option to retain such a structure.
European Continental Settings (1.234.567,89 = 1 234 567.89) – In North America, the decimal point is a period – which separates the integer portion of a number from the fractional portion – and the thousand separator is a comma. In certain other countries, the reverse is true: a decimal comma is used to separate the integer portion of a number from the fractional portion, and a period is used as the thousand separator.
This option allows you to convert documents that adopt the decimal comma and "period" thousand separators (referred to here as European continental settings) to Excel formatted numbers correctly.
Enable Table Unfolding (treat rows as columns and vice versa) – This option allows you to generate output from a PDF in column structure to Excel in row structure or vice versa. For instance – if your PDF document contains three columns of data, if this option is selected, the conversion of these three columns will result in three rows. The first row will contain the data from the first column, the second row will contain data from the second column, etc.
Convert Rotated text to Excel: In some cases, a PDF document may have text in a table that has been rotated in a certain direction. Checking this box in the options menu will convert the rotated text into Excel upon conversion.
Calculate columns positions forecast for OCR pages in Custom Excel conversion – if this option is selected the Custom Excel conversion will always try to calculate a column position forecast, otherwise the calculation will be performed only for non-scanned pages.
Excel Format: – This option designates the Excel output format. The default is conversion into the default format of your Excel version while the Excel 2007 option is conversion into the XLSX format. There is also the option to select OpenOffice Calc (.ods), an open standard spreadsheet format used and generated by OpenOffice, as their designated output format.
Default Excel Format – Select from a variety of Excel conversion commands to fine tune data extractions from your PDF into Excel spreadsheets.
- Excel Single Worksheet – The default for this menu item is pre-selected. Once selected, this item ensures that all data converted from a single PDF document ends up on a single worksheet in Excel. If you decide to unselect this item, data from each page of a PDF document will appear in a separate worksheet in Excel.
- Excel Fonts – Use this command to turn on/off Excel fonts for the to-Excel conversion. When turned on (which is the default setting), this feature attempts to replicate the size, colour and type of font from the PDF in Excel. When turned off, the standard Excel font is used.
- Excel Spacing – Use this command to turn on/off automatic spacing for to-Excel conversion. You can experiment with this option in order to produce the most desirable result on each particular document.
CSV Delimiter – Able2Extract comes with support for conversion to CSV with configurable delimiters (DSV). Users can select from Comma, Tab or Other specified delimiters as needed.
Default Action for Convert Button on Side Panel – You can specify what format, by default, Able2Extract will extract the data to when performing PDF to Excel/CSV conversions when a Custom Excel Template is loaded. You can choose either Excel or CSV format. Note that this setting affects the Convert Button only when the Advance Excel side panel is opened while loading a template. It does not apply to normal Advanced PDF to Excel conversions.
Word Conversion Parameters
Page Margin: Use the Page Margin Value setting to change the size of the printable margins for a Word document converted from a PDF document. The Calculate Margin option lets the software calculate what it thinks the correct margin is automatically. The Default Page Margin Value is 0.00 inches – this is chosen because it provides the best positional output when converting a document from PDF to Word.
Certain office printers cannot print the whole page area of a PDF – i.e. 0.00 inch margins on a page will not print. If this is the case, the Custom Page Margin Value allows you to set the printable margins appropriate for your printer.
Tips: (a) For best results, select the smallest Page Margin Value that your printer will support; (b) A Page Margin Value of 0.2 inches or 0.5 inches will generally work best on most printers.
Graphics and Images Management: The default setting for Graphics and Image Management is set to Auto Detect. It works in the following way: if a page for some reason cannot be drawn correctly in the way it is described in PDF (say the page contains some graphics settings not supported in RTF/DOC) it will be drawn as an image(s); otherwise vector graphics will be drawn as vector graphics and images — as images.
Manual Setting: The default scenario places all images in the background and place vector graphics in the background. In certain circumstances, the user may wish to try and work with the vector graphics – to do so, the following optionscan be used:
Place all Images in Background – This setting will place all images identified within a PDF document onto the background,in the converted Word document. Otherwise images (such as JPG or BMP files) will be added as Word pictures, so that you can format each image separately or change/move their position within the document.
In certain cases dealing with "masked images", the conversion into Word may not be properly rendered based on our default setting. A "masked image" refers to the portion of a viewable image on a PDF document that is cropped or "masked" out – and placed on a different background. Because PDF documents may be multi-layered, the several backgrounds or layers contained within a PDF document may cause problems in conversion output.
In certain other cases, attributable to an inappropriate Z-order (i.e. the order in which graphics objects overlap each other) in the PDF source, images may also be incorrectly displayed – such as problems with image borders or disappearing images. In both of these cases, selecting this option (i.e. placing all images from the page to the background image) may avoid problems in the image display for the converted document.
Vector Graphics as Background Image – This option converts all PDF vector graphics objects within a page into a background image. What are vector graphics? Generally, there are two kinds of graphical objects in PDF documents – pixel-based images and vector graphic images (consisting of lines and shapes).
Pixel-based images that are viewed in Word may result in varying resolution – if the image is resized or redrawn, Word will attempt to add/remove pixels based on an algorithm. In most cases, the result is a loss in image quality – for instance, thin lines might disappear under low resolution.
Vector graphics, on the other hand, in PDF may be drawn at any resolution – which may result in better conversion results upon conversion into Word. However, in some cases, the number of vector objects comprising an image may be high, which may compromise the conversion for a particular Word document.
In cases where the rendering of vector graphics poses problems in the display of the conversion output, this option allows the user to display all vector graphics as a background image. By doing so, it ensures the integrity of the vector graphic image – although the drawback is that the vector image is not easily moved within the Word document.
Output: The Word conversion output offers 3 different output formats when converting a PDF to Word. The user can select any of the 3 options:
- Default: Using this setting on the Windows platform will result in the software automatically detecting which version of Word is installed on the machine and then saving the Word output into the applicable format – .docx or .doc.
- RTF: Choosing this setting means that all conversion that are made from PDF to Word will be saved in the rich text format (.rtf) extension.
- Word 2007: Using this setting means that all documents converted into Word will be saved as .docx regardless of which version of Word is installed.
- OpenOffice Writer: Selecting this option allows users to save all their PDF to Word conversions in the OpenOffice Writer format (.odt) by default.
Standard Conversion Options Only
Keep Hyphens from Original Document – The default setting is for the application to automatically keep or delete hyphens based on their position within the paragraph – in some cases, the position of the text in Word will vary from the original PDF document, so that a hyphen that was originally required to split a word is not required in the converted document. The user can opt to select the Keep Hyphens from Original Document option to ensure that no hyphens are deleted.
Use Tables: This option is automatically checked and it tells the Investintech PDF conversion engine to retain the tables from the PDF when the conversion is made into Word. An additional option called exact row height is also checked to tell the algorithm to retain the row structure contained in the PDF.
Column (Newspaper) Paragraph Minimum Width – Many PDF documents are formatted with column paragraphs, or newspaper-style paragraphs. To assist in the recognition and conversion from PDF to Word for these types of paragraphs, the user can designate the minimum width for column/newspaper paragraphs.
The Investintech PDF conversion engine contains complex algorithms for differentiating between table columns and paragraph columns – however, in some cases, it is very difficult for the engine to distinguish between these two types of paragraph. The Column (Newspaper) Paragraph Minimum Width setting allows users to improve conversion results by providing input regarding a PDF document's structure.
Example – If it is known that a given document does not have any column/newspaper paragraphs with column widths of less than 3.00 inches, changing the Column (Newspaper) Paragraph Minimum Width to 3.00 inches will prevent the conversion engine from treating certain table columns as column/newspaper paragraphs.
Note: The Column (Newspaper) Paragraph Minimum Width setting should be between 1.00 and 3.00 inches
The default for this setting is 1.35 inches
The user of the standard PDF to Word conversion has 3 options regarding the formatting of the Word output from the PDF conversion:
- Keep layout: This is the default option and when selected it tells the Investintech conversion algorithm to focus efforts on keeping the most accurate layout possible in the Word output.
- Most editability: This option when selected focuses the conversion engine on delivering the most editability on the Word output as opposed to making a primary focus on the layout. Both options result in good conversion but selecting this one places a little emphasis on editability.
Insert Headers – and Insert Footers: these 2 options in combination with the "Header value" and "Footer value" options allow to specify the headers and footers size so that patterns located near the page top and bottom will be converted as page headers or footers correspondingly.
Max Font size for Header/Footer – this option specifies the maximum font size for text patterns to be placed as a header or footer in the converted Word (or RTF) document (the option is enabled only if the "Insert Headers" or/and "Insert Footers" option is selected).
Anchor graphics – if "to page" option is selected, all images and vector graphics elements will have absolute position anchored to the page top left corner; otherwise images or graphics elements will be grouped by Y coordinate and each group (i.e. each graphics element from the group) will be anchored to the paragraph closest to the group (except column/newspaper paragraphs).
RESTORE STANDARD (button)
The Restore Standard button restores all settings to what they were when the program was installed on the computer first time.
SET AS DEFAULT (button)
The Set As Default button saves all current settings from the Optionsdialog into the Registry.
Save to File...
The Set As Default button saves all current settings from the Options dialog into a template file for further usage.