How To Clean Up Large PDF Datasets

Analyzing Data For Investigative Reporting

For big data analysts, working with clean data is a must. The major hurdle, though, is actually cleaning that data. Right now, analysts are spending more than half of their time cleaning up unstructured datasets. And if you aren’t an advanced expert with cleaning datasets, just knowing some basic data cleaning tasks becomes even more crucial.

Datasets can represent a large variety of information. From government and healthcare data to demographic and financial numbers, datasets come from all different areas. They also come in all different forms, like the PDF format. Getting it into a form you can manipulate is your first goal– and your biggest challenge.

The PDF format isn’t easily editable. In addition, it may contain hundreds of pages, consist of tables that span the entire file, be scanned in from a hard copy document, be created from an Excel spreadsheet, or be protected against copying and pasting.

You need to be able to analyze that locked down data. But how do you get started?

Extracting Large PDF Datasets

The key to working with PDF datasets is to extract that data. A PDF to Excel conversion is usually the first step. We all know that PDF conversion results can give you some post-conversion manipulation work. However, there are tools, like Able2Extract Professional, that will get most of the legwork done before the data is extracted, giving you less work in cleaning that PDF data.

Customizing PDF to Excel Conversion

It can tailor your PDF data extraction with a custom PDF to Excel conversion feature that lets you manually adjust rows and columns, delete headers and footers, select portions of the data to convert, decide how content is treated across columns. Once everything is set up the way you want in Excel, you can convert it as usual.

Most of the unwanted data will be already eliminated from the conversion results. All you have left to do is deal with is the more refined data cleaning process in basic go-to data cleaning tools like Microsoft Excel.

Basic Tips For Cleaning PDF Datasets

Whether you’re learning more about Excel or are just learning how to clean data, here’s a look at how to accomplish some of the most basic data cleaning tasks. This includes deleting duplicates, getting rid of blank cells, deleting extra spaces, re-organizing data in columns and rows, or cleaning the text’s formatting.

As a rule of thumb, don’t forget to make a copy of your original dataset. This way, if you make a mistake spanning the entire file, you can always revert back.

Deleting Duplicates

It isn’t uncommon to have duplicates due to data entry errors. However, these must usually be weeded out before any analysis of the data can happen.

  1. Select your data.
  2. Go to Data > Remove Duplicates.
  3. In the resulting dialog, select the Column(s) from which you want to remove duplicates and click OK.

Removing Duplicate Excel Data

Getting Rid Of Blank Cells

In PDF datasets, it’s common to have tables that convert improperly due to text splitting among the columns, causing the data to shift. You can pick out and delete all blank cells.

  1. Select the entire data range.
  2. Hit F5 >Special.
  3. In the dialog that appears select Blanks. This will highlight all blank cells in the spreadsheet data.

Selecting Blank Spreadsheet Cells

  1. Right-click on one of the blank cells and select Delete from the context menu.
  2. In the dialog select how you want the other cells to be shifted once the blank cells are deleted.

Deleting Blank Spreadsheet Cells

Deleting Extra Leading and Trailing Spaces

One common issue with large PDF datasets is that it may contain extra spaces. Instead of going to each cell one by one,  you can get rid of leading and trailing spaces simultaneously.

  1. Create a blank column adjacent to the column with the data you want to clean.This will be your Helper column.
  2. In this Helper column, enter the formula =TRIM (cell reference or text) in the cell adjacent to the data you want to be cleaned.
  3. Press Enter.
  4. Continue to copy the formula into the blank cells of the Helper column to delete the extra spaces in the original cells as needed.
  5. Once done, replace the data in the original column with the Helper column containing the cleaned data.

Re-organizing Data In Columns and Rows

Oftentimes, PDF datasets you receive won’t be organized in the way you need them to be. Organizing the rows and columns can be a quick task. Below are a few basic things you can do in Excel:

  • Text to Columns (Home>Data>Text to Columns)– Split single columns of text across multiple cells.

Excel Text To Columns Options

  • Concatenate (Formulas>Text>CONCAT)– Combine several text items into one cell.
  • Transpose (Home>Paste>Transpose)– Rearrange copied data in columns so it’s rearranged in rows.

Cleaning Text Formatting

Irregularly formatted text in your dataset is to be expected. If you’re going to be using that data in other database tools, you’ll want to get rid of all formatting:

  1. Select the data set.
  2. Go to Home > Clear > Clear Formats.

These are just some of the basics you’ll learn as you get more familiar with cleaning PDF data. Eventually, you’ll move on to using Pivot Tables, Charts and functions (SUM, MAX, AVERAGE)  as you get comfortable cleaning large PDF datasets in Excel.