{"id":7395,"date":"2017-10-09T14:08:35","date_gmt":"2017-10-09T14:08:35","guid":{"rendered":"https:\/\/www.investintech.com\/resources\/blog\/?p=7395"},"modified":"2019-08-22T13:04:58","modified_gmt":"2019-08-22T13:04:58","slug":"how-to-clean-large-pdf-datasets","status":"publish","type":"post","link":"https:\/\/www.investintech.com\/resources\/blog\/archives\/7395-how-to-clean-large-pdf-datasets.html","title":{"rendered":"How To Clean Up Large PDF Datasets"},"content":{"rendered":"<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-7855\" src=\"https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2017\/10\/Spreadsheet-Data-Analysis.jpg\" alt=\"Analyzing Data For Investigative Reporting\" width=\"640\" height=\"391\" srcset=\"https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2017\/10\/Spreadsheet-Data-Analysis.jpg 640w, https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2017\/10\/Spreadsheet-Data-Analysis-300x183.jpg 300w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><\/p>\n<p>For <a href=\"https:\/\/www.investintech.com\/resources\/blog\/archives\/7826-data-analysts-big-data-tools.html\">big data analysts<\/a>, working with clean data is a must. The major hurdle, though, is actually cleaning that data. Right now, analysts are spending more than half of their time cleaning up unstructured datasets. And if you aren\u2019t an advanced expert with cleaning datasets, just knowing some basic data cleaning tasks becomes even more crucial.<\/p>\n<p>Datasets can represent a large variety of information. From government and healthcare data to demographic and financial numbers, datasets come from all different areas. They also come in all different forms, like the PDF format. Getting it into a form you can manipulate is your first goal&#8211; and your biggest challenge.<\/p>\n<p>The PDF format isn\u2019t easily editable. In addition, it may contain hundreds of pages, consist of tables that span the entire file, be scanned in from a hard copy document, be created from an Excel spreadsheet, or be protected against copying and pasting.<\/p>\n<p>You need to be able to analyze that locked down data. But how do you get started?<\/p>\n<p><!--more--><\/p>\n<h2><b>Extracting Large PDF Datasets<\/b><\/h2>\n<p>The key to working with PDF datasets is to extract that data. A <a href=\"https:\/\/www.investintech.com\/pdf-to-excel\/\">PDF to Excel<\/a> conversion is usually the first step. We all know that PDF conversion results can give you some post-conversion manipulation work. However, there are tools, like Able2Extract Professional, that will get most of the legwork done before the data is extracted, giving you less work in cleaning that PDF data.<\/p>\n<p><span style=\"color: #ff0000;\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-6629 size-large\" src=\"https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2017\/03\/Able2Extract-Professional-Custom-Excel-1024x818.png\" alt=\"Customizing PDF to Excel Conversion\" width=\"1024\" height=\"818\" srcset=\"https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2017\/03\/Able2Extract-Professional-Custom-Excel-1024x818.png 1024w, https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2017\/03\/Able2Extract-Professional-Custom-Excel-300x240.png 300w, https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2017\/03\/Able2Extract-Professional-Custom-Excel-768x613.png 768w, https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2017\/03\/Able2Extract-Professional-Custom-Excel.png 1038w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/span><\/p>\n<p>It can tailor your PDF data extraction with a custom PDF to Excel conversion feature that lets you manually adjust rows and columns, delete headers and footers, select portions of the data to convert, decide how content is treated across columns. Once everything is set up the way you want in Excel, you can convert it as usual.<\/p>\n<p>Most of the unwanted data will be already eliminated from the conversion results. All you have left to do is deal with is the more refined data cleaning process in basic go-to data cleaning tools like Microsoft Excel.<\/p>\n<h2><b>Basic Tips For Cleaning PDF Datasets <\/b><\/h2>\n<p>Whether you\u2019re learning more about Excel or are just learning how to clean data, here\u2019s a look at how to accomplish some of the most basic data cleaning tasks. This includes deleting duplicates, getting rid of blank cells, deleting extra spaces, re-organizing data in columns and rows, or cleaning the text\u2019s formatting.<\/p>\n<p>As a rule of thumb, don\u2019t forget to make a copy of your original dataset. This way, if you make a mistake spanning the entire file, you can always revert back.<\/p>\n<h3><b>Deleting Duplicates<\/b><\/h3>\n<p>It isn\u2019t uncommon to have duplicates due to data entry errors. However, these must usually be weeded out before any analysis of the data can happen.<\/p>\n<ol>\n<li>Select your data.<\/li>\n<li>Go to <strong>Data &gt; Remove Duplicates<\/strong>.<\/li>\n<li>In the resulting dialog, select the Column(s) from which you want to remove duplicates and click OK.<\/li>\n<\/ol>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-7428 size-full\" src=\"https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2017\/08\/Removing-Spreadsheet-Duplicates.png\" alt=\"Removing Duplicate Excel Data\" width=\"436\" height=\"261\" srcset=\"https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2017\/08\/Removing-Spreadsheet-Duplicates.png 436w, https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2017\/08\/Removing-Spreadsheet-Duplicates-300x180.png 300w\" sizes=\"auto, (max-width: 436px) 100vw, 436px\" \/><\/p>\n<h3><b>Getting Rid Of Blank Cells <\/b><\/h3>\n<p>In PDF datasets, it\u2019s common to have tables that convert improperly due to text splitting among the columns, causing the data to shift. You can pick out and delete all blank cells.<\/p>\n<ol>\n<li>Select the entire data range.<\/li>\n<li>Hit<b> F5 &gt;Special.<\/b><\/li>\n<li>In the dialog that appears select <b>Blanks<\/b>. This will highlight all blank cells in the spreadsheet data.<\/li>\n<\/ol>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-7430 size-full\" src=\"https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2017\/08\/Go-To-Special-Excel-Dialog.png\" alt=\"Selecting Blank Spreadsheet Cells\" width=\"314\" height=\"327\" srcset=\"https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2017\/08\/Go-To-Special-Excel-Dialog.png 314w, https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2017\/08\/Go-To-Special-Excel-Dialog-288x300.png 288w\" sizes=\"auto, (max-width: 314px) 100vw, 314px\" \/><\/p>\n<ol>\n<li>Right-click on one of the blank cells and select <b>Delete<\/b> from the context menu.<\/li>\n<li>In the dialog select how you want the other cells to be shifted once the blank cells are deleted.<\/li>\n<\/ol>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-7431 size-full\" src=\"https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2017\/08\/Excel-Context-Menu.png\" alt=\"Deleting Blank Spreadsheet Cells\" width=\"350\" height=\"414\" srcset=\"https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2017\/08\/Excel-Context-Menu.png 350w, https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2017\/08\/Excel-Context-Menu-254x300.png 254w\" sizes=\"auto, (max-width: 350px) 100vw, 350px\" \/><\/p>\n<h3><b>Deleting Extra Leading and Trailing Spaces<\/b><\/h3>\n<p>One common issue with large PDF datasets is that it may contain extra spaces. Instead of going to each cell one by one, \u00a0you can get rid of leading and trailing spaces simultaneously.<\/p>\n<ol>\n<li>Create a blank column adjacent to the column with the data you want to clean.This will be your Helper column.<\/li>\n<li>In this Helper column, enter the formula =TRIM (<i>cell reference or text<\/i>) in the cell adjacent to the data you want to be cleaned.<\/li>\n<li>Press Enter.<\/li>\n<li>Continue to copy the formula into the blank cells of the Helper column to delete the extra spaces in the original cells as needed.<\/li>\n<li>Once done, replace the data in the original column with the Helper column containing the cleaned data.<\/li>\n<\/ol>\n<h3><b>Re-organizing Data In Columns and Rows<\/b><\/h3>\n<p>Oftentimes, PDF datasets you receive won\u2019t be organized in the way you need them to be. Organizing the rows and columns can be a quick task. Below are a few basic things you can do in Excel:<\/p>\n<ul>\n<li>Text to Columns (<strong>Home&gt;Data&gt;Text to Columns<\/strong>)&#8211; Split single columns of text across multiple cells.<\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-7432 size-full\" src=\"https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2017\/08\/Text-To-Columns-Dialog.png\" alt=\"Excel Text To Columns Options\" width=\"439\" height=\"359\" srcset=\"https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2017\/08\/Text-To-Columns-Dialog.png 439w, https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2017\/08\/Text-To-Columns-Dialog-300x245.png 300w\" sizes=\"auto, (max-width: 439px) 100vw, 439px\" \/><\/p>\n<ul>\n<li>Concatenate (<strong>Formulas&gt;Text&gt;CONCAT<\/strong>)&#8211; Combine several text items into one cell.<\/li>\n<li>Transpose (<strong>Home&gt;Paste&gt;Transpose<\/strong>)&#8211; Rearrange copied data in columns so it\u2019s rearranged in rows.<\/li>\n<\/ul>\n<h3><b>Cleaning Text Formatting<\/b><\/h3>\n<p>Irregularly formatted text in your dataset is to be expected. If you\u2019re going to be using that data in other database tools, you\u2019ll want to get rid of all formatting:<\/p>\n<ol>\n<li>Select the data set.<\/li>\n<li>Go to<strong> Home &gt; Clear &gt; Clear Formats.<\/strong><\/li>\n<\/ol>\n<p>These are just some of the basics you\u2019ll learn as you get more familiar with cleaning PDF data. Eventually, you\u2019ll move on to using <a href=\"https:\/\/www.investintech.com\/resources\/blog\/archives\/7802-how-to-excel-pivot-table.html\">Pivot Tables<\/a>, Charts and functions (SUM, MAX, AVERAGE) \u00a0as you get comfortable cleaning large PDF datasets in Excel.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>For big data analysts, working with clean data is a must. The major hurdle, though, is actually cleaning that data. Right now, analysts are spending more than half of their time cleaning up unstructured datasets. And if you aren\u2019t an advanced expert with cleaning datasets, just knowing some basic data cleaning tasks becomes even more &#8230; <a title=\"How To Clean Up Large PDF Datasets\" class=\"read-more\" href=\"https:\/\/www.investintech.com\/resources\/blog\/archives\/7395-how-to-clean-large-pdf-datasets.html\" aria-label=\"More on How To Clean Up Large PDF Datasets\">Continue reading \u2192<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[332],"tags":[335,232,321,361,45,85],"class_list":["post-7395","post","type-post","status-publish","format-standard","hentry","category-tech-tips-tutorials","tag-big-data","tag-data-analysis","tag-data-extraction","tag-excel","tag-ms-excel","tag-pdf-to-excel"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>How To Clean Up Large PDF Datasets<\/title>\n<meta name=\"description\" content=\"Working with large amounts of data isn\u2019t an easy task. However, this guide walks you through the basics on cleaning PDF datasets from start to finish.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.investintech.com\/resources\/blog\/archives\/7395-how-to-clean-large-pdf-datasets.html\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"How To Clean Up Large PDF Datasets\" \/>\n<meta property=\"og:description\" content=\"Working with large amounts of data isn\u2019t an easy task. However, this guide walks you through the basics on cleaning PDF datasets from start to finish.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.investintech.com\/resources\/blog\/archives\/7395-how-to-clean-large-pdf-datasets.html\" \/>\n<meta property=\"og:site_name\" content=\"PDF Blog | Investintech PDF Solutions\" \/>\n<meta property=\"article:published_time\" content=\"2017-10-09T14:08:35+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2019-08-22T13:04:58+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2017\/10\/Spreadsheet-Data-Analysis.jpg\" \/>\n<meta name=\"author\" content=\"Reena\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@able2extract\" \/>\n<meta name=\"twitter:site\" content=\"@able2extract\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.investintech.com\/resources\/blog\/archives\/7395-how-to-clean-large-pdf-datasets.html#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.investintech.com\/resources\/blog\/archives\/7395-how-to-clean-large-pdf-datasets.html\"},\"author\":{\"name\":\"Reena\",\"@id\":\"https:\/\/www.investintech.com\/resources\/blog\/#\/schema\/person\/9d21ba7980d32dbd36069a4878f8e409\"},\"headline\":\"How To Clean Up Large PDF Datasets\",\"datePublished\":\"2017-10-09T14:08:35+00:00\",\"dateModified\":\"2019-08-22T13:04:58+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.investintech.com\/resources\/blog\/archives\/7395-how-to-clean-large-pdf-datasets.html\"},\"wordCount\":901,\"publisher\":{\"@id\":\"https:\/\/www.investintech.com\/resources\/blog\/#organization\"},\"image\":{\"@id\":\"https:\/\/www.investintech.com\/resources\/blog\/archives\/7395-how-to-clean-large-pdf-datasets.html#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2017\/10\/Spreadsheet-Data-Analysis.jpg\",\"keywords\":[\"big data\",\"data analysis\",\"data extraction\",\"Excel\",\"MS Excel\",\"PDF to Excel\"],\"articleSection\":[\"Tech Tips and Tutorials\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.investintech.com\/resources\/blog\/archives\/7395-how-to-clean-large-pdf-datasets.html\",\"url\":\"https:\/\/www.investintech.com\/resources\/blog\/archives\/7395-how-to-clean-large-pdf-datasets.html\",\"name\":\"How To Clean Up Large PDF Datasets\",\"isPartOf\":{\"@id\":\"https:\/\/www.investintech.com\/resources\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.investintech.com\/resources\/blog\/archives\/7395-how-to-clean-large-pdf-datasets.html#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.investintech.com\/resources\/blog\/archives\/7395-how-to-clean-large-pdf-datasets.html#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2017\/10\/Spreadsheet-Data-Analysis.jpg\",\"datePublished\":\"2017-10-09T14:08:35+00:00\",\"dateModified\":\"2019-08-22T13:04:58+00:00\",\"description\":\"Working with large amounts of data isn\u2019t an easy task. However, this guide walks you through the basics on cleaning PDF datasets from start to finish.\",\"breadcrumb\":{\"@id\":\"https:\/\/www.investintech.com\/resources\/blog\/archives\/7395-how-to-clean-large-pdf-datasets.html#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.investintech.com\/resources\/blog\/archives\/7395-how-to-clean-large-pdf-datasets.html\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.investintech.com\/resources\/blog\/archives\/7395-how-to-clean-large-pdf-datasets.html#primaryimage\",\"url\":\"https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2017\/10\/Spreadsheet-Data-Analysis.jpg\",\"contentUrl\":\"https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2017\/10\/Spreadsheet-Data-Analysis.jpg\",\"width\":640,\"height\":391,\"caption\":\"Analyzing Data For Investigative Reporting\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.investintech.com\/resources\/blog\/archives\/7395-how-to-clean-large-pdf-datasets.html#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.investintech.com\/resources\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"How To Clean Up Large PDF Datasets\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.investintech.com\/resources\/blog\/#website\",\"url\":\"https:\/\/www.investintech.com\/resources\/blog\/\",\"name\":\"PDF Blog | Investintech PDF Solutions\",\"description\":\"Everything PDF\",\"publisher\":{\"@id\":\"https:\/\/www.investintech.com\/resources\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.investintech.com\/resources\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.investintech.com\/resources\/blog\/#organization\",\"name\":\"PDF Blog | Investintech PDF Solutions\",\"url\":\"https:\/\/www.investintech.com\/resources\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.investintech.com\/resources\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2024\/12\/Investintech-apryse-logo-w270.webp\",\"contentUrl\":\"https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2024\/12\/Investintech-apryse-logo-w270.webp\",\"width\":270,\"height\":40,\"caption\":\"PDF Blog | Investintech PDF Solutions\"},\"image\":{\"@id\":\"https:\/\/www.investintech.com\/resources\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/able2extract\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.investintech.com\/resources\/blog\/#\/schema\/person\/9d21ba7980d32dbd36069a4878f8e409\",\"name\":\"Reena\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.investintech.com\/resources\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/aceff76f1b124f7ffb271de50b78f12a7599655c7087ea3a656b61cf9a89c376?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/aceff76f1b124f7ffb271de50b78f12a7599655c7087ea3a656b61cf9a89c376?s=96&d=mm&r=g\",\"caption\":\"Reena\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"How To Clean Up Large PDF Datasets","description":"Working with large amounts of data isn\u2019t an easy task. However, this guide walks you through the basics on cleaning PDF datasets from start to finish.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.investintech.com\/resources\/blog\/archives\/7395-how-to-clean-large-pdf-datasets.html","og_locale":"en_US","og_type":"article","og_title":"How To Clean Up Large PDF Datasets","og_description":"Working with large amounts of data isn\u2019t an easy task. However, this guide walks you through the basics on cleaning PDF datasets from start to finish.","og_url":"https:\/\/www.investintech.com\/resources\/blog\/archives\/7395-how-to-clean-large-pdf-datasets.html","og_site_name":"PDF Blog | Investintech PDF Solutions","article_published_time":"2017-10-09T14:08:35+00:00","article_modified_time":"2019-08-22T13:04:58+00:00","og_image":[{"url":"https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2017\/10\/Spreadsheet-Data-Analysis.jpg","type":"","width":"","height":""}],"author":"Reena","twitter_card":"summary_large_image","twitter_creator":"@able2extract","twitter_site":"@able2extract","schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.investintech.com\/resources\/blog\/archives\/7395-how-to-clean-large-pdf-datasets.html#article","isPartOf":{"@id":"https:\/\/www.investintech.com\/resources\/blog\/archives\/7395-how-to-clean-large-pdf-datasets.html"},"author":{"name":"Reena","@id":"https:\/\/www.investintech.com\/resources\/blog\/#\/schema\/person\/9d21ba7980d32dbd36069a4878f8e409"},"headline":"How To Clean Up Large PDF Datasets","datePublished":"2017-10-09T14:08:35+00:00","dateModified":"2019-08-22T13:04:58+00:00","mainEntityOfPage":{"@id":"https:\/\/www.investintech.com\/resources\/blog\/archives\/7395-how-to-clean-large-pdf-datasets.html"},"wordCount":901,"publisher":{"@id":"https:\/\/www.investintech.com\/resources\/blog\/#organization"},"image":{"@id":"https:\/\/www.investintech.com\/resources\/blog\/archives\/7395-how-to-clean-large-pdf-datasets.html#primaryimage"},"thumbnailUrl":"https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2017\/10\/Spreadsheet-Data-Analysis.jpg","keywords":["big data","data analysis","data extraction","Excel","MS Excel","PDF to Excel"],"articleSection":["Tech Tips and Tutorials"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.investintech.com\/resources\/blog\/archives\/7395-how-to-clean-large-pdf-datasets.html","url":"https:\/\/www.investintech.com\/resources\/blog\/archives\/7395-how-to-clean-large-pdf-datasets.html","name":"How To Clean Up Large PDF Datasets","isPartOf":{"@id":"https:\/\/www.investintech.com\/resources\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.investintech.com\/resources\/blog\/archives\/7395-how-to-clean-large-pdf-datasets.html#primaryimage"},"image":{"@id":"https:\/\/www.investintech.com\/resources\/blog\/archives\/7395-how-to-clean-large-pdf-datasets.html#primaryimage"},"thumbnailUrl":"https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2017\/10\/Spreadsheet-Data-Analysis.jpg","datePublished":"2017-10-09T14:08:35+00:00","dateModified":"2019-08-22T13:04:58+00:00","description":"Working with large amounts of data isn\u2019t an easy task. However, this guide walks you through the basics on cleaning PDF datasets from start to finish.","breadcrumb":{"@id":"https:\/\/www.investintech.com\/resources\/blog\/archives\/7395-how-to-clean-large-pdf-datasets.html#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.investintech.com\/resources\/blog\/archives\/7395-how-to-clean-large-pdf-datasets.html"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.investintech.com\/resources\/blog\/archives\/7395-how-to-clean-large-pdf-datasets.html#primaryimage","url":"https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2017\/10\/Spreadsheet-Data-Analysis.jpg","contentUrl":"https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2017\/10\/Spreadsheet-Data-Analysis.jpg","width":640,"height":391,"caption":"Analyzing Data For Investigative Reporting"},{"@type":"BreadcrumbList","@id":"https:\/\/www.investintech.com\/resources\/blog\/archives\/7395-how-to-clean-large-pdf-datasets.html#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.investintech.com\/resources\/blog\/"},{"@type":"ListItem","position":2,"name":"How To Clean Up Large PDF Datasets"}]},{"@type":"WebSite","@id":"https:\/\/www.investintech.com\/resources\/blog\/#website","url":"https:\/\/www.investintech.com\/resources\/blog\/","name":"PDF Blog | Investintech PDF Solutions","description":"Everything PDF","publisher":{"@id":"https:\/\/www.investintech.com\/resources\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.investintech.com\/resources\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.investintech.com\/resources\/blog\/#organization","name":"PDF Blog | Investintech PDF Solutions","url":"https:\/\/www.investintech.com\/resources\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.investintech.com\/resources\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2024\/12\/Investintech-apryse-logo-w270.webp","contentUrl":"https:\/\/www.investintech.com\/resources\/blog\/wp-content\/uploads\/2024\/12\/Investintech-apryse-logo-w270.webp","width":270,"height":40,"caption":"PDF Blog | Investintech PDF Solutions"},"image":{"@id":"https:\/\/www.investintech.com\/resources\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/able2extract"]},{"@type":"Person","@id":"https:\/\/www.investintech.com\/resources\/blog\/#\/schema\/person\/9d21ba7980d32dbd36069a4878f8e409","name":"Reena","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.investintech.com\/resources\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/aceff76f1b124f7ffb271de50b78f12a7599655c7087ea3a656b61cf9a89c376?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/aceff76f1b124f7ffb271de50b78f12a7599655c7087ea3a656b61cf9a89c376?s=96&d=mm&r=g","caption":"Reena"}}]}},"_links":{"self":[{"href":"https:\/\/www.investintech.com\/resources\/blog\/wp-json\/wp\/v2\/posts\/7395","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.investintech.com\/resources\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.investintech.com\/resources\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.investintech.com\/resources\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.investintech.com\/resources\/blog\/wp-json\/wp\/v2\/comments?post=7395"}],"version-history":[{"count":12,"href":"https:\/\/www.investintech.com\/resources\/blog\/wp-json\/wp\/v2\/posts\/7395\/revisions"}],"predecessor-version":[{"id":9366,"href":"https:\/\/www.investintech.com\/resources\/blog\/wp-json\/wp\/v2\/posts\/7395\/revisions\/9366"}],"wp:attachment":[{"href":"https:\/\/www.investintech.com\/resources\/blog\/wp-json\/wp\/v2\/media?parent=7395"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.investintech.com\/resources\/blog\/wp-json\/wp\/v2\/categories?post=7395"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.investintech.com\/resources\/blog\/wp-json\/wp\/v2\/tags?post=7395"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}