Journalism Technology Toolbox | Clean, Analyse & Visualize Data

data has a better idea sign

As you may know, we’ve started up the first annual Investintech – CAJ Data Journalism scholarship, and we wanted to take a moment to focus on you, the student, who may be interested in applying.

The scholarship is aimed at helping students develop strong storytelling chops and the data skills to back it up, now a common expectation in the journalism industry. 

As a student, you can expect to come up against a number of challenges as you embark on your career, not only financially, but also technically.

The storytelling tools you need come in all shapes and forms. Thus, if you’re just getting your feet wet as a student journalist, we’ve put together a few tip offs on some tools you can expect to encounter in your career as a journalist, and where they may help you out the most.

Working with Microsoft Excel Spreadsheets

Microsoft Excel is commonly used by everyone. When using Excel as a journalist, however, you’ll be using this application to clean and sift through your data, or perform calculations to weed out trends.

If you learn how to create pivot tables, you can rearrange and organize your data on the spot. In addition, pivot tables can summarize and filter your datasets to quickly display meaningful conclusions. This is helpful if you want to find out answers to:

  • Comparative questions (ie. which company did the best?) 
  • Quantitative analyses (ie. how many people used product A?) 
  • Relationships between values (ie. what are some of the major patterns in this dataset?)

Use a pivot table in your research and your stories gain more context and authority.

Analyzing spreadsheet data

When used for simple data cleaning, Excel can be a life saver. King’s College Data School for journalism students offer up skillsheets on Excel including one Excel tutorial on how to sift out the month and year from dates recorded within a single cell. By fixing and weeding out those values, you can get a more accurate analysis when using years as a differentiator to investigate your data. 

Focusing on Cellular Data with Open Refine 

Previously known as Google Refine, Open Refine is a powerful open source tool for cleaning large datasets. If you haven’t heard of this application before, Open Refine focuses on providing granular control over manipulating cell values and basic data management without having to know how to code or learn formulas. 

This tutorial from Berkeley Graduate School of Journalism shows you what you can do with Open Refine. Journalists are using Open Refine for its ability to deal with spelling errors, duplicates, white spaces and other inaccuracies in large datasets. The application can also generate an audit for data tasks already performed and create instantaneous links between datasets. 

Visualizing your Data with Data Wrapper  

Visualizing data is a highly sought after skill. Yet not everyone is skilled at creating them. From pie charts to graphs, Data Wrapper allows even the least graphically experienced to create and integrate charts, maps and tables into a story.

One of the features of Data Wrapper is the ability to work with imported and formatted data from different sources. For instance, the team behind the Data Wrapper blog shows you how to get data in the right format with pivot tables. It shows you how to create a line chart in Data Wrapper using reformatted pivot table data in Google Sheets. The tutorial is pretty straightforward and can be helpful if you’re expanding on your skills with pivot table data.

Visualizing data

Coding with Data Using Programming Languages

With today’s fast-paced use of technology, it isn’t surprising that professional journalists are picking up coding skills. It allows them to become more versatile and independent with their data manipulation. Below are the two most common languages being used.

R –  R is a programming language used by serious data journalists for data mining and statistical analysis, and manipulation via command-line interface. You may be wondering why use a programming language like R instead of Excel? Mary Jo Webster, data editor for the Star Tribune, goes over the pros of R vs. Excel and a few more resources for you to learn more about it. 

With R, you can scrape web pages, do mapping, pull data from other formats, and it’s open source. The programming language is more versatile with data cleaning, statistical analysis and Pivot Tables. Work wise, it’s known for its ability to help your work with large datasets by providing you with paper trails, making it easy for someone to check your analyses or find and correct your mistakes.

Python – Another popular programming language used by journalists is Python. Journalists use this programming language for a number of data-related tasks. 

To get a better sense of the programming language in action, check out this Python data scraping tutorial for beginners from CodeActually. It walks you through the basic skills on how to scrape websites for data, which comes in handy when your investigation requires a look at multiple webpages of data posted online. 

So which should you use? Datacamp.com offers up a good discussion to help you decide on using either Python or R.

Mapping Out your Data Geographically with ArcGIS

Geographical data map example

When reporting on stats, ArcGIS from Esri can geographically map out your data. The best part is that ESRI offers up features that allow for compelling story-telling graphics that work with your data to either prove a claim or to uncover conclusions. You can use GIS for: 

  • Layering different data sets onto a map 
  • Using mapped data visualizations to uncover patterns  
  • Analyzing geographical trends via locale

By importing your Excel data, you can layout your data on maps which you can create and customize. Once your data is plotted, you can upload, share and embed your map into your article. The company also offers an ArcGIS Microsoft add-in that provides you with the ability to map data in Excel and PowerPoint.

Convert Scanned PDFs from FOIA Requests

FOIA requests often result in getting sent scanned, difficult to access data in the PDF format. It is the worst, yet most common obstacle journalists encounter before even starting the data analysis process. Able2Extract Professional gives you a more efficient way to extract PDF data into formatted Excel spreadsheets by letting you manually adjust the rows and columns. 

How to extract PDF table data into formatted Excel spreadsheets with Able2Extract

The custom PDF to Excel conversion refines the data extraction process by selecting and converting only the data you need and preventing inaccurate and messy conversions. In addition, it converts scanned PDFs which is big when you consider scanned PDFs lock down the data you need.

Conclusion

While you may not start off with all of these tools at the beginning, we’re hoping that this gives you a good idea of what’s in store for you as journalism students.

With a few of the above tips and tools, you can get a leg up on data extraction methods, visualizing your research, and improving your data analysis techniques. 

What journalism tools are you looking at for working with datasets? Shout ‘em out on Facebook, Twitter & LinkedIn!