April 17, 2018
Coding is for nerds…or is it? Which programming language to use as a beginner data journalist
Are you a journalist starting out on your data career? Are you tired of using Excel sheets and want to move on to more accurate tools? Then you might want to think about learning a programming language. Caelainn Barr of The Guardian and Karrie Kehoe of the Investigations Unit at RTÉ gave a rundown on whether to use R or Python as you start out on your journey.
Barr started off, as many other budding data journalists, on Excel and Google Sheets. “I came to journalism through the Bureau of Investigative Journalism in London where I worked with Cynthia Marku who works at the Financial Times and she was a data journalist and investigative journalist. She gave me my first taste of data journalism and through that I really kind of caught the bug”, she said.
Kehoe, on the other hand, had a different route into the field. “Back in 2011 when the Iraq war logs hit, I saw this amazing project from The Guardian and I was like ‘this is what I want to do with my life’,” she says.
A good place to start when getting into data journalism as someone with no computer science background is to have a project that you are working on. This will help you in choosing the best programming language depending on your specific needs. It will also give you the drive to learn the language and not to give up because you have a goal in mind.
So what are the pros and cons of each of these programming languages when it comes to data reporting?
A case for R
It’s free and open source
So many newsrooms are under financial pressure. There is a difficulty in proving to your editors that it is worth it to learn a language that you have no experience in, but you feel can help you grow as a journalist and enable you to do even better stories. The fact that it’s free can help you convincing your seniors as it costs the newsrooms nothing.
It has an amazing user interface
RStudio does not have a coding interface where you’re only seeing your input and output. Instead, it has multiple panels where you can see your files, you can store all of your codes, you can see all of the queries you run in the past and also see your output.
It’s reproducible and efficient
Just like most coding languages, once you have a script, you can keep it, you can update it and you can re-run it. This is especially helpful when you have a project where you’re going to be doing the same tasks regularly.
It’s a Swiss Army Knife for different data types
It is one of the most useful tools when you are working with difficult data types. It’s also very helpful for journalists as you can combine data sets and finding the most interesting bits that come out of it. Mostly this involves large government datasets.
It’s great for data analysis
The analysis tools on R are massively useful and they keep improving. There are a series of packages with tutorials that allow you to manipulate your data set. They are also named with English terms that make it easier to understand the tasks you are carrying out on your data set.
A case for Python
It’s easy to read
For someone who is starting out, this is great as you can read other people’s script and make sense of them and you can replicate this for your own projects. There are also a lot of English terms embedded in the language.
It’s easy to learn
There are dozens of free online tutorials that teach you how to use Python. The Python Software Foundation has some but there are also courses created specifically for journalists as well. Google has a set of classes too. In addition, there’s a great community around the Python Software. The PyCon conferences give you opportunities to see a variety of presentations, panels, and impromptu discussions, learn about significant advances in the Python development community, meet fellow developers from around the world, enroll in tutorials delivered by experts and participate in development sprints with fellow enthusiasts.
It’s powerful and fast
This is especially useful when you have deadlines and you are working with data that is way too big to be useful as a .csv file.
It’s got a statistics library
Pandas is a useful tool that you can use to manipulate data, merge and reshape it, run pivot tables, get descriptive statistics really quickly and it handles missing data really well. It is also a fantastic tool for cleaning data, especially when you have millions of rows, it allows you to convert data types, create new columns, re-order them, merge them, remove unwanted characters like asterisks and renaming column headings.
It teaches you other programming tools
It helps you understand computers a little better and can give you the confidence to use other tools like GitHub.
Whichever programming language you decide to go with, stick to it, especially as you start out and build your skills slowly by slowly as you tackle more and more complex data projects.