DataSquad: A Deep Dive Into Data Cleaning

18 May 2023
By Aishwarya Varma

Editor’s note: This article was written by Aishwarya Varma, one of the Data Squad students in 2023.

Shaun Baron-Furayama (class of 2023) was tasked with the tedious endeavor of cleaning student-collected presidential election data that spanned over two decades. While many might think this was an easy endeavor, it was actually a project that spanned a little over a summer. I interviewed him to get a closer look at his experience and gain a deeper understanding of data cleaning. 

Before we start, it’s important to note that data cleaning is an important part of research — not only does it allow those using the final product to gain access to accurate data, but it also optimizes the amount of data extracted. 

The Data

Students compiled data for Professor Barbara Allen in the Political Science department from a variety of sources, including but not limited to videos from news stations reporting about statewide and presidential elections. The data was both numerical and categorical; there was quantitative data like timestamps and dates as well as rankings of bias and partisan attitudes, which required a qualitative analysis. Moreover, the data collection process spanned over two decades. Students working for Professor Allen at the beginning of the millennium utilized data collection and storage methods that would seem outdated for the time. These sources were added to, edited, and mutated over the years, so the final product of this collection was a series of datasheets containing sparse data that was often difficult to understand. Data cleaning is not easy. As defined by Tableau, it encompasses the process of “fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset”. When combining multiple data sources, which was part of Shaun’s challenge, there are “many opportunities for data to be duplicated or mislabeled”. Because of the size of these files (one csv file contained over 250 columns!), Shaun was only able to incorporate three or four sources into the final product, which speaks to the complexity and tedium of the task. 

The Challenge, Spelled Out

While Shaun was given metadata illustrating what each piece of data meant, like a description of the column and its range of values, he noted there were often discrepancies between what was described and what he actually saw. For example, if the documentation noted that a particular column’s values would only range from 1 to 7, he might see values like 10 or 0 included. As a result, he had to interpret outliers and spelling mistakes himself, which was difficult considering he was not directly involved in the data collection process. Columns could be mislabeled or mismatched, so he would often consult with Professor Allen, in case he didn’t understand what a column’s purpose was. Most of the time, he would have to use his discretion (and a bit of common sense) to discern any irregularities, which required him to draw out columns, figure out relationships between them, check for unique values within a column, and look at outliers to make sense of things. I asked him if he had done any tests to make certain any corrections he made did not significantly alter the data, to which he candidly admitted that the size of the project made it difficult to do so. Ultimately, his goal was to not only combine multiple sources into one csv file, but also try to retain as much of the information as possible.  

The Process

Shaun had taken Data Science, a statistics class, the previous year, where he learned how to use R to manipulate datasets. While he was entirely new to the data cleaning process, he thought using this language would allow him to exercise these skills. When I asked him what aspect of the process he might change, he reported that R might’ve not been the best choice. Initially, he thought the data visualization aspect of R might be useful for the project, but he quickly realized that it was not an ideal choice. While R is a low-level programming language, which generally means that it might run programs faster than a high-level language like Python, Shaun said that it was not specifically designed for data cleaning– in fact, a Python library like pandas might run much faster. As a result of this choice, he spent a huge chunk of his time waiting for the program to run on these large datasets. As a fellow computer science major, I can admit that this can be the most demoralizing aspect of coding– wanting to check if I achieved the desired result, but losing steam as I waited for the program to finish running. As inefficient as R was, however, he noted that it was a good tool for beginners, especially because he knew less about the details of the data collected. 

Reflections and Final Thoughts

It’s evident that this project was a huge undertaking, given Shaun devoted a little more than a summer’s worth to the process. Surely this project would’ve changed his perception of coding, seeing as this was his first huge assignment? According to him, the data was certainly overwhelming, and he didn’t quite know where to start. He started by doing tons of research, both online and within specific datasets, to establish his priorities. There were too many columns and too many things he didn’t know, so he had to figure out what areas to focus on. He first identified the areas that needed improvement (e.g. alignment and data accuracy), developed a system to rank these priorities, and then figured out the specific tasks necessary to fix them. Once he was able to make a few changes, he worked on consistency to make sure these changes were applied in all the necessary areas. He found this planning process to be extremely useful in his computer science classes, as his coding assignments would require him to sketch out a plan before delving into the thick of it. I concur that these skills are extremely important and make the coding process much easier. I recall creating a design document in Anna Rafferty’s Data Structures course to outline my group’s approach towards implementing a data structure and found that it was instrumental in creating an efficient and accurate solution. 

Now, would he do it again?

Maybe not in the same way he did it before, but he would certainly like to start the project over given the knowledge he has now. Obviously, the project would still be challenging, but it’s very likely he might derive a solution faster than before. In his eyes, it was a good project for his sophomore year self. This was a paid opportunity that provided him with class credits and an excellent internship opportunity to put on his resume.