Following my brief data science internship at Wells Fargo this winter, I had an opportunity to visualize data from the Measuring Information Service Outcomes (MISO) survey. The survey contains questions asking faculty, staff, and students about their satisfaction of college technology services. Because I was taking over this project from another Datasquad member, it was important to first review the documentation in the unfinished R code used to clean the data.
In one portion of the code, I realized that a new column of data containing information about departments (such as ITS, Library, etc) and their provided services was added to the dataset:
# record abbreviations of services
abbreviation <- c("AORO", "ASC", "AWAC",...)
# record what services fall under what departments, and add a column of departments
department <- c("Other", "Library", "ITS",...)
miso_all$department <- miso_all$Type %>% mapvalues(abbreviation, department)
This was being done using a function called mapvalues in plyr. The function matches each element in the input vector to an index in the abbreviation vector, and assigns it an element of the same index in the department vector. This approach is valid and easy to follow. However, there are several drawbacks to this.
Because there were more than 20 services in the survey, there is sufficient room for human error when manually assigning them departments. It is especially hard to confirm if all of the services were accounted for. Second, if the services that are in subsequent surveys change, these two vectors will have to be newly defined, but it is unclear where the data matching these two fields exist in the project file. I had to spend time going through all of the datasets to try and figure out where this relationship was laid out.
In my code, I loaded a relationship file matching services to departments and used an inner join to attach departments to the dataset. Although this makes it harder to understand what is going on in the code, it creates a workflow whereby this relationship file can be searched for and modified if necessary in subsequent surveys so that people are not confused about where individual sections of the data exist in the project file. This is also a better way of making sure if all of the services were accounted for.
Reproducibility is important, especially if the task must clearly be repeated in the future (as in a survey setting). This is an example of how changing one section of your code can make it much easier for subsequent people working on the same project to understand your workflow.