Barbara Allen, a political science prof, approached the DataSquad for help redesigning graphs in her book for better clarity and to meet publisher standards. I was given a word document containing the chapter in question, the old versions of the graphs, a raw .csv data file, and a STATA .do file that was used to create the .csv.
The limitations given by the client and publisher were hard to meet:
- Only black and white (no gray)
- 4 by 7.5 inches
- Understandable without the context of the chapter
- No titles (The publisher wanted to add the title for the graph)
The original versions of the graphs were unclear, messy, and didn’t meet the standards.
![Original Client Graph](https://cdn.carleton.edu/uploads/sites/114/2017/05/Screen-Shot-2017-05-28-at-7.58.30-PM-p138lh.png)
To make the situation more complicated, the Excel file used to create the graphs was not included. Without the file, it was impossible to salvage the original graphs and I had to start from scratch with the raw data to make new graphs.
Because the source data file was changing throughout the course of the project, I made a project flow plan that made the graphs independent from the data source and the data cleaning and aggregation. To further compartmentalize the figures from each other, the graphs were made in an Rmarkdown notebook so each graph could be put in its own chunk and could be created independently from each other but retain the same theme.
Project Workflow:
- Receive raw data in .csv format (This raw data changed several times but had consistent column names)
- Read .csv into Python using the Pandas library
- Use information from .do file to re-add contextual information
- Filter out invalid information
- Aggregate the data and output a new processed .csv for each graph.
- Read each processed .csv into Rmarkdown notebook
- One chunk controlled all global variables (required libraries, line thickness, and theme settings)
- One chunk was created for each graph
- Load the corresponding processed .csv file from the python script
- Format the data for ggplot2
- Apply custom theme
- Graph and resize to publisher specifications
- Export each graph as a .png
Changes Made to The Original Graph
To start, I made a custom theme that removed all color from the graph. Without color to encode party, I used linetype so both parties could still be graphed on the same axis. For clarity, the cumulative sum of ads ran was moved to a separate but parallel set of axis. The cumulative sum of ads ran was later changed to a running average of ads shown daily in order to encode more interesting and relevant information. Both Barbara and I love small multiples and as a result, almost every graph had some form of small multiples.
By the clients request, a reference line encoding the date Lehman Brothers bank collapsed was added to each graph. To retain clarity, each graph was given a caption to explain the reference line and the scope/filter for each graph.
The hardest aesthetic changes centered around trying to use every space pixel space inside of the publisher dimension requirements. This involved turning off almost all margins in the graph, moving the legend, deleting axis titles, and realigning the caption.
First attempt:
# subset the data to last 90 days
filt2 = subset(ip_ads[,c('Party', 'day', 'cumulativeAds', 'Issue','weightedIssueTruth')], day <= 90)
# subset the data to only Dems and Republicans
filt2 = subset(filt2, Party != "Other")
# rename columns
names(filt2) <- c("Party", "day", "Cumulative Ads Mentioning Issues","Issue", "Average Issue Truth")
# melt the data
m2 <- melt(filt2, id= c("Party", "day","Issue"))
# facet
g2 <- ggplot(m2, aes(x = day, y = value, linetype=Party))
# make line plot and make line thicker
g2 <- g2 + geom_line(size = lineW)
# name and reverse x axis
g2 <- g2 + scale_x_continuous(name = "Days until Election", trans = "reverse")
# don't use scientific notation
g2 <- g2 + scale_y_continuous(labels = comma)
# set y axis
g2 <- g2 + facet_grid(variable2~Issue, scales = 'free_y', switch="y")
# remove artifact of faceting
g2 <- g2 + ylab("")
# label the axis
g2 <- g2 + theme_bw()+theme(strip.background = element_blank(), strip.placement = "outside")
# add title
g2 <- g2 + ggtitle("Figure 2", subtitle = "Ad Airings and Accuracy on Principal Issue Topics in the 2008 Presidential Election")
# print graph
g2
ggsave("Figure2.pdf", device="pdf", width= 7.29*1.5)
Final product:
# constant for line width
lineW <- 0.5
# custom theme aesthetics
customTheme <- theme_bw()+theme(strip.background = element_blank(),
strip.placement = "outside",
text=element_text(size=10, family="Arial", face="bold", color = "black"),
legend.position="bottom",
strip.text = element_text(hjust = 0.5, face = "bold"),
legend.title = element_blank(),
plot.caption =element_text(hjust = 0.5, face="italic", size = 8),
plot.margin=margin(t=0, r=0, b=0, l=0, unit="cm"),
#panel.margin = unit(0, "cm"),
legend.margin=margin(t=0, r=0, b=0, l=0, unit="cm"),
legend.key.height= unit(0, "cm"),
strip.switch.pad.wrap = unit(0, "cm"),
panel.grid.major = element_line(colour = "black", size= 0.1),
panel.grid.minor = element_line(colour = "white"))
# for removing left padding
noY <- theme(axis.title.y = element_blank())
# the day the Lehman brothers collapsed
SepFifteen <- 50
# import the Data
ip_ads <- suppressMessages(read_csv("./ProcessedData/issueTruthPivot.csv"))
# subset the data to last 90 days)
filt2 <- subset(ip_ads[, c('Party', 'day', 'meanAds', 'Issue', 'weightedIssueTruth')], day <= 90)
# subset the data to Dems and Republicans
filt2 <- subset(filt2, Party != "Other")
# rename columns
names(filt2) <- c("Party",
"day",
"Ads Shown Daily *\n(Rolling Average, 7-Day Window)",
"Issue",
"Average Issue Truth")
# for use for ordering the data later
v2Names <- c("Average Issue Truth",
"Ads Shown Daily *\n(Rolling Average, 7-Day Window)")
# melt the data so it can be pivoted
m2 <- melt(filt2, id = c("Party", "day", "Issue"))
# order the data so it graphs correctly
m2 <- m2[order(m2$variable == v2Names[1], decreasing = TRUE), ]
m2$variable <- factor(m2$variable, levels = v2Names)
# create dummy data to standardize y axis range
axis_fix2 <-data.frame(Party = "Democrat",
Issue = "Economy",
day = range(0:90),
value = c(5, 5000),
variable = levels(m2$variable),
stringsAsFactors = FALSE)
axis_fix2$variable <- factor(axis_fix2$variable, levels = v2Names)
# start graph
g2 <- ggplot(m2, aes(x = day, y = value, linetype = Party))
# make line plot and make line thicker
g2 <- g2 + geom_line(size = lineW)
# name and reverse x axis
g2 <- g2 + scale_x_continuous(name = "Days until Election", trans = "reverse")
# don't use scientific notation
g2 <- g2 + scale_y_continuous(labels = comma)
# fix axis for mean daily ads
g2 <- g2 + geom_blank(data = axis_fix2, aes(x = day, y = value))
# facet and set y axis aesthetics
g2 <- g2 + facet_grid(variable ~ Issue, scales = 'free_y', switch = "y")
# set y axis range for issue truth
g2 <- g2 + expand_limits(y = 5)
# use custom theme and remove left padding
g2 <- g2 + customTheme + noY
# add caption
g2 <- g2 + labs(caption = "*Only ads mentioning the corresponding issue. Dotted vertical line on day 50 marks Lehman Brother collapse.")
# add reference line for Lehman brothers collapse
g2 <- g2 + geom_vline(xintercept = SepFifteen, linetype=3, size = 0.5)
# print and save graph
g2
ggsave(
"./OutputFigures/IssueMultiples.jpg",
device = "jpg",
width = 7.5,
height = 4,
units = "in",
dpi = 600
)
![Final Product](https://cdn.carleton.edu/uploads/sites/114/2017/05/IssueMultiples-13h879n-scaled-1.jpg?resize=1024,546)
Overall, this project was a huge learning experience for me. It was my first project with major client interaction and my first project done with ggplot. Because ggplot works so differently than other graphing libraries in R, I had to learn it from scratch for these graphs. I enjoyed trying to find creative solutions to the problems I ran into; my personal favorite was melting the data and then using a facet grid to align the x axis.