The goal of the Election News Analysis project is to accumulate transcripts from various news sources, analyze trends and patterns of topics discussed during presidential election periods and key events that affected presidential elections (ie. George Floyd’s murder, January 6th). This project was prompted by the research of Carleton College political science department faculty member Barbara Allen and her research partner Daniel Stevens (University of Exeter), along with recent and current students Efram Stewart, Bjorn Hotley, and Dylan Fox-Arnold.
Aaron Bronstone (’24) and Serafin Patino (’24) worked with professor Barbara Allen to scrape transcripts from NDS, a news data aggregation service.
News Data Service (NDS) is a national media monitoring service that provides videos and transcripts from various media outlets via a browser API.
NDS is a valuable resource for obtaining transcripts to analyze, but rumors quickly spread at the end of 2023 regarding NDS’s decision to improve their outdated UI within the upcoming months. Given that the current NDS scraper might be rendered obsolete after the UI changes, it became a high priority to scrape as many transcripts as possible from other election periods in anticipation of potential future research.
Carleton College holds a subscription to NDS, and with their permission, our NDS scraper was initially developed by Hiromichi Ueda (‘21) with further improvements by Helen Du (‘22), Isabella Cha (‘23), Cathy Duan (‘25), Nina Sun (‘24), Charles Nykamp (‘25) and Graham Gordon (‘25) in 2022.
From September 2023 to February 2024, Aaron Bronstone (‘24) and Serafin Patino (‘24) worked to scrape news transcripts from local sources in Minnesota, Iowa and Wisconsin, as well as national news sources from years related to presidential elections.
The Technology Behind Election News Scraping
NDS Scraper
Scraping involves automating web browsing and user interactivity such as clicking or typing, and also collecting information from websites that can be viewed in the HTML page structure, called the Document Object Model (DOM). Every browser web page consists of HTML elements creating a DOM, which renders and displays the information as you see when you visit a website. The legality of scraping is a large concern: some websites prohibit its use, however some websites allow scraping as long as information acquired falls under Fair Use. The NDS scraper utilizes the popular browser automation module Selenium.
There are two main parameters we used for the NDS scraper:
- DATE RANGE: can be as long as desired. The date ranges we focused on were:
- 2004, 2008, 2012, 2016: July 1st – November 31st
- 2019: November 1st – December 31st
- 2020-2023: January 1st – December 31st
- 2024: January 1st – February 1st
- LIST OF NEWS SOURCES: can be any sources found in the Broadcast Content page of NDS. Each source comes with a state and city market if it is a local source, otherwise it is listed as “national”. The following news sources were of interest in the project:
This data is joined together via a Cartesian product, something that allowed us to list of queries containing all combinations of sources for every day in the date range. Using this result, the scraper follows these steps to perform a successful NDS scrape:
- Crawling – Using the “Broadcast Content” NDS page, input and submit each query using automated selection and clicking. This generates a list of transcript links that are added to a CSV file
- Scraping – Going through the CSV output file, the NDS scraper visits the link for each transcript, collects the entire text using scraping collection methods, and inserts it into a text file in the defined output file structure below
These two steps result in a collection of transcripts from each specified news source for each day in the given date range. The file structure that the NDS scraper uses to organize this collection is as follows:
Year (YYYY) → Station → Month (full name) → Day (mmddyy) → source_date_time_name.txt.
For example, the scrape of the date range for 2008 output file structure looks something like this diagram:
NDS News Scraping [F23] 2008 ↳| ABC (News Station) ↳| July (Month) ↳| 070108 (Date) ↳| ABC 070108 1200PM CT.txt | ABC 070108 0135AM CT.txt | ... | 070208 ↳| ... | ... | 073108 | August ↳| ... | CBS ↳ ...
The NDS scraper was initially used to collect transcripts from 2016, as well as the few months surrounding George Floyd’s murder in 2020 for specific analysis. These transcripts were stored and separated in a Google shared drive that were utilized by different research projects and classes.
NDS Scraper Improvements
For further analysis of other election periods, more scraping would be required to collect transcripts from more presidential election years. As of this writing, NDS has transcripts back to 2002. Additionally, given the dramatic divide the recent elections have revealed across the nation, all of the months between November 2019 and present day are of interest for more topic and sentiment analysis to uncover each news station’s attitude towards the election.
This amounted to over 500,000 transcripts to be scraped while having major limitations in the original code:
- Slow Speed of Google Colab – while Google Colab is a fantastic tool to collaborate on code and work with Jupyter notebooks, the online runtime for the scale of this project was slow.
- Untailored Selenium wait times – Some Selenium element wait times were considerably longer than needed to account for a variety of action outcomes.
- No progress saving – After collecting links, the crawling function would not save progress. With optimized Selenium DOM interaction, the average time per query was 30 seconds while having 60,000 queries to process; amounting to nearly 3 weeks of required crawling time. With limited computing resources there was high risk of the code stopping and us losing all our progress.
Both before and during scraping, we made several improvements to the original code:
- Restructured code to run locally – to make use of faster local execution time, we consolidated the Jupyter notebook code on Colab into one singular Python file and then separated into different files to better organize the functions and parameters. Using Drive for Desktop, we were able to interact with the Google shared drive for the project
- Analyzed NDS loading times – we spent a considerable amount of time manually testing queries on the NDS Broadcast Content page to analyze loading times. This allowed us to optimize the Selenium element searching and waiting to cut the amount of time per query in half.
- Implemented progress saving – given the amount of time required to crawl every query, we added a new CSV file per year folder to store the query progress, and changed the code to ignore all queries already performed
Analyzed edge cases – there were a few other errors NDS could throw after a query other than “links found” or “no links found”. We implemented new error handling in the code so some inevitable error queries were considered “successful” and were not repeated
Data Management
The Google shared drive has been the primary storage method for these transcripts. However over the years the drive became cluttered with lots of different folders containing copies of original transcripts, some of which were vaguely named and some of which contained machine-coded/annotated transcripts modified by students in class assignments. Eventually, it became difficult to recognize the original transcript files from the annotated ones.
An analysis project of this scale required a fixed source of original transcripts, where we looked to the Inter-university Consortium for Political and Social Research (ICPSR), a social science data archive who expressed interest in archiving our data set.