Scraping ENROLL with Beautiful Soup and Pandas

29 October 2019
By Sam Terwilliger

For my first project with the Datasquad, I worked on extracting course information to construct up-to-date lists of Writing Rich and Quantitative Reasoning Courses. Initially, we thought this task would be quite simple. We believed the database behind Enroll contained all the relevant information, and that the data could be easily extracted via an Excel data dump, but we quickly encountered a roadblock. While almost all the data we needed was available, the descriptions of courses were separate, and the only simple way to access them was via the ENROLL website itself. 

To overcome this data access scenario, we decided to first retrieve the WR and QR course listings from the database, and then merge the descriptions (gleaned from scraping the Enroll website), with the original data file. This scraping task was a perfect fit for the Python library Beautiful Soup 4, which contains a variety of functions for extracting HTML formatted data from websites. Once I had retrieved course descriptions and IDs from the ENROLL website, constructing the final WR and QR course list was simply a matter of writing a Python program to map the database course listings to the scraped Enroll information.

While working on this project, I learned a lot about how the scope of a project will often gradually change as we encounter and overcome roadblocks in data gathering and processing. Moreover, I learned about and got practice working with a couple new Python libraries.

qr blog post