Below you will find a list of faculty that will be conducting research during the Summer of 2019/Winter 2020 and are looking for research students.

  • Improving Educational Software Through Data Analytics and Machine Learning (Anna Rafferty) / Summer 2019
  • Diabetes Data Science (Dave Musicant and Sarah Meerts) / Summer 2019
  • Visual Text Analysis for Digital Humanities (Eric Alexander) / Summer 2019
  • Graph-Theoretic Properties of Networks Formed by Distance Thresholds (David Liben-Nowell) / Winter 2020 

Descriptions of the projects are below:

Improving Educational Software Through Data Analytics and Machine Learning (Anna Rafferty)

Learning analytics and machine learning models are increasingly used in educational technologies, facilitating the creation of adaptive systems that dynamically provide students with practice in areas where they struggle.  These systems can be engaging and motivating for students, and they offer opportunities to better understand human learning by logging fine-grained information about students’ problem solving choices. 

In my research, I focus on building systems that can make inferences about understanding based on the data collected in educational technologies, as well as creating self-improving systems that learn from previous students’ interactions to help future students learn more. These systems typically involve machine learning, computational models of human learning, or both.

In the research this summer, you’ll be focusing on one of two general questions:

  1. How can we further leverage reinforcement learning within an online algebra tutor in order to both improve all students’ learning experiences and more quickly personalize the experience for an individual learner?
  2. How can we automatically identify common student misunderstandings based on their actions in an open-ended learning environment, such as a virtual chemistry lab or a neuroscience circuitry building game?

This research is part of a larger project in which we’ve been developing machine learning techniques for inferring misunderstandings based on students’ choices. Part of that research has developed a machine learning algorithm that we’ve deployed in a free online algebra tutor where learners can practice equation solving. Question 1 involves extending that algorithm as well as considering additional ways adaptation could be incorporated into the tutor. Question 2 involves both developing new learning analytics tools and exploring ways to automatically figure out the space of possible misunderstandings from data, which would enable our techniques to be more easily used in new educational software.

In addition to addressing one of these research questions, you may spend some time helping out on other parts of the larger project to gain a better understanding of the overall goals. The exact focus of your work will depend on how things evolve over the course of the spring and summer and your own interests and background. This project combines cognitive science, machine learning, and statistics, so it’s a great opportunity to see the multidisciplinary nature of computer science!

CS 201 and CS 202, including especially the material on probability, are appropriate preparation for this project. Experience with statistics, electives like computational models of cognition, AI or data mining, or probability at a more advanced level than CS 202 would be a definite plus, but none of these are required. Ideally, you would do a 1-credit independent study with me in the spring to start reading relevant background material and learn about the data you’ll be working with.

Diabetes Data Science (Dave Musicant and Sarah Meerts)

This project is about using tools from data science to help learn more about biological and temporal rhythms in blood glucose and insulin needs of people with diabetes. We’ve recently been able to get access to data from a nonprofit called Tidepool, which is one of the only places that consolidates data from a person’s continuous glucose monitor, a blood glucose meter, and insulin pump. Thousands of people have been donating their data to Tidepool for a couple of years, and Tidepool is willing to share the data with researchers who are interested in asking questions of the data.

The possible questions are vast: does blood glucose vary systematically as a function of weekday vs. weekend? Do predictable patterns of blood glucose emerge in relation to maintaining insulin delivery (pump changes)? Do women with menstrual cycles notice distinct fluctuations in blood glucose based on menstrual cycle phase?

Some published studies can help with the current state of understanding of these questions but the huge amount of data and number of subjects sharing data via Tidepool is unique and can be used to help identify approaches for managing diabetes. Additionally, Tidepool is at the forefront of developing technology to enable a pump to “talk” to a continuous glucose monitor, thereby automating some aspects of diabetes management.

Outcomes from this project are likely to be incorporated into Tidepool’s technology and be enormously helpful to people with diabetes

Students working on the project are likely to do the following:

  1. Handle raw processing of the data, from whatever format it comes in, to something we can use.
  2. Think through what data science algorithms make sense to apply to the data, in order to help answer questions we have.
  3. Consult existing literature, generate new questions, and see if we can answer them.
  4. If time permits and if the data allows it, produce a visualization system to help users see patterns in the data.

Students who sign on to the project must be available to participate in a independent study reading group during the spring of 2018 to come up to speed on time series data science approaches and biomedical literature on periodicity in blood glucose levels.

Visual Text Analysis for the Digital Humanities (Eric Alexander

Researchers have access to more digital text than ever before, from websites to newspaper articles to books. This availability offers the potential to answer sweeping questions about the evolution of literature and language at scales previously unheard of — so long as we can actually make sense of all the data we have.

Research in natural language processing has provided us with powerful statistical techniques to model the behavior of text within a large collection of documents. However, using and interpreting such models can present a challenge to those whose expertise lies outside the field of statistics. In my research, I design, develop, and evaluate visual techniques for putting statistical text analysis into the hands of researchers with a wide variety of backgrounds.

This summer I hope to hire two students to take part in ongoing research in this vein. In particular, I am looking for students to help me with the following projects:

Character sonic signatures

Different characters within literature are sometimes attributed with different voices — not just in the types of words they use, but in the sound of their speech. For instance, in Shakespeare’s Othello, the titular character is sometimes described as having slower, rounder speech when compared to the quick, staccato dialog of the villain Iago.

We have found that we were able to algorithmically detect such differences between characters, and have built visualizations to help explore those differences. This summer I am looking for students to help me expand upon these visualizations and work towards making comparisons across authors’ abilities to make distinguishable character voices.

Evaluation of visual text summaries

There are many ways of representing statistical summaries of document collections, from bar charts to word clouds. The efficacy of these techniques for actually conveying a summary to a reader is hotly debated. In this project, students will investigate the ability of human participants to retrieve summary information from different visualization types.

The precise trajectory of these projects is open-ended, to be steered by the particular backgrounds and interests of the students involved. Potential useful experience would include familiarity with statistical models, machine learning, visualization, or the digital humanities (though none of these are required!). Accepted students would be required to take a 1-credit independent study during the spring to prepare for their project.

Graph-Theoretic Properties of Networks Formed by Distance Thresholds (David Liben-Nowell) Winter 2020

NOTE: you must be on leave from Carleton in Winter 2020 to participate, and you must live in Northfield (but cannot live on campus as an on-leave student).

Graphs that model interesting and important real-world phenomena can be defined in many different ways, even when they’re modeling the “same” real-world setting. (For example, a graph representing “the” Twitter social network might be based on users following, liking, or retweeting each other — and perhaps requiring the number of likes/retweets to exceed some threshold, in one direction or in both.)

Recently I have become interested in graphs that are derived according a kind of distance-based model: imagine a set of points S, some distance function D defining the distance between any two points in S, and some threshold T.  We can then define a network where an edge (u,v) exists if and only if D(u,v) < T.  Such graphs have been considered for points in the plane, typically called “unit disk graphs”, as a model of communication networks (perhaps T is the range of a radio transmitter/receiver). 

This research project will be devoted to studying the graph-theoretic properties of graphs defined in this way — including, for example, various centrality measures that have typically been used to analyze social networks.

For context: I have become interested in these graphs through collaborative work on graph-theoretic modeling of spoken-word recognition that I have been carrying out in collaboration with Julia Strand, in the Carleton Psychology department. Julia and I have been doing some work related to the “phonological network,” a graph in which words are the nodes [all English words, say], and there are edges between nodes whose pronunciations are within a single phonological edit of each other. This network is based on a different kind of distance than the geometric distances studied in the “unit disk graphs” mentioned above, but there are some ways in which the idea may be quite similar. I want to understand more about graph properties that are simply consequences of the way that we’ve defined edges, and which are more surprising.