Below you will find a list of faculty that will be conducting research during the Summer of 2023 and are looking for research students.

  • Algorithmic Methods for Analyzing Tumor Evolutionary Trees (Layla Oesper)
  • Evaluating the Perceptual and Cognitive Limitations of Text Visualizations (Eric Alexander)
  • Improving Educational Assessment with Machine Learning (Anna Rafferty)
  • Exploring Dynamics of Policy Change in Online Communities (Sneha Narayan)
  • Approximating Bitvectors using the Disjunction of Exemplars (David Liben-Nowell and Anna Rafferty)

Descriptions of the projects are below:


Algorithmic Methods for Analyzing Tumor Evolutionary Trees (Layla Oesper)

2-3 Students, up to 8 weeks, dates are yet to be determined, may be combined with SLAI

Cancer is a disease resulting from the accumulation of genomic alterations that occur during an individual’s lifetime and cause the uncontrolled growth of a collection of cells into a tumor. These mutations occur as part of an evolutionary process that may have begun decades before a patient’s diagnosis. Better understanding about the history of a tumor’s evolution over time may yield important insight into how and why tumors develop as well as which mutations drive their growth.  While recent algorithmic progress has led to improved inference of tumor evolutionary histories, there is still a very challenging task.

This summer students in my group will be part of an ongoing initiative to develop algorithmic methods for analyzing these tumor evolutionary histories (represented as trees) and better understanding what the space of these histories looks like.  The exact details of what students will be working on will depend on their interests, background and how the project progresses prior to the start of summer.  Aspects of the project that students may likely work on include:

  1. Extend, modify or analyze existing distance measures designed for tumor evolutionary trees.
  2. Perform mathematical and computational analysis of the space of tumor evolutionary trees. 
  3. Create libraries containing code for common tasks when it comes to analysis of tumor evolution.

Students working on these tasks may gain experience working with large datasets, using large multi-core machines, designing computational experiments and will become familiar with some aspects of DNA sequencing data and analysis. 

Ideally, students should be available to participate in an independent study during the spring of 2023 to read papers, familiarize themselves with related tools/concepts, and have discussions to begin planning the project.  Applicants should have completed at a minimum CS 201 by the end of Spring term 2023.  Students who have taken Computational Biology, Bioinformatics or Algorithms are also strongly encouraged to apply.  No specific biology background is required, just an interest in applying computational techniques to important biological problems


Improving Educational Assessment with Machine Learning (Anna Rafferty)

1-2 students, up to 8 weeks, starting shortly after the end of finals, likely cannot be combined with SLAI

Every year, significant amount of school time for elementary, middle, and high school students is spent on assessment — typically, taking standardized tests. Ideally, this time on assessment should be useful to students and teachers: if assessment results provided teachers with timely information about where their students were struggling, then teachers could target those areas during classroom instruction. But too often, assessment results fail to meet those goals. Results are often not very specific about where students are struggling, with each assessment giving only a little information about students’ knowledge on a wide variety of topics. Further, scoring of assessments is often constrained to use only information about whether a student is correct or incorrect, ignoring factors like which incorrect answer a student chose and limiting the types of questions that are asked. Machine learning (ML) approaches have been proposed to try to address some of these challenges. ML models can capture patterns across student responses, potentially providing finer-grained information about their knowledge compared to traditional scoring approaches. ML models also have the potential to predict student knowledge based on the strategies they take to solve problems, opening up the possibility of more open-ended and authentic assessments. Yet, there are still plenty of open problems in applying ML to challenges in educational assessment!

In this summer research project, we’ll be extending existing and recently developed ML models for working with either standardized testing data or more open-ended data from students working on virtual chemistry labs. Some possible questions for us to investigate include examining how to use the ML models to adaptively select questions to ask a student as they’re being assessed; exploring whether we can automatically identify productive and unproductive strategies in a virtual chemistry lab; and appraising how fairly the ML models make predictions about different groups of students. This project will build on research I’ve been conducting in collaboration with researchers at several education non-profits.

Through this project, you’ll learn about modern machine learning models and widely used libraries for data science and machine learning as well as how to manage and work with large-scale data. Students should have completed or be currently enrolled in CS201 and have experience with Python, as well as some comfort with math (at the level of CS202 or linear algebra). You should be curious about how to apply machine learning to problems in education and interested in thinking deeply about how we can responsibly work with educational data to address real needs of teachers and students, but you do not need prior coursework in education or educational assessment. Ideally, students will be available for a 1-credit independent study in the spring to read related papers and begin planning for the summer.


Exploring Dynamics of Policy Change in Online Communities (Sneha Narayan)

2 students, up to 10 weeks. Exact dates TBD. Likely cannot be combined with SLAI.

Some large platforms like Twitter and Facebook use teams of paid content moderators to flag posts, manage harassment, and block illegal content, based on guidelines provided by the companies hosting these platforms. Other communities (like subreddits, Wikipedia editions, and free-software projects) do moderation a bit differently — typically, these communities develop their own policies around what kind of content is permitted and what the norms of engagement are, and usually have moderators or administrators who emerge from the community to assist in both developing and enforcing these rules.

An interesting case study here is Reddit. Many different kinds of subreddits exist, each with their own kinds of rules and norms. Reddit provides a template for community rules as a starting point, but many subreddits go on to modify these and develop their own policies over time.

This brings up a lot of interesting questions, such as: how and why might a community change its policies from those suggested by the host platform? What kinds of events prompt policy changes? When in a community’s lifecycle do policies typically get implemented, or changed? Do policy changes relate to community size, goals, or other attributes? Do communities generally find default codes of conduct/rule systems suitable for their purposes, or do they often need to be modified for specific contexts? Understanding these dynamics could help characterize the kinds of issues that communities navigate that lead them to articulate rules, and help create better tools and shared practices for moderators and community managers.

While this should give you a broad sense of the line of inquiry I’m interested in pursuing, the specific questions pursued will likely also be shaped by the interests and background of students who work with me. An entry point into investigating these questions across a population of communities is the PushShift Reddit API, which is a massive research archive containing historical data on posts and comments made on Reddit. Students working with me can expect to get experience constructing and analyzing large datasets using a variety of methods in an interdisciplinary context, and writing up findings for an academic audience. As this project develops, I also hope to supplement any findings from this dataset with interviews of Reddit moderators, so conducting user research and qualitatively coding interview data might be a different kind of work you do down the line. 

My hope is to have any students working with me this summer also participate in a 1-2 credit independent study during Spring 2023 — this is to help you get the background that you would need to start work in earnest over the summer. Applicants should have at a minimum taken CS201. Having taken Social Computing (or planning to take it in Spring 2023) is a plus. Other relevant coursework includes Software Design, HCI, and classes outside the department in statistics and/or the social sciences. Additionally, if you’ve had experience with being very active in an online community or have performed moderation work yourself, please let me know about it in your application.


Evaluating the Perceptual and Cognitive Limitations of Text Visualizations (Eric Alexander)

I will be looking for 1-2 students to join me in this project.

It may seem silly to say “A picture is worth a thousand words” at the beginning of a research description containing exclusively text, but in many ways, it’s true. The human perceptual system allows us to make complex judgments about enormous amounts of data in mere fractions of a second, which is why so many in-depth prose arguments and sophisticated statistical analyses are skimmed over in favor of looking at the accompanying figure. As tools for communicating information to a wide audience, data visualizations are exceptionally efficient. 

However, powerful though it may be, our perceptual system is also fraught with bias and inaccuracy. We evolved to be able to spot berries amongst dense leaves, not to precisely compare the positions and sizes of red and green glyphs in a scatter plot. Aspects of our vision that help us find the berries might distort our impressions of data in a visualization, making it important to understand and quantify these sorts of perceptual quirks if we want to convey accurate information to our readers.

This summer, students in my group will be working to evaluate and properly harness our ability to accurately perceive data encoded in visualizations containing text (e.g., word clouds). Text visualizations share some of the same perceptual oddities as other visualizations, with additional cognitive challenges associated with how a word’s meaning might skew the way we see or remember it. The precise trajectory of this summer’s work is open-ended, and will depend on the skills and interests of those involved. Some of the things that students may work on include:

  1. Designing experimental conditions that allow us to isolate and measure specific aspects of text visualization perception and interpretation of online participants.
  2. Building on the development of new tools to dynamically generate visualizations that meet these constraints.
  3. Performing the statistical analysis of participant performance.

Students working on this project are likely to gain experience in web development, designing and analyzing perceptual experiments, and visualizing complex data.

Accepted students will work for 8-10 weeks during the summer of 2023, though the precise dates may be flexible. Ideally, students should be available to participate in an independent study during the spring of 2023 to read papers, familiarize themselves with related tools/concepts, and have discussions to begin planning the project.  Applicants should have completed at a minimum CS 201 by the end of Spring term 2023.  Students who have taken Data Visualization, Software Design, or classes in perceptual psychology are strongly encouraged to apply. No specific background in visualization or text analysis is required, just an interest in how the choices we make as designers can affect the information our readers take away.


Approximating Bitvectors using the Disjunction of Exemplars (David Liben-Nowell and Anna Rafferty)

2 students, 8 weeks, beginning shortly after the end of finals, likely cannot be combined with SLAI

In the course of solving other problems, one sometimes comes across interesting theory problems, and this research project is an example of that. We had data from a bunch of students about the reasons that they took intro to computer science, and we wanted to identify what patterns of reasons to take the course were most common. For example, some of the reasons for taking the course might all be “exploratory” in nature, and there might be some students whose reasons were primarily these exploratory reasons. But, we didn’t want to have to decide on the groupings of reasons by hand, and we recognized that each student might be taking the course for both exploratory reasons and some other set of reasons. After considering the problem for a bit and reading a few papers, we thought this was an interesting problem to consider how to solve as “optimally” as possible, and thus this research project arose.

We can phrase the problem more generally as one in which we consider a set of individuals (e.g., the intro CS students above), and a set of binary characteristics that each individual might or might not have (e.g., the reasons the students are taking the course). We can represent this population as a collection of bit vectors, one per individual. For example, the bit vector
[1 0 1 1 0 0 0 0 0 0 0]
corresponds to an individual who has characteristic #1, #3, and #4, but none of the others. This research project will seek algorithms to approximate the individuals in this population as the combination of “exemplar” bit vectors: for example, if the exemplars are
[1 0 1 0 0 0 0 0 0 0 0] = A
[0 1 1 0 0 0 0 0 1 0 0] = B
[1 1 1 1 1 1 0 0 1 0 0] = C
then we can approximate the above individual as
[1 0 1 1 0 0 0 0 0 0 0] = the individual
[1 1 1 0 0 0 0 0 1 0 0] = A + B
which is right in 8 characteristics (8 places) and wrong in the other 3.

This project will be fundamentally theoretical and algorithmic in nature: we will be digging through the algorithmic details in published papers about related problems, and seeking to develop new algorithms (or adapt existing algorithms) to solve this problem as best we can. If all goes well, we will mostly be writing proofs rather than writing code. While we think this problem has interesting applications (including the one described above), the focus will be on developing techniques to solve the abstract problem rather than details of the applications.

Required preparation: CS 252. It would be great if you happen to have also taken either CS 254 or CS 352, but these are not requirements.