Imagine that you're the Chief of Surgery for a major hospital. Dozens of patients are being operated on every day, using a variety of techniques. If you wanted to find out which surgery was most effective on patients with similar problems, you might start with their medical records, looking at things like post-surgery pain levels and how long each patient needed to recover. With the right analysis, you might see clear benefits to a particular approach, thus improving the lives of thousands of future patients.
There's only one problem: even though the medical records are electronic, most of the information you need doesn't come from a check box or the answer to a multiple-choice question. It's in the free text fields where your doctors have written notes that are unstandardized and idiosyncratic to say the least. You could use machine learning to extract that information, but you need labeled training data, or data that's been tagged with meaningful information to learn from. Especially in fields like medicine, crowdsourcing training data at this scale is impossible.
So you decide to call Stephen Bach (he's joining Brown CS as Assistant Professor this fall) and ask him about Snorkel, which he describes as a training data creation and management system for machine learning.
"In order to find the subtle patterns in a data set that humans don't see," he explains, "we need to label the data, but if humans are doing the labeling, it's expensive and challenging, so Snorkel uses weakly supervised machine learning to shrink the human effort needed. You can't ask someone to write a million labels, but one of our developers can write ten functions that describe the task and label the data programmatically. The results are often noisy and contradictory, but Snorkel models the labeling functions statistically, to learn which are better than others, and then applies them to train a high-quality end model."
Only two years old, Snorkel is already being used by dozens of organizations around the world, ranging from Alibaba Group and Intel to DARPA, Stanford Medicine, and the District Attorney of New York County. With all of these uses for the software, it's tempting to ask Stephen what's in store for the project, which data sets are next in line to give up their secrets. But let's start at the beginning.
"I was always interested in computers," Stephen says, "but even starting college, I didn't understand what computer science was about, how much breadth there was and how many new, open questions." He'd planned to be an economist, but couldn't fit an economics course in his first-semester schedule at Georgetown University, so he took a CS class to fulfill one of his requirements. And got hooked.
Spurred on by an initial interest in artificial intelligence, Stephen began doing undergraduate research, including multiple summer fellowships. He recommends it to anyone: "They really gave me a taste of how different research was from coursework, the whole idea of learning how to approach open questions." One of his early goals was using automation to reduce tedious, routine tasks that are a burden to researchers. And as a fascination with machine learning grew, he became interested in the idea of helping people make discoveries, providing tools for use by scientists in other domains.
Working toward his doctorate at the University of Maryland, Stephen became interested in one of AI's oldest debates: logical methods such as programming and first-order logic versus statistics, which asks computers to deal with the uncertainty and ambiguity of data. "Scientists have gone back and forth because of the trade-offs between the two," he explains. "We're great at symbolic reasoning but not at estimating uncertainty, and I wanted to help people by combining the advantages of both approaches."
Hearing that, it's easy to see how Snorkel, which Stephen co-created with colleagues at Stanford University while doing postdoctoral research, was a logical next step. Will his work with the project continue here at Brown CS? "Absolutely. I want to keep going, and I want to work with people in other fields, not just computer scientists. Brown's culture is such a great place for that. And I want to invest time in validating applications, making sure they're really useful. Students at Brown are going to have their own real-world applications for weakly supervised machine learning, and they're the kind of people I really want to work with."
This fall, Stephen will be teaching a seminar (CSCI 2952-C Learning with Limited Labeled Data), and in the spring, CSCI 1420 Machine Learning, which explores the theory and practice of statistical machine learning, focusing on computational methods for supervised and unsupervised data analysis. "I'm eager to hit the ground running," he says, "with my teaching and with working on some new problems shaped by what we learned on the Snorkel project. We've used machine reading to find facts in text, but now I’m interested in new ways to extend our techniques. Code-as-supervision in areas like video and computer vision, is a big, open challenge. I also want to look at managing the process of having models train other models for different tasks."
What he finds exciting about machine learning, Stephen says, is that it can have tremendous indirect effects: "Doctors making better surgical decisions is a great example of how we can all benefit from ML discoveries. Just managing the increasing complexity of our lives is going to become incredibly important, like smarter personal assistants that will manage our schedules and summarize all the information we don’t have to time to look at. The joy of machine learning is that we can help people live more pleasant and productive lives if we do it right."
For more information, click the link that follows to contact Brown CS Communication Outreach Specialist Jesse C. Polhemus.