My latest research is on accelerating the processes by which humans teach computers.
That includes engineering training data, with methods like programmatic weak supervision,
as well as learning to generalize from fewer examples, with methods like zero-shot and
Often, our group's methods focus on exploiting high-level, symbolic or otherwise semantically
meaningful domain knowledge.
Lately I'm particularly excited by the ways these directions intersect.
Applications of our work include information extraction, image understanding,
scientific discovery, and other areas of data science.
- New preprint on learning to compose soft prompts. We show that foundation models like CLIP can be fine-tuned to be better at composing concepts into novel combinations.
- PromptSource will appear as a demo at ACL 2022!
- The T0 paper is accepted to ICLR 2022!
- Our work providing a method and theoretical analysis for learning from multiple noisy partial labelers is accepted to AISTATS 2022!
- TAGLETS is now open source and accepted to MLSys 2022!
- T0 is now available for download, as is the dataset P3.
I lead the BATS machine learning research group. In the tradition of groups like
DAGS, BATS stands for "Bach's Awesome Team
Master's and Undergrad Students
Alumni (Role, Year, Next Position)
- Ross Briden
- Chace Hayhurst
- George Hu
- Top Piriyakulkij
- Gaurav Sharma
- Tom Liu
- Jessica Dai (Undergrad, 2021, Ph.D. at UC Berkeley)
- Tiffany Ding (Undergrad + Master's, 2021, Ph.D. at UC Berkeley)
- Amy Pu (Undergrad, 2021, Google)
- Dylan Sam (Undergrad, 2021, Ph.D. at Carnegie Mellon)
- Berkan Hiziroglu (Master's, 2020, Amazon)
- Angie Kim (Undergrad, 2020, The New York Times)
- Esteban Safranchik (Undergrad, 2020, Ph.D. at U. Washington)
Snorkel is a framework for creating noisy
training labels for machine learning. It uses statistical methods to combine weak
supervision sources like heuristic rules and task-related data sets, i.e., distant
supervision, which are far less expensive to use than hand labeling data. With the
resulting estimated labels, users can train many kinds of state-of-the-art models.
Snorkel is used at numerous technology companies like Google, research labs, and
agencies like the FDA.
Probabilistic soft logic is a formalism for
building statistical models over relational data like knowledge bases and social
networks. PSL programs define hinge-loss MRFs, a type of probabilistic graphical
model that admits fast, convex optimization for MAP inference, which makes them
very scalable. Researchers around the world have used PSL for bioinformatics,
computational social science, natural language processing, information extraction,
and computer vision.
In spring semesters, I teach machine learning
In fall semesters, I usually teach a seminar on
learning with limited labeled data (CSCI 2952-C).