Info. for prospective Ph.D. students here
	 
	
	
		My latest research is on improving the processes by which humans teach and instruct computers. That includes learning to generalize from fewer examples, with methods like zero-shot and few-shot learning, as well as engineering training data, with methods like synthetic data generation and  programmatic weak supervision. I measure that improvement in terms of reduction in the effort required by people---especially non-computer scientists in specialized, technical domains---to get computers to do what they want. Applications of our group's work include information extraction, image understanding, scientific discovery, and other areas of data science.
	 
	
	
		News
		 
	
	
	 
	
	BATS
	
	
		I lead the BATS machine learning research group. In the tradition of groups like
		LINQS and
		DAGS, BATS stands for "Bach's Awesome Team
		of Students."
	 
	 
		
	Ph.D. Students
	
	
	
	
	Master's and Undergrad Students
	
		- Yik Siu Chan
 
		- Charlie Duong
 
		- Zhenke Liu
 
		- David Ning
 
	 
	
	Ph.D. and Postdoc Alumni (Role, Year, Next Position)
	
	Undergrad and Master's Alumni (Role, Year, Next Position)
	
		- Elise Carman (Undergrad + Master's, 2025, GEICO)
 
		- Jacob Li (Master's, 2025, Ph.D. at MIT)
 
		- Justin Long (Undergrad + Master's, 2025, Meta)
 
		- Justin Phillips (Master's, 2025, Deutsche Bank)
 
		- Sarah Liu (Undergrad + Master's, 2024, Microsoft)
 
		- Oliver Nan (Master's, 2024, Cohere for AI)
 
		- Kevin Scroggins (Master's, 2024, Ph.D. at University of Florida)
 
		- Avi Trost (Undergrad, 2024, Ph.D. at University of Wisconsin)
 
		- Andy Delworth (Undergrad, 2023, Hive AI)
 
		- Chace Hayhurst (Undergrad + Master's, 2023, MIT Lincoln Laboratory)
 
		- Andrew Yuan (Undergrad, 2023, IMC Trading)
 
		- Ross Briden (Undergrad, 2022, Affirm)
 
		- George Hu (Undergrad, 2022, Master's at Stanford)
 
		- Tom Liu (Undergrad, 2022, Scale AI)
 
		- Top Piriyakulkij (Undergrad, 2022, Ph.D. at Cornell)
 
		- Gaurav Sharma (Master's, 2022, MathWorks)
 
		- Jessica Dai (Undergrad, 2021, Ph.D. at UC Berkeley)
 
		- Tiffany Ding (Undergrad + Master's, 2021, Ph.D. at UC Berkeley)
 
		- Amy Pu (Undergrad, 2021, Google)
 
		- Dylan Sam (Undergrad, 2021, Ph.D. at Carnegie Mellon)
 
		- Berkan Hiziroglu (Master's, 2020, Amazon)
 
		- Angie Kim (Undergrad, 2020, The New York Times)
 
		- Esteban Safranchik (Undergrad, 2020, Ph.D. at U. Washington)
 
	 
	
	 
	
	Projects
	
	
	
	  | 
	
		T0 is a family of large language
		models fine-tuned for zero-shot task generalization. In collaboration with many
		others in the BigScience
		Workshop, we showed that by fine-tuning T5 on many variations of prompts for
		supervised tasks, the resulting model could generalize to completely new tasks
		like natural language inference. All the models are publicly available, and
		 T0++ is probably the
		best one to use for new tasks. We also built an IDE and repository for prompt
		development called
		PromptSource
		(ACL demo paper) that contains
		over 2,000 prompted tasks.
	 | 
	 
	
	
	  | 
	
		ZSL-KG is a framework for 
		zero-shot learning with common sense knowledge graphs. ZSL-KG learns to 
		identify classes described as nodes in a knowledge graph. We have applied it to
		both text and image tasks. ZSL-KG uses a novel graph neural network encoder called
		transformer graph convolutional network (TrGCN). TrGCN increases the expressivity
		of traditional inductive graph neural networks by using small transformers to
		aggregate nodes.
	 | 
	 
	
	
	  | 
	
		TAGLETS is a system for
		automatic semi-supervised learning with auxiliary data. It automatically exploits 
		all available data, including labeled, unlabeled, and auxiliary data, for a given
		task to produce a single classifier. TAGLETS extracts relevant auxiliary data for
		training using SCADs, a database of auxiliary data aligned with concepts in
		ConceptNet, and passes all relevant data to an ensemble of user-specified modules,
		which are trained and distilled into a final classifier.
	 | 
	 
	
	
	  | 
	
		WISER is a framework for 
		programmatic weak supervision in sequence-tagging domains liked named entity
		recognition. Users write tagging rules that tag sequence elements
		linking rules that guide how those elements should be grouped into coherent
		spans. We introduced this approach to avoid the common problem of "candidate
		generation," in which users first have to heuristically convert their problem
		from sequence tagging to classification. Now users can supervise the tagging
		process with rules directly!
	 | 
	 
	
	
	  | 
	
		Snorkel is a framework for creating noisy
		training labels for machine learning. It uses statistical methods to combine weak
		supervision sources like heuristic rules and task-related data sets, i.e., distant
		supervision, which are far less expensive to use than hand labeling data. With the
		resulting estimated labels, users can train many kinds of state-of-the-art models.
		Snorkel is used at numerous technology companies like Google, research labs, and
		agencies like the FDA.
	 | 
	 
	
	
	  | 
	
		Probabilistic soft logic is a formalism for
		building statistical models over relational data like knowledge bases and social
		networks. PSL programs define hinge-loss MRFs, a type of probabilistic graphical
		model that admits fast, convex optimization for MAP inference, which makes them
		very scalable. Researchers around the world have used PSL for bioinformatics,
		computational social science, natural language processing, information extraction,
		and computer vision.
	 | 
	 
	 
	
	 
	
	Teaching
	
	
		In spring semesters, I teach machine learning
		(CSCI 1420).
	 
	
		In fall semesters, I usually teach a seminar on 
		learning with limited labeled data (CSCI 2952-C).
	 
	 
 |