CSCI2222

Interpretability of Language Models

Spring 2025

An outstanding problem in the field of deep learning is that neural networks are blackboxes – no one understands how they work “under the hood”. This class focuses on the extent to which we can understand and control such models, as well as review evidence that neural networks learn interpretable processes in their weights without direct supervision to do so. There will be an emphasis on interpretability of language models (LMs), although interpretability of vision models will also be touched upon when relevant. Topics covered will include causal interventions, probing, input attribution methods, and more recent topics in mechanistic interpretability. History and motivations for studying interpretability will also be discussed. The class will be a combination of reading and discussing research papers, and coding assignments that will culminate in a final project on a topic covered in class.

Instructor's Permission Required

Instructor(s):	Ellie Pavlick Jack Merullo
Meets:	TTh 10:30am-11:50am in CIT Center (Thomas Watson CIT) 241
Exam:	If an exam is scheduled for the final exam period, it will be held: Exam Date: 10-MAY-2025 Exam Time: 02:00:00 PM Exam Group: 09
Max Seats:	25 Full
CRN:	28206

Information for:

CSCI2222

Interpretability of Language Models

Spring 2025