CSCI2222
Interpretability of Language Models
Spring 2025
An outstanding problem in the field of deep learning is that neural networks are blackboxes – no one understands how they work “under the hood”. This class focuses on the extent to which we can understand and control such models, as well as review evidence that neural networks learn interpretable processes in their weights without direct supervision to do so. There will be an emphasis on interpretability of language models (LMs), although interpretability of vision models will also be touched upon when relevant. Topics covered will include causal interventions, probing, input attribution methods, and more recent topics in mechanistic interpretability. History and motivations for studying interpretability will also be discussed. The class will be a combination of reading and discussing research papers, and coding assignments that will culminate in a final project on a topic covered in class.
Instructor's Permission Required
Instructor(s): |
|
Meets: | TTh 10:30am-11:50am in CIT Center (Thomas Watson CIT) 241 |
Exam: | If an exam is scheduled for the final exam period, it will be held: |
Max Seats: | 25 Full |
CRN: | 28206 |