Brown CS - CS241

Computer Science 241

Statistical Models in Natural-Language Processing

Professor: Eugene Charniak
Chief Cook and Bottle Washer: Matt Lease
Time: Monday 3:00 - 5:30
Mail Group
Room: CIT 345
Text: Foundations of Statistical Natural Language Processing by Christopher Manning and Hinrich Schutze MIT Press 1999

This course covers statistical methods for learning a natural language and applying the knowledge to specific tasks. Topics include: entropy and cross entropy of a language, hidden Markov models, Viterbi algorithm, forward-backward algorithm, trigram models, part-of-speech tagging, probabilistic context-free parsing, inside-outside algorithm, learning probabilistic context-free grammars, statistical models of syntactic disambiguation, statistical anaphora resolution, deriving semantic word classes from statistical properties, and word-sense disambiguation.

Grading is based primarily on the project, and secondarily on the two in-class, 40 minute, exams. Class participation will also be considered. The project is done in groups of 2-4 students. All groups work on the same project. Collaboration between groups is allowed (indeed encouraged), up to, but not including, sharing of code (unless explicitly authorized in class). This semesters project looks at the problem of clustering sentences.

Class Schedule, Fall 2006

All chapter and page references are to the course text.

Week of	Reading Assignments
Sept 10	Ch 14
Sept 17	Ch 2 (minus 2.1.10, 2.2.4) Ch 9 to 9.3.1
Sept 24	Ch 9, Ch 10
Oct 1	Ch 10
Oct 8	Ch 11
Oct 15	Exam, Ch 12
Oct 22	Ch 6
Oct 29	Ch 7
Nov 5	Ch 8
Nov 12	Exam
Nov 19	No Class
Nov 26	Project Discussions

Project Assignments

Computer Files for the project can be found in /pro/dpg/cs241/.

Sept 18

Read in Stripped Representations. Find number of words that occur 5 or more times. What sentence includes the ??? occurance of the word "stock". Implemence single-link clustering for word vectors. How well do the days of the week cluster? The stripped represetnation for WSJ sections 2-21 can be found in /pro/dpg/cs241/data/train.strip.

Eugene Charniak