## Computer Science 241

## Statistical Models in Natural-Language Processing

Professor: Eugene Charniak

Chief Cook and Bottle Washer: Matt Lease

Time: Monday 3:00 - 5:30

Room: CIT 345

Text: Foundations of Statistical Natural Language Processing by
Christopher Manning and Hinrich Schutze
MIT
Press 1999

This course covers statistical methods for learning a natural language
and applying the knowledge to specific tasks. Topics include: entropy
and cross entropy of a language, hidden Markov models, Viterbi
algorithm, forward-backward algorithm, trigram models, part-of-speech
tagging, probabilistic context-free parsing, inside-outside algorithm,
learning probabilistic context-free grammars, statistical models of
syntactic disambiguation, statistical anaphora resolution, deriving
semantic word classes from statistical properties, and word-sense
disambiguation.

Grading is based primarily on the project, and secondarily on the two
in-class, 40 minute, exams. Class participation will also be considered.
The project is done in groups of 2-4 students.
All groups work on the same project. Collaboration between groups is allowed
(indeed encouraged), up to, but not including, sharing of code (unless
explicitly authorized in class).
This semesters project looks at the problem of clustering sentences.

#### Class Schedule, Fall 2006

All chapter and page references are to the course text.
Week of | Reading Assignments |

Sept 10 | Ch 14 |

Sept 17 | Ch 2 (minus 2.1.10, 2.2.4) Ch 9 to 9.3.1 |

Sept 24 | Ch 9, Ch 10 |

Oct 1 | Ch 10 |

Oct 8 | Ch 11 |

Oct 15 | Exam, Ch 12 |

Oct 22 | Ch 6 |

Oct 29 | Ch 7 |

Nov 5 | Ch 8 |

Nov 12 | Exam |

Nov 19 | No Class |

Nov 26 | Project Discussions |

#### Project Assignments

Computer Files for the project can be found in /pro/dpg/cs241/.

#### Sept 18

Read in Stripped Representations. Find number of words that occur 5 or more times. What sentence includes the ??? occurance of the word "stock". Implemence single-link clustering for word vectors. How well do the days of the week cluster?
The stripped represetnation for WSJ sections 2-21 can be found in /pro/dpg/cs241/data/train.strip.

