Tech Report CS-95-28

Parsing with Context-Free Grammars and Word Statistics

Eugene Charniak

September 1995


We present a language model in which the probability of a sentence is the sum of the individual parse probabilities, and these are calculated using a probabilistic context-free grammar (PCFG) plus statistics on individual words and how they fit into parses. We have used the model to improve syntactic disambiguation. After training on Wall Street Journal (WSJ) text we tested on about 200 WSJ sentence restricted to the 5400 most common words from our training. We observed a 41% reduction in bracket-crossing errors compared to the performance of our PCFG without the use of the word statistics.

(complete text in pdf or gzipped postscript)