Thesis Proposal


"Improved Constituency Parsing"

Do Kook Choe

Thursday, September 29, 2016 at 11:00 A.M.

Room 506 (CIT 5th Floor)

A natural language parser recovers the latent grammatical structures of sentences. In many natural language processing (NLP) applications, parsing is applied to sentences first and the parses along with their sentences are fed to following NLP systems. For example, Google parses the entire web and applies a series of NLP programs to index the web and the quality of search results depends on the quality of parses.

Parsing is difficult because sentences are ambiguous: a sentence has different syntactic structures depending on its meaning. For example, a sentence "Eugene wears a bow tie with polka dots" can have very different meanings depending on what "with polka dots" modifies. It is natural for us humans to infer that "with polka dots" modifies "a bow tie" because we have common sense that "with polka dots" rarely (if not never) describes an action "wears." Computers, however, lack common sense and learn such a relationship from large amounts of texts by just looking for statistical patterns.

We explore three ways of improving parsing in our work: creating a training data of high quality parses using paraphrases; a model combination technique applied to n-best parsing; and a generative reranker based on a language model. Our methods improve very strong parsing models and our reranker achieves the state of the art parsing performance on the standard Penn Treebank dataset.

In recent years, long short-term memory networks (LSTMs) have shown tremendous successes in sequence prediction tasks such as language modeling and machine translation. We propose a greedy LSTM generative parser by treating parsing as a sequence task with two goals in mind: a fast inference with a series of greedy decisions and an accurate inference with global LSTM features.

Host: Eugene Charniak