"A Stochastic Memoizer for Sequence Data"

Frank Wood '07, Assistant Professor, Statistics, Columbia University

Monday, November 23, 2009 at 4:00 P.M.

Room 368 (CIT 3rd Floor)

We propose an unbounded-depth, hierarchical,Bayesian nonparametric model for discrete sequence data. This model can be estimated from a single training sequence, yet shares statistical strength between subsequent symbol predictive distributions in such a way that predictive performance generalizes well. The model builds on a specific parameterization of an unbounded-depth hierarchical Pitman-Yor process. We introduce analytic marginalization steps (using coagulation operators) to reduce this model to one that can be represented in time and space linear in the length of the training sequence. We show how to perform inference in such a model without truncation approximation and introduce fragmentation operators necessary to do predictive inference. We demonstrate the sequence memoizer by using it as a language model, achieving state-of-the-art results.

Frank Wood is an Assistant Professor of Statistics at Columbia University. Prior to joining the faculty at Columbia he held a postdoc position at the Gatsby Computational Neuroscience Unit of the University College London under Dr. Yee Whye Teh while simultaneously serving as a consultant to Stan James. Frank received his PhD from Brown University and was advised by Dr. Michael Black.

Host: Michael Black