Joining the LUNAR Lab

If you are interested in research in NLP, we'd love to work with you! The steps for joining the lab are:

  1. Start working on one of the "starter projects" below
  2. Start coming to our lab meetings: NLP Group meetings will be held on Mondays from 4:30-5:30 in CIT 241

The point of the starter project is to give you something to start working on right now that will familiarize you with the types of questions we tend to ask and the tools/methods we tend to use in trying to answer them. These projects are supposed to be open ended, so jump in and see which aspects pique your curiosity and roll with it. If you notice something that seems interesting, speak up in lab meeting and let us know. If you get stuck, speak up in lab meeting and ask for help! If you come to lab meeting regularly, you will see what types of problems we are working on and you will see how new research questions pop up routinely in the process of our work. When this happens, feel free to ditch your starter project and volunteer yourself to tackle one of these other open questions instead. Or, if you are excited about the direction you are taking your starter project, you are free to stick with it and see how it evolves.

If you have zero background in NLP, that's fine! No better way to learn than by doing: interest and perserverence are easily as (or more) valuable than prerequisite courses. :) Pick something that you find interesting and don't be afraid to ask for help in lab meetings. We'll be happy to help you get off the ground. If you aren't sure which starter project to pick (or, if you don't even understand what the projects are asking) please just ask.

Broadly speaking, we work on problems concerning semantics and pragmatics, including questions of how to ground language to the real world--i.e. the physical and social contexts in which the language occurs. Here are several projects that can help you get started in each area. Know that questions from one area quickly bleed into the others--that is what makes the area so exciting!

Semantics: Sentence Representations and Pretraining

A big open question right now is whether we can build "general purpose" representations of sentences, akin to the word embeddings that are pervasive in NLP. Building these representations usually involves some kind of transfer learning: i.e. training a model to do one task (the "pretraining" task), and then using the representations that it learned to do a different task (the "target" task), with the hope that some aspect of what is "useful" in a representation is shared across many tasks.

There are a lot of factors which affect transfer learning in NLP. We spent a whole summer trying to understand the process in terms of two main research questions:

  1. Which pretraining tasks help most for which target tasks?
  2. Are there differences among pretraining tasks in terms of what they encode about language?
These are big open questions, not something on which you should expect to get a clean answer right away. Here are some resources to help you form and test hypotheses:
  • Our code base for pairing pretraining tasks with different target tasks.
  • Our data set for evaluating representations' understanding of specific linguistic phenomena
  • A paper describing some initial experiments aimed at Question 1: [here]
  • A few papers describing some initial experiments aimed at Question 2: [here] [here] [here]

Pragmatics: Speaker-Listener Models

Understanding language requires more than just knowing the meanings of words. E.g. if I asked "Are you going to the party?" and you reply "I have to work", I can easily infer that the answer is "no" even though nothing about the literal meaning of "I have to work" entails "no". People are able to draw a ton of inferences by reasoning about the state, goals, and beliefs of the people they are communicating with.

These types of inferences of frequently called implicatures. You can read about the linguistic theories surrounding implicatures here. You can read about a computational implementation of these linguistic theories here and then play with a basic demo of the model (look at the scalar implicature demo).

If you want to do more linguistics, try to find examples of inferences or situations that break the model. E.g. what can you change about the speaker's beliefs, the listener's beliefs, or the meaning of language which will lead the speaker and listener to misunderstand eachother? Alternatively, try to find examples of new inferences that might be well-explained by this model? E.g. could we use this to represent inferences about metaphor? What about prototypicality? If you want to do more programming, try to reimpliment the framework yourself, using either python, pytorch, or pyro. Try to extend the model so that it can support arbitrary utterances (not just hard-coded sentences).

Computational Social Science: Understanding Political Language

Political language provides an especially rich testbed for understanding nuance in language. There is a world of difference between, e.g. "family reunification" and "chain migration", even though both refer to the same concept! Recognizing the style and connotation of what is said provides many layers of information beyond the literal denotation of the words. (If you like philosophy, read more about formal discussions related to connotation vs. denotation, usually called sense vs. reference, here.)

If you are interested in data science and computational SS, start by downloading one or both of the below datasets and start playing around with them and looking for patterns.

  1. Tweets from politicians all over the world. List of people appearing in the data.
  2. Text and vote data on all bills introduced in congress tracing way way back. I downloaded this all a while ago and haven't gotten to do anything with it yet.

Here are a variety of data preprocessing and visualization toolkits you might want to check out too: NLTK, SpaCy, SciKitLearn, matplotlib, D3, seaborn

A few questions for starting out:

  • Can you find pairs of phrases that refer to the same thing but are used drastically differently (e.g. chain migration/family reunification, estate tax/death tax)? Here is a paper a had a while back that could serve as a starting point. i
  • Can you find patterns or trends in the way "short titles" or congressional bills are used? Can you train a model that generates "short titles" from long titles? (I've been wanting to work on this for years and haven't had time, I'd love someone to adopt it.)
  • Download both datasets and look for similarities/differences between language use in official documents and language use on twitter.