This assignment is due at 11:59PM on Wednesday, November 21, 2018.

Neural Sentence Classification : Assignment 3

In this assignment, you’ll be experimenting with RNN-based sequence models for sentence classification. During class discussions, and in the assigned readings, we’ve discussed various ways to model language in order to construct word representations. A key purpose of forming such representations is to improve the performance of “downstream” tasks such as question answering, entailment, named entity recognition, sentiment analysis, among others.

Here, we will focus on a particular task that relies on classifying a sentence by encoding it to some latent representation. In particular, sentiment analysis is a neat classification problem to explore.

In this homework, you will be implementing a sequence model to classify sentences according to their sentiment score. For this task, you’ll be using an RNN to encode a sequence of word embeddings and use that encoding to then classify the sentence.

Model overview

At the core of your model is a recurrent neural network. An RNN is a type of neural network that’s designed to maintain state across a sequence of inputs. At its most basic, it might be a feedforward network that takes in a previous state \(h_{t-1}\) and an input \(x_t\) and tries to predict \(h_t\). In practice, RNN variants such as either the LSTM (Hochreiter & Schmidhuber, 1997) or the GRU (Cho et al., 2014) are sometimes more effective.

Unrolled RNN

An RNN \(A\) unrolled over \(t\) timesteps. [Source: Understanding LSTMs].

Once an RNN has been applied to a sequence, the final hidden state is a vector which in theory has captured information from an arbitrary number of previous time-steps. We can consider that hidden state to be an encoding of the entire sequence. Our sentiment analysis task is a classification problem: given a sentence, classify it among negative/neutral, and positive. Using the sentence encoding given by the RNN, a vanilla feedforward neural network can then produce a vector of class scores for sentiment classification.

Word representations

How to represent a sequence of words is an important choice when designing neural sequence models. For this assignment, you’ll be examining how different word embeddings affect your model’s performance. GloVe is a commonly-used vector representation of words. As discussed in class, a more recent approach is to learn “Deep Contextualized Word Representations” (Peters et al., 2018) using a bi-directional language model that’s pre-trained on a large corpus of text. Central to this paper is ELMo, which aims to provide a semi-supervised solution to downstream NLP tasks.

For this assignment, you will implement your sentiment classifier to operate on sequences of embeddings. You’ll examine how the choice of embedding affects model performance, by evaluating your model on GloVe, ELMo, and concatenated GloVe+ELMo embeddings. In addition, you’ll want to examine the effectiveness of randomly-initialised embeddings as a baseline against which to compare. You may find that the SST-2 classification task has certain quirks, and a naive baseline can be useful for sorting those out.

Your task:

Implement the SentimentNetwork model as well as the modules for performing embedding lookups.
Evaluate performance on different embeddings: ELMo, GloVe, ELMo+GloVe, as well as randomly-initialized embedding vectors.
Test different hyperparameters (such as RNN hidden size, dense layer sizes, number of epochs).
If you observe overfitting, experiment with a few regularization techniques (e.g. dropout in your dense layers).

For your writeup, include the following:

Plot loss evaluation curves for both training and dev datasets. Report final evaluation metrics (accuracy and confusion matrix) for each embedding type.
As you tune hyperparameters, pick two distinct hypotheses you’d like to test and run an experiment for each. For example, “When using ELMo embeddings, how do dropout and hidden size interact?” or “How does the effect of dropout differ in the ELMo vs. no-ELMo setting?”.
If applicable, describe any overfitting observed and how you addressed it.

Getting started:

As before, copy the project stencil from /course/cs2952d/stencil/hw3/. The data for this assignment can be found in /course/cs2952d/data/hw3/ and includes train/dev/test splits from the SST-2 corpus as well as some pre-processed data files to take some of the work out of loading the corpus into your model. You will mostly be working with sentiment.py, which serves as the starting point of your program. Run sentiment.py --help to get a feel for how to use it.

You must complete anything marked as TODO. To quickly find which sections must be implemented, run grep -n "TODO" *.

Unlike Homework 1, this assignment will most likely require GPU access in order to complete in a reasonable amount of time. If you have personal access to a GPU, feel free to use that to complete the assignment. To get started on a machine with enough disk space, create a Python 3 virtual environment and run pip install -r requirements.txt. If you do not have access to a GPU and/or a machine with enough disk space to install a GPU environment, please email the course staff as early as possible, and we can set you up with compute resources.

Resources

In this assignment you’ll be using PyTorch again, but the stencil will provide less guidance this time. For the most part, you should refer to the official PyTorch documentation. In addition, the PyTorch website provides an excellent introduction to using sequence models and LSTMs here.

Of particular note is Step 3 in SentimentNetwork.forward, which asks you to call a function torch.nn.utils.rnn.pack_padded_sequence. This is necessary since our data consists of a bunch of sentences of varying lengths. If we had a batch size of 1 this wouldn’t be an issue, but imagine how an RNN might be unrolled for an entire batch when each batch example is a different length? The solution to this is to pad each sequence with an appropriate number of zeroes such that each sequence is the same length. However, a new problem appears: we don’t actually want these zero-vectors to affect the RNN’s hidden state (and consequently, our loss function). The pack_padded_sequence utility addresses this issue. We’ve provided you with a walkthrough of padding sequences in PyTorch in packed_pytorch_demo.py. This article also gives a good overview of how to work with variable-sized batches.

Handing in:

Submit your code and writeup. In addition, include your best saved checkpoint for each embedding type. Please do not submit every saved checkpoint or the data files. Run cs2952d_handin hw3 to submit every file within the current directory.