This assignment is due at 11:59PM on Tuesday, September 25, 2018.

Vector Space Models and Skip-grams : Assignment 1

In this assignment, you will be examining different ways of representing text as vectors. Vector space models (VSMs) allow us to “embed” similar documents or words in points that are nearby in some real-valued vector space. These vector representations can then be used for different tasks, such as similarity scoring, classification and clustering.

Part 1: Sparse word vectors and context windows

A naive way of representing words might be to simply assign each word some machine-understandable token (such as a sequence of ASCII values, or an integer id). But this provides little information about what a word means. How is the ASCII sequence 0x77 0x61 0x6c 0x6b (“walk”) related to the sequence 0x72 0x75 0x6e (“run”)? These sequences serve primarily as identifiers of words, and so a good first step is to map words to unique ids in a collection known as a vocabulary. In this toy example, we might have "run" -> 0 and "walk" -> 1. While a step up from string sequences, we still can’t relate “walk” to “run” in any useful way since the symbols we’ve assigned them are entirely arbitrary. On some level however, we as humans might understand that the words “walk” and “run” share similar ideas of motion because of the contexts in which we’ve seen them appear.

For this part of the assignment, you’ll be constructing vector representations of words that model a word’s context. For a given word \(w\), define a context window of radius \(N\) to capture \(N\) words to the left and right of \(w\). We wish to represent all contexts around \(w\) as a “bag of words”. Given a corpus of sentences, we construct a fixed-size vocabulary \(V\). Assign each word in \(V\) a unique integer id \(i\). The bag-of-words representation of \(w\) is then a vector \(c \in R^{|V|}\), where each \(c_i\) counts how often word \(i\) appears in a context window of \(w\) (across the entire corpus). This approach allows us to construct very large, and very sparse vector representations of words in terms of their contexts.

Task: Transform a corpus of text into word vectors according to this context-window principle. Use a window of radius 2. For your writeup:

Pick your favourite word (in the vocabulary) and show the 20 closest words to it. Are these what you expected?
What does similarity mean under this model? Explain your choice of distance metric. Why is it important?
Which two distinct words are most similar? Which two are least similar? (Hint: Use scipy.spatial.distance.pdist).

Outline:

Copy the project stencil from /course/cs2952d/stencil/hw1/. You will be working with sparse_vecs.py and utils.py as the stencil code for this part. Use the datasets in /course/cs2952d/data/. Report your results for en-wiki-100k.txt, but use toy.txt as a toy dataset for making sure your code works.

Load the data - this is done for you.
Construct a vocabulary across the entire corpus. This should map a string (word) to id.
Limit the vocabulary to 10,000 words. Use the provided mapping of word-counts to keep only the most-frequent words.
Use the vocabulary (as a word-to-id mapping) and corpus to construct the sparse word vectors.

Part 2: Skip-grams and word2vec

For this part, we’ll try something different. Instead of simply capturing counts of nearby words, we’ll look at embedding words into a vector space that best predicts a word within a target word’s context. As before, the goal is to let semantically-close words to be similar in embedding-space. Whereas the previous model is based solely on word counts, the approach here will use a neural network that has been trained to predict skip-grams. This is the word2vec model of Mikolov et al. (2013).

To recap, a skip-gram looks at each word in a sentence, and forms pairs between it and every other word in its context window. Here’s an example:

>>> # assume there exists some function skip_grams(sentence) with a window size of 2
>>> sentence = "the quick brown fox jumps over the lazy dog"
>>> print(skip_grams(sentence))
[('the', 'quick'), ('the', 'brown'), ('quick', 'the'), ('quick', 'brown'), ('quick', 'fox'), ..., ('lazy', 'over'),
  ('lazy', 'the'), ('lazy', 'dog'), ('dog', 'the'), ('dog', 'lazy')]

The goal is, given a word, to predict a word that occurs within its context. To do this, we project a one-hot vector (which encodes a word) into a D-dimensional embedding space, and learn a linear decision boundary to predict a context word. In the terminology of neural networks, we use a network with one hidden layer (the embedding space) and minimise the negative log-likelihood of the softmax logits. During training, the projection matrix to the embedding space will learn a representation for each word that minimises the loss. This matrix is known as the embedding table. Once trained, a lookup can be performed on this table to find the embedding for a given word.

Task: Train a neural network with a hidden layer size of 128. Use a cross-entropy-like loss as your training criterion. The network should take in a word id, and predict a word id representing a context word. For your writeup, you should:

Plot the loss as a function of training iterations. Report your final training loss score.
Pick your favourite word (in the vocabulary), and show the 20 closest words to it. Are these what you expected? How does the skip-gram model affect the notion of similarity?
Which two distinct words are most similar? Which are least similar?
Visualise the embedding manifold on a subset of the data. Use a dimensionality reduction algorithm (such as T-SNE) to project the embeddings to 2 dimensions and plot them.

Outline:

Copy the project stencil from /course/cs2952d/stencil/hw1/ and data from /course/cs2952d/data/hw1/. You will be working with word2vec.py as the stencil code for this part. As before, report your results on en-wiki-100k.txt.

Load the data, this is done for you.
Limit the vocabulary to 10,000 words. Use the provided mapping of word-counts to keep only the most-frequent words.
Generate skipgrams from the input. As before, you will need to build a vocabulary to map words to integer ids.
Define a neural network, loss criterion, and optimiser. Implement the training loop over the data. Train for 15 epochs.
After training, index into the rows of the embedding matrix using a word ID. Equivalently, compute the dot product between a one-hot word vector and the embedding matrix.
Feel free to use an existing implementation of T-SNE. Scikit-learn has a good implementation. Use matplotlib for plotting the 2-dimensional embeddings. Matplotlib’s scatter() and annotate() methods may be useful.

If you read the original word2vec paper, you may have noticed the authors’ implementation differs from what’s specified here. In particular, you don’t need to worry about: subsampling frequent words, phrase detection, or negative sampling/hierarchical softmax. These are all cool optimisations that improve the resulting word vectors, but are beyond the scope of this project.

Tips:

Start early. This project cannot be completed in one night. It can take up to 24 hours to train your skip-gram model on the wiki dataset.
Read through the entire stencil code before you begin.
Before running your code on the large dataset, try it out on the toy dataset. Do things look reasonable?
Check out how the Tensorflow tutorial uses T-SNE to visualise word embeddings: T-SNE code sample, Plotting code sample.

Handing in:

From your project directory, run cs2952d_handin hw1 to submit your code and writeup. In addition to your analysis, be sure to include a (short) description of your code: how to run it, any known bugs, and a brief explanation of your design. Your writeup can be one file, split into sections.

Rubric:

Part 1 (5.5 points)

1.0 pt - we are able to run your code within a reasonable amount of time (binary, it either runs or it doesn’t)
0.5 pt - correctly indexing the vocabulary (construct_vocab() in utils.py)
2.0 pt - correctly constructing word vectors from the corpus (word_vectors() method in sparse_vecs.py)
0.5 pt - correctly returning nearest neighbors to a word (most_similar() method in sparse_vecs.py)
1.5 pt - responses to questions (0.5 pt each)

Part 2 (9.5 points)

1.0 pt - we are able to load a checkpoint and run your code within a reasonable amount of time (binary, it either runs or it doesn’t)
1.0 pt - correct initialization of network layers (init() method for SkipGramNetwork in word2vec.py)
1.0 pt - correct implementation of forward pass (forward() method for SkipGramNetwork in word2vec.py)
1.0 pt - correctly enumerating skip gram contexts (skip_grams() method in word2vec.py)
2.0 pt - training loop (train() method in word2vec.py))
- 0.4 pt - forward pass
- 0.4 pt - loss function
- 0.4 pt - backward pass
- 0.4 pt - parameter update
- 0.4 pt - correct return value
0.5 pt - correctly returning nearest neighbors to a word (most_similar() method in word2vec.py)
1.0 pt - correctly handling rare words (trunc_vocab() in utils.py)
2.0 pt - responses to questions (0.5 pt each)