In this assignment, you will be examining different ways of representing text as vectors. Vector space models (VSMs) allow us to “embed” similar documents or words in points that are nearby in some real-valued vector space. These vector representations can then be used for different tasks, such as similarity scoring, classification and clustering.
A naive way of representing words might be to simply assign each word some machine-understandable token (such as a sequence of
ASCII values, or an integer id). But this provides little information about what a word means. How is the ASCII sequence 0x77 0x61 0x6c 0x6b
(“walk”) related to the sequence 0x72 0x75 0x6e
(“run”)? These sequences serve primarily as identifiers of words, and so a good first step is to map words to unique ids in a collection known as a vocabulary. In this toy example, we might have "run" -> 0
and "walk" -> 1
. While a step up from string sequences, we still can’t relate “walk” to “run” in any useful way since the symbols we’ve assigned them are entirely arbitrary. On some level however, we as humans might understand that the words “walk” and “run” share similar ideas of motion because of the contexts in which we’ve seen them appear.
For this part of the assignment, you’ll be constructing vector representations of words that model a word’s context. For a given word \(w\), define a context window of radius \(N\) to capture \(N\) words to the left and right of \(w\). We wish to represent all contexts around \(w\) as a “bag of words”. Given a corpus of sentences, we construct a fixed-size vocabulary \(V\). Assign each word in \(V\) a unique integer id \(i\). The bag-of-words representation of \(w\) is then a vector \(c \in R^{|V|}\), where each \(c_i\) counts how often word \(i\) appears in a context window of \(w\) (across the entire corpus). This approach allows us to construct very large, and very sparse vector representations of words in terms of their contexts.
Task: Transform a corpus of text into word vectors according to this context-window principle. Use a window of radius 2. For your writeup:
scipy.spatial.distance.pdist
).Copy the project stencil from /course/cs2952d/stencil/hw1/
. You will be working with sparse_vecs.py
and utils.py
as the stencil
code for this part. Use the datasets in /course/cs2952d/data/
. Report your results for en-wiki-100k.txt
, but use toy.txt
as a toy
dataset for making sure your code works.
For this part, we’ll try something different. Instead of simply capturing counts of nearby words, we’ll look at embedding words into a vector space that best predicts a word within a target word’s context. As before, the goal is to let semantically-close words to be similar in embedding-space. Whereas the previous model is based solely on word counts, the approach here will use a neural network that has been trained to predict skip-grams. This is the word2vec model of Mikolov et al. (2013).
To recap, a skip-gram looks at each word in a sentence, and forms pairs between it and every other word in its context window. Here’s an example:
>>> # assume there exists some function skip_grams(sentence) with a window size of 2
>>> sentence = "the quick brown fox jumps over the lazy dog"
>>> print(skip_grams(sentence))
[('the', 'quick'), ('the', 'brown'), ('quick', 'the'), ('quick', 'brown'), ('quick', 'fox'), ..., ('lazy', 'over'),
('lazy', 'the'), ('lazy', 'dog'), ('dog', 'the'), ('dog', 'lazy')]
The goal is, given a word, to predict a word that occurs within its context. To do this, we project a one-hot vector (which encodes a
word) into a D
-dimensional embedding space, and learn a linear decision boundary to predict a context word.
In the terminology of neural networks, we use a network with one hidden layer (the embedding space) and minimise the
negative log-likelihood of the softmax logits. During training, the projection matrix to the embedding space will learn
a representation for each word that minimises the loss. This matrix is known as the embedding table.
Once trained, a lookup can be performed on this table to find the embedding for a given word.
Task: Train a neural network with a hidden layer size of 128
. Use a cross-entropy-like loss as your training criterion.
The network should take in a word id, and predict a word id representing a context word. For your writeup, you should:
Copy the project stencil from /course/cs2952d/stencil/hw1/
and data from /course/cs2952d/data/hw1/
.
You will be working with word2vec.py
as the stencil code for this part. As before, report your results on en-wiki-100k.txt
.
scatter()
and annotate()
methods may be useful.If you read the original word2vec paper, you may have noticed the authors’ implementation differs from what’s specified here. In particular, you don’t need to worry about: subsampling frequent words, phrase detection, or negative sampling/hierarchical softmax. These are all cool optimisations that improve the resulting word vectors, but are beyond the scope of this project.
From your project directory, run cs2952d_handin hw1
to submit your code and writeup. In addition to your analysis, be sure to
include a (short) description of your code: how to run it, any known bugs, and a brief explanation of your design.
Your writeup can be one file, split into sections.
Part 1 (5.5 points)
Part 2 (9.5 points)