HW4: Machine Translation

Due Tuesday, 11/5/19 at 11:59pm

French philosopher Jean-Paul Sartre had many profound things to say about the universe. However, many of his greatest musings are in French!

Therefore, to get at the meaning of existence (or lack thereof), we will need to train a model that can translate from French to English!

Getting the stencil

You can find the files located here or on the "Files" column under the Assignments page. The files are compressed into a ZIP file, and to unzip the ZIP file, you can just double click on the ZIP file. There, you should find the files: preprocess.py, assignment.py, rnn_model.py, transformer_model.py, transformer_funcs.py, README, and data folder. You can find the conceptual questions located here or the "Conceptual Questions" column in the Assignments page.

Logistics

Work on this assignment off of the stencil code provided, but do not change the stencil except where specified. Changing the stencil will result in incompatiblity with the autograder and result in a low grade. For this assignment, a significant amount of code is provided. You shouldn't change any method signatures.

This assignment has 2 main parts.

You will implement two different types of sequence to sequence model. One is based on Recurrent Neural Networks, the other one Transformers. However, since both of these models are trying to solve the same problems, they share the same preprocessing, training, and testing. In addition, we provide you with Transformer helper functions. However, you will implement self attention.

Virtual Environment

For this assignment, you might also need the gast library. This may be necessary if you want to speed up your code (to run things in graph execution). If you are doing this locally, you will have to activate your existing virtual environment and install gast by running the following command in your terminal/command line:

pip install gast

You are also welcome to set up the virtual environment from scratch using the updated requirements.txt file.

The Corpus

By law, the official records, (Hansards) of the Canadian Parliament must be transcribed in both English and French. Therefore, they provide a fantastic set of mappings for a machine translation model. We are providing you with a modified version of the corpus that only includes sentences shorter than 12 words.

Here's what you should find in the data folder:

fls.txt - french_training_file
els.txt - english_training_file
flt.txt - french_test_file
elt.txt - english_test_file

Part 0: Preprocessing

We have provided you with several helper functions to help with preprocessing.

pad_corpus : pads the corpus to make all inputs the same length. It does this by adding *PAD* tokens to the end of the french and english sentences. It also adds a *START* token to the beginning of the english sentence.
build_vocab : returns a dictionary from words to IDs, and also the ID associated with padding. You can build your vocab only using the training data.
convert_to_id : converts sentences to their ID form
ead_datar : load text data from file

You must implement the get_data function. This function takes training file and testing files, and uses the helper functions provided to return the processed data, vocab dictionaries, and the english padding ID.

In this function you should:

Step 1. Read the French training data and the English training data

Step 2. Read the French testing data and the English testing data

Step 3. Call pad_corpus on the training and testing data

Step 4. Build the French and English vocabularies from the training data, then use the vocabularies to convert the sentences to ID form

Step 5. Return the processed data, dictionaries and the English padding ID

Part 1: RNN Machine Translation

Roadmap

For this homework, you will build two neural network that encodes the French sentence and then decodes out the English translation. In this part you will build an RNN based encoder-decoder architecture.

You should use at least one RNN to encode the French embeddings into a single vector, which is then passed to the decoder. This decoder RNN is initialized with the output of the encoder.

The decoder performs similiarly to the language model in the last assignment. The decoder should take the english inputs shifted over one timestep, and use the combined hidden state and shifted input to predict the next english word. This procedure is called Teacher Forcing.

In other words, we initialize the decoder with the "encoded" French sentence, then give it the previous correct English word and have it guess the next, as if we were language modeling. Teacher forcing helps stabilize training.

Step 1. Create your RNN model

Fill out the init function, and define your trainable variables and hyperparameters.
Fill out the call function using the trainable variables you've created.
Calculate the softmax cross-entropy loss on the probabilities compared to the labels (These should NOT be one hot vectors). We again recommend using tf.keras.losses.sparse_categorical_crossentropy. In addition, you must now give loss a mask. This is because many of the output labels will be padding, and we do not want to include padding in our loss calculation.
Your mask should be a tensor of 1s and 0s or booleans, and the same dimension as your labels. There should be a 0/False value corresponding to each PAD token.
You are welcome to use both reduce_mean and reduce_sum. If your stencil says "sum over batch," you can ignore that. Just make sure you calculate everything properly.

Step 2. Fill out training and testing in assignment.py

In assignment.py, you will want to get your train and test data, initialize your model, and train it. We have provided you with a train and test method to fill out. The train method will take in the model and do the forward and backward pass.
Note that you will initialize both the RNN and the Transformer in assignment.py this time.
In training and testing steps, you should batch your data. The french version of a sentence will serve as your encoder input. To construct your decoder labels for each sentence, you should remove the *START*token. Similarly, you will want to remove the last padding token for your decoder input. By removing these two elements, you ensure your decoder input is the same dimension as your decoder labels, and that you are predicting the NEXT English word at each place in your window.

Running the Model

Your assignment.py should be able to run both your RNN and Transformer model. You can run it with:

python assignment.py [MODEL TYPE]

Where MODEL_TYPE is "RNN" or "TRANSFORMER".

Mandatory Hyperparameters

You must use separate embedding matrices for your French and English inputs. In addition, you must use at least two RNNS: one for your encoder, the other for your decoder.

While not required, a learning rate of 0.01 and a batch size of 100 is recommended. Additionally we recomend choosing embeddings/hidden layer sizes between 32-256. We also suggest using a standard deviation of 0.01 for your embedding matrices (if you do not use Keras).

While the specifications for your architecture are flexible, your RNN seq2seq model must train in under 30 minutes on a department machine!

Your target perplexity should be <=20, and your target per symbol accuracy > 58. As a reference point, our RNN model trains within 22 minutes and receives a perplexity of around 10.

Part 2: Transformers Machine Translation

RNNs are neat. However, since 2017, there has been a new cool kid on the Natural Language Processing block: Transformers. Transformer based models have produced state-of-the-art performance on a variety of NLP tasks, including language modeling and translation.

These architectures rely on stacked self-attention modules, rather than recurrence. We will be implementing a simplified version of the Attention Is All You Need architecture.

These attention modules turn a sequence of embeddings into Queries, Keys, and Values. Just like how each timestep has a word embedding, in self-attention, each timestep has a query, key, and value embedding. The queries are compared to every key to produce an attention matrix. This attention matrix is then used to create new embeddings for each timestep.

Self-attention can be fairly confusing. Thus, we encourage students to refer back to the lecture slides. Another great resource that explains the intuition/implementation of Transformers can be found here.

For this part of the assignment, we give you code for Transformer blocks, which you can use like RNN layers. This code can be found in transformer_funcs.py However, it is not complete: you must implement the single attention head functionality. Do this by filling in the Single Self_Attention function and the Atten_Head class.

If you are in 2470, you should also implement multi-headed attention (with three heads) as well, and use it in your model.

Roadmap

Step 1. Create your Transformer model

Fill out the init function, and define your trainable variables.

Instead of RNNs, you should use at least one transformer.Transformer_Block for your encoder and at least one transformer.Transformer_Block for your decoder.

The transformer block takes the following arguments: (embedding_size,is_decoder=False/True,multi_headed=False/True). You can find this in transformer_funcs.py.

You also must define and use two transformer.Position_Encoding_Layers to add positional embeddings to your French and English embeddings.

Additionally, please note that, for this architecture, your embedding/hidden_state size must be the same for your word embeddings and transformer blocks.

Fill out the call function using the trainable variables you've created.
You can reuse your loss function from the rnn_model.py.

Step 2. Fill out the Self_Attention Function and Atten_Head class in transformer_funcs.py

Note that when you fill out Atten_Head, you will be using tf.tensordot. This is necessary because you are trying to perform matrix multiplication on matrices of different orders/dimensions.

Please refer to the lecture slides, and/or the illustrated transformer.

Mandatory Hyperparameters

You must use separate embedding matrices for your French and English inputs. Don't forget to add positional embeddings to your French and English embeddings using different Position_Encoding_Layers.

In addition, you must use at least two transformer blocks: one for your encoder, the other for you decoder. While not required, a learning rate of 0.001 and a batch size of 100 is recommended. Also we recommend drawing embedding/hidden layer sizes from the range 32-256.

While the specifications for your architecture are flexible, your Transformer seq2seq model must train in under 30 minutes on a department machine!

Your target perplexity should also be <=15, and your target per symbol accuracy should be >= 65.

Part 3: Conceptual Questions

Fill out conceptual questions and submit in PDF. Submitting a scan of written work is also fine as long as it is readable. Please copy over the questions and write well thought out answers to the questions.

We will not accept anything other than a PDF.

Your README should can contain your perplexity and any bugs you have.

CS2470 Students

You will have to also implement the Multi_Headed attention class. This class should create 3 self attention heads, and combine their result.

NOTE: if you have Three_Headed_Attention in your stencil, please change that to Multi_Headed. Do the same if you see Multi_Headed_Attention.

Also please complete the CS2470-only conceptual questions in addition to the coding assignment and the CS1470 conceptual questions. Note: Questions about 2470 will only be answered on Piazza, or by TAs marked with an asterisk (*) on the calendar.

Grading

Code: You will be primarily graded on functionality. Your RNN model should have a perplexity should be <=20, and your target per symbol accuracy > 58, and your Transformer model should have a perplexity that is <=15, and your target per symbol accuracy should be >= 65.

Conceptual: You will be primarily graded on correctness (when applicable), thoughtfulness, and clarity.

Autograder

Your RNN model must complete training in under 30 minutes, and your Transformer model should complete training in under 30 minutes on a department machine.

Our autograder will import your model and your preprocessing functions. We will feed the result of your get_data function called on a path to our data and pass the result to your train method in order to return a fully trained model. We will test using our testing function.

Handing In

You should submit the assignment using this Google Form. You must be logged in with your Brown account. Your assignment.py, preprocess.py, rnnmodel.py, transformerfuncs.py, and transformer_model.py files should be Python files, while the written up conceptual questions should be PDF format. The README can be any format.