Homework 3: Tree Hugger

The goals of this assignment are:

  • To get practice implementing methods on classes
  • To get practice designing and developing programs on trees
  • To consider some of the ethical implications of autocorrect

Due date

  • HW3 is due on Friday, February 28 at 9pm.

Setup and Handin

Setup

In order to setup your hw3 directory on your department machine, open a terminal and run the command cs0112_install hw3. This will create a directory in your ~/course/cs0112/ directory for this assignment.

Handin

You may submit as many times as you want. Only your latest submission will be graded. This means that if you submit after the deadline, you will be using a late day – so do NOT submit after the deadline unless you plan on using late days.

The README template can be found here. Please follow the design and clarity guide–part of your grade will be for code style and clarity. After completing the homework, you will submit:

  • README.txt
  • bst.py
  • test_bst.py
  • trie.py
  • test_trie.py

Because all we need are the files, you do not need to submit the whole project folder. As long as the file you create has the same name as what needs to be submitted, you’re good to go!

If you are using late days, make sure to make a note of that in your README. Remember, you may only use a maximum of 3 late days per assignment. If the assignment is late (and you do NOT have anymore late days) no credit will be given.

Please don’t put your name anywhere in any of the handin files–we grade assigments anonymously!

You can follow this step-by-step guide for submitting assignments through gradescope here.

Helpful Things

Documentation

Staff Assistance

Your friendly TAs and Professor are here to help with this assignment! You can find our schedule for office hours here.

The Assignment

Part 1: Searching for answers

In class, we implemented some methods on a binary search tree. For this part of the assignment, you’ll write a few additional methods that a binary search tree implementation might provide.

The bst.py file contains our binary search tree implementation from class, along with a few unimplemented methods:

  • The height method should return the height of the binary tree: that is, the maximum distance between the root node and a leaf node. The height of an empty binary tree is 0; the height of a tree with just a root node is 1. Hint: Python’s max function returns the maximum of two numbers.
  • The number_of_nodes method should return the total number of nodes in the tree.
  • The between function should return a sorted list of the elements in the tree whose values are between its two arguments. The list shouldn’t include elements which are equal to either argument (i.e., the range should be EXCLUSIVE on both sides). Don’t build this list by getting a list of all of the elements in the tree and then filtering the list–there’s a more efficient way to find elements in a particular range! Hint: think about what you should do if the root node’s value is 8 and you’re looking for elements between 9 and 20!

You will probably to want to use helper methods for all three of these. Please do so! Your helper methods should have docstrings and type annotations, but do not need separate tests.

You should include tests for all of your new BST methods in test_bst.py. You don’t need to test existing methods (although your tests will probably want to call the add method).

Part 2: You complete me

In this assignment you’re going to implement an autosuggestion system: given some text a user has typed so far (for instance, “ha”) you’ll provide the possible words they might be trying to type (for instance, “hat”, “hand”, etc.) You’re not trying to do any spelling correction–you’re just predicting the word based on prefixes.

In order to do this, you’re going to use a special kind of tree called a trie.

Tries

A trie (usually pronounced “try”) is a tree where the nodes of the tree represent prefixes of words. Each node has a dictionary from characters to its children; each child is a longer prefix of the same word, formed by adding a character.

Let’s take a look at an example:

This trie includes the words: “hat”, “hand”, “a”, and “at”. To create a trie like this, we’d start with a root node that has no children.

When adding the word “hat”, we would check if the root node’s dictionary of children contains an entry for h. Since it doesn’t, we create a node for h and add it to the root node’s dictionary (under the key "h". Then, we create a node for ha and add it to the h node’s dictionary (under the key "a"). Finally, we create a child for hat and add it to the ha node’s dictionary (under the key "t"). We also need to mark that, unlike h and ha, hat represents a complete word. Words are marked with red circles in the diagram above.

When we add “hand” after adding “hat”, we once again check if h is as a child of the root node (i.e., if it has an entry in its dictionary of children for "h"). It does, so we don’t need to add a new node. Now we look at the h node. Does h have an ha child node (i.e., does it have an entry in its child dictionary for "a"? It does! Does ha have an han child node? Nope, not yet. Therefore, we add han as a child of ha, and then hand as a child of han. We mark the hand node as a word.

Once we add “a” and “hat”, the trie looks like the picture above. Notice that prefixes of words can themselves be words–for instance, “a” is both a word and a prefix of “at”.

Implementing autocomplete

We’ve provided a partial implementation of a trie in trie.py, including methods to insert a word and to determine if a word is present. You should read through and understand the provided code, because you’ll be writing similar code for the assignment!

A TrieNode is a node within the Trie. TrieNodes have a prefix (for instance, "" or "h" or "ha" or "hat") and a dictionary of children. They also have a word field, which will be set to True if the prefix represents a word in the trie.

The Trie class implements the actual trie data structure. Tries are initialized with a single root node containing the empty string as the prefix (the empty string is a prefix of every word). The insert method adds a word to the trie by adding nodes if necessary and then marking the word’s final node as a word. The contains method determines whether a word is present in the trie by walking down a path in the trie.

You’re going to fill in two methods to the trie in order to implement autocomplete:

  • words_below should return a list of all of the words in the trie below or at the provided node. For instance, if it’s called on the root node, it should return all of the words in the trie. If it’s called on the a node in the trie above, it should return "a" and "at". words_below returns a list of strings.
  • completions returns all of the words in the trie that start with the given prefix. This method should call words_below after walking down a path in the trie corresponding to the prefix. completions should return the empty list if there are no words in the trie that start with the prefix.

You should test your new trie methods in test_trie.py. Make sure to test all of the interesting cases!

We’ve included a program, autosuggest.py, that presents an interface to your autocompletion system. Play around with it and see how autocompletion systems work in practice!

Part 3: Autocomplete Ethics

Consider how search tree implementations can have ethical implications. Please respond with 2-3 thoughtful paragraphs for each of the following questions:

  1. Think about another instance machine learning-informed autocomplete is used — in communication platforms, web browsers, text processors, etc.

    For example, Google has undergone criticism and legal suits because of its Autocomplete feature — the company was convicted of defaming a French man by linking his name with sexual offenses, revealed the names of victims granted anonymity courts, spread unauthenticated rumors about the wife of a former German presidentss private life, and provided hateful autocomplete options for phrases relating to certain ethnic/religious/gender groups. Google Autocomplete was also originally called “Google Suggest.” Google renamed this technology in 2010, and now emphasizes that Autocomplete provides “predictions,” not “suggestions.”

    Discuss an ethics-related issue that may arise from an autocomplete feature. Some questions to think about: the parties/demographics/industries affected, the severity of this effect, who (if anyone) should be held accountable, who (if anyone) should be responsible for remediating the problem, ways to address the issue, etc. You can analyze Google Autocomplete or a different instance of autocomplete.

  2. You’re designing a product that has an autocomplete feature — how would your implementation of this feature mitigate the ethics issue you detailed in the previous question? Think about the implementation of the underlying algorithm, the data sets used to inform it, the visual components/design, manual measures to moderate the feature, etc.