DocDiff
1 Department Presentation
We have an “Introduction to the Department” for newcomers at 8pm on Thursday, September 6 in MacMillan 117. It’ll be an hour long and will introduce you to numerous aspects of the department, with q&a and a student panel. Please attend!
2 TA Hours Form
Please go to Hours! Before each meeting, please be sure to use the form available on that page to indicate what you need help with. Thanks.
3 Introduction
[list: "The", "quick", "brown", "fox", "jumps"]
In order to compute the similarity between two documents, we associate each document with a mathematical vector, which here we will represent using a list of numbers. The indices of the vector correspond to words that are found in either document. The value at each index is how many times the corresponding word occurs in the document.This is called the bag of words model. It’s a “bag” because, like a set, order doesn’t matter, but the count does.
For example, the documents [list: "a", "b", "c"] and [list: "d", "d", "d", "b"] would result in vectors of length 4, accounting for all unique words ("a", "b", "c", and "d"):
| "a" |
| "b" |
| "c" |
| "d" | |
[list: "a", "b", "c"] |
| 1 |
| 1 |
| 1 |
| 0 |
[list: "d", "d", "d", "b"] |
| 0 |
| 1 |
| 0 |
| 3 |
Therefore, the two documents will respectively have the representations [list: 1, 1, 1, 0] and [list: 0, 1, 0, 3].
We assume that two words are the same if they have the same characters in the same order, ignoring upper- and lower-case (Pyret has functions to upper- or lower-case a string, and for sorting; you can look up these functions in the string and list libraries.)
We define the overlap between two documents, represented this way, to be proportional (\(\propto\)) to the dot product of these two document vectors:
\[overlap(\vec{d_1}, \vec{d_2}) \propto \vec{d_1} \cdot \vec{d_2}\]
To obtain a formula, we normalize this dot-product. We will choose a simple method, which is to divide by the squared magnitude of the larger vector:
\[overlap(\vec{d_1}, \vec{d_2}) = \frac{\vec{d_1} \cdot \vec{d_2}}{max(\|\vec{d_1}\|^2,\|\vec{d_2}\|^2)}\]
where the magnitude of a vector \(\vec{x}\), denoted as \(\|\vec{x}\|\), is given by \(sqrt(\vec{x} \cdot \vec{x})\). Observe that this means every document will have an overlap of 1 with itself, and any two documents that have no words in common will have overlaps of 0 with each other.
4 Assignment
fun overlap(doc1 :: List<String>, doc2 :: List<String>) -> Number: ... end |
Note that we are not asking you to consider efficiency of implementation.
5 Language Use
You may not use built-in sets; everything in the list library is permitted.
6 Background
You will find this chapter useful in learning to convert from Racket to Pyret, and this one useful for learning more about lists in Pyret.
7 Extra TA Hours
In addition to the listed hours, your TAs will be holding extra hours for this assignment at the following times:
Date |
| Hours |
| Where |
| Who |
Friday, 9/7 |
| 5pm - 7pm |
| CIT 201 |
| zespirit |
Friday, 9/7 |
| 7pm - 9pm |
| CIT 201 |
| msantoma |
Friday, 9/7 |
| 7pm - 9pm |
| CIT 201 |
| mlitt2 |
Please make sure to take advantage of the available hours—
8 Template Files
9 Handing In
You will submit two separate files, named docdiff-code.arr and docdiff-tests.arr.