Doc  Diff
1 Introduction
2 Assignment
3 Language
4 Handing In

DocDiff

1 Introduction

Consider the problem of comparing two text documents. Why might you want to do this? Perhaps you want to check for plagiarism; search for articles similar to a particular one you’re studying; or have uncovered a new manuscript and want to know whether it’s a legitimate Shakespeare or a fake. All these require being able to determine the similarity between documents. One way to model this similarity is as a distance metric, analogous to how we compute the distance between points in space.

With each document, we associate a mathematical vector, which here we will represent using a list. The indices of the vector are the words that are found in either document. The value at each index is how many times that word occurs in the document. Because we are comparing documents, we will assume that two words are the same if they have the same characters in the same order, ignoring upper- and lower-case (Racket has functions like string-ci=? for case-insensitive string-equality and sort for sorting; you can look up these and other functions).A smarter version of this program would ignore case for some words but not for ones that might also be proper nouns. Our distance measure is defined to be proportional to the dot product of these two document vectors:

\[dist(\vec{d_1}, \vec{d_2}) \propto \vec{d_1} \cdot \vec{d_2}\]

To obtain a formula, we normalize this dot-product. We will choose a simple method which is to divide by the squared magnitude of the larger vector:

\[dist(\vec{d_1}, \vec{d_2}) = \frac{\vec{d_1} \cdot \vec{d_2}}{max(\|\vec{d_1}\|^2,\|\vec{d_2}\|^2)}\]

where the magnitude of a vector \(\vec{x}\), denoted as \(\|\vec{x}\|\), is given by \(sqrt(\vec{x} \cdot \vec{x})\). Observe that this means every document will have a distance of 1 from itself, and any two documents that have no words in common will have distances of 0 from each other.

2 Assignment

Define a function

(define (distance doc1 doc2) ...)

where doc1 and doc2 are lists of strings and distance returns a number. This function computes the distance between two non-empty documents. Each document is a list of strings, with each string representing a word. Here’s an example of a document:

(list "The" "quick" "brown" "fox" "jumps" "over" "the" "lazy" "dog")

Note that we are not asking you to consider efficiency of implementation.

3 Language

Make sure you’re in the Intermediate Student language. To change it, go to the Language menu and select Choose Language... .

4 Handing In

Using the file navigator, go to your cs019 directory within your home directory. Make a new directory called docdiff. Put your solution, which should be named docdiff.rkt, inside that directory. Open the terminal and paste the following two commands (on separate lines, press enter after each):

cd ~/course/cs019/docdiff

cs019_supp_handin docdiff