DocDiff🔗

    1 What is This Assignment’s Purpose?

    2 Theme Song

    3 Definition

    4 Assignment

    5 Language Use

    6 Background

    7 A Warning

    8 Starter

    9 Socially Responsible Computing

    10 General Instructions

    11 Friendly Reminders

    12 Handing In

1 What is This Assignment’s Purpose?🔗

Similarity is a central problem in computer science. We often want to know whether two things are the same and, if not, how similar they are. Good similarity measures sit underneath numerous systems, from search engines to plagiarism checkers. In this assignment, we will study a classic similarity measure between documents.

2 Theme Song🔗

Ice Under Pressure - Vanilla Queen by Fanboy Films

3 Definition🔗

In this assignment we will define a document as a List<String>, with each string representing a word. Here’s an example of a document:

[list: "The", "quick", "brown", "fox", "jumps"]

In order to compute the similarity between two documents, we associate each document with a mathematical vector, which here we will represent using a list of numbers. The indices of the vector correspond to words that are found in either document. The value at each index is how many times the corresponding word occurs in the document.This is called the bag of words model. It’s a “bag” because, like a set, order doesn’t matter, but the count does.

For example, the documents [list: "a", "b", "c"] and [list: "d", "d", "d", "b"] would result in vectors of length 4, accounting for all unique words ("a", "b", "c", and "d"):

  

"a"

  

"b"

  

"c"

  

"d"

[list: "a", "b", "c"]

  

1

  

1

  

1

  

0

[list: "d", "d", "d", "b"]

  

0

  

1

  

0

  

3

Therefore, the two documents will respectively have the representations [list: 1, 1, 1, 0] and [list: 0, 1, 0, 3].

Note: When comparing two documents, the same length vector is used for both documents because the vector accounts for all the words across both documents.

We assume that two words are the same if they have the same characters in the same order, ignoring upper- and lower-case (Pyret has functions to upper- or lower-case a string, and for sorting; you can look up these functions in the string and list libraries.)

We define the overlap between two documents, represented this way, to be proportional (\(\propto\)) to the dot product of these two document vectors:

\[overlap(\vec{d_1}, \vec{d_2}) \propto \vec{d_1} \cdot \vec{d_2}\]

To obtain a formula, we normalize this dot-product. We will choose a simple method, which is to divide by the squared magnitude of the vector with the larger magnitude:

\[overlap(\vec{d_1}, \vec{d_2}) = \frac{\vec{d_1} \cdot \vec{d_2}}{max(\|\vec{d_1}\|^2,\|\vec{d_2}\|^2)}\]

where the magnitude of a vector \(\vec{x}\), denoted as \(\|\vec{x}\|\), is given by \(sqrt(\vec{x} \cdot \vec{x})\). Observe that this means every document will have an overlap of 1 with itself, and any two documents that have no words in common will have overlaps of 0 with each other.

4 Assignment🔗

Define a function

fun overlap(doc1 :: List<String>, doc2 :: List<String>) -> Number: ... end

where doc1 and doc2 are lists of strings and overlap returns a number. This function computes the overlap of two non-empty documents, defined by the formula above. You should not round any of the calculated numbers.

Note that we are not asking you to consider efficiency of implementation.

5 Language Use🔗

Recall from Assignments that you should not use any other built-in functions or libraries unless an assignment explicitly permits you to.

6 Background🔗

You will find this chapter useful in learning to convert from Racket to Pyret, and this one useful for learning more about lists in Pyret.

7 A Warning🔗

You’ll be reusing DocDiff later in the semester! Therefore, if you falter in this assignment, you’ll want to make sure you understand what you did wrong.

8 Starter🔗

Template

9 Socially Responsible Computing🔗

How well do plagiarism detectors actually work?

Read/view

This article is written by an expert who studies them. (You will probably not be able to access the full article from that link. If so, use this link instead. You may be asked to log into your Brown account. It will then use Brown’s proxy to get you full access.)

Write

Your homework is to go to this plagiarism checker and paste a paragraph from the article you just read into it. Then, tweak your snippet to try and trick the detector without changing the meaning of your snippet. Think of this as a testing exercise for the software!The author of the above article warns that the “free” tools mainly aim to get your text and hence information about you. Therefore, use with caution!

Produce a writeup answering the following questions: What is the most interesting tweak you were able to make that resulted in an unexpected (and maybe unwarranted) change in your plagiarism/uniqueness score? Considering the concerns brought up in the article and the results of your experiments, how can we best make use of plagiarism checkers?

Your writeup should be brief and crisp. Be concrete by drawing on your experience. You can prepare your writeup in whatever editor you want; submit in text or PDF.

10 General Instructions🔗

Please keep these instructions in mind for the rest of the semester:
  • Many of your assignments will have a Socially Responsible Computing (SRC) component, as above. In each case, we will expect you to provide some proof of having read/watched the work. A separate document provides guidance on SRC assignments.

  • The starter template (as above) gives names for files. Please don’t change these! The autograder is expecting the same names, so you know how that goes.

  • All submissions will be through Gradescope. Remember to sign up anonymously.

11 Friendly Reminders🔗

Since this is your first assignment, we want to provide a few reminders here:
  • The textbook is free and online. It can be very helpful in reviewing lecture content and finding examples if you get stuck.

  • The Pyret Documentation is online.

  • You are expected to follow the practices of the Pyret style guide.

  • The course staff is here to help you! If you have any questions, make sure to come to hours [Staff] or post on EdStem.

  • Remember to test your functions thoroughly!

12 Handing In🔗

You will submit three separate files, named docdiff-code.arr, docdiff-tests.arr, and docdiff-common.arr (which may effectively be empty).