To give you a feel for software development efforts, you will do this project
in stages. At each stage you will be given slightly different and changing
requirements, and must turn in a solution that matches that stage. The link to
the next stage will be provided when you complete the previous stage. There are
three stages in all, ordered in increasing difficulty—
You should treat this assignment like an exam. Therefore, you should do this work without consulting course staff except for critical issues (broken links, possible assignment typo, etc.).
You should not try to peek ahead to the next stages through any means; doing so would be a violation of the Academic Code.
Unlike in previous assignments, you will be given only one chance to submit at each stage (like a company making a code release of a new version). So please submit with care.
We will be evaluating your code in part based on how effectively it can be reused in more general contexts.
Naruto Shippuden Opening 3 | Blue Bird by Ikimono-gakari
Imagine you have a collection of past tweets, and are given a new tweet. You want to find all the old tweets that are similar to the current one.
How do we test for “similarity”? Easy: we use DocDiff! That is, we will use overlap with some specified threshold (which will be in the range \([0, 1]\)).
data Tweet: tweet(author :: String, content :: String) end
consider only the alphanumeric characters (retaining both capital and lower-case letters) and spaces of each Tweet, and
keep the tweet intact, so when returning searched tweets, the entire text is present.
search :: Tweet, List<Tweet>, Number -> List<Tweet>
Your overlap function from DocDiff takes two lists of strings. To create the lists, split the content of a tweet into words. In the tweet, a space separates one word from the next.
The result is all past tweets that have an overlap of at least threshold (inclusive). The result should be sorted from the tweets with highest to those with lowest (up to threshold) overlap.
Here is your Template for Problem 1.
꧁༺ Stop Here! ༻꧂
Submit before continuing to read!
Previously, each old tweet was treated in isolation. In reality, many tweets are parts of threads, i.e., they have a parent tweet. Therefore, the Tweet structure now has a field, parent, of type Option<Tweet>. The none option indicates that the tweet has no parent.
You will again define search, with the same parameters. However, the way we perform the search is somewhat different, as defined below.
First, we define the notion of a tweet’s relevance. If a tweet has no parent, then its relevance is exactly its overlap. If a tweet does have a parent, then its relevance is \[(0.25 \times r_p) + (0.75 \times o_t)\] where \(r_p\) is the relevance of its parent and \(o_t\) is the overlap of the tweet itself with the new tweet.
The third argument to search is the relevance threshold (inclusive).
All tweets that satisfy the relevance threshold should be included in the result, which is again sorted in order of most to least relevant.
Finally, note that multiple tweets can have the same parent (because two tweets may have responded to the parent tweet). This should remind you of our discussion about DAGs. You need to eliminate duplicates in your response, where duplicates are defined by identical.
꧁༺ Stop Here! ༻꧂
Submit before continuing to read!
Here is your Template for Problem 2.
In the previous version, we had a somewhat awkward way of dealing with the true structure of tweets. This is because we defined tweets as ascending chains, which meant we could end up inadvertently traversing duplicates, which we had to then filter. A better structure would be to have a descending tree.
Therefore, we will change the definition of a Tweet, eliminating the parent field and instead having the field children, which is of type List<Tweet>.
The given list of old tweets is guaranteed to contain only and all the roots.
Below, we redefine what relevance means.
A tweet’s relevance is now conditionally defined depending on whether or not it has a parent.
If the tweet does not have a parent, its relevance is \[(0.75 \times o_t) + (0.25 \times s)\] where \(o_t\) is the overlap of the tweet itself with the new tweet, and \(s\) is a size ratio (defined below).
If the tweet does have a parent, its relevance is \[(0.20 \times r_p) + (0.60 \times o_t) + (0.20 \times s)\] where \(r_p\) is the relevance of its parent, \(o_t\) is the overlap of the tweet itself with the new tweet, and \(s\) is its size ratio.
What is the size ratio? It’s the ratio of the size of the current tweet’s subtree to the total size of all the trees provided. The size of a tweet’s tree is defined as 1 (for the tweet itself) plus the sum of the sizes of each of the child tweets. (The parent and other tweets do not count.)
Here is your Template for Problem 3.
Come up with a few pairs of queries that search for the same content on Youtube but, with subtle changes in wording, come up with results quite different in character. Since we can’t be sure to get the same results you do, tell us what the queries produce and why you think those outcomes are significantly different.