MapReduce
When working with partners, you should still follow the collaboration policy, but treating the pair as an individual. Also, turn in only one submission, agreeing with your partner who will do the submitting.
You will have to implement your own version of the essence of Google’s MapReduce.MapReduce was one of the most important innovations in the 2000s. Dave Patterson, one of the world’s leading computer systems researchers, called it the first “instruction” for cloud computing.
One of the things you will discover is that their operator is horribly mis-named: it isn’t the composition of map and reduce. Nevertheless, we’ll stick to their naming to respect (unconventional) convention.
The MapReduce pattern allows you to divide problems into a series of independent, equivalent tasks that can be parallelized. Specifically, all the map tasks can be run at the same time, as can all of the reduce tasks, because the result of each task does not depend on any of the other tasks. Because your implementation will run on only a single processor, you won’t see any performance gains. However, working through the central idea should help familiarize you with it so you recognize applications that can use it later.
data Tv-pair<A, B>:
| tv(tag :: A, value :: B)
end
1 Your MapReduce
fun<A, B, M, N, O> map-reduce(
input :: List<Tv-pair<A, B>>,
mapper :: (Tv-pair<A, B> -> List<Tv-pair<M, N>>),
reducer :: (Tv-pair<M, List<N>> -> Tv-pair<M, O>))
-> List<Tv-pair<M, O>>
2 An Example Scenario: Word Counting
A word count mapper:
fun wc-map(
file :: Tv-pair<String, String>)
-> List<Tv-pair<String, Number>>
wc-map consumes a file, which is a Tv-pair with the filename as its tag and the file’s contents as its value. It produces a List<Tv-pair> where the tags are word tokens and the values are the corresponding counts of the tokens. wc-map produces the count 1 for every word token, because it counts words as independent tokens as it encounters them.
A word count reducer:
fun wc-reduce(
word-counts :: Tv-pair<String, List<Number>>)
-> Tv-pair<String, Number>
It takes a Tv-pair<String, List<Number>> in which the tag represents a word and the value is a list of word counts for that word (in this case, as we know, a list of 1s, though it won’t always be so simple!). It produces a Tv-pair<String, Number> in which the tag is a word, and the value is the number of times that word appeared in all of the input.
It’s the reducer’s job to sum the word token counts into a word type count. wc-map may be run independently on each file, and wc-reduce may be run independently on each word. This is the division of the problem into a series of equivalent tasks which can be run at the same time, and from here we gain the performance advantage over the undivided problem. MapReduce would allow the words from the files to be read into the mapper, and parse that output for use to the reducer.
Take in a list of documents.
Tokenize each word.
Find all instances of each word among all the documents and combine them into one Tv-pair per word.
Pass these Tv-pairs of each word and its list of word counts to the reducer.
Produce a word count for each word among all documents.
Once you and your partner are satisfied with your map-reduce, use it to solve the following problems.
3 Anagrams
Your input consists of a List<Tv-pair<String, String>> in which the tag is the file name, and the value is the string contents of the file (like in the word count example). For example, one item that could be part of an input would be tv("words.txt", "star rats tarts"). Your output should be a List<Tv-pair<String, List<String>>> in which each tag represents a unique anagram group, and the value is a list of unique words that appear in the input that are all anagrams of each other. We will define anagrams to be a set of words that are made up of the same set of characters (case-sensitive), with the same number of each character. Your functions should be named anagram-map and anagram-reduce.
Note that you are responsible for creating your own tagging convention to uniquely label each anagram group as you see fit.
4 Nile 2.0
If you thought you were done with Nile, then you must be in de-nile!
recommend(title :: String,
book-records :: List<Tv-pair<String, String>)
-> Tv-pair<Number, List<String>>
popular-pairs(book-records :: List<Tv-pair<String, String>>)
-> Tv-pair<Number, List<String>>
Pairs should be formatted according to the original Nile spec.
5 Analysis
How would JoinLists be useful for MapReduce? (One paragraph or less.)
Give an example of another problem that you could use MapReduce to solve beyond Anagrams and Nile. Provide the contracts for your input data, mapper, and reducer.
6 Template Files
Feel free to write your tests for map-reduce using
wc-mapper and wc-reducer. Note that you should be
testing functions in an implementation-independent manner—
7 Handing In
Please have only one person from each pair submit the form.