MapReduce

← prev up next →

MapReduce

1 Theme Song

2 Background

3 Your MapReduce

4 An Example Scenario: Word Counting

When working with partners, the collaboration policy applies to a pair as a whole, rather than to individuals. Indeed, you are required to collaborate.

We strongly recommend a pair-programming strategy called driver/navigator. The driver is the person with the keyboard; the navigator is the person discussing, reading each line as it’s written, and providing feedback. Every fifteen minutes, swap roles. It’s a really good habit to get used to.

Please turn in only one submission, agreeing with your partner who will do the submitting.

You may not use a late pass for this assignment!

You will have to implement your own version of the essence of Google’s MapReduce.MapReduce was one of the most important innovations in the 2000s. Dave Patterson, one of the world’s leading computer systems researchers, called it the first “instruction” for cloud computing.

One of the things you will discover is that their operator is horribly mis-named: it isn’t the composition of map and reduce. Nevertheless, we’ll stick to their naming to respect (unconventional) convention.

1 Theme Song

Maps by the Yeah Yeah Yeahs

2 Background

The MapReduce pattern allows you to divide problems into a series of independent, equivalent tasks that can be parallelized. Specifically, all the map tasks can be run at the same time, as can all of the reduce tasks, because the result of each task does not depend on any of the other tasks. Because your implementation will run on only a single processor, you won’t see any performance gains. However, working through the central idea should help familiarize you with it so you recognize applications that can use it later.

First, you will have to write a MapReduce framework. This framework will take a mapper and a reducer, along with input. It will apply the mapper to each input, collect the results, and pass them to the reducer. The support code will give you the following data structure:

data Tv-pair<A, B>:
| tv(tag :: A, value :: B)
end

This represents a tag-value pair, the fundamental unit of data for all MapReduce programs.

3 Your MapReduce

For your framework, you’ll need to define:

fun map-reduce<A, B, M, N, O>(
  input :: List<Tv-pair<A, B>>,
  mapper :: (Tv-pair<A, B> -> List<Tv-pair<M, N>>),
  reducer :: (Tv-pair<M, List<N>> -> Tv-pair<M, O>))
-> List<Tv-pair<M, O>>

The mapper goes through each tv-pair in the list of input pairs and separates them into tagged pairs. The entire list of pairs from the list of inputs (each Tv-pair<M, N> created by a mapper) is then organized by tag, such that the reducer function can consume each Tv-pair with a unique tag and a list as the corresponding value. That is, while the mapper function produces a List<Tv-pair<M, N>>, the reducer function takes a Tv-pair<M, List<N>>. You will have to figure out how to collect all of the N’s for each M so you can pass them to a reducer task. (Think about how the Design Recipe can help you here!)

4 An Example Scenario: Word Counting

Let’s consider this contract in terms of a concrete example. To help you with testing, the support code provides the mapper and reducer for one example: word count. In this example, we make a distinction between word tokens and word types. Word tokens are individual words (for “the words the words”, the word tokens are “the”, “words”, “the”, and “words”). Word types are unique words (for “the words the words”, the word types are “the” and “words”). The support code provides:

A word count mapper:
fun wc-map(
file :: Tv-pair<String, String>)
-> List<Tv-pair<String, Number>>
wc-map consumes a file, which is a Tv-pair with the filename as its tag and the file’s contents as its value. It produces a List<Tv-pair> where the tags are word tokens and the values are the corresponding counts of the tokens. wc-map produces the count 1 for every word token, because it counts words as independent tokens as it encounters them.
A word count reducer:
fun wc-reduce(
word-counts :: Tv-pair<String, List<Number>>)
-> Tv-pair<String, Number>
It takes a Tv-pair<String, List<Number>> in which the tag represents a word and the value is a list of word counts for that word (in this case, as we know, a list of 1s, though it won’t always be so simple!). It produces a Tv-pair<String, Number> in which the tag is a word, and the value is the number of times that word appeared in all of the input.

It’s the reducer’s job to sum the word token counts into a word type count. wc-map may be run independently on each file, and wc-reduce may be run independently on each word. This is the division of the problem into a series of equivalent tasks which can be run at the same time, and from here we gain the performance advantage over the undivided problem. MapReduce would allow the words from the files to be read into the mapper, and parse that output for use to the reducer.

In review, for word count, MapReduce would:

Take in a list of documents.
Tokenize each word.
Find all instances of each word among all the documents and combine them into one Tv-pair per word.
Pass these Tv-pairs of each word and its list of word counts to the reducer.
Produce a word count for each word among all documents.

Once you and your partner are satisfied with your map-reduce, use it to solve the following problems.

5 Anagrams

Your input consists of a List<Tv-pair<String, String>> in which the tag is the file name, and the value is the string contents of the file (like in the word count example). For example, one item that could be part of an input would be tv("words.txt", "star rats tarts"). Your final output should be a List<Tv-pair<String, List<String>>> in which each tag represents a unique anagram group, and the value is a list of unique words that appear in the input that are all anagrams of each other. We will define anagrams to be a set of words that are made up of the same set of characters (case-sensitive), with the same number of each character. Your functions should be named anagram-map and anagram-reduce.

Note that you are responsible for creating your own tagging convention to uniquely label each anagram group as you see fit. This also means that, when writing public tests (as opposed to ones internal to your implementation), you can’t expect a particular tagging convention!

6 Nile 2.0

If you thought you were done with Nile, then you must be in de-nile!

recommend(title :: String,
book-records :: List<Tv-pair<String, String>)
-> Tv-pair<Number, List<String>>

takes a book title and a list of tagged files (like in Anagrams, except formatted according to the Nile spec) and produces a Tv-pair with the highest number of times the book was paired with one or more other books, and a list of those books.

popular-pairs(book-records :: List<Tv-pair<String, String>>)
-> Tv-pair<Number, List<String>>

takes a list of tagged files and produces a Tv-pair with the highest number of times any two books were paired with each other, and a list of the corresponding pairs.

Pairs should be formatted according to the original Nile spec.

7 Clarifications

The output from your MapReduce functions can be in any order, so you cannot hard code your test cases. We are providing you with the following four functions to compare the output with a hard-coded solution:

fun count<A>(target :: A, a :: List<A>, eq-checker :: (A, A -> Boolean))
  -> Number:
  doc: "counts quantity of target in a"
  fun el-checker(el, cnt):
    if eq-checker(el, target):
      cnt + 1
    else:
      cnt
    end
  end
  a.foldl(el-checker, 0)
where:
  count(3, [list: 1, 2, 3], lam(x, a): x == a end) is 1
  count(3, [list: 3, 2, 3], lam(x, a): x == a end) is 2
  count(4, [list: 1, 2, 3], lam(x, a): x == a end) is 0
end

fun lst-same-els<A>(
    a :: List<A>,
    b :: List<A>,
    eq-checker :: (A, A -> Boolean))
  -> Boolean:
  doc: "checks whether two lists have the same elements in the same quantity"
  fun same-count(el, acc):
    acc and (count(el, a, eq-checker) == count(el, b, eq-checker))
  end
  (a.length() == b.length()) and a.foldl(same-count, true)
where:
  lst-same-els([list: 1, 2, 3], [list: 1, 2, 3], lam(x, a): x == a end)
    is true
  lst-same-els([list: 1, 2, 3], [list: 1, 2, 3], lam(x, a): x == a end)
    is true
  lst-same-els([list: 1, 2, 3], [list: 2, 1, 3], lam(x, a): x == a end)
    is true
  lst-same-els([list: 1, 2, 2], [list: 2, 1], lam(x, a): x == a end)
    is false
  lst-same-els([list: 1, 2, 2], [list: 2, 1, 1], lam(x, a): x == a end)
    is false
end

fun recommend-equiv(
    t1 :: Tv-pair<Number, List<String>>,
    t2 :: Tv-pair<Number, List<String>>)
  -> Boolean:
  doc: "checks whether two recommenations are equivalent"
  (t1.tag == t2.tag) and lst-same-els(t1.value, t2.value, lam(x, y): x == y end)
where:
  recommend-equiv(tv(0, empty), tv(0, empty)) is true
  recommend-equiv(tv(1, [list: "a"]), tv(1, [list: "a"])) is true
  recommend-equiv(tv(1, [list: "a", "b"]), tv(1, [list: "a", "b"]))
    is true
  recommend-equiv(tv(1, [list: "b", "a"]), tv(1, [list: "a", "b"]))
    is true
  recommend-equiv(tv(1, [list: "b", "a"]), tv(2, [list: "a", "b"]))
    is false
end

fun popular-equiv(
    t1 :: Tv-pair<Number, List<String>>,
    t2 :: Tv-pair<Number, List<String>>)
  -> Boolean:
  doc: "checks whether two popular pairs are equivalent"
  fun equiv-parse(str :: String, delim :: String) -> List:
    if string-length(str) < string-length(delim):
      empty
    else if string-substring(str, 0, string-length(delim)) == delim:
      substr = string-substring(str, string-length(delim), string-length(str))
      parse(substr, delim)
    else:
      split = string-split-at(str, delim)
      link(split.word, parse(split.rest, delim))
    end
  end
  (t1.tag == t2.tag) and lst-same-els(t1.value, t2.value,
    lam(str1, str2):
      lst-same-els(equiv-parse(str1, "+"), equiv-parse(str2, "+"),
        lam(x, y): x == y end)
    end)
where:
  popular-equiv(tv(0, empty), tv(0, empty)) is true
  popular-equiv(tv(1, [list: "a+b"]), tv(1, [list: "a+b"])) is true
  popular-equiv(tv(1, [list: "a+b"]), tv(1, [list: "b+a"])) is true
  popular-equiv(tv(1, [list: "a+b", "b+c"]), tv(1, [list: "b+c", "a+b"]))
    is true
  popular-equiv(tv(1, [list: "a+b", "c+b"]), tv(1, [list: "b+c", "a+b"]))
    is true
  popular-equiv(tv(2, [list: "a+b", "c+b"]), tv(1, [list: "b+c", "a+b"]))
    is false
end

You are welcome to use these functions in your test suites. You should use the lst-same-els function to help you write tests for the Anagrams section of the assignment, and you should use the recommend-equiv and popular-equiv functions to help you write tests for the Nile section of the assignment.

You should not modify these functions, nor should you use them in your implementations.

8 Analysis

How would JoinLists be useful for MapReduce? (One paragraph or less.)
Give an example of another problem that you could use MapReduce to solve beyond Anagrams and Nile. Provide the contracts for your input data, mapper, and reducer.

9 Template Files

Feel free to write your tests for map-reduce using wc-mapper and wc-reducer. Note that you should be testing functions in an implementation-independent manner—be wary of anagram-map, anagram-reduce, and the Nile functions in particular! As a hint, you may not be able to test anagram-map and anagram-reduce individually; how could you observe their behavior as a whole?

Implementation

Examplar

10 Handing In

Please have only one person from each pair submit the form.

Final Submission

← prev up next →

1	Anticipated Frequent Questions
2	README
3	Learning Goals, Assessments, and Time Allocation
4	Syllabus and Course Policies
5	Late Policy
6	Laptop Policy
7	Diversity and Professionalism
8	Assignments
9	(Early) Testing for Programming
10	Textbook
11	Software
12	Staff and Contact
13	How to Ask Questions (and Report Bugs)
14	Credits

	Doc Diff
	Nile
	Sortacle
	Data Scripting
	Oracle
	Filesystem
	Updater
	Continued Fractions
	Join Lists
	Map Reduce
	Tour Guide
	MST
	Fluid Images
	Final Quiz
8.1	24

1	Theme Song
2	Background
3	Your Map Reduce
4	An Example Scenario: Word Counting
5	Anagrams
6	Nile 2.0
7	Clarifications
8	Analysis
9	Template Files
10	Handing In