Map Reduce
In simple terms, a list of <key, value> pairs mapped into another list of <key, value> pairs which gets grouped by the key and reduced into a list of values.
E.g:
[A] dataset1.txt: Hadoop was created by Doug Cutting and Mike Cafarella
[B] dataset2.txt: Cutting, who was working at Yahoo! at the time, named it after his son’s toy elephant
This two data-sets [A & B] as an input will get divided into splits. Each split will have a key, value pair. In this case the key will be the offset/line number and the value will be the content of the respective split for that offset/line number. The map function discards the line number and produces a per-line (word, count) pair for each word in the input line.
So the mapper output (from (line number, text) as an input ——> (word, count) pair —-> as an input to map phase):
[(“Hadoop”, 1),(“was”, 1),(“created”, 1),(“by”, 1),(“Dough”,1),(“and”, 1),(“Mike”, 1),(“Cafarella”, 1),(“who”, 1),(“was”, 1),(“working”, 1),(“at”, 1),(“Yahoo”, 1),(“at”, 1),(“the”, 1),(“time”, 1),(“named”, 1),(“it”, 1),(“after”, 1),(“his”, 1),(“son’s”,1),(“toy”,1),(“elephant”,1)]
The output of mapper contains multiple key-value pairs with the same key. So before entering into reducer phase, the map-reduce framework will consolidate all the values for similar key. So the input to the reducer is actually (key, value) pairs. Below is the output from the shuffle phase:
{“Hadoop”: [1] ,”was”: [1,1], “created”: [1], “by”: [1], “Dough”:[1] , “and”: [1], “Mike”: [1], “Cafarella”: [1] ,
“who”: [1], “working”: [1], “at”: [1,1], “Yahoo”: [1], “the”: [1] , “time”: [1], “named”: [1], “it”: [1], “after”: [1], “his”: [1], “son’s”:[1], “toy”:[1], “elephant”:[1] }
Post shuffle, the reducer will take the above consolidate key-value pair input and simply sums up the list of intermediate values and produce the intermediate key and the sum as output:
[(“Hadoop”, 1),(“was”, 2),(“created”, 1),(“by”, 1),(“Dough”,1),(“and”, 1),(“Mike”, 1),(“Cafarella”, 1),(“who”, 1),(“was”, 1),(“working”, 1),(“at”, 2),(“Yahoo”, 1),(“at”, 1),(“the”, 1),(“time”, 1),(“named”, 1),(“it”, 1),(“after”, 1),(“his”, 1),(“son’s”,1),(“toy”,1),(“elephant”,1)]