MapReduce Programming and
Cluster Accessing Instructions
Gang Luo
Sept. 2, 2010
(K1, V1)
Dataflow
(K2, V2)
(K2, List<V2>) (K3, V3)
A Query Example
Table1
Year Tempera ture
1998 87
Air
Quality
2
1983 93
2008 90
2001 89
1965 97
4
3
5
4
…
…
..
…
…
…
SELECT Year, MAX(Temperature)
FROM Table1
WHERE AirQuality = 0|1|4|5|9
GROUPBY Year
Implementation in MapReduce
( 1998, 87, 2, … )
Selection+
Projection
( 1998, 87 )
Aggregation
(MAX)
87
94
1998, 84
87
78
( 1998, 94 )
Mapper
Reducer
Driver
Think more!
• What if we want to get the average temperature for a year?
• What if you are only interested in the temperature in Durham? (Assume the station ID at Durham is 212)
You may want to change the code a little bit and fulfill a different query
Hadoop Cluster
• Master node:
– hadoop21.cs.duke.edu
• Slave nodes
– hadoop22.cs.duke.edu – hadoop36.cs.duke.edu
• Online job tracker *
– hadoop21.cs.duke.edu:50030
• Online HDFS info *
– hadoop21.cs.duke.edu:50070
*
You cannot access these pages outside CS trusted network. Solution:
Now, let’s see how to compile and run a MapReduce job in a cluster
What I will be showing you is covered by the instructions at the course website: http://www.cs.duke.edu/courses/fall10/cps216/Project/cluster_instruction