MapReduce Programming and Cluster Accessing Instructions Gang Luo

advertisement

MapReduce Programming and

Cluster Accessing Instructions

Gang Luo

Sept. 2, 2010

(K1, V1)

Dataflow

(K2, V2)

(K2, List<V2>) (K3, V3)

A Query Example

Table1

Year Tempera ture

1998 87

Air

Quality

2

1983 93

2008 90

2001 89

1965 97

4

3

5

4

..

SELECT Year, MAX(Temperature)

FROM Table1

WHERE AirQuality = 0|1|4|5|9

GROUPBY Year

Implementation in MapReduce

( 1998, 87, 2, … )

Selection+

Projection

( 1998, 87 )

Aggregation

(MAX)

87

94

1998, 84

87

78

( 1998, 94 )

Mapper

Reducer

Driver

Think more!

• What if we want to get the average temperature for a year?

• What if you are only interested in the temperature in Durham? (Assume the station ID at Durham is 212)

You may want to change the code a little bit and fulfill a different query

Hadoop Cluster

• Master node:

– hadoop21.cs.duke.edu

• Slave nodes

– hadoop22.cs.duke.edu – hadoop36.cs.duke.edu

• Online job tracker *

– hadoop21.cs.duke.edu:50030

• Online HDFS info *

– hadoop21.cs.duke.edu:50070

*

You cannot access these pages outside CS trusted network. Solution:

Now, let’s see how to compile and run a MapReduce job in a cluster

What I will be showing you is covered by the instructions at the course website: http://www.cs.duke.edu/courses/fall10/cps216/Project/cluster_instruction

Download