Experiences on Processing Spatial Data with MapReduce

advertisement
Experiences on Processing Spatial
Data with MapReduce
Ariel Cary, Zhengguo Sun, Vagelis Hristidis, Naphtali Rishe
Florida International University
School of Computing and Information Sciences
11200 SW 8th St, Miami, FL 33199
{acary001,sunz,vagelis,rishen}@cis.fiu.edu
Sponsored by: NSF Cluster Exploratory (CluE)
Agenda
Introduction
2. Solving Spatial Problems in MapReduce
1.
•
•
R-Tree Index Construction
Aerial Image Processing
Experimental Results
4. Related Work
5. Conclusion
3.
2
Florida International University
Introduction
 Spatial databases mainly store:
 Raster data (satellite/aerial digital images), and
 Vector data (points, lines, polygons).
 Traditional sequential computing models may take excessive
time to process large and complex spatial repositories.
 Emerging parallel computing models, such as MapReduce,
provide a potential for scaling data processing in spatial
applications.
3
Florida International University
Introduction (cont.)
 MapReduce is an emerging massively parallel computing
model (Google) composed of two functions:
 Map: takes a key/value pair, executes some computation, and
emits a set of intermediate key/value pairs as output.
 Reduce: merges its intermediate values, executes some
computation on them, and emits the final output.
 In this work, we present our experiences in applying the
MapReduce model to:
 Bulk-construct R-Trees (vector) and
 Compute aerial image quality (raster)
4
Florida International University
Introduction (cont.)
 Apache's Hadoop
 Linux operating system
Proxy
Cloud
 XEN hypervisor
 Hadoop Distributed File
System (HDFS)
 SOCKS proxy server
 480 computers (nodes),
each half terabytes
storage
1
Internet
bin/hadoop dfs put
2
bin/hadoop jar
3
bin/hadoop dfs get
5
Florida International University
2. Solving Spatial Problems in
MapReduce
•
•
R-Tree Index Construction
Aerial Image Processing
MapReduce (MR) R-Tree Construction
 R-Tree Bulk-Construction
 Every object o in database D has two attributes:
 o.id - the object’s unique identifier.
 o.P - the object’s location in some spatial domain.
 The goal is to build an R-Tree index on D.
 MapReduce Algorithm
1.
2.
3.
7
Database partitioning function computation (MR).
A small R-Tree is created for each partition (MR).
The small R-Trees are merged into the final R-Tree.
Florida International University
Phase 1 – Partitioning Function
8
– Goal: compute f to assign objects of D into one of R possible
partitions
s.t.: inputs/outputs in computing partitioning function f.
Map and Reduce
–Function
R (ideally)Input:
equally-sized
partitions
are(Key,
generated
(Key, Value)
Output:
Value)
Map
(o.id, is
o.P)
(minimal
variance
acceptable). (C, U(o.P))
Reduce
(C, list(ui, i=1, .., L))
S´
– Objects close in the spatial domain are placed within the
Where:same partition.
• o is an solution:
spatial object in the database.
– Proposed
• C which is a constant that helps in sending Mappers' outputs to a
– Use
space-filling curve to map spatial coordinate
singleZ-order
Reducer.
(3%) intocurve,
an sorted
sequence.
• samples
U is a space-filling
e.g. Z-order
value.
S' is an array
containing
R-1that
splitting
points.the sequence in R
–• Collect
splitting
points
partition
ranges.
Florida International University
Phase 2 - R-Tree Construction in MR
 Mappers compute f() values for objects.
 Reducers compute an R-Tree for each group of objects with
identical f() value
MapReduce functions in constructing R-Trees.
Function
Map
Reduce
Input: (Key, Value)
(o.id, o.P)
(f(o.P), list(oi, i=1, .., A))
Output: (Key, Value)
(f(o.P), o)
tree.root
Where:
• o is an spatial object in the database.
• f is the partitioning function computed in phase 1.
• Tree.root is the R-Tree root node.
9
Florida International University
Phase 3 - R-Tree Consolidation
 sequential process
10
Florida International University
Image Processing in MapReduce
 Aerial Image Quality Computation
 Let d be a orthorectified aerial photography (DOQQ) file and t
be a tile inside d, d.name is d’s file name and t.q is the quality
information of tile t.
 The goal is to compute a quality bitmap for d.
 MapReduce Algorithm
 A customized InputFormatter partitions each DOQQ file d into
several splits containing multiple tiles.
 The Mappers compute the quality bitmap for each tile inside a
split.
 The Reducers merge all the bitmaps that belongs to a file d and
write them to an output file.
11
Florida International University
Image Processing in MapReduce
 MapReduce Algorithm
Input and output of map and reduce functions
Function
Map
Reduce
Input: (Key, Value)
(d.name+t.id, t)
(d.name, list(t.id,t.q))
Output: (Key, Value)
(d.name, (t.id,t.q))
Quality-bitmap of d
Where:
• d is a DOQQ file.
• t is a tile in d.
• t.q is the quality bitmap of t.
12
Florida International University
3. Experimental Results
Experimental Results: Setting
 Data Set
Table 4. Spatial data sets used in experiments*.
Problem
Data
set
Objects
Data size
(GB)
R-Tree
FLD
11.4 M
5
YPD
37 M
5.3
MiamiDade
482 files
52
Image
Quality
Description
Points of properties in the state of Florida.
Yellow pages directory of points of businesses mostly
in the United States but also in other countries.
Aerial imagery of Miami-Dade county, FL (3-inch
resolution)
* Data sets supplied by the High Performance Database Research Center at Florida International University
 Environment
 The cluster was provided by the Google and IBM Academic Cluster
Computing Initiative.
 The cluster contains around 480 computers running Hadoop - open
source MapReduce.
14
Florida International University
Experimental Results: R-Tree
 R-Tree Construction Performance Metrics
30.00
60.00
50.00
25.00
20.00
MR1
15.00
10.00
MR2
Time (min)
Time (min)
MR2
40.00
20.00
5.00
10.00
0.00
0.00
2
4
8
16
Reducers
(a) FLD data set
32
64
MR1
30.00
4
8
16
32
Reducers
(b) YPD data set
MapReduce job completion times for various number of reducers in phase-2 (MR2).
15
Florida International University
64
Experimental Results: R-Tree
 MapReduce R-Trees vs. Single Process (SP)
Objects per Reducer
Data set
R
FLD
2
5,690,419
12,183
172,776
4
4
2,845,210
6,347
172,624
4
8
16
32
64
1,422,605
711,379
355,651
177,826
2,235
2,533
2,379
1,816
173,141
162,518
173,273
173,445
4
4
3
3
SP
11,382,185
0
172,681
4
4
9,257,188
22,137
568,854
4
8
4,628,594
9,413
568,716
4
16
2,314,297
7,634
568,232
4
32
64
1,157,149
578,574
6,043
2,982
567,550
566,199
4
4
SP
37,034,126
0
587,353
5
YPD
16
Florida International University
Average
Consolidated R-Tree
Stdev
Nodes
Height
Experimental Results: Imagery
 Tile Quality Computation
25
4
Reduce
Map
3
Time (min)
Time (min)
20
3.5
15
10
Reduce
Map
2.5
2
1.5
1
5
0.5
0
0
4
8
16 32 64 128 256 512
Reducers
2
4
8
Size of data (GB)
(a) Fixed data size, variable Reducers
(b) Variable data size, fixed Reducers
Fig. 9. MapReduce job completion time for tile quality computation
17
Florida International University
16
4. Related Work
Related Work
 Previous works on R-Tree parallel construction faced intrinsic
distributed computing problems: load balancing, process
scheduling, fault tolerance, etc.
 Schnitzer and Leutenegger [16] proposed a Master-Client R-Tree,
where the data set is first partitioned using Hilbert packing sort
algorithm, then the partitions are declustered into a number of
processors, where individual trees are built. At the end, a master
process combines the individual trees into the final R-Tree.
 Papadopoulos and Manolopoulos [17] proposed a methodology for
sampling-based space partitionining, load balancing, and partition
assignment into a set of processors in parallely building R-Trees.
19
Florida International University
5. Conclusion
Conclusion
 We used the MapReduce model to solve two spatial problems
on a Google&IBM cluster:
 (a) Bulk-construction of R-Trees and
 (b) Aerial image quality computation
 MapReduce can dramatically improve task completion times.
Our experiments show close to linear scalability.
 Our experience in this work shows MapReduce has the
potential to be applicable to more complex spatial problems.
21
Florida International University
References
[1] Antonin Guttman: R-Trees: A Dynamic Index Structure for Spatial Searching. SIGMOD 1984:4757.
[2] NSF Cluster Exploratory Program: http://www.nsf.gov/pubs/2008/nsf08560/nsf08560.htm
[3] Google&IBM Academic Cluster Computing Initiative:
http://www.google.com/intl/en/press/pressrel/20071008_ibm_univ.html
[4] Apache Hadoop project: http://hadoop.apache.org
[6] Jeffrey Dean, Sanjay Ghemawat: MapReduce: Simplified data processing on large clusters. In
Proceedings of the 6th Conference on Symposium on Opearting Systems Design &
Implementation, USENIX Association, Volume 6, pp. 10-10, December 2004.
[12]
High Performance Database Research Center (HPDRC), Research Division of the Florida
International University, School of Computing and Information Sciences, University Park,
Telephone: (305) 348-1706, FIU ECS-243, Miami, FL 33199.
[16]
Schnitzer B., Leutenegger S.T.: Master-client R-trees: a new parallel R-tree architecture,
In Proceedings of the 11th International Conference on Scientific and Statistical Database
Management, pp. 68-77, August 1999.
[17]
Apostolos Papadopoulos, Yannis Manolopoulos: Parallel bulk-loading of spatial data,
Parallel Computing, Volume 29, Issue 10, pp. 1419 - 1444, October 2003.
22
Florida International University
Download