DiscFinder: A Data-Intensive Scalable Cluster Finder for Astrophysics

advertisement
DiscFinder: A Data-Intensive Scalable Cluster Finder for Astrophysics
Bin Fu, Kai Ren, Julio López, Eugene Fink, and Garth Gibson
We have developed a distributed version of the Friends-of-Friends technique, which is a
standard astronomical application for identifying and analyzing clusters of galaxies. The
distributed procedure can process tens of billions of galaxies, which makes it
sufficiently powerful for modern astronomical datasets and cosmological simulations.
• Two galaxies are “friends” if they are close to
each other; that is, the distance between
them is within a specific global threshold.
• The algorithm analyzes an undirected graph,
where galaxies are vertices and their
“friendships” are edges. It identifies the
connected components of the graph, which
serve as an approximation of gravitationally
bound clusters.
1.5
• The time complexity is O((n · log n) ) for the
exact computation, and O(n) for an
approximate algorithm.
Performance
WEAK SCALABILITY
Time
(minutes)
Time
(minutes)
15 bln
galaxies
240
60
Galaxy clusters and
space partitioning.
Distributed Procedure
We have developed a Map-Reduce “wrapper” that
distributes the Friends-of-Friends computation
among multiple cores.
STRONG SCALABILITY
sampling
of galaxies
• Divide the space into cubes, where each cube
includes about the same number of galaxies, by
applying the kd-tree construction to a randomly partitioning
the galaxies
selected sample of galaxies.
by cubes
• Apply a sequential Friends-of-Friends procedure to
find the clusters within each cube.
• Identify cross-cube “friendships” and merge the
cross-cube
respective clusters, using the union-find algorithm.
merging
list of
galaxies
splitting the
space into
cubes
clustering
within cubes
galaxy
clusters
Map-reduce wrapper.
8
4 mln
galaxies
per core
4
0.5 bln
galaxies
0
8
8 mln
galaxies
per core
12
1 bln
galaxies
15
4
16
16 32 64 128 256
Number of cores
Dependency of the running time on
the number of available cores for 0.5
billion galaxies (dotted line), 1 billion
galaxies (solid line), and 15 billion
galaxies (dashed line).
8
16 32 64 128 256
Number of cores
Dependency of the running time on the
number of available cores, where the
input size is proportional to the number
of cores. We show results for 4 million
galaxies per core (dashed line) and 8
million galaxies per core (solid line).
HADOOP TUNING (IN PROGRESS)
We have further reduced
the running time by tuning
the related parameters of
the Hadoop framework.
Time (minutes)
Friends-of-Friends Algorithm
15 billion galaxies, 256 cores
200
100
0
Before
After
Download