DiscFinder: A Data-Intensive Scalable Cluster Finder for Astrophysics Bin Fu, Kai Ren, Julio López, Eugene Fink, and Garth Gibson We have developed a distributed version of the Friends-of-Friends technique, which is a standard astronomical application for identifying and analyzing clusters of galaxies. The distributed procedure can process tens of billions of galaxies, which makes it sufficiently powerful for modern astronomical datasets and cosmological simulations. • Two galaxies are “friends” if they are close to each other; that is, the distance between them is within a specific global threshold. • The algorithm analyzes an undirected graph, where galaxies are vertices and their “friendships” are edges. It identifies the connected components of the graph, which serve as an approximation of gravitationally bound clusters. 1.5 • The time complexity is O((n · log n) ) for the exact computation, and O(n) for an approximate algorithm. Performance WEAK SCALABILITY Time (minutes) Time (minutes) 15 bln galaxies 240 60 Galaxy clusters and space partitioning. Distributed Procedure We have developed a Map-Reduce “wrapper” that distributes the Friends-of-Friends computation among multiple cores. STRONG SCALABILITY sampling of galaxies • Divide the space into cubes, where each cube includes about the same number of galaxies, by applying the kd-tree construction to a randomly partitioning the galaxies selected sample of galaxies. by cubes • Apply a sequential Friends-of-Friends procedure to find the clusters within each cube. • Identify cross-cube “friendships” and merge the cross-cube respective clusters, using the union-find algorithm. merging list of galaxies splitting the space into cubes clustering within cubes galaxy clusters Map-reduce wrapper. 8 4 mln galaxies per core 4 0.5 bln galaxies 0 8 8 mln galaxies per core 12 1 bln galaxies 15 4 16 16 32 64 128 256 Number of cores Dependency of the running time on the number of available cores for 0.5 billion galaxies (dotted line), 1 billion galaxies (solid line), and 15 billion galaxies (dashed line). 8 16 32 64 128 256 Number of cores Dependency of the running time on the number of available cores, where the input size is proportional to the number of cores. We show results for 4 million galaxies per core (dashed line) and 8 million galaxies per core (solid line). HADOOP TUNING (IN PROGRESS) We have further reduced the running time by tuning the related parameters of the Hadoop framework. Time (minutes) Friends-of-Friends Algorithm 15 billion galaxies, 256 cores 200 100 0 Before After