Friends-of-Friends (FoF) technique:
• Two galaxies are “friends” if they are close to each other
• We analyze an undirected graph, where galaxies are vertices and their “friendships” are edges
• We need to identify its connected components
Sequential algorithms
• Exact: O(( n
∙ log n ) 1.5
)
• Approximate: O( n )
Distributed procedure
• Divide the space into “slightly overlapping” cubes
Load balancing:
- Randomly select a subset of galaxies
- Apply the kd -tree construction to build a balanced partition for the subset
- Use it for the full set of galaxies
• Distributed computation:
Apply a sequential FoF algorithm to find the clusters within each cube
- Use any sequential FoF
- Allocate different cores to cubes
• Identify cross-cube edges and merge the respective clusters
- Apply the union-find algorithm to the galaxies in the cube overlaps
Distributed procedure galaxy sets divide the space into cubes apply local sequential FoF local clusters
Advantages
• Scalable: We can apply it to massive datasets and use all available cores
• Black-box use of a sequential FoF:
We can utilize any FoF algorithm
• Hadoop friendly: We have mapped all main operations into the Hadoop framework, which has resulted in very compact code (800 lines)
Scalability
Time
(min)
240
60
15
500 mln galaxies
4
8
14,800 mln galaxies
1000 mln galaxies
16 32 64
Number of cores
128 256