Slide show illustrating the algorithms

advertisement

DISC-Finder:

A distributed algorithm for identifying galaxy clusters

Friends-of-Friends (FoF) technique:

Identification of galaxy clusters

• Two galaxies are “friends” if they are close to each other

• We analyze an undirected graph, where galaxies are vertices and their “friendships” are edges

• We need to identify its connected components

Sequential algorithms

• Exact: O(( n

∙ log n ) 1.5

)

• Approximate: O( n )

Distributed procedure

• Divide the space into “slightly overlapping” cubes

Load balancing:

- Randomly select a subset of galaxies

- Apply the kd -tree construction to build a balanced partition for the subset

- Use it for the full set of galaxies

• Distributed computation:

Apply a sequential FoF algorithm to find the clusters within each cube

- Use any sequential FoF

- Allocate different cores to cubes

• Identify cross-cube edges and merge the respective clusters

- Apply the union-find algorithm to the galaxies in the cube overlaps

Distributed procedure galaxy sets divide the space into cubes apply local sequential FoF local clusters

Advantages

• Scalable: We can apply it to massive datasets and use all available cores

• Black-box use of a sequential FoF:

We can utilize any FoF algorithm

• Hadoop friendly: We have mapped all main operations into the Hadoop framework, which has resulted in very compact code (800 lines)

Scalability

Time

(min)

240

60

15

500 mln galaxies

4

8

14,800 mln galaxies

1000 mln galaxies

16 32 64

Number of cores

128 256

Download