Algorithm and data structure Overview Programming and problem solving, with applications Algorithm: method for solving a problem Data structure: method to store information CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton Agenda Data types stack, queue, priority queue Sorting bubble sort, merge sort, quick sort, heap sort, radix sort Searching binary search tree, red black tree, hash table Graphs Depth-first search, breadth-first search CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton Why study algorithms? Their impact is broad and far reaching • • • Internet: Web search, packet routing, file sharing,… Computers: Compilers, file system,… Social media: News feeds, Advertisements,… To solve problems Example: Network connectivity problem • Find a path to get from one place to another when we have a large set of pairwise connected places CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton Why study algorithms? To become a proficient programmer To understand the nature • • • Computational models are replacing mathematical models Formula based algorithm based People are more and more developing computational model and try to simulate what might be happening in nature in order to understand it For fun and profit • better chance at interview CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton Steps to developing a usable algorithm Model the problem Find an algorithm to solve it Fast enough? Fits in memory? If not, why? Address the problem Iterate until satisfied CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton Dynamic connectivity problem Given a set of N objects Union command: Connect two objects Given two object, provide a connection between them Find/connected query: is there a path connecting the two objects? union(4,3) 0 1 union(3,8) union(6,5) union(9,4) 5 6 union(2,1) Connected(0,7) Connected(8,9) indirect connection 2 3 4 7 8 9 CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton Modeling the objects Pixels in digital photos Computers in network Friends in a social media … Convenient to name objects 0 to N-1 Use integers as array index CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton Modeling Assumptions “is connected to” is an equivalence relation: Reflexive: p is connected to p Symmetric: if p is connected to q, then q is connected to p. Transitive: if p is connected to q and q is connected to r, then p is connected to r. Connected components Maximal set of objects that are mutually connected {1,2} {5,6} {3,4,8,9} 0 1 2 3 4 5 6 7 8 9 CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton Implementation Find query: Check if two objects are in the same component Union Command: Replace components containing two objects with their union 0 1 2 3 4 5 6 7 8 9 Union(6,3) {1,2} {5,6,3,4,8,9} {1,2} {5,6} {3,4,8,9} 0 1 2 3 4 5 6 7 8 9 CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton Specify union-find data type Goal: design efficient data structure for union-find Number of objects N can be huge Number of operations M can be huge Find queries and union commands may be intermixed UnionFind UnionFind(int N) void union (int p, int q) boolean connected(int p, int q) int find (int p) int count() CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton Testing read in number of objects N Repeat: read in pair of integers if not connected yet, connect them, and print out pair Public static void main(String[] args){ int N = StdIn.readInt(); UnionFind uf = new UnionFind(N); while(!StdIn.isEmpty()){ int p = StdIn.readInt(); int q = StdIn.readInt(); if(!uf.connected(p,q)){ uf.union(p,q); StdOut.println(p + “ ” +q); } } } CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton Quick-find Data structure: Integer array id[] of size N. Interpretation: p and q are connected iff they have the same id. 0 1 2 3 4 5 6 7 8 9 Id[] 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Id[] 0 1 1 8 8 0 0 1 8 8 Find: Check if p and q have the same id Id[6]=0 ; id[1]=1 6 and 1 are not connected Union: To merge components containing p and q , change all entries whose id equals id[p] to id[q] Problem: Many values can change CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton QuickFind public class QuickFind{ private int[] id; public QuickFind(int N){ N Array accesses id = new int[N]; for (int i=0; i<N; i++){ id[i]=i; } } public boolean connected (int p, int q){ 2 Array accesses return id[p]==id[q]; } public void union (int p, int q){ int pid = id[p]; At most 2N+2 Array access int qid = id[q]; for (int i=0; i<id.length; i++) if(id[i]==pid) id[i]=qid; } CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton Quick-find is too slow Cost model: Number of array access (for read and write) algorithm initialize union find cost Quick-find N N 1 expensive order of growth of number of array accesses Quadratic Takes N2 array access to process N union commands on N objects CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton Quadratic algorithms don’t scale billion operations per second billion words of main memory Touch all words in about 1 sec Since 1950! Example: - 109 union command - 109 objects - Quick-find take 1018 operations - +30 years CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton Quick-union (lazy approach) Data structure: Integer array id[] of size N. Interpretation: id[i] is parent of i Root of i is id[id[….id[i]…..]]] keep going until it doesn’t change, algorithm ensures no cycles 0 0 Id[] 1 2 3 4 5 6 7 8 1 9 6 9 0 1 9 4 9 6 6 7 8 9 2 4 5 3 Find: Check if p and q have the same root Union: To merge components containing p and q , set the id of p’s root to the id of q’s root. 0 Id[] 1 2 3 4 5 6 7 8 9 0 1 9 4 9 6 6 7 8 6 Only one value changes CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton 7 8 demo 0 1 2 3 4 5 6 7 8 9 Id[] 0 1 2 3 4 5 6 7 8 9 union(4,3) 0 1 2 3 3 5 6 7 8 9 union(3,8) 0 1 2 8 3 5 6 7 8 9 union(6,5) 0 1 2 8 3 5 5 7 8 9 union(9,4) 0 1 2 8 3 5 5 7 8 8 Union(2,1) 0 1 1 8 3 5 5 7 8 8 connected(8,9) 0 1 1 8 3 5 5 7 8 8 connected(5,4) 0 1 1 8 3 5 5 7 8 8 CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton QuickUnion public class QuickUnion{ private int[] id public QuickUnion(int N){ N Array accesses id = new int[N]; for (int i=0; i<N; i++){ id[i]=i; } } Depth of i array accesses private int findRoot(int i){ while(i != id[i] ) I = id[i]; return i; Depth of p and q array accesses } public boolean connected (int p, int q){ return findRoot(p)==findRoot(q); } public void union (int p, int q){ int i = findRoot(p); Depth of p and q array accesses int j = findRoot(q); id[i] = j; } } CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton Quick-union is also too slow Cost model: Number of array access (for read and write) algorithm initialize union find cost Quick-find N N 1 Expensive Quick-union N N N Expensive Includes cost of finding root • Tree can get really tall • Find too expensive (N array accesses) CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton Worse case Improvements Improvement 1: weighting Weighted quick-union: Avoid tall tree Keep track of size of each tree (number of objects) Balance by linking the root of smaller tree to the larger tree In quick-union we might put the larger tree lower In weighted quick-union we always choose the better alternative CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton demo 0 1 2 3 4 5 6 7 8 9 Id[] 0 1 2 3 4 5 6 7 8 9 union(4,3) 0 1 2 4 4 5 6 7 8 9 union(3,8) 0 1 2 4 4 5 6 7 4 9 union(6,5) 0 1 2 4 4 6 6 7 4 9 union(9,4) 0 1 2 4 4 6 6 7 4 4 Union(2,1) 0 2 2 4 4 6 6 7 4 4 connected(8,9) 0 2 2 4 4 6 6 7 4 4 connected(5,4) 0 2 2 4 4 6 6 7 4 4 CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton Weighted QuickUnion public class QuickUnion{ private int[] id; public QuickUnion(int N){ id = new int[N]; for (int i=0; i<N; i++){ id[i]=i; } } private int findRoot(int i){ while(i != id[i] ) I = id[i]; return i; } public boolean connected (int p, int q){ return findRoot(p)==findRoot(q); } public void union (int p, int q){ int i = findRoot(p); Maintain extra array sz[] int j = findRoot(q); if(sz[i]<sz[j] {id[i] = j; sz[j] +=sz[i];} else {id[j] = i; sz[i] +=sz[j];} } } CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton Weighted Quick-union analysis Running time: Find: takes time proportional to depth of p and q. Union: takes constant time, given roots Proposition. Depth of any node x is at most lg N. Proof: when does depth of x increase? Depth increase by 1 when tree T1 containing x is merged into another tree T2. • The size of the tree containing x at least double since |T2| >= |T1| • Size of tree containing x can double at most lg N times. Why? x CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton Weighted QU algorithm initialize union find Quick-find N N 1 Quick-union N N N Weighted QU N Lg N Lg N Includes cost of finding root CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton Improvements Improvement 2: path compression Quick-union with path compression: Just after computing the root of p, set the id of each examined node to point to that root. Flatten the tree private int findRoot(int i){ while(i != id[i] ){ id[i] = id[id[i]]; I = id[i]; } return i; } CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton Analysis Proposition: Stating from an empty data structure, any sequence of M union-find operations on N objects makes <= c(N+M lg* N) array accesses. So close to be linear-time. Theory it is not Practice it is N Lg* N 1 0 2 1 4 2 16 3 6553 6 4 265536 5 Iterate log function CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton Summary algorithm find Quick-find MN Quick-union MN Weighted QU N+M Lg N QU + path compression N+M Lg N Weighted QU + path compression N+M Lg* N M union-find operations on a set of N objects Weighted QU + path compression reduces time from 30 years to 6 seconds Supercomputer wont help much; good algorithm is the key solution CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton Union-find Applications Dynamic connectivity Image processing Graph processing Understanding physical phenomena …. CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton Percolation A model of many physical systems: • N by N grid of sites • Each site is open with probability p (or blocked with probability 1-p) • System percolates iff top and bottom are connected by open sites • Percolate Blocked site does not percolate Open site Open site connected to top No open site connected to top CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton Likelihood of Percolation Depends on site vacancy probability p. p low ( 0.4 ) doesn’t percolate p>p* p high (0.8) percolates p<p* p medium (0.5) ?? So the question is how we know whether it percolates or not. What is the value for p* Percolation phase transition (when N is large sharp threshold for p*) 1 P* 0 CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton Monte Carlo Simulation Initialized N by N grid to be blocked Declare random sites open until top connected to bottom Vacancy percentage estimates p* Every time we add an open site we check and see if it makes the system percolate CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton Percolation threshold Q: What is the value for p*? A. About 0.592746 Fast Algorithm enables accurate answer to scientific question. CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton Dynamic connectivity solutions How to model opening a new site? A. Connect newly opened site to all of its adjacent open site Up to 4 calls to union() Open this site 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 29 20 21 22 23 24 CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton Dynamic connectivity solutions 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 29 20 21 22 23 24 Create an object for each site and name then 0 to N2-1 Sites are in same component if connected by open sites Percolates iff any site on bottom row is connected to site on top row Brute-force algorithm: N2 calls to connected() CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton Dynamic connectivity solutions Clever trick: Introduce 2 virtual sites (and connections to top and bottom) • Percolates iff virtual top site is connected to virtual bottom site Only one 1 call to connected() Virtual site 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 29 20 21 22 23 24 CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton Summary Dynamic connectivity problem Modeled the problem to understand what kind of data structure and algorithm needed to solve it Introduced few easy and inaccurate algorithms to solve it Learned how to improve them to get efficient algorithm Applications that could not be solved without these efficient algorithms CS2336: Computer Science II; K. Wayne & R. Sedgewick Uni. of Princeton