Data Structures and Algorithms Dr. Amit Kumar and Dr. Amitabha Bagchi March 13, 2008 IIT Delhi 1 Teaser Given two (2,4) trees such that each key in the first tree is at most any key in the second tree. How do you merge these to get a single (2,4) tree in O (log n + log m) time. Here n and m are the number of nodes in the two trees. March 13, 2008 2 (2,4) Trees Properties: Each node has at most 4 children All external nodes have same depth Height h of (2,4) tree is O(log n). Search, Insert, delete in O(log n) time. March 13, 2008 12 5 10 3 4 6 8 15 11 1 31 4 17 3 Beyond (2,4) Trees What do we know about (2,4)Trees? Balanced O(log n) search time Different node structures Question: Can we get the (2,4) tree advantages in a binary tree format??? Welcome to the world of Red­Black Trees!!! March 13, 2008 4 Red­Black Tree A red­black tree is a binary search tree with the following properties: edges are colored red or black no two consecutive red edges on any root­leaf path same number of black edges on any root­leaf path (black height) edges connecting leaves are black March 13, 2008 5 (2,4) Tree Evolution Note how (2,4)­trees relate to red­ black trees A red­black tree can be viewed as a representation of a (2,4)­tree that breaks nodes with 3 or 4 children into two levels of nodes. March 13, 2008 6 Red­Black Tree Properties Notation: N is # of internal nodes L is # leaves (= N + 1) H is height B is black height (the height if red nodes are not counted.) March 13, 2008 7 Insertion into Red­Black 1)Perform a standard search to find the leaf where the key should be added 2)Replace the leaf with an internal node with the new key 3)Color the incoming edge of the new node red 4)Add two new leaves, and color their incoming edges black 5)If the parent had an incoming red edge, we now have two consecutive red edges! We must reorganize tree to remove that violation. What must be done depends on the sibling of the parent. March 13, 2008 8 Insertion ­ Plain and Simple March 13, 2008 9 Restructuring We call this a “rotation” No further work necessary Inorder remains unchanged Black depth is preserved for all leaves No more consecutive red edges! Corrects “malformed” 4­node in the associated (2,4) tree March 13, 2008 10 More Rotations g g p p n p p g n n g n g n p g p n March 13, 2008 11 Promotion We call this a “recoloring” The black depth remains unchanged for all the descendants of g This process will continue upward beyond g if necessary: rename g as n and repeat. Splits 5­node of the associated (2,4) tree March 13, 2008 12 Summary of Insertion If two red edges are present, we do either a restructuring (with a simple or double rotation) and stop, or a recoloring and continue A restructuring takes constant time and is performed at most once. It reorganizes an off­balanced section of the tree. Recolorings may continue up the tree and are executed O(log N) times. The time complexity of an insertion is O(log N). March 13, 2008 13 An Example Start by inserting “REDSOX” into an empty tree Now, let’s insert “C U B S”... March 13, 2008 14 Example March 13, 2008 15 Example What should we do? March 13, 2008 16 March 13, 2008 17 March 13, 2008 18 March 13, 2008 19 March 13, 2008 20 E C B R D O U S S E C B X BIFF! R D O U S March 13, 2008 X S 21 E C B Rotation R D O U S X S R U E C March 13, 2008 B O D S X S 22 Setting Up Deletion As with binary search trees, we can always delete a node that has at least one external child If the key to be deleted is stored at a node that has no external children, we move there the key of its inorder predecessor (or successor), and delete that node instead Example: to delete key 7, we move key 5 to node u, and delete node v E RR C B D U O S March 13, 2008 X 23 Deletion Algorithm Remove v where w is a leaf child. If v was red or u is red, color u black. Else, color u double black v u w v u While a double black edge exists, perform one of the following actions .. March 13, 2008 u u 24 How to Eliminate the Double Black Edge The intuitive idea is to perform a “color compensation’’ Find a red edge nearby, and change the pair (red , double black) into (black , black) As for insertion, we have two cases: restructuring, and recoloring (demotion, inverse of promotion) Restructuring resolves the problem locally, while recoloring may propagate it two levels up Slightly more complicated than insertion, since two restructurings may occur (instead of just one) March 13, 2008 25 Case 1: black sibling with a red child If sibling is black and one of its children is red, perform a restructuring p v p s v z p v March 13, 2008 z s z z p s v 26 (2,4) Tree Interpretation x ... 3 0 ... ... 30 ... y 1020 z 20 40 r 10 40 . . . 2 0 . . b. ... ... a 1 0 3 0 1 0 2 0 c 3 0 r 4 0 4 0 March 13, 2008 27 Case 2: black sibling with black childern If sibling and its children are black, perform a recoloring If parent becomes double black, continue upward p p v s v p v March 13, 2008 s p s v s 28 (2,4) Tree Interpretation 10 x 1 0 3 0 ... 30 ... y r 20 40 20 40 1 0 x 1 0 ... 3 0 ... y r 2 0 4 0 2 03 0 4 0 March 13, 2008 29 Case 3: red sibling If sibling is red, perform an adjustment Now the sibling is black and one the of previous cases applies If the next case is recoloring, there is no propagation upward (parent is now red) s p v p s v March 13, 2008 30 How About an Example? 6 Remove 9 4 2 8 7 5 9 6 4 2 March 13, 2008 8 5 7 31 Example What do we know? Sibling is black with black children What do we do? Recoloring 6 6 4 2 March 13, 2008 8 5 7 4 2 8 5 7 32 Example Delete 8 no double black 6 6 8 4 2 March 13, 2008 5 7 4 2 7 5 33 Example Delete 7 Restructuring 6 6 7 4 2 4 5 2 5 4 6 2 5 March 13, 2008 34 Example March 13, 2008 35 Example March 13, 2008 36 Summary of Red­Black Trees An insertion or deletion may cause a local perturbation (two consecutive red edges, or a double­black edge) The perturbation is either resolved locally (restructuring), or propagated to a higher level in the tree by recoloring (promotion or demotion) O(1) time for a restructuring or recoloring At most one restructuring per insertion, and at most two restructurings per deletion O(log N) recolorings Total time: O(log N) March 13, 2008 37 Tries Data­structure for dictionary operations on a set of strings. March 13, 2008 38 Standard Tries The standard trie for a set of strings S is an ordered tree such that: each node but the root is labeled with a character the children of a node are alphabetically ordered the paths from the external nodes to the root yield the strings of S Example: standard trie for the set of strings S = { bear, bell, bid, bull, buy, sell, stock, stop } March 13, 2008 39 Standard Tries •A standard trie uses O(n) space. •Operations (find, insert, remove) take time O(dm) each, where: ­n = total size of the strings in S, ­m =size of the string parameter of the operation ­d =alphabet size, March 13, 2008 40 Applications of Tries A standard trie supports the following operations on a preprocessed text in time O(m), where m = |X| ­word matching : find the first occurence of word X in the text ­prefix matching: find the first occurrence of the longest prefix of word X in the text Each operation is performed by tracing a path in the trie starting at the root March 13, 2008 41 Applications of Tries March 13, 2008 42 Compressed Tries Trie with nodes of degree at least 2 Obtained from standard trie by compressing chains of redundant nodes Standard Trie: Compressed Trie: March 13, 2008 43 Compact Storage of Compressed Tries A compressed trie can be stored in space O(s), where s = |S|, by using O(1) space index ranges at the nodes March 13, 2008 44 Insertion and Deletion March 13, 2008 45 Suffix Tries A suffix trie is a compressed trie for all the suffixes of a text Example: Compact representation: March 13, 2008 46 Properties of Suffix Tries The suffix trie for a text X of size n from an alphabet of size d ­stores all the n(n­1)/2 suffixes of X in O(n) space ­supports arbitrary pattern matching and prefix matching queries in O(dm) time, where m is the length of the pattern ­can be constructed in O(dn) time March 13, 2008 47 Tries and Web Search Engines The index of a search engine (collection of all searchable words) is stored into a compressed trie Each leaf of the trie is associated with a word and has a list of pages (URLs) containing that word, called occurrence list The trie is kept in internal memory The occurrence lists are kept in external memory and are ranked by relevance Boolean queries for sets of words (e.g., Java and coffee) correspond to set operations (e.g., intersection) on the occurrence lists Additional information retrieval techniques are used, such as stopword elimination (e.g., ignore “the” “a” “is”) stemming (e.g., identify “add” “adding” “added”) link analysis (recognize authoritative pages) March 13, 2008 48 Tries and Internet Routers Computers on the internet (hosts) are identified by a unique 32­bit IP ( internet protocol) addres, usually written in “dotted­quad­decimal” notation E.g., 10.20.25.70 Use nslookup on Unix to find out IP addresses An organization uses a subset of IP addresses with the same prefix, e.g., IITD uses 10.20.*.*. Data is sent to a host by fragmenting it into packets. Each packet carries the IP address of its destination. The internet whose nodes are routers, and whose edges are communication links. A router forwards packets to its neighbors using IP prefix matching rules. E.g., a packet with IP prefix 10.20 should be forwarded to the IIT gateway router. Routers use tries on the alphabet 0,1 to do prefix matching. March 13, 2008 49 Priority Queues A priority queue is an ADT(abstract data type) for maintaining a set S of elements, each with an associated value called key A PQ supports the following operations Insert(S,x) insert element x in set S (S←S∪{x}) Maximum(S) returns the element of S with the largest key Extract­Max(S) returns and removes the element of S with the largest key March 13, 2008 50 Priority Queues Applications: job scheduling shared computing resources (Unix) Event simulation As a building block for other algorithms A Heap can be used to implement a PQ March 13, 2008 51 Heaps Can be viewed as a nearly complete binary tree All levels, except the lowest one are completely filled The key in root is greater or equal than all its children, and the left and right subtrees are again binary heaps Height ? March 13, 2008 52 Heaps Binary heap data structure A Can be implemented as an array Two attributes length[A] heap­size[A] March 13, 2008 53 Heaps Parent (i) return i/2 Left (i) return 2i Right (i) return 2i+1 Heap property: A[Parent(i)] ≥ A[i] 1 2 3 16 15 10 4 8 5 7 6 9 7 3 8 2 9 4 10 1 Level: 3 2 1 0 March 13, 2008 54 Heaps Notice the implicit tree links; children of node i are 2i and 2i+1 Why is this useful? In a binary representation, a multiplication/division by two is left/right shift Adding 1 can be done by adding the lowest bit March 13, 2008 55 Heapify i is index into the array A Binary trees rooted at Left(i) and Right(i) are heaps But, A[i] might be smaller than its children, thus violating the heap property The method Heapify makes A a heap once more by moving A[i] down the heap until the heap property is satisfied again March 13, 2008 56 Heapify Example March 13, 2008 57 Heapify: Running Time Running time on a node of height h: O(h) March 13, 2008 58 Extract Max Removal of max takes constant time on top of Heapify Θ(lg n) March 13, 2008 59 Insertion Insertion of a new element enlarge the PQ and propagate the new element from last place ”up” the PQ tree is of height lg n, running time: Θ(lg n) March 13, 2008 60 Insertion (Example) March 13, 2008 61 Teaser Given a heap as an array, how do you find the k largest element in O(k log k) time. March 13, 2008 th 62 Graphs – Definition A graph G = (V,E) is composed of: V: set of vertices E⊂ V× V: set of edges connecting the vertices An edge e = (u,v) is a pair of vertices (u,v) is ordered, if G is a directed graph March 13, 2008 63 Applications Electronic circuits, pipeline networks Transportation and communication networks Modeling any sort of relationtionships (between components, people, processes, concepts) March 13, 2008 64 (Undirected) Graph Terminology adjacent vertices: connected by an edge (neighbors) degree (of a vertex): # of adjacent vertices ∑ deg(v) = 2(# of edges) v∈V Since adjacent vertices each count the adjoining edge, it will be counted twice path: sequence of vertices v1 ,v2 ,. . .vk such that consecutive vertices vi and vi+1 are adjacent March 13, 2008 65 Graph Terminology simple path: no repeated vertices March 13, 2008 66 Graph Terminology cycle: simple path, except that the last vertex is the same as the first vertex connected graph: any two vertices are connected by some path March 13, 2008 67 Graph Terminology subgraph: subset of vertices and edges forming a graph connected component: maximal connected subgraph. E.g., the graph below has 3 connected components March 13, 2008 68 Graph Terminology (free) tree ­ connected graph without cycles forest ­ collection of trees March 13, 2008 69 Data Structures for Graphs How can we represent a graph? To start with, we can store the vertices and the edges in two containers, and we store with each edge object references to its start and end vertices March 13, 2008 70 Edge List The edge list Easy to implement Finding the edges incident on a given vertex is inefficient since it requires examining the entire edge sequence March 13, 2008 71 Adjacency List The Adjacency list of a vertex v: a sequence of vertices adjacent to v Represent the graph by the adjacency lists of all its vertices Space = Θ(n + ∑ deg(v)) = Θ( n + m) March 13, 2008 72 Adjacency Matrix Matrix M with entries for all pairs of vertices M[i,j] = true – there is an edge (i,j) M[i,j] = false – there is no edge (i,j) Space = O(n2) March 13, 2008 73 Graph Searching Algorithms Systematic search of every edge and vertex of the graph Graph G = (V,E) is either directed or undirected Basic Question : given two vertices u and v, find a path from u to v. Applications Compilers Graphics Maze­solving Mapping Networks: routing, searching, clustering, etc. March 13, 2008 74 Graph Searching Algorithms Traverse (v) { visit v; for each neighbour u of v Traverse (u); } what is wrong ? Need to remember if a vertex has been visited or not! March 13, 2008 75 Graph Searching Algorithms Visited[] : initialized to FALSE. Traverse (v) { visit v; visited[v] = TRUE; for each neighbour u of v if (visited[u] == FALSE) Traverse (u); } Called DEPTH FIRST SEARCH (DFS) March 13, 2008 76 Examples March 13, 2008 77 DFS What if graph has many connected components ? Visited[] : initialized to FALSE. For v = 1, …, N do if visited[v] = false Traverse (v) { visit v; visited[v] = TRUE; for each neighbour u of v if (visited[u] == FALSE) Traverse (u); } March 13, 2008 78 Running Time O(n + m) March 13, 2008 79 Depth­First Search A depth­first search (DFS) in an undirected graph G is like wandering in a labyrinth with a string and a can of paint We start at vertex s, tying the end of our string to the point and painting s “visited (discovered)”. Next we label s as our current vertex called u Now, we travel along an arbitrary edge (u,v). If edge (u,v) leads us to an already visited vertex v we return to u If vertex v is unvisited, we unroll our string, move to v, paint v “visited”, set v as our current vertex, and repeat the previous steps March 13, 2008 80 Depth­First Search Eventually, we will get to a point where all incident edges on u lead to visited vertices We then backtrack by unrolling our string to a previously visited vertex v. Then v becomes our current vertex and we repeat the previous steps Then, if all incident edges on v lead to visited vertices, we backtrack as we did before. We continue to backtrack along the path we have traveled, finding and exploring unexplored edges, and repeating the procedure March 13, 2008 81 Depth­First Search Using DFS, we can • check if a graph is connected. • find the connected components in a graph • check if there is a path from u to v. How do we find a path from u to v ?? When we recursively call Traverse(u), remember who is responsible for calling Traverse(u). March 13, 2008 82 DFS Visited[] : initialized to FALSE. Traverse (v) { visit v; visited[v] = TRUE; for each neighbour u of v if (visited[u] == FALSE) { p[u] = v; Traverse (u); } } March 13, 2008 83 Depth­First Search To find a path from u to v, Start a DFS from u. x = v; while (x ! = u) x = p[x]; If create a new graph where we add an edge between u and p[u], what do we get ? DFS spanning tree ! March 13, 2008 84 Teaser Let T be a DFS tree of G. Show that if (u,v) is an edge in the graph G which is not in T, then either u is an ancestor of v or v is an ancestor of u. March 13, 2008 85 Breadth First Search BFS in an undirected graph G is like wandering in a labyrinth with a string. starting vertex s is assigned a distance 0. In the first round, the string is unrolled the length of one edge, and all of the edges that are only one edge away from the anchor are visited (discovered), and assigned distances of 1 March 13, 2008 86 Breadth­First Search In the second round, all the new edges that can be reached by unrolling the string 2 edges are visited and assigned a distance of 2 This continues until every vertex has been assigned a level The label of any vertex v corresponds to the length of the shortest path (in terms of edges) from s to v March 13, 2008 87 BFS Example March 13, 2008 88 BFS Visited[] : initialized to FALSE. s : starting vertex. Q : queue Initially contains just s. While (Q is not empty) { x = Q. dequeue(); visited[x] = true; for each neighbor y of x if visited[x] = false Q.insert(y) } What is missing ? March 13, 2008 89 BFS Visited[] : initialized to FALSE. s : starting vertex. Q : queue Initially contains just s. inQueue[v] : is v in queue ? While (Q is not empty) { x = Q. dequeue(); visited[x] = true; for each neighbor y of x if (visited[x] = false AND inQueue(y) = false) Q.insert(y); p[y] = x; } March 13, 2008 90 BFS Tree Properties No edge in the graph can cross more than 1 level. WHY ? March 13, 2008 91 BFS Tree Properties BFS tree is a shortest path tree. March 13, 2008 92 BFS Properties Given a graph G = (V,E), BFS discovers all vertices reachable from a source vertex s It computes the shortest distance to all reachable vertices It computes a breadth­first tree that contains all such reachable vertices For any vertex v reachable from s, the path in the breadth first tree from s to v, corresponds to a shortest path in G March 13, 2008 93 Directed Graphs Edges are ordered pairs. Can use adjacency matrix or adjacency list representation. March 13, 2008 94 Directed Graphs Terminology Outdegree, indegree of a vertex. What is the sum of the indegrees of all the edges ? Directed paths, directed cycles. What about connectivity ? A directed graph is strongly connected if given any two vertices u and v there is a path from u to v. March 13, 2008 95 DFS on directed graphs. DFS (G,v) visited[v] ← true pre[v] ← clock; clock++ for each out­neighbor u of v if not visited(u) then DFS(G,u) post[v] ← clock; clock++ vertex Iv := [pre[v],post[v]] What properties do these intervals have ? March 13, 2008 96 Example A B C March 13, 2008 D 97 Example A B C March 13, 2008 D 98 Example A Back edge B C March 13, 2008 D 99 Example A Back edge Forward edge B D C Cross edge March 13, 2008 100 Applications How do we find a cycle in a directed graph ? How do we check if a graph is strongly connected ? March 13, 2008 101 Topological Sorting For each edge (u,v), vertex u is visited before vertex v w ak e up 1 2 eat cs16 meditation 7 play 4 w ork 8 cs16 program 9 mak e cookies for cs16 HT A March 13, 2008 10 sleep 3 A typical student day 5 more cs16 6 cxhe xtris 11 dream of cs16 102 Topological Sorting Topological sorting may not be unique A B C D A or C B A C B D D March 13, 2008 103 Topological Sorting Labels are increasing along a directed path A digraph has a topological sorting if and only if it is acyclic (i.e., a dag) 1 A 2 3 B C 4 D March 13, 2008 5 E 104 Topological Sorting Can you use DFS for topological sorting ? 1 A 2 3 B C 4 D March 13, 2008 5 E 105