CSE 326: Data Structures Introduction & Part One: Complexity Henry Kautz Autumn Quarter 2002 Overview of the Quarter • Part One: Complexity – – – – • • • • • • • inductive proofs of program correctness empirical and asymptotic complexity order of magnitude notation; logs & series analyzing recursive programs Part Two: List-like data structures Part Three: Sorting Part Four: Search Trees Part Five: Hash Tables Part Six: Heaps and Union/Find Part Seven: Graph Algorithms Part Eight: Advanced Topics Material for Part One • Weiss Chapters 1 and 2 • Additional material – Graphical analysis – Amortized analysis – Stretchy arrays • Any questions on course organization? Program Analysis • Correctness – Testing – Proofs of correctness • Efficiency – How to define? – Asymptotic complexity - how running times scales as function of size of input Proving Programs Correct • Often takes the form of an inductive proof • Example: summing an array int sum(int v[], int n) { if (n==0) return 0; else return v[n-1]+sum(v,n-1); } What are the parts of an inductive proof? Inductive Proof of Correctness int sum(int v[], int n) { if (n==0) return 0; else return v[n-1]+sum(v,n-1); } Theorem: sum(v,n) correctly returns sum of 1st n elements of array v for any n. Basis Step: Program is correct for n=0; returns 0. Inductive Hypothesis (n=k): Assume sum(v,k) returns sum of first k elements of v. Inductive Step (n=k+1): sum(v,k+1) returns v[k]+sum(v,k), which is the same of the first k+1 elements of v. Proof by Contradiction • Assume negation of goal, show this leads to a contradiction • Example: there is no program that solves the “halting problem” – Determines if any other program runs forever or not Alan Turing, 1937 Program NonConformist (Program P) If ( HALT(P) = “never halts” ) Then Halt Else Do While (1 > 0) Print “Hello!” End While End If End Program • Does NonConformist(NonConformist) halt? • Yes? That means HALT(NonConformist) = “never halts” • No? That means HALT(NonConformist) = “halts” Defining Efficiency • Asymptotic Complexity - how running time scales as function of size of input • Why is this a reasonable definition? Defining Efficiency • Asymptotic Complexity - how running time scales as function of size of input • Why is this a reasonable definition? – Many kinds of small problems can be solved in practice by almost any approach • E.g., exhaustive enumeration of possible solutions • Want to focus efficiency concerns on larger problems – Definition is independent of any possible advances in computer technology Technology-Depended Efficiency • Drum Computers: Popular technology from early 1960’s • Transistors too costly to use for RAM, so memory was kept on a revolving magnetic drum • An efficient program scattered instructions on the drum so that next instruction to execute was under read head just when it was needed – Minimized number of full revolutions of drum during execution The Apocalyptic Laptop Speed Energy Consumption E=mc2 25 million megawatt-hours Quantum mechanics: Switching speed = h / (2 * Energy) h is Planck’s constant 5.4 x 10 50 operations per second Seth Lloyd, SCIENCE, 31 Aug 2000 Big Bang Ultimate Laptop, 1 year 1 second 1E+60 1E+55 1E+50 2^N 1.2^N N^5 N^3 5N 1E+45 1E+40 1000 MIPS, since Big Bang 1E+35 1E+30 1E+25 1E+20 1000 MIPS, 1 day 1E+15 1E+10 100000 1 1 10 100 1000 Defining Efficiency • Asymptotic Complexity - how running time scales as function of size of input • What is “size”? – Often: length (in characters) of input – Sometimes: value of input (if input is a number) • Which inputs? – Worst case • Advantages / disadvantages ? – Best case • Why? Average Case Analysis • More realistic analysis, first attempt: – Assume inputs are randomly distributed according to some “realistic” distribution – Compute expected running time E (T , n) xInputs( n ) Prob ( x) RunTime( x) – Drawbacks • Often hard to define realistic random distributions • Usually hard to perform math Amortized Analysis • Instead of a single input, consider a sequence of inputs • Choose worst possible sequence • Determine average running time on this sequence • Advantages – Often less pessimistic than simple worst-case analysis – Guaranteed results - no assumed distribution – Usually mathematically easier than average case analysis Comparing Runtimes • Program A is asymptotically less efficient than program B iff the runtime of A dominates the runtime of B, as the size of the input goes to infinity RunTime( A, n) as n RunTime( B, n) • Note: RunTime can be “worst case”, “best case”, “average case”, “amortized case” Which Function Dominates? n3 + 2n2 100n2 + 1000 n0.1 log n n + 100n0.1 2n + 10 log n 5n5 n! n-152n/100 1000n15 82log n 3n7 + 7n Race I n3 + 2n2 vs. 100n2 + 1000 Race II n0.1 vs. log n Race III n + 100n0.1 vs. 2n + 10 log n Race IV 5n5 vs. n! Race V n-152n/100 vs. 1000n15 Race VI 82log(n) vs. 3n7 + 7n Order of Magnitude Notation (big O) • Asymptotic Complexity - how running time scales as function of size of input – We usually only care about order of magnitude of scaling • Why? Order of Magnitude Notation (big O) • Asymptotic Complexity - how running time scales as function of size of input – We usually only care about order of magnitude of scaling • Why? – As we saw, some functions overwhelm other functions • So if running time is a sum of terms, can drop dominated terms – “True” constant factors depend on details of compiler and hardware • Might as well make constant factor 1 16n log8 (10n ) 100n O(n log(n)) 3 2 2 3 16n3 log8 (10n 2 ) 100n 2 • Eliminate low order terms • Eliminate constant coefficients 16n3 log8 (10n 2 ) n3 log8 (10n 2 ) n3 log8 (10) log 8 ( n 2 ) n3 log8 (10) n3 log8 (n 2 ) n3 log8 (n 2 ) n3 2 log8 (n) n3 log8 (n) n3 log8 (2) log( n) n3 log(n) Common Names constant: logarithmic: linear: log-linear: quadratic: exponential: Slowest Growth O(1) O(log n) O(n) O(n log n) O(n2) O(cn) (c is a constant > 1) Fastest Growth superlinear: polynomial: O(nc) O(nc) (c is a constant > 1) (c is a constant > 0) Summary • Proofs by induction and contradiction • Asymptotic complexity • Worst case, best case, average case, and amortized asymptotic complexity • Dominance of functions • Order of magnitude notation • Next: – Part One: Complexity, continued – Read Chapters 1 and 2 Part One: Complexity, continued Friday, October 4th, 2002 Determining the Complexity of an Algorithm • Empirical measurement • Formal analysis (i.e. proofs) • Question: what are likely advantages and drawbacks of each approach? Determining the Complexity of an Algorithm • Empirical measurement • Formal analysis (i.e. proofs) • Question: what are likely advantages and drawbacks of each approach? – Empirical: • pro: discover if constant factors are significant • con: may be running on “wrong” inputs – Formal: • pro: no interference from implementation/hardware details • con: can make mistake in a proof! In theory, theory is the same as practice, but not in practice. Measuring Empirical Complexity: Linear vs. Binary Search • Find a item in a sorted array of length N • Binary search algorithm: Linear Search Time to find one item: Time to find N items: Binary Search void bfind(int x, int a[], int n) { m = n / 2; if (x == a[m]) return; if (x < a[m]) bfind(x, a, m); else bfind(x, &a[m+1], n-m-1); } void lfind(int x, int a[], int n) { for (i=0; i<n; i++) if (a[i] == x) return; } for (i=0; i<n; i++) a[i] = i; for (i=0; i<n; i++) lfind(i,a,n); or bfind My C Code Graphical Analysis linear vs binary search 0.050 seconds 0.040 0.030 0.020 linear 0.010 binary 0.000 0 10 20 30 40 50 60 70 N 80 90 100 110 120 Graphical Analysis linear vs binary search 0.050 seconds 0.040 0.030 0.020 linear 0.010 binary 0.000 1, 0 0 0 0 0 0 0 0 0 0 00 90 80 70 60 50 40 30 20 10 0 N seconds linear vs binary search 5,000 4,500 4,000 3,500 3,000 2,500 2,000 1,500 1,000 500 0 linear binary 0 25 0,0 00 50 0,0 00 N 75 0,0 00 1, 0 00 , 00 0 linear vs binary search - log/log plot 10,000 seconds 1,000 linear 100 binary 10 1 0 0 1 10 10 0 1, 0 10 10 1, 0 10 , 0 ,00 00 00 00 , 0 , 00 0 00 0 ,0 0 00 N linear vs binary search - log/log plot slope 2 10,000 seconds 1,000 linear 100 binary 10 1 slope 1 0 0 1 10 10 0 1, 0 10 10 1, 0 10 , 0 ,00 00 00 00 , 0 , 00 0 00 0 ,0 0 00 N Property of Log/Log Plots • On a linear plot, a linear function is a straight line • On a log/log plot, any polynomial function is a straight line! – The slope y/ x is the same as the exponent Proof: Suppose y cx k Then log y log(cx k ) log y log c log x k horizontal axis log y log c k log x vertical axis slope Why doeslinear O(n vs logbinary n) look like- log/log a straight search plot line? 10,000 seconds 1,000 linear 100 binary 10 1 slope 1 0 0 1 10 10 0 1, 0 10 10 1, 0 10 , 0 ,00 00 00 00 , 0 , 00 0 00 0 ,0 0 00 N Summary • Empirical and formal analyses of runtime scaling are both important techniques in algorithm development • Large data sets may be required to gain an accurate empirical picture • Log/log plots provide a fast and simple visual tool for estimating the exponent of a polynomial function Formal Asymptotic Analysis • In order to prove complexity results, we must make the notion of “order of magnitude” more precise • Asymptotic bounds on runtime – Upper bound – Lower bound Definition of Order Notation • Upper bound: T(n) = O(f(n)) Big-O Exist constants c and n’ such that T(n) c f(n) for all n n’ • Lower bound: T(n) = (g(n)) Omega Exist constants c and n’ such that T(n) c g(n) for all n n’ • Tight bound: T(n) = θ(f(n)) When both hold: T(n) = O(f(n)) T(n) = (f(n)) Theta Example: Upper Bound Claim: n 2 100n O (n 2 ) Proof: Must find c, n such that for all n n, n 2 100n cn 2 Let's try setting c 2. Then n 2 100n 2n 2 100n n 2 100 n So we can set n 100 and reverse the steps above. Using a Different Pair of Constants Claim: n 2 100n O(n 2 ) Proof: Must find c, n such that for all n n, n 100n cn Let's try setting c 101. Then 2 2 n 100n 100n 2 2 n 100 101n (divide both sides by n) 100 100n 1 n So we can set n 1 and reverse the steps above. Example: Lower Bound Claim: n 2 100n (n 2 ) Proof: Must find c, n such that for all n n, n 2 100n cn 2 Let's try setting c 1. Then n 2 100n n 2 n0 So we can set n 0 and reverse the steps above. Thus we can also conclude n 2 100n (n 2 ) Conventions of Order Notation Order notation is not symmetric: write 2n 2 n O(n 2 ) but never O(n 2 ) 2n 2 n The expression O( f (n)) O( g (n)) is equivalent to f (n) O( g (n)) The expression ( f ( n)) ( g ( n)) is equivalent to f (n) ( g (n)) The right-hand side is a "cruder" version of the left: 18n 2 O(n 2 ) O(n3 ) O(2n ) 18n 2 (n 2 ) (n log n) (n) Upper/Lower vs. Worst/Best • Worst case upper bound is f(n) – Guarantee that run time is no more than c f(n) • Best case upper bound is f(n) – If you are lucky, run time is no more than c f(n) • Worst case lower bound is g(n) – If you are unlikely, run time is at least c g(n) • Best case lower bound is g(n) – Guarantee that run time is at least c g(n) Analyzing Code • • • • • • primitive operations consecutive statements function calls conditionals loops recursive functions Conditionals • Conditional if C then S1 else S2 • Suppose you are doing a O( ) analysis? • Suppose you are doing a ( ) analysis? Conditionals • Conditional if C then S1 else S2 • Suppose you are doing a O( ) analysis? Time(C) + Max(Time(S1),Time(S2)) or Time(C)+Time(S1)+Time(S2) or … • Suppose you are doing a ( ) analysis? Time(C) + Min(Time(S1),Time(S2)) or Time(C) or … Nested Loops for i = 1 to n do for j = 1 to n do sum = sum + 1 Nested Loops for i = 1 to n do for j = 1 to n do sum = sum + 1 n n n 1 n n i 1 j 1 i 1 2 Nested Dependent Loops for i = 1 to n do for j = i to n do sum = sum + 1 Nested Dependent Loops for i = 1 to n do for j = i to n do sum = sum + 1 n n 1 ? i 1 j i Summary • Formal definition of order of magnitude notation • Proving upper and lower asymptotic bounds on a function • Formal analysis of conditionals and simple loops • Next: – Analyzing complex loops – Mathematical series – Analyzing recursive functions Part One: Complexity, Continued Monday October 7, 2002 Today’s Material • • • • • • • Running time of nested dependent loops Mathematical series Formal analysis of linear search Formal analysis of binary search Solving recursive equations Stretchy arrays and the Stack ADT Amortized analysis Nested Dependent Loops for i = 1 to n do for j = i to n do sum = sum + 1 n n 1 ? i 1 j i Nested Dependent Loops for i = 1 to n do for j = i to n do sum = sum + 1 n n n 1 (n i 1) i 1 j i n i 1 n n n n i 1 n i n 2 i 1 i 1 i 1 i 1 Arithmetic Series N S(N ) 1 2 N i ? i 1 • Note that: S(1) = 1, S(2) = 3, S(3) = 6, S(4) = 10, … • Hypothesis: S(N) = N(N+1)/2 Prove by induction – – – – Base case: for N = 1, S(N) = 1(2)/2 = 1 Assume true for N = k Suppose N = k+1. S(k+1) = S(k) + (k+1) = k(k+1)/2 + (k+1) = (k+1)(k/2 + 1) = (k+1)(k+2)/2. Other Important Series • Sum of squares: N ( N 1)(2 N 1) N 3 i for large N 6 3 i 1 N 2 • Sum of exponents: N k 1 i for large N and k -1 | k 1| i 1 • Geometric series: A N 1 1 A A 1 i 0 N k N i • Novel series: – Reduce to known series, or prove inductively Nested Dependent Loops for i = 1 to n do for j = i to n do sum = sum + 1 n(n 1) n i n n n 2 i 1 n(n 1) n(n 1) n(n 1) 2 2 2 2 n / 2 n / 2 (n ) n 2 2 Linear Search Analysis void lfind(int x, int a[], int n) { for (i=0; i<n; i++) if (a[i] == x) return; } • Best case, tight analysis: • Worst case, tight analysis: Iterated Linear Search Analysis for (i=0; i<n; i++) a[i] = i; for (i=0; i<n; i++) lfind(i,a,n); • Easy worst-case upper-bound: • Worst-case tight analysis: Iterated Linear Search Analysis for (i=0; i<n; i++) a[i] = i; for (i=0; i<n; i++) lfind(i,a,n); • Easy worst-case upper-bound: nO(n) O(n 2 ) • Worst-case tight analysis: – Just multiplying worst case by n does not justify answer, since each time lfind is called i is specified n(n 1) 2 1 i (n ) 2 i 1 j 1 i 1 n i n Analyzing Recursive Programs 1. Express the running time T(n) as a recursive equation 2. Solve the recursive equation • • For an upper-bound analysis, you can optionally simplify the equation to something larger For a lower-bound analysis, you can optionally simplify the equation to something smaller Binary Search void bfind(int x, int a[], int n) { m = n / 2; if (x == a[m]) return; if (x < a[m]) bfind(x, a, m); else bfind(x, &a[m+1], n-m-1); } What is the worst-case upper bound? Binary Search void bfind(int x, int a[], int n) { m = n / 2; if (x == a[m]) return; if (x < a[m]) bfind(x, a, m); else bfind(x, &a[m+1], n-m-1); } What is the worst-case upper bound? Trick question: Binary Search void bfind(int x, int a[], int n) { m = n / 2; if (n <= 1) return; if (x == a[m]) return; if (x < a[m]) bfind(x, a, m); else bfind(x, &a[m+1], n-m-1); } Okay, let’s prove it is (log n)… Binary Search void bfind(int x, int a[], int n) { m = n / 2; if (n <= 1) return; if (x == a[m]) return; if (x < a[m]) bfind(x, a, m); else bfind(x, &a[m+1], n-m-1); } Introduce some constants… b = time needed for base case c = time needed to get ready to do a recursive call Running time is thus: T (1) b T (n) T (n / 2) c Binary Search Analysis One sub-problem, half as large Equation: T(1) b T(n) T(n/2) + c for n>1 Solution: T(n) T(n/2) + c T(n/4) + c + c T(n/8) + c + c + c T(n/2k) + kc T(1) + c log n where k = log n b + c log n = O(log n) write equation expand inductive leap select value for k simplify Solving Recursive Equations by Repeated Substitution • Somewhat “informal”, but intuitively clear and straightforward substitute for T(n/2) T (n) T (n / 2) c T (n) T (n / 4) c c substitute for T(n/4) T (n) T (n / 4) c c T (n) T (n / 8) c c c T (n) T (n / 2k ) kc "inductive leap" choose k=log n T (n) T (n / 2log n ) c log n T (n) T (n / n) c log n T (1) c log n b c log n (log n) Solving Recursive Equations by Telescoping • Create a set of equations, take their sum T (n) T (n / 2) c initial equation T (n / 2) T (n / 4) c so this holds... T (n / 4) T (n / 8) c and this... T (n / 8) T (n /16) c and this... ... and eventually... T (2) T (1) c sum equations, cancelling terms that appear on both sides T (n) T (1) c log n look familiar? T (n) (log n) Solving Recursive Equations by Induction • Repeated substitution and telescoping construct the solution • If you know the closed form solution, you can validate it by ordinary induction • For the induction, may want to increase n by a multiple (2n) rather than by n+1 Inductive Proof T (1) b c log1 b base case Assume T (n) b c log n hypothesis T (2n) T (n) c definition of T(n) T (2n) (b c log n) c by induction hypothesis T (2n) b c((log n) 1) T (2n) b c((log n) (log 2)) T (2n) b c log(2n) Q.E.D. Thus: T (n) (log n) Example: Sum of Integer Queue sum_queue(Q){ if (Q.length() == 0 ) return 0; else return Q.dequeue() + sum_queue(Q); } – One subproblem – Linear reduction in size (decrease by 1) Equation: T(0) = b T(n) = c + T(n – 1) for n>0 Lower Bound Analysis: Recursive Fibonacci int Fib(n){ if (n == 0 or n == 1) return 1 ; else return Fib(n - 1) + Fib(n - 2); } • Lower bound analysis (n) • Instead of =, equations will use T(n) Some expression • Will simplify math by throwing out terms on the right-hand side Analysis by Repeated Subsitution T (0) T (1) a T (n) b T (n 1) T ( n 2) base case recursive case T (n) b 2T (n 2) simplify to smaller quantity T (n) b 2(b 2T (n 2 2)) substitute T (n) 3b 4T (n 4)) T (n) 3b 4(b 2T (n 4 2)) T (n) 7b 8T (n 6)) simplify substitute simplify T (n) 7b 8(b 2T (n 6 2)) T (n) 15b 16T ( n 8) substitute simplify T (n) (2k 1)b 2 k T (n 2k ) T (n) (2n / 2 1)b 2n / 2 T (n 2(n / 2)) T (n) 2n / 2 (b a ) b T ( n) ( 2 n / 2 ) inductive leap choose k=(n/2) simplify Note: this is not the same as (2n )!!! Learning from Analysis • To avoid recursive calls – store all basis values in a table – each time you calculate an answer, store it in the table – before performing any calculation for a value n • check if a valid answer for n is in the table • if so, return it • Memoization – a form of dynamic programming • How much time does memoized version take? Amortized Analysis • Consider any sequence of operations applied to a data structure • Some operations may be fast, others slow • Goal: show that the average time per operation is still good total time for n operations n Stack Abstract Data Type A • Stack operations – push – pop – is_empty E D C BA B C D E F F • Stack property: if x is on the stack before y is pushed, then x will be popped after y is popped What is biggest problem with an array implementation? Stretchy Stack Implementation int[] data; int maxsize; int top; Best case Push = O( ) Worst case Push = O( ) Push(e){ if (top == maxsize){ temp = new int[2*maxsize]; for (i=0;i<maxsize;i++) temp[i]=data[i]; ; data = temp; maxsize = 2*maxsize; } else { data[++top] = e; } Stretchy Stack Amortized Analysis • Consider sequence of n operations push(3); push(19); push(2); … • What is the max number of stretches? • What is the total time? – let’s say a regular push takes time a, and stretching an array contain k elements takes time bk. • Amortized time = Stretchy Stack Amortized Analysis • Consider sequence of n operations push(3); push(19); push(2); … • What is the max number of stretches? log n • What is the total time? – let’s say a regular push takes time a, and stretching an array contain k elements takes time bk. log n an b(1 2 4 8 ... n) an b 2i i o • Amortized time = Geometric Series N 1 A 1 A A 1 i 0 N i n 1 2 1 n 1 2 2 1 2 1 i 0 n i log n 2 i i 0 log n 1 2 1 (2 log n )2 1 2n 1 1 Stretchy Stack Amortized Analysis • Consider sequence of n operations push(3); push(19); push(2); … • What is the max number of stretches? log n • What is the total time? – let’s say a regular push takes time a, and stretching an array contain k elements takes time bk. log n an b(1 2 4 8 ... n) an b 2i i o an b(2n 1) • Amortized time = an b(2n 1) ( n ) Surprise • In an asymptotic sense, there is no overhead in using stretchy arrays rather than regular arrays!