Introduction to Matlab & Data Analysis Lecture 8: Simple algorithms & how to build efficient programs in Matlab Eran Eden, Weizmann 2008 © Some of the animation slides are based the lectures published by Georgia Tech College of computing http://www.cs.gatech.edu/~bleahy/cs1311 1 Today’s menu Reminder of previous lecture Simple algorithms What is an algorithm? Why do we need good algorithms? Strategies for designing efficient algorithms Binary search Bubble sort “Divide and conquer “ - Merge sort Developing efficient programs in Matlab Making your program run fast Lint optimizer Profiler 2 Reading and writing to files Low level I/O 4 general steps for copying one file into another: 1) Open the source file & open a target file fid = fopen(<file name>, <permission>) % opens the file <file name> for read access 2) Read the lines, one by one and store in a variable line = fgetl(fid) % returns the next line of a file associated with file identifier fid as a MATLAB string and discards newline characters 3) Write the lines into the target file fprintf(fid, format, variables); % Write formatted data to file specified by fid 4) Close the source and the target files fclose(fid) % closes the file associated with file identifier fid 3 Reading and Writing to files, continued… 4 Reading and writing to files Matlab has high level I\O functions for reading data. Advantages: it is easier and requires less programming than low level functions. Limitations: it will only work if the data is in a predefined format. 5 Reading and writing to files Example: we have a text file called data_table.txt that looks as follows: Name Badal Badri Badrinath Bahubali Bahuleya Bajrang Balaaditya Balachan Balagovind Balaji Balakrishna Balamani …. Age 77 100 22 95 33 48 78 78 64 16 30 89 Weight Salary 79 86 79 91 77 85 106 86 120 112 76 79 5909 5290 6960 6233 8314 5870 9144 7055 7570 5021 9066 6253 6 Reading and writing to files The importdata command >> file = importdata('C:\Matlab and Data Analysis\table_data.txt') >> file = data: [93x3 double] All the numeric data is textdata: {94x4 cell} stored in a double array >> file.data ans = 25 38 83 99 27 92 42 2 38 99 22 called data 85 76 86 116 67 78 113 117 79 60 100 9056 8621 5115 7510 5433 9053 6853 8798 6703 9746 7875 ... All the char data is stored in a cell array called textdata 7 Reading and writing to files >> file.textdata ans = 'Name' 'Badal' 'Badri' 'Badrinath' 'Bahubali' 'Bahuleya' 'Bajrang' 'Balaaditya' 'Balachandra' 'Balagovind' 'Balaji' 'Balakrishna' 'Balamani' ... 'Age' [] [] [] [] [] [] [] [] [] [] [] [] 'Weight ' [] [] [] [] [] [] [] [] [] [] [] [] 'Salary ' [] [] [] [] [] [] [] [] [] [] [] [] 8 Reading and writing to files There are many other functions for reading and writing to files. For example reading the content of an Excel spreadsheet can easily be done with the function xlsread. Some of these functions will be discussed in the tutorial… For more details just use help 9 Algorithms 10 What is an algorithm? An algorithm is a sequence of instructions, often used for calculation or data processing Lamp doesn’t work Lamp plugged in ? no Plug in lamp yes Bulb burned out? yes Replace Bulb no Replace Lamp 11 What is an algorithm? An algorithm is a sequence of instructions, often used for calculation or data processing Lamp doesn’t work Bulb burned out? yes Replace bulb no Lamp plugged in ? no Plug in lamp yes Replace Lamp 12 Why do we need good algorithms? Without efficient algorithms many simple problems can’t be solved by the computer (running time is too large, or not enough memory) 13 How do we measure whether an algorithm is “good”? Time complexity The number of steps it takes to solve the problem as function of input size Examples: Analogy: Mowing grass has linear time complexity because it takes double the time to mow double the area What about looking up a name in a dictionary, what happens if we double the dictionary size? 14 How do we measure whether an algorithm is “good”? Space complexity The amount of memory required by the algorithm Optimal vs. suboptimal solutions 15 There are numerous examples of problems that require efficient algorithms Find a person’s name in a phone book Designing a web crawler The sequence alignment problem 16 There are numerous examples of problems that require efficient algorithms The traveling salesman problem (TSP) Teaching a computer to play Tic-Tac-Toe Teaching a computer to play Chess 17 The Binary search algorithm You are given a person’s name and a phonebook Think of a naïve algorithm in order to find that person’s phone number How many page swaps are needed on average? What happens when we double the phone book size? Think of an efficient algorithm for finding the person’s phone number How many page swaps are needed on average? What happens when we double the phone book size? 18 The Binary search algorithm Let’s generate a phone book… function phonebook = initPhoneBook() phonebook(1).name = 'aong'; phonebook(1).number = '04-111111111'; phonebook(2).name = 'bong'; phonebook(2).number = '04-222222222'; phonebook(3).name = 'bongo'; phonebook(3).number = '04-222222223'; phonebook(4).name = 'cong'; phonebook(4).number = '04-333333333'; phonebook(5).name = 'congo'; phonebook(5).number = '04-333333334'; phonebook(6).name = 'dong'; phonebook(6).number = '04-444444444'; phonebook(7).name = 'eong'; phonebook(7).number = '04-555555555'; phonebook(8).name = 'fong'; phonebook(8).number = '04-666666666'; phonebook(9).name = 'gong'; phonebook(9).number = '04-777777777'; phonebook(10).name = 'hong'; phonebook(10).number= '04-88888888'; 19 The Binary search algorithm function number = binarySearch(phonebook, name) l = 1; r = length(phonebook); mid = floor((l + r) / 2); number = [ ]; while l < r name_curr = phonebook(mid).name if strlexcmp(name_curr , name) == 0 number = phonebook(mid).number; disp(['I found the phone of ', name_curr, '!!!']); return; elseif strlexcmp(name_curr, name) > 0 r = mid mid = floor((l + r) / 2); elseif strlexcmp(name_curr, name) < 0 l = mid mid = ceil((l + r) / 2); end end disp('Name not found!'); 20 The Binary search algorithm % The function compares between two strings according to lexicographic order % The function returns: 0 if the strings are identical; 1 if str1 > str2 and -1 if str1 < str2 function res = strlexcmp(str1, str2) % get string lengths n1 = length(str1); n2 = length(str2); n = min(n1, n2); % find characters that differ k = find(str1(1 : n) ~= str2( 1 : n)); if isempty(k) % if all characters are identical then compare lengths res = sign(n1 - n2); else % compare first character that is different k = k(1); res = sign(str1(k) - str2(k));% Ascii value, case sensitive end 21 The Binary search algorithm >> phonebook = initPhoneBook(); >> binarySearch(phonebook, 'fong') I found the phone of fong!!! ans = 04-666666666 >> binarySearch(phonebook, 'wong') Name not found! ans = [] 22 List intersection Find a naïve algorithm for intersecting two lists of names How many operations are performed on average? Find an efficient algorithm for intersecting two lists of names How many operations are performed on average? 23 Sorting In order to solve the intersection problem efficiently we need the two lists to be sorted Sorting is a very basic operation which is performed routinely. Therefore, we want to find an efficient sorting algorithm. 24 Bubble sort 1 77 2 42 3 4 35 12 5 101 6 5 25 Bubble sort Bubble sort 1 2 3 4 42 Swap4277 77 35 12 5 101 6 5 26 Bubble sort Bubble sort 1 42 2 3 7735 Swap35 77 4 12 5 101 6 5 27 Bubble sort Bubble sort 1 42 2 35 3 4 77 7712 Swap 12 5 101 6 5 28 Bubble sort Bubble sort 1 42 2 35 3 4 12 77 5 101 6 5 There is no need to swap 29 Bubble sort Bubble sort 1 42 2 35 3 4 12 77 5 6 1015 Swap 101 5 30 Bubble sort Bubble sort 1 42 2 35 3 4 12 77 5 5 6 101 Largest value correctly placed 31 Bubble sort 1 2 3 4 5 6 42 35 12 77 5 101 35 12 42 5 77 101 12 35 5 42 77 101 12 5 35 42 77 101 5 12 35 42 77 101 Array is sorted !!! 32 Bubble sort function array = bubbleSort(array) for i = (length(array) – 1) : -1 : 1 for j = 1 : i if array(j) > array(j+1) %swap temp = array(j); array(j) = array(j + 1); array(j + 1) = temp; end end disp(array) end 33 Bubble sort >> array = randperm(10) array = 5 6 9 1 4 2 10 8 3 7 >> bubbleSort(array) 5 5 1 1 1 1 1 1 1 6 1 4 2 2 2 2 2 2 1 4 2 4 4 3 3 3 3 4 2 5 5 3 4 4 4 4 2 6 6 3 5 5 5 5 5 9 8 3 6 6 6 6 6 6 8 3 7 7 7 7 7 7 7 3 7 8 8 8 8 8 8 8 7 9 9 9 9 9 9 9 9 10 10 10 10 10 10 10 10 10 34 Bubble sort What is the time complexity? time complexity - The way in which the number of steps required by an algorithm varies with the size of the problem it is solving. Time complexity is normally expressed as an order of magnitude, e.g. O(N^2) means that if the size of the problem (N) doubles then the algorithm will take four times as many steps to complete. What is the space complexity? space complexity - The way in which the amount of storage space required by an algorithm varies with the size of the problem it is solving. Space complexity is normally expressed as an order of magnitude, e.g. O(N^2) means that if the size of the problem (N) doubles then four times as much working storage will be needed. Can we do better? 35 Merge sort The divide and conquer approach… The algorithm uses the following fact: If we have two sorted lists then merging them together into a one sorted list is easy 36 98 23 45 14 6 67 33 42 98 98 23 23 45 45 14 14 6 67 6 33 67 42 33 42 98 98 98 23 23 23 45 45 45 14 14 14 6 67 6 33 67 42 33 42 98 98 98 98 23 23 23 23 45 45 45 14 14 14 6 67 6 33 67 42 33 42 98 98 98 98 Merge 23 23 23 23 45 45 45 14 14 14 6 67 6 33 67 42 33 42 98 98 98 98 23 Merge 23 23 23 23 45 45 45 14 14 14 6 67 6 33 67 42 33 42 98 98 23 98 23 98 23 23 98 Merge 23 45 45 45 14 14 14 6 67 6 33 67 42 33 42 98 98 98 98 23 23 23 23 98 23 45 45 45 45 14 14 6 67 6 14 14 33 67 42 33 42 98 98 98 98 23 23 23 23 23 45 45 45 14 14 98 Merge 67 6 14 45 6 14 33 67 42 33 42 98 98 98 98 23 23 23 23 98 23 45 45 45 14 14 14 Merge 67 6 14 45 6 14 33 67 42 33 42 98 98 98 98 23 23 23 23 98 23 45 45 45 14 14 14 14 45 Merge 67 6 14 45 6 33 67 42 33 42 98 98 98 98 23 23 23 45 45 23 23 98 45 45 14 Merge 14 14 6 67 6 14 14 45 33 67 42 33 42 98 98 98 98 23 23 23 45 45 23 23 98 45 45 14 14 Merge 14 14 6 67 6 14 14 45 33 67 42 33 42 98 98 98 98 23 14 23 23 45 45 23 45 23 45 98 14 23 Merge 14 14 6 67 6 14 14 45 33 67 42 33 42 98 98 98 98 23 14 23 23 45 45 23 45 23 45 98 14 23 45 Merge 14 14 6 67 6 14 14 45 33 67 42 33 42 98 98 98 98 23 14 23 23 45 45 23 14 14 45 23 98 14 23 45 Merge 14 45 98 67 6 14 45 6 33 67 42 33 42 98 98 98 98 23 14 23 23 45 45 23 23 14 45 14 45 98 23 14 14 45 14 45 98 6 67 33 6 67 6 67 42 33 42 33 42 98 98 98 98 23 14 23 23 45 45 23 23 14 45 14 45 98 23 14 14 45 14 45 98 6 67 33 6 67 6 67 6 67 42 33 42 33 42 98 98 98 98 23 14 23 23 45 45 23 23 14 45 14 45 98 23 14 14 45 14 6 67 33 6 67 6 67 6 45 98 Merge 67 42 33 42 33 42 98 98 98 98 23 14 23 23 45 45 23 23 14 45 14 45 98 23 14 14 45 14 45 98 6 67 33 6 67 6 67 6 6 Merge 67 42 33 42 33 42 98 98 98 98 23 14 23 23 45 45 23 23 14 45 14 45 98 23 14 14 45 14 45 98 6 67 33 6 67 6 67 6 6 67 67 Merge 42 33 42 33 42 98 98 98 98 23 14 23 23 45 45 23 23 14 45 14 45 98 23 14 14 45 14 45 98 6 67 33 6 67 6 67 6 6 67 67 42 33 42 33 33 42 42 98 98 98 98 23 14 23 23 45 45 23 23 14 45 14 45 98 23 14 14 45 14 45 98 6 67 33 6 67 6 67 6 6 67 42 33 42 33 42 33 67 Merge 42 98 98 98 98 23 14 23 23 45 45 23 23 14 45 14 45 98 23 14 14 45 14 45 98 6 67 33 6 67 6 67 6 6 67 67 42 33 42 33 42 33 33 Merge 42 98 98 98 98 23 14 23 23 45 45 23 23 14 45 14 45 98 23 14 14 45 14 45 98 6 67 33 6 67 6 67 6 6 67 67 42 33 42 33 42 33 33 42 42 Merge 98 98 98 98 23 14 23 23 45 45 23 23 14 45 14 45 98 23 14 14 45 14 45 6 67 33 6 67 6 67 6 6 67 67 42 33 42 33 33 33 98 Merge 42 42 42 98 98 98 98 23 14 23 23 45 45 23 23 14 45 14 45 98 23 14 14 45 14 45 98 6 67 33 6 67 6 67 6 67 6 67 42 33 42 33 33 33 6 Merge 42 42 42 98 98 98 98 23 14 23 23 45 45 23 23 14 45 14 45 98 23 14 14 45 14 45 98 6 67 33 6 67 6 67 6 67 6 6 67 42 33 42 33 33 33 33 Merge 42 42 42 98 98 98 98 23 14 23 23 45 45 23 23 14 45 14 45 98 23 14 14 45 14 45 98 6 67 33 6 67 6 67 6 42 33 33 67 6 6 33 67 33 42 33 42 Merge 42 42 42 98 98 98 98 23 14 23 23 45 45 23 23 14 45 14 45 98 23 14 14 45 14 45 98 6 67 33 6 67 6 67 6 42 33 33 67 6 6 42 33 67 33 42 33 42 Merge 67 42 42 98 98 98 98 23 14 23 23 45 45 23 23 14 45 14 45 6 14 45 98 23 14 14 45 67 6 67 6 67 6 6 Merge 42 33 42 33 67 33 42 33 67 6 98 33 33 42 67 42 42 98 98 98 98 23 14 23 23 45 45 23 23 14 45 14 45 6 14 45 98 23 14 14 45 67 6 67 6 67 6 6 6 Merge 42 33 42 33 67 33 42 33 67 6 98 33 33 42 67 42 42 98 98 98 98 23 14 23 23 45 45 23 6 14 45 98 14 45 6 14 45 23 23 14 14 45 67 6 67 6 67 6 6 14 Merge 42 33 42 33 67 33 42 33 67 6 98 33 33 42 67 42 42 98 98 98 98 23 14 23 23 45 45 23 6 14 45 98 14 14 45 14 6 14 45 23 23 14 45 67 6 67 6 67 6 6 23 Merge 42 33 42 33 67 33 42 33 67 6 98 33 33 42 67 42 42 98 98 98 98 23 14 23 23 45 45 23 6 14 45 98 14 14 45 14 6 14 45 23 23 14 45 67 6 67 6 Merge 42 33 42 33 67 33 42 33 67 6 33 33 6 6 98 23 67 33 42 67 42 42 98 98 98 98 23 14 23 23 45 45 23 6 14 45 98 14 14 45 14 6 14 45 23 23 14 67 6 67 6 33 42 Merge 42 33 42 33 67 33 42 33 67 6 98 33 6 6 45 23 67 33 42 67 42 42 98 98 98 98 23 14 23 23 45 45 23 6 14 45 98 14 14 45 14 6 14 45 23 23 14 67 6 67 42 Merge 33 45 33 42 33 42 33 67 6 33 42 67 6 98 33 6 6 45 23 67 33 42 67 42 42 98 98 98 98 23 14 23 23 45 45 23 6 14 45 98 14 14 45 14 6 14 45 23 23 14 67 6 67 42 Merge 33 45 42 33 42 33 67 6 33 42 67 6 98 33 6 6 45 23 67 33 33 67 42 67 42 42 98 98 98 98 23 14 23 23 45 45 23 6 14 45 98 14 14 45 14 6 14 45 23 23 14 67 6 67 42 Merge 33 42 33 45 42 33 67 6 33 42 67 6 98 33 6 6 45 23 67 33 33 67 42 98 67 42 42 98 98 98 98 23 14 23 23 45 45 23 6 14 45 98 14 14 45 14 6 14 45 23 23 14 67 6 67 42 33 42 33 45 Array is sorted!!! 42 33 67 6 33 42 67 6 98 33 6 6 45 23 67 33 33 67 42 98 67 42 42 Merge sort What is the time complexity? What is the space complexity? Is merge sort better than bubblesort? 77 Merge sort % The function receives a vector x of numbers. % It returns y consisting of the values in x sorted from smallest to largest. function y = mergeSort(x) n = length(x); if n == 1 y = x; else m = floor(n/2); % Sort the first half y1 = mergeSort(x(1 : m)) % Sort the second half y2 = mergeSort(x(m+1 : n)) % Merge the two halves y = merge(y1,y2) end 78 Merge sort function z = merge(x,y) n = length(x); m = length(y); z = zeros(1,n+m); ix = 1; % The index of the next x-value to select. iy = 1; % The index of the next y-value to select. for iz = 1 : (n+m) % Deteremine the iz-th value for the merged array... if ix > n % All done with x-values. Select the next y-value. z(iz) = y(iy); iy = iy+1; elseif iy > m % All done with y-values. Select the next x-value. z(iz) = x(ix); ix = ix + 1; elseif x(ix) <= y(iy) % The next x-value is less than or equal to the next y-value z(iz) = x(ix); ix = ix + 1; else % The next y-value is less than the next x-value z(iz) = y(iy); iy = iy + 1; end end 79 Merge sort Merge Sort vs. Bubble Sort >> array = randperm(10^6); >> mergeSort(array) >> bubbleSort(array) Merge Sort running time: ~60.8 sec Bubble sort running time: After 10 minutes still no result… 80 Using Matlab’s built in sort function Matlab has built in sorting functions (which are based on the principles we have just seen) sorted_array = sort(array); 81 How to make efficient programs in Matlab Matlab is slower than some of the other programming languages… (e.g. C, C++) However, if you write your code using “good Matlab style” you can significantly reduce this runtime overhead. 82 How to make efficient programs in Matlab Principle #1: Design efficient algorithms in terms of time complexity (minimal number of operations) Principle #2: Design efficient algorithms in terms of time space complexity (minimal memory requirements) Principle #3: Avoid loops when possible! Principle #4: Especially avoid nested loops… 83 How to make efficient programs in Matlab Principle #5: “If else” statements inside nested loops is the mother of all evil! Try to avoid when possible. Principle #6: Allocate memory in advance when possible 84 Principle #3, #4: Avoid loops and nested loops when possible Example 1: What does the following function do? function mat = funcRand() mat = zeros(10000, 1000); tic for h = 1 : 2 : size(mat, 1) for w = 1 : size(mat, 2) mat(h, w) = rand(1,1); end end toc Start the clock… Stop the clock… >> funcRand; Elapsed time is 9.446822 seconds. Can we improve running time? 85 Principle #3, #4: Avoid loops and nested loops when possible function mat = funcRandNoLoops() mat = zeros(10000, 1000); tic mat_r = rand(10000, 1000); mat(1 : 2 : end, :) = mat_r(1 : 2 : end, :); toc >> funcRandNoLoops; Elapsed time is 0.648897 seconds. Running time is improved by a factor of 14! 86 Principle #5: “If else” statements inside nested loops is the mother of all evil! Example: What does the following function do? function overpaid_workers = getLazyWorkers() n_workers = 200000; salary = 5000 + rand(n_workers, 1) * 30000; working_days = round(rand(n_workers, 1) * 7); workers_ids = [1 : n_workers]'; overpaid_workers = []; tic for i = 1 : n_workers if salary(i) > 10000 if working_days(i) <= 3 overpaid_workers(end + 1) = workers_ids(i); end end end toc >> getLazyWorkers(); Elapsed time is 22.072058 seconds. Can we do better? 87 Principle #5: Nested loops with control statements is the mother of all evil! function overpaid_workers = getLazyWorkers() n_workers = 200000; salary = 5000 + rand(n_workers, 1) * 30000; working_days = round(rand(n_workers, 1) * 7); workers_ids = [1 : n_workers]'; overpaid_workers = []; tic overpaid_workers = find(workers_ids(salary > 10000 & working_days <= 3)); toc >> getLazyWorkers(); Elapsed time is 0.043157 seconds. Running Time improved by a factor of 50! 88 Principle #6: Allocate memory in advance when possible function funcD() tic vec_size = 100000; for i = 1 : vec_size x(i) = i; end toc Hmm… Looks like a highly optimized program! >> funcD() Elapsed time is 40.259843 seconds. Very bad programming. Memory is allocated during the loop statement 89 Principle #6: Allocate memory in advance when possible function funcD() tic vec_size = 100000; x = zeros(1, vec_size); for i = 1 : vec_size x(i) = i; end toc Better… Memory is allocated before loop statement >> funcD Elapsed time is 0.003077 seconds. Running Time improved by a factor of 13,000! 90 Principle #5: Allocate memory in advance when possible function funcD() tic vec_size = 100000; x = 1 : vec_size; toc Even Better… we can get rid of the loops >> funcD() Elapsed time is 0.002347 seconds. Running Time improved by a factor of 17,000! 91 Syntax checking and code optimization using M-lint What is M-Lint? 92 Syntax checking and code optimization using M-lint 93 Syntax checking and code optimization using M-lint 94 Optimizing your program using the Matlab Profiler Will be taught in the tutorial 95 Optimizing your program using the Matlab Profiler Will be taught in the tutorial 96