Introduction to Matlab & Data analysis

advertisement
Introduction to Matlab
& Data Analysis
Lecture 8:
Simple algorithms &
how to build efficient programs in Matlab
Eran Eden, Weizmann 2008 ©
Some of the animation slides are based the lectures published
by Georgia Tech College of computing
http://www.cs.gatech.edu/~bleahy/cs1311
1
Today’s menu

Reminder of previous lecture

Simple algorithms







What is an algorithm?
Why do we need good algorithms?
Strategies for designing efficient algorithms
Binary search
Bubble sort
“Divide and conquer “ - Merge sort
Developing efficient programs in Matlab



Making your program run fast
Lint optimizer
Profiler
2
Reading and writing to files
Low level I/O

4 general steps for copying one file into another:
1) Open the source file & open a target file
fid = fopen(<file name>, <permission>)
% opens the file <file name> for read access
2) Read the lines, one by one and store in a variable
line = fgetl(fid)
% returns the next line of a file associated with file identifier fid as a MATLAB string
and discards newline characters
3) Write the lines into the target file
fprintf(fid, format, variables);
% Write formatted data to file specified by fid
4) Close the source and the target files
fclose(fid)
% closes the file associated with file identifier fid
3
Reading and Writing to files, continued…
4
Reading and writing to files



Matlab has high level I\O functions for reading data.
Advantages: it is easier and requires less programming than
low level functions.
Limitations: it will only work if the data is in a predefined
format.
5
Reading and writing to files

Example: we have a text file called data_table.txt that looks as follows:
Name
Badal
Badri
Badrinath
Bahubali
Bahuleya
Bajrang
Balaaditya
Balachan
Balagovind
Balaji
Balakrishna
Balamani
….
Age
77
100
22
95
33
48
78
78
64
16
30
89
Weight Salary
79
86
79
91
77
85
106
86
120
112
76
79
5909
5290
6960
6233
8314
5870
9144
7055
7570
5021
9066
6253
6
Reading and writing to files

The importdata command
>> file = importdata('C:\Matlab and Data Analysis\table_data.txt')
>> file =
data: [93x3 double]
All the numeric data is
textdata: {94x4 cell}
stored in a double array
>> file.data
ans =
25
38
83
99
27
92
42
2
38
99
22
called data
85
76
86
116
67
78
113
117
79
60
100
9056
8621
5115
7510
5433
9053
6853
8798
6703
9746
7875 ...
All the char data is
stored in a cell array
called textdata
7
Reading and writing to files
>> file.textdata
ans =
'Name'
'Badal'
'Badri'
'Badrinath'
'Bahubali'
'Bahuleya'
'Bajrang'
'Balaaditya'
'Balachandra'
'Balagovind'
'Balaji'
'Balakrishna'
'Balamani'
...
'Age'
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
'Weight '
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
'Salary '
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
8
Reading and writing to files

There are many other functions for reading and
writing to files.



For example reading the content of an Excel spreadsheet
can easily be done with the function xlsread.
Some of these functions will be discussed in the
tutorial…
For more details just use help
9
Algorithms
10
What is an algorithm?
An algorithm is a sequence of instructions,
often used for calculation or data processing
Lamp doesn’t work
Lamp
plugged in ?
no
Plug in lamp
yes
Bulb burned
out?
yes
Replace Bulb
no
Replace Lamp
11
What is an algorithm?
An algorithm is a sequence of instructions,
often used for calculation or data processing
Lamp doesn’t work
Bulb burned
out?
yes
Replace bulb
no
Lamp
plugged in ?
no
Plug in lamp
yes
Replace Lamp
12
Why do we need good algorithms?

Without efficient algorithms many simple
problems can’t be solved by the computer
(running time is too large, or not enough
memory)
13
How do we measure whether an algorithm is
“good”?

Time complexity


The number of steps it takes to solve the
problem as function of input size
Examples:


Analogy: Mowing grass has linear time
complexity because it takes double the time to
mow double the area
What about looking up a name in a dictionary,
what happens if we double the dictionary size?
14
How do we measure whether an algorithm is
“good”?

Space complexity


The amount of memory required by
the algorithm
Optimal vs. suboptimal solutions
15
There are numerous examples of problems that
require efficient algorithms

Find a person’s name in a phone book

Designing a web crawler

The sequence alignment problem
16
There are numerous examples of problems that
require efficient algorithms

The traveling salesman problem (TSP)

Teaching a computer to play Tic-Tac-Toe

Teaching a computer to play Chess
17
The Binary search algorithm


You are given a person’s name and a phonebook
Think of a naïve algorithm in order to find that person’s
phone number



How many page swaps are needed on average?
What happens when we double the phone book size?
Think of an efficient algorithm for finding the person’s
phone number


How many page swaps are needed on average?
What happens when we double the phone book size?
18
The Binary search algorithm

Let’s generate a phone book…
function phonebook = initPhoneBook()
phonebook(1).name = 'aong';
phonebook(1).number = '04-111111111';
phonebook(2).name = 'bong';
phonebook(2).number = '04-222222222';
phonebook(3).name = 'bongo';
phonebook(3).number = '04-222222223';
phonebook(4).name = 'cong';
phonebook(4).number = '04-333333333';
phonebook(5).name = 'congo';
phonebook(5).number = '04-333333334';
phonebook(6).name = 'dong';
phonebook(6).number = '04-444444444';
phonebook(7).name = 'eong';
phonebook(7).number = '04-555555555';
phonebook(8).name = 'fong';
phonebook(8).number = '04-666666666';
phonebook(9).name = 'gong';
phonebook(9).number = '04-777777777';
phonebook(10).name = 'hong';
phonebook(10).number= '04-88888888';
19
The Binary search algorithm
function number = binarySearch(phonebook, name)
l = 1; r = length(phonebook);
mid = floor((l + r) / 2); number = [ ];
while l < r
name_curr = phonebook(mid).name
if strlexcmp(name_curr , name) == 0
number = phonebook(mid).number;
disp(['I found the phone of ', name_curr, '!!!']);
return;
elseif strlexcmp(name_curr, name) > 0
r = mid
mid = floor((l + r) / 2);
elseif strlexcmp(name_curr, name) < 0
l = mid
mid = ceil((l + r) / 2);
end
end
disp('Name not found!');
20
The Binary search algorithm
% The function compares between two strings according to lexicographic order
% The function returns: 0 if the strings are identical; 1 if str1 > str2 and -1 if str1 < str2
function res = strlexcmp(str1, str2)
% get string lengths
n1 = length(str1);
n2 = length(str2);
n = min(n1, n2);
% find characters that differ
k = find(str1(1 : n) ~= str2( 1 : n));
if isempty(k)
% if all characters are identical then compare lengths
res = sign(n1 - n2);
else
% compare first character that is different
k = k(1);
res = sign(str1(k) - str2(k));% Ascii value, case sensitive
end
21
The Binary search algorithm
>> phonebook = initPhoneBook();
>> binarySearch(phonebook, 'fong')
I found the phone of fong!!!
ans =
04-666666666
>> binarySearch(phonebook, 'wong')
Name not found!
ans =
[]
22
List intersection

Find a naïve algorithm for intersecting two lists of names


How many operations are performed on average?
Find an efficient algorithm for intersecting two lists of names

How many operations are performed on average?
23
Sorting


In order to solve the intersection problem efficiently we need the
two lists to be sorted
Sorting is a very basic operation which is performed routinely.
Therefore, we want to find an efficient sorting algorithm.
24
Bubble sort
1
77
2
42
3
4
35
12
5
101
6
5
25
Bubble sort

Bubble sort
1
2
3
4
42 Swap4277
77
35
12
5
101
6
5
26
Bubble sort

Bubble sort
1
42
2
3
7735 Swap35
77
4
12
5
101
6
5
27
Bubble sort

Bubble sort
1
42
2
35
3
4
77
7712 Swap 12
5
101
6
5
28
Bubble sort

Bubble sort
1
42
2
35
3
4
12
77
5
101
6
5
There is no need to swap
29
Bubble sort

Bubble sort
1
42
2
35
3
4
12
77
5
6
1015 Swap 101
5
30
Bubble sort

Bubble sort
1
42
2
35
3
4
12
77
5
5
6
101
Largest value correctly placed
31
Bubble sort
1
2
3
4
5
6
42
35
12
77
5
101
35
12
42
5
77
101
12
35
5
42
77
101
12
5
35
42
77
101
5
12
35
42
77
101
Array is sorted !!!
32
Bubble sort
function array = bubbleSort(array)
for i = (length(array) – 1) : -1 : 1
for j = 1 : i
if array(j) > array(j+1)
%swap
temp
= array(j);
array(j)
= array(j + 1);
array(j + 1) = temp;
end
end
disp(array)
end
33
Bubble sort
>> array = randperm(10)
array =
5
6
9
1
4
2
10
8
3
7
>> bubbleSort(array)
5
5
1
1
1
1
1
1
1
6
1
4
2
2
2
2
2
2
1
4
2
4
4
3
3
3
3
4
2
5
5
3
4
4
4
4
2
6
6
3
5
5
5
5
5
9
8
3
6
6
6
6
6
6
8
3
7
7
7
7
7
7
7
3
7
8
8
8
8
8
8
8
7
9
9
9
9
9
9
9
9
10
10
10
10
10
10
10
10
10
34
Bubble sort





What is the time complexity?
time complexity - The way in which the number of steps required by an algorithm varies with
the size of the problem it is solving. Time complexity is normally expressed as an order of
magnitude, e.g. O(N^2) means that if the size of the problem (N) doubles then the algorithm
will take four times as many steps to complete.
What is the space complexity?
space complexity - The way in which the amount of storage space required by an algorithm
varies with the size of the problem it is solving. Space complexity is normally expressed as an
order of magnitude, e.g. O(N^2) means that if the size of the problem (N) doubles then four
times as much working storage will be needed.
Can we do better?
35
Merge sort

The divide and conquer approach…

The algorithm uses the following fact:

If we have two sorted lists then merging them
together into a one sorted list is easy
36
98
23
45
14
6
67
33
42
98
98
23
23
45
45
14
14
6
67
6
33
67
42
33
42
98
98
98
23
23
23
45
45
45
14
14
14
6
67
6
33
67
42
33
42
98
98
98
98
23
23
23
23
45
45
45
14
14
14
6
67
6
33
67
42
33
42
98
98
98
98
Merge
23
23
23
23
45
45
45
14
14
14
6
67
6
33
67
42
33
42
98
98
98
98
23
Merge
23
23
23
23
45
45
45
14
14
14
6
67
6
33
67
42
33
42
98
98
23
98
23
98
23
23
98
Merge
23
45
45
45
14
14
14
6
67
6
33
67
42
33
42
98
98
98
98
23
23
23
23
98
23
45
45
45
45
14
14
6
67
6
14
14
33
67
42
33
42
98
98
98
98
23
23
23
23
23
45
45
45
14
14
98
Merge
67
6
14
45
6
14
33
67
42
33
42
98
98
98
98
23
23
23
23
98
23
45
45
45
14
14
14
Merge
67
6
14
45
6
14
33
67
42
33
42
98
98
98
98
23
23
23
23
98
23
45
45
45
14
14
14
14
45
Merge
67
6
14
45
6
33
67
42
33
42
98
98
98
98
23
23
23
45
45
23
23
98
45
45
14
Merge
14
14
6
67
6
14
14
45
33
67
42
33
42
98
98
98
98
23
23
23
45
45
23
23
98
45
45
14
14
Merge
14
14
6
67
6
14
14
45
33
67
42
33
42
98
98
98
98
23
14
23
23
45
45
23
45
23
45
98
14
23
Merge
14
14
6
67
6
14
14
45
33
67
42
33
42
98
98
98
98
23
14
23
23
45
45
23
45
23
45
98
14
23
45
Merge
14
14
6
67
6
14
14
45
33
67
42
33
42
98
98
98
98
23
14
23
23
45
45
23
14
14
45
23
98
14
23
45
Merge
14
45
98
67
6
14
45
6
33
67
42
33
42
98
98
98
98
23
14
23
23
45
45
23
23
14
45
14
45
98
23
14
14
45
14
45
98
6
67
33
6
67
6
67
42
33
42
33
42
98
98
98
98
23
14
23
23
45
45
23
23
14
45
14
45
98
23
14
14
45
14
45
98
6
67
33
6
67
6
67
6
67
42
33
42
33
42
98
98
98
98
23
14
23
23
45
45
23
23
14
45
14
45
98
23
14
14
45
14
6
67
33
6
67
6
67
6
45
98
Merge
67
42
33
42
33
42
98
98
98
98
23
14
23
23
45
45
23
23
14
45
14
45
98
23
14
14
45
14
45
98
6
67
33
6
67
6
67
6
6
Merge
67
42
33
42
33
42
98
98
98
98
23
14
23
23
45
45
23
23
14
45
14
45
98
23
14
14
45
14
45
98
6
67
33
6
67
6
67
6
6
67
67
Merge
42
33
42
33
42
98
98
98
98
23
14
23
23
45
45
23
23
14
45
14
45
98
23
14
14
45
14
45
98
6
67
33
6
67
6
67
6
6
67
67
42
33
42
33
33
42
42
98
98
98
98
23
14
23
23
45
45
23
23
14
45
14
45
98
23
14
14
45
14
45
98
6
67
33
6
67
6
67
6
6
67
42
33
42
33
42
33
67
Merge
42
98
98
98
98
23
14
23
23
45
45
23
23
14
45
14
45
98
23
14
14
45
14
45
98
6
67
33
6
67
6
67
6
6
67
67
42
33
42
33
42
33
33
Merge
42
98
98
98
98
23
14
23
23
45
45
23
23
14
45
14
45
98
23
14
14
45
14
45
98
6
67
33
6
67
6
67
6
6
67
67
42
33
42
33
42
33
33
42
42
Merge
98
98
98
98
23
14
23
23
45
45
23
23
14
45
14
45
98
23
14
14
45
14
45
6
67
33
6
67
6
67
6
6
67
67
42
33
42
33
33
33
98
Merge
42
42
42
98
98
98
98
23
14
23
23
45
45
23
23
14
45
14
45
98
23
14
14
45
14
45
98
6
67
33
6
67
6
67
6
67
6
67
42
33
42
33
33
33
6
Merge
42
42
42
98
98
98
98
23
14
23
23
45
45
23
23
14
45
14
45
98
23
14
14
45
14
45
98
6
67
33
6
67
6
67
6
67
6
6
67
42
33
42
33
33
33
33
Merge
42
42
42
98
98
98
98
23
14
23
23
45
45
23
23
14
45
14
45
98
23
14
14
45
14
45
98
6
67
33
6
67
6
67
6
42
33
33
67
6
6
33
67
33
42
33
42
Merge
42
42
42
98
98
98
98
23
14
23
23
45
45
23
23
14
45
14
45
98
23
14
14
45
14
45
98
6
67
33
6
67
6
67
6
42
33
33
67
6
6
42
33
67
33
42
33
42
Merge
67
42
42
98
98
98
98
23
14
23
23
45
45
23
23
14
45
14
45
6
14
45
98
23
14
14
45
67
6
67
6
67
6
6
Merge
42
33
42
33
67
33
42
33
67
6
98
33
33
42
67
42
42
98
98
98
98
23
14
23
23
45
45
23
23
14
45
14
45
6
14
45
98
23
14
14
45
67
6
67
6
67
6
6
6
Merge
42
33
42
33
67
33
42
33
67
6
98
33
33
42
67
42
42
98
98
98
98
23
14
23
23
45
45
23
6
14
45
98
14
45
6
14
45
23
23
14
14
45
67
6
67
6
67
6
6
14
Merge
42
33
42
33
67
33
42
33
67
6
98
33
33
42
67
42
42
98
98
98
98
23
14
23
23
45
45
23
6
14
45
98
14
14
45
14
6
14
45
23
23
14
45
67
6
67
6
67
6
6
23
Merge
42
33
42
33
67
33
42
33
67
6
98
33
33
42
67
42
42
98
98
98
98
23
14
23
23
45
45
23
6
14
45
98
14
14
45
14
6
14
45
23
23
14
45
67
6
67
6
Merge
42
33
42
33
67
33
42
33
67
6
33
33
6
6
98
23
67
33
42
67
42
42
98
98
98
98
23
14
23
23
45
45
23
6
14
45
98
14
14
45
14
6
14
45
23
23
14
67
6
67
6
33
42
Merge
42
33
42
33
67
33
42
33
67
6
98
33
6
6
45
23
67
33
42
67
42
42
98
98
98
98
23
14
23
23
45
45
23
6
14
45
98
14
14
45
14
6
14
45
23
23
14
67
6
67
42
Merge
33
45
33
42
33
42
33
67
6
33
42
67
6
98
33
6
6
45
23
67
33
42
67
42
42
98
98
98
98
23
14
23
23
45
45
23
6
14
45
98
14
14
45
14
6
14
45
23
23
14
67
6
67
42
Merge
33
45
42
33
42
33
67
6
33
42
67
6
98
33
6
6
45
23
67
33
33
67
42
67
42
42
98
98
98
98
23
14
23
23
45
45
23
6
14
45
98
14
14
45
14
6
14
45
23
23
14
67
6
67
42
Merge
33
42
33
45
42
33
67
6
33
42
67
6
98
33
6
6
45
23
67
33
33
67
42
98
67
42
42
98
98
98
98
23
14
23
23
45
45
23
6
14
45
98
14
14
45
14
6
14
45
23
23
14
67
6
67
42
33
42
33
45
Array is sorted!!!
42
33
67
6
33
42
67
6
98
33
6
6
45
23
67
33
33
67
42
98
67
42
42
Merge sort

What is the time complexity?

What is the space complexity?

Is merge sort better than bubblesort?
77
Merge sort
% The function receives a vector x of numbers.
% It returns y consisting of the values in x sorted from smallest to largest.
function y = mergeSort(x)
n = length(x);
if n == 1
y = x;
else
m = floor(n/2);
% Sort the first half
y1 = mergeSort(x(1 : m))
% Sort the second half
y2 = mergeSort(x(m+1 : n))
% Merge the two halves
y = merge(y1,y2)
end
78
Merge sort
function z = merge(x,y)
n = length(x); m = length(y); z = zeros(1,n+m);
ix = 1; % The index of the next x-value to select.
iy = 1; % The index of the next y-value to select.
for iz = 1 : (n+m)
% Deteremine the iz-th value for the merged array...
if ix > n
% All done with x-values. Select the next y-value.
z(iz) = y(iy); iy = iy+1;
elseif iy > m
% All done with y-values. Select the next x-value.
z(iz) = x(ix); ix = ix + 1;
elseif x(ix) <= y(iy)
% The next x-value is less than or equal to the next y-value
z(iz) = x(ix); ix = ix + 1;
else
% The next y-value is less than the next x-value
z(iz) = y(iy); iy = iy + 1;
end
end
79
Merge sort

Merge Sort vs. Bubble Sort
>> array = randperm(10^6);
>> mergeSort(array)
>> bubbleSort(array)
Merge Sort
running time:
~60.8 sec
Bubble sort running
time:
After 10 minutes
still no result…
80
Using Matlab’s built in sort function

Matlab has built in sorting functions (which are based on
the principles we have just seen)
sorted_array = sort(array);
81
How to make efficient programs in Matlab


Matlab is slower than some of the other programming
languages… (e.g. C, C++)
However, if you write your code using “good Matlab style”
you can significantly reduce this runtime overhead.
82
How to make efficient programs in Matlab


Principle #1: Design efficient algorithms in terms of time
complexity (minimal number of operations)
Principle #2: Design efficient algorithms in terms of time
space complexity (minimal memory requirements)

Principle #3: Avoid loops when possible!

Principle #4: Especially avoid nested loops…
83
How to make efficient programs in Matlab


Principle #5: “If else” statements inside nested loops is the
mother of all evil! Try to avoid when possible.
Principle #6: Allocate memory in advance when possible
84
Principle #3, #4: Avoid loops and nested
loops when possible

Example 1: What does the following function do?
function mat = funcRand()
mat = zeros(10000, 1000);
tic
for h = 1 : 2 : size(mat, 1)
for w = 1 : size(mat, 2)
mat(h, w) = rand(1,1);
end
end
toc
Start the clock…
Stop the clock…
>> funcRand;
Elapsed time is 9.446822 seconds.

Can we improve running time?
85
Principle #3, #4: Avoid loops and nested
loops when possible
function mat = funcRandNoLoops()
mat = zeros(10000, 1000);
tic
mat_r = rand(10000, 1000);
mat(1 : 2 : end, :) = mat_r(1 : 2 : end, :);
toc
>> funcRandNoLoops;
Elapsed time is 0.648897 seconds.

Running time is improved by a factor of 14!
86
Principle #5: “If else” statements inside
nested loops is the mother of all evil!

Example: What does the following function do?
function overpaid_workers = getLazyWorkers()
n_workers = 200000;
salary
= 5000 + rand(n_workers, 1) * 30000;
working_days
= round(rand(n_workers, 1) * 7);
workers_ids
= [1 : n_workers]';
overpaid_workers = [];
tic
for i = 1 : n_workers
if salary(i) > 10000
if working_days(i) <= 3
overpaid_workers(end + 1) = workers_ids(i);
end
end
end
toc
>> getLazyWorkers();
Elapsed time is 22.072058 seconds.
Can we do
better?
87
Principle #5: Nested loops with control
statements is the mother of all evil!
function overpaid_workers = getLazyWorkers()
n_workers = 200000;
salary
= 5000 + rand(n_workers, 1) * 30000;
working_days
= round(rand(n_workers, 1) * 7);
workers_ids
= [1 : n_workers]';
overpaid_workers = [];
tic
overpaid_workers = find(workers_ids(salary > 10000 & working_days <= 3));
toc
>> getLazyWorkers();
Elapsed time is 0.043157 seconds.
Running Time
improved by a
factor of 50!
88
Principle #6: Allocate memory in advance
when possible
function funcD()
tic
vec_size = 100000;
for i = 1 : vec_size
x(i) = i;
end
toc
Hmm… Looks like
a highly optimized
program!
>> funcD()
Elapsed time is 40.259843 seconds.
Very bad programming.
Memory is allocated during
the loop statement
89
Principle #6: Allocate memory in advance
when possible
function funcD()
tic
vec_size = 100000;
x = zeros(1, vec_size);
for i = 1 : vec_size
x(i) = i;
end
toc
Better… Memory is
allocated before loop
statement
>> funcD
Elapsed time is 0.003077 seconds.
Running Time
improved by a
factor of 13,000!
90
Principle #5: Allocate memory in advance
when possible
function funcD()
tic
vec_size = 100000;
x = 1 : vec_size;
toc
Even Better… we can
get rid of the loops
>> funcD()
Elapsed time is 0.002347 seconds.
Running Time
improved by a
factor of 17,000!
91
Syntax checking and code
optimization using M-lint

What is M-Lint?
92
Syntax checking and code
optimization using M-lint
93
Syntax checking and code
optimization using M-lint
94
Optimizing your program using
the Matlab Profiler
Will be taught
in the tutorial
95
Optimizing your program using
the Matlab Profiler
Will be taught
in the tutorial
96
Download