mem

advertisement
Programming for Engineers in Python
Recitation 12
Plan
 Dynamic Programming
 Coin Change problem
 Longest Common Subsequence
 Application to Bioinformatics
2
Teaching Survey
 Please answer the teaching survey:
https://www.ims.tau.ac.il/Tal/
 This will help us to improve the course
 Deadline: 4.2.12
3
Coin Change Problem
 What is the smallest number of coins I can use to make exact




4
change?
Greedy solution: pick the largest coin first, until you reach
the change needed
In the US currency this works well:
Give change for 30 cents if you’ve got 1, 5, 10, and 25 cent
coins:
25 + 5 → 2 coins
http://jeremykun.files.wordpress.com/2012/01/coins.jpg
The Sin of Greediness
 What if you don’t have 5




cent coins?
You got 1, 10, and 25
Greedy solution:
25+1+1+1+1+1 →
6
coins
But a better solution is:
10+10+10 → 3 coins!
So the greedy approach isn’t
optimal
The Seven Deadly Sins and the Four Last Things by Hieronymus Bosch
http://en.wikipedia.org/wiki/File:Boschsevendeadlysins.jpg
5
Recursive Solution
 Reminder – find the minimal # of coins needed to give exact






6
change with coins of specified values
Assume that we can use 1 cent coins so there is always some
solution
Denote our coin list by c1, c2, …, ck (c1=1)
k is the # of coins values we can use
Denote the change required by n
In the previous example:
n=30, k=3, c1=1, c2=10, c3=25
Recursive Solution
 Recursion Base:
 If n=0 then we need 0 coins
 If k=1, c1=1, so we need n coins
 Recursion Step:
 If n<ck we can’t use ck → We solve for n and c1,…,ck-1
 Otherwise, we can either use ck or not use ck
 If we use ck → we solve for n-ck and c1,…,ck
 If we don’t use ck → we solve for n and c1,…,ck-1
7
Recursion Solution
def coins_change_rec( cents_needed, coin_values):
if cents_needed <= 0: # base 1
return 0
elif len(coin_values) == 1: # base 2
return cents_needed # assume that coin_values[0]==1
elif coin_values[-1] > cents_needed: # step 1
return coins_change_rec( cents_needed, coin_values[:-1])
else: # step 2
s1 = coins_change_rec( cents_needed, coin_values[:-1] )
s2 = coins_change_rec( cents_needed-coin_values[-1],
coin_values )
return min(s1, s2+1)
8
coins_rec.py
Repeated calls
 We count how many times we call the recursive function for each set
of arguments:
calls = {}
def coins_change_rec(cents_needed, coin_values):
global calls
calls[(cents_needed, coin_values)] =
calls.get( (cents_needed, coin_values) , 0) + 1
…
>>> print 'result', coins_change_rec(30, (1,5,10,25))
result 2
>>> print 'max calls',max(calls.values())
max calls 4
9
Dynamic Programing - Memoization
 We want to store the values of calculation so we don’t repeat




them
We create a table called mem
# of columns: # of cents needed + 1
# of rows: # of coin values + 1
The table is initialized with some illegal value – for example 1:
mem = [ [-1 for y in range(cents_needed+1)] for
x in range(len(coin_values)) ]
10
Dynamic Programing - Memoization
 For each call of the recursive function, we check if mem
already has the answer:
if mem[len(coin_values)][cents_needed] == -1:
 In case that it doesn’t (the above is True) we calculate it as
before, and we store the result, for example:
if cents_needed <= 0:
mem[len(coin_values)][cents_needed] = 0
 Eventually we return the value
return mem[len(coin_values)][cents_needed]
11
coins_mem.py
Dynamic Programing - Iteration
 Another approach is to first build the entire matrix
 This matrix holds the minimal number of coins we need to
get change for j cents using the first i coins (c1, c2, …, ci)
 The solution will be min_coins[k,n] – the last element in the
matrix
 This will save us the recursive calls, but will enforce us to
calculate all the values apriori
 Bottom-up approach vs. the top-down approach of
memoization
12
Dynamic Programming approach
 The point of this approach is that we have a recursive formula
to break apart a problem to sub problems
 Then we can use different approaches to minimize the
number of calculations by storing the sub solutions in
memory
13
Bottom up - example matrix
 Set n=4 and k=3 (coins are 1, 2, and 3 cents)
 Base cases:
 how many coins do I need to make change for zero cents?




14

Zero!
So min_coins[i,0]=0
And how many pennies do I need to make j cents? Exactly j
(we assumed we can use pennies)
So min_coins[0,j]=j
So the base cases give us:
0 1 2 3 4
0 ? ? ? ?
0 ? ? ? ?
Next – the recursion step
Bottom up - example matrix
 For particular choice of i,j (but not i=0 or j=0)
 To determine min_coins[i,j] – the minimum # of coins to get
15
exact change of j using the first i coins
 We can use the coin ci and add +1 to min_coins[i,j-ci] (only
valid if j>ci)
 We can decide not to use ci , therefore to use only c0 ,.., ci-1,
and therefore min_coins[i-1,j] .
 So which way do we choose?
 The one with the least coins!
min_coins[i,j] =
min(min_coins[i,j-ci] +1, min_coins[i-1,j])
Example matrix – recursion step
 Set n=4 and k=3 (coins are 1, 2, and 3 cents)
 So the base cases give us:
0 1 2 3 4
𝑀= 0 1 1 2 2
0 1 1 1 2
 M(1,1)=1
 M(1,2)=1
 M(1,3)=min(M(1,1)+1,M(0,3))=min(2,2)=2
 M(1,4)=min(M(1,2)+1, M(0,4))=min(2,4)=2
 etc…
16
coins_matrix.py
The code for the matrix solution and the idea is from
http://jeremykun.wordpress.com/2012/01/12/a-spoonful-of-python/
Longest Common Subsequence
 Given two sequences (strings/lists) we want to find the






17
longest common subsequence
Definition – subsequence: B is a subsequence of A if B can be
derived from A by removing elements from A
Examples
[2,4,6] is a subsequence of [1,2,3,4,5,6]
[6,4,2] is NOT a subsequence of [1,2,3,4,5,6]
‘is’ is a subsequence of ‘distance’
‘nice’ is NOT a subsequence of ‘distance’
Longest Common Subsequence
 Given two subsequences (strings or lists) we want to find the
longest common subsequence:
 Example for a LCS:
 Sequence 1: HUMAN
 Sequence 2: CHIMPANZEE
 Applications include:
 BioInformatics (next up)
 Version Control
18
http://wordaligned.org/articles/longest-common-subsequence
The DNA
 Our biological blue-print
 A sequence made of four bases
– A, G, C, T
 Double strand:
 A connects to T
 G connects to C
 Every triplet encodes for an
amino-acid
 Example: GAG→Glutamate
 A chain of amino-acids is a
protein – the biological machine!
19
http://sips.inesc-id.pt/~nfvr/msc_theses/msc09b/
Longest common subsequence
 The DNA changes:
 Mutation: A→G, C→T, etc.
 Insertion: AGC → ATGC
 Deletion: AGC → A‒C
http://palscience.com/wp-content/uploads/2010/09/DNA_with_mutation.jpg
 Given two non-identical sequences, we want to find the parts
that are common
 So we can say how different they are
 Which DNA is more similar to ours? The cat’s or the dog’s?
20
Recursion
 An LCS of two sequences can be built from the LCSes of
21
prefixes of these sequences
 Denote the sequences seq1 and seq2
 Base – check if either sequence is empty:
If len(seq1) == 0 or len(seq2) == 0:
return [ ]
 Step – build solution from shorter sequences:
If seq1[-1] == seq2[-1]:
return lcs (seq1[:-1],seq2[:-1]) + [ seq1[-1] ]
else:
return max(lcs (seq1[:-1],seq2), lcs(seq1,seq2[:-1]),
key = len)
lcs_rec.py
Wasteful Recursion
 For the inputs “MAN” and “PIG”, the calls are:
22
(1, ('', 'PIG'))
(1, ('M', 'PIG'))
(1, ('MA', 'PIG'))
(1, ('MAN', ''))
(1, ('MAN', 'P'))
(1, ('MAN', 'PI'))
(1, ('MAN', 'PIG'))
(2, ('MA', 'PI'))
(3, ('', 'PI'))
(3, ('M', 'PI'))
(3, ('MA', ''))
(3, ('MA', 'P'))
(6, ('', 'P'))
(6, ('M', ''))
(6, ('M', 'P'))
 24 redundant calls!
http://wordaligned.org/articles/longest-common-subsequence
Wasteful Recursion
 When comparing longer sequences with a small number of
letters the problem is worse
 For example, DNA sequences are composed of A, G, T and C,
and are long
 For lcs('ACCGGTCGAGTGCGCGGAAGCCGGCCGAA',
'GTCGTTCGGAATGCCGTTGCTCTGTAAA') we get an
absurd:
(('', 'GT'), 13,182,769)
(('A', 'GT'), 13,182,769)
(('A', 'G'), 24,853,152)
(('', 'G'), 24,853,152)
(('A', ''), 24,853,152)
23
http://blog.oncofertility.northwestern.edu/wpcontent/uploads/2010/07/DNA-sequence.jpg
DP Saves the Day
 We saw the overlapping sub problems emerge –
comparing the same sequences over and over again
 We saw how we can find the solution from solution of sub
problems – a property we called optimal substructure
 Therefore we will apply a dynamic programming
approach
 Start with top-down approach - memoization
24
Memoization
 We save results of function calls to refrain from calculating them again
def lcs_mem( seq1, seq2, mem=None ):
if not mem:
mem = { }
key = (len(seq1), len(seq2)) # tuples are immutable
if key not in mem: # result not saved yet
if len(seq1) == 0 or len(seq2) == 0:
mem[key] = [ ]
else:
if seq1[-1] == seq2[-1]:
mem[key] = lcs_mem(seq1[:-1], seq2[:-1], mem) + [ seq1[-1] ]
else:
mem[key] = max(lcs_mem(seq1[:-1], seq2 ,mem),
lcs_mem (seq1, seq2[:-1], mem), key=len )
return mem[key]
25
“maximum recursion depth exceeded”
 We want to use our memoized LCS algorithm on two long DNA
sequences:
>>> from random import choice
>>> def base():
…
return choice('AGCT')
>>> seq1 = str([base() for x in range(10000)])
>>> seq2 = str([base() for x in range(10000)])
>>>print lcs(seq1, seq2)
RuntimeError: maximum recursion depth exceeded in cmp
 We need a different algorithm…
26
link→
27
DNA Sequence Alignment
 Needleman-Wunsch DP Algorithm:
 Python package: http://pypi.python.org/pypi/nwalign
 On-line example:
http://alggen.lsi.upc.es/docencia/ember/frame-ember.html
 Code: needleman_wunsch_algorithm.py
 Lecture videos from TAU:
 http://video.tau.ac.il/index.php?option=com_videos&view=
video&id=4168&Itemid=53
28
Download