Programming for Engineers in Python Recitation 12 Plan Dynamic Programming Coin Change problem Longest Common Subsequence Application to Bioinformatics 2 Teaching Survey Please answer the teaching survey: https://www.ims.tau.ac.il/Tal/ This will help us to improve the course Deadline: 4.2.12 3 Coin Change Problem What is the smallest number of coins I can use to make exact 4 change? Greedy solution: pick the largest coin first, until you reach the change needed In the US currency this works well: Give change for 30 cents if you’ve got 1, 5, 10, and 25 cent coins: 25 + 5 → 2 coins http://jeremykun.files.wordpress.com/2012/01/coins.jpg The Sin of Greediness What if you don’t have 5 cent coins? You got 1, 10, and 25 Greedy solution: 25+1+1+1+1+1 → 6 coins But a better solution is: 10+10+10 → 3 coins! So the greedy approach isn’t optimal The Seven Deadly Sins and the Four Last Things by Hieronymus Bosch http://en.wikipedia.org/wiki/File:Boschsevendeadlysins.jpg 5 Recursive Solution Reminder – find the minimal # of coins needed to give exact 6 change with coins of specified values Assume that we can use 1 cent coins so there is always some solution Denote our coin list by c1, c2, …, ck (c1=1) k is the # of coins values we can use Denote the change required by n In the previous example: n=30, k=3, c1=1, c2=10, c3=25 Recursive Solution Recursion Base: If n=0 then we need 0 coins If k=1, c1=1, so we need n coins Recursion Step: If n<ck we can’t use ck → We solve for n and c1,…,ck-1 Otherwise, we can either use ck or not use ck If we use ck → we solve for n-ck and c1,…,ck If we don’t use ck → we solve for n and c1,…,ck-1 7 Recursion Solution def coins_change_rec( cents_needed, coin_values): if cents_needed <= 0: # base 1 return 0 elif len(coin_values) == 1: # base 2 return cents_needed # assume that coin_values[0]==1 elif coin_values[-1] > cents_needed: # step 1 return coins_change_rec( cents_needed, coin_values[:-1]) else: # step 2 s1 = coins_change_rec( cents_needed, coin_values[:-1] ) s2 = coins_change_rec( cents_needed-coin_values[-1], coin_values ) return min(s1, s2+1) 8 coins_rec.py Repeated calls We count how many times we call the recursive function for each set of arguments: calls = {} def coins_change_rec(cents_needed, coin_values): global calls calls[(cents_needed, coin_values)] = calls.get( (cents_needed, coin_values) , 0) + 1 … >>> print 'result', coins_change_rec(30, (1,5,10,25)) result 2 >>> print 'max calls',max(calls.values()) max calls 4 9 Dynamic Programing - Memoization We want to store the values of calculation so we don’t repeat them We create a table called mem # of columns: # of cents needed + 1 # of rows: # of coin values + 1 The table is initialized with some illegal value – for example 1: mem = [ [-1 for y in range(cents_needed+1)] for x in range(len(coin_values)) ] 10 Dynamic Programing - Memoization For each call of the recursive function, we check if mem already has the answer: if mem[len(coin_values)][cents_needed] == -1: In case that it doesn’t (the above is True) we calculate it as before, and we store the result, for example: if cents_needed <= 0: mem[len(coin_values)][cents_needed] = 0 Eventually we return the value return mem[len(coin_values)][cents_needed] 11 coins_mem.py Dynamic Programing - Iteration Another approach is to first build the entire matrix This matrix holds the minimal number of coins we need to get change for j cents using the first i coins (c1, c2, …, ci) The solution will be min_coins[k,n] – the last element in the matrix This will save us the recursive calls, but will enforce us to calculate all the values apriori Bottom-up approach vs. the top-down approach of memoization 12 Dynamic Programming approach The point of this approach is that we have a recursive formula to break apart a problem to sub problems Then we can use different approaches to minimize the number of calculations by storing the sub solutions in memory 13 Bottom up - example matrix Set n=4 and k=3 (coins are 1, 2, and 3 cents) Base cases: how many coins do I need to make change for zero cents? 14 Zero! So min_coins[i,0]=0 And how many pennies do I need to make j cents? Exactly j (we assumed we can use pennies) So min_coins[0,j]=j So the base cases give us: 0 1 2 3 4 0 ? ? ? ? 0 ? ? ? ? Next – the recursion step Bottom up - example matrix For particular choice of i,j (but not i=0 or j=0) To determine min_coins[i,j] – the minimum # of coins to get 15 exact change of j using the first i coins We can use the coin ci and add +1 to min_coins[i,j-ci] (only valid if j>ci) We can decide not to use ci , therefore to use only c0 ,.., ci-1, and therefore min_coins[i-1,j] . So which way do we choose? The one with the least coins! min_coins[i,j] = min(min_coins[i,j-ci] +1, min_coins[i-1,j]) Example matrix – recursion step Set n=4 and k=3 (coins are 1, 2, and 3 cents) So the base cases give us: 0 1 2 3 4 𝑀= 0 1 1 2 2 0 1 1 1 2 M(1,1)=1 M(1,2)=1 M(1,3)=min(M(1,1)+1,M(0,3))=min(2,2)=2 M(1,4)=min(M(1,2)+1, M(0,4))=min(2,4)=2 etc… 16 coins_matrix.py The code for the matrix solution and the idea is from http://jeremykun.wordpress.com/2012/01/12/a-spoonful-of-python/ Longest Common Subsequence Given two sequences (strings/lists) we want to find the 17 longest common subsequence Definition – subsequence: B is a subsequence of A if B can be derived from A by removing elements from A Examples [2,4,6] is a subsequence of [1,2,3,4,5,6] [6,4,2] is NOT a subsequence of [1,2,3,4,5,6] ‘is’ is a subsequence of ‘distance’ ‘nice’ is NOT a subsequence of ‘distance’ Longest Common Subsequence Given two subsequences (strings or lists) we want to find the longest common subsequence: Example for a LCS: Sequence 1: HUMAN Sequence 2: CHIMPANZEE Applications include: BioInformatics (next up) Version Control 18 http://wordaligned.org/articles/longest-common-subsequence The DNA Our biological blue-print A sequence made of four bases – A, G, C, T Double strand: A connects to T G connects to C Every triplet encodes for an amino-acid Example: GAG→Glutamate A chain of amino-acids is a protein – the biological machine! 19 http://sips.inesc-id.pt/~nfvr/msc_theses/msc09b/ Longest common subsequence The DNA changes: Mutation: A→G, C→T, etc. Insertion: AGC → ATGC Deletion: AGC → A‒C http://palscience.com/wp-content/uploads/2010/09/DNA_with_mutation.jpg Given two non-identical sequences, we want to find the parts that are common So we can say how different they are Which DNA is more similar to ours? The cat’s or the dog’s? 20 Recursion An LCS of two sequences can be built from the LCSes of 21 prefixes of these sequences Denote the sequences seq1 and seq2 Base – check if either sequence is empty: If len(seq1) == 0 or len(seq2) == 0: return [ ] Step – build solution from shorter sequences: If seq1[-1] == seq2[-1]: return lcs (seq1[:-1],seq2[:-1]) + [ seq1[-1] ] else: return max(lcs (seq1[:-1],seq2), lcs(seq1,seq2[:-1]), key = len) lcs_rec.py Wasteful Recursion For the inputs “MAN” and “PIG”, the calls are: 22 (1, ('', 'PIG')) (1, ('M', 'PIG')) (1, ('MA', 'PIG')) (1, ('MAN', '')) (1, ('MAN', 'P')) (1, ('MAN', 'PI')) (1, ('MAN', 'PIG')) (2, ('MA', 'PI')) (3, ('', 'PI')) (3, ('M', 'PI')) (3, ('MA', '')) (3, ('MA', 'P')) (6, ('', 'P')) (6, ('M', '')) (6, ('M', 'P')) 24 redundant calls! http://wordaligned.org/articles/longest-common-subsequence Wasteful Recursion When comparing longer sequences with a small number of letters the problem is worse For example, DNA sequences are composed of A, G, T and C, and are long For lcs('ACCGGTCGAGTGCGCGGAAGCCGGCCGAA', 'GTCGTTCGGAATGCCGTTGCTCTGTAAA') we get an absurd: (('', 'GT'), 13,182,769) (('A', 'GT'), 13,182,769) (('A', 'G'), 24,853,152) (('', 'G'), 24,853,152) (('A', ''), 24,853,152) 23 http://blog.oncofertility.northwestern.edu/wpcontent/uploads/2010/07/DNA-sequence.jpg DP Saves the Day We saw the overlapping sub problems emerge – comparing the same sequences over and over again We saw how we can find the solution from solution of sub problems – a property we called optimal substructure Therefore we will apply a dynamic programming approach Start with top-down approach - memoization 24 Memoization We save results of function calls to refrain from calculating them again def lcs_mem( seq1, seq2, mem=None ): if not mem: mem = { } key = (len(seq1), len(seq2)) # tuples are immutable if key not in mem: # result not saved yet if len(seq1) == 0 or len(seq2) == 0: mem[key] = [ ] else: if seq1[-1] == seq2[-1]: mem[key] = lcs_mem(seq1[:-1], seq2[:-1], mem) + [ seq1[-1] ] else: mem[key] = max(lcs_mem(seq1[:-1], seq2 ,mem), lcs_mem (seq1, seq2[:-1], mem), key=len ) return mem[key] 25 “maximum recursion depth exceeded” We want to use our memoized LCS algorithm on two long DNA sequences: >>> from random import choice >>> def base(): … return choice('AGCT') >>> seq1 = str([base() for x in range(10000)]) >>> seq2 = str([base() for x in range(10000)]) >>>print lcs(seq1, seq2) RuntimeError: maximum recursion depth exceeded in cmp We need a different algorithm… 26 link→ 27 DNA Sequence Alignment Needleman-Wunsch DP Algorithm: Python package: http://pypi.python.org/pypi/nwalign On-line example: http://alggen.lsi.upc.es/docencia/ember/frame-ember.html Code: needleman_wunsch_algorithm.py Lecture videos from TAU: http://video.tau.ac.il/index.php?option=com_videos&view= video&id=4168&Itemid=53 28