Take-Home Lab 5 Problem 1: DNA 1

Take-Home Lab 5 Problem 1: DNA 1 DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string.  For instance, String : ACGTAC (N = 6) Substring : AC (K = 2) Answer : There are 2 AC in string ACGTAC. ACGTAC  2 DNA Substring is consecutive part of a string.  Note that AG is not a substring of ACGTAC.  3 Brute-force Algorithm For each query  Iterate through the entire string  For each position in the string, check the substring, and increment count  4 DNA (70%)  for (int i = 0; i < N; i++) { boolean found = true; for (int j = 0; j < K; j++) { if (text[i + j] != pattern[j]) { // character mismatch found = false; break; } } if (found) counter++; } 5 DNA (70%) We can answer one query in O(N.K)  Hence with Q queries, the time complexity will be O(Q.N.K)  Solution: For every query, we check the substring with length K starting at index i  6 DNA (100%)  Java HashTable (or HashMap) 7 DNA (100%) Key: substring  Value: Number of occurrences of substring  Iterate through string once to populate hashtable O(NK)  Constant time for each query  8 DNA (100%) ACGTAC  ACGTAC  ACGTAC  ACGTAC  ACGTAC Store the substrings as key. AC, CG, GT, TA.  9 DNA (100%) We will have:  occur[AC] = 2  occur[CG] = 1  occur[GT] = 1  occur[TA] = 1  for (int i = 0; i < N – K + 1; i++) {  occur[hash(i, K)]++; // we increase the substring starting at index i with length K. } 10 DNA(100%) After we have built the table, we can answer a query in O(1)  By searching the hash table with the query as the key  11 Alternative  What if we do not have Java Hash Table API? 12 DNA – V2 Implement our own hash table!  Since K is very small, we can use simple hash function and array as the table.  13 DNA-V2 Hash function?  First, we map A to 1, C to 2, G to 3, T to 4. (we only have A, C, G, and T in DNA sequence).  14 DNA-V2 ACGTAC  ACGTAC  ACGTAC  ACGTAC  ACGTAC We only need to store the number related to the substring. AC = 12, CG = 23, GT = 34, TA = 41.  15 DNA-V2 We will have:  occur[12] = 2  occur[23] = 1  occur[34] = 1  occur[41] = 1  for (int i = 0; i < N – K + 1; i++) {  occur[hash(i, K)]++; // we increase the substring starting at index i with length K. } 16 DNA (100%) After we have built the table, we can answer a query in O(K) by calculating the hash value of the substring in that query (X)  Output the value in occur[X].  17 Problem 2: Find Substring 18 Find Substring Given 2 strings,  Output 0: if a substring is not in string1&2  Output 1: if a substring is only in string 1  Output 2: if a substring is only in string 2  Output 3: if a substring is in both string 1&2  19 Find Substring (70%) Check the existence of a substring in both strings to determine the answer.  You might notice that this problem is very similar to DNA problem, i.e. a substring is in a string if the number of occurrences is greater than 0.  Can be solved using the same technique for DNA(70%)  20 Find Substring (100%) It is possible to reuse the solution for DNA  If the number of occurrences of a substring in a given string > 0, it means that we can find the substring in the string.  You need 2 tables, one for the first string and another one for the second string  21 Find Substring (100%) For example, we have 2 strings, i.e. ACGTAC and ACTGCA Use the same technique as the one in DNA  22 Find Substring (100%) After we have built the table, we can answer a query in O(1)  E.g. check occurOne.get(“AC”) and occur2.get(“AC”)  23 Love Letter 24 Task  Find a interval (continuous section) ◦ Contains all words ◦ Total length is minimal {i, love, love, i, i, you, i, love} 25 Idea {i, love, love, i, i, you, i, love}  Maintain the interval using a queue  ◦ Step1: Initially empty {[]i, love, love, i, i, you, i, love} ◦ Step2: While the queue does not contain all words, add words at the back of the queue  {[i, love, love, i, i, you], i, love} 26 Idea ◦ Step3: While the front of the queue is redundant, pop it out, and update the minimum total length  {i, love, [love, i, i, you], i, love}, min = 9 ◦ Step4: if not reach the end of the list, add the next word at the back of the queue, and goto Step3 ◦ Final Answer: {i, love, love, i, i, [you, i, love]}, min = 8 27  Time Complexity: O(N).  How to check whether the first word in the queue is redundant? ◦ Hashing to store the word’s occurrence in the queue. 28

Take-Home Lab 5 Problem 1: DNA 1

Related documents

Products

Support

Take-Home Lab 5 Problem 1: DNA 1

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib