Take-Home Lab 5 Problem 1: DNA 1 DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For instance, String : ACGTAC (N = 6) Substring : AC (K = 2) Answer : There are 2 AC in string ACGTAC. ACGTAC 2 DNA Substring is consecutive part of a string. Note that AG is not a substring of ACGTAC. 3 Brute-force Algorithm For each query Iterate through the entire string For each position in the string, check the substring, and increment count 4 DNA (70%) for (int i = 0; i < N; i++) { boolean found = true; for (int j = 0; j < K; j++) { if (text[i + j] != pattern[j]) { // character mismatch found = false; break; } } if (found) counter++; } 5 DNA (70%) We can answer one query in O(N.K) Hence with Q queries, the time complexity will be O(Q.N.K) Solution: For every query, we check the substring with length K starting at index i 6 DNA (100%) Java HashTable (or HashMap) 7 DNA (100%) Key: substring Value: Number of occurrences of substring Iterate through string once to populate hashtable O(NK) Constant time for each query 8 DNA (100%) ACGTAC ACGTAC ACGTAC ACGTAC ACGTAC Store the substrings as key. AC, CG, GT, TA. 9 DNA (100%) We will have: occur[AC] = 2 occur[CG] = 1 occur[GT] = 1 occur[TA] = 1 for (int i = 0; i < N – K + 1; i++) { occur[hash(i, K)]++; // we increase the substring starting at index i with length K. } 10 DNA(100%) After we have built the table, we can answer a query in O(1) By searching the hash table with the query as the key 11 Alternative What if we do not have Java Hash Table API? 12 DNA – V2 Implement our own hash table! Since K is very small, we can use simple hash function and array as the table. 13 DNA-V2 Hash function? First, we map A to 1, C to 2, G to 3, T to 4. (we only have A, C, G, and T in DNA sequence). 14 DNA-V2 ACGTAC ACGTAC ACGTAC ACGTAC ACGTAC We only need to store the number related to the substring. AC = 12, CG = 23, GT = 34, TA = 41. 15 DNA-V2 We will have: occur[12] = 2 occur[23] = 1 occur[34] = 1 occur[41] = 1 for (int i = 0; i < N – K + 1; i++) { occur[hash(i, K)]++; // we increase the substring starting at index i with length K. } 16 DNA (100%) After we have built the table, we can answer a query in O(K) by calculating the hash value of the substring in that query (X) Output the value in occur[X]. 17 Problem 2: Find Substring 18 Find Substring Given 2 strings, Output 0: if a substring is not in string1&2 Output 1: if a substring is only in string 1 Output 2: if a substring is only in string 2 Output 3: if a substring is in both string 1&2 19 Find Substring (70%) Check the existence of a substring in both strings to determine the answer. You might notice that this problem is very similar to DNA problem, i.e. a substring is in a string if the number of occurrences is greater than 0. Can be solved using the same technique for DNA(70%) 20 Find Substring (100%) It is possible to reuse the solution for DNA If the number of occurrences of a substring in a given string > 0, it means that we can find the substring in the string. You need 2 tables, one for the first string and another one for the second string 21 Find Substring (100%) For example, we have 2 strings, i.e. ACGTAC and ACTGCA Use the same technique as the one in DNA 22 Find Substring (100%) After we have built the table, we can answer a query in O(1) E.g. check occurOne.get(“AC”) and occur2.get(“AC”) 23 Love Letter 24 Task Find a interval (continuous section) ◦ Contains all words ◦ Total length is minimal {i, love, love, i, i, you, i, love} 25 Idea {i, love, love, i, i, you, i, love} Maintain the interval using a queue ◦ Step1: Initially empty {[]i, love, love, i, i, you, i, love} ◦ Step2: While the queue does not contain all words, add words at the back of the queue {[i, love, love, i, i, you], i, love} 26 Idea ◦ Step3: While the front of the queue is redundant, pop it out, and update the minimum total length {i, love, [love, i, i, you], i, love}, min = 9 ◦ Step4: if not reach the end of the list, add the next word at the back of the queue, and goto Step3 ◦ Final Answer: {i, love, love, i, i, [you, i, love]}, min = 8 27 Time Complexity: O(N). How to check whether the first word in the queue is redundant? ◦ Hashing to store the word’s occurrence in the queue. 28