Take-Home Lab 5 Problem 1: DNA 1

advertisement
Take-Home Lab 5
Problem 1: DNA
1
DNA
Given a string with length N, determine
the number of occurrences of some given
substrings (with length K) in that string.
 For instance,
String
: ACGTAC (N = 6)
Substring : AC (K = 2)
Answer : There are 2 AC in string
ACGTAC.
ACGTAC

2
DNA
Substring is consecutive part of a string.
 Note that AG is not a substring of
ACGTAC.

3
Brute-force Algorithm
For each query
 Iterate through the entire string
 For each position in the string, check the
substring, and increment count

4
DNA (70%)

for (int i = 0; i < N; i++) {
boolean found = true;
for (int j = 0; j < K; j++) {
if (text[i + j] != pattern[j]) {
// character mismatch
found = false;
break;
}
}
if (found) counter++;
}
5
DNA (70%)
We can answer one query in O(N.K)
 Hence with Q queries, the time
complexity will be O(Q.N.K)
 Solution: For every query, we check the
substring with length K starting at index i

6
DNA (100%)

Java HashTable (or HashMap)
7
DNA (100%)
Key: substring
 Value: Number of occurrences of
substring
 Iterate through string once to populate
hashtable O(NK)
 Constant time for each query

8
DNA (100%)
ACGTAC
 ACGTAC
 ACGTAC
 ACGTAC
 ACGTAC
Store the substrings as key.
AC, CG, GT, TA.

9
DNA (100%)
We will have:
 occur[AC] = 2
 occur[CG] = 1
 occur[GT] = 1
 occur[TA] = 1
 for (int i = 0; i < N – K + 1; i++) {

occur[hash(i, K)]++; // we increase the
substring starting at index i with length K.
}
10
DNA(100%)
After we have built the table, we can
answer a query in O(1)
 By searching the hash table with the
query as the key

11
Alternative

What if we do not have Java Hash Table
API?
12
DNA – V2
Implement our own hash table!
 Since K is very small, we can use simple
hash function and array as the table.

13
DNA-V2
Hash function?
 First, we map
A to 1, C to 2, G to 3, T to 4. (we only have
A, C, G, and T in DNA sequence).

14
DNA-V2
ACGTAC
 ACGTAC
 ACGTAC
 ACGTAC
 ACGTAC
We only need to store the number related
to the substring. AC = 12, CG = 23, GT =
34, TA = 41.

15
DNA-V2
We will have:
 occur[12] = 2
 occur[23] = 1
 occur[34] = 1
 occur[41] = 1
 for (int i = 0; i < N – K + 1; i++) {

occur[hash(i, K)]++; // we increase the
substring starting at index i with length K.
}
16
DNA (100%)
After we have built the table, we can
answer a query in O(K) by calculating the
hash value of the substring in that query
(X)
 Output the value in occur[X].

17
Problem 2: Find
Substring
18
Find Substring
Given 2 strings,
 Output 0: if a substring is not in string1&2
 Output 1: if a substring is only in string 1
 Output 2: if a substring is only in string 2
 Output 3: if a substring is in both string 1&2

19
Find Substring (70%)
Check the existence of a substring in
both strings to determine the answer.
 You might notice that this problem is very
similar to DNA problem, i.e. a substring is
in a string if the number of occurrences is
greater than 0.
 Can be solved using the same technique
for DNA(70%)

20
Find Substring (100%)
It is possible to reuse the solution for
DNA
 If the number of occurrences of a
substring in a given string > 0, it means
that we can find the substring in the
string.
 You need 2 tables, one for the first string
and another one for the second string

21
Find Substring (100%)
For example, we have 2 strings, i.e.
ACGTAC and ACTGCA
Use the same technique as the one in DNA

22
Find Substring (100%)
After we have built the table, we can
answer a query in O(1)
 E.g. check occurOne.get(“AC”) and
occur2.get(“AC”)

23
Love Letter
24
Task

Find a interval (continuous section)
◦ Contains all words
◦ Total length is minimal
{i, love, love, i, i, you, i, love}
25
Idea
{i, love, love, i, i, you, i, love}
 Maintain the interval using a queue

◦ Step1: Initially empty {[]i, love, love, i, i, you, i,
love}
◦ Step2: While the queue does not contain all
words, add words at the back of the queue
 {[i, love, love, i, i, you], i, love}
26
Idea
◦ Step3: While the front of the queue is
redundant, pop it out, and update the
minimum total length
 {i, love, [love, i, i, you], i, love}, min = 9
◦ Step4: if not reach the end of the list, add the
next word at the back of the queue, and goto
Step3
◦ Final Answer: {i, love, love, i, i, [you, i, love]},
min = 8
27

Time Complexity: O(N).

How to check whether the first word in
the queue is redundant?
◦ Hashing to store the word’s occurrence in the
queue.
28
Download