Implementation of motif finding algorithm SIMPLE_CONSENSUS

advertisement
Implementation of motif finding algorithm SIMPLE_CONSENSUS
This algorithm finds a position weight matrix (PWM) motif in an input set of DNA
sequences.
1. The input is :
 a set of sequences S = {S1,S2,…,SN} in “FastA” format.
 Desired motif length k.
 Number of motifs to report, T.
2. The output is :
 A set of “motifs” M = {m1, m2, … , mT}. Each motif mi is a multi-set of kmers (k-length substrings), one taken from each input sequence Si.
3. The score of a motif is the information content of the PWM formed by the k-mers in
the motif.
4. The algorithm SIMPLE_CONSENSUS works as follows:
Consider any arbitrary ordering {S1,S2,…,SN} of the sequences in set S, e.g., the
order in which the sequences were included in the input FastA file.
Iteration 1: For every pair of k-mers x1 (substring of S1) and x2 (substring of S2),
compute the information content of the corresponding PWM. Store the T highest scoring
pairs.
Iteration t : For every k-mer xt+1 (substring of St+1), and for every motif m stored
in the previous iteration, add xt+1 to m to obtain a new motif, and compute the information
content of the corresponding PWM. Store the T highest scoring new motifs from this
iteration.
The algorithm terminates when each Si  S has been handled, i.e., after iteration
(N-1).
Problem A. Implement a function that creates a PWM out of a given motif, and computes
its information content. The formula for information content I(W) of a PWM W is :
I(W )  

k  {A,C,G,T }
W k log
W k
q
where W k is the entry for base  in column k of W. The background probabilities of the
bases are denoted by q. Assume a background probability distribution of {A = T = 0.35,
C = G = 0.15}.
 If W k is 0, assume W k log W k to be 0.

Problem B. Implement the algorithm SIMPLE_CONSENSUS, as described above. The
input sequences are given in FastA format, described at the end. The inputs k (motif
 T (number of motifs

report) are provided in the command-line.
length) and
to
Problem C. Run your SIMPLE_CONSENSUS program on each of the three provided
FastA files (bicoid.fa, kruppel.fa, hunchback.fa), with k=9.
FastA format of sequences:
A FastaA file can contain one or many sequences. Each sequence begins with a “header”
line, whose first character is “>” and the string immediately following this is the name of
the sequence. The actual sequence (string over {A,C,G,T}) begins with the next line, and
may be split over multiple lines. For example:
>seq1
ACACAGTACAGCTACGATCAGCATCAG
CAGCTACGACGAGCATCAGCGACTACGACTAC
CAGCATCGATCAGC
>seq2
CAGGCGTAGCGATCAGCGAGCGACGAGCACTTCGAGCATCAG
CGATCGAGCAGCTACGACTACGATCAG
CGACTAGCATAGCAC
>seq3
GCAGCGATCAGC
Download