Implementation of motif finding algorithm SIMPLE_CONSENSUS

Implementation of motif finding algorithm SIMPLE_CONSENSUS This algorithm finds a position weight matrix (PWM) motif in an input set of DNA sequences. 1. The input is :  a set of sequences S = {S1,S2,…,SN} in “FastA” format.  Desired motif length k.  Number of motifs to report, T. 2. The output is :  A set of “motifs” M = {m1, m2, … , mT}. Each motif mi is a multi-set of kmers (k-length substrings), one taken from each input sequence Si. 3. The score of a motif is the information content of the PWM formed by the k-mers in the motif. 4. The algorithm SIMPLE_CONSENSUS works as follows: Consider any arbitrary ordering {S1,S2,…,SN} of the sequences in set S, e.g., the order in which the sequences were included in the input FastA file. Iteration 1: For every pair of k-mers x1 (substring of S1) and x2 (substring of S2), compute the information content of the corresponding PWM. Store the T highest scoring pairs. Iteration t : For every k-mer xt+1 (substring of St+1), and for every motif m stored in the previous iteration, add xt+1 to m to obtain a new motif, and compute the information content of the corresponding PWM. Store the T highest scoring new motifs from this iteration. The algorithm terminates when each Si  S has been handled, i.e., after iteration (N-1). Problem A. Implement a function that creates a PWM out of a given motif, and computes its information content. The formula for information content I(W) of a PWM W is : I(W )    k  {A,C,G,T } W k log W k q where W k is the entry for base  in column k of W. The background probabilities of the bases are denoted by q. Assume a background probability distribution of {A = T = 0.35, C = G = 0.15}.  If W k is 0, assume W k log W k to be 0.  Problem B. Implement the algorithm SIMPLE_CONSENSUS, as described above. The input sequences are given in FastA format, described at the end. The inputs k (motif  T (number of motifs  report) are provided in the command-line. length) and to Problem C. Run your SIMPLE_CONSENSUS program on each of the three provided FastA files (bicoid.fa, kruppel.fa, hunchback.fa), with k=9. FastA format of sequences: A FastaA file can contain one or many sequences. Each sequence begins with a “header” line, whose first character is “>” and the string immediately following this is the name of the sequence. The actual sequence (string over {A,C,G,T}) begins with the next line, and may be split over multiple lines. For example: >seq1 ACACAGTACAGCTACGATCAGCATCAG CAGCTACGACGAGCATCAGCGACTACGACTAC CAGCATCGATCAGC >seq2 CAGGCGTAGCGATCAGCGAGCGACGAGCACTTCGAGCATCAG CGATCGAGCAGCTACGACTACGATCAG CGACTAGCATAGCAC >seq3 GCAGCGATCAGC

Implementation of motif finding algorithm SIMPLE_CONSENSUS

Related documents

Products

Support

Implementation of motif finding algorithm SIMPLE_CONSENSUS

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib