Motif Finding Workshop Project Chaim Linhart January 2008 MF workshop 08 © Ron Shamir 1 Outline 1. Some background again… 2. The project MF workshop 08 © Ron Shamir 2 1. Background Slides with Ron Shamir and Adi Akavia MF workshop 08 © Ron Shamir 3 Gene: from DNA to protein DNA PremRNA transcription MF workshop 08 © Ron Shamir Mature mRNA splicing protein translation 4 DNA • DNA: a “string” over the alphabet of 4 bases (nucleotides): { A, C, G, T } • Resides in chromosomes • Complementary strands: A-T ; C-G Forward/sense strand: AACTTGCG Reverse-complement/anti-sense strand: TTGAACGC • Directional: from 5’ to 3’: (upstream) AACTTGCGATACTCCTA 5’ end MF workshop 08 © Ron Shamir (downstream) 3’ end 5 Gene structure (eukaryotes) Promoter DNA Coding strand Transcription start site (TSS) Transcription Pre-mRNA Exon Intron Splicing Mature mRNA 5’ UTR Start codon (RNA polymerase) Exon (spliceosome) 3’ UTR Coding region Translation MF workshop 08 © Ron Shamir Protein Stop codon (ribosome) 6 Translation • Codon - a triplet of bases, codes a specific amino acid (except the stop codons); many-to-1 relation • Stop codons - signal termination of the protein synthesis process MF workshop 08 © Ron Shamir http://ntri.tamuk.edu/cell/ribosomes.html 7 Genome sequences • Many genomes have been sequences, including those of viruses, microbes, plants and animals. • Human: – 23 pairs of chromosomes – 3+ Gbps (bps = base pairs) , only ~3% are genes – ~25,000 genes • Yeast: – 16 chromosomes – 20 Mbps – 6,500 genes MF workshop 08 © Ron Shamir 8 Regulation of Expression • Each cell contains an identical copy of the whole genome - but utilizes only a subset of the genes to perform diverse, unique tasks • Most genes are highly regulated – their expression is limited to specific tissues, developmental stages, physiological condition • Main regulatory mechanism – transcriptional regulation 9 MF workshop 08 © Ron Shamir Transcriptional regulation • Transcription is regulated primarily by transcription factors (TFs) – proteins that bind to DNA subsequences, called binding sites (BSs) • TFBSs are located mainly (not always!) in the gene’s promoter – the DNA sequence upstream the gene’s transcription start site (TSS) • BSs of a particular TF share a common pattern, or motif • Some TFs operate together – TF modules TF 5’ BS MF workshop 08 © Ron Shamir TF Gene BS TSS 3’ 10 TFBS motif models • Consensus (“degenerate”) string: A ACT CT C G AACTGT CACTGT CACTCT CACTGT AACTGT gene 1 gene 2 gene 3 gene 4 gene 5 gene 6 gene 7 gene 8 gene 9 gene 10 • Statistical models… • Motif logo representation MF workshop 08 © Ron Shamir 11 Human G2+M cell-cycle genes: The CHR – NF-Y module CDCA3 (trigger of mitotic entry 1) CTCAGCCAATAGGGTCAGGGCAGGGGGCGTGGCGGGAAGTTTGAAACT -18 CDCA8 (cell division cycle associated 8) TTGTGATTGGATGTTGTGGGA…[25bp]…TGACTGTGGAGTTTGAATTGG +23 CDC2 (cell division control protein 2 homolog) CTCTGATTGGCTGCTTTGAAAGTCTACGGGCTACCCGATTGGTGAATCCGGG GCCCTTTAGCGCGGTGAGTTTGAAACTGCT 0 CDC42EP4 (cdc42 effector protein 4) GCTTTCAGTTTGAACCGAGGA…[25bp]…CGACGGCCATTGGCTGCTGC -110 CCNB1 (G2/mitotic-specific cyclin B1) AGCCGCCAATGGGAAGGGAG…[30bp]…AGCAGTGCGGGGTTTAAATCT +45 CCNB2 (G2/mitotic-specific cyclin B2) TTCAGCCAATGAGAGT…[15bp]…GTGTTGGCCAATGAGAAC…[15bp]…GGGC CGCCCAATGGGGCGCAAGCGACGCGGTATTTGAATCCTGGA +10 BS’s are short, non-specific, hiding in both strands and at various TFs: NF-Y , CHR 12 locations along the promoters MF workshop 08 © Ron Shamir The computational challenge • Given a set of co-regulated genes (e.g., from gene expression chips) • Find a motif that is over-represented (occurs unusually often) in their promoters • This may be the TF binding site motif • Find TF modules – over-represented motifs that tend to co-occur MF workshop 08 © Ron Shamir 13 The computational challenge (II) • Motifs can also be found w/o a given target-set – “genome-wide” • Find a motif that is localized - occurs more often neat the TSS of genes • Find a motif with a strand bias – occurs more often on the genes’ coding strand • Find TF modules with biases in their order / orientation / distance MF workshop 08 © Ron Shamir 14 Motif finding algorithms • >100 motif finding algs • Main differences between them: – Type of analysis & input: • Target-set vs. genome-wide • Single vs. multi-species (conservation) • Single motifs vs. modules – Motif model – Score for evaluating motif – Motif search technique: • Combinatorial (enumeration) vs. Statistical optimization MF workshop 08 © Ron Shamir 15 Example - Amadeus Over-represented motifs in the promoters of genes expressed in the G2 and G2/M phases of the human cell cycle: CHR NF-Y MF workshop 08 © Ron Shamir 16 2. The project MF workshop 08 © Ron Shamir 17 General goals • Develop software from A-Z: – – – – Design Implementation (Optimization) Execution & analysis of real data • A taste of bioinformatics • Have fun • Get credit… MF workshop 08 © Ron Shamir 18 The computational task • Given a set of DNA sequences • Find “interesting” pairs of motifs: – Order bias – Other scores… • Main challenges: – Performance (time, memory) – Output redundancy MF workshop 08 © Ron Shamir 19 Input File with DNA sequences in “fasta” format: >sequence-name1 <space> [header1] ACCCGNNNNTCGGAAATGANN CGGAGTAAAATATGCGAGCGT >sequence-name2 <space> [header2] cggattnnnaccgcannnnnnnnaccgtga >sequence-name3 <space> [header3] agtttagactgctagctcgatcgcta gcggatnggctannnnnatctag MF workshop 08 © Ron Shamir 20 Input (II) • Ignore the header lines • Sequence may span multiple lines or one long line • Sequence contains the characters A,C,G,T,N in upper or lower case • “N” means unknown or masked base • Sample input files will be supplied MF workshop 08 © Ron Shamir 21 Input (III) • Search parameters: – Length of motifs (between 5-10) – Min. + Max. distance between the motifs: ACGGATTGATNNNTGGATGCCAT distance=9 – Single vs. two strands search – Min. number of occurrences (hits) of pair: GCGGATTCAGTGATGCCANGNATGCCTCAGGATTGNAATGCCA hit hit hit – Max. p-value – Additional parameters… MF workshop 08 © Ron Shamir (don’t count overlaps, e.g. AAAAAA) 22 Output A. A list of the string pairs with the best order-bias score (smallest pvalues): Motif A ACGTT ACGTT TTAAC Motif B GGATT GATTC CAGCC A→B 97 87 31 B→A 17 16 114 p-value 4.3E-15 2.7E-13 1.2E-12 B. A non-redundant list of motif pairs (motif = consensus string): logos, # of hits, additional scores MF workshop 08 © Ron Shamir 23 Part A: String pairs with order bias • • • • • nA = # of A→B ; nB = # of B→A WLOG, nA > nB n = nA + nB H0 = random order: nA ~ B(n, 0.5) p-value = prob for at least nA occurrences of A→B = tail of B(n, 0.5) • Normal approximation (central limit thm.) • Fix for multiple testing: x2 n j Binomial tail ( n, p, k ) p (1 p) n j j k j n MF workshop 08 © Ron Shamir 24 Part B: Non-redundant list of motif pairs • Collect similar strings to motif with better score: (motif = consensus) String pair (p-value) ACGTT , GGATT (4.3E-15) ACGAT , GGATT (2.4E-11) AGGAT , GGTTT (1.7E-5) AGGTT , GGTTT (5.9E-5) Motif pair , (8.1E-31) • Don’t report similar motif pairs: – Motifs that consist of similar strings – Motif pairs that are small shifts of one another – Palindromes MF workshop 08 © Ron Shamir 25 Part B (cont.): Additional score Option I: Co-occurrence rate N = total # of sequences sA = # of sequences that contain motif A sAB = # of sequences that contain motifs A and B H0 = motifs occur independently and randomly p-value = prob for at least joint occurrences, given the number of hits of each single motif = tail of hypergeometric distribution sB N sB s i min( sA , sB ) i HG tail ( N , s A , sB , s AB ) A N i sAB sA MF workshop 08 © Ron Shamir 26 Part B (cont.): Additional score Option II: Distance bias Is the distance between the two motifs uniform (H0), or are there specific distances that are very common? Option III: Gap variability Are the sequences between the motifs conserved (H0), or are they highly variable? Other options?? MF workshop 08 © Ron Shamir 27 Implementation • Java (Eclipse) ; Linux • GUI: Simple graphical user interface for supplying the input parameters and reporting the results • Packages for motif logo and statistical scores will be supplied • Time performance will be measured only for part A • Reasonable documentation • Separate packages for data-structures, scores, GUI, I/O, etc. MF workshop 08 © Ron Shamir 28 Design document • Due in 3 weeks (Feb 24) • 3-5 pages (Word), Hebrew/English • Briefly describe main goal, input and output of program • Describe main data structures, algorithms, and scores for parts A+B • Meet with me before submission MF workshop 08 © Ron Shamir 29 Fin MF workshop 08 © Ron Shamir 30