CS 466 Introduction to Bioinformatics Saurabh Sinha What is the course about? • Algorithmic concepts, applied to sample (toy) problems in “bioinformatics” – Follows the text book • “Real” bioinformatics research – Follows the best journals • Not about practical training in the use of popular bioinformatics software Grading • Assignments: 40% – About one every two weeks • Mid Term: 30% • Final: 30% Expectations • Some programming skills – Any programming language is fine Administrative Details • Instructor: – Saurabh Sinha – Room 2122, Siebel Center – Email: sinhas@illinois.edu • • • • Class hrs: Tue & Thu, 3:30pm-4:45pm, 1131SC Office hrs: Tue, before class (2:30 - 3:30 pm) 2122SC Web site: http://veda.cs.uiuc.edu/courses/fa09/cs466/ Welcome to sit in, if not taking for credit Text books • Jones and Pevzner: recommended Other course • CS 591 BIO: weekly seminar (“journal club”) on bioinformatics research • Thursdays at 11:00 am. Motivating bioinformatics Special issue of journal Science, July 1, 2005. >What Is the Universe Made Of?>What is the Biological Basis of Consciousness?>Why Do Humans Have So Few Genes?>To What Extent Are Genetic Variation and Personal Health Linked?>Can the Laws of Physics Be Unified?>How Much Can Human Life Span Be Extended?>What Controls Organ Regeneration?>How Can a Skin Cell Become a Nerve Cell?>How Does a Single Somatic Cell Become a Whole Plant?>How Does Earth's Interior Work?>Are We Alone in the Universe?>How and Where Did Life on Earth Arise?>What Determines Species Diversity?>What Genetic Changes Made Us Uniquely Human?>How Are Memories Stored and Retrieved?>How Did Cooperative Behavior Evolve?>How Will Big Pictures Emerge from a Sea of Biological Data?>How Far Can We Push Chemical Self-Assembly?>What Are the Limits of Conventional Computing?>Can We Selectively Shut Off Immune Responses?>Do Deeper Principles Underlie Quantum Uncertainty and Nonlocality?>Is an Effective HIV Vaccine Feasible?>How Hot Will the Greenhouse World Be?>What Can Replace Cheap Oil -- and When?>Will Malthus Continue to Be Wrong? Many of the most profound scientific questions of today are within the realm of bioinformatics research “Why do humans have so few genes ?” A simple organism Environmental signal GENE Response (protein) A simple organism GENE1 GENE2 GENE3 A simple organism GENE1 GENE6 GENE2 GENE7 GENE3 GENE8 GENE4 GENE9 GENE5 GENE10 A complex organism GENE1 GENE6 GENE2 GENE7 GENE3 GENE8 GENE4 GENE9 GENE5 GENE10 Complex circuit of interactions Regulatory networks • This may be the reason why humans have so few genes (the circuit, not the number of switches, carries the complexity) • Bioinformatics can unravel such networks, given the genome (DNA sequence) and gene activity information Decoding the regulatory network • Find patterns (“motifs”) in DNA sequence that occur more often than expected by chance • Statistics on DNA sequences and words • Knowing these can tell us about edges in the regulatory network Decoding the regulatory network • An example computational problem: • Given a string of length 10,000 over the alphabet {A,C,G,T} • Count the number of occurrences Nw of every 6 letter word w • Are there specific words that occur more frequently than expected by chance? Decoding the regulatory network • What is expected by chance? • What is “more frequently”? • Interesting mathematical questions • The Moby Dick example Moti f TGTCACCT|TGTGTCA|TGTGTCAC|TTGTGTC GTCAGTAA|TCAGTAAT GACACA|GCCACA CGCGACGC|GACGCGA|GACGCGAA|GAGGCGA|GAGGCGAC|G CGACGCG GCGGCTAA|GGCGGCAAA|GGCGGCTAAA CGTCACAAA|GTCACAAAA Obse r ve d 35 14 52 18 Expe c te d 7 3 2 3 10 17 1 1 Cla s se s 4 4 4 1, 2, 5 1 4 http://www.pnas.org/content/97/18/10096/T4.expansion.html Decoding the regulatory network • What is expected by chance? • What is “more frequently”? • Interesting mathematical questions • The Moby Dick example • This is called “motif finding”, and helps decode the regulatory network! Comparing DNA • Humans are about 99.9% identical to each other, DNA-wise. • How do we know that ? • Compare the genome of two individuals. • The computational problem: Are two sequences similar ? Sequence alignment • Why is this a problem? • The two sequences will differ by “substitutions”, “insertions” and “deletions” accumulated during evolution • The comparison algorithm has to be robust to such possibilities. – A special technique called “dynamic programming” does all this, and is “efficient” Sequence alignment • Why should we care? • Compare human genome with fish. You’ll see some portions that are highly similar. • These “conserved” portions are often genes … • … or regulatory sequences! The regulatory network again. On counting genes • The original question was “Why do humans have so few genes?” • How do we know how many genes there are in the human genome ? (And where they are in the genome) • Experiments can be designed, but bioinformatics plays a major role Gene prediction • The task of predicting the locations of genes in a new genome (“annotation”) • Gene prediction software • The more sophisticated ones use “Hidden Markov models” (HMM) and multiple species comparison HMM for Gene Prediction What is this graph? It captures the “architecture” of a gene. It translates into a “probabilistic model”. It leads directly to a gene finding algorithm http://researchweb.watson.ibm.com/journal/rd/453/birney.html “What controls organ regeneration ?” “How does a single somatic cell become a whole plant ?” Development and Regeneration • Developmental biology • The timeline from a single cell (with genetic material from mother and father) to a multicellular embryo, and to an adult • A paradox : All cells in the adult body have the same DNA, then how come different cells are different ? Regulatory networks again 1 inputs 0 0 inputs 1 GENE1 GENE6 GENE1 GENE6 GENE2 GENE7 GENE2 GENE7 GENE3 GENE8 GENE3 GENE8 GENE4 GENE9 GENE4 GENE9 GENE5 GENE10 GENE5 GENE10 HEAD PRECURSOR CELL TAIL PRECURSOR CELL Regulatory networks again • Bioinformatics used to scan entire genome for regions that participate in “segmenting” the embryo • Hidden Markov models used to detect such regions • Multiple species comparison aids discovery “How did cooperative behavior evolve?” Social behavior and bioinformatics? • Social behavior in honey bees • Young worker bees are nurses in the hive; older ones go out to forage • This behavioral maturation is determined by needs of colony. What is the genetic basis of this ? Social behavior and bioinformatics • Illinois team scanned the genome to understand this (2006) • Regulatory network of social behavior • Statistical tools, such as Hypergeometric test • Machine learning tools such as “support vector machine classification” Other challenges Protein structure prediction http://www.denizyuret.com/students/vkurt/thesis-main_dosyalar/image006.gif Protein structure prediction • Can we predict the 3-D structure of a protein from its amino acid sequence ? • Why ? – One good reason: structure gives clues about function. If we can tell the structure, we can perhaps tell the function – We can design amino acid sequences that will fold into proteins that do what we want them to do. Drug design !! • Neural networks, a popular technique in computer science, applied to this problem Metagenomics • Most studies to date are on genomes of one species • A sample from the soil contains hundreds of bacteria, thousands of viruses. Can we study all of these ? • The Sorcerer II expedition • http://www.sorcerer2expedition.org/version1/HTML/main.htm Many more challenges • New types of data come due to technological breakthroughs in biology • High throughput data carries unprecedented amount of information • Too much noise • Bioinformatics removes the noise and reveals the truth Bioinformatics • Is not about one problem (e.g., designing better computer chips, better compilers, better graphics, better networks, better operating systems, etc.) • Is about a family of very different problems, all related to biology, all related to each other • How can computers help solve any of this family of problems ? Bioinformatics and You • You can learn the tools of bioinformatics • These tools owe their origin to computer science, information theory, probability theory, statistics, etc. • You can learn the language of biology, enough to understand what the problems are • You can apply the tools to these problems and contribute to science