CpSc 845: Bioinformatics Algorithms Introduction Copy Right Notice Most slides in this presentation are adopted from various sources. The Copyright belong to the original authors. Thanks! 2 General Information Instructor: Dr. Feng Luo, Associate professor school of computing 210 McAdams Hall Office: (864) 656 4793 E-mail: luofeng@clemson.edu – Class Hours/Room: 2:00 PM ~ 3:15 PM, TTH, 119 McAdams Hall Office Hours/Room: 2:00 PM ~ 3:00 PM, MW, and by appointment, 310 McAdams Hall Course Webpage: www.cs.clemson.edu/~luofeng/course/2014fall/845/bioinfo.ht ml 3 Text Book Textbook: Neil C. Jones, Pavel A. Pevzner “An Introduction to Bioinformatics Algorithms”; The MIT Press; ISBN: 0262101068 Reference books: Jonathan Pevsner, "Bioinformatics and Functional Genomics", First edition (October 2003); Publisher: Wiley, John & Sons; ISBN: 0471210048 4 Grading Grading: Mid-term exam Final exam Term Project 5 25% 25 % 50 % The “old” biology The most challenging task for a scientist is to get good data 6 The “new” biology 7 The most challenging task for a scientist is to make sense of lots of data Old vs. New - What’s the difference? 1) Economics Miniaturize – less cost Multiplex – more data Parallelize – save time Automate – minimize human intervention Thus, you must be able to deal with large amounts of data and trust the process that generated it 8 What’s the difference? 2) Scale From gene sequencing (~ 1 KB) to genome sequencing (many MB, even GB) From picking several genes for expression studies to analyzing the expression patterns of all genes From a catalog of key genes in a few key species to a catalog of all genes in many species 9 Analyzing your data in isolation makes less sense when you can make much more powerful statements by including data from others What’s the difference? 3) Logic Hypothesis-driven research to datadriven research Expertise-driven approach versus information-driven approach Reductionist versus integrationist How to answer the question becomes how to question an answer Algorithmic approaches for filtering, normalizing, analyzing and interpreting become increasingly important 10 Data-driven Science Must have some hypothesis – data is not the end goal of science Finding patterns in the data is where analysis starts, not ends Must understand the limits of highthroughput technology (e.g. microarrays measure transcription only, one genome does not tell you about species variation, etc.) Must understand or explore the limits of your algorithm 11 # of databases (estimated) . Data is being collected faster and in greater amounts 700 600 500 400 300 200 100 0 2005 Year 2000 1995 1990 1985 1980 12 Growth in microarray publications 14000 # of microarray papers 12000 10000 8000 6000 4000 2000 0 1998 13 1999 2000 13 2001 2002 2003 2004 2005 Plummeting Cost of Sequencing 1000000000 Original Data: Memory cost: $/Mbyte "Original Data: CPU cost: $/MFLOP" Original Data: Sequencing cost: $/base-pair Fit to CPU Fit to Mem. Cost Fit to Seq. Cost 100000000 1000000 100000 10000 1000 100 $ [Greenbaum et al., Am. J. Bioethics ('08)] 10000000 10 1 0.1 0.01 0.001 0.0001 0.00001 0.000001 0.0000001 1980 14 1985 1990 1995 2000 2005 2010 Growth in information & knowledge # of articles in MEDLINE (millions) MEDLINE spans: 16 14 >4,800 Journals 12 >16,000,000 records 10 8 672,000 new papers in 2005 (~1,840 per day) 6 4 2 0 2005 2001 1997 15 1993 1989 1985 1981 1977 1973 Year The use of software & algorithms is becoming more common in biomedical research 16 The use of software & algorithms is becoming more common in biomedical research 17 The Biomedical Information Science and Technology Initiative (BISTI) Prepared by the Working Group on Biomedical Computing Advisory Committee to the Director National Institutes of Health http://www.nih.gov/about/director/060399.h tm 18 What is Bioinformatics? Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. (BISTIC Definition Committee, July 2000) Computational Biology: The development and application of dataanalytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems. (BISTIC Definition Committee, July 2000) 19 Bioinformatics is Interdisciplinary Mathematics Statistics Computer Science Biomedicine Molecular Biology Structural Biology Ethical, legal and social implications 20 Bioinformatics Biophysics Evolution Patrice Koehl 21 22 Bioinformatics Opportunity “From the Principal Investigators who understand how to use computers to solve biomedical problems to the people who keep the computers running, there is a shortfall of trained, educated, competent people. The NIH needs a program of workforce development for biomedical computing that encompasses every level, from the technician to the Ph.D. The National Programs of Excellence in Biomedical Computing would provide a structure for developing expertise among biomedical researchers in using computational tools.” (BISTI, 1999) 23 NATURE MEDICINE • VOLUME 5 • NUMBER 7 • JULY 1999 24 NATURE|VOL 400 | 8 JULY 1999 |www.nature.com 25 Nature 410, 293 (15 March 2001) 26 Nature Biotechnology 22, 933 (2004) Published online: 27 July 2004; | doi:10.1038/nbt0804-933 27 Industry Opportunity Aris Persidis, Industry Trends: Bioinformatics, NATURE BIOTECHNOLOGY VOL 17 AUGUST 1999 28 Industry Opportunity 29 Aris Persidis, Industry Trends: Bioinformatics, NATURE BIOTECHNOLOGY VOL 17 AUGUST 1999 Top ten challenges for bioinformatics [1] Precise models of where and when transcription will occur in a genome (initiation and termination) [2] Precise, predictive models of alternative RNA splicing [3] Precise models of signal transduction pathways; ability to predict cellular responses to external stimuli [4] Determining protein:DNA, protein:RNA, protein:protein recognition codes 30 [5] Accurate ab initio protein structure prediction Top ten challenges for bioinformatics [6] Rational design of small molecule inhibitors of proteins [7] Mechanistic understanding of protein evolution [8] Mechanistic understanding of speciation [9] Development of effective gene ontologies: systematic ways to describe gene and protein function [10] Education: development of bioinformatics curricula 31 Source: Ewan Birney, Chris Burge, Jim Fickett Topic Covered Sequence Analysis Microarray Analysis Protein Structure Biological Network More … 32