Applied Bioinformatics Dr. Jens Allmer Week 1 (Introduction) Your Instructor • Education – BSc: University of Münster 1996 – MSc: University of Münster 2002 – PhD: University of Münster 2006 • Worked at – – – – – Izmir Institute of Technology (since 2008) Izmir University of Economics, Turkey (Feb 2007 – Aug 2008) University of Muenster, Germany (Jan 2006 – Feb 2007) University of Pennsylvania, USA (Jan 2004 – Dec 2005) University of Jena, Germany (Nov 2002 – Dec 2003) Areas of Interest • Bioinformatics – Sequences – Alignments • Mass Spectrometry – De novo sequencing – Pattern matching • Annotation – Integration – Automatic assessments • General Automation and Productivity Course Rules • Attendance – Is essential and will be monitored strictly – if(absence > 12h) Then NA; • Make-up – No make-up for homework – Midterm and Final need medical report for make-up Course Rules • Lecture starts on time – if late enter QUIETLY – if more then 5 min late DO NOT ENTER wait for break • Breaks are 10 min max – if late after break enter QUIETLY – if more then 5 min late DO NOT ENTER wait for next break • Early leave – Announce before course and leave if granted Course Rules • Homework – Published on the website and/or as slides – Deadline 6pm on the day before the next class (you may submit early of course) – No extention – No make-up – No extra homeworks • Must be electronicly submitted to: jensallmer.iyte@analysis.urkund.com – Must be named HW00_first_last.eee or will not be accepted – Formats include: doc, ppt, odx, txt, html, ... – Not allowed are formats that may not be edited by me like pdf, and similar formats that are not widespread – Must be significantly different from your classmates – Otherwise everyone involved will obtain zero for that assignment Grading • All information available on class website • Grading individualized – – – – – Homework Quizzes Mind Maps Midterm Final 20% 20% 10% 20% 30% Grading • I am responsible to evaluate you – I am not responsible to pass everyone or give great grades • Make it easy for me 1. Show up and participate 2. Do homeworks and pre-course preparations 3. Midterm and Final will be easy for you if you adhere to 1. and 2. Course Structure – – – – – – – – – – Start 9:00 10 min quiz 35 min lecture 5 min mind mapping 10 min break 50 min practice 10 min break 40-50 min lecture 10 min break 30 min practice Textbooks Primary audience Junior bio majors Course home page: http://www.biolnk.com/habf ISBN: 978-605-133-297-0 http://www.idefix.com/kitap/biyoenformatik-1-dizi-kiyaslamalarijens-allmer/tanim.asp?sid=GUFFOI44R7FJ9CIR6STU Textbooks • Primary audience – • Course home page: – • Junior bio majors http://www.bio.davidson.edu/genomics Taught by A. Malcolm Campbell (Biology) Textbooks Everything you currently need to know about Applied Bioinformatics in regard to practical problems you will encounter during everyday research. Bioinformatics Chemistry Biology Molecular biology Mathematics Statistics Bioinformatics Computer Science Informatics Medicine Physics Bioinformatics is Multidisciplinary Genomics Drug Design Computer Science Molecular Life Sciences Phylogenetics Structural Biology Math Statistics BIOINFORMATICS The Pyramid of Life (2000) Metabolomics 1400 Chemicals Proteomics 3,000 Enzymes Genomics 30,000 Genes The Pyramid of Life Protein Interactions? 100,000 Proteins 30,000 Genes 1400 Chemicals Bioinformatics (or Computational Biology) • Not just the study of DNA or protein sequence data • Inclusive definition – concerns the storage, display, reduction, management, analysis, extraction, simulation, modeling, fitting or prediction of biological, medical or pharmaceutical data Basis of molecular life sciences • Hierarchy of relationships (some exceptions): Genome Gene 1 Gene 2 Gene 3 Gene X Protein 1 Protein 2 Protein 3 Protein X Function 1 Function 2 Function 3 Function X How can one use bioinformatics to link diseases to genes? • Disease Map Gene Function Positional cloning of genes 1. Find genetic markers associated with disease 2. Sequence DNA next to the markers 3. Compare DNA from afflicted individuals to DNA of normal individuals (database) 4. Find abnormalities 5. Predict gene function from sequence information Bioinformatics in the old days • Close to Molecular Biology: – (Statistical) analysis of protein and nucleotide structure – Protein folding problem – Protein-protein and protein-nucleotide interaction • Many essential methods were created early on – Protein sequence analysis (pairwise and multiple alignment) – Protein structure prediction (secondary, tertiary structure) Bioinformatics in the old days (Cont.) • Evolution was studied and methods created – Phylogenetic reconstruction (clustering – e.g., Neighbor Joining (NJ) method) – Nowadays also part of Datamining But then the big bang…. The Human Genome - 26 June 2000 Dr. Craig Venter Celera Genomics -- Shotgun method Francis Collins (USA)/Sir John Sulston (UK) Human Genome Project History of the Human Genome Project 1953 Watson, Crick DNA structure 1972 Berg, 1st recombinant DNA 1977 Maxam, Gilbert, Sanger sequence DNA 1980 1982 1984 1985 1986 Botstein, Sinsheimer DOE begins Wada MRC Davis, genome proposes to publishes hosts Skolnick build first large meeting to studies with White discuss HGP $5.3 million automated genome propose to sequencing Epstein-Barrat UCSanta map human robots virus (170 Cruz; genome with Kary Mullis kb) RFLPs develops PCR 1987 Gilbert announces plans to start company to sequence and copyright DNA; Burke, Olson, Carle develop YACs; DonisKeller publish first map (403 markers) History of the Human Genome Project (continued) 1987 (cont) 1988 1989 Hood produces first automated sequencer; Dupont develops fluorescent dideoxynucleotides Proposal Venter Simon Hood, to sequence announces develops Olson, 20 Mb in strategy to BACs; US Botstein model sequence and French Cantor propose organism by ESTs. He teams 2005; plans to publish first using Lipman, patent physical STS’s to map Myers partial maps of the human chromosome genome publish the cDNAs; BLAST Uberbacher s; first algorithm develops genetic maps GRAIL, a of mouse and gene finding human program genome published NIH supports the HGP; Watson heads the project and allocates part of the budget to study social and ethical issues 1990 1991 1992 1993 Collins is named director of NCHGR; revise plan to complete seq of human genome by 2005 1995 Venter publishes first sequence of free-living organism: H. influenzae (1.8 Mb); Brown publishes on DNA arrays 1996 Yeast genome is sequenced (S. cerevisiae) History of the Human Genome Project (continued) 1997 Blattner, Plunket complete E. coli sequence; a capillary sequencing machine is introduced. 1998 SNP project is initiated; rice genome project is started; Venter creates new company called Celera and proposes to sequence HG within 3 years; C. elegans genome completed 1999 2000 NIH proposes to sequence mouse genome in 3 years; first sequence of chromosome 22 is announced Celera and others publish Drosphila sequence (180 Mb); human chromosome 21 is completely sequenced; proposal to sequence puffer fish; Arabidopsis sequence is completed 2001 Celera publishes human sequence in Science; the HGP consortium publishes the human sequence in Nature 2003 Completed genomes: 112 Microbial 18 Eukaryotes 1275 Viruses Human DNA • There are at least 3bn (3 109) nucleotides in the nucleus of almost all of the trillions (3.2 1012 ) of cells of a human body (an exception is, for example, red blood cells which have no nucleus and therefore no DNA) – a total of ~1022 nucleotides! • Many DNA regions code for proteins, and are called genes (1 gene codes for 1 protein as a base rule, but the reality is a lot more complicated) – Name examples • Human DNA may contain ~27,000 expressed genes – Problems? • Deoxyribonucleic acid (DNA) comprises 4 different types of nucleotides: adenine (A), thiamine (T), cytosine (C) and guanine (G). These nucleotides are sometimes also called bases – Ambiguities? Y-Chromosome • 50% of the sequence consists of NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN • Not very meaningful Human DNA (Cont.) • All people are different • but the DNA of different people only varies for 0.2% or less • So, only up to 2 letters in 1000 are expected to be different. • Evidence in current genomics studies (Single Nucleotide Polymorphisms or SNPs) imply that • on average only 1 letter out of 1400 is different between individuals. • Over the whole genome, this means that 2 to 3 million letters would differ between individuals. Modern bioinformatics is closely associated with genomics • The aim is to solve the genomics information problem • Ultimately, this should lead to biological understanding how all parts fit (DNA, RNA, proteins, metabolites) and how they interact (gene regulation, gene expression, protein interaction, metabolic pathways, protein signaling, etc.) Functional Genomics From gene to function Genome Expressome Proteome Interactome? TERTIARY STRUCTURE (fold) TERTIARY STRUCTURE (fold) Metabolome How much of the genome is defined? Unknown Function What is bioinformatics? Math Physics English Bio Comp sci Chem Bioinformatics Stats • • • • • • • • • • • • • Machine learning Database systems Data mining Image processing Modeling Graph theory Statistical analysis Sequence Structure Interactions Regulation Genomes Evolution • E.g. Process the spots on a microarray, determine which genes are differentially expressed, link spots to sequence via a database, analyze the sequence using predictive tools, link the genes to related genes to form a network What is a bioinformatician? • Somebody who knows everything What is a bioinformatician? • A facilitator – Typically has background in biology or CS, but is comfortable with concepts from other disciplines – Bring together ideas (or researchers) from different domains to solve a biological problem • Conceptualize the problem – Use language appropriate to the domain • Identify potential solutions – Understanding of different fields helps to identify possible approaches at a broad level • Guide the development process – Create in-house or find potential collaborators to work on approaches in-depth • Integrate results into overall solution – Software/method, results of biological analysis How is Bioinformatics Used? Bioinformatics is used to help “focus” the scientist on the bench top experiments Bioinformatics isn’t going to replace lab work anytime soon Experimental proof is still the “Gold Standard”. Bioinformatics • Is application of computational tools in Biology Bioinformatics? • Not really! • In this course we will however only go into algorithmic details rarely (like today ;) Mind Mapping • Have you ever studied a subject or brainstormed an idea, only to find yourself with pages of information, but no clear view of how pieces fit together? • Mind mapping – – – – – – Learn more effectively Improves memorization Enhances creativity Speeds up analyses Gives structure to complex ideas Records information for future use Source: http://www.mindtools.com/pages/article/newISS_01.htm An Example Mind Map for MicroRNAs How to Mind Map 1. Identify the central topic write in center 2. Write major parts of the topic on lines in all directions 3. Repeat 2. with ever finer level of detail until satisfied Source: http://www.mindtools.com/pages/article/newISS_01.htm Note Taking with Mind Maps • Capture ideas organized into topics – What if the central topic which I chose is not the central topic? – Make a new mind map which captures the topic correctly • Uses Cases – – – – Note taking in class Recapitulization after lecture Analysis of a new topic Structuring of any intended writing • When – During acquisition of new knowledge (faster than writing) – For review 5m, 1h, 6h, 1d, 7d, 1m after note taking Mind Mapping Tips 1. Use single words or very short phrases 2. Write clearly and readable 3. Use color! 4. Seperate ideas (color, lines, shading) 5. Draw symbols and images 6. Draw links among elements A More Elaborate Mind Map Source: http://www.mindtools.com/pages/article/newISS_01.htm At the Heart of Bioinformatics Genomic >scaffold_1152 GGTGCGGCCGTCCTCCAGCTGCTTGCCGGCGAAGATCAGGCGCTGCTGGT CCGGGGGGATGCCTGCATCCGGTGAGGAAACGCTCGTGTCAGACAAAGTG GGTGGGCGCAGGAAGCAGCAATCAACACAGCCCAGTGCAGCTGCAAAGCG CCCGCCTTACCACTGACCCGCCTGGCCACCCACCCCTACCCCCCGTAAGG AAAGAGCCCCGACTCACCCTCCTTGTCCTGAATCTTGGCCTTCACGTTCT CAATGGTGTCCGAAGACTCCACCTCGAGCGTGATGGTCTTGCCCGTCAGG GTCTTGACGAAGATCTGCATGCCACCGCGCAGGCGCAGCACCAGGTGCAG … Translated >RF1_scaffold_1152 GAAVLQLLAGEDQALLVRGDACIR$GNARVRQSGWAQEAAINTAQCSC KAPALPLTRLATHPYPP$GKSPDSPSLS$ILARDVAHDFAKSSPR$YA PLIPQNLRC$SIEMKQPASLLSPIGEGACASHLQCLEKCLLP$GAIVY MIS$GSGRR$TSWVGIGGCNDGTEKRSEVDSRRGGKGNIHD >RF2_scaffold_1152 VRPSSSCLPAKIRRCWSGGMPASGEETLVS AATAAKPQTWSPTAWEF KVGGRRKQQSTQPSAAAKRPPYH$PAWPPTPTPRKERAPTHPPCPESW SRSQWCPKTPPRA$WSCPSGS$RRSACHRAGAAPGAGSTPSGCCSQPG CGRPPAACRRRSGAAGPGGCLCVGGGGEGACASHLQCLEGE … Try it for yourself Sequence ACGGTAGTATGTGATGTATGATCGCGAAAGAGG Pattern TGATGT Your Your Task Task You You may may only only compare compare 11 character character at at aa time time You You may may create create helpful helpful structures structures You You should should find find the the location location of of the the Pattern pattern in in the the Sequence Sequence with with aa minimal minimal number number of of comparisons comparisons Brute Force Approach ACGGTAGTATGTGATGTATGATCGCGAAAGAGG TGATGT Comparisons: 1 Brute Force Approach ACGGTAGTATGTGATGTATGATCGCGAAAGAGG TGATGT Comparisons: 2 Brute Force Approach ACGGTAGTATGTGATGTATGATCGCGAAAGAGG TGATGT Comparisons: 3 Brute Force Approach ACGGTAGTATGTGATGTATGATCGCGAAAGAGG TGATGT Comparisons: 4 Brute Force Approach ACGGTAGTATGTGATGTATGATCGCGAAAGAGG TGATGT Comparisons: 6 Brute Force Approach ACGGTAGTATGTGATGTATGATCGCGAAAGAGG TGATGT Comparisons: 7-16 Brute Force Approach ACGGTAGTATGTGATGTATGATCGCGAAAGAGG TGATGT Comparisons: 17-22 Boyer-Moore Algorithm •Preprocessing •Good suffix matrix •Bad character matrix (m+1) (m+1) ACGGTAGTATGTGATGTATGATCGCGAAAGAGG TGATGT Comparisons: 1 Boyer-Moore Algorithm ACGGTAGTATGTGATGTATGATCGCGAAAGAGG TGATGT Comparisons: 2 Boyer-Moore Algorithm ACGGTAGTATGTGATGTATGATCGCGAAAGAGG TGATGT Comparisons: 3-7 Boyer-Moore Algorithm ACGGTAGTATGTGATGTATGATCGCGAAAGAGG TGATGT Comparisons: 8 Boyer-Moore Algorithm ACGGTAGTATGTGATGTATGATCGCGAAAGAGG TGATGT Comparisons: 9-15 Questions Define Algorithm Website • http://mbg305.allmer.de • Slides • Homework • Additional materials and challenges • Grades Website • To see your grades you need to login • Some material may need login as well • Currently – UserID = StudentID – Password = StudentID • Change now – UserID = working email address – Password = whatever you will remember Login to mbg305.allmer.de • We will now assist you to log in and to add your email address and change your password. Assignments – Research about Mind Maps • E.g.: http://en.wikipedia.org/wiki/Mind-map • IYTE library – Make sure to read the lecture notes for next week (Available online on Monday) • Prepare at least two proper questions that will be collected at the beginning of the course next week – Read Chapters 1 and 2 from our textbook