Introduction to Bioinformatics Lecturer: Dr. Yael Mandel-Gutfreund Teaching Assistance: Oleg Rokhlenko Ydo Wexler http://webcourse.cs.technion.ac.il/236523 What is Bioinformatics? 2 Course Objectives • To introduce the bioinfomatics discipline • To make the students familiar with the major biological questions which can be addressed by bioinformatics tools • To introduce the major tools used for sequence and structure analysis and explain in general how they work (limitation etc..) 3 Course Structure and Requirements 1. Class Structure Each class (except the first one) will be divided into two parts: 1. 2. Lecture (in lecture room) A Training Lab (in computer lab)* • • For the Training Lab the class will be divided to 2 groups. Each one of the groups will meet every second week, starting from the second week. The work in the Training Labs will be in pairs. Lab assignments will be submitted at the end of each lab. Preparing yourself for the lab- A tutorial including self home exercise and their answers will be posted on the web a week before the lab • • • 2. A final home exam 4 Grading • 30 % lab assignments • 70% final exam 5 Literature list • Gibas, C., Jambeck, P. Developing Bioinformatics Computer Skills. O'Reilly, 2001. • Lesk, A. M. Introduction to Bioinformatics. Oxford University Press, 2002. • Mount, D.W. Bioinformatics: Sequence and Genome Analysis. 2nd ed.,Cold Spring Harbor Laboratory Press, 2004. Advanced Reading Jones N.C & Pevzner P.A. An introduction to Bioinformatics algorithms MIT Press, 2004 6 Course syllabus 7 What is Bioinformatics? 8 What is Bioinformatics? “The field of science in which biology, computer science, and information technology merge to form a single discipline” Ultimate goal: to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned. 9 from purely lab-based science to an information science Bioinformatics Bio = Informatics 10 Central Paradigm in Molecular Biology Gene (DNA) mRNA Protein 21ST centaury Genome Transcriptome Proteome 11 Genome • Chromosomal DNA of an organism • Coding and non-coding DNA • Genome size and number of genes does not necessarily determine organism complexity 12 Transcriptome • Complete collection of all possible mRNAs (including splice variants) of an organism. • Regions of an organism’s genome that get transcribed into messenger RNA. • Transcriptome can be extended to include all transcribed elements, including non-coding RNAs used for structural and regulatory purposes. 13 Proteome • The complete collection of proteins that can be produced by an organism. • Can be studied either as static (sum of all proteins possible) or dynamic (all proteins found at a specific time point) entity 14 From DNA to Genome Watson and Crick DNA model First protein sequence 1955 1960 First protein structure 1965 1970 1975 1980 1985 15 1990 First bacterial genome 1995 Hemophilus Influenzae Yeast genome 2000 First human genome draft 16 The Human Genome Project Initiated in 1986 Completed in 2003 Project goals were to • identify all the genes in human DNA, • determine the sequences of the 3 billion chemical base pairs that make up human DNA, • store this information in databases, • improve tools for data analysis and develop new tools • address the ethical, legal, and social issues that may arise from the project. 17 Human Genome Project International Human Genome Organization founded 1985 Celera Genomics founded First working drafts published 1995 1990 USA Department of Energy announces project 2000 Low resolution linkage map published Project successfully completed 18 The Human Genome Project Initiated in 1986 Completed in 2003 How did we do?? • identify all the genes in human DNA ☺ ☺ • determine the sequences of the 3 billion chemical base pairs that make up human DNA ☺ ☺ ☺ • store this information in databases ☺ ☺ ☺ • improve tools for data analysis and develop new tools ☺ ☺ ☺ • address the ethical, legal, and social issues that may arise from the project ☺ 19 What makes us human? CHIMP GENOME Chimpanzees are similar to humans in so many ways: they are socially complex, sensitive and communicative, and yet indisputably on the animal side of the man/beast divide. Scientists have now sequenced the genetic code of our closest living relative, showing the striking concordances and divergences between the two species, and perhaps holding up a mirror to our own humanity. 20 How humans are chimps? Perhaps not surprising!!! Comparison between the full drafts of the human and chimp genomes revealed that they differ only by 1.23% 21 Complete Genomes • 1994 0 • 1995 1 • 2004 234 • 2005 303 eukaryotes 24 bacteria 240 archaea 39 22 What’s Next ? The “post-genomics” era Annotation Comparative genomics Structural genomics Functional genomics Goal: to understand the functional networks of a living cell 23 Open reading frames Annotation Functional sites Structure, function 24 CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA TAT GGA CAA TTG GTT TCT TCT CTG AAT ...... .............. TGAAAAACGTA 25 TF binding site CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA TAT GGA CAA TTG GTT TCT TCT CTG AAT ................................. Transcription Start Site promoter .............. TGAAAAACGTA ORF=Open Reading Frame Ribosome binding Site CDS=Coding Sequence 26 Whole Genome Comparison Concluding on regulatory networks Comparative genomics 27 Chimps and Us 28 Whole Genome Comparison Concluding on regulatory networks Comparative genomics Comparing ORFs Identifying orthologs Concluding on structure and function 29 Researchers have learned a great deal about the function of human genes by examining their counterparts in simpler model organisms such as the mouse. Conservation of the IGFALS (Insulin-like growth factor) Between human and mouse. 30 Genome-wide profiling of: • mRNA levels • Protein levels Functional genomics Co-expression of genes and/or proteins 31 Understanding the function of genes and other parts of the genome 32 Genome-wide profiling of: • mRNA levels • Protein levels Functional genomics Co-expression of genes and/or proteins Identifying protein-protein interaction Networks of interactions 33 A network of interactions can be built For all proteins in an organism A large network of 8184 interactions among 4140 S. Cerevisiae proteins 34 Structural genomics Assign structure to all proteins encoded in a genome 35 Protein Structure 36 Resources and Databases The different types of data are collected in database – Sequence databases – Structural databases – Databases of Experimental Results All databases are connected 37 Database Types Sequence databases General special GenBank, embl PIR, Swissprot TF binding sites Promoters Genomes Structure databases General Special PDB Specific protein families folds Databases of experimental results Co-expressed genes, prot-prot interaction, etc. 38 Sequence databases • • • • Gene database Genome database SNPs database Disease related mutation database 39 What can we learn about a Gene 40 mRNA, full length, EST 41 EST Expressed Sequence Tags • Partial copies of mRNA found within a particular cell • Can be used to identify genic regions; splicing patterns of genes; etc 42 Different transcripts can be related to the same gene! 43 Gene database • Give information into gene functionality • Alternative splicing of genes – Alternative pattern of exons included to create gene product • EST 44 Genome Databases • Data organized by species • Clones assembled into contigous pieces ‘contigs’ or whole chromosomes • Information on non-coding regions • Relativity 45 Genome Browsers • Annotation adds value to sequence • Easy “walk” through the genome • Comparative genomics 46 Genome Browsers • Ensembl Genome Browser (http://www.ensembl.org) • UCSC Genome Browser http://genome.ucsc.edu/ • WormBase: http://www.wormbase.org/ • AceDB: http://www.acedb.org/ • Comprehensive Microbial Resource: http://www.tigr.org/tigr-scripts/CMR2/CMRHomePage.spl • FlyBase: http://flybase.bio.indiana.edu/ 47 beta globin 48 49 RefSeq • Set of mRNA sequences cureted at NCBI • Many experimentally validated • Some partially validated via ESTs • Some computationally predicted 50 51 52 53 54 55 SNP database Single Nucleotide Polymorphisms (SNPs) • Single base difference in a single position among two different individuals of the same species • Play an important role in differentiation and disease 56 Sickle Cell Anemia • Due to 1 swapping an A for a T, causing inserted amino acid to be valine instead of glutamine in hemoglobin Image source: http://www.cc.nih.gov/ccc/ccnews/nov99/ 57 Healthy Individual >gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA GGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC AGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATG CTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGC TCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGAT CCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCA CCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCA CTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACT GGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC >gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] EEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG MVHLTP AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH 58 Diseased Individual >gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA GGTGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC AGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATG CTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGC TCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGAT CCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCA CCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCA CTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACT GGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC >gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] VEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG MVHLTP AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH 59 Disease Databases • Genes are involved in disease • Many diseases are well studied • Description of diseases and what is known about them is stored OMIM - Online Mendelian Inheritance in Man 60 61 Structure Databases • 3-dimensional structures of proteins, nucleic acids, molecular complexes etc • 3-d data is available due to techniques such as NMR and X-Ray crystallography 62 63 64 Databases of Experimental Results • Data such as experimental microarray images- expression data • Clustering information • Metabolic pathways, protein-protein interaction data 65 Literature Databases PubMed http://www.ncbi.nlm.nih.giv/PubMed Service of the National Library of Medicine • MEDLINE publication database – Over 17,000 journals – 15 million citations since 1950 66 Putting it All Together • Each Database contains specific information • Like other biological systems also these databases are interrelated 67 PROTEIN PIR DISEASE ASSEMBLED GENOMES LocusLink SWISS-PROT OMIM GoldenPath OMIA WormBase MOTIFS TIGR BLOCKS Pfam GENOMIC DATA Prosite GenBank ESTs dbEST DDBJ GENES EMBL RefSeq unigene AllGenes SNPs GENE EXPRESSION dbSNP STRUCTURE PDB MMDB SCOP PATHWAY Stanford MGDB KEGG NetAffx COG ArrayExpress GDB LITERATURE PubMed 68 Entrez – NCBI Engine • Entrez is the integrated, text-based search and retrieval system used at NCBI for the major databases, including PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Genomes, Taxonomy, and others. http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi?itool=toolbar 69 Entrez – NCBI Engine 70 • General Bioinformatic Webpages – USA National Center for Biotechnology Information: www.ncbi.nlm.nih.gov – European Bioinformatics Institute: www.ebi.ac.uk – ExPASy Molecular Biology Server: www.expasy.org – Israeli National Node: inn.org.il http://www.agr.kuleuven.ac.be/vakken/i287/bioinformatica.htm 71