1 Computational biology, bioinformatics, and high performance computing Craig A. Stewart stewart@iu.edu Indiana University SC2003 Tutorial 16 November 2003 S14 License terms • • Please cite as: Stewart, C.A. 2003. Computational Biology. Tutorial presented at SC2003, 15-21 Nov, Phoenix, AZ. http://hdl.handle.net/2022/14000 Some figures are shown here taken from web, under an interpretation of fair use that seemed reasonable at the time and within reasonable readings of copyright interpretations. Such diagrams are indicated here with a source url. In several cases these web sites are no longer available, so the diagrams are included here for historical value. Except where otherwise noted, by inclusion of a source url or some other note, the contents of this presentation are © by the Trustees of Indiana University. This content is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work. 2 3 Table of Contents • • • • • • • • • Class Plan and Objectives A rapid introduction to key elements of biology Bioinformatics data sources Similarity matching Phylogenetics RNA and Protein Structure Systems Biology Grand challenge problems Acknowledgements & references 3 11 32 48 95 108 126 140 163 Note: Slides with the Indiana University wordmark in the bottom left corner were generated at Indiana University, with images sometimes from other sources. In such cases the url for the source of the image is indicated on the slide. Slides with a plain white background have been graciously provided by someone outside IU, and sources are attributed on such slides. 4 Class Plan & Objectives • Class Plan & Strategy – Materials focus on open source software (generally not the presenters own work) – One critical application will be covered in great depth, and several others will be reviewed • Objectives. At the end of the class, participants should: – understand enough biology to understand key computational biology problems – be conversant with current key applications, and current problems facing bioinformatics and computational biology – Be familiar with some strategies for collaborating with biologists and biomedical scientists 5 Motivation • • • • • The “-omics” trend Finding press pieces about huge computing problems is easy How many bio codes really scale to hundreds of processors? What are the coming high performance needs of biologists? Importance of computational biology and bioinformatics to the HPC community • The challenges and promise are real • Successes and failures so far – Successes: Protein structure, Genome assembly, Surgical assistance, Phylogenetics – Mismatched priorities: Ab initio protein folding – Not yet successful: Drug discovery 6 What has changed recently? • Bioinformatics not new – Protein structure – Phylogenetics • What is new is highthroughput sequencing: – Lots more data – The possibility of going from a knowledge of the DNA sequence to an understanding of diseases and health http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html 7 Genome Projects Timeline • • • • • • • • • • • 1978 1986 1994 1995 1996 1997 1998 1998 1999 2000 2003 First virus (SV40) sequenced (5224 base pairs) DOE announces Human Genome Initiative First complete map of all human chromosomes First living organism sequenced (H. influenzae) 2 Mb Yeast (S. cerevisiae) - 12 Mb Intestinal bacterium (E. coli) - 5 Mb Nematode worm (C. elegans) - 100 Mb Celera announcement; Public effort regroups Human Chromosome 22 – 34 Mb Joint announcement by NHGRI – Celera “As good as it gets” human genome This slide based on slide by Manfred D. Zorn 8 Definitions • Computational Biology: any use of advanced information technology in the study of biological problems. • “Bioinformatics applies the principles of information sciences and techologies to make the vast, diverse and complex life sciences data mnore understandable and useful” (NIH BISTIC Committee grants1.nih.gov/grants/bistic/CompuBioDef.pdf) • Genomics – study of genomes and gene function • Proteomics – study of proteins and protein function • ___omics – 9 Challenges • • • • Different types of biological data at different scales Data of varying quality Much of the underlying biology is not well understood Prior to the availability of high-throughput sequencing, scientists could only study small pieces of the genetic information of any organism. • Now the entire genome of several organisms has been completed, but knowing the genome is different than knowing how it works! 10 Comparison of Complexity • Physics & Chemistry • Biology – 2 elementary particles – 3B base pairs in humans – 4 forces – Min. 30,000 genes in humans – 112 elements – ~1.5M species – When random events occur it is often possible – Individual random to study average behavior events important; no law of large numbers – Typically ahistoric (astrophysics an – Intensely historic, exception) heavily contingent 11 Complexity, Con't • Chip design – All components known – Device physics for individual components known – Itanium has 3 x 10^8 connections and 2 x 10^8 devices – Unified basic currency (electrons) – Computer program required to understand • Cells – Components not known – Function of individual components not known – # components ~10^13 – No unified basic currency – Ecell, Karyote, etc. attempting to model cells 12 A rapid introduction to key elements of biology Why is it important to know some biology? Anopheles gambiae From www.sciencemag.org/feature/data/ mosquito/mtm/index.html Source Library:Centers for Disease Control Photo Credit:Jim Gathany 13 • Would you study numerical methods without knowing some mathematics? • Much current biological knowledge is very specific to particular organisms, genes, or diseases • If you just wade into the available data online you can do some very silly things. 14 Central dogma of biology • The central dogma of biology is that genes act to create phenotypes through a flow of information form DNA to RNA to proteins, to interactions among proteins (regulatory circuits and metabolic pathways), and ultimately to phenotypes. Collections of individual phenotypes constitute a population (first put forward by Crick in 1958) http://www.ncbi.nlm.nih.gov/About/primer/genetics_cell.html 15 Cell Structure http://www.ncbi.nlm.nih.gov/About/primer/genetics_cell.html Eukaryotes • Chromosomes linear • Introns, exons, postprocessing • Nucleus & nuclear wall • Mictochondria and (in plants) Chloroplasts Prokaryotes • Chromosome circular • Location is everything • No nucleus • No plastids 16 Four (or Five) Bases • DNA consists of four nucleotides: Cytosine, Thymine, Adenine, and Guanine. • In the double helix, A&T are always bound, and C&G are always bound to each other • RNA consists of four nucleotides as well: Cytosine, Uracil, Adenine, and Guanine • RNA may loop back on itself but it does not form a double helix http://www.ornl.gov/TechResources/Human_Genome/graphics/slides/images/structur.gif 17 http://www.ornl.gov/TechResources/Human_Genome/graphics/slides/images/98-647.jpg 18 Genetic Code Ala Alanine Arg Arginine Asn Asparagine Asp Aspartic acid Cys Cysteine Glu Glutamic acid Gln Glutamine Gly Glycine His Histidine Ile Isoleucine http://www.ncbi.nlm.nih.gov/Class/MLACourse/ Original8Hour/Genetics/geneticcode.html Leu Leucine Lys Lysine Met Methionine Phe Phenylalanine Pro Proline Ser Serine Thr Threonine Trp Tryptophan Tyr Tyrosine Val Valine Translating DNA to RNA and Transcribing RNA to Proteins DNA AAAAAGGAGCAAATT 1 RNA One possible amino acid string 2 4 3 6 5 UUUUUCCUCGUUUAA Phe Asn Asp Ala 19 20 Human Chromosomes http://www.ncbi.nlm.nih.gov/Class/MLACourse /Original8Hour/Genetics/cytogenetic.html http://www.ornl.gov/TechResources/ Human_Genome/graphics/slides/ elsikaryotype.html 21 Sickle Cell Normal RBC • GAG codes for Glutamine • disc-Shaped, soft • easily flow through small blood vessels • lives for 120 days Sickle RBC • GTG codes for Valine • sickle-Shaped, hard • often get stuck in small blood vessels • lives for 20 days or less Malaria vs. Anemia! http://www.nlm.nih.gov/medlineplus/ ency/imagepages/1223.htm 22 What is a Gene? • An inheritable trait associated with a region of DNA that codes for a polypeptide chain or specifies an RNA molecule which in turn has an influence on some characteristic phenotype of the organism. – Early views: genes lined up on the chromosome like beads on a string; one gene => one protein – Examples of genes: color blindness, sickle-cell anemia – Mendelian genes, Sex-linked genes, Quantitative traits • Annotation: Extraction, definition, and interpretation of features on the genome sequence • Annotations vs. genes: – Many annotations describe features that constitute a gene. – Others may not always directly correspond in this way – An annotation is what we think… nature may disagree! • Inheritance problem with annotations 23 Gene Components • Procaryotes – Location is everything – Essentially all of the DNA is transcribed (few mitochondrial diseases) • Eucaryotes – Non-contiguous pieces of DNA may comprise one gene – Start sequence (complicated and long) – Stop Codons – end transcription – Exons – portions of sequence that are transcribed and used – Introns – portions of sequence that are not used • Genes and Chromosomes – In eukaryotes, an organism has two of each chromosome (in pairs). – Among sexually reproducing organisms, one chromosome comes from each parent – In “simple Mendelian genes” there are two alleles for each gene – one on each chromosome (e.g. wrinkly) 24 Alternate splicing http://www.blc.arizona.edu/marty/411/Modules/altsplice.html A (very) little about evolutionary genetics Hardy-Weinberg Law Parents Ww Ww Offspring WW Ww Ww ww Based on this, can you explain why the gene for Sickle Cell Anemia persists in populations of people in Africa? 25 26 Population genetics & evolution • Mutations create the raw material for evolution • Natural selection and chance affect the frequency with which particular genes or DNA sequences are present in populations • Given enough time and enough change, evolution, speciation, and so forth happen • Genes can be ‘fixed’ or ‘maintained in an equilibrium’ in a population by chance or through natural selection http://faculty.wm.edu/bsgran/ 27 How do sequences differ? • Differences in individual bases CGTACCGTTAATAT CGTACCGATAATAT • Bases may be added to a sequence CGTACCCCGTAATAT CGTACC . .GTAATAT • Bases may be deleted from a sequence CGTACCGTTAATAT CGTACCG . . .ATAT 28 Random genetic change • “things happen” • Molecular clock – theory – ~ 2% change per million years (2 x 10-9 substitutions per base location per year) – Practice – a rule of thumb is different than something like Newton’s 2nd law of motion • Random change may often be responsible for speciation – e.g. two populations of birds, separated by a geographic barrier, may at random eventually develop into two different species 29 Key points (so far) • Biological processes are complicated; the historicity and complexity of biological processes and our lack of understanding of many matters makes biologty an interesting topic! • The fundamental dogma of molecular biology is that genes act to create phenotypes through a flow of information form DNA to RNA to proteins, to interactions among proteins (regulatory circuits and metabolic pathways), and ultimately to phenotypes. Collections of individual phenotypes constitute a population. • DNA consists of four base pairs (ATCG). A is always paired with T; C always paired with G. • DNA is translated into RNA. RNA consists of four base pairs as well (AUCG). • The linear structure of DNA is transcribed into RNA and then into proteins. Proteins have their 3D configuration as the basis for their structure. 30 DNA sequencing Send in the clones! • DNA chopped into blocks • Blocks inserted into bacterial cells using viruses • The bacterial clones make lots of copies of DNA so that you have something to work with • The sequence of each chunk of genetic material is determined using gel electrophoresis 31 Sanger •Cut DNA at various places (at T, G, C, A) •Add a radioactive molecule at the end of the DNA chain •Find out how long the chain is by gel electrophoresis •Read off the sequence www.ornl.gov/TechResources/ Human_Genome/publicat/primer/ Dye-terminator Sequencing www.ornl.gov/TechResources/ Human_Genome/graphics/slides/ images/standardRGB200.jpg 32 Sequence assembly • • • • • Phred – base calling Phrap – shotgun sequence assembly Consed – finishing http://www.phrap.org/ High quality software 33 Bioinformatics data sources 34 Bioinformatics Data Sources • • • • There are many Characteristics vary There are many ways to organize view of the biological data A pragmatic approach: – Biomedical literature sources – Structured vocabularies – DNA, RNA, Protein etc. data sources 35 Biomedical literature • Abstracts of biomedical lit. largely available online • Text processing itself is an interesting problem • U.S. National Library of Medicine – NLM Medline http://www.nlm.nih.gov/ • ~12 million references on life sciences/biomedicine. • Covers 1966 to present. • Citations from over 4,600 journals; most published in English 36 PubMed • Standard search tool for Medline • http://www.ncbi.nlm.nih.gov/e ntrez/ • Useful limit terms: – Gender – Age Groups – Human or Animal – Publication Date • You can save queries 37 Structured Languages • NLP or write with agreed-upon terms? • Three important structured languages: – MeSH – GO (Gene Ontology) – LOINC 38 MeSH • Medical Subject Heading • http://www.nlm.nih.gov/mes h/MBrowser.html • ~17,000 Thesaurus Terms • Typically 10-15 used per article in MedLine; 3-4 as major points (indicated with * in PubMed) • When done right…. the terms used are the most specific possible • There are both advantages and disadvantages! 39 GO (Gene Ontology) • http://www.geneontology.org/ • “The goal of the Gene OntologyTM Consortium is to produce a dynamic controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing.” • Based on xml file format • Several browsers (AmiGO, QuickGO, MGO) • Directed Acyclic Graph (child may have multiple parents) – ISA (is a) % – Part of < • Three ontologies – Molecular function – Biological processes – Cellular components 40 Genomic, Proteomic, etc. data sources • A tremendous amount of data is available through public data sources via the Web, ftp, or by other means. • To analyze biological data, we first have to get it…. • Several ways to organize presentation of material – by site, by type, etc. We will organize by data type. • Types of Databases: – – – – – – Chromosomal (http://www.ncbi.nlm.nih.gov/mapview) DNA/Genes Protein Biochemistry and metabolic pathways Microarray Web collections 41 Types of genomic data • Genomic DNA: DNA sequences, typically complete with coding and noncoding sequences • GSS: Genome survey sequence. Single pass sequence read directly from robot. • mRNA: an RNA sequence from an mRNA molecule. May or may not cover all of a particular gene • cDNA: complement DNA – a DNA sequence generated by conversion of an mRNA sequence • EST: Expressed Sequence Tag – short cDNA sequences from studies of cells under particular circumstances. Typically incomplete. • SNP – Single Nucleotide Polymorphism 42 DNA databases • GenBank. Operated by NCBI (National Center for Biotechnology Information). http://www.ncbi.nlm.nih.gov • European Molecular Biology Laboratory – Nucleotide Sequence Database. http://www.ebi.ac.uk/genomes/ • DNA Database of Japan (DDBJ). http://www.ddbj.nig.ac.jp • All share data daily. Update conflicts avoided by policy. • Differences are in “value added” and interfaces http://www.ncbi.nlm.nih.gov 43 44 Data Structures • Current – Primary DNA repository data based on ASN.1. Makes possible linkages among many types of biomedical info. – The software libraries now often handle XML as well – Software toolkits and docs available at http://www.ncbi.nlm.nih.gov/IEB/ • Genbank Flat File format – http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html • FASTA >gi|532319|pir|TVFV2E|TVFV2E envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC 45 Primary vs. Secondary Data sources • Primary data sources: – Genetic sequences in NCBI, EMBL, DDJP – Protein sequences in PDB • Secondary data sources: – Inferred protein sequences (what do we know already about issues here?) – Curated data sources 46 Protein Structure • NCBI (of course…) • Swiss-Prot/TrEMBL at http://www.expasy.org/ – Note: 125,744 chemically determined vs 861,482 inferred from automated translation of DNA sequences!!!!! • Protein Data Base – PDB http://www.rcsb.org/pdb/ - one of the first online bioinformatics databases!!! 47 Biochemistry and pathways • Biochemistry – ENZYME (part of the ExPASy system) – BIND (part of the NCBI system) • Pathways – PathDB http://www.ncgr.org/software/version_2_0.html – Kegg WIT http://wit.mcs.anl.gov/WIT2/ 48 Web Resources - General • NCBI http://www.ncbi.nlm.nih.gov/ • EBI Biocatalog http://www.ebi.ac.uk/biocat/ • IUBio Archive http://iubio.bio.indiana.edu http://www.ncbi.nlm.nih.gov/ 49 Similarity matching Why pattern matching (and what are the problems) and… US! Bonobo http://www.sandiegozoo.org/special/zoo-featured/pygmy_chimps.html 50 51 Problems! • For proteins, 95% similarity is ~ identical, 80% similarity is a lot. Even less similarity than that needed for DNA • Database techniques inadequate – they are too precise! • Datasets very large to search • Homology • Common ancestry • Sequence (and usually structure) conservation • Homology is inferred rather than measured • Identity • Objective and well defined • Can be quantified easily, but not very useful! • Similarity • Most common method used, but not as easily defined 52 Alignment • An alignment is an arrangement of two sequences opposite one another • It shows where they are different and where they are similar • We want to find the optimal alignment - the most similarity and the least differences • Alignments have two aspects: – Quantity: To what degree are the sequences similar (percentage, other scoring method) – Quality: Regions of similarity in a given sequence 53 Alignment • Methods: – dynamic programming – Hidden Markov Models – Pattern matching • Key problem: keeping the calculation time manageable • Some alignment packages: – BLAST (http://www.ncbi.nlm.nih.gov/BLAST/) – FASTA (http://gcg.nhri.org.tw/fasta.html) 54 Scoring Alignments GCTAAATTC ++ x x GC AAGTT • Matches are good: they get a positive value • Mismatches are bad: they get a negative value • Gaps are bad: they get a negative value – Gap opening penalty – Gap extension penalty – Score = Matches –Mismatches -∑{gap opening penalty +(length)*gap length penalty} CGTACCGTTAATAT CGTACCGTTAATAT CGTACCG . . .ATAT CGT. C . GTT .ATAT 55 Now what? • Taking a sequence and simply comparing it against all existing sequences in a database in all possible ways approaches O(N!) if you do it badly enough. Plus it would be silly. • So: many algorithms possible • Algorithms are in some ways the same, and in some ways different, between DNA and proteins. • We’ll start with DNA, and not do things in historical order 56 Dotter • Simple way to get a feel for how sequences compare to each other. • Used both with DNA and Protein sequences • http://www.cgr.ki.se/cgr/groups/son nhammer/Dotter.html/ • "A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis" Erik L.L. Sonnhammer and Richard Durbin Gene 167(2):GC1-10 (1995) • And now (hopefully) a live demo • Modular nature of proteins 57 Local Alignments with BLAST • • • • • Basic Linear Alignment Search Tool We’ll spend a LOT of time with BLAST First a quick demo (hopefully) http://www.ncbi.nlm.nih.gov/BLAST So, what did we do? – BLAST – Basic Linear Alignment Search Tool – In particular, BLASTn (for nucleotides) – Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic Local alignment search tool. Journal of Molecular Biology 215:403-410 58 (Original) BLAST Algorithm • Original algorithm does not permit gaps • The original BLAST algorithm is a local (heuristic) alignment tool • Given a search sequence, e.g. ACGTAGGCATGAA • BLAST first makes a list of all “words” of a given length that would possibly have a score of at least T against the search string. • In the case of this example there would be (at least) the following: – ACGTAGGCATG – CGTAGGCATGA – GTAGGCATGAA 59 (Original) BLAST Algorithm, 2 • BLAST takes the list of all words with a score of at least T against the string one is trying to match…. and then searches a database for any matches to these words. So if one were using the example and the NR database, BLAST would search NR for all occurrences of the words: – ACGTAGGCATG – CGTAGGCATGA – GTAGGCATGAA • Suppose BLAST finds in the NR database an exact match to – ACGTAGGCATG • BLAST then attempts to extend the match in both directions – ACGTAGGCATGA – ACGTAGGCATGA • So now we have an exact match of 12 letters 60 (Original) BLAST algorithm,3 • So BLAST keeps going, and in this case would stop at an exact match of 13 letters (if one existed), since 13 letters was the entire initial search string: – ACGTAGGCATGAA – ACGTAGGCATGAA • BLAST has a stopping algorithm for dropping particular search directions, or stopping altogether 61 Scoring of DNA A A C G T R Y M W S K D H V B N 4 -3 -3 -3 1 -1 1 1 -2 -2 1 1 1 -2 1 4 -3 -3 -1 1 1 -2 1 -2 -2 1 1 1 1 C 4 -3 1 -1 -2 -2 1 1 1 -2 1 1 1 G T 4 -1 1 1 -3 -2 0 1 0 -2 0 1 0 1 1 1 0 -2 1 1 0 1 0 R Y 1 0 0 0 0 0 1 0 1 0 M 1 0 0 0 0 1 1 0 0 W 1 0 0 1 1 0 0 0 S 1 0 0 0 1 1 0 K 1 1 0 0 1 0 D 1 0 0 0 0 H 1 0 0 0 V 1 0 0 B 1 0 N 1 62 BLAST algorithm in more detail • • • • • • • • The BLAST algorithm searches for MSPs – Maximal Scoring Pairs – such that the score of sequences cannot be improved either by lengthening it or shortening it. “Pairs” here refers to a string – or a substring – of the initial string used as the search string – and one or more strings or substrings found in a database. The search starts with the creation of all possible subwords of a given length (default typically 11 for DNA sequences, 3 amino acids for protein sequences) that would score at least T when matched against the original search string. (T is short for Threshold) BLAST then goes through the database being searched against looking for any occurrence of each of these words that have a score of at least T. This is a “hit” – or a “High Scoring Pair (HSP)” The search then continues by trying to extend these HSPs. Suppose “S” is the best score found for a word of length k. BLAST stops trying to extend words when the score drops a certain amount below the best value S in the previous round. BLAST continues on and on until it is no longer possible to improve the score of HSPs by making them longer. Then it generates a list of the best HSPs. Default is a cutoff E-value of 10 BLAST (original) has an infinite gap penalty 63 BLAST Statistics • BLAST reports E values rather than P values, but it turns out that when E < 0.01, E~P • What do we do about the fact that we have done many tests? • If the sequence is length n, and the total length of the database being searched is N, then a reasonable approach is to multiply E by N/n • Edge effects – statistics tend to be conservative for short sequences • Problems: – Highly repetitive segments – Low complexity regions – Bias in composition • Solution: low complexity regions can be excluded 64 BLAST Options • • • • • • Set subsequence (of the submitted sequence) Choose Database (NB: nr ≠ non redundant!) Limit by entrez query or select an organism Choose Filter Expect Value Word size (default = 11 for nucleotides) 65 Protein Sequence Alignment • What most people do most of the time • DNA sequences are useful for relationships that are close, but DNA sequences are not nearly as well conserved as Amino Acid sequences • Now we need to talk about the characteristics of Amino Acids and ways to compare what is similar and what is not! • Amino acids can have similar chemical properties, and similar functions as part of a protein, without being identical! 66 Point Accepted Mutations (PAM) • For scoring amino acid sequence alignments • Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C. 1978. "A model of evolutionary change in proteins." In Atlas of Protein Sequence and Structure 5(3) M.O. Dayhoff (ed.), 345 - 352, National Biomedical Research Foundation, Washington. • PAM N corresponds to N mutations in DNA sequence per 100 amino acids. N can be greater than 100. • PAM 250 is most commonly used; PAM 100 is also used. PAM 250 => chains with ~20% identity • PAM matrix calculator at www.cmbi.kun.nl/bioinf/tools/pam.shtml http://www.psc.edu/biomed/training/ tutorials/sequence/db/index.html 67 BLOSUM Matrices • Henikoff and Henikoff (1992) Proc Natl Acad Sci 89(22):10915-9 • Based on analysis of the BLOCKS database (http://www.blocks.fhcrc.org/) • BLOSUM = BLOcks SUM database • Based on analysis of conserved and variable regions of proteins Naming convention is different than for PAM matrices. • BLOSUMxy is based on likelihood ratios for two chains of amino acids that are xy% identical • BLOSUM62 is the ‘typical default’ • PAM250 is roughly equivalent to BLOSUM45 68 PSI BLAST • Position Specific Iterative BLAST • http://nar.oupjournals.org/cgi/content/full/25/17/3389 • Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. 1997. Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res 1997 Sep 1;25(17):3389-402 • Required two non-overlapping similarities with search term to occur within a certain distance (A) on the genome • Permits gaps in the alignments • Can be iterated to allow for user-specified scoring matrices By default, uses the BLOSUM-62 Matrix 69 PSI BLAST • In the original BLAST, the step of extending the length of the ‘hits’ took ~90% of execution time. • The initial threshold value T must be lower than with the original BLAST, but far fewer hits are pursued, meaning that the extension time is lower Two hits, T=11 A=40 vs One hit, T=13 http://nar.oupjournals.org/content/ vol25/issue17/images/gka56202.gif 70 http://nar.oupjournals.org/content/vol25/ issue17/images/gka56201.gif 71 Gaps in PSI-Blast • PSI BLAST seeks alignments with single gaps • Gaps are sought only when a two-hit score exceeds the value Sg • Gaps: handled by using a different gap cost function: -(a+bk+cj) a is the cost for opening a gap b is the per unit cost for the length of the gap k is the length of the gap c is the cost per of unaligned sequences in the gap j is the number of sequences left unaligned 72 Discontinuous MEGA Blast • Useful especially for identifying diverged DNA sequences • Uses templates; within the template only those items with “1”s are compared. • E.g. 1101101101101101 How many BLASTs? http://www.ncbi.nlm.nih.gov/BLAST/producttable.html 73 mpiBLAST http://mpiblast.lanl.gov/ 74 mpiBLAST Algorithm • Darling, A.E., L. Carey, W.-C. Feng. 2003. The design, implementation, and evaluation of mpiBLAST. Presented at ClusterWorld2003. http://www.cs.wisc.edu/%7Edarling/mpiblast-cwce2003.pdf • Algorithm – Database is segmented. Portions of database are placed on data storage devices on multiple nodes in a HPC system. mpiformatdb is a wrapper for the BLAST formatdb program. Number of subdivisions specified by user – Foreman/worker algorithm. Portions of the database are assigned to workers, using a greedy algorithm 75 mpiBLAST performance • Scaling can be superlinear when pieces are small enough that they fit into memory • Scalability limitations due to communication, implicit barrier before assembly of results • If pieces of data distributed out to workers are larger than available RAM, then scaling is still good but not superlinear • Blast is the most heavily used bioinformatics tool in existence. Parallelization of BLAST has huge payoff for practicing biologists 76 Motivation: BLAST with Low Memory • Standard BLAST running on a system with 128 MB of memory. Slide courtesy of Wu-chun Feng feng@lanl.gov Los Alamos National Laboratory 77 mpiBLAST: Low-Memory Performance • Environment – 1, 2, or 4 nodes. – Each node w/ dual 550-MHz CPUs and 128-MB memory. – Same query and database used. • Conclusions – blastn is I/O bound. Superlinear speed-up possible. – tblastx is CPU bound. Slide courtesy of Wu-chun Feng feng@lanl.gov Los Alamos National Laboratory mpiBLAST on Green Destiny BLAST Run Time for 300-kB Query against nt Nodes Runtime (s) Speedup over 1 node 1 80774.93 1.00 4 8751.97 9.23 8 4547.83 17.76 16 2436.60 33.15 32 1349.92 59.84 64 850.75 94.95 128 473.79 170.49 The Bottom Line: mpiBLAST reduces search time from 1346 minutes (or 22.4 hours) to under 8 minutes! Slide courtesy of Wu-chun Feng feng@lanl.gov Los Alamos National Laboratory 78 Global Alignments: Needleman-Wunsch Algorithm • Start at the beginning, end t the end • Needleman, S.B., and C.D. Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. Mol. Bio. 48: 443-453. • “The amino acid sequences of a number of proteins have been compared to determine whether the relationships existing between them could have occurred by chance. Generally, these sequences are from proteins having closely related functions and are so similar that simple visual comparisons can reveal sequence coincidence….” 79 80 Needleman-Wunsch • • • • Amino acid sequences are lined up as column and row headers for a matrix Ai is the ith amino acid in protein A Bj is the jth amino acid in protein B Start with a matrix where the matches between the Ai s and Bj s are 1 of there is a match, 0 otherwise • The optimal alignment can be represented as a path through the matrix • If MATmn is part of a pathway including MATij, the only permissible relationships are m> i and n>j, or m<I and n<j • The optimal pathway is found by filling out the matrix from the bottom right corner towards the upper left, where in each cell you insert the maximum score arising from an alignment that includes this cell in the matrix 81 Needleman-Wunsch and Smith-Watermann • Shortcomings of Needleman-Wunsch? • Can you think of biological situations in which you might want to use Needleman-Wunsch? • Smith-Waterman: similar to Needleman-Wunsch, except – Requires a penalty for gaps – Will do partial alignments (e.g. has stopping point) • Computational requirements – Original Needleman-Wunsch and Smith Waterman both require O(N*M) time and O(N*M) memory – There are improvements of Smith-Waterman that require O(N*M) time and O(N) space 82 ALIGN • • • • • Simple protein alignment tool Included in FASTA distributions 2.x, but not 3.x Still, it’s a nice learning tool Can be downloaded for Linux or for Windows Can also be run from web at http://fasta.bioch.virginia.edu/fasta/align.htm • Can also be run from web at http://us.expasy.org/tools 83 Protein Alignment with the FASTA family • FASTA is one of the earliest protein alignment tools, and still actively maintained • Pronounced FAST and then a long A • A local alignment, heuristic tool • Can be downloaded from http://www.people.virginia.edu/~wrp/pearson.html • FASTA family maintained by Prof. William R. Pearson • Can also be run from Web 84 FASTA Algorithm • Ktup = word length (2 default; 1 sometimes used) • FASTA searches for words of length ktup matching between sequences • FASTA searches for ungapped regions of a particular length that have the highest number of identical ktups • FASTA scores the 10 ungapped alignments that have the highest number of identical ktups, scoring with a scoring matrix (default is BLOSUM50) • FASTA then tests for the ability to merge the ungapped alignments into a single alignment without dropping the overall score too much • FASTA uses the Smith-Waterman algorithm within the local alignment regions! 85 Multiple Alignment - Clustal-W • Why do we need to align many different sequences at once? – Look for highly conserved regions – Gene searching (of mice and men) • http://www.ebi.ac.uk/clustalw/ • Thompson et al. 1994. Nucleic Acids Res. 22: 4673-4680 • Heuristic & Progressive – Begin with 2 sequences – Add others one-by-one • Uses profile alignment – Align sequence with group of aligned sequences – Align groups of aligned sequences – Misalignments in conserved regions penalized heavily 86 Example output FOS_RAT FOS_MOUSE FOS_CHICK FOSB_MOUSE FOSB_HUMAN MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNT MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNT MMYQGFAGEYEAPSSRCSSASPAGDSLTYYPSPADSFSSMGSPVNS -MFQAFPGDYDS-GSRCSS-SPSAESQ--YLSSVDSFGSPPTAAAS -MFQAFPGDYDS-GSRCSS-SPSAESQ--YLSSVDSFGSPPTAAAS *:..* .:*:: .***** **:.:* * *..***.* :.. :*: FOS_RAT FOS_MOUSE FOS_CHICK FOSB_MOUSE FOSB_HUMAN IPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHPYGLP IPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHPYGLP VPTVTAISTSPDLQWLVQPTLISSVAPSQ-------NRG-HPYGVP VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMP VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMP :******:** **********:**:* **... ::. .**.:* : 87 Clustal-W Algorithm • Construct matrix of distances – Alignment scores from all pairwise combinations – Alignments by dynamic programming method – Alignment scores transformed to evolutionary distances – Cluster distances into hierarchical tree (neighbor joining) • Progressively align sequences using tree as a guide – Begin with closest pair – Work up tree in order of decreasing similarity – Use pairwise alignment for pairs – Use sequence-profile alignment to add sequences to clusters – Use profile-profile alignment to join clusters 88 CLUSTAL-W key features • Sequences weighted to reduce representation bias associated with large subfamilies (usual sum-of-pairs score problem) • Substitution matrix used for scoring depends on distance between sequences. – BLOSUM80 for near sequences – BLOSUM50 for distant sequences • Gap penalties at hydrophobic residues heavier than those at hydrophilic residues • Gap penalties also contingent upon exact residue identity at gap site • Gaps corralled by increasing penalties at sites where gaps are rare when gaps are common nearby • When building alignment, low-scoring additions rescheduled to be added later 89 ClustalW-MPI • Li, K.-B.2003. ClustalW-MPI: ClustalW analysis using distributed and parallel computing. Bioinformatics 19: 15851586 • Initial pairwise alignment process is parallelized and scales very well • Multiple alignment process is parallelized and scales modestly • Scaling tests published thus far up to 16 processors, reduces time from hours to minutes HMMR • http://hmmer.wustl.edu/ • Profile HMMs for protein sequence analysis • A profile is a statistical model of patterns that are likely for multiple alignments, including variability at various positions and probabilities of various residues • Useful when similarities are too faint to be picked up by BLAST • Several profiles based on existing alignments exist • Available as a parallel code using PVM • Scales reasonably well as regards number of processors. Does not scale as well as regards size of the biological problem 90 91 GeneIndex • Location of initiators, promoters, etc. a key question in genomics • First step in this is creating a dictionary of words of various lengths (many possible next steps) • To be useful, analysis must be performed on entire genomes at once • GeneIndex finds frequencies and positions of all words of a given length in a DNA sequence. Visualization with Tcl/Tk. 92 GeneIndex Parallelization • Genome is broken up into n sections, where n = number of processors • After each segment is analyzed, linked lists are joined 93 94 GeneIndex Scalability: Processing Time Drosophila 3000 Time (seconds) 2500 2000 1500 1000 500 0 0 20 40 Num ber of CPU 60 95 GeneIndex Scalability: Speedup Drosophila 70 Speedup 60 50 40 30 20 10 0 0 20 40 Num ber of CPU 60 80 96 Phylogenetics 97 Building Phylogenetic Trees • Goal: an objective means by which phylogenetic trees can be estimated in tolerable amounts of wallclock time, producing phylogenetic trees with measures of their uncertainty • All evolutionary changes are described as bifurcating trees -genes or gene products -organisms 98 Phylogenetic trees from DNA sequences • Changes DNA modeled as Markov processes • Sequences available: • DNA (sequences are series of the base molecules; aligned sequences will also contain +s for gaps) • Amino acid sequences (series of letters indicating the 20 amino acids). Computational challenges more severe than with DNA sequences. • RNA • The availability of data at present exceeds the ability of researchers to analyze it! 99 Why is tree-building a HPC problem? • The number of bifurcating unrooted trees for n taxa is (2n-5)!/ (n-3)! 2n-3 • for 50 taxa the number of possible trees is ~1074; most scientists are interested in much larger problems • NP-hard problem • The number of rooted trees is (2n-5)! 100 Phylogenetic software • Phylip. (J. Felsenstein). Collection of software packages that cover most types of analysis. One of the most popular software collections. Free. • PAUP. (D. Swofford). Parsimony, distance, and ML methods. Also one of the most popular software collections. Not free, but not expensive. • fastDNAml. (G. Olsen). Maximum likelihood method for DNA; becoming one of the more popular ML packages. MPI version available soon; well suited to tree searching in large data sets. Free. • GRAPPA (Bader et al.): Breakpoint analysis program - scales well 101 Stochastic change of DNA • Markov process, independent for each site: 4 x 4 matrix for DNA, 20 x 20 for amino acids • A C G T • A p(A->A) p(A->C) p(A->G) … • C p(C->A) p(C->C) p(C->G) … • G . • T . • Transitions more probable than transversions. • Must account for heterogeneity in substitution rates among sites (DNArates – Olsen) 102 fastDNAml • • • • Developed by Gary Olsen Derived from Felsensteins’s PHYLIP programs One of the more commonly used ML methods The first phylogenetic software implemented in a parallel program (at Argonne National Laboratory, using P4 libraries) • Olsen, G.J.,et al.1994. fastDNAml: a tool for construction of phylogenetic trees of DNA sequences using maximum likelihood. Computer Applications in Biosciences 10: 41-48 • MPI version produced by Indiana University in collaboration with Gary Olsen available from http://www.indiana.edu/~rac/hpc/fastDNAml/ 103 fastDNAml algorithm – adding taxa • Optimize tree for 3 (randomly chosen) taxa only one topology possible • Randomly pick another taxon – (2i-5) trees possible • Keep the best (maximum likelihood tree) Basic fastDNAml algorithm - Branch rearrangement • Move any subtree crossing n vertices (if n=1 there are 2i6 possibilities) • Keep best resulting tree • Repeat this step until local swapping no longer improves likelihood value 104 105 fastDNAml algorithm con’t: Iterate • • • • • • • Get sequence data for next taxon Add new taxa (2i-5) Keep best Local rearrangements (2i-6) Keep best Keep going…. When all taxa have been added, perform a full tree check 106 Overview of parallel program flow • Program modules – Master (generates trees, receives back from Foreman best tree at each step) – Foreman (dispatches trees to workers, determines best tree, tracks activity of workers) – Worker – Monitor (instrumentation) – Parallel versions include fault tolerance features (useful in large clusters and grid computing) 107 Performance of fastDNAml 70 60 SpeedUp 50 40 30 20 10 0 0 10 20 30 40 50 Number of Processors Perfect Scaling 50 Taxa 101 Taxa 150 Taxa 60 70 108 Why bother with parallel code? • Why not just achieve speedup of n on n processors by running n independent jobs? • Practical benefits of seeing results quickly • Parallel program permits assault on much more complicated problems (e.g. protein sequences) 109 RNA & Protein Structure 110 RNA Structure – Vienna RNA • http://www.tbi.univie.ac.at/~ivo/RNA/ • Package consists of several parts (from the web site): – RNAfold - predict minimum energy secondary structures and pair probabilities – RNAeval - evaluate energy of RNA secondary structures – RNAheat - calculate the specific heat (melting curve) of an RNA sequence – RNAinverse - inverse fold (design) sequences with predefined structure – RNAdistance - compare secondary structures – RNApdist - compare base pair probabilities – RNAsubopt - complete suboptimal folding http://www.tbi.univie.ac.at/~ivo/RNA/ 111 Types of Proteins • Enzymes- biological catalysts Most of the chemical reactions which occur in biological systems are catalyzed by enzymes. • Storage. Various ions, small molecules and other metabolites are stored by complexing with proteins; for example haemoglobin carries oxygen. • Transport. Proteins are involved in the transportation of particles ranging from electrons to macromolecules. • Messengers. Proteins are involved in the transmission of nervous impulses. Hormones play a coordinating role. • Antibodies. Proteins which bind to specific foreign particles such as bacteria and viruses. • Regulation. Enzymes synthesize proteins by translating sequences of DNA. • Structural proteins. Mechanical proteins (e.g. collagen) Proteins – a sparse vocabulary build up from amino acids • Average time to fold based on random motion • Actual folding – small fractions of a second • Only a small subset of possible amino acid sequences actually code for a real protein • Minimization of free energy – the key in real life and in analysis! 112 113 http://bmbiris.bmb.uga.edu/wampler/tutorial/prot0.html 114 http://bmbiris.bmb.uga.edu/wampler/tutorial/prot0.html 115 Molecular viewing software options • VRML – Cosmo Player http://www.karmanaut.com/cosmo/player/ • RASMOL - http://www.openrasmol.org/ • CHIME - http://www.mdl.com/chime/index.html • Swiss Pdb Viewer - http://www.expasy.ch/spdbv/ • MICE - http://mice.sdsc.edu/ • Many tend to be touchy about browsers and plugins 116 Different ways to view molecules • • • • • Wireframe Stick Ball and stick Space filled (Van der Waals radii) Some examples: – http://class.fst.ohio-state.edu/FST822/Images/helix.pdb – http://www.rcsb.org/pdb/ – http://www.rcsb.org/pdb/cgi/explore.cgi?job=graphics;pdbId=1 GFL;page=;pid=264201048789105&opt=vrml_default 117 Protein structure determination • Xray crystallography • X-ray reflections form a pattern • Model the known sequence of atoms fitting into a 3D structure so that the reflection pattern matches the observed pattern • Spectroscopic analysis of molecule structure precise but still slow! • ~127,863 entries in SwissProt • ~857,950 entries in TrEMBL http://crystal.uah.edu/~carter/protein/xray.htm 118 Protein structure prediction methods • Knowledge-based methods – Based on information extracted from existing structures to estimate structure • Physico-chemical methods – “Ab initio” protein structure prediction • Feature detection methods: – Look for post-translational modification signals • Cleavage sites • Glycosylation sites • Phosphorylation • Site for prediction server: http://www.cbs.dtu.dk/services/ 119 Protein Structure Prediction • Key requirement: prediction of molecule position within 1 angstrom • Measuring quality of fit – Root mean square of atom distances RMSD = √ (∑di2)/N – Q3 = (true positives + true negatives)/total residues • Better than 70% right is really good! 120 Secondary Structure Prediction • Secondary – or local – structure prediction is the first step in classifying amino acid sequences – Alpha helix – Beta sheet – coil http://www.cryst.bbk.ac.uk/PPS95/ course/3_geometry/rama.html http://www.cryst.bbk.ac.uk/PPS95/ course/3_geometry/helix1.html Different approaches to tertiary structure prediction • Do a sequence alignment to find a protein that is like the unknown sequence in whole or in part • Threading – Thread a molecule on to a guide – Add sidechains – Optimize sidechains • Piecewise reconstrcution – Estimate the structure of smaller pieces – Then estimate how they fit together 121 122 SDSC Biology Workbench • Probably one of the best overall sites in the US • http://workbench.sdsc.edu • Requires registration but this is relatively painless • You do need to read the instructions first… 123 Ab initio methods - Amber • http://amber.scripps.edu/#ff • sander: Simulated annealing with NMR-derived energy restraints. • gibbs: Free energy perturbation (FEP) and thermodynamic integration (TI) , and also allows potential of mean force (PMF) calculations. • roar: Allows mixed quantum-mechanical/molecular-mechanical (QM/MM) calculations, "true" Ewald simulations, and alternate molecular dynamics integrators. • nmode: Normal mode analysis program using first and second derivative information, used to find search for local minima, perform vibrational analysis, and search for transition states. • (from http://amber.scripps.edu/#code) 124 Ab initio methods - GAMESS • M.W.Schmidt, M.W., K.K.Baldridge, J.A.Boatz, S.T.Elbert, M.S.Gordon, J.H.Jensen, S.Koseki, N.Matsunaga, K.A.Nguyen, S.Su, T.L.Windus, M.Dupuis, J.A.Montgomery. 1993. General Atomic and Molecular Electronic Structure System J. Comput. Chem.14: 1347-63. • NPACI/SDSC Web portal for GAMESS: https://gridport.npaci.edu/gamess/ 125 Hybrid approaches: Rosetta • Library of identification of short sequence motifs that correlate strongly with protein local structural properties. • Basic idea: – sequence-dependent local interactions bias segments of the chain – nonlocal interactions select the lowest free-energy tertiary structures from the many conformations compatible – Use protein database and take the distribution of local structures adopted by short sequence segments (fewer than 10 residues in length) in known three-dimensional structures – Put these structures together using non-local interactions • hydrophobic burial, electrostatics, main-chain hydrogen bonding and excluded volume. • Free energy is then minimized to create candidate structures 126 Molecular Docking • • • • Key in drug searching Autodock is a commonly used package http://www.scripps.edu/pub/olson-web/doc/autodock/ “AutoDock is a suite of automated docking tools. It is designed to predict how small molecules, such as substrates or drug candidates, bind to a receptor of known 3D structure.” (from the web site) • Nice visualization of an AutoDock docking simulation: http://wwwcmc.pharm.uu.nl/moret/dockings/home.html 127 Systems Biology 128 Systems Biology • Special issue of Science: 295, Mar. 2002 • Special issue of Nature: 420, Nov. 2002 • Nobody’s quite sure what it is, but it sure is hot! http://www.ornl.gov/TechResources/Human_Genome/ graphics/slides/images/01-0052_web.gif Historical approach to biological experiments • From Lazebnik, Y. 2002. Cancer cell 2:179 • Traditional biological experimentation much like the process of trying to fix a broken radio • (Or, for those of us who have experienced either being or living with a 12-year old boy, the process of breaking a functioning radio) • Some typical steps: – Cataloguing components and their attributes – Perturbing the system – Knock-out experiments – Drawing diagrams • Eventually may find a component that, when replaced, repairs the radio 129 130 Issues • In a very complex system, knowing what all of the parts are, and knowing the function of individual pathways, may still not tell you how the systems work. It may simply be impossible to deduce this from 1-st order interactions • Interactions, multiple changes – Power supply and other components (well-known PC repair example!) – Change everything all at once so that we’ll never know what worked! 131 Systems Biology • Systems biology emphasizes close integration of experiment, theory and computational modeling • Goal: understanding the structure and dynamics of biological systems, placing the parts in the context of the dynamic whole – Studies the complex interactions of many levels of biological information – Quantitative, predictive models are central – Computational modeling in particular is a key tool • Why model – You are forced to really state what you are hypothesizing – Allows you to understand an *approximation* of reality in great detail • Computational Cell Biology. 2002. Springer Verlag (Fall et al, eds). • Foundations of systems biology. MIT Press, 2001. Kitano (ed) 132 Example - MCell • MCell is: A General Monte Carlo Simulator of Cellular Microphysiology. http://www.mcell.cnl.salk.edu/ • MCell focuses on simulations using a Brownian dynamics random walk algorithm. • MCell's use to date has been focused on the microphysiology of synaptic transmission. • Images and MCell-related material courtesy of Joel R. Stiles, Pittsburgh SupercomputingCenter and Carnegie Mellon University, and Thomas M. Bartol, Computational Neurobiology Laboratory, The Salk Institute.http://www.mcell.cnl.salk.edu/ 133 MCell Scalability Images and MCell-related material courtesy of Joel R. Stiles, Pittsburgh Supercomputing Center and Carnegie Mellon University, and Thomas M. Bartol, Computational Neurobiology Laboratory, The Salk Institute. http://www.mcell.cnl.salk.edu/ 134 M-Cell • Uses MDL (Model Description Language (MDL), designed with biologically-oriented users in mind. • Embarrassingly parallel Monte Carlo application • Supports checkpointing! Images and MCell-related material courtesy of Joel R. Stiles, Pittsburgh Supercomputing Center and Carnegie Mellon University, and Thomas M. Bartol, Computational Neurobiology Laboratory, The Salk Institute. http://www.mcell.cnl.salk.edu/ 135 CompuCell • CompuCell currently uses a combination of "extended Potts model" for cell sorting and clustering, and "Schnakenberg Reaction Diffusion" equations to establish the underlying chemical field to which cells respond and form typical patterns found in such biological systems as a growing chicken limb. • http://www.nd.edu/~icsb/ Image courtesy of James Glazier http://www.biocomplexity.indiana.edu/software.php 136 Karyote • Information theory approach - construction of probability for parameters so that uncertainty in their estimation is assessed. • The incompleteness of model is addressed via a probability functional approach for computing the time-dependence of the concentration of key enzymes • Small features such as ribosomes or viruses behave in ways that rely on their atomic scale structure but which take part in the overall (macroscopic) balance of metabolic reaction and transport. “Zones” may be treated in more detail via the solution of mesoscopic models using finite element methods. • Can be run over web at http://biodynamics.indiana.edu/overview/ 137 Issue: Getting Tools to Interoperate • There is currently a proliferation of software, but no single package answers all needs • No single tool is likely to do so in the near future • But: problems with using multiple packages • One effort to address this problem: – Systems Biology Workbench Project • Purpose: develop software and standards to – Enable sharing of simulation & analysis software – Enable sharing of models • Goal: make it easier to share than to reimplement 138 The Systems Biology Workbench Project • http://www.sbw-sbml.org/ • Simple framework for application interaction. • Cross-platform compatible & language-neutral Script Interpreter Visual Editor Database Interface Stochastic Simulator • Modules are separately compiled executables. A module defines services which have methods • SBW native-language libraries provide APIs. • SBW Broker acts as coordinator SBW ODE-based Simulator 139 CellML • http://www.cellml.org/public/about/what_is_cellml.html • XML-based specification of interchange of cell model information • Includes: • Information about model structure • Math, based on MathML • Metadata about the model • Project of Bioengineering Institute of University of Auckland with support from Physiome Sciences Inc. 140 Systems biology URLs • • • • • • • • • • SBW & SBML www.sbw-sbml.org NetBuilder strc.herts.ac.uk/bio/Maria/NetBuilder CellML www.cellml.org Jarnac + JDesigner www.cds.caltech.edu/~hsauro Gepasi www.gepasi.org Virtual Cell www.nrcam.uchc.edu/ E-CELL www.e-cell.org JigCell gnida.cs.vt.edu/~cellcyclepse/ DARPA BioSPICEwww.biospice.org Karyote http://biodynamics.indiana.edu/overview/ 141 Grand challenge problems and some thoughts about the future 142 Modeling Heart Function • Based on Noble, D. 2002. Modeling the heart – from genes to cells to the whole organ. Science 295: 1678-1682 • Two mutations known for sodium channels – DeltaKPQ – deletion of 3 amino acids (lysine-prolineglutamine) – causes persistent sodium flow through cell wall – Missense mutations in sodium channels which cause ventricular fibrulations that can be fatal • Models of heart function can produce counterintuitive predictions • Grand challenge problem: the full scale reconstruction of a heart attack 143 Real-time fMRI 3.0T MRI Scanner CRAY T3E SGI Onyx In 1996, this required a supercomputer Today, it’s routine Slide courtesy of Ralph Roskies, Pittsburgh Supercomputing Center, roskies@psc.edu 144 Gamma Knife • Used to treat inoperable tumors • Treatment methods currently use a standardized head model • UITS is working with IU School of Medicine to adapt Penelope code to work with detailed model of an individual patient’s head 145 PENELOPE Basics • “PENELOPE performs Monte Carlo simulation of coupled electron-photon transport in arbitrary materials and complex quadric geometries” (http://www.nea.fr/abs/html/nea-1525.html) • Improvement of targeting based on CT scans of patient’s head – 200 512 x 512 voxel slices • Simulation takes ~7 hours using a serial version of PENELOPE running on a 1 GHz PIII Windows system • Goal: 5 minutes to one hour Parallelization of PENELOPE • Each processor: – Views entire target – Generates its own random numbers – Generates a set number of independent trajectories – Accumulates data • Process 0: – Collects the raw data – Computes desired results • Uses F90 for parallel random number generator from MILC consortium • Uses MPI elsewhere 146 147 PENELOPE Scalability: processing time 100000 Total Wallclock Time (sec.) 10000 1000 100 10 1 0 50 100 150 200 Number of Processors On IBM SP/Power3 250 300 148 PENELOPE Scalability: Speedup 300 250 Speedup 200 150 100 50 0 0 50 100 150 200 # of Processors 250 300 Some very boring Vampir traces of PENELOPE 149 150 “Simulation-only” studies • Aquaporins -proteins which conduct large volumes of water through cell walls while filtering out charged particles like hydrogen ions. • Massive simulation (35,000 hours TCS) showed that water moves through aquaporin channels in single file. Oxygen leads the way in. Half way through, the water molecule flips over. • That breaks the ‘proton wire’ • Work done at Pittsburgh Supercomputing Center • Klaus Schulten et al, U. of Illinois, SCIENCE (April 19, 2002) Other example large-scale computational biology grid projects • Department of Energy “Genomes to Life” http://doegenomestolife.org/ • Encyclopedia of Life (http://eol.sdsc.edu/) • Biomedical Informatics Research Network (BIRN) http://birn.ncrr.nih.gov/birn/ • Asia Pacific BioGrid (http://www.apbionet.org/) • eDiamond – breast cancer/mammography grid (http://www.mirada-solutions.com/PH1.asp?PAGE_ID=739) 151 152 Visualization: OpenDX • http://www.opendx.org/ • OpenDX is the open source software version of IBM's Visualization Data Explorer Product • Good sources of information in books, tutorials, etc. • Interesting example of open source • Animations as well http://www.opendx.org/highlights.php 153 Visualization: SciRUN • Some of the most dramatic biological visualizations ever done • Has been used for surgical support • Scientific Computing and Imaging Institute – Christopher R. Johnson • http://www.sci.utah.edu/ 154 Genomes to Life • http://www.doegenomestolife.org/ • Goals: – Identify and Characterize the Molecular Machines of Life — the Multiprotein Complexes That Execute Cellular Functions and Govern Cell Form – Characterize Gene Regulatory Networks – Characterize the Functional Repertoire of Complex Microbial Communities in Their Natural Environments at the Molecular Level – Develop the Computational Methods and Capabilities to Advance Understanding of Complex Biological Systems and Predict Their Behavior – (Goals taken directly from Genomes to Life web site) 155 EOL Basic Topology Genomic Data Putative Functional and 3D Assignment Integration with Other Resources Public and Private Databases To Serve Thousands Worldwide http://eol.sdsc.edu/methodology.html Current Genomic Pipeline sequence info structure info NR, PFAM SCOP, PDB Building FOLDLIB: PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP 90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30) 156 Arabidopsis Protein sequences Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG) Create PSI-BLAST profiles for Protein sequences Structural assignment of domains by PSI-BLAST on FOLDLIB Only sequences w/out A-prediction Structural assignment of domains by 123D on FOLDLIB Only sequences w/out A-prediction Functional assignment by PFAM, NR, PSIPred assignments FOLDLIB Domain location prediction by sequence http://eol.sdsc.edu/methodology.html Store assigned regions in the DB Scale of Multi-genome Analysis sequence info structure info NR, PFAM SCOP, PDB Building FOLDLIB: PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP 90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30) 104 entries 157 ~800 genomes @ 10k-20k per =~107 ORF’s Genomes Protein sequences Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG) Create PSI-BLAST profiles for Protein sequences Structural assignment of domains by PSI-BLAST on FOLDLIB 4 CPU years 228 CPU years 3 CPU years Only sequences w/out A-prediction Structural assignment of domains by 123D on FOLDLIB 9 CPU years Only sequences w/out A-prediction Functional assignment by PFAM, NR, PSIPred assignments FOLDLIB Domain location prediction by sequence http://eol.sdsc.edu/methodology.html 252 CPU years 3 CPU years Store assigned regions in the DB 158 BIRN • Biomedical Informatics Research Network • http://www.nbirn.net/ • NIH-sponsored attempt to create health-oriented cyberinfrastructure • Function BIRN – brain function and disorders, e.g. schizophrenia • Morphometry BIRN – brain structural disorders, e.g. Alzheimers • Mouse BIRN – studying mouse brain and mouse models of human brain disorders • Grid technology, using federated data system approach, based on Globus, SRB, etc. 159 Drug Design • • • • • Target generation – so what Target verification – that’s important! Toxicity prediction – VERY important!! (Cholesterol example) Counterintuitive problem: the more personalized a therapy is, the smaller its target audience! What is the killer application in computational biology? • Systems biology – latest buzzword, but…. • Goal: multiscale modeling from cell chemistry up to multiple populations • Current software tools still inadequate • Multiscale modeling calls for use of established HPC techniques – e.g. adaptive mesh refinement, coupled applications • Current challenge examples: actin fiber creation, heart attack modeling • Opportunity for predictive biology? 160 Computational biology, biomedical research, and HPC • Two challenges: – Scalability of applications – Wall-clock time sensitivity • Bioinformatics, Genomics, Proteomics, ____ics will radically change understanding of biological function and the way biomedical research is done. • Traditional biomedical researchers must take advantage of new possibilities • Computer-oriented researchers must take advantage of the knowledge held by traditional biomedical researchers 161 162 Peta-Scale applications? • Is this what most biologist really need? • Many biologists are unfamiliar with the real possibilities • Useful – even lifesaving – applications may require straightforward application of well known principles. • The low hanging fruit taste just fine. e.g. “Parallel” Matlab, GeneIndex, batch scripts (www.indiana.edu/~rac/bioinformatics/iubatchscripts.html) • Writing a parallel application that can be used to treat people is a very difficult challenge • Attacks on all fronts simultaneously are needed • Interactive Tera-scale applications might for many biologists be more valuable right now than Peta-scale applications (even if we had them!) • All of these open source codes are out there waiting for you to parallelize and/or tune them! So how do you find biologists with whom to collaborate? • • • • Chicken and egg problem? Or more like fishing? Or bank robbery? Willie Sutton, a famous American bank robber, was asked why he robbed banks, and reportedly said “because that's where the money is.” (This is, sadly, an urban legend: Sutton never said this) • Cultivating collaborations with biologists in the short run will require: – Active outreach – Different expectations than we might usually have – Patience • There are lots of opportunities open for HPC centers willing to take the effort to cultivate relationships. To do this, we’ll all have to spend a bit of time “going where the biologists are.” 163 164 Acknowledgments • • • • • • • • Some of the research described herein was supported in part by the Indiana Genomics Initiative. The Indiana Genomics Initiative of Indiana University is supported in part by Lilly Endowment Inc. Some of the research described herein was supported in part by Shared University Research grants from IBM, Inc. to Indiana University. Some of the material described herein is based upon work supported by the National Science Foundation under Grant No. 0116050 and Grant No. CDA-9601632. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF). Some of the ideas presented here were developed while the senior author was a visiting scientist at Höchstleistungsrechenzentrum Universität Stuttgart. The support and collaboration of HLRS and Michael Resch, Matthias Müller, Peggy Lindner, Matthias Hess, and Rainer Keller are gratefully acknowledged. Thanks to UITS Research and Academic Computing Division managers: Mary Papakhian, Stephen Simms, Richard Repasky, Matt Link, John Samuel, Eric Wernert, Anurag Shankar Indiana Genomics Initiative Staff: Andy Arenson, Chris Garrison, Huian Li, Jagan Lakshmipathy, David Hancock UITS Senior Management: Associate Vice President and Dean (Retired) Christopher Peebles, RAC(Data) Director Gerry Bernbom, Associate Vice President and Dean Bradley Wheeler Assistance with this presentation: John Herrin, Malinda Lingwall, W. Les Teach 165 Some Good Books • Winter, P.C., G.I. Hickey, H.L. Fletcher. 1998. Instant notes in genetics. Springer-Verlag, NY. ISBM 0-387-91562-1 • Durbin, R., S. Eddy, A. Krogh, G. Mitchison. 2000. Biological sequence analysis. Cambridge University Press. • Gibas, C., and P. Jambeck. 2001. Developing bioinformatics computer skills. O’Reilly. • Tisdall, J. 2001. Beginning perl for bioinformatics. O’Reilly. • Gusfield, D. 1997. Algorithms on strings, trees, and sequences. Cambridge University Press. • Berman, F., G.C. Fox, A.J.G. Hey. (eds) 2003. Grid computing: making the grid infrastructure a reality. Wiley, Sussex