1 Introduction to computational biology Craig A. Stewart stewart@iu.edu Indiana University 1 July 2005 Höchstleistungsrechenzentrum Stuttgart License terms • • Please cite as: Stewart, C.A. 2005. Introduction to computational biology. Tutorial presented at High Performance Computing Center, Stuttgart. 1 July 2005, Stuttgart, Germany. http://hdl.handle.net/2022/13995 Some figures are shown here taken from web, under an interpretation of fair use that seemed reasonable at the time and within reasonable readings of copyright interpretations. Such diagrams are indicated here with a source url. In several cases these web sites are no longer available, so the diagrams are included here for historical value. Except where otherwise noted, by inclusion of a source url or some other note, the contents of this presentation are © by the Trustees of Indiana University. This content is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work. 2 3 Planned schedule for the day • • • • • • • • • • • • 9:00-9:15 9:15-10:00 10:00-10:15 10:15-10:30 10:30-10:45 10:45-11:30 11:30-12:00 12:00-13:00 13:00-13:30 13:30-14:00 14:00-14:30 14:30:14:45 Introduction and objectives An introduction to the biological basis [11] Biological data sources [30] Similarity searching and Alignment [39] Coffee Break Similarity searching and alignment, con;t Multiple Alignment [58] Lunch Phylogenetics [75] RNA and Protein Structure [91] Systems Biology [119] Miscellaneous semi-random topics [138] 4 Plan & Objectives • Materials focus on open source software (generally not the presenters own work) • Objectives. At the end of the class, participants should: – understand enough biology to understand key computational biology problems, and be familiar with some strategies for collaborating with biologists and biomedical scientists – be conversant with key open source applications in computational biology and bioinformatics, and current problems in these areas – Be ready to download some code and start making it better! 5 Motivation • • • • • The “-omics” trend Finding press pieces about huge computing problems is easy How many bio codes really scale to hundreds of processors? What are the coming high performance needs of biologists? Importance of computational biology and bioinformatics to the HPC community • The challenges and promise are real • Successes and failures so far – Successes: Protein structure, Genome assembly, Surgical assistance, Phylogenetics – Mismatched priorities: Ab initio protein folding – Not yet successful: Drug discovery 6 What has changed recently? • Bioinformatics not new – Protein structure – Phylogenetics • What is new is highthroughput sequencing: – Lots more data – The possibility of going from a knowledge of the DNA sequence to an understanding of diseases and health http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html 7 Genome Projects Timeline • • • • • • • • • • • 1978 1986 1994 1995 1996 1997 1998 1998 1999 2000 2003 First virus (SV40) sequenced (5224 base pairs) DOE announces Human Genome Initiative First complete map of all human chromosomes First living organism sequenced (H. influenzae) 2 Mb Yeast (S. cerevisiae) - 12 Mb Intestinal bacterium (E. coli) - 5 Mb Nematode worm (C. elegans) - 100 Mb Celera announcement; Public effort regroups Human Chromosome 22 – 34 Mb Joint announcement by NHGRI – Celera “As good as it gets” human genome This slide based on slide by Manfred D. Zorn 8 Definitions • Computational Biology: any use of advanced information technology in the study of biological problems. • “Bioinformatics applies the principles of information sciences and technologies to make the vast, diverse and complex life sciences data mnore understandable and useful” (NIH BISTIC Committee grants1.nih.gov/grants/bistic/CompuBioDef.pdf) • Genomics – study of genomes and gene function • Proteomics – study of proteins and protein function • ___omics – 9 Challenges • • • • Different types of biological data at different scales Data of varying quality Much of the underlying biology is not well understood Prior to the availability of high-throughput sequencing, scientists could only study small pieces of the genetic information of any organism. • Now the entire genome of several organisms has been completed, but knowing the genome is different than knowing how it works! 10 Comparison of Complexity • Physics & Chemistry • Biology – 2 elementary particles – 3B base pairs in humans – 4 forces – Min. 30,000 genes in humans – 112 elements – ~1.5M species – When random events occur it is often possible – Individual random to study average behavior events important; no law of large numbers – Typically ahistoric (astrophysics an – Intensely historic, exception) heavily contingent 11 Complexity, Con't • Chip design – All components known – Device physics for individual components known – Itanium has 3 x 10^8 connections and 2 x 10^8 devices – Unified basic currency (electrons) – Computer program required to understand (e.g. SPICE) • Cells – Components not known – Function of individual components not known – # components ~10^13 – No unified basic currency – Ecell, Karyote, etc. attempting to model cells – Computer programs do not yet permit full understanding 12 A rapid introduction to key elements of biology Why is it important to know some biology? Anopheles gambiae From www.sciencemag.org/feature/data/ mosquito/mtm/index.html Source Library:Centers for Disease Control Photo Credit:Jim Gathany • Would you study numerical methods without knowing some mathematics? • Much current biological knowledge is very specific to particular organisms, genes, or diseases • If you just wade into the available data online you can do some very silly things. 13 14 Central dogma of biology • The central dogma of biology is that genes act to create phenotypes through a flow of information form DNA to RNA to proteins, to interactions among proteins (regulatory circuits and metabolic pathways), and ultimately to phenotypes. Collections of individual phenotypes constitute a population (first put forward by Crick in 1958) http://www.ncbi.nlm.nih.gov/About/primer/genetics_cell.html 15 Cell Structure http://www.ncbi.nlm.nih.gov/About/primer/genetics_cell.html Eukaryotes • Chromosomes linear • Introns, exons, postprocessing • Nucleus & nuclear wall • Mictochondria and (in plants) Chloroplasts Prokaryotes • Chromosome circular • Location is everything • No nucleus • No plastids 16 Four (or Five) Bases • DNA consists of four nucleotides: Cytosine, Thymine, Adenine, and Guanine. • In the double helix, A&T are always bound, and C&G are always bound to each other • RNA consists of four nucleotides as well: Cytosine, Uracil, Adenine, and Guanine • RNA may loop back on itself but it does not form a double helix http://www.ornl.gov/TechResources/Human_Genome/graphics/slides/images/structur.gif 17 http://www.ornl.gov/sci/techresources/Human_Genome/graphics/slides/98-648jpg.shtml 18 www.ornl.gov/sci/techresources/Human_Genome/graphics/slides/images/molecularmachine.jpg 19 http://www.abdn.ac.uk/sfirc/images/ligand_receptor.gif Translating DNA to RNA and Transcribing RNA to Proteins DNA AAAAAGGAGCAAATT 1 RNA One possible amino acid string 2 4 3 6 5 UUUUUCCUCGUUUAA Phe Asn Asp Ala 20 21 Genetic Code Ala Alanine Arg Arginine Asn Asparagine Asp Aspartic acid Cys Cysteine Glu Glutamic acid Gln Glutamine Gly Glycine His Histidine Ile Isoleucine http://www.ncbi.nlm.nih.gov/Class/MLACourse/ Original8Hour/Genetics/geneticcode.html Leu Leucine Lys Lysine Met Methionine Phe Phenylalanine Pro Proline Ser Serine Thr Threonine Trp Tryptophan Tyr Tyrosine Val Valine 22 Sickle Cell Normal RBC • GAG codes for Glutamine • disc-Shaped, soft • easily flow through small blood vessels • lives for 120 days Sickle RBC • GTG codes for Valine • sickle-Shaped, hard • often get stuck in small blood vessels • lives for 20 days or less Malaria vs. Anaemia! http://www.nlm.nih.gov/medlineplus/ ency/imagepages/1223.htm 23 What is a Gene? • An inheritable trait associated with a region of DNA that codes for a polypeptide chain or specifies an RNA molecule which in turn has an influence on some characteristic phenotype of the organism. – Early views: genes lined up on the chromosome like beads on a string; one gene => one protein – Examples of genes: color blindness, sickle-cell anaemia – Mendelian genes, Sex-linked genes, Quantitative traits • Annotation: Extraction, definition, and interpretation of features on the genome sequence • Annotations vs. genes: – Many annotations describe features that constitute a gene. – Others may not always directly correspond in this way – An annotation is what we think… nature may disagree! • Inheritance problem with annotations 24 http://www.ornl.gov/TechResources/Human_Genome/graphics/slides/images/98-647.jpg 25 Human Chromosomes • Prokaryotes and Eukaryotes that reproduce asexually just get a copy of all of their genes from their ‘parent’ • Eukaryotes (that includes us) that reproduce sexually receive one chromosome from each of our parents. This means we have two copies of the same information – one copy from each parent • Genes can be dominant (brown eyes) or recessive (blue eyes) http://www.ornl.gov/TechResources/ Human_Genome/graphics/slides/ elsikaryotype.html 26 Mendelian Genetics • Round peas are “dominant” to that a pea with the genotype RR or Rr is going to look round • Only peas with the rr genotype will look wrinkled • For very simple Mendelian genes, you can use something called a Punnet square • Many genes are inherited in ways OTHER than simple Mendelian genetics, which is why sequencing the geneome has been very important! 27 Gene Components • Prokaryotes – Location is everything – Essentially all of the DNA is transcribed (few mitochondrial diseases) • Eukaryotes – Non-contiguous pieces of DNA may comprise one gene – Start sequence (complicated and long) – Stop Codons – end transcription – Exons – portions of sequence that are transcribed and used – Introns – portions of sequence that are not used 28 Alternate splicing http://www.blc.arizona.edu/marty/411/Modules/altsplice.html 29 Population genetics & evolution • Mutations create the raw material for evolution • Natural selection and chance affect the frequency with which particular genes or DNA sequences are present in populations • Given enough time and enough change, evolution, speciation, and so forth happen • Genes can be ‘fixed’ or ‘maintained in an equilibrium’ in a population by chance or through natural selection http://faculty.wm.edu/bsgran/ 30 Key points (so far) • Biological processes are complicated; the historicity and complexity of biological processes and our lack of understanding of many matters makes biology an interesting topic! • The fundamental dogma of molecular biology is that genes act to create phenotypes through a flow of information form DNA to RNA to proteins, to interactions among proteins (regulatory circuits and metabolic pathways), and ultimately to phenotypes. Collections of individual phenotypes constitute a population. • DNA consists of four base pairs (ATCG). A is always paired with T; C always paired with G. • DNA is translated into RNA. RNA consists of four base pairs as well (AUCG). • The linear structure of DNA is transcribed into RNA and then into proteins. Proteins have their 3D configuration as the basis for their structure. 31 Bioinformatics data sources 32 MedLine & PubMed • Medline: – U.S. National Library of Medicine – NLM Medline http://www.nlm.nih.gov/ – ~12 million references on life sciences/biomedicine (since 1996) • Pubmed - Standard search tool for Medline http://www.ncbi.nlm.nih.gov/entrez/ • Structured Language - Medical Subject Heading http://www.nlm.nih.gov/mesh/MBrowser.html 33 Genomic, Proteomic, etc. data sources • A tremendous amount of data is available through public data sources via the Web, ftp, or by other means. • To analyze biological data, we first have to get it…. • Several ways to organize presentation of material – by site, by type, etc. We will organize by data type. • Types of Databases: – – – – – – Chromosomal (http://www.ncbi.nlm.nih.gov/mapview) DNA/Genes Protein Biochemistry and metabolic pathways Microarray Web collections 34 DNA databases • GenBank. Operated by NCBI (National Center for Biotechnology Information). http://www.ncbi.nlm.nih.gov • European Molecular Biology Laboratory – Nucleotide Sequence Database. http://www.ebi.ac.uk/genomes/ • DNA Database of Japan (DDBJ). http://www.ddbj.nig.ac.jp • All share data daily. Update conflicts avoided by policy. • Differences are in “value added” and interfaces 35 Data Structures • Current – Primary DNA repository data based on ASN.1. Makes possible linkages among many types of biomedical info. – The software libraries now often handle XML as well – Software toolkits and docs available at http://www.ncbi.nlm.nih.gov/IEB/ • Genbank Flat File format – http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html • FASTA >gi|532319|pir|TVFV2E|TVFV2E envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC http://www.ncbi.nlm.nih.gov 36 37 Protein Structure • NCBI • Swiss-Prot/TrEMBL at http://www.expasy.org/ – Note: 125,744 chemically determined vs 861,482 inferred from automated translation of DNA sequences!!!!! • Protein Data Base – PDB http://www.rcsb.org/pdb/ - one of the first online bioinformatics databases!!! 38 Biochemistry and pathways • Biochemistry – ENZYME (part of the ExPASy system) – BIND (part of the NCBI system) • Pathways – PathDB http://www.ncgr.org/software/version_2_0.html – Kegg WIT http://wit.mcs.anl.gov/WIT2/ 39 Web Resources - General • NCBI http://www.ncbi.nlm.nih.gov/ • EBI Biocatalog http://www.ebi.ac.uk/biocat/ • IUBio Archive http://iubio.bio.indiana.edu http://www.ncbi.nlm.nih.gov/ 40 Similarity matching & Sequence Alignment Why pattern matching (and what are the problems) and… US! Bonobo http://www.sandiegozoo.org/special/zoo-featured/pygmy_chimps.html 41 42 Problems! • For proteins, 95% similarity is ~ identical, 80% similarity is a lot. Even less similarity than that needed for DNA • Database techniques inadequate – they are too precise! • Datasets very large to search • Homology • Common ancestry • Sequence (and usually structure) conservation • Homology is inferred rather than measured • Identity • Objective and well defined • Can be quantified easily, but not very useful! • Similarity • Most common method used, but not as easily defined 43 How do sequences differ? • Differences in individual bases CGTACCGTTAATAT CGTACCGATAATAT • Bases may be added to a sequence CGTACCCCGTAATAT CGTACC . .GTAATAT • Bases may be deleted from a sequence CGTACCGTTAATAT CGTACCG . . .ATAT 44 Alignment • An alignment is an arrangement of two sequences opposite one another • It shows where they are different and where they are similar • We want to find the optimal alignment - the most similarity and the least differences • Alignments have two aspects: – Quantity: To what degree are the sequences similar (percentage, other scoring method) – Quality: Regions of similarity in a given sequence 45 Scoring Alignments GCTAAATTC ++ x x GC AAGTT • Matches are good: they get a positive value • Mismatches are bad: they get a negative value • Gaps are bad: they get a negative value – Gap opening penalty – Gap extension penalty – Score = Matches –Mismatches -∑{gap opening penalty +(length)*gap length penalty} CGTACCGTTAATAT CGTACCGTTAATAT CGTACCG . . .ATAT CGT. C . GTT .ATAT 46 Dotter • Simple way to get a feel for how sequences compare to each other. • Used both with DNA and Protein sequences • http://www.cgr.ki.se/cgr/groups/son nhammer/Dotter.html/ • "A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis" Erik L.L. Sonnhammer and Richard Durbin Gene 167(2):GC1-10 (1995) • Modular nature of proteins 47 Local Alignments with BLAST • • • • Basic Linear Alignment Search Tool First a quick demo (hopefully) http://www.ncbi.nlm.nih.gov/BLAST So, what did we do? – BLAST – Basic Linear Alignment Search Tool – In particular, BLASTn (for nucleotides) – Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic Local alignment search tool. Journal of Molecular Biology 215:403-410 48 (Original) BLAST Algorithm • Original algorithm does not permit gaps • The original BLAST algorithm is a local (heuristic) alignment tool • Given a search sequence, e.g. ACGTAGGCATGAA • BLAST first makes a list of all “words” of a given length that would possibly have a score of at least T against the search string. • In the case of this example there would be (at least) the following: – ACGTAGGCATG – CGTAGGCATGA – GTAGGCATGAA 49 (Original) BLAST Algorithm, 2 • BLAST takes the list of all words with a score of at least T against the string one is trying to match…. and then searches a database for any matches to these words. So if one were using the example and the NR database, BLAST would search NR for all occurrences of the words: – ACGTAGGCATG – CGTAGGCATGA – GTAGGCATGAA • Suppose BLAST finds in the NR database an exact match to – ACGTAGGCATG • BLAST then attempts to extend the match in both directions – ACGTAGGCATGA – ACGTAGGCATGA • So now we have an exact match of 12 letters 50 (Original) BLAST algorithm,3 • So BLAST keeps going, and in this case would stop at an exact match of 13 letters (if one existed), since 13 letters was the entire initial search string: – ACGTAGGCATGAA – ACGTAGGCATGAA • BLAST has a stopping algorithm for dropping particular search directions, or stopping altogether 51 BLAST algorithm in more detail • The BLAST algorithm searches for MSPs – Maximal Scoring Pairs – such that the score of sequences cannot be improved either by lengthening it or shortening it. “Pairs” here refers to a string – or a substring – of the initial string used as the search string – and one or more strings or substrings found in a database. • The search starts with the creation of all possible subwords of a given length (default typically 11 for DNA sequences, 3 amino acids for protein sequences) that would score at least T when matched against the original search string. (T is short for Threshold) • BLAST searches for any occurrence of each of these words that have a score of at least T. This is a “hit” – or a “High Scoring Pair (HSP)” • The search then continues by trying to extend these HSPs. • Suppose “S” is the best score found for a word of length k. BLAST stops trying to extend words when the score drops a certain amount below the best value S in the previous round. • BLAST continues on and on until it is no longer possible to improve the score of HSPs by making them longer. • Then it generates a list of the best HSPs. Default is a cutoff E-value of 10 • BLAST (original) has an infinite gap penalty 52 PSI BLAST • Position Specific Iterative BLAST • http://nar.oupjournals.org/cgi/content/full/25/17/3389 • Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. 1997. Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res 1997 Sep 1;25(17):3389-402 • Required two non-overlapping similarities with search term to occur within a certain distance (A) on the genome • Permits gaps in the alignments • Can be iterated to allow for user-specified scoring matrices By default, uses the BLOSUM-62 Matrix 53 PSI BLAST • In the original BLAST, the step of extending the length of the ‘hits’ took ~90% of execution time. • The initial threshold value T must be lower than with the original BLAST, but far fewer hits are pursued, meaning that the extension time is lower Two hits, T=11 A=40 vs One hit, T=13 http://nar.oupjournals.org/content/ vol25/issue17/images/gka56202.gif 54 Example Problem (hoffentlich) 55 mpiBLAST http://mpiblast.lanl.gov/ 56 mpiBLAST Algorithm • Darling, A.E., L. Carey, W.-C. Feng. 2003. The design, implementation, and evaluation of mpiBLAST. Presented at ClusterWorld2003. http://www.cs.wisc.edu/%7Edarling/mpiblast-cwce2003.pdf • Algorithm – Database is segmented. Portions of database are placed on data storage devices on multiple nodes in a HPC system. mpiformatdb is a wrapper for the BLAST formatdb program. Number of subdivisions specified by user – Foreman/worker algorithm. Portions of the database are assigned to workers, using a greedy algorithm 57 mpiBLAST performance • Scaling can be super linear when pieces are small enough that they fit into memory • Scalability limitations due to communication, implicit barrier before assembly of results • If pieces of data distributed out to workers are larger than available RAM, then scaling is still good but not super linear • Blast is the most heavily used bioinformatics tool in existence. Parallelization of BLAST has huge payoff for practicing biologists BLAST superlinear scaling as memory phenomenon • Standard BLAST running on a system with 128 MB of memory. Slide courtesy of Wu-chun Feng feng@lanl.gov Los Alamos National Laboratory 58 59 Multiple Alignment 60 Multiple Alignment • Sequences lined up so that homologous residues are next to one another Color reflects residue type (e.g., green = hydrophobic) This slide based on a slide created by Dr. Richard Repasky, Indiana University 61 Uses • Alignments reveal the degree to which sequences have been conserved • Most functional sequences are conserved • Multiple alignment is used to locate them – – – – Functional groups of enzymes Predict protein structure Gene promoters Unknown functional units of non-coding regions of DNA • Alignments necessary to estimate evolutionary trees This slide based on a slide created by Dr. Richard Repasky, Indiana University 62 Progressive alignments • Pairwise dynamic programming alignment algorithms can be extended to multiple sequences but scale poorly to large numbers of sequences (or to long sequences) • Heuristic algorithms are employed. Commonly used heuristic methods are progressive - build multiple alignment by aggregating from paired alignments • Order in which sequences are added determined by a guide-tree that reflects similarity/distance • Guide tree constructed from sequences • Closely related sequences aggregated/added first • Errors in early additions tend to propagate • Algorithms differ in strategy for minimizing error propagation • Algorithms also differ in guide tree construction & scoring This slide based on a slide created by Dr. Richard Repasky, Indiana University 63 Progressive alignment steps • • • • 1 - align B & C 2 - align D & E 3 - align (D & E) & A 4 -align (D & E & A) & (B & C) This slide based on a slide created by Dr. Richard Repasky, Indiana University 64 Three algorithms • CLUSTAL W – Oldest of three & most widely used – Initial alignment error not addressed – Good performance by adding realistic details • T-COFFEE – Initial alignment error addressed by using consistency methods – Uses CLUSTAL W, improves performance • ProbCons – New this year – Initial alignment error also addressed by consistency methods – Uses hidden Markov models This slide based on a slide created by Dr. Richard Repasky, Indiana University 65 CLUSTAL W • http://www.ebi.ac.uk/clustalw/ • Thompson et al. 1994. Nucleic Acids Res. 22: 4673-4680 • Uses dynamic programming with distance matrices and gap penalties for alignments • Selective use of scoring matrices – Strict matrices for closely related sequences – Permissive matrices for distantly related sequences – Relatedness determined by branch lengths in guide tree • Uses residue-specific gap penalties from reference alignments This slide based on a slide created by Dr. Richard Repasky, Indiana University 66 CLUSTAL W • Gap penalties reduced in short stretches of hydrophilic residues (usually associated with bends and are gap-prone) • Gap penalties increased in areas within 8 residues of existing gaps because such gaps are rare in reference alignments • Sequences weighted by relatedness – Attempt to correct for unbalanced sampling across guide tree – Closely related sequences discounted in importance • Progression – Leaves of tree joined by dynamic programming – Leaves joined with internal nodes by sequence-profile alignment – Internal nodes joined by profile-profile alignment This slide based on a slide created by Dr. Richard Repasky, Indiana University 67 Example output FOS_RAT FOS_MOUSE FOS_CHICK FOSB_MOUSE FOSB_HUMAN MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNT MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNT MMYQGFAGEYEAPSSRCSSASPAGDSLTYYPSPADSFSSMGSPVNS -MFQAFPGDYDS-GSRCSS-SPSAESQ--YLSSVDSFGSPPTAAAS -MFQAFPGDYDS-GSRCSS-SPSAESQ--YLSSVDSFGSPPTAAAS *:..* .:*:: .***** **:.:* * *..***.* :.. :*: FOS_RAT FOS_MOUSE FOS_CHICK FOSB_MOUSE FOSB_HUMAN IPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHPYGLP IPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHPYGLP VPTVTAISTSPDLQWLVQPTLISSVAPSQ-------NRG-HPYGVP VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMP VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMP :******:** **********:**:* **... ::. .**.:* : 68 ClustalW-MPI • Li, K.-B.2003. ClustalW-MPI: ClustalW analysis using distributed and parallel computing. Bioinformatics 19: 15851586 • Initial pairwise alignment process is parallelized and scales very well • Multiple alignment process is parallelized and scales modestly • Scaling tests published thus far up to 16 processors, reduces time from hours to minutes 69 Consistency Methods • Make estimates based on more information - “averaging” • Lazy teacher analogy • In progressive multiple alignment, use as much information as possible when adding sequences to the alignment • T-COFFEE: each position in one alignment is weighted by consistency in all alignments of all pairs of sequences that include the sequences being aligned • ProbCons: posterior probabilities in pairwise HMM alignments weighted by posterior probabilities of same positions in other alignments This slide based on a slide created by Dr. Richard Repasky, Indiana University 70 T-COFFEE • Http://igs-server.cnrsmrs.fr/~cnotred/Projects_home_page/t_coffee_home_page.ht ml • Notredame, et al. (2000, J. Mol. Biol. 302:205-217) • Gives better alignments than CLUSTAL W on benchmark datasets • Avoids problem of early bad gaps using consistency methods • Progressive alignment based on weights pooled from all pairwise alignments rather than currently accumulated sequences This slide based on a slide created by Dr. Richard Repasky, Indiana University 71 T-COFFEE steps • Calculation of weights – All pairwise alignments using CLUSTAL W, local alignments using FASTA Lalign – For each aligned base pair in each pair of sequences calculate weight – Aggregate weights for aligned base pairs using triplets of sequences • Progression – Align all sequences pairwise using weights – Build guide tree from pairwise alignments – Progressively build multiple alignment using tree and weights This slide based on a slide created by Dr. Richard Repasky, Indiana University 72 ProbCons • • • • • • • • Http://probcons.stanford.edu ISMB 2004, Bioinformatics 20:Supplement 1 Constistency methods & HMM to align Gives better alignments than CLUSTAL W & T-COFFEE on alignment benchmarks Use HMM to align all pairs of sequences Keep posterior probability matrices & update each value by averaging over all alignments in which the sequence position occurs. Do twice. Create a guide tree from expected accuracies (sums of posterior probabilities of highest summing path in matrix) Progressive alignment objective function is sum of posterior probabilities for all aligned residues This slide based on a slide created by Dr. Richard Repasky, Indiana University 73 Multiple alignment viewers • CLUSTAL X - X windows – ftp://ftp-igbmc.ustrasbg.fr/pub/ClustalX/ • Jalview - Java – http://www.ebi.ac.uk/~michel e/jalview/ • Variable color schemes • Editing • Front end to aligners This slide based on a slide created by Dr. Richard Repasky, Indiana University 74 Abstracting Multiple Alignments • • • • • Hidden Markov models can be used to describe alignments Called profile HMMS Think of them as definitions of proteins or averages Useful for aligning newly discovered sequences Search sequence databases for sequences that match the alignment profile (Consider the alternative!) • Build databases of profiles and search for profiles that match query sequences This slide based on a slide created by Dr. Richard Repasky, Indiana University HMMER • • • • • • • • • http://hmmer.wustl.edu/ Profile HMMs for protein sequence analysis Builds profiles from existing alignments Searches sequence databases for molecules that match profiles Can be used to construct db’s of profiles and to search for profiles that match sequences Generates random sequences from profiles Also available as a parallel code using PVM Has been ported to vector supercomputers Scales reasonably well as regards number of processors. Does not scale as well as regards size of the biological problem 75 76 Phylogenetic Inference 77 Building Phylogenetic Trees • Goal: an objective means by which phylogenetic trees can be estimated in tolerable amounts of wallclock time, producing phylogenetic trees with measures of their uncertainty • All evolutionary changes are described as bifurcating trees -genes or gene products -organisms 78 Why is tree-building a HPC problem? • The number of bifurcating unrooted trees for n taxa is (2n-5)!/ (n-3)! 2n-3 • for 50 taxa the number of possible trees is ~1074; most scientists are interested in much larger problems • NP-hard problem • The number of rooted trees is (2n-5)! 79 Stochastic change of DNA • Markov process, independent for each site: 4 x 4 matrix for DNA, 20 x 20 for amino acids • A C G T • A p(A->A) p(A->C) p(A->G) … • C p(C->A) p(C->C) p(C->G) … • G . • T . • Transitions more probable than transversions. • Must account for heterogeneity in substitution rates among sites (DNArates – Olsen) 80 fastDNAml • • • • Developed by Gary Olsen Derived from Felsensteins’s PHYLIP programs One of the more commonly used ML methods The first phylogenetic software implemented in a parallel program (at Argonne National Laboratory, using P4 libraries) • Olsen, G.J.,et al.1994. fastDNAml: a tool for construction of phylogenetic trees of DNA sequences using maximum likelihood. Computer Applications in Biosciences 10: 41-48 • MPI version produced by Indiana University in collaboration with Gary Olsen available from http://www.indiana.edu/~rac/hpc/fastDNAml/ 81 fastDNAml algorithm – adding taxa • Optimize tree for 3 (randomly chosen) taxa only one topology possible • Randomly pick another taxon – (2i-5) trees possible • Keep the best (maximum likelihood tree) Basic fastDNAml algorithm - Branch rearrangement • Move any subtree crossing n vertices (if n=1 there are 2i6 possibilities) • Keep best resulting tree • Repeat this step until local swapping no longer improves likelihood value 82 83 fastDNAml algorithm con’t: Iterate • • • • • • • Get sequence data for next taxon Add new taxa (2i-5) Keep best Local rearrangements (2i-6) Keep best Keep going…. When all taxa have been added, perform a full tree check 84 Overview of parallel program flow • Program modules – Master (generates trees, receives back from Foreman best tree at each step) – Foreman (dispatches trees to workers, determines best tree, tracks activity of workers) – Worker – Monitor (instrumentation) – Parallel versions include fault tolerance features (useful in large clusters and grid computing) 85 Performance of fastDNAml 70 60 SpeedUp 50 40 30 20 10 0 0 10 20 30 40 50 Number of Processors Perfect Scaling 50 Taxa 101 Taxa 150 Taxa 60 70 86 SC2003 HPC Challenge (“It seemed like a good idea at the time”) Are Hexapods a single evolutionary group? Are ecdysozoans a single evolutionary group? 87 A partial bestiary All organism illustrations copyright Jennifer Fairman, 2003. www.fairmanstudios.com Used by agreement 88 Software and data analysis • Non-grid preparatory work – Download sequences from NCBI (67 Taxa, 12,162 bp, mitochondrial genes for 12 proteins) – Align sequences with Multi-Clustal – Determine rate parameters with TreePuzzle • • Grid preparatory work – Analyze performance of fastDNAml with Vampir – Meetings via Access Grid & CoVise The grid software – PACXMPI – Grid/MPI middleware – Covise – Collaboration and visualization – Application Framework – Matthias Hess – fastDNAml – Maximum Likelihood phylogenetics Application framework, COVISE, FastDNAml • • • • • ML analysis of phylogenetic trees based on DNA sequences Foreman/worker MPI program Heuristic search for best trees For 67 taxa: 2.12 ~10109 trees Goal: 300 bootstraps, 10 jumbles per – 3000 executions (more than 3x typical!) 89 90 Why this project on a grid? • • • • • Important & time-sensitive biological question requiring massive computer resources A biologically-oriented code that scales well Grid middleware environment & collaboration tool well suited to the task at hand Opportunity to create a grid spanning every continent on earth (except Antarctica) It seemed like a good idea at the time 91 The results • Hundreds of trees were analyzed during the course of the week • The biological results are still being analyzed • Our HPC challenge project was awarded the prize for “Most geographically distributed application” • We have not yet become rich or famous 92 RNA & Protein Structure 93 RNA & Protein Structure • • • • • • Want to know functions Function dictated by structure Need structure to understand function Empirical determination of structure difficult/expensive Shortcut: predict structure from sequence Algorithms & software for predicting structure 94 RNA • • • • • Nucleotide sequence Composition differs from DNA Thymine replaced by uracil Alphabet: C, G, A, U Types of RNA – Messenger RNA - template from DNA - decoded to produce protein – Transfer RNA - interface attached to amino acids - identifies amino acid to protein producing machinery – Ribosomal RNA - protein producing machinery – Regulatory RNA - small polynucleotides that bind to other molecules and alter behavior – Catalytic RNA - most catalyze reactions of DNA 95 RNA secondary structure • • • • • Single stranded RNA folds on itself Complementary bases join A – U, G - C Forms loops & hairpins • In nature, structure nearly minimizes energy • Energy - more or less bending/stress on bond angles • Zuker algorithm minimizes calculated energy tRNA 96 RNA Structure – Vienna RNA • http://www.tbi.univie.ac.at/~ivo/RNA/ • Package consists of several parts (from the web site): – RNAfold - predict minimum energy secondary structures and pair probabilities – RNAeval - evaluate energy of RNA secondary structures – RNAheat - calculate the specific heat (melting curve) of an RNA sequence – RNAinverse - inverse fold (design) sequences with predefined structure – RNAdistance - compare secondary structures – RNApdist - compare base pair probabilities – RNAsubopt - complete suboptimal folding http://www.tbi.univie.ac.at/~ivo/RNA/ 97 Protein • • • • • • • • Enzymes - catalysts Regulatory - bind with molecules to alter behavior Transport - move here to there as oxygen in hemoglobin Storage - e.g., caches of nitrogen or metal ions Mobility - contractile & motile proteins (muscle, flagella) Structural proteins - fill space, provide support Scaffold - supports for construction of macro molecules Defense/Attack - immune system proteins, venom 98 Structure and function of proteins • • • • • Enzymes receive most attention Enzymes catalyze reactions = lower energy required Place reactants in favorable positions for reaction Location is everything Example enolase 99 http://bmbiris.bmb.uga.edu/wampler/tutorial/prot0.html 100 http://bmbiris.bmb.uga.edu/wampler/tutorial/prot0.html 101 Primary structure - amino acids • Share nitrogen group (amino) • Share acid group (carboxyl) • Differ in side chains • 20 common amino acids • Polymerize amino-tocarboxyl • Side chains determine secondary & tertiary structure 102 Secondary structure & Bond rotation • Reflects angles in amino acid chain • Shape of the peptide chain over short sequences • Determined by amino acid composition Angles of rotation & secondary structure http://www.cryst.bbk.ac.uk/PPS95/ course/3_geometry/rama.html 103 104 Visualization: ways to view molecules • Wireframe – Often used by crystallographers while interpreting data – Frames fit nicely in electron density mesh • Space filled (Van der Waals radii) – Often used in docking applications – Van derWaals radii useful for thinking about hydrogen bonds 105 Visualization: ways to view molecules • Richardson-type (also ribbon or noodle) – Depict secondary and tertiary structure – Omit details of atoms and polymerization • Ball and stick – Usually more useful for ligands than for whole proteins – Atoms & covalent bonds 106 Molecular viewing software • Most programs do several types of visualizations • VRML – Cosmo Player http://www.karmanaut.com/cosmo/player/ • RASMOL - http://www.openrasmol.org/ • RasTop - http://www.geneinfinity.org/rastop/ • CHIME - http://www.mdl.com/chime/index.html • Swiss Pdb Viewer - http://www.expasy.ch/spdbv/ • MICE - http://mice.sdsc.edu/ • Many tend to be touchy about browsers and plugins 107 Empirical structure • X-ray crystallography • Yields 3-D plot of electron density in space • Create model of molecule that matches electron density • ~127,863 entries in SwissProt • ~857,950 entries in TrEMBL http://crystal.uah.edu/~carter/protein/xray.htm 108 Secondary structure prediction • Induction: assign structure based on sequence similarity to proteins of known structure • Consensus methods do best • Search database for similar sequences • Align sequences • Apply several algorithms (e.g., neural network, nearestneighbor) to predict structure type • Take consensus of predictions • JPRED: www.compbio.dundee.ac.uk/~www-jpred/ 109 Tertiary structure prediction • Long segments fold • Folds are held in place by molecular forces (e.g., electrostatic, hydrogen bonds, some covalent bonds) • Proteins fold to minimize energy • Folding algorithms seek conformation with minimum energy • Two main methods – Ab initio – Fold recognition 110 Criteria • Goal: prediction of molecule position within 1 angstrom • Remember, location is everything in enzymes • Measuring quality of fit – Root mean square of atom distances from correct position RMSD = √ (∑di2)/N – Q3 = (true positives + true negatives)/total residues • Better than 70% right is really good! 111 Fold Recognition (Threading) • • • • Impose known folds on molecule; evaluate fit Dissimilar sequences may fold similarly Number of possible folds is finite Many methods of fitting (e.g., dynamic programming, Gibbs sampling, hidden Markov models) • Calculate energy or distances • Web services - many methods http://cubic.bioc.columbia.edu/predictprotein • General recommendation: use many methods and build a consensus 112 Ab Initio methods • • • • • • L. From the beginning (O. E. D.) A real ab initio chemist would complain about use of the term A scoring function is used to judge conformations Search function used to explore conformational space Criterion: usually minimize free energy Scoring function types – Molecular mechanics calculations – Use empirically derived scoring functions based on probability distributions of data in Protein Data Bank • Search function may be coarse-grained or fine-grained, usually matches granularity of scoring function 113 Amber • Assisted Model Building with Energy Refinement • http://amber.scripps.edu • Not open source – modest license cost • D.A. Pearlman, D.A. Case, J.W. Caldwell, W.R. Ross, T.E. Cheatham, III, S. DeBolt, D. Ferguson, G. Seibel and P. Kollman. AMBER, a computer program for applying molecular mechanics, normal mode analysis, molecular dynamics and free energy calculations to elucidate the structures and energies of molecules. Comp. Phys. Commun. 91, 1-41 (1995) • Simulated annealing approach with energy refinements 114 GAMESS • General Atomic and Molecular Electronic Structure System • M.W.Schmidt, M.W., K.K.Baldridge, J.A.Boatz, S.T.Elbert, M.S.Gordon, J.H.Jensen, S.Koseki, N.Matsunaga, K.A.Nguyen, S.Su, T.L.Windus, M.Dupuis, J.A.Montgomery. 1993. General Atomic and Molecular Electronic Structure System J. Comput. Chem.14: 1347-63. • NPACI/SDSC Web portal for GAMESS: https://gridport.npaci.edu/gamess/ • No-cost academic license • It’s parallel 115 CHARMM • • • • CHARMM (Chemistry at HARvard Molecular Mechanics) For prediction of macromolecular structure Not open source - modest license cost CHARMM: A Program for Macromolecular Energy, Minimization, and Dynamics Calculations, J. Comp. Chem. 4, 187-217 (1983), by B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, and M. Karplus. • CHARMM: The Energy Function and Its Parameterization with an Overview of the Program, in The Encyclopedia of Computational Chemistry, 1, 271-277, P. v. R. Schleyer et al., editors (John Wiley & Sons: Chichester, 1998), by A. D. MacKerell, Jr., B. Brooks, C. L. Brooks, III, L. Nilsson, B. Roux, Y. Won, and M. Karplus. 116 Rosetta • Work with fragments 3-9 amino acids in length • Restrict conformations of individual fragments to the distribution of conformations exhibited in real proteins • Fragment conformations modeled stochastically using distributions of observed conformations • Seek array of conformations that minimize energy • Local minima likely • Run many searches • Cluster results & pick large clusters as likely conformations • Typically licensed at no cost for academic purposes 117 Molecular Docking • Will two molecules bind? • Usually interested in docking of small molecules (e.g., drug candidates) to proteins • Small molecule called ligand (from Latin ligare - to bind) • Specific question: will ligand bind to a receptor in a protein? • Receptor usually the largest pocket in the surface of a protein • Steps – Characterize receptor site – Orient ligand(s) & evaluate 118 Molecular Docking • Process: usually create negative image of receptor site and ask whether ligands take that conformation • Rigid models use a grid search • Models with flexible surfaces – Usually assume receptor fixed and ligand flexible. – Explore conformation of ligand (simulated annealing) • Models with flexible surfaces do better than those with fixed surfaces • Autodock is a commonly used package – http://www.scripps.edu/pub/olson-web/doc/autodock/ – Flexible model – Can do simulated annealing and genetic algorithms 119 Systems Biology 120 Systems Biology • Special issue of Science: 295, Mar. 2002 • Special issue of Nature: 420, Nov. 2002 • “Systems biology is a new field in biology that aims at a systems-level understanding of biological systems.” • Nobody’s quite sure what it is, but it sure is hot! http://www.ornl.gov/TechResources/Human_Genome/ graphics/slides/images/01-0052_web.gif Historical approach to biological experiments • From Lazebnik, Y. 2002. Cancer cell 2:179: Traditional biological experimentation much like the process of trying to fix a broken radio (or if you are or were a 12-year old boy…) • Some typical steps: – – – – Cataloguing components and their attributes Perturbing the system Knock-out experiments Drawing diagrams • Eventually may find a component that, when replaced, repairs the radio • In a very complex system, knowing what all of the parts are, and knowing the function of individual pathways, may still not tell you how the systems work. It may simply be impossible to deduce this from 1-st order interactions • Interactions, multiple changes – Power supply and other components (well-known PC repair example!) – Change everything all at once so that we’ll never know what worked! 121 122 Systems Biology • Systems biology emphasizes close integration of experiment, theory and computational modeling • Goal: understanding the structure and dynamics of biological systems, placing the parts in the context of the dynamic whole – Studies the complex interactions of many levels of biological information – Quantitative, predictive models are central – Computational modeling in particular is a key tool • Why model – You are forced to really state what you are hypothesizing – Allows you to understand an *approximation* of reality in great detail • Computational Cell Biology. 2002. Springer Verlag (Fall et al, eds). • Foundations of systems biology. MIT Press, 2001. Kitano (ed) 123 A small sampling • • • • • • • • • • • • BALSA BASIS BIOCHAM BioCharon biocyc2SBML BioGrid BioNetGen BioPathways Explorer Bio Sketch Pad BioSPICE Dashboard BioSpreadsheet BioUML • • • • • • • • • • • • • Cellware Cytoscape DBsolve Dizzy E-CELL FluxAnalyzer Gepasi INSILICO discovery Jarnac JDesigner JSIM JWS Karyote 124 Example - MCell • MCell is: A General Monte Carlo Simulator of Cellular Microphysiology. http://www.mcell.cnl.salk.edu/ • MCell focuses on simulations using a Brownian dynamics random walk algorithm. • MCell's use to date has been focused on the microphysiology of synaptic transmission. • Images and MCell-related material courtesy of Joel R. Stiles, Pittsburgh SupercomputingCenter and Carnegie Mellon University, and Thomas M. Bartol, Computational Neurobiology Laboratory, The Salk Institute.http://www.mcell.cnl.salk.edu/ 125 MCell Scalability Images and MCell-related material courtesy of Joel R. Stiles, Pittsburgh Supercomputing Center and Carnegie Mellon University, and Thomas M. Bartol, Computational Neurobiology Laboratory, The Salk Institute. http://www.mcell.cnl.salk.edu/ 126 CompuCell • CompuCell currently uses a combination of "extended Potts model" for cell sorting and clustering, and "Schnakenberg Reaction Diffusion" equations to establish the underlying chemical field to which cells respond and form typical patterns found in such biological systems as a growing chicken limb. • http://www.nd.edu/~icsb/ Image courtesy of James Glazier http://www.biocomplexity.indiana.edu/software.php 127 SBML • There is currently a proliferation of software, but no single package answers all needs • Systems Biology Markup Language – Purpose: develop software and standards to – Enable sharing of simulation & analysis software – Enable sharing of models • • • • Goal: make it easier to share than to reimplement An XML-based markup language Active and functional leadership and reasonable funding stream SBML is focused on biochemical networks, but of all of the biologyoriented markup languages, it seems to be the one with the most traction • Permits storage, transmission, and reuse of models • Consists of “levels” 128 What does an SBML model look like? <?xml version="1.0" encoding="UTF-8" ?> - <sbml xmlns="http://www.sbml.org/sbml/level1" version="2" level="1" xmlns:celldesigner="http://www.sbml.org/2001/ns/celldesigner"> - <model name="ban00010"> - <annotation> <celldesigner:modelVersion>2.2</celldesigner:modelVersion> <celldesigner:modelDisplay sizeX="876" sizeY="1177" /> - <celldesigner:listOfCompartmentAliases> - <celldesigner:compartmentAlias id="ca1" compartment="uVol"> <celldesigner:class>SQUARE</celldesigner:class> <celldesigner:bounds x="10.0" y="10.0" w="856" h="1157" /> </celldesigner:compartmentAlias> </celldesigner:listOfCompartmentAliases> - <celldesigner:listOfSpeciesAliases> - 129 SBML Levels • Level 1 – Biochemical networks. Frozen. • Level 2 – enhancements and extensions to level 1. Frozen June 2003. – Uses MathML for equation specifications – Uses same metadatascheme as CellML (exp named function defs), catalysts, time delays – Fixes minor issues in Level 1 specification – Any Level 1 model can be run within software that supports Level 2 • Level 3 – current development effor 130 Components of a Level 1 or 2 model • Compartment: a well-stirred container • Species: chemical compounds • Reaction: transformation, transport, or binding process involving a species. May have a rate parameter • Parameter: a quantity that has a symbolic name (global and local) • Unit definition • Rule: added to set constraints, initial conditions, bounds, etc on the reactions • Everything in SBML is one of the above! 131 So you actually want to run one… • MANY programs will handle a model written in SBML • libSBML provides a C/C++ API if you want to write your own • Math SBML – an open source toolbox for running SBML models within Mathematica • SBML Toolbox – the equivalent for MatLab • While an open source toolkit for a proprietary software package seems odd at first blush… • There is a KEGG to SBML converter! 132 JWS Online From http://jjj.biochem.sun.ac.za/ 133 134 CellML • Originally designed to describe and exchange models of cellular and subcellular processes. • http://www.cellml.org/public/about/what_is_cellml.html • XML-based specification of interchange of cell model information • Includes: • Information about model structure • Math, based on MathML • Metadata about the model • Project of Bioengineering Institute of University of Auckland with support from Physiome Sciences Inc. 135 BioSpice • • • • • Lead by Adam Arkin – a DARPA-backed effort Described in some detail in two recent issues of “-Omics” www.biospice.org More licensing term details than many open source efforts The BioSpice Dashboard may be one of the better “integrative” tools under development at present • Uses SBML for model specification 136 Systems biology URLs • • • • • • • • • • SBW & SBML www.sbw-sbml.org NetBuilder strc.herts.ac.uk/bio/Maria/NetBuilder CellML www.cellml.org Jarnac + JDesigner www.cds.caltech.edu/~hsauro Gepasi www.gepasi.org Virtual Cell www.nrcam.uchc.edu/ (NIH-supported) E-CELL www.e-cell.org (based in Japan JigCell gnida.cs.vt.edu/~cellcyclepse/ DARPA BioSPICEwww.biospice.org Karyote http://biodynamics.indiana.edu/overview/ 137 Some Good Books • Winter, P.C., G.I. Hickey, H.L. Fletcher. 1998. Instant notes in genetics. Springer-Verlag, NY. ISBM 0-387-91562-1 • Durbin, R., S. Eddy, A. Krogh, G. Mitchison. 2000. Biological sequence analysis. Cambridge University Press. • Gibas, C., and P. Jambeck. 2001. Developing bioinformatics computer skills. O’Reilly. • Tisdall, J. 2001. Beginning perl for bioinformatics. O’Reilly. • Gusfield, D. 1997. Algorithms on strings, trees, and sequences. Cambridge University Press. • Berman, F., G.C. Fox, A.J.G. Hey. (eds) 2003. Grid computing: making the grid infrastructure a reality. Wiley, Sussex 138 And a few semi-random other matters 139 GeneIndex • • • • • Location of initiators, promoters, etc. a key question in genomics First step in this is creating a dictionary of words of various lengths (many possible next steps) To be useful, analysis must be performed on entire genomes at once GeneIndex finds frequencies and positions of all words of a given length in a DNA sequence. Visualization with Tcl/Tk. Genome is broken up into n sections, where n = number of processors After each segment is analyzed, linked lists are joined 140 141 GeneIndex Scalability: Speedup Drosophila 70 Speedup 60 50 40 30 20 10 0 0 20 40 Num ber of CPU 60 80 142 BioPerl • Duct tape for DNA/protein sequence bioinformatics • Perl API for – Reading many data formats/data sources – Writing many data formats/data sources – Manipulating data objects in well known ways • Object oriented Perl module(s) • As with all frameworks, it can be extremely painful for quick and dirty applications • http://bio.perl.org/ • Siblings: BIOPYTHON (www.biopython.org), BIOJAVA (www.biojava.org) 143 Apple bioclusters • Apple Xserve clusters: head node plus compute nodes • Batch & queuing with Platform LSF, Sun Grid Engine, Open PBS, or PBS Pro • Parallel API’s: MPICH, MPI Pro, LAM/MPI • Globus tookit available • iNquiry Bioinformatics tools available – Many open source bioinformatics packages pre-compiled – All available through Pise web interface • “Easy” system/cluster administration tools • http://www.apple.com/ 144 BIRN • Biomedical Informatics Research Network • http://www.nbirn.net/ • NIH-sponsored attempt to create health-oriented cyberinfrastructure • Function BIRN – brain function and disorders, e.g. schizophrenia • Morphometry BIRN – brain structural disorders, e.g. Alzheimers • Mouse BIRN – studying mouse brain and mouse models of human brain disorders • Grid technology, using federated data system approach, based on Globus, SRB, etc. 145 Grid.org • • • • Volunteer cycles to handle various problems Uses United Devices software Human proteome project – uses Rosetta Cancer research – does screening against potential projects Computational biology, biomedical research, and HPC • Two challenges: – Scalability of applications – Wall-clock time sensitivity • Bioinformatics, Genomics, Proteomics, ____ics will radically change understanding of biological function and the way biomedical research is done. • Traditional biomedical researchers must take advantage of new possibilities • Computer-oriented researchers must take advantage of the knowledge held by traditional biomedical researchers 146 147 Future directions & needs • Drug Discovery – Target generation – so what – Target verification – that’s important! – Toxicity prediction – VERY important!! (Cholesterol example) – Counterintuitive problem: the more personalized a therapy is, the smaller its target audience! • Many biologists are unfamiliar with the real possibilities • Useful applications may require straightforward application of well known principles, but writing a parallel application that can be used to treat people is a very difficult challenge • Attacks on all fronts simultaneously are needed • Interactive Tera-scale applications might for many biologists be more valuable right now than Peta-scale applications (even if we had them!) • Portals and the TeraGrid –> solutions to problems that biologists care about • Lots of open source codes are out there waiting for you 148 Acknowledgments • • • Some of the research described herein was supported by the following:\ – The Indiana Genomics Initiative of Indiana University, supported in part by Lilly Endowment Inc. – The Indiana METACyt Initiative of Indiana University, supported in part by Lilly Endowment Inc. – Shared University Research grants from IBM, Inc. to Indiana University. – National Science Foundation under Grant No. 0116050 and Grant No. CDA9601632. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Some of the ideas presented here were developed while the senior author was a visiting scientist at Höchstleistungsrechenzentrum Universität Stuttgart. Thanks to HLRS and everyone I have worked with there, especially Michael Resch, Matthias Müller, Peggy Lindner, Matthias Hess, Rainer Keller, and Edgar Gabriel. John Herrin, Malinda Lingwall, & W. Leslie Teach assisted with graphics