ICB Fall 2009 G4120: Introduction to Computational Biology Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology & Immunology Copyright © 2009 Oliver Jovanovic, All Rights Reserved. Lecture 1 Introduction to Computing September 17, 2009 Growth of GenBank Lecture 1 Introduction to Computing September 17, 2009 1982 1990 2000 2008 A Brief History of Computing 35000 BC Tally systems African & European 8500 BC Prime system African 1000 BC Abacus Chinese & Babylonian 100 BC Antikythera mechanism Greek 1500 Mechanical calculator Leonardo da Vinci 1621 Slide rule William Oughtred 1642 Arithmetic Machine Blaise Pascal 1822 Difference Engine Charles Babbage 1831 Computer program Lady Ada Lovelace 1936 Z1 Computer Konrad Zuse 1936 Turing Machine Alan Turing 1938 Boolean Circuits Claude Shannon 1943 COLOSSUS Alan Turing 1945 von Neumann Machine John von Neumann 1947 Transistor William Shockley, John Bardeen & Walter Brattain 1958 Integrated Circuit Jack Kilby & Robert Noyce 1964 Mouse & Graphical User Interface Douglas Engelbart Lecture 1 Introduction to Computing September 17, 2009 The Era of Modern Computing Lecture 1 Introduction to Computing September 17, 2009 1969 ARPAnet UCLA, Stanford, UC Santa Barbara & University of Utah 1969 1973 UNIX C Ken Thompson & Dennis Ritchie, Bell Laboratories Dennis Ritchie & Brian Kernighan, Bell Laboratories 1973 1973 Ethernet FTP Robert Metcalfe, Harvard University/Xerox PARC Alex McKenzie, BBN 1974 1975 TCP Microsoft Corporation Vint Cerf & Robert Kahn Bill Gates & Paul Allen 1976 Apple Computer Steve Wozniak & Steve Jobs 1978 Usenet Tom Truscott, Jim Ellis & Steve Bellovin 1981 1982 1984 IBM PC TCP/IP DNS IBM Corporation ARPA Jon Postel 1984 Macintosh Apple Computer 1985 1986 Windows NeXT Computer Microsoft Corporation Steve Jobs 1989 1990 HTML & HTTP BSD Unix NR1 Tim Breners-Lee, CERN University of California, Berkeley 1991 Linux Linus Torvalds 1993 Mosaic Marc Andreessen 2001 OS X Apple Computer 2004 Google Larry Page & Sergey Brin 2008 Cloud Computing Amazon ECC and EBS Growth of the Internet Overview In 1981, there were 213 computers acting as Internet hosts. By the beginning of 1994, 2.2 million computers were acting as hosts. Currently, it is estimated that over 700 million Internet hosts exist. Lecture 1 Introduction to Computing September 17, 2009 Computers in Biology Algorithms • An algorithm is simply a series of steps used to solve a problem. One of a computer’s great strengths is its ability to rapidly and accurately repeat recursive steps in an algorithm. Many algorithms of use to biologists could not be practically applied without computers. • Early algorithms for comparing sequences to each other attempted to find optimal alignments. With the tremendous growth of sequence data, many modern search algorithms use heuristic (rule of thumb) approaches which may not find • Early algorithms for searching sequence data for significant matches depended on consensus sequences. It rapidly became clear that biologically significant sequences rarely perfectly matched a consensus, and more sophisticated approaches were adopted, including the use of matrices, Markov chains and hidden Markov models. Data Storage and Databases • The tremendous growth of sequence and other biological data has made storing such data in digital form on modern computers a necessity. More and more of this data is being stored in databases to make it easier to retrieve and analyze. • A database is a structured collection of data stored on a computer that can be accessed using a query language, which greatly simplifies asking questions of the data. The current trend is to make biological databases Internet accessible. Internet Lecture 1 Introduction to Computing September 17, 2009 • The World Wide Web computer protocols (HTML and HTTP) that are the foundation of much of the modern Internet were originally developed by Tim Breners-Lee at CERN to allow scientists to share research data. • Computational biology uses applications with Internet connectivity (EndNote, MacVector), Internet applications (DNA Artist), Web applications (BLAST, GenMark, Phylodendron) and Internet databases (GenBank, PubMed), among other Internet host resources. History of Computational Biology 1869 DNA Johann Friedrich Miescher 1924 1928 Chromosomal DNA Transforming principle Robert Feulgen Franklin Griffith 1944 DNA transformation Oswald Avery, Maclyn McCarty & Colin MacLeod 1948 Information Theory Claude Shannon 1949 Chargaff’s Rule Erwin Chargaff 1953 Double helix James Watson & Francis Crick 1955 Protein sequencing Fred Sanger 1961 1966 Codons Genetic code Sidney Brenner & Francis Crick Marshall Nirenberg, Robert Holley & Har Khorana 1970 1970 Restriction enzyme Needleman-Wunsch Hamilton Smith, Johns Hopkins S. Needleman & C. Wunsch 1971 MEDLINE NIH/NLM 1977 DNA sequencing Allan Maxam & Walter Gilbert/Frederick Sanger 1977 Staden programs Roger Staden 1981 Smith-Waterman Temple Smith & Michael Waterman 1982 GenBank LANL/EMBL/NCBI 1988 1988 NCBI FASTA NIH/NLM William Pearson & David Lipman 1988 DNA Strider Christian Marck 1990 BLAST Stephen Altschul & David Lipman, NCBI 1994 DNA computer Leonard Adelman 1997 PubMed NCBI Lecture 1 Introduction to Computing September 17, 2009 The Genomics Era Overview of Published Genomes Lecture 1 Introduction to Computing September 17, 2009 1980 øX174 (5,386 bp) 1981 Human mitochondria (16,569 bp) 1981 Poliovirus (7,440 bp) 1990 Human Genome Project 1992 The Institute for Genomic Research 1994 RK2 (60,099 bp) 1995 Haemophilus influenzae (1.8 Mb) 1995 Mycoplasma genitalium (0.58 Mb) 1996 Methanococcus jannaschii (1.6 Mb) 1996 Saccharomyces cerevisiae (12.1 Mb) 1997 1998 Escherichia coli (4.7 Mb) Celera, Inc. 1998 Caenorhabditis elegans (97 Mb) 2000 Drosophila melanogaster (180 Mb) 2000 Arabidopsis thaliana (115 Mb) 2001 Salmonella typhimurium (4.8 Mb) 2001 Homo sapiens (2.9 Gb) 2002 Mus musculus (2.9 Gb) 2003 Nanoarchaeum equitans (0.49 Mb) 2004 Legionella pneumophila (3.4 Mb) 2005 Pan troglodytes (2.8 Gb) Growth of Sequenced Prokaryotic Genomes Source: David W. Ussery, Genome Update: 161 prokaryotic genomes sequenced, and counting, Microbiology. 2004 Feb;150 (Pt. 2): 261-3. Evolution of Operating Systems Unix Apple Windows Lecture 1 Introduction to Computing September 17, 2009 Macintosh OS X Architecture Classic OS X 10.4 (and below) provides support through the Classic Environment for older Macintosh applications (OS 9 and below) Lecture 1 Introduction to Computing September 17, 2009 User Experience The layer with which most users interact with the Macintosh includes Aqua (the graphical user interface (GUI) of OS X), Dashboard (which manages and displays desktop widgets), Spotlight (which provides system wide search and indexing through the use of metadata) and Accessibility (assistive technology for the disabled). Computers and Sequence Analysis SeqMatrix E. coli promoter output: DNA Location: 3,075 Spacer Length: 11 Similarity Score: 55.29 CGACATTGCTTGACCC <11> GCGTGTTCAATTCG Lecture 1 Introduction to Computing September 17, 2009 Computers and Phylogenetic Analysis Lecture 1 Introduction to Computing September 17, 2009 Computers and Data Visualization Lecture 1 Introduction to Computing September 17, 2009 Computers and Multimedia L27758. Birmingham IncP-a...[gi:508311] LOCUS DEFINITION Lecture 1 Introduction to Computing September 17, 2009 Related Sequences, PubMed, Taxonomy BIACOMGEN 60099 bp DNA linear BCT 08-JUL-1994 Birmingham IncP-alpha plasmid (R18, R68, RK2, RP1, RP4) complete genome. ACCESSION L27758 VERSION L27758.1 GI:508311 KEYWORDS complete genome. SOURCE Birmingham IncP-alpha plasmid (plasmid Birmingham IncP-alpha plasmid, kingdom Prokaryotae) DNA. ORGANISM Birmingham IncP-alpha plasmid broad host range plasmids. REFERENCE 1 (bases 1 to 60099) AUTHORS Pansegrau,W., Lanka,E., Barth,P.T., Figurski,D.H., Guiney,D.G., Haas,D., Helinski,D.R., Schwab,H., Stanisich,V.A. and Thomas,C.M. TITLE Complete nucleotide sequence of Birmingham IncP-alpha plasmids: compilation and comparative analysis JOURNAL J. Mol. Biol. 239, 623-663 (1994) MEDLINE 94285211 FEATURES Location/Qualifiers source 1..60099 /organism="Birmingham IncP-alpha plasmid" /plasmid="Birmingham IncP-alpha plasmid" /db_xref="taxon:35419" BASE COUNT 10839 a 18681 c 18448 g 12131 t ORIGIN 1 ttcacccccg aacacgagca cggcacccgc gaccactatg ccaagaatgc ccaaggtaaa 61 aattgccggc cccgccatga agtccgtgaa tgccccgacg gccgaagtga agggcaggcc 121 gccacccagg ccgccgccct cactgcccgg cacctggtcg ctgaatgtcg atgccagcac 181 ctgcggcacg tcaatgcttc cgggcgtcgc gctcgggctg atcgcccatc ccgttactgc 241 cccgatcccg gcaatggcaa ggactgccag cgccgcgatg aggaagcggg tgccccgctt 301 cttcatcttc gcgcctcggg cctcgaggcc gcctacctgg gcgaaaacat cggtgtttgt etc. Software Applications Microsoft Office • Microsoft’s office suite, including Word, Excel and PowerPoint. • Office 2003 and 2007 for Windows or Office 2004 and 2008 for Mac can be downloaded for free by students at http://www.columbia.edu/acis/software/msoffice/ • Macintosh uses have an alternative in Apple’s iLife. Apple’s Keynote is particularly good presentation software. EndNote • Bibliographic search and storage utility. • The latest version, EndNote X2, for Windows or Mac can be downloaded for free by students at http://www.columbia.edu/acis/software/endnote/ Antivirals • If running Windows, make sure to download Symantec AntiVirus, update it, and keep it updated. It is not essential for Mac users. • Symantec AntiVirus 11 for Windows can be downloaded for free by students at http:// www.columbia.edu/acis/software/nav/ Lecture 1 Introduction to Computing September 17, 2009 Internet Addressing IP Address (Internet Protocol Address) An IP address is a 32 bit number, written in the form of four decimal numbers in the range 0-255 that are separated by dots (e.g. 128.59.48.24). Columbia University Medical Center (CUMC) IP addresses will always have the format 156.111.x.x or 156.145.x.x. Subnet Mask A subnet mask allows for defining a local network, called a subnet, within a larger network. CUMC subnet masks will always have the format 255.255.255.0. Router A device that routes packets of data between networks. A router sits between your computer and local area network and the networks beyond it. CUMC router IP addresses will always have the format 156.111.x.1 or 156.145.x.1. DNS (Domain Name Server) These specialized servers automatically translate an easy to remember domain name (e.g. microbiology.columbia.edu) into the appropriate IP address (e.g. 156.111.98.150). CUMC DNS IP addresses are 156.111.60.150 and 156.111.70.150. Lecture 1 Introduction to Computing September 17, 2009 Ethernet Address A unique 48 bit number, usually written in the form of 12 hexadecimal digits separated by colons in groups of two (e.g. 00:03:93:bc:3c:18), which is assigned to every piece of network hardware, including Ethernet cards and AirPort cards. It is also called a MAC (media access control) address. Search Domains This optional information is automatically appended to names you type in Internet applications. If you have defined a search domain of columbia.edu, typing www will take you to www.columbia.edu. CUMC Internet Setup Overview The CUMC (Columbia University Medical Center) campus network has two core routers, both redundantly linked to a router in each building. Each floor of a building then has its own router. Several microwave links and a high speed cable connect the core routers to the downtown Columbia campus, which has multiple high speed cable connections to the rest of the Internet. The CUMC network is walled off from the rest of the Internet by a firewall and is centrally administered by a group called CUMC IT. See http://cumc.columbia.edu/it/ for details. You will need an IP Address for Internet access from your lab. To register a computer for access from a lab, see http://cumc.columbia.edu/it/getting_started/wired.html for details, and use the New IP Request form. For Internet access from a dorm room, see http:// cumc.columbia.edu/it/getting_started/resnet.html for details.You will need to provide your computer’s Ethernet adapter hardware address, also known as Media Access Control (MAC) address. This is a 12 digit hexadecimal number (e.g. 00:00:af:a0:b1:89). If you have problems getting connected, try calling the CUMC computer help line at 5-HELP. Network Settings IP Address: 156.111.x.x or 156.145.x.x Subnet Mask: 255.255.255.0 Router: 156.111.x.1 or 156.145.x.1 DNS Servers 156.111.60.150 156.111.70.150 Lecture 1 Introduction to Computing September 17, 2009 Columbia University Email Setup Apple Mail Apple’s email client, officially supported by Columbia University, features sophisticated sorting and junk mail filtering capabilities, linked to the Address Book application. Follow these configuration instructions: http://www.columbia.edu/acis/software/applemail/config.html Outlook Express Window’s email client, officially supported by Columbia Univesity. Follow these configuration instructions: http://www.columbia.edu/acis/software/outlookexpress/config.html Off Campus Access If you are not on your regular computer, you should still be able to access your email through CubMail, a web based system: https://cubmail.cc.columbia.edu/horde/imp/login.php Lecture 1 Introduction to Computing September 17, 2009 Cisco VPN Client If you are traveling, or live in non campus housing and need to connect to a server on campus, or want to be able to read journal articles as if you were on a Columbia computer, you will need to install Cisco VPN (Virtual Private Network) client software on your computer. Windows or Mac versions of software can be downloaded and configured by following these instructions: http://cumc.columbia.edu/it/getting_started/vpn.html References Recommended Macintosh OS X Books Mac OS X Tiger Unleashed by John Ray & William C. Ray Mac OS X: The Missing Manual,Tiger Edition by David Pogue Mac OS X Tiger Killer Tips by Scott Kelby Mac OS X Tiger Timesaving Techniques by Larry Ullman & Mark Liyange Recommended Computational Biology Books Fundamental Concepts of Bioinformatics by Dan E. Krane & Michael L. Rayme Developing Bioinformatics Computer Skills by Cynthia Gibas & Per Jambek Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins,Third Edition edited by Andreas D. Baxevanis & B. F. Francis Ouellette BLAST: An Essential Guide to the BASIC Local Alignment Search Tool by Ian Korf, Mark Yandell & Joseph Bedell Lecture 1 Introduction to Computing September 17, 2009