in PDF format - Department of Microbiology & Immunology

ICB Fall 2009
G4120: Introduction to
Computational Biology
Oliver Jovanovic, Ph.D.
Columbia University
Department of Microbiology & Immunology
Copyright © 2009 Oliver Jovanovic, All Rights Reserved.
Lecture 1
Introduction to
Computing
September 17, 2009
Growth of GenBank
Lecture 1
Introduction to
Computing
September 17, 2009
1982
1990
2000
2008
A Brief History of Computing
35000 BC
Tally systems
African & European
8500 BC
Prime system
African
1000 BC
Abacus
Chinese & Babylonian
100 BC
Antikythera mechanism
Greek
1500
Mechanical calculator
Leonardo da Vinci
1621
Slide rule
William Oughtred
1642
Arithmetic Machine
Blaise Pascal
1822
Difference Engine
Charles Babbage
1831
Computer program
Lady Ada Lovelace
1936
Z1 Computer
Konrad Zuse
1936
Turing Machine
Alan Turing
1938
Boolean Circuits
Claude Shannon
1943
COLOSSUS
Alan Turing
1945
von Neumann Machine
John von Neumann
1947
Transistor
William Shockley, John Bardeen & Walter Brattain
1958
Integrated Circuit
Jack Kilby & Robert Noyce
1964
Mouse & Graphical User Interface
Douglas Engelbart
Lecture 1
Introduction to
Computing
September 17, 2009
The Era of Modern Computing
Lecture 1
Introduction to
Computing
September 17, 2009
1969
ARPAnet
UCLA, Stanford, UC Santa Barbara & University of Utah
1969
1973
UNIX
C
Ken Thompson & Dennis Ritchie, Bell Laboratories
Dennis Ritchie & Brian Kernighan, Bell Laboratories
1973
1973
Ethernet
FTP
Robert Metcalfe, Harvard University/Xerox PARC
Alex McKenzie, BBN
1974
1975
TCP
Microsoft Corporation
Vint Cerf & Robert Kahn
Bill Gates & Paul Allen
1976
Apple Computer
Steve Wozniak & Steve Jobs
1978
Usenet
Tom Truscott, Jim Ellis & Steve Bellovin
1981
1982
1984
IBM PC
TCP/IP
DNS
IBM Corporation
ARPA
Jon Postel
1984
Macintosh
Apple Computer
1985
1986
Windows
NeXT Computer
Microsoft Corporation
Steve Jobs
1989
1990
HTML & HTTP
BSD Unix NR1
Tim Breners-Lee, CERN
University of California, Berkeley
1991
Linux
Linus Torvalds
1993
Mosaic
Marc Andreessen
2001
OS X
Apple Computer
2004
Google
Larry Page & Sergey Brin
2008
Cloud Computing
Amazon ECC and EBS
Growth of the Internet
Overview
In 1981, there were 213 computers acting as Internet hosts. By the beginning of 1994, 2.2 million
computers were acting as hosts. Currently, it is estimated that over 700 million Internet hosts
exist.
Lecture 1
Introduction to
Computing
September 17, 2009
Computers in Biology
Algorithms
• An algorithm is simply a series of steps used to solve a problem. One of a computer’s great
strengths is its ability to rapidly and accurately repeat recursive steps in an algorithm. Many
algorithms of use to biologists could not be practically applied without computers.
• Early algorithms for comparing sequences to each other attempted to find optimal alignments.
With the tremendous growth of sequence data, many modern search algorithms use heuristic
(rule of thumb) approaches which may not find
• Early algorithms for searching sequence data for significant matches depended on consensus
sequences. It rapidly became clear that biologically significant sequences rarely perfectly
matched a consensus, and more sophisticated approaches were adopted, including the use of
matrices, Markov chains and hidden Markov models.
Data Storage and Databases
• The tremendous growth of sequence and other biological data has made storing such data in
digital form on modern computers a necessity. More and more of this data is being stored in
databases to make it easier to retrieve and analyze.
• A database is a structured collection of data stored on a computer that can be accessed using a
query language, which greatly simplifies asking questions of the data. The current trend is to
make biological databases Internet accessible.
Internet
Lecture 1
Introduction to
Computing
September 17, 2009
• The World Wide Web computer protocols (HTML and HTTP) that are the foundation of
much of the modern Internet were originally developed by Tim Breners-Lee at CERN to allow
scientists to share research data.
• Computational biology uses applications with Internet connectivity (EndNote, MacVector),
Internet applications (DNA Artist), Web applications (BLAST, GenMark, Phylodendron) and
Internet databases (GenBank, PubMed), among other Internet host resources.
History of Computational Biology
1869
DNA
Johann Friedrich Miescher
1924
1928
Chromosomal DNA
Transforming principle
Robert Feulgen
Franklin Griffith
1944
DNA transformation
Oswald Avery, Maclyn McCarty & Colin MacLeod
1948
Information Theory
Claude Shannon
1949
Chargaff’s Rule
Erwin Chargaff
1953
Double helix
James Watson & Francis Crick
1955
Protein sequencing
Fred Sanger
1961
1966
Codons
Genetic code
Sidney Brenner & Francis Crick
Marshall Nirenberg, Robert Holley & Har Khorana
1970
1970
Restriction enzyme
Needleman-Wunsch
Hamilton Smith, Johns Hopkins
S. Needleman & C. Wunsch
1971
MEDLINE
NIH/NLM
1977
DNA sequencing
Allan Maxam & Walter Gilbert/Frederick Sanger
1977
Staden programs
Roger Staden
1981
Smith-Waterman
Temple Smith & Michael Waterman
1982
GenBank
LANL/EMBL/NCBI
1988
1988
NCBI
FASTA
NIH/NLM
William Pearson & David Lipman
1988
DNA Strider
Christian Marck
1990
BLAST
Stephen Altschul & David Lipman, NCBI
1994
DNA computer
Leonard Adelman
1997
PubMed
NCBI
Lecture 1
Introduction to
Computing
September 17, 2009
The Genomics Era
Overview of Published Genomes
Lecture 1
Introduction to
Computing
September 17, 2009
1980
øX174 (5,386 bp)
1981
Human mitochondria (16,569 bp)
1981
Poliovirus (7,440 bp)
1990
Human Genome Project
1992
The Institute for Genomic Research
1994
RK2 (60,099 bp)
1995
Haemophilus influenzae (1.8 Mb)
1995
Mycoplasma genitalium (0.58 Mb)
1996
Methanococcus jannaschii (1.6 Mb)
1996
Saccharomyces cerevisiae (12.1 Mb)
1997
1998
Escherichia coli (4.7 Mb)
Celera, Inc.
1998
Caenorhabditis elegans (97 Mb)
2000
Drosophila melanogaster (180 Mb)
2000
Arabidopsis thaliana (115 Mb)
2001
Salmonella typhimurium (4.8 Mb)
2001
Homo sapiens (2.9 Gb)
2002
Mus musculus (2.9 Gb)
2003
Nanoarchaeum equitans (0.49 Mb)
2004
Legionella pneumophila (3.4 Mb)
2005
Pan troglodytes (2.8 Gb)
Growth of Sequenced Prokaryotic Genomes
Source: David W. Ussery, Genome Update: 161
prokaryotic genomes sequenced, and counting,
Microbiology. 2004 Feb;150 (Pt. 2): 261-3.
Evolution of Operating Systems
Unix
Apple
Windows
Lecture 1
Introduction to
Computing
September 17, 2009
Macintosh OS X Architecture
Classic
OS X 10.4 (and below)
provides support through
the Classic Environment
for older Macintosh
applications (OS 9 and
below)
Lecture 1
Introduction to
Computing
September 17, 2009
User Experience
The layer with which most users interact with the Macintosh includes Aqua (the
graphical user interface (GUI) of OS X), Dashboard (which manages and displays
desktop widgets), Spotlight (which provides system wide search and indexing
through the use of metadata) and Accessibility (assistive technology for the
disabled).
Computers and
Sequence Analysis
SeqMatrix E. coli promoter output:
DNA Location: 3,075
Spacer Length: 11
Similarity Score: 55.29
CGACATTGCTTGACCC <11> GCGTGTTCAATTCG
Lecture 1
Introduction to
Computing
September 17, 2009
Computers and
Phylogenetic Analysis
Lecture 1
Introduction to
Computing
September 17, 2009
Computers and
Data Visualization
Lecture 1
Introduction to
Computing
September 17, 2009
Computers and
Multimedia
L27758. Birmingham IncP-a...[gi:508311]
LOCUS
DEFINITION
Lecture 1
Introduction to
Computing
September 17, 2009
Related Sequences, PubMed, Taxonomy
BIACOMGEN
60099 bp
DNA
linear
BCT 08-JUL-1994
Birmingham IncP-alpha plasmid (R18, R68, RK2, RP1, RP4) complete
genome.
ACCESSION
L27758
VERSION
L27758.1 GI:508311
KEYWORDS
complete genome.
SOURCE
Birmingham IncP-alpha plasmid (plasmid Birmingham IncP-alpha
plasmid, kingdom Prokaryotae) DNA.
ORGANISM Birmingham IncP-alpha plasmid
broad host range plasmids.
REFERENCE
1 (bases 1 to 60099)
AUTHORS
Pansegrau,W., Lanka,E., Barth,P.T., Figurski,D.H., Guiney,D.G.,
Haas,D., Helinski,D.R., Schwab,H., Stanisich,V.A. and Thomas,C.M.
TITLE
Complete nucleotide sequence of Birmingham IncP-alpha plasmids:
compilation and comparative analysis
JOURNAL
J. Mol. Biol. 239, 623-663 (1994)
MEDLINE
94285211
FEATURES
Location/Qualifiers
source
1..60099
/organism="Birmingham IncP-alpha plasmid"
/plasmid="Birmingham IncP-alpha plasmid"
/db_xref="taxon:35419"
BASE COUNT
10839 a 18681 c 18448 g 12131 t
ORIGIN
1 ttcacccccg aacacgagca cggcacccgc gaccactatg ccaagaatgc ccaaggtaaa
61 aattgccggc cccgccatga agtccgtgaa tgccccgacg gccgaagtga agggcaggcc
121 gccacccagg ccgccgccct cactgcccgg cacctggtcg ctgaatgtcg atgccagcac
181 ctgcggcacg tcaatgcttc cgggcgtcgc gctcgggctg atcgcccatc ccgttactgc
241 cccgatcccg gcaatggcaa ggactgccag cgccgcgatg aggaagcggg tgccccgctt
301 cttcatcttc gcgcctcggg cctcgaggcc gcctacctgg gcgaaaacat cggtgtttgt
etc.
Software Applications
Microsoft Office
• Microsoft’s office suite, including Word, Excel and PowerPoint.
• Office 2003 and 2007 for Windows or Office 2004 and 2008 for Mac can be downloaded for free
by students at http://www.columbia.edu/acis/software/msoffice/
• Macintosh uses have an alternative in Apple’s iLife. Apple’s Keynote is particularly good
presentation software.
EndNote
• Bibliographic search and storage utility.
• The latest version, EndNote X2, for Windows or Mac can be downloaded for free by students at
http://www.columbia.edu/acis/software/endnote/
Antivirals
• If running Windows, make sure to download Symantec AntiVirus, update it, and keep it updated. It
is not essential for Mac users.
• Symantec AntiVirus 11 for Windows can be downloaded for free by students at http://
www.columbia.edu/acis/software/nav/
Lecture 1
Introduction to
Computing
September 17, 2009
Internet Addressing
IP Address (Internet Protocol Address)
An IP address is a 32 bit number, written in the form of four decimal numbers in the range
0-255 that are separated by dots (e.g. 128.59.48.24). Columbia University Medical Center
(CUMC) IP addresses will always have the format 156.111.x.x or 156.145.x.x.
Subnet Mask
A subnet mask allows for defining a local network, called a subnet, within a larger network.
CUMC subnet masks will always have the format 255.255.255.0.
Router
A device that routes packets of data between networks. A router sits between your computer
and local area network and the networks beyond it. CUMC router IP addresses will always have
the format 156.111.x.1 or 156.145.x.1.
DNS (Domain Name Server)
These specialized servers automatically translate an easy to remember domain name (e.g.
microbiology.columbia.edu) into the appropriate IP address (e.g. 156.111.98.150).
CUMC DNS IP addresses are 156.111.60.150 and 156.111.70.150.
Lecture 1
Introduction to
Computing
September 17, 2009
Ethernet Address
A unique 48 bit number, usually written in the form of 12 hexadecimal digits separated by colons
in groups of two (e.g. 00:03:93:bc:3c:18), which is assigned to every piece of network
hardware, including Ethernet cards and AirPort cards. It is also called a MAC (media access
control) address.
Search Domains
This optional information is automatically appended to names you type in Internet applications.
If you have defined a search domain of columbia.edu, typing www will take you to
www.columbia.edu.
CUMC Internet Setup
Overview
The CUMC (Columbia University Medical Center) campus network has two core routers, both
redundantly linked to a router in each building. Each floor of a building then has its own router.
Several microwave links and a high speed cable connect the core routers to the downtown
Columbia campus, which has multiple high speed cable connections to the rest of the Internet.
The CUMC network is walled off from the rest of the Internet by a firewall and is centrally
administered by a group called CUMC IT. See http://cumc.columbia.edu/it/ for details.
You will need an IP Address for Internet access from your lab. To register a computer for access
from a lab, see http://cumc.columbia.edu/it/getting_started/wired.html for details,
and use the New IP Request form. For Internet access from a dorm room, see http://
cumc.columbia.edu/it/getting_started/resnet.html for details.You will need to
provide your computer’s Ethernet adapter hardware address, also known as Media Access Control
(MAC) address. This is a 12 digit hexadecimal number (e.g. 00:00:af:a0:b1:89). If you have problems
getting connected, try calling the CUMC computer help line at 5-HELP.
Network Settings
IP Address: 156.111.x.x or 156.145.x.x
Subnet Mask: 255.255.255.0
Router: 156.111.x.1 or 156.145.x.1
DNS Servers
156.111.60.150
156.111.70.150
Lecture 1
Introduction to
Computing
September 17, 2009
Columbia University Email Setup
Apple Mail
Apple’s email client, officially supported by Columbia University, features sophisticated sorting
and junk mail filtering capabilities, linked to the Address Book application. Follow these
configuration instructions:
http://www.columbia.edu/acis/software/applemail/config.html
Outlook Express
Window’s email client, officially supported by Columbia Univesity. Follow these configuration
instructions:
http://www.columbia.edu/acis/software/outlookexpress/config.html
Off Campus Access
If you are not on your regular computer, you should still be able to access your email through
CubMail, a web based system:
https://cubmail.cc.columbia.edu/horde/imp/login.php
Lecture 1
Introduction to
Computing
September 17, 2009
Cisco VPN Client
If you are traveling, or live in non campus housing and need to connect to a server on campus,
or want to be able to read journal articles as if you were on a Columbia computer, you will
need to install Cisco VPN (Virtual Private Network) client software on your computer.
Windows or Mac versions of software can be downloaded and configured by following these
instructions:
http://cumc.columbia.edu/it/getting_started/vpn.html
References
Recommended Macintosh OS X Books
Mac OS X Tiger Unleashed by John Ray & William C. Ray
Mac OS X: The Missing Manual,Tiger Edition by David Pogue
Mac OS X Tiger Killer Tips by Scott Kelby
Mac OS X Tiger Timesaving Techniques by Larry Ullman & Mark Liyange
Recommended Computational Biology Books
Fundamental Concepts of Bioinformatics by Dan E. Krane & Michael L. Rayme
Developing Bioinformatics Computer Skills by Cynthia Gibas & Per Jambek
Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins,Third Edition
edited by Andreas D. Baxevanis & B. F. Francis Ouellette
BLAST: An Essential Guide to the BASIC Local Alignment Search Tool
by Ian Korf, Mark Yandell & Joseph Bedell
Lecture 1
Introduction to
Computing
September 17, 2009