ChBi406/506
Ozlem Keskin
For today’s lectures Many slides from gersteinlab.org/courses/452
And
Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN 0-471-21004-8).
Ozlem Keskin
Engin Cukuroglu okeskin@ku.edu.tr
ecukuroglu@ku.edu.tr
Who is taking this course?
• People with very diverse backgrounds in biology, chemical engineering (MS/BS)
• People with diverse backgrounds in computer science please visit Attila Hoca’s office!
• Most people have a favorite gene, protein, or disease
• To provide an introduction to bioinformatics with a focus on the National Center for Biotechnology
Information (NCBI) and EBI
• To focus on the analysis of DNA, RNA and proteins
• To introduce you to the analysis of genomes
• To combine theory and practice to help you solve research problems
Textbooks
Web sites
Literature references
Gene/protein families
Computer labs
Textbook
The course textbook is J. Pevsner, Bioinformatics and
Functional Genomics (Wiley, 2009).
Several other bioinformatics texts are available:
Baxevanis and Ouellette
Mount
Durbin et al.
Lesk
In our library you will find (e-book)
Bioinformatics [electronic resource] : sequence and genome analysis / David W.
Mount.
ImprintCold Spring Harbor, N.Y. : Cold Spring Harbor Laboratory Press, c2001.
Bioinformatics : a practical guide to the analysis of genes and proteins / editedby
Andreas D. Baxevanis, B.F. Francis Ouellette.
ImprintHoboken, N.J. : John Wiley, 2005.
(SOON)
Themes throughout the course:
Literature references
You are encouraged to read original source articles. Although articles are not required, they will enhance your understanding of the material.
You can obtain articles through PubMed and Web of Science.
Web sites
The course website is reached via: http://pevsnerlab.kennedykrieger.org/bioinfo_course.htm
(or Google “pevsnerlab” courses)
This site contains the powerpoints for each lecture.
The textbook website is: http://www.bioinfbook.org
This has 1000 URLs, organized by chapter
This site also contains the same powerpoints.
You will also find the lecture slides at F-folder.
Midterm
Final
HWs
Project
30%
35%
20%
15%
(might change, the course will evolve)
We will use beta globin and retinol-binding protein 4
(RBP4) as model genes/proteins throughout the course.
Globins including hemoglobin and myoglobin carry oxygen. RBP4 is a member of the lipocalin family. It is a small, abundant carrier protein. We will study globins and lipocalins in a variety of contexts including
• --sequence alignment
• --gene expression
• --protein structure
• --phylogeny
• --homologs in various species
The HIV-1 pol gene encodes three proteins
Aspartyl protease
PR
Reverse transcriptase
RT
Integrase
IN
Definition of bioinformatics
Overview of the NCBI website
Accessing information about DNA and proteins
--Definition of an accession number
--Four ways to find information on proteins and DNA
Access to biomedical literature
Biological
Data
+
Computer
Calculations
• Interface of biology and computers
• Analysis of proteins, genes and genomes using computer algorithms and computer databases
• Genomics is the analysis of genomes.
The tools of bioinformatics are used to make sense of the billions of base pairs of DNA that are sequenced by genomics projects.
Protein coordinates, DNA array data, annotated gene sequences
Biological information is being generated now days in parallel.
We can easily run 10,000 simultaneous experiments on a single DNA microarray.
To cope with this much data we really need computers.
So Bioinformatics is that field that combines biology and computers.
Data from the Human Genome Project has fueled the development of new bioinformatics methods
• (Molecular) Bio informatics
• One idea for a definition?
Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques ( derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a large-scale.
• Interface of biology and computers
Analysis of proteins, genes and genomes using computer algorithms and computer databases
Genomics is the analysis of genomes.
The tools of bioinformatics are used to make sense of the billions of base pairs of DNA
• that are sequenced by genomics projects.
[1] Precise models of where and when transcription will occur in a genome (initiation and termination)
[2] Precise, predictive models of alternative RNA splicing
[3] Precise models of signal transduction pathways; ability to predict cellular responses to external stimuli
[4] Determining protein:DNA, protein:RNA, protein:protein recognition codes
[5] Accurate ab initio protein structure prediction
[6] Rational design of small molecule inhibitors of proteins
[7] Mechanistic understanding of protein evolution
[8] Mechanistic understanding of speciation
[9] Development of effective gene ontologies: systematic ways to describe gene and protein function
[10] Education: development of bioinformatics curricula
Source: Ewan Birney,
Chris Burge, Jim Fickett
“Science is about building causal relations between natural phenomena (for instance, between a mutation in a gene and a disease). The development of instruments to increase our capacity to observe natural phenomena has, therefore, played a crucial role in the development of science - the microscope being the paradigmatic example in biology. With the human genome, the natural world takes an unprecedented turn: it is better described as a sequence of symbols. Besides high-throughput machines such as sequencers and DNA chip readers, the computer and the associated software becomes the instrument to observe it, and the discipline of bioinformatics flourishes.
However, as the separation between us (the observers) and the phenomena observed increases (from organism to cell to genome, for instance), instruments may capture phenomena only indirectly, through the footprints they leave.
Instruments therefore need to be calibrated: the distance between the reality and the observation (through the instrument) needs to be accounted for. This issue of
Genome Biology is about calibrating instruments to observe gene sequences; more specifically, computer programs to identify human genes in the sequence of the human genome.”
Martin Reese and Roderic Guig ó, Genome Biology 2006 7(Suppl I):S1, introducing EGASP, the Encyclopedia of DNA Elements (ENCODE)
Genome Annotation Assessment Project
bioinformatics public health informatics medical informatics
Tool-users databases infrastructure algorithms
Tool-makers
Page 4
After Pace NR (1997)
Science 276:734
Page 6
Time of development
Body region, physiology, pharmacology, pathology
Page 5
DNA RNA protein phenotype
Page 5
DNA RNA protein phenotype
Growth of GenBank
Updated 8-12-04:
>40b base pairs
1982 1986 1990 1994 1998 2002
Year
Fig. 2.1
Page 17
Growth of GenBank
70
60
50
40
30
20
10
0
1985 1990 1995 2000
December
1982
June
2006
Growth of the International Nucleotide
Sequence Database Collaboration
Base pairs contributed by GenBank EMBL DDBJ http://www.ncbi.nlm.nih.gov/Genbank/
Central dogma of molecular biology
DNA RNA protein genome transcriptome proteome
Central dogma of bioinformatics and genomics
• Central Dogma of Molecular Biology
DNA
-> RNA
-> Protein
-> Phenotype
-> DNA
• Molecules
– Sequence, Structure, Function
• Processes
– Mechanism, Specificity, Regulation
• Central Paradigm for Bioinformatics
Genomic Sequence
Information
-> mRNA (level)
-> Protein Sequence
-> Protein Structure
-> Protein Function
-> Phenotype
• Large Amounts of Information
– Standardized
– Statistical
•Genetic material
•Information transfer (mRNA)
•Protein synthesis (tRNA/mRNA)
•Some catalytic activity
•Most cellular functions are performed or facilitated by proteins.
•Primary biocatalyst
•Cofactor transport/storage
•Mechanical motion/support
•Immune protection
•Control of growth/differentiation
(idea from D Brutlag, Stanford, graphics from S Strobel)
• Proteins fold into 3D structures with specific functions which are reflected in a pheonotype.
• These functions are selected in a Darwinian sense by the environment of the phenotype.
• Which drives the evolution of the DNA sequence.
• Many Bioinformatics techniques address this flow of molecular biology information inside the organism hoping to understand the organization and control of genes even predicting protein structure from sequence.
• There is a second flow of information that bioinformatics seeks to address is the large amount of data generated by new high through methods.
Bioinformatics owes its lively hood to the availability of large data sets that are too complex to allow manual analysis.
DNA RNA protein phenotype genomic
DNA databases cDNA
ESTs
UniGene protein sequence databases
Fig. 2.2
Page 20
There are three major public DNA databases
EMBL GenBank DDBJ
The underlying raw DNA sequences are identical
Page 16
There are three major public DNA databases
EMBL
Housed at EBI
European
Bioinformatics
Institute
GenBank
Housed at NCBI
National
Center for
Biotechnology
Information
DDBJ
Housed in Japan
Page 16
>100,000 species are represented in GenBank all species viruses bacteria archaea eukaryota
128,941
6,137
31,262
2,100
87,147
Table 2-1
Page 17
Homo sapiens (6.9 million entries)
Mus musculus (5.0 million)
Zea mays (896,000)
Rattus norvegicus (819,000)
Gallus gallus (567,000)
Arabidopsis thaliana (519,000)
Danio rerio (492,000)
Drosophila melanogaster (350,000)
Oryza sativa (221,000)
Taxonomy nodes at NCBI
8/06 http://www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi
Homo sapiens
Mus musculus
Rattus norvegicus
Danio rerio
Zea mays
Oryza sativa
Drosophila melanogaster
Gallus gallus
Arabidopsis thaliana
10.7 billion bases
6.5b
5.6b
1.7b
1.4b
0.8b
0.7b
0.5b
0.5b
Updated 8-12-04
GenBank release 142.0
Table 2-2
Page 18
Homo sapiens
Mus musculus
Rattus norvegicus
Danio rerio
Bos taurus
Zea mays
Oryza sativa (japonica)
Xenopus tropicalis
Canis familiaris
Drosophila melanogaster
Updated 8-29-05
GenBank release 149.0
11.2 billion bases
7.5b
5.7b
2.1b
1.9b
1.4b
1.2b
0.9b
0.8b
0.7b
Table 2-2
Page 18
Homo sapiens
Mus musculus
Rattus norvegicus
Bos taurus
12.3 billion bases
8.0b
5.7b
3.5b
Danio rerio
Zea mays
2.5b
1.8b
Oryza sativa (japonica) 1.5b
Strongylocentrotus purpurata 1.2b
Sus scrofa 1.0b
Xenopus tropicalis 1.0b
Updated 7-19-06
GenBank release 154.0
Table 2-2
Page 18
• Raw DNA Sequence
– Coding or Not?
– Parse into genes?
– 4 bases: AGCT
– ~1 K in a gene,
~2 M in genome
– ~3 Gb Human atggcaattaaaattggtatcaatggttttggtcgtatcggccgtatcgtattccgtgca gcacaacaccgtgatgacattgaagttgtaggtattaacgacttaatcgacgttgaatac atggcttatatgttgaaatatgattcaactcacggtcgtttcgacggcactgttgaagtg aaagatggtaacttagtggttaatggtaaaactatccgtgtaactgcagaacgtgatcca gcaaacttaaactggggtgcaatcggtgttgatatcgctgttgaagcgactggtttattc ttaactgatgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagttgtattaact ggcccatctaaagatgcaacccctatgttcgttcgtggtgtaaacttcaacgcatacgca ggtcaagatatcgtttctaacgcatcttgtacaacaaactgtttagctcctttagcacgt gttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgact gcaactcaaaaaactgtggatggtccatcagctaaagactggcgcggcggccgcggtgca tcacaaaacatcattccatcttcaacaggtgcagcgaaagcagtaggtaaagtattacct gcattaaacggtaaattaactggtatggctttccgtgttccaacgccaaacgtatctgtt gttgatttaacagttaatcttgaaaaaccagcttcttatgatgcaatcaaacaagcaatc aaagatgcagcggaaggtaaaacgttcaatggcgaattaaaaggcgtattaggttacact gaagatgctgttgtttctactgacttcaacggttgtgctttaacttctgtatttgatgca gacgctggtatcgcattaactgattctttcgttaaattggtatc . . .
. . . caaaaatagggttaatatgaatctcgatctccattttgttcatcgtattcaa caacaagccaaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtgg cgagatatctcttggaaaaactttcaagagcaactcaatcaactttctcgagcattgctt gctcacaatattgacgtacaagataaaatcgccatttttgcccataatatggaacgttgg gttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgact acaatcgttgacattgcgaccttacaaattcgagcaatcacagtgcctatttacgcaacc aatacagcccagcaagcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtc ggcgatcaagagcaatacgatcaaacattggaaattgctcatcattgtccaaaattacaa aaaattgtagcaatgaaatccaccattcaattacaacaagatcctctttcttgcacttgg
• 20 letter alphabet
– ACDEFGHIKLMNPQRSTVWY but not BJOUXZ
• Strings of ~300 aa in an average protein (in bacteria),
~200 aa in a domain
• >1M known protein sequences (uniprot) d1dhfa_ LNCIVAVSQ NM GIGKNGDLPW PP LRNEFRYFQRMT TTSSVEGKQNLVIMGKKTWFSI d8dfr__ LNSIVAVCQ NM GIGKDGNLPW PP LRNEYKYFQRMT STSHVEGKQNAVIMGKKTWFSI d4dfra_ ISLIAALAV DR VIGMENAMPW NLPADLAWFKRNT L--------N KPVIMGRHTWESI d3dfr__ TAFLWAQDR DG LIGKDGHLPW HLPDDLHYFRAQT V--------G KIMVVGRRTYESF d1dhfa_ LNCIVAVSQ NM GIGKNGDLPW PP LRNEFRYFQRMT TTSSVEGKQNLVIMGKKTWFSI d8dfr__ LNSIVAVCQ NM GIGKDGNLPW PP LRNEYKYFQRMT STSHVEGKQNAVIMGKKTWFSI d4dfra_ ISLIAALAV DR VIGMENAMPW -N LPADLAWFKRNT LD-------KPVIMGRHTWESI d3dfr__ TAFLWAQDR NG LIGKDGHLPW -H LPDDLHYFRAQT VG-------KIMVVGRRTYESF d1dhfa_ VPEKNRP LKGRINLVLS RELKEP PQGA H FLSRSLDDALKLTE QPELANKVD MVWIVGGSSVYKEAM NHP d8dfr__ VPEKNRP LKDRINIVLS RELKEA PKGA H YLSKSLDDALALLD SPELKSKVD MVWIVGGTAVYKAAM EKP d4dfra_ ---G-RP LPGRKNIILS -SQPGT DDRV TWVKSVDEAIAACG DVP-----EIMVIGGGRVYEQFL PKA d3dfr__ ---PKRP LPERTNVVLT HQEDYQ AQGA VVVHDVAAVFAYAK QHLDQ---ELVIAGGAQIFTAFK DDV d1dhfa_ -PEKNRP LKGRINLVLS RELKEP PQGA H FLSRSLDDALKLTE QPELANKVD MVWIVGGSSVYKEAM NHP d8dfr__ -PEKNRP LKDRINIVLS RELKEA PKGA H YLSKSLDDALALLD SPELKSKVD MVWIVGGTAVYKAAM EKP d4dfra_ -G---RP LPGRKNIILS SSQPGT DDRV TWVKSVDEAIAACG DVPE----.IMVIGGGRVYEQFL
PKA d3dfr__ -P--KRP LPERTNVVLT HQEDYQ AQGA VVVHDVAAVFAYAK QHLD----Q ELVIAGGAQIFTAFK DDV
• DNA/RNA/Protein
– Almost all protein
(RNA Adapted From D Soll Web Page,
Right Hand Top Protein from M Levitt web page)
Molecular Biology
Information:
Whole Genomes
• The Revolution Driving
Everything
Fleischmann, R. D., Adams, M. D., White, O., Clayton, R. A., Kirkness, E.
F., Kerlavage, A. R., Bult, C. J., Tomb, J. F., Dougherty, B. A., Merrick, J.
M., McKenney, K., Sutton, G., Fitzhugh, W., Fields, C., Gocayne, J. D.,
Scott, J., Shirley, R., Liu, L. I., Glodek, A., Kelley, J. M., Weidman, J. F.,
Phillips, C. A., Spriggs, T., Hedblom, E., Cotton, M. D., Utterback, T. R.,
Hanna, M. C., Nguyen, D. T., Saudek, D. M., Brandon, R. C., Fine, L. D.,
Fritchman, J. L., Fuhrmann, J. L., Geoghagen, N. S. M., Gnehm, C. L.,
McDonald, L. A., Small, K. V., Fraser, C. M., Smith, H. O. & Venter, J. C.
(1995). "Whole-genome random sequencing and assembly of Haemophilus influenzae rd." Science 269: 496-512.
(Picture adapted from TIGR website, http://www.tigr.org)
• Integrative Data
1995, HI (bacteria): 1.6 Mb & 1600 genes done
1997, yeast: 13 Mb & ~6000 genes for yeast
1998, worm: ~100Mb with 19 K genes
1999: >30 completed genomes!
2003, human: 3 Gb & 100 K genes...
Genome sequence now accumulate so quickly that, in less than a week, a single laboratory can produce more bits of data than
Shakespeare managed in a lifetime, although the latter make better reading.
-- G A Pekso, Nature 401 : 115-116 (1999)
1995
Bacteria,
1.6 Mb,
~1600 genes
[ Science 269 : 496]
1997
Eukaryote,
13 Mb,
~6K genes
[Nature 387 : 1]
1998
Animal,
~100 Mb,
~20K genes
[ Science 282 :
1945]
2000?
Human,
~3 Gb,
~100K genes
[???]
Genomes highlight the Finiteness of the “Parts” in
Biology real thing, Apr ‘00
‘98 spoof
Cor e
– Early experiments yeast
– Now tiling array technology
• 50 M data points to tile the human genome at ~50 bp res.
– Can only sequence genome once but can do an infinite variety of array experiments
– For yeast: 6000 x 6000 / 2 ~ 18M possible interactions
– maybe 30K real
Weber
Cartoon
(courtesy of Finn Drablos)
• Understanding How Structures Bind Other
Molecules (Function)
• Designing Inhibitors
• Docking, Structure Modeling
Cor e
(From left to right, figures adapted from Olsen Group Docking Page at Scripps, Dyson NMR Group Web page at Scripps, and from Computational
Chemistry Page at Cornell Theory Center).
Cor e
• Find Similar Ones in Different Organisms
• Human vs. Mouse vs. Yeast
– Easier to do Expts. on latter!
(Section from NCBI Disease Genes Database Reproduced Below.)
Human Genes and S. cerevisiae Proteins
Human Disease MIM # Human GenBank BLASTX Yeast GenBank Yeast Gene
Gene Acc# for P-value Gene Acc# for Description
Human cDNA Yeast cDNA
Hereditary Non-polyposis Colon Cancer 120436 MSH2 U03911 9.2e-261 MSH2 M84170 DNA repair protein
Hereditary Non-polyposis Colon Cancer 120436 MLH1 U07418 6.3e-196 MLH1 U07187 DNA repair protein
Cystic Fibrosis 219700 CFTR M28668 1.3e-167 YCF1 L35237 Metal resistance protein
Wilson Disease 277900 WND U11700 5.9e-161 CCC2 L36317 Probable copper transporter
Glycerol Kinase Deficiency 307030 GK L13943 1.8e-129 GUT1 X69049 Glycerol kinase
Bloom Syndrome 210900 BLM U39817 2.6e-119 SGS1 U22341 Helicase
Adrenoleukodystrophy, X-linked 300100 ALD Z21876 3.4e-107 PXA1 U17065 Peroxisomal ABC transporter
Ataxia Telangiectasia 208900 ATM U26455 2.8e-90 TEL1 U31331 PI3 kinase
Amyotrophic Lateral Sclerosis 105400 SOD1 K00065 2.0e-58 SOD1 J03279 Superoxide dismutase
Myotonic Dystrophy 160900 DM L19268 5.4e-53 YPK1 M21307 Serine/threonine protein kinase
Lowe Syndrome 309000 OCRL M88162 1.2e-47 YIL002C Z47047 Putative IPP-5-phosphatase
Neurofibromatosis, Type 1 162200 NF1 M89914 2.0e-46 IRA2 M33779 Inhibitory regulator protein
Choroideremia 303100 CHM X78121 2.1e-42 GDI1 S69371 GDP dissociation inhibitor
Diastrophic Dysplasia 222600 DTD U14528 7.2e-38 SUL1 X82013 Sulfate permease
Lissencephaly 247200 LIS1 L13385 1.7e-34 MET30 L26505 Methionine metabolism
Thomsen Disease 160800 CLC1 Z25884 7.9e-31 GEF1 Z23117 Voltage-gated chloride channel
Wilms Tumor 194070 WT1 X51630 1.1e-20 FZF1 X67787 Sulphite resistance protein
Achondroplasia 100800 FGFR3 M58051 2.0e-18 IPL1 U07163 Serine/threoinine protein kinase
Menkes Syndrome 309400 MNK X69208 2.1e-17 CCC2 L36317 Probable copper transporter
Cor e
– e.g. how many kinases in Yeast
– Expression levels in Cancerous vs
Normal Tissues
(Clock figures, yeast v. Synechocystis, adapted from GeneQuiz Web Page, Sander Group, EBI)
EX-1: Occurrence of
EX-2: Occurrence of 1-4 functions per fold & salt bridges in genomes interactions per fold of thermophiles v over all genomes mesophiles
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
EK(3)
EK(4)
MP MG EC SC HP SS HI MT MJ AF AA OT
10 to 45 65 85 83 95
Mesophile Thermophile
Physiological temperature in C
98
http://www.nature.com/nature/journal/vaop/ncurrent/pdf/ nature07331.pdf
• Different Sequences Have the Same
Structure
• Organism has many similar genes
• Single Gene May Have Multiple
Functions
• Genes are grouped into Pathways
• Genomic Sequence Redundancy due to the Genetic Code
• How do we find the similarities?....
Cor e
Integrative Genomics genes
structures
functions
pathways
expression levels
regulatory systems
….
Ome molecular group
Genome
Proteome
Transcriptome
Phenome
Interactome
Metabolome
Physiome
Orfeome
Secretome
Morphome
Glycome
Regulome
Functome
Cellome
Transportome
Ribonome
Operome
'Omics: studying populations of molecules in a database framework
Ome molecular group
Genome
Proteome
Transcriptome
Phenome
Interactome
Metabolome
Physiome
Orfeome
Secretome
Morphome
Glycome
Regulome
Functome
Cellome
Transportome
Ribonome
Operome
Hits
58200000
1850000
707000
418000
87500
80700
56300
29800
23900
11400
995
618
390
246
155
131
57
'Omics: studying populations of molecules in a database framework
Ome molecular group
Genome
Proteome
Transcriptome
Phenome
Interactome
Metabolome
Physiome
Orfeome
Secretome
Morphome
Glycome
Regulome
Functome
Cellome
Transportome
Ribonome
Operome
Cor e
PubMed
Hits
48
2
34
6
1
17
1
1
0
537993
6005
1665
53
87
182
41
25
Hits
23900
11400
995
618
390
246
155
131
57
58200000
1850000
707000
418000
87500
80700
56300
29800
PubMed
First year
1953
1995
1997
1989
1999
1998
1997
2002
2000
2000
2000
2004
2001
2002
2004
2002
'Omics: studying populations of molecules in a database framework
Proteome
Extra
How many roles can these play?
How flexible and adaptable are they mechanically?
What are the shared parts ( bolt , nut , washer , spring , bearing ), unique parts ( cogs, levers )? What are the common parts -
- types of parts
( nuts & washers )?
Cor e
Extra
Where are the parts located?
Total in Databank
New Submissions
New Folds
Vast Growth in
(Structural) Data...
but number of
Fundamentally New (Fold)
Parts Not Increasing that Fast
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 … ~100000 genes
~1000 folds
(human)
(T. pallidum)
1 2 3 4 5 6 7 8 9 10 11
Same logic for pathways, functions, sequence families, blocks, motifs....
of a
from
Functions picture from www.fruitfly.org/~suzi (Ashburner); Pathways picture from, ecocyc.pangeasystems.com/ecocyc (Karp, Riley). Related resources: COGS, ProDom,
Pfam, Blocks, Domo, WIT, CATH, Scop....
12 13 14 15 …
~1000 genes
• (Molecular)
Bio informatics
• One idea for a definition?
Bioinformatics is conceptualizing biology in terms
of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques
(derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a large-scale.
• Structural Bioinformatics is a practical discipline with many applications that deals with biological three dimensional structural data.
• Databases
• Geometry
Building, Querying
Object DB
• Text String Comparison
Text Search
1D Alignment
Significance Statistics
Google, grep
• Finding Patterns
AI / Machine Learning
Clustering
Datamining
Robotics
Graphics (Surfaces, Volumes)
Comparison and 3D Matching
(Vision, recognition)
• Physical Simulation
Newtonian Mechanics
Electrostatics
Numerical Algorithms
Simulation
• Finding Genes in Genomic
DNA
introns
exons
promotors
• Characterizing Repeats in
Genomic DNA
Statistics
Patterns
• Duplications in the Genome
Large scale genomic alignment
• Whole-Genome Comparisons
• Finding Structural RNAs
• Sequence Alignment
non-exact string matching, gaps
How to align two strings optimally via Dynamic
Programming
Local vs Global Alignment
Suboptimal Alignment
Hashing to increase speed
(BLAST, FASTA)
Amino acid substitution scoring matrices
• Multiple Alignment and
Consensus Patterns
How to align more than one sequence and then fuse the result in a consensus representation
Transitive Comparisons
HMMs, Profiles
Motifs
• Scoring schemes and
Matching statistics
How to tell if a given alignment or match is statistically significant
A P-value (or an e-value)?
Score Distributions
(extreme val. dist.)
Low Complexity Sequences
• Evolutionary Issues
Rates of mutation and change
• Secondary Structure
“Prediction”
via Propensities
Neural Networks, Genetic
Alg.
Simple Statistics
TM-helix finding
Assessing Secondary
Structure Prediction
• Structure Prediction:
Protein v RNA
• Tertiary Structure Prediction
Fold Recognition
Threading
Ab initio
• Function Prediction
Active site identification
• Relation of Sequence Similarity to
Structural Similarity
• Prediction of structure from sequence
Fold recognition
Fragment construction
• Proteome annotation
• Protein-protein docking
Protein sequence
Protein folding code
Protein structure
Query sequence
Fold recognitio n
Matched fold
Match sequence against library of known folds
Eisenberg et al.
Jones, Taylor, Thornton
• 1 sequence search takes 12 mins (3Ghz)
• Benchmarking on 100 proteins with 100 runs for a simplex search of parameter space = 80 days
• 30 approaches explored = 7 years (on 1 cpu)