Genomics and Bioinformatics

advertisement
Computational Molecular Biology
Biochem 218 – BioMedical Informatics 231
http://biochem218.stanford.edu/
Genomics and Bioinformatics
Doug Brutlag
Professor Emeritus
Biochemistry & Medicine (by courtesy)
Faculty, TAs and Staff
Doug Brutlag
Lee Kozar
Maeve O’Huallachain
Dan Davison
Course and Video Availability
• Alway M114
– Tuesdays & Thursdays 2:15-3:30 PM
• Course Web Site
– http://biochem218.stanford.edu/
• Stanford Center for Professional Development
– http://scpd.stanford.edu/
• Videos available 24 hours/day, 7 days/week
• Course offered Autumn, Winter and Spring
quarters
Course Requirements
• Lectures
– Theoretical background of current methods
– Strengths and weaknesses of current approaches
– Future directions for improvements
• Demonstrations
– Applications (Mac, PC, Unix, Web)
– Web applications
– Illustrate homework
• All homework and questions must be submitted by
email to homework218@cmgm.stanford.edu
• Several homework assignments (35%)
– Due one week after assigned
• Final project (Due March 12th)
– A critical or comparative review of computational approaches to
any problem in computational molecular biology
– Propose new approach
– Implement a new approach
– Examples of previous projects for the class can be found at
http://biochem218.stanford.edu/Projects.html
David Mount
Bioinformatics: Sequence and Genome Analysis 2nd Edition
Jin Xiong
Essential Bioinformatics
Richard Durbin et al.
Biological Sequence Analysis
Jones & Pevzner
Bioinformatics Algorithms
Dan Gusfield
Algorithms on Strings, Trees & Sequences
Baldi & Brunak
Bioinformatics: The Machine Learning Approach
Higgins & Taylor
Bioinformatics: Sequence, Structure & Databanks
NCBI Handbook
http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=handbook
NCBI Handbook
http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=handbook
EMBL-EBI Home Page
http://www.ebi.ac.uk/
Berg, Tymoczko & Stryer
Biochemistry, Fifth Edition
Benjamin Lewin
Genes IX
Genomics, Bioinformatics &
Computational Biology
Genomics
Bioinformatics
Structural Genomics
Proteomics
Computational Molecular Biology
Computational Biology
Genomics, Bioinformatics &
Computational Biology
Genomics
Bioinformatics
Systems Biology
Structural Genomics
Proteomics
Computational Molecular Biology
Computational Biology
Genomics, Bioinformatics &
Computational Biology
Genomics
Bioinformatics
Structural Genomics
Proteomics
Computational Molecular Biology
Computational Biology
Machine Learning
Robotics
Databases
Artificial Intelligence
Statistics & Probability
Algorithms
Information Theory
Graph Theory
What is Bioinformatics?
Individuals
RNA
Protein
DNA
Phenotype
Evolution
Selection
Populations
Biological Information
Computational Goals of Bioinformatics
•
Learn & Generalize: Discover conserved patterns (models) of
sequences, structures, interactions, metabolism & chemistries from
well-studied examples.
•
Prediction: Infer function or structure of newly sequenced genes,
genomes, proteins or proteomes from these generalizations.
•
Organize & Integrate: Develop a systematic and genomic approach to
molecular interactions, metabolism, cell signaling, gene expression…
•
Simulate: Model gene expression, gene regulation, protein folding,
protein-protein interaction, protein-ligand binding, catalytic function,
metabolism…
•
Engineer: Construct novel organisms or novel functions or novel
regulation of genes and proteins.
•
Gene Therapy: Target specific genes, or mutations, RNAi to change
a disease phenotype.
Central Paradigm of Molecular Biology
DNA
RNA
Protein
Phenotype
(Symptoms)
Molecular Biology of the Gene 1965
Central Paradigm of Bioinformatics
Genetic
Information
MVHLTPEEKT
AVNALWGKVN
VDAVGGEALG
RLLVVYPWTQ
RFFESFGDLS
SPDAVMGNPK
VKAHGKKVLG
AFSDGLAHLD
NLKGTFSQLS
ELHCDKLHVD
PENFRLLGNV
LVCVLARNFG
KEFTPQMQAA
YQKVVAGVAN
ALAHKYH
Molecular
Structure
Biochemical
Function
Phenotype
(Symptoms)
Central Paradigm of Bioinformatics
Genetic
Information
MVHLTPEEKT
AVNALWGKVN
VDAVGGEALG
RLLVVYPWTQ
RFFESFGDLS
SPDAVMGNPK
VKAHGKKVLG
AFSDGLAHLD
NLKGTFSQLS
ELHCDKLHVD
PENFRLLGNV
LVCVLARNFG
KEFTPQMQAA
YQKVVAGVAN
ALAHKYH
Molecular
Structure
Biochemical
Function
Phenotype
(Symptoms)
Challenges Understanding
Genetic Information
Genetic
Information
•
•
•
•
•
Molecular
Structure
Biochemical
Function
Phenotype
Genetic information is redundant
Structural information is redundant
Genes and proteins are meta-stable
Single genes have multiple functions
Genes are one dimensional but function depends
on three-dimensional structure
Using A Controlled Vocabulary for Literature Search
http://www.ncbi.nlm.nih.gov/sites/entrez?db=mesh
Gene Ontology Database
http://www.geneontology.org/
UCSC Genome Browser
http://genome.ucsc.edu/
ExPASy Proteomics Server
http://www.expasy.ch/doc.html
Inferring Biological Function from
Protein Sequence
Consensus Sequences
or Sequence Motifs
Zinc Finger (C2H2 type)
C x {2,4} C x {12} H x {3,5} H
Sequences of Common
Structure or Function
Sequence Similarity
10
20
30
40
50
Query VLSPADKTNVKAAWGKVGAHAGEVGAEALERMFLSFPTTKTYFPHF------DLSHGS
|:| :|: | |:||||
| |:||| |: : :|:| :| |
|: |
Match HLTPEEKSAVTALWGKV--NVDEYGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGN
10
20
30
40
50
A Typical Motif:
Zinc Finger DNA Binding Motif
C..C............H....H
Inferring Biological Function from
Protein Sequence
Weight Matrices or
1 2 3 4 5 Scoring
6 7 8 Matrices
9 10 11 12
Position-Specific
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
2 1 3 13
7 5 8 9
0 8 0 1
0 1 0 1
0 0 1 0
1 1 21 8
2 0 0 9
9 7 1 4
4 3 1 1
10 0 11 1
16 1 17 0
3 4 5 10
7 1 1 0
4 0 3 0
0 6 0 1
1 17 0 8
5 22 3 11
2 0 0 0
1 0 4 2
6 3 1 1
10
4
0
13
0
10
21
0
2
2
1
11
0
0
0
3
1
0
0
2
12 67 4 13 9 1 2
0 1 16 7 0 1 0
0 0 2 1 1 10 0
0 0 12 1 0 4 0
0 0 0 0 2 2 1
0 0 7 6 0 0 2
0 0 15 7 3 3 0
0 8 0 0 0 46 0
0 0 2 2 0 5 0
10 0 4 9 3 0 16
31 0 3 11 24 0 14
1 1 13 10 0 5 2
0 0 0 5 7 1 8
4 0 0 0 10 0 0
0 0 0 0 0 0 0
1 3 0 2 2 2 0
5 0 2 2 2 0 5
0 0 0 0 1 0 1
1 0 0 2 4 0 1
15 0 0 2 12 0 28
Consensus Sequences
or Sequence Motifs
Zinc Finger (C2H2 type)
C x {2,4} C x {12} H x {3,5} H
Profiles, PSI-BLAST
Sequences of Common Hidden Markov Models
Structure or Function
D2
D3
D4
D5
I1
I2
I3
I4
I5
AA1
AA2
AA3
AA4
AA5
Sequence Similarity
10
20
30
40
50
Query VLSPADKTNVKAAWGKVGAHAGEVGAEALERMFLSFPTTKTYFPHF------DLSHGS
|:| :|: | |:||||
| |:||| |: : :|:| :| |
|: |
Match HLTPEEKSAVTALWGKV--NVDEYGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGN
10
20
30
40
50
AA6
Buried Treasure
Buried Treasure
Buried Treasure
Clustal Globin Alignment
Consensus Sequence From a
Multiple Sequence Alignment
ClustalW Insulin Alignments
10
IPGP
IPDK
IPDG
IPCH
IPCA
IPBO
IPAF
20
30
M A L WM R L L P L L A L L A L W A P A P T R A
M A L W I R S L P L L A L L V F S G P G - T S Y
M A V W I Q A G A L L F L L A V S S V N A N A G
M A A L WL Q S F S L L V L L V V S W P G S Q A V
A . W . .
L L
L L
40
IPGP
IPDK
IPDG
IPCH
IPCA
IPBO
IPAF
L
L
L
L
L
L
L
L
C
C
C
C
C
C
C
C
G
G
G
G
G
G
G
G
S
S
S
S
S
S
S
S
N
H
H
H
H
H
H
H
L
L
L
L
L
L
L
L
V
V
V
V
V
V
V
V
E
E
E
E
D
E
D
E
T
A
A
A
A
A
A
A
IPGP
IPDK
IPDG
IPCH
IPCA
IPBO
IPAF
D
Q
D
Q
L
G
L
P
P
L
P
G
P
G
P
Q
Q
F
Q
F
V
L
V
L
L
V
L
L
E
V
R
V
P
G
P
Q
N
D
S
P
A
P
T
G
V
S
K
L
K
E
P
E
P
S
E
S
L
L
L
L
L
G
L
IPGP
IPDK
IPDG
IPCH
IPCA
IPBO
IPAF
A
E
A
E
V
P
M
L
Y
L
Y
I
P
M
.
Q
Q
Q
E
R
Q
V
Q
X
X
K
K
K
K
K
K
X
X
R
V
R
R
R
R
K
-
R
-
G
G
G
G
G
G
G
G
I
I
I
I
I
I
I
I
L
L
L
L
L
L
L
L
S
L
L
L
L
L
L
L
V
V
V
V
V
V
V
V
C
C
C
C
C
C
C
C
Q
G
G
G
G
G
G
G
D
E
E
E
P
E
D
E
D
R
R
R
T
R
R
R
G
G
G
G
G
G
G
G
F
F
F
F
F
F
F
F
M
G
G
G
G
A
G
G
E
A
E
G
A
A
A
A
G
Q
A
E
D
L
V
P
A
T
P
N
G
G
G
G
E
G
E
G
S
N
N
N
N
A
N
R
Q
Q
Q
Q
Q
Q
Q
D
E
E
E
E
E
E
E
Q
Q
Q
Q
Q
Q
Q
Q
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
T
E
T
H
H
A
H
G
N
S
N
K
S
R
T
P
I
T
P
V
P
C
C
C
C
C
C
C
C
T
S
S
S
S
S
N
S
F
F
F
F
F
F
F
F
A
E
E
V
V
A
V
I
S
T
S
N
T
N
.
P
P
P
P
P
P
P
P
K
K
K
K
K
K
K
K
D
T
A
A
R
A
R
.
X
X
R
R
D
R
D
X
X
R
R
V
R
V
E
D
E
D
D
E
D
D
L
V
V
V
P
V
Q
V
G
L
G
L
A
G
A
G
P
G
P
D
G
E
L
F
L
F
F
L
F
F
Q
Q
Q
Q
A
E
A
Q
P
P
F
F
L
K
K
A
D
D
L
L
H
Q
Q
H
E
Q
A
M
H
Y
Y
Y
F
Y
F
Y
Q
Q
Q
Q
E
Q
D
Q
L
L
L
L
L
L
L
L
Q
E
E
E
Q
E
Q
E
S
N
N
N
N
N
N
N
Y
Y
Y
Y
Y
Y
Y
Y
C
C
C
C
C
C
C
C
N
N
N
N
N
N
N
N
E
E
E
E
P
E
L
E
90
11 0
R
L
L
L
I
L
I
L
H
H
H
H
H
H
H
H
60
Y
Y
Y
Y
Y
Y
Y
Y
80
10 0
V
V
V
V
V
V
V
V
V
A
V
A
P
V
P
50
Y
Y
Y
Y
Y
Y
Y
Y
70
G
H
A
R
A
G
F
A
F
A
A
F
A
A
G
E
G
E
E
G
E
E
12 0
HMM Model of Hemoglobins
http://decypher.stanford.edu/
GrowTree VegF Neighbor Joining Tree
Human Gene Expression Signatures
T Cells Signaling
DNA Damage
Fibroblast Stimulation
B Cells Signaling
CMV Infection
Anoxia
Polio Infection
Monocytes Signaling IL4
Hormone
Clustering Gene Expression Profiles:
Comparison of Methods
D'haeseleer P (2005). Nat Biotechnol. 23,1499-501
TAMO:
Tools for the Analysis of Motifs
Finding Transcription Factor Binding Sites
Upstream Regions
expressed
CoGenes
Pho 5
GATGGCTGCACCACGTGTATGC...ACGATGTCTCGC
Pho 8
CACATCGCATCACGTGACCAGT...GACATGGACGGC
Pho 81
GCCTCGCACGTGGTGGTACAGT...AACATGACTAAA
Pho 84
TCTCGTTAGGACCATCACGTGA...ACAATGAGAGCG
Pho …
CGCTAGCCCACGTGGATCTTGA...AGAATGACTGGC
Transcription
Start
Finding Transcription Factor Binding Sites
Upstream Regions
Co-expressed
Genes
GATGGCTGCACCACGTGTATGC...ACGATGTCTCGC
CACATCGCATCACGTGACCAGT...GACATGGACGGC
GCCTCGCACGTGGTGGTACAGT...AACATGACTAAA
TCTCGTTAGGACCATCACGTGA...ACAATGAGAGCG
CGCTAGCCCACGTGGATCTTGT...AGAATGGCCTAT
Finding Transcription Factor Binding
Sites
Upstream Regions
Co-expressed
Genes
ATGGCTGCACCACGTTTATGC...ACGATGTCTCGC
CACATCGCATCACGTGACCAGT...GACATGGACGGC
GCCTCGCACGTGGTGGTACAGT...AACATGACTAAA
TTAGGACCATCACGTGA...ACAATGAGAGCG
CGCTAGCCCACGTTGATCTTGT...AGAATGGCCTAT
Pho4 binding
Metabolic Networks: BioCyc
http://biocyc.org/
C. crescentus Cell Cycle Gene Expression
Genome Wide Associations in
Rheumatoid Arthritis
Pearson, T. A. et al. JAMA 2008;299:1335-1344
Leveraging Genomic Information in
Medicine
Novel Diagnostics
Microchips & Microarrays - DNA
Gene Expression - RNA
Proteomics - Protein
Novel Therapeutics
Drug Target Discovery
Rational Drug Design
Molecular Docking
Gene Therapy
Stem Cell Therapy
Understanding Metabolism
Understanding Disease
Inherited Diseases - OMIM
Infectious Diseases
Pathogenic Bacteria
Viruses
Download