#33 - Genomics 11/09/07 Genomics BCB 444/544

advertisement
#33 - Genomics
11/09/07
Required Reading
BCB 444/544
(before lecture)
Lecture 33
√ Mon Nov 5 - Lecture 31
Phylogenetics – Parsimony and ML
• Chp 11 - pp 142 – 169
Genomics
√ Wed Nov 7 - Lecture 32
Machine Learning
Fri Nov 9 - Lecture 33
Functional and Comparative Genomics
#33_Nov09
BCB 444/544 F07 ISU Dobbs#33 - Genomics
• Chp 17 and Chp 18
11/09/07
1
Assignments & Announcements
Fri Nov 9 - HW#6
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
2
Seminars this Week
BCB List of URLs for Seminars related to Bioinformatics:
(will be posted this weekend)
http://www.bcb.iastate.edu/seminars/index.html
HW#6 - More fun with Machine Learning!!
• Nov 7 Wed - BBMB Seminar 4:10 in 1414 MBB
• Sharon Roth Dent
Due: Fri Nov 16
( or sometime before Mon Nov 26)
MD Anderson Cancer Center
• Role of chromatin and chromatin modifying proteins in
regulating gene expression
• Nov 8 Thurs - BBMB Seminar 4:10 in 1414 MBB
• Jianzhi George Zhang
U. Michigan
• Evolution of new functions for proteins
• Nov 9 Fri - BCB Faculty Seminar 2:10 in 102 SciI
• Amy Andreotti
ISU
• T cell signaling: insights from protein NMR spectroscopy
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
3
Chp 11 – Phylogenetic Tree Construction Methods
and Programs
11/09/07
4
11/09/07
6
Machine Learning
SECTION IV MOLECULAR PHYLOGENETICS
• What is learning?
Xiong: Chp 11 Phylogenetic Tree Construction Methods
and Programs
•
•
•
•
BCB 444/544 F07 ISU Dobbs#33 - Genomics
• What is machine learning?
• Learning algorithms
• Machine learning applied to bioinformatics and
computational biology
Distance-Based Methods
Character-Based Methods
Phylogenetic Tree Evaluation
Phylogenetic Programs
• Some slides adapted from Dr. Vasant Honavar and Dr. Byron Olson
BCB 444/544 F07 ISU Dobbs#33 - Genomics
BCB 444/544 Fall 07 Dobbs
11/09/07
5
BCB 444/544 F07 ISU Dobbs#33 - Genomics
1
#33 - Genomics
11/09/07
An Application:
Predicting RNA Binding Sites in Proteins
Examples of Machine Learning Algorithms
• Problem: Given an amino acid sequence, classify
each residue as RNA binding or non-RNA binding
• Naïve Bayes (NB)
• Bayes Theorem
• Input to the classifier is a string of amino acid
identities
• Neural network (NN) or Artificial Neural Net (ANN)
• Perceptrons
• Output from the classifier is a class label, either
binding or not
• Support Vector Machine (SVM)
• Kernel functions
Lab - WEKA: Decision Trees (DT), NB, SVM
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
7
Bayes Theorem Applied to
RNA Binding Site Prediction
P (binding | aa seq) =
P (c = 0 | X = x ) =
11/09/07
8
Naïve Bayes for Binary Classification
P (binding ) P (aa seq | binding )
P (aa seq)
P (c = 1 | X = x ) =
BCB 444/544 F07 ISU Dobbs#33 - Genomics
Assign c = 1 if
P(c = 1) P( X = x | c = 1)
P( X = x)
P (c = 1 | X = x )
"!
P (c = 0 | X = x )
Otherwise, assign c = 0
P(c = 0) P( X = x | c = 0)
P( X = x)
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
9
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
10
Predicted vs Actual RNA Binding for
Ribosomal protein L15 (PDB ID 1JJ2:K)
Example: Is ARG 6 RNA-binding or not?
Predicted
Actual
ARG 6
TSKKKRQRGSR
p(X1 = T | c = 1) p(X2 = S | c = 1) …
p(X1 = T | c = 0) p(X2 = S | c = 0) …
BCB 444/544 F07 ISU Dobbs#33 - Genomics
BCB 444/544 Fall 07 Dobbs
≥ θ
11/09/07
11
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
12
2
#33 - Genomics
11/09/07
Biological Neurons “Sum” Input Signals
& Generate Output Signal
Artificial Neural Networks (ANNs or NNs)
• Neural networks - classify “input vectors” or “examples”
into categories (2 or more)
• They are loosely based on biological neurons
• Some of most successful methods for predicting
secondary structure are based on neural networks:
• Neural networks are trained to recognize amino acid patterns
corresponding to known secondary structure elements; these
patterns are used to predict secondary structure type for aa
sequences in proteins of unknown structure
Dendrites receive inputs, Axon sends output
Image from Christos Stergiou and Dimitrios Siganos
http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
13
BCB 444/544 F07 ISU Dobbs#33 - Genomics
Simple Neuron = “Perceptron”
11/09/07
14
The Perceptron
Perceptron is “Simplest ANN” = feed-forward NN
= linear classifier
X1
X2
w1
w2
T
N
#1 S > T
"
!0 S < T
S = ! X i Wi
i =1
XN
Input X
wN
Weights W
Summation S
Threshold T
Output F
Perceptron combines input vectors X1…N , compares “sum” S with a
threshold T, and generates output class label: either 1 or 0
If weights W and threshold T are not known in advance, the perceptron
must be trained. Ideally, perceptron is trained to return correct answer for
all training examples, and perform well on test examples it has never seen.
Image from Christos Stergiou and Dimitrios Siganos
http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html
BCB 444/544 F07 ISU Dobbs#33 - Genomics
Training set must contain both classes of data (i.e.. with “1” and “0” output).
11/09/07
15
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
16
Training a Perceptron
Perceptron “Sums” Inputs by Computing
Dot Product S = X⋅W
Find the weights W that minimize the error function E:
• Input is a vector X; Weight is are another vector W
• Perceptron Summation S computes the dot product, S = X⋅W
• Perceptron Output F is a function of S: it is often discrete (1 or 0),
in which case the function is a step function
• For continuous output, a sigmoidal function is often used:
P
E = # ( F(X i •W ) " t(X i ))
i=1
2
P: number of training examples
Xi: training vectors
F(W⋅X i ): output of perceptron
t(Xi ) : target value for Xi
1
1
F(X ) =
1 + e! X
!
1/2
Use steepest descent:
- compute gradient:
0
- update weight vector:
0
- iterate
BCB 444/544 F07 ISU Dobbs#33 - Genomics
BCB 444/544 Fall 07 Dobbs
11/09/07
17
& 'E 'E 'E
'E #
!
(E = $$
,
,
,...,
'wN !"
% 'w1 'w2 'w3
Wnew = Wold " #!E
BCB 444/544 F07 ISU Dobbs#33 - Genomics
(ε: learning rate)
11/09/07
18
3
#33 - Genomics
11/09/07
Support Vector Machines - SVMs
Artificial Neural Network (ANN)
Artificial neural network
• Set of perceptrons
interconnected such that
outputs of some units become
inputs of other units
• Many topologies are possible!
• Can have multiple layers
Neural networks are trained in same way perceptrons are trained,
P
by minimizing an error function:
i
i
E = # ( PP(X ) " t(X ))
2
Image from http://en.wikipedia.org/wiki/Support_vector_machine
i=1
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
19
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
20
11/09/07
22
!
SVM Finds Maximum-Margin Hyperplane
Kernel “Trick”
(i.e., hyperplane that provides maximum separation
between two classes of instances in dataset)
Image from http://en.wikipedia.org/wiki/Support_vector_machine
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
21
Kernel Function
BCB 444/544 F07 ISU Dobbs#33 - Genomics
Take Home Messages
• Must consider how to set up the learning problem
(supervised or unsupervised, generative or
discriminative, classification or regression, etc.)
• Lots of algorithms out there
• No algorithm performs best on all problems
BCB 444/544 F07 ISU Dobbs#33 - Genomics
BCB 444/544 Fall 07 Dobbs
11/09/07
23
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
24
4
#33 - Genomics
11/09/07
Genomics -
for excellent overview lectures,
see these posted by NHGRI & Pevsner:
1- Genomic sequencing
1- Genomic sequencing
Mapping and Sequencing
CTGA2005Lecture1.pdf
Many thanks to:
Eric Green, NHGRI
Eric Green, NHGRI
2- Human genome project
The Human Genome 2005-10-19_ch17.pdf
Jonathan Pevsner, Kennedy Krieger Institute
for the following slides extracted from his lecture on:
Mapping and Sequencing
3- SNPs
Studying Genetic Variation II: Computational Techniques
Jim Mullikin, NHGRI TGA2005Lecture13.pdf
CTGA2005Lecture1.pdf
4- Comparative Genomics
Comparative Sequence Analysis
Elliott Margulies, NHGRI CTGA2005Lecture8.pdf
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
25
Genomic Sequencing - Brief Review
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
BCB 444/544 F07 ISU Dobbs#33 - Genomics
BCB 444/544 Fall 07 Dobbs
11/09/07
11/09/07
26
Comparison of Sequenced Genome Sizes
27
Comparison of Genetic & Physical Maps
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
28
STSs: Provide common markers for "linking"
genetic & physical maps
29
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
30
5
#33 - Genomics
11/09/07
With complete genomes (now), why bother to
generate physical maps?
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
31
Genomic sequencing requires assembly of
sequences obtained from cloned DNA
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
32
NIH: "Hierarchical" BAC-by-BAC Sequencing
Human Genome Sequencing
Two approaches:
• Public (government) - International Consortium
(6 countries, NIH-funded in US)
• "Hierarchical" cloning & BAC-by-BAC sequencing
• Map-based assembly
• Private (industry) - Celera (Craig Venter)
• Whole genome random "shotgun" sequencing
• Computational assembly
(took advantage of public maps & sequences,too)
Guess which human genome Celera sequenced?
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
33
"Hierarchical" Subcloning Strategy
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
BCB 444/544 Fall 07 Dobbs
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
34
Celera: Whole-Genome "Shotgun" Sequencing
11/09/07
35
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
36
6
#33 - Genomics
11/09/07
Either Strategy:
Sequence "Finishing" = Hardest part !!
"Shotgun" Sequencing Stategy
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
37
Advances in DNA Sequencing Technology
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
38
Sequencing Method #1: Gilbert-Maxim
"Chemical Degradation"
39
Sequencing Method #2: Sanger
"Di-deoxy Chain Termination"
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
40
Automated Sequencing for Genome Projects:
Sanger method - with improvements
Another “recent” improvement: rapid & high resolution
separation of fragments in capillaries instead of gels
(E Yeung,Ames Lab, ISU)
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
BCB 444/544 Fall 07 Dobbs
11/09/07
41
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
42
7
#33 - Genomics
11/09/07
1st Eukaryotic Genome Sequence:
S. cerevisiae
Recent technologies?
Pyro- & 454 Sequencing
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
43
1st Animal Genome Sequence:
C. elegans
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
BCB 444/544 F07 ISU Dobbs#33 - Genomics
BCB 444/544 Fall 07 Dobbs
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
44
Timetable for Human Genome Sequencing:
Faster than expected!
11/09/07
45
1st Draft Human Genome:
”Complete" in 2001
E Green 2005
E Green 2005
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
46
Public Sequencing - International Consortium
11/09/07
47
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
48
8
#33 - Genomics
11/09/07
"Finishing" the Human Genome - continues…
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
49
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
E Green 2005
51
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
BCB 444/544 Fall 07 Dobbs
11/09/07
11/09/07
50
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
52
11/09/07
54
ENCODE Project
Comparing Genomes: Functional Elements
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
Comparative Genomics:
now with complete genomic sequences
Interpreting the Human Genome Sequence!
E Green 2005
After "Complete" Human Genome Sequence
What next?
53
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
9
#33 - Genomics
11/09/07
ENCODE - Web Sites
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
ENCODE - Results? June 2007
11/09/07
55
http://www.nature.com/nature/journal/v447/n7146/full/nature05874.html
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
56
Eric Green's Genomic Sequencing Challenges
(2005 List)
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
BCB 444/544 Fall 07 Dobbs
11/09/07
57
10
Download