Genomics BCB 444/544 Lecture 33 #33_Nov09

advertisement
BCB 444/544
Lecture 33
Genomics
#33_Nov09
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
1
Required Reading
(before lecture)
√ Mon Nov 5 - Lecture 31
Phylogenetics – Parsimony and ML
• Chp 11 - pp 142 – 169
√ Wed Nov 7 - Lecture 32
Machine Learning
Fri Nov 9 - Lecture 33
Functional and Comparative Genomics
• Chp 17 and Chp 18
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
2
Assignments & Announcements
Fri Nov 9 - HW#6
(will be posted this weekend)
HW#6 - More fun with Machine Learning!!
Due: Fri Nov 16
(or sometime before Mon Nov 26)
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
3
Seminars this Week
BCB List of URLs for Seminars related to Bioinformatics:
http://www.bcb.iastate.edu/seminars/index.html
• Nov 7 Wed - BBMB Seminar 4:10 in 1414 MBB
• Sharon Roth Dent
MD Anderson Cancer Center
• Role of chromatin and chromatin modifying proteins in
regulating gene expression
• Nov 8 Thurs - BBMB Seminar 4:10 in 1414 MBB
• Jianzhi George Zhang
U. Michigan
• Evolution of new functions for proteins
• Nov 9 Fri - BCB Faculty Seminar 2:10 in 102 SciI
• Amy Andreotti
ISU
• T cell signaling: insights from protein NMR spectroscopy
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
4
Chp 11 – Phylogenetic Tree Construction Methods
and Programs
SECTION IV MOLECULAR PHYLOGENETICS
Xiong: Chp 11 Phylogenetic Tree Construction Methods
and Programs
•
•
•
•
Distance-Based Methods
Character-Based Methods
Phylogenetic Tree Evaluation
Phylogenetic Programs
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
5
Machine Learning
• What is learning?
• What is machine learning?
• Learning algorithms
• Machine learning applied to bioinformatics and
computational biology
• Some slides adapted from Dr. Vasant Honavar and Dr. Byron Olson
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
6
Examples of Machine Learning Algorithms
• Naïve Bayes (NB)
• Bayes Theorem
• Neural network (NN) or Artificial Neural Net (ANN)
• Perceptrons
• Support Vector Machine (SVM)
• Kernel functions
Lab - WEKA: Decision Trees (DT), NB, SVM
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
7
An Application:
Predicting RNA Binding Sites in Proteins
• Problem: Given an amino acid sequence, classify
each residue as RNA binding or non-RNA binding
• Input to the classifier is a string of amino acid
identities
• Output from the classifier is a class label, either
binding or not
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
8
Bayes Theorem Applied to
RNA Binding Site Prediction
P(binding ) P(aa seq | binding )
P(binding | aa seq ) 
P(aa seq )
P(c  1) P( X  x | c  1)
P (c  1 | X  x ) 
P( X  x)
P(c  0) P( X  x | c  0)
P (c  0 | X  x ) 
P( X  x)
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
9
Naïve Bayes for Binary Classification
Assign c = 1 if
P (c  1 | X  x )

P (c  0 | X  x )
Otherwise, assign c = 0
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
10
Example: Is ARG 6 RNA-binding or not?
ARG 6
TSKKKRQRGSR
p(X1 = T | c = 1) p(X2 = S | c = 1) …
p(X1 = T | c = 0) p(X2 = S | c = 0) …
BCB 444/544 F07 ISU Dobbs#33 - Genomics
≥ θ
11/09/07
11
Predicted vs Actual RNA Binding for
Ribosomal protein L15 (PDB ID 1JJ2:K)
Predicted
BCB 444/544 F07 ISU Dobbs#33 - Genomics
Actual
11/09/07
12
Artificial Neural Networks (ANNs or NNs)
• Neural networks - classify “input vectors” or “examples”
into categories (2 or more)
• They are loosely based on biological neurons
• Some of most successful methods for predicting
secondary structure are based on neural networks:
• Neural networks are trained to recognize amino acid patterns
corresponding to known secondary structure elements; these
patterns are used to predict secondary structure type for aa
sequences in proteins of unknown structure
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
13
Biological Neurons “Sum” Input Signals
& Generate Output Signal
Dendrites receive inputs, Axon sends output
Image from Christos Stergiou and Dimitrios Siganos
http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
14
Simple Neuron = “Perceptron”
Perceptron is “Simplest ANN” = feed-forward NN
= linear classifier
Image from Christos Stergiou and Dimitrios Siganos
http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
15
The Perceptron
X1
X2
w1
w2
N
1 S  T

0 S  T
T
S   X i Wi
i 1
XN
Input X
wN
Weights W
Summation S
Threshold T
Output F
Perceptron combines input vectors X1…N , compares “sum” S with a
threshold T, and generates output class label: either 1 or 0
If weights W and threshold T are not known in advance, the perceptron
must be trained. Ideally, perceptron is trained to return correct answer for
all training examples, and perform well on test examples it has never seen.
Training set must contain both classes of data (i.e.. with “1” and “0” output).
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
16
Perceptron “Sums” Inputs by Computing
Dot Product S = XW
• Input is a vector X; Weight is are another vector W
• Perceptron Summation S computes the dot product, S = XW
• Perceptron Output F is a function of S: it is often discrete (1 or 0),
in which case the function is a step function
• For continuous output, a sigmoidal function is often used:
1
1
F(X ) 
1  e X
1/2
0
0
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
17
Training a Perceptron
Find the weights W that minimize the error function E:
P
E   F(X W )  t(X )
i
i
i1
Use steepest descent:
- compute gradient:
- update weight vector:
- iterate
2
P: number of training examples
Xi: training vectors
F(WXi): output of perceptron
t(Xi) : target value for Xi
 E E E
E
E  
,
,
,...,
wN
 w1 w2 w3



Wnew  Wold  E
BCB 444/544 F07 ISU Dobbs#33 - Genomics
(: learning rate)
11/09/07
18
Artificial Neural Network (ANN)
Artificial neural network
• Set of perceptrons
interconnected such that
outputs of some units become
inputs of other units
• Many topologies are possible!
• Can have multiple layers
Neural networks are trained in same way perceptrons are trained,
P
by minimizing an error function:
i
i
E   PP(X )  t(X )
2
i1
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
19
Support Vector Machines - SVMs
Image from http://en.wikipedia.org/wiki/Support_vector_machine
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
20
SVM Finds Maximum-Margin Hyperplane
(i.e., hyperplane that provides maximum separation
between two classes of instances in dataset)
Image from http://en.wikipedia.org/wiki/Support_vector_machine
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
21
Kernel “Trick”
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
22
Kernel Function
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
23
Take Home Messages
• Must consider how to set up the learning problem
(supervised or unsupervised, generative or
discriminative, classification or regression, etc.)
• Lots of algorithms out there
• No algorithm performs best on all problems
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
24
Genomics -
for excellent overview lectures,
see these posted by NHGRI & Pevsner:
1- Genomic sequencing
Mapping and Sequencing
Eric Green, NHGRI
CTGA2005Lecture1.pdf
2- Human genome project
The Human Genome 2005-10-19_ch17.pdf
Jonathan Pevsner, Kennedy Krieger Institute
3- SNPs
Studying Genetic Variation II: Computational Techniques
Jim Mullikin, NHGRI TGA2005Lecture13.pdf
4- Comparative Genomics
Comparative Sequence Analysis
Elliott Margulies, NHGRI CTGA2005Lecture8.pdf
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
25
1- Genomic sequencing
Many thanks to:
Eric Green, NHGRI
for the following slides extracted from his lecture on:
Mapping and Sequencing
CTGA2005Lecture1.pdf
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
26
Genomic Sequencing - Brief Review
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
27
Comparison of Sequenced Genome Sizes
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
28
Comparison of Genetic & Physical Maps
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
29
STSs: Provide common markers for "linking"
genetic & physical maps
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
30
With complete genomes (now), why bother to
generate physical maps?
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
11/09/07
31
Genomic sequencing requires assembly of
sequences obtained from cloned DNA
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
32
Human Genome Sequencing
Two approaches:
• Public (government) - International Consortium
(6 countries, NIH-funded in US)
• "Hierarchical" cloning & BAC-by-BAC sequencing
• Map-based assembly
• Private (industry) - Celera (Craig Venter)
• Whole genome random "shotgun" sequencing
• Computational assembly
(took advantage of public maps & sequences,too)
Guess which human genome Celera sequenced?
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
33
NIH: "Hierarchical" BAC-by-BAC Sequencing
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
34
"Hierarchical" Subcloning Strategy
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
35
Celera: Whole-Genome "Shotgun" Sequencing
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
36
"Shotgun" Sequencing Stategy
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
37
Either Strategy:
Sequence "Finishing" = Hardest part !!
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
38
Advances in DNA Sequencing Technology
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
39
Sequencing Method #1: Gilbert-Maxim
"Chemical Degradation"
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
40
Sequencing Method #2: Sanger
"Di-deoxy Chain Termination"
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
41
Automated Sequencing for Genome Projects:
Sanger method - with improvements
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Another “recent” improvement: rapid & high resolution
separation of fragments in capillaries instead of gels
(E Yeung,Ames Lab, ISU)
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
42
Recent technologies?
Pyro- & 454 Sequencing
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
43
1st Eukaryotic Genome Sequence:
S. cerevisiae
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
44
1st Animal Genome Sequence:
C. elegans
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
45
Timetable for Human Genome Sequencing:
Faster than expected!
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
11/09/07
46
1st Draft Human Genome:
”Complete" in 2001
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
47
Public Sequencing - International Consortium
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
48
"Finishing" the Human Genome - continues…
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
49
After "Complete" Human Genome Sequence
What next?
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
50
Interpreting the Human Genome Sequence!
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
51
Comparative Genomics:
now with complete genomic sequences
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
52
Comparing Genomes: Functional Elements
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
53
ENCODE Project
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
54
ENCODE - Web Sites
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
55
ENCODE - Results? June 2007
http://www.nature.com/nature/journal/v447/n7146/full/nature05874.html
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
56
Eric Green's Genomic Sequencing Challenges
(2005 List)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
E Green 2005
BCB 444/544 F07 ISU Dobbs#33 - Genomics
11/09/07
57
Download