Multiple testing correction - Noble Research Lab

advertisement
The rest of bioinformatics
Prof. William Stafford Noble
Department of Genome Sciences
Department of Computer Science and Engineering
University of Washington
thabangh@gmail.com
One-minute responses
•
•
•
•
•
•
•
•
•
•
I always like it when we ask questions and you first say good question, even though the
question is not good.
I liked the lecture although the concepts were a bit advanced for me.
I understood about 90% of everything.
The Python is more challenging but it is good to get confused sometimes.
Python was more interesting!
The comprehension of Python is improved at 95%.
Today’s program (first one) was really challenging. I thought the second one was easier
to understand.
Python problem 3 was really challenging for me.
The Python today was completely different from the rest and needed more time.
Do your students at home write one-minute responses for the whole semester every
day?
–
Yes.
•
How did we discover the first mutation?
•
Are you going to be readily available in future for consultations in case I get stuck?
•
I do not think species are related because I believe in creation.
–
–
I am not sure I understand the question. We can observe mutations happening in microorganisms in
the lab by sequencing their DNA from one generation to the next.
Yes, you can always email me at thabangh@gmail.com.
Outline
• Parsimony
• Distance methods
– Computing distances
– Finding the tree
• Maximum likelihood
Revision
ACGCGTTGGG
ACGCGTTGGG
ACGCAATGAA
ACACAGGGAA
+
T
Pr(column|tree,model)
T A
G
• How do we compute the probability of
observing this column, given this tree and an
assumed model of evolution?
Revision
C
A
A
T
A
A
T A
G
T
G
A
A
T A
G
T
A
T A
G
• We enumerate all possible assignments to the
internal nodes, compute the probability of each tree,
and sum.
Revision
ACGCGTTGGG
ACGCGTTGGG
ACGCAATGAA
ACACAGGGAA
+
T
A
T
Pr(column|tree,model)
A
T A
G
• How do we compute the probability of
observing this column, given this assigned tree
and an assumed model of evolution?
Revision
πA, πC, πG, πT
L0
A
L1
L2
T
L5
A
L3
L4
L6
T
T A
G
• We use our evolutionary model to assign a
probability to each branch, and then take the
product of the probabilities of the branches.
• L(tree) = L0  L1  L2  L3  L4  L5  L6
Revision
• In maximum likelihood estimation, are mutations that
occur on branches of a single tree considered independent
or mutually exclusive events?
– Independent.
• What do different labelings of internal nodes of a tree
represent?
– Different possible evolutionary histories.
• Are the different labelings independent or mutually
exclusive?
– Mutually exclusive.
• Are the columns of a multiple alignment considered
independent or mutually exclusive?
– Independent
Maximum likelihood revisited
for each possible tree
for each column of the alignment
for each assignment of internal nodes
for each branch
compute the probability of that branch
assigned tree probability ← multiply branch probabilities
column probability ← sum assigned tree probabilities
tree probability ← multiply column probabilities
return the tree with the highest probability
Sequence analysis tasks
• Protein structure prediction
• Remote homology detection
• Gene finding
Protein structure prediction
• Given: amino acid
sequence
• Return: protein
structure
A complex of earthworm hemoglobin, comprised of 144 globin chains.
Source: Protein Databank.
Remote homology detection
I0
B
I1
I2
I3
I4
I5
I6
I7
I8
M1
M2
M3
M4
M5
M6
M7
M8
D1
D2
D3
D4
D5
D6
D7
D8
• The hidden Markov model generalizes the PSSM used
by PSI-BLAST.
• The model is trained using expectation-maximization.
E
Gene finding
Pedersen and Hein, Bioinformatics 2003.
Mass spectrometry
• Spectrum identification
• Protein inference
• Biomarker discovery
EAMPK
GDIFYPGYCPDVK
LPLENENQGK
ASVYNSFVSNGVK
YVMTFK
ENQGVVNR
Biological networks
•
•
•
•
Functional networks
Protein-protein interaction networks
Metabolic networks
Regulatory networks
Adai et al. JMB 340:179-190 (2004).
Protein-protein interactions
• Each node is a
protein.
• Each edge is a
physical interaction.
• Edges are measured
via
– Yeast two-hybrid
– TAP tagging plus
MS/MS
Jeong et al. Nature. 2001.
Regulatory networks
• Mammalian cell cycle.
• Colors represent different
types of interactions
– Black: binding
– Red: covalent modifications
and gene expression
– Green: enzyme actions
– Blue: stimulations and
inhibitions
Kohn. Mol Cell Biol. 1999
Metabolic networks
• Nodes are enzymes
or metabolites.
• Edges represent
interactions.
• This network
represents the
Arabidopsis TCA
cycle.
Gene expression
• Clustering
• Predictive modeling
• Clinical applications
Gene expression matrix
Genes
The matrix entry at (i, j) is
the expression level of
gene i in experiment j.
Experiments
Fibroblast gene clustering
• Cholesterol
biosynthesis
• Cell cycle
• Immediate-early
response
• Signaling and
angiogenesis
• Wound healing and
tissue remodeling
Iyer et al. “The transcriptional program in the response of
human fibroblasts to serum.” Science. 283:83-7, 1999.
Achieves >75% accuracy.
Next generation sequencing
Next generation sequencing video
Spaced seed
alignment
• Tags and tag-sized
pieces of reference are
cut into small “seeds.”
• Pairs of spaced seeds
are stored in an index.
• Look up spaced seeds for
each tag.
• For each “hit,” confirm
the remaining positions.
• Report results to the user.
Burrows-Wheeler
• Store entire reference
genome.
• Align tag base by base
from the end.
• When tag is traversed, all
active locations are
reported.
• If no match is found, then
back up and try a
substitution.
Spliced-read mapping
• Used for processed mRNA data.
• Reports reads that span introns.
• Examples: TopHat, ERANGE
Beyond the genome
• Epigenetics
• Chromatin state assignment
• Genome 3D architecture
Next generation assays
ENCODE Project Consortium 2011. PLoS Biol 9:e1001046
Rediscovering genes
Population genetics
• Genotype to phenotype
• Human disease genetics
• Population history
Human migrations
jbiol.com
Other topics
• Natural language processing
• Image analysis
Download