Some Big Questions & Little Answers on Phylogenetic Inference Elchanan Mossel U.C Berkeley

advertisement
Some Big Questions & Little Answers
on Phylogenetic Inference
Elchanan Mossel
U.C Berkeley
Inference in other areas of science
• Phylogeny is not the only area using statistical tools.
• In other areas … - serious methodological issues.
7/27/2016
2
Inference in Psychology
• “Fraud Case Seen as a Red Flag for
Psychology Research” (Nytimes Nov 2nd) –
scientist: Diederik Stapel
• Leading researcher on: “hypocrisy, on
racial stereotyping and on how
advertisements affect how people view
themselves”.
The psychologist
Diederik Stapel in an
undated photograph.
“I have failed as a
scientist and researcher,”
he said in a statement
after a committee found
problems in dozens
of his papers. (Nytimes)
• Related paper: “False-Positive Psychology:
Undisclosed Flexibility in Data Collection
and Analysis Allows Presenting Anything as
Significant” (paper by: Simmons, Nelson
and Simonsohn)
7/27/2016
3
Inference in Oncology
• The Duke Scandal (Potti)
• Refutation by Baggerly, Coombes
• and Goldberg
• “What biomedical computing
• can learn from its mistakes” K. Sainan
• “The Duke University scandal
— what can be done?” (Darrel)
• Steen, R. G. (2011) Retractions in the medical
literature: How many patients are put at risk by
flawed research. Journal of Medical Ethics
7/27/2016
4
Inference in Epidemiology
• “Most published research in epidemiology is false”.
• “Why Most Published Research Findings Are False”
by Ioannidis.
• Ben Goldacre: Battling bad science | Video on TED.com
• “Sunlight is the best
disinfectant”
7/27/2016
5
Are we any better ?
• So far: no big stories of “bad science” in phylogeny.
• How come?
7/27/2016
6
Are we any better ?
• So far: no big stories of “bad science” in phylogeny.
• How come?
• Answer 1: We are better.
• Answer 2: We are dealing with the past ; no predictions.
• Answer 3: Nobody is invested enough to refute wrong
results and techniques.
7/27/2016
7
Are we any better – Methodological indications
• Q1: Do we post complete protocols of all the steps and
choices in all of our research (including all data)?
• Q2: Results repeated by multiple research groups?
• Q3: Do we develop testing methods?
• Q4: Study of robustness of our models and methods?
• Q5: Theoretical results saying when methods do / do not
work?
• Q6: Are results tested by their predictive power?
7/27/2016
8
Are we any better ?
• Some indications:
• “I am not interested in negative
science” – leading phylogenetic
expert when a new method for
testing trees presented to him.
• Other areas that are relatively
free of scandals: Economics,
Quantitative Methods in Political
Science and History.
• But: The bible code (Statistical
Sciences).
7/27/2016
Statistical Science
1994, Vol. 9, No. 3,
429-438 (abridged)
Equidistant Letter Sequences
in the Book of Genesis
Doron Witztum, Eliyahu Rips
and Yoav Rosenberg
Abstract. It has been noted that
when the Book of Genesis is written
as two-dimensional arrays,
equidistant letter
sequences spelling words with
related meanings often appear
in close proximity. Quantitative
tools for measuring this
Phenomenon are developed.
Randomization analysis shows that
the effect is significant at the
level of 0.00002.
9
Directions to consider:
• Full reporting of methods and data including failed
attempts.
• Lowering the p-value bar.
• Repeating analysis by different groups.
• Developing and applying post-tree analysis methods.
• Theoretical directions:
• proving negative results
• Behavior of methods under wrong models.
7/27/2016
10
Discussion Questions:
• What should the community expect as “valid”
phylogenetic analysis?
• What testing tools should be developed?
• Is big data working in our favor or against us?
• What do you do when you don’t believe a Phylogenetic
analysis?
7/27/2016
11
Rest of talk – positive implications of
negative science
The origin of life question (2002)
• Penny/Sober/Steel: origin of life question … - is it
possible to recover the deep part of the tree.
• Simplest model: no deletions/insertions. Each letter
evolves independently according to the same Markov
model.
• Jukes-Cantor: letters are {A,C,G,T} & symmetric
mutation rates.
• CFN: letters are {0,1} and symmetric mutation rates.
7/27/2016
15
Broadcasting DNA
(Probabilistic Point of View)
30mya
20mya
10mya
today
The Phase transition from Statistical Physics:
There exist a critical mutation parameter:
Above – Root and leaves become independent
Below – not independent no matter how large the tree is
The implication to Phylogenetics
•
•
•
theorem [M ’03,’04,M-Steel 04, M-Roch-Sly’11 … ] – Sequence length
needed to recover a tree at least: ct / # species, where
t = evolutionary time to the root.
c = some constant > 1.
(1/2) log n
data-processing inequality
if X → Y → Z forms a Markov chain, I(X;Y) ≥ I(X;Z)
Jurassic Park – in dense trees
30mya
20mya
10mya
today
And the Inverse holds too!
•
•
•
•
theorem [M ’03,’04,M-Steel 04, M-Roch-Sly’11 … ] –
If all branch length < c’ then can recover tree from C log(# species).
t = evolutionary time to the root.
c, C – some constants.
(1/2) log n
Boosting
•
loop
1) distance estimation
2) reconstruct one (or a few) level(s)
3) infer sequences at roots
ancestral
reconstruction
reconstruction
non-reconstruction
phylogenetic
reconstruction
k µ f -2 × log n
k µ f -2 × log n× e Depth
[M ’03,’04, M-Steel ‘04 Daskalakis-M-Roch. ’06, Mihaescu-Hill-Rao’06, M-RochSly’11]
How to Un-mix data?
+
=
heterogeneous data
• phylogenetic mixtures – definition by picture:
1
T1
+2
T2
+3
T3
+...
• special case – “rates-across-sites”
– trees are the same up to random scaling
– in this talk, will focus on two-scaling case
– can think of scaling as “hidden variable”
• biological motivation
‐
‐
‐
‐
heterogeneous mutation rates
inconsistent lineage histories
hybrid speciation, gene transfer
corrupted data
SLOW
+
FAST
why are mixtures problematic?
• identifiability – does the distribution at the leaves
determine the ’s and T’s?
‐ negative results: e.g. [Steel Szekeley’94], [StefankovicVigoda’07], [Matsen-Steel’07], etc.
‐ positive results: e.g. [Allman, Rhodes’06,’08], [Allman,
Ane, Rhodes’08], [Chai-Housworth’10], etc.
1
T1
+2
T2
+3
T3
+...
• algorithmic – assuming identifiability, can we
reconstruct the topologies efficiently?
– can mislead standard methods;
– ML under the full model is consistent in identifiable
cases;
The Pitfalls of Generic Techniques for
Mixtures

Example of study of robustness.

Method assumes a single tree, in fact an identifiable mixture.

In fact in [M-Vigoda’ (Science 05, Ann. App. Prob. 06)]: Bayesian
techniques are misleading for mixtures (assuming no-mixture).

Conclusion of algorithm: a single tree.
A new site clustering approach
New results [M-Roch, 2005, 2011] – we give
a simple way to determine which sites
come from which component
– based on concentration of measure in largetree limit
site clustering for rates across sites
• ideally, guess which sites were produced
by each component
scaling is “hidden” but we can try to infer it
– to be useful, a test should work with
high
r
t
confidence
tab
a
1
SLOW
FAST
tb2
2
A
A
C
G
C
trc
ta3
b
tb1
ra
A
G
C
C
C
c
tc4
3
T
G
A
C
T
tc5
4
5
T
G
A
C
C
T
C
C
C
C
leaf agreement
• a natural place to start - impact of scaling on leaf agreement
– one pair of leaves is not very informative
– we can look at many pairs
• we would like C to be concentrated:
–
–
–
–
large number of pairs
each pair has a small contribution
independent (or almost independent) pairs
nice separation between SLOW and FAST
a b
c d
64 leaves
128 leaves
256 leaves
512 leaves
but the tree is not complete…
• lemma 1 – on a general binary tree, the size of the set of all
pairs of leaves at distance at most 10 is linear in n
– proof: count the number of leaves with no other leaves at
distance 5
• lemma 2 – in fact, can find a linear set of leaf pairs that are
non-intersecting
– proof: sparsify above
• this is enough to build a
concentrated statistic
but we don’t know the tree…
• a simple algorithm – cannot compute exact distances but can tell
which pairs are more or less correlated
– find “close” pairs
– starting with one pair, remove all pairs that are too close
– pick one of the remaining pairs and repeat
• claim – this gives a nicely concentrated
variable (for large enough trees)
– large number of pairs
– independent (or almost independent) pairs
– nice separation between SLOW and FAST
site clustering + reconstruction
summary
site clustering for different trees
• ideally, guess which sites were produced by
which generating tree.
r
tab
a
1
•
•
•
•
trc
ta3
b
tb1
r
tra
tc4
3
4
a
tc5
5
tb1
1
trc
ta3
b
c
tb2
2
tab
tra
c
tb2
2
tc4
3
Let C1 = pairs of leaves close in T1 but not in T2
Let C2 = pairs of leaves close in T2 but not in T1
Let C3 = pairs of leaves close in both T1 and T2.
Generically C1 and C2 are of size O(n) and C3 of size O(1).
4
tc5
5
Clustering leaf pairs
• If know C1, C2, C3 use agreements in C1, C2 to cluster
sites to tree T1 and T2
C1
• But how to find C1, C2 (or big subsets of C1, C2)?
• Let C be all correlated leaf pairs.
• If (u1,u2),(v1,v2) 2 C1 or (u1,u2),(v1,v2) 2 C2
• Then the events s(u1) = s(u2) and s(v1) = s(v2) are
positively correlated.
• If (u1,u2) 2 C1 and (v1,v2) 2 C2 then the events s(u1) =
s(u2) and s(v1) = s(v2) are negatively correlated.
• Cluster to find C1, C2 (may make errors on some
elements of C3).
c3
C2
Back to discussion: Are we any better ?
7/27/2016
46
Back to discussion: Are we any better ?
• So far: no big stories of “bad science” in phylogeny.
• How come?
• Answer 1: We are better.
• Answer 2: We are dealing with the past ; no predictions.
• Answer 3: Nobody invests enough to refute wrong results
and techniques.
7/27/2016
47
Directions to consider:
• Full reporting of methods and data including failed
attempts.
• Lowering the p-value bar.
• Repeating analysis by different groups.
• Developing and applying post-tree analysis methods.
• Theoretical directions:
• proving negative results
• Behavior of methods under wrong models.
7/27/2016
48
Discussion Questions:
• What should the community expect as “valid”
phylogenetic analysis?
• What testing tools should be developed?
• Is big data working in our favor or against us?
• What do you do when you don’t believe a Phylogenetic
analysis?
7/27/2016
49
Thank you!
Elchanan Mossel,
U.C. Berkeley
mossel@stat.berkeley.edu,
http://www.cs.berkeley.edu/~mossel/
7/27/2016
50
General plan
• Define a number of Markovian
Inheritance Models (MIM)
• Discuss how to estimate and
reconstruct from data.
• Lecture 1: Definition of Models
• Lecture 2: Reconstruction via
metric estimates.
• Lecture 3: Decay of information
and impossibility results.
• Lecture 4: Reconstruction.
• Lecture 5: Survey of more
advanced topics.
7/27/2016
51
General plan
•
•
•
•
Disclaimers:
Won’t prove anything hard.
Many of easy facts are exercises.
Questions!
7/27/2016
52
Markovian Inheritance Models
• An inheritance graph is nothing but
• A directed acyclic graph (DAG) (V,E).
• u -> v := u is a parent of v,
direct ancestor;
• Par(v) := {parents of v}.
• If u -> v1 -> v2 -> … vk = v
• v is a descendant of u, etc.
• Anc(v) = {Ancestors of v}.
7/27/2016
NHGIR,
Darryl Lega
CSS, NBII
53
Markovian Inheritance Models
• For each v 2 V, genetic content is given by (v).
• Def: An MIM is given by 1) a DAG (V,E)
• 2) A probability distribution P on V
satisfying the Markov property:
• P((v) = * | (Anc(v))) =
P((v) = * | (Par(v)))
• Ex 1: Phylogeny > speciation.
• Ex 2: Pedigrees > H. genetics.
7/27/2016
54
Phylogenetic product models
• Def: A Phylogenetic tree is an MIM where (V,E) is a tree.
• Many models are given by products of simpler models.
• Lemma: Let (P,V,E) be an MIM taking values in V.
Then (P k , V, E) is an MIM taking values in (k)V.
• Pf: Exercise.
• In biological terms:
• Genetic data is given in sequences of letters.
• Each letter evolves independently according to the same
law (law includes the DAG (V,E)).
7/27/2016
55
The “random cluster” model
• Infinite set A of colors.
– “real life” – large |A|; e.g. gene order.
• Defined on an un-rooted tree T=(V,E).
• Edge e has (non-mutation) probability (e).
• Character: Perform percolation – edge e open with
probability (e).
• All the vertices v in the same open-cluster have the
same color v. Different clusters get different colors.
This is the “random cluster” model (both for (P,V, E)
and (P k , V, E)
7/27/2016
56
Markov models on trees
Finite set  of information values.
Tree T=(V,E) rooted at r.
Vertex v 2 V, has information σv 2 .
Edge e=(v, u), where v is the parent of u, has a mutation
matrix Me of size || £ ||:
• Mi,j (v,u) = P[u = j | v = i]
• For each character , we are given T = (v)v 2 T,
where T is the boundary of the tree.
• Most well knows is the Ising-CFN model.
•
•
•
•
7/27/2016
57
Insertions and Deletions on Trees
• Not a product model (Thorne, Kishino, Felsenstein 91-2)
• Vertex v 2 V, has information σv 2 ¤ .Then:
• Apply Markov model (e.g. CFN) to each site
independently.
• Delete each letter indep. With prob pd(e).
• There also exist variants with insertions.
ACGACCGCTGACCGACCCGACGTTGTAAACCGT
Original Sequence
ACGACCGTTGACCGACCCGACATTGTAAACTGT
Mutations
ACGACCGTTGACCGACCCGACATTGTAAACTGT
Deletions
ACGCCGTTGACCGCCCGACTTGTAACTGT
7/27/2016
Mutated Sequence
58
A simple model of recombination on pedigrees
•
•
•
•
•
•
Vertex v 2 V, has information σv 2 k .
Let  be a probability distribution over subsets of [k].
Let u,w be the father and mother of v.
Let S be drawn from  and let:
v(S) = u(S), v(Sc) = w(Sc).
Example: i.i.d. “Hot spot” process on [k]: {X1,…Xr}
Let S = [1,X1] [ [X2,X3] [ …
ACGACCGCTGACCGACCCGAC
CGATGGCATGCACGATCTGAT
ACGAGGCATGCCCGACCTGAT
7/27/2016
59
The reconstruction problem
• We discuss two related problems.
• In both, want to reconstruct/estimate unknown
parameters from observations.
• The first is the “reconstruction problem”.
• Here we are given the tree/DAG and
• the values of the random variables at a subset of the
vertices.
• Want to reconstruct the value of the random variable at a
specific vertex (“root”).
• For trees this is algorithmically easy using
Dynamic programs / recursion.
??
7/27/2016
60
Phylogenetic Reconstruction
• Here the tree/DAG etc. is unknown.
• Given a sequence of collections of random variables at
the leaves (“species”).
• Want to reconstruct the tree (un-rooted).
7/27/2016
61
Phylogenetic Reconstruction
• Algorithmically “hard”. Many heuristics based on
Maximum-Likelihood, Bayesian Statistics used in practice.
7/27/2016
62
Trees
u
• In biology, all internal degrees ¸ 3.
u
Me’
v
• Given a set of species (labeled vertices) X,
an X-tree is a tree which has X as the set
of leaves.
• Two X-trees T1 and T2 are identical if
there’s a graph isomorphism between T1
and T2 that is the identity map on X.
7/27/2016
Me ’
e
Me’’ M ’’
w
w
d
a
c
b
d
a
b
c
c
a
b
d
63
Highlights for next lectures
• Develop methods to reconstruct
Phylogenies with the following guarantees.
• Consider large trees (# of leaves n -> 1)
• Show that for all trees with high
probability (over randomness of
inheritance) recover the true tree.
• Upper and lower bounds on amount of
information needed.
• Surprising connections with phase
transitions in statistical physics.
• Briefly discuss why non-tree models are
much harder.
7/27/2016
64
Lecture plan
•
•
•
•
•
•
•
Lecture 2: Reconstruction via metric estimates.
Metrics from stochastic models.
Tree Metrics determine trees.
Approximate Tree Metrics determine trees.
Some tree reconstruction algorithms.
Metric and geometric ideas for tree mixtures.
Metrics and pedigrees.
7/27/2016
65
The “random cluster” model
• Infinite set A of colors.
– “real life” – large |A|; e.g. gene order.
• Defined on an un-rooted tree T=(V,E).
• Edge e has (non-mutation) probability (e).
• Character: Perform percolation – edge e open with
probability (e).
• All the vertices v in the same open-cluster have the
same color v. Different clusters get different colors.
This is the “random cluster” model (both for (P,V, E)
and (P k , V, E)
7/27/2016
66
An additive metric for the RC model
• Claim: For all u,v: P(u = v) =  e2 path(u,v)(e), where the product
is over all e in the path connecting u to v.
• Def: Let d(e) = –log (e), and d(u,v)= e2 path(u,v)d(e) = -log P(u = v)
• Claim: d(u,v) is a metric
– Pf: Exercise
7/27/2016
67
Markov models on trees
Finite set  of information values.
Tree T=(V,E) rooted at r.
Vertex v 2 V, has information σv 2 .
Edge e=(v, u), where v is the parent of u, has a mutation
matrix Me of size || £ ||:
• Mi,j (v,u) = P[u = j | v = i]
• For each character , we are given T = (v)v 2 T,
where T is the boundary of the tree.
• Most well knows is the Ising-CFN model.
•
•
•
•
7/27/2016
68
Markov models on trees
• Most well knows is the Ising-CFN model on {-1,1}:
• Claim: For all u,v: E[u v]=  e2 path(u,v)(e).
• Pf: Exercise.
• Claim: d(u,v) = -log E[u v] is a metric and
d(u,v)= e2 path(u,v)d(e)
• This a special case of the log-det distance for General
Markov models on trees (Steel 94)
e
d(u,v) ~ -log |det e 2 path(u,v) M |
7/27/2016
69
Insertions and Deletions on Trees
• Not a product model (Thorne, Kishino, Felsenstein 91-2)
• Vertex v 2 V, has information σv 2 ¤ .Then:
• Delete each letter indep. With prob pd(e).
ACGACCGTTGACCGACCCGACATTGTAAACTGT
ACGACCGTTGACCGACCCGACATTGTAAACTGT
ACGCCGTTGACCGCCCGACTTGTAACTGT
Original Sequence
Deletions
Mutated Sequence
• Define d(u,v) = -log E[Avg(u) Avg(v)]
• This is a metric (Ex ; Daskalakis-Roch 10).
• Same also works if also insertions and mutations allowed.
7/27/2016
70
From metrics to trees
• Def: Given a tree T=(V,E) a tree metric is defined by a
collection of positive numbers { d(e) : e 2 E} by:
letting: d(u,v) = e2 path(u,v)d(e) all u,v 2 V.
• Claim: Let T=(V,E) a tree with all internal degrees at least 3,
let d be a tree metric on T and let L be the set of leaves of
T. Then { d(u,v) : u,v 2 L } determines the tree T uniquely.
7/27/2016
71
Think small: trees on 2 and 3 leaves
• Q: What are the possible trees on 2 / 3 leaves a,b,c?
• A: Only one tree if we assume all int. deg > 2.
a
b
b
7/27/2016
a
c
72
Think small: trees on 4 leaves
• Q: What are the possible trees on 4 leaves a,b,c,d?
• A: ab|cd, ac|bd, ad|bc or abcd
a
e1
e
e4
c
a
b
a
b
a
d
c
d , given the leaves’
c
d pairwise
• b Q: e2How e3
to distinguish
betweend them
1
2
3
4
distances
of the leaves?
• A: Look at partition xy, zw minimizing d(x,y) + d(z,w)
b
c
– Case 1-3 : The partition corresponding to the tree will give the optimum
distance – d(e1)+d(e2)+d(e3)+d(e4), while all other partitions will give
distance bigger by 2d(e) (go through the middle edge twice).
– Case 4 (star) : All partitions will give the same result.
– Note: Approximate distances (+/- d(e)/8) suffice!
7/27/2016
73
From Small Tree to Big Trees
• Claim: In order to recover tree topology suffice to know for each
set of 4 leaves what is the induced tree.
• Pf: By induction on size of tree using Cherries.
• Definition: A cherry is a pair of leaves at graph distance 2.
• Claim1 : vertices x,y make a cherry in the tree T iff they are a cherry
in all trees created of 4 of the it’s leaves.
• Claim2 : Every tree with all internal degrees ¸ 3 has a cherry
• Proof : Pick a root, take u to be the leaf farthest away from the root.
The sibling of u (must exist one as the degree ¸ 3 ) must be a leaf as
well.
7/27/2016
74
74
From leaf pairwise distances to trees
• Algorithm to build tree from quartets :
– Find cherries (pairs of vertices which are coupled in
all 4-leaves combinations).
– For each cherry <x,y> replace it by a single leaf x
(remove all quartets involving both x,y; each quartet
including only y – replace the y by x)
– Repeat (until # leaves ·4)
• A statistical Q: How many samples k are needed?
• In other words: what is the seq length needed?
• A: We would like to have enough samples so we can
estimate d(u,v) with accuracy mine{d(e)/8}
• Define f = mine d(e), g = maxe d(e),
D = max{u,v leaves} d(u,v).
7/27/2016
75
From leaf pairwise distances to trees
• A statistical Q: How many samples are actually needed?
• A: We would like to have enough samples so we can
estimate d(u,v) with accuracy mine{d(e)/8}
• Define f = mine d(e), g = maxe d(e),
D = max{u,v leaves} d(u,v).
• In RC-model: e-D vs. e-D-f/8 agreement.
• In CFN: e-D vs. e-D-f/8 correlation.
• Etc.
• Claim: In both models need at least O(eD/g2) samples to
estimate all distances within required accuracy.
• Claim: In both models O(log n eD/g2 ) suffice to estimate
all distances with required accuracy with good
probability.
• Exercises!
7/27/2016
76
From leaf pairwise distances to trees
• Claim: In both models need at least O(eD/g) samples to
estimate all distances within required accuracy.
• Claim: In both models O(log n eD/g2 ) suffice to estimate all
distances with required accuracy with good probability.
• Q: Is this bad? How large can D be? Let n = # leaves.
• D can be as small as O(log n) and as large as O(n).
• If D = f n need O(ef n /g2) samples!
• Can we do better?
b
c
d
c
77
From leaf pairwise distances to trees
• Can we do better?
• Do we actually need *all* pairwise distances?
• Do we actually need *all* quartets?
• In fact: Need only “short quartets” so actual # of
samples needed is O(e8 f log n /g2) (Erods-SteelSzekeley-Warnow-96).
• An alternative approach is in Mossel-09:
u1
*
u1
v1
u
e v
u2
*
u2
v1
*
v2
v2
*
Distorted metrics idea sketch
• Construction: given a radius D:
• For each leaf u look at C(u,D) = all leaves v whose estimated
distance to u is at most D.
• Construct the tree T(u,D) on C(u,D).
• Algorithm to stitch T(u,D)’s (main combinatorial argument)
• Sequence length needed is O(e2D/g2)
• Lemma: if D > 2 g log n, will cover the tree.
• Even for smaller D, get forest that refines the true tree.
b
c
d
c
79
Short and long edges
• Gronau, Moran, Snir 2008: dealing with short edges
(sometimes need to contract)
• Daskalakis, Mossel, Roch 09: dealing with both short
and long edges: “contracting the short, pruning the
deep”.
b
c
d
c
80
Can we do better?
• Consider e.g. the CFN model with sequence length k.
• Results so far ) model can be reconstruct when
k = O(n®) where ® = ®(f,g).
• Can we do better?
• Can we prove lower bounds?
Can we do better?
•
Can we prove lower bounds?
•
•
•
Trivial lower bound:
Claim 1: Tn = set of leaf labeled trees on n leaves (and all degrees at
least 3). Then |Tn|= exp(£( n log n)).
Pf: Exercise.
•
Claim 2: # of possible sequences at the leaves is 2k n.
•
•
Conclusion: To have good prob. of reconstruction need
2n k > exp(£( n log n)). ) k ¸ (log n)
Can we do better?
• More formally:
•
•
•
•
•
•
•
Claim: Consider a uniform prior over trees ¹.
Then for all possible estimators Est
E¹ P[Est is correct] · 2n k / |Tn|.
Pf sketch:
The optimal estimator is deterministic:
Est : {0,1}n k -> Tn.
E¹ P[Est is correct] · |Image(Est)| / |Tn| · 2n k / |Tn|
• Conclusion: Impossible to reconstruct if k · 0.5 log n and
possible if k ¸ n®. What is the truth?
• Next lecture …
Metric ideas for tree mixtures
• Def: Let T1=(V1,E1,P1) and T2 = (V2, E2, P2) be two
phylogenetic models on the same leaf set L.
• The (®,1-®) mixture of the two models is the
probability distribution ® P1 + (1-®) P2
• Construction (Matsen Steel 2009):
• There exist 3 phylogenies T1, T2, T3 for the CFN model
with (V1,E1) = (V2,E2)  (V3, E3) and T3= 0.5(T1 + T2)
• ) Mixtures are not identifiable!
a
a
e1
e2
e
e4
c
e3
d
e1
e4
a
b
d
c
e
e2
b
b
e3
Metric ideas for tree mixtures
• Construction (Matsen Steel 2009):
• There exist 3 phylogenies T1, T2, T3 for the CFN model
with (V1,E1) = (V2,E2)  (V3, E3) and T3= 0.5(T1 + T2)
• ) Mixtures are not identifiable!
• On the other hand, using metric idea in a recent work
with Roch we show that when n is large and the trees T1
and T2 are generic it is possible to find both of them
with high probability.
Metric ideas for tree mixtures
•
•
•
•
•
•
•
•
•
Proof sketch: Fix a radius D ¸ 10g.
Let S1 = { u, v 2 Leaves: d1(u,v) · D}
Easy to show that |S2|, |S1| ¸ (n)
For “generic trees” we have |S2 Å S1| = o(n)
By looking for high correlation between leaves we can
approximately recover S1 [ S2.
Note: Pairs in S1 will tend to be correlated in samples from T1
and pairs in S2 will be correlated in samples from T2.
By checking co-occurrence of correlation can approximately
recover both S1 and S2.
Using S1 and S2 can determine for each sample if it comes
from T1 or from T2
Same ideas can be used for different rates …
heterogeneous data
• phylogenetic mixtures – definition by picture:
1
T1
+2
T2
+3
T3
+...
• special case – “rates-across-sites”
– trees are the same up to random scaling
– in this talk, will focus on two-scaling case
– can think of scaling as “hidden variable”
• biological motivation
‐
‐
‐
‐
heterogeneous mutation rates
inconsistent lineage histories
hybrid speciation, gene transfer
corrupted data
SLOW
+
FAST
but, on a mixture…
+
=
why are mixtures problematic?
• identifiability – does the distribution at the leaves
determine the ’s and T’s?
‐ negative results: e.g. [Steel et al.’94], [Stefankovic-Vigoda’07],
[Matsen-Steel’07], etc.
‐ positive results: e.g. [Allman, Rhodes’06,’08], [Allman, Ane,
Rhodes’08], [Chai-Housworth’10], etc.
1
T1
+2
T2
+3
T3
+...
• algorithmic – assuming identifiability, can we reconstruct
the topologies efficiently?
– can mislead standard methods;
– ML under the full model is consistent in identifiable cases; BUT
ML is already NP-hard for pure case [Chor,Tuller’06, R.’06]
a new site clustering approach
• new results [M-Roch, 2011] – we give a
simple way to determine which sites
come from which component
– based on concentration of measure in largetree limit
site clustering
• ideally, guess which sites were produced
by each component
r
a
tra
trc
– scaling is “hidden” but we can try to infer it
t
b
– to be useful, a test should work with high c
t
t
t
t
confidence
tab
a3
b1
1
SLOW
FAST
b2
2
A
A
C
G
C
A
G
C
C
C
c4
3
T
G
A
C
T
c5
4
5
T
G
A
C
C
T
C
C
C
C
leaf agreement
• a natural place to start - impact of scaling on leaf agreement
– one pair of leaves is not very informative
– we can look at many pairs
• we would like C to be concentrated:
–
–
–
–
large number of pairs
each pair has a small contribution
independent (or almost independent) pairs
nice separation between SLOW and FAST
a b
c d
64 leaves
128 leaves
256 leaves
512 leaves
but the tree is not complete…
• lemma 1 – on a general binary tree, the set of all
pairs of leaves at distance at most 10 is linear in n
– proof: count the number of leaves with no other leaves at
distance 5
• lemma 2 – in fact, can find a linear set of leaf pairs
that are non-intersecting
– proof: sparsify above
• this is enough to build a concentrated statistic
but we don’t know the tree…
• a simple algorithm – cannot compute exact distances but can tell
which pairs are more or less correlated
– find “close” pairs
– starting with one pair, remove all pairs that are too close
– pick one of the remaining pairs and repeat
• claim – this gives a nicely concentrated
variable (for large enough trees)
– large number of pairs
– independent (or almost independent) pairs
– nice separation between SLOW and FAST
site clustering + reconstruction
summary
Metric ideas for pedigrees
• Correlation measure = inheritance by decent
• Doesn’t really measure distance but something more
complicated …
Download