Sure Screening for Gaussian Graphical Models

advertisement
Sure Screening for Gaussian Graphical Models
Shikai Luo, Rui Song, Daniela Witten
Siyuan Ma
March 20, 2015
Siyuan Ma
GRASS
March 20, 2015
1 / 29
Outline
1
Introduction
Gaussian Graphical Models
Conditional Dependency
High Dimensionality
Previous Works
2
The GRASS Algorithm
Graphical Sure Screening
Theoretical Properties
3
Results
Simulation
Analysis of Gene Expression Data
Siyuan Ma
GRASS
March 20, 2015
2 / 29
Introduction
Gaussian Graphical Models
Graphical Models
In recent years, graphical modeling (network analyses) has been a
topic of great interest in both the scientific and statistical
communities.
In particular, in genomics, graphical models have been extensively
used to model gene regulatory networks.
Siyuan Ma
GRASS
March 20, 2015
3 / 29
Introduction
Gaussian Graphical Models
Graphical Models
Gene Regulatory Network of TF family GLI in human.
Credit: http://rulai.cshl.edu/TRED/GRN/Gli.htm
Siyuan Ma
GRASS
March 20, 2015
4 / 29
Introduction
Gaussian Graphical Models
Gaussian Graphical Model
Consider the gene expression random vector X = (X1 , X2 , . . . , Xp )T .
Each Xj would correspond to a gene, i.e., a row in a gene expresison
matrix.
Each sample (column) in a gene expression matrix could be viewed as a
realization of X .
We could make the assumption that X ∼ N(0, Σ).
Could furthur assume that each gene has mean 0 and standard
deviation 1.
Siyuan Ma
GRASS
March 20, 2015
5 / 29
Introduction
Gaussian Graphical Models
Gaussian Graphical Model
Translating the covariance matrix Σ into a graph (network) G:






Σ=




1
σ21
..
.
..
.
..
.
σ12
1
..
.
..
.
..
.
σ13 · · ·
σ23 · · ·
..
..
.
.
..
σij
.
..
..
.
.
σp1 σp2 σp3 · · ·
Siyuan Ma
σ1p
σ2p
..
.
..
.
..
.











1
GRASS
March 20, 2015
6 / 29
Introduction
Gaussian Graphical Models
Gaussian Graphical Model
|σij | > 0 ⇔ there is an edge in graph G between gene i and gene j.
In practice, usually have an estimate Σ̂ for Σ. Could set some
threshold for |σ̂ij | to determine whether edges exist between genes.
Siyuan Ma
GRASS
March 20, 2015
7 / 29
Introduction
Conditional Dependency
Conditional vs. Marginal Dependency: An Example
Genes X1 , X2 , X3 , X4 .
X3 and X4 are independent regulators.
X1 and X2 are regulated by X3 and X4 . X1 = X3 + X4 + 1 .
X2 = X3 + X4 /2 + 2 .
1 and 2 are independent noises.
Given X3 and X4 , X1 and X2 are independent.
Marginally, X1 and X2 are not independent.
Siyuan Ma
GRASS
March 20, 2015
8 / 29
Introduction
Conditional Dependency
Conditional vs. Marginal Dependency
The marginal dependency is determined by Σ. It’s easy to estimate
and used a lot.
The conditional dependency is determiend by Σ−1 . It is much harder
to estimate, but might reveal more meaningful relationships.
Siyuan Ma
GRASS
March 20, 2015
9 / 29
Introduction
High Dimensionality
High Dimensionality
Article: Karell et al., 2015
Siyuan Ma
GRASS
March 20, 2015
10 / 29
Introduction
High Dimensionality
High Dimensionality
High dimensionality problems arise when the number of genes (p) is
greater than the number of samples (n):
This is a problem:
This is not a problem:
n=50
z
p = 200
x1,1
..
.
..
.
..
.
}|
···
..
.
..
.
..
.
x200,1 · · ·
Siyuan Ma
n=1000
x1,50
..
.
..
.
..
.
{
z
p = 200
x200,50
GRASS
···
..
.
..
.
..
.
}|
···
..
.
..
.
..
.
x1,1000
..
.
..
.
..
.
x200,1 · · ·
···
x200,1000
x1,1
..
.
..
.
..
.
March 20, 2015
{
11 / 29
Introduction
High Dimensionality
High Dimensionality
In our case, high dimensionality leads to a singular Σ̂, which means
we can’t simply estimate Σ−1 using Σ̂−1 .
Siyuan Ma
GRASS
March 20, 2015
12 / 29
Introduction
Previous Works
The Graphical Lasso
One of the previously established approaches is the graphical lasso:




X
|Θij |
Θ = argmaxΘ log det(Θ) − trace[(X T X /n)Θ] − λ


i6=j
Solving this problem is not straight forward and computationally
expensive.
Siyuan Ma
GRASS
March 20, 2015
13 / 29
The GRASS Algorithm
Graphical Sure Screening
Graphical Sure Screening
The goal is to estimate Σ−1 , or equivalently, estimate the conditional
dependency graph E.
Given our normal assumption, the GRASS algorithm is surprisingly
simple:
Given an estimate Σ̂ for Σ.
Claim that gene i and j are conditionally dependent if |σ̂ij | > γn ,
where γn is some threshold.
This would give us an estimated edge set Êγn .
Siyuan Ma
GRASS
March 20, 2015
14 / 29
The GRASS Algorithm
Theoretical Properties
Connection with the Graphical Lasso
Theorem (Witten et al., 2011)
The connected components of the graphical lasso estimator are exactly the
same as the connected components that result from the GRASS algorithm.
Siyuan Ma
GRASS
March 20, 2015
15 / 29
The GRASS Algorithm
Theoretical Properties
Connection with the Graphical Lasso
Theorem (Witten et al., 2011)
The connected components of the graphical lasso estimator are exactly the
same as the connected components that result from the GRASS algorithm.
This theorem suggests that the results of the graphical lasso and the
GRASS algorithm are similary “from a distance”.
Intuitively this makes sense, because the inverse of a block diagonal
matrix should still be block diagonal.
Siyuan Ma
GRASS
March 20, 2015
15 / 29
The GRASS Algorithm
Theoretical Properties
Sure Screening Property
Theorem (Sure Screening Property)
Assume that certain assumptions hold, and that log(p) = C3 nξ for some
constants C3 > 0 and ξ ∈ (0, 1 − 2κ). Let γn = 2/3C1 n−κ . Then there
exist constants C4 and C5 such that
P(E ∈ Êγn ) ≥ 1 − C4 exp(−C5 n1−2κ )
Siyuan Ma
GRASS
March 20, 2015
16 / 29
The GRASS Algorithm
Theoretical Properties
Sure Screening Property
Theorem (Sure Screening Property)
Assume that certain assumptions hold, and that log(p) = C3 nξ for some
constants C3 > 0 and ξ ∈ (0, 1 − 2κ). Let γn = 2/3C1 n−κ . Then there
exist constants C4 and C5 such that
P(E ∈ Êγn ) ≥ 1 − C4 exp(−C5 n1−2κ )
The sure screening property guarantees that with very high
probability, GRASS will not result in false negatives.
Siyuan Ma
GRASS
March 20, 2015
16 / 29
The GRASS Algorithm
Theoretical Properties
Sure Screening Property
The sure screening property is based on the following assumption:
Assumption
For some constants C1 > 0 and 0 < κ < 1/2,
min |σi,j | ≥ C1 n−κ
(i,j)∈E
Siyuan Ma
GRASS
March 20, 2015
17 / 29
The GRASS Algorithm
Theoretical Properties
Sure Screening Property
The sure screening property is based on the following assumption:
Assumption
For some constants C1 > 0 and 0 < κ < 1/2,
min |σi,j | ≥ C1 n−κ
(i,j)∈E
The implication of this assumption is, given that two genes are
conditionally dependent, marginaly they shouldn’t be “too
independent”.
Siyuan Ma
GRASS
March 20, 2015
17 / 29
The GRASS Algorithm
Theoretical Properties
Control of False Positive Rate
Theorem (Control of False Positive Rate)
Assume that certain assumptions hold, and that log(p) = C3 nξ for some
constants C3 > 0 and ξ ∈ (0, 1 − 2κ) (this is the same condition as in the
last theorem). We could control the false positive rate at f /|E c | by
√
f
choosing γn = Φ−1 (1 − p(p−1)
)/ n.
Siyuan Ma
GRASS
March 20, 2015
18 / 29
The GRASS Algorithm
Theoretical Properties
Control of False Positive Rate
Theorem (Control of False Positive Rate)
Assume that certain assumptions hold, and that log(p) = C3 nξ for some
constants C3 > 0 and ξ ∈ (0, 1 − 2κ) (this is the same condition as in the
last theorem). We could control the false positive rate at f /|E c | by
√
f
choosing γn = Φ−1 (1 − p(p−1)
)/ n.
This theorem ensures that we could control the false positive rate by
controlling the number of posible false positives at f . Furthurmore,
with the given threshold, the sure screening property still holds.
Siyuan Ma
GRASS
March 20, 2015
18 / 29
The GRASS Algorithm
Theoretical Properties
Control of False Positive Rate
The false positive rate theorem is based on the following assumption:
Assumption
For the same ξ as in Theorem 1,
max |σi,j | = o(n−
(i,j)6∈E
Siyuan Ma
GRASS
1−ξ
2
)
March 20, 2015
19 / 29
The GRASS Algorithm
Theoretical Properties
Control of False Positive Rate
The false positive rate theorem is based on the following assumption:
Assumption
For the same ξ as in Theorem 1,
max |σi,j | = o(n−
(i,j)6∈E
1−ξ
2
)
The implication of this assumption is, given that two genes are
conditionally independent, marginaly they shouldn’t be “too
dependent”.
Siyuan Ma
GRASS
March 20, 2015
19 / 29
Results
Simulation
Simulation
Simulation A: A sparse graph. For all i < j, set (i, j) ∈ E with
probability 0.01.
Simulation B: A graph with ten densely connected components.
Partition the p features into 10 equally-sized and non-overlapping
sets: C1 ∪ C2 ∪ · · · ∪ C10 = {1, . . . , p}, |Ck | = p/10, Ck ∩ Cj = ∅. For
all i, j ∈ Ck , set (i, j) ∈ E.
Simulation C: A banded graph. For |i − j| ≤ 2, set (i, j) ∈ E.
Siyuan Ma
GRASS
March 20, 2015
20 / 29
Results
Simulation
Simulation
Figure: For simulation B (panels a-c) and C (panels d-f) with p = 100 and n = 50, the
adjacency matrices corresponding to the true edge set (panels (a) and (d)), the graphical lasso
estimate (panels (b) and (e)), and the GRASS estimate (panels (c) and (f)) are shown. The
adjacency matrices for graphical lasso and GRASS are averaged over 10 simulated data sets; the
color of a particular cell in the heatmap corresponds to the fraction of these 10 data sets for
which the corresponding edge is estimated to be present. Results for Simulation A are not
shown, since in that setting the true edge set is not fixed across the simulated data sets.
Siyuan Ma
GRASS
March 20, 2015
21 / 29
Results
Simulation
Simulation
Figure: For Simulations A-C with p = 200 and n = 50, the number of false and true positive
edges detected is displayed as the tuning parameter is varied. Results are shown for graphical
lasso (black), neighborhood selection (blue), and GRASS (green).
Siyuan Ma
GRASS
March 20, 2015
22 / 29
Results
Simulation
Simulation
In Simulation B, the sparsity patterns of Σ and Σ−1 are identical. In
this setting, GRASS outperforms the graphical lasso and neiborhood
selection.
Even for Simulation A and C, the vast majority of the large
off-diagonal elements of Σ corresponds to non-zero elements of Σ−1 .
Siyuan Ma
GRASS
March 20, 2015
23 / 29
Results
Simulation
Simulation
Figure: For Simulations A-C with n = 100 and p = 50, the off-diagonal elements of Σ−1
(x-axis) and Σ (y -axis) are shown. The 0.5% of largest absolute off-diagonal elements of Σ are
shown in red; the rest are in black. For all three setups, the vast majority of large ff-diagonal
elements of Σ correspond to non-zero elements of Σ−1 .
Siyuan Ma
GRASS
March 20, 2015
24 / 29
Results
Simulation
Simulation
For Simulation A, because of the sparsity setting, with high
probability a given column of Σ−1 contains no more than one
non-zero off-diagonal element.
Consequently, Σ−1 is (approximately) a block-diagonal matrix with
blocks containing no more than two features (genes).
Siyuan Ma
GRASS
March 20, 2015
25 / 29
Results
Analysis of Gene Expression Data
Analysis of Gene Expression Data
Gene expression data from Spira et al., 2007.
Contains 22283 microarray-derived gene expression measurements
from large airway epithelial cells sampled from 97 patients with lung
cancer and 90 controls.
Limited to the 1778 genes with the highest marginal variance.
Features were standardized to have mean zero and standard deviation
one.
Siyuan Ma
GRASS
March 20, 2015
26 / 29
Results
Analysis of Gene Expression Data
“Validation”
Split the control samples into Set 1 and Set 2.
Applied both graphical lasso and GRASS to each set. Get estimatd
edge sets Ê1GL , Ê1GRASS , Ê2GL , Ê2GRASS .
First treat the edges estimated by the graphical lasso on Set 1 as the
gold standard, then treat the edges estimated by GRASS on Set 1 as
the gold standard. Specifically, calculate the following:
c GL
Ê1 ∩ Ê2GRASS ∩ Ê2GL c GL
Ê1 ∩ Ê2GL ∩ Ê2GRASS c GRASS
∩ Ê2GRASS ∩ Ê2GL Ê1
c GRASS
∩ Ê2GL ∩ Ê2GRASS Ê1
Siyuan Ma
GRASS
March 20, 2015
27 / 29
Results
Analysis of Gene Expression Data
“Validation”
Figure: Mean (and standard error) of accuracy of graphical lasso (GL) and GRASS on gene
expression data, over 20 splits of the observations into Set 1 and Set 2. |Ê|, the size of the
estimated edge set, is also reported. Regardless of whether GRASS or graphical lasso is treated
as the gold standard, GRASS yields more accurate edge set recovery than does the graphical
lasso. These results are based on an analysis of the control observations. Similar results are
obtained from the cases (results not shown).
Siyuan Ma
GRASS
March 20, 2015
28 / 29
Results
Analysis of Gene Expression Data
Thanks!
Siyuan Ma
GRASS
March 20, 2015
29 / 29
Download