Around the Reproducibility of Scientific Research in the Big Data Era:

advertisement
Around the Reproducibility of Scientific Research
in the Big Data Era:
A Knockoff Filter for Controlling the False Discovery Rate
Emmanuel Candès
Big Data and Computational Scalability, University of Warwick, July 2015
Collaborator
Rina Foygel Barber
A crisis? The Economist: Oct 19th, 2013
Problems with scientific research: How science goes wrong | The Economist
5.11.13 18:28
Problems with scientific research
How science goes wrong
Scientific research has changed the world. Now it needs to change itself
Oct 19th 2013 | From the print edition
A SIMPLE idea underpins science: “trust, but
verify”. Results should always be subject to
challenge from experiment. That simple but
powerful idea has generated a vast body of
knowledge. Since its birth in the 17th century,
modern science has changed the world beyond
recognition, and overwhelmingly for the better.
But success can breed complacency. Modern scientists are doing too much trusting and not
enough verifying—to the detriment of the whole of science, and of humanity.
Too many of the findings that fill the academic ether are the result of shoddy experiments or
poor analysis (see article (http://www.economist.com/news/briefing/21588057-scientists-thinkscience-self-correcting-alarming-degree-it-not-trouble) ). A rule of thumb among biotechnology
venture-capitalists is that half of published research cannot be replicated. Even that may be
optimistic. Last year researchers at one biotech firm, Amgen, found they could reproduce just six
of 53 “landmark” studies in cancer research. Earlier, a group at Bayer, a drug company, managed
to repeat just a quarter of 67 similarly important papers. A leading computer scientist frets that
three-quarters of papers in his subfield are bunk. In 2000-10 roughly 80,000 patients took part
in clinical trials based on research that was later retracted because of mistakes or improprieties.
A crisis? The Economist: Oct 19th, 2013
Problems with scientific research: How science goes wrong | The Economist
5.11.13 18:28
Problems with scientific research
How science goes wrong
Unreliable research: Trouble at the lab | The Economist
5.11.13 18:29
Scientific research has changed the world. Now it needs to change itself
Oct 19th 2013 | From the print edition
A SIMPLE idea underpins science: “trust, but
verify”. Results should always be subject to
challenge from experiment. That simple but
powerful idea has generated a vast body of
knowledge. Since its birth in the 17th century,
modern science has changed the world beyond
Unreliable research
Trouble at the lab
recognition, and overwhelmingly for the better. Scientists like to think of science as self-correcting. To an alarming degree, it is not
Oct 19th 2013 | From the print edition
But success can breed complacency. Modern scientists are doing too much trusting and not
enough verifying—to the detriment of the whole of
and wreck
of humanity.
“I science,
SEE a train
looming,” warned Daniel
Kahneman, an eminent psychologist, in an open
Too many of the findings that fill the academic ether are the result of shoddy experiments or
letter last year. The premonition concerned
poor analysis (see article (http://www.economist.com/news/briefing/21588057-scientists-thinkresearch on a phenomenon known as “priming”.
science-self-correcting-alarming-degree-it-not-trouble) ). A rule of thumb among biotechnology
Priming studies suggest that decisions can be
venture-capitalists is that half of published research cannot be replicated. Even that may be
influenced by apparently irrelevant actions or
optimistic. Last year researchers at one biotech firm, Amgen, found they could reproduce just six
events
that took
placea just
the cusp
of
of 53 “landmark” studies in cancer research. Earlier,
a group
at Bayer,
drugbefore
company,
managed
havecomputer
been a boom
area frets
in psychology
to repeat just a quarter of 67 similarly important choice.
papers. They
A leading
scientist
that
the pastroughly
decade,80,000
and some
of their
insights
three-quarters of papers in his subfield are bunk.over
In 2000-10
patients
took
part have already made it out of the lab and into the
toolkits ofbecause
policy wonks
keen or
onimproprieties.
“nudging” the populace.
in clinical trials based on research that was later retracted
of mistakes
Snippets
Unreliable research: Trouble at the lab | The Economist
5.11.13 18:29
Unreliable research
Trouble at the lab
Scientists like to think of science as self-correcting. To an alarming degree, it is not
Oct 19th 2013 | From the print edition
“I SEE a train wreck looming,” warned Daniel
Kahneman, an eminent psychologist, in an open
letter last year. The premonition concerned
Systematic attempts to replicate widely cited priming experiments have failed
research on a phenomenon known as “priming”.
Priming studies suggest that decisions can be
Amgen could only replicate 6 of 53 studies they considered landmarks in
science
events basic
that took cancer
place just before
the cusp of
influenced by apparently irrelevant actions or
choice. They have been a boom area in psychology
HealthCare could only replicate about 25% of 67 seminal studies
over the past decade, and some of their insights have already made it out of the lab and into the
toolkits of policy wonks keen on “nudging” the populace.
Dr Kahneman and a growing number of his colleagues fear that a lot of this priming research is
Early
report (Kaplan, ’08): 50% of Phase III FDA studies ended in failure
poorly founded. Over the past few years various researchers have made systematic attempts to
replicate some of the more widely cited priming experiments. Many of these replications have
failed. In April, for instance, a paper in PLoS ONE, a journal, reported that nine separate
Daily Shouts
Page-Turner
Daily Comment
Amy Davidson
John Cassidy
Andy Borowitz
Richard Brody
Close
New Yorker: December, 2010
Reddit
Linked In
Email
O
n September 18, 2007, a few dozen neuroscientists,
psychiatrists, and drug-company executives gathered
in a hotel conference room in Brussels to hear some
ANNALS OF SCIENCE
startling news. It had to do with a class of drugs known as
atypical or second-generation antipsychotics, which came
on the market in the early nineties. The drugs, sold under
Is there something wrong with the scientific method?
brand names such as Abilify, Seroquel, and Zyprexa, had
by Jonah Lehrer
been tested on schizophrenics in several large clinical
DECEMBER 13, 2010
trials, all of which had demonstrated a dramatic decrease
in the subjects’ psychiatric symptoms. As a result, secondRecommend 29k
Many results that are rigorously proved and
generation antipsychotics had become one of the fastestaccepted start shrinking in later studies.
growing and most profitable pharmaceutical classes. By
2001, Eli Lilly’s Zyprexa was generating more revenue
than Prozac. It remains the company’s top-selling drug.
But the data presented at the Brussels meeting made it clear that something strange was happening:
the therapeutic power of the drugs appeared to be steadily waning. A recent study showed an effect that
The New Yorker
Reporting & Essays
THE TRUTH WEARS OFF
“Significance chasing”
Tweet
192
O
“Publication bias”
1,767
http://www.newyorker.com/reporting/2010/12/13/101213fa_fact_lehrer?printable=true&currentPage=all
Page 1 of 11
“Selective reporting”
n September“Why
18, 2007,most
a few dozen
neuroscientists,
published
research
psychiatrists, and drug-company executives gathered
http://www.newyorker.com/reporting/2010/12/13/101213fa_fact_lehrer
findings are false” (Ioannidis,
’05)
Print
E-Mail
Page 1 of 4
The New York Times Jan. 2014
Personal and societal concern
Great danger in seeing erosion of public confidence in science
Personal and societal concern
Great danger in seeing erosion of public confidence in science
Seems like scientific community is responding
Reproducibility Initiative
http://validation.
scienceexchange.com/
disorders at the same time. Treatments are inconsistent. Outcomes
are unpredictable.
Science was supposed to come to the rescue. Genetics and neuroimaging studies would, all involved hoped, reveal biological signatures
unique to each disorder, which could be used to provide consistent
encourage work that cuts across the boundaries. Researchers should be
encouraged to investigate the causes of mental illness from the bottom
up, as the US National Institute of Mental Health is doing. The brain
is complicated enough. Why investigate its problems with one hand
tied behind our backs? ■
Nature’s 18-point checklist: April 25, 2013
ANNO UN CEMEN T
Reducing our
irreproducibility
O
ver the past year, Nature has published a string of articles that
highlight failures in the reliability and reproducibility of published research (collected and freely available at go.nature.com/
huhbyr). The problems arise in laboratories, but journals such as
this one compound them when they fail to exert sufficient scrutiny
over the results that they publish, and when they do not publish
enough information for other researchers to assess results properly.
From next month, Nature and the Nature research journals will
introduce editorial measures to address the problem by improving
the consistency and quality of reporting in life-sciences articles.
To ease the interpretation and improve the reliability of published
results we will more systematically ensure that key methodological details are reported, and we will give more space to methods
sections. We will examine statistics more closely and encourage
authors to be transparent, for example by including their raw data.
Central to this initiative is a checklist intended to prompt authors
to disclose technical and statistical information in their submissions, and to encourage referees to consider aspects important for
research reproducibility (go.nature.com/oloeip). It was developed
after discussions with researchers on the problems that lead to
irreproducibility, including workshops organized last year by US
National Institutes of Health (NIH) institutes. It also draws on published concerns about reporting standards (or the lack of them) and
the collective experience of editors at Nature journals.
The checklist is not exhaustive. It focuses on a few experimental
and analytical design elements that are crucial for the interpretation of research results but are often reported incompletely. For
example, authors will need to describe methodological parameters
that can introduce bias or influence robustness, and provide precise
characterization of key reagents that may be subject to biological
variability, such as cell lines and antibodies. The checklist also consolidates existing policies about data deposition and presentation.
We will also demand more precise descriptions of statistics, and
we will commission statisticians as consultants on certain papers,
at the editor’s discretion and at the referees’ suggestion.
We recognize that there is no single way to conduct an experimental study. Exploratory investigations cannot be done with the
same level of statistical rigour as hypothesis-testing studies. Few
academic laboratories have the means to perform the level of validation required, for example, to translate a finding from the laboratory to the clinic. However, that should not stand in the way of a
full report of how a study was designed, conducted and analysed
that will allow reviewers and readers to adequately interpret and
build on the results.
To allow authors to describe their experimental design and
methods in as much detail as necessary, the participating journals, including Nature, will abolish space restrictions on the
methods section.
To further increase transparency, we will encourage authors to
provide tables of the data behind graphs and figures. This builds
on our established data-deposition policy for specific experiments
and large data sets. The source data will be made available directly
from the figure legend, for easy access. We continue to encourage authors to share detailed methods and reagent descriptions
by depositing protocols in Protocol Exchange (www.nature.com/
protocolexchange), an open resource linked from the primary paper.
Renewed attention to reporting and transparency is a small step.
Much bigger underlying issues contribute to the problem, and are
beyond the reach of journals alone. Too few biologists receive adequate training in statistics and other quantitative aspects of their
subject. Mentoring of young scientists on matters of rigour and
transparency is inconsistent at best. In academia, the ever increasing pressures to publish and chase funds provide little incentive to
pursue studies and publish results that contradict or confirm previous papers. Those who document the validity or irreproducibility of
a published piece of work seldom get a welcome from journals and
funders, even as money and effort are wasted on false assumptions.
Tackling these issues is a long-term endeavour that will require
the commitment of funders, institutions, researchers and publishers. It is encouraging that NIH institutes have led community
discussions on this topic and are considering their own recommendations. We urge others to take note of these and of our initiatives,
and do whatever they can to improve research reproducibility. ■
3 9 8 | NAT U R E | VO L 4 9 6 | 2 5 A P R I L 2 0 1 3
© 2013 Macmillan Publishers Limited. All rights reserved
NAS President’s address: April 27, 2015
Big data and a new scientific paradigm
Collect data first
=⇒
Ask questions later
Large data sets available prior to formulation of hypotheses
Need to quantify “reliability” of hypotheses generated by data snooping
Very different from hypothesis-driven research
Big data and a new scientific paradigm
Collect data first
=⇒
Ask questions later
Large data sets available prior to formulation of hypotheses
Need to quantify “reliability” of hypotheses generated by data snooping
Very different from hypothesis-driven research
What does statistics have to offer?
Account for “look everywhere” effect
Understand reliability in the context of all hypotheses that have been explored
Most discoveries may be false: Sorić (’89)
1000 hypotheses to test
See also The Economist
Most discoveries may be false: Sorić (’89)
1000 hypotheses, 100 potential discoveries
1
Most discoveries may be false: Sorić (’89)
1000 hypotheses, 100 potential discoveries
1
Most discoveries may be false: Sorić (’89)
True positives
False negatives
Power ≈ 80% −→ true positives ≈ 80
False positives (5% level) ≈ 45
=⇒
False positives
False discovery rate ≈ 36%
Most discoveries may be false: Sorić (’89)
True positives
False negatives
False positives
Reported
Power ≈ 80% −→ true positives ≈ 80
False positives (5% level) ≈ 45
1
=⇒
False discovery rate ≈ 36%
Most discoveries may be false: Sorić (’89)
True positives
False negatives
False positives
Reported
Power ≈ 30%
=⇒
False discovery rate ≈ 60%
More false negatives than true positives!
Example: meta-analysis in neuroscience
Button et al. (2013) Power failure: why small sample size undermines the
reliability of neuroscience
False Discovery Rate (FDR): Benjamini & Hochberg (’95)
H1 , . . . Hn hypotheses subject to some testing procedure
#false discoveries
FDR = E
#discoveries
‘0/0 = 0’
Natural type I error
Under independence (and PRDS) simple rules control FDR (BHq)
Widely used
FDR control with BHq (under independence)
FDR: expected proportion of false discoveries
Sorted p-values: p(1) ≤ p(2) ≤ . . . ≤ p(n) (from most to least significant)
Target FDR q
1.0
●
●●
●
●●●
●
●●
●●●
●
●●
Sorted p−value
0.8
●●
●
●
●
●
●
0.6
●
●●●
●●
●
●
●●●●●
●
●●
●●
●
●●●
●
●●
●●●
0.4
●
●
●●●
●●●
●
●●
●●
●
●
●●●
0.2
0.0
●
●●
●●
●
●●
●●
●
●●●●
●●●●
●
●
●●●●●●●●●●
0
20
40
60
Index
80
100
FDR control with BHq (under independence)
FDR: expected proportion of false discoveries
Sorted p-values: p(1) ≤ p(2) ≤ . . . ≤ p(n) (from most to least significant)
Target FDR q
1.0
Sorted p−value
Sorted p−value
●
●
●●●
●●
●
●
●●●●●
●
●●
●●
●
●●●
●
●●
●●●
0.4
●
●
●●●
●●●
●
●●
●●
●
●
●●●
0.2
●
●
●
●
●
●
●
0.4
●
0.0
40
●
●
●
●
●
●●
●
●●
●
●●
●
●
●●
●●
●
●●●●
●●●●
●
●
●●●●●●●●●●
20
●●
●●
●
●●●
●
●●
0.6
0.2
●
●●
●●
0
●
●
●
●●
●●●
0.8
●●
●
●
●
●
0.6
●●●
●
●
●
●
●
●●●
●
●●
0.8
0.0
1.0
●
●●
●
●●●
●
●●
60
80
100
Index
The cut-off is adaptive to number of non-nulls
●
●
●
●●●
●
●●
●●●
●
●●●
●
●●
●●●
●●●
●●●
●●●●
●●●●●●●●●●●●●●●●●●●
0
20
40
60
Index
80
100
Earlier work on multiple comparisons
Henry Scheffe
John Tukey
Westfall and Young (’93)
Rupert Miller
Controlled Variable Selection
obtain results for 347,010 SNPs (MAF 4 5%) genotyped in 2,758 1.46% (for imputation based on Illumina genotypes) to 2.14%
Finnish and Swedish individuals from the Diabetes Genetics Initiative (imputation based on Affymetrix genotypes) per allele, consistent
(DGI) using the Affymetrix 500K Mapping Array Set. Further details with expectations from HapMap data. For additional details of
of the DGI study and independent follow-up analyses are provided in quality-control and imputation procedures, see Methods and Supplea companion manuscript16. All three initial scans excluded individuals mentary Table 1 online.
We then conducted a series of association analyses to relate
taking lipid lowering therapies, for a total of 8,816 phenotyped
Find locations
on theconsent
genome
that from
influence
trait: e.g.
cholesterol
level
B2,261,000
genotyped
and/or imputed
SNPs with plasma
individuals
(Table 1). Informed
was obtained
all the a
Contemporary problem: inference for sparse regression
HDL cholesterol
CETP
–log10 (P value)
20
15
LPL
10
LIPC
ABCA1
GALNT2
MVK/MMAB
5
0
1
2
3
4
5
6
7
8
9
10
11
12
13
LCAT
14
15
16
LDL cholesterol
–log10 (P value)
20
15
0
17 18 19 20 21 22
APOE cluster
APOB
SORT1/CELSR2/PSRC1
LDLR
10
5
LIPG
B4GALT4
PCSK9
1
2
3
NCAN/CILP2
B3GALT4
4
5
6
7
8
9
10
11
12
13
14
15
16
17 18 19 20 21 22
Triglycerides
15
10
5
0
APOA5
GCKR
LPL
ANGPTL3
GALNT2 RBKS
1
2
MLXIPL
3
4
5
20
15
6
7
20
15
NCAN/CILP2
LIPC
8
9
10
11
12
13
LDL cholesterol
value)
(P value)
HDL cholesterol
TRIB1
14
15
16
17 18 19 20 21 22
Triglycerides
P value)
–log10 (P value)
20
20
15
obtain results for 347,010 SNPs (MAF 4 5%) genotyped in 2,758 1.46% (for imputation based on Illumina genotypes) to 2.14%
Finnish and Swedish individuals from the Diabetes Genetics Initiative (imputation based on Affymetrix genotypes) per allele, consistent
(DGI) using the Affymetrix 500K Mapping Array Set. Further details with expectations from HapMap data. For additional details of
of the DGI study and independent follow-up analyses are provided in quality-control and imputation procedures, see Methods and Supplea companion manuscript16. All three initial scans excluded individuals mentary Table 1 online.
We then conducted a series of association analyses to relate
taking lipid lowering therapies, for a total of 8,816 phenotyped
Find locations
on theconsent
genome
that from
influence
trait: e.g.
cholesterol
level
B2,261,000
genotyped
and/or imputed
SNPs with plasma
individuals
(Table 1). Informed
was obtained
all the a
Contemporary problem: inference for sparse regression
HDL cholesterol
CETP
–log10 (P value)
20
15
LPL
10
LIPC
ABCA1
GALNT2
MVK/MMAB
5
0
1
2
3
4
5
6
7
8
9
10
11
12
13
LCAT
14
15
16
LDL cholesterol
LIPG
17 18 19 20 21 22
APOE cluster
20
LDLR
10
5
0
B4GALT4
PCSK9
1
2
3
4
1
2 ky
B3GALT4
min
5
6
y : cholesterol level of patients
–log10 (P value)
20
15
10
5
0
8
9
10
11
12
13
14
15
16
17 18 19 20 21 22
Triglycerides
APOA5
X : genotype matrix; e.g. Xi,j # of alleles
of recessive
type at location j
LPL
TRIB1
z ANGPTL3
environmental
factors (not accounted
by genetics)
MLXIPL
GALNT2 RBKS
1
2
3
4
5
6
7
8
9
10
11
20
15
NCAN/CILP2
LIPC
12
13
LDL cholesterol
value)
(P value)
15
7
NCAN/CILP2
GCKR
HDL cholesterol
20
− X β̂k22 + λkβ̂k1
14
15
16
17 18 19 20 21 22
Triglycerides
P value)
–log10 (P value)
Linear
modelAPOB
(y = Xβ + z) + e.g. Lasso fit
15 SORT1/CELSR2/PSRC1
20
15
obtain results for 347,010 SNPs (MAF 4 5%) genotyped in 2,758 1.46% (for imputation based on Illumina genotypes) to 2.14%
Finnish and Swedish individuals from the Diabetes Genetics Initiative (imputation based on Affymetrix genotypes) per allele, consistent
(DGI) using the Affymetrix 500K Mapping Array Set. Further details with expectations from HapMap data. For additional details of
of the DGI study and independent follow-up analyses are provided in quality-control and imputation procedures, see Methods and Supplea companion manuscript16. All three initial scans excluded individuals mentary Table 1 online.
We then conducted a series of association analyses to relate
taking lipid lowering therapies, for a total of 8,816 phenotyped
Find locations
on theconsent
genome
that from
influence
trait: e.g.
cholesterol
level
B2,261,000
genotyped
and/or imputed
SNPs with plasma
individuals
(Table 1). Informed
was obtained
all the a
Contemporary problem: inference for sparse regression
HDL cholesterol
CETP
–log10 (P value)
20
15
LPL
10
LIPC
ABCA1
GALNT2
MVK/MMAB
5
0
1
2
3
4
5
6
7
8
9
10
11
12
13
LCAT
14
15
16
LDL cholesterol
LIPG
17 18 19 20 21 22
APOE cluster
20
–log10 (P value)
Linear
modelAPOB
(y = Xβ + z) + e.g. Lasso fit
15 SORT1/CELSR2/PSRC1
LDLR
10
5
0
B4GALT4
PCSK9
1
2
3
4
1
2 ky
B3GALT4
min
5
6
y : cholesterol level of patients
–log10 (P value)
20
15
10
5
0
− X β̂k22 + λkβ̂k1
7
8
9
10
11
NCAN/CILP2
12
13
16
17 18 19 20 21 22
Triglycerides
APOA5
X : genotype matrix; e.g. Xi,j # of alleles
of recessive
type at location j
LPL
TRIB1
z ANGPTL3
environmental
factors (not accounted
by genetics)
MLXIPL
GALNT2 RBKS
1
2
3
4
5
6
7
8
9
10
11
12
13
LDL cholesterol
P value)
value)
15
NCAN/CILP2
LIPC
How
do we control the FDR of20selected variables {i : β̂i 6= 0}?
20
20
(P value)
15
GCKR
HDL cholesterol
15
14
15
14
15
16
17 18 19 20 21 22
Triglycerides
Sparse
regression
rse
regression
imulated data with n = 1500, p = 500.
Simulated data with n = 1500, p = 500
Lasso model with λ = 1.75
●●
2
●
●
●
1
●
●
●
●
●
0
−1
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
−2
Fitted coefficient βj
asso fitted model for λ = 1.75:
●
●
●
●
0
100
200
300
Index j
400
500
Sparse
regression
rse
regression
imulated data with n = 1500, p = 500.
Simulated data with n = 1500, p = 500
Lasso model with λ = 1.75
●●
2
●
●
●
1
●
●
●
●
●
0
−1
●
●
●
●
● ●
●
o
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
−2
Fitted coefficient βj
asso fitted model for λ = 1.75:
●
●
o
●
●
0
100
200
False positive?
300
Index j
400
500
True positive?
Sparse
regression
rse
regression
imulated data with n = 1500, p = 500.
Simulated data with n = 1500, p = 500
Lasso model with λ = 1.75
●●
2
●
●
●
1
●
●
●
●
●
0
−1
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
−2
Fitted coefficient βj
asso fitted model for λ = 1.75:
●
●
●
●
0
100
200
300
Index j
400
500
FDP =
26
55
= 47%
Sparse
regression
rse
regression
imulated data with n = 1500, p = 500.
Simulated data with n = 1500, p = 500
Lasso model with λ = 1.75
●●
2
●
●
●
1
●
●
●
●
●
0
−1
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
−2
Fitted coefficient βj
asso fitted model for λ = 1.75:
●
●
●
●
0
100
200
300
Index j
Estimate FDP?
400
500
FDP =
26
55
= 47%
Sparse
regression
rse
regression
imulated data with n = 1500, p = 500.
Simulated data with n = 1500, p = 500
Lasso model with λ = 1.75
●●
2
●
●
●
1
●
●
●
●
●
0
−1
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
−2
Fitted coefficient βj
asso fitted model for λ = 1.75:
●
●
●
●
0
100
200
300
Index j
Estimate FDP? Forget it...
400
500
FDP =
26
55
= 47%
Controlled variable selection
P
j
y
n×1
=
βj Xj
z}|{
Xβ
n×p p×1
+
z
n×1
y ∼ N (Xβ, σ 2 I)
Controlled variable selection
P
j
y
n×1
=
βj Xj
z}|{
Xβ
n×p p×1
+
z
n×1
y ∼ N (Xβ, σ 2 I)
Goal: select set of features Xj without too many false positives
FDR
|{z}
False discovery rate
=
h # false positives i
E
# features selected
|
{z
}
False discovery proportion
‘0/0 = 0’
Controlled variable selection
P
j
y
n×1
=
βj Xj
z}|{
Xβ
n×p p×1
+
z
n×1
y ∼ N (Xβ, σ 2 I)
Goal: select set of features Xj without too many false positives
FDR
|{z}
False discovery rate
=
h # false positives i
E
# features selected
|
{z
}
‘0/0 = 0’
False discovery proportion
Context of multiple testing (with possibly very many irrelevant variables)
Hj : βj = 0
j = 1, . . . , p
This lecture
Novel procedure for variable selection controlling FDR in finite sample settings
Requires n ≥ p (and full column rank) for identifiability
n<p
=⇒
Xβ = Xβ 0 and β 6= β 0
Works under any design X
Does not require any knowledge of noise level σ
Ongoing research: n < p
The Knockoff Filter
The knockoff filter
It’s not a name brand bag, just a cheap knockoff
Thesaurize.com
For each feature Xj , construct knockoff version X̃j
Knockoffs serve as control group =⇒ can estimate FDP
(1) Construct fake variables (knockoffs)
(2) Calculate statistics for each original/knockoff pair
(3) Calculate a data-dependent threshold for the statistics
1. Knockoff features X̃j
X̃j0 X̃k = Xj0 Xk
X̃j0 Xk = Xj0 Xk
for all j, k
for all j 6= k
Would like knockoffs as uncorrelated to features as possible
How?
Compute knockoffs via matrix computations and/or numerical optimization (later)
0 Σ
Σ − diag{s}
X X̃ X X̃ =
0
s ∈ Rp
Σ − diag{s}
Σ
How?
Compute knockoffs via matrix computations and/or numerical optimization (later)
0 Σ
Σ − diag{s}
X X̃ X X̃ =
0
s ∈ Rp
Σ − diag{s}
Σ
X̃ = X(I − Σ−1 diag{s}) + Ũ C
Ũ ∈ Rn×p with col. space orthogonal to that of X
C 0 C Cholevsky factorization of 2 diag{s} − diag{s}Σ−1 diag{s} 0
No need for new data or experiment
Why?
Construct
knockoffs
For null feature Xj
Why?
d
For a null feature
Xj0 y X
=j , Xj0 Xβ + Xj0 z = X̃j0 Xβ + X̃j0 z = X̃j0 y
D e>
?
e>
e>
Xj> y = Xj> Xβ ? + Xj> z = X
j Xβ + Xj z = Xj y
Why?
Construct
knockoffs
For null feature Xj
Why?
d
For a null feature
Xj0 y X
=j , Xj0 Xβ + Xj0 z = X̃j0 Xβ + X̃j0 z = X̃j0 y
D e>
?
e>
e>
Xj> y = Xj> Xβ ? + Xj> z = X
j Xβ + Xj z = Xj y
Why?
Lemma
Pairwise exchangeability
property.
For any subset of nulls N
Construct
knockoffs
d
0
0
Why?[X X̃]swap(N ) y = [X X̃] y
For a null feature X ,
=⇒ knockoffs are a ‘control group’ jfor the nulls
D e>
?
e>
e>
Xj> y = Xj> Xβ ? + Xj> z = X
j Xβ + Xj z = Xj y
[X X̃]0swap(N ) =
Knockoff method
Compute Lasso
with augmented matrix:
Knockoff
method
Fitted
2
1 β̂(λ)
=
argmin
−
[X
X̃]
·
b
+ λkbk1
2p
y
b∈R
model for λ = 1.75 on 2the simulated dataset:
●
Fitted coefficient βj
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
500 original features
I
500 knockoff features
Lasso selects 49 original features & 24 knockoff features
Knockoff method
Compute Lasso
with augmented matrix:
Knockoff
method
Fitted
2
1 β̂(λ)
=
argmin
−
[X
X̃]
·
b
+ λkbk1
2p
y
b∈R
model for λ = 1.75 on 2the simulated dataset:
●
Fitted coefficient βj
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
500 original features
500 knockoff features
Lasso selects 49 original features & 24 knockoff features
I
Lasso selects 49 original features & 24 knockoff features
Knockoff method
Compute Lasso
with augmented matrix:
Knockoff
method
Fitted
2
1 β̂(λ)
=
argmin
−
[X
X̃]
·
b
+ λkbk1
2p
y
b∈R
model for λ = 1.75 on 2the simulated dataset:
●
Fitted coefficient βj
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
500 original features
500 knockoff features
Lasso selects 49 original features & 24 knockoff features
I Lasso
selects 49 original features & 24 knockoff
Pairwise
exchangeability
=⇒ probably ≈ 24 false positives among 49 original features
features
2. Statistics
Zj = sup {λ : bj (λ) 6= 0}
Z̃j = sup {λ : b̃j (λ) 6= 0}
first time Xj enters model
first time X̃j enters model
Test statistic Wj for feature j
(
+1
Wj = max(Zj , Z̃j ) ·
−1
Zj > Z̃j
Zj < Z̃j
2. Statistics
Zj = sup {λ : bj (λ) 6= 0}
Z̃j = sup {λ : b̃j (λ) 6= 0}
first time Xj enters model
first time X̃j enters model
Test statistic Wj for feature j
(
+1
Wj = max(Zj , Z̃j ) ·
−1
+
0
enters late
Zj > Z̃j
Zj < Z̃j
++
−
+
−
−
++
enters early
(significant)
|W|
2. Statistics
Zj = sup {λ : bj (λ) 6= 0}
Z̃j = sup {λ : b̃j (λ) 6= 0}
first time Xj enters model
first time X̃j enters model
Test statistic Wj for feature j
(
+1
Wj = max(Zj , Z̃j ) ·
−1
+
0
enters late
Many other choices (later)
Zj > Z̃j
Zj < Z̃j
++
−
+
−
−
++
enters early
(significant)
|W|
Variation
Forward selection on augmented design X
X̃
First time (rank) either original or knockoff enters
Tag ‘+’ if original comes before its knockoff, ‘-’ otherwise
+
enters late
+ +
−
+ + +
− −
rank
enters early
(significant)
Pairwise exchangeability of the nulls
exchangeable
exchangeable
Pairwise exchangeability of the nulls
Pairwise exchangeability of the nulls
exchangeable
exchangeable
exchangeable
exchangeable
Estimated FDP at threshold t=1.5
●
~
Value of λ when Xj enters
4
3
2
●
●
●
●
●
●
●
●●●
1
0
●
●
●
●
●● ● ●
● ●●
●
●
●
● ● ●● ●
●●
●●
● ●
●●
●
●●
●
● ● ●●
●
●
●
●●
●
● ● ●●
●
●● ●
●
●● ●●
●●
●●●
●
●
●
●
●● ●
● ●
●
0
●
●
●
●
●
●
●
● ●
Value of λ when Xj enters
Signal
Null
Consequence of exchangeability
null
non null
0
+
+ t
+
__
__
+
__
+
+ +
__
Signs of nulls iid ±1 indep. of |W | (ordering)
d
(W1 , . . . , Wp ) = (W1 · 1 , . . . , Wp · p )
i.i.d.
Sign seq. {j } indep. of W , j = +1 for all non-null j and j ∼ {±1} for null j
Signs → 1-bit p-values
Knockoff estimate of FDR
0
+
__
FDP(t) =
+
+
__
__
t
+
+
+ +
__
#{j null : Wj ≤ −t}
#{j null : Wj ≥ t}
≈
#{j : Wj ≥ t} ∨ 1
#{j : Wj ≥ t} ∨ 1
Knockoff estimate of FDR
0
+
__
FDP(t) =
+
+
__
__
t
+
+
+ +
__
#{j null : Wj ≤ −t}
#{j null : Wj ≥ t}
≈
#{j : Wj ≥ t} ∨ 1
#{j : Wj ≥ t} ∨ 1
#{j : Wj ≤ −t}
d
≤
:= FDP(t)
#{j : Wj ≥ t} ∨ 1
3. Selection and FDR control
Select features with large and positive statistics {Wj ≥ T }
#{j : Wj ≤ −t}
≤q
Knockoff
T = min t ∈ W :
#{j : Wj ≥ t} ∨ 1
Theorem (Knockoff)
E
V
≤q
R + q −1
V : # false positives
R : total # of selections
3. Selection and FDR control
Select features with large and positive statistics {Wj ≥ T }
#{j : Wj ≤ −t}
≤q
Knockoff
T = min t ∈ W :
#{j : Wj ≥ t} ∨ 1
Theorem (Knockoff)
E
V
≤q
R + q −1
V : # false positives
R : total # of selections
1 + #{j : Wj ≤ −t}
T = min t ∈ W :
≤q
#{j : Wj ≥ t} ∨ 1
Theorem (Knockoff+)
E
V
≤q
R∨1
Knockoff+
Stopping rule
0/1 + #{j : Wj ≤ −t}
T = min t :
≤q
#{j : Wj ≥ t} ∨ 1
0
+
__
+
+
__
__
+
+
+ +
__
Stop first time ratio between # negatives and # positives below q
Stopping rule
0/1 + #{j : Wj ≤ −t}
T = min t :
≤q
#{j : Wj ≥ t} ∨ 1
0
+
__
+
+
__
__
+
+
+ +
__
Stop first time ratio between # negatives and # positives below q
Stopping rule
0/1 + #{j : Wj ≤ −t}
T = min t :
≤q
#{j : Wj ≥ t} ∨ 1
0
+
__
+
+
__
__
+
+
+ +
__
Stop first time ratio between # negatives and # positives below q
Stopping rule
0/1 + #{j : Wj ≤ −t}
T = min t :
≤q
#{j : Wj ≥ t} ∨ 1
0
+
__
+
+
__
__
+
+
+ +
__
Stop first time ratio between # negatives and # positives below q
Why does all this work?
1+#{j : Wj ≤ −t}
≤q
T = min t :
#{j : Wj ≥ t} ∨ 1
0
+
__
+
+
__
__
+
+
+ +
__
Why does all this work?
1+#{j : Wj ≤ −t}
≤q
T = min t :
#{j : Wj ≥ t} ∨ 1
0
+
__
FDP(T ) =
+
+
__
__
+
+
+ +
__
#{j null : Wj ≥ T } 1 + #{j null : Wj ≤ −T }
·
#{j : Wj ≥ T } ∨ 1 1 + #{j null : Wj ≤ −T }
Why does all this work?
1+#{j : Wj ≤ −t}
≤q
T = min t :
#{j : Wj ≥ t} ∨ 1
0
+
__
FDP(T ) =
+
+
__
+
+
__
+ +
__
#{j null : Wj ≥ T } 1 + #{j null : Wj ≤ −T }
·
#{j : Wj ≥ T } ∨ 1 1 + #{j null : Wj ≤ −T }
V + (T )
z
}|
{
#{j null : Wj ≥ T }
≤q·
1 + #{j null : Wj ≤ −T }
|
{z
}
V − (T )
Why does all this work?
1+#{j : Wj ≤ −t}
≤q
T = min t :
#{j : Wj ≥ t} ∨ 1
0
+
FDP(T ) =
+
+
__
__
+
+
__
+ +
__
#{j null : Wj ≥ T } 1 + #{j null : Wj ≤ −T }
·
#{j : Wj ≥ T } ∨ 1 1 + #{j null : Wj ≤ −T }
V + (T )
z
}|
{
#{j null : Wj ≥ T }
≤q·
1 + #{j null : Wj ≤ −T }
|
{z
}
V − (T )
V + (t)/(1 + V − (t)) is a super-martingale w.r.t. well defined filtration
T is stopping time
Optional stopping time theorem
null
0
+
+ t
V + (T )
FDR ≤ q E
1 + V − (T )
+
__
__
__
+
non null
+
+ +
__
Optional stopping time theorem
null
0
+
+ t
V + (T )
FDR ≤ q E
1 + V − (T )
+
__
__
+
+
__
V + (0)
≤q E
1 + V − (0)
non null
+ +
__
Optional stopping time theorem
null
0
+
__
+ t
+
__
__
+
non null
+
+ +
__


z
}|
{


V + (T )
V + (0)
V + (0)


FDR ≤ q E
≤
q
E
=
q
E


1 + V − (T )
1 + V − (0)
 1 + #nulls − V + (0) 
Ber(#nulls,1/2)
Optional stopping time theorem
null
0
+
__
+ t
+
__
__
+
non null
+
+ +
__


z
}|
{


V + (T )
V + (0)
V + (0)


FDR ≤ q E
≤
q
E
=
q
E

 ≤q
1 + V − (T )
1 + V − (0)
 1 + #nulls − V + (0) 
Ber(#nulls,1/2)
Comparison with other methods
Permutation methods
Let X π = X with rows randomly permuted
Σ 0
π 0
π
X X
X X ≈
0 Σ
Zj
Value of λ when variable enters model
Permutation
Let
X π = X withmethods
rows randomly permuted. Then
π
Let X = X with rows randomly permuted
Σ
0
>
X X π π 0 X X ππ ≈ Σ 0
X X
X X ≈
0 Σ
0 Σ
●
25
●
20
●
●
15
Original non−null features
Original null features
Permuted features
●
10
●●
●
5
●
●
●
● ●
● ●
●
●
●
●
● ●
●
●
● ●● ●●●●●●●●●●●● ●● ● ●●● ● ●●● ● ●● ●●●●●●●●●● ● ●●●● ●● ●● ●
0
|0 {z } |
signals
50
{z
nulls
}100|
Knockoff method
Permutation method
{z
150
200
permuted features
Index of column in the augmented matrix [X Xπ]
FDR (target level q = 20%)
12.29%
45.61%
}
Zj
Value of λ when variable enters model
Permutation
Let
X π = X withmethods
rows randomly permuted. Then
π
Let X = X with rows randomly permuted
Σ
0
>
X X π π 0 X X ππ ≈ Σ 0
X X
X X ≈
0 Σ
0 Σ
●
25
●
20
●
●
15
Original non−null features
Original null features
Permuted features
●
10
●●
●
5
●
●
●
● ●
● ●
●
●
●
●
● ●
●
●
● ●● ●●●●●●●●●●●● ●● ● ●●● ● ●●● ● ●● ●●●●●●●●●● ● ●●●● ●● ●● ●
0
|0 {z } |
signals
50
{z
nulls
}100|
Knockoff method
Knockoff
method
Permutation
method
Permutation method
{z
150
200
permuted features
Index of column in the augmented matrix [X Xπ]
FDR over 1000 trials
(nominal
level q level
= 20%)q
FDR (target
12.29%
12.29%
45.61%
45.61%
= 20%)
}
Other methods
Benjamini-Hochberg (BHq)
BHq + log factor correction
BHq with whitened noise
Other methods
Benjamini-Hochberg (BHq)
BHq + log factor correction
BHq with whitened noise
y ∼ N (Xβ, σ 2 I)
β̂ LS ∼ N (β, σ 2 (X 0 X)−1 )
⇐⇒
Apply BHq to
Zj =
σ
p
β̂jLS
(Σ−1 )jj
Not known to control FDR → log factor correction (Benjamini Yekutieli)
Empirical results
Features N (0, In ), n = 3000, p = 1000
k = 30 variables with regression coefficients of magnitude 3.5
Method
Knockoff+ (equivariant)
Knockoff (equivariant)
Knockoff+ (SDP)
Knockoff (SDP)
BHq
BHq + log-factor correction
BHq with whitened noise
FDR (%)
(nominal level q = 20%)
14.40
17.82
15.05
18.72
18.70
2.20
18.79
Power (%)
60.99
66.73
61.54
67.50
48.88
19.09
2.33
Theor. FDR
control?
Yes
No
Yes
No
No
Yes
Yes
Effect of sparsity level
Same setup with amplitudes set to 3.5 (q = 0.2)
25
100
20
80
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
15
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10
5
●
Nominal level
Knockoff
Knockoff+
BHq
0
Power (%)
FDR (%)
●
●
●
60
●
40
●
20
●
Knockoff
Knockoff+
BHq
0
50
100
Sparsity level k
150
200
50
100
Sparsity level k
150
200
Effect of signal amplitude
Same setup with k = 30 (q = 0.2)
25
100
20
80
●
●
●
●
●
●
●
●
●
●
●
●
●
10
●
●
●
●
Power (%)
FDR (%)
●
15
●
●
●
●
●
60
●
●
●
●
40
●
●
●
5
●
Nominal level
Knockoff
Knockoff+
BHq
0
20
●
Knockoff
Knockoff+
BHq
0
2.8
3.0
3.2
3.4
3.6
Amplitude A
3.8
4.0
4.2
2.8
3.0
3.2
3.4
3.6
Amplitude A
3.8
4.0
4.2
Effect of feature correlation
Θjk = ρ|j−k|
Features ∼ N (0, Θ)
n = 3000, p = 1000, and k = 30 and amplitude = 3.5
30
100
●
25
Nominal level
Knockoff
Knockoff+
BHq
●
Knockoff
Knockoff+
BHq
80
15
Power (%)
FDR (%)
20
●
●
●
●
●
60
●
●
●
●
●
40
10
●
●
●
●
20
●
5
●
●
●
●
0
●
0
0.0
0.2
0.4
0.6
Feature correlation ρ
0.8
0.0
0.2
0.4
0.6
Feature correlation ρ
0.8
Application to real HIV data
HIV drug resistance
Drug type
# drugs
Sample size
PI
NRTI
NNRTI
6
6
3
848
639
747
# protease or RT
positions genotyped
99
240
240
# mutations appearing
≥ 3 times in sample
209
294
319
response y: log-fold-increase of lab-tested drug resistance in
covariate Xj : presence or absence of mutation #j
Data from R. Shafer (Stanford) available at:
http://hivdb.stanford.edu/pages/published_analysis/genophenoPNAS2006/
PI-type drug resistance
TSM list: mutations associated with the PI class of drugs in general, and is not
specialized to the individual drugs in the class
30
Appear in TSM list
Not in TSM list
Resistance to ATV
Resistance to IDV
35
35
30
30
25
25
25
20
20
20
15
15
15
10
10
10
5
5
0
0
Knockoff
BHq
5
Resistance to APV
0
Knockoff
BHq
Knockoff
# HIV−1 protease positions selected
# HIV−1 protease positions selected
Resistance to APV
35
BHq
Data set size: n=768, p=201
Data set size: n=329, p=147
Data set size: n=826, p=208
Resistance to LPV
Resistance to NFV
Resistance to SQV
35
35
35
30
30
30
25
25
25
20
20
20
15
15
15
10
10
10
5
5
5
0
0
0
Knockoff
BHq
Data set size: n=516, p=184
Knockoff
BHq
Data set size: n=843, p=209
Knockoff
35
30
Appear in TSM list
Not in TSM list
25
20
15
BHq
Data set size: n=825, p=208
10
5
Figure: q = 0.2. # positions on the HIV-1 protease where mutations were selected.
Horizontal line = # HIV-1 protease positions in TSM list. 0
Knockoff
BHq
NRTI-type drug resistance
# HIV−1 RT positions selected
Resistance to X3TC
30
Appear in TSM list
Not in TSM list
Resistance to ABC
Resistance to AZT
30
30
25
25
25
20
20
20
15
15
15
10
10
10
5
5
5
0
0
Knockoff
BHq
0
Knockoff
BHq
Knockoff
BHq
Data set size: n=633, p=292
Data set size: n=628, p=294
Data set size: n=630, p=292
Resistance to D4T
Resistance to DDI
Resistance to TDF
30
30
30
25
25
25
20
20
20
15
15
15
10
10
10
5
5
0
5
0
Knockoff
BHq
Data set size: n=630, p=293
0
Knockoff
BHq
Data set size: n=632, p=292
Knockoff
BHq
Data set size: n=353, p=218
Figure: Validation against the treatment-selected mutation (TSM) panel
NNRTI-type drug resistance
# HIV−1 RT positions selected
Resistance to DLV
35
30
Appear in TSM list
Not in TSM list
Resistance to EFV
Resistance to NVP
35
35
30
30
25
25
25
20
20
20
15
15
15
10
10
10
5
5
0
5
0
Knockoff
BHq
Data set size: n=732, p=311
0
Knockoff
BHq
Data set size: n=734, p=318
Knockoff
BHq
Data set size: n=746, p=319
Figure: Validation against the treatment-selected mutation (TSM) panel
Heavy-tailed noise
Design as in HIV example (sparse matrix)
Errors as residuals from HIV example
Regression coefficients entered manually
Knockoff+
BHq
FDR
(q = 20%)
20.31%
25.47%
Power
60.67%
69.42%
Details
Knockoff constructions (n ≥ 2p)
X̃ = X(I − Σ−1 diag{s}) + Ũ C
Knockoff constructions (n ≥ 2p)
X̃ = X(I − Σ−1 diag{s}) + Ũ C
X X̃
0 X X̃ =
Σ
Σ − diag{s}
:= G 0
Σ − diag{s}
Σ
Knockoff constructions (n ≥ 2p)
X̃ = X(I − Σ−1 diag{s}) + Ũ C
X X̃
0 X X̃ =
G0
Σ
Σ − diag{s}
:= G 0
Σ − diag{s}
Σ
⇐⇒
diag{s} 0
2Σ − diag{s} 0
Knockoff constructions(n ≥ 2p)
Equi-correlated knockoffs: sj = 2λmin (Σ) ∧ 1
hXj , X̃j i = 1 − 2λmin (Σ) ∧ 1
Under equivariance, minimizes the value of |hXj , X̃j i|
Knockoff constructions(n ≥ 2p)
Equi-correlated knockoffs: sj = 2λmin (Σ) ∧ 1
hXj , X̃j i = 1 − 2λmin (Σ) ∧ 1
Under equivariance, minimizes the value of |hXj , X̃j i|
SDP knockoffs:
minimize
subject to
P
|1 − sj |
sj ≥ 0
diag{s} 2Σ
j
⇐⇒
minimize
subject to
Highly structured semidefinite program (SDP)
Other possibilities
P
j (1 − sj )
sj ≥ 0
diag{s} 2Σ
Symmetric statistics: W ( X X̃ , y)
column swap
Sufficiency property:
0 0 W = f X X̃ X X̃ , X X̃ y
Anti-symmetry property: swapping changes signs
(
+1 j 6∈ S
Wj X X̃ swap(S) , y = Wj X X̃ , y ·
−1 j ∈ S
null
non null
0
+
__
+ t
+
__
__
+
+
+ +
__
Examples of statistics
Zj = sup{λ : β̂j (λ) 6= 0}, j = 1, . . . , 2p, and β̂(λ) sol. to augmented Lasso
Wj = (Zj ∨ Zj+p ) · sign(Zj − Zj+p )
Wj = Zj − Zj+p
Examples of statistics
Zj = sup{λ : β̂j (λ) 6= 0}, j = 1, . . . , 2p, and β̂(λ) sol. to augmented Lasso
Wj = (Zj ∨ Zj+p ) · sign(Zj − Zj+p )
Wj = Zj − Zj+p
Same with penalized estimate
min
1
2 ky
− Xbk22 + λP (b)
Examples of statistics
Zj = sup{λ : β̂j (λ) 6= 0}, j = 1, . . . , 2p, and β̂(λ) sol. to augmented Lasso
Wj = (Zj ∨ Zj+p ) · sign(Zj − Zj+p )
Wj = Zj − Zj+p
Same with penalized estimate
min
1
2 ky
− Xbk22 + λP (b)
Forward selection/orthogonal matching pursuit: Z1 , · · · , Z2p (reverse) order
in which 2p variables entered model
Wj = (Zj ∨ Zj+p ) · sign(Zj − Zj+p )
Examples of statistics
Zj = sup{λ : β̂j (λ) 6= 0}, j = 1, . . . , 2p, and β̂(λ) sol. to augmented Lasso
Wj = (Zj ∨ Zj+p ) · sign(Zj − Zj+p )
Wj = Zj − Zj+p
Same with penalized estimate
min
1
2 ky
− Xbk22 + λP (b)
Forward selection/orthogonal matching pursuit: Z1 , · · · , Z2p (reverse) order
in which 2p variables entered model
Wj = (Zj ∨ Zj+p ) · sign(Zj − Zj+p )
Statistics based on LS estimates
LS 2
Wj = |β̂jLS |2 − |β̂j+p
|
LS
|β̂jLS | − |β̂j+p
|
Examples of statistics
Zj = sup{λ : β̂j (λ) 6= 0}, j = 1, . . . , 2p, and β̂(λ) sol. to augmented Lasso
Wj = (Zj ∨ Zj+p ) · sign(Zj − Zj+p )
Wj = Zj − Zj+p
Same with penalized estimate
min
1
2 ky
− Xbk22 + λP (b)
Forward selection/orthogonal matching pursuit: Z1 , · · · , Z2p (reverse) order
in which 2p variables entered model
Wj = (Zj ∨ Zj+p ) · sign(Zj − Zj+p )
Statistics based on LS estimates
LS 2
Wj = |β̂jLS |2 − |β̂j+p
|
... (endless possibilities)
LS
|β̂jLS | − |β̂j+p
|
Extensions
Other type-I errors
Can control familywise error rate (FWER), k-FWER, ...: Janson and Su (’15)
+
0
enters late
++
+
−
−
−
++
|W|
enters early
(significant)
T : time at which m knockoffs have entered before originals appear before
Reject hypotheses with ‘+’
Expected number of false discoveries
EV ≤ m
Other models
Suppose we wish to test for groups
X
y=
Xg βg + z
Hg : βg = 0
g∈G
Group lasso
min 12 ky −
Forward group selection
...
P
g
Xg β̂g k22 + λ
P
g
kβ̂g k2
Other models
Suppose we wish to test for groups
X
y=
Xg βg + z
Hg : βg = 0
g∈G
Group lasso
min 12 ky −
Forward group selection
...
P
g
Xg β̂g k22 + λ
Construct group knockoffs for exchangeability
P
g
kβ̂g k2
Other models
Suppose we wish to test for groups
X
y=
Xg βg + z
Hg : βg = 0
g∈G
Group lasso
min 12 ky −
Forward group selection
...
+
0
enters late
P
g
Xg β̂g k22 + λ
P
++
−
g
kβ̂g k2
+
−
Construct group knockoffs for exchangeability
Calculate statistics; e.g. signed reversed ranks
−
++
enters early
(significant)
|W|
Other models
Suppose we wish to test for groups
X
y=
Xg βg + z
Hg : βg = 0
g∈G
Group lasso
min 12 ky −
Forward group selection
...
+
0
enters late
P
g
Xg β̂g k22 + λ
P
++
−
kβ̂g k2
+
−
Construct group knockoffs for exchangeability
Calculate statistics; e.g. signed reversed ranks
Compute threshold as before
g
−
++
enters early
(significant)
|W|
Other models
Suppose we wish to test for groups
X
y=
Xg βg + z
Hg : βg = 0
g∈G
Group lasso
min 12 ky −
Forward group selection
...
+
0
enters late
P
g
Xg β̂g k22 + λ
++
−
−
Calculate statistics; e.g. signed reversed ranks
Compute threshold as before
g
kβ̂g k2
+
Construct group knockoffs for exchangeability
Provides FDR control
P
−
++
enters early
(significant)
|W|
Summary
Knockoff filter = inference machine
You design the statistics, knockoffs take care of inference
Works under any design X (handles arb. correlations)
Does not require any knowledge of noise level σ
Very powerful when sparse effects
Open research: n < p...
FDR control is an extremely useful concept even away from counting errors and
successes
BHq
y ∼ N (Xβ, σ 2 I)
⇐⇒
β̂ LS ∼ N (β, σ 2 (X 0 X)−1 )
Statistics are independent iff X 0 X is diagonal (orthogonal design)
BHq
Orthogonal model: y ∼ N (β, I) (σ = 1 is known)
p · P{|N (0, 1)| ≥ t}
≤q
TBH = min t :
#{j : |yj | = |βj + zj | ≥ t}
BHq
Orthogonal model: y ∼ N (β, I) (σ = 1 is known)
p · P{|N (0, 1)| ≥ t}
≤q
TBH = min t :
#{j : |yj | = |βj + zj | ≥ t}
Knockoff procedure (with lasso) is quite different:
Make control group
I
X X̃] = p
0
0
Ip
=⇒
statistics =
y
z0
y ∼ N (β, I) indep. from z 0 ∼ N (0, I)
Compute threshold (knockoff+) via
(
)
#{j : |zj0 | ≥ t and |zj0 | > |yj |}
d
T = min t : FDP(t)
=
≤q
#{j : |yj | ≥ t and |yj | > |zj0 |}
Empirical comparison
p = 1000
# true signals is 200 → fraction of nulls is 0.8
25
100
●
20
15
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10
80
●
Power (%)
FDR (%)
●
●
Nominal level
BHq
Knockoff+
5
●
0
2
3
4
Signal Magnitude
FDR
5
6
●
●
●
●
●
●
●
●
●
●
●
●
●
60
●
40
●
20
BHq
Knockoff+
●
●
●
0
1
●
●
●
1
●
2
3
4
5
6
Signal Magnitude
Power
FDR and power of the BHq and knockoff+ methods vs. size A of the regression
coefficients (signal magnitude)
●
Some Closing Remarks: How About Prediction/Estimation?
FDR for estimation: y = Xβ + z
Goal: predict response from explanatory variables
Sparse (modern) setup: p large and only few important variables
FDR for estimation: y = Xβ + z
Goal: predict response from explanatory variables
Sparse (modern) setup: p large and only few important variables
Model selection via Cp
Cp (S) = ky − X β̂[S]k2 + 2σ 2
|
{z
}
RSS
|S|
|{z}
model dim.
β̂[S]: fitted LS coefficients from variables in S
Cp (S) unbiased for prediction error of model S
S ⊂ {1, . . . , p}
FDR for estimation: y = Xβ + z
Goal: predict response from explanatory variables
Sparse (modern) setup: p large and only few important variables
Low bias
Cp
FWER
FDR
Low variance
FDR thresholding (Abramovich and Benjamini (’96))
Orthogonal design: X T X = Ip =⇒ X T y ∼ N (β, σ 2 Ip )
Select nominal level q and perform BH(q) testing
(
XiT y |XiT y| ≥ tFDR
β̂i =
0
otherwise
FDR thresholding (Abramovich and Benjamini (’96))
Orthogonal design: X T X = Ip =⇒ X T y ∼ N (β, σ 2 Ip )
Select nominal level q and perform BH(q) testing
(
XiT y |XiT y| ≥ tFDR
β̂i =
0
otherwise
Theorem (Abramovich, Benjamini, Donoho, Johnstone (’05))
Sparsity class `0 (k) = {β : kβk0 ≤ k}
Minimax risk
R(k) = inf sup
β̂ β∈`0 (k)
E kβ̂ − βk22
FDR thresholding (Abramovich and Benjamini (’96))
Orthogonal design: X T X = Ip =⇒ X T y ∼ N (β, σ 2 Ip )
Select nominal level q and perform BH(q) testing
(
XiT y |XiT y| ≥ tFDR
β̂i =
0
otherwise
Theorem (Abramovich, Benjamini, Donoho, Johnstone (’05))
Sparsity class `0 (k) = {β : kβk0 ≤ k}
Minimax risk
R(k) = inf sup
β̂ β∈`0 (k)
E kβ̂ − βk22
Set q < 1/2. Then asymptotically, as p → ∞ and k ∈ [(log p)5 , p1−δ ]
E kβ̂FDR − βk22 = E kX β̂FDR − Xβk22 = R(k)(1 + o(1))
Other connections
SLOPE: Bogdan, van den Berg, Sabatti, Su and Candès (’13)
min
1
ky − X β̂k22 + λ1 |β̂|(1) + λ2 |β̂|(2) + . . . + λp |β̂|(p)
2
λ1 ≥ λ2 ≥ . . . ≥ λp ≥ 0
|β̂|(1) ≥ |β̂|(2) ≥ . . . ≥ |β̂|(p)
[order statistic]
Other connections
SLOPE: Bogdan, van den Berg, Sabatti, Su and Candès (’13)
min
1
ky − X β̂k22 + λ1 |β̂|(1) + λ2 |β̂|(2) + . . . + λp |β̂|(p)
2
λ1 ≥ λ2 ≥ . . . ≥ λp ≥ 0
|β̂|(1) ≥ |β̂|(2) ≥ . . . ≥ |β̂|(p)
[order statistic]
Adaptive minimaxity: C. and Su (’15)
Sparse class `0 (k) = {β : kβk0 ≤ k}
Fix q ∈ (0, 1] and set λi = σ · Φ−1 (1 − iq/2p) (BHq)
For some linear models, adaptive minimax estimation
sup
β∈`0 (k)
E kX β̂SLOPE − Xβk22 = R(k)(1 + o(1))
Download