Around the Reproducibility of Scientific Research in the Big Data Era: A Knockoff Filter for Controlling the False Discovery Rate Emmanuel Candès Big Data and Computational Scalability, University of Warwick, July 2015 Collaborator Rina Foygel Barber A crisis? The Economist: Oct 19th, 2013 Problems with scientific research: How science goes wrong | The Economist 5.11.13 18:28 Problems with scientific research How science goes wrong Scientific research has changed the world. Now it needs to change itself Oct 19th 2013 | From the print edition A SIMPLE idea underpins science: “trust, but verify”. Results should always be subject to challenge from experiment. That simple but powerful idea has generated a vast body of knowledge. Since its birth in the 17th century, modern science has changed the world beyond recognition, and overwhelmingly for the better. But success can breed complacency. Modern scientists are doing too much trusting and not enough verifying—to the detriment of the whole of science, and of humanity. Too many of the findings that fill the academic ether are the result of shoddy experiments or poor analysis (see article (http://www.economist.com/news/briefing/21588057-scientists-thinkscience-self-correcting-alarming-degree-it-not-trouble) ). A rule of thumb among biotechnology venture-capitalists is that half of published research cannot be replicated. Even that may be optimistic. Last year researchers at one biotech firm, Amgen, found they could reproduce just six of 53 “landmark” studies in cancer research. Earlier, a group at Bayer, a drug company, managed to repeat just a quarter of 67 similarly important papers. A leading computer scientist frets that three-quarters of papers in his subfield are bunk. In 2000-10 roughly 80,000 patients took part in clinical trials based on research that was later retracted because of mistakes or improprieties. A crisis? The Economist: Oct 19th, 2013 Problems with scientific research: How science goes wrong | The Economist 5.11.13 18:28 Problems with scientific research How science goes wrong Unreliable research: Trouble at the lab | The Economist 5.11.13 18:29 Scientific research has changed the world. Now it needs to change itself Oct 19th 2013 | From the print edition A SIMPLE idea underpins science: “trust, but verify”. Results should always be subject to challenge from experiment. That simple but powerful idea has generated a vast body of knowledge. Since its birth in the 17th century, modern science has changed the world beyond Unreliable research Trouble at the lab recognition, and overwhelmingly for the better. Scientists like to think of science as self-correcting. To an alarming degree, it is not Oct 19th 2013 | From the print edition But success can breed complacency. Modern scientists are doing too much trusting and not enough verifying—to the detriment of the whole of and wreck of humanity. “I science, SEE a train looming,” warned Daniel Kahneman, an eminent psychologist, in an open Too many of the findings that fill the academic ether are the result of shoddy experiments or letter last year. The premonition concerned poor analysis (see article (http://www.economist.com/news/briefing/21588057-scientists-thinkresearch on a phenomenon known as “priming”. science-self-correcting-alarming-degree-it-not-trouble) ). A rule of thumb among biotechnology Priming studies suggest that decisions can be venture-capitalists is that half of published research cannot be replicated. Even that may be influenced by apparently irrelevant actions or optimistic. Last year researchers at one biotech firm, Amgen, found they could reproduce just six events that took placea just the cusp of of 53 “landmark” studies in cancer research. Earlier, a group at Bayer, drugbefore company, managed havecomputer been a boom area frets in psychology to repeat just a quarter of 67 similarly important choice. papers. They A leading scientist that the pastroughly decade,80,000 and some of their insights three-quarters of papers in his subfield are bunk.over In 2000-10 patients took part have already made it out of the lab and into the toolkits ofbecause policy wonks keen or onimproprieties. “nudging” the populace. in clinical trials based on research that was later retracted of mistakes Snippets Unreliable research: Trouble at the lab | The Economist 5.11.13 18:29 Unreliable research Trouble at the lab Scientists like to think of science as self-correcting. To an alarming degree, it is not Oct 19th 2013 | From the print edition “I SEE a train wreck looming,” warned Daniel Kahneman, an eminent psychologist, in an open letter last year. The premonition concerned Systematic attempts to replicate widely cited priming experiments have failed research on a phenomenon known as “priming”. Priming studies suggest that decisions can be Amgen could only replicate 6 of 53 studies they considered landmarks in science events basic that took cancer place just before the cusp of influenced by apparently irrelevant actions or choice. They have been a boom area in psychology HealthCare could only replicate about 25% of 67 seminal studies over the past decade, and some of their insights have already made it out of the lab and into the toolkits of policy wonks keen on “nudging” the populace. Dr Kahneman and a growing number of his colleagues fear that a lot of this priming research is Early report (Kaplan, ’08): 50% of Phase III FDA studies ended in failure poorly founded. Over the past few years various researchers have made systematic attempts to replicate some of the more widely cited priming experiments. Many of these replications have failed. In April, for instance, a paper in PLoS ONE, a journal, reported that nine separate Daily Shouts Page-Turner Daily Comment Amy Davidson John Cassidy Andy Borowitz Richard Brody Close New Yorker: December, 2010 Reddit Linked In Email O n September 18, 2007, a few dozen neuroscientists, psychiatrists, and drug-company executives gathered in a hotel conference room in Brussels to hear some ANNALS OF SCIENCE startling news. It had to do with a class of drugs known as atypical or second-generation antipsychotics, which came on the market in the early nineties. The drugs, sold under Is there something wrong with the scientific method? brand names such as Abilify, Seroquel, and Zyprexa, had by Jonah Lehrer been tested on schizophrenics in several large clinical DECEMBER 13, 2010 trials, all of which had demonstrated a dramatic decrease in the subjects’ psychiatric symptoms. As a result, secondRecommend 29k Many results that are rigorously proved and generation antipsychotics had become one of the fastestaccepted start shrinking in later studies. growing and most profitable pharmaceutical classes. By 2001, Eli Lilly’s Zyprexa was generating more revenue than Prozac. It remains the company’s top-selling drug. But the data presented at the Brussels meeting made it clear that something strange was happening: the therapeutic power of the drugs appeared to be steadily waning. A recent study showed an effect that The New Yorker Reporting & Essays THE TRUTH WEARS OFF “Significance chasing” Tweet 192 O “Publication bias” 1,767 http://www.newyorker.com/reporting/2010/12/13/101213fa_fact_lehrer?printable=true&currentPage=all Page 1 of 11 “Selective reporting” n September“Why 18, 2007,most a few dozen neuroscientists, published research psychiatrists, and drug-company executives gathered http://www.newyorker.com/reporting/2010/12/13/101213fa_fact_lehrer findings are false” (Ioannidis, ’05) Print E-Mail Page 1 of 4 The New York Times Jan. 2014 Personal and societal concern Great danger in seeing erosion of public confidence in science Personal and societal concern Great danger in seeing erosion of public confidence in science Seems like scientific community is responding Reproducibility Initiative http://validation. scienceexchange.com/ disorders at the same time. Treatments are inconsistent. Outcomes are unpredictable. Science was supposed to come to the rescue. Genetics and neuroimaging studies would, all involved hoped, reveal biological signatures unique to each disorder, which could be used to provide consistent encourage work that cuts across the boundaries. Researchers should be encouraged to investigate the causes of mental illness from the bottom up, as the US National Institute of Mental Health is doing. The brain is complicated enough. Why investigate its problems with one hand tied behind our backs? ■ Nature’s 18-point checklist: April 25, 2013 ANNO UN CEMEN T Reducing our irreproducibility O ver the past year, Nature has published a string of articles that highlight failures in the reliability and reproducibility of published research (collected and freely available at go.nature.com/ huhbyr). The problems arise in laboratories, but journals such as this one compound them when they fail to exert sufficient scrutiny over the results that they publish, and when they do not publish enough information for other researchers to assess results properly. From next month, Nature and the Nature research journals will introduce editorial measures to address the problem by improving the consistency and quality of reporting in life-sciences articles. To ease the interpretation and improve the reliability of published results we will more systematically ensure that key methodological details are reported, and we will give more space to methods sections. We will examine statistics more closely and encourage authors to be transparent, for example by including their raw data. Central to this initiative is a checklist intended to prompt authors to disclose technical and statistical information in their submissions, and to encourage referees to consider aspects important for research reproducibility (go.nature.com/oloeip). It was developed after discussions with researchers on the problems that lead to irreproducibility, including workshops organized last year by US National Institutes of Health (NIH) institutes. It also draws on published concerns about reporting standards (or the lack of them) and the collective experience of editors at Nature journals. The checklist is not exhaustive. It focuses on a few experimental and analytical design elements that are crucial for the interpretation of research results but are often reported incompletely. For example, authors will need to describe methodological parameters that can introduce bias or influence robustness, and provide precise characterization of key reagents that may be subject to biological variability, such as cell lines and antibodies. The checklist also consolidates existing policies about data deposition and presentation. We will also demand more precise descriptions of statistics, and we will commission statisticians as consultants on certain papers, at the editor’s discretion and at the referees’ suggestion. We recognize that there is no single way to conduct an experimental study. Exploratory investigations cannot be done with the same level of statistical rigour as hypothesis-testing studies. Few academic laboratories have the means to perform the level of validation required, for example, to translate a finding from the laboratory to the clinic. However, that should not stand in the way of a full report of how a study was designed, conducted and analysed that will allow reviewers and readers to adequately interpret and build on the results. To allow authors to describe their experimental design and methods in as much detail as necessary, the participating journals, including Nature, will abolish space restrictions on the methods section. To further increase transparency, we will encourage authors to provide tables of the data behind graphs and figures. This builds on our established data-deposition policy for specific experiments and large data sets. The source data will be made available directly from the figure legend, for easy access. We continue to encourage authors to share detailed methods and reagent descriptions by depositing protocols in Protocol Exchange (www.nature.com/ protocolexchange), an open resource linked from the primary paper. Renewed attention to reporting and transparency is a small step. Much bigger underlying issues contribute to the problem, and are beyond the reach of journals alone. Too few biologists receive adequate training in statistics and other quantitative aspects of their subject. Mentoring of young scientists on matters of rigour and transparency is inconsistent at best. In academia, the ever increasing pressures to publish and chase funds provide little incentive to pursue studies and publish results that contradict or confirm previous papers. Those who document the validity or irreproducibility of a published piece of work seldom get a welcome from journals and funders, even as money and effort are wasted on false assumptions. Tackling these issues is a long-term endeavour that will require the commitment of funders, institutions, researchers and publishers. It is encouraging that NIH institutes have led community discussions on this topic and are considering their own recommendations. We urge others to take note of these and of our initiatives, and do whatever they can to improve research reproducibility. ■ 3 9 8 | NAT U R E | VO L 4 9 6 | 2 5 A P R I L 2 0 1 3 © 2013 Macmillan Publishers Limited. All rights reserved NAS President’s address: April 27, 2015 Big data and a new scientific paradigm Collect data first =⇒ Ask questions later Large data sets available prior to formulation of hypotheses Need to quantify “reliability” of hypotheses generated by data snooping Very different from hypothesis-driven research Big data and a new scientific paradigm Collect data first =⇒ Ask questions later Large data sets available prior to formulation of hypotheses Need to quantify “reliability” of hypotheses generated by data snooping Very different from hypothesis-driven research What does statistics have to offer? Account for “look everywhere” effect Understand reliability in the context of all hypotheses that have been explored Most discoveries may be false: Sorić (’89) 1000 hypotheses to test See also The Economist Most discoveries may be false: Sorić (’89) 1000 hypotheses, 100 potential discoveries 1 Most discoveries may be false: Sorić (’89) 1000 hypotheses, 100 potential discoveries 1 Most discoveries may be false: Sorić (’89) True positives False negatives Power ≈ 80% −→ true positives ≈ 80 False positives (5% level) ≈ 45 =⇒ False positives False discovery rate ≈ 36% Most discoveries may be false: Sorić (’89) True positives False negatives False positives Reported Power ≈ 80% −→ true positives ≈ 80 False positives (5% level) ≈ 45 1 =⇒ False discovery rate ≈ 36% Most discoveries may be false: Sorić (’89) True positives False negatives False positives Reported Power ≈ 30% =⇒ False discovery rate ≈ 60% More false negatives than true positives! Example: meta-analysis in neuroscience Button et al. (2013) Power failure: why small sample size undermines the reliability of neuroscience False Discovery Rate (FDR): Benjamini & Hochberg (’95) H1 , . . . Hn hypotheses subject to some testing procedure #false discoveries FDR = E #discoveries ‘0/0 = 0’ Natural type I error Under independence (and PRDS) simple rules control FDR (BHq) Widely used FDR control with BHq (under independence) FDR: expected proportion of false discoveries Sorted p-values: p(1) ≤ p(2) ≤ . . . ≤ p(n) (from most to least significant) Target FDR q 1.0 ● ●● ● ●●● ● ●● ●●● ● ●● Sorted p−value 0.8 ●● ● ● ● ● ● 0.6 ● ●●● ●● ● ● ●●●●● ● ●● ●● ● ●●● ● ●● ●●● 0.4 ● ● ●●● ●●● ● ●● ●● ● ● ●●● 0.2 0.0 ● ●● ●● ● ●● ●● ● ●●●● ●●●● ● ● ●●●●●●●●●● 0 20 40 60 Index 80 100 FDR control with BHq (under independence) FDR: expected proportion of false discoveries Sorted p-values: p(1) ≤ p(2) ≤ . . . ≤ p(n) (from most to least significant) Target FDR q 1.0 Sorted p−value Sorted p−value ● ● ●●● ●● ● ● ●●●●● ● ●● ●● ● ●●● ● ●● ●●● 0.4 ● ● ●●● ●●● ● ●● ●● ● ● ●●● 0.2 ● ● ● ● ● ● ● 0.4 ● 0.0 40 ● ● ● ● ● ●● ● ●● ● ●● ● ● ●● ●● ● ●●●● ●●●● ● ● ●●●●●●●●●● 20 ●● ●● ● ●●● ● ●● 0.6 0.2 ● ●● ●● 0 ● ● ● ●● ●●● 0.8 ●● ● ● ● ● 0.6 ●●● ● ● ● ● ● ●●● ● ●● 0.8 0.0 1.0 ● ●● ● ●●● ● ●● 60 80 100 Index The cut-off is adaptive to number of non-nulls ● ● ● ●●● ● ●● ●●● ● ●●● ● ●● ●●● ●●● ●●● ●●●● ●●●●●●●●●●●●●●●●●●● 0 20 40 60 Index 80 100 Earlier work on multiple comparisons Henry Scheffe John Tukey Westfall and Young (’93) Rupert Miller Controlled Variable Selection obtain results for 347,010 SNPs (MAF 4 5%) genotyped in 2,758 1.46% (for imputation based on Illumina genotypes) to 2.14% Finnish and Swedish individuals from the Diabetes Genetics Initiative (imputation based on Affymetrix genotypes) per allele, consistent (DGI) using the Affymetrix 500K Mapping Array Set. Further details with expectations from HapMap data. For additional details of of the DGI study and independent follow-up analyses are provided in quality-control and imputation procedures, see Methods and Supplea companion manuscript16. All three initial scans excluded individuals mentary Table 1 online. We then conducted a series of association analyses to relate taking lipid lowering therapies, for a total of 8,816 phenotyped Find locations on theconsent genome that from influence trait: e.g. cholesterol level B2,261,000 genotyped and/or imputed SNPs with plasma individuals (Table 1). Informed was obtained all the a Contemporary problem: inference for sparse regression HDL cholesterol CETP –log10 (P value) 20 15 LPL 10 LIPC ABCA1 GALNT2 MVK/MMAB 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 LCAT 14 15 16 LDL cholesterol –log10 (P value) 20 15 0 17 18 19 20 21 22 APOE cluster APOB SORT1/CELSR2/PSRC1 LDLR 10 5 LIPG B4GALT4 PCSK9 1 2 3 NCAN/CILP2 B3GALT4 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Triglycerides 15 10 5 0 APOA5 GCKR LPL ANGPTL3 GALNT2 RBKS 1 2 MLXIPL 3 4 5 20 15 6 7 20 15 NCAN/CILP2 LIPC 8 9 10 11 12 13 LDL cholesterol value) (P value) HDL cholesterol TRIB1 14 15 16 17 18 19 20 21 22 Triglycerides P value) –log10 (P value) 20 20 15 obtain results for 347,010 SNPs (MAF 4 5%) genotyped in 2,758 1.46% (for imputation based on Illumina genotypes) to 2.14% Finnish and Swedish individuals from the Diabetes Genetics Initiative (imputation based on Affymetrix genotypes) per allele, consistent (DGI) using the Affymetrix 500K Mapping Array Set. Further details with expectations from HapMap data. For additional details of of the DGI study and independent follow-up analyses are provided in quality-control and imputation procedures, see Methods and Supplea companion manuscript16. All three initial scans excluded individuals mentary Table 1 online. We then conducted a series of association analyses to relate taking lipid lowering therapies, for a total of 8,816 phenotyped Find locations on theconsent genome that from influence trait: e.g. cholesterol level B2,261,000 genotyped and/or imputed SNPs with plasma individuals (Table 1). Informed was obtained all the a Contemporary problem: inference for sparse regression HDL cholesterol CETP –log10 (P value) 20 15 LPL 10 LIPC ABCA1 GALNT2 MVK/MMAB 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 LCAT 14 15 16 LDL cholesterol LIPG 17 18 19 20 21 22 APOE cluster 20 LDLR 10 5 0 B4GALT4 PCSK9 1 2 3 4 1 2 ky B3GALT4 min 5 6 y : cholesterol level of patients –log10 (P value) 20 15 10 5 0 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Triglycerides APOA5 X : genotype matrix; e.g. Xi,j # of alleles of recessive type at location j LPL TRIB1 z ANGPTL3 environmental factors (not accounted by genetics) MLXIPL GALNT2 RBKS 1 2 3 4 5 6 7 8 9 10 11 20 15 NCAN/CILP2 LIPC 12 13 LDL cholesterol value) (P value) 15 7 NCAN/CILP2 GCKR HDL cholesterol 20 − X β̂k22 + λkβ̂k1 14 15 16 17 18 19 20 21 22 Triglycerides P value) –log10 (P value) Linear modelAPOB (y = Xβ + z) + e.g. Lasso fit 15 SORT1/CELSR2/PSRC1 20 15 obtain results for 347,010 SNPs (MAF 4 5%) genotyped in 2,758 1.46% (for imputation based on Illumina genotypes) to 2.14% Finnish and Swedish individuals from the Diabetes Genetics Initiative (imputation based on Affymetrix genotypes) per allele, consistent (DGI) using the Affymetrix 500K Mapping Array Set. Further details with expectations from HapMap data. For additional details of of the DGI study and independent follow-up analyses are provided in quality-control and imputation procedures, see Methods and Supplea companion manuscript16. All three initial scans excluded individuals mentary Table 1 online. We then conducted a series of association analyses to relate taking lipid lowering therapies, for a total of 8,816 phenotyped Find locations on theconsent genome that from influence trait: e.g. cholesterol level B2,261,000 genotyped and/or imputed SNPs with plasma individuals (Table 1). Informed was obtained all the a Contemporary problem: inference for sparse regression HDL cholesterol CETP –log10 (P value) 20 15 LPL 10 LIPC ABCA1 GALNT2 MVK/MMAB 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 LCAT 14 15 16 LDL cholesterol LIPG 17 18 19 20 21 22 APOE cluster 20 –log10 (P value) Linear modelAPOB (y = Xβ + z) + e.g. Lasso fit 15 SORT1/CELSR2/PSRC1 LDLR 10 5 0 B4GALT4 PCSK9 1 2 3 4 1 2 ky B3GALT4 min 5 6 y : cholesterol level of patients –log10 (P value) 20 15 10 5 0 − X β̂k22 + λkβ̂k1 7 8 9 10 11 NCAN/CILP2 12 13 16 17 18 19 20 21 22 Triglycerides APOA5 X : genotype matrix; e.g. Xi,j # of alleles of recessive type at location j LPL TRIB1 z ANGPTL3 environmental factors (not accounted by genetics) MLXIPL GALNT2 RBKS 1 2 3 4 5 6 7 8 9 10 11 12 13 LDL cholesterol P value) value) 15 NCAN/CILP2 LIPC How do we control the FDR of20selected variables {i : β̂i 6= 0}? 20 20 (P value) 15 GCKR HDL cholesterol 15 14 15 14 15 16 17 18 19 20 21 22 Triglycerides Sparse regression rse regression imulated data with n = 1500, p = 500. Simulated data with n = 1500, p = 500 Lasso model with λ = 1.75 ●● 2 ● ● ● 1 ● ● ● ● ● 0 −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● −2 Fitted coefficient βj asso fitted model for λ = 1.75: ● ● ● ● 0 100 200 300 Index j 400 500 Sparse regression rse regression imulated data with n = 1500, p = 500. Simulated data with n = 1500, p = 500 Lasso model with λ = 1.75 ●● 2 ● ● ● 1 ● ● ● ● ● 0 −1 ● ● ● ● ● ● ● o ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● −2 Fitted coefficient βj asso fitted model for λ = 1.75: ● ● o ● ● 0 100 200 False positive? 300 Index j 400 500 True positive? Sparse regression rse regression imulated data with n = 1500, p = 500. Simulated data with n = 1500, p = 500 Lasso model with λ = 1.75 ●● 2 ● ● ● 1 ● ● ● ● ● 0 −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● −2 Fitted coefficient βj asso fitted model for λ = 1.75: ● ● ● ● 0 100 200 300 Index j 400 500 FDP = 26 55 = 47% Sparse regression rse regression imulated data with n = 1500, p = 500. Simulated data with n = 1500, p = 500 Lasso model with λ = 1.75 ●● 2 ● ● ● 1 ● ● ● ● ● 0 −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● −2 Fitted coefficient βj asso fitted model for λ = 1.75: ● ● ● ● 0 100 200 300 Index j Estimate FDP? 400 500 FDP = 26 55 = 47% Sparse regression rse regression imulated data with n = 1500, p = 500. Simulated data with n = 1500, p = 500 Lasso model with λ = 1.75 ●● 2 ● ● ● 1 ● ● ● ● ● 0 −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● −2 Fitted coefficient βj asso fitted model for λ = 1.75: ● ● ● ● 0 100 200 300 Index j Estimate FDP? Forget it... 400 500 FDP = 26 55 = 47% Controlled variable selection P j y n×1 = βj Xj z}|{ Xβ n×p p×1 + z n×1 y ∼ N (Xβ, σ 2 I) Controlled variable selection P j y n×1 = βj Xj z}|{ Xβ n×p p×1 + z n×1 y ∼ N (Xβ, σ 2 I) Goal: select set of features Xj without too many false positives FDR |{z} False discovery rate = h # false positives i E # features selected | {z } False discovery proportion ‘0/0 = 0’ Controlled variable selection P j y n×1 = βj Xj z}|{ Xβ n×p p×1 + z n×1 y ∼ N (Xβ, σ 2 I) Goal: select set of features Xj without too many false positives FDR |{z} False discovery rate = h # false positives i E # features selected | {z } ‘0/0 = 0’ False discovery proportion Context of multiple testing (with possibly very many irrelevant variables) Hj : βj = 0 j = 1, . . . , p This lecture Novel procedure for variable selection controlling FDR in finite sample settings Requires n ≥ p (and full column rank) for identifiability n<p =⇒ Xβ = Xβ 0 and β 6= β 0 Works under any design X Does not require any knowledge of noise level σ Ongoing research: n < p The Knockoff Filter The knockoff filter It’s not a name brand bag, just a cheap knockoff Thesaurize.com For each feature Xj , construct knockoff version X̃j Knockoffs serve as control group =⇒ can estimate FDP (1) Construct fake variables (knockoffs) (2) Calculate statistics for each original/knockoff pair (3) Calculate a data-dependent threshold for the statistics 1. Knockoff features X̃j X̃j0 X̃k = Xj0 Xk X̃j0 Xk = Xj0 Xk for all j, k for all j 6= k Would like knockoffs as uncorrelated to features as possible How? Compute knockoffs via matrix computations and/or numerical optimization (later) 0 Σ Σ − diag{s} X X̃ X X̃ = 0 s ∈ Rp Σ − diag{s} Σ How? Compute knockoffs via matrix computations and/or numerical optimization (later) 0 Σ Σ − diag{s} X X̃ X X̃ = 0 s ∈ Rp Σ − diag{s} Σ X̃ = X(I − Σ−1 diag{s}) + Ũ C Ũ ∈ Rn×p with col. space orthogonal to that of X C 0 C Cholevsky factorization of 2 diag{s} − diag{s}Σ−1 diag{s} 0 No need for new data or experiment Why? Construct knockoffs For null feature Xj Why? d For a null feature Xj0 y X =j , Xj0 Xβ + Xj0 z = X̃j0 Xβ + X̃j0 z = X̃j0 y D e> ? e> e> Xj> y = Xj> Xβ ? + Xj> z = X j Xβ + Xj z = Xj y Why? Construct knockoffs For null feature Xj Why? d For a null feature Xj0 y X =j , Xj0 Xβ + Xj0 z = X̃j0 Xβ + X̃j0 z = X̃j0 y D e> ? e> e> Xj> y = Xj> Xβ ? + Xj> z = X j Xβ + Xj z = Xj y Why? Lemma Pairwise exchangeability property. For any subset of nulls N Construct knockoffs d 0 0 Why?[X X̃]swap(N ) y = [X X̃] y For a null feature X , =⇒ knockoffs are a ‘control group’ jfor the nulls D e> ? e> e> Xj> y = Xj> Xβ ? + Xj> z = X j Xβ + Xj z = Xj y [X X̃]0swap(N ) = Knockoff method Compute Lasso with augmented matrix: Knockoff method Fitted 2 1 β̂(λ) = argmin − [X X̃] · b + λkbk1 2p y b∈R model for λ = 1.75 on 2the simulated dataset: ● Fitted coefficient βj ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● 500 original features I 500 knockoff features Lasso selects 49 original features & 24 knockoff features Knockoff method Compute Lasso with augmented matrix: Knockoff method Fitted 2 1 β̂(λ) = argmin − [X X̃] · b + λkbk1 2p y b∈R model for λ = 1.75 on 2the simulated dataset: ● Fitted coefficient βj ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● 500 original features 500 knockoff features Lasso selects 49 original features & 24 knockoff features I Lasso selects 49 original features & 24 knockoff features Knockoff method Compute Lasso with augmented matrix: Knockoff method Fitted 2 1 β̂(λ) = argmin − [X X̃] · b + λkbk1 2p y b∈R model for λ = 1.75 on 2the simulated dataset: ● Fitted coefficient βj ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● 500 original features 500 knockoff features Lasso selects 49 original features & 24 knockoff features I Lasso selects 49 original features & 24 knockoff Pairwise exchangeability =⇒ probably ≈ 24 false positives among 49 original features features 2. Statistics Zj = sup {λ : bj (λ) 6= 0} Z̃j = sup {λ : b̃j (λ) 6= 0} first time Xj enters model first time X̃j enters model Test statistic Wj for feature j ( +1 Wj = max(Zj , Z̃j ) · −1 Zj > Z̃j Zj < Z̃j 2. Statistics Zj = sup {λ : bj (λ) 6= 0} Z̃j = sup {λ : b̃j (λ) 6= 0} first time Xj enters model first time X̃j enters model Test statistic Wj for feature j ( +1 Wj = max(Zj , Z̃j ) · −1 + 0 enters late Zj > Z̃j Zj < Z̃j ++ − + − − ++ enters early (significant) |W| 2. Statistics Zj = sup {λ : bj (λ) 6= 0} Z̃j = sup {λ : b̃j (λ) 6= 0} first time Xj enters model first time X̃j enters model Test statistic Wj for feature j ( +1 Wj = max(Zj , Z̃j ) · −1 + 0 enters late Many other choices (later) Zj > Z̃j Zj < Z̃j ++ − + − − ++ enters early (significant) |W| Variation Forward selection on augmented design X X̃ First time (rank) either original or knockoff enters Tag ‘+’ if original comes before its knockoff, ‘-’ otherwise + enters late + + − + + + − − rank enters early (significant) Pairwise exchangeability of the nulls exchangeable exchangeable Pairwise exchangeability of the nulls Pairwise exchangeability of the nulls exchangeable exchangeable exchangeable exchangeable Estimated FDP at threshold t=1.5 ● ~ Value of λ when Xj enters 4 3 2 ● ● ● ● ● ● ● ●●● 1 0 ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ●● ●● ● ● ●● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ●● ● ● ●● ●● ●● ●●● ● ● ● ● ●● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● Value of λ when Xj enters Signal Null Consequence of exchangeability null non null 0 + + t + __ __ + __ + + + __ Signs of nulls iid ±1 indep. of |W | (ordering) d (W1 , . . . , Wp ) = (W1 · 1 , . . . , Wp · p ) i.i.d. Sign seq. {j } indep. of W , j = +1 for all non-null j and j ∼ {±1} for null j Signs → 1-bit p-values Knockoff estimate of FDR 0 + __ FDP(t) = + + __ __ t + + + + __ #{j null : Wj ≤ −t} #{j null : Wj ≥ t} ≈ #{j : Wj ≥ t} ∨ 1 #{j : Wj ≥ t} ∨ 1 Knockoff estimate of FDR 0 + __ FDP(t) = + + __ __ t + + + + __ #{j null : Wj ≤ −t} #{j null : Wj ≥ t} ≈ #{j : Wj ≥ t} ∨ 1 #{j : Wj ≥ t} ∨ 1 #{j : Wj ≤ −t} d ≤ := FDP(t) #{j : Wj ≥ t} ∨ 1 3. Selection and FDR control Select features with large and positive statistics {Wj ≥ T } #{j : Wj ≤ −t} ≤q Knockoff T = min t ∈ W : #{j : Wj ≥ t} ∨ 1 Theorem (Knockoff) E V ≤q R + q −1 V : # false positives R : total # of selections 3. Selection and FDR control Select features with large and positive statistics {Wj ≥ T } #{j : Wj ≤ −t} ≤q Knockoff T = min t ∈ W : #{j : Wj ≥ t} ∨ 1 Theorem (Knockoff) E V ≤q R + q −1 V : # false positives R : total # of selections 1 + #{j : Wj ≤ −t} T = min t ∈ W : ≤q #{j : Wj ≥ t} ∨ 1 Theorem (Knockoff+) E V ≤q R∨1 Knockoff+ Stopping rule 0/1 + #{j : Wj ≤ −t} T = min t : ≤q #{j : Wj ≥ t} ∨ 1 0 + __ + + __ __ + + + + __ Stop first time ratio between # negatives and # positives below q Stopping rule 0/1 + #{j : Wj ≤ −t} T = min t : ≤q #{j : Wj ≥ t} ∨ 1 0 + __ + + __ __ + + + + __ Stop first time ratio between # negatives and # positives below q Stopping rule 0/1 + #{j : Wj ≤ −t} T = min t : ≤q #{j : Wj ≥ t} ∨ 1 0 + __ + + __ __ + + + + __ Stop first time ratio between # negatives and # positives below q Stopping rule 0/1 + #{j : Wj ≤ −t} T = min t : ≤q #{j : Wj ≥ t} ∨ 1 0 + __ + + __ __ + + + + __ Stop first time ratio between # negatives and # positives below q Why does all this work? 1+#{j : Wj ≤ −t} ≤q T = min t : #{j : Wj ≥ t} ∨ 1 0 + __ + + __ __ + + + + __ Why does all this work? 1+#{j : Wj ≤ −t} ≤q T = min t : #{j : Wj ≥ t} ∨ 1 0 + __ FDP(T ) = + + __ __ + + + + __ #{j null : Wj ≥ T } 1 + #{j null : Wj ≤ −T } · #{j : Wj ≥ T } ∨ 1 1 + #{j null : Wj ≤ −T } Why does all this work? 1+#{j : Wj ≤ −t} ≤q T = min t : #{j : Wj ≥ t} ∨ 1 0 + __ FDP(T ) = + + __ + + __ + + __ #{j null : Wj ≥ T } 1 + #{j null : Wj ≤ −T } · #{j : Wj ≥ T } ∨ 1 1 + #{j null : Wj ≤ −T } V + (T ) z }| { #{j null : Wj ≥ T } ≤q· 1 + #{j null : Wj ≤ −T } | {z } V − (T ) Why does all this work? 1+#{j : Wj ≤ −t} ≤q T = min t : #{j : Wj ≥ t} ∨ 1 0 + FDP(T ) = + + __ __ + + __ + + __ #{j null : Wj ≥ T } 1 + #{j null : Wj ≤ −T } · #{j : Wj ≥ T } ∨ 1 1 + #{j null : Wj ≤ −T } V + (T ) z }| { #{j null : Wj ≥ T } ≤q· 1 + #{j null : Wj ≤ −T } | {z } V − (T ) V + (t)/(1 + V − (t)) is a super-martingale w.r.t. well defined filtration T is stopping time Optional stopping time theorem null 0 + + t V + (T ) FDR ≤ q E 1 + V − (T ) + __ __ __ + non null + + + __ Optional stopping time theorem null 0 + + t V + (T ) FDR ≤ q E 1 + V − (T ) + __ __ + + __ V + (0) ≤q E 1 + V − (0) non null + + __ Optional stopping time theorem null 0 + __ + t + __ __ + non null + + + __ z }| { V + (T ) V + (0) V + (0) FDR ≤ q E ≤ q E = q E 1 + V − (T ) 1 + V − (0) 1 + #nulls − V + (0) Ber(#nulls,1/2) Optional stopping time theorem null 0 + __ + t + __ __ + non null + + + __ z }| { V + (T ) V + (0) V + (0) FDR ≤ q E ≤ q E = q E ≤q 1 + V − (T ) 1 + V − (0) 1 + #nulls − V + (0) Ber(#nulls,1/2) Comparison with other methods Permutation methods Let X π = X with rows randomly permuted Σ 0 π 0 π X X X X ≈ 0 Σ Zj Value of λ when variable enters model Permutation Let X π = X withmethods rows randomly permuted. Then π Let X = X with rows randomly permuted Σ 0 > X X π π 0 X X ππ ≈ Σ 0 X X X X ≈ 0 Σ 0 Σ ● 25 ● 20 ● ● 15 Original non−null features Original null features Permuted features ● 10 ●● ● 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●●●●●●●●●● ●● ● ●●● ● ●●● ● ●● ●●●●●●●●●● ● ●●●● ●● ●● ● 0 |0 {z } | signals 50 {z nulls }100| Knockoff method Permutation method {z 150 200 permuted features Index of column in the augmented matrix [X Xπ] FDR (target level q = 20%) 12.29% 45.61% } Zj Value of λ when variable enters model Permutation Let X π = X withmethods rows randomly permuted. Then π Let X = X with rows randomly permuted Σ 0 > X X π π 0 X X ππ ≈ Σ 0 X X X X ≈ 0 Σ 0 Σ ● 25 ● 20 ● ● 15 Original non−null features Original null features Permuted features ● 10 ●● ● 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●●●●●●●●●● ●● ● ●●● ● ●●● ● ●● ●●●●●●●●●● ● ●●●● ●● ●● ● 0 |0 {z } | signals 50 {z nulls }100| Knockoff method Knockoff method Permutation method Permutation method {z 150 200 permuted features Index of column in the augmented matrix [X Xπ] FDR over 1000 trials (nominal level q level = 20%)q FDR (target 12.29% 12.29% 45.61% 45.61% = 20%) } Other methods Benjamini-Hochberg (BHq) BHq + log factor correction BHq with whitened noise Other methods Benjamini-Hochberg (BHq) BHq + log factor correction BHq with whitened noise y ∼ N (Xβ, σ 2 I) β̂ LS ∼ N (β, σ 2 (X 0 X)−1 ) ⇐⇒ Apply BHq to Zj = σ p β̂jLS (Σ−1 )jj Not known to control FDR → log factor correction (Benjamini Yekutieli) Empirical results Features N (0, In ), n = 3000, p = 1000 k = 30 variables with regression coefficients of magnitude 3.5 Method Knockoff+ (equivariant) Knockoff (equivariant) Knockoff+ (SDP) Knockoff (SDP) BHq BHq + log-factor correction BHq with whitened noise FDR (%) (nominal level q = 20%) 14.40 17.82 15.05 18.72 18.70 2.20 18.79 Power (%) 60.99 66.73 61.54 67.50 48.88 19.09 2.33 Theor. FDR control? Yes No Yes No No Yes Yes Effect of sparsity level Same setup with amplitudes set to 3.5 (q = 0.2) 25 100 20 80 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 15 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 5 ● Nominal level Knockoff Knockoff+ BHq 0 Power (%) FDR (%) ● ● ● 60 ● 40 ● 20 ● Knockoff Knockoff+ BHq 0 50 100 Sparsity level k 150 200 50 100 Sparsity level k 150 200 Effect of signal amplitude Same setup with k = 30 (q = 0.2) 25 100 20 80 ● ● ● ● ● ● ● ● ● ● ● ● ● 10 ● ● ● ● Power (%) FDR (%) ● 15 ● ● ● ● ● 60 ● ● ● ● 40 ● ● ● 5 ● Nominal level Knockoff Knockoff+ BHq 0 20 ● Knockoff Knockoff+ BHq 0 2.8 3.0 3.2 3.4 3.6 Amplitude A 3.8 4.0 4.2 2.8 3.0 3.2 3.4 3.6 Amplitude A 3.8 4.0 4.2 Effect of feature correlation Θjk = ρ|j−k| Features ∼ N (0, Θ) n = 3000, p = 1000, and k = 30 and amplitude = 3.5 30 100 ● 25 Nominal level Knockoff Knockoff+ BHq ● Knockoff Knockoff+ BHq 80 15 Power (%) FDR (%) 20 ● ● ● ● ● 60 ● ● ● ● ● 40 10 ● ● ● ● 20 ● 5 ● ● ● ● 0 ● 0 0.0 0.2 0.4 0.6 Feature correlation ρ 0.8 0.0 0.2 0.4 0.6 Feature correlation ρ 0.8 Application to real HIV data HIV drug resistance Drug type # drugs Sample size PI NRTI NNRTI 6 6 3 848 639 747 # protease or RT positions genotyped 99 240 240 # mutations appearing ≥ 3 times in sample 209 294 319 response y: log-fold-increase of lab-tested drug resistance in covariate Xj : presence or absence of mutation #j Data from R. Shafer (Stanford) available at: http://hivdb.stanford.edu/pages/published_analysis/genophenoPNAS2006/ PI-type drug resistance TSM list: mutations associated with the PI class of drugs in general, and is not specialized to the individual drugs in the class 30 Appear in TSM list Not in TSM list Resistance to ATV Resistance to IDV 35 35 30 30 25 25 25 20 20 20 15 15 15 10 10 10 5 5 0 0 Knockoff BHq 5 Resistance to APV 0 Knockoff BHq Knockoff # HIV−1 protease positions selected # HIV−1 protease positions selected Resistance to APV 35 BHq Data set size: n=768, p=201 Data set size: n=329, p=147 Data set size: n=826, p=208 Resistance to LPV Resistance to NFV Resistance to SQV 35 35 35 30 30 30 25 25 25 20 20 20 15 15 15 10 10 10 5 5 5 0 0 0 Knockoff BHq Data set size: n=516, p=184 Knockoff BHq Data set size: n=843, p=209 Knockoff 35 30 Appear in TSM list Not in TSM list 25 20 15 BHq Data set size: n=825, p=208 10 5 Figure: q = 0.2. # positions on the HIV-1 protease where mutations were selected. Horizontal line = # HIV-1 protease positions in TSM list. 0 Knockoff BHq NRTI-type drug resistance # HIV−1 RT positions selected Resistance to X3TC 30 Appear in TSM list Not in TSM list Resistance to ABC Resistance to AZT 30 30 25 25 25 20 20 20 15 15 15 10 10 10 5 5 5 0 0 Knockoff BHq 0 Knockoff BHq Knockoff BHq Data set size: n=633, p=292 Data set size: n=628, p=294 Data set size: n=630, p=292 Resistance to D4T Resistance to DDI Resistance to TDF 30 30 30 25 25 25 20 20 20 15 15 15 10 10 10 5 5 0 5 0 Knockoff BHq Data set size: n=630, p=293 0 Knockoff BHq Data set size: n=632, p=292 Knockoff BHq Data set size: n=353, p=218 Figure: Validation against the treatment-selected mutation (TSM) panel NNRTI-type drug resistance # HIV−1 RT positions selected Resistance to DLV 35 30 Appear in TSM list Not in TSM list Resistance to EFV Resistance to NVP 35 35 30 30 25 25 25 20 20 20 15 15 15 10 10 10 5 5 0 5 0 Knockoff BHq Data set size: n=732, p=311 0 Knockoff BHq Data set size: n=734, p=318 Knockoff BHq Data set size: n=746, p=319 Figure: Validation against the treatment-selected mutation (TSM) panel Heavy-tailed noise Design as in HIV example (sparse matrix) Errors as residuals from HIV example Regression coefficients entered manually Knockoff+ BHq FDR (q = 20%) 20.31% 25.47% Power 60.67% 69.42% Details Knockoff constructions (n ≥ 2p) X̃ = X(I − Σ−1 diag{s}) + Ũ C Knockoff constructions (n ≥ 2p) X̃ = X(I − Σ−1 diag{s}) + Ũ C X X̃ 0 X X̃ = Σ Σ − diag{s} := G 0 Σ − diag{s} Σ Knockoff constructions (n ≥ 2p) X̃ = X(I − Σ−1 diag{s}) + Ũ C X X̃ 0 X X̃ = G0 Σ Σ − diag{s} := G 0 Σ − diag{s} Σ ⇐⇒ diag{s} 0 2Σ − diag{s} 0 Knockoff constructions(n ≥ 2p) Equi-correlated knockoffs: sj = 2λmin (Σ) ∧ 1 hXj , X̃j i = 1 − 2λmin (Σ) ∧ 1 Under equivariance, minimizes the value of |hXj , X̃j i| Knockoff constructions(n ≥ 2p) Equi-correlated knockoffs: sj = 2λmin (Σ) ∧ 1 hXj , X̃j i = 1 − 2λmin (Σ) ∧ 1 Under equivariance, minimizes the value of |hXj , X̃j i| SDP knockoffs: minimize subject to P |1 − sj | sj ≥ 0 diag{s} 2Σ j ⇐⇒ minimize subject to Highly structured semidefinite program (SDP) Other possibilities P j (1 − sj ) sj ≥ 0 diag{s} 2Σ Symmetric statistics: W ( X X̃ , y) column swap Sufficiency property: 0 0 W = f X X̃ X X̃ , X X̃ y Anti-symmetry property: swapping changes signs ( +1 j 6∈ S Wj X X̃ swap(S) , y = Wj X X̃ , y · −1 j ∈ S null non null 0 + __ + t + __ __ + + + + __ Examples of statistics Zj = sup{λ : β̂j (λ) 6= 0}, j = 1, . . . , 2p, and β̂(λ) sol. to augmented Lasso Wj = (Zj ∨ Zj+p ) · sign(Zj − Zj+p ) Wj = Zj − Zj+p Examples of statistics Zj = sup{λ : β̂j (λ) 6= 0}, j = 1, . . . , 2p, and β̂(λ) sol. to augmented Lasso Wj = (Zj ∨ Zj+p ) · sign(Zj − Zj+p ) Wj = Zj − Zj+p Same with penalized estimate min 1 2 ky − Xbk22 + λP (b) Examples of statistics Zj = sup{λ : β̂j (λ) 6= 0}, j = 1, . . . , 2p, and β̂(λ) sol. to augmented Lasso Wj = (Zj ∨ Zj+p ) · sign(Zj − Zj+p ) Wj = Zj − Zj+p Same with penalized estimate min 1 2 ky − Xbk22 + λP (b) Forward selection/orthogonal matching pursuit: Z1 , · · · , Z2p (reverse) order in which 2p variables entered model Wj = (Zj ∨ Zj+p ) · sign(Zj − Zj+p ) Examples of statistics Zj = sup{λ : β̂j (λ) 6= 0}, j = 1, . . . , 2p, and β̂(λ) sol. to augmented Lasso Wj = (Zj ∨ Zj+p ) · sign(Zj − Zj+p ) Wj = Zj − Zj+p Same with penalized estimate min 1 2 ky − Xbk22 + λP (b) Forward selection/orthogonal matching pursuit: Z1 , · · · , Z2p (reverse) order in which 2p variables entered model Wj = (Zj ∨ Zj+p ) · sign(Zj − Zj+p ) Statistics based on LS estimates LS 2 Wj = |β̂jLS |2 − |β̂j+p | LS |β̂jLS | − |β̂j+p | Examples of statistics Zj = sup{λ : β̂j (λ) 6= 0}, j = 1, . . . , 2p, and β̂(λ) sol. to augmented Lasso Wj = (Zj ∨ Zj+p ) · sign(Zj − Zj+p ) Wj = Zj − Zj+p Same with penalized estimate min 1 2 ky − Xbk22 + λP (b) Forward selection/orthogonal matching pursuit: Z1 , · · · , Z2p (reverse) order in which 2p variables entered model Wj = (Zj ∨ Zj+p ) · sign(Zj − Zj+p ) Statistics based on LS estimates LS 2 Wj = |β̂jLS |2 − |β̂j+p | ... (endless possibilities) LS |β̂jLS | − |β̂j+p | Extensions Other type-I errors Can control familywise error rate (FWER), k-FWER, ...: Janson and Su (’15) + 0 enters late ++ + − − − ++ |W| enters early (significant) T : time at which m knockoffs have entered before originals appear before Reject hypotheses with ‘+’ Expected number of false discoveries EV ≤ m Other models Suppose we wish to test for groups X y= Xg βg + z Hg : βg = 0 g∈G Group lasso min 12 ky − Forward group selection ... P g Xg β̂g k22 + λ P g kβ̂g k2 Other models Suppose we wish to test for groups X y= Xg βg + z Hg : βg = 0 g∈G Group lasso min 12 ky − Forward group selection ... P g Xg β̂g k22 + λ Construct group knockoffs for exchangeability P g kβ̂g k2 Other models Suppose we wish to test for groups X y= Xg βg + z Hg : βg = 0 g∈G Group lasso min 12 ky − Forward group selection ... + 0 enters late P g Xg β̂g k22 + λ P ++ − g kβ̂g k2 + − Construct group knockoffs for exchangeability Calculate statistics; e.g. signed reversed ranks − ++ enters early (significant) |W| Other models Suppose we wish to test for groups X y= Xg βg + z Hg : βg = 0 g∈G Group lasso min 12 ky − Forward group selection ... + 0 enters late P g Xg β̂g k22 + λ P ++ − kβ̂g k2 + − Construct group knockoffs for exchangeability Calculate statistics; e.g. signed reversed ranks Compute threshold as before g − ++ enters early (significant) |W| Other models Suppose we wish to test for groups X y= Xg βg + z Hg : βg = 0 g∈G Group lasso min 12 ky − Forward group selection ... + 0 enters late P g Xg β̂g k22 + λ ++ − − Calculate statistics; e.g. signed reversed ranks Compute threshold as before g kβ̂g k2 + Construct group knockoffs for exchangeability Provides FDR control P − ++ enters early (significant) |W| Summary Knockoff filter = inference machine You design the statistics, knockoffs take care of inference Works under any design X (handles arb. correlations) Does not require any knowledge of noise level σ Very powerful when sparse effects Open research: n < p... FDR control is an extremely useful concept even away from counting errors and successes BHq y ∼ N (Xβ, σ 2 I) ⇐⇒ β̂ LS ∼ N (β, σ 2 (X 0 X)−1 ) Statistics are independent iff X 0 X is diagonal (orthogonal design) BHq Orthogonal model: y ∼ N (β, I) (σ = 1 is known) p · P{|N (0, 1)| ≥ t} ≤q TBH = min t : #{j : |yj | = |βj + zj | ≥ t} BHq Orthogonal model: y ∼ N (β, I) (σ = 1 is known) p · P{|N (0, 1)| ≥ t} ≤q TBH = min t : #{j : |yj | = |βj + zj | ≥ t} Knockoff procedure (with lasso) is quite different: Make control group I X X̃] = p 0 0 Ip =⇒ statistics = y z0 y ∼ N (β, I) indep. from z 0 ∼ N (0, I) Compute threshold (knockoff+) via ( ) #{j : |zj0 | ≥ t and |zj0 | > |yj |} d T = min t : FDP(t) = ≤q #{j : |yj | ≥ t and |yj | > |zj0 |} Empirical comparison p = 1000 # true signals is 200 → fraction of nulls is 0.8 25 100 ● 20 15 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 80 ● Power (%) FDR (%) ● ● Nominal level BHq Knockoff+ 5 ● 0 2 3 4 Signal Magnitude FDR 5 6 ● ● ● ● ● ● ● ● ● ● ● ● ● 60 ● 40 ● 20 BHq Knockoff+ ● ● ● 0 1 ● ● ● 1 ● 2 3 4 5 6 Signal Magnitude Power FDR and power of the BHq and knockoff+ methods vs. size A of the regression coefficients (signal magnitude) ● Some Closing Remarks: How About Prediction/Estimation? FDR for estimation: y = Xβ + z Goal: predict response from explanatory variables Sparse (modern) setup: p large and only few important variables FDR for estimation: y = Xβ + z Goal: predict response from explanatory variables Sparse (modern) setup: p large and only few important variables Model selection via Cp Cp (S) = ky − X β̂[S]k2 + 2σ 2 | {z } RSS |S| |{z} model dim. β̂[S]: fitted LS coefficients from variables in S Cp (S) unbiased for prediction error of model S S ⊂ {1, . . . , p} FDR for estimation: y = Xβ + z Goal: predict response from explanatory variables Sparse (modern) setup: p large and only few important variables Low bias Cp FWER FDR Low variance FDR thresholding (Abramovich and Benjamini (’96)) Orthogonal design: X T X = Ip =⇒ X T y ∼ N (β, σ 2 Ip ) Select nominal level q and perform BH(q) testing ( XiT y |XiT y| ≥ tFDR β̂i = 0 otherwise FDR thresholding (Abramovich and Benjamini (’96)) Orthogonal design: X T X = Ip =⇒ X T y ∼ N (β, σ 2 Ip ) Select nominal level q and perform BH(q) testing ( XiT y |XiT y| ≥ tFDR β̂i = 0 otherwise Theorem (Abramovich, Benjamini, Donoho, Johnstone (’05)) Sparsity class `0 (k) = {β : kβk0 ≤ k} Minimax risk R(k) = inf sup β̂ β∈`0 (k) E kβ̂ − βk22 FDR thresholding (Abramovich and Benjamini (’96)) Orthogonal design: X T X = Ip =⇒ X T y ∼ N (β, σ 2 Ip ) Select nominal level q and perform BH(q) testing ( XiT y |XiT y| ≥ tFDR β̂i = 0 otherwise Theorem (Abramovich, Benjamini, Donoho, Johnstone (’05)) Sparsity class `0 (k) = {β : kβk0 ≤ k} Minimax risk R(k) = inf sup β̂ β∈`0 (k) E kβ̂ − βk22 Set q < 1/2. Then asymptotically, as p → ∞ and k ∈ [(log p)5 , p1−δ ] E kβ̂FDR − βk22 = E kX β̂FDR − Xβk22 = R(k)(1 + o(1)) Other connections SLOPE: Bogdan, van den Berg, Sabatti, Su and Candès (’13) min 1 ky − X β̂k22 + λ1 |β̂|(1) + λ2 |β̂|(2) + . . . + λp |β̂|(p) 2 λ1 ≥ λ2 ≥ . . . ≥ λp ≥ 0 |β̂|(1) ≥ |β̂|(2) ≥ . . . ≥ |β̂|(p) [order statistic] Other connections SLOPE: Bogdan, van den Berg, Sabatti, Su and Candès (’13) min 1 ky − X β̂k22 + λ1 |β̂|(1) + λ2 |β̂|(2) + . . . + λp |β̂|(p) 2 λ1 ≥ λ2 ≥ . . . ≥ λp ≥ 0 |β̂|(1) ≥ |β̂|(2) ≥ . . . ≥ |β̂|(p) [order statistic] Adaptive minimaxity: C. and Su (’15) Sparse class `0 (k) = {β : kβk0 ≤ k} Fix q ∈ (0, 1] and set λi = σ · Φ−1 (1 − iq/2p) (BHq) For some linear models, adaptive minimax estimation sup β∈`0 (k) E kX β̂SLOPE − Xβk22 = R(k)(1 + o(1))