Presenting: On the immortality of television sets: “function” in the human genome according to the evolution-free gospel of ENCODE AKA: why the ENCODE project is full of it by Matthew Oberhardt What is ENCODE? •Attempt to find all functional elements of the human genome •huge international consortium, 10 years running •exome = 1.5% of human DNA •How much of the rest of it is garbage, vs. being useful ‘junk’ or fully functional? •pilot phase ended 2007 •production phase, 2007 – 2012 (with first major results published in 2012), and funded by $80 million in grants over 4 years •attempt to answer questions like: why are 88% of diseaseassociated SNPs in non-coding DNA regions? What did ENCODE do? mapped: RNA transcribed regions protein coding regions transcription factor binding sites chromatin structure DNA methylation sites performed assays on all of these biological areas in “tier 1,” “tier 2”, and “tier 3” cells – different standard cell types provide 1640 ‘datasets’ designed to annotate functional elements in the human genome ENCODE datatypes: Major findings: • 80.4% of the human genome participates in at least one biochemical RNA- and/or chromatin-associated event in at least one cell type (i.e., are ‘functional’ according to ENCODE) • Primate specific elements are in general negatively selected (fig 1) • classified chromatin states into groups with different promoter functionalities, and correlated RNA sequence production and processing to these chromatin states (showing that “most” variation in RNA expression can be explained by chromatin states). • found (or just repeated known information?) that most diseaserelated SNPs lie outside of coding regions But-- There are some problems with encode... On the immortality of television sets: “function” in the human genome according to the evolution-free gospel of ENCODE “Unless a genomic functionality is actively protected by selection, it will... cease to be functional. The absurd alternative, which unfortunately was adopted by ENCODE, is to assume that no deleterious mutations can ever occur in the regions they have deemed to be functional. Such an assumption is akin to claiming that a television set left on and unattended will still be in working condition after a million years because no natural events... can affect it.” On the immortality of television sets: “function” in the human genome according to the evolution-free gospel of ENCODE But let’s back up... On the immortality of television sets: “function” in the human genome according to the evolution-free gospel of ENCODE Major criticisms of ENCODE: (1) using the ‘causal role’ definition of biological function (2) committing the logical fallacy of ‘affirming the consequent’ (3) using analytical estimates that yield biased errors and inflate functionality estimates (4) favoring statistical sensitivity over specificity (5) emphasizing statistical significance rather than the magnitude of an effect Criticism 1: using the ‘causal role’ definition of biological function Two biological concepts of function: (1) The ‘causal role’ definition - a functional element is a genome segment producing a protein or an RNA or displaying a reproducible biochemical signature (e.g., protein binding) (2) The ‘selected effect’ definition – for a trait, T, to have a biological function F, it must (1) originate as a reproduction’ of some prior trait that performed F (or some similar function) in the past, and (2) T exists because of F. Example: a sequence similar to TATAAA can easily arise by chance, and will certainly bind transcription factors (being similar to the TATA box). It is therefore functional in the ‘causal role’ sense but not in the ‘selected effect’ sense. Similarly, the human heart has the ‘causal role’ of producing sounds, but its selected effect is pumping blood... Criticism 1: using the ‘causal role’ definition of biological function Bottom line: If a sequence doesn’t show signs of selection, it cannot be functional in the ‘selected effect’ manner, which is the only one that really counts. (this is a very strong statement...) Criticism 1: using the ‘causal role’ definition of biological function How, then, to detect selection? can have positive selection, purifying selection, or recently evolved speciesspecific elements. some of these can be subtle & hard to detect. SO – likely that more than 9% of the human genome is functional (what is currently thought) BUT – 80% is too high. Comparative genomics suggests that <15% of the genome is under evolutionary selection Therefore, % of functional elements should be below that... “ENCODE Incongruity”, that a biological function can be maintained without selection. Criticism 1: using the ‘causal role’ definition of biological function Why single out transcription as a function? You could also say ‘acted on by DNA polymerase’ is a function, in which case 100% of the genome is functional! ENCODE also uses this wrong definition of functionality wrongly... Criticism 2: committing the logical fallacy of ‘affirming the consequent’ The Fallacy: 1. if P then Q. 2. Q. 3. Therefore, P. Example: A random sequence binds a transcription factor; this does not necessarily result in transcription. However, the ‘binding’ property would be enough for ENCODE. In ENCODE, a DNA segment is ascribed ‘functionality’ if it is: (1) transcribed (2) associated with a modified histone (3) located in an open chromatin area (4) binds a transcription factor (5) contains a methylated CpG dinucleotide All of these are examples of affirming the consequent... Criticism 3: using analytical estimates that yield biased errors and inflate functionality estimates In ENCODE, a DNA segment is ascribed ‘functionality’ if it is: (1) transcribed (2) associated with a modified histone (3) located in an open chromatin area (4) binds a transcription factor (5) contains a methylated CpG dinucleotide All of these are examples of affirming the consequent... And continuing on this theme: Criticism 3: using analytical estimates that yield biased errors and inflate functionality estimates According to ENCODE, all of the below are (wrongly) considered functional: (1) 74.7% of genome that is transcribed – ALL OF WHICH IS CONSIDERED FUNCTIONAL • also, ENCODE used stem cells and cancer cells, both very transcriptionally active... • what about pseudogenes, introns, and mobile elements (non-functional)?? • Also, mapped RNA transcripts to DNA using a tool with 10% rejection rate (2) 56.1% that is associated with modified histones • A recent study showed 2% of histone modifications to affect function • ENCODE assigned functions to all histone modifications it analyzed (3) 15.2% that is found in open chromatin areas • ENCODE claims most open chromatin regions are functional transcription start sites • In fact, only 30% of open regions are even in the neighborhood of start sites (4) 8.5% that binds transcription factors • transcription sites are short, so many can occur by chance • better estimate is 0.28%, taking into account selection • Mean lengths of ENCODE ‘transcription factor binding sites’ are 824, 457, and 535 nucleotides, while most binding sitest are 6 – 14 bp!!!!! (5) 4.6% that is methylated CpG dinucleotides • ENCODE claims that 96% of CpG sites are methylated – not a sign of function, but merely that all CpG sites can be methylated! Evidence for purifying selection in ENCODE And the errors...: instead of using all SNPs, ENCODE used only the 1.3 million primate-specific ones of >=200bp ***By doing this, they removed everything that is of interest functionally!!! then, more processing left 82% of segments smaller than 100bp, with a median of 15bp, so: inferences in part using ~85,000 alignment blocks of 1bp and ~76,000 of 2bp... other problems with the controls... (they were longer, etc.) but in the end, the ENCODE-containing samples had a frequency 0.20% lower than control (hence negative selection!!). the pval was strong because there were so many datapoints (4e-37). IS THIS BIOLOGICALLY MEANINGFUL??? (stat test also probably didn’t take into account dependence of variables, and there are other possible causes of the 0.20% laid out) Evidence for purifying selection in ENCODE (CODING) allele frequency for primate-specific elements. this is the evidence for negative selection derived allele frequency Criticism 4: favoring statistical sensitivity over specificity (Just covered as well...) Criticism 5: emphasizing statistical significance rather than the magnitude of an effect Junk DNA ENCODE would have us think that “Junk DNA is Dead” A few distinctions: (1) Having a potential future function does NOT mean that a DNA segment is functional (hence ‘junk’, not ‘garbage’) (2) evolution will drive towards a mostly functional genome only if genome size is a significant negative selector & if the population size is huge – in humans neither are true (in bacteria they are), hence we expect a lot of junk. Big vs. Small science What is the function of ‘big science’? --to generate massive amounts of reliable & easily accessible data BUT – wisdom is best gained from small science... Take Home messages • selection is a *must* in ascribing a function to a gene. (is this strictly true?) • don’t affirm the consequent • don’t believe everything you read, even in prestigious journals... resistance is growing, as are multiply resistant strains reverse-incentive for drug companies to produce antibiotics, esp. narrow spectrum ones drugs today are very safe – high hurdle! penicillin wouldn’t have passed current standards! current Ab’s are off-patent & thus cheap, so doctors don’t want to use expensive new Ab’s infections present with vague symptoms usually... broad spectrum Ab’s are the best bet. Ab’s actually cure disease after a short run – not so good for $$ closing pipelines mean the intellectual base is scattering –we can’t just turn on the tap again!!