Targets of recent positive selection in Indian populations Irene Gallego Romero Leverhulme Centre for Human Evolutionary Studies Department of Biological Anthropology The Indian subcontinent • Probably inhabited by H sapiens ~50,000 YBP (coastal route out of Africa, mtDNA and Y data) • Drastic population expansion ~35,000 YBP • Decidedly not a single panmictic population, highly stratified and fragmented – linguistics, geography, sociocultural practices. • Very high incidence of T2D and obesity (predicted highest worldwide by 2030) • Underrepresented in genomic diversity panels All of which means… • There has been ample time for ‘recent’ evolutionary adaptations to arise • These adaptations have generally gone unexamined – Most Indian work to date has examined Indian population history, and been carried out on mtDNA and Y-chromosome Allelic trajectories under selection Bamshad & Wooding, Nat Rev Gen, 2003 Selective sweeps and haplotypes Nielsen et al, Nat Rev Gen, 2007 Selective sweeps and haplotypes All we are looking for is haplotypes that are uncommonly long for their frequency in the sample. Bamshad & Wooding, Nat Rev Gen, 2003 Quantifying selective sweeps • EHH: probability of two chromosomes in a sample being identical as a function of distance from a chosen ‘core’ SNP • Other related metrics: – iHS: integral under the EHH curve, sensitive to allelic ancestry – XP-EHH: cross population EHH, compares population pairs, detects the action of selection in one population but not the other Sample composition • 156 Indian samples – 31 populations • 836 further samples HGDP-CEPH, our data – Old World, Oceania – Split into 8 geographic groups/40 populations • Illumina 650K, 610K chips (~550,000 autosomal SNPs) India in a global context: FST Computational challenges • Phasing: – Inferring haplotype from genotype • Calculating test statistics: – iHS and XP-EHH • Data post-processing: – ~550,000 data points per population per statistic – SNPs to genes/genomic regions Phasing • Likelihood-based methods • 550,000 SNPs per individual, ~1,000 individuals • Phasing chromosome 2 (densest, ~50,000 SNPs) can take over a week • Computationally intensive, and requires a lot of disk space for storing iterations, so cannot use CamGrid – use elephant.bio.cam.ac.uk, simultaneously run multiple chromosomes – < 2 weeks to phase all autosomal chromosomes Computing XP-EHH and iHS • Compute a value for each statistic for each SNP for each population or population pair (~10 per test) – >5,000,000 data points for each statistic • Not computationally intensive, small files – easily run on CamGrid (each chromosome separately) – 4-5 hours to analyse a single population • C++ code Data processing • Data sets this big suffer from high false discovery rates • Multiple testing corrections can be too stringent • Need to reduce the number of data points – windowing approach: • Break the genome into non-overlapping, contiguous 200kb windows, test significance at that level Windowing • Done using R – Hand-written code, no extra packages – Requires large amounts of RAM (> 10GB), so not suitable for CamGrid – Again, use elephant – Roughly 2 hours per population • From 550,000 SNPs to 13,274 windows – Spanning ~20,000 genes – How to tease out biological meaningfulness? Separate signals in North and South India From SNPs to genes and beyond • Selection acts on phenotypes, not genes • Mining of ontologies and other databases – Gene Ontology terms, Mammalian Phenotype terms, other annotations – (not actually done by high throughput methods, but I know better by now) – Although it still requires a lot of manual curation • Map biological function to windows, test for overrepresentation of categories relative to expectations A lot of hours later… Acknowledgements • Toomas Kivisild, Katie Siddle (LCHES) • Jenny Barna • Mait Metspalu, Georgi Hudjashov, Gyaneshwer Chaubey (University of Tartu) • Joe Pickrell (University of Chicago) • Richard Lempicki (NIH) Other genome-wide statistics • Genome-wide FST and HS are both computed with simple R scripts – – – – Hand-written code ~5 minutes per population The slowest bit is reading the data in Use elephant.bio.cam.ac.uk • AAF spectrum slopes are a bit more involved – To correct for sample size effects, resample every locus 1,000 times from its own allelic distribution – ~ 1 hour per population, requires high RAM, use R Ancestral allele frequency slopes