April 1, 2014 Yang Li Lin Liu STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Major Projects for Human Genetics Human Variation Genotype-Phenotype • HapMap (1,2,3) • dbGap • 1,000 Genomes • CGEMS • 1KGP • TCGA STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Tests for Association • X2 Test • Logistic Regression STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology X2 Test – Testing the effect STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Odds Ratio – How strong is the effect STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Odds Ratio – How strong is the effect STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Logistic Reg – More flexible Testing STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology PLINK • http://pngu.mgh.harvard.edu/~purcell/plink/ • Most often used software for GWAS • Include both data preprocessing and statistical analyses STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Data Format for PLINK • Ped Files. The PED file is a white-space (space or tab) delimited file – Family ID – Individual ID – Paternal ID – Maternal ID – Sex (1=male; 2=female; other=unknown) – Phenotype STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology • Affection status, by default, should be coded: – -9 missing – 0 missing – 1 unaffected – 2 affected STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology • Map Files. By default, each line of the MAP file describes a single marker and must contain exactly 4 columns: – chromosome (1-22, X, Y or 0 if unplaced) – rs# or snp identifier – Genetic distance (morgans) – Base-pair position (bp units) – Genetic distance can be specified in centimorgans with the --cm flag. STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology • To save space and time, you can make a binary ped file (*.bed). This will store the pedigree/phenotype information in separate file (*.fam) and create an extended MAP file (*.bim) (which contains information about the allele names, which would otherwise be lost in the BED file). To create these files use the command: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology • >plink --file mydata --make-bed which creates (by default): plink.bed ( binary file, genotype information ) plink.fam ( first six columns of mydata.ped ) plink.bim ( extended MAP file: two extra cols = allele names) STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Identify/remove highly related samples • # Step 1: calculate pairwise genomewide identity-by-state (IBS) • plink --bfile v1gwas --genome --geno 0 -mind 1 --out ibsibd • # Step 2: find the pairs of subjects with excessive relatedness • gawk ' $8 > 0.8 ' ibsibd.genome STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Creating the GWAS dataset – filtering bad SNPs. • # command to filter by hwe, etc. • plink --bfile v1gwas --mind 0.05 --geno 0.01 --maf 0.01 --hwe 1e-3 --out v2gwas -make-bed STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Missing data • # PLINK commands for analyzing missing data. • plink --bfile v2gwas --out v2gwasmiss –missing STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Descriptive analyses of the case/control GWAS data • # PLINK commands for testing missingness between cases and controls. • plink --bfile v2gwas --out v2gwasmisstest -test-missing • # PLINK commands for generating allele frequencies. • plink –-bfile v2gwas --out v2gwasfreq – freq STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology • Allele frequency data (v2gwasfreq.frq) CHR SNP A1 A2 MAF NCHROBS 1 rs3934834 T C 0.1679 1084 1 rs3737728 T C 0.2706 1094 1 rs6687776 T C 0.1344 1094 1 rs9651273 A G 0.2866 1092 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology • A1 and A2 refer to the minor and major alleles • MAF (minor allele frequency) is the number of occurrences of the minor allele divided by the number of non-missing genomes • NCHROBS, maximum value is two times the sample size. STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology • # PLINK commands for generating genotype frequencies and HWE estimates. • plink --bfile v2gwas --out v2gwashwe -hardy 10 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Association Analysis of GWAS data • # PLINK commands for association analyses. • plink --bfile v2gwas --out v2gwasassoc -assoc • plink --bfile v2gwas --out v2gwasmodel -model --model-trend --adjust STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Output • Adjustment for multiple comparisons (v2gwasmodel.model.trend.adjusted) CHR SNP UNADJ GC BONF HOLM SIDAK_SS SIDAK_SD FDR_BH FDR_BY 10 rs4363506 4.568e-007 4.791e-007 0.2156 0.2156 0.1939 0.1939 0.2156 1 5 rs5014235 7.823e-006 8.124e-006 1 1 0.9751 0.9751 0.9625 1 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology MAF and VCF file format • Frequently used in cancer genome and 1000 genome project • https://wiki.nci.nih.gov/display/TCGA/Mutat ion+Annotation+Format+(MAF)+Specificati on • http://www.1000genomes.org/wiki/Analysis /Variant%20Call%20Format/vcf-variantcall-format-version-41 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology