PLINK tutorial 1/20/2011 Erin Smith with John Kelsoe Goals: 1. 2. 3. 4. Run a GWAS on cleaned data for multiple phenotypes in PLINK Visualize results with Manhattan and Q-Q plots. Look at LD structure of regions of interest with Haploview Make a regional plot of the P-values using SNAP plot Websites of interest: PLINK: http://pngu.mgh.harvard.edu/~purcell/plink/ A Catalog of Published Genome-Wide Association Studies: http://www.genome.gov/gwastudies/ UCSC Genome browser: http://genome.ucsc.edu/ dbSNP: http://www.ncbi.nlm.nih.gov/projects/SNP/ dbGaP: http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gap Haploview: http://www.broadinstitute.org/mpg/haploview SNAP plot: http://www.broadinstitute.org/mpg/snap/doc.php Setting the PATH environment variable After placing PLINK in a convenient place, put the location in your environment path to make it easier to call. This process is temporary and will only work for the current window. “PLINK_location” is the folder where PLINK is located. Windows in a command prompt: echo %PATH% (see what is now in the path) path = C:\PLINK_location;%PATH% Mac in a terminal window: echo $PATH export PATH=/PLINK_location:$PATH Introduction to data formats in PLINK: PED & MAP BED, BIM, & FAM Additional phenotype files Exercise: Look at example.ped, example.map, example.bim, example.fam, and phenotypes.txt and figure out what they are Exercise: Reading in files: use --bfile to read in the example and bipolar BED/BIM/FAM filesets – how many individuals are in the datasets? How many SNPs? What is genotyping rate? For Windows, use plink.exe, for Mac, use plink plink --bfile example plink --bfile bipolar Manipulating the data files Get only the genotypes for a single chromosome or a region around a SNP --chr 13 Exercise: Get data from chromosome 13 and write to a new BED file. If you are having trouble running the full dataset, you can use this fileset instead of bipolar. plink --bfile bipolar --chr 13 --make-bed --out bipolar_chr13 Performing association tests --assoc allelic association (chi-squared test of allele frequencies) Other examples of association-related commands --linear linear regression for a quantitative phenotype --logistic logistic regression for a qualitative phenotype --pheno pick a new phenotype file --pheno-name choose a column from the phenotype file Run a GWAS on the irritable mania phenotype Use the commands –pheno and --pheno-name to select an alternate phenotype. For later analyses, also add –adjust and --qq-plot commands. plink --bfile bipolar --assoc --out bipolar_irritable --pheno phenotypes.txt -pheno-name irritable_elated --adjust --qq-plot Interpret genome-wide output: Manhattan & Q-Q plots Exercise: generate a Manhattan plot in Haploview Load Haploview Choose PLINK format and read in .assoc file Note: these assoc files have integrated map information. Plot chromosomes on x-axis and –log(p) on y-axis. Exercise: generate a Quantile-Quantile (Q-Q) plot in R Use results from –adjust and –qq-plot, which generated the expected null distribution of p-values: bipolar_irritable.assoc.adjusted. Start R Change working directory to where the plink output is located (setwd(dir) or Mac: Misc -> Change working directory or Windows: File -> Change dir…) Read in data data <- read.table(file = " bipolar_irritable.assoc.adjusted", header = T) look at first 10 lines of table: data[1:10,] plot the expected –log P-values on the x-axis and observed –log P values on the y-axis: plot(-log(data$QQ, 10), -log(data$UNADJ,10), xlab = “expected –logP values”, ylab = “observed –logP values”) add a line corresponding to y = x abline(a = 0, b = 1) Strong deviation from the line indicates that there were more significant associations than you would expect by chance. Interpreting regional associations Exercise: Look at LD relationships near potential hits Get region +/- 250kb from peak SNP from PLINK – output as a ped file using –snp and –window command plink --bfile bipolar --snp rs17079247 --window 250 --recodeHV --out rs17079247_250kb Load into Haploview using the linkage format Exercise: Look at zoomed-in P-values for the region with LD values (SNAP plot) plink --bfile bipolar --chr 13 --from-kb 84500 --to-kb 84800 --pheno phenotypes.txt --pheno-name irritable_elated --assoc --out bipolar_irritable_rs17079247_zoom Edit output file in a text editor: change the header P to PValue Go to SNAP plot website, choose “Plots” from upper right menu and plot a “Regional Association Plot” Get more info on genes in the region UCSC Genome browser: http://genome.ucsc.edu/ Enter top SNP to find region and zoom out to find nearest genes