Practical: Two-locus summaries of association in HapMap data Simon Myers, Gil McVean The Concept One important question one might ask about variation data drawn from a population regards the correlation structure between different positions. Within a population we might ask whether patterns of genetic variation are correlated at nearby positions and whether the correlation depends on the distance apart of the two sites. Between two human populations, we might ask whether correlation structures are similar at individual pairs of sites, or whether there are systematic differences in the amount of correlation seen. This practical addresses these questions empirically. We will discuss model based approaches in the next lecture. Suppose at each SNP site there are only two types, labelled 0 and 1 say, observed in a sample. Suppose also that we have complete variation data for a total of S SNP sites, and n chromosomes, represented as an n x S matrix D containing only 0’s and 1’s (Figure 1). Figure 1. Variation data for a region of the human genome. Each column represents a SNP, each row a chromosome, and the two types in the population are shown as blue and yellow respectively. The top lines map each SNP to its position in the human genome. In this case, a popular and particularly simple measure of correlation between genetic variation is a statistic known as π 2 . This is a pairwise statistic, in that for each pair of SNPs, numbered π and π say, we calculate an π 2 value, π 2 (π, π) , based on the variation data for these two SNPs in the sample. In our simple case, π 2 reduces to the squared sample pairwise correlation between columns π and π of D: πππ£(π·.π ,π·.π )2 π 2 (π, π) = πππ(π·.π , π·.π )2 =πππ(π· . .π )πππ(π·.π ) Here π·.π is defined to be the n-vector corresponding to the πth column of D and similarly for π·.π . (For more complex data types the definition of π 2 is a little more complex but in the same spirit.) This statistic is then a measure of how “similar” the πth and πth SNPs are. Obviously, 0 ≤ π 2 (π, π) ≤ 1. Further, π 2 (π, π) = 1 if, and only if, the sets of chromosomes carrying allele 1 at SNP π and allele 1 at SNP π are identical, OR the sets of chromosomes carrying allele 1 at SNP π and allele 0 at SNP π are identical. In other words, knowing what type an individual carries at SNP π completely determines their type at SNP π. If π 2 (π, π) = 0, that implies independence between the two SNPs at these positions in our sample. We will construct the n x n matrix R of all pairwise π 2 values, where π ππ gives the π 2 value between SNPs π and π, for some real datasets, and plot this matrix, to explore the positional correlation structure in the data, and compare the correlation for datasets corresponding to European, and African, population samples. R is often referred to as the linkage disequilibrium (LD) matrix for the sample. The Data We will use real data collected by the HapMap for a (fairly typical) 500,000 base pair (500kb) region of Chromosome 7, in two populations; CEPHs from Utah (CEU), and Yoruba Nigerians from Ibadan in Nigeria (YRI). There are three data files: the file “positions7q31.txt” contains a single column, giving the locations on the chromosome of 707 SNPs typed in both populations. The file “ceu7q31.txt” contains a 120 x 707 matrix of 0’s and 1’s giving the type of 120 CEU chromosomes at these 707 SNPs. The file “yri7q31.txt” contains a 120 x 707 matrix of 0’s and 1’s giving the type of 120 YRI chromosomes at these 707 SNPs. Tasks 1. Load in “positions7q31.txt”, “ceu7q31.txt” and “yri7q31.txt” using the importdata function, or another function, and store these as a vector and matrices respectively named, for example, “positions”, “ceudata” and “yridata”. Verify that ceudata and yridata contain only 0’s and 1’s. 2. Calculate the pairwise correlation matrix for the European (CEU) data and store this matrix, using the function corrcoef, which can be directly applied to the data matrix ceudata. Repeat for the African (YRI) data. (Note that each value in the matrix returned by corrcoef must be squared to give the π 2 (LD) matrix.) 3. On the same plot, plot the π 2 value between the 100th SNP and all the other SNPs for the CEU population and the YRI population. Repeat for the 200th SNP and other randomly chosen SNPs. a. What happens to π 2 in general as distance increases? Can you explain this in terms of an underlying biological process? b. Is the decline in π 2 monotonic? Are SNPs near to the 100th SNP always strongly correlated to it? Can you explain this in terms of the ancestry of the sample? c. Which population has the generally higher π 2 values? Can you explain this? 4. Visualise the entire CEU LD matrix, by producing an image plot using the MATLAB function imagesc. This will produce a colour grid, with the colour plotted at position π, π determined by π 2 (π, π). The plot may look nicer if you first use the command colormap(hot). You should use the “positions” vector to plot actual SNP positions along the chromosome for the x and y axes. a. What does the overall pattern look like? Is the pattern consistent with a smooth decay of π 2 as the distance between SNPs increases, or a sudden decay at specific positions along the chromosome? b. Explain why the pattern of LD in humans is often described as “blocklike”, with specific block boundaries. Write down the approximate positions of the five strongest block boundaries along the chromosome. c. Are there other, weaker block boundaries along the chromosome? Can you interpret these block boundaries in terms of the underlying biological process of 3 a) above? 5. Visualise both the CEU and the YRI LD matrices on the same plot, by creating a new matrix containing CEU LD values above the diagonal, and YRI LD values below the diagonal (note the original LD matrices are symmetric about the diagonal, so this does not lose information). Plot this matrix in the same way as task 4. What do you notice? Are block boundaries shared, or different, in these two human populations? Is the strength of LD consistent in both populations? Again, what do the results suggest about underlying biological processes in these populations?