PBG/MCB 622 – Class Exercise 2 Friday Nov 21 st

PBG/MCB 622 – Class Exercise 2 Friday Nov 21st 2014 Today’s lab will focus on Genome Wide Association Scanning using TASSEL 5.0, a graphical version of a program to query your data and then used Generalised Linear Models (GLM) or Mixed Linear Models (MLM) to detect significant associations of markers with phenotypes using different population substructure corrections. You have two data files to load into TASSEL and these will enable you to do association analyses. The data set is based upon a panel of 102 barley genotypes that have been genotyped with 2221 mapped SNP markers and the data together with some map information is in ‘genotype_am.txt’. Note that the data have been prepared in hapmap format (.hap) which is more commonly used for sequence data and therefore requires a whole number for the position of each marker and for all the data to be ordered by genome position. We have therefore used the genetic map to create a sequential numbering of 1-2221 to derive the positions. Some more unmapped markers are available but we have excluded these for simplicity. In addition, heading date has been scored in one environment for all the association panel and is in ‘phenotype_am.txt’ Task 1. Import your data into TASSEL. 1. Find TASSEL and double click on the icon to load the program. 2. Choose the ‘Load’ option from the Data menu which will bring up a dialog box ‘Choose file type to load’. 3. Select ‘Load hapmap’ and then press OK. 4. Browse to the directory containing your data files, select ‘genotype_am.txt’ and press Open 5. This will create a genotype_am node under a Sequence node in the tree at the LHS 6. Repeat Task 1.2 7. Select ‘Make Best Guess’ and then select ‘phenotype_am.txt’ from the next window. Press Open and a phenotype_am node will be created under the Numerical node in the LHS tree. Task 2. Calculation and display of Linkage Disequilbrium for Chromosome 1 1. Highlight the genotype node and choose the Sites option from the Filter menu 2. Filter the data set to remove genotypes with more than 5% missing values by inputting 97 for ‘Minimum Count’ 3. Filter out markers with a minor allele frequency of <5% by inputting 0.05 as the Minimum Frequency. 4. Restrict the analysis to Chromosome 1 by clicking the Select Chromosomes tab, unchecking the select all box and then checking the chromosome 1 box 5. Click remove minor SNP states and press Filter 6. This creates a filtered node under the genotype node and you will notice that a number of markers have been filtered out 7. Select the filtered node and then choose Linkage Disequilibrium from the Analysis menu. 8. Leave all the settings at default values and press Run to create an LD node under the Results node of the tree 9. Highlight the node created in Task 2.8 and then choose the LD plot option from the Results menu to give a graphical view of LD on chromosome 1 QUESTION 1. Highlight the LD results and report which pairs of Positions have R2 values of 1 6&5, 40&39, 45&43, 49&47, 52&46 Question 2. Report the chromosomal location of a block of quite significant LD and the Positions at the beginning and end of this block Towards the end of the long arm chromosome there is a block of highly significant LD beginning at Position 46 and ending at Position 59 Task 3. Population sub-structure effects by Principal Components Analysis 1. Highlight the unfiltered genotype_am node under the Sequence tree and choose the Sites option from the Filter menu 2. Ensure the minimum frequency is 0.05, the minimum count is set to 97, check ‘Remove minor SNP status’ and then press Filter. 3. Highlight the new genotype node and then convert the SNP base calls into numeric coding by choosing Transform from the Data menu. 4. Leave ‘Collapse Non Major Alleles’ checked and press Create Dataset. 5. This will form a genotype node under the Numerical node of the tree but it has a number of missing values 6. Use imputation to replace the missing values by highlighting the numerical genotype node and then choosing Transform from the Data menu 7. Click the Impute tab of the resulting dialog box, leave the settings at default and then click Create Dataset 8. This will create another genotype node under the Numerical node. Highlight it and then choose PCA from the dialog box that opens up when you choose Transform from the Data menu. 9. Choose the components option from the resulting options and replace 2221 with 3 for the number of components 10. Click Create dataset and 3 more nodes will appear. Highlight the PC for genotype… node and then choose Chart from the Results menu. 11. Change the option to XY Scatter, set the X to PC 1, Y1 to PC 2 and Y2 to PC3 an check ‘2 Y axes’ 12. Now highlight the eigenvalues node and gain chose Chart from the Results menu. 13. Change the chart to XY Scatter again and set X to PC, Y1 to individual and Y2 to cumulative proportions to display the amount of variation accounted for by each PC QUESTION 3. Approximately how much variation is accounted for by PC1 26.7% Question 4. Excel, Stander, Stellar, Robust and Morex are 6 row genotypes and Charles, Harrington, Scarlett, Arapiles and Barke are 2 row genotypes. What do you think high and low loadings on PC1 represent? 6 rows appear to have a high loading and 2 rows low. Task 4. Create a kinship matrix 1. Highlight the Collapsed and Imputed Numerical genotype node. 2. Choose Kinship from the Analysis menu, accept the defaults and press OK 3. This creates a kinship matrix under the Matrix node of the tree. Task 5. GWAS using a Naïve Generalised Linear Model (GLM) 1. Naïve model: Phenotype = Marker effect + error 2. Combine your genotypic data and phenotypic data by holding down the Ctrl key and then clicking the genotype file under the sequence node and the phenotype file under the numerical node. Then choose Intersect Join from the Data menu to create a joined node under the Numerical node 3. Highlight the joined node and then choose the GLM item from the Analysis menu. 4. Check the Permutation tests box and then accept defaults by pressing OK twice to start GLM. A box under the tree window shows progress and 2 nodes appear under an Association node of the Results node. 5. Click the Marker Tests node and then Choose ‘Manhattan Plot’ from the options of the Results menu to produce a graphical display of your results. 6. Now Choose QQ plots from the option of the Results menu to see how good your model is QUESTION 5. Chromosome numbers are displayed under ‘Locus’ What chromosome and at what position is the most significant SNP located? Chrom 3 Pos 633 Question 6. What is the effect at that SNP and what is unusual about it compared to other high scoring SNPs? 16.9 days. Only SNP with a large effect that has a MAF >10% Task 6. GWAS with GLM taking Structure into account (Q Model) 1. Add in the Q Model: Phenotype = Population Structure + Marker Effect + error 2. The PCoA analysis generated 102 PCs and we will choose to just fit PC1 to PC10. Highlight the PC node of the Numerical node and then choose Traits from the Filter menu. Exclude all and then check just the first 10 and then press OK to produce a filtered PC node 3. Hold down the Ctrl key and highlight the filtered node and the genotypic and phenotypic nodes node and join them as in Task 5.2. 4. Repeat Tasks 5.3 to 5.6 Task 7. GWAS with Mixed Linear Model adding in the kinship matrix to account for other effects 1. Q+K Model: Phenotype = Population Structure + Marker Effect + Individuals + error 2. Hold down the Ctrl key and highlight the joined node created in Task 6.3 ans the kinship node created in Task 4. 3. Choose MLM from the Analysis menu, accept the default values and press Run and then OK 4. Repeat Tasks 5.3 to 5.6 QUESTION 7. Taking a threshold of 0.0001, identify the SNPs that are significantly associated with heading date and their chromosomal locations and positions. Which appear in the Q and naïve analyses above the same threshold? (Hint: you can click on the results nodes for each analysis and export the table as a txt file that you can then copy and paste into excel for sorting and comparison) Marker Locus Site 2_0794 3 633 1_0331 6 1857 1_0271 4 1190 3_0117 4 1197 1_0094 5 1491 All are detected by the Q model but the naïve model doesn’t detect 1_0271 and 3_0117 Question 8. Look at all three QQ plots, which is the best and why? The Q+K model in MLM is best because the observed probabilities show a much closer fir to the expected.

PBG/MCB 622 – Class Exercise 2 Friday Nov 21 st

Related documents

Products

Support

PBG/MCB 622 – Class Exercise 2 Friday Nov 21 st

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib