PBG/MCB 622 – Class Exercise 2 Friday Nov 21 st

advertisement
PBG/MCB 622 – Class Exercise 2 Friday Nov 21st 2014
Today’s lab will focus on Genome Wide Association Scanning using TASSEL 5.0, a graphical version of a
program to query your data and then used Generalised Linear Models (GLM) or Mixed Linear Models
(MLM) to detect significant associations of markers with phenotypes using different population
substructure corrections. You have two data files to load into TASSEL and these will enable you to do
association analyses.
The data set is based upon a panel of 102 barley genotypes that have been genotyped with 2221
mapped SNP markers and the data together with some map information is in ‘genotype_am.txt’. Note
that the data have been prepared in hapmap format (.hap) which is more commonly used for sequence
data and therefore requires a whole number for the position of each marker and for all the data to be
ordered by genome position. We have therefore used the genetic map to create a sequential numbering
of 1-2221 to derive the positions. Some more unmapped markers are available but we have excluded
these for simplicity. In addition, heading date has been scored in one environment for all the association
panel and is in ‘phenotype_am.txt’
Task 1. Import your data into TASSEL.
1. Find TASSEL and double click on the icon to load the program.
2. Choose the ‘Load’ option from the Data menu which will bring up a dialog box ‘Choose file type
to load’.
3. Select ‘Load hapmap’ and then press OK.
4. Browse to the directory containing your data files, select ‘genotype_am.txt’ and press Open
5. This will create a genotype_am node under a Sequence node in the tree at the LHS
6. Repeat Task 1.2
7. Select ‘Make Best Guess’ and then select ‘phenotype_am.txt’ from the next window. Press Open
and a phenotype_am node will be created under the Numerical node in the LHS tree.
Task 2. Calculation and display of Linkage Disequilbrium for Chromosome 1
1. Highlight the genotype node and choose the Sites option from the Filter menu
2. Filter the data set to remove genotypes with more than 5% missing values by inputting 97 for
‘Minimum Count’
3. Filter out markers with a minor allele frequency of <5% by inputting 0.05 as the Minimum
Frequency.
4. Restrict the analysis to Chromosome 1 by clicking the Select Chromosomes tab, unchecking the
select all box and then checking the chromosome 1 box
5. Click remove minor SNP states and press Filter
6. This creates a filtered node under the genotype node and you will notice that a number of
markers have been filtered out
7. Select the filtered node and then choose Linkage Disequilibrium from the Analysis menu.
8. Leave all the settings at default values and press Run to create an LD node under the Results
node of the tree
9. Highlight the node created in Task 2.8 and then choose the LD plot option from the Results
menu to give a graphical view of LD on chromosome 1
QUESTION 1. Highlight the LD results and report which pairs of Positions have R2 values of 1
6&5, 40&39, 45&43, 49&47, 52&46
Question 2. Report the chromosomal location of a block of quite significant LD and the Positions at the
beginning and end of this block
Towards the end of the long arm chromosome there is a block of highly significant LD beginning at
Position 46 and ending at Position 59
Task 3. Population sub-structure effects by Principal Components Analysis
1. Highlight the unfiltered genotype_am node under the Sequence tree and choose the Sites
option from the Filter menu
2. Ensure the minimum frequency is 0.05, the minimum count is set to 97, check ‘Remove minor
SNP status’ and then press Filter.
3. Highlight the new genotype node and then convert the SNP base calls into numeric coding by
choosing Transform from the Data menu.
4. Leave ‘Collapse Non Major Alleles’ checked and press Create Dataset.
5. This will form a genotype node under the Numerical node of the tree but it has a number of
missing values
6. Use imputation to replace the missing values by highlighting the numerical genotype node and
then choosing Transform from the Data menu
7. Click the Impute tab of the resulting dialog box, leave the settings at default and then click
Create Dataset
8. This will create another genotype node under the Numerical node. Highlight it and then choose
PCA from the dialog box that opens up when you choose Transform from the Data menu.
9. Choose the components option from the resulting options and replace 2221 with 3 for the
number of components
10. Click Create dataset and 3 more nodes will appear. Highlight the PC for genotype… node and
then choose Chart from the Results menu.
11. Change the option to XY Scatter, set the X to PC 1, Y1 to PC 2 and Y2 to PC3 an check ‘2 Y axes’
12. Now highlight the eigenvalues node and gain chose Chart from the Results menu.
13. Change the chart to XY Scatter again and set X to PC, Y1 to individual and Y2 to cumulative
proportions to display the amount of variation accounted for by each PC
QUESTION 3. Approximately how much variation is accounted for by PC1
26.7%
Question 4. Excel, Stander, Stellar, Robust and Morex are 6 row genotypes and Charles, Harrington,
Scarlett, Arapiles and Barke are 2 row genotypes. What do you think high and low loadings on PC1
represent?
6 rows appear to have a high loading and 2 rows low.
Task 4. Create a kinship matrix
1. Highlight the Collapsed and Imputed Numerical genotype node.
2. Choose Kinship from the Analysis menu, accept the defaults and press OK
3. This creates a kinship matrix under the Matrix node of the tree.
Task 5. GWAS using a Naïve Generalised Linear Model (GLM)
1. Naïve model: Phenotype = Marker effect + error
2. Combine your genotypic data and phenotypic data by holding down the Ctrl key and then
clicking the genotype file under the sequence node and the phenotype file under the numerical
node. Then choose Intersect Join from the Data menu to create a joined node under the
Numerical node
3. Highlight the joined node and then choose the GLM item from the Analysis menu.
4. Check the Permutation tests box and then accept defaults by pressing OK twice to start GLM. A
box under the tree window shows progress and 2 nodes appear under an Association node of
the Results node.
5. Click the Marker Tests node and then Choose ‘Manhattan Plot’ from the options of the Results
menu to produce a graphical display of your results.
6. Now Choose QQ plots from the option of the Results menu to see how good your model is
QUESTION 5. Chromosome numbers are displayed under ‘Locus’ What chromosome and at what
position is the most significant SNP located?
Chrom 3 Pos 633
Question 6. What is the effect at that SNP and what is unusual about it compared to other high
scoring SNPs?
16.9 days. Only SNP with a large effect that has a MAF >10%
Task 6. GWAS with GLM taking Structure into account (Q Model)
1. Add in the Q Model: Phenotype = Population Structure + Marker Effect + error
2. The PCoA analysis generated 102 PCs and we will choose to just fit PC1 to PC10. Highlight the PC
node of the Numerical node and then choose Traits from the Filter menu. Exclude all and then
check just the first 10 and then press OK to produce a filtered PC node
3. Hold down the Ctrl key and highlight the filtered node and the genotypic and phenotypic nodes
node and join them as in Task 5.2.
4. Repeat Tasks 5.3 to 5.6
Task 7. GWAS with Mixed Linear Model adding in the kinship matrix to account for other effects
1. Q+K Model: Phenotype = Population Structure + Marker Effect + Individuals + error
2. Hold down the Ctrl key and highlight the joined node created in Task 6.3 ans the kinship node
created in Task 4.
3. Choose MLM from the Analysis menu, accept the default values and press Run and then OK
4. Repeat Tasks 5.3 to 5.6
QUESTION 7. Taking a threshold of 0.0001, identify the SNPs that are significantly associated with
heading date and their chromosomal locations and positions. Which appear in the Q and naïve
analyses above the same threshold? (Hint: you can click on the results nodes for each analysis and
export the table as a txt file that you can then copy and paste into excel for sorting and comparison)
Marker Locus
Site
2_0794
3
633
1_0331
6
1857
1_0271
4
1190
3_0117
4
1197
1_0094
5
1491
All are detected by the Q model but the naïve model doesn’t detect 1_0271 and 3_0117
Question 8. Look at all three QQ plots, which is the best and why?
The Q+K model in MLM is best because the observed probabilities show a much closer fir to the
expected.
Download