Genes and metabolites to phenotypes: Arabidopsis thaliana CH927 QTL practical: dry-lab section

advertisement
CH927 QTL practical: dry-lab section
Warwick Systems Biology 13/03/13
Genes and metabolites to phenotypes:
QTL mapping of glucosinolate production in Arabidopsis thaliana
Today we will analyse QTL data from the Wentzell et al. (2007) study. We will use the
metabolite profiling data as our traits and compare it to the genetic map data from the Bay-0 x
Sha cross (used to generate the RIL lines that were genotyped and phenotyped).
The main considerations will be on learning how to analyse ‘how good’ the data is and thus
what it can be used for, how to process data to select the part you want (the ‘good’ part), and
how to compare the output of different QTL methods.
1 file is provided for the QTL analysis section (download from module page):
gsmg11.csv (this contains the marker data relative to a genetic map and trait data for a QTL
analysis of glucosinolates)
Start by downloading R if you don’t already have it, and then install the QTL package:
0.1. Before installing R/qtl, you must first install R, which is available at the
Comprehensive R Archive Network (CRAN): http://cran.r-project.org.
** Make sure you have the latest version (R-2.14.1) even if you already have installed R. **
0.2. Once R is installed, and provided that your computer is connected to the internet, it is
easiest to install R/qtl by first invoking R and then typing the following:
install.packages("qtl")
- choose a UK mirror for installation
This will download and install R/qtl (which is known in R as the "qtl" package or library).
0.3. Load R/qtl by typing:
library(qtl)
0.4. Here is a description of the methods that R-QTL uses:
A key component of computational methods for QTL mapping is the hidden Markov
model (HMM) technology for dealing with missing genotype data. We have implemented the
main HMM algorithms, with allowance for the presence of genotyping errors, for backcrosses,
intercrosses, and phase-known four-way crosses.
The current version of R/qtl includes facilities for estimating genetic maps, identifying
genotyping errors, and performing single-QTL genome scans and two-QTL, two-dimensional
genome scans, by interval mapping (with the EM algorithm), Haley-Knott regression, and
multiple imputation. All of this may be done in the presence of covariates (such as age or
treatment). One may also fit higher-order QTL models by multiple imputation and Haley-Knott
regression.
1
CH927 QTL practical: dry-lab section
Warwick Systems Biology 13/03/13
Mapping QTLs – Points to consider
1. What is your hypothesis?
To detect QTL need to consider:
a. population showing genetic variability for target phenotype
b. marker system that allows genotyping of population
c. reproducible quantitative phenotyping methodologies
d. appropriate experimental and statistical methods for detecting and locating QTL.
Questions:
(i) Have you got a suitable population to measure the trait? Is the population “fixed” i.e.
not segregating, for example homozygous lines created through selfing or doubled
haploid lines created using microspore culture. A fixed mapping population will allow
sufficient replication to get better estimates of the trait variance components (since
polygenic traits are influenced by gene x environment interactions).
(ii) Do you have sufficient numbers within the population to accurately measure the trait
variation (segregation distortion)?
(iii) Genotype distribution vs. replication? i.e. consider covering the maximum number of
allelic combinations vs. increasing replication to have a better estimate of line means
and their variance.
(iv) Has a population been used to map the trait previously – if so is it syntenic
(genetic/physical map) with the population you wish to use.
(v) If you want to explore the genetic variation for a particular trait – is the trait
measurable within the population with the accuracy required – e.g. broccoli (Brassica
oleracea var. italica) head diameter – could not use the Diversity Foundation Set (DFS)
since the phenotype is not present.
(vi) Is there genotype data available for the population?
What marker types were used? (May need to rescore individuals).
2. A mapping exercise requires 3 elements:
If you have data from other work or you want to map QTLs it is best to summarise the data by
constructing:
1. A map file (*.map) for the experimental population
2. A locus file (*.loc) containing marker genotype scores (that are in the map) for
individuals within the population
3. A quantitative trait file (*.qua) containing the trait data
The three data files (*.loc, *.map, *.qua) will enable you to compile the necessary information
required by different mapping software:
MapQTL, QTLCartographer, R/qtl, R/eqtl, QGene, QTLcafe, GridQTL to name a few.
2
CH927 QTL practical: dry-lab section
Warwick Systems Biology 13/03/13
3. Data checking
If you did not make the genetic map, it is worth spending time looking at the map to see if it is
the best map that explains the linkage between the markers used. This also allows you to
become familiar with the map data.
(i) If you rebuild the map – can you get the original genotype scores for all marker x all
individuals? This may contain markers that are not on the final map; and individuals
that trait data is not available for.
(ii) Check that there are not individuals replicated within the population. Identical
individuals do not contribute to the calculation of the recombination fraction, but do add
to the computation.
(iii) Check that the markers are not repeated.
(iv) Look at the segregation of the markers within the data, e.g. Double Haploid (DH)
lines derived from an F1 should have a 1:1 segregation of parent 1: parent 2 loci. But it
should be noted that the process of making DH lines may create segregation
distortion.
4. Let’s look at the Wentzell et al data. Follow this tutorial, using R for data analysis:
4.1. Start by getting the data into R
Read in the file. This comes from the author’s analysis in QTL cartographer so we have to do
some reassigning of variables:
Genotype conversion from QTL cartographer to R/qtl format.
0 -> A and 2 -> B and -1 -> -.
gsmg11 <- read.cross("csv", , "gsmg11.csv", genotypes=c("A","B"))
Define the population as RI (recombinant inbred lines):
class(gsmg11)[1] <- "riself"
summary(gsmg11)
To view data summary:
plot(gsmg11)
The generic plot function passes (gsmg11) to the plot.cross function, which makes the plot.
Likewise these functions plot different aspects of the data:
plot.missing(gsmg11)
plot.map(gsmg11)
plot.pheno(gsmg11, 1)
plot.pheno(gsmg11, 2)
3
CH927 QTL practical: dry-lab section
Warwick Systems Biology 13/03/13
4.2. Analysing the phenotype data (metabolite levels)
Check the distribution of the phenotype data, this can influence the method of analysis, allows
to see if the data needs transforming, highlights missing data.
par(mfrow=c(3,2))
for(i in 1:5) plot.pheno(gsmg11, pheno.col=i)
Explore scatterplots of phenotype vs. phenotype to locate data that maybe erroneous:
pairs(jitter( as.matrix (gsmg11$pheno)), cex=0.6, las=1)
The gsmg11$pheno pulls out the phenotype data and converts it to a numerical matrix with
as.matrix in order to use the jitter function (which adds a bit of noise so that individual points
may be distinguished).
For missing values:
gsmg11$pheno[gsmg11$pheno == 0] <- NA
This locates missing phenotype data and adds NA.
To compare the distribution of the means of a phenotype against a random order for that
phenotype:
par(mfrow=c(1,2), las=1, cex=0.8)
means <- apply (gsmg11$pheno, 1, mean)
plot(means)
plot(sample(means), xlab="Random index", ylab="means")
4.3. Examining the marker data
Check for segregation distortion at all markers i.e. genotypes appear in expected proportions –
may indicate possible genotyping errors.
Geno.table inspects genotype frequencies at each marker. The P.value column indicates pvalue for a chi sq test of Mendelian proportions (1:2:1 in an intercross for example). In this
case 1:1.
gt <-geno.table(gsmg11)
p <- gt$P.value
gt[!is.na(p) & p <0.01,]
Can calculate the recombination fraction for the genotype orders based on the genotypes
provided. NB: map ordered genotypes = loc file.
Compare all individual’s genotypes to detect genotyping error, naming error etc. Also, if two
individuals carry identical marker genotypes you may wish to exclude one from analysis (it
doesn’t add any more data).
cg <- comparegeno(gsmg11)
hist(cg, breaks=200, xlab="proportion of identical genotypes")
rug(cg)
4
CH927 QTL practical: dry-lab section
Warwick Systems Biology 13/03/13
Then to pull out which individuals, using 0.9 as an example:
which(cg >0.9, arr.ind=TRUE)
It would be worth checking these to see if errors have occurred. If a dense set of marker are
used for selection during production of lines – could vary by one marker, so might not be a
problem.
4.4. Checking the marker order and recombination fraction
Have markers been assigned to the correct chromosome? If they have, are they in the correct
order? LOD intervals across chromosomes will appear distorted if incorrect placement of
markers. See Figure 1 below.
Figure 1 LOD score plot highlighting a misplaced marker (red) on a linkage group.
For example, marker 7 in Figure 1 is misplaced. Can go back to marker 7 and check that the
chromosome that it has been placed on is its best position (grouping LOD, and RF). Can
explore these using estimates of the recombination fraction.
gsmg11 <- est.rf(gsmg11)
Estimates recombination fraction
5
CH927 QTL practical: dry-lab section
Warwick Systems Biology 13/03/13
plot.rf(gsmg11)
plots recombination fraction vs. LOD for each Chr (Figure 2, p5)
If you have the genotype file that was used to build the map your could rebuild the map yourself
(JoinMap, MapMaker, Carthagene, for example).
Pairwise recombination fractions and LOD scores
1
2
3
4
5
5
80
4
Markers
60
3
40
2
20
1
20
40
60
80
Markers
Figure 2 Estimated recombination fractions (upper left) and LOD scores (lower right) for all
pairs of markers in gsmg11 data. Red = low RF or high LOD; blue = pairs that are not linked
(high RF, low LOD)
All the chromosome assignments are correct.
If errors were present can use:
checkAlleles(gsmg11)
6
CH927 QTL practical: dry-lab section
Warwick Systems Biology 13/03/13
To check alternate chromosomes, first use:
plot.rf(gsmg11, alternate.chrid=TRUE)
This allows chromosome id’s to be identified. Then to pull out specific comparisons use:
plot.rf(gsmg11, chr=c(1,2,5))
4.5. Examining the data for genotyping errors
nxo <- countXO(gsmg11)
The observed number of crossovers in gsmg11 can be observed:
plot(nxo, ylab="No. crossovers")
on viewing the chart it can be seen that an individual seems to have a higher number of
crossovers compared to the other individuals; this line can be pulled out; (20 is user defined):
mean(nxo[1:211])
=7.815166
Now we can look for lines with more crossovers than expected:
nxo[nxo>15]
Ind 60 and173 have 16 crossovers – may be worth checking these lines.
Check line 173:
countXO(gsmg11, bychr=TRUE)[173,]
1 2 3 4 5 (chromosome number)
2 7 2 1 4 (number of crossovers)
It would be worth looking at ch2.
4.6. Is there any missing data?
It might be useful to check for missing genotype data since these are areas where we might
want to add more markers. Also, standard interval mapping methods may give spurious results
across regions of missing genotype information. Proportions of markers scored can be seen
using:
hist(nmissing(gsmg11, what="mar"), breaks=50)
And by entropy and variance (blue and red respectively)
plot.info(gsmg11, col=c("blue", "red"))
If you wanted this information as a table:
z<- plot.info(gsmg11, step=0)
For chr 1 and 5:
z[ z[,1]==1,]
z[ z[,1]==5,]
7
CH927 QTL practical: dry-lab section
Warwick Systems Biology 13/03/13
5. Now we are happy with the data we can move on to QTL analysis itself.
5.1. Comparing single marker and interval mapping methods for QTL analysis
The simplest method for QTL analysis is to take each marker in turn – split the individuals
based on their genotype score at that marker, then compare the phenotype means for the
groups.
Interval mapping improves on marker regression by taking account of missing genotype
data at a putative QTL. Standard interval mapping uses maximum likelihood estimations under
a mixture model; while the Haley-Knott regression methods use approximations to the mixture
model.
Need to calculate the conditional genotype probabilities (calc.geno), step=1 defines the
density (cM) across the grid to which interval mapping will be performed.
error.prob allows calculations to be made based on a given number of genotyping errors.
gsmg11 <- calc.genoprob(gsmg11, step=1, error.prob=0.001)
The scanone function is used for interval mapping specifying the method to be used , in this
case the “EM” algorithm. (In standard interval mapping the EM algorithm is performed at each
position on a grid of putative QTL locations along the genome, while the estimates and
likelihood the null hypothesis are calculated just once. The likelihood estimate is non
decreasing across iterations).
out.em <-scanone(gsmg11, method="em")
plot(out.em, ylab="LOD score")
For just chr2 and 5:
plot(out.em, chr=c (2, 5), col="blue", ylab="LOD score")
Haley-Knott regression. Fast approximation of the standard interval mapping results.
out.hk <-scanone(gsmg11, method="hk")
Looking at chr 2 and 5:
plot(out.em, out.hk, chr=c (2, 5), col=c("blue", "red"), ylab="LOD score")
to compare the difference between the 2 methods:
plot(out.hk - out.em, chr=c(2, 5), ylim=c(-0.5, 1.0), ylab=expression(LOD[HK] - LOD[EM]))
to add a horizontal line:
abline(h=0, lty=3)
In this case no difference between the two methods.
8
CH927 QTL practical: dry-lab section
Warwick Systems Biology 13/03/13
Extended HK
An improved version of HK may be obtained by considering the variances.
out.ehk <-scanone(gsmg11, method="ehk")
can compare methods:
chr2 & 5:
plot(out.em, out.hk, out.ehk, chr=c (2,5), ylab="LOD score", lty=c(1,1,2))
chr1-5
plot(out.em, out.hk, out.ehk, chr=c (1:5), ylab="LOD score", lty=c(1,1,2))
IM = black, HK= blue, EHK= red dashed
Can plot difference in the LOD scores from HK and EHK from the LOD scores of standard IM:
plot(out.hk - out.em, out.ehk - out.em, chr=c(1:5), col=c("blue", "red"), ylim=c(-0.5, 1),
ylab=expression(LOD[HK]-LOD[EM]))
HK =blue, EHK= red
abline (h=0, lty=3)
5.2. Determining the significance threshold: Permutation test
Can use scanone to do the permutation test. Can add verbose=FALSE to suppress output, but
often gives error.
operm <- scanone(gsmg11, n.perm=1000)
plot(operm)
for genome-wide significance levels at 20% and 5%:
summary(operm, alpha=c(0.20, 0.05))
5.3. Are any of the results from the mapping carried out in 5.1. significant at a 5%
threshold?
If you estimate significant QTL locations and want to explore these further, you can calculate
the LOD support interval and Bayes credible interval.
These use results from scanone and the chromosome to consider, plus the drop in LOD units
(1.5 default) or the nominal Bayes fraction (95 % default):
lodint(out.em, 5, 1.5)
bayesint(out.em, 5, 0.95)
The first and last row indicate the ends of the 1.5 cM confidence window, the middle row
indicates the maximum likelihood estimate of the QTL location.
9
CH927 QTL practical: dry-lab section
Warwick Systems Biology 13/03/13
These are just coordinates, we need marker positions or the nearest flanking markers:
lodint(out.em, 5, 1.5, expandtomarkers=TRUE)
bayesint(out.em, 5, 0.95, expandtomarkers=TRUE)
We now have a significant QTL, markers that delimit a confidence interval for the QTL.
To explore the effects of the markers at the QTL or any marker we wish to examine, we use the
function effectplot, this provides estimates of the genotype-specific phenotype averages,
taking account of missing genotype data.
For our data we will use the find.marker function to locate the nearest marker to our QTL:
find.marker(gsmg11,5, 61.0)
gsmg11<-sim.geno(gsmg11, n.draws=16,error.prob=0.001)
effectplot(gsmg11,mname1="At5g44320-4")
To look at the data:
Effect61.0<- effectplot(gsmg11,mname1="At5g44320-4", draw=FALSE)
Effect61.0
It is also useful to examine the distribution of the phenotype – genotype relationship:
plot.pxg(gsmg11, marker="At5g44320-4")
If we have more than one QTL/ markers linked to different QTL we can examine the interaction
of having different QTL genotypes (marker genotypes) and the effect on the phenotypic data,
for chromosomes 2 and 5:
effect2<-sim.geno(subset(gsmg11, chr=c(2,5)),step=2.5, error.prob=0.001, n.draws=256)
par(mfrow=c(1,1))
effectplot(effect2, mname1="c2.loc15", mname2="At5g44320-4", ylim=c(-1, 1))
It is beneficial to explore how the phenotypic data changes with different QTL genotypes. The
effect plots allow us to appreciate the different outcomes of marker-assisted selection at the
different QTL. The effect plots also allow us to gain an understanding of possible epistatic
interactions between loci at different QTL. If the plots appear relatively parallel, then one can
assume that having a particular genotype combination is additive towards the phenotype effect.
However if the lines cross, this indicates possible epistatic interactions, we then have to
consider the effect of having the alternate genotypes at different loci during selection.
These models can be extended to include multiple QTL x QTL interactions; at present we are
considering modeling 1 QTL models.
10
CH927 QTL practical: dry-lab section
Warwick Systems Biology 13/03/13
Recommended reading:
Specific to this practical:
Wentzell et al. (2007) Linking Metabolic QTLs with Network and cis-eQTLs Controlling
Biosynthetic Pathways. Plos Genetics 3(9): e162.
General reading on QTL analysis:
Tanksley, S.D. (1993) Annual Reviews in Genetics: 27. 205-233
Bromen and Sen. A guide to QTL mapping in R-QTL
11
Download