CH927 QTL practical: dry-lab section Warwick Systems Biology 13/03/13 Genes and metabolites to phenotypes: QTL mapping of glucosinolate production in Arabidopsis thaliana Today we will analyse QTL data from the Wentzell et al. (2007) study. We will use the metabolite profiling data as our traits and compare it to the genetic map data from the Bay-0 x Sha cross (used to generate the RIL lines that were genotyped and phenotyped). The main considerations will be on learning how to analyse ‘how good’ the data is and thus what it can be used for, how to process data to select the part you want (the ‘good’ part), and how to compare the output of different QTL methods. 1 file is provided for the QTL analysis section (download from module page): gsmg11.csv (this contains the marker data relative to a genetic map and trait data for a QTL analysis of glucosinolates) Start by downloading R if you don’t already have it, and then install the QTL package: 0.1. Before installing R/qtl, you must first install R, which is available at the Comprehensive R Archive Network (CRAN): http://cran.r-project.org. ** Make sure you have the latest version (R-2.14.1) even if you already have installed R. ** 0.2. Once R is installed, and provided that your computer is connected to the internet, it is easiest to install R/qtl by first invoking R and then typing the following: install.packages("qtl") - choose a UK mirror for installation This will download and install R/qtl (which is known in R as the "qtl" package or library). 0.3. Load R/qtl by typing: library(qtl) 0.4. Here is a description of the methods that R-QTL uses: A key component of computational methods for QTL mapping is the hidden Markov model (HMM) technology for dealing with missing genotype data. We have implemented the main HMM algorithms, with allowance for the presence of genotyping errors, for backcrosses, intercrosses, and phase-known four-way crosses. The current version of R/qtl includes facilities for estimating genetic maps, identifying genotyping errors, and performing single-QTL genome scans and two-QTL, two-dimensional genome scans, by interval mapping (with the EM algorithm), Haley-Knott regression, and multiple imputation. All of this may be done in the presence of covariates (such as age or treatment). One may also fit higher-order QTL models by multiple imputation and Haley-Knott regression. 1 CH927 QTL practical: dry-lab section Warwick Systems Biology 13/03/13 Mapping QTLs – Points to consider 1. What is your hypothesis? To detect QTL need to consider: a. population showing genetic variability for target phenotype b. marker system that allows genotyping of population c. reproducible quantitative phenotyping methodologies d. appropriate experimental and statistical methods for detecting and locating QTL. Questions: (i) Have you got a suitable population to measure the trait? Is the population “fixed” i.e. not segregating, for example homozygous lines created through selfing or doubled haploid lines created using microspore culture. A fixed mapping population will allow sufficient replication to get better estimates of the trait variance components (since polygenic traits are influenced by gene x environment interactions). (ii) Do you have sufficient numbers within the population to accurately measure the trait variation (segregation distortion)? (iii) Genotype distribution vs. replication? i.e. consider covering the maximum number of allelic combinations vs. increasing replication to have a better estimate of line means and their variance. (iv) Has a population been used to map the trait previously – if so is it syntenic (genetic/physical map) with the population you wish to use. (v) If you want to explore the genetic variation for a particular trait – is the trait measurable within the population with the accuracy required – e.g. broccoli (Brassica oleracea var. italica) head diameter – could not use the Diversity Foundation Set (DFS) since the phenotype is not present. (vi) Is there genotype data available for the population? What marker types were used? (May need to rescore individuals). 2. A mapping exercise requires 3 elements: If you have data from other work or you want to map QTLs it is best to summarise the data by constructing: 1. A map file (*.map) for the experimental population 2. A locus file (*.loc) containing marker genotype scores (that are in the map) for individuals within the population 3. A quantitative trait file (*.qua) containing the trait data The three data files (*.loc, *.map, *.qua) will enable you to compile the necessary information required by different mapping software: MapQTL, QTLCartographer, R/qtl, R/eqtl, QGene, QTLcafe, GridQTL to name a few. 2 CH927 QTL practical: dry-lab section Warwick Systems Biology 13/03/13 3. Data checking If you did not make the genetic map, it is worth spending time looking at the map to see if it is the best map that explains the linkage between the markers used. This also allows you to become familiar with the map data. (i) If you rebuild the map – can you get the original genotype scores for all marker x all individuals? This may contain markers that are not on the final map; and individuals that trait data is not available for. (ii) Check that there are not individuals replicated within the population. Identical individuals do not contribute to the calculation of the recombination fraction, but do add to the computation. (iii) Check that the markers are not repeated. (iv) Look at the segregation of the markers within the data, e.g. Double Haploid (DH) lines derived from an F1 should have a 1:1 segregation of parent 1: parent 2 loci. But it should be noted that the process of making DH lines may create segregation distortion. 4. Let’s look at the Wentzell et al data. Follow this tutorial, using R for data analysis: 4.1. Start by getting the data into R Read in the file. This comes from the author’s analysis in QTL cartographer so we have to do some reassigning of variables: Genotype conversion from QTL cartographer to R/qtl format. 0 -> A and 2 -> B and -1 -> -. gsmg11 <- read.cross("csv", , "gsmg11.csv", genotypes=c("A","B")) Define the population as RI (recombinant inbred lines): class(gsmg11)[1] <- "riself" summary(gsmg11) To view data summary: plot(gsmg11) The generic plot function passes (gsmg11) to the plot.cross function, which makes the plot. Likewise these functions plot different aspects of the data: plot.missing(gsmg11) plot.map(gsmg11) plot.pheno(gsmg11, 1) plot.pheno(gsmg11, 2) 3 CH927 QTL practical: dry-lab section Warwick Systems Biology 13/03/13 4.2. Analysing the phenotype data (metabolite levels) Check the distribution of the phenotype data, this can influence the method of analysis, allows to see if the data needs transforming, highlights missing data. par(mfrow=c(3,2)) for(i in 1:5) plot.pheno(gsmg11, pheno.col=i) Explore scatterplots of phenotype vs. phenotype to locate data that maybe erroneous: pairs(jitter( as.matrix (gsmg11$pheno)), cex=0.6, las=1) The gsmg11$pheno pulls out the phenotype data and converts it to a numerical matrix with as.matrix in order to use the jitter function (which adds a bit of noise so that individual points may be distinguished). For missing values: gsmg11$pheno[gsmg11$pheno == 0] <- NA This locates missing phenotype data and adds NA. To compare the distribution of the means of a phenotype against a random order for that phenotype: par(mfrow=c(1,2), las=1, cex=0.8) means <- apply (gsmg11$pheno, 1, mean) plot(means) plot(sample(means), xlab="Random index", ylab="means") 4.3. Examining the marker data Check for segregation distortion at all markers i.e. genotypes appear in expected proportions – may indicate possible genotyping errors. Geno.table inspects genotype frequencies at each marker. The P.value column indicates pvalue for a chi sq test of Mendelian proportions (1:2:1 in an intercross for example). In this case 1:1. gt <-geno.table(gsmg11) p <- gt$P.value gt[!is.na(p) & p <0.01,] Can calculate the recombination fraction for the genotype orders based on the genotypes provided. NB: map ordered genotypes = loc file. Compare all individual’s genotypes to detect genotyping error, naming error etc. Also, if two individuals carry identical marker genotypes you may wish to exclude one from analysis (it doesn’t add any more data). cg <- comparegeno(gsmg11) hist(cg, breaks=200, xlab="proportion of identical genotypes") rug(cg) 4 CH927 QTL practical: dry-lab section Warwick Systems Biology 13/03/13 Then to pull out which individuals, using 0.9 as an example: which(cg >0.9, arr.ind=TRUE) It would be worth checking these to see if errors have occurred. If a dense set of marker are used for selection during production of lines – could vary by one marker, so might not be a problem. 4.4. Checking the marker order and recombination fraction Have markers been assigned to the correct chromosome? If they have, are they in the correct order? LOD intervals across chromosomes will appear distorted if incorrect placement of markers. See Figure 1 below. Figure 1 LOD score plot highlighting a misplaced marker (red) on a linkage group. For example, marker 7 in Figure 1 is misplaced. Can go back to marker 7 and check that the chromosome that it has been placed on is its best position (grouping LOD, and RF). Can explore these using estimates of the recombination fraction. gsmg11 <- est.rf(gsmg11) Estimates recombination fraction 5 CH927 QTL practical: dry-lab section Warwick Systems Biology 13/03/13 plot.rf(gsmg11) plots recombination fraction vs. LOD for each Chr (Figure 2, p5) If you have the genotype file that was used to build the map your could rebuild the map yourself (JoinMap, MapMaker, Carthagene, for example). Pairwise recombination fractions and LOD scores 1 2 3 4 5 5 80 4 Markers 60 3 40 2 20 1 20 40 60 80 Markers Figure 2 Estimated recombination fractions (upper left) and LOD scores (lower right) for all pairs of markers in gsmg11 data. Red = low RF or high LOD; blue = pairs that are not linked (high RF, low LOD) All the chromosome assignments are correct. If errors were present can use: checkAlleles(gsmg11) 6 CH927 QTL practical: dry-lab section Warwick Systems Biology 13/03/13 To check alternate chromosomes, first use: plot.rf(gsmg11, alternate.chrid=TRUE) This allows chromosome id’s to be identified. Then to pull out specific comparisons use: plot.rf(gsmg11, chr=c(1,2,5)) 4.5. Examining the data for genotyping errors nxo <- countXO(gsmg11) The observed number of crossovers in gsmg11 can be observed: plot(nxo, ylab="No. crossovers") on viewing the chart it can be seen that an individual seems to have a higher number of crossovers compared to the other individuals; this line can be pulled out; (20 is user defined): mean(nxo[1:211]) =7.815166 Now we can look for lines with more crossovers than expected: nxo[nxo>15] Ind 60 and173 have 16 crossovers – may be worth checking these lines. Check line 173: countXO(gsmg11, bychr=TRUE)[173,] 1 2 3 4 5 (chromosome number) 2 7 2 1 4 (number of crossovers) It would be worth looking at ch2. 4.6. Is there any missing data? It might be useful to check for missing genotype data since these are areas where we might want to add more markers. Also, standard interval mapping methods may give spurious results across regions of missing genotype information. Proportions of markers scored can be seen using: hist(nmissing(gsmg11, what="mar"), breaks=50) And by entropy and variance (blue and red respectively) plot.info(gsmg11, col=c("blue", "red")) If you wanted this information as a table: z<- plot.info(gsmg11, step=0) For chr 1 and 5: z[ z[,1]==1,] z[ z[,1]==5,] 7 CH927 QTL practical: dry-lab section Warwick Systems Biology 13/03/13 5. Now we are happy with the data we can move on to QTL analysis itself. 5.1. Comparing single marker and interval mapping methods for QTL analysis The simplest method for QTL analysis is to take each marker in turn – split the individuals based on their genotype score at that marker, then compare the phenotype means for the groups. Interval mapping improves on marker regression by taking account of missing genotype data at a putative QTL. Standard interval mapping uses maximum likelihood estimations under a mixture model; while the Haley-Knott regression methods use approximations to the mixture model. Need to calculate the conditional genotype probabilities (calc.geno), step=1 defines the density (cM) across the grid to which interval mapping will be performed. error.prob allows calculations to be made based on a given number of genotyping errors. gsmg11 <- calc.genoprob(gsmg11, step=1, error.prob=0.001) The scanone function is used for interval mapping specifying the method to be used , in this case the “EM” algorithm. (In standard interval mapping the EM algorithm is performed at each position on a grid of putative QTL locations along the genome, while the estimates and likelihood the null hypothesis are calculated just once. The likelihood estimate is non decreasing across iterations). out.em <-scanone(gsmg11, method="em") plot(out.em, ylab="LOD score") For just chr2 and 5: plot(out.em, chr=c (2, 5), col="blue", ylab="LOD score") Haley-Knott regression. Fast approximation of the standard interval mapping results. out.hk <-scanone(gsmg11, method="hk") Looking at chr 2 and 5: plot(out.em, out.hk, chr=c (2, 5), col=c("blue", "red"), ylab="LOD score") to compare the difference between the 2 methods: plot(out.hk - out.em, chr=c(2, 5), ylim=c(-0.5, 1.0), ylab=expression(LOD[HK] - LOD[EM])) to add a horizontal line: abline(h=0, lty=3) In this case no difference between the two methods. 8 CH927 QTL practical: dry-lab section Warwick Systems Biology 13/03/13 Extended HK An improved version of HK may be obtained by considering the variances. out.ehk <-scanone(gsmg11, method="ehk") can compare methods: chr2 & 5: plot(out.em, out.hk, out.ehk, chr=c (2,5), ylab="LOD score", lty=c(1,1,2)) chr1-5 plot(out.em, out.hk, out.ehk, chr=c (1:5), ylab="LOD score", lty=c(1,1,2)) IM = black, HK= blue, EHK= red dashed Can plot difference in the LOD scores from HK and EHK from the LOD scores of standard IM: plot(out.hk - out.em, out.ehk - out.em, chr=c(1:5), col=c("blue", "red"), ylim=c(-0.5, 1), ylab=expression(LOD[HK]-LOD[EM])) HK =blue, EHK= red abline (h=0, lty=3) 5.2. Determining the significance threshold: Permutation test Can use scanone to do the permutation test. Can add verbose=FALSE to suppress output, but often gives error. operm <- scanone(gsmg11, n.perm=1000) plot(operm) for genome-wide significance levels at 20% and 5%: summary(operm, alpha=c(0.20, 0.05)) 5.3. Are any of the results from the mapping carried out in 5.1. significant at a 5% threshold? If you estimate significant QTL locations and want to explore these further, you can calculate the LOD support interval and Bayes credible interval. These use results from scanone and the chromosome to consider, plus the drop in LOD units (1.5 default) or the nominal Bayes fraction (95 % default): lodint(out.em, 5, 1.5) bayesint(out.em, 5, 0.95) The first and last row indicate the ends of the 1.5 cM confidence window, the middle row indicates the maximum likelihood estimate of the QTL location. 9 CH927 QTL practical: dry-lab section Warwick Systems Biology 13/03/13 These are just coordinates, we need marker positions or the nearest flanking markers: lodint(out.em, 5, 1.5, expandtomarkers=TRUE) bayesint(out.em, 5, 0.95, expandtomarkers=TRUE) We now have a significant QTL, markers that delimit a confidence interval for the QTL. To explore the effects of the markers at the QTL or any marker we wish to examine, we use the function effectplot, this provides estimates of the genotype-specific phenotype averages, taking account of missing genotype data. For our data we will use the find.marker function to locate the nearest marker to our QTL: find.marker(gsmg11,5, 61.0) gsmg11<-sim.geno(gsmg11, n.draws=16,error.prob=0.001) effectplot(gsmg11,mname1="At5g44320-4") To look at the data: Effect61.0<- effectplot(gsmg11,mname1="At5g44320-4", draw=FALSE) Effect61.0 It is also useful to examine the distribution of the phenotype – genotype relationship: plot.pxg(gsmg11, marker="At5g44320-4") If we have more than one QTL/ markers linked to different QTL we can examine the interaction of having different QTL genotypes (marker genotypes) and the effect on the phenotypic data, for chromosomes 2 and 5: effect2<-sim.geno(subset(gsmg11, chr=c(2,5)),step=2.5, error.prob=0.001, n.draws=256) par(mfrow=c(1,1)) effectplot(effect2, mname1="c2.loc15", mname2="At5g44320-4", ylim=c(-1, 1)) It is beneficial to explore how the phenotypic data changes with different QTL genotypes. The effect plots allow us to appreciate the different outcomes of marker-assisted selection at the different QTL. The effect plots also allow us to gain an understanding of possible epistatic interactions between loci at different QTL. If the plots appear relatively parallel, then one can assume that having a particular genotype combination is additive towards the phenotype effect. However if the lines cross, this indicates possible epistatic interactions, we then have to consider the effect of having the alternate genotypes at different loci during selection. These models can be extended to include multiple QTL x QTL interactions; at present we are considering modeling 1 QTL models. 10 CH927 QTL practical: dry-lab section Warwick Systems Biology 13/03/13 Recommended reading: Specific to this practical: Wentzell et al. (2007) Linking Metabolic QTLs with Network and cis-eQTLs Controlling Biosynthetic Pathways. Plos Genetics 3(9): e162. General reading on QTL analysis: Tanksley, S.D. (1993) Annual Reviews in Genetics: 27. 205-233 Bromen and Sen. A guide to QTL mapping in R-QTL 11