LDLA User Manual V 0.4-21/10/2010 Jean-Alain Grunchec, Jules Hernández-Sánchez, Sara Knott 0/ CONTRIBUTIONS .................................................................................................................................. 1 I/ INTRODUCTION .................................................................................................................................... 2 II/ REQUIREMENTS .................................................................................................................................. 3 A/ WEB BROWSER ....................................................................................................................................... 3 B/ RESOLUTION ADVICE .............................................................................................................................. 3 III/ LOGGING ON THE PORTAL ............................................................................................................ 3 A/ URL OF THE PORTAL .............................................................................................................................. 3 B/ AUTOMATED LOG OUT ............................................................................................................................ 3 C/ THE PROFILE MANAGER .......................................................................................................................... 4 IV/ UPLOADING AND CHECKING THE LDLA INPUT FILES ......................................................... 5 A/ THE INPUT FILES ..................................................................................................................................... 5 C/ THE CHECKING PROCESS ......................................................................................................................... 6 D/ DEALING WITH ERRORS AND WARNINGS ................................................................................................. 7 E/ MISSING ALLELE RECOVERY ................................................................................................................... 8 V/ CHOOSING THE TYPE OF QTL ANALYSIS ................................................................................... 8 A/ TRAITS, COVARIATES AND FACTORS PANELS .......................................................................................... 8 B/ THE SAVE G MATRICES PANEL ................................................................................................................ 8 C/ THE METHOD PANEL ............................................................................................................................... 8 D/ COMBINATION OF TRAITS ....................................................................................................................... 9 E/ THE SEARCH PARAMETERS PANEL ........................................................................................................... 9 F/ THE MARKERS USE PANEL ......................................................................................................................10 VI/ RUNNING THE QTL ANALYSES ....................................................................................................10 A/ STARTING THE ANALYSES ......................................................................................................................10 B/ THE PRE-PROCESSING PHASE .................................................................................................................10 C/ MONITORING THE PROGRESS OF THE ANALYSES ....................................................................................10 D/ CANCELLING .........................................................................................................................................11 E/ DOWNLOADING AND CHECKING THE RESULTS .......................................................................................11 F/ CLEANING RESULTS ...............................................................................................................................15 G/ END NOTE: HELPING THE DEBUGGING PROCESS .....................................................................................16 H/ KNOWN ISSUES ......................................................................................................................................16 APPENDIX A: FILE FORMAT DESCRIPTION ....................................................................................16 APPENDIX B : ERROR CHECKING ......................................................................................................18 APPENDIX C: THE GRID ........................................................................................................................19 0/ Contributions QTL analyses carried out by the LDLA module make use of the ASReml software. Gilmour AR, Gogel BJ, Cullis BR, Welham SJ, Thompson R. ASReml user guide release 1.0. 2002. Hemel Hempstead, UK: VSN International Ltd. 1 The epistasis module uses the SWARM meta-scheduler. The SWARM meta-scheduler as been noted as accelerating the execution of analysis with 200 jobs by a factor of up to 147 . Grunchec JA, Hernández-Sánchez J, Knott SA. SWARM: A meta-scheduler to minimize job queuing duration in a Grid portal. Proceedings of the International Conference of Cluster and Grid Computing Systems (ICCGS 2009). July 2009. Oslo, Norway ; 600-607. QTL analyses carried out by the GridQTL module make use of the resources provided by the Edinburgh Compute and Data Facility (ECDF). ( http://www.ecdf.ed.ac.uk/). The ECDF is partially supported by the eDIKT initiative ( http://www.edikt.org.uk ). Please, do note that registered accounts are personal i.e. if some of your colleagues or supervised students want to use the GridQTL service, they need to register individually. For very large LDLA and epistasis analyses, the free usage of the NGS has been noted to save weeks of waiting time for the end users when compared with a comparable virtual usage of a desktop PC. When publishing work based on use of the NGS, users should acknowledge both the relevant GridQTL publication or manual, and the NGS directly using the following line: "The authors would like to acknowledge the use of the UK National Grid Service in carrying out this work". This citation should, if possible, be used in conjunction with one of the NGS logos. If it appears to be impossible or difficult to include such a full citation in the manuscript owing to the format of the Journal paper or Conference proceedings, a reference to the UK National Grid Service, http://www.ngs.ac.uk/, could alternatively be included. I/ Introduction This document describes how to use the Linkage Disequilibrium and Linkage Analysis module which has been developed as part of the GridQTL project. 2 II/ Requirements A/ Web browser In order to run the LDLA module, you need to contact one of the authors to have an account created for you on the GridQTL portal. You also need to have a JavaScript-enabled web browser installed on your computer. For instance, the latest versions of the following web browsers have been tested successfully with the LDLA module: Mozilla Firefox Netscape Opera Windows Internet Explorer B/ Resolution advice The best graphic output is provided with Firefox. The screenshots provided in this manual have been taken when the LDLA module was used with Firefox. You may have a slightly different graphic resolution when you use other web browsers. All screen resolutions settings can be used but the graphic output looks better with screen resolution settings of 1024x768 or higher. III/ Logging on the portal A/ URL of the portal The URL of the GridQTL portal for the LDLA module is: http://cleopatra.cap.ed.ac.uk/gridsphere/gridsphere You need to type your user name and password, and then click on the button Login. B/ Automated log out Beware you may be logged out after a few minutes of inactivity during a QTL analysis. This happens typically when a QTL analysis is started and a user goes away while the computation is running. This is not a problem since the user can come back later and log in again so as to proceed with the rest of the analysis. Users can log in and log out, open and close the web browser at any time. When they log in again, their settings will be unchanged since the last time they logged out. Data are stored and computations are performed even when the user is not logged. 3 Figure 1: Logging on the portal. C/ The profile manager Once you have logged, you can change some of your settings on the profile manager portlet, such as your name or password. Beware that it may be convenient at some point to communicate this password to an administrator in case you encounter a problem with the LDLA, for debugging purpose. Therefore do not choose a password that you use in other circumstances. 4 Figure 2: The profile manager portlet. IV/ Uploading and checking the LDLA input files On Figure 2, there is an “IBD Module v03” tab in the menu. Click on it in order to load the LDLA module. IBD stands for Identity By Descent, a key concept in genetics. A/ The input files Five files are expected as input in the LDLA module. These are files containing information about the pedigree, traits, marker genotypes, marker positions, and demographic history of the population. Their format is described in Appendix A: File format description at the end of this document. These files should be located on your file system. You can upload them by selecting their path in the relevant text box (you can click on the browse buttons and use the file browser to select these paths). The Output text box can be used to provide a name for the compressed files that will be generated at the end of the analysis. Any name will do but it is better to use a clear identifier for future references, for instance, we can call it simN100T100M10, denoting analysis of a 5 simulated data set where historical IBD is predicted assuming Ne equal to 100 (50 males plus 50 females), 100 generations of history, using the nearest 10 markers to each tested location. Once the analysis is completed, a bundle with compressed output files will be generated and its name will contain the string simN100T100M10. Figure 3: Uploading and checking the input files. C/ The checking process Once the parameters have been typed, click on the Upload and check button. The files are then uploaded from your computer to the server in Edinburgh. They are then “checked” by the web server: if there are errors in the input files, you will then be warned. The type of errors checked for are: 1. 2. 3. 4. Individuals with genotypes and/or phenotypes not in pedigree Sex exchanges Mendelian errors General formatting errors It will take a few seconds or maybe even a few minutes for the web page to refresh. Do not click several times on the Upload and check button since it will restart the checking process from scratch and impact negatively on the web server’s performance. Once the files have been validated, you should see a screen similar to Figure 4. 6 Figure 4: The LDLA portlet after the files have been uploaded. D/ Dealing with errors and warnings If there are errors in the input files some messages can be displayed in the text area of the Errors panel (see Appendix B: Error checking). There are different types of errors, some are warnings, but others are critical and require some data corrections. In case of critical errors, 7 data needs to be corrected and uploaded again. In order to do this, the user needs to click on the New Dataset button, correct the errors and upload the files again. E/ Missing allele recovery Missing alleles are common. A strategy to recover as much information as possible before further analysis is also implemented at this stage. This method makes use of all available marker information (family trios, entire full-sib families, entire half-sib families, and distant genotyped ancestors) to recover missing alleles in an individual. The process is iterative until no more changes can be made, as recovered alleles can help recovering other missing alleles in the next round. V/ Choosing the type of QTL analysis A/ Traits, covariates and factors panels There are several panels below the Errors panel. The Traits, Covariates and Factors panels contain check boxes named after the records in the uploaded Traits file. Multiple traits can be uploaded simultaneously but, at this stage, only one at a time can be analyzed. Consequently, one box should be ticked in the Traits panel. Traits are analyzed under the assumption of normally distributed errors. Covariates are continuous variables that may or may not be associated with the traits, e.g. age determining height. Factors are discrete variables defining groups of records with something in common, e.g. gender. Covariates are read as numerical and factors as alphanumerical. B/ The save G matrices panel The Save G matrices panel allows the user to keep the G matrices (IBD matrices between individuals) in the output file produced during the QTL analysis. G matrices can be very large files. Their transfer and compression can take a lot of time. Therefore, by default G matrices are not saved. You can tick the Yes/No check box so as to save these matrices. These matrices can be useful for purposes other than QTL mapping. C/ The method panel The Method panel allows choosing three methods for estimation of historical IBD probabilities. The M&G method is based on Meuwissen and Goddard (2000, 2001), the H&HS method is based on Hill and Hernandez-Sanchez (2007), and the R method is based on Hernandez-Sanchez et al. (2006). M&G and H&HS require haplotype data and R requires genotype data. M&G and H&HS are more accurate than R given correct population parameters. We have yet to implement a haplotype algorithm based on the minimum recombinant concept of Qian and Beckmann (2002). Until then, and unless you have haplotypes, only R is available. 8 D/ Combination of traits By default, only one QTL is searched. Several options can be selected when one QTL is searched: Polygenic Additive Dominant The polygenic variance is calculated using the average relationship matrix across all individuals in the pedigree, and it contains all additive genetic effects on the trait, apart from the additive effect at the position being tested. The additive and dominant variances are calculated with IBD matrices based on marker data and historical information, and it is made up of additive or dominant effects at the position being tested, respectively. As for now it is not possible to analyze several QTL. However, it should be possible later to analyze two QTL (Figure 5). At most 6 check boxes can be ticked in addition of Polygenic. Figure 5: Other panels The restriction of maximum ticking 6 boxes is related to the maximum number of ‘userdefined’ covariance matrices ASReml can take (version 2001). Two QTL can have direct additive and/or dominant effects, plus interactions among them (epistasis). In general populations, epistasis is hard to detect, however, as gene-gene interactions are the norm rather than the exception we opted to offer this option to users. E/ The search parameters panel The Search Parameters panel allows choosing the positions where a QTL analysis will be performed. Three methods can be used. Every – cM: a strictly positive real number should be used. Hence, a QTL analysis is performed at regular intervals (see Figure 4). At – cM (File): a file needs to be selected on the user file system. This file should be a list of positive floating numbers, separated by blank space, tabs or “new line” characters. A QTL will be searched at each position indicated in this file (see Figure 5). 9 At – cM (Hand): a list of positions needs to be input. Those positions should be separated by blank spaces (Figure 6). Figure 6: Choice of positions by hand F/ The markers use panel You can either decide to select all markers for the computation of historical IBD (Figure 4). This option is computationally intensive if many markers are available. It may be better to run a first preliminary search with fewer markers. The other option Closest allows using fewer markers (Figure 5). The number of closest markers to the position being analysed must be an integer input. The Estimate option has yet to be implemented, and will assist users in using the optimal number of markers in each analysis. Note: at this stage, the method used should be R, there should be only one QTL analysed, and the Estimated option in the Markers use panel should not be used. Work is in progress to make these options functional. VI/ Running the QTL analyses A/ Starting the analyses When all the previous parameters have been selected, you can start the QTL analyses by clicking on the Start button of the analysis panel. There is a small tick box near the Start button. Ticking it allows displaying the progress of the jobs on the UK National Grid Service (see Appendix C: The Grid). B/ The pre-processing phase This Display Grid Activity tick box can be ticked or unchecked at any time during the computation. A progress bar is also displayed during the computation. When the Display Grid Activity box is ticked, the Analysis panel is refreshed roughly every 5 seconds (Figure 7). Several messages are displayed during the analysis. The pre-processing phase can take up to a few minutes. C/ Monitoring the progress of the analyses 1 0 Afterwards, the analyses are distributed on the Grid. There is one analysis for each position selected as described in Section IV-E. The dark blue part of the progress bar indicates the percentage of the analyses which have completed. The purple part of the progress bar indicates the percentage of the analyses which are being calculated on the Grid. The light blue part of the progress bar indicates the percentage of the analyses which have yet to start on the Grid. This progress bar is indicative of the progress, and generally reports real progress a few seconds with a slight delay. Figure 7: Monitoring the progress of the analyses D/ Cancelling It is possible to cancel an analysis by clicking on the Cancel button. In this event, if some computations have ended, it may be possible to retrieve some of the results (incomplete graphs will be generated). E/ Downloading and checking the results When the analyses have all completed, a dialog box will pop up (Figure 8). Click OK. 1 1 Figure 8: the results can be downloaded. Some graphs may be generated depending on the type of parameters you chose previously (Figure 9). 1 2 Figure 9: A few plots The results can now be downloaded. Click on the Download Results button. You should see a dialog box which allows you to save the results (Figure 10). 1 3 Figure 10: Download dialog boxes, Firefox style on the left, Internet Explorer on the right. The name of the file is in the format number@username.userdefinedname.zip. number is an identifier which is helpful to guarantee the uniqueness of the file. It can also be used as a reference to the analysis for software maintenance purpose. username is the user name of the user who started the analyses. userdefinedname is a string of characters which is expected to be meaningful to the user and helps him to manage his results on his file system. Save the file on your file system. You can open it with unzip in Linux or other uncompressing software under windows (7-zip, Winrar, Winzip). The compressed output files bundle contains .asr and .sln for each position where a QTL analysis was performed (Figure 11). They are indexed in the same order of the positions specified as described in Section IV-E. The .asr files are ASReml result files containing the most relevant information, e.g. variance components and maximum log-likelihood. The .sln file contains the BLUP and BLUE solutions (see ASReml manual for more explanations). Plots, both .ps and .png are also included. Other files such as an number.info.txt file are included. This file is designed to contain general statistics on the data, e.g. number of alleles per locus, allele frequencies, number of half and full sib families, number of cohorts (a rough estimate of number of discrete generations when they are really overlapping), etc. The files with instructions for ASReml (.as) will also be here, it can be helpful if later the user needs to check that he used the right model and data. 1 4 Figure 11: Content of the compressed output files bundle. F/ Cleaning results You may want to use the same data set and run a new analysis with almost the same parameters. In this event, you would not need to go through the checking process (see section IV-C), and the results of the pre-processing step would not need to be computed again (see section VI-B). This can save time when you deal with large datasets. In this case you just need to change the parameters as explained in section V, and you can submit new analyses as explained in section VI-A. In case you need to upload a new dataset, you need to click on the button NEW DATASET (see Figure 4). When you click on this button, it will take a while for the browser to refresh, possibly a few minutes. This is normal. Until you click on this button, many files which were related to your previous experiment are stored on the web server in Edinburgh and on the supercomputers across the UK, and they only get removed at this stage. 1 5 Once the temporary files have been removed, you should be able to upload your files as described in section IV. G/ End note: helping the debugging process In case you encounter a problem, or think you have found a bug, you are advised to take a screenshot of your browser, write what you did previously, copy-paste error messages to an email to be sent to one of the authors. You are also advised not to proceed with your computations in this event. If problems appear after the computation ended, do not click on the “new dataset” buttons since doing so would delete any temporary file on the server and the Grid, which can be invaluable for debugging purposes. H/ Known issues Current no known issue. Appendix A: File format description We expect data files to be standard text files, e.g. those that can be opened with winword in Windows or gedit in Linux. The names of these files should not contain spaces and they should not be too long, e.g. up to 50 characters or so. Data are pedigree, markers, traits and explanatory variables, and map distances. The first line in the pedigree file contains the number of individuals in the pedigree. The rest of the lines contain three identifiers: individual, father and mother (in this order). Identifiers are considered alphanumeric, so you can use letters, numbers or a combination of both. Unknown parents must be identified with zeros. For example: 4 a00 b00 c12 d12 Pedigree will be checked for errors (see Appendix B), and individuals will be added as required, e.g. if first two individuals in this pedigree, ‘a’ and ‘b’, had been missing dummy variables would have been created to identify them. There must be at least one space between identifiers, and the file must finish with a blank line. Markers information can come in two forms: genotypes or haplotypes. Both use the same format, but haplotypes are phased genotypes so the first allele at each locus is always paternal, and the second maternal. Again, any name (up to 50 characters or so) can be used for the markers file, however, a label, either .gen. or .hap., must be attached to it to distinguish 1 6 between genotypes and haplotypes, e.g. markers.gen.txt or markers.hap.txt. The following is an example of a file containing marker information: Chromosome1 Marker1 Marker2 a1234 b1234 c1133 d2244 The first line contains the name of the chromosome, an alphanumeric variable of up to 20 characters. The second line contains the names of all markers in the data, each name of up to 20 characters, separated by at least one space. The rest of the lines will contain an individual identifier and marker information. Each individual identifier must appear in the pedigree, or an error will be issued. The marker data following each identifier will be read alleles within loci. Each record must be written on a single line. For example, individual ‘a’ has genotype 1 2 and 3 4 at first and second loci, respectively. The file must end with an empty line. Markers can be typed in any order, not necessarily their ‘real’ order on the chromosome. However, their chromosome order must be reflected in the map file, for example: 1 Chromosome1 2 4 Marker2 1.4 Marker1 The first line denotes the number of chromosomes, however at present we can only handle one chromosome at a time, in the future we will expand the capabilities of our software to handle any number of chromosomes. The second line contains the name of the chromosome, which must be the same name found in the marker file, followed by the number of loci (2) and the number of individuals genotyped (4). The third line contains the name of each marker, in chromosome order, separated by the distance between consecutive markers in centi-Morgans (cM). The file must end with an empty line. In this example, Marker1 and Marker2 are separated 1.4 cM, and Marker2 appears before Marker1 on the chromosome (despite having written it in the opposite order in the marker file). The reason why a map file is necessary here is that genotypes are usually given unordered and hence an additional step is required to find their order. The traits file must contain one or more traits, and may contain factors, and/or covariates. Traits can be discrete or continuous in measurement, but all will be analysed assuming errors are normally distributed. Factors are alphanumeric categories grouping several individuals, for example gender. Covariates are continuous variables that may influence the trait, such as larger individuals may also be heavier than smaller individuals. A missing value label has to be specified. An example of a trait file is the following: 4 1 0 0 1 -99.9 Id Fat Gender 1 11.2 1 2 10.4 2 3 13.0 1 4 9.9 2 1 7 The above file specifies number of individuals (4), number of traits (1), the number of random effects (0), number of covariates (0), number of factors (1) and missing values (-99.9) in the first line. The second line contains labels for each column. Lines 3 onwards contain data. The file must end with an empty line. The history file contains demographic information of the population. An example is the following: 50 50 100 3 50 The first and second lines denote the effective sample size (Ne) of males and females, respectively. These are assumed fixed over time. The third line contains the number of discrete generations since population foundation. The fourth line contains the random mating system, the choices are: 0 for haploid population, 1 for diploid monoecious with selfing, 2 for diploid monoecious without selfing, and 3 for diploid dioecious, and 4 for diploid dioecious with hierarchical mating, e.g. as it stands each male is mated to each female chosen at random without replacement, if Ne had been 5 and 50 for males and females, respectively, then 10 random females without replacement would have be mated to each male. This last option reflects common situations in animal breeding. The fifth line contains the percentage (%) of minimum number of loci fully genotyped to qualify individuals as ‘base’. Base individuals are important because they correspond to the first generation in a pedigree for which historical IBD is estimated. LDLA was primarily designed to estimate historical relationships based on markers and demography among pedigree founders (in linkage, pedigree founders are assumed unrelated and non-inbred), and bring those down through the whole pedigree following the segregation of marker alleles. However, in many data sets, pedigree founders are not genotyped, and estimating historical IBD among them has not given us good results, presumably because all founders had the same degree of relatedness. Thus, we must find the first generation of individuals that are at least partially genotyped (the base), without having ancestors as informative as them. The level of marker information required to qualify as base is regulated by the last parameter in the history file. For example, 100 would mean that only fully genotyped individuals could qualify as base, or 0 would mean that pedigree founders are base irrespective of marker data. Finally, the file must end with an empty line. Appendix B : Error checking Errors in data are the norm. A crash would be most likely due to formatting errors. We endeavor to produce as meaningful error messages as possible but the many ways in which formatting errors can occur can only be accommodated over time and experience. If the program crashes and we are not able to give you an informative clue to solve the problem you can always contact us directly. Nevertheless, some errors have been captured by meaningful messages. These errors include: 1. Stated and actual file sizes differ, e.g. pedigree has 10 individuals but first line says 9. 1 8 2. Sex exchanges, e.g. a father can be later a mother or vice-versa. This may be legitimate in some species of fish so it must be considered a warning unless your working species cannot go through sex exchanges. 3. Individuals genotyped and/or phenotyped not in pedigree. All individuals must appear in the pedigree whether they have records or not. This should be treated as a severe error, and not taking action can have unforeseen consequences. 4. Mendelian errors. These are common errors or warnings and should be erased from the data set. A list with all the errors detected will be issued so that a perl program can delete them from the data set. Each error is typed on a single line and consists, at least, of genotypes of a family trio. Sometimes genotypes of more family members are included in the error, as an error can only be considered such in reference to relatives. Our perl program takes a liberal approach in deleting all the individuals typed in the same error line. A more subtle error treatment is possible using probabilities, e.g. heterozygous are more likely to be mistyped as homozygous rather than the other way round, but we have not implemented it here. Example Here we can see that individual 101 has genotype markers value 2 2, that the father is individual 27 (with genotype marker values being 1 1), the mother individual 62 (with genotype marker values 2 1). This is something that is not possible and hence is noted as an error. For individual 295, the genotype marker values are 1 2, with incompatible values for father (individual 101) and mother (individual 196). Whenever possible these errors should be corrected (here with individual 101 having genotype values 1 2 instead of 2 2). It may rarely happen that some mendelian errors are displayed and yet this will seem to be correct (possibly due to a bug). Appendix C: The Grid Basically, a Grid is a set of supercomputers, and we take advantage of the large computational power provided by the UK National Grid Service (NGS) (http://www.grid-support.ac.uk ). There are several High Performance Clusters which are part of the Grid used by the GridQTL project. Most of these clusters have more than 256 CPU. 1 9 The ECDF is a 512 CPU cluster located in Edinburgh (http://www.is.ed.ac.uk/ecdf/faq.shtml). A local Condor pool is part of the Grid, but is not expected to be used in production. In section V-E, the user specifies a set of positions where QTL analyses will be performed. A job is created for each of these positions. For instance, if your data covers 15 cM, and an analysis is expected to be performed every 1 cM, 15 computations will be undertaken in different places on the Grid. This allows speeding up the computations significantly, since many QTL analyses run at the same time. As an example, on the user’s perspective, the 15 analyses may complete in less than 10 minutes, but each of these analyses may take 8 minutes. The NGS will then increment the GridQTL account of 15x8=120 minutes. As an example of a large computation, the analyses of some of our mice data with 30 positions takes up to 100 hours on the Grid (but only 4 from a user perspective). As for now, this level of usage is expected to be exceptional and not frequent. Beware of this when you run large computations: it may be better to run a preliminary set of analyses with fewer positions, and then run a new analysis with a denser number of positions to be analyzed in particular regions of interest. The total duration of the computation on the Grid is displayed at the end of a set of analyses before results can be downloaded. This duration is recorded in a database in Edinburgh, per user, for statistical purpose. 2 0