GenSel4 August 2011 Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the BIGS web interface GenSel jobs can be submitted to the queue on the HPC using the bigscli command from the unix interface of BIGS Usage gensel input_file_name nohup gensel input_file_name –s status_file_name Genotypes & Phenotypes Required for all analyses trainPhenotypeFileName markerFileName Read by GenSel4 Used for analysis Genotype File Structure Space delimited unix file (dos2unix to convert) header row plus one row for each animal column for ID then a column for each genotype One header row Alphanumeric labels for each genotype/locus One row for each animal Alphanumeric ID followed by all the genotypes -10, 0 or 10 for AA, AB or BB (no support for missing genotypes) Ordered by genomic location if no map file Read in binary format (end in .newbin) Text files are converted to binary in the first analysis Must be same number of columns in every row Example Genotype File ISU_ID.bt isu_1 nadc_1 isu_2 isu_3 isu_4 ISU_Angus_1 -10 0 10 10 0 AAA00001 -10 10 -10 10 0 Tag_number_a b -10 -10 0 10 0 Casanova_bull -10 10 6.5 -10 10 Disk requirements for 5,000 bovine 50k genotypes in text form are about 1Gb (and the same file in binary format is typically half the size) Species are designated by the first letters of Genus and species bt = Bos Taurus; hs=hom sapiens; oa=ovis ariesl ss=sus scrofa etc This will later provide functionality for species specific genome browsing Phenotype File Structure Space delimited unix file Separate phenotype file for each trait Header row plus one row for each animal with phenotype Alphanumeric animal ID must be in column 1 Trait value must be in column 2 (label in header) Remainder of file is arbitrary but defines model for trait Recommend to at least involve a column of 1’s for the mean Columns headed by alphanumerics – all rows have same no of columns Columns headed by name ending in $ are class variables Columns headed by other names are covariates Columns ending in # are ignored Column headed by rinverse specifies a weighted analysis Example Phenotype File Animal IQ mean dob Sex$ Family# rinverse A_1 100 1 100 male 1 1.0 B_2 95 1 105 female 1 0.9 C.12345 103 1 97 spey 2 .95 Spot 110 1 90 male 2 1.1 rinverse is only proportional (scalar variance factored out) covariates must be numbers! categorical traits must be numbered from 1 upwards trait in column 2 (not required for prediction) sensible to at least have the mean model does not need to be full rank GenSel matches IDs Only records with the same alphanumeric ID in the genotype and phenotype file are available for subsequent analysis Start of analysis reports the number of animals in the genotype file, phenotype file and matching records Genotypes & Map Files GenSel now supports the use of a map file A map file provides chromosome and basepair position information for at least one build Can support any number of builds A map file may provide multiple aliases for marker names Every marker name from the genotype file must exist somewhere in the map file Additional marker names can be in the map file. Map File Structure Rs_num Ss_num ISU_ID UMD_chr UMD_pos BTA_chr BTA_pos Rs_001 101 isu_1 1 100000 1 95123 1234 102 isu_2 2 1234567 2 1500000 5678 103 isu_9 2 987654321 2 10000000 910a 104 isu_5 X 0 PAR 2543 newS newS nadc_1 unk 0 unk N/A Space delimited unix text file Map File Options The minimum requirements are mapFileName linkageMap (options depend upon your mapfile) eg UMD or BTA for my example on last page This will result in columns of the genotype file being sorted into genomic order to facilitate formation of contiguous marker windows – automatically formed in 1Mb sizes Options include addMapInfoToMarkers yes Results in chromosome and base pair position added to output outputMarkerHeaderName (options are aliases in your map file) Filtering Genotypes 4 methods to filter columns of the genotype file for analysis Two approaches are always available includeFileName or excludeFileName These files contain a list of marker names as in the genotype file header that are to be included or excluded Include takes precedence over exclude Two other approaches are available if a map file is used windowIncFileName or windowExclFileName List of chromosome_names to include/exclude entire chromosome List of chromosome_name start_bp end_bp Map files & SNP names Sometimes the genotype file uses one marker name (eg database numeric ID), but the marker output file would benefit from having a different name (eg rs number) Given a map file, Predict can cross reference the different marker names so you can exchange marker results (.mrkRes) files with other users Output File Name Conventions Suppose GenSel is run using gensel4 demo.inp The root for all output files will be “demo” All options will produce output to demo.out# where # is the next available integer not already used The first run produces demo.out1, the next demo.out2 etc Most other options produce additional files that will have the same root name and the same suffix number as the .out file demo.LD1, demo.mrkRes1, demo.ghat1, demo.winVar1 etc Analysis Options Many calculations are time consuming Computing window Variance Validating predictive accuracy in test data Computing PEV and R2 These are only done in some iterations according to the outputFrequency option Default is 100 so these calculations occur for 1% iterations Markov Chains use many random numbers The seed option (default 1234) can be used to alter sampling Print analysisType Print This can be used to get a printout of the X matrix, ordered by map position if a map file is used, for just those animals in the genotype and phenotype file The output contains the covariates on a 0, 1, 2 scale, before centering, not on the -10, 0 , 10 scale used in the marker genotype file LD analysisType LD This computes the pairwise squared correlation between every pair of markers in the filtered genotype file Also computes the minor allele frequencies (MAF) The output file will be very large if you don’t filter it Only squared correlations exceeding minLDoutput are stored minLDoutput (default 0.1) StepWise analysisType StepWise Computes (unweighted) forward and reverse submodels after first fitting all the fixed effects R2 is defined as the proportion of sums of squares after the fixed effects Three options control the model inputMaxRsquared (default 0.8) will stop the analysis inputMaxMarkers (default 100) will stop the analysis alphaValue (default 0.05) controls significance Bayes analysisType Bayes bayesType BayesB Metropolis-Hastings Gibbs Sampling bayesType BayesA (Actually just BayesB with pi=0) bayesType BayesC bayesType BayesCPi (Actually BayesC but with pi estimation) bayesType RBR (Robust Bayesian Regression) Really Bayes B but with pi, Scale and df (genetic) estimation FindScale options (no, yes) or for BayesCPi (thruPi) Bayes Priors Priors and associated degrees of freedom are required for the genetic and residual variance genVariance (default 1) degreesFreedomEffectVar (default 4) resVariance (default 1) nuRes (default 10) Better estimates of genVariance and resVariance should be used From knowledge of heritability and phenotypic sd Bayes Options All analysisType Bayes jobs have extra options burnIn is the number of iterations in the chain to discard Probably doesn’t need to be very many (eg 1,000) chainLength is the number of iterations in the chain Typically use 41,000 or more (this includes burnIn) Mixture models (BayesB, BayesC, BayesCPi, RBR) assume a fraction (1-pi) of markers have an effect and pi have 0 effect Option is for example probFixed 0.95 Bayes Options BayesB (and therefore BayesA = B0) used to use a Metropolis-Hastings rather than a Gibbs sampler MHG did 100 MH iterations Our fast version used a different proposal distribution and required no more than 10 MH iterations You can specify numMHIter Long developed an alternative sampler that does not use MH You select this option using numMHIter 0 It is faster – the same speed as BayesC Bayes Options The 1 Mb windows formed using a map file can be used to compute the variance of the window This is turned on using windowBV yes Note the number of markers in each window varies with SNP density along the genome (many markers for chrom unk) This provides posterior distributions of windows so that the previous Permute and Bootstrap options are no longer needed or supported In the absence of a map file, the columns in the genotype file are assumed to be consecutive, and the number of markers in a window are defined by the windowWidth option The default is 5 Automatically get graphs of posteriors and table of variances 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 20 Percentage of Genetic Variance by 1 mb windows 25 30 Note window Variances typically don’t sum to 100 due to nonzero covariances 120 110 100 90 80 70 60 50 40 30 20 0 200 400 600 800 Cumulative Genetic Variance by largest windows 1000 1200 0.06 0.05 0.04 0.03 0.02 0.01 0 10 15 20 25 30 35 Window contains 20 SNPs from Gga_rs14490890 to Gga_rs14491074 40 0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 1 2 3 4 5 6 Window contains 27 SNPs from Gga_rs14693113 to Gga_rs13758442 7 8 Predict analysisType Predict markerSolFilename defines the name of a .mrkRes file from a previous training analysis windowWidth defines the number of markers in a consecutive window from which the overlapping window variances are computed windowBV yes will result in a file full of ghats with a row for each animal and a column for each overlapping window GenerateData Randomly chooses 1-probFixed proportion of loci to be QTL Samples QTL effects and residual effects according to normal distributions with mean 0 and variance determined by varGenotypic and varResidual Outputs the simulated genotypes and phenotypes Phenotypes will be categorical if isCategorical yes with as many categories as specified by numCat (default 2) Categories will be equal sizes unless specified by the option PortOfCat (eg 0.70:0.20) if numCat 3 Validation There are two options for validation Validation can be done jointly with the training analysis trainPhenotypeFileName testPhenotypeFileName If no testPhenotypeFileName, training data is used This will produce ghat, PEV and R2 for validation animals Validation can be done in a later session from training This will produce ghat but no PEV or R2 All columns of phenotype file are copied into the ghat file to facilitate downstream analysis Graphing Posteriors Various posterior distributions will be output if desired using the key word plotPosteriors yes Samples used in the graphs are in .mcmcSamples which can be produced without graphing if mcmcSamples yes Requires that gnuplot is installed on the machine in a location accessible using the defined path Categorical Options All analysisType Bayes will do categorical analyses if the option isCategorical yes is used Categories must start from 1, and be ordered without missing categories Required Libraries Many routines use matvec libraries Most matrix and vector computations use Eigen3 GSL is no longer used Boost is used (only for format statements) Limited use of STL Graphics options require gnuplot Environment must include paths to gnuplot (/opt/local/bin) R version We are developing an R version that will allow you to run any or all of the options from R Also allow you access to variables created during the analysis Hope to allow you to replace existing procedures with your own for prototyping new methods or features Planned Developments Addition of partial least squares (PLS), Bayesian Lasso Addition of further random factors beyond the genotypes Using pedigree, genomic or identity variance-covariance matrices Extension to multiple trait analysis Implementation using CUDA graphics processors