cnvPipe – A Copy Number Variation Meta-Analysis Pipeline Aim: The aim of cnvPipe is to enable CNV meta-analyses between cohorts which have been analysed with different CNV prediction algorithms, and assayed on different genotyping chips. See section ‘Modifying the cnvPipe.sh script’ for information on how to specify the CNV algorithm you are using. Pre-requisites: 1) For each cohort, a directory (e.g. cohort1/, cohort2/) is required which contains a) cohort1/cnv_*.txt - Files which contain the cnv predictions made by a CNV segmentation algorithm, such as cnvHap, PennCNV, QuantiSNP, BirdSuite, etc. - Can have multiple files, cnvPipe will read in everything starting with ‘cnv’ and ending with ‘.txt’ - CNV files should be tab or space-delimited and contain the following information for each predicted CNV in each sample (although the header names and order can be different, you can specify the header names in the configuration of the shell script below): Sample Chr FirstProbe LastProbe NbSnp length_min Type Avg_certainty i. `Sample’ is the id of a given sample ii. ‘Chr’ is the chromosome of the CNV iii. FirstProbe is the bp position of the first probe within the CNV iv. LastProbe is the bp position of the last probe within the CNV v.Type is the copy number (0,1,2,3,4) vi. Avg_certainty is the probability that the CNV is genuine (between 0 and 1) - Extra columns are allowed but will not be read by the program b) cohort1/snps.txt - A single file which contains the co-ordinates of each snp, with the entries chromosome, base pair position, e.g. chr1 10004 - This file does not have a header and should be tab-delimitted - Extra columns are permitted but will not be read - This can be derived from ‘build’ file of the genome chip you are using. c) cohort1/samples.txt - A single file in which the first column lists all of the samples for which CNV segmentation was run. This should not include any samples which failed QC. This can have more columns, but they will not be read (tab-delimited if there is more than one column). This file does not have a header. 2) You will also need to have the script cnvPipe.sh and the lib/ directory (which contain the executables) in the working directory (which is also the parent directory of the project files). 3) You will need to have a working version of java installed, version 1.6 or above 4) If you want to take advantage of the power calculations, you will also need to have R installed, as well as the java-R interface (http://www.rforge.net/rJava/) Running the program Stage 1: Calculating the CNV regions and genotypes at these regions for each project In the working directory, you can type the following command, for each project ./cnvPipe.sh cohort1/ 0:1 0.5 5 Where the second argument is the copy number to interrogate, separated by ‘:’ (all deletions in this case); the third and fourth arguments are optional, and only required if power calculations are desired. In this case the third argument is the disease penetrance (here we are saying that 50% of the cohort have the disease) and the fourth argument is the odds ratio (5 in this case). If you want to investigate different penetrances and odds ratios, you can add further arguments with a ‘:’ separator, e.g. ./cnvPipe.sh cohort1/ 0:1 0.1:0.2:0.5 3:5:10 This will generate the output project1/genos_0_1.txt. Each line of this file contains information on a CNV region (CNVR). The first 9 columns of the output file are as follows: Sample Chr FirstProbe LastProbe Avg_certainty regionId NbSnp length_min Type You will notice that these column headings (with the exception of the last one) are the same as the input ‘cnv.txt’ files). The reason for this will become apparent when we move onto the second stage of this analysis, which is combining these output files across multiple cohorts, for which the first 8 columns of these output files become the input files. In this case the ‘Sample’ which will be assigned to each CNVR is simply the name of the cohort. ‘Chr’ is the chromosome location of the CNVR, and FirstProbe and LastProbe are the locations of the first and last probes within the CNVR. ‘NbSnp’ lists the number of probes within the CNV, ‘length_min’ lists the length of this region. The ‘Type’ entry is now a little different from the cnv.txt files above. In these files, we simply had the copy number of the event, but now different individuals may have different copy number aberrations within the same CNVR. So now, type is a ‘:’ separated list of numbers, which specify the number of individuals in that dataset that had that copy number. The first number in this list is the number of individuals with a double deletion, followed by the number of individuals with a single deletion, and so on. The `Avg_certainty’ is also a ‘:’-separated list of numbers, with each number representing the average certainty of CNV predictions made in this class. Lastly the ‘regionId’ is an automatically generated identifier for this CNVR. If disease penetrance and odds ratio parameters were specified, then the next four columns correspond to the lower- and upper- bound of possible odds ratios and the corresponding p-values which would be achieved with these odds ratios (note that it is usually impossible to achieve the odds-ratio as exactly specified, but cnvPipe figures out what the smallest possible odds-ratio greater, as well as the largest possible oddsratio less than, the specified odds-ratio respectively). For each set of diseaseparameters specified, 4 columns will appear in the output. The remaining columns correspond to the genotypes for each sample at each of the CNVR. A ‘normal’ copy number of 2 is indicated by a ‘.’, and otherwise the copy number of the sample in that region is given. Note, the same procedure can be carried out with ./cnvPipe.sh cohort1/ 3:4 0.5 5 to produce cohort1/genos_3_4.txt files for duplications Stage 2: Calculating CNV regions from multiple cohorts Firstly, you have to make a new directory, e.g. meta_0_1/ (and meta_3_4 for amplifications) and copy into this directory all of the genos_0_1.txt from each of the cohort1/ directories, renaming them as follows (with cnv_ prefix, followed by cohort name, then .txt) cp cohort1/genos_0_1.txt meta_0_1/cnv_cohort1.txt cp cohort1/genos_3_4.txt meta_3_4/cnv_cohort1.txt Note, only the first eight columns of these files are used, so you can also simply extract these first 8 columns, e.g. cut –f 1-8 cohort1/genos_0_1.txt > meta_0_1/cnv_cohort1.txt cut –f 1-8 cohort1/genos_3_4.txt > meta_3_4/cnv_cohort1.txt Then you can simply re-run the previous step, but with the meta_0_1 directory: ./cnvPipe.sh meta_0_1/ 0:1 0.5 5 ./cnvPipe.sh meta_3_4/ 3:4 0.5 5 for deletions and amplifications respectively. Which will generate output files genos_1_1.txt in the meta/ directory, which are described above. In the `Type’ column, the total numbers of samples in each copy number category are given, as described above. Stage 3: Re-calculating the CNV genotypes for each CNVR defined in step 2. Stage 3 is exactly the same as Stage 1, with the exception that you put a ‘regions.txt’ file in the parent directory, which specifies the regions you want to calculate the genotypes for. This regions file just consists of the first 8 columns of the meta/genos_0_1.txt (for deletions) or meta/genos_3_4.txt (for amplifications). So, for example, if you run cp meta_0_1/genos_0_1.txt regions.txt ./cnvPipe.sh cohort1/ 0:1 0.5 5 to generate a new cohort1/genos_0_1.txt, which now has CNVR as defined by the regions.txt file. To do the same for amplifications, type cp meta_3_4/genos_3_4.txt regions.txt ./cnvPipe.sh cohort1/ 3:4 0.5 5 Modifying the cnvPipe.sh script Currently the cnvPipe.sh script assumes cnvHap input, via the variable cnv_header="Sample:Chr:FirstProbe:LastProbe:NbSnp:length_min:Type:Avg_certain ty" This variable should be modified to use output from other programs, but the order of the information should remain the same. As an example, for PennCNV tab-delimited output: cnv_header="samplefile:chr:start:end:numSnps:NA:CN:NA"