cnvPipe – A Copy Number Variation Meta

advertisement
cnvPipe – A Copy Number Variation Meta-Analysis Pipeline
Aim:
The aim of cnvPipe is to enable CNV meta-analyses between cohorts which have
been analysed with different CNV prediction algorithms, and assayed on different
genotyping chips. See section ‘Modifying the cnvPipe.sh script’ for information
on how to specify the CNV algorithm you are using.
Pre-requisites:
1) For each cohort, a directory (e.g. cohort1/, cohort2/) is required which contains
a) cohort1/cnv_*.txt
- Files which contain the cnv predictions made by a CNV segmentation
algorithm, such as cnvHap, PennCNV, QuantiSNP, BirdSuite, etc.
- Can have multiple files, cnvPipe will read in everything starting with
‘cnv’ and ending with ‘.txt’
- CNV files should be tab or space-delimited and contain the following
information for each predicted CNV in each sample (although the
header names and order can be different, you can specify the header
names in the configuration of the shell script below):
Sample Chr FirstProbe LastProbe NbSnp length_min Type Avg_certainty
i. `Sample’ is the id of a given sample
ii.
‘Chr’ is the chromosome of the CNV
iii.
FirstProbe is the bp position of the first probe within the
CNV
iv.
LastProbe is the bp position of the last probe within the
CNV
v.Type is the copy number (0,1,2,3,4)
vi.
Avg_certainty is the probability that the CNV is
genuine (between 0 and 1)
- Extra columns are allowed but will not be read by the program
b) cohort1/snps.txt
- A single file which contains the co-ordinates of each snp, with the
entries chromosome, base pair position, e.g. chr1 10004
- This file does not have a header and should be tab-delimitted
- Extra columns are permitted but will not be read
- This can be derived from ‘build’ file of the genome chip you are using.
c) cohort1/samples.txt
- A single file in which the first column lists all of the samples for which
CNV segmentation was run. This should not include any samples
which failed QC. This can have more columns, but they will not be
read (tab-delimited if there is more than one column). This file does
not have a header.
2) You will also need to have the script cnvPipe.sh and the lib/ directory (which
contain the executables) in the working directory (which is also the parent directory of
the project files).
3) You will need to have a working version of java installed, version 1.6 or above
4) If you want to take advantage of the power calculations, you will also need to have
R installed, as well as the java-R interface (http://www.rforge.net/rJava/)
Running the program
Stage 1: Calculating the CNV regions and genotypes at these regions for each project
In the working directory, you can type the following command, for each project
 ./cnvPipe.sh cohort1/ 0:1 0.5 5
Where the second argument is the copy number to interrogate, separated by ‘:’ (all
deletions in this case); the third and fourth arguments are optional, and only required
if power calculations are desired. In this case the third argument is the disease
penetrance (here we are saying that 50% of the cohort have the disease) and the fourth
argument is the odds ratio (5 in this case). If you want to investigate different
penetrances and odds ratios, you can add further arguments with a ‘:’ separator, e.g.
 ./cnvPipe.sh cohort1/ 0:1 0.1:0.2:0.5 3:5:10
This will generate the output project1/genos_0_1.txt. Each line of this file contains
information on a CNV region (CNVR). The first 9 columns of the output file are as
follows:
Sample
Chr
FirstProbe LastProbe
Avg_certainty
regionId
NbSnp length_min
Type
You will notice that these column headings (with the exception of the last one) are the
same as the input ‘cnv.txt’ files). The reason for this will become apparent when we
move onto the second stage of this analysis, which is combining these output files
across multiple cohorts, for which the first 8 columns of these output files become the
input files.
In this case the ‘Sample’ which will be assigned to each CNVR is simply the name of
the cohort. ‘Chr’ is the chromosome location of the CNVR, and FirstProbe and
LastProbe are the locations of the first and last probes within the CNVR. ‘NbSnp’
lists the number of probes within the CNV, ‘length_min’ lists the length of this region.
The ‘Type’ entry is now a little different from the cnv.txt files above. In these files,
we simply had the copy number of the event, but now different individuals may have
different copy number aberrations within the same CNVR. So now, type is a ‘:’
separated list of numbers, which specify the number of individuals in that dataset that
had that copy number. The first number in this list is the number of individuals with a
double deletion, followed by the number of individuals with a single deletion, and so
on. The `Avg_certainty’ is also a ‘:’-separated list of numbers, with each number
representing the average certainty of CNV predictions made in this class.
Lastly the ‘regionId’ is an automatically generated identifier for this CNVR.
If disease penetrance and odds ratio parameters were specified, then the next four
columns correspond to the lower- and upper- bound of possible odds ratios and the
corresponding p-values which would be achieved with these odds ratios (note that it is
usually impossible to achieve the odds-ratio as exactly specified, but cnvPipe figures
out what the smallest possible odds-ratio greater, as well as the largest possible oddsratio less than, the specified odds-ratio respectively). For each set of diseaseparameters specified, 4 columns will appear in the output.
The remaining columns correspond to the genotypes for each sample at each of the
CNVR. A ‘normal’ copy number of 2 is indicated by a ‘.’, and otherwise the copy
number of the sample in that region is given.
Note, the same procedure can be carried out with
 ./cnvPipe.sh cohort1/ 3:4 0.5 5
to produce cohort1/genos_3_4.txt files for duplications
Stage 2: Calculating CNV regions from multiple cohorts
Firstly, you have to make a new directory, e.g. meta_0_1/ (and meta_3_4 for
amplifications) and copy into this directory all of the genos_0_1.txt from each of the
cohort1/ directories, renaming them as follows (with cnv_ prefix, followed by cohort
name, then .txt)
 cp cohort1/genos_0_1.txt meta_0_1/cnv_cohort1.txt
 cp cohort1/genos_3_4.txt meta_3_4/cnv_cohort1.txt
Note, only the first eight columns of these files are used, so you can also simply
extract these first 8 columns, e.g.
 cut –f 1-8 cohort1/genos_0_1.txt > meta_0_1/cnv_cohort1.txt
 cut –f 1-8 cohort1/genos_3_4.txt > meta_3_4/cnv_cohort1.txt
Then you can simply re-run the previous step, but with the meta_0_1 directory:
 ./cnvPipe.sh meta_0_1/ 0:1 0.5 5
 ./cnvPipe.sh meta_3_4/ 3:4 0.5 5
for deletions and amplifications respectively.
Which will generate output files genos_1_1.txt in the meta/ directory, which are
described above. In the `Type’ column, the total numbers of samples in each copy
number category are given, as described above.
Stage 3: Re-calculating the CNV genotypes for each CNVR defined in step 2.
Stage 3 is exactly the same as Stage 1, with the exception that you put a
‘regions.txt’ file in the parent directory, which specifies the regions you want to
calculate the genotypes for. This regions file just consists of the first 8 columns of
the meta/genos_0_1.txt (for deletions) or meta/genos_3_4.txt (for amplifications).
So, for example, if you run
 cp meta_0_1/genos_0_1.txt regions.txt
 ./cnvPipe.sh cohort1/ 0:1 0.5 5
to generate a new cohort1/genos_0_1.txt, which now has CNVR as defined by the
regions.txt file. To do the same for amplifications, type
 cp meta_3_4/genos_3_4.txt regions.txt
 ./cnvPipe.sh cohort1/ 3:4 0.5 5
Modifying the cnvPipe.sh script
Currently the cnvPipe.sh script assumes cnvHap input, via the variable
cnv_header="Sample:Chr:FirstProbe:LastProbe:NbSnp:length_min:Type:Avg_certain
ty"
This variable should be modified to use output from other programs, but the order of
the information should remain the same. As an example, for PennCNV tab-delimited
output:
cnv_header="samplefile:chr:start:end:numSnps:NA:CN:NA"
Download