mec12082-sup-0003-FileS1

advertisement
File S1
Here we detail the steps taken to identify and genotype SNPs from RAD-seq data
using the bioinformatic pipeline described and developed in:
Miller MR, Brunelli JP, Wheeler PA, et al. (2012) A conserved haplotype controls parallel adaptation in
geographically distant salmonid populations. Molecular Ecology 21, 237-249.
All of the scripts referred to here, can be obtained by contacting the corresponding author from Miller
et al. (2012) with the exception of the "GenotyperII.pl" script, which can be obtained by contacting Ben
Hecht at hecb@critfc.org.
The following are an example only, and should not be interpreted as the optimal method for all data
sets.
Commands are preceded by "%" and were executed in a Unix.
Preliminary steps
Create a directory for all raw Illumina sequence data and change to the new directory.
Move or copy all raw Illumina data for current project into this directory.
% mkdir Omy_Smolt
% cd Omy_Smolt
Create a directory for individual .fastq RAD sequence files
% mkdir Individuals
Create a directory for all RAD generated secondary files from pipeline
% mkdir RAD
Proceed through pipeline
Step 1) Count raw reads for each library (the output value should be divided by four to get the number
of lines that corresponds to actual sequence)
% wc -l Omy_Smolt1.fastq.txt
Step 2) Quality filter
(edit QualityFilter.pl script parameters: percent_filter = 80; $length = 71; $phred = 33)
We recommend using fastq quality control software (i.e. FastQC V0.10.1) to determine the optimal trim
length of sequence for this step.
% perl QualityFilter.pl Omy_Smolt1.fastq.txt > Omy_Smolt1.qf.fastq.txt
Step 3) Barcode Split where ****** represents the barcode sequence of an individual. This step is
repeated for each individual.
% perl BarcodeSplit.pl Omy_Smolt1.qf.fastq.txt ******TGCAGG > Individuals/ID001.qf.fastq.txt
Step 4) Count quality filter reads for each library and each individual
% wc -l Omy_Smolt*.qf.fastq.txt
% wc -l Individuals/*.qf.fastq.txt
Step 5) Standardize the number of sequences from 10 individuals (6 from Yakima River and 4 from
Upper Mann Creek that have at least 3 million quality filtered reads) to 3,000,000 reads to minimize
ascertainment bias. These individuals selected here will serve as the index for SNP discovery. It is
advised here to select enough individuals across the populations to adequately represent the genetic
variation in the sample and to minimize ascertainment bias, but not so many as to increase the
probability of identifying sequencing errors as a true SNP.
% head -12000000 ID009.qf.fastq.txt > RAD/YakS1.head
% head -12000000 ID012.qf.fastq.txt > RAD/YakS2.head
% head -12000000 ID033.qf.fastq.txt > RAD/YakS3.head
% head -12000000 ID084.qf.fastq.txt > RAD/YakR1.head
% head -12000000 ID104.qf.fastq.txt > RAD/YakR2.head
% head -12000000 ID112.qf.fastq.txt > RAD/YakR3.head
% head -12000000 ID139.qf.fastq.txt > RAD/UMCR1.head
% head -12000000 ID152.qf.fastq.txt > RAD/UMCR2.head
% head -12000000 ID153.qf.fastq.txt > RAD/UMCS1.head
% head -12000000 ID160.qf.fastq.txt > RAD/UMCS2.head
Step 6) Concatenate sequences into pools representing each population
% cat RAD/YakS1.head RAD/YakS2.head RAD/YakS3.head RAD/YakR1.head RAD/YakR2.head
RAD/YakR3.head > RAD/Yakima.cat.qf.fastq
% cat RAD/UMCR1.head RAD/UMCR2.head RAD/UMCS1.head RAD/UMCS2.head >
RAD/UMC.cat.qf.fastq
Step 7) Hash Sequences (This step identifies all unique sequences within a fastq file, and outputs a file
which contains a sequence ID for each unique sequence, a count of how many times the sequence
occurred, and appends a user specified identifier ("Yak" or "UMC" as specified below).
% perl HashSeqs.pl RAD/Yakima.cat.qf.fastq Yak > RAD/Yak.hash
% perl HashSeqs.pl RAD/UMC.cat.qf.fastq UMC > RAD/UMC.hash
Step 8) Examine Histograms of hash files
Here we identify the number of sequences that occur infrequently in the hash files, which likely
represent sequence errors. By removing all the sequences that occur infrequently (for example only 1 or
2 times) from these files we can speed the computation in the following alignments.
% perl PrintHashHisto.pl RAD/Yak.hash
% perl PrintHashHisto.pl RAD/UMC.hash
Step 9) Concatenate Yak and UMC hash files into single hash file
% cat RAD/Yak.hash RAD/UMC.hash > RAD/Smolt.hash
Step 10) Create Novoindex of Smolt.hash file (See http://www.novocraft.com/main/index.php for
software and documentation)
% novoindex RAD/Smolt.novoindex RAD/Smolt.hash
Step 11) Run alignment of novoindex and hash file using Novoalign
% novoalign -r E 20 -t 250 -d RAD/Smolt.novoindex -f RAD/Smolt.hash > RAD/Smolt.novoalign
Step 12) Place sequences into distinct loci
Edit IdentifyLoci.pl script with following parameters:
#
#
#
#
#
#
#
#
#
#
#
max_alignment_score = 30;
divergence_factor = 3;
min_count = 6;
max_internal_alignments = 1;
min_internal_alignments = 0;
max_external_alignments = 2;
min_external_alignments = 1;
max_total_alignments = 3;
min_total_alignments = 1;
min_alleles = 2;
max_alleles = 2;
% perl IdentifyLoci.pl RAD/Smolt.novoalign > RAD/Smolt.loci
Step 13) Look at Histogram of Smolt.loci file to determine that a min count = 6 is appropriate (may need
to start lower i.e. 4 or go higher)
% perl PrintLociHisto.pl RAD/Smolt.loci
Step 14) Generate alignment index for the loci
% novoindex RAD/Smolt.loci.novoindex RAD/Smolt.loci
Step 15) Align qf reads from each individual to the loci index
% mkdir IndNovo
% novoalign -t 0 -d RAD/Smolt.loci.novoindex -f Individuals/ID001.qf.fastq.txt >
IndNovo/ID001.novo
Step 16) Count number of reads present for each allele in each individual
% mkdir Count
% perl CountAlleles.pl RAD/Smolt.loci IndNovo/ID001.novo > Count/ID001.count
Step 17) Convert allele counts to genotypes
select options: Min count = 5, het threshold = 10, output = [2] Genotypes only
GenotyperII.pl must be in Count directory, with no other files except the *.count files execute the script
from /Count directory
% cd Count
% perl GenotyperII.pl
Download