File S1 Here we detail the steps taken to identify and genotype SNPs from RAD-seq data using the bioinformatic pipeline described and developed in: Miller MR, Brunelli JP, Wheeler PA, et al. (2012) A conserved haplotype controls parallel adaptation in geographically distant salmonid populations. Molecular Ecology 21, 237-249. All of the scripts referred to here, can be obtained by contacting the corresponding author from Miller et al. (2012) with the exception of the "GenotyperII.pl" script, which can be obtained by contacting Ben Hecht at hecb@critfc.org. The following are an example only, and should not be interpreted as the optimal method for all data sets. Commands are preceded by "%" and were executed in a Unix. Preliminary steps Create a directory for all raw Illumina sequence data and change to the new directory. Move or copy all raw Illumina data for current project into this directory. % mkdir Omy_Smolt % cd Omy_Smolt Create a directory for individual .fastq RAD sequence files % mkdir Individuals Create a directory for all RAD generated secondary files from pipeline % mkdir RAD Proceed through pipeline Step 1) Count raw reads for each library (the output value should be divided by four to get the number of lines that corresponds to actual sequence) % wc -l Omy_Smolt1.fastq.txt Step 2) Quality filter (edit QualityFilter.pl script parameters: percent_filter = 80; $length = 71; $phred = 33) We recommend using fastq quality control software (i.e. FastQC V0.10.1) to determine the optimal trim length of sequence for this step. % perl QualityFilter.pl Omy_Smolt1.fastq.txt > Omy_Smolt1.qf.fastq.txt Step 3) Barcode Split where ****** represents the barcode sequence of an individual. This step is repeated for each individual. % perl BarcodeSplit.pl Omy_Smolt1.qf.fastq.txt ******TGCAGG > Individuals/ID001.qf.fastq.txt Step 4) Count quality filter reads for each library and each individual % wc -l Omy_Smolt*.qf.fastq.txt % wc -l Individuals/*.qf.fastq.txt Step 5) Standardize the number of sequences from 10 individuals (6 from Yakima River and 4 from Upper Mann Creek that have at least 3 million quality filtered reads) to 3,000,000 reads to minimize ascertainment bias. These individuals selected here will serve as the index for SNP discovery. It is advised here to select enough individuals across the populations to adequately represent the genetic variation in the sample and to minimize ascertainment bias, but not so many as to increase the probability of identifying sequencing errors as a true SNP. % head -12000000 ID009.qf.fastq.txt > RAD/YakS1.head % head -12000000 ID012.qf.fastq.txt > RAD/YakS2.head % head -12000000 ID033.qf.fastq.txt > RAD/YakS3.head % head -12000000 ID084.qf.fastq.txt > RAD/YakR1.head % head -12000000 ID104.qf.fastq.txt > RAD/YakR2.head % head -12000000 ID112.qf.fastq.txt > RAD/YakR3.head % head -12000000 ID139.qf.fastq.txt > RAD/UMCR1.head % head -12000000 ID152.qf.fastq.txt > RAD/UMCR2.head % head -12000000 ID153.qf.fastq.txt > RAD/UMCS1.head % head -12000000 ID160.qf.fastq.txt > RAD/UMCS2.head Step 6) Concatenate sequences into pools representing each population % cat RAD/YakS1.head RAD/YakS2.head RAD/YakS3.head RAD/YakR1.head RAD/YakR2.head RAD/YakR3.head > RAD/Yakima.cat.qf.fastq % cat RAD/UMCR1.head RAD/UMCR2.head RAD/UMCS1.head RAD/UMCS2.head > RAD/UMC.cat.qf.fastq Step 7) Hash Sequences (This step identifies all unique sequences within a fastq file, and outputs a file which contains a sequence ID for each unique sequence, a count of how many times the sequence occurred, and appends a user specified identifier ("Yak" or "UMC" as specified below). % perl HashSeqs.pl RAD/Yakima.cat.qf.fastq Yak > RAD/Yak.hash % perl HashSeqs.pl RAD/UMC.cat.qf.fastq UMC > RAD/UMC.hash Step 8) Examine Histograms of hash files Here we identify the number of sequences that occur infrequently in the hash files, which likely represent sequence errors. By removing all the sequences that occur infrequently (for example only 1 or 2 times) from these files we can speed the computation in the following alignments. % perl PrintHashHisto.pl RAD/Yak.hash % perl PrintHashHisto.pl RAD/UMC.hash Step 9) Concatenate Yak and UMC hash files into single hash file % cat RAD/Yak.hash RAD/UMC.hash > RAD/Smolt.hash Step 10) Create Novoindex of Smolt.hash file (See http://www.novocraft.com/main/index.php for software and documentation) % novoindex RAD/Smolt.novoindex RAD/Smolt.hash Step 11) Run alignment of novoindex and hash file using Novoalign % novoalign -r E 20 -t 250 -d RAD/Smolt.novoindex -f RAD/Smolt.hash > RAD/Smolt.novoalign Step 12) Place sequences into distinct loci Edit IdentifyLoci.pl script with following parameters: # # # # # # # # # # # max_alignment_score = 30; divergence_factor = 3; min_count = 6; max_internal_alignments = 1; min_internal_alignments = 0; max_external_alignments = 2; min_external_alignments = 1; max_total_alignments = 3; min_total_alignments = 1; min_alleles = 2; max_alleles = 2; % perl IdentifyLoci.pl RAD/Smolt.novoalign > RAD/Smolt.loci Step 13) Look at Histogram of Smolt.loci file to determine that a min count = 6 is appropriate (may need to start lower i.e. 4 or go higher) % perl PrintLociHisto.pl RAD/Smolt.loci Step 14) Generate alignment index for the loci % novoindex RAD/Smolt.loci.novoindex RAD/Smolt.loci Step 15) Align qf reads from each individual to the loci index % mkdir IndNovo % novoalign -t 0 -d RAD/Smolt.loci.novoindex -f Individuals/ID001.qf.fastq.txt > IndNovo/ID001.novo Step 16) Count number of reads present for each allele in each individual % mkdir Count % perl CountAlleles.pl RAD/Smolt.loci IndNovo/ID001.novo > Count/ID001.count Step 17) Convert allele counts to genotypes select options: Min count = 5, het threshold = 10, output = [2] Genotypes only GenotyperII.pl must be in Count directory, with no other files except the *.count files execute the script from /Count directory % cd Count % perl GenotyperII.pl