ESM 2. Processing lobSTR vcf files. The program lobSTR is a

advertisement
ESM 2. Processing lobSTR vcf files. The program lobSTR is a likelihood-based program that makes allele
calls for tandem repeats in next generation sequencing data and returns a vcf file (see below). This file
contains information about every repeat found. I was only interested in pure AC repeats so wrote a C++
script to assay each output line in the vcf file. My annotations are in red.
vcf header lines, all starting with double hash
##fileformat=VCFv4.1
##fileDate=2015-02-11.23:20:35
##source=# version=lobSTR_3.0.3;
##source=allelotype_3.0.3;command=classify;bam=/keep/c06138c2e230390ebd5c158a19bc5780+487/t
est.sorted.bam;noisemodel=/keep/513a1bdbbc2ac2165a0b84e37ab91e31+10812/models/illumina_v3.pcrfree;out=HG00143
;strinfo=/keep/d341a6f1db391a780d694e240e95e475+3805/lobstr_v3.0.2_hg19_strinfo.tab;indexprefix=/keep/d341a6f1db391a780d694e240e95e475+3805/lobstr_v3.0.2_hg19_ref/lobSTR_;noweb;
##INFO=<ID=RPA,Number=A,Type=Float,Description="Repeats per allele">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of variant">
##INFO=<ID=MOTIF,Number=1,Type=String,Description="Canonical repeat motif">
##INFO=<ID=REF,Number=1,Type=Float,Description="Reference copy number">
##INFO=<ID=RL,Number=1,Type=Integer,Description="Reference STR track length in bp">
##INFO=<ID=RU,Number=1,Type=String,Description="Repeat motif">
##INFO=<ID=VT,Number=1,Type=String,Description="Variant type">
##FORMAT=<ID=ALLREADS,Number=1,Type=String,Description="All reads aligned to locus">
##FORMAT=<ID=AML,Number=1,Type=String,Description="Allele marginal likelihood ratio scores">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=GB,Number=1,Type=String,Description="Genotype given in bp difference from
reference">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for
genotypes as defined in the VCF specification">
##FORMAT=<ID=Q,Number=1,Type=Float,Description="Likelihood ratio score of allelotype call">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=STITCH,Number=1,Type=Integer,Description="Number of stitched reads">
#CHROM
POS ID
REF
ALT
QUAL FILTER INFO FORMAT
mysample
Below is a typical entry, an AC microsatellite at position 856020 on chromosome 1. In this case only one
allele is found which lobSTR calls as (AC)8, ignoring the interruption. My code calls this allele (AC)5
CHROM: chr1
PO:
856020
ID
.
REF
TGTGCTGTGTGTGTGT
ALT
.
QUAL 0
FILTER .
INFO END=856035;MOTIF=AC;REF=8;RL=16;RU=TG;VT=STR
FORMAT
GT:ALLREADS:AML:DP:GB:PL:Q:STITCH
mysample
0/0:0|3:1/1:3:0/0:0:1:0
Code to return AC repeat number. I wrote a simple code that reads lines of data, discarding the header
lines, then extracting the REF field and, if present, the ALT field. If two alleles are present neither of
which matches REF these are both given in the ALT field, separated by a comma. Each allele sequence is
processed by the following sub-routine that takes the query sequence as its argument:
int Repeats(char seq[500])
{
int max=0; //set maximum repeat number to zero
for (int i=0; i<500 && seq[i]!='\0'; i++) { //analyse each line starting with each base in turn
if (seq[i]=='A' && seq[i+1]=='C') { //if an AC pair is found, look to see if a microsatellite exists
int micro=0; //zero the repeat number count
for (g=i; g<500 && seq[g]=='A' && seq[g+1]=='C'; g+=2) micro++; //while AC pairs are found, count
if (micro>max) max=micro; // if longer than the maximum length, store as maximum
}
}
// repeat the process for GT motifs
for (int i=0; i<500 && seq[i]!='\0'; i++) {
if (seq[i]=='G' && seq[i+1]=='T') {
int micro=0;
for (g=i; g<500 && seq[g]=='G' && seq[g+1]=='T'; g+=2) micro++;
if (micro>max) max=micro;
}
}
return max; //return the longest AC tract found
}
Download