ESM 2. Processing lobSTR vcf files. The program lobSTR is a likelihood-based program that makes allele calls for tandem repeats in next generation sequencing data and returns a vcf file (see below). This file contains information about every repeat found. I was only interested in pure AC repeats so wrote a C++ script to assay each output line in the vcf file. My annotations are in red. vcf header lines, all starting with double hash ##fileformat=VCFv4.1 ##fileDate=2015-02-11.23:20:35 ##source=# version=lobSTR_3.0.3; ##source=allelotype_3.0.3;command=classify;bam=/keep/c06138c2e230390ebd5c158a19bc5780+487/t est.sorted.bam;noisemodel=/keep/513a1bdbbc2ac2165a0b84e37ab91e31+10812/models/illumina_v3.pcrfree;out=HG00143 ;strinfo=/keep/d341a6f1db391a780d694e240e95e475+3805/lobstr_v3.0.2_hg19_strinfo.tab;indexprefix=/keep/d341a6f1db391a780d694e240e95e475+3805/lobstr_v3.0.2_hg19_ref/lobSTR_;noweb; ##INFO=<ID=RPA,Number=A,Type=Float,Description="Repeats per allele"> ##INFO=<ID=END,Number=1,Type=Integer,Description="End position of variant"> ##INFO=<ID=MOTIF,Number=1,Type=String,Description="Canonical repeat motif"> ##INFO=<ID=REF,Number=1,Type=Float,Description="Reference copy number"> ##INFO=<ID=RL,Number=1,Type=Integer,Description="Reference STR track length in bp"> ##INFO=<ID=RU,Number=1,Type=String,Description="Repeat motif"> ##INFO=<ID=VT,Number=1,Type=String,Description="Variant type"> ##FORMAT=<ID=ALLREADS,Number=1,Type=String,Description="All reads aligned to locus"> ##FORMAT=<ID=AML,Number=1,Type=String,Description="Allele marginal likelihood ratio scores"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=GB,Number=1,Type=String,Description="Genotype given in bp difference from reference"> ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification"> ##FORMAT=<ID=Q,Number=1,Type=Float,Description="Likelihood ratio score of allelotype call"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=STITCH,Number=1,Type=Integer,Description="Number of stitched reads"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT mysample Below is a typical entry, an AC microsatellite at position 856020 on chromosome 1. In this case only one allele is found which lobSTR calls as (AC)8, ignoring the interruption. My code calls this allele (AC)5 CHROM: chr1 PO: 856020 ID . REF TGTGCTGTGTGTGTGT ALT . QUAL 0 FILTER . INFO END=856035;MOTIF=AC;REF=8;RL=16;RU=TG;VT=STR FORMAT GT:ALLREADS:AML:DP:GB:PL:Q:STITCH mysample 0/0:0|3:1/1:3:0/0:0:1:0 Code to return AC repeat number. I wrote a simple code that reads lines of data, discarding the header lines, then extracting the REF field and, if present, the ALT field. If two alleles are present neither of which matches REF these are both given in the ALT field, separated by a comma. Each allele sequence is processed by the following sub-routine that takes the query sequence as its argument: int Repeats(char seq[500]) { int max=0; //set maximum repeat number to zero for (int i=0; i<500 && seq[i]!='\0'; i++) { //analyse each line starting with each base in turn if (seq[i]=='A' && seq[i+1]=='C') { //if an AC pair is found, look to see if a microsatellite exists int micro=0; //zero the repeat number count for (g=i; g<500 && seq[g]=='A' && seq[g+1]=='C'; g+=2) micro++; //while AC pairs are found, count if (micro>max) max=micro; // if longer than the maximum length, store as maximum } } // repeat the process for GT motifs for (int i=0; i<500 && seq[i]!='\0'; i++) { if (seq[i]=='G' && seq[i+1]=='T') { int micro=0; for (g=i; g<500 && seq[g]=='G' && seq[g+1]=='T'; g+=2) micro++; if (micro>max) max=micro; } } return max; //return the longest AC tract found }