Guide to lobSTR VCF format Steps to create VCF format • Initial round of lobSTR allelotyping to generate priors on allele frequencies • Second round of lobSTR calling to generate genotype likelihoods and posteriors for all possible genotypes • Merge VCFs from each sample using GATK Fixed Fields #CHROM Chromosome of the STR variant POS Start position of the STR variant ID Set to “.” REF Nucleotide at CHROM:POS in the reference genome Fixed Fields ##ALT=<ID=STRVAR,Description="Short tandem variation"> Alternate alleles are given as <STRVAR:$ALLELE>, where $ALLELE is the number of base pairs length difference of the alternate allele from the reference sequence e.g. ALT <STRVAR:-4>, <STRVAR:-2>, <STRVAR:2> Fixed Fields QUAL Set to -10*log(1-P). Where P=posterior probability of the genotype call given the observed reads. FILTER Defaults to “.” INFO fields ##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed”> (Standard) ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed"> (Standard) ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes"> (Standard) ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> (Standard) INFO fields ##INFO=<ID=END,Number=1,Type=Integer,Description="End position of variant"> An STR at position chr1: 422395-422435 will have END=422435 ##INFO=<ID=MOTIF,Number=1,Type=String,Description="Repeat motif"> An STR with motif “AAAT” will have MOTIF=AAAT ##INFO=<ID=REF,Number=1,Type=Float,Description="Reference copy number"> The number of copies of the MOTIF in the reference genome. E.g. if MOTIF=AT and REF=14.5, there are 14.5 copies of AT in the reference genome, spanning 14.5*2=29bp. INFO fields ##INFO=<ID=VT,Number=1,Type=String,Description="Variant type"> (Standard) For lobSTR calls, VT=STR. ##INFO=<ID=set,Number=1,Type=String,Description="Source VCF for the merged record in CombineVariants"> (Standard) Files had to be merged in stages to prevent GATK from crashing, so these file names reflect those intermediate files. FORMAT fields ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> (Standard) ##FORMAT=<ID=ALLREADS,Number=1,Type=String,Description="All reads aligned to locus"> Gives the alleles of all reads seen in the form: $allele1|$readcount1;$allele2|$readcount2, etc. where the allele is given in the number of base pairs difference from the reference, and the read count is the number of reads supporting that allele. If no reads fully spanned the STR but there is other data for this locus, this field is set to “NA”. ##FORMAT=<ID=ALLPARTIALREADS,Number=1,Type=String,Description="All partially spanning reads aligned to locus"> Same format as ALLREADS, but for reads that only partially span the STR. In this case, the allele given is not an actual allele call but an upper bound on the length of the possible allele exhibited by the read if it were to fully span the STR. FORMAT fields ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> (Standard) ##FORMAT=<ID=GB,Number=1,Type=String,Description="Genotype given in bp difference from reference"> If an allelotype call was made, this gives the call in the form $A/$B where $A and $B are the number of base pairs difference from reference of each called allele. If no call was made, this is given as “./.”. FORMAT FIELDS ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification"> For each possible genotype, gives 10*log10(L) where L is the likelihood of the genotype call given the reads seen. The order is as specified in the VCF 4.1 format, for alleles k and j, where k and j are indices into the ALT field, the order in this list of genotype <allele[j], allele[k]> is (k*(k+1)/2)+j. ##FORMAT=<ID=GPP,Number=G,Type=Float,Description="Genotype Posterior probabilities (phred scaled, -10log10)"> For each possible genotype, gives 10*log10(P), where P is the posterior probability of seeing the genotype given the reads seen. Based on priors on allele frequencies (assumes HWE) when available. FORMAT fields ##FORMAT=<ID=PP,Number=1,Type=Float,Description="Posterior probability of call"> Gives the posterior probability of the maximum a posteriori genotype call, which is that returned in the GT field. ##FORMAT=<ID=MP,Number=1,Type=Float,Description="Upper bound on maximum partially spanning allele"> If any reads partially spanned the locus, gives the longest possible allele supported by any of those reads. ##FORMAT=<ID=PC,Number=1,Type=Integer,Description="Coverage by partially spanning reads"> Gives the number of reads partially spanning the STR locus. FORMAT field ##FORMAT=<ID=STITCH,Number=1,Type=Integer,Description="Number of stitched reads"> The number of paired end reads that overlapped enough to be stitched together into one long read.