lobSTRVCFFormat

advertisement
Guide to lobSTR VCF format
Steps to create VCF format
• Initial round of lobSTR allelotyping to generate
priors on allele frequencies
• Second round of lobSTR calling to generate
genotype likelihoods and posteriors for all
possible genotypes
• Merge VCFs from each sample using GATK
Fixed Fields
#CHROM
Chromosome of the STR variant
POS
Start position of the STR variant
ID
Set to “.”
REF
Nucleotide at CHROM:POS in the reference genome
Fixed Fields
##ALT=<ID=STRVAR,Description="Short tandem variation">
Alternate alleles are given as <STRVAR:$ALLELE>, where $ALLELE
is the number of base pairs length difference of the alternate
allele from the reference sequence
e.g.
ALT
<STRVAR:-4>, <STRVAR:-2>, <STRVAR:2>
Fixed Fields
QUAL
Set to -10*log(1-P). Where P=posterior probability of the
genotype call given the observed reads.
FILTER
Defaults to “.”
INFO fields
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for
each ALT allele, in the same order as listed”>
(Standard)
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT
allele, in the same order as listed">
(Standard)
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in
called genotypes">
(Standard)
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
(Standard)
INFO fields
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of variant">
An STR at position chr1: 422395-422435 will have END=422435
##INFO=<ID=MOTIF,Number=1,Type=String,Description="Repeat motif">
An STR with motif “AAAT” will have MOTIF=AAAT
##INFO=<ID=REF,Number=1,Type=Float,Description="Reference copy number">
The number of copies of the MOTIF in the reference genome. E.g. if MOTIF=AT and
REF=14.5, there are 14.5 copies of AT in the reference genome, spanning 14.5*2=29bp.
INFO fields
##INFO=<ID=VT,Number=1,Type=String,Description="Variant type">
(Standard) For lobSTR calls, VT=STR.
##INFO=<ID=set,Number=1,Type=String,Description="Source VCF for the merged
record in CombineVariants">
(Standard) Files had to be merged in stages to prevent GATK from crashing, so these
file names reflect those intermediate files.
FORMAT fields
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
(Standard)
##FORMAT=<ID=ALLREADS,Number=1,Type=String,Description="All reads aligned to
locus">
Gives the alleles of all reads seen in the form:
$allele1|$readcount1;$allele2|$readcount2, etc. where the allele is given in the
number of base pairs difference from the reference, and the read count is the number
of reads supporting that allele. If no reads fully spanned the STR but there is other
data for this locus, this field is set to “NA”.
##FORMAT=<ID=ALLPARTIALREADS,Number=1,Type=String,Description="All partially
spanning reads aligned to locus">
Same format as ALLREADS, but for reads that only partially span the STR. In this case,
the allele given is not an actual allele call but an upper bound on the length of the
possible allele exhibited by the read if it were to fully span the STR.
FORMAT fields
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
(Standard)
##FORMAT=<ID=GB,Number=1,Type=String,Description="Genotype given in bp
difference from reference">
If an allelotype call was made, this gives the call in the form $A/$B where $A and $B
are the number of base pairs difference from reference of each called allele. If no call
was made, this is given as “./.”.
FORMAT FIELDS
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled
likelihoods for genotypes as defined in the VCF specification">
For each possible genotype, gives 10*log10(L) where L is the likelihood of the
genotype call given the reads seen. The order is as specified in the VCF 4.1 format, for
alleles k and j, where k and j are indices into the ALT field, the order in this list of
genotype <allele[j], allele[k]> is (k*(k+1)/2)+j.
##FORMAT=<ID=GPP,Number=G,Type=Float,Description="Genotype Posterior
probabilities (phred scaled, -10log10)">
For each possible genotype, gives 10*log10(P), where P is the posterior probability of
seeing the genotype given the reads seen. Based on priors on allele frequencies
(assumes HWE) when available.
FORMAT fields
##FORMAT=<ID=PP,Number=1,Type=Float,Description="Posterior probability of
call">
Gives the posterior probability of the maximum a posteriori genotype call, which is
that returned in the GT field.
##FORMAT=<ID=MP,Number=1,Type=Float,Description="Upper bound on maximum
partially spanning allele">
If any reads partially spanned the locus, gives the longest possible allele supported by
any of those reads.
##FORMAT=<ID=PC,Number=1,Type=Integer,Description="Coverage by partially
spanning reads">
Gives the number of reads partially spanning the STR locus.
FORMAT field
##FORMAT=<ID=STITCH,Number=1,Type=Integer,Description="Number of stitched
reads">
The number of paired end reads that overlapped enough to be stitched together into
one long read.
Download