Documentation_Submit_NGS_SNPS - URGI

advertisement
Exchange format for
NGS variants submission
REMARK: We advise you to also submit the NGS project and run description (named submit_NGS.xls and related to GnpSeq NGS project) before submitting the file
mentioned below (files dedicated to sequence variations submission). It allows benefiting of all GnpIS functionalities.
Here is the link to GnpSeqNGS submission procedure: http://urgi.versailles.inra.fr/Data/NGS-sequences/Data-submission
NGS Variant Submission Format description
M_Run_Description
M_Variation_File
List of files:
Meta data about the sequencing. Information about the run, the line
Mandatory excel sheet
sequenced and the reference genome.
VCF or Varscan file containing polymorphism of the run described
Mandatory file
above.
M_Run_Description
Field name
RUN_NAME
RUN_DESCRIPTION
Mandatory?
y
y
RUN_DATE
y
Description of the field
Name of the run from which the fastq sequences come.
Regexp / Authorized values
Field had to be filled with a word/name.
Example(s): 705KBAAXX
Description of the run and optionnally description of the
mapping step and SNP calling step.
Example(s): Sequencing of PN40024 in Illumina GA2.
Mapping and SNP calling with MAPHiTS pipeline.
Date of the sequencing step.
Example(s): 22/11/2011
Field had to be filled with a sentence
composed of alphanumeric words.
Field had to be filled with a date format.
Convention is dd/mm/yyyy.
Length max
SUBRUN_NAME
y
Name of the subrun (lane for Illumina) from which the
fastq sequences comes. Better if it contains the genotype
name.
Field had to be filled with a word/name.
ANALYSIS_NAME
y
ANALYSIS_SOFTWARE_NAME
y
Example(s): 705KBAAXX_s2_GenotypeName
Name of the analysis performed to call the SNPs.
Example: Polymorphism discovery on X genotype.
Name of the software used to compute the analysis.
Example: BWA or Maphits
ANALYSIS_CONTACT_NAME
y
Person who has done the analysis.
Example: Pierre MARTIN
PROTOCOL_NAME
y
MAPPING_GENOME_NAME
y
MAPPING_GENOME_TAXON_NA
ME
y
MAPPING_GENOME_DESCRIPTIO
N
y
GENOTYPE_NAME
y
GENOTYPE_TAXON_NAME
y
PROJECT_NAME
y
Field had to be filled with a word/name.
Field had to be filled with a word/name.
Field had to be filled with 2 words: the
firstname (with first letter in uppercase
and other in lowcase) and then
lastname in uppercase.
Name of the protocol used for the sequencing step.
Field had to be filled with a sentence
Default value (if unknown): NGS sequencing protocol
composed of alphanumeric words.
Example(s): Sequencing with Illumina GA2 protocol.
Name of the reference genome version used for the
Field had to be filled with a sentence
mapping step.
composed of alphanumeric words.
Example(s): Vitis vinifera 12x.
Organism scientific name of the reference genome used
Field had to be filled with a group of
for the mapping step.
alphabetic words.
Example(s):Vitis vinifera 12x.
Description of the reference genome used for the mapping
Field had to be filled with a sentence
step.
composed of alphanumeric words.
Example(s): 12X version of PN40024 genome
Name of the genotype or accession/line/strain sequenced
in the run.
Field had to be filled with a word/name.
Example(s): PN40024.
Organism scientific name of the genotype or
Field had to be filled with a group of
accession/line/strain sequenced in the run.
alphabetic words.
Example(s): Vitis vinifera L.
Field had to be filled with a word/name
Name of the project that funded the sequencing.
or with a group of alphabetic words.
Keys informations:
 GOOD PRACTICE: values of RUN_NAME, ANALYSIS_NAME and SUBRUN_NAME should be consistent with the one that could have been submitted in GnpSeqNGS.
To have a link beetween the SNPs (submitted here) and the project, please submit first to GnpSeqNGS and use the same RUN_NAME as GnpSeqNGS format.
It will give you access to all the options of GnpSNP-NGS interface.
 MANDATORY: values of PROJECT_NAME have to be consistent with the one that could have been submitted in a previous submisiion (of GnpSeqNGS for example).
If the project doesn’t exist and you don’t want to submit in GnpSeqNGS before, please contact us : urgi_support@versailles.inra.fr
M_Variation_File
SNP calling result file (in vcf or varscan format):

Vcf outputs
Version 4.0 or 4.1. recommanded

Varscan outputs
(Varscan v2.2.2 recommanded, possibly V2.2.8)
EXAMPLE OF OUTPUT V2.2.0
Tab-delimited SNP calls with the following columns:
Chrom
chromosome name
Position
position (1-based)
Ref
reference allele at this position
VarAllele
Non-reference allele observed
Reads1
reads supporting reference allele
Reads2
reads supporting variant allele
VarFreq
frequency of variant allele by read count
Strands1
strands on which reference allele was observed
Strands2
strands on which variant allele was observed
Qual1
average base quality of reference-supporting read bases
Qual2
average base quality of variant-supporting read bases
Pvalue
Significance of variant read count vs. expected baseline error
VarAllele
Most frequent non-reference allele observed
EXAMPLE OF OUTPUT V2.2.8
Tab-delimited SNP calls with the following columns:
Chrom
chromosome name
Position
position (1-based)
Ref
reference allele at this position
Cons
Consensus genotype of sample in IUPAC format.
Reads1
reads supporting reference allele
Reads2
reads supporting variant allele
VarFreq
frequency of variant allele by read count
Strands1
strands on which reference allele was observed
Strands2
strands on which variant allele was observed
Qual1
average base quality of reference-supporting read bases
Qual2
average base quality of variant-supporting read bases
Pvalue
Significance of variant read count vs. expected baseline error
MapQual1
Average map quality of ref reads (only useful if in pileup)
MapQual2
Average map quality of var reads (only useful if in pileup)
Reads1Plus
Number of reference-supporting reads on + strand
Reads1Minus
Number of reference-supporting reads on - strand
Reads2Plus
Number of variant-supporting reads on + strand
Reads2Minus
Number of variant-supporting reads on - strand
VarAllele
Most frequent non-reference allele observed
Download