Exchange format for NGS variants submission REMARK: We advise you to also submit the NGS project and run description (named submit_NGS.xls and related to GnpSeq NGS project) before submitting the file mentioned below (files dedicated to sequence variations submission). It allows benefiting of all GnpIS functionalities. Here is the link to GnpSeqNGS submission procedure: http://urgi.versailles.inra.fr/Data/NGS-sequences/Data-submission NGS Variant Submission Format description M_Run_Description M_Variation_File List of files: Meta data about the sequencing. Information about the run, the line Mandatory excel sheet sequenced and the reference genome. VCF or Varscan file containing polymorphism of the run described Mandatory file above. M_Run_Description Field name RUN_NAME RUN_DESCRIPTION Mandatory? y y RUN_DATE y Description of the field Name of the run from which the fastq sequences come. Regexp / Authorized values Field had to be filled with a word/name. Example(s): 705KBAAXX Description of the run and optionnally description of the mapping step and SNP calling step. Example(s): Sequencing of PN40024 in Illumina GA2. Mapping and SNP calling with MAPHiTS pipeline. Date of the sequencing step. Example(s): 22/11/2011 Field had to be filled with a sentence composed of alphanumeric words. Field had to be filled with a date format. Convention is dd/mm/yyyy. Length max SUBRUN_NAME y Name of the subrun (lane for Illumina) from which the fastq sequences comes. Better if it contains the genotype name. Field had to be filled with a word/name. ANALYSIS_NAME y ANALYSIS_SOFTWARE_NAME y Example(s): 705KBAAXX_s2_GenotypeName Name of the analysis performed to call the SNPs. Example: Polymorphism discovery on X genotype. Name of the software used to compute the analysis. Example: BWA or Maphits ANALYSIS_CONTACT_NAME y Person who has done the analysis. Example: Pierre MARTIN PROTOCOL_NAME y MAPPING_GENOME_NAME y MAPPING_GENOME_TAXON_NA ME y MAPPING_GENOME_DESCRIPTIO N y GENOTYPE_NAME y GENOTYPE_TAXON_NAME y PROJECT_NAME y Field had to be filled with a word/name. Field had to be filled with a word/name. Field had to be filled with 2 words: the firstname (with first letter in uppercase and other in lowcase) and then lastname in uppercase. Name of the protocol used for the sequencing step. Field had to be filled with a sentence Default value (if unknown): NGS sequencing protocol composed of alphanumeric words. Example(s): Sequencing with Illumina GA2 protocol. Name of the reference genome version used for the Field had to be filled with a sentence mapping step. composed of alphanumeric words. Example(s): Vitis vinifera 12x. Organism scientific name of the reference genome used Field had to be filled with a group of for the mapping step. alphabetic words. Example(s):Vitis vinifera 12x. Description of the reference genome used for the mapping Field had to be filled with a sentence step. composed of alphanumeric words. Example(s): 12X version of PN40024 genome Name of the genotype or accession/line/strain sequenced in the run. Field had to be filled with a word/name. Example(s): PN40024. Organism scientific name of the genotype or Field had to be filled with a group of accession/line/strain sequenced in the run. alphabetic words. Example(s): Vitis vinifera L. Field had to be filled with a word/name Name of the project that funded the sequencing. or with a group of alphabetic words. Keys informations: GOOD PRACTICE: values of RUN_NAME, ANALYSIS_NAME and SUBRUN_NAME should be consistent with the one that could have been submitted in GnpSeqNGS. To have a link beetween the SNPs (submitted here) and the project, please submit first to GnpSeqNGS and use the same RUN_NAME as GnpSeqNGS format. It will give you access to all the options of GnpSNP-NGS interface. MANDATORY: values of PROJECT_NAME have to be consistent with the one that could have been submitted in a previous submisiion (of GnpSeqNGS for example). If the project doesn’t exist and you don’t want to submit in GnpSeqNGS before, please contact us : urgi_support@versailles.inra.fr M_Variation_File SNP calling result file (in vcf or varscan format): Vcf outputs Version 4.0 or 4.1. recommanded Varscan outputs (Varscan v2.2.2 recommanded, possibly V2.2.8) EXAMPLE OF OUTPUT V2.2.0 Tab-delimited SNP calls with the following columns: Chrom chromosome name Position position (1-based) Ref reference allele at this position VarAllele Non-reference allele observed Reads1 reads supporting reference allele Reads2 reads supporting variant allele VarFreq frequency of variant allele by read count Strands1 strands on which reference allele was observed Strands2 strands on which variant allele was observed Qual1 average base quality of reference-supporting read bases Qual2 average base quality of variant-supporting read bases Pvalue Significance of variant read count vs. expected baseline error VarAllele Most frequent non-reference allele observed EXAMPLE OF OUTPUT V2.2.8 Tab-delimited SNP calls with the following columns: Chrom chromosome name Position position (1-based) Ref reference allele at this position Cons Consensus genotype of sample in IUPAC format. Reads1 reads supporting reference allele Reads2 reads supporting variant allele VarFreq frequency of variant allele by read count Strands1 strands on which reference allele was observed Strands2 strands on which variant allele was observed Qual1 average base quality of reference-supporting read bases Qual2 average base quality of variant-supporting read bases Pvalue Significance of variant read count vs. expected baseline error MapQual1 Average map quality of ref reads (only useful if in pileup) MapQual2 Average map quality of var reads (only useful if in pileup) Reads1Plus Number of reference-supporting reads on + strand Reads1Minus Number of reference-supporting reads on - strand Reads2Plus Number of variant-supporting reads on + strand Reads2Minus Number of variant-supporting reads on - strand VarAllele Most frequent non-reference allele observed