Data Submission Guide

DATA SUBMISSION GUIDELINES Reviewed by NIA on September 12, 2012 To submit data, please contact data@niagads.org with the required documentation. Please use the following guidelines when submitting data to NIAGADS. Required documentation for all data submissions: NIA_AD_genetics_sharing_plan_updated_4_16_2010.pdf Please include the NIA AD Genetics Sharing Plan, signed by both the PI and a supervisor with signatory authority for the PI’s institution. Informed Consent Form(s) Please provide the IRB-approved informed consent form(s) in PDF format. Phenotype Data File Please use tab-delimited plain text (.txt) or excel (.xls) file formats along with a data dictionary listing each variable and their description. README file This file should contain citations and abstracts of publications that describe the study, a concise description of study design, final subject and marker count, platform, list of included files and formats, and primary contact information. Please use plain text (.txt), PDF (.pdf), or Microsoft Word (.doc or .docx). If you are submitting Polymorphism Genotyping data: APOE Genotypes When available, please provide the APOE genotype. Describe the lab(s) that performed genotyping and the genotyping methodology in the README file mentioned above. Preferred Format Computer files containing genotype or genetic mapping data should be plain text files in the genetic pedigree file format. We ask the contributor(s) to use either the PLINK (.ped and .map files) or MERLIN pedigree formats (.ped, .map, and .dat files). For more detailed definition please refer to the following two URLs or the example file formats listed below. PLINK: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped MERLIN: http://www.sph.umich.edu/csg/abecasis/merlin/tour/input_files.html There must be separate documentation (preferably in the README file) that clearly explains the format used for the files mentioned above. For PLINK pedigree format, including a URL to the definition of the file format is enough; for MERLIN pedigree format, both a detailed definition of the fields and a URL to the format definition should be included. The columns in each file should be listed and explained. Also, there should be some summary statistics, such as the number of individuals, number of markers, and so on. If there was some system used to divide the genotyping into "plates", etc., this system should be explained. For example, if the genotype files are named "plate1, plate2...,” this system should be explained. NIAGADS Data Submission Guidelines Last Update: 2/7/2016 Created: 8/18/2012 Page 1 of 3 Loci Labeling Microsatellite markers, SNPs, genes, etc, should be labeled with the common usage used at NCBI. For microsatellites, use the "DnSmmmm" format so that the marker appears in the Marshfield or deCODE maps. For SNPs, use dbSNP rs numbers, and not ss numbers. For genes, use the official NCBI Entrez Gene name and numerical gene ID, not an alias. For example, DRD2 is an official name and has aliases D2DR, D2R, etc. Alternative Genotype Format If it is difficult to format data into the formats above, data sets can be formatted as additional plain text files according to the following guidelines: ● ● ● Data should be formatted as tables in plain text, using space or tab to separate records. Include a line at the top of each file that indicates the labels of the columns. These labels should begin with letters and not contain spaces (i.e., standard rules for the definition of variables in computer programs). The field labels should be distinct and case insensitive (e.g., SEX, Sex, and sex are considered to be the same). Use standard labels for the following fields: FAMID (family ID), SUBJID (subject ID), FATHER (father ID), MOTHER (mother ID), SEX (1 for male and 2 for female), AGE, DX (for diagnosis: 1 for control, 2 for case). Example: FAMID SUBJID FATHER MOTHER SEX AGE DX RS1001 RS1002 100 1 0 0 1 45 0 A/T A/A 100 2 0 0 2 43 0 A/A C/C 100 3 1 2 1 12 0 A/T A/C 100 4 1 2 1 10 0 A/T A/C If you are submitting Next Generation Sequencing (NGS) data: Required Files Called reads prior to quality control in FASTQ format, compressed using the gzip or bzip2 program. Mapped reads in BAM format (see SAMtools, http://samtools.sourceforge.net). For studies focusing on genetic variants (e.g. genomic resequencing): called variants in the Variant Call Format (VCF, http://1000genomes.org/wiki/doku.php?id=1000_genomes:analysis:vcfv3.2). For read abundance studies (e.g. RNA-seq for gene expression profiling, ChIP-seq), provide summaries in tab-separated file format with explanations. Additional Relevant Information The contributor(s) should also provide additional information to facilitate future analysis and enable replication of primary findings by other researchers: ● ● ● ● ● Sequencer information (technology, machine type, version, protocol) How the FASTQ files are generated (i.e. pipeline version, base calling software, settings). How the BAM files are generated (i.e. workflow, read alignment software, parameters, reference genome, quality control parameters). How the VCF files are generated (i.e. call program, stringency settings). For targeted enrichment sequencing and whole exome sequencing, provide information on the enrichment target regions using the UCSC Genome Browser BED format (see http://genome.ucsc.edu/FAQ/FAQformat.html). The genomic coordinates should be based on the reference genome version used for read alignment. Note how the coordinates are NIAGADS Data Submission Guidelines Last Update: 2/7/2016 Created: 8/18/2012 Page 2 of 3 defined (using “zero-based” coordinate system, and the ending position does not include the rightmost basepair in the interval) in BED files. ● ● ● For RNA-seq experiments, how the transcript-level summaries are generated (i.e. statistical/computational steps to generate the summaries, transcript and isoform assemblers, software used). For ChIP-seq experiments, how the summaries are generated (i.e. peak caller software). Other information that the submitter deems relevant. Other considerations: Family and Individual Identifiers If using dashes to represent different identifiers of an individual, please explain the different parts of the ID. Example: individual "SJ-12321-1". Do not use identification numbers with leading zeros to avoid issues with some computer programs such as Excel, which may convert ID text into numerals without notice. Information for the reference genome If a genetic or physical map is supplied, please explain the source. For example, "NCBI physical map build 34.3". Be consistent Make sure all files use exactly the same system. We noticed in the past that this rule is typically violated when data are sent at different time points. Please make sure subsequent submissions use the same format as the original data. If you have any further questions about data submission or would like to submit data, please contact data@niagads.org. Developed by NIH/NIA, Washington University at St. Louis, and University of Pennsylvania Perelman School of Medicine NIAGADS Data Submission Guidelines Last Update: 2/7/2016 Created: 8/18/2012 Page 3 of 3

Data Submission Guide

Related documents

Products

Support

Data Submission Guide

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib