Data Submission Guide

advertisement
DATA SUBMISSION GUIDELINES
Reviewed by NIA on September 12, 2012
To submit data, please contact data@niagads.org with the required documentation. Please use the
following guidelines when submitting data to NIAGADS.
Required documentation for all data submissions:
NIA_AD_genetics_sharing_plan_updated_4_16_2010.pdf
Please include the NIA AD Genetics Sharing Plan, signed by both the PI and a supervisor with
signatory authority for the PI’s institution.
Informed Consent Form(s)
Please provide the IRB-approved informed consent form(s) in PDF format.
Phenotype Data File
Please use tab-delimited plain text (.txt) or excel (.xls) file formats along with a data dictionary
listing each variable and their description.
README file
This file should contain citations and abstracts of publications that describe the study, a concise
description of study design, final subject and marker count, platform, list of included files and
formats, and primary contact information. Please use plain text (.txt), PDF (.pdf), or Microsoft
Word (.doc or .docx).
If you are submitting Polymorphism Genotyping data:
APOE Genotypes
When available, please provide the APOE genotype. Describe the lab(s) that performed
genotyping and the genotyping methodology in the README file mentioned above.
Preferred Format
Computer files containing genotype or genetic mapping data should be plain text files in the
genetic pedigree file format. We ask the contributor(s) to use either the PLINK (.ped and .map
files) or MERLIN pedigree formats (.ped, .map, and .dat files). For more detailed definition
please refer to the following two URLs or the example file formats listed below.
PLINK: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped
MERLIN: http://www.sph.umich.edu/csg/abecasis/merlin/tour/input_files.html
There must be separate documentation (preferably in the README file) that clearly explains the
format used for the files mentioned above. For PLINK pedigree format, including a URL to the
definition of the file format is enough; for MERLIN pedigree format, both a detailed definition of
the fields and a URL to the format definition should be included. The columns in each file should
be listed and explained. Also, there should be some summary statistics, such as the number of
individuals, number of markers, and so on. If there was some system used to divide the
genotyping into "plates", etc., this system should be explained. For example, if the genotype files
are named "plate1, plate2...,” this system should be explained.
NIAGADS Data Submission Guidelines
Last Update: 2/7/2016
Created: 8/18/2012
Page 1 of 3
Loci Labeling
Microsatellite markers, SNPs, genes, etc, should be labeled with the common usage used at
NCBI. For microsatellites, use the "DnSmmmm" format so that the marker appears in the
Marshfield or deCODE maps. For SNPs, use dbSNP rs numbers, and not ss numbers. For genes,
use the official NCBI Entrez Gene name and numerical gene ID, not an alias. For example, DRD2
is an official name and has aliases D2DR, D2R, etc.
Alternative Genotype Format
If it is difficult to format data into the formats above, data sets can be formatted as additional
plain text files according to the following guidelines:
●
●
●
Data should be formatted as tables in plain text, using space or tab to separate records.
Include a line at the top of each file that indicates the labels of the columns. These labels
should begin with letters and not contain spaces (i.e., standard rules for the definition of
variables in computer programs). The field labels should be distinct and case insensitive
(e.g., SEX, Sex, and sex are considered to be the same).
Use standard labels for the following fields: FAMID (family ID), SUBJID (subject ID), FATHER
(father ID), MOTHER (mother ID), SEX (1 for male and 2 for female), AGE, DX (for diagnosis:
1 for control, 2 for case).
Example:
FAMID SUBJID FATHER MOTHER SEX AGE DX RS1001 RS1002
100 1 0 0 1 45 0 A/T A/A
100 2 0 0 2 43 0 A/A C/C
100 3 1 2 1 12 0 A/T A/C
100 4 1 2 1 10 0 A/T A/C
If you are submitting Next Generation Sequencing (NGS) data:
Required Files
Called reads prior to quality control in FASTQ format, compressed using the gzip or bzip2
program.
Mapped reads in BAM format (see SAMtools, http://samtools.sourceforge.net).
For studies focusing on genetic variants (e.g. genomic resequencing): called variants in the
Variant Call Format (VCF,
http://1000genomes.org/wiki/doku.php?id=1000_genomes:analysis:vcfv3.2).
For read abundance studies (e.g. RNA-seq for gene expression profiling, ChIP-seq), provide
summaries in tab-separated file format with explanations.
Additional Relevant Information
The contributor(s) should also provide additional information to facilitate future analysis and
enable replication of primary findings by other researchers:
●
●
●
●
●
Sequencer information (technology, machine type, version, protocol)
How the FASTQ files are generated (i.e. pipeline version, base calling software, settings).
How the BAM files are generated (i.e. workflow, read alignment software, parameters,
reference genome, quality control parameters).
How the VCF files are generated (i.e. call program, stringency settings).
For targeted enrichment sequencing and whole exome sequencing, provide information on
the enrichment target regions using the UCSC Genome Browser BED format (see
http://genome.ucsc.edu/FAQ/FAQformat.html). The genomic coordinates should be based on
the reference genome version used for read alignment. Note how the coordinates are
NIAGADS Data Submission Guidelines
Last Update: 2/7/2016
Created: 8/18/2012
Page 2 of 3
defined (using “zero-based” coordinate system, and the ending position does not include the
rightmost basepair in the interval) in BED files.
●
●
●
For RNA-seq experiments, how the transcript-level summaries are generated (i.e.
statistical/computational steps to generate the summaries, transcript and isoform assemblers,
software used).
For ChIP-seq experiments, how the summaries are generated (i.e. peak caller software).
Other information that the submitter deems relevant.
Other considerations:
Family and Individual Identifiers
If using dashes to represent different identifiers of an individual, please explain the different
parts of the ID. Example: individual "SJ-12321-1". Do not use identification numbers with
leading zeros to avoid issues with some computer programs such as Excel, which may convert ID
text into numerals without notice.
Information for the reference genome
If a genetic or physical map is supplied, please explain the source. For example, "NCBI physical
map build 34.3".
Be consistent
Make sure all files use exactly the same system. We noticed in the past that this rule is typically
violated when data are sent at different time points. Please make sure subsequent submissions
use the same format as the original data.
If you have any further questions about data submission or would like to submit data, please contact
data@niagads.org.
Developed by NIH/NIA, Washington University at St. Louis, and University of Pennsylvania Perelman
School of Medicine
NIAGADS Data Submission Guidelines
Last Update: 2/7/2016
Created: 8/18/2012
Page 3 of 3
Download