Submitting DNA Barcode Sequences to GenBank: A Tutorial Todd Osmundson, Garbelotto Lab September, 2008 Contents GENERAL INTRODUCTION ................................................................................................................... 1 INTRODUCTION TO NCBI’S BARCODE SUBMISSION TOOL ........................................................ 2 SEQUENCE FILES...................................................................................................................................... 2 CHROMATOGRAPH FILES ..................................................................................................................... 4 ATTRIBUTE TABLES ................................................................................................................................ 5 TABLE 1: SEQUENCE ATTRIBUTES .............................................................................................................. 6 TABLE 2: TRACE FILE ATTRIBUTES ............................................................................................................. 7 GENERATING FILE LISTS ............................................................................................................................. 8 SUBMITTING THE BARCODE DATA TO GENBANK USING BARSTOOL ................................... 9 USING NCBI’S SEQUIN AND TBL2ASN ...............................................................................................10 ELEMENT 1: THE SUBMIT-BLOCK TEMPLATE FILE ......................................................................................11 ELEMENT 2: THE SEQUENCE DATA ............................................................................................................12 ELEMENT 3: THE FEATURE ANNOTATION TABLE........................................................................................12 ELEMENT 4: THE SOURCE ANNOTATION TABLE .........................................................................................15 SUBMITTING THE DATA .............................................................................................................................15 APPENDICES .............................................................................................................................................20 APPENDIX I: A SAMPLE FASTA FILE FOR A DNA BARCODE SUBMISSION ...............................................20 APPENDIX II: SOURCE MODIFIERS FOR BARCODE SUBMISSIONS THROUGH BARSTOOL ...........................21 APPENDIX III: USING THE TAR UTILITY TO MAKE FILE ARCHIVES .............................................................24 APPENDIX IV. CONDUCTING BATCH BLAST SEARCHES USING SEQTOOLS ..............................................26 APPENDIX V. SOURCE MODIFIERS FOR FASTA DEFINITION LINES OR TBL2ASN SOURCE TABLES ...........30 General Introduction DNA barcode sequences can be submitted to GenBank (the genetic sequence database at the National Center for Biotechnology Information, NCBI) using several different methods. The emphasis in this tutorial is on methods for batch data checking and submission so that many sequences can be handled at one time. There are two main ways of making batch sequence submissions to GenBank: NCBI’s Barcode Submission Tool (BarSTool) specifically for DNA barcode sequences, and Sequin (or the similar but more automated tool tbl2asn) for either barcode or non-barcode sequences. When submitting true DNA barcode sequences (i.e., sequences that meet the length, quality and voucher criteria for official barcodes), it is preferable to use BarSTool, as it has functionality for batch submission of the necessary ancillary data (voucher specimen data, chromatograph trace files, etc.). However, in working with fungi, we have a problem – whereas DNA barcoding for most animal groups uses a portion of the mitochondrial cytochrome oxidase gene (CO1 or cox1), mycologists have adopted the nuclear ribosomal internal transcribed spacer region (nrDNA– ITS) as a barcoding standard – and BarSTool is currently configured to accept only CO1 sequences. I am currently in a dialogue with NCBI to reconfigure BarSTool to accept ITS data, but am unsure of when, or if, such a change will take place. Therefore, this tutorial will cover BarSTool in hopes that NCBI will reconfigure it in the near future, and Sequin/tbl2asn in case they do not. Introduction to NCBI’s Barcode Submission Tool GenBank allows bulk submission of DNA barcode sequences and ancillary descriptive data via its Barcode Submission Tool (BarSTool). The following tutorial describes how to prepare data for submission and how to submit these data using BarSTool. For additional information, see the BarSTool website: http://www.ncbi.nlm.nih.gov/WebSub/index.cgi?tool=barcode The data that you will need consist of 3 parts: The sequences themselves, in FASTA format Chromatograph trace files for each sequence (one forward, one reverse per sequence) Ancillary data: 2 Tab-delimited tables (in .txt format) – one containing the descriptive data for the sequences and one for the trace files – and the names and sequences of the forward and reverse primers used for sequencing. Sequence files The following instructions assume that you have a set of completed sequence contigs (e.g., prepared in Sequencher) of known provenance (i.e., checked against closest matches in GenBank using BLAST). See Appendix IV for instructions on how to conduct BLAST searches. The sequence data should be in a single FASTA file. A sequence in FASTA format consists of a description line, which begins with a greater-than symbol (">"), a carriage return, and then one or more lines of sequence data. The sequence data can be in one continuous line, but for ease of reading GenBank recommends that all lines of text be shorter than 80 characters in length. The sequence data are followed by a carriage return, followed by the next sequence. An example FASTA file containing 3 sequences is: Create a FASTA file for each sequence after by exporting the contig’s consensus sequence from Sequencher (File Export Consensus). The resulting FASTA file will have as its descriptor line whatever you have named the contig, so make sure that the contig has a unique name that will distinguish it from all other sequences in the dataset. The individual FASTA files can be joined into a single file using a script; the easiest way to implement such a script is using the Automator program included in Mac OS. Set up the following workflow in Automator (see right side of window): Tasks are added to the workflow using the two windows on the left. For step one, select “Finder” in the far left window (under Library Applications), then select “Get Specified Finder Items” under the Action column. Steps 2 and 3 are found within TextEdit in the Library column. Before running the workflow, remove all files from the first (Get Specified Finder Items) step (select the files, and click the “-“ button), and choose an appropriate name in the third (New Text File) step. Run the workflow by clicking the “Run” button in the upper right-hand corner. Workflows can be saved so you can use them again. The following UNIX shell command (this can be run through the Terminal program in Mac OS) will do the same concatenation operation, but requires typing in all of the file names: $ cat <file1> <file2> > <cat.out> This operation will concatenate the files “file1” and “file2” into the output file “cat.out”. For a slightly easier way than typing all of the file names, see instructions for generating file lists, later in this document. If you prepare your FASTA files in a word processing program (e.g., Microsoft Word) rather than in a text editor (e.g., TextEdit, BBEdit, Text Wrangler, etc.) be sure to save your file as a plain text (.txt) file rather than as a word processor document (.doc, etc.), as the embedded formatting tags in the latter can cause problems downstream. See Appendix 1 for a sample FASTA file for a DNA barcode submission. Chromatograph files The output from the ABI sequencer includes 3 files for each sequence: the chromatograph trace file (.ab1), a Phred file that includes the base call quality scores (.phd.1), and a text file that includes the sequence itself (.seq). The .seq file includes the raw base calls and is therefore virtually useless without editing, but the .ab1 and .phd.1 files are essential for the barcode submission. The attribute tables (see below) will be used to tie together the trace, Phred, and sequence (as edited contigs in FASTA format) files. The chromatograph trace (.ab1) files must be submitted as a single compressed format / archive file (.zip, .gz, .tar, etc.). To prepare the trace file, first create a new directory (folder) named “traces” containing all the traces for this submission. Then, assemble the .ab1 files into a single archive file and compress it. The archiving and compression steps can be done simultaneously or in sequence, depending on the tools at your disposal. To do these steps simultaneously, use a zip utility (e.g., WinZip for Windows). In Mac OS, you can easily prepare a .zip file as follows: 1. Select the files to be put in the archive (this should be all files in the “traces” folder that you just created. 2. Select the cogwheel drop-down menu and select “Create Archive of X items”; a .zip file containing a compressed archive will appear in the folder – this is the file that you will submit through BarSTool. Alternatively, you can produce the archive using a Tar utility, and compress it using the gzip utility. For instructions on using the UNIX tar utility, see Appendix III. For instructions on using the UNIX gzip utility, see the gzip home page: http://www.gzip.org/#intro Attribute tables The main difference between submission of barcode sequences and that of other DNA sequence data is that barcode sequences are held to a higher standard – they must correspond to vouchered specimens, must be from particular (agreed-upon) loci, and must be of high quality (low percentage of ambiguous bases (“N”s); must have forward and reverse sequences; and should be linked to a chromatograph file and/or a file of read quality metrics). The attribute tables are the way to link each sequence to its appropriate chromatograph file and voucher specimen data. These tables must be submitted to GenBank as tab-delimited text files. One can use a text editor to make the files, but it is probably easier to use a spreadsheet application such as Excel; just be sure to save the file as tab-delimited text when you’re finished. Table 1: Sequence attributes This is a tab-delimited text file that includes information about the sequence and the specimen from which it is derived (NCBI refers to this information as “source modifiers”). The NCBI website has downloadable templates for a table including the source modifiers recommended by the Barcoding Consortium, and a table including all possible source modifiers. In general, it is sufficient to use just the recommended source modifiers. Here are the links: * Template including recommended source modifiers: http://www.ncbi.nlm.nih.gov/WebSub/html/help/templates/source-table-recommended.txt * Template including all source modifiers: http://www.ncbi.nlm.nih.gov/WebSub/html/help/templates/source-table-all.txt To download the files, it is better to go directly to NCBI and download them from there: http://www.ncbi.nlm.nih.gov/WebSub/html/help/source-table.html Here is a sample source modifier table: The first column includes the sequence ID. This must be the same ID that was used in the description line of that sequence in the FASTA file. The rest of the columns include information about the collection and the accession number of the voucher specimen. While more information is usually better, GenBank will accept a table that includes only the first (Sequence_ID) and last (Specimen_voucher) columns. For an example file with the 2column format, see the following URL: http://www.ncbi.nlm.nih.gov/WebSub/html/help/sample_files/source-table-2-col-sample.txt Official barcode submissions also require the “Country” modifier (country where the specimen was collected). So, the end result of this table is to tell GenBank which voucher specimen corresponds to each barcode sequence. Note the following requirements for the source modifiers table: The heading for the first column must be exactly Sequence_ID as shown in the sample table. Each specimen in the set must have a line in the source modifiers file, even if there are no modifiers to apply to the specimen. Each Sequence_ID may appear only once in the source modifier file. See Appendix 2 for descriptions of all source modifier fields. Table 2: Trace file attributes This is a tab-delimited text file that includes information about the trace files . The NCBI website has a downloadable template for this table; to see the format, go to the following URL: http://www.ncbi.nlm.nih.gov/WebSub/html/help/templates/trace-table-required.txt To download the template, go to the NCBI website: http://www.ncbi.nlm.nih.gov/WebSub/html/help/trace-table.html Here is a sample trace attribute table: The first row of the table includes the column headings. The columns of the table are as follows (descriptions taken from the NCBI website): Template_ID - identifies the sequence. This identifier must be the same value as the Sequence_ID used in the source modifier table and in the nucleotide FASTA file, and allows GenBank to tie together the sequence, trace file, and voucher specimen data for each barcode. Trace_file - the path to a specific trace in the trace archive file. If you set up the trace archive by putting all the traces into a directory (folder) named “traces”, the path would start with "traces/" For example: traces/filename.ab1. Note: If you set up your traces directory with subdirectories (eg, for each separate submission set or for each separate organism, etc), the path listed in the trace_file column must include the subdirectory name. For example: traces/subdirectory_name/filename.scf. Trace_format - names the format of the provided trace file. Trace_format can have the following values: SCF, SFF, ZTR, and ABI. Center_project - a sequencing center's internal designation for a specific sequencing p roject. This field can be useful for grouping related traces. Program_ID - the base calling program. This field is free text. Program name, version numbers or dates are very useful. Examples include: phred-19980904e abi-3.1 ATQA TraceTuner Licor Megabase Beckman Trace_end - labels which end of the sequence is contained in the read. Possible values: F, R, N for Forward, Reverse, and uNknown. Note that a Trace_file may appear only once in a Trace Information file; however, a Template_ID may appear more than once. For more documentation on the NCBI Trace Archive, including additional fields and their descriptions, consult the following website: http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=rfc_b&m=doc&s=rfc_b#PROGRAM_ID Generating file lists If it sounds like a lot of work to enter all of the filenames into the spreadsheet, here is some good news – it is not necessary to manually type the file names or copy/paste them individually into the spreadsheet – let your operating system do most of the work for you by generating a file list! Here’s how: 1. In Mac OS, open a Terminal window (or run cmd in Windows), then navigate to the folder where the trace files and score files have been placed (use the cd command in all OS’s). Note that in Mac OS Terminal, you do not need to manually type the folder path following the “cd” command – just type “cd”, and then drag the folder icon (from a Finder window) into the Terminal window and Terminal will write the path for you. 2. Use the following commands to generate a list of all files in the folder and write them to a file entitled list.txt: Windows: dir > list.txt MacOS: ls > list.txt Linux/Unix: ls > list.txt 3. Generally, in building the spreadsheets, however, you will not want a list of all files, but rather a list of all files within a certain class (e.g., trace files, or Phred files). These commands will generate a list of files having a particular file extension in the current folder, and save it in a text file entitled list.txt. To generate a list of the trace files use the following commands : Windows: dir *.ab1 > list.txt MacOS: ls *.ab1 > list.txt Linux/Unix: ls *.ab1 > list.txt To generate a list of the score files use : Windows: dir *.phd.1 > list.txt MacOS: ls *. phd.1 > list.txt Linux/Unix: ls *. phd.1 > list.txt You can then open list.txt directly in Excel, or open list.txt in a text editor and copy/paste into a column in an Excel spreadsheet. Submitting the barcode data to GenBank using BarSTool [Note: The NCBI BarSTool information page can be found at the following URL: http://www.ncbi.nlm.nih.gov/WebSub/index.cgi?tool=barcode] Once you have all of the data prepared, it is time to submit them to GenBank. First, you will need to register for a My NCBI account: http://www.ncbi.nlm.nih.gov/entrez/login.fcgi To begin the submission process, make sure that you have the following available: A web browser that supports both JavaScript and cookies The title of a published or in-press paper that discusses the Barcode Set A text file of the set of nucleotide sequences in FASTA format The names and sequences of the forward and reverse primers A tab-delimited table of source modifier data for the set A text file of the set of protein sequences in FASTA format (optional for CO1; not applicable for ITS) A tab-delimited table of trace attributes and a compressed archive containing the traces (optional, but highly recommended) From the NCBI Barcode of Life home page (http://www.ncbi.nlm.nih.gov/WebSub/index.cgi?tool=barcode), select the link “Sign in to use barcode” in the upper right corner. On the first page, enter your contact information. On the second page, enter the names of the sequence authors and study (either a published or in press paper, or the name of an unpublished study. Sequence authors for the Venice study should include Matteo Garbelotto, Lydia Baker, and anyone involved in the data acquisition and/or submission of the particular group of sequences being submitted. For the study title we should use a single, agreed-upon name so that all of the sequences can be grouped together even if they are submitted separately. I propose using the name of the study that appears on the lab website: “Barcoding the Venice Fungal Collection.” On the third page, select a release date for the sequences; this step should be done in consultation with Matteo. Also on the third page, upload the nucleotide FASTA file containing the sequence data. On the fourth screen, you may upload a protein translation file; since ITS is not a protein-coding gene, continue past this step. The fifth screen prompts for primer information. If all sequences were generated using the same primers (e.g., ITS1-F and ITS4-B), choose the option “Set one value for all sequences” and then enter the primer name and sequence. In the sixth page, upload your sequence source modifier file; note that, if your file does not contain all of the columns recommended by the Barcode Consortium, your submission will still be accepted but may not be given the official “Barcode” label in GenBank. The seventh screen will prompt you to upload the trace information table and the trace archive (the compressed archive file containing all of the chromatograph trace files). Be sure to upload the correct file in the correct place. Following all information entry and upload, BarSTool presents a text file (in GenBank flat file format) containing your submission as it will appear in GenBank. Review this file carefully to confirm that the specimen, locus, author, and study information are correct. If you are submitting a protein-coding sequence (e.g., CO1), make sure that the translation makes sense (e.g., no stop codons, signified by an asterisk in the protein sequence). Finalize your submission, and you’re finished! Using NCBI’s Sequin and tbl2asn Though one can submit sequences directly to GenBank via a web interface (BankIt), this method only accommodates submission of sequences one-at-a-time – certainly an unpalatable option if one has many sequences to submit. Batch submission of sequences is facilitated by NCBI’s Sequin utility. Sequin allows the creation of a single file containing descriptive information for a batch of sequences (author information, etc.) through web forms completed by the user, and then packages this file with the sequence files (in FASTA format) into a single Sequin (.sqn) file that can be submitted to GenBank via e-mail. The utility tbl2asn, as its name (albeit cryptically) suggests, converts information from tables to ASN.1 (Abstract Syntax Notation 1), the file format used by GenBank. It is a command-line program, so there are no menus – you must enter commands directly into a shell (UNIX, DOS, etc.) window. Regardless of whether Sequin or tbl2asn is used, a submission consists of the following elements: Sequence data in FASTA format General information about the submission (e.g., author information) Annotation of sequence features (e.g., coding regions, non-coding regions) Source information (e.g., organism, collection information, etc.) Besides the difference in the way that the programs are run (command line vs. web forms), Sequin and tbl2asn differ primarily in the way in which some of these submission elements are organized. The general information (author names and institutions, etc.) is in both cases entered into a Sequin project, but when prepared for tbl2asn the user stops after this information is entered and exports the results into a standalone file that can be read by tbl2asn. The feature annotations can be either entered into a web form or imported as a tabdelimited text file for Sequin, and must be imported as a table for tbl2asn. Source information can either be entered into the web form (the tedious way) or embedded in the FASTA definition line (the more straightforward way) for Sequin, and are either embedded in the FASTA definition line or stored in a tab-delimited text table for tbl2asn. Because there is a fair amount of overlap between Sequin and tbl2asn and because tbl2asn is a bit more efficient for large numbers of sequences, I will describe the use of tbl2asn here. Instructions specific to Sequin are available from the following URL: http://www.ncbi.nlm.nih.gov/Sequin/QuickGuide/sequin.htm. First, download tbl2asn. A link to the FTP site containing the download is available at http://www.ncbi.nlm.nih.gov/Genbank/tbl2asn2.html. Be sure to download the correct version for your platform, then uncompress the file and change the file permissions if necessary. Also download Sequin (at http://www.ncbi.nlm.nih.gov/Sequin/), as you will need this program for the initial steps of the process. The submission can contain up to 6 types of items, as follows: 1. Template file containing a text ASN.1 Submit-block object (suffix .sbt). 2. Nucleotide sequence data in FASTA format (suffix .fsa). 3. Tab-delimited text file with a table containing sequence features (suffix .tbl). 4. Protein sequence if the gene encodes a protein (suffix .pep). 5. Source Table (suffix .src). 6. Quality Scores (suffix .qvl). In our nrDNA-ITS sequence submissions, we can omit #4 since ITS is non-coding. We will also omit #6. The remaining four elements are described in more detail below. Element 1: The submit-block template file This file is generated through Sequin. To make the file, first open Sequin. Choose “GenBank” as the database for submission, then select the button “Start New Submission.” The following form will open: Select a date for when the sequence record may be released (in consultation with Matteo), and fill in a tentative manuscript title. Then, select the other tabs and enter the contact, author, and affiliation information. After you have done this, return to the submission tab and use File->Export Submitter Info. Save the file as template.sbt. Element 2: The sequence data Sequence data should be given in FASTA format, just as in a BarSTool submission. As in a BarSTool submission, it is easiest if all sequences are combined into a single FASTA file. The FASTA file should be placed in the same directory as the template and table files generated in the other steps of this process. It is possible to provide source information in the FASTA definition line (See Appendix V), or to store it in a separate tab-delimited table. Keep in mind that the sequence identifier (sequence “title”) used in the definition line (i.e., following the “>” symbol) must be identical to those used in the source modifier and feature annotation tables. This sequence ID will be changed to a GenBank accession number by the NCBI staff after the sequences are submitted. For our submissions, we will put the source modifier data in a separate file (See Element 4), so the FASTA definition lines need only contain a unique identifier. Element 3: The feature annotation table Sequence features such as the location of coding regions, introns, or different structural parts of a gene must be identified prior to submission. For example, a typical ITS1/ITS4 – primed sequence product contains a small portion of the 18S ribosomal RNA gene, followed by the first internal transcribed spacer (ITS1), the 5.8S ribosomal RNA gene, the second internal transcribed spacer (ITS2), and a small portion of the 28S ribosomal RNA gene. Identification of these elements and their positions will be, by far, the most time-consuming part of this process. These feature annotations must then be stored in a tab-delimited table having a specific 5-column format (columns separated by tabs). The file begins with a definition line similar to that in a FASTA-formatted sequence; for example: >Feature SeqId table_name The sequence identifier (SeqId) must be the same as that used in the sequence FASTA file. The table name portion is optional. Subsequent lines of the table list the features, each on a separate line. Each feature can contain additional notes or qualifiers, placed on the line below the feature type and location. are on the line below. The columns are as follows: Column 1: Start location of feature Column 2: Stop location of feature Column 3: Feature key (type) Column 4: Qualifier key (placed on the row below the information in the first 3 columns Column 5: Qualifier value (also placed on the row below the information in the first 3 columns) An example table may look like this: >Feature Lp_1625 1 629 source organism mol_type isolate specimen_voucher db_xref tissue_type country note <1 14 rRNA 15 249 misc_RNA 250 407 rRNA 408 614 misc_RNA 615 >629 Laccaria pseudomontana genomic DNA pse1625 Cripps 1625(type) taxon:344594 basidiome USA: Colorado, Ten Mile Range, Blue Lake Dam type strain of Laccaria sp. CLC1771 product 18S ribosomal RNA product internal transcribed spacer 1 product 5.8S ribosomal RNA product internal transcribed spacer 2 product 28S ribosomal RNA rRNA Note that, in the columns 1 and 2 (feature start and stop positions), the first entry begins with <1, and the last entry ends with >629. The “<” denotes that the feature actually begins before the first nucleotide of our sequence; the “>” denotes that the feature actually ends after the last nucleotide of our sequence. This annotation table will yield a GenBank file having the following feature annotations (See “FEATURES,” below): LOCUS DQ149871 629 bp DNA linear PLN 13MAR-2006 DEFINITION Laccaria pseudomontana isolate pse1625 18S ribosomal RNA gene, partial sequence; internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence; and 28S ribosomal RNA gene, partial sequence. ACCESSION DQ149871 VERSION DQ149871.1 GI:76781901 KEYWORDS . SOURCE Laccaria pseudomontana ORGANISM Laccaria pseudomontana Eukaryota; Fungi; Dikarya; Basidiomycota; Agaricomycotina; Agaricomycetes; Agaricomycetidae; Agaricales; Tricholomataceae; Laccaria. REFERENCE 1 (bases 1 to 629) AUTHORS Osmundson,T.W., Cripps,C.L. and Mueller,G.M. TITLE Morphological and molecular systematics of Rocky Mountain alpine Laccaria JOURNAL Mycologia 97 (5), 949-972 (2006) REFERENCE 2 (bases 1 to 629) AUTHORS Osmundson,T.W., Cripps,C.L. and Mueller,G.M. TITLE Direct Submission JOURNAL Submitted (29-JUL-2005) Ecology, Evolution and Environmental Biology, Columbia University, 1200 Amsterdam Avenue, MC 5557, New York, NY 10027, USA FEATURES Location/Qualifiers source 1..629 /organism="Laccaria pseudomontana" /mol_type="genomic DNA" /isolate="pse1625" /specimen_voucher="Cripps 1625 (type)" /db_xref="taxon:344594" /tissue_type="basidiome" /country="USA: Colorado, Ten Mile Range, Blue Lake Dam" /note="type strain of Laccaria sp. CLC 1771" rRNA <1..14 /product="18S ribosomal RNA" misc_RNA 15..249 /product="internal transcribed spacer 1" rRNA 250..407 /product="5.8S ribosomal RNA" misc_RNA 408..614 /product="internal transcribed spacer 2" rRNA 615..>629 /product="28S ribosomal RNA" ORIGIN 1 aggatcatta ttgaataaac ctgatgtggc tgttagctgg cttttcaaag catgtgctcg 61 tccgtcatct ttaatttctc cacctgtgca cattttgtag tcttggatac ctctcgaggc 121 181 241 301 361 421 481 541 601 // aactcggatt tgttttcata aaaattatac aaatgcgata ttgcgctcct ccaactttta cggctctcct ctacgccgtg acaattttga ttaggatcgc tacaccaaag aactttcagc agtaatgtga tggtattccg ttagcttggt taaatgcatt gatttgaagc caatttgacc cgtgctgtaa tatgtttaaa aacggatctc attgcagaat aggagcatgc taggcttgga agcggaactt agctttatga tcaaatcag aagtcagctt gaatgtcatc ttggctctcg tcagtgaatc ctgtttgagt tgtgggggtt ttgtggaccg agttcagcct tcctctcatt aatgggaact catcgatgaa atcgaatctt gtcattaaat gcgggcttca tctattggtg ctaaccgtcc tccaagacta tgtttcctat gaacgcagcg tgaacgcacc tctcaacctt tcaatgaggt tgataattat attgacttgg It is best to make this table, as well as the source modifier table, in a text editor rather than a word processor in order to avoid the surreptitious insertion of formatting codes. Save the feature annotation table with the file extension .tbl. Element 4: The source annotation table This table contains information about the biological source of the sequence. It can contain a wealth of different information (see Appendix V for a complete list of accepted source modifiers), but usually only includes a small subset of the possible fields. The first column must include the sequence ID (SeqID); this code must be identical to the one in the definition line of the corresponding FASTA file. The second column should include the organism name (Latin binomial). For our submissions, we should also use the fields recommended for DNA barcode data by the Consortium for the Barcode of Life, as doing so will facilitate obtaining official barcode designation for our sequences; these fields are: Collected-by, Collection-date, Country, Identified-by, LatLon, and Specimen-voucher. An example table is as follows: SeqID organism Collected-by Collection-date Country Identified-by Mp_MG15 Mycena pura Matteo Garbelotto 1-April-2008 USA Todd Osmundson Lb_MG99 Mycena impura Matteo Garbelotto 1-April-2008 USA Doug Schmidt Lat-lon Specimen-voucher 13.57 N 24.68 W MG 15 13.57 N 24.68 W MG 99 The table must be saved as a tab-delimited file with a .src extension. Submitting the data Now we will run the tbl2asn program, which will generate a Sequin (.sqn) file that we can submit to GenBank via e-mail. First, copy all files into the directory that contains the tbl2asn program, as this simplifies the path specification in the command line. In MacOS, open a Terminal window; in Windows, open a DOS command window; then navigate to the directory that contains the tbl2asn program. Typing “tbl2asn -” at the shell prompt will produce the full list of command line arguments; the following page contains a summary of the most common ones, as well as some example command lines. We will use the following command line, for a batch submission with multiple sequences per .fsa file: tbl2asn -t template.sbt –p. -a s -V v This command line makes several assumptions: that the Sequin template is named “template.sbt”, that all sequences are in a single FASTA file, and that all files are in the same directory as the tbl2asn program; make sure that your data meet these assumptions, or change the command line accordingly. Note also that the .fsa, .tbl, and .src files must have the same filename prefix (e.g., mycena.fsa and mycena.tbl), or tbl2asn will not match them correctly. Most common command line arguments for tbl2asn (from the NCBI website). -p -r -t -i -a -s -j -V -k -y -Y -Z Path to the directory. If files are in the current directory –p. should be used. Path for the resulting .sqn file(s) (if the –r argument is not used, the .sqn files will be saved in the source directory). Specifies the template file (.sbt). If the .sbt file is in a different directory the full path must be specified. Creates single submission from indicated .fsa file in a directory of multiple .fsa files. Specifies the File type. s :FASTA Set (s Batch, s1 Pop, s2 Phy, s3 Mut, s4 Eco) l :FASTA+Gap Alignment z :FASTA with Gap Lines e :PHRAP/ACE d :FASTA Delta, di FASTA Delta with Implicit Gaps a :Any (default) Sample command line: -a s Instructs tbl2asn to read multiple FASTA components in one file as a set of unrelated sequences. Equivalent to “-a s”. This creates a single file of multiple submissions. (1000 sequences per file is the usual maximum.) Allows the addition of source qualifiers that will be the same for each submission. Example: -j “[organism=Saccharomyces cerevisiae] [strain=S288C]”. Verification (combine any of the following letters): v :Validates the data records. The output is saved to files with a .val suffix. b :Generates GenBank flatfiles with a .gbf suffix. r :Validates without Country Check Sample command line: -V vb CDS Flags (combine any of the following letters): c :Instructs tbl2asn to annotate the longest open reading frame (ORF) if a .tbl file is not provided. The product name will be ‘unknown’ unless a product name is included in the FASTA definition, [product=xyz]. m :Allows alternative start codons to be used in ORF searches. r :Allows Runon ORFs Sample command line: -k c Adds a COMMENT to each submission. Example: -y “Contigs larger than 2kb have been annotated, representing approx. 87% of the total genome”. Like –y, but adds a COMMENT to each submission from a file. Runs the Discrepancy Report. Must supply an output file name. Recommended only for annotated genome submissions, complete or WGS. See the Discrepancy Report page for information about its output. -o Creates a single submission from multiple fasta files. Example Command Lines: Single submission: one sequence per .fsa file: tbl2asn -t template.sbt -p path_to_files -V v Batch submission: multiple sequences per .fsa file: tbl2asn -t template.sbt -p path_to_files -a s -V v Single submission: one .fsa file in directory of multiple .fsa files: tbl2asn -t template.sbt -i x.fsa -V v The –V v portion of the command line generates a validation file with a .val extension. Before submitting your .sqn files to GenBank, review the .val files (open with a text editor); you will need to correct any error-level errors. Taxonomy-related errors about missing lineages can (and should) generally be ignored. To correct errors, open the newly-created .sqn file in Sequin by double clicking on it. Double-click on the portion of the GenBank output that contains an error; the program will draw a black, vertical line to the left of the portion and open a dialog box where you can correct the error (see diagrams on next page). When you open the Sequin file, the first file in the set will be open; to go to other files, click in the box that contains the sequence ID number. Double-click here to go to another sequence Double-click here to correct the source information Unfortunately, if we wish to submit chromatograph trace files to the NCBI Trace Archive, we will have to do so separately rather than integrated as in BarSTool. At present, a trace file does not appear to be a requirement to obtain official barcode designation for a sequence, so we may wish to skip this step for now. Appendices Appendix I: A Sample FASTA File for a DNA Barcode Submission (Source: NCBI) >Seq1 [organism=Carpodacus mexicanus] CCTTTATCTAATCTTTGGAGCATGAGCTGGCATAGTTGGAACCGCCCTCAGCCTCCTCATCCGTGCAGAA CTTGGACAACCTGGAACTCTTCTAGGAGACGACCAAATTTACAATGTAATCGTCACTGCCCACGCCTTCG TAATAATTTTCTTTATAGTAATACCAATCATGATCGGTGGTTTCGGAAACTGACTAGTCCCACTCATAAT CGGCGCCCCCGACATAGCATTCCCCCGTATAAACAACATAAGCTTCTGACTACTTCCCCCATCATTTCTT TTACTTCTAGCATCCTCCACAGTAGAAGCTGGAGCAGGAACAGGGTGAACAGTATATCCCCCTCTCGCTG GTAACCTAGCCCATGCCGGTGCTTCAGTAGACCTAGCCATCTTCTCCCTCCACTTAGCAGGTGTTTCCTC TATCCTAGGTGCTATTAACTTTATTACAACCGCCATCAACATAAAACCCCCAACCCTCTCCCAATACCAA ACCCCCCTATTCGTATGATCAGTCCTTATTACCGCCGTCCTTCTCCTACTCTCTCTCCCAGTCCTCGCTG CTGGCATTACTATACTACTAACAGACCGAAACCTAAACACTACGTTCTTTGACCCAGCTGGAGGAGGAGA CCCAGTCCTGTACCAACACCTCTTCTGATTCTTCGGCCATCCAGAAGTCTATATCCTCATTTTAC >Seq2 [organism=Vireo solitarius] GGTAGGTACCGCCCTAAGNCTCCTAATCCGAGCAGAACTANGCCAACCCGGAGCCCTTCTGGGAGACGAC CAAATCTACAACGTAGTCGTTACGGCCCACGCCTTCGTAATAATCTTTTTCATAGTAATGCCAATCATAA TCGGAGGATTCGGGAACTGACTAGTTCCTCTAATGATTGGGGCCCCAGACATAGCATTCCCTCGAATAAA CAACATAAGCTTTTGACTACTACCACCATCATTCCTACTCCTAATAGCCTCCTCAACAGTAGAAGCAGGA GCCGGAACCGGATGAACCGTGTACCCACCACTAGCTGGAAACCTGGCCCACGCCGGAGCCTCAGTAGACC TAGCTATCTTCTCCCTACACCTAGCAGGTATCTCATCCATCCTGGGGGCAATTAACTTCATTACAACAGC AATCAACATAAAACCACCCGCCCTCTCACAATACCAAACACCACTATTCGTGTGATCCGTCCTAATTACG GCCGTACTACTCCTACTATCTCTCCCAGTACTAGCCGCCGGTATCACCATGCTACTCACAGACCGCAACC TCAACACCACCTTCTTTGACCCAGCAGGAGGAGGAGACCCAGTACTATACCAGCACCTATTCTGATTCTT CGGACACCCAGAAGTCTACATCCTAATTCTC >Seq3 [organism=Dendroica tigrina] CCTATACCTAATTTTCGGCGCATGAGCCGGAATGGTGGGTACCGCTCTAAGCCTCCTCATTCGAGCAGAA CTAGGCCAACCCGGAGCCCTTCTGGGAGACGACCAAGTCTACAACGTGGTTGTCACGGCCCATGCCTTCG TAATAATCTTCTTTATAGTTATGCCGATTATAATCGGAGGATTCGGAAACTGACTAGTCCCCCTAATAAT CGGAGCCCCAGACATAGCATTTCCGCGAATAAACAACATAAGCTTCTGACTACTCCCACCATCATTCCTC CTCCTCTTAGCATCCTCCACAGTGGAAGCAGGCGTAGGTACAGGCTGAACAGTGTATCCCCCACTAGCTG GCAACCTAGCTCATGCCGGGGCCTCAGTCGACCTCGCAATCTTCTCCTTACACCTAGCTGGTATTTCCTC AATCCTCGGAGCAATTAACTTCATTACAACAGCAATTAACATGAAACCTCCTGCCCTCTCACAATACCAA ACCCCACTATTCGTCTGATCAGTGTTAATTACTGCAGTCCTCCTTCTCCTTTCCCTTCCAGTTCTAGCTG CAGGAATCACAATGCTCCTCACAGACCGCAACCTCAACACCACATTCTTCGACCCTGCCGGAGGAGGAGA TCCCGTCCTATATCAACATCTCTTCTGATTCTTCGGCCACCCAGAAGTCTACATCCTAATCCTC >Seq4 [organism=Vireo gilvus] CATGAGCTGGAATAGTAGGTACCGCCCTAAGCCTCCTAATTCGAGCAGAGCTAGGCCAACCCGGAGCCCT ACTGGGAGACGACCAAATCTACAACGTAGTCGNCACGGCCCATGCTTTTGTAATAATCTTCTTCATAGTA ATGCCAATCATAATCGGAGGGTTTGGAAACTGACTGGTCCCCCTAATAATTGGAGCTCCAGACATAGCAT TCCCCCGAATAAACAACATGAGTTTCTGACTACTTCCCCCATCATTCCTACTACTAATAGCCTCCTCAAC AGTAGAAGCAGGCGTTGGAACAGGATGAACCGTATATCCACCACTAGCCGGAAACCTAGCCCATGCAGGA GCCTCAGTAGACCTAGCTATCTTCTCCCTACACCTAGCAGGTATCTCCTCCATCCTAGGGGCAATCAACT TCATTACAACAGCAATCAACATAAAACCACCCGCCCTATCACAATACCAAACACCACTATTCGTATGATC CGTCCTAATCACAGCCGTACTACTCCTCCTATCACTCCCAGTGCTAGCTGCTGGAATTACCATGCTACTT ACAGACCGCAACCTCAACACTACCTTCTTTGACCCAGCAGGGGGAGGAGACCCAGTGCTATACCAACATC TATTCTGATTCTTCGGACACCCAGAAGTTTACATCCTAATTCTC >Seq5 [organism=Dendroica castanea] CCTATACCTAATTTTCGGCGCATGAGCCGGAATAGTGGGTACCGCCCTAAGCCTCCTCATTCGAGCAGAA CTAGGCCAACCCGGAGCCCTTCTGGGAGACGACCAAGTCTATAACGTAGTTGTCACGGCCCATGCCTTCG TAATAATTTTCTTTATAGTTATGCCGATTATAATCGGAGGATTCGGAAACTGACTAGTCCCCCTAATAAT CGGAGCCCCAGACATAGCATTCCCACGAATAAACAACATAAGCTTCTGACTACTCCCACCATCATTCCTT CTCCTCCTAGCATCCTCCACAGTCGAAGCAGGCGTAGGTACAGGCTGAACAGTATACCCCCCACTAGCTG GCAACCTAGCTCACGCCGGAGCCTCAGTCGACCTCGCAATCTTCTCTCTACACCTAGCTGGTATTTCCTC AATCCTCGGAGCAATCAACTTCATTACAACAGCAATTAACATAAAACCTCCTGCCCTCTCACAATACCAA ACCCCACTGTTCGTCTGATCCGTCCTAATCACTGCAGTCCTCCTGCTCCTTTCCCTTCCAGTTCTAGCTG CAGGAATCACAATACTCCTCACAGACCGCAACCTAAACACCACATTCTTCGACCCTGCTGGAGGAGGAGA TCCCGTCCTATATCAACACCTTTTCTGATTCTTCGGCCACCCAGAAGTCTACATCCTAATCNTC >Seq6 [organism=Vireo gilvus] CATGAGCTGGAATAGTAGGTACCGCCCTAAGCCTCCTAATTCGAGCAGAGCTAGGCCAACCCGGAGCCCT ACTGGGAGACGACCAAATCTACAACGTAGTCGTCACGGCCCATGCTTTTGTAATAATCTTCTTCATAGTA ATGCCAATCATAATCGGAGGGTTTGGAAACTGACTGGTCCCCCTAATAATTGGAGCTCCAGACATAGCAT TCCCCCGAATAAACAACATGAGTTTCTGACTACTTCCCCCATCATTCCTACTACTAATAGCCTCCTCAAC AGTAGAAGCAGGCGTTGGAACAGGATGAACTGTATACCCGCCACTAGCCGGTAACCTAGCCCATGCAGGA GCCTCAGTAGACCTAGCTATCTTCTCCCTACACCTAGCAGGTATCTCCTCCATCCTAGGGGCAATCAACT TCATTACAACAGCAATCAACATAAAACCACCCGCCCTATCACAATACCAAACACCACTATTCGTATGATC CGTCCTAATCACAGCCGTACTACTCCTCCTATCACTCCCAGTGCTAGCTGCTGGAATTACCATGCTACTT ACAGACCGCAACCTCAACACTACCTTCTTTGACCCAGCAGGGGGAGGAGACCCAGTGCTATACCAACATC TATTCTGATTCTTCGGACACCCAGAAGTTTACATCCTAATTCTC Appendix II: Source Modifiers for Barcode Submissions through BarSTool (from the NCBI website http://www.ncbi.nlm.nih.gov/WebSub/html/help/sourcetable.html) In addition to the Sequence ID, the following source modifiers are required for Barcode submissions: Country - The country of origin of DNA samples used. Specimen_voucher - An identifier of the individual or collection of the source organism and the place where it is currently stored, usually an institution. The following source modifiers are recommended for Barcode submissions: Collected_by - Name of person who collected the sample. Collection_date - Date the specimen was collected. In format DD-Mon-YYYY, that is 2-digit date, three-character abbreviation of month, and 4-digit year, (e.g., 11-Feb2002). Mon-YYYY and YYYY are alternate formats to use when date information is less complete. Identified_by - name of the person or persons who identified by taxonomic name the organism from which the sequence was obtained Lat_Lon - Latitude and longitude, in decimal degrees, of where the sample was collected. The following optional source modifiers are available to further describe the sequences in a Barcode set: Authority - The author or authors of the organism name from which sequence was obtained. Biotype - Variety of a species (usually a fungus, bacteria, or virus) characterized by some specific biological property (often geographical, ecological, or physiological). Same as biotype. Biovar - See biotype Breed - The named breed from which sequence was obtained (usually applied to domesticated mammals). Cell_line - Cell line from which sequence was obtained. Cell_type - Type of cell from which sequence was obtained. Chemovar - Variety of a species (usually a fungus, bacteria, or virus) characterized by its biochemical properties. Clone - Name of clone from which sequence was obtained. Cultivar - Cultivated variety of plant from which sequence was obtained. Dev_stage - Developmental stage of organism. Ecotype - The named ecotype (population adapted to a local habitat) from which sequence was obtained (customarily applied to populations of Arabidopsis thaliana). Forma - The forma (lowest taxonomic unit governed by the nomenclatural codes) of organism from which sequence was obtained. This term is usually applied to plants and fungi. Forma_specialis - The physiologically distinct form from which sequence was obtained (usually restricted to certain parasitic fungi). Genotype - Genotype of the organism. Haplotype - Haplotype of the organism. Isolate - Identification or description of the specific individual from which this sequence was obtained. Isolation source - Describes the local geographical source of the organism from which the sequence was obtained. Lab_host - Laboratory host used to propagate the organism from which the sequence was obtained. Natural_host - When the sequence submission is from an organism that exists in a symbiotic, parasitic, or other special relationship with some second organism, the 'natural host' modifier can be used to identify the name of the host species. Note - Any additional information that you wish to provide about the sequence. Pathovar - Variety of a species (usually a fungus, bacteria or virus) characterized by the biological target of the pathogen. Examples include Pseudomonas syringae pathovar tomato and Pseudomonas syringae pathovar tabaci. Pop_variant - name of the population variant from which the sequence was obtained Serogroup - Variety of a species (usually a fungus, bacteria, or virus) characterized by its antigenic properties. Same as serogroup and serovar. Serotype - See Serogroup Serovar - See Serogroup Sex - Sex of the organism from which the sequence was obtained. Strain - Strain of organism from which sequence was obtained. Sub_species - Subspecies of organism from which sequence was obtained. Subclone - Name of subclone from which sequence was obtained. Subtype - Subtype of organism from which sequence was obtained. Substrain - Sub-strain of organism from which sequence was obtained. Tissue_lib - Tissue library from which the sequence was obtained. Tissue_type - Type of tissue from which sequence was obtained. Type - Type of organism from which sequence was obtained. Variety - Variety of organism from which sequence was obtained. Appendix III: Using the tar utility to make file archives By Owen L. Astrachan, Duke University (http://www.cs.duke.edu/~ola/courses/programming/tar.html) The program tar (originally for tape archive) is useful for archiving and transmitting files. For example, you may want to 'tar up' all your work for a course on the acpub and save it to your own computer's disk drive so you don't run into quota problems. You might also want to submit (e.g., for cps 108 or cps 100) an entire directory at once rather than the individual files in the directory. The tar program is useful for these and other tasks and is simple to use. You can see more information by reading the man page, type man tar The examples below are not meant to be exhaustive. You can also use the utility gtar instead. Create, Extract, See Contents The tar program takes one of three function command line arguments (there are two others I won't talk about). c --- to create a tar file, writing the file starts at the beginning. t --- table of contents, see the names of all files or those specified in other command line arguments. x --- extract (restore) the contents of the tar file. (the other options are u for update and r for replace, see the man page for details). Exactly one function argument, c, t, x, is used in conjunction with other command line arguments shown below. Again, these examples are not meant to be complete, just useful. Compression, Verbose, File specified In addition to a function command line argument the arguments below are useful. I usually use z and f all the time, and v when creating/extracting. f --- specifies the filename (which follows the f) used to tar into or to tar out from; see the examples below. z --- use zip/gzip to compress the tar file or to read from a compressed tar file. v --- verbose output, show, e.g., during create or extract, the files being stored into or restored from the tar file. Examples To tar all .cc and .h files into a tar file named foo.tgz use: tar cvzf foo.tgz *.cc *.h This creates (c) a compressed (z) tar file named foo.tgz (f) and shows the files being stored into the tar file (v). The .tgz suffix is a convention for gzipped tar files, it's useful to use the convention since you'll know to use z to restore/extract. It's often more useful to tar a directory (which tars all files and subdirectories recursively unless you specify otherwise). The nice part about tarring a directory is that it is untarred as a directory rather than as individual files. tar cvzf foo.tgz cps100 will tar the directory cps100 (and its files/subdirectories) into a tar file named foo.tgz. To see a tar file's table of contents use: tar tzf foo.tgz To extract the contents of a tar file use: tar xvzf foo.tgz This untars/extracts (x) into the directory from which the command is invoked, and prints the files being extracted (v). If you want to untar into a specified directory, change into that directory and then use tar. For example, to untar into a directory named newdir: mkdir newdir cd newdir tar xvzf ../foo.tgz You can extract only one (or several) files if you know the name of the file. For example, to extract the file named anagram.cc from the tarfile foo.tgz: tar xvzf foo.tgz anagram.cc Other Archiving/Compression Tools Many PC/Mac programs will be able to restore files that have been archived using tar. For example, on Macs, the Stuffit Deluxe program can handle Unix tar files. On PCs, the pkunzip program will handle Unix tar files. This makes it possible to tar files up on [a server] and then use ftp to bring them to your personal machine where you can store the tar files and restore when needed. Of course you can run Linux too. The zip and unzip commands available on some systems are very useful replacements for tar. Zip/unzip programs are nearly standard on Windows 95/NT machines and zip will archive entire directory structures with the right options (type zip by itself for help). Appendix IV. Conducting batch BLAST searches using SEQTools (Thanks to Silvia for the software suggestion!) SEQTools is a versatile software package for sequence manipulation and analysis. Among the many tasks that SEQTools can accomplish is facilitating the submission of batches of DNA or protein sequences to the NCBI BLAST web interface. The following tutorial describes how to create a sequence project and conduct a batch BLAST search using SEQTools. SEQTools can be downloaded for a 60-day trial period (this license can be extended in 60-day increments free of charge for students, or investigators can purchase a long-term license following the trial period) from the website http://www.seqtools.dk . The software is currently available only for the Windows operating system. Once you have downloaded the program, build a new project. It is easiest if you put all of your sequences together in a folder. Note that all files must be of the same type (e.g., FASTA, chromatograph trace files, etc.) and correspond to the same class of macromolecule (e.g., nucleotide and protein sequences will not be handled correctly in a single project). From the File menu, select “Open Sequence Files.” From the subsequent menu, select “Sequences, All Types.” In the Project References dialogue box, select a name for your project: Through the Main dialogue box, select the folder that contains your sequences, then select the “Add to List” button, followed by the “Load Files” button. Batch BLAST searches can be run on either a local BLAST database or using the internet to search GenBank online; we will do the latter. To conduct the search, open the Search menu, select “Blast Batch Search,” then select “Sequential, NCBI – QBlast.” This will open the BLAST dialogue box. In the “Final Blast Program” tab, select BlastN as the Blast program for final search. Choose the number of top BLAST hits that you would like the program to report (“Number of description returned from NCBI”), and the number of alignments that you would like the program to report (if you are only interested in the top hits, select zero for this value). You can also select a maximum expect value for reporting. It is best to deselect the checkbox for “Result in HTML format,” as it is easier to automate downstream applications using a text output rather than HTML. In the “Final Databases” tab, you may choose the GenBank databases to search. For our purposes, choose the nr (general nucleotide) database: The “Advanced Options” tab offers further options for database selection. The “Destination” tab contains options for specifying the output of your search. You could choose to ‘parse results into sequence headers’ if you plan to use the BLAST results further (e.g., using a Perl script to modify the output), or choose to ‘save results as separate files without parsing’ if you wish to simply view the files to check whether the top hits make sense. If you use the second option, be sure to save the results as text files, not HTML. A few words about the remaining tabs: the “Range” tab allows you to choose a subset of the sequences to submit to BLAST, and the “View Search Progress” tab contains a window that allows you to follow the progress of your batch search. Appendix V. Source Modifiers for FASTA Definition Lines or tbl2asn Source Tables (from the GenBank website: http://www.ncbi.nlm.nih.gov/BankIt/examples/eukrrna.html) Source modifiers contain information about the biological source of the sequence to be submitted. These modifiers can either be embedded in the definition line of a FASTAformatted sequence by placing them in square brackets, or can be stored in a separate, tabdelimited table. The proper format for embedding multiple modifiers in the FASTA definition line is demonstrated in this example: >MG104 [organism=Mycena pura] [molecule=DNA] [collection-date=Oct-2005] CGGTTGCTTGGGTTTTATAACATCAGTCAGTGACAGGCATTTCCAGAGTTGCCCTGTTCATTCATATTATT TATTACTGACCAGTGAGGATCCACCTAGCAGTATAGATTCGGATGCAGTATAGGCGTATGATTACAACATC ... Accepted source modifiers are presented below. Note that some modifier names have restricted values or formats (Note: the following information is taken directly from the NCBI website). organism should use the unabbreviated scientific name. Example: [organism=Drosophila melanogaster] molecule should use either "DNA" or "RNA". Example: [molecule=DNA] moltype should use one of the following [moltype=genomic] genomic precursor RNA mRNA rRNA tRNA snRNA scRNA other-genetic cRNA snoRNA transcribed RNA location should use one of the following values. Example: [location=mitochondrion] genomic chloroplast kinetoplast mitochondrion plastid macronuclear values. Example: extrachromosomal plasmid cyanelle proviral virion nucleomorph apicoplast leucoplast proplastid endogenous-virus hydrogenosome collection-date should be in the form YYYY or Mmm-YYYY or DD-MmmYYYY. Example: [collection-date=2005] or [collection-date=Oct-2005] or [collection-date=25-Oct-2005] The following modifiers should use only TRUE or FALSE. Example: [transgenic=TRUE]. environmental-sample germline metagenomic rearranged transgenic Other accepted modifiers for nucleotide sequences are: