Submitting DNA Barcode Sequences to GenBank: A Tutorial

advertisement
Submitting DNA Barcode Sequences to GenBank: A Tutorial
Todd Osmundson, Garbelotto Lab
September, 2008
Contents
GENERAL INTRODUCTION ................................................................................................................... 1
INTRODUCTION TO NCBI’S BARCODE SUBMISSION TOOL ........................................................ 2
SEQUENCE FILES...................................................................................................................................... 2
CHROMATOGRAPH FILES ..................................................................................................................... 4
ATTRIBUTE TABLES ................................................................................................................................ 5
TABLE 1: SEQUENCE ATTRIBUTES .............................................................................................................. 6
TABLE 2: TRACE FILE ATTRIBUTES ............................................................................................................. 7
GENERATING FILE LISTS ............................................................................................................................. 8
SUBMITTING THE BARCODE DATA TO GENBANK USING BARSTOOL ................................... 9
USING NCBI’S SEQUIN AND TBL2ASN ...............................................................................................10
ELEMENT 1: THE SUBMIT-BLOCK TEMPLATE FILE ......................................................................................11
ELEMENT 2: THE SEQUENCE DATA ............................................................................................................12
ELEMENT 3: THE FEATURE ANNOTATION TABLE........................................................................................12
ELEMENT 4: THE SOURCE ANNOTATION TABLE .........................................................................................15
SUBMITTING THE DATA .............................................................................................................................15
APPENDICES .............................................................................................................................................20
APPENDIX I: A SAMPLE FASTA FILE FOR A DNA BARCODE SUBMISSION ...............................................20
APPENDIX II: SOURCE MODIFIERS FOR BARCODE SUBMISSIONS THROUGH BARSTOOL ...........................21
APPENDIX III: USING THE TAR UTILITY TO MAKE FILE ARCHIVES .............................................................24
APPENDIX IV. CONDUCTING BATCH BLAST SEARCHES USING SEQTOOLS ..............................................26
APPENDIX V. SOURCE MODIFIERS FOR FASTA DEFINITION LINES OR TBL2ASN SOURCE TABLES ...........30
General Introduction
DNA barcode sequences can be submitted to GenBank (the genetic sequence database at
the National Center for Biotechnology Information, NCBI) using several different methods.
The emphasis in this tutorial is on methods for batch data checking and submission so that
many sequences can be handled at one time. There are two main ways of making batch
sequence submissions to GenBank: NCBI’s Barcode Submission Tool (BarSTool)
specifically for DNA barcode sequences, and Sequin (or the similar but more automated
tool tbl2asn) for either barcode or non-barcode sequences. When submitting true DNA
barcode sequences (i.e., sequences that meet the length, quality and voucher criteria for
official barcodes), it is preferable to use BarSTool, as it has functionality for batch
submission of the necessary ancillary data (voucher specimen data, chromatograph trace
files, etc.). However, in working with fungi, we have a problem – whereas DNA barcoding
for most animal groups uses a portion of the mitochondrial cytochrome oxidase gene (CO1
or cox1), mycologists have adopted the nuclear ribosomal internal transcribed spacer region
(nrDNA– ITS) as a barcoding standard – and BarSTool is currently configured to accept
only CO1 sequences. I am currently in a dialogue with NCBI to reconfigure BarSTool to
accept ITS data, but am unsure of when, or if, such a change will take place. Therefore, this
tutorial will cover BarSTool in hopes that NCBI will reconfigure it in the near future, and
Sequin/tbl2asn in case they do not.
Introduction to NCBI’s Barcode Submission Tool
GenBank allows bulk submission of DNA barcode sequences and ancillary descriptive
data via its Barcode Submission Tool (BarSTool). The following tutorial describes how to
prepare data for submission and how to submit these data using BarSTool. For additional
information, see the BarSTool website:
http://www.ncbi.nlm.nih.gov/WebSub/index.cgi?tool=barcode
The data that you will need consist of 3 parts:

The sequences themselves, in FASTA format

Chromatograph trace files for each sequence (one forward, one reverse per
sequence)

Ancillary data: 2 Tab-delimited tables (in .txt format) – one containing the descriptive
data for the sequences and one for the trace files – and the names and sequences of
the forward and reverse primers used for sequencing.
Sequence files
The following instructions assume that you have a set of completed sequence contigs
(e.g., prepared in Sequencher) of known provenance (i.e., checked against closest matches in
GenBank using BLAST). See Appendix IV for instructions on how to conduct BLAST
searches.
The sequence data should be in a single FASTA file. A sequence in FASTA format
consists of a description line, which begins with a greater-than symbol (">"), a carriage
return, and then one or more lines of sequence data. The sequence data can be in one
continuous line, but for ease of reading GenBank recommends that all lines of text be
shorter than 80 characters in length. The sequence data are followed by a carriage return,
followed by the next sequence. An example FASTA file containing 3 sequences is:
Create a FASTA file for each sequence after by exporting the contig’s consensus
sequence from Sequencher (File  Export  Consensus). The resulting FASTA file will
have as its descriptor line whatever you have named the contig, so make sure that the contig
has a unique name that will distinguish it from all other sequences in the dataset.
The individual FASTA files can be joined into a single file using a script; the easiest way
to implement such a script is using the Automator program included in Mac OS. Set up the
following workflow in Automator (see right side of window):
Tasks are added to the workflow using the two windows on the left. For step one, select
“Finder” in the far left window (under Library  Applications), then select “Get Specified
Finder Items” under the Action column. Steps 2 and 3 are found within TextEdit in the
Library column. Before running the workflow, remove all files from the first (Get Specified
Finder Items) step (select the files, and click the “-“ button), and choose an appropriate
name in the third (New Text File) step. Run the workflow by clicking the “Run” button in
the upper right-hand corner. Workflows can be saved so you can use them again.
The following UNIX shell command (this can be run through the Terminal program in
Mac OS) will do the same concatenation operation, but requires typing in all of the file
names:
$ cat <file1> <file2> > <cat.out>
This operation will concatenate the files “file1” and “file2” into the output file “cat.out”. For
a slightly easier way than typing all of the file names, see instructions for generating file lists,
later in this document.
If you prepare your FASTA files in a word processing program (e.g., Microsoft Word)
rather than in a text editor (e.g., TextEdit, BBEdit, Text Wrangler, etc.) be sure to save your
file as a plain text (.txt) file rather than as a word processor document (.doc, etc.), as the
embedded formatting tags in the latter can cause problems downstream.
See Appendix 1 for a sample FASTA file for a DNA barcode submission.
Chromatograph files
The output from the ABI sequencer includes 3 files for each sequence: the
chromatograph trace file (.ab1), a Phred file that includes the base call quality scores (.phd.1),
and a text file that includes the sequence itself (.seq). The .seq file includes the raw base calls
and is therefore virtually useless without editing, but the .ab1 and .phd.1 files are essential for
the barcode submission. The attribute tables (see below) will be used to tie together the
trace, Phred, and sequence (as edited contigs in FASTA format) files.
The chromatograph trace (.ab1) files must be submitted as a single compressed format /
archive file (.zip, .gz, .tar, etc.). To prepare the trace file, first create a new directory (folder)
named “traces” containing all the traces for this submission. Then, assemble the .ab1 files
into a single archive file and compress it. The archiving and compression steps can be done
simultaneously or in sequence, depending on the tools at your disposal. To do these steps
simultaneously, use a zip utility (e.g., WinZip for Windows). In Mac OS, you can easily
prepare a .zip file as follows:
1. Select the files to be put in the archive (this should be all files in the “traces” folder
that you just created.
2. Select the cogwheel drop-down menu and select “Create Archive of X items”; a .zip
file containing a compressed archive will appear in the folder – this is the file that
you will submit through BarSTool.
Alternatively, you can produce the archive using a Tar utility, and compress it using the gzip
utility. For instructions on using the UNIX tar utility, see Appendix III. For instructions
on using the UNIX gzip utility, see the gzip home page: http://www.gzip.org/#intro
Attribute tables
The main difference between submission of barcode sequences and that of other DNA
sequence data is that barcode sequences are held to a higher standard – they must
correspond to vouchered specimens, must be from particular (agreed-upon) loci, and must
be of high quality (low percentage of ambiguous bases (“N”s); must have forward and
reverse sequences; and should be linked to a chromatograph file and/or a file of read quality
metrics). The attribute tables are the way to link each sequence to its appropriate
chromatograph file and voucher specimen data. These tables must be submitted to
GenBank as tab-delimited text files. One can use a text editor to make the files, but it is
probably easier to use a spreadsheet application such as Excel; just be sure to save the file as
tab-delimited text when you’re finished.
Table 1: Sequence attributes
This is a tab-delimited text file that includes information about the sequence and the
specimen from which it is derived (NCBI refers to this information as “source modifiers”).
The NCBI website has downloadable templates for a table including the source modifiers
recommended by the Barcoding Consortium, and a table including all possible source
modifiers. In general, it is sufficient to use just the recommended source modifiers.
Here are the links:
* Template including recommended source modifiers:
http://www.ncbi.nlm.nih.gov/WebSub/html/help/templates/source-table-recommended.txt
* Template including all source modifiers:
http://www.ncbi.nlm.nih.gov/WebSub/html/help/templates/source-table-all.txt
To download the files, it is better to go directly to NCBI and download them from there:
http://www.ncbi.nlm.nih.gov/WebSub/html/help/source-table.html
Here is a sample source modifier table:
The first column includes the sequence ID. This must be the same ID that was used in
the description line of that sequence in the FASTA file. The rest of the columns include
information about the collection and the accession number of the voucher specimen. While
more information is usually better, GenBank will accept a table that includes only the first
(Sequence_ID) and last (Specimen_voucher) columns. For an example file with the 2column format, see the following URL:
http://www.ncbi.nlm.nih.gov/WebSub/html/help/sample_files/source-table-2-col-sample.txt
Official barcode submissions also require the “Country” modifier (country where the
specimen was collected).
So, the end result of this table is to tell GenBank which voucher specimen corresponds
to each barcode sequence.
Note the following requirements for the source modifiers table:

The heading for the first column must be exactly Sequence_ID as shown in the
sample table.

Each specimen in the set must have a line in the source modifiers file, even if
there are no modifiers to apply to the specimen.

Each Sequence_ID may appear only once in the source modifier file.
See Appendix 2 for descriptions of all source modifier fields.
Table 2: Trace file attributes
This is a tab-delimited text file that includes information about the trace files .
The NCBI website has a downloadable template for this table; to see the format, go to the
following URL:
http://www.ncbi.nlm.nih.gov/WebSub/html/help/templates/trace-table-required.txt
To download the template, go to the NCBI website:
http://www.ncbi.nlm.nih.gov/WebSub/html/help/trace-table.html
Here is a sample trace attribute table:
The first row of the table includes the column headings.
The columns of the table are as follows (descriptions taken from the NCBI website):

Template_ID - identifies the sequence. This identifier must be the same value
as the Sequence_ID used in the source modifier table and in the nucleotide
FASTA file, and allows GenBank to tie together the sequence, trace file, and
voucher specimen data for each barcode.

Trace_file - the path to a specific trace in the trace archive file. If you set up the
trace archive by putting all the traces into a directory (folder) named “traces”, the
path would start with "traces/" For example: traces/filename.ab1.
Note: If you set up your traces directory with subdirectories (eg, for each separate
submission set or for each separate organism, etc), the path listed in the trace_file column
must include the subdirectory name. For example: traces/subdirectory_name/filename.scf.

Trace_format - names the format of the provided trace file. Trace_format can
have the following values: SCF, SFF, ZTR, and ABI.

Center_project - a sequencing center's internal designation for a specific
sequencing p roject. This field can be useful for grouping related traces.

Program_ID - the base calling program. This field is free text. Program name,
version numbers or dates are very useful. Examples include:








phred-19980904e
abi-3.1
ATQA
TraceTuner
Licor
Megabase
Beckman
Trace_end - labels which end of the sequence is contained in the read. Possible
values: F, R, N for Forward, Reverse, and uNknown.
Note that a Trace_file may appear only once in a Trace Information file; however, a
Template_ID may appear more than once.
For more documentation on the NCBI Trace Archive, including additional fields and their
descriptions, consult the following website:
http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=rfc_b&m=doc&s=rfc_b#PROGRAM_ID
Generating file lists
If it sounds like a lot of work to enter all of the filenames into the spreadsheet, here is
some good news – it is not necessary to manually type the file names or copy/paste them
individually into the spreadsheet – let your operating system do most of the work for you by
generating a file list! Here’s how:
1. In Mac OS, open a Terminal window (or run cmd in Windows), then navigate to the
folder where the trace files and score files have been placed (use the cd command in all
OS’s). Note that in Mac OS Terminal, you do not need to manually type the folder path
following the “cd” command – just type “cd”, and then drag the folder icon (from a Finder
window) into the Terminal window and Terminal will write the path for you.
2. Use the following commands to generate a list of all files in the folder and write them to a
file entitled list.txt:
Windows:
dir > list.txt
MacOS:
ls > list.txt
Linux/Unix:
ls > list.txt
3. Generally, in building the spreadsheets, however, you will not want a list of all files, but
rather a list of all files within a certain class (e.g., trace files, or Phred files). These
commands will generate a list of files having a particular file extension in the current folder,
and save it in a text file entitled list.txt.
To generate a list of the trace files use the following commands :
Windows:
dir *.ab1 > list.txt
MacOS:
ls *.ab1 > list.txt
Linux/Unix: ls *.ab1 > list.txt
To generate a list of the score files use :
Windows:
dir *.phd.1 > list.txt
MacOS:
ls *. phd.1 > list.txt
Linux/Unix: ls *. phd.1 > list.txt
You can then open list.txt directly in Excel, or open list.txt in a text editor and copy/paste
into a column in an Excel spreadsheet.
Submitting the barcode data to GenBank using BarSTool
[Note: The NCBI BarSTool information page can be found at the following URL:
http://www.ncbi.nlm.nih.gov/WebSub/index.cgi?tool=barcode]
Once you have all of the data prepared, it is time to submit them to GenBank. First, you
will need to register for a My NCBI account:
http://www.ncbi.nlm.nih.gov/entrez/login.fcgi
To begin the submission process, make sure that you have the following available:







A web browser that supports both JavaScript and cookies
The title of a published or in-press paper that discusses the Barcode Set
A text file of the set of nucleotide sequences in FASTA format
The names and sequences of the forward and reverse primers
A tab-delimited table of source modifier data for the set
A text file of the set of protein sequences in FASTA format (optional for CO1; not
applicable for ITS)
A tab-delimited table of trace attributes and a compressed archive containing the
traces (optional, but highly recommended)
From the NCBI Barcode of Life home page
(http://www.ncbi.nlm.nih.gov/WebSub/index.cgi?tool=barcode),
select the link “Sign in to use
barcode” in the upper right corner. On the first page, enter your contact information. On
the second page, enter the names of the sequence authors and study (either a published or in
press paper, or the name of an unpublished study. Sequence authors for the Venice study
should include Matteo Garbelotto, Lydia Baker, and anyone involved in the data acquisition
and/or submission of the particular group of sequences being submitted. For the study title
we should use a single, agreed-upon name so that all of the sequences can be grouped
together even if they are submitted separately. I propose using the name of the study that
appears on the lab website: “Barcoding the Venice Fungal Collection.” On the third page,
select a release date for the sequences; this step should be done in consultation with Matteo.
Also on the third page, upload the nucleotide FASTA file containing the sequence data. On
the fourth screen, you may upload a protein translation file; since ITS is not a protein-coding
gene, continue past this step. The fifth screen prompts for primer information. If all
sequences were generated using the same primers (e.g., ITS1-F and ITS4-B), choose the
option “Set one value for all sequences” and then enter the primer name and sequence. In
the sixth page, upload your sequence source modifier file; note that, if your file does not
contain all of the columns recommended by the Barcode Consortium, your submission will
still be accepted but may not be given the official “Barcode” label in GenBank. The seventh
screen will prompt you to upload the trace information table and the trace archive (the
compressed archive file containing all of the chromatograph trace files). Be sure to upload
the correct file in the correct place. Following all information entry and upload, BarSTool
presents a text file (in GenBank flat file format) containing your submission as it will appear
in GenBank. Review this file carefully to confirm that the specimen, locus, author, and
study information are correct. If you are submitting a protein-coding sequence (e.g., CO1),
make sure that the translation makes sense (e.g., no stop codons, signified by an asterisk in
the protein sequence). Finalize your submission, and you’re finished!
Using NCBI’s Sequin and tbl2asn
Though one can submit sequences directly to GenBank via a web interface (BankIt), this
method only accommodates submission of sequences one-at-a-time – certainly an
unpalatable option if one has many sequences to submit. Batch submission of sequences is
facilitated by NCBI’s Sequin utility. Sequin allows the creation of a single file containing
descriptive information for a batch of sequences (author information, etc.) through web
forms completed by the user, and then packages this file with the sequence files (in FASTA
format) into a single Sequin (.sqn) file that can be submitted to GenBank via e-mail. The
utility tbl2asn, as its name (albeit cryptically) suggests, converts information from tables to
ASN.1 (Abstract Syntax Notation 1), the file format used by GenBank. It is a command-line
program, so there are no menus – you must enter commands directly into a shell (UNIX,
DOS, etc.) window. Regardless of whether Sequin or tbl2asn is used, a submission consists
of the following elements:

Sequence data in FASTA format

General information about the submission (e.g., author information)

Annotation of sequence features (e.g., coding regions, non-coding regions)

Source information (e.g., organism, collection information, etc.)
Besides the difference in the way that the programs are run (command line vs. web forms),
Sequin and tbl2asn differ primarily in the way in which some of these submission elements
are organized. The general information (author names and institutions, etc.) is in both cases
entered into a Sequin project, but when prepared for tbl2asn the user stops after this
information is entered and exports the results into a standalone file that can be read by
tbl2asn. The feature annotations can be either entered into a web form or imported as a tabdelimited text file for Sequin, and must be imported as a table for tbl2asn. Source
information can either be entered into the web form (the tedious way) or embedded in the
FASTA definition line (the more straightforward way) for Sequin, and are either embedded
in the FASTA definition line or stored in a tab-delimited text table for tbl2asn.
Because there is a fair amount of overlap between Sequin and tbl2asn and because
tbl2asn is a bit more efficient for large numbers of sequences, I will describe the use of
tbl2asn here. Instructions specific to Sequin are available from the following URL:
http://www.ncbi.nlm.nih.gov/Sequin/QuickGuide/sequin.htm.
First, download tbl2asn. A link to the FTP site containing the download is available at
http://www.ncbi.nlm.nih.gov/Genbank/tbl2asn2.html. Be sure to download the correct
version for your platform, then uncompress the file and change the file permissions if
necessary. Also download Sequin (at http://www.ncbi.nlm.nih.gov/Sequin/), as you will
need this program for the initial steps of the process.
The submission can contain up to 6 types of items, as follows:
1. Template file containing a text ASN.1 Submit-block object (suffix .sbt).
2. Nucleotide sequence data in FASTA format (suffix .fsa).
3. Tab-delimited text file with a table containing sequence features (suffix .tbl).
4. Protein sequence if the gene encodes a protein (suffix .pep).
5. Source Table (suffix .src).
6. Quality Scores (suffix .qvl).
In our nrDNA-ITS sequence submissions, we can omit #4 since ITS is non-coding. We
will also omit #6. The remaining four elements are described in more detail below.
Element 1: The submit-block template file
This file is generated through Sequin. To make the file, first open Sequin. Choose
“GenBank” as the database for submission, then select the button “Start New
Submission.” The following form will open:
Select a date for when the sequence record may be released (in consultation with
Matteo), and fill in a tentative manuscript title. Then, select the other tabs and enter the
contact, author, and affiliation information. After you have done this, return to the
submission tab and use File->Export Submitter Info. Save the file as template.sbt.
Element 2: The sequence data
Sequence data should be given in FASTA format, just as in a BarSTool submission.
As in a BarSTool submission, it is easiest if all sequences are combined into a single
FASTA file. The FASTA file should be placed in the same directory as the template and
table files generated in the other steps of this process. It is possible to provide source
information in the FASTA definition line (See Appendix V), or to store it in a separate
tab-delimited table. Keep in mind that the sequence identifier (sequence “title”) used in
the definition line (i.e., following the “>” symbol) must be identical to those used in the
source modifier and feature annotation tables. This sequence ID will be changed to a
GenBank accession number by the NCBI staff after the sequences are submitted. For
our submissions, we will put the source modifier data in a separate file (See Element 4),
so the FASTA definition lines need only contain a unique identifier.
Element 3: The feature annotation table
Sequence features such as the location of coding regions, introns, or different
structural parts of a gene must be identified prior to submission. For example, a typical
ITS1/ITS4 – primed sequence product contains a small portion of the 18S ribosomal
RNA gene, followed by the first internal transcribed spacer (ITS1), the 5.8S ribosomal
RNA gene, the second internal transcribed spacer (ITS2), and a small portion of the 28S
ribosomal RNA gene. Identification of these elements and their positions will be, by far,
the most time-consuming part of this process. These feature annotations must then be
stored in a tab-delimited table having a specific 5-column format (columns separated by
tabs). The file begins with a definition line similar to that in a FASTA-formatted
sequence; for example: >Feature SeqId table_name
The sequence identifier (SeqId) must be the same as that used in the sequence FASTA file.
The table name portion is optional. Subsequent lines of the table list the features, each on a
separate line. Each feature can contain additional notes or qualifiers, placed on the line
below the feature type and location. are on the line below. The columns are as follows:

Column 1: Start location of feature

Column 2: Stop location of feature

Column 3: Feature key (type)

Column 4: Qualifier key (placed on the row below the information in the first 3
columns

Column 5: Qualifier value (also placed on the row below the information in the first
3 columns)
An example table may look like this:
>Feature Lp_1625
1
629
source
organism
mol_type
isolate
specimen_voucher
db_xref
tissue_type
country
note
<1
14
rRNA
15
249
misc_RNA
250
407
rRNA
408
614
misc_RNA
615
>629
Laccaria pseudomontana
genomic DNA
pse1625
Cripps 1625(type)
taxon:344594
basidiome
USA: Colorado, Ten Mile Range,
Blue Lake Dam
type strain of Laccaria sp. CLC1771
product
18S ribosomal RNA
product
internal transcribed
spacer 1
product
5.8S ribosomal RNA
product
internal transcribed
spacer 2
product
28S ribosomal RNA
rRNA
Note that, in the columns 1 and 2 (feature start and stop positions), the first entry begins
with <1, and the last entry ends with >629. The “<” denotes that the feature actually begins
before the first nucleotide of our sequence; the “>” denotes that the feature actually ends
after the last nucleotide of our sequence. This annotation table will yield a GenBank file
having the following feature annotations (See “FEATURES,” below):
LOCUS
DQ149871
629 bp
DNA
linear
PLN 13MAR-2006
DEFINITION Laccaria pseudomontana isolate pse1625 18S ribosomal RNA
gene, partial sequence; internal transcribed spacer 1, 5.8S ribosomal
RNA gene, and internal transcribed spacer 2, complete sequence; and 28S
ribosomal RNA gene, partial sequence.
ACCESSION
DQ149871
VERSION
DQ149871.1 GI:76781901
KEYWORDS
.
SOURCE
Laccaria pseudomontana
ORGANISM Laccaria pseudomontana
Eukaryota; Fungi; Dikarya; Basidiomycota; Agaricomycotina;
Agaricomycetes; Agaricomycetidae; Agaricales;
Tricholomataceae; Laccaria.
REFERENCE
1 (bases 1 to 629)
AUTHORS
Osmundson,T.W., Cripps,C.L. and Mueller,G.M.
TITLE
Morphological and molecular systematics of Rocky Mountain
alpine Laccaria
JOURNAL
Mycologia 97 (5), 949-972 (2006)
REFERENCE
2 (bases 1 to 629)
AUTHORS
Osmundson,T.W., Cripps,C.L. and Mueller,G.M.
TITLE
Direct Submission
JOURNAL
Submitted (29-JUL-2005) Ecology, Evolution and
Environmental Biology, Columbia University, 1200 Amsterdam
Avenue, MC 5557, New York, NY 10027, USA
FEATURES
Location/Qualifiers
source
1..629
/organism="Laccaria pseudomontana"
/mol_type="genomic DNA"
/isolate="pse1625"
/specimen_voucher="Cripps 1625 (type)"
/db_xref="taxon:344594"
/tissue_type="basidiome"
/country="USA: Colorado, Ten Mile Range, Blue Lake
Dam"
/note="type strain of Laccaria sp. CLC 1771"
rRNA
<1..14
/product="18S ribosomal RNA"
misc_RNA
15..249
/product="internal transcribed spacer 1"
rRNA
250..407
/product="5.8S ribosomal RNA"
misc_RNA
408..614
/product="internal transcribed spacer 2"
rRNA
615..>629
/product="28S ribosomal RNA"
ORIGIN
1
aggatcatta ttgaataaac ctgatgtggc tgttagctgg cttttcaaag catgtgctcg
61 tccgtcatct ttaatttctc cacctgtgca cattttgtag tcttggatac ctctcgaggc
121
181
241
301
361
421
481
541
601
//
aactcggatt
tgttttcata
aaaattatac
aaatgcgata
ttgcgctcct
ccaactttta
cggctctcct
ctacgccgtg
acaattttga
ttaggatcgc
tacaccaaag
aactttcagc
agtaatgtga
tggtattccg
ttagcttggt
taaatgcatt
gatttgaagc
caatttgacc
cgtgctgtaa
tatgtttaaa
aacggatctc
attgcagaat
aggagcatgc
taggcttgga
agcggaactt
agctttatga
tcaaatcag
aagtcagctt
gaatgtcatc
ttggctctcg
tcagtgaatc
ctgtttgagt
tgtgggggtt
ttgtggaccg
agttcagcct
tcctctcatt
aatgggaact
catcgatgaa
atcgaatctt
gtcattaaat
gcgggcttca
tctattggtg
ctaaccgtcc
tccaagacta
tgtttcctat
gaacgcagcg
tgaacgcacc
tctcaacctt
tcaatgaggt
tgataattat
attgacttgg
It is best to make this table, as well as the source modifier table, in a text editor rather
than a word processor in order to avoid the surreptitious insertion of formatting codes.
Save the feature annotation table with the file extension .tbl.
Element 4: The source annotation table
This table contains information about the biological source of the sequence. It can
contain a wealth of different information (see Appendix V for a complete list of
accepted source modifiers), but usually only includes a small subset of the possible fields.
The first column must include the sequence ID (SeqID); this code must be identical to
the one in the definition line of the corresponding FASTA file. The second column
should include the organism name (Latin binomial). For our submissions, we should
also use the fields recommended for DNA barcode data by the Consortium for the
Barcode of Life, as doing so will facilitate obtaining official barcode designation for our
sequences; these fields are: Collected-by, Collection-date, Country, Identified-by, LatLon, and Specimen-voucher. An example table is as follows:
SeqID
organism
Collected-by
Collection-date Country Identified-by
Mp_MG15 Mycena pura
Matteo Garbelotto 1-April-2008
USA
Todd Osmundson
Lb_MG99 Mycena impura Matteo Garbelotto 1-April-2008
USA
Doug Schmidt
Lat-lon
Specimen-voucher
13.57 N 24.68 W MG 15
13.57 N 24.68 W MG 99
The table must be saved as a tab-delimited file with a .src extension.
Submitting the data
Now we will run the tbl2asn program, which will generate a Sequin (.sqn) file that we
can submit to GenBank via e-mail. First, copy all files into the directory that contains
the tbl2asn program, as this simplifies the path specification in the command line. In
MacOS, open a Terminal window; in Windows, open a DOS command window; then
navigate to the directory that contains the tbl2asn program. Typing “tbl2asn -” at the
shell prompt will produce the full list of command line arguments; the following page
contains a summary of the most common ones, as well as some example command lines.
We will use the following command line, for a batch submission with multiple sequences
per .fsa file:
tbl2asn -t template.sbt –p. -a s -V v
This command line makes several assumptions: that the Sequin template is named
“template.sbt”, that all sequences are in a single FASTA file, and that all files are in the
same directory as the tbl2asn program; make sure that your data meet these assumptions,
or change the command line accordingly. Note also that the .fsa, .tbl, and .src files must
have the same filename prefix (e.g., mycena.fsa and mycena.tbl), or tbl2asn will not
match them correctly.
Most common command line arguments for tbl2asn (from the NCBI website).
-p
-r
-t
-i
-a
-s
-j
-V
-k
-y
-Y
-Z
Path to the directory. If files are in the current directory –p. should be used.
Path for the resulting .sqn file(s) (if the –r argument is not used, the .sqn files will be saved in the
source directory).
Specifies the template file (.sbt). If the .sbt file is in a different directory the full path must be
specified.
Creates single submission from indicated .fsa file in a directory of multiple .fsa files.
Specifies the File type.
s :FASTA Set (s Batch, s1 Pop, s2 Phy, s3 Mut, s4 Eco)
l :FASTA+Gap Alignment
z :FASTA with Gap Lines
e :PHRAP/ACE
d :FASTA Delta, di FASTA Delta with Implicit Gaps
a :Any (default)
Sample command line: -a s
Instructs tbl2asn to read multiple FASTA components in one file as a set of unrelated sequences.
Equivalent to “-a s”. This creates a single file of multiple submissions. (1000 sequences per file is
the usual maximum.)
Allows the addition of source qualifiers that will be the same for each submission. Example: -j
“[organism=Saccharomyces cerevisiae] [strain=S288C]”.
Verification (combine any of the following letters):
v :Validates the data records. The output is saved to files with a .val suffix.
b :Generates GenBank flatfiles with a .gbf suffix.
r :Validates without Country Check
Sample command line: -V vb
CDS Flags (combine any of the following letters):
c :Instructs tbl2asn to annotate the longest open reading frame (ORF) if a .tbl file is not provided.
The product name will be ‘unknown’ unless a product name is included in the FASTA definition,
[product=xyz].
m :Allows alternative start codons to be used in ORF searches.
r :Allows Runon ORFs
Sample command line: -k c
Adds a COMMENT to each submission. Example: -y “Contigs larger than 2kb have been
annotated, representing approx. 87% of the total genome”.
Like –y, but adds a COMMENT to each submission from a file.
Runs the Discrepancy Report. Must supply an output file name. Recommended only for
annotated genome submissions, complete or WGS. See the Discrepancy Report page for
information about its output.
-o
Creates a single submission from multiple fasta files.
Example Command Lines:
Single submission: one sequence per .fsa file:
tbl2asn -t template.sbt -p path_to_files -V v
Batch submission: multiple sequences per .fsa file:
tbl2asn -t template.sbt -p path_to_files -a s -V v
Single submission: one .fsa file in directory of multiple .fsa files:
tbl2asn -t template.sbt -i x.fsa -V v
The –V v portion of the command line generates a validation file with a .val extension.
Before submitting your .sqn files to GenBank, review the .val files (open with a text editor);
you will need to correct any error-level errors. Taxonomy-related errors about missing
lineages can (and should) generally be ignored. To correct errors, open the newly-created
.sqn file in Sequin by double clicking on it. Double-click on the portion of the GenBank
output that contains an error; the program will draw a black, vertical line to the left of the
portion and open a dialog box where you can correct the error (see diagrams on next page).
When you open the Sequin file, the first file in the set will be open; to go to other files, click
in the box that contains the sequence ID number.
Double-click here to go to
another sequence
Double-click here to correct the
source information
Unfortunately, if we wish to submit chromatograph trace files to the NCBI Trace Archive,
we will have to do so separately rather than integrated as in BarSTool. At present, a trace file
does not appear to be a requirement to obtain official barcode designation for a sequence, so
we may wish to skip this step for now.
Appendices
Appendix I: A Sample FASTA File for a DNA Barcode Submission
(Source: NCBI)
>Seq1 [organism=Carpodacus mexicanus]
CCTTTATCTAATCTTTGGAGCATGAGCTGGCATAGTTGGAACCGCCCTCAGCCTCCTCATCCGTGCAGAA
CTTGGACAACCTGGAACTCTTCTAGGAGACGACCAAATTTACAATGTAATCGTCACTGCCCACGCCTTCG
TAATAATTTTCTTTATAGTAATACCAATCATGATCGGTGGTTTCGGAAACTGACTAGTCCCACTCATAAT
CGGCGCCCCCGACATAGCATTCCCCCGTATAAACAACATAAGCTTCTGACTACTTCCCCCATCATTTCTT
TTACTTCTAGCATCCTCCACAGTAGAAGCTGGAGCAGGAACAGGGTGAACAGTATATCCCCCTCTCGCTG
GTAACCTAGCCCATGCCGGTGCTTCAGTAGACCTAGCCATCTTCTCCCTCCACTTAGCAGGTGTTTCCTC
TATCCTAGGTGCTATTAACTTTATTACAACCGCCATCAACATAAAACCCCCAACCCTCTCCCAATACCAA
ACCCCCCTATTCGTATGATCAGTCCTTATTACCGCCGTCCTTCTCCTACTCTCTCTCCCAGTCCTCGCTG
CTGGCATTACTATACTACTAACAGACCGAAACCTAAACACTACGTTCTTTGACCCAGCTGGAGGAGGAGA
CCCAGTCCTGTACCAACACCTCTTCTGATTCTTCGGCCATCCAGAAGTCTATATCCTCATTTTAC
>Seq2 [organism=Vireo solitarius]
GGTAGGTACCGCCCTAAGNCTCCTAATCCGAGCAGAACTANGCCAACCCGGAGCCCTTCTGGGAGACGAC
CAAATCTACAACGTAGTCGTTACGGCCCACGCCTTCGTAATAATCTTTTTCATAGTAATGCCAATCATAA
TCGGAGGATTCGGGAACTGACTAGTTCCTCTAATGATTGGGGCCCCAGACATAGCATTCCCTCGAATAAA
CAACATAAGCTTTTGACTACTACCACCATCATTCCTACTCCTAATAGCCTCCTCAACAGTAGAAGCAGGA
GCCGGAACCGGATGAACCGTGTACCCACCACTAGCTGGAAACCTGGCCCACGCCGGAGCCTCAGTAGACC
TAGCTATCTTCTCCCTACACCTAGCAGGTATCTCATCCATCCTGGGGGCAATTAACTTCATTACAACAGC
AATCAACATAAAACCACCCGCCCTCTCACAATACCAAACACCACTATTCGTGTGATCCGTCCTAATTACG
GCCGTACTACTCCTACTATCTCTCCCAGTACTAGCCGCCGGTATCACCATGCTACTCACAGACCGCAACC
TCAACACCACCTTCTTTGACCCAGCAGGAGGAGGAGACCCAGTACTATACCAGCACCTATTCTGATTCTT
CGGACACCCAGAAGTCTACATCCTAATTCTC
>Seq3 [organism=Dendroica tigrina]
CCTATACCTAATTTTCGGCGCATGAGCCGGAATGGTGGGTACCGCTCTAAGCCTCCTCATTCGAGCAGAA
CTAGGCCAACCCGGAGCCCTTCTGGGAGACGACCAAGTCTACAACGTGGTTGTCACGGCCCATGCCTTCG
TAATAATCTTCTTTATAGTTATGCCGATTATAATCGGAGGATTCGGAAACTGACTAGTCCCCCTAATAAT
CGGAGCCCCAGACATAGCATTTCCGCGAATAAACAACATAAGCTTCTGACTACTCCCACCATCATTCCTC
CTCCTCTTAGCATCCTCCACAGTGGAAGCAGGCGTAGGTACAGGCTGAACAGTGTATCCCCCACTAGCTG
GCAACCTAGCTCATGCCGGGGCCTCAGTCGACCTCGCAATCTTCTCCTTACACCTAGCTGGTATTTCCTC
AATCCTCGGAGCAATTAACTTCATTACAACAGCAATTAACATGAAACCTCCTGCCCTCTCACAATACCAA
ACCCCACTATTCGTCTGATCAGTGTTAATTACTGCAGTCCTCCTTCTCCTTTCCCTTCCAGTTCTAGCTG
CAGGAATCACAATGCTCCTCACAGACCGCAACCTCAACACCACATTCTTCGACCCTGCCGGAGGAGGAGA
TCCCGTCCTATATCAACATCTCTTCTGATTCTTCGGCCACCCAGAAGTCTACATCCTAATCCTC
>Seq4 [organism=Vireo gilvus]
CATGAGCTGGAATAGTAGGTACCGCCCTAAGCCTCCTAATTCGAGCAGAGCTAGGCCAACCCGGAGCCCT
ACTGGGAGACGACCAAATCTACAACGTAGTCGNCACGGCCCATGCTTTTGTAATAATCTTCTTCATAGTA
ATGCCAATCATAATCGGAGGGTTTGGAAACTGACTGGTCCCCCTAATAATTGGAGCTCCAGACATAGCAT
TCCCCCGAATAAACAACATGAGTTTCTGACTACTTCCCCCATCATTCCTACTACTAATAGCCTCCTCAAC
AGTAGAAGCAGGCGTTGGAACAGGATGAACCGTATATCCACCACTAGCCGGAAACCTAGCCCATGCAGGA
GCCTCAGTAGACCTAGCTATCTTCTCCCTACACCTAGCAGGTATCTCCTCCATCCTAGGGGCAATCAACT
TCATTACAACAGCAATCAACATAAAACCACCCGCCCTATCACAATACCAAACACCACTATTCGTATGATC
CGTCCTAATCACAGCCGTACTACTCCTCCTATCACTCCCAGTGCTAGCTGCTGGAATTACCATGCTACTT
ACAGACCGCAACCTCAACACTACCTTCTTTGACCCAGCAGGGGGAGGAGACCCAGTGCTATACCAACATC
TATTCTGATTCTTCGGACACCCAGAAGTTTACATCCTAATTCTC
>Seq5 [organism=Dendroica castanea]
CCTATACCTAATTTTCGGCGCATGAGCCGGAATAGTGGGTACCGCCCTAAGCCTCCTCATTCGAGCAGAA
CTAGGCCAACCCGGAGCCCTTCTGGGAGACGACCAAGTCTATAACGTAGTTGTCACGGCCCATGCCTTCG
TAATAATTTTCTTTATAGTTATGCCGATTATAATCGGAGGATTCGGAAACTGACTAGTCCCCCTAATAAT
CGGAGCCCCAGACATAGCATTCCCACGAATAAACAACATAAGCTTCTGACTACTCCCACCATCATTCCTT
CTCCTCCTAGCATCCTCCACAGTCGAAGCAGGCGTAGGTACAGGCTGAACAGTATACCCCCCACTAGCTG
GCAACCTAGCTCACGCCGGAGCCTCAGTCGACCTCGCAATCTTCTCTCTACACCTAGCTGGTATTTCCTC
AATCCTCGGAGCAATCAACTTCATTACAACAGCAATTAACATAAAACCTCCTGCCCTCTCACAATACCAA
ACCCCACTGTTCGTCTGATCCGTCCTAATCACTGCAGTCCTCCTGCTCCTTTCCCTTCCAGTTCTAGCTG
CAGGAATCACAATACTCCTCACAGACCGCAACCTAAACACCACATTCTTCGACCCTGCTGGAGGAGGAGA
TCCCGTCCTATATCAACACCTTTTCTGATTCTTCGGCCACCCAGAAGTCTACATCCTAATCNTC
>Seq6 [organism=Vireo gilvus]
CATGAGCTGGAATAGTAGGTACCGCCCTAAGCCTCCTAATTCGAGCAGAGCTAGGCCAACCCGGAGCCCT
ACTGGGAGACGACCAAATCTACAACGTAGTCGTCACGGCCCATGCTTTTGTAATAATCTTCTTCATAGTA
ATGCCAATCATAATCGGAGGGTTTGGAAACTGACTGGTCCCCCTAATAATTGGAGCTCCAGACATAGCAT
TCCCCCGAATAAACAACATGAGTTTCTGACTACTTCCCCCATCATTCCTACTACTAATAGCCTCCTCAAC
AGTAGAAGCAGGCGTTGGAACAGGATGAACTGTATACCCGCCACTAGCCGGTAACCTAGCCCATGCAGGA
GCCTCAGTAGACCTAGCTATCTTCTCCCTACACCTAGCAGGTATCTCCTCCATCCTAGGGGCAATCAACT
TCATTACAACAGCAATCAACATAAAACCACCCGCCCTATCACAATACCAAACACCACTATTCGTATGATC
CGTCCTAATCACAGCCGTACTACTCCTCCTATCACTCCCAGTGCTAGCTGCTGGAATTACCATGCTACTT
ACAGACCGCAACCTCAACACTACCTTCTTTGACCCAGCAGGGGGAGGAGACCCAGTGCTATACCAACATC
TATTCTGATTCTTCGGACACCCAGAAGTTTACATCCTAATTCTC
Appendix II: Source Modifiers for Barcode Submissions through BarSTool
(from the NCBI website http://www.ncbi.nlm.nih.gov/WebSub/html/help/sourcetable.html)
In addition to the Sequence ID, the following source modifiers are required for Barcode
submissions:

Country - The country of origin of DNA samples used.

Specimen_voucher - An identifier of the individual or collection of the source
organism and the place where it is currently stored, usually an institution.
The following source modifiers are recommended for Barcode submissions:

Collected_by - Name of person who collected the sample.

Collection_date - Date the specimen was collected. In format DD-Mon-YYYY, that
is 2-digit date, three-character abbreviation of month, and 4-digit year, (e.g., 11-Feb2002). Mon-YYYY and YYYY are alternate formats to use when date information is
less complete.

Identified_by - name of the person or persons who identified by taxonomic name
the organism from which the sequence was obtained

Lat_Lon - Latitude and longitude, in decimal degrees, of where the sample was
collected.
The following optional source modifiers are available to further describe the sequences in a
Barcode set:

Authority - The author or authors of the organism name from which sequence was
obtained.

Biotype - Variety of a species (usually a fungus, bacteria, or virus) characterized by
some specific biological property (often geographical, ecological, or physiological).
Same as biotype.

Biovar - See biotype

Breed - The named breed from which sequence was obtained (usually applied to
domesticated mammals).

Cell_line - Cell line from which sequence was obtained.

Cell_type - Type of cell from which sequence was obtained.

Chemovar - Variety of a species (usually a fungus, bacteria, or virus) characterized by
its biochemical properties.

Clone - Name of clone from which sequence was obtained.

Cultivar - Cultivated variety of plant from which sequence was obtained.

Dev_stage - Developmental stage of organism.

Ecotype - The named ecotype (population adapted to a local habitat) from which
sequence was obtained (customarily applied to populations of Arabidopsis thaliana).

Forma - The forma (lowest taxonomic unit governed by the nomenclatural codes) of
organism from which sequence was obtained. This term is usually applied to plants
and fungi.

Forma_specialis - The physiologically distinct form from which sequence was
obtained (usually restricted to certain parasitic fungi).

Genotype - Genotype of the organism.

Haplotype - Haplotype of the organism.

Isolate - Identification or description of the specific individual from which this
sequence was obtained.

Isolation source - Describes the local geographical source of the organism from
which the sequence was obtained.

Lab_host - Laboratory host used to propagate the organism from which the
sequence was obtained.

Natural_host - When the sequence submission is from an organism that exists in a
symbiotic, parasitic, or other special relationship with some second organism, the
'natural host' modifier can be used to identify the name of the host species.

Note - Any additional information that you wish to provide about the sequence.

Pathovar - Variety of a species (usually a fungus, bacteria or virus) characterized by
the biological target of the pathogen. Examples include Pseudomonas syringae
pathovar tomato and Pseudomonas syringae pathovar tabaci.

Pop_variant - name of the population variant from which the sequence was obtained

Serogroup - Variety of a species (usually a fungus, bacteria, or virus) characterized by
its antigenic properties. Same as serogroup and serovar.

Serotype - See Serogroup

Serovar - See Serogroup

Sex - Sex of the organism from which the sequence was obtained.

Strain - Strain of organism from which sequence was obtained.

Sub_species - Subspecies of organism from which sequence was obtained.

Subclone - Name of subclone from which sequence was obtained.

Subtype - Subtype of organism from which sequence was obtained.

Substrain - Sub-strain of organism from which sequence was obtained.

Tissue_lib - Tissue library from which the sequence was obtained.

Tissue_type - Type of tissue from which sequence was obtained.

Type - Type of organism from which sequence was obtained.

Variety - Variety of organism from which sequence was obtained.
Appendix III: Using the tar utility to make file archives
By Owen L. Astrachan, Duke University
(http://www.cs.duke.edu/~ola/courses/programming/tar.html)
The program tar (originally for tape archive) is useful for archiving and transmitting files. For
example, you may want to 'tar up' all your work for a course on the acpub and save it to your
own computer's disk drive so you don't run into quota problems. You might also want to
submit (e.g., for cps 108 or cps 100) an entire directory at once rather than the individual
files in the directory. The tar program is useful for these and other tasks and is simple to use.
You can see more information by reading the man page, type man tar The examples below
are not meant to be exhaustive. You can also use the utility gtar instead.
Create, Extract, See Contents
The tar program takes one of three function command line arguments (there are two others I
won't talk about).
 c --- to create a tar file, writing the file starts at the beginning.
 t --- table of contents, see the names of all files or those specified in other command
line arguments.
 x --- extract (restore) the contents of the tar file.
(the other options are u for update and r for replace, see the man page for details).
Exactly one function argument, c, t, x, is used in conjunction with other command line
arguments shown below. Again, these examples are not meant to be complete, just useful.
Compression, Verbose, File specified
In addition to a function command line argument the arguments below are useful. I usually
use z and f all the time, and v when creating/extracting.

f --- specifies the filename (which follows the f) used to tar into or to tar out from;
see the examples below.
 z --- use zip/gzip to compress the tar file or to read from a compressed tar file.
 v --- verbose output, show, e.g., during create or extract, the files being stored into or
restored from the tar file.
Examples
To tar all .cc and .h files into a tar file named foo.tgz use:
tar cvzf foo.tgz *.cc *.h
This creates (c) a compressed (z) tar file named foo.tgz (f) and shows the files being stored
into the tar file (v). The .tgz suffix is a convention for gzipped tar files, it's useful to use the
convention since you'll know to use z to restore/extract.
It's often more useful to tar a directory (which tars all files and subdirectories recursively
unless you specify otherwise). The nice part about tarring a directory is that it is untarred as a
directory rather than as individual files.
tar cvzf foo.tgz cps100
will tar the directory cps100 (and its files/subdirectories) into a tar file named foo.tgz.
To see a tar file's table of contents use:
tar tzf foo.tgz
To extract the contents of a tar file use:
tar xvzf foo.tgz
This untars/extracts (x) into the directory from which the command is invoked, and prints
the files being extracted (v).
If you want to untar into a specified directory, change into that directory and then use tar.
For example, to untar into a directory named newdir:
mkdir newdir
cd newdir
tar xvzf ../foo.tgz
You can extract only one (or several) files if you know the name of the file. For example, to
extract the file named anagram.cc from the tarfile foo.tgz: tar xvzf foo.tgz anagram.cc
Other Archiving/Compression Tools
Many PC/Mac programs will be able to restore files that have been archived using tar. For
example, on Macs, the Stuffit Deluxe program can handle Unix tar files. On PCs, the
pkunzip program will handle Unix tar files. This makes it possible to tar files up on [a server]
and then use ftp to bring them to your personal machine where you can store the tar files
and restore when needed. Of course you can run Linux too.
The zip and unzip commands available on some systems are very useful replacements for tar.
Zip/unzip programs are nearly standard on Windows 95/NT machines and zip will archive
entire directory structures with the right options (type zip by itself for help).
Appendix IV. Conducting batch BLAST searches using SEQTools
(Thanks to Silvia for the software suggestion!)
SEQTools is a versatile software package for sequence manipulation and analysis.
Among the many tasks that SEQTools can accomplish is facilitating the submission of
batches of DNA or protein sequences to the NCBI BLAST web interface. The following
tutorial describes how to create a sequence project and conduct a batch BLAST search using
SEQTools.
SEQTools can be downloaded for a 60-day trial period (this license can be extended in
60-day increments free of charge for students, or investigators can purchase a long-term
license following the trial period) from the website http://www.seqtools.dk . The software
is currently available only for the Windows operating system.
Once you have downloaded the program, build a new project. It is easiest if you put all of
your sequences together in a folder. Note that all files must be of the same type (e.g.,
FASTA, chromatograph trace files, etc.) and correspond to the same class of macromolecule
(e.g., nucleotide and protein sequences will not be handled correctly in a single project).
From the File menu, select “Open Sequence Files.” From the subsequent menu, select
“Sequences, All Types.” In the Project References dialogue box, select a name for your
project:
Through the Main dialogue box, select the folder that contains your sequences, then select
the “Add to List” button, followed by the “Load Files” button.
Batch BLAST searches can be run on either a local BLAST database or using the internet to
search GenBank online; we will do the latter. To conduct the search, open the Search
menu, select “Blast Batch Search,” then select “Sequential, NCBI – QBlast.” This will open
the BLAST dialogue box.
In the “Final Blast Program” tab, select BlastN as the Blast program for final search.
Choose the number of top BLAST hits that you would like the program to report (“Number
of description returned from NCBI”), and the number of alignments that you would like the
program to report (if you are only interested in the top hits, select zero for this value). You
can also select a maximum expect value for reporting. It is best to deselect the checkbox for
“Result in HTML format,” as it is easier to automate downstream applications using a text
output rather than HTML.
In the “Final Databases” tab, you may choose the GenBank databases to search. For our
purposes, choose the nr (general nucleotide) database:
The “Advanced Options” tab offers further options for database selection.
The “Destination” tab contains options for specifying the output of your search. You
could choose to ‘parse results into sequence headers’ if you plan to use the BLAST results
further (e.g., using a Perl script to modify the output), or choose to ‘save results as separate
files without parsing’ if you wish to simply view the files to check whether the top hits make
sense. If you use the second option, be sure to save the results as text files, not HTML.
A few words about the remaining tabs: the “Range” tab allows you to choose a subset of
the sequences to submit to BLAST, and the “View Search Progress” tab contains a window
that allows you to follow the progress of your batch search.
Appendix V. Source Modifiers for FASTA Definition Lines or tbl2asn Source Tables
(from the GenBank website: http://www.ncbi.nlm.nih.gov/BankIt/examples/eukrrna.html)
Source modifiers contain information about the biological source of the sequence to be
submitted. These modifiers can either be embedded in the definition line of a FASTAformatted sequence by placing them in square brackets, or can be stored in a separate, tabdelimited table. The proper format for embedding multiple modifiers in the FASTA
definition line is demonstrated in this example:
>MG104 [organism=Mycena pura] [molecule=DNA] [collection-date=Oct-2005]
CGGTTGCTTGGGTTTTATAACATCAGTCAGTGACAGGCATTTCCAGAGTTGCCCTGTTCATTCATATTATT
TATTACTGACCAGTGAGGATCCACCTAGCAGTATAGATTCGGATGCAGTATAGGCGTATGATTACAACATC
...
Accepted source modifiers are presented below. Note that some modifier names
have restricted values or formats (Note: the following information is taken directly
from the NCBI website).

organism should use the unabbreviated scientific name. Example:
[organism=Drosophila melanogaster]

molecule should use either "DNA" or "RNA". Example: [molecule=DNA]

moltype
should
use
one
of
the
following
[moltype=genomic]
genomic
precursor RNA
mRNA
rRNA
tRNA
snRNA
scRNA
other-genetic
cRNA
snoRNA
transcribed RNA

location should use one of the following values. Example:
[location=mitochondrion]
genomic
chloroplast
kinetoplast
mitochondrion
plastid
macronuclear
values.
Example:
extrachromosomal
plasmid
cyanelle
proviral
virion
nucleomorph
apicoplast
leucoplast
proplastid
endogenous-virus
hydrogenosome

collection-date should be in the form YYYY or Mmm-YYYY or DD-MmmYYYY. Example: [collection-date=2005] or [collection-date=Oct-2005] or
[collection-date=25-Oct-2005]
The following modifiers should use only TRUE or FALSE. Example:
[transgenic=TRUE].

environmental-sample

germline

metagenomic

rearranged

transgenic
Other accepted modifiers for nucleotide sequences are:
Download