ARB nifH Database Last database update: December 2, 2011 Last

advertisement
ARB nifH Database
Last database update: December 2, 2011
Last documentation update: February 17, 2012
University of California, Santa Cruz, California
Maintained and Distributed by Zehr lab (http://www.es.ucsc.edu/~wwwzehr/research/database/)
Important additions to this update:
Upgraded to ARB 5.2: This database has been upgraded to be compatible with ARB 5.2. This means
users might have difficulty merging old databases with this current database. Please contact us if you
need to do this, and we can navigate you through how to update old databases to ARB 5.2.
Integrated CD-HIT and CD-HIT-EST analysis into the pipeline: We now send out the entire database,
after updating with new sequences, to CD-HIT (Huang et al 2010), to determine representative
sequences (based on both amino acid and nucleic acid sequences).
Integrated Chimera-check analysis: We have used UCHIME (Edgar et al, 2011) to evaluate potential
chimeras in this database. This is a necessary, but imperfect, approach. Sequences that not only meet
threshold criteria (outlined below) to be defined as a chimera, but also have parent sequences from the
same study, are marked as Putative Chimeras, and left out of most trees.
Utilized new masks for the creation of trees: We have several new masks that, in some cases, mask out
regions of the gene that are problematic for HMMalign. These can be explored by searching for *mask*
in the name field.
Basics about this database:
Nitrogenase gene sequences in the databases are accumulating rapidly. BLAST analysis is not always the
best approach for comparing sequences or identifying phylogenetic relationships. Due to the large
number of sequences in the databases, and different formats, it is not simple to download and align all
extant nifH protein sequences and their corresponding DNA encoding sequence. Such capabilities are
necessary for environmental studies where the amino acid sequence is needed for phylogenetic
analysis, but the corresponding DNA sequence is also needed for probe design. The problem of
obtaining all extant sequences is compounded by misannotation, and homologous proteins in the
databases.
The ARB software environment is a useful environment for visualizing and manipulating aligned and
unaligned sequences, and for maintaining metadata on sources, publications etc. ARB also contains
features for probe design, and the construction of phylogenetic trees. However, ARB is not well suited
for downloading and validating new data. Our group has developed a semi-automated process for
constructing the nifH database from public genomic data sources.
The procedure uses representative nifH protein sequences to BLAST against GenBank to identify
potential nifH and nifH-like genes. The output is screened for false positives, which are eliminated from
the database. Once identified the nifH protein and the encoding DNA sequence are retrieved. The
sequences are imported into ARB using the nucleotide GI (GenBank identifier) as the sequence name to
prevent redundancy problems. After import the DNA sequence is used to generate the amino acid
sequence (which should be identical to the GenBank record), the amino acid sequences are exported
and aligned against a nifH PFAM using HMMR and the Amino acid sequences re-imported into ARB. The
aligned amino acid sequences are then used to align the DNA sequences using the "Backalign" feature of
ARB.
Features of the nifH_ARB database
The database contains all nifH amino acid and DNA sequences obtainable from BLAST analysis. The
Cluster IV nif-like sequences are included, to allow identification in environmental surveys.
The amino acid sequences are aligned with a Hidden Markov Model, not by Clustal.
The DNA sequences are aligned according to the amino acid sequences so that DNA sequences can also
be used for phylogenetic analysis.
Start and stop positions of amino acid sequences are included in searchable fields, enabling rapid
selection of equal length sequences for phylogenetic analysis.
Virtually all of the GenBank metadata is imported along with the sequences, allowing rapid searches and
assembly of sequences for analysis.
A nomenclature for nifH clusters is provided. This should be used with caution, since many branches of
the phylogenetic tree (such as within the Proteobacteria) are poorly supported.
Disclaimers and Notes for Use
Most of the nifH sequences have been obtained from PCR amplification. A variety of primers have been
used and so the sequence database is comprised of sequences of a variety of lengths, some of which are
very short. The trees provided and the cluster naming is very dependent upon the length of sequence
and the region amplified and used in the phylogenetic analysis. Unfortunately, shorter sequences and
the deletion of certain regions in the analysis, quickly deteriorate the robust clustering of the trees. The
analysis provided in the database starts with genomic length sequences, and uses the sequence
clustering based on the genome tree to evaluate shorter trees. Clustering deteriorated rapidly when the
3' end of the sequence is not used. Thus any use of the analysis for trees generated from shorter length
sequences should be interpreted very cautiously.
Furthermore, although the QuickAdd function in ARB is very useful for quickly screening the phylogeny
for new sequences, we are finding that clustering breaks down when adding thousands of sequences to
a backbone tree. We advocate building neighbor joining trees with your sequences, and close relatives,
for publication quality analyses.
Putative chimeric sequences were determined using the UCHIME algorithm (Edgar et al, 2011).
Nucleotide alignments for all 22,579 sequences (Dec 2010 update; including those obtained from
genomes) were clustered using CDHit (Huang et al 2010) at a 98% sequence identity cut-off, and the
resulting 8579 representative sequences were analyzed for chimeras using the UCHIME algorithm in de
novo mode. As accurate abundance data was unavailable for many studies in this database, the number
of sequences in each CDHit cluster was used as a proxy. UCHIME was run using all the default
parameters, but the resulting chimeras were subject to the additional criteria determined empirically
(please contact us for further details) to reduce the number of false positives. Putative chimeric
sequences were tagged as such in a field named “PutativeChimera” in the ARB database. Likely
chimeras were further defined if the two parent sequences were recovered from the same study, and
tagged in the “PutativeChimeraSameStudy” field.
Fields in ARB
The fields in the nifH database are primarily derived from Genbank records. A number of fields have
been created by the curators for convenience and to facilitate analysis of the data
Fields in ARB that have been derived from GenBank fields are described in the Appendix Table 1. Some
fields have been obtained from translation to EMBL format, and a few field names were changed for the
corresponding field in ARB.
Fields in ARB that have been created for data analysis and curation are:
StartAlign-the position in the current database for the first amino acid in the alignment
EndAlign-the position in the current database for the last amino acid in the alignment
Raymond_group- Major cluster designation (1-5) as defined by Raymond et al. Clusters 1-3 annotated.
Young_group- Major cluster designations (B, A, C) as used by Young.
SEQ_PROBLEMS-field where problems in sequences can be stored. Currently the only annotation is
"DNA unaligned" to designate sequences where translation "X" prevented backaligning the DNA
according to the amino acid sequence.
AMINO_2011- Amino acid sequence clusters as in AMINO_2009, except using a new numerical
designation (1.1, 1.2, 3.1, etc).
AMINO_2010- Current designation of clusters using the Alphabetical clustering system of Zehr et al.
2003. Numerous sequences had to be reclassified according to the new tree topology. Therefore, some
subgroups will differ, and there are fewer subclusters than in Zehr et al. Many of the Proteobacterial
groups are not robust, although there is good general agreement between the 2003 and 2010 clustering
(as long as the shorter nifH sequences are not used to make the tree). The cluster labeling of
AMINO_2010 was based on the tree included in the database:
tree_genome_AA_Dec2010_MASK_GENOME)
AMINO_2003- Amino acid sequence clusters as defined in Zehr et al. 2003.
DNA_2003 - DNA groups as defined in Zehr et al. 2003.
PuatativeChimera – Sequences identified as possible chimeras using UCHIME.
PuatativeChimeraSameStudy – Sequences identified as likely chimeras using UCHIME, because parent
sequences are from the same study as the chimeric sequence.
AddUpdt_MonthYear – Sequences pulled in on the update date designated in the field name.
CDHITnt_ClusterNo – Cluster ID from the most recent CD-HIT-EST analysis (98% nucleotide identity).
CDHITnt_NumSeq – Number of sequences associated with the cluster number from the most recent CDHIT-EST analysis (98% nucleotide identity).
CDHITnt_RepSeqFlag – This field will have a “Y” if the sequence is the designated representative of a
cluster from the most recent CD-HIT-EST analysis (98% nucleotide identity).
CDHITnt_RepSeqID – This field will have the sequence ID of the designated cluster representative from
the most recent CD-HIT-EST analysis (98% nucleotide identity).
CDHITaa_ClusterNo – Cluster ID from the most recent CD-HIT analysis (98% amino acid identity).
CDHITaa_NumSeq – Number of sequences associated with the cluster number from the most recent CDHIT analysis (98% amino acid identity).
CDHITaa_RepSeqFlag – This field will have a “Y” if the sequence is the designated representative of a
cluster from the most recent CD-HIT analysis (98% amino acid identity).
CDHITaa_RepSeqID – This field will have the sequence ID of the designated cluster representative from
the most recent CD-HIT analysis (98% amino acid identity).
Fields relevant to old database users:
NR- sequences with "AA99" written to this field were selected as representative sequences using the
program "cd-hit" using the default parameters and selecting sequences representing 99% sequence
identity at the amino acid level.
Cluster – Original CD-HIT cluster number, based on Dec2010 database.
RepFlag - Original CD-HIT Representative Sequence Flag (“Y” or “N”), based on Dec2010 database.
RepSeq - Original CD-HIT Representative Sequence, based on Dec2010 database.
NumSeqsInCluster - Original CD-HIT number of sequences in cluster, based on Dec2010 database.
Using the Database
The nifH database is provided as a resource for the community. We have tried to curate and maintain
the database to facilitate the analysis of environmental sequences in particular. The alignments are
generated by HMMR from PFAMs in order to provide some objectivity in approach, such that multiple
users will obtain similar tree phylogenies, and sequences have been identified by cluster names in order
to make it easier to discuss and compare datasets. However, neither the alignments nor cluster naming
is absolute. Much of the cluster naming appears to be robust, but some branches and clusters are poorly
resolved and typically not supported by reasonable bootstrap values. The cluster naming in the
2010/2011 efforts have condensed some clusters to approach a more robust cluster designation. There
are multiple sequences that have problems, as imported from GenBank (for example sequences that
cannot be backaligned because of X's in the nucleotide sequence).
Trees
We’ve added several additional trees to this new version of the database:
Trees created using new masks:
From Dec 2010 update:
tree_Genomes_AA_Dec2010_MASK_GENOME – Non-redundant genome sequences (as of Dec 2010),
tree created using the MASK_GENOME mask.
tree _AA_RepSeqsDec2010_MASK1 – Representative sequences from the original CD-HIT analysis; tree
created using the MASK1 mask.
tree _AA_QuickAddtoRepSeqsDec2010_MASK1 – Quick add tree of all sequences of suitable length (as
of Dec 2010), added using the MASK1 mask.
tree _AA_ QuickAddtoRepSeqsDec2010_NoPutChimeras_MASK1 – same as above, but with likely
chimeras removed.
From Dec 2011 update:
tree_Genomes_AA_Dec2011_MASK_GENOME – In this case, all genome sequences as of Dec 2011 are
included, meaning that there are redundancies (e.g. draft vs. complete genomes).
tree _AA_RepSeqsDec2011_MASK1 - Representative sequences from the most recent CD-HIT analysis;
tree created using the MASK1 mask.
tree _AA_ RepSeqsDec2011_plusAllGenomeSeqs_noPutChimeras_MASK1 - Representative sequences
from the most recent CD-HIT analysis with all the genome sequences included and likely chimeras not
included; tree created using the MASK1 mask.
tree _AA_ QuickAddtoRepSeqsDec2011_noPutChimeras_MASK1 – Quick add tree of all sequences of
suitable length (as of Dec 2011), with likely chimeras removed, added using the MASK1 mask.
From the old version of the database:
Several trees are provided as starting points. Tree names have information on the type of sequences (AA
or DNA), whether it was generated from genome or nonredundant representative sequences (NR), and
the start and stop positions used for the mask to generate the tree. Kimura correction was used for
amino acid sequence trees, and Jukes-Cantor for DNA trees.
Genome tree: tree_genome2010_81_630. This is probably the most robust tree since it uses the longest
amino acid sequences obtained from genome sequencing efforts. The sequences were selected by
searching the records for "genome". 81-630 refers to the positions in the current amino acid alignment.
tree_AA_NR_Dec3_2010_134_481: Includes 2982 sequences identified as representative sequences by
cd-hit (at 99% identity clustering). There were 8389 representative sequences identified by cd-hit, but
sequences were selected from the representative sequences that also were long enough for the
134_481 mask. The original cd-hit representative sequences can be found by searching the NR field for
"AA99".
tree_AA_NR _134_481_quickadd: Tree was built by the quickadd parsimony feature in arb using the
149_478 mask, in order to add as many sequences as possible. This tree is not as reliable as the other
two trees, but generally shows the same clustering and allows positioning of more sequences. There are
17627 sequences in this tree (out of total 22574 sequences in the database).
Tip for making your own trees
In order to use this database for making trees of additional sequences (e.g. newly derived sequences
from PCR, genomes, etc), the sequences can be 1) manually aligned (very slow), 2) quick aligned (fast,
but check alignment manually), 3) aligned using the same procedure used to make the database
(requires some skills, and results in loss of some information on aligned positions and features of the
current database, since the sequences may be repositioned). In order to use the quickalign feature (see
ARB documentation):
Import sequences (in fasta format). The other option is to create a new ARB database from just your
sequences. Then use the merge function to bring the sequences from your database into this one. Note
the Genbank fields will not have any information in them. You might want to create a field of your own
and write something to it (e.g. "myseqs') so that you can easily search for your own sequences.
Mark your sequences and some of the sequences that are already in the database (if you know some
sequences that are relatively close, e.g. cyanobacteria if you are working on cyanobacteria and can mark
those with your sequences to use as the aligning sequence, it is probably better). The sequences are
generally so well conserved that the quickalign function works pretty well, as long as you use sequences
in the same major cluster (Clusters 1-4) to align. Once the sequences are marked open the ARB_EDIT
window. Put cursor somewhere in sequence you want to use to align, and unmark it (and any other
sequences that are already aligned). Go to Edit, Integrated Aligners, Click Fast Aligner radio button, Click
Align Marked Species radio button, Click Reference Species by name radio button (and then make sure
the GI of the sequence you want to use as aligning sequence in the box to right-you should be able to do
this by clicking the Copy button on the right, if your sequence GI appeared in the Align what box above).
Click on Range Whole sequence radio button. Click Go. Check alignment visually.
Selecting sequences of the same length in the database is time consuming. Sequences that are of the
same length can now easily be selected using the Search and Query feature. Search StartAlign for "<xxx"
AND EndAlign for ">yyy", where xxx is the starting position of the amino acid alignment mask you want
(the tree mask can be 1 residue shorter than xxx since "<" will return all sequences that start one
position to the left) and "yyy" is the end of the alignment mask (the tree mask can be 1 residue longer
than yyy). Do the search, mark listed species and make the tree.
We typically open up the amino acid sequence alignment prior to making the tree, to make sure the
correct sequences are selected, and put the cursor in one of the sequences. That sequence will then
appear as an option for making the mask by positions in the Neighbor Joining "Filter" tree window.
Recent contributors to the nif_ARB project are:
Rachel Foster
Philip Heller
Pia Moisander
Kendra Turk
H. James Tripp
Jonathan P. Zehr
The database can currently be acknowledged with Zehr et al. (2003). New publications are in
preparation on the 2011 developments of the database.
References
Edgar et al (2011). “UCHIME improves sensitivity and speed of chimera detection.” Bioinformatics
27(16): 2194-2200.
Huang et al. (2010) “CD-HIT Suite: a web server for clustering and comparing biological sequences.”
Bioinformatics 26:680.
Ludwig, W., O. Strunk, et al. (2004). "ARB: a software environment for sequence data." Nucleic Acids
Research 32(4): 1363-1371.
Raymond, J., J. L. Siefert, et al. (2004). "The natural history of nitrogen fixation." Molecular Biology and
Evolution 21(3): 541-554.
Young, J. P. W. (2005). The phylogeny and evolution of nitrogenases Genomes and genomics of
nitrogen-fixing organisms. R. Palacios and W. E. Newton. Netherlands, Springer. 3: 221-241.
Zehr, J. P., B. D. Jenkins, et al. (2003). "Nitrogenase gene diversity and microbial community structure: a
cross-system comparison." Envionmental Microbiology 5(7): 539-554.
APPENDIX
Table 1. Fields for nifH metadata in the nifH ARB database and their source in GenBank records.
GenBank
parsed from "LOCUS"
line
parsed from "/coded_by"
parsed from "/coded_by"
parsed from "LOCUS"
line
DEFINITION
KEYWORDS
SOURCE (One line)
ORGANISM (All lines)
REFERENCE
medline-ID
AUTHORS
TITLE
JOURNAL
mol_type
taxon
clone
collection_date
collected_by
country
GI
GI
gene
host
strain + isolate
isolation_source
lat_lon
note
operon
PCR_primers
product
protein_id
translation
ARB
nuc_len
nuc_acc_and_pos
nuc_version
date
description
key_words
full_taxon_name
tax
num_bib
medline
author
title
submission
mol_type
taxon
clone
collection_date
collected_by
country
name
GI
gene
host
strain
isolation_source
lat_lon
note
operon
PCR_primers
product
protein_acc
amino_acid
Table 2. Unique nifH ARB database fields. (This does not include ARB-specific fields that are not specific
to the nif database).
Field name
NR
AlignedAAs
StartAlign
EndAlign
SEQ_PROBLEMS
DNA_2003
AMINO_2003
LOCATION
Raymond_group
Young_group
AMINO_2011
AMINO_2010
Cluster
RepFlag
RepSeq
NumSeqInCluster
PutativeChimera
PutativeChimeraSame
Study
AddUpdt_Jul2011
CDHITnt_ClusterNo
CDHITnt_NumSeq
CDHITnt_RepSeqFlag
CDHITnt_RepSeqID
CDHITaa_ClusterNo
CDHITaa_NumSeq
CDHITaa_RepSeqFlag
CDHITaa_RepSeqID
AddUpdt_Dec2011
description
Field for non-redundant information. AA99 in this field indicates
sequence was selected by cd-hit as representative of cluster based on
99% identity.
Total number of aligned amino acids in sequence
First position of amino acid sequence in current alignment
Last position of amino acid sequence in current alignment
Field for keeping sequence problem information (not curated)
Cluster designation based on DNA sequences from Zehr et al. 2003
Field for writing sampling location information (not curated)
Cluster designation using Raymond et al. scheme
Cluster designation using Young scheme
Cluster designations from current alignment (1.1, 1.2, etc)
Cluster designations from current alignment (1A, 1B, etc).
Original CD-HIT analysis: Cluster No
Original CD-HIT analysis: Representative sequence flag
Original CD-HIT analysis: Representative sequence ID
Original CD-HIT analysis: number of sequences in cluster
Potential chimeric sequence based on UCHIME analysis
Likely chimeric sequence; parent sequences from the same study
New sequences from the July 2011 update
Most current CD-HIT-EST analysis: Cluster No
Most current CD-HIT-EST analysis: number of sequences in cluster
Most current CD-HIT-EST analysis: Representative sequence flag
Most current CD-HIT-EST analysis: Representative sequence ID
Most current CD-HIT analysis: Cluster No
Most current CD-HIT analysis: number of sequences in cluster
Most current CD-HIT analysis: Representative sequence flag
Most current CD-HIT analysis: Representative sequence ID
New sequences from the Dec 2011 update
Download