ARB nifH Database Last database update: December 2, 2011 Last documentation update: February 17, 2012 University of California, Santa Cruz, California Maintained and Distributed by Zehr lab (http://www.es.ucsc.edu/~wwwzehr/research/database/) Important additions to this update: Upgraded to ARB 5.2: This database has been upgraded to be compatible with ARB 5.2. This means users might have difficulty merging old databases with this current database. Please contact us if you need to do this, and we can navigate you through how to update old databases to ARB 5.2. Integrated CD-HIT and CD-HIT-EST analysis into the pipeline: We now send out the entire database, after updating with new sequences, to CD-HIT (Huang et al 2010), to determine representative sequences (based on both amino acid and nucleic acid sequences). Integrated Chimera-check analysis: We have used UCHIME (Edgar et al, 2011) to evaluate potential chimeras in this database. This is a necessary, but imperfect, approach. Sequences that not only meet threshold criteria (outlined below) to be defined as a chimera, but also have parent sequences from the same study, are marked as Putative Chimeras, and left out of most trees. Utilized new masks for the creation of trees: We have several new masks that, in some cases, mask out regions of the gene that are problematic for HMMalign. These can be explored by searching for *mask* in the name field. Basics about this database: Nitrogenase gene sequences in the databases are accumulating rapidly. BLAST analysis is not always the best approach for comparing sequences or identifying phylogenetic relationships. Due to the large number of sequences in the databases, and different formats, it is not simple to download and align all extant nifH protein sequences and their corresponding DNA encoding sequence. Such capabilities are necessary for environmental studies where the amino acid sequence is needed for phylogenetic analysis, but the corresponding DNA sequence is also needed for probe design. The problem of obtaining all extant sequences is compounded by misannotation, and homologous proteins in the databases. The ARB software environment is a useful environment for visualizing and manipulating aligned and unaligned sequences, and for maintaining metadata on sources, publications etc. ARB also contains features for probe design, and the construction of phylogenetic trees. However, ARB is not well suited for downloading and validating new data. Our group has developed a semi-automated process for constructing the nifH database from public genomic data sources. The procedure uses representative nifH protein sequences to BLAST against GenBank to identify potential nifH and nifH-like genes. The output is screened for false positives, which are eliminated from the database. Once identified the nifH protein and the encoding DNA sequence are retrieved. The sequences are imported into ARB using the nucleotide GI (GenBank identifier) as the sequence name to prevent redundancy problems. After import the DNA sequence is used to generate the amino acid sequence (which should be identical to the GenBank record), the amino acid sequences are exported and aligned against a nifH PFAM using HMMR and the Amino acid sequences re-imported into ARB. The aligned amino acid sequences are then used to align the DNA sequences using the "Backalign" feature of ARB. Features of the nifH_ARB database The database contains all nifH amino acid and DNA sequences obtainable from BLAST analysis. The Cluster IV nif-like sequences are included, to allow identification in environmental surveys. The amino acid sequences are aligned with a Hidden Markov Model, not by Clustal. The DNA sequences are aligned according to the amino acid sequences so that DNA sequences can also be used for phylogenetic analysis. Start and stop positions of amino acid sequences are included in searchable fields, enabling rapid selection of equal length sequences for phylogenetic analysis. Virtually all of the GenBank metadata is imported along with the sequences, allowing rapid searches and assembly of sequences for analysis. A nomenclature for nifH clusters is provided. This should be used with caution, since many branches of the phylogenetic tree (such as within the Proteobacteria) are poorly supported. Disclaimers and Notes for Use Most of the nifH sequences have been obtained from PCR amplification. A variety of primers have been used and so the sequence database is comprised of sequences of a variety of lengths, some of which are very short. The trees provided and the cluster naming is very dependent upon the length of sequence and the region amplified and used in the phylogenetic analysis. Unfortunately, shorter sequences and the deletion of certain regions in the analysis, quickly deteriorate the robust clustering of the trees. The analysis provided in the database starts with genomic length sequences, and uses the sequence clustering based on the genome tree to evaluate shorter trees. Clustering deteriorated rapidly when the 3' end of the sequence is not used. Thus any use of the analysis for trees generated from shorter length sequences should be interpreted very cautiously. Furthermore, although the QuickAdd function in ARB is very useful for quickly screening the phylogeny for new sequences, we are finding that clustering breaks down when adding thousands of sequences to a backbone tree. We advocate building neighbor joining trees with your sequences, and close relatives, for publication quality analyses. Putative chimeric sequences were determined using the UCHIME algorithm (Edgar et al, 2011). Nucleotide alignments for all 22,579 sequences (Dec 2010 update; including those obtained from genomes) were clustered using CDHit (Huang et al 2010) at a 98% sequence identity cut-off, and the resulting 8579 representative sequences were analyzed for chimeras using the UCHIME algorithm in de novo mode. As accurate abundance data was unavailable for many studies in this database, the number of sequences in each CDHit cluster was used as a proxy. UCHIME was run using all the default parameters, but the resulting chimeras were subject to the additional criteria determined empirically (please contact us for further details) to reduce the number of false positives. Putative chimeric sequences were tagged as such in a field named “PutativeChimera” in the ARB database. Likely chimeras were further defined if the two parent sequences were recovered from the same study, and tagged in the “PutativeChimeraSameStudy” field. Fields in ARB The fields in the nifH database are primarily derived from Genbank records. A number of fields have been created by the curators for convenience and to facilitate analysis of the data Fields in ARB that have been derived from GenBank fields are described in the Appendix Table 1. Some fields have been obtained from translation to EMBL format, and a few field names were changed for the corresponding field in ARB. Fields in ARB that have been created for data analysis and curation are: StartAlign-the position in the current database for the first amino acid in the alignment EndAlign-the position in the current database for the last amino acid in the alignment Raymond_group- Major cluster designation (1-5) as defined by Raymond et al. Clusters 1-3 annotated. Young_group- Major cluster designations (B, A, C) as used by Young. SEQ_PROBLEMS-field where problems in sequences can be stored. Currently the only annotation is "DNA unaligned" to designate sequences where translation "X" prevented backaligning the DNA according to the amino acid sequence. AMINO_2011- Amino acid sequence clusters as in AMINO_2009, except using a new numerical designation (1.1, 1.2, 3.1, etc). AMINO_2010- Current designation of clusters using the Alphabetical clustering system of Zehr et al. 2003. Numerous sequences had to be reclassified according to the new tree topology. Therefore, some subgroups will differ, and there are fewer subclusters than in Zehr et al. Many of the Proteobacterial groups are not robust, although there is good general agreement between the 2003 and 2010 clustering (as long as the shorter nifH sequences are not used to make the tree). The cluster labeling of AMINO_2010 was based on the tree included in the database: tree_genome_AA_Dec2010_MASK_GENOME) AMINO_2003- Amino acid sequence clusters as defined in Zehr et al. 2003. DNA_2003 - DNA groups as defined in Zehr et al. 2003. PuatativeChimera – Sequences identified as possible chimeras using UCHIME. PuatativeChimeraSameStudy – Sequences identified as likely chimeras using UCHIME, because parent sequences are from the same study as the chimeric sequence. AddUpdt_MonthYear – Sequences pulled in on the update date designated in the field name. CDHITnt_ClusterNo – Cluster ID from the most recent CD-HIT-EST analysis (98% nucleotide identity). CDHITnt_NumSeq – Number of sequences associated with the cluster number from the most recent CDHIT-EST analysis (98% nucleotide identity). CDHITnt_RepSeqFlag – This field will have a “Y” if the sequence is the designated representative of a cluster from the most recent CD-HIT-EST analysis (98% nucleotide identity). CDHITnt_RepSeqID – This field will have the sequence ID of the designated cluster representative from the most recent CD-HIT-EST analysis (98% nucleotide identity). CDHITaa_ClusterNo – Cluster ID from the most recent CD-HIT analysis (98% amino acid identity). CDHITaa_NumSeq – Number of sequences associated with the cluster number from the most recent CDHIT analysis (98% amino acid identity). CDHITaa_RepSeqFlag – This field will have a “Y” if the sequence is the designated representative of a cluster from the most recent CD-HIT analysis (98% amino acid identity). CDHITaa_RepSeqID – This field will have the sequence ID of the designated cluster representative from the most recent CD-HIT analysis (98% amino acid identity). Fields relevant to old database users: NR- sequences with "AA99" written to this field were selected as representative sequences using the program "cd-hit" using the default parameters and selecting sequences representing 99% sequence identity at the amino acid level. Cluster – Original CD-HIT cluster number, based on Dec2010 database. RepFlag - Original CD-HIT Representative Sequence Flag (“Y” or “N”), based on Dec2010 database. RepSeq - Original CD-HIT Representative Sequence, based on Dec2010 database. NumSeqsInCluster - Original CD-HIT number of sequences in cluster, based on Dec2010 database. Using the Database The nifH database is provided as a resource for the community. We have tried to curate and maintain the database to facilitate the analysis of environmental sequences in particular. The alignments are generated by HMMR from PFAMs in order to provide some objectivity in approach, such that multiple users will obtain similar tree phylogenies, and sequences have been identified by cluster names in order to make it easier to discuss and compare datasets. However, neither the alignments nor cluster naming is absolute. Much of the cluster naming appears to be robust, but some branches and clusters are poorly resolved and typically not supported by reasonable bootstrap values. The cluster naming in the 2010/2011 efforts have condensed some clusters to approach a more robust cluster designation. There are multiple sequences that have problems, as imported from GenBank (for example sequences that cannot be backaligned because of X's in the nucleotide sequence). Trees We’ve added several additional trees to this new version of the database: Trees created using new masks: From Dec 2010 update: tree_Genomes_AA_Dec2010_MASK_GENOME – Non-redundant genome sequences (as of Dec 2010), tree created using the MASK_GENOME mask. tree _AA_RepSeqsDec2010_MASK1 – Representative sequences from the original CD-HIT analysis; tree created using the MASK1 mask. tree _AA_QuickAddtoRepSeqsDec2010_MASK1 – Quick add tree of all sequences of suitable length (as of Dec 2010), added using the MASK1 mask. tree _AA_ QuickAddtoRepSeqsDec2010_NoPutChimeras_MASK1 – same as above, but with likely chimeras removed. From Dec 2011 update: tree_Genomes_AA_Dec2011_MASK_GENOME – In this case, all genome sequences as of Dec 2011 are included, meaning that there are redundancies (e.g. draft vs. complete genomes). tree _AA_RepSeqsDec2011_MASK1 - Representative sequences from the most recent CD-HIT analysis; tree created using the MASK1 mask. tree _AA_ RepSeqsDec2011_plusAllGenomeSeqs_noPutChimeras_MASK1 - Representative sequences from the most recent CD-HIT analysis with all the genome sequences included and likely chimeras not included; tree created using the MASK1 mask. tree _AA_ QuickAddtoRepSeqsDec2011_noPutChimeras_MASK1 – Quick add tree of all sequences of suitable length (as of Dec 2011), with likely chimeras removed, added using the MASK1 mask. From the old version of the database: Several trees are provided as starting points. Tree names have information on the type of sequences (AA or DNA), whether it was generated from genome or nonredundant representative sequences (NR), and the start and stop positions used for the mask to generate the tree. Kimura correction was used for amino acid sequence trees, and Jukes-Cantor for DNA trees. Genome tree: tree_genome2010_81_630. This is probably the most robust tree since it uses the longest amino acid sequences obtained from genome sequencing efforts. The sequences were selected by searching the records for "genome". 81-630 refers to the positions in the current amino acid alignment. tree_AA_NR_Dec3_2010_134_481: Includes 2982 sequences identified as representative sequences by cd-hit (at 99% identity clustering). There were 8389 representative sequences identified by cd-hit, but sequences were selected from the representative sequences that also were long enough for the 134_481 mask. The original cd-hit representative sequences can be found by searching the NR field for "AA99". tree_AA_NR _134_481_quickadd: Tree was built by the quickadd parsimony feature in arb using the 149_478 mask, in order to add as many sequences as possible. This tree is not as reliable as the other two trees, but generally shows the same clustering and allows positioning of more sequences. There are 17627 sequences in this tree (out of total 22574 sequences in the database). Tip for making your own trees In order to use this database for making trees of additional sequences (e.g. newly derived sequences from PCR, genomes, etc), the sequences can be 1) manually aligned (very slow), 2) quick aligned (fast, but check alignment manually), 3) aligned using the same procedure used to make the database (requires some skills, and results in loss of some information on aligned positions and features of the current database, since the sequences may be repositioned). In order to use the quickalign feature (see ARB documentation): Import sequences (in fasta format). The other option is to create a new ARB database from just your sequences. Then use the merge function to bring the sequences from your database into this one. Note the Genbank fields will not have any information in them. You might want to create a field of your own and write something to it (e.g. "myseqs') so that you can easily search for your own sequences. Mark your sequences and some of the sequences that are already in the database (if you know some sequences that are relatively close, e.g. cyanobacteria if you are working on cyanobacteria and can mark those with your sequences to use as the aligning sequence, it is probably better). The sequences are generally so well conserved that the quickalign function works pretty well, as long as you use sequences in the same major cluster (Clusters 1-4) to align. Once the sequences are marked open the ARB_EDIT window. Put cursor somewhere in sequence you want to use to align, and unmark it (and any other sequences that are already aligned). Go to Edit, Integrated Aligners, Click Fast Aligner radio button, Click Align Marked Species radio button, Click Reference Species by name radio button (and then make sure the GI of the sequence you want to use as aligning sequence in the box to right-you should be able to do this by clicking the Copy button on the right, if your sequence GI appeared in the Align what box above). Click on Range Whole sequence radio button. Click Go. Check alignment visually. Selecting sequences of the same length in the database is time consuming. Sequences that are of the same length can now easily be selected using the Search and Query feature. Search StartAlign for "<xxx" AND EndAlign for ">yyy", where xxx is the starting position of the amino acid alignment mask you want (the tree mask can be 1 residue shorter than xxx since "<" will return all sequences that start one position to the left) and "yyy" is the end of the alignment mask (the tree mask can be 1 residue longer than yyy). Do the search, mark listed species and make the tree. We typically open up the amino acid sequence alignment prior to making the tree, to make sure the correct sequences are selected, and put the cursor in one of the sequences. That sequence will then appear as an option for making the mask by positions in the Neighbor Joining "Filter" tree window. Recent contributors to the nif_ARB project are: Rachel Foster Philip Heller Pia Moisander Kendra Turk H. James Tripp Jonathan P. Zehr The database can currently be acknowledged with Zehr et al. (2003). New publications are in preparation on the 2011 developments of the database. References Edgar et al (2011). “UCHIME improves sensitivity and speed of chimera detection.” Bioinformatics 27(16): 2194-2200. Huang et al. (2010) “CD-HIT Suite: a web server for clustering and comparing biological sequences.” Bioinformatics 26:680. Ludwig, W., O. Strunk, et al. (2004). "ARB: a software environment for sequence data." Nucleic Acids Research 32(4): 1363-1371. Raymond, J., J. L. Siefert, et al. (2004). "The natural history of nitrogen fixation." Molecular Biology and Evolution 21(3): 541-554. Young, J. P. W. (2005). The phylogeny and evolution of nitrogenases Genomes and genomics of nitrogen-fixing organisms. R. Palacios and W. E. Newton. Netherlands, Springer. 3: 221-241. Zehr, J. P., B. D. Jenkins, et al. (2003). "Nitrogenase gene diversity and microbial community structure: a cross-system comparison." Envionmental Microbiology 5(7): 539-554. APPENDIX Table 1. Fields for nifH metadata in the nifH ARB database and their source in GenBank records. GenBank parsed from "LOCUS" line parsed from "/coded_by" parsed from "/coded_by" parsed from "LOCUS" line DEFINITION KEYWORDS SOURCE (One line) ORGANISM (All lines) REFERENCE medline-ID AUTHORS TITLE JOURNAL mol_type taxon clone collection_date collected_by country GI GI gene host strain + isolate isolation_source lat_lon note operon PCR_primers product protein_id translation ARB nuc_len nuc_acc_and_pos nuc_version date description key_words full_taxon_name tax num_bib medline author title submission mol_type taxon clone collection_date collected_by country name GI gene host strain isolation_source lat_lon note operon PCR_primers product protein_acc amino_acid Table 2. Unique nifH ARB database fields. (This does not include ARB-specific fields that are not specific to the nif database). Field name NR AlignedAAs StartAlign EndAlign SEQ_PROBLEMS DNA_2003 AMINO_2003 LOCATION Raymond_group Young_group AMINO_2011 AMINO_2010 Cluster RepFlag RepSeq NumSeqInCluster PutativeChimera PutativeChimeraSame Study AddUpdt_Jul2011 CDHITnt_ClusterNo CDHITnt_NumSeq CDHITnt_RepSeqFlag CDHITnt_RepSeqID CDHITaa_ClusterNo CDHITaa_NumSeq CDHITaa_RepSeqFlag CDHITaa_RepSeqID AddUpdt_Dec2011 description Field for non-redundant information. AA99 in this field indicates sequence was selected by cd-hit as representative of cluster based on 99% identity. Total number of aligned amino acids in sequence First position of amino acid sequence in current alignment Last position of amino acid sequence in current alignment Field for keeping sequence problem information (not curated) Cluster designation based on DNA sequences from Zehr et al. 2003 Field for writing sampling location information (not curated) Cluster designation using Raymond et al. scheme Cluster designation using Young scheme Cluster designations from current alignment (1.1, 1.2, etc) Cluster designations from current alignment (1A, 1B, etc). Original CD-HIT analysis: Cluster No Original CD-HIT analysis: Representative sequence flag Original CD-HIT analysis: Representative sequence ID Original CD-HIT analysis: number of sequences in cluster Potential chimeric sequence based on UCHIME analysis Likely chimeric sequence; parent sequences from the same study New sequences from the July 2011 update Most current CD-HIT-EST analysis: Cluster No Most current CD-HIT-EST analysis: number of sequences in cluster Most current CD-HIT-EST analysis: Representative sequence flag Most current CD-HIT-EST analysis: Representative sequence ID Most current CD-HIT analysis: Cluster No Most current CD-HIT analysis: number of sequences in cluster Most current CD-HIT analysis: Representative sequence flag Most current CD-HIT analysis: Representative sequence ID New sequences from the Dec 2011 update