Protocol for assembly and analysis of 454 transcriptome sequences Eli Meyer – version of December 13, 2008 General overview of the process This document describes the de novo analysis of 454 transcriptome sequences using basic scripting tools and publicly available software packages. Disclaimer: it is my opinion that researchers wishing to analyze large sequence datasets are best served by making their own tools for the purpose, to best understand the output from those tools. But for (relatively!) quick analysis of new transcriptome data, the scripts we’ve developed for sequence analysis are available online at our website. In this document, a set of step-by-step instructions for the use of those scripts is provided. These scripts were developed for use on a small computing cluster consisting of two Dell PowerEdge 1900 servers joined together with ROCKS clustering software v5.0. Each server had: two Intel Quad Core E5345 (2.33 Ghz, 1333 Mhz FSB, 2x4MB L2 Cache) CPU’s and 16 GB of 667 Mhz DDR2 RAM. The cluster had a combined total of 580 GB disk space. All analyses were conducted by connecting remotely to the server via SSH (PuTTY 0.60), in a Unix environment (tcsh shell). The scripts should be easily portable to other operating systems with minor changes. I have run the process in separate steps as described here, but the same strategy could be easily adapted into an automated pipeline. One advantage of running it in discrete steps like this is that particular steps can be skipped or modified as needed for a new project. And other steps can be run in parallel if computational power allows, to speed up the process. For example, the annotation steps are most efficiently finished if run in parallel. Note that some of the steps described here take a long time, either due to blast search speed (blasting 105-106 queries and writing blast reports), or the data transfers involved in remotely accessing GenBank files. For that reason, it is recommended that you test the tools with small subsets of your data before running the main analysis. Before beginning sequence analysis Before beginning, it is assumed that the user has installed and configured the following software, and is generally familiar with their use: Install software - Blast executables from NCBI, including blast, blastcl3, and blastclust - Washington University blast (Wu-blast) - ESTate sequence clustering software - Perl Download and format sequence databases - NCBI: nr and Swiss-Prot protein sequence databases (format for blast and wu-blast) CDD conserved domain database (format for rps-blast), along with definitions table (cddid_all.tbl) - EMBL: UniProt- trEMBL protein sequence database (format for wu-blast) Download Gene Ontology files - Uniprot annotation file (gene_association.goa_uniprot) Copy these scripts to the scripts directory AdaptorTrim.pl SizeScreener.pl ContigJoin.pl fasta_wrapper.pl NtCluster.pl ClusterMerge.pl GeneAnnotate.pl DomainAnnotate.pl seqtable_ext.pl GOAnnotTable.pl GOAnnotate.pl DefLineMerge.pl FastaToTable.pl TableQuery.pl ProteinCluster.pl Path environmental variable Make sure to set your path environmental variable appropriately so that you can call the scripts and software described above from any directory. The recommended strategy for using these scripts is to create (1) a scripts directory, (2) a databases directory, and (3) a set of individual working directories for each stage of the analysis. Step-by-step protocol for transcriptome analysis 1. Adaptor trimming This procedure trims known adaptor sequences from sequencing reads using stringent criteria; i.e., regions that show even weak matches to the adaptors are removed. a. Obtain a fasta-formatted file of raw sequencing reads from 454 FLX, and a second file containing the quality scores associated with these sequences, and save these files into the current working directory. b. Create a fasta-formatted file containing all adaptor sequences to be screened. These should include the complete A and B adaptor sequences that are used in all 454 library protocols, in addition to any specific sequences used in cDNA library preparation. Save this file into the current working directory. c. Begin adaptor trimming by entering the following at the command line: AdaptorTrim.pl adaptors.fasta reads.fasta scores.qual Where “adaptors.fasta” is the fasta file of adaptor sequences, “reads.fasta” is the fasta file of raw sequencing reads, and “scores.qual” is the set of quality scores obtained from pyrosequencing. The output produced by this script is: i. “trimmed.fasta”: the cleaned sequencing reads ii. “trimmed.qual”: quality scores for the cleaned reads iii. “trimmed.tab”: a tab-delimited text file containing three columns: (a) sequence identifier, (b) position of the beginning of adaptor-free sequence, and (c) the position of the end of adaptor-free sequence. Notes: Many assembly programs do not use quality scores, but they are included in the script because some assemblers can use them. Whether you intend to use them or not, they are required for the trimming script. Any sequences that are shorter than 10 bp after removal of adaptor sequence are discarded by the script. 2. Outlier sequence removal (size-selection) This procedure removes all sequences that fall outside the usual size range on the basis that these might represent artifacts. We recommend that the size thresholds be empirically determined for each new set of sequence data, based on a size-distribution of the cleaned reads. For our study using 454 FLX, we removed all sequences that were not between 60-340 bp, cutoffs that removed ~5% of reads. a. Copy the trimmed reads and their quality scores into the current working directory. b. First remove all reads smaller than the lower threshold, by entering at the command line: SizeScreener.pl in.fasta in.qual X g first_cut.fasta first_cut.qual Where “in.fasta” is the set of input sequences, “in.qual” is the file of their quality scores, “X” is the size threshold, “g” means “keep sequences larger than X”, “first_cut.fasta” is the set of all sequences greater than that size, and “first_cut.qual” is the file of their quality scores. c. Next remove all reads greater than the upper size threshold, by entering at the command line: SizeScreener.pl first_cut.fasta first_cut.qual Y l second_cut.fasta second_cut.qual Where “first_cut.fasta” is the set of input sequences, “first_cut.qual” is the file of their quality scores, “Y” is the size threshold, “l” means “keep sequences smaller than Y”, “second_cut.fasta” is the set of all sequences greater than that size, and “second_cut.qual” is the file of their quality scores. Note: The cleaned and size selected reads are ready for assembly at this point using the assembler of your choice. Because the Newbler assembler does not actually use quality scores, but rather uses the raw SFF file, that SFF file will have to be modified according to the “trimmed.tab” file prior to assembly in Newbler. The sfffile utility from Roche can be sued for this purpose. 3. Contig joining This procedure should be applied after assembly. Assembled sequences are first compared against the smaller Swiss-Prot database, and then the sequences lacking matches in this first search are compared against the larger nr protein databases. Query sequences that match to different, non-overlapping parts of the same protein are joined together into a single sequence, with a string of twelve X characters (XXXXXXXXXXXX) marking the junction between joined sequences. a. Compile all contigs (and singletons, if desired) from assembly into a single fasta file, and copy that file into the current working directory. b. The current version of this script is hard-coded to search the Swiss-Prot database using NCBI blast (it’s a small database, so speed is not as important), and to search the nr database using the faster WU-blast program. So before beginning, be sure that your Swiss-Prot and nr databases are formatted accordingly. c. To begin the contig joining process, enter the following at the command line: ContigJoin.pl -i=input.fasta -p=swissprot -p=nr >output.log Where “input.fasta” is the file containing all contigs and singletons, “swissprot” and “nr” are the names of locally installed protein sequence databases, and “output.log” is the name given to a log file summarizing the script progress. The script produces several output files in addition to the log: i. “tq1.br”: blast report from the first blast search (Swiss-Prot) ii. “tq2.br”: blast report from the first blast search (nr) iii. “matches.tab”: a tab-delimited text file listing the best match for each query iv. “GC_seqs.fasta”: the main output containing all sequences after contig joining, both those that were joined together and those that were not. Note: The joined sequences are assigned new identifiers in this process, and the original sequence identifiers are retained in the sequence description. 4. Protein-based sequence clustering This procedure builds single-linkage sequence clusters from the assembled sequences based on similarity between their top blast hits. Because each assembled sequence could represent only a portion of the original transcript, but the protein matches are complete sequences, this is expected to cluster together highly similar paralogs as well as different fragments of the same transcript. a. Copy the joined contigs (FASTA format) into the current working directory. b. The current version of this script is hard-coded to search the Swiss-Prot database using NCBI blast (it’s a small database, so speed is not as important), and to search the nr database using the faster WU-blast program. So before beginning, be sure that your Swiss-Prot and nr databases are formatted accordingly. c. Enter the following at the command line: ProteinCluster.pl -i=joined.fasta -p=db1 -p=db2 -o=protclust.index > protclust.log Where “joined.fasta” is the fasta-formatted file of joined contig sequences, “db1” is the first database against which to compare the queries (e.g., Swiss-Prot), “db2” is the second database against which to compare queries for which no match was found in db1 (e.g., nr), “protclust.index” is a name for the output file of clusters, and “protclust.log” is a text file that logs the progress of the script. This script produces, in addition to the log file: i. “protein_families.fasta”: a fasta-formatted file of joined contigs annotated with protein cluster information in the definition line, as “ProteinFamily=N”, where “N” is a designation of the protein cluster number. Singletons (sequences not assigned to a protein cluster) are not included in this file. ii. “protclust.index”: a blastclust index file, for later use in cluster merging iii. “hits1.list” & “hist2.list”: tab-delimited text files listing the best matches for each query sequence analyzed iv. “db1.br” and “db2.br”: the blast reports from these blast searches; v. “prots.fasta”: the list of protein sequences (best matches), with sequence IDs assigned based on the query matching each protein. 5. Nucleotide-based sequence clustering This procedure builds single-linkage sequence clusters from assembled sequences based on nucleotide sequence similarity. This process has the advantage over (4) that it can cluster similar sequences that lack protein matches in public databases; however, because it relies on the assembled sequences themselves, it cannot cluster together sequences that correspond to different, non-overlapping parts of the same transcript. The two clustering processes have complementary strengths and weakness, and should therefore both be used if either is used. a. Copy the joined contigs (FASTA format) in to the current working directory. b. Enter the following at the command line: NtCluster.pl joined.fasta output.directory >ntclust.log Where “joined.fasta” is the set of joined contig sequences from (4); “output.directory” is an arbitrary name for an output directory in which to store the temporary output files; and “ntclust.log” is a text file logging the progress of the script. This script produces, in addition to the log file: i. “precluster.log”: a log file for the precluster program (ESTate). ii. “estcluster.log”: a log file for the estcluster program (ESTate). iii. “ntclusters.fasta”: fasta-formatted file containing all sequences annotated with the cluster information in definition line as “NtFamily=”. Singletons (sequences not assigned to a nucleotide cluster) are included in this file. iv. “ntclusters.tab”: tab-delimited text file suitable for input into cluster merging (6). Each row contains the cluster ID followed by the cluster members. 6. Sequence cluster merging The goal of this procedure is to find the union of the two sets of sequence clusters produced by protein and nucleotide sequence clustering (4, 5). The logic of the script is that, within each protein cluster, all nucleotide-sequence relatives for each of the cluster members are brought into the cluster. If both protein and nucleotide clustering are performed, this step is required to arrive at a unified set of clusters. a. Make a new directory in which to merge sequence clusters, and navigate into that directory. b. At the command prompt, enter the following: ClusterMerge.pl joined.fasta nt.index protein.index >merge.log Where “joined.fasta” is the file containing assembled sequences from (3b), “nt.index” is the table of nucleotide clusters produced in (5e), “protein.index” is the table of protein clusters produced in (4b), and “merge.log” is a text file containing a log of the process. This script produces, in addition to the log file: i. “merged_clusters.fasta”: all sequences annotated with their sequence cluster membership in the definition line, as “SeqCluster=N”, where “N” is the name of the cluster containing that sequence. ii. “merged_clusters.tab”: A tab-delimited text file documenting the clustermerging process. 7. Gene name annotation The goal of this process is to assign gene names to assembled sequences with significant matches to previously identified genes. Sequences are queried first against a small, well annotated database (e.g., Swiss-Prot), and then those sequences lacking matches in the first search are queried against a larger but less extensively annotated database (e.g., nr). For each query, the GenBank records are downloaded and parsed for each significant match until a record is found that contains useful gene name annotation. This gene name is then assigned to the query sequence, and the script moves on to the next sequence. For computational speed, the search is limited to the top 10 blast matches for each query. a. Make a new directory in which to perform gene name annotation, and navigate to that directory. b. At the command line, enter the following: GeneAnnotate.pl –i=joined.fasta –b=badwords.list –o=local –p=db1 –p=db2 >gene_ann.log The arguments for this script are: “-b”: a list of annotation terms to avoid, delimited by new lines. An example list of terms to avoid is: “Unknown”, “uncharacterized”, “hypothetical”, “RIKEN”, “predicted”, and “similar”. Any blast match that is annotated with these uninformative terms will be skipped by the script. If all blast matches for a given query contain one or more of these uninformative terms, that query will not be assigned a gene name. “-i”: the set of assembled sequences from (3b); “-o”: “local” - search locally using blast.exe, or “remote” - search against NCBI databases using blastcl3.exe. “-p”: First, list the first database against which to search (a small and well annotated one like e.g., Swiss-Prot). Next, like the second, a larger or less-well annotated db (e.g., nr). “-n”: like “-p”, but for nucleotide sequence databases. The script produces, in additon to the log: i. “gene_annotated.fasta”: all sequences annotated in the definition line with with gene names shown as “Gene=”, and the accession number of the best annotated match shown as “Match_Acc=”. ii. “tq1.br”: the blast report from the first blast search. iii. “tq2.br”: the blast report from the second blast search. 8. Protein domain annotation This process compares assembled sequences with the CDD database of conserved protein domains, or any other desired domain database (e.g., the PFAM database would be another reasonable alternative). Sequence comparisons are made using RSP-BLAST. The best domain match for each query sequence is assigned to that sequence. a. Find the locations on your system of the domain database file formatted for RPS-BLAST, and the domain descriptions table (e.g., cddid_all.tbl, for NCBI’s CDD database). Alternatively, these can be copied into the working directory, or added to environmental variables. b. Make a new directory in which to perform domain annotation, and navigate to that directory. c. At the command line, enter the following: DomainAnnotate.pl –i=joined.fasta –d=database –t=table –o=(v/s) >dom.log The arguments for this script are: “-i”: the assembled sequences from (3b) to be annotated. “-d”: the database of domains to be searched, with file path if needed. “-t”: the table of domain IDs and descriptions for the database specified. For the CDD database, this is the “cddid_all.tbl” file. Specify file path if needed. “-o”: either “v” (verbose – list full domain description) or “s” (short – descriptions truncated to first 50 characters). The script produces, in additon to the log: i. “domain_annotated.fasta”: all sequences annotated in the definition line with with domain IDs shown as “DomainID=”, a description of the domain as “DomainDesc=”, and the database identifer of the best domain match shown as “DomainMatch=”. ii. “tq.br”: the blast report for the domain annotation search. 9. Gene Ontology term annotation This process compares assembled sequences with the Uniprot-trEMBL protein sequence database, which is extensively annotated with Gene Ontology (GO) terms. For each query, the script loops through all significant blast matches and for each, looks up the GO annotation for that match. Once a significant match with GO annotation is found, those GO terms are assigned to that query sequence and the script moves on to the next query. a. Navigate to the directory containing the set of Uniprot annotations from Gene Ontology (gene_association.goa_uniprot). b. At the command prompt, enter the following: GOAnnotTable.pl input.tab >uniprot_annot.tab Where “input.tab” is the name of the annotations file for your database of interest (e.g., the Uniprot-trEMBL database), and “uniprot_annot.tab” is the name of a tab-delimited text file in which each row contains an accession numbers followed by all GO annotations for that sequence. c. After that table has been built, navigate to a new directory in which you will conduct GO annotation of your assembled sequences. d. At the command prompt, enter the following: GoAnnotate.pl –i=joined.fasta –d=db –t=annot.table > GO_annot.log The arguments for this script are: “-i”: the assembled sequences from (3b) to be annotated. “-d”: the database of domains to be searched, with file path if needed. “-t”: the table of GO annotations for sequences in the database specified (the output from step 9b). In addition to the log file, the script produces: i. “GO_annotated.fasta”: all sequences annotated in the definition line where “GOTerms=” lists the GO terms assigned to this sequence, and “GOMatch=” denotes the blast match used to assign GO annotation. ii. “tq.br”: the blast report for the GO annotation search. 10. Compiling annotations Once all the desired annotations have been completed in parallel, the annotated files can be merged together into one final fasta-formatted file of sequences each of which is annotated in the definition line with all information gathered in the various annotation procedures. Finally, this annotated dataset is used to build a tab-delimited text file that can be imported into Microsoft Excel, built into a database using the database software of your choice, or simply queried as is using simple Perl tools. For example, the script TableQuery.pl can be used for this purpose, allowing text searches for desired gene names, etc. a. Navigate to a new directory in which to compile annotations. b. At the command line, enter the following: DefLineMerge.pl annot1.fasta annot2.fasta merged.fasta Where “annot1.fasta” is one file of annotated sequences; “annot2.fasta” is a second file containing the same sequences with different annotation. (e.g., the merged unions fasta file, and the gene name fasta file); and “merged.fasta” is the output file. The script will produce a single fasta file in which each sequence is annotated with the information from the first file and the information from the second file. Annotations remain in the definition line, separated by a single space. Each annotation takes the form “tag=value”, with spaces separating annotations. c. Merge the output from step (b) with the third annotation file (e.g., the domains annotation fasta file. d. Repeat steps (b-d) until all annotations are merged into one final fasta file containing all annotations. e. At the command prompt, enter the following: FastaToTable.pl all_annot.fasta (y/n) > final.tab Where “all_annot.fasta” is the final fasta file containing all merged annotations; followed by either “y” or “n” to indicate whether to print sequences as strings in the last column of the table. The output is a searchable, tab-delimited text file containing all annotation information. f. The output table can be imported into Microsoft Excel, or the database software of your choice. Alternatively, it can be searched directly from the command prompt by entering: TableQuery.pl final.tab field value Where “final.tab” is the searchable table from step (10f) above, “field” is the column of the table to be searched (e.g., “GeneName” – note that this field is case sensitive); and “value” is the value to be searched. Perl search syntax works here, so that “^act.” matches “actin” but not “interacting”.