A detailed protocol for the assembly and analysis of 454

advertisement
Protocol for assembly and analysis of 454 transcriptome sequences
Eli Meyer – version of December 13, 2008
General overview of the process
This document describes the de novo analysis of 454 transcriptome sequences using basic scripting tools
and publicly available software packages. Disclaimer: it is my opinion that researchers wishing to
analyze large sequence datasets are best served by making their own tools for the purpose, to best
understand the output from those tools. But for (relatively!) quick analysis of new transcriptome data,
the scripts we’ve developed for sequence analysis are available online at our website. In this document,
a set of step-by-step instructions for the use of those scripts is provided.
These scripts were developed for use on a small computing cluster consisting of two Dell PowerEdge
1900 servers joined together with ROCKS clustering software v5.0. Each server had: two Intel Quad
Core E5345 (2.33 Ghz, 1333 Mhz FSB, 2x4MB L2 Cache) CPU’s and 16 GB of 667 Mhz DDR2 RAM. The
cluster had a combined total of 580 GB disk space. All analyses were conducted by connecting remotely
to the server via SSH (PuTTY 0.60), in a Unix environment (tcsh shell). The scripts should be easily
portable to other operating systems with minor changes.
I have run the process in separate steps as described here, but the same strategy could be easily
adapted into an automated pipeline. One advantage of running it in discrete steps like this is that
particular steps can be skipped or modified as needed for a new project. And other steps can be run in
parallel if computational power allows, to speed up the process. For example, the annotation steps are
most efficiently finished if run in parallel.
Note that some of the steps described here take a long time, either due to blast search speed (blasting
105-106 queries and writing blast reports), or the data transfers involved in remotely accessing GenBank
files. For that reason, it is recommended that you test the tools with small subsets of your data before
running the main analysis.
Before beginning sequence analysis
Before beginning, it is assumed that the user has installed and configured the following software, and is
generally familiar with their use:
Install software
- Blast executables from NCBI, including blast, blastcl3, and blastclust
- Washington University blast (Wu-blast)
- ESTate sequence clustering software
- Perl
Download and format sequence databases
- NCBI:
nr and Swiss-Prot protein sequence databases (format for blast and wu-blast)
CDD conserved domain database (format for rps-blast), along with definitions
table (cddid_all.tbl)
- EMBL:
UniProt- trEMBL protein sequence database (format for wu-blast)
Download Gene Ontology files
-
Uniprot annotation file (gene_association.goa_uniprot)
Copy these scripts to the scripts directory
AdaptorTrim.pl
SizeScreener.pl
ContigJoin.pl
fasta_wrapper.pl
NtCluster.pl
ClusterMerge.pl
GeneAnnotate.pl
DomainAnnotate.pl
seqtable_ext.pl
GOAnnotTable.pl
GOAnnotate.pl
DefLineMerge.pl
FastaToTable.pl
TableQuery.pl
ProteinCluster.pl
Path environmental variable
Make sure to set your path environmental variable appropriately so that you can call the scripts and
software described above from any directory. The recommended strategy for using these scripts is to
create (1) a scripts directory, (2) a databases directory, and (3) a set of individual working directories for
each stage of the analysis.
Step-by-step protocol for transcriptome analysis
1. Adaptor trimming
This procedure trims known adaptor sequences from sequencing reads using stringent criteria; i.e.,
regions that show even weak matches to the adaptors are removed.
a. Obtain a fasta-formatted file of raw sequencing reads from 454 FLX, and a second file containing
the quality scores associated with these sequences, and save these files into the current working
directory.
b. Create a fasta-formatted file containing all adaptor sequences to be screened. These should
include the complete A and B adaptor sequences that are used in all 454 library protocols, in
addition to any specific sequences used in cDNA library preparation. Save this file into the
current working directory.
c. Begin adaptor trimming by entering the following at the command line:
AdaptorTrim.pl adaptors.fasta reads.fasta scores.qual
Where “adaptors.fasta” is the fasta file of adaptor sequences, “reads.fasta” is the fasta file of
raw sequencing reads, and “scores.qual” is the set of quality scores obtained from
pyrosequencing. The output produced by this script is:
i.
“trimmed.fasta”: the cleaned sequencing reads
ii.
“trimmed.qual”: quality scores for the cleaned reads
iii.
“trimmed.tab”: a tab-delimited text file containing three columns: (a) sequence
identifier, (b) position of the beginning of adaptor-free sequence, and (c) the
position of the end of adaptor-free sequence.
Notes: Many assembly programs do not use quality scores, but they are included in the script
because some assemblers can use them. Whether you intend to use them or not, they are
required for the trimming script. Any sequences that are shorter than 10 bp after removal of
adaptor sequence are discarded by the script.
2. Outlier sequence removal (size-selection)
This procedure removes all sequences that fall outside the usual size range on the basis that these might
represent artifacts. We recommend that the size thresholds be empirically determined for each new set
of sequence data, based on a size-distribution of the cleaned reads. For our study using 454 FLX, we
removed all sequences that were not between 60-340 bp, cutoffs that removed ~5% of reads.
a. Copy the trimmed reads and their quality scores into the current working directory.
b. First remove all reads smaller than the lower threshold, by entering at the command line:
SizeScreener.pl in.fasta in.qual X g first_cut.fasta first_cut.qual
Where “in.fasta” is the set of input sequences, “in.qual” is the file of their quality scores, “X” is
the size threshold, “g” means “keep sequences larger than X”, “first_cut.fasta” is the set of all
sequences greater than that size, and “first_cut.qual” is the file of their quality scores.
c. Next remove all reads greater than the upper size threshold, by entering at the command line:
SizeScreener.pl first_cut.fasta first_cut.qual Y l second_cut.fasta second_cut.qual
Where “first_cut.fasta” is the set of input sequences, “first_cut.qual” is the file of their quality
scores, “Y” is the size threshold, “l” means “keep sequences smaller than Y”, “second_cut.fasta”
is the set of all sequences greater than that size, and “second_cut.qual” is the file of their quality
scores.
Note: The cleaned and size selected reads are ready for assembly at this point using the
assembler of your choice. Because the Newbler assembler does not actually use quality scores,
but rather uses the raw SFF file, that SFF file will have to be modified according to the
“trimmed.tab” file prior to assembly in Newbler. The sfffile utility from Roche can be sued for
this purpose.
3. Contig joining
This procedure should be applied after assembly. Assembled sequences are first compared against the
smaller Swiss-Prot database, and then the sequences lacking matches in this first search are compared
against the larger nr protein databases. Query sequences that match to different, non-overlapping parts
of the same protein are joined together into a single sequence, with a string of twelve X characters
(XXXXXXXXXXXX) marking the junction between joined sequences.
a. Compile all contigs (and singletons, if desired) from assembly into a single fasta file, and copy
that file into the current working directory.
b. The current version of this script is hard-coded to search the Swiss-Prot database using NCBI
blast (it’s a small database, so speed is not as important), and to search the nr database using
the faster WU-blast program. So before beginning, be sure that your Swiss-Prot and nr
databases are formatted accordingly.
c. To begin the contig joining process, enter the following at the command line:
ContigJoin.pl -i=input.fasta -p=swissprot -p=nr >output.log
Where “input.fasta” is the file containing all contigs and singletons, “swissprot” and “nr” are the
names of locally installed protein sequence databases, and “output.log” is the name given to a
log file summarizing the script progress.
The script produces several output files in addition to the log:
i.
“tq1.br”: blast report from the first blast search (Swiss-Prot)
ii.
“tq2.br”: blast report from the first blast search (nr)
iii.
“matches.tab”: a tab-delimited text file listing the best match for each query
iv.
“GC_seqs.fasta”: the main output containing all sequences after contig joining,
both those that were joined together and those that were not.
Note: The joined sequences are assigned new identifiers in this process, and the original
sequence identifiers are retained in the sequence description.
4. Protein-based sequence clustering
This procedure builds single-linkage sequence clusters from the assembled sequences based on similarity
between their top blast hits. Because each assembled sequence could represent only a portion of the
original transcript, but the protein matches are complete sequences, this is expected to cluster together
highly similar paralogs as well as different fragments of the same transcript.
a. Copy the joined contigs (FASTA format) into the current working directory.
b. The current version of this script is hard-coded to search the Swiss-Prot database using NCBI
blast (it’s a small database, so speed is not as important), and to search the nr database
using the faster WU-blast program. So before beginning, be sure that your Swiss-Prot and
nr databases are formatted accordingly.
c. Enter the following at the command line:
ProteinCluster.pl -i=joined.fasta -p=db1 -p=db2 -o=protclust.index > protclust.log
Where “joined.fasta” is the fasta-formatted file of joined contig sequences, “db1” is the first
database against which to compare the queries (e.g., Swiss-Prot), “db2” is the second
database against which to compare queries for which no match was found in db1 (e.g., nr),
“protclust.index” is a name for the output file of clusters, and “protclust.log” is a text file
that logs the progress of the script.
This script produces, in addition to the log file:
i.
“protein_families.fasta”: a fasta-formatted file of joined contigs annotated with
protein cluster information in the definition line, as “ProteinFamily=N”, where
“N” is a designation of the protein cluster number. Singletons (sequences not
assigned to a protein cluster) are not included in this file.
ii.
“protclust.index”: a blastclust index file, for later use in cluster merging
iii.
“hits1.list” & “hist2.list”: tab-delimited text files listing the best matches for
each query sequence analyzed
iv.
“db1.br” and “db2.br”: the blast reports from these blast searches;
v.
“prots.fasta”: the list of protein sequences (best matches), with sequence IDs
assigned based on the query matching each protein.
5. Nucleotide-based sequence clustering
This procedure builds single-linkage sequence clusters from assembled sequences based on nucleotide
sequence similarity. This process has the advantage over (4) that it can cluster similar sequences that
lack protein matches in public databases; however, because it relies on the assembled sequences
themselves, it cannot cluster together sequences that correspond to different, non-overlapping parts of
the same transcript. The two clustering processes have complementary strengths and weakness, and
should therefore both be used if either is used.
a. Copy the joined contigs (FASTA format) in to the current working directory.
b. Enter the following at the command line:
NtCluster.pl joined.fasta output.directory >ntclust.log
Where “joined.fasta” is the set of joined contig sequences from (4); “output.directory” is an
arbitrary name for an output directory in which to store the temporary output files; and
“ntclust.log” is a text file logging the progress of the script.
This script produces, in addition to the log file:
i.
“precluster.log”: a log file for the precluster program (ESTate).
ii.
“estcluster.log”: a log file for the estcluster program (ESTate).
iii.
“ntclusters.fasta”: fasta-formatted file containing all sequences annotated with
the cluster information in definition line as “NtFamily=”. Singletons (sequences
not assigned to a nucleotide cluster) are included in this file.
iv.
“ntclusters.tab”: tab-delimited text file suitable for input into cluster merging
(6). Each row contains the cluster ID followed by the cluster members.
6. Sequence cluster merging
The goal of this procedure is to find the union of the two sets of sequence clusters produced by protein
and nucleotide sequence clustering (4, 5). The logic of the script is that, within each protein cluster, all
nucleotide-sequence relatives for each of the cluster members are brought into the cluster. If both
protein and nucleotide clustering are performed, this step is required to arrive at a unified set of clusters.
a. Make a new directory in which to merge sequence clusters, and navigate into that directory.
b. At the command prompt, enter the following:
ClusterMerge.pl joined.fasta nt.index protein.index >merge.log
Where “joined.fasta” is the file containing assembled sequences from (3b), “nt.index” is the
table of nucleotide clusters produced in (5e), “protein.index” is the table of protein clusters
produced in (4b), and “merge.log” is a text file containing a log of the process.
This script produces, in addition to the log file:
i.
“merged_clusters.fasta”: all sequences annotated with their sequence cluster
membership in the definition line, as “SeqCluster=N”, where “N” is the name of
the cluster containing that sequence.
ii.
“merged_clusters.tab”: A tab-delimited text file documenting the clustermerging process.
7. Gene name annotation
The goal of this process is to assign gene names to assembled sequences with significant matches to
previously identified genes. Sequences are queried first against a small, well annotated database (e.g.,
Swiss-Prot), and then those sequences lacking matches in the first search are queried against a larger but
less extensively annotated database (e.g., nr). For each query, the GenBank records are downloaded and
parsed for each significant match until a record is found that contains useful gene name annotation.
This gene name is then assigned to the query sequence, and the script moves on to the next sequence.
For computational speed, the search is limited to the top 10 blast matches for each query.
a. Make a new directory in which to perform gene name annotation, and navigate to that
directory.
b. At the command line, enter the following:
GeneAnnotate.pl –i=joined.fasta –b=badwords.list –o=local –p=db1 –p=db2 >gene_ann.log
The arguments for this script are:
“-b”: a list of annotation terms to avoid, delimited by new lines. An example list of
terms to avoid is: “Unknown”, “uncharacterized”, “hypothetical”, “RIKEN”, “predicted”,
and “similar”.
Any blast match that is annotated with these uninformative terms will be skipped by the
script. If all blast matches for a given query contain one or more of these uninformative
terms, that query will not be assigned a gene name.
“-i”: the set of assembled sequences from (3b);
“-o”: “local” - search locally using blast.exe, or “remote” - search against NCBI databases
using blastcl3.exe.
“-p”: First, list the first database against which to search (a small and well annotated
one like e.g., Swiss-Prot). Next, like the second, a larger or less-well annotated db (e.g.,
nr).
“-n”: like “-p”, but for nucleotide sequence databases.
The script produces, in additon to the log:
i.
“gene_annotated.fasta”: all sequences annotated in the definition line with with
gene names shown as “Gene=”, and the accession number of the best
annotated match shown as “Match_Acc=”.
ii.
“tq1.br”: the blast report from the first blast search.
iii.
“tq2.br”: the blast report from the second blast search.
8. Protein domain annotation
This process compares assembled sequences with the CDD database of conserved protein domains, or
any other desired domain database (e.g., the PFAM database would be another reasonable alternative).
Sequence comparisons are made using RSP-BLAST. The best domain match for each query sequence is
assigned to that sequence.
a. Find the locations on your system of the domain database file formatted for RPS-BLAST, and
the domain descriptions table (e.g., cddid_all.tbl, for NCBI’s CDD database). Alternatively,
these can be copied into the working directory, or added to environmental variables.
b. Make a new directory in which to perform domain annotation, and navigate to that
directory.
c. At the command line, enter the following:
DomainAnnotate.pl –i=joined.fasta –d=database –t=table –o=(v/s) >dom.log
The arguments for this script are:
“-i”: the assembled sequences from (3b) to be annotated.
“-d”: the database of domains to be searched, with file path if needed.
“-t”: the table of domain IDs and descriptions for the database specified. For the CDD
database, this is the “cddid_all.tbl” file. Specify file path if needed.
“-o”: either “v” (verbose – list full domain description) or “s” (short – descriptions truncated
to first 50 characters).
The script produces, in additon to the log:
i.
“domain_annotated.fasta”: all sequences annotated in the definition line with
with domain IDs shown as “DomainID=”, a description of the domain as
“DomainDesc=”, and the database identifer of the best domain match shown as
“DomainMatch=”.
ii.
“tq.br”: the blast report for the domain annotation search.
9. Gene Ontology term annotation
This process compares assembled sequences with the Uniprot-trEMBL protein sequence database, which
is extensively annotated with Gene Ontology (GO) terms. For each query, the script loops through all
significant blast matches and for each, looks up the GO annotation for that match. Once a significant
match with GO annotation is found, those GO terms are assigned to that query sequence and the script
moves on to the next query.
a. Navigate to the directory containing the set of Uniprot annotations from Gene Ontology
(gene_association.goa_uniprot).
b. At the command prompt, enter the following:
GOAnnotTable.pl input.tab >uniprot_annot.tab
Where “input.tab” is the name of the annotations file for your database of interest (e.g., the
Uniprot-trEMBL database), and “uniprot_annot.tab” is the name of a tab-delimited text file
in which each row contains an accession numbers followed by all GO annotations for that
sequence.
c. After that table has been built, navigate to a new directory in which you will conduct GO
annotation of your assembled sequences.
d. At the command prompt, enter the following:
GoAnnotate.pl –i=joined.fasta –d=db –t=annot.table > GO_annot.log
The arguments for this script are:
“-i”: the assembled sequences from (3b) to be annotated.
“-d”: the database of domains to be searched, with file path if needed.
“-t”: the table of GO annotations for sequences in the database specified (the output from
step 9b).
In addition to the log file, the script produces:
i.
“GO_annotated.fasta”: all sequences annotated in the definition line where
“GOTerms=” lists the GO terms assigned to this sequence, and “GOMatch=”
denotes the blast match used to assign GO annotation.
ii.
“tq.br”: the blast report for the GO annotation search.
10. Compiling annotations
Once all the desired annotations have been completed in parallel, the annotated files can be merged
together into one final fasta-formatted file of sequences each of which is annotated in the definition line
with all information gathered in the various annotation procedures. Finally, this annotated dataset is
used to build a tab-delimited text file that can be imported into Microsoft Excel, built into a database
using the database software of your choice, or simply queried as is using simple Perl tools. For example,
the script TableQuery.pl can be used for this purpose, allowing text searches for desired gene names, etc.
a. Navigate to a new directory in which to compile annotations.
b. At the command line, enter the following:
DefLineMerge.pl annot1.fasta annot2.fasta merged.fasta
Where “annot1.fasta” is one file of annotated sequences; “annot2.fasta” is a second file
containing the same sequences with different annotation. (e.g., the merged unions
fasta file, and the gene name fasta file); and “merged.fasta” is the output file.
The script will produce a single fasta file in which each sequence is annotated with the
information from the first file and the information from the second file. Annotations
remain in the definition line, separated by a single space. Each annotation takes the
form “tag=value”, with spaces separating annotations.
c. Merge the output from step (b) with the third annotation file (e.g., the domains
annotation fasta file.
d. Repeat steps (b-d) until all annotations are merged into one final fasta file containing all
annotations.
e. At the command prompt, enter the following:
FastaToTable.pl all_annot.fasta (y/n) > final.tab
Where “all_annot.fasta” is the final fasta file containing all merged annotations;
followed by either “y” or “n” to indicate whether to print sequences as strings in the last
column of the table.
The output is a searchable, tab-delimited text file containing all annotation information.
f.
The output table can be imported into Microsoft Excel, or the database software of your
choice. Alternatively, it can be searched directly from the command prompt by
entering:
TableQuery.pl final.tab field value
Where “final.tab” is the searchable table from step (10f) above, “field” is the column of
the table to be searched (e.g., “GeneName” – note that this field is case sensitive); and
“value” is the value to be searched. Perl search syntax works here, so that “^act.”
matches “actin” but not “interacting”.
Download