README

advertisement
README contents:
1. Overview
2. Installation
3. Sample Run
4. RNA-Seeker Procedures
5. Run RNA-Seeker
6. Required File Formats
7. Output File Formats
1. Overview
RNA-Seeker is a package of scripts that supplement the Bowtie mapping algorithm. The recommended
use for RNA-Seeker is to help glean useful information from sets of gene families (e.g. tRNAs), where
multiple genes are identical or have identical portions, resulting in RNA-seq reads mapping to multiple
locations. RNA-Seeker requires a "Library" file, which contains names, classifications, and genomic
locations for the genes that are of interest to the user. There are two primary procedures in the RNASeeker package.
The Compile Reference Gene Files procedure extracts from the reference genomes the DNA sequences
of genes of interest to the user. Additionally, non-coding DNA including any introns and the flanking
sequences of the gene within a defined distance are included in the reference files. This allows Bowtie to
map processed and unprocessed reads alike without allowing caps during the mapping, a process that
raises the possibility of false-positive matches.
The Classify Reads procedure addresses the problem of reads mapping to multiple locations in the
genome, as is common when dealing with sets of gene families like tRNAs. Instead of discarding these
multimapping reads or using a probabilistic approach to distribute them to genes, this procedure notes
the relationship of all of the genes that the read maps to and classifies the read to a category of
uniqueness based on this. If all of the genes that a read maps to are identical copies of the same gene,
this is much more useful to note than the case where a read maps to two or more completely unrelated
genes. This procedure distinguishes between these types of multimap reads and more, depending on
the user’s customization of the analysis. Importantly, the results of this procedure give only the factual
results of the classification, and leave any inference or interpretation of these results to the user.
The RNA-Seeker tool was developed by Matt Spencer and Jonathan Morrand.
MS: Informatics Institute, University of Missouri
JM: Department of Biochemistry, University of Missouri
Contact: mcsgx2@mail.missouri.edu
Do not hesitate to use this email for support and questions.
Please include "RNA-Seeker" in the subject of any messages.
Version: 1.0
Last Updated: December 2014
2. Installation
Note that the scripts included in RNA-Seeker are implemented in Perl, and are not designed for use
other than on Unix/Linux systems.
To decompress the RNA-Seeker package on a Unix/Linux system, navigate to the directory where the
package (rnaseeker-v1.0.tar.gz) is stored and use the command:
tar -xvzf rnaseeker-v1.0.tar.gz
Next, enter the rnaseeker-v1.0/scripts/ directory (e.g. cd rnaseeker-v1.0/scripts/ ). Use the
following command to complete installation:
sh install.sh
The RNA-Seeker program will run the relevant bowtie commands for the user. Since bowtie is not
included in the installation of RNA-Seeker, this program must be able to locate bowtie. Try running the
command "bowtie --version". If it tells you the bowtie version, then nothing more needs to be done.
Otherwise, identify the location of installation of the bowtie tool (or install it using the link below if it is
not yet installed on this machine), and the bowtie command parameter must be changed to reflect this
file location. To do this, make sure you are still in the rnaseeker-v1.0/scripts/ directory and run the
following command:
./change_params.pl
Follow the instructions for changing the Bowtie Command (additional instructions about this can be
found in “Parameter File: How to change the parameters for RNA-Seeker” in section 6: Required File
Formats).
The RNA-Seeker package is now installed. The contents of the scripts directory should be left together. If
the scripts/ directory is ever moved, renamed, or its contents are removed from that directory, the ‘sh
install.sh’ command should be used again to reinstall the package after the changes have occurred.
In such a case, be sure that the six files (install.sh, params.txt, default_params.txt, change_params.pl,
seeker_functions.pm, and RNA-Seeker.pl) are all together in the same directory, or else the installation
will fail.
The Bowtie bioinformatics tool is necessary for RNA-Seeker's intended use. This tool is not included with
the RNA-Seeker package, but it is free, open-source software that can be downloaded from the
following website.
http://sourceforge.net/projects/bowtie-bio/
3. Sample Run
We provide a small sample dataset and the results obtained using these data. These data files can be
found in the sample directory, with the expected results in the expected_sample directory. Note that
RNA-Seeker does not employ stochastic (random) processes, so the sample run should yield the exact
results that are saved in the expected_sample/results directory if the analysis is run as described below.
To run the Sample Run, navigate to the sample directory (rnaseeker-v1.0/sample). Then run the
command below:
../scripts/RNA-Seeker.pl –input sample_input.txt
Compare the files that are produced in the sample/results directory to those stored in the
expected_sample/results directory. They should be identical except for the comments that are written
at the top of the files in the expected_results directory. These comments explain what can be found in
each of the output files.
This sample run utilizes the RNA-Seeker functionality to store input information in an input file. More
details on the different ways that RNA-Seeker can receive inputs is discussed below.
4. RNA-Seeker Procedures
a) Compile Reference Genes
This procedure is a precursor to the bowtie-build indexer, which can be used to align reads to the
custom-made reference. Here the program utilizes the user input stored in the Library file to make
reduced reference gene files that only include the specified genes and the immediately surrounding
non-coding DNA, including introns and flanking sequences. This strategy allows Bowtie to map both
processed and unprocessed reads without allowing for gaps during the mapping, and the resulting reads
are more accurate for the analysis without investing significantly more time for the genome-wide
mapping with gaps.
Reference fasta files are produced that include the sequences of the indicated genes and the
surrounding non-coding information. Changing the Flank Length parameter allows the user to indicate
just how much non-coding DNA around the genes should be included (directions for changing
parameters like this can be found in the Options section below).
b) Bowtie-Build and Bowtie Map
RNA-Seeker automatically incorporates the necessary Bowtie procedures into a quick and easy workflow
that includes the mapping software. Bowtie-build allows the customized reference gene files to be used
for the Bowtie Map algorithm, which quickly and efficiently determines where in these reference genes
the reads map.
c) Classify Reads
This procedure allows for post-processing the reads mapped by the Bowtie algorithm. Once reads are
mapped to the custom-built reference gene files, this script will sort the reads according to their degree
of uniqueness, using the user-defined hierarchy of groups. Reads that map to one unique location are
separated, counted, and noted as uniquely mapping genes, and no further processing is done to these
reads. The major benefit of this procedure is how it processes reads that map to multiple locations in
the reference genes (multimaps). Depending on the methods, many analyses discard these
multimapping reads or use probabilistic algorithms to guess their gene of origin. This procedure
incorporates no guesswork, but instead derives as much information from these multimaps as possible
by analyzing the relationship of the genes to which the gene maps. Note that the relationship of genes
must be included by the user, and we make no attempt to determine the relationship between genes
other than the annotations that are given.
For example, in our application of RNA-Seeker, we classified reads mapping to tRNA genes in the plant
species Brachypodium distichyon. Since many tRNA genes are duplicated many times in the genome,
sometimes perfectly and sometimes with small distinguishing factors, this mapping produced a large
amount of multimaps. We manually determined which genes are perfect copies of each other (Families),
and noted the genes that produce the same tRNA isotype despite not being perfect copies. We made an
additional group that distinguishes genes of the same isotype that may have been transcribed in an
organelle. By providing the annotations showing which of each of these groups each gene belongs to,
the Classify Reads procedure separated the reads into useful categories which allowed the researcher to
quickly move on to making conclusions based on this evidence.
5. Using RNA-Seeker
The most important part of using RNA-Seeker is the appropriate creation of the Library file. The Library
file will provide the system with the annotation information it needs to compile the reference gene files
and to perform the classification of multimap reads. Note that this software was deliberately created to
be flexible and unrestrictive to allow for a variety of research questions to be answered by the creative
generation and organization of the Library file. A detailed explanation of the format and requirements of
the Library file can be found in the next section: Required File Formats.
Once a Library file appropriate to your analysis is created, there are several ways to provide input
information to the RNA-Seeker program: in-program prompts, command line options, and input file
selection. These are described below, using the included sample dataset as an example.
a) In-Program Prompts
Unlike most Bioinformatics tools, RNA-Seeker provides the user with the opportunity to designate input
after starting the program. This is useful because the user is taken through a step-by-step process asking
for one required input or file at a time. Inputs designated in another way are also checked for validity,
and the program will ask for an input to be restated if there is a problem. Try this out by following the
instructions below.
Navigate to the rnaseeker-v1.0/sample directory. Enter the following command:
../scripts/RNA-Seeker.pl
The program will ask for missing input (since you didn’t indicate any input, all inputs are currently
missing). Enter the following inputs when prompted:
Flank Length: 100
Library File: data/library.txt
RNA-seq Reads File: data/RNA-seq_reads.trim.fq
Results Directory: results
The following prompt asks for an intermediate results directory. If this directory is not specified, the
intermediate files will be saved to the results file indicated above. While designating such a directory is
not strictly required, we recommend indicating a separate directory because this is where the bowtie
files will be saved, and segregating them will prevent cluttering the main results directory.
Would you like to indicate a Intermediate directory?
Enter y/n: y
Intermediate Directory: bowtie_files
Next the program asks if the entire analysis should be run. Here it is possible to indicate that only certain
procedures should be run. This can be useful if the user wants to run a similar analysis with slight
changes. For example, to run an analysis a second time using different options for the Bowtie Map
algorithm, the Compile Reference Gene Files and Bowtie-Build procedures do not need to be rerun (they
can be, but it will save some time not to). For now let’s run the entire analysis.
Do you want to run the entire analysis, including
Compile Reference Gene Files
Bowtie-Build
Bowtie Map
and Classify Reads?
Enter y/n: y
At this point, the program will quickly scan the designated library file for all of the chromosome codes
included in the file. It will then ask the user to designate a reference genome file that corresponds to
each chromosome code. Thus, the program will extract the DNA sequence for the gene from the
corresponding reference genome file while compiling the reference genes. Note that if the user does not
want to include all of the codes in the analysis, entering ‘ignore’ will result in the program omitting all
genes labeled with this code.
Reference file for code C: data/chl_ref_genome.fasta
Reference file for code M: data/mit_ref_genome.fasta
At this point, the program has all of the inputs it needs, and will run the analysis (or at least the
procedures that were indicated in the above steps). Note that in the conclusion of the analysis, the
location of the input file generated is shown, as below.
Input file saved as:
/home/mspencer/rnaseeker-v1.0/input_files/input_file_3.txt
This file stores all of the inputs used for this execution of the program. Thus, if you would like to run the
same analysis again, you can indicate this input file (see the Input File Selection option below).
b) Command Line Options
Although the interactive in-program prompts can be convenient, they are slow, especially if you would
like to run more than one analysis. This command line option method of indicating input is typical of
Unix programs, so it may be more familiar to users. It also allows for a very concise way to communicate
the inputs used for an analysis. Try using the command below to run the same analysis performed
above.
../scripts/RNA-Seeker.pl –FL 100 –LF data/library.txt –RF data/RNA-seq_reads.trim.fq –
RD results –ID bowtie_files –RG M::data/mit_ref_genome.fasta –RG
C::data/chl_ref_genome.fasta –CRG –BTB –BTM -CLR
Notice how simply copying and pasting this command into the command line will result in the analysis
being performed, which is a much quicker alternative to individually typing the inputs. The two-letter
tags stand for input value types (FL = Flank Length, etc) and should be followed immediately by the value
for this input (e.g. “–FL 100” indicates using a flank length of 100). The three-letter tags represent the
procedures (CRG = Compile Reference Gene files, etc) and do not need any additional input other than
their presence (e.g. including “-CLR” is sufficient to tell the program to include the Classify Reads
procedure). More discussion of the tags and their meanings can be found below in the Required File
Formats section, where the sample_input.txt file is described.
When indicating a reference genome file to use, it is important to indicate the chromosome code that
represents it. In this example, the “M” code represents genes found in the mitochondrial genome, which
is stored in the data/mit_ref_genome.fasta file. Thus, to indicate this in the command line option, we
use the tag “RG” to indicate that we are specifying a reference genome file, followed by the
chromosome code “M” and the file name, separated by two colons “::”. This format (-RG <Code>::<File>)
is interpreted by the program, which will use the indicated file to be the reference genome for genes
with that code.
c) Input File Selection
The third method of selecting input is to indicate an input file where all inputs are saved. This is a very
quick and easy way to run an analysis, but requires that the input file is already written. Input files are
automatically generated in the rnaseeker-v1.0/input_files/ directory every time the RNA-Seeker analysis
is executed (the most recent input file will be the one with the highest number attached to the file
name). Thus, in order to run an analysis twice, one need only select the input file that was generated
from the first analysis and all of the same inputs will be used. Try running the sample analysis using the
prepared sample input file as below.
../scripts/RNA-Seeker.pl –input sample_input.txt
In addition to using automatically generated input files in this fashion, it is encouraged to modify these
input files to slightly change the inputs for the next analysis, or even write an input file from scratch.
However, the recommended way to introduce slight changes to a previous analysis is described below.
d) Combination
The recommended strategy for designating inputs for RNA-Seeker analyses is to use a combination of
the above methods. It is important to note that the program will look for inputs in this order:
1. Input File Selection
2. Command Line Options
3. In-Program Prompts
First, if an input file is designated, the program will read the input file and load all inputs stored there.
Then the command line options are read, and any inputs designated on the command line replace those
found in the input file. Finally, any inputs that are still missing or entered incorrectly will generate
prompts to be indicated before the analysis is attempted.
Due to this prioritization structure, the recommended way to introduce slight changes to a previously
run analysis is to use an input file and indicate any changes using the command line options. For
example, use the command below to run a slightly modified version of the sample analysis using a
combination of input types.
../scripts/RNA-Seeker.pl -input sample_input.txt -FL 50 -LF data/library_file.txt
When prompted to revise the name of the library file:
Library File: data/library.txt
In this case, most of the inputs are gathered from the sample_input.txt input file. However, the
command line –FL option indicating to use a Flank Length of 50 overwrites the input file’s Flank Length
of 100. Similarly, the command line –LF option to specify the library file overwrites the library file
location indicated in the input file. However, there is not actually any file with that name, so when the
program fails to find it, it uses the in-program prompts to request the real library file name. Once this is
obtained, the analysis will run.
6. Required File Formats
Library File: How to create a library file for your purposes.
The sample library file can be found at rnaseeker-v1.0/sample/library.txt
The Library file is the primary input file that RNA-Seeker utilizes. There are some formatting
requirements, which are explained in detail below, as well as in the file sample/data/library.txt and
in the sample Excel document Example_Library.xls and below. Note that the organization of groups
is largely up to the user, and the usefulness of this package depends on the user sculpting the
organization of groups to meet their individual needs.
The Library file is easily constructed out of an Excel Spreadsheet. There is an example of the basic
structure of the library file excel sheet included in this package. Fill out the spreadsheet using the
guidelines that can be found written in comments in the spreadsheet (more guidelines can be found
in the sample/data/library.txt text file). Once the appropriate information is listed in this excel
sheet, follow the below instructions:
Select File -> Save As...
Name the file
Indicate for "Save as type:" the selection "Text (Tab delimited) (*.txt)"
Save
The resulting file will be in the appropriate format for the library file. No parameters in the
params.txt will need to be changed (explanation of params.txt is below).
Formatting requirements for the Library File:
The first non-comment line should be the Headings line. Comment lines begin with a ‘#’ symbol.
The headings line should include the user-specified names of the groups to be used in the analysis.
Headings for the “!Reverse”, “!Chrom”, and “!Exons” columns should remain as written, including
remaining in the same order. The exclamation points indicate that these are specific, non-userdefined fields. Heading for Uniq Name and Group columns are user-defined. These headings will be
used as the classifications for the genes.
!Revers: This column should contain either "Yes" or "No", indicating whether the gene of interest is
on the forward (No) or reverse (Yes) strand, as the reference genome is written. RNA-Seeker will
search for reads mapping to the area in the indicated direction, and ignore reads that match the
opposite strand.
!Chrom: This column should contain a chromosomal ID code (which may be any character, or even
multiple characters). All genes with the same ID should have sequences found in the same gene
reference file, though this file need not be named according to the chromosomal ID listed here. As
described above, the program will require that you indicate which reference genome file
corresponds to each chromosome code.
!Exons: This column should contain the first and last nucleotide that is included in each exon. They
should be listed in ascending order, even for genes on the reverse strand. The two numbers should
be separated by a comma, and individual exons should be separated by a tab (if you prefer different
characters separating these fields, see the params.txt discussion below). Most of the below genes
are examples of genes with a single exon, though CysteineM1 is an example of a gene with multiple
exons.
Group Names: The column headings that do not begin with an exclamation point are user-defined
columns, and these names can be changed at will (Certain special characters may cause problems.
The underscore is recommended instead of space, e.g. Alanine_1 instead of Alanine 1.) Also note
that the number of these user-defined groups can be changed (increased or decreased) to the user's
specifications.
An example header line is below in bold, with some brief descriptions:
Uniq Name: Unique identifier for the gene (title for this category can be changed)
|
Group1: User-defined group to which the gene belongs (most specific)
|
|
Group2: User-defined group
|
|
|
Group3: User-defined group (most broad)
|
|
|
|
!Revers: Yes if gene is on reverse strand, otherwise No
|
|
|
|
|
!Chrom: Chromosome # where gene can be found
|
|
|
|
|
|
!Exons: Exon boundary information
|
|
|
|
|
|
|
use comma (exon delim) between
|
|
|
|
|
|
|
start and stop nucleotide
|
|
|
|
|
|
|
use tab (file delimiter) between exons
v
v
v
v
v
v
v
Gene
Family
SerineM1 SerineMA
SerineM2 SerineMB
CysteineM1 CysteineMA
SerineM3 SerineMC
Isotype
SerineM
SerineM
CysteineM
SerineM
GlobIso
Serine
Serine
Cysteine
Serine
!Revers !Chrom
No
M
No
M
No
M
No
M
!Exons
262563,262649
337683,337770
338021,338058
358823,358909
338109,338140
Parameter File: How to change the parameters for RNA-Seeker.
The parameter files can be found at rnaseeker-v1.0/scripts/params.txt and rnaseekerv1.0/scripts/default_params.txt. These files should not be moved from this directory. Do not touch
the default_params.txt file. It is the same as the params.txt file before any changes are made to it,
and it exists as a backup in case the params.txt file somehow gets corrupted.
The parameter file stores adjustable parameters for the program. This file is kept separate from the
input files that are generated on each execution because these parameters are ones that we
anticipate will not be changed very often (probably, in fact, zero or one time(s)). If you don’t already
have something in mind that you want to change and RNA-Seeker hasn’t specifically recommended
a parameter change, you probably don’t need to read this section.
The parameters present in this file may (and should) be changed to the user specifications. In
particular, the implementation of the min and max gene length is not at all required for the
functionality of the scripts, but were implemented to be used as an error-checking mechanism to
help verify that the Library file was written correctly, according to the expectations of the user.
These parameters also allow the user to vary the delimiters of the Library file, if the user would
prefer this file to be constructed in a slightly different format. We advise leaving the delimiter
parameters as the default values, as this is appropriate for library files that are formatted in the
recommended manner described above.
The recommended way to change the values of these parameters is to navigate to the rnaseekerv1.0/scripts directory and use the below command:
./change_params.pl
This script provides a friendly interface to indicate parameters that you would like to change, and
provide the new value for the changed parameters. The script will check to make sure it is a valid
value for the parameter before updating it. Furthermore, if you would like to reset to the parameter
values that were assigned when it was installed, that is an option in this script. If you are see any
errors while executing RNA-Seeker that mention anything about parameters, this can usually be
solved by navigating to the scripts directory and running the above change_params.pl command.
This script recognizes when the params.txt file is corrupted and automatically reverts to the default
parameters. Below is the entirety of the params.txt file. Each parameter defined here includes a
description of its effect.
params.txt
This is the parameter file. Below, parameters are explainedin detail, and parameter
values can be permanently altered.
>FD : File Delimiter
The lines in the gene library file should be formatted as such:
UniqueName,Group1,(...),GroupN,Reverse,Chromosome,ExonBounds where the columns are
separated by a delimiter character. Above, the delimiter is a comma. Set the
parameter value to a different character to indicate a different delimiter. Note
that '\t' indicates tab and '\s' indicates space.
The current value of File Delimiter is shown below:
#FD: \t
>ED : Exon Delimiter
The last column(s) of the gene library file should have exon boundary information.
This information should include the start nucleotide of the exon and the end
nucleotide of the exon, separated by the exon delimiter. For example, if the
delimiter is a colon, the exon information should look like this:
12345678:12345876. Note that '\t' indicates tab and '\s' indicates space. Also note
that the exon delimiter cannot be identical to the file delimiter.
The current value of Exon Delimiter is shown below:
#ED: ,
>Min : Min span
The min span value exists to safeguard against errors in the exon boundaries. Set
the min span to be slightly smaller than the shortest gene (the sum of the exons),
and the program will notify you if any genes are found to be less than this minimum
length. It can also be temporarily reset once during the program run.
The current value of Min span is shown below:
#Min: 60
>Max : Max span
The max span value exists to safeguard against errors in the exon boundaries. Set
the max span to be slightly larger than the largest gene (the sum of the exons),
and the program will notify you if any genes are found to be greater than this
maximum length. It can be temporarily reset once during the program run.
The current value of Max span is shown below:
#Max: 90
>BC : Bowtie Command
The RNA-Seeker program will run the relevant bowtie commands for the user. Since
bowtie is not included in the installation of RNA-Seeker, this program must be able
to locate bowtie. Often tools can be accessed from any directory, but sometimes
this is not the case. Try running the command "bowtie --version". If it tells you
the bowtie version, then leave this parameter as "bowtie". Otherwise, identify the
location of installation of the bowtie tool, and change this parameter to the
command that correctly accesses bowtie. (e.g. if using the command "/usr/lib/bowtie
--version" succeeds, set this parameter to "/usr/lib/bowtie")
The current value of the Bowtie Command is shown below:
#BC: bowtie
>BO : Bowtie Options
Bowtie has many options that can be used. The most relevant ones to the RNA-Seeker
application are the -v and -a options. Changing -v will specify the amount of
mismatches allowed in the alignment. Thus, -v0 means that reads have to perfectly
match the reference. Allowing for one or more mismatches is encouraged, as the
classification algorithm does not suffer from this specification. Using -a tells
the bowtie algorithm to report all valid alignments for each read, resulting in
reads mapping to multiple locations. Dealing with such multimapping reads is the
primary benefit of RNA-Seeker, so including this option is highly recommended. Feel
free to use other options specific to your needs.
The current value of Bowtie Options is shown below:
#BO: -v0 -a
Sample Input File: What components are saved in the input file.
The sample input file can be found at rnaseeker-v1.0/sample/sample_input.txt. This file contains all
of the necessary inputs to execute the sample run, as described in Section 3: Sample Run. Below we
list each of the required and optional inputs and describe how to appropriately list the values for
these inputs.
Note that every time RNA-seeker is run, an input file will be saved to this directory
(scripts/input_files). If a similar but slightly different analysis is to be run, you may edit the resulting
input file with the desired changes to allow for an easy way to do this. See section 5: Run RNASeeker for a more detailed discussion on how to use input files to run the program.
It is recommended to list inputs in an input file as an absolute path (for example:
"/home/myusername/myproject/library.txt" as opposed to “myproject/library.txt”. Determine the
absolute path by using the unix command: "pwd"), as this simplifies things. Unfortunately, the
sample input file is written such that it will only work if the sample run is executed from the sample
directory, as indicated in Section 3: Sample Run. Fortunately, all input files that are saved after other
RNA-seeker runs will not have this problem, and can be used from any directory.
The four inputs below are the essential inputs. In order to indicate the value of an input, include a
line beginning with a ">" symbol preceding a line beginning with the "#" symbol that shows the
desired value of the input file. Here are some examples, defining the Library File, the Reads File, and
the Results Directory.
>LF: Library File
#LF: data/library.txt
>RF: Reads File
#RF: data/RNA-seq_reads.trim.fq
>RD: Results Directory
#RD: results/
The flank length dictates the number of nucleotides to either side of the gene that will be included
in the pseudo reference file, so that reads that partially include these nucleotides will be mapped.
We chose a flank length of 100 because that was sufficiently large to capture any expected reads
from unprocessed RNA strands, but small enough to reasonably reduce the processing time of the
Bowtie algorithm. However, the flank length is flexible depending on the needs of the user. It can be
set to 0 to include only reads that are contained completely within the genes. Similarly, using a small
flank length will limit the distance away from the gene that the reads are allowed to map to. It is
recommended to keep the flank length at a reasonably high value (slightly higher than the maximum
read length) to include all reads that may include fragments of the genes of interest. It is important
to note that any reads that map completely within the flanking sequence will be removed by the
system automatically, so that is not a concern of choosing a large flank length. Below the flank
length is defined.
>FL: Flank Length
#FL: 100
Note that indicating the below directory is not strictly essential. The intermediate results will save by
default to the Results directory, unless the Intermediate Results Directory is specified. However, it is
recommended to save intermediate results to a separate directory to avoid cluttering the primary
results directory.
>ID: Intermediate Results Directory
#ID: bowtie_files/
Below are the specifications for the reference genome files. They are formatted slightly differently
because it is important to indicate which file corresponds to each chromosome code in the library
file. Thus, these entries must include the two characters "::" in between the chromosome code and
the corresponding file, as below. Note that in this case the same tag "RG" can be used multiple
times.
>Reference Genome File
#RG: C::data/chl_ref_genome.fasta
>Reference Genome File
#RG: M::data/mit_ref_genome.fasta
In the event that you want to run only part of the RNA-seeker analysis, the below options can be
specified. For example, if you are executing a follow-up run that uses a slightly different set of
parameters during the Bowtie mapping, then you would not have to perform the reference genome
compilation or the Bowtie-build procedures, since those would remain unchanged. It is advised that
the final results from the original analysis be copied into a different folder to avoid them being
overwritten. Below we use the value "1" to indicate that a process will be included in the analysis. If
a process should not be included in the analysis, set this to "0" or simply delete the lines for that
procedure.
>CRG: Compile Reference Gene Files
#CRG: 1
>BTB: Bowtie-Build
#BTB: 1
>BTM: Bowtie Map
#BTM: 1
>CLR: Classify Reads
#CLR: 1
Below, we show an example input file. This file is sufficient to indicate the sample inputs.
sample_input.txt
>LF: Library File
#LF: data/library.txt
>RF: Reads File
#RF: data/RNA-seq_reads.trim.fq
>RD: Results Directory
#RD: results/
>FL: Flank Length
#FL: 100
>ID: Intermediate Results Directory
#ID: bowtie_files/
>Reference Genome File
#RG: C::data/chl_ref_genome.fasta
>Reference Genome File
#RG: M::data/mit_ref_genome.fasta
>CRG: Compile Reference Gene Files
#CRG: 1
>BTB: Bowtie-Build
#BTB: 1
>BTM: Bowtie Map
#BTM: 1
>CLR: Classify Reads
#CLR: 1
7. Output File Formats
Reference Gene Files
The reference gene files generated by the Compile Reference Gene Files procedure are saved in the
indicated results directory. The sample run will generate two reference gene files:
results/RG_C.fasta (Reference Genome for Chromosome “C”) and results/RG_M.fasta (Reference
Genome for Chromosome “M”).
Each of these files contain the nucleotide sequences of the genes of interest labeled in the library
file. As demonstrated below, each gene has an indicated name, followed by a nucleotide sequence.
The sequence consists of the flanking sequence preceding the gene, the coding sequence itself, and
the flanking sequence following the gene. In this file, the flanking sequences to either side are 100
nucleotides long, as indicated by using the Flank Length input of 100.
In the case where a gene was indicated to have two or more exons, multiple versions of the gene
are written which show the sequence including the intron as well as the sequence with the intron
excised, designated by a tag at the end. For example, the gene label MethionineM1-eie indicates
that the following sequence contains two exons split by an intron, whereas MethionineM1-ee
indicates that the two exons are adjacent in the sequence.
Here is a small sample of the RG_M.fasta file for the sample run. Notice how the bold intron is
included in CysteineM1-eie but not in CysteineM1-ee.
>SerineM2
AGAGTCTTTCTCGCTCAGTAAAGAAGAGTACGCGCGCTACGGCTTACGCAGTGGATCTTCGGGCAACCAACCCGGCACATCCA
ATTCCGATCAACAACTTGGAGGTATGGCTGAGTGGCTTAAGGCATTGGTTTGCTAAATCGACATACAAGAAGATTGTATCATG
GGTTCGAATCCCATTTCCTCCGGCACGGAAGTAGAACGGGCGGGCGAAATTACGTGAGAGAAAGAACCTCAGATTGATGGAGT
CCGCCGTCGGACAGAATAGCACTACTTAGTGACTAGGAG
>CysteineM1-eie
CCCGAACGAGGGCCGTCCAGTCCCTGGCCGGCTCTCGGTTCTTGAGCAAACTCCTCCACTGGGATGCTCATAGATGAAGAAAA
GAGACTTTAGGCAAGTGGTTCTGGTAGCTCAGCTGGTTAGAGCAAAGGACTGCAAATCCTTTTTTGCTTGTTTCAGTGGGAAG
AGCAAGGGGCATTGCCCTTGAAATCCTTCAGTGGTTCGATTCCACCTCAGAACGAAAGAAAATGCAAGTGAAAGAGAATTGTA
GGACAAGAGCTGATGGATGATAGAACAAAAGAGATGGATAGAGGCGAGAGAGATTGAGTGGAGGGTTCGAA
>CysteineM1-ee
CCCGAACGAGGGCCGTCCAGTCCCTGGCCGGCTCTCGGTTCTTGAGCAAACTCCTCCACTGGGATGCTCATAGATGAAGAAAA
GAGACTTTAGGCAAGTGGTTCTGGTAGCTCAGCTGGTTAGAGCAAAGGACTGCAAATCCTTCAGTGGTTCGATTCCACCTCAG
AACGAAAGAAAATGCAAGTGAAAGAGAATTGTAGGACAAGAGCTGATGGATGATAGAACAAAAGAGATGGATAGAGGCGAGAG
AGATTGAGTGGAGGGTTCGAA
>SerineM3
AGAACCATGGAACCCCCGAGCGGTTGTGGTTGTACCTGTACTTGCAGGGATACGAAAACTCGCTATTCACTCAGTTTCTGGTC
AATAATAAGATTATGTAGGAGAGATGGCCGAGCGGTTCAAGGCGTAGCATTGGAACTGCTATGTAGGCTTTTGTTTACCGAGG
GTTCGAATCCCTCTCTTTCCGTTTCTGTTAATTCACCACCGACCACAATGTATCAAATCAAATAGATTTCAGCAATAAGACTT
TTATTTGATAGAAATTCATTATTAAAAATTAATTACTT
Classification Files
The Classify Reads procedure produces several Classification Files, which are named according to the
header of the Library file. In our sample run, the group names used were Gene, Family, Isotype and
GlobIso (Global Isotype), so the files produced are Gene_reads.txt, Family_reads.txt,
Isotype_reads.txt, and GlobIso_reads.txt. Additionally, regardless of the group names specified in
the Library file, two extra categories are produced, saved in the files Splice_reads.txt and
Unclassified.txt.
Splice_reads.txt will contain any reads that mapped to at least one splice variant of a gene, but not
all splice variants. These reads are still unique to that gene, but since they are also unique to one
splice variant we note this observation by segregating these reads to a separate file. In the sample
run, there are no reads that end up mapping to just one splice variant of the genes with introns, so
this file is empty.
Gene_reads.txt will contain the original read lines which were found to be unique to a single
specified gene. In other words, these reads only mapped to a single location in any of the genomes
supplied. The reads are sorted by the gene to which they mapped, arranged in alphabetical order.
Note that since 100 flanking nucleotides were included before and after the gene sequence, reads
that begin at the first nucleotide of the gene were mapped to start at the 100th nucleotide of the
reference gene, so reads that have 100 in the position column mapped to the beginning of the gene,
and reads that have a position index less than 100 included some of the flanking sequence before
the gene. Also note that if a flank length different than 100 is used, the position of the start of the
gene will similarly be changed.
Below is a sample from the Gene_reads.txt file generated during the sample run. Note that all of the
information about the read that was originally included after the Bowtie mapping is still present in
these files, so there was no loss of information during the Classify Reads procedure.
>AsparagineM1 specific reads:
HWI-ST538:280:C1FE9ACXX:5:1314:19180:6288 1:N:0:AGTCAA +
AsparagineM1
135
AACTGACTGGTCGTAGGTTCGAATCCTACTTGGGGAGATTTGATTCATTCAGAATTCGAATTGATAGTTATAGCTTTTCTGAC
TAGC
@@CFFFFFHHFHHFEHIEHHIJFGHGGIJJJJIJJGIGHIIIJJIIIJJHIJIHHIJJEGHHIHHHGGHEDDFFFDFEEEEEE
EDDC 0
HWI-ST538:280:C1FE9ACXX:5:1110:18949:14936 1:N:0:AGTCAA +
AsparagineM1
AAGTCCGTTCTTCCTCAGTAGCTCAGTGGTAGAGCGGTCGGCTGTTAACTGACTGGTCGTAGGTTCGAAT
@CCDFFFFGHHHDCGIJJJIIIIJGGFHIHIICGGHIIJIJGIIIIIJJJJGHIIIGHHEHFFFFEDDDD 0
89
HWI-ST538:280:C1FE9ACXX:5:2104:11621:55887 1:N:0:AGTCAA +
GCTTCCTTAAGTCCGTTCTTCCTCAGTAGCTCAGTGGTAGAGCGGTCGGCT
CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJHIJJJJJJIJJJJJ
0
AsparagineM1
81
AsparticacidC1
0
100
HWI-ST538:280:C1FE9ACXX:5:1116:8902:29828 1:N:0:AGTCAA +
AsparticacidC1
GGGATTGTAGTTCAATTGGTCAGAGCACCGCCCTGT
@@@DDDDD<FFBFFGII<A<EHEH?GFIEFF:FF?D
100
0
>AsparticacidC1 specific reads:
HWI-ST538:280:C1FE9ACXX:5:2111:16256:97578 1:N:0:AGTCAA +
GGGATTGTAGTTCAATTGGTCAGAGCACC
@@CFFFFDFHFDDGHGGEHFFFHJFCGHI
HWI-ST538:280:C1FE9ACXX:5:2113:4857:57765 1:N:0:AGTCAA +
AsparticacidC1 100
GGGATTGTAGTTCAATTGGTCAGAGCACCGCCCTGTCAA <@@DDDDDAF?<?A<EHGGG@>GHEFHIEHGBGIGGGEG 0
Family_reads.txt will contain the reads that mapped only to genes that were labeled with the same
Family. Keep in mind that the group name “Family” is not important, and in the Library file header
the name of this category can be customized to be relevant to the user’s analysis. As with all of the
read output files, all the reads that sorted to a family are listed together. In this file and others that
include multimaps (Isotype_reads.txt and GlobIso_reads.txt), all of the matches that were found for
a single read are printed on adjacent lines, with a blank line separating different reads.
Below is a sample from Family_reads.txt, as it was generated from the sample run.
>AsparagineCA specific reads:
HWI-ST538:280:C1FE9ACXX:5:2216:7612:45579 1:N:0:AGTCAA +
AsparagineC1
GTTCCTCAGTAGCTCAGTGGTAGAGCGGTCGGCTGT
C@CFFFFFHHHHHJJJJGHIFIIJJJJJIJJJJJJJ
HWI-ST538:280:C1FE9ACXX:5:2216:7612:45579 1:N:0:AGTCAA +
AsparagineC2
GTTCCTCAGTAGCTCAGTGGTAGAGCGGTCGGCTGT
C@CFFFFFHHHHHJJJJGHIFIIJJJJJIJJJJJJJ
HWI-ST538:280:C1FE9ACXX:5:1306:4719:10665 1:N:0:AGTCAA
TGACTGGTCGTAGGTTCGAATCCTACTTGGGGAGATTTGATTCATTCTTTAATGT
CCCFFFFFHHHHHJJJJJJIJJJJJJJJJJJJDGGIIJJIJJJIJJJJJJJJJJJ
HWI-ST538:280:C1FE9ACXX:5:1306:4719:10665 1:N:0:AGTCAA
TGACTGGTCGTAGGTTCGAATCCTACTTGGGGAGATTTGATTCATTCTTTAATGT
CCCFFFFFHHHHHJJJJJJIJJJJJJJJJJJJDGGIIJJIJJJIJJJJJJJJJJJ
98
1
98
1
+
AsparagineC1
138
1
+
AsparagineC2
138
1
>HistidineMA specific reads:
HWI-ST538:280:C1FE9ACXX:5:2313:4726:24348 1:N:0:AGTCAA +
HistidineM1
44
AATAAATAGAATAATGATGAATGGAAAAAAGAGAAAATCCTTTAGCTGGATAAGGGGCGGATGTAGCCAAGTGGATCAAGGCA
GTGGATTGTGAATCCAC
@@CFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJIIJJJJJJJJJIHJJIIJIJJFDDDDDEEDDDDD@CDDCDDDDD1<
BACDDDCCACCC>:ACC
1HWI-ST538:280:C1FE9ACXX:5:2313:4726:24348 1:N:0:AGTCAA +
HistidineM2
44
AATAAATAGAATAATGATGAATGGAAAAAAGAGAAAATCCTTTAGCTGGATAAGGGGCGGATGTAGCCAAGTGGATCAAGGCA
GTGGATTGTGAATCCAC
@@CFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJIIJJJJJJJJJIHJJIIJIJJFDDDDDEEDDDDD@CDDCDDDDD1<
BACDDDCCACCC>:ACC
1
HWI-ST538:280:C1FE9ACXX:5:1215:20823:23696 1:N:0:AGTCAA +
HistidineM1
68
AAAAAAGAGAAAATCCTTTAGCTGGATAAGGGGCGGATGTAGCCAAGTGGATCAAGGCAGT
@CCFFFDFHHHHGHJJJJJJJJJIJJIGHGHHIIIJHFFBBACEEED@CDCDDDDCCBBB?
1HWIST538:280:C1FE9ACXX:5:1215:20823:23696 1:N:0:AGTCAA +
HistidineM2
68
AAAAAAGAGAAAATCCTTTAGCTGGATAAGGGGCGGATGTAGCCAAGTGGATCAAGGCAGT
@CCFFFDFHHHHGHJJJJJJJJJIJJIGHGHHIIIJHFFBBACEEED@CDCDDDDCCBBB?
1
>PhenylalanineMA specific reads:
HWI-ST538:280:C1FE9ACXX:5:1102:9958:77076 1:N:0:AGTCAA +
PhenylalanineM1 138
TCCTCGTGTCACCAGTTCAAATCTGGTTCCTGGCAC
CCCFFFFFHHHHHJJHIJJJJJJJJJFHIJJJJJJJ
1HWI-ST538:280:C1FE9ACXX:5:1102:9958:77076 1:N:0:AGTCAA +
PhenylalanineM2
138
TCCTCGTGTCACCAGTTCAAATCTGGTTCCTGGCAC
CCCFFFFFHHHHHJJHIJJJJJJJJJFHIJJJJJJJ
1
HWI-ST538:280:C1FE9ACXX:5:2209:15859:82108 1:N:0:AGTCAA +
PhenylalanineM1 127
AGGACTGAAAATCCTCGTGTCACCAGTTCAAATCTGGTTCCTGGCAC
CCCFFFFFHHHHHJJJJJJJJIJJJJHIJJJJJJJJJHIJJJJIIJJ 1HWIST538:280:C1FE9ACXX:5:2209:15859:82108 1:N:0:AGTCAA +
PhenylalanineM2 127
AGGACTGAAAATCCTCGTGTCACCAGTTCAAATCTGGTTCCTGGCAC
CCCFFFFFHHHHHJJJJJJJJIJJJJHIJJJJJJJJJHIJJJJIIJJ 1
Isotype_reads.txt will contain the reads that mapped only to genes with identical isotypes, as
labeled in the Library File. In the sample run, none of the reads met this criterion, so the resulting
Isotype_reads.txt file is empty. However, it would be formatted similarly to the above
Family_reads.txt file if it was populated.
GlobIso_reads.txt contains reads that mapped to genes with the same isotype in different genome
reference files, though only because this distinction was defined in the library file from the start. All
the reads mapping to a global isotype are listed under the name of that isotype, where all the
matches for a read are listed consecutively. Although it is somewhat verbose to store every instance
of a read instead of condensing the data so that one line represents the read and to what genes it
maps, we feel that it is appropriate to include this long-hand form of the data to prevent the loss of
information. This way, the lines specify the starting position in the corresponding gene, which might
not always be the same depending on the analysis. Furthermore, this format allows the user to see
which members of the group that the read mapped to, instead of implying that the reads mapped to
all of the members of the group, which is not necessarily the case (in fact, in the case of Global
Isotype, is rather uncommon: Below, the first read does not map to ArginineC1 or ArginineC2, both
of which are part of the Arginine global isotype).
Below is a sample from the GlobIso_reads.txt file, obtained from the sample run.
>Arginine specific reads:
HWI-ST538:280:C1FE9ACXX:5:1108:14815:76263 1:N:0:AGTCAA +
GCGTCCATTGTCTAATGGATAGGACAGAGGTCTTCTAAAC
CCCFFFFFHHHHHJIJJJIIIJIJJGJIJJEGHIIIIJJJ
0
HWI-ST538:280:C1FE9ACXX:5:1108:14815:76263 1:N:0:AGTCAA +
GCGTCCATTGTCTAATGGATAGGACAGAGGTCTTCTAAAC
CCCFFFFFHHHHHJIJJJIIIJIJJGJIJJEGHIIIIJJJ
0
HWI-ST538:280:C1FE9ACXX:5:1309:19990:51827 1:N:0:AGTCAA +
GCGTCCATTGTCTAATGGATAGGACAGAGGTCTTCTAAAC
CCCFFFFFHHHHHIIIJJJJJJJIJJJIJICGHHIIJIJJ
0
HWI-ST538:280:C1FE9ACXX:5:1309:19990:51827 1:N:0:AGTCAA +
GCGTCCATTGTCTAATGGATAGGACAGAGGTCTTCTAAAC
CCCFFFFFHHHHHIIIJJJJJJJIJJJIJICGHHIIJIJJ
0
ArginineC3
100
ArginineM1
100
ArginineC3
100
ArginineM1
100
>Asparagine specific reads:
HWI-ST538:280:C1FE9ACXX:5:1111:1839:76285 1:N:0:AGTCAA +
AsparagineC1
TCCTCAGTAGCTCAGTGGTAGAGCGGTCGGCTG
CCCFFFFFHHHHHJJJIIHHIJJJJJHHIJJJI
HWI-ST538:280:C1FE9ACXX:5:1111:1839:76285 1:N:0:AGTCAA +
AsparagineC2
TCCTCAGTAGCTCAGTGGTAGAGCGGTCGGCTG
CCCFFFFFHHHHHJJJIIHHIJJJJJHHIJJJI
HWI-ST538:280:C1FE9ACXX:5:1111:1839:76285 1:N:0:AGTCAA +
AsparagineM1
TCCTCAGTAGCTCAGTGGTAGAGCGGTCGGCTG
CCCFFFFFHHHHHJJJIIHHIJJJJJHHIJJJI
100
1
100
1
100
0
HWI-ST538:280:C1FE9ACXX:5:1309:11844:17432 1:N:0:AGTCAA +
AsparagineC1
TCCTCAGTAGCTCAGTGGTAGAGCGGTCGGCTG
CCCFFFFFHHHGHJJFHIIIIIHJJJJJJIJJJ
HWI-ST538:280:C1FE9ACXX:5:1309:11844:17432 1:N:0:AGTCAA +
AsparagineC2
TCCTCAGTAGCTCAGTGGTAGAGCGGTCGGCTG
CCCFFFFFHHHGHJJFHIIIIIHJJJJJJIJJJ
HWI-ST538:280:C1FE9ACXX:5:1309:11844:17432 1:N:0:AGTCAA +
AsparagineM1
TCCTCAGTAGCTCAGTGGTAGAGCGGTCGGCTG
CCCFFFFFHHHGHJJFHIIIIIHJJJJJJIJJJ
100
1
100
1
100
0
>Cysteine specific reads:
HWI-ST538:280:C1FE9ACXX:5:2114:10203:18024 1:N:0:AGTCAA +
GGCGGCATGGCCAAG @C@FFDDFAHGHFBD 0
HWI-ST538:280:C1FE9ACXX:5:2114:10203:18024 1:N:0:AGTCAA +
GGCGGCATGGCCAAG @C@FFDDFAHGHFBD 0
CysteineC1
100
CysteineM2
100
Unclassified.txt contains multimap reads that remain unclassified. Reads would be defined as
unclassified in the sample run if they mapped to at least two genes of different isotypes, such as if a
read mapped to LeucineC1 and SerineC1, as these two gene do not have a common group name in
any of the designated group columns.
Generally it is expected for some of the reads to be unclassified, but due to the limited amount of
genes represented in the sample library file, this did not occur in the sample run analysis. Thus, the
resulting Unclassified.txt file is empty. If there were unclassified reads, however, this file would list
them, grouping lines representing the same read together and showing which genes the read did
map to.
Read Summary File
The Read Summary file is also generated during the Classify Reads procedure. This file contains a
condensed, concise summary of the results of the RNA-Seeker procedure. First, the amount of reads
that were classified to each group is listed to give a general overview of the degree of specificity of
the data set.
Next, each gene present in the Library file is listed along with the amount of reads that mapped to it
in each category. This allows the user to make general observations in the data that can be
confirmed or expanded upon by looking at the specific reads in the classification files.
Note that the sum of the numbers in a column does not necessarily equal the total number of reads
that were classified to that group. In fact, this is only the case for the Splice and Gene specific reads.
Below is a sample of the read_summary.txt file generated from the sample run.
Number of reads classified to each group:
Total reads: 7383
Splice: 0
Gene: 4405
Family: 2500
Isotype: 0
GlobIso: 478
Unclassified: 0
Name
ArginineC1
ArginineC2
ArginineC3
ArginineM1
AsparagineC1
AsparagineC2
AsparagineM1
AsparticacidC1
AsparticacidM1
CysteineC1
CysteineM1-ee
CysteineM1-eie
CysteineM2
GlutamicacidC1
GlutamicacidM1
GlutamicacidM2
GlutamineC1
GlutamineM1
GlutamineM2
GlycineC1
HistidineC1
HistidineC2
HistidineM1
HistidineM2
LeucineC1
Splice
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Gene
0
0
0
0
0
0
3
2170
1109
13
0
0
45
409
0
0
83
0
0
155
0
0
0
0
0
Family
11
11
0
0
2
2
0
0
0
0
0
0
0
0
0
0
0
5
5
0
0
0
2
2
0
Isotype
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
GlobIso
0
0
188
188
289
289
289
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
Unclassified
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Download