README contents: 1. Overview 2. Installation 3. Sample Run 4. RNA-Seeker Procedures 5. Run RNA-Seeker 6. Required File Formats 7. Output File Formats 1. Overview RNA-Seeker is a package of scripts that supplement the Bowtie mapping algorithm. The recommended use for RNA-Seeker is to help glean useful information from sets of gene families (e.g. tRNAs), where multiple genes are identical or have identical portions, resulting in RNA-seq reads mapping to multiple locations. RNA-Seeker requires a "Library" file, which contains names, classifications, and genomic locations for the genes that are of interest to the user. There are two primary procedures in the RNASeeker package. The Compile Reference Gene Files procedure extracts from the reference genomes the DNA sequences of genes of interest to the user. Additionally, non-coding DNA including any introns and the flanking sequences of the gene within a defined distance are included in the reference files. This allows Bowtie to map processed and unprocessed reads alike without allowing caps during the mapping, a process that raises the possibility of false-positive matches. The Classify Reads procedure addresses the problem of reads mapping to multiple locations in the genome, as is common when dealing with sets of gene families like tRNAs. Instead of discarding these multimapping reads or using a probabilistic approach to distribute them to genes, this procedure notes the relationship of all of the genes that the read maps to and classifies the read to a category of uniqueness based on this. If all of the genes that a read maps to are identical copies of the same gene, this is much more useful to note than the case where a read maps to two or more completely unrelated genes. This procedure distinguishes between these types of multimap reads and more, depending on the user’s customization of the analysis. Importantly, the results of this procedure give only the factual results of the classification, and leave any inference or interpretation of these results to the user. The RNA-Seeker tool was developed by Matt Spencer and Jonathan Morrand. MS: Informatics Institute, University of Missouri JM: Department of Biochemistry, University of Missouri Contact: mcsgx2@mail.missouri.edu Do not hesitate to use this email for support and questions. Please include "RNA-Seeker" in the subject of any messages. Version: 1.0 Last Updated: December 2014 2. Installation Note that the scripts included in RNA-Seeker are implemented in Perl, and are not designed for use other than on Unix/Linux systems. To decompress the RNA-Seeker package on a Unix/Linux system, navigate to the directory where the package (rnaseeker-v1.0.tar.gz) is stored and use the command: tar -xvzf rnaseeker-v1.0.tar.gz Next, enter the rnaseeker-v1.0/scripts/ directory (e.g. cd rnaseeker-v1.0/scripts/ ). Use the following command to complete installation: sh install.sh The RNA-Seeker program will run the relevant bowtie commands for the user. Since bowtie is not included in the installation of RNA-Seeker, this program must be able to locate bowtie. Try running the command "bowtie --version". If it tells you the bowtie version, then nothing more needs to be done. Otherwise, identify the location of installation of the bowtie tool (or install it using the link below if it is not yet installed on this machine), and the bowtie command parameter must be changed to reflect this file location. To do this, make sure you are still in the rnaseeker-v1.0/scripts/ directory and run the following command: ./change_params.pl Follow the instructions for changing the Bowtie Command (additional instructions about this can be found in “Parameter File: How to change the parameters for RNA-Seeker” in section 6: Required File Formats). The RNA-Seeker package is now installed. The contents of the scripts directory should be left together. If the scripts/ directory is ever moved, renamed, or its contents are removed from that directory, the ‘sh install.sh’ command should be used again to reinstall the package after the changes have occurred. In such a case, be sure that the six files (install.sh, params.txt, default_params.txt, change_params.pl, seeker_functions.pm, and RNA-Seeker.pl) are all together in the same directory, or else the installation will fail. The Bowtie bioinformatics tool is necessary for RNA-Seeker's intended use. This tool is not included with the RNA-Seeker package, but it is free, open-source software that can be downloaded from the following website. http://sourceforge.net/projects/bowtie-bio/ 3. Sample Run We provide a small sample dataset and the results obtained using these data. These data files can be found in the sample directory, with the expected results in the expected_sample directory. Note that RNA-Seeker does not employ stochastic (random) processes, so the sample run should yield the exact results that are saved in the expected_sample/results directory if the analysis is run as described below. To run the Sample Run, navigate to the sample directory (rnaseeker-v1.0/sample). Then run the command below: ../scripts/RNA-Seeker.pl –input sample_input.txt Compare the files that are produced in the sample/results directory to those stored in the expected_sample/results directory. They should be identical except for the comments that are written at the top of the files in the expected_results directory. These comments explain what can be found in each of the output files. This sample run utilizes the RNA-Seeker functionality to store input information in an input file. More details on the different ways that RNA-Seeker can receive inputs is discussed below. 4. RNA-Seeker Procedures a) Compile Reference Genes This procedure is a precursor to the bowtie-build indexer, which can be used to align reads to the custom-made reference. Here the program utilizes the user input stored in the Library file to make reduced reference gene files that only include the specified genes and the immediately surrounding non-coding DNA, including introns and flanking sequences. This strategy allows Bowtie to map both processed and unprocessed reads without allowing for gaps during the mapping, and the resulting reads are more accurate for the analysis without investing significantly more time for the genome-wide mapping with gaps. Reference fasta files are produced that include the sequences of the indicated genes and the surrounding non-coding information. Changing the Flank Length parameter allows the user to indicate just how much non-coding DNA around the genes should be included (directions for changing parameters like this can be found in the Options section below). b) Bowtie-Build and Bowtie Map RNA-Seeker automatically incorporates the necessary Bowtie procedures into a quick and easy workflow that includes the mapping software. Bowtie-build allows the customized reference gene files to be used for the Bowtie Map algorithm, which quickly and efficiently determines where in these reference genes the reads map. c) Classify Reads This procedure allows for post-processing the reads mapped by the Bowtie algorithm. Once reads are mapped to the custom-built reference gene files, this script will sort the reads according to their degree of uniqueness, using the user-defined hierarchy of groups. Reads that map to one unique location are separated, counted, and noted as uniquely mapping genes, and no further processing is done to these reads. The major benefit of this procedure is how it processes reads that map to multiple locations in the reference genes (multimaps). Depending on the methods, many analyses discard these multimapping reads or use probabilistic algorithms to guess their gene of origin. This procedure incorporates no guesswork, but instead derives as much information from these multimaps as possible by analyzing the relationship of the genes to which the gene maps. Note that the relationship of genes must be included by the user, and we make no attempt to determine the relationship between genes other than the annotations that are given. For example, in our application of RNA-Seeker, we classified reads mapping to tRNA genes in the plant species Brachypodium distichyon. Since many tRNA genes are duplicated many times in the genome, sometimes perfectly and sometimes with small distinguishing factors, this mapping produced a large amount of multimaps. We manually determined which genes are perfect copies of each other (Families), and noted the genes that produce the same tRNA isotype despite not being perfect copies. We made an additional group that distinguishes genes of the same isotype that may have been transcribed in an organelle. By providing the annotations showing which of each of these groups each gene belongs to, the Classify Reads procedure separated the reads into useful categories which allowed the researcher to quickly move on to making conclusions based on this evidence. 5. Using RNA-Seeker The most important part of using RNA-Seeker is the appropriate creation of the Library file. The Library file will provide the system with the annotation information it needs to compile the reference gene files and to perform the classification of multimap reads. Note that this software was deliberately created to be flexible and unrestrictive to allow for a variety of research questions to be answered by the creative generation and organization of the Library file. A detailed explanation of the format and requirements of the Library file can be found in the next section: Required File Formats. Once a Library file appropriate to your analysis is created, there are several ways to provide input information to the RNA-Seeker program: in-program prompts, command line options, and input file selection. These are described below, using the included sample dataset as an example. a) In-Program Prompts Unlike most Bioinformatics tools, RNA-Seeker provides the user with the opportunity to designate input after starting the program. This is useful because the user is taken through a step-by-step process asking for one required input or file at a time. Inputs designated in another way are also checked for validity, and the program will ask for an input to be restated if there is a problem. Try this out by following the instructions below. Navigate to the rnaseeker-v1.0/sample directory. Enter the following command: ../scripts/RNA-Seeker.pl The program will ask for missing input (since you didn’t indicate any input, all inputs are currently missing). Enter the following inputs when prompted: Flank Length: 100 Library File: data/library.txt RNA-seq Reads File: data/RNA-seq_reads.trim.fq Results Directory: results The following prompt asks for an intermediate results directory. If this directory is not specified, the intermediate files will be saved to the results file indicated above. While designating such a directory is not strictly required, we recommend indicating a separate directory because this is where the bowtie files will be saved, and segregating them will prevent cluttering the main results directory. Would you like to indicate a Intermediate directory? Enter y/n: y Intermediate Directory: bowtie_files Next the program asks if the entire analysis should be run. Here it is possible to indicate that only certain procedures should be run. This can be useful if the user wants to run a similar analysis with slight changes. For example, to run an analysis a second time using different options for the Bowtie Map algorithm, the Compile Reference Gene Files and Bowtie-Build procedures do not need to be rerun (they can be, but it will save some time not to). For now let’s run the entire analysis. Do you want to run the entire analysis, including Compile Reference Gene Files Bowtie-Build Bowtie Map and Classify Reads? Enter y/n: y At this point, the program will quickly scan the designated library file for all of the chromosome codes included in the file. It will then ask the user to designate a reference genome file that corresponds to each chromosome code. Thus, the program will extract the DNA sequence for the gene from the corresponding reference genome file while compiling the reference genes. Note that if the user does not want to include all of the codes in the analysis, entering ‘ignore’ will result in the program omitting all genes labeled with this code. Reference file for code C: data/chl_ref_genome.fasta Reference file for code M: data/mit_ref_genome.fasta At this point, the program has all of the inputs it needs, and will run the analysis (or at least the procedures that were indicated in the above steps). Note that in the conclusion of the analysis, the location of the input file generated is shown, as below. Input file saved as: /home/mspencer/rnaseeker-v1.0/input_files/input_file_3.txt This file stores all of the inputs used for this execution of the program. Thus, if you would like to run the same analysis again, you can indicate this input file (see the Input File Selection option below). b) Command Line Options Although the interactive in-program prompts can be convenient, they are slow, especially if you would like to run more than one analysis. This command line option method of indicating input is typical of Unix programs, so it may be more familiar to users. It also allows for a very concise way to communicate the inputs used for an analysis. Try using the command below to run the same analysis performed above. ../scripts/RNA-Seeker.pl –FL 100 –LF data/library.txt –RF data/RNA-seq_reads.trim.fq – RD results –ID bowtie_files –RG M::data/mit_ref_genome.fasta –RG C::data/chl_ref_genome.fasta –CRG –BTB –BTM -CLR Notice how simply copying and pasting this command into the command line will result in the analysis being performed, which is a much quicker alternative to individually typing the inputs. The two-letter tags stand for input value types (FL = Flank Length, etc) and should be followed immediately by the value for this input (e.g. “–FL 100” indicates using a flank length of 100). The three-letter tags represent the procedures (CRG = Compile Reference Gene files, etc) and do not need any additional input other than their presence (e.g. including “-CLR” is sufficient to tell the program to include the Classify Reads procedure). More discussion of the tags and their meanings can be found below in the Required File Formats section, where the sample_input.txt file is described. When indicating a reference genome file to use, it is important to indicate the chromosome code that represents it. In this example, the “M” code represents genes found in the mitochondrial genome, which is stored in the data/mit_ref_genome.fasta file. Thus, to indicate this in the command line option, we use the tag “RG” to indicate that we are specifying a reference genome file, followed by the chromosome code “M” and the file name, separated by two colons “::”. This format (-RG <Code>::<File>) is interpreted by the program, which will use the indicated file to be the reference genome for genes with that code. c) Input File Selection The third method of selecting input is to indicate an input file where all inputs are saved. This is a very quick and easy way to run an analysis, but requires that the input file is already written. Input files are automatically generated in the rnaseeker-v1.0/input_files/ directory every time the RNA-Seeker analysis is executed (the most recent input file will be the one with the highest number attached to the file name). Thus, in order to run an analysis twice, one need only select the input file that was generated from the first analysis and all of the same inputs will be used. Try running the sample analysis using the prepared sample input file as below. ../scripts/RNA-Seeker.pl –input sample_input.txt In addition to using automatically generated input files in this fashion, it is encouraged to modify these input files to slightly change the inputs for the next analysis, or even write an input file from scratch. However, the recommended way to introduce slight changes to a previous analysis is described below. d) Combination The recommended strategy for designating inputs for RNA-Seeker analyses is to use a combination of the above methods. It is important to note that the program will look for inputs in this order: 1. Input File Selection 2. Command Line Options 3. In-Program Prompts First, if an input file is designated, the program will read the input file and load all inputs stored there. Then the command line options are read, and any inputs designated on the command line replace those found in the input file. Finally, any inputs that are still missing or entered incorrectly will generate prompts to be indicated before the analysis is attempted. Due to this prioritization structure, the recommended way to introduce slight changes to a previously run analysis is to use an input file and indicate any changes using the command line options. For example, use the command below to run a slightly modified version of the sample analysis using a combination of input types. ../scripts/RNA-Seeker.pl -input sample_input.txt -FL 50 -LF data/library_file.txt When prompted to revise the name of the library file: Library File: data/library.txt In this case, most of the inputs are gathered from the sample_input.txt input file. However, the command line –FL option indicating to use a Flank Length of 50 overwrites the input file’s Flank Length of 100. Similarly, the command line –LF option to specify the library file overwrites the library file location indicated in the input file. However, there is not actually any file with that name, so when the program fails to find it, it uses the in-program prompts to request the real library file name. Once this is obtained, the analysis will run. 6. Required File Formats Library File: How to create a library file for your purposes. The sample library file can be found at rnaseeker-v1.0/sample/library.txt The Library file is the primary input file that RNA-Seeker utilizes. There are some formatting requirements, which are explained in detail below, as well as in the file sample/data/library.txt and in the sample Excel document Example_Library.xls and below. Note that the organization of groups is largely up to the user, and the usefulness of this package depends on the user sculpting the organization of groups to meet their individual needs. The Library file is easily constructed out of an Excel Spreadsheet. There is an example of the basic structure of the library file excel sheet included in this package. Fill out the spreadsheet using the guidelines that can be found written in comments in the spreadsheet (more guidelines can be found in the sample/data/library.txt text file). Once the appropriate information is listed in this excel sheet, follow the below instructions: Select File -> Save As... Name the file Indicate for "Save as type:" the selection "Text (Tab delimited) (*.txt)" Save The resulting file will be in the appropriate format for the library file. No parameters in the params.txt will need to be changed (explanation of params.txt is below). Formatting requirements for the Library File: The first non-comment line should be the Headings line. Comment lines begin with a ‘#’ symbol. The headings line should include the user-specified names of the groups to be used in the analysis. Headings for the “!Reverse”, “!Chrom”, and “!Exons” columns should remain as written, including remaining in the same order. The exclamation points indicate that these are specific, non-userdefined fields. Heading for Uniq Name and Group columns are user-defined. These headings will be used as the classifications for the genes. !Revers: This column should contain either "Yes" or "No", indicating whether the gene of interest is on the forward (No) or reverse (Yes) strand, as the reference genome is written. RNA-Seeker will search for reads mapping to the area in the indicated direction, and ignore reads that match the opposite strand. !Chrom: This column should contain a chromosomal ID code (which may be any character, or even multiple characters). All genes with the same ID should have sequences found in the same gene reference file, though this file need not be named according to the chromosomal ID listed here. As described above, the program will require that you indicate which reference genome file corresponds to each chromosome code. !Exons: This column should contain the first and last nucleotide that is included in each exon. They should be listed in ascending order, even for genes on the reverse strand. The two numbers should be separated by a comma, and individual exons should be separated by a tab (if you prefer different characters separating these fields, see the params.txt discussion below). Most of the below genes are examples of genes with a single exon, though CysteineM1 is an example of a gene with multiple exons. Group Names: The column headings that do not begin with an exclamation point are user-defined columns, and these names can be changed at will (Certain special characters may cause problems. The underscore is recommended instead of space, e.g. Alanine_1 instead of Alanine 1.) Also note that the number of these user-defined groups can be changed (increased or decreased) to the user's specifications. An example header line is below in bold, with some brief descriptions: Uniq Name: Unique identifier for the gene (title for this category can be changed) | Group1: User-defined group to which the gene belongs (most specific) | | Group2: User-defined group | | | Group3: User-defined group (most broad) | | | | !Revers: Yes if gene is on reverse strand, otherwise No | | | | | !Chrom: Chromosome # where gene can be found | | | | | | !Exons: Exon boundary information | | | | | | | use comma (exon delim) between | | | | | | | start and stop nucleotide | | | | | | | use tab (file delimiter) between exons v v v v v v v Gene Family SerineM1 SerineMA SerineM2 SerineMB CysteineM1 CysteineMA SerineM3 SerineMC Isotype SerineM SerineM CysteineM SerineM GlobIso Serine Serine Cysteine Serine !Revers !Chrom No M No M No M No M !Exons 262563,262649 337683,337770 338021,338058 358823,358909 338109,338140 Parameter File: How to change the parameters for RNA-Seeker. The parameter files can be found at rnaseeker-v1.0/scripts/params.txt and rnaseekerv1.0/scripts/default_params.txt. These files should not be moved from this directory. Do not touch the default_params.txt file. It is the same as the params.txt file before any changes are made to it, and it exists as a backup in case the params.txt file somehow gets corrupted. The parameter file stores adjustable parameters for the program. This file is kept separate from the input files that are generated on each execution because these parameters are ones that we anticipate will not be changed very often (probably, in fact, zero or one time(s)). If you don’t already have something in mind that you want to change and RNA-Seeker hasn’t specifically recommended a parameter change, you probably don’t need to read this section. The parameters present in this file may (and should) be changed to the user specifications. In particular, the implementation of the min and max gene length is not at all required for the functionality of the scripts, but were implemented to be used as an error-checking mechanism to help verify that the Library file was written correctly, according to the expectations of the user. These parameters also allow the user to vary the delimiters of the Library file, if the user would prefer this file to be constructed in a slightly different format. We advise leaving the delimiter parameters as the default values, as this is appropriate for library files that are formatted in the recommended manner described above. The recommended way to change the values of these parameters is to navigate to the rnaseekerv1.0/scripts directory and use the below command: ./change_params.pl This script provides a friendly interface to indicate parameters that you would like to change, and provide the new value for the changed parameters. The script will check to make sure it is a valid value for the parameter before updating it. Furthermore, if you would like to reset to the parameter values that were assigned when it was installed, that is an option in this script. If you are see any errors while executing RNA-Seeker that mention anything about parameters, this can usually be solved by navigating to the scripts directory and running the above change_params.pl command. This script recognizes when the params.txt file is corrupted and automatically reverts to the default parameters. Below is the entirety of the params.txt file. Each parameter defined here includes a description of its effect. params.txt This is the parameter file. Below, parameters are explainedin detail, and parameter values can be permanently altered. >FD : File Delimiter The lines in the gene library file should be formatted as such: UniqueName,Group1,(...),GroupN,Reverse,Chromosome,ExonBounds where the columns are separated by a delimiter character. Above, the delimiter is a comma. Set the parameter value to a different character to indicate a different delimiter. Note that '\t' indicates tab and '\s' indicates space. The current value of File Delimiter is shown below: #FD: \t >ED : Exon Delimiter The last column(s) of the gene library file should have exon boundary information. This information should include the start nucleotide of the exon and the end nucleotide of the exon, separated by the exon delimiter. For example, if the delimiter is a colon, the exon information should look like this: 12345678:12345876. Note that '\t' indicates tab and '\s' indicates space. Also note that the exon delimiter cannot be identical to the file delimiter. The current value of Exon Delimiter is shown below: #ED: , >Min : Min span The min span value exists to safeguard against errors in the exon boundaries. Set the min span to be slightly smaller than the shortest gene (the sum of the exons), and the program will notify you if any genes are found to be less than this minimum length. It can also be temporarily reset once during the program run. The current value of Min span is shown below: #Min: 60 >Max : Max span The max span value exists to safeguard against errors in the exon boundaries. Set the max span to be slightly larger than the largest gene (the sum of the exons), and the program will notify you if any genes are found to be greater than this maximum length. It can be temporarily reset once during the program run. The current value of Max span is shown below: #Max: 90 >BC : Bowtie Command The RNA-Seeker program will run the relevant bowtie commands for the user. Since bowtie is not included in the installation of RNA-Seeker, this program must be able to locate bowtie. Often tools can be accessed from any directory, but sometimes this is not the case. Try running the command "bowtie --version". If it tells you the bowtie version, then leave this parameter as "bowtie". Otherwise, identify the location of installation of the bowtie tool, and change this parameter to the command that correctly accesses bowtie. (e.g. if using the command "/usr/lib/bowtie --version" succeeds, set this parameter to "/usr/lib/bowtie") The current value of the Bowtie Command is shown below: #BC: bowtie >BO : Bowtie Options Bowtie has many options that can be used. The most relevant ones to the RNA-Seeker application are the -v and -a options. Changing -v will specify the amount of mismatches allowed in the alignment. Thus, -v0 means that reads have to perfectly match the reference. Allowing for one or more mismatches is encouraged, as the classification algorithm does not suffer from this specification. Using -a tells the bowtie algorithm to report all valid alignments for each read, resulting in reads mapping to multiple locations. Dealing with such multimapping reads is the primary benefit of RNA-Seeker, so including this option is highly recommended. Feel free to use other options specific to your needs. The current value of Bowtie Options is shown below: #BO: -v0 -a Sample Input File: What components are saved in the input file. The sample input file can be found at rnaseeker-v1.0/sample/sample_input.txt. This file contains all of the necessary inputs to execute the sample run, as described in Section 3: Sample Run. Below we list each of the required and optional inputs and describe how to appropriately list the values for these inputs. Note that every time RNA-seeker is run, an input file will be saved to this directory (scripts/input_files). If a similar but slightly different analysis is to be run, you may edit the resulting input file with the desired changes to allow for an easy way to do this. See section 5: Run RNASeeker for a more detailed discussion on how to use input files to run the program. It is recommended to list inputs in an input file as an absolute path (for example: "/home/myusername/myproject/library.txt" as opposed to “myproject/library.txt”. Determine the absolute path by using the unix command: "pwd"), as this simplifies things. Unfortunately, the sample input file is written such that it will only work if the sample run is executed from the sample directory, as indicated in Section 3: Sample Run. Fortunately, all input files that are saved after other RNA-seeker runs will not have this problem, and can be used from any directory. The four inputs below are the essential inputs. In order to indicate the value of an input, include a line beginning with a ">" symbol preceding a line beginning with the "#" symbol that shows the desired value of the input file. Here are some examples, defining the Library File, the Reads File, and the Results Directory. >LF: Library File #LF: data/library.txt >RF: Reads File #RF: data/RNA-seq_reads.trim.fq >RD: Results Directory #RD: results/ The flank length dictates the number of nucleotides to either side of the gene that will be included in the pseudo reference file, so that reads that partially include these nucleotides will be mapped. We chose a flank length of 100 because that was sufficiently large to capture any expected reads from unprocessed RNA strands, but small enough to reasonably reduce the processing time of the Bowtie algorithm. However, the flank length is flexible depending on the needs of the user. It can be set to 0 to include only reads that are contained completely within the genes. Similarly, using a small flank length will limit the distance away from the gene that the reads are allowed to map to. It is recommended to keep the flank length at a reasonably high value (slightly higher than the maximum read length) to include all reads that may include fragments of the genes of interest. It is important to note that any reads that map completely within the flanking sequence will be removed by the system automatically, so that is not a concern of choosing a large flank length. Below the flank length is defined. >FL: Flank Length #FL: 100 Note that indicating the below directory is not strictly essential. The intermediate results will save by default to the Results directory, unless the Intermediate Results Directory is specified. However, it is recommended to save intermediate results to a separate directory to avoid cluttering the primary results directory. >ID: Intermediate Results Directory #ID: bowtie_files/ Below are the specifications for the reference genome files. They are formatted slightly differently because it is important to indicate which file corresponds to each chromosome code in the library file. Thus, these entries must include the two characters "::" in between the chromosome code and the corresponding file, as below. Note that in this case the same tag "RG" can be used multiple times. >Reference Genome File #RG: C::data/chl_ref_genome.fasta >Reference Genome File #RG: M::data/mit_ref_genome.fasta In the event that you want to run only part of the RNA-seeker analysis, the below options can be specified. For example, if you are executing a follow-up run that uses a slightly different set of parameters during the Bowtie mapping, then you would not have to perform the reference genome compilation or the Bowtie-build procedures, since those would remain unchanged. It is advised that the final results from the original analysis be copied into a different folder to avoid them being overwritten. Below we use the value "1" to indicate that a process will be included in the analysis. If a process should not be included in the analysis, set this to "0" or simply delete the lines for that procedure. >CRG: Compile Reference Gene Files #CRG: 1 >BTB: Bowtie-Build #BTB: 1 >BTM: Bowtie Map #BTM: 1 >CLR: Classify Reads #CLR: 1 Below, we show an example input file. This file is sufficient to indicate the sample inputs. sample_input.txt >LF: Library File #LF: data/library.txt >RF: Reads File #RF: data/RNA-seq_reads.trim.fq >RD: Results Directory #RD: results/ >FL: Flank Length #FL: 100 >ID: Intermediate Results Directory #ID: bowtie_files/ >Reference Genome File #RG: C::data/chl_ref_genome.fasta >Reference Genome File #RG: M::data/mit_ref_genome.fasta >CRG: Compile Reference Gene Files #CRG: 1 >BTB: Bowtie-Build #BTB: 1 >BTM: Bowtie Map #BTM: 1 >CLR: Classify Reads #CLR: 1 7. Output File Formats Reference Gene Files The reference gene files generated by the Compile Reference Gene Files procedure are saved in the indicated results directory. The sample run will generate two reference gene files: results/RG_C.fasta (Reference Genome for Chromosome “C”) and results/RG_M.fasta (Reference Genome for Chromosome “M”). Each of these files contain the nucleotide sequences of the genes of interest labeled in the library file. As demonstrated below, each gene has an indicated name, followed by a nucleotide sequence. The sequence consists of the flanking sequence preceding the gene, the coding sequence itself, and the flanking sequence following the gene. In this file, the flanking sequences to either side are 100 nucleotides long, as indicated by using the Flank Length input of 100. In the case where a gene was indicated to have two or more exons, multiple versions of the gene are written which show the sequence including the intron as well as the sequence with the intron excised, designated by a tag at the end. For example, the gene label MethionineM1-eie indicates that the following sequence contains two exons split by an intron, whereas MethionineM1-ee indicates that the two exons are adjacent in the sequence. Here is a small sample of the RG_M.fasta file for the sample run. Notice how the bold intron is included in CysteineM1-eie but not in CysteineM1-ee. >SerineM2 AGAGTCTTTCTCGCTCAGTAAAGAAGAGTACGCGCGCTACGGCTTACGCAGTGGATCTTCGGGCAACCAACCCGGCACATCCA ATTCCGATCAACAACTTGGAGGTATGGCTGAGTGGCTTAAGGCATTGGTTTGCTAAATCGACATACAAGAAGATTGTATCATG GGTTCGAATCCCATTTCCTCCGGCACGGAAGTAGAACGGGCGGGCGAAATTACGTGAGAGAAAGAACCTCAGATTGATGGAGT CCGCCGTCGGACAGAATAGCACTACTTAGTGACTAGGAG >CysteineM1-eie CCCGAACGAGGGCCGTCCAGTCCCTGGCCGGCTCTCGGTTCTTGAGCAAACTCCTCCACTGGGATGCTCATAGATGAAGAAAA GAGACTTTAGGCAAGTGGTTCTGGTAGCTCAGCTGGTTAGAGCAAAGGACTGCAAATCCTTTTTTGCTTGTTTCAGTGGGAAG AGCAAGGGGCATTGCCCTTGAAATCCTTCAGTGGTTCGATTCCACCTCAGAACGAAAGAAAATGCAAGTGAAAGAGAATTGTA GGACAAGAGCTGATGGATGATAGAACAAAAGAGATGGATAGAGGCGAGAGAGATTGAGTGGAGGGTTCGAA >CysteineM1-ee CCCGAACGAGGGCCGTCCAGTCCCTGGCCGGCTCTCGGTTCTTGAGCAAACTCCTCCACTGGGATGCTCATAGATGAAGAAAA GAGACTTTAGGCAAGTGGTTCTGGTAGCTCAGCTGGTTAGAGCAAAGGACTGCAAATCCTTCAGTGGTTCGATTCCACCTCAG AACGAAAGAAAATGCAAGTGAAAGAGAATTGTAGGACAAGAGCTGATGGATGATAGAACAAAAGAGATGGATAGAGGCGAGAG AGATTGAGTGGAGGGTTCGAA >SerineM3 AGAACCATGGAACCCCCGAGCGGTTGTGGTTGTACCTGTACTTGCAGGGATACGAAAACTCGCTATTCACTCAGTTTCTGGTC AATAATAAGATTATGTAGGAGAGATGGCCGAGCGGTTCAAGGCGTAGCATTGGAACTGCTATGTAGGCTTTTGTTTACCGAGG GTTCGAATCCCTCTCTTTCCGTTTCTGTTAATTCACCACCGACCACAATGTATCAAATCAAATAGATTTCAGCAATAAGACTT TTATTTGATAGAAATTCATTATTAAAAATTAATTACTT Classification Files The Classify Reads procedure produces several Classification Files, which are named according to the header of the Library file. In our sample run, the group names used were Gene, Family, Isotype and GlobIso (Global Isotype), so the files produced are Gene_reads.txt, Family_reads.txt, Isotype_reads.txt, and GlobIso_reads.txt. Additionally, regardless of the group names specified in the Library file, two extra categories are produced, saved in the files Splice_reads.txt and Unclassified.txt. Splice_reads.txt will contain any reads that mapped to at least one splice variant of a gene, but not all splice variants. These reads are still unique to that gene, but since they are also unique to one splice variant we note this observation by segregating these reads to a separate file. In the sample run, there are no reads that end up mapping to just one splice variant of the genes with introns, so this file is empty. Gene_reads.txt will contain the original read lines which were found to be unique to a single specified gene. In other words, these reads only mapped to a single location in any of the genomes supplied. The reads are sorted by the gene to which they mapped, arranged in alphabetical order. Note that since 100 flanking nucleotides were included before and after the gene sequence, reads that begin at the first nucleotide of the gene were mapped to start at the 100th nucleotide of the reference gene, so reads that have 100 in the position column mapped to the beginning of the gene, and reads that have a position index less than 100 included some of the flanking sequence before the gene. Also note that if a flank length different than 100 is used, the position of the start of the gene will similarly be changed. Below is a sample from the Gene_reads.txt file generated during the sample run. Note that all of the information about the read that was originally included after the Bowtie mapping is still present in these files, so there was no loss of information during the Classify Reads procedure. >AsparagineM1 specific reads: HWI-ST538:280:C1FE9ACXX:5:1314:19180:6288 1:N:0:AGTCAA + AsparagineM1 135 AACTGACTGGTCGTAGGTTCGAATCCTACTTGGGGAGATTTGATTCATTCAGAATTCGAATTGATAGTTATAGCTTTTCTGAC TAGC @@CFFFFFHHFHHFEHIEHHIJFGHGGIJJJJIJJGIGHIIIJJIIIJJHIJIHHIJJEGHHIHHHGGHEDDFFFDFEEEEEE EDDC 0 HWI-ST538:280:C1FE9ACXX:5:1110:18949:14936 1:N:0:AGTCAA + AsparagineM1 AAGTCCGTTCTTCCTCAGTAGCTCAGTGGTAGAGCGGTCGGCTGTTAACTGACTGGTCGTAGGTTCGAAT @CCDFFFFGHHHDCGIJJJIIIIJGGFHIHIICGGHIIJIJGIIIIIJJJJGHIIIGHHEHFFFFEDDDD 0 89 HWI-ST538:280:C1FE9ACXX:5:2104:11621:55887 1:N:0:AGTCAA + GCTTCCTTAAGTCCGTTCTTCCTCAGTAGCTCAGTGGTAGAGCGGTCGGCT CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJHIJJJJJJIJJJJJ 0 AsparagineM1 81 AsparticacidC1 0 100 HWI-ST538:280:C1FE9ACXX:5:1116:8902:29828 1:N:0:AGTCAA + AsparticacidC1 GGGATTGTAGTTCAATTGGTCAGAGCACCGCCCTGT @@@DDDDD<FFBFFGII<A<EHEH?GFIEFF:FF?D 100 0 >AsparticacidC1 specific reads: HWI-ST538:280:C1FE9ACXX:5:2111:16256:97578 1:N:0:AGTCAA + GGGATTGTAGTTCAATTGGTCAGAGCACC @@CFFFFDFHFDDGHGGEHFFFHJFCGHI HWI-ST538:280:C1FE9ACXX:5:2113:4857:57765 1:N:0:AGTCAA + AsparticacidC1 100 GGGATTGTAGTTCAATTGGTCAGAGCACCGCCCTGTCAA <@@DDDDDAF?<?A<EHGGG@>GHEFHIEHGBGIGGGEG 0 Family_reads.txt will contain the reads that mapped only to genes that were labeled with the same Family. Keep in mind that the group name “Family” is not important, and in the Library file header the name of this category can be customized to be relevant to the user’s analysis. As with all of the read output files, all the reads that sorted to a family are listed together. In this file and others that include multimaps (Isotype_reads.txt and GlobIso_reads.txt), all of the matches that were found for a single read are printed on adjacent lines, with a blank line separating different reads. Below is a sample from Family_reads.txt, as it was generated from the sample run. >AsparagineCA specific reads: HWI-ST538:280:C1FE9ACXX:5:2216:7612:45579 1:N:0:AGTCAA + AsparagineC1 GTTCCTCAGTAGCTCAGTGGTAGAGCGGTCGGCTGT C@CFFFFFHHHHHJJJJGHIFIIJJJJJIJJJJJJJ HWI-ST538:280:C1FE9ACXX:5:2216:7612:45579 1:N:0:AGTCAA + AsparagineC2 GTTCCTCAGTAGCTCAGTGGTAGAGCGGTCGGCTGT C@CFFFFFHHHHHJJJJGHIFIIJJJJJIJJJJJJJ HWI-ST538:280:C1FE9ACXX:5:1306:4719:10665 1:N:0:AGTCAA TGACTGGTCGTAGGTTCGAATCCTACTTGGGGAGATTTGATTCATTCTTTAATGT CCCFFFFFHHHHHJJJJJJIJJJJJJJJJJJJDGGIIJJIJJJIJJJJJJJJJJJ HWI-ST538:280:C1FE9ACXX:5:1306:4719:10665 1:N:0:AGTCAA TGACTGGTCGTAGGTTCGAATCCTACTTGGGGAGATTTGATTCATTCTTTAATGT CCCFFFFFHHHHHJJJJJJIJJJJJJJJJJJJDGGIIJJIJJJIJJJJJJJJJJJ 98 1 98 1 + AsparagineC1 138 1 + AsparagineC2 138 1 >HistidineMA specific reads: HWI-ST538:280:C1FE9ACXX:5:2313:4726:24348 1:N:0:AGTCAA + HistidineM1 44 AATAAATAGAATAATGATGAATGGAAAAAAGAGAAAATCCTTTAGCTGGATAAGGGGCGGATGTAGCCAAGTGGATCAAGGCA GTGGATTGTGAATCCAC @@CFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJIIJJJJJJJJJIHJJIIJIJJFDDDDDEEDDDDD@CDDCDDDDD1< BACDDDCCACCC>:ACC 1HWI-ST538:280:C1FE9ACXX:5:2313:4726:24348 1:N:0:AGTCAA + HistidineM2 44 AATAAATAGAATAATGATGAATGGAAAAAAGAGAAAATCCTTTAGCTGGATAAGGGGCGGATGTAGCCAAGTGGATCAAGGCA GTGGATTGTGAATCCAC @@CFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJIIJJJJJJJJJIHJJIIJIJJFDDDDDEEDDDDD@CDDCDDDDD1< BACDDDCCACCC>:ACC 1 HWI-ST538:280:C1FE9ACXX:5:1215:20823:23696 1:N:0:AGTCAA + HistidineM1 68 AAAAAAGAGAAAATCCTTTAGCTGGATAAGGGGCGGATGTAGCCAAGTGGATCAAGGCAGT @CCFFFDFHHHHGHJJJJJJJJJIJJIGHGHHIIIJHFFBBACEEED@CDCDDDDCCBBB? 1HWIST538:280:C1FE9ACXX:5:1215:20823:23696 1:N:0:AGTCAA + HistidineM2 68 AAAAAAGAGAAAATCCTTTAGCTGGATAAGGGGCGGATGTAGCCAAGTGGATCAAGGCAGT @CCFFFDFHHHHGHJJJJJJJJJIJJIGHGHHIIIJHFFBBACEEED@CDCDDDDCCBBB? 1 >PhenylalanineMA specific reads: HWI-ST538:280:C1FE9ACXX:5:1102:9958:77076 1:N:0:AGTCAA + PhenylalanineM1 138 TCCTCGTGTCACCAGTTCAAATCTGGTTCCTGGCAC CCCFFFFFHHHHHJJHIJJJJJJJJJFHIJJJJJJJ 1HWI-ST538:280:C1FE9ACXX:5:1102:9958:77076 1:N:0:AGTCAA + PhenylalanineM2 138 TCCTCGTGTCACCAGTTCAAATCTGGTTCCTGGCAC CCCFFFFFHHHHHJJHIJJJJJJJJJFHIJJJJJJJ 1 HWI-ST538:280:C1FE9ACXX:5:2209:15859:82108 1:N:0:AGTCAA + PhenylalanineM1 127 AGGACTGAAAATCCTCGTGTCACCAGTTCAAATCTGGTTCCTGGCAC CCCFFFFFHHHHHJJJJJJJJIJJJJHIJJJJJJJJJHIJJJJIIJJ 1HWIST538:280:C1FE9ACXX:5:2209:15859:82108 1:N:0:AGTCAA + PhenylalanineM2 127 AGGACTGAAAATCCTCGTGTCACCAGTTCAAATCTGGTTCCTGGCAC CCCFFFFFHHHHHJJJJJJJJIJJJJHIJJJJJJJJJHIJJJJIIJJ 1 Isotype_reads.txt will contain the reads that mapped only to genes with identical isotypes, as labeled in the Library File. In the sample run, none of the reads met this criterion, so the resulting Isotype_reads.txt file is empty. However, it would be formatted similarly to the above Family_reads.txt file if it was populated. GlobIso_reads.txt contains reads that mapped to genes with the same isotype in different genome reference files, though only because this distinction was defined in the library file from the start. All the reads mapping to a global isotype are listed under the name of that isotype, where all the matches for a read are listed consecutively. Although it is somewhat verbose to store every instance of a read instead of condensing the data so that one line represents the read and to what genes it maps, we feel that it is appropriate to include this long-hand form of the data to prevent the loss of information. This way, the lines specify the starting position in the corresponding gene, which might not always be the same depending on the analysis. Furthermore, this format allows the user to see which members of the group that the read mapped to, instead of implying that the reads mapped to all of the members of the group, which is not necessarily the case (in fact, in the case of Global Isotype, is rather uncommon: Below, the first read does not map to ArginineC1 or ArginineC2, both of which are part of the Arginine global isotype). Below is a sample from the GlobIso_reads.txt file, obtained from the sample run. >Arginine specific reads: HWI-ST538:280:C1FE9ACXX:5:1108:14815:76263 1:N:0:AGTCAA + GCGTCCATTGTCTAATGGATAGGACAGAGGTCTTCTAAAC CCCFFFFFHHHHHJIJJJIIIJIJJGJIJJEGHIIIIJJJ 0 HWI-ST538:280:C1FE9ACXX:5:1108:14815:76263 1:N:0:AGTCAA + GCGTCCATTGTCTAATGGATAGGACAGAGGTCTTCTAAAC CCCFFFFFHHHHHJIJJJIIIJIJJGJIJJEGHIIIIJJJ 0 HWI-ST538:280:C1FE9ACXX:5:1309:19990:51827 1:N:0:AGTCAA + GCGTCCATTGTCTAATGGATAGGACAGAGGTCTTCTAAAC CCCFFFFFHHHHHIIIJJJJJJJIJJJIJICGHHIIJIJJ 0 HWI-ST538:280:C1FE9ACXX:5:1309:19990:51827 1:N:0:AGTCAA + GCGTCCATTGTCTAATGGATAGGACAGAGGTCTTCTAAAC CCCFFFFFHHHHHIIIJJJJJJJIJJJIJICGHHIIJIJJ 0 ArginineC3 100 ArginineM1 100 ArginineC3 100 ArginineM1 100 >Asparagine specific reads: HWI-ST538:280:C1FE9ACXX:5:1111:1839:76285 1:N:0:AGTCAA + AsparagineC1 TCCTCAGTAGCTCAGTGGTAGAGCGGTCGGCTG CCCFFFFFHHHHHJJJIIHHIJJJJJHHIJJJI HWI-ST538:280:C1FE9ACXX:5:1111:1839:76285 1:N:0:AGTCAA + AsparagineC2 TCCTCAGTAGCTCAGTGGTAGAGCGGTCGGCTG CCCFFFFFHHHHHJJJIIHHIJJJJJHHIJJJI HWI-ST538:280:C1FE9ACXX:5:1111:1839:76285 1:N:0:AGTCAA + AsparagineM1 TCCTCAGTAGCTCAGTGGTAGAGCGGTCGGCTG CCCFFFFFHHHHHJJJIIHHIJJJJJHHIJJJI 100 1 100 1 100 0 HWI-ST538:280:C1FE9ACXX:5:1309:11844:17432 1:N:0:AGTCAA + AsparagineC1 TCCTCAGTAGCTCAGTGGTAGAGCGGTCGGCTG CCCFFFFFHHHGHJJFHIIIIIHJJJJJJIJJJ HWI-ST538:280:C1FE9ACXX:5:1309:11844:17432 1:N:0:AGTCAA + AsparagineC2 TCCTCAGTAGCTCAGTGGTAGAGCGGTCGGCTG CCCFFFFFHHHGHJJFHIIIIIHJJJJJJIJJJ HWI-ST538:280:C1FE9ACXX:5:1309:11844:17432 1:N:0:AGTCAA + AsparagineM1 TCCTCAGTAGCTCAGTGGTAGAGCGGTCGGCTG CCCFFFFFHHHGHJJFHIIIIIHJJJJJJIJJJ 100 1 100 1 100 0 >Cysteine specific reads: HWI-ST538:280:C1FE9ACXX:5:2114:10203:18024 1:N:0:AGTCAA + GGCGGCATGGCCAAG @C@FFDDFAHGHFBD 0 HWI-ST538:280:C1FE9ACXX:5:2114:10203:18024 1:N:0:AGTCAA + GGCGGCATGGCCAAG @C@FFDDFAHGHFBD 0 CysteineC1 100 CysteineM2 100 Unclassified.txt contains multimap reads that remain unclassified. Reads would be defined as unclassified in the sample run if they mapped to at least two genes of different isotypes, such as if a read mapped to LeucineC1 and SerineC1, as these two gene do not have a common group name in any of the designated group columns. Generally it is expected for some of the reads to be unclassified, but due to the limited amount of genes represented in the sample library file, this did not occur in the sample run analysis. Thus, the resulting Unclassified.txt file is empty. If there were unclassified reads, however, this file would list them, grouping lines representing the same read together and showing which genes the read did map to. Read Summary File The Read Summary file is also generated during the Classify Reads procedure. This file contains a condensed, concise summary of the results of the RNA-Seeker procedure. First, the amount of reads that were classified to each group is listed to give a general overview of the degree of specificity of the data set. Next, each gene present in the Library file is listed along with the amount of reads that mapped to it in each category. This allows the user to make general observations in the data that can be confirmed or expanded upon by looking at the specific reads in the classification files. Note that the sum of the numbers in a column does not necessarily equal the total number of reads that were classified to that group. In fact, this is only the case for the Splice and Gene specific reads. Below is a sample of the read_summary.txt file generated from the sample run. Number of reads classified to each group: Total reads: 7383 Splice: 0 Gene: 4405 Family: 2500 Isotype: 0 GlobIso: 478 Unclassified: 0 Name ArginineC1 ArginineC2 ArginineC3 ArginineM1 AsparagineC1 AsparagineC2 AsparagineM1 AsparticacidC1 AsparticacidM1 CysteineC1 CysteineM1-ee CysteineM1-eie CysteineM2 GlutamicacidC1 GlutamicacidM1 GlutamicacidM2 GlutamineC1 GlutamineM1 GlutamineM2 GlycineC1 HistidineC1 HistidineC2 HistidineM1 HistidineM2 LeucineC1 Splice 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Gene 0 0 0 0 0 0 3 2170 1109 13 0 0 45 409 0 0 83 0 0 155 0 0 0 0 0 Family 11 11 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 5 5 0 0 0 2 2 0 Isotype 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 GlobIso 0 0 188 188 289 289 289 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 Unclassified 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0