SEQSELECTOR VERSION 1.1B MANUAL Andreanna J. Welch, Andre E. Moura, Annalora Irvine, and A. Rus Hoelzel 14 May 2014 TABLE OF CONTENTS 1. OVERVIEW 3 1.1 License and Warranty 1.2 Citing the Program 1.3 Conventions Used in This Manual 1.4 Getting help 5 5 5 5 2. INSTALLATION 7 2.1 SeqSelector Toolset Download 2.2 Required Software 2.3 Optional Software 7 7 8 3. SUGGESTED WORKFLOW AND TIPS 11 3.1 Understanding the scripts 3.2 Some basics of python to keep in mind 3.3 Using the scripts 3.4 Identifying genes using functional GO annotations 3.5 Selecting gene sequences using SeqFinder 3.6 Removing duplicate sequences using SeqFinderRemoveDuplicates 3.7 Quality control of sequences from the reference 3.8 Performing a local BLAST search 3.9 Using AccessionNumberChecker to list all accession numbers for the non-model species genome sequences 3.10 Obtaining sequences from the BLAST results using ParseBLAST 3.11 Merging sequences from the reference and non-model species for bait design 3.12 Quality control with QCMaskedMissing 3.13 Quality Control with QCComplementarity 3.14 Quality Control with QCDuplicates 3.15 Sorting and obtaining stats for fasta files independently 11 12 12 13 15 17 18 18 4. DETAILED DESCRIPTION OF SCRIPTS 26 4.1 GoSelect 4.2 SeqFinder 4.3 SeqFinderRemoveDuplicates 4.4 AccessionNumberChecker 4.5 ParseBLAST 4.6 MergeSeqs 4.7 QCMaskedMissing 4.8 QCComplementarity 4.9 QCDuplicates 4.10 SequencesSortStats 26 27 31 33 34 39 41 42 44 45 5. TROUBLESHOOTING 47 5.1 Error Messages 5.2 Unexpected results 47 48 1 19 20 22 22 23 24 25 APPENDIX 1. COMMAND LINE TUTORIAL FOR WINDOWS 49 Getting Started Locations of files and directories Working with files and directories Getting help 49 49 51 53 APPENDIX 2. COMMAND LINE TUTORIAL FOR MAC/UNIX/LINUX 54 Getting Started Locations of files and directories Working with files and directories Getting help Permissions 54 54 56 58 58 2 1. OVERVIEW The SeqSelector toolset is a suite of scripts for identification and selection of the sequences of interest from the genomes of model and non-model species for use in capture enrichment of next-generation sequencing libraries. The suggested workflow (Figure 1), may start at two different steps. One option is for users to use the toolset to identify genes of interest. In addition to information from the literature and results from previous work, the GoSelect script can be used to search GO term annotation files (available for model species from the Ensembl BioMart database) for key words related to the function of gene products. Once a suitable set of genes has been identified, the SeqSelector scripts will search the annotated genome sequence files of a working reference species (i.e. an annotated genome sequence for a model or non-model species in Genbank format) for those genes, and return the sequences in a fasta file. Depending on the version of the script, either full or partial gene, exon (CDS), or mRNA sequences will be returned. Baits for sequence capture can be designed from these sequences directly, depending on the evolutionary distance to the species of interest, or if desired, these sequences can be used as queries to perform a BLAST search to find the corresponding sequences in the assembled, unannotated genome sequences of a non-model species. The workflow may also begin by bypassing these steps and starting with EST, transcriptome, or other (e.g. ultraconserved or noncoding) sequences from the reference species, which can be used as BLAST queries. The ParseBLAST script can then be used to retrieve these sequences from the genome of the nonmodel organism. Depending on the needs of the user, and the evolutionary distance between the reference and non-model species, the sequences from the reference and the non-model species can be merged such that the sequence from the reference species will be returned whenever a sequence was not found in the non-model species. We provide some additional tools for quality control of selected sequences, as well as other tools to facilitate progression through the workflow. The SeqSelector toolset has been developed to be flexible. Each tool stands alone and may be implemented according to the user’s needs. The SeqFinder script focuses on selecting the sequences of genes from the genome, but non-coding regions may also be introduced into the workflow at the BLAST search/ParseBLAST step, as described above. The tools were also developed with the aim of being user-friendly, as well as quick and easy to run. The scripts are written in python, although essentially no knowledge of python is required to run them. They will work on any of the most common operating systems, including Windows and Mac OSX, and should run quickly (probably in less than 15 minutes in most cases) on standard desktop or laptop computers. The infiles required are standard formats (csv formatted text, GenBank, and fasta format files) or produced through the workflow (BLAST search results). We have tried to provide thorough and user-friendly documentation, tips, tutorials, and troubleshooting suggestions, as well as example files and results. We recommend staring with the Suggested Workflow after installation of the required software. 3 Figure 1. Overview of the SeqSelector work flow demonstrating major steps and associated tools. Gray shading represents steps using information from the working reference genome or additional published and unpublished data. Blue text and dashed arrows indicate steps to obtain sequences of genes of interest from the reference species genome, while green text and dotted arrows indicate starting point when EST or transcriptome data are available. Gray dotted arrows indicate an optional step. 4 1.1 License and Warranty The SeqSelector toolset is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This software is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of merchantability or fitness for a particular purpose. See the GNU General Public License at http://www.gnu.org/copyleft/gpl.html for more details. 1.2 Citing the Program Welch AJ, Moura AE, Irvine A, Hoelzel AR. 2014. SeqSelector: Bioinformatic toolset to select genomic regions for capture enrichment of next-generation sequencing libraries from model and non-model species In Review. Please feel free to email andreanna05@gmail.com for a copy of the manuscript. 1.3 Conventions Used in This Manual In attempt to increase clarity we use certain conventions throughout this manual. The terms reference, query, and model species are used interchangeably, and by this we mean a species whose genome sequences are annotated and available in GenBank format. These species may or may not be traditional model species, as long as annotated sequences are available. The terms target, subject, and non-model species are used interchangeably to mean a species whose genome sequences have been assembled, but not yet annotated. Tips that we thought might be helpful for getting started are indented and written in gray. Arrows () are used to indicate selections that you should make (generally from interactive dialogue boxes). Commands that should be entered at the command-line are specified in the typewriter font. Output from the scripts are written in black text with gray shading. 1.4 Getting help 5 For issues related to installation of the required and optional software, Google is often the best bet since these programs are widely used and generally well supported. This manual contains several sections with helpful information on how to run the SeqSelector toolset. If you are unfamiliar with executing programs from the command line, you may wish to start by going through the command line tutorial for Windows or Mac/Unix/Linux (Appendix 1 and 2). The suggested workflow provides step-by-step instructions and includes tips and additional information. There is also a detailed description for each script, which contains important information on input files, settings, and output files. Finally, at the end of the manual there is a Trouble Shooting section, which contains a list of the most frequently encountered errors and some suggestions of what to do if you receive unexpected output from the scripts. If you require additional assistance, or have suggestions for how to improve the usefulness of these tools, you can email Andreanna Welch at andreanna05@gmail.com. 6 2. INSTALLATION If you are not familiar with operating from the command line on your computer, you may wish to start with the Command Line tutorials in Appendix 1 for Windows and Appendix 2 for Unix, Linux, and Mac. 2.1 SeqSelector Toolset Download Programs are available at: https://sourceforge.net/projects/seqselector/ Programs are written in python scripting language. Python is platform independent and works on all common operating systems, including MacOS, Windows, Unix, and Linux. 2.2 Required Software Python 2.5 - 2.7 *Note: The code was written in python 2.7 and may or may not be compatible with python 3 There are many free options for obtaining python: – Downloads are available at the python website http://www.python.org/download/ which includes installers for Windows and Mac. If you use Unix or Linux, you will need to compile from the source code when using option. – Enthought python distribution. This is a user-friendly distribution that includes helpful modules (like NumPy), and is free for academic users: https://www.enthought.com/products/epd/ https://www.enthought.com/products/canopy/academic/ – NumPy provides installers for Windows and Mac that include python 2.7 http://www.scipy.org/scipylib/download.html NumPy 1.5 and above There are many options for obtaining NumPy: – NumPy provides installers for Windows and Mac, as well as source code for installation on other platforms 7 http://www.scipy.org/scipylib/download.html –NumPy also comes with the Enthought python distribution, which is free for academic users: https://www.enthought.com/products/epd/ https://www.enthought.com/products/canopy/academic/ Biopython You can find downloads and installation instructions for biopython at http://biopython.org/wiki/Download Biopython installers are available for Windows, but not Mac. Mac users will have to install Apple’s XCode tools and the optional command line tools (see http://docwiki.embarcadero.com/RADStudio/XE4/en/Installing_the_Xcode_Command_Line_To ols_on_a_Mac), which are now available at the App Store. The download is large and may require free registration as an Apple Developer, but the package includes installers and should not be difficult to obtain. The installation instructions for biopython should provide quick and easy installation. Test your installation For Windows: Go to Accessories Command Prompt to open the command prompt window Type python to start python. Some information about version should be printed to the screen. Type import numpy to test for NumPy installation Type from Bio import SeqIO to test for Biopython installation For Mac: Go to Applications Utilities Terminal to open the terminal window Type python to start python. Some information about version should be printed to the screen. Type import numpy to test for NumPy installation Type from Bio import SeqIO to test for Biopython installation 2.3 Optional Software NCBI BLAST + 2.2+ Command Line Applications Installation 8 Manual with installation and usage instructions: http://www.ncbi.nlm.nih.gov/books/NBK1763/ Download Site: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/ *Note: You may choose to follow the steps to configure BLAST + so that you can call it from any directory. However, if you choose not to configure BLAST + this way then you will need to execute BLAST search commands from the directory where the executable resides. To make things simple, keep your BLAST + executable, BLAST database, and query files in the same directory together, or if you keep your database and query files somewhere else then you will have to provide the path to the database and query files when issuing a command. Selecting a BLAST database to use A) You can download and use preformatted BLAST databases if these are suitable for your purposes. Available databases include human genomic, refseq genomic, other genomic, and the non-redundant database. These databases can be found at the NCBI ftp site ftp://ftp.ncbi.nlm.nih.gov/blast/db/ The BLAST + program comes with scripts that can be set up to update the database files regularly, and to only download files with a newer version than the local installation. For more information see the readme file at ftp://ftp.ncbi.nlm.nih.gov/blast/db/README In this case you will need to use the blastdbcmd program to generate the original FASTA files for using the ParseBLAST scripts. B) You can also build your own BLAST database using fasta files from your species of interest using the makeblastdb program. See the section called “Building a BLAST database with local sequences” in the BLAST + help manual/cookbook. If genomic sequences of your species of interest are available on GenBank, perform a search (e.g. using the advanced search feature) that returns the sequences you are interested in. Download all of sequences to a single file by clicking the “Send to:” button in the upper right hand corner of the search results screen, and then selecting “file” and then “fasta” format. If you don’t click any checkboxes then all of the sequences will be downloaded, otherwise only the selected sequences will be downloaded. Use the makeblastdb program to create the BLAST database from the fasta file: makeblastdb -in MySpecies.fasta -dbtype nucl -parse_seqids -out MySpecies -title "MySpecies Genome Contigs" Testing your blast database Test your database using a command like this: blastn -db MySpecies -query QuerySequence.fasta -out Results.out 9 RepeatMasker RepeatMasker (http://repeatmasker.org/) can identify and mask repetitive and low complexity regions of sequence, and is a handy tool for quality control of sequences before bait design. A web server version is available at http://repeatmasker.org/cgi-bin/WEBRepeatMasker, which should be sufficient in most cases. Small datasets will be processed immediately and the results displayed in the web browser. However for larger datasets (which you are likely to use), you will need to select ‘email’ as the return method, and then the run will be queued and you will be notified upon its completion. For extremely large datasets it may be beneficial to install a local copy, although this is available for Unix-based systems only. See http://repeatmasker.org/RMDownload.html for more details. 10 3. SUGGESTED WORKFLOW AND TIPS 3.1 Understanding the scripts Python scripts are just plain text documents that contain a series of commands for the computer to perform. There are two ways to use the SeqSelector scripts. The first method is to simply execute the script (see next section) and change the settings interactively to match your filenames and desired options. The second method is to open the script in your favorite text editor and change the initial settings there directly. This way when you execute the script, all of the settings will be correct. You can also optionally turn off the interactive mode completely. The second method has two benefits: 1) if you need to re-run the script (or change just one or two settings) you will not have to re-enter all of the settings at the prompt; 2) You can save a copy of the script for each run, and therefore keep a record of your work. The second method does have one slight drawback, however. In order to input your settings into the script directly, you will have to follow some simple python rules (see next section). If you would like to input your settings directly in the script, then open the script using a text editor such as WordPad or TextEdit. There are also more advanced text editors available (see tip below), which are often free. Once the script is open you will see that is set up with different sections. The first section at the top of each file contains some housekeeping things. This is because certain commands are needed (such as module import) before other input can be accepted. The second section, called “Initial Input”, is where you can specify your filenames and desired options. Be sure to save the file after changing any filenames or options, or when you execute the script it will execute the last version that was saved (without your changes). The final section contains the rest of the script, which carries out your commands. Tip: For Mac, TextWrangler is a particularly nice text editor. For Windows Notepad++ is very practical and versatile. Tip: AVOID OPENING SCRIPTS IN MICROSOFT WORD or other word processing programs as this could introduce ‘invisible characters’ to the file, which will cause errors Tip: If you open the scripts and it looks like a jumble of text without line breaks, the line endings may be an issue. Windows uses carriage returns plus a line feed character to represent line endings, while Mac and Unix use just the line feed character. The easiest way to avoid this problem is to open the files in a smarter text editor such as WordPad for Windows or TextWrangler for Mac. Tip: Only change settings in the “Initial Input” section or the script may not work Tip: Be careful when selecting the outfile names as well. If you accidentally select the same name as one of the input files it will be written over. Tip: Script files can be saved with different names if you want to keep a record of your analyses 11 3.2 Some basics of python to keep in mind If you decide to enter your settings interactively after executing the script, you can probably skip most of this section, although it may still be helpful to have an idea of some of the basics of python. The only thing you must keep in mind is that python is case sensitive, except for some situations where the script has been written to get around this. If you would like to enter your settings in the script file directly then this section contains important information for you. First, as mentioned above, python is case sensitive. There are some situations in which the scripts have been written in a way to make them case insensitive, and when this is true it will be clearly specified. All other times you should assume that case is important. Second, words or phrases should have quotes around them. These are also known as strings. Depending on the content of the word/phrase, single, double, or triple quotes can be used. The key is to make sure that the type of quote used matches on both sides of the text. Tip: If your phrase contains double quotes, then use single quotes around the outside, and vice versa Tip: If your phrase contains both double and single quotes, then use triple quotes around the outside Third, when entering numbers that should be treated as numbers (e.g. minimum sequence length, number of Ns when considering missing data) then numbers should not have quotes around them. Putting quotes around numbers will result in them being treated as text, which could cause errors depending on their usage in the script. Finally, when entering settings directly into the scripts, the # is the comment symbol, and any text following it will not be executed. Therefore, you can use the # to write notes or comments in the file. 3.3 Using the scripts The scripts are executed from the command line. See the Windows and Unix tutorials for help on using the command line if you are not already familiar with it. For Windows: Go to Accessories Command Prompt to open the command prompt window Navigate to the appropriate folder Type the name of the script 12 Tip: You may need to type ./ or python in front of the script name For Mac: Go to Applications Utilities Terminal to open the terminal window Navigate to appropriate folder Type the name of the script Tip: You may need to type ./ or python in front of script name Tip: If you get an error about permissions, see the command line tutorial for how to fix this If a run is going and you decide you would like to stop it you can kill the run by pressing control c (for Windows or Mac) or control ScrLock (for Windows). This will usually cause python to output several errors before returning to the command prompt. The script will start by printing its name and the list of initial settings. You will be given a prompt to either change the settings or to type run to start the analysis. To change a setting, just type its name. The setting name will be on the left side of the equals sign and the current setting will be on the right. When using the scripts, you will need to specify the location and name of the input files. If you keep the input files in the same folder as the script then you simply need to type the name of the input files in the script. If you keep the input files somewhere else, you will need to specify the entire path to the file (e.g. something like /YourName/Desktop/Folder/Input.fa). See the Windows and Unix tutorials for more on this. Tip: Spaces in the names of files or folders can be difficult to handle and should probably be avoided – use underscores instead Tip: Copying and pasting the names of files and directories is a helpful way to prevent errors caused by typos. Windows may not allow you to copy and paste at the command line. In cases where a setting has a small number of options (e.g. yes or no), if your input is not valid it will prompt you to try again. In cases where the options are unlimited (e.g. filenames) the script will not know there has been an error until the analysis is started. If that happens you will need to execute the script again and re-enter all of our settings. 3.4 Identifying genes using functional GO annotations Downloading GO annotation files Candidate genes for targeted sequence capture can be identified in various ways, including literature searches and from the results of previous experiments. It may also be useful to identify genes by searching for a particular annotated function. Gene Ontology (GO) terms are a standardized vocabulary for specifying gene product functions (see http://www.geneontology.org/ for more details). The genome sequences of many model 13 organisms have been annotated with GO terms and these annotations can be downloaded and searched. To download GO term annotations for genes of a reference species: Go to the Ensembl BioMart database (http://www.ensembl.org/biomart/). Under database choose Ensembl Genes, and under dataset select the reference species of interest. Click on attributes on left side of screen and select relevant options (note that attributes are added in the order that you select them), such as: Under Gene: o Ensembl Gene ID o Ensembl Transcript ID o Description o Chromosome Name o Gene Start (bp) o Gene End (bp) o Associated Gene Name Under External: o GO Term Accession o GO Term Name o EMBL (GenBank) ID When finished, click on Results on the left. Go term annotations can be downloaded by selecting ‘Export all results to’ File TSV Go. Note: It’s important to download the results as TSV (tab delimited) and not CSV (comma separated values) because some of the GO term descriptions contain commas, which interfere when parsing the file Using GoSelect to search for functions The GoSelect.py script will identify all entries (lines) of the GO term annotation file that contain a particular gene name or function of interest. See the detailed description of the GoSelect script for more information, and example files on SourceForge. Users can select to search for a single term, or to search for two terms, whose relationship is defined using the Boolean operators and, or, and not. The makeunique setting can be used so that the script also returns a list of results for only the first instance of a gene or transcript ID. In this way a list of only unique genes/transcripts are returned which contain the search term(s) in the function or gene name. The full results file is useful to look at because a single gene may have several functions of interest. However, it can be difficult to tell from this file how many unique genes were found with your search terms. This information is much easier to see from the unique results file. Finally, the geneslist option can be set so that after the search, the unique list of gene names is returned in the 14 format required for the SeqSelector scripts. To search within the results for a further subset of genes of interest, simply run the script again on the results file. You may wish to examine and manipulate the search result text files in Excel. If you decide not to include some genes, you can either remove them from the list of unique genes created for the SeqFinder tool (if you opted to make this file), or you can delete the corresponding rows in Excel. Your final list of selected genes can be exported from Excel in the csv format required by the SeqFinder script by using the ‘Save As’ function and selecting .csv format. Note: For the SeqFinder script, the gene names must be in a single row, and separated by commas (csv format). Tip: If gene names are in a vertical column in Excel, highlight them and then use Edit Copy Paste Special Transpose to paste them into a single row on a new sheet. Repeat as necessary until a suitable set of target genes have been identified. 3.5 Selecting gene sequences using SeqFinder Running SeqFinder Once you have a list of interesting genes in csv format (all gene names in a single row, separated by commas – see example files on SourceForge), the SeqFinder scripts can be used to select the sequences for these genes from the annotated genome of a reference species. This script requires the genome sequences to be in GenBank format, but there can be many GenBank entries per file, and many files can be used (e.g. one file with multiple GenBank entries per chromosome). Genome sequences are available from the GenBank ftp site (ftp://ftp.ncbi.nlm.nih.gov/genomes/). Navigate to the genome of interest and you should find a list of directories labeled something like “CHR_01”. Within each directory select the GenBank file to download, which is the file that has the .gbk.gz extension. These are gzip compressed files, which will need to be uncompressed before you can use them Tip: Uncompressing .gz files – On Macs you should be able to double click the file to uncompress it. On Windows, you may be able to double click the file to uncompress it as well. However, if double clicking doesn’t work (e.g. if a file full of strange symbols appears) then you will need to download and install the free 7-zip program from http://www.7-zip.org/ Simply select the .exe file and install. To unzip a file, navigate to the installation location (default will be something like C:\Program Files\7-Zip\) and double click on the .exe file to open the 7-zip utility. Click the Add + button and find the file to unzip. Click the Extract – button to extract the file. If you have annotated sequences that are not in GenBank format yet, but that you would like to use as a reference sequence, GenBank format files can be created using NCBI’s Sequin program (available at http://www.ncbi.nlm.nih.gov/guide/data-software/#submissions_). For example, this 15 software takes fasta format sequence files and gff annotation files to create a GenBank-style file. You can create these files locally without submission to the GenBank database. Select the SeqFinder script that best suits your needs. SeqFinderExons will return exon sequences only. In the case where multiple transcripts have been annotated in the GenBank file, you can chose to return the exon sequences for the longest transcript, the shortest transcript, or all transcripts. SeqFinderWhole will return the entire gene sequence, including introns (default) or optionally it will return all CDS sequences or mRNA sequences instead. SeqFinderPartial will return the first N bp of a gene (exons and introns), where N is a length set by the user. The genome sequences of many model species have been assigned to their respective chromosomes, however the sequences from on-going or recently completed sequencing projects of non-model species may not have been. The SeqFinder scripts are set up to allow you to specify a number of files for autosomal chromosomes in cases where that is possible (assumes they are contiguous, and start with chromosome 1). In addition to or instead of this you can specify files for sex chromosomes, sequences that have unidentified locations, or a subset of chromosome files (e.g. files for chromosomes 21 and 22). For more on this, and other options see the detailed description of the SeqFinder scripts. Once all the options are set, execute the script by typing run. Tip: Keep all of the chromosome files, the csv file with your genes of interest and the SeqFinder script in the same directory to simplify specification of file names. Inspecting the results SeqFinder will output a variety of files when the run is finished. The first file to look at is the Stats file. This gives information about the settings and the results, including how many genes were found and not found, as well as some summary information about average and total length, and GC content. It will also specify sequences for which there were annotation uncertainties (e.g. where the start and stop codons are not precisely known), and sequences with a user-specified amount or more of missing data. It will also list sequences that were found multiple times. This may be because the sequences are duplicated and/or annotated in multiple locations in the genome, or because you selected to have the sequences for all transcripts (or CDS or mRNA) returned. The Info file contains the list of genes found in each input file. The sequences found will be in the .fa file (in the order in which they were found) or sorted.fa file (sorted by name). See the detailed description of the SeqFinder script for more information about output files. If some genes were not found, it may be that the gene name may differ between the annotations in the GO file and the annotations in the GenBank file. There are several ways to resolve this. You may be able to identify synonyms for the gene by going to the UCSC Genome Browser (http://genome.ucsc.edu/). Once there, click “Genomes” in the top left corner and search for the gene in your reference genome. From the results you can compare the annotations to other species. Clicking on the transcript name and/or the protein name may also help identify synonyms. 16 If you know which chromosome the gene should be located on, and if the GenBank file for that chromosome is not too large, it can be opened in TextEdit or Wordpad and you can search the file for a word from the gene name or function. For example, if the gene TPK1 is not found, try searching the file for ‘thiamin’ instead to determine the synonym. Alternatively, this may be done at the GenBank website as well. Repeat as necessary until you have identified the appropriate gene names in the GenBank files. Tip: To reduce the number of files used for further steps, run SeqFinder one last time on the final list of gene names Multiple transcripts If you chose to return all transcripts, you can use some criteria other than length (e.g. experimental support) to decide which transcript to retain. The GenBank files will contain information on the support for each mRNA for a gene, as well as specify the accession number for the associated CDS. Using the accession number, you can then look in the GenBank file to determine which of the returned transcripts corresponds to the transcript you selected. SeqFinder writes the CDS sequences in the order of the GenBank file, therefore if the first CDS has the accession number that corresponds to the transcript with the most support, then keep the first exon listed in the UNSORTED output file. Tip: The mRNA and associated CDS are not necessarily listed in the same order in the GenBank file (e.g. the first mRNA for a gene may or may not be represented by the first CDS for a gene) Tip: The online version of the GenBank file should be most up-to-date 3.6 Removing duplicate sequences using SeqFinderRemoveDuplicates When using the SeqFinder script, it will return sequences of genes whenever they are annotated in the reference genome (e.g. when the same gene is annotated on multiple chromosomes). You can use the SeqFinderRemoveDuplicates script to search for sequences in the SeqFinder results file that have the same name, and keep one copy of the sequences and get rid of the extras. This script was written with duplicate genes in mind, and therefore the script compares different sets of exons found at different genomic locations (although the duplicated genes may be adjacent to each other). Therefore the input sequence file for this script should be the file sorted by genome location and NOT the file sorted by name. Using the SeqFinderRemoveDuplicates script, you can choose whether to keep the duplicate with the largest number of exons or the duplicate with the longest total sequence length. In cases where multiple duplicates have the same number of exons or the same total sequence length, the first occurrence is retained. The input files for this script are the list of genes in csv format used for SeqFinder and the UNSORTED output sequence file in fasta format from SeqFinder. Note that the script uses the gene name to identify exons. Since SeqFinder inserts a species identifier before the locus name, you will have to specify this identifier in the settings. See the detailed description of the RemoveDuplicates script for more information. 17 During the run, the script will print the name of the gene that it is working on, and some indication of whether or not duplicates were found. This information is also written to the Info file when the run has completed. The resulting sequences (minus the duplicates) are written to a fasta format file in the order that the genes are given in the csv formatted list of genes, and also to a fasta file that is sorted by name. As usual some summary information about the sequences, such as the total number of sequences retained, average and total length, etc., are written to the Stats file. As noted above, this script was written particularly to compare duplicated genes, and it may not remove single duplicated exons, although this latter instance should be relatively rare. If you are only interested in single copy loci for your project, you may wish to remove the duplicated genes from the list of genes and then re-run SeqFinder again. This will help avoid the inclusion of duplicated genes in your dataset. 3.7 Quality control of sequences from the reference If you are only selecting sequences from your working reference genome, and you don’t need to use the rest of the workflow, then you should proceed to the sections on final Quality Control to finalize your scripts before bait design. If you are going to use the sequences selected from the reference genome (whether they were obtained using the SeqFinder tools or from previously available data such as EST or transcriptome sequences) as queries to find the corresponding sequences in the unannotated genome of a non-model species, then it might be helpful at this point to investigate if the sequences contain repetitive or low complexity regions, as these will return many or low quality BLAST hits. This is particularly important if your sequences are non-coding. See the section on Quality control with QCMaskedMissing for information on how to use RepeatMasker to identify these regions and then discard sequences with a certain threshold of masked or missing data. 3.8 Performing a local BLAST search The sequences that were selected from the genome of the working reference species (including any other sequences, such as EST or transcriptome sequences) can be used as queries for a BLAST search to find corresponding sequences in an un-annotated genome of a non-model (subject) species. This works best for sequences of reasonable lengths – performing a BLAST search for a 50 kb sequence of a whole gene may not be feasible. See the NCBI BLAST + installation instructions (Section 2.3) for how to make a local database for sequences from the non-model species. Tip: Either keep the NCBI BLAST+ executable in the same directory as your local database and the output from SeqFinder, or verify that the BLAST+ program is in your path. On Mac, at the command prompt type $PATH and verify the appropriate directory with the BLAST+ program is in your path. 18 Conduct the local BLAST search using this command: blastn -db DatabaseName -max_target_seqs 1 -outfmt "10 qacc sacc evalue bitscore pident qstart qend sstart send" -query SeqFinderResultsFile.fa -out BLASTResults.csv Where DatabaseName is the name of the local BLAST database you created; SeqFinderResultsFile.fa is the fasta format file of sequences you wish to find (queries), and BLASTResults.csv is the name you designate for the BLAST results file in csv format. This outputs the BLAST results in the proper format for the ParseBLAST.py script. Tip: It is important not to change the –outfmt options (including their order) or the ParseBLAST script will not work properly. Tip: You may also want to conduct the BLAST search and return the standard BLAST output for visual inspection. Note, that this format will NOT work for the ParseBLAST script. To return the full output, use this command: blastn -db DatabaseName -query SeqFinderResultsFile.fa -out BLASTResults.out Where DatabaseName is the name of the local BLAST database you created during installation, SeqFinderResultsFile.fa is the fasta format file of sequences you wish to find, and BLASTResults.out is the name you designate for the BLAST results file in the full output format. 3.9 Using AccessionNumberChecker to list all accession numbers for the nonmodel species genome sequences The results from the BLAST search (see above) give the accession number of the non-model (subject) genomic sequence where a match was found for the reference (query) sequence. The ParseBLAST script (see next section) uses this accession number to find the correct subject genome sequence, and then extracts the appropriate part of it. Therefore, in order for ParseBLAST to work correctly, it needs to know how to find the accession number amongst all of the information given after the > sign in the description for each sequence in the subject genome fasta file. It is helpful to look at the descriptions of the subject genome sequences, but often these files are too large to open with a text editor. The AccessionNumberChecker script takes the genome sequence file in fasta format as input and will return the description line for every sequence in a file to a separate text file. You can scroll through this file to see if the formats of the description lines are consistent for all sequences. See the detailed description of the AccessionNumberChecker script for more information. 19 3.10 Obtaining sequences from the BLAST results using ParseBLAST Running ParseBLAST Using the output from the BLAST search (in the format as described above), the ParseBLAST scripts return the best sequences identified, based on a threshold e-value set by the user and the bit score, which reflects the quality of the alignment between the query and the subject sequences. In cases where the sequence from the non-model species does not match the full length of the sequence from the reference species, the ParseBLAST script will return additional sequence upstream and/or downstream until a certain length is reached. For ParseBLASTWhole, a sequence of similar length to the query sequence will be returned. For ParseBLASTPartial, sequences of a particular length will be returned (e.g. all will be ~1000 bp). For ParseBLASTExons, a sequence of similar length to the query sequence will be returned (similar to ParseBLASTWhole), but the output will contain some information that is more relevant to exons. Note that because of insertions/deletions between the sequences of the reference and the non-model species, the lengths of sequences returned may not be exactly identical to the length of the query or to the length specified. Three input files are needed in order to run this script: the BLAST results (Note: these must be in the format described above or the script will produce nonsensical results), the query sequences used in the BLAST search, plus the fasta file of sequences represented in the BLAST database for the non-model species. If you created your own local BLAST database, it is simply the fasta file that you used to do this. If you used a precompiled database from NCBI, then you will need to use the blastdbcmd program to generate the original FASTA file (see the BLAST + manual at http://www.ncbi.nlm.nih.gov/books/NBK1763/ for more information). Accession number identification methods For each line of the BLAST results, ParseBLAST looks in the non-model species genome file for the sequence with the accession number that matches that given in the BLAST hit. Here is an example of a BLAST search result with the accession number of the hit in bold: Reference_CYTB,HQ420351,0.0,100,1,1000,1702,695 The genome sequences of the non-model species are in fasta format, and contain a line like this before each sequence (the accession number is in bold again): >gi|317409317|gb|HQ420351.1| Pterodroma sandwichensis Therefore, to run ParseBLAST, you need to tell it how to find the accession number (or just the name if your sequences don’t have an accession number), in order to return the proper sequences. AccessionNumberMethod1 assumes that the accession number is the only text between the > and the first space for each sequence, and therefore no other options are required. AccessionNumberMethod2 assumes that the accession number is found at the exact same position for each sequence name and is always the exact same length, although the accession number may not be the only text in the description. For this method you need to enter the exact 20 starting and exact ending position of the accession number (not including the >). AccessionNumberMethod3 attempts to find the accession number based on some information you provide (query and offset). This works best if your accession numbers have some consistent value in them, but will work fine if they differ in position in the line or in length. For more information see the detailed description of the ParseBLAST script. Tip: Inputting the Accession number method settings correctly may take a couple of tries. The ParseBLAST script outputs to the screen each accession number it is searching for. If the output to the screen does not match the format of the accession numbers in the fasta file used to make the local BLAST database, kill the run by typing control c and adjust the proper settings accordingly. Inspecting the results ParseBLAST will output a variety of files when the run is finished. The first file to look at is the Stats file. This gives information about how many loci/exons were found and not found, as well as some summary information about the % identity, average and total length, and GC content. It will also return a list of sequences that were below the minimum length specified (although it does not remove them at this point – you can do that later using the QCMaskedMissing tool), sequences that may have been found multiple times, and sequences with a user-specified amount of missing data or more. The UniqueBLASTtable file will also be helpful to look at. This file is sorted by the % identity between the query reference sequence and the corresponding sequence in the non-model species, and also gives the e-values for each hit. The final column is the % match length between the query reference sequence and the corresponding sequence in the nonmodel species. Sequences with short regions of alignment and/or low identity may be spurious. The Info file gives the accession number that each sequence was retrieved from. The sequences from the non-model species will be ouput to the .fa file (in the order that the sequences were found) or the sorted.fa file (sorted by sequence name). ParseBLAST will return a single subject sequence for all query sequences found during the BLAST search (assuming the e-value for the hit is below the user-defined threshold). If sequences for some of your loci are missing at this point, there are several reasons why they may not have been found during the BLAST search: they may be missing from the genomic sequences obtained for the non-model species (even though they may in reality be present in the genome) because of low coverage, the sequences may contain low complexity or repetitive regions masked by BLAST, the sequences may be too divergent, or they have been lost from the genome of the non-model species. If desired, you can identify additional genes/loci to take the place of sequences that were not found. Or, if the genes are of particular interest, you may choose to retain the sequences from the reference species for those loci and use them for bait design, under the assumption that coverage may be an issue. In this case, the MergeSeqs script may be of use (see next section). Tip: If you used EST or other sequences not obtained with the SeqFinder tool as the query and the sequence identifiers (the text after the > in the fasta file) look strange then you may need to change the way you have specified the query species. See the detailed description of the ParseBLAST script for more information. 21 3.11 Merging sequences from the reference and non-model species for bait design If the sequences for some loci could not be found during the BLAST search, you may decide to design baits from the sequences of the reference genome instead, assuming the sequences are not too divergent from your species of interest. The MergeSeqs script opens the file of query sequences obtained from the reference genome, and then looks for a corresponding sequence from the subject or non-model species genome. If a sequence was found for the non-model species, it is written to the results file, otherwise the sequence from the reference species will be returned instead. Note that the matching is based on the locus name. Since SeqFinder and ParseBLAST both insert a species identifier before the locus name, you will have to specify these before running the script. See the detailed description of the MergeSeqs script for more information. While the script is running it will output the locus names that it is looking for. If these are not in the correct format, simply kill the run by typing control c, adjust the species identifiers as necessary and try again. 3.12 Quality control with QCMaskedMissing Before designing baits from the selected sequences some quality control may be helpful. Baits designed from sequences with repetitive or low complexity regions will hybridize nonspecifically to similar regions across the genome rather than your locus of interest. Additionally, sequences much shorter than the length of baits in your kit may need to be removed or they will have to be heavily padded. The QCMaskedMissing tool will help you investigate these issues in your set of selected sequences. To identify and mask repetitive or low complexity regions use the RepeatMasker program. The webserver version available at http://repeatmasker.org/cgi-bin/WEBRepeatMasker will allow submission of large fasta format datasets and will email you a message when the results are ready. Select the rmblast option, which is compatible with NCBI’s BLAST program, and select your desired speed/sensitivity (default should be fine in most cases). For the return format, you can select ‘html’ to have the results to be displayed in a web browser, in which case you will need to download the results files individually, or you can select ‘tar file’ and have the results displayed both in the browser and downloadable as a “zipped” folder. For return method you will most likely need to select ‘email’ unless your sequence input file is small (< 50kb). There are additional options you can select if your species of interest is closely related to a model organism. Finally, make sure that ‘Masking options’ under ‘Advanced Options’ says that repetitive sequences will be replaced by strings of X, otherwise the default is to mask with strings of N, which will be indistinguishable from missing data. The results from RepeatMasker include a summary table (.tbl file, which can be opened in a text editor) with information about the regions identified, as well as your set of sequences with the repetitive/low complexity regions masked by Xs (.masked file). 22 The QCMaskedMissing tool will go through a set of sequences and discard any that have a percentage of missing or masked data above the user define threshold. It will also investigate sequence lengths and, if desired, discard sequences shorter than a user defined length. The input for this tool is a set of fasta format sequences, such as those produced from the SeqFinder workflow and/or from RepeatMasker. The user can specify the threshold of missing/masked data to allow. This is should be entered as a percentage: for example 10% would be written as 10 rather than 0.10. The user can also specify the minimum sequence length. During the run each sequence being investigated will be printed to the screen. The ouput files include the set of fasta formatted sequences that pass the threshold for masked/missing data (and sequence length, if desired), and stats and info files that contain results and information about the run. For more information see the detailed description of the QCMaskedMissing script. 3.13 Quality Control with QCComplementarity It may also be helpful to identify any baits sequences that are highly complimentary to each other. If baits are complimentary then they could bind to each other rather than to your sequencing library. For example, MYcroArray suggests avoiding sequences with greater than 90% sequence identity for greater than 100 bp. In order to investigate this, you can create a BLAST database for the sequences you have selected for targeted enrichment, then use your selected sequences as queries to BLAST against that database, and then use the QCComplementarity tool to remove sequences that have high complementarity. First, create a new blast database for your sequences using the command: makeblastdb -in TargetSequenceFile.fasta -dbtype nucl parse_seqids -out TargetSequences -title "Sequences for Targeted Enrichment" Where TargetSequenceFile.fasta is the file containing the sequences you have selected for targeted enrichment (e.g. through the SeqSelector workflow), TargetSequences is a name you selected for the database, and “Sequences for Targeted Enrichment” is some description of the database. Next, BLAST your sequences against this database using this command (this produces the correct format for the QCComplementarity script): blastn -db TargetSequences -outfmt "10 qacc sacc pident length mismatch gaps qstart qend sstart send evalue bitscore" -query TargetSequenceFile.fasta -out SelfBLASTResults.csv Where TargetSequences is the name of the database you created, TargetSequenceFile.fasta is the file containing the sequences you have selected for targeted enrichment (e.g. through the SeqSelector workflow), and SelfBLASTResults.csv is the name you selected for the file containing the results. 23 The SelfBLASTResults.csv file is used as the input for the QCComplementarity tool, along with the query sequence file. The user can set thresholds for percent identity (as above percent should be written as whole numbers – e.g. 90% is input as 90) and match length (in bp). When a BLAST hit between a pair of sequences has a percent identity higher than the setting over a region longer than the length specified, then the first sequence in the pair will be removed. The reciprocal BLAST hit between the pair is ignored. The output files include a reduced BLAST results file that contains only the hits that fail to meet the user-defined thresholds, a stats file with information about the run, including the IDs for the sequences removed, and a fasta formatted file with all of the sequences retained. Note that some sequences may be highly complementary to several other sequences, and removing this one sequence may resolve more than one instance of high complementarity. Therefore, fewer sequences may be removed than the number of hits in the reduced BLAST table (see example file). See the detailed description of the QCComplementarity script for more information. 3.14 Quality Control with QCDuplicates It may also be helpful to investigate whether your selected sequences come from single copy or multi-copy loci. If you used the SeqFinder and SeqFinderRemoveDuplicates scripts you may already know whether the sequences for your genes of interest were annotated in multiple regions of the working reference genome. However, you may wish to check again or if you started with EST, transcriptome, or other data you may wish to investigate this further. The QCDuplicates script parses the results of a BLAST search of your selected sequences against a database created for your genome of interest (e.g. a non-model species and potentially your working reference species) and identifies those that have multiple hits above a certain percent identity and below a specified e-value threshold. Those with multiple hits that fail the threshold settings are discarded. First, create a BLAST database for your genome of interest using this command (if you haven’t already): makeblastdb -in GenomeSequenceFile.fasta -dbtype nucl parse_seqids -out GenomeSequences -title "Genome Sequences for Species of Interest" Where GenomeSequenceFile.fasta is the file containing the genome sequences for the species of interest in fasta format, GenomeSequences is the name you selected for the database, and “Genome Sequences for Species of Interest” is a description of the database. Next, BLAST your sequences against the genome database using this command blastn -db GenomeSequences -outfmt "10 qacc sacc pident length mismatch gaps qstart qend sstart send evalue bitscore" -query TargetSequenceFile.fasta -out GenomeBLASTResults.csv 24 Where GenomeSequences is the name of the database you created, TargetSequenceFile.fasta is the file containing the sequences you have selected for targeted enrichment (e.g. through the SeqSelector workflow), and GenomeBLASTResults.csv is the name you selected for the file containing the results. The QCDuplicates tool takes the genome BLAST results and the fasta format file of the query (target) sequences as inputs. The user specifies a percent identity (as a whole number) and evalue threshold. Sequences that have multiple hits above the percent identity threshold and below the e-value threshold are flagged as potentially duplicated regions and discarded from the dataset. The sequence being examined is shown on the screen during the run. The output files include the fasta format sequence file with the potentially duplicated sequences removed, an info file that gives information about the potential number of duplicates for each duplicated sequence, and a stats file with information about the run, including how many sequences were removed and their IDs, as well as some information about the sequences that were retained. For further information see the detailed description of the QCDuplicates script. Final note Further considerations and quality checks may be necessary for your project depending on the capture format (array or in-solution), design specifications of the company producing the baits, etc. 3.15 Sorting and obtaining stats for fasta files independently At times during the sequence selection process, you may wish to sort or obtain stats for a file of sequences independent from the pipeline. The SequencesSortStats script will do this for you. 25 4. DETAILED DESCRIPTION OF SCRIPTS Note: If you would like to enter the settings interactively, execute the script and at the prompt type the name of the setting you would to change. The setting names appear on the left side of the equal sign. At the next prompt simply type what the new setting should be. Note: If you would like to enter the settings into the script directly, then you will need to include quotes around settings that should be treated as text and no quotes around settings that should be treated as numbers (e.g. min sequence length). In description of the user input below, information about whether to include quotes or not will be included in gray. 4.1 GoSelect Functions Reads an Ensembl GO annotation file for a reference genome and outputs all lines that contain particular search term(s) to a separate file A search can be performed for one or two terms, using the boolean operators ‘and’ ‘or’ ‘not’ It can take the output file produced above and returns a separate file with only the first instance of the Ensembl transcript or gene number Finally, it cam produce a list of gene names resulting from your search terms in Assumptions The script will find any instance of the query text, including if it occurs in a gene name, gene description, or GO term, etc. When creating the unique file it assumes that entries with the same transcript or gene number are on adjacent lines (this is the format of the unmodified file downloaded from Ensembl) User Input GOfilename: Name of the GO annotation file to search [When entering settings directly put this in quotes] o Case sensitive outname: Name to use for the beginning of all output files [quotes] o Additional text will be appended to this to designate specific results files numbersearchterms: Number of terms to search for – select 1 or 2 [no quotes] searchphrase1: First search term [quotes, see Basics of Python above if you would like to include quotes in your search term] boolean: If searching for two terms, enter the appropriate Boolean operator, either and, or, not [quotes] searchphrase2: Second search term, if applicable [quotes, see Basics of Python above if you would like to include quotes in your search term] makeunique: Whether or not to make the unique GO gene/transcript file o Answer Yes or No [quotes] 26 idcolumn: The column in the GO annotation file downloaded from Ensembl that contains the transcript or gene ID that should be used when making the unique file [no quotes] genelist: This setting species whether or not a separate file should be made that contains the list of unique genes in the format required by SeqFinder. Specify ‘yes’ or ‘no’ [quotes] genecolumn: The column in the GO annotation file downloaded from Ensembl that contains the names of the genes [no quotes] [Optional setting – disableinteractive: When entering settings directly into the script this setting allows you to shut off the interactive mode. Enter ‘yes’ to disable it and ‘no’ to leave it on] Output to Screen Whether or not the unique file and genes files have been made Output files The names of all output files will contain the search terms It will return a text file which includes the full search results. This file will contain all lines for which a match was found, in the same format as the initial GO annotation file (i.e. multiple lines for each transcript/gene as long as the search term is found in that line) Optionally, it will also return a separate file with only the first instance of a match for each Ensembl gene/transcript number. The name of this file will be the same as the name of the full file but will include ‘Unique.txt’ Optionally, it will also return a separate file that contains all of the unique gene names found during the search (i.e. with the duplicates removed) that is in the correct format for the SeqFinder script. The name of this file will be the same as the name of the full file but will include ‘UniqueGenes.txt’ Back to workflow 4.2 SeqFinder Function Reads a list of genes in csv format Opens each reference genome file in GenBank format (e.g. a file for each chromosome) and searches for each gene in the list For SeqFinderPartial.py, if the gene name is found in the GenBank file, it will return the first N basepairs, where the length N is set by the user o Use this script if you want partial gene sequences For SeqFinderWhole.py, if the gene name is found in the GenBank file, it will return the entire sequence o This script will return the whole gene (exons + introns), whole mRNA, or whole CDS sequences, based on the options selected 27 o If CDS is selected the function is similar to SeqFinderExons (see below), but the output is more crude E.g. All transcripts will be returned by this script, whereas with SeqFinderExons you can choose which one(s) to output. Also, the Info file will not contain information about the number of transcripts found, etc. o Similarly, if mRNA is selected, all mRNAs will be returned For SeqFinderExons.py, if the gene name is found in the GenBank file, it will return the sequence for each exon o Use this script if you want whole exon sequences o If multiple transcripts are annotated, you can select whether to return the shortest, the longest, or all of them o Additional exon specific and transcript specific information is output in the results files For SeqFinder Partial, if the length of the sequence is shorter than the target length set by the user, then the whole gene sequence will be returned For all scripts, if the length of the selected sequence is shorter than the minimum length set by the user, the sequence plus additional sequence after the gene will be returned until the min length is reached Some statistics are calculated, including average length of sequence, average GC content, total length of regions found, names of duplicate sequences found, names of sequences with more than the user-specified number of Ns Assumptions The GenBank files used all begin with the same string, have the same extension, and differ only by a number/text as entered by the user The script will only find a gene if the name in the csv file matches the name in the GenBank file exactly (case sensitive) SeqFinderExons.py will only find a gene if it has a CDS annotation in the GenBank file o E.g. if the gene you are looking for is a pseudogene and does not have a CDS sequence annotated in the GenBank file then it will not be found by this script. o SeqFinder whole can search for a gene, whether or not it is annotated with “CDS” or “mRNA” Annotations are accurate o If there is some uncertainty in the annotation of a region (e.g. if the start or stop codons are not precisely known) then the script will make a list of these regions and report them in the ‘AnnotationUncertainties.txt’ file and in the ‘Stats.txt’ file If a gene name is annotated on more than one chromosome then more than one sequence will be returned User input genesfilename: Name of the csv file containing the names of the query genes [When entering settings directly put this in quotes] The csv file should have all of the gene names on a single line, separated by a comma The last gene should not have a comma after it 28 Do not use quotes around the gene names in this file When entering the name of the csv file, case in important Name of the annotated reference genome files to search. These must be in GenBank format and must be unzipped. genbankfilename: Base name for the GenBank file [quotes] o If you have multiple files that have the same text at the beginning of the name, input the part of the name that is common to all of them o Case sensitive genbankfileextension: extension for the GenBank file [quotes] o As above, but input the extension for the files, including anything the files have in common after the chromosome number o Case sensitive chromnumber: Number of autosomal chromosomes [no quotes] o This is the number of consecutive autosomal chromosome files, starting at 1, whose names differ by a number o If the reference genome sequences have not been mapped to chromosomes, put a zero here and then treat the GenBank file as if it was an ‘extra chromosome’ file (see below) o If you want to search some autosomal chromosome files, but not all of the them, put a zero here and then treat the autosomal chromosome files as ‘extra chromosome’ files (see below) useextrachromosomes: Whether or not you would like to use extra chromosomes (these would be files whose base name and extensions are the same but differ by a character, a string of characters, or non-consecutive numbers; e.g. X, Un, 10, 11) o Answer Yes or No [quotes] o extrachromosomes: Extra files to use (these have the same base name and extensions but differ by a character, a string of characters, or non-consecutive numbers) o Enter the number of extra files you would like to use, and type the unique part of the name for each file at the prompt [Not applicable when entering settings directly in the script file] o [For direct entry of settings give the unique part of the filenname of the extra chromosomes using quotes. Multiple strings can be entered here if they are separated by a comma. This all needs to be surrounded by square brackets. Note: Here numbers should be treated as text and be surrounded by quotes, e.g. [‘X’, ‘Un’, ’10’, ‘12’] Example: For these files: DogChromosome1.gbk DogChromosome2.gbk DogChromosome3.gbk DogChromosomeX.gbk DogChromosoneUnk.gbk The appropriate settings would be: 29 GenBankfilename = DogChromosome [in quotes] GenBankfileextension = .gbk [in quotes] chromnumber = 3 [no quotes] useextrachromosomes = Yes [quotes] extrachromosomes = X [‘X’] transcripthandling: Handling of multiple transcripts If multiple transcripts are found you can use this option to specify which to return [no quotes]: o 1 – Return only the longest o 2 – Return only the shortest o 3 – Return all For each of these options the number of transcripts found for each gene, the length(s) of the transcripts returned, and the sequence location of the transcript(s) is printed in the Info file outname: Base name for output files Enter a name that you would like to use as the basename for the output files [quotes] queryspecies: Reference/query species name Enter the name of the reference/query species [quotes] numberNs: Identify sequences that have this many Ns (missing data) Enter a number [no quotes] targetsequencesize: For SeqFinderPartial.py only: Length of sequence to return Enter a number [no quotes] seqtype: For SeqFinderWhole.py only: Select the type of sequence to return Choose gene, CDS, or mRNA [no quotes] Note: The seqtypes CDS and mRNA will return sequences that lack introns and therefore may not be suitable for bait design because baits may inadvertently cross exons [Optional setting – disableinteractive: When entering settings directly into the script this setting allows you to shut off the interactive mode. Enter ‘yes’ to disable it and ‘no’ to leave it on] Output to screen Whether or not additional chromosome files are being used (e.g. sex chromosomes, files with names other than numbers, or non-consecutive chromosome files) A list of files to search The current chromosome file being searched Output files (assuming the base name selected was Results) ResultsStats.txt o This file gives some information about the run, including the settings used, number of genes found or missing, total length of the sequences, average sequence length and GC content 30 o It also lists sequences that had uncertainties associated with the annotated locations of the start and/or stop codons, sequences that were found multiple times, and sequences with strings of length N or longer, where N is the value specified by the user ResultsInfo.txt o This file gives the name of each chromosome file searched, the genes found in that file, their chromosomal location including strand, and for SeqFinderExons the number of transcripts found, along with the total length of the sequence returned for each of the selected transcripts Results.fa o Sequences selected from the reference genome, in the order they were found in the GenBank files ResultsSorted.fa o Sequences selected from the reference genome, sorted by name ResultsAnnotationUncertainties.txt o Lists the names of loci for which there are uncertainties in the annotated location of the start and/or stop codons ResultsFoundGenes.txt o The subset of genes from the .csv file that were found by SeqFinder ResultsMissingGenes.txt o The subset of genes from the .csv file that were not found by SeqFinder ResultsLengths.txt o A list of the lengths of the sequences found. This could be put into Excel or some other program to further examine the distribution of lengths ResultsGC_Content.txt o A list of the GC content for each of the sequences found. This could be put into Excel or some other program to further examine the distribution of GC content of the sequences Back to Workflow 4.3 SeqFinderRemoveDuplicates Function Searches through the fasta format sequence file produced by SeqFinder to identify sequences returned when the same genes are annotated multiple times in the reference Genbank file(s) o Searches the fasta file for multiple sequences with the same name Counts the total number of exons and the total combined sequence length If a gene has not been duplicated it is written to the results file If a gene has been duplicated it will return either the copy with the most exons or the copy with longest combined length, as set by the user If multiple copies have the same number of exons or same sequence length then the first occurrence will be returned 31 Some statistics are calculated, including average length of sequence, average GC content, total length of regions found, names of duplicate sequences found, names of sequences with more than the user-specified number of Ns Assumptions This script was developed for use with SeqFinderExons. It assumes that pseudogenes would not have been annotated as coding sequences in the reference genome, and therefore would not have been returned by the SeqFinderExons script. This script compares sets of exons, and therefore UNSORTED results files from SeqFinder should be used. If the results files is sorted by name, the exons will be disassociated from the set they belong to, which will cause the script to return the wrong exons. In order to identify a duplicated gene, the script looks to see how many occurrences were found for exon 1. If part of a gene that does not include exon 1 was duplicated, then this duplication event will not be detected by the script. User input genesfilename: Name of the csv file used with Target Finder containing the names of the query genes [When entering settings directly put this in quotes] o The csv file should have all of the gene names on a single line, separated by a comma o The last gene should not have a comma after it o Do not use quotes around the gene names in this file o When entering the name of the csv file, case in important sequencefilename: Name of the UNSORTED results file from SeqFinder [in quotes] o Case sensitive queryspecies: [in quotes] o Because SeqFinder adds the name of the query species to the sequences you need to enter this information here so that the gene name from the csv file will match the gene name for the sequence o Include here all text between the > and the beginning of the gene name, which should be common for all sequences keepmethod: Enter the number corresponding to which duplicate you would like to keep [no quotes] o 1 – Keep the duplicate with the most exons o 2 – Keep the duplicate with the longest combined sequence length outname: Enter the base name you would like to use for the output of the run [in quotes] o Case sensitive minsequencesize: This setting identifies sequences shorter than the desired length (e.g. the length of the baits). Enter a number [no quotes] numberNs: This setting identifies sequences that have a certain number of Ns (missing data). Enter a number [no quotes] [Optional setting – disableinteractive: When entering settings directly into the script this setting allows you to shut off the interactive mode. Enter ‘yes’ to disable it and ‘no’ to leave it on] 32 Output to screen Whether you selected to keep the duplicate with the most exons or the longest length The gene being processed for duplicates and an indication of the number of occurrences found If the gene has been duplicated, the number of exons in each duplicate, the total length of each duplicate, and which duplicate is being returned will also be printed to the screen Output files (assuming the basename is Results) Results.fa o This is the set sequences with the duplicated removed sorted by the order of the genes listed in the .csv file ResultsSorted.fa o This is the same file above but sorted by name ResultsStats.txt o File with summary information including the settings use, total number of sequences, total length of sequences, average sequence length, average GC content, names of duplicate sequences found, names of sequences with more than the user-specified number of Ns Back to workflow 4.4 AccessionNumberChecker Function Reads a file of sequences and returns the description line for each sequence to another text file Assumptions Sequences are in fasta format User input sequencefilename: Name of fasta format sequence file [When entering settings directly put this in quotes] outname: Name for the output files [quotes] [Optional setting – disableinteractive: When entering settings directly into the script this setting allows you to shut off the interactive mode. Enter ‘yes’ to disable it and ‘no’ to leave it on] Output to screen 33 None Output files (assuming that outname is Results) It will return a text file with one line for each sequence that contains all of the information from the description of that sequence Back to workflow 4.5 ParseBLAST Function Parses output from a BLAST search of a local database of un-annotated, non-model species genomic sequences assembled as scaffolds/contigs to retrieve the sequences that correspond to the reference species sequences For the BLAST match, if the alignment of the reference and non-model sequences does not cover the whole length of the reference sequence (e.g. a match with an alignment for 900 out of 1000 bp), additional sequence will be added from upstream and/or downstream as necessary to return a sequence of comparable length ParseBLASTPartial.py will return sequences all of approximately the same length, which is set by the user o Some differences in length may still occur due to insertions/deletions ParseBLASTExons.py will return sequences of similar length to those used in the BLAST search o Includes some more detailed output regarding the number of genes searched, and which genes had exons lacking BLAST hits o Some differences in length may still occur due to insertions/deletions ParseBLASTWhole.py will return sequences of similar length to those used in the BLAST search o Similar to ParseBLASTExons, but the output is more general and does not focus specifically on exons o Some differences in length may still occur due to insertions/deletions Some statistics are calculated, including average length of sequence, average GC content, total length of regions found, names of duplicate sequences found, names of sequences with more than the user-specified number of Ns Assumptions Reference and subject are relatively closely related Returns only the best BLAST hit for each sequence based on the bitscore, after first removing BLAST hits with e-values higher than the user defined threshold o As defined by NCBI, “The bit score gives an indication of how good the alignment is [between the query and the subject sequences]; the higher the score, the better the 34 alignment. In general terms, the bit score is calculated from a formula that takes into account the alignment of similar or identical residues, as well as any gaps introduced to align the sequences.” As with other programs that utilize BLAST searches, the sequence returned may not be homologous to the regions of interest If multiple copies of the same sequence are present in the BLAST query file (e.g. duplicated genes, multiple transcripts), then the single best BLAST hit will be returned for all of the sequences User input blastfilename: Name of the file with the BLAST results [When entering settings directly in script use quotes] Case sensitive A BLAST search must be run so that it outputs specific information in a particular order. See the Installation Instructions for how to make a local BLAST database. Use this BLAST command to generate the appropriate output (Note: the outfmt options are the important part here. They must all be specified and specified in this order) blastn -db DatabaseName -max_target_seqs 1 -outfmt "10 qacc sacc evalue bitscore pident qstart qend sstart send" -query SeqFinderResultsFile.fa -out BLASTResults.csv fastaqueriesfilename: Name of the fasta file with the query sequences used for the BLAST search [quotes] This is used for the output summary file to identify genes/exons that were and were not found during the run Case sensitive subjectfilename: The name of the file that contains the fasta sequences of the non-model species genome used to create the BLAST database [quotes] Case sensitive As when creating the database, all scaffold/contig sequences for the non-model species should be included in this single fasta format file AccessionNumberMethod: Enter the number of the Accession number method you would like to use (1, 2, or 3; see below) [no quotes] The ParseBLAST script works by matching the accession number from the BLAST hit to the accession number in the subject sequence file o You need to tell ParseBLAST how to find the accession number for sequences in the sequence file o Here is an example BLAST hit to illustrate the three accession number method options (with the accession number in bold): Reference_CYTB,HQ420351,0.0,1,1000,1702,695 35 AccessionNumberMethod 1 – assumes that the accession number is the only text between the > and the first space for each sequence. o There are no settings specifically for this method o You could use this method if the accession number in your sequence file looks like this: > HQ420351 AccessionNumberMethod 2 – assumes that the accession number is found at the exact same position for each sequence name, although the accession number may not be the only text in the name o For this method you need to enter the exact starting and exact ending position of the accession number (not including the >) for Accession2Start and Accession2End o Example: >gi|317409317|gb|HQ420351.1| Pterodroma sandwichensis o Here the accession number starts at character 18 with H and ends at character 25 with 1 The BLAST hit above did not include the revision number (the decimal or the text after it) so that should not be included here AccessionNumberMethod 3 – attempts to find the accession number based on some information you provide o This works best if your accession numbers have some consistent value in them, but differ in length or in position in the name of the sequence o Using the same example as immediately above: >gi|317409317|gb|HQ420351.1| Pterodroma sandwichensis >gi|317409317|gb|HQ420352.1| Pterodroma sandwichensis o Here all of the accession numbers start with ‘HQ’, so HQ could be used as the start query in Accession3StartFind. Python will find this position each time and store its location for you Since H is the exact start of the accession number you do not need to specify any offset for Accession3StartFindOffset. o Here all of the accession numbers end at ‘.’ so the decimal point could be used as the end query for Accession3EndFind. Python will find this position and store its location for you In this case, since the decimal occurs immediately after the end of the accession number you do not need any offset for Accession3EndFindOffset o If your search queries occurred before or after the accession number (e.g. you searched for the gb in the example above instead of HQ) you can specify the 36 number of characters that the query is from the start or end of the accession number. If you searched for a string of letters instead of a single character, python returns the position of the beginning of the string. For example, if you searched for gb above, you would need to specify the offset from the letter g. Note: Finding the right values to enter for Accession Number Methods 2 and 3 may take some trial and error. Once you start the ParseBLAST script it will output the accession numbers it is searching for to the screen. If these do not match the format of the accession numbers in your BLAST results file (e.g. the first letter of the accession number is missing), kill the run by typing control c and try again o In python, ranges include the beginning number but not the end number (e.g. a range of 1 – 5 only includes the values 1, 2, 3, and 4); and python starts counting at 0 instead of 1 o Accession number method 2 has been adjusted for this so that when you specify the exact start and exact end positions of the accession number it will know what to do o Accession number method 3 has not been adjusted for this because Python is searching for the accession number for us. Therefore, if you need to specify an offset for Accession Number Method 3 you should specify the distance to the exact starting position of the accession number and the position of the character immediately after the end of the accession. evalue: E-value threshold to use to filter the BLAST search results [no quotes] o As defined by NCBI, “the E-value is a parameter that describes the number of hits one can ‘expect’ to see by chance in a database of a particular size…For example, an E-value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size, one might expect to see one match with a similar score simply by chance.” o BLAST hits with e-values larger than this threshold will not be returned by the script queryspecies: Query species [quotes] This is anything that occurs before the name of the gene (excluding the >) in the fasta definition line If you used the SeqFinder script, this is the same text you entered for this option there If you are using sequences that were not obtained with the SeqFinder script, enter any text between the > and the beginning of the gene (assuming it is the same length for all sequences). If there is no extra text simply enter an underscore (‘_’) here. subjectspecies: Subject (non-model) species [quotes] Enter this so you can later differentiate the sequences from the non-model and reference species outname: Base name for the output files [quotes] minsequencesize: Minimum sequence length [no quotes] The output summary file will tell you which sequences are shorter than this length Short sequences may arise because a contig/scaffold ended in the middle of a gene, because of insertions and deletions between the reference and subject species numberNs: Number of Ns to find [no quotes] 37 The output summary file will tell you which sequences have strings of Ns this long or longer [Optional setting – disableinteractive: When entering settings directly into the script this setting allows you to shut off the interactive mode. Enter ‘yes’ to disable it and ‘no’ to leave it on] Output to screen This script looks at the accession number for each sequence in the subject file and compares that to the accession number for each BLAST hit. It will write that accession number to the screen as it progresses through the file. If the accession numbers look strange, kill the run by typing control c and adjust the settings of the AccessionNumberMethod Output files (assuming the basename is Results) Results.fa o This file contains one sequence from the non-model species for the best, longest BLAST hit. If a BLAST hit was not found for a particular reference sequence then there will be no corresponding subject sequence in this file o This is sorted in the order of the accession number the sequence came from ResultsSorted.fa o This is the file above sorted by locus name instead ResultsStats.txt o This file contains summary information about the run: Settings used Number of regions found and missing For ParseBLASTExons only: Number of unique genes from the query file For ParseBLASTExons only: Number of genes with missing exons after the BLAST search Average query length, average BLAST match length, average identity Average length and GC content of the sequences found Total length of sequences found Sequences shorter than the minimum size set by the user Sequences that were found more than once Sequences with strings of Ns as long or longer than the value input by the user ResultsUniqueBLASTtable.txt o This is the final BLAST table used by ParseBLAST, sorted by percent identity Column 1 – name of query sequence Column 2 – Accession number (or name) of the sequence where the hit was found Column 3 – E-value for hit Column 4 – Bitscore Column 5 – Percent identity Column 6 – Query sequence start for hit alignment Column 7 – Query sequence end for hit alignment Column 8 – Subject sequence start for hit alignment 38 Column 9 – Subject sequence end for hit alignment Column 10 – Length of aligned region Column 11 – Length of query sequence Column 12 – % of query covered by BLAST hit o Comparison of the e-value in the third column, the percent identity in the fourth column, and the % match length in column 11 will give some information about the quality of the BLAST hit for which the sequence was returned ResultsAccessions.txt o This file contains a non-redundant list of accession numbers for which a sequence from the non-model species was retrieved o These sequences can be used later for mapping the resulting next-generation sequence reads For ParseBLASTExons.py – ResultsFoundExons.txt o A list of the regions/exons found and the number of times found For ParseBLASTExons.py – ResultsMissingExons.txt o A list of the exons that were not found by BLAST For ParseBLASTExons.py – ResultsGenesMissingExons.txt o A list of the genes for which at least one exon was not found by BLAST o This can be compared to the ResultsMissingExons.txt file to determine how many and which exons are missing For ParseBLASTPartial.py and ParseBLASTWhole.py – ResultsFoundGenes.txt o A list of the regions/exons found and the number of times found For ParseBLASTPartial.py and ParseBLASTWhole.py – ResultsMissingRegions.txt o A list of the regions that were not found by BLAST ResultsLengths.txt o A list of the lengths for the sequences returned. This could be put into Excel or some other program to further examine the distribution of lengths. ResultsMatchLength.txt o A list of the BLAST match lengths. This could be put into Excel or some other program to further examine the distribution of lengths ResultsIdentity.txt o A list giving the identity between the reference query sequence and the subject sequence ResultsInfo.txt o This gives the accession number where the sequence for each region was found Back to Workflow 4.6 MergeSeqs Function This script merges the sequences from the reference and non-model species. It starts by examining the reference/query sequence file. If there is a sequence for the nonmodel species present, the sequence for the non-model species will be added to the output 39 file. If a sequence from the reference species is present, but there is no corresponding sequence from the non-model species for it, the reference sequence will be added to the merged output file instead. Using this script will allow baits to be designed from the reference species instead of the nonmodel species. This is helpful if your reference is relatively closely related to your target species, and if there are genes of high interest from the reference that were not found in the target species. The script will also return the standard summary information Assumptions Only sequences with corresponding members in the query sequence file will be written to the merged file o If, for example, a sequence is present only for the non-model species but not the query species, then it will not be written to the merged file User input queryfilename: Name of the fasta format query/reference sequence file (e.g. from SeqFinder) [When entering settings directly into script use quotes] subjectfilename: Name of the fasta format non-model species sequence file (e.g. from ParseBLAST) [quotes] queryidentifier [quotes] From the query fasta sequence file, this is the string of characters immediately following (but not including) the > sign up to the beginning of the locus name It should be the same in every sequence Example: For these sequences: >Dog_Gene1_Exon1 >Dog_Gene2_Exon3 >Dog_Gene3_Exon1 The query identifier would be ‘Dog_’ subjectidentifier: Subject identifier [quotes] Similar to above, but for the non-model fasta sequence file outname: Base name for the output files [quotes] numberNs: Identify sequences with this number (or more) when returning sequences [no quotes] [Optional setting – disableinteractive: When entering settings directly into the script this setting allows you to shut off the interactive mode. Enter ‘yes’ to disable it and ‘no’ to leave it on] Output to screen 40 Name of the locus being searched for in the subject file, and subsequently being copied to the merged file. If the locus names look strange, kill the run by typing control c, adjust the settings of the query and subject species identifiers, and try again. Output files (assuming base name for output files is Results) ResultsMerged.fa o This file contains the sequences merged from both the query (reference) species and the subject (non-model) species, sorted in the order of the gene name ResultsStats.txt o This file contains summary information about the merged sequence file QueryFileSorted.fa o This file contains the query sequences sorted by name Back to Workflow 4.7 QCMaskedMissing Function This tool reads in a set of fasta format sequences and will discard sequences with too much masked/missing data, as well as sequences below a desired length It calculates the percentage of sites with IUPAC ambiguity codes (e.g. R,Y,M, etc), percentage of sites with missing data (N or ?), percentage of sites that have been masked for repetitive or low complexity sequences (indicated by X), and then discards any sequences for which the percent missing/masked data is above a user-defined threshold. It will also discard sequences shorter than the desired length set by the user Assumptions Input is a fasta formatted sequence file Ambiguous sites are identified with IUPAC ambiguity codes and/or question marks Repetive and low complexity regions are masked with Xs User input sequencefilename: Name of the fasta sequence file to analyze [quotes] outname: Base name to use for the output files [quotes] threshold: Percentage of missing + masked data above which the sequence should be discarded. Enter a whole number (e.g. 10% is entered as 10 not 0.10) [no quotes] minsequencesize: Identify sequences shorter than this length in the results [no quotes] discardshortseqs: Whether or not discard sequences shorter than minsequencesize. Enter ‘yes’ or ‘no’ [quotes] numberNs: Identify sequences with this number (or more) Ns when returning sequences [no quotes] 41 [Optional setting – disableinteractive: When entering settings directly into the script this setting allows you to shut off the interactive mode. Enter ‘yes’ to disable it and ‘no’ to leave it on Output to screen ID for sequence being evaluated. Output files (assuming base name for output files is Results) ResultsQCMaskedMissing.fa o File with the sequences that were retained ResultsStats.txt o File with summary information including total number of sequences read, number of sequences retained and discarded, total length of sequences retained, average length of sequences, average GC content of sequences, names of sequences that were discarded because they contained to much missing/masked data, names of sequences shorter than length specified by the user and whether they were retained or discarded, names of sequences with more than the user-specified number of Ns Back to Workflow 4.8 QCComplementarity Function This script reads in a results file from a BLAST search of the selected sequences (e.g. from the SeqSelector workflow) against themselves and then parses these results to identify pairs of sequences that demonstrate high complementarity to each other. The first sequence (e.g. Seq1) of a pair ( e.g. Seq1, Seq2) with percent identity greater than the user defined threshold for a region longer than that defined by the user will be discarded. Self BLAST hits (Seq1, Seq1) and the reciprocal BLAST hit (Seq2, Seq1) are ignored. Assumptions Input is a csv formatted BLAST results file created using the command given in the workflow User input blastfilename: Name of the file with the BLAST results [When entering settings directly in script use quotes] fastaqueriesfilename: Name of the fasta file with the query sequences used for the BLAST search [quotes] 42 percentidentity: One part of the criteria above which a pair of sequences will be considered complementary. Percent identity should be entered as a whole number. For example, 90% identity should be entered as 90 and not 0.90 [no quotes] matchlength: The second part of the criteria above which a pair of sequences will be considered complementary. If a pair of sequences has higher percent identity for a region longer than the matchlength then it will be discarded. Matchlength should be entered as a whole number [no quotes] outname: Base name to use for the output files [quotes] minsequencesize: Identify sequences shorter than this length in the results [no quotes] numberNs: Identify sequences with this number (or more) Ns when returning sequences [no quotes] [Optional setting – disableinteractive: When entering settings directly into the script this setting allows you to shut off the interactive mode. Enter ‘yes’ to disable it and ‘no’ to leave it on] Output to screen None Output files (assuming base name for output files is Results) ResultsQCComplementarity.fa o Fasta format file with the complementary sequences removed ResultsQCComplementarityStats.txt o File with summary information including the number of sequence pairs that demonstrated complementarity above the percent identity and match length thresholds, the number of sequences removed, total number of sequences retained, total length of sequences retained, average length of sequences retained, average GC content of sequences retained, names of highly complementary sequences that were removed, names of sequences shorter than a specified length, names of sequences found multiple times in the data set, and those with more than the user-specified number of Ns. ResultsQCComplementarityMatchTable.txt o A table with the BLAST hits that failed the percent identity and match length thresholds. The columns in this table are: Column 1 - query sequence ID Column 2 - subject sequence ID Column 3 - percent identity Column 4 - match length Column 5 - number of mismatches Column 6 - number of gaps Column 7 - query starting position for the match Column 8 - query ending position Column 9 - subject starting position for the match Column 10 - subject ending position Column 11 - E-value 43 Column 12 - bitscore Back to Workflow 4.9 QCDuplicates Function This script reads in a results file from a BLAST search of the selected sequences (e.g. from the SeqSelector workflow) against the genome of a species of interest (e.g non-model or working reference species) and then parses these results to identify sequences with multiple high quality hits with percent identity above and e-value below user-specified thresholds, respectively. Sequences with multiple high quality hits are discarded. Assumptions Input is a csv formatted BLAST results file created using the command given in the workflow Depending on the settings selected this script may also remove sequences that are complementary to other regions even though they may not be duplicates User input blastfilename: Name of the file with the BLAST results [When entering settings directly in script use quotes] fastaqueriesfilename: Name of the fasta file with the query sequences used for the BLAST search [quotes] percentidentity: One part of the criteria for considering whether a sequence may potentially be duplicated. Percent identity should be entered as a whole number. For example, 95% identity should be entered as 95 and not 0.95 [no quotes] evalue: The second part of the criteria for identifying duplicated regions [no quotes]. If a sequence has multiple hits with percent identity higher than the threshold specified above and e-value less than that specified here will be discarded outname: Base name to use for the output files [quotes] minsequencesize: Identify sequences shorter than this length in the results [no quotes] numberNs: Identify sequences with this number (or more) Ns when returning sequences [no quotes] [Optional setting – disableinteractive: When entering settings directly into the script this setting allows you to shut off the interactive mode. Enter ‘yes’ to disable it and ‘no’ to leave it on] Output to screen 44 Each sequence examined will be output to the screen along with information on the number of potential duplicates. Output files (assuming base name for output files is Results) ResultsQCDuplicates.fa o Fasta format file with the potentially duplicated sequences removed ResultsQCDuplicatesStats.txt o File with summary information including the number of sequences that are potentially duplicated, the number of sequences removed, total number of sequences retained, total length of sequences retained, average length of sequences retained, average GC content of sequences retained, names of potentially duplicated sequences that were removed, names of sequences shorter than a specified length, names of sequences found multiple times in the data set, and those with more than the user-specified number of Ns. ResultsQCDuplicatesInfo.txt o A list of potentially duplicated sequences and potential number of duplicates (i.e. similar to the output to the screen during the run. Back to Workflow 4.10 SequencesSortStats Function This is a stand alone utility for summarizing a set of sequences in a fasta format file, separate from SeqFinder or ParseBLAST scripts This script will output a sorted file of your sequences and calculate some statistics including total number of sequences, total length of sequences, average sequence length, average GC content, names of duplicate sequences found, names of sequences with more than the userspecified number of Ns Assumptions Input is a fasta formatted sequence file Sequences will be sorted by their names, which includes text after the > sign and until the first space Duplicate sequences will be found on adjacent lines of the sorted sequence file Only sequences with exactly the same sequence name (up to the first space) will be considered duplicates; this is case sensitive. User input sequencefilename: Name of the fasta sequence file to analyze [quotes] 45 outname: Base name to use for the output files [quotes] minsequencesize: Identify sequences shorter than this length in the results [no quotes] numberNs: Identify sequences with this number (or more) Ns when returning sequences [no quotes] [Optional setting – disableinteractive: When entering settings directly into the script this setting allows you to shut off the interactive mode. Enter ‘yes’ to disable it and ‘no’ to leave it on] Output to screen None Output files (assuming base name for output files is Results) ResultsSorted.fa o File with the sequences sorted by name ResultsStats.txt o File with summary information including total number of sequences, total length of sequences, average sequence length, average GC content, names of duplicate sequences found, names of sequences with more than the user-specified number of Ns Back to Workflow 46 5. TROUBLESHOOTING 5.1 Error Messages In the section below error messages given by python are shaded in gray and potential solutions to the problem are described below. Traceback (most recent call last): [several lines of text] IOError: [Errno 2] No such file or directory: 'Filename.fasta' Python could not find a file that you told it to use. Check the spelling of the filenames (Note this is case sensitive!), and make sure that the files are present in the correct directory (usually the directory where you are executing the script). File "./ParseBLASTPartial.py", line 32 outname = ‘Filename.fasta ^ SyntaxError: EOL while scanning string literal Quotation marks have been misused around some text. Make sure that quotation marks are present at the beginning and end of a string of text. Double check that you use the same style quotation marks at each end of the text (e.g. ‘word‘ ) Traceback (most recent call last): File "./ParseBLASTExons.py", line 32, in <module> outname = Filename NameError: name 'Filename' is not defined In this case the quotation marks may be completely missing from around some text. Please add them and try again. Traceback (most recent call last): File "./ParseBLASTPartial.py", line 86, in <module> Accession2Start = Accession2Start - 1 TypeError: unsupported operand type(s) for -: 'str' and 'int' Python was expecting one type of input (e.g. a string with quotes around it) and found a different type of input (e.g. a number). Please check that all of the user input items are correct. Traceback (most recent call last): File "./ParseBLASTExons.py", line 569, in <module> avelength = x / numgenes ZeroDivisionError: division by zero 47 No genes or sequences were found to match your query. Check to make sure you have entered correct gene names (and that they are spelled correctly), that there were BLAST hits for your genes, that the matching record for the BLAST hit is present in your fasta file of subject sequences, etc. If this message is obtained while using a ParseBLAST script then look at the line directly above to see if an error message was printed. If there is an error message it will give hints how to fix the problem. ERROR! Please enter 1, 2, or 3 for AccessionNumberMethod. Traceback (most recent call last): File "./ParseBLASTExons.py", line 706, in <module> avelength = x / numgenes ZeroDivisionError: division by zero As the error message suggests, you have selected an inappropriate value for AccessionNumberMethod in the ParseBLAST script. Please select 1, 2, or 3 instead. 5.2 Unexpected results For the SeqFinder scripts, if some genes of interest were not found, then there may be a mismatch between the gene name (e.g. from the Ensembl file) and the annotated gene name in the reference GenBank file. There are a couple of things you can try: If the gene has synonyms in other species, try running the scripts again using those names (UCSC Genome Browser is useful for identifying these) If the gene still isn’t found, try searching the reference genome (e.g. UCSC Genome Browser or the appropriate chromosome GenBank file) for a key word in the gene’s name or the gene’s function to identify an alternative name. The gene may have been annotated in the GenBank file using a generic name, such as “LOC0895654” instead. For the ParseBLAST script, if no Accession numbers are printed while the script is running or if they seem incomplete, it is likely that the script is not able to understand how it should look for the Accession number. Check to make sure you have selected the correct AccessionNumberMethod, and that the start and end coordinates, the search strings, or the offsets have been entered correctly. If they seem incorrect adjust them and try again. The script outputs to the screen what it is searching for, so if the format of the Accession number on the screen doesn’t look similar to the format of the Accession number in the BLAST result then you will not get any sequences back. For the ParseBLAST script, if you get an error message stating ‘BLAST hits but no genes found.’ in the Stats file, there may be an issue with how you are specifying the query or subject species. Check to make sure these are entered correctly and/or that you are using the correct files. For the ParseBLAST script, if you get anomalous looking gene names, check that the query species has been input correctly. 48 APPENDIX 1. COMMAND LINE TUTORIAL FOR WINDOWS Getting Started In order to execute scripts from the command line it is helpful to know some basic Windows commands as well as have an understanding of file structure. This appendix contains some of the basics for those who are unfamiliar with working at the command line. Let us start with an example. Suppose you want to execute the GoSelect.py script. We’ll assume that this script has been saved in folder called Scripts on the Desktop. For Windows 8, the first thing to do is open the Terminal application, which accepts and executes commands from the user. Go to Programs Windows System Command Prompt. The location of the Command Prompt application may be slightly different in other versions of Windows, but should be similar. Once open you will see a prompt that is waiting for your command. Usually the prompt contains information about the folder (also called directory) that you are currently in (also called the current working directory). The prompt usually ends with a ‘>’. Note that most commands are case-sensitive. Locations of files and directories Before working with a file, you need to navigate to the directory where the file resides. This should be a familiar concept as you have often navigated through directories to find files using the mouse. This is the file structure. Each directory is specified as a certain location in the file structure. Usually the prompt will tell you your location, but in case it is truncated, to see your current location, use this command: chdir This will show an output like this: \Users\YourName This is your home directory. Your home directory actually resides within a set of directories above this, but they are not shown because most of the time you don’t need to use them (see below regarding the root). Each file and directory will have a location. The full description of the location of a file is the absolute path: \Users\YourName\Desktop\Scripts\GoSelect.py 49 Assuming this is where the GoSelect.py script file is saved, you can navigate to it using the change directory command: cd Desktop cd Scripts Note: To accomplish this in one command you could have typed cd Desktop\Scripts Tip: On most operating systems you can begin typing a directory or file name and then press the tab button and the computer will try to guess what you are typing. If you have typed enough letters the computer will complete the word. If you have not typed enough letters the computer won’t be sure what you are trying to type and may be able to complete none or only part of the word you are typing, in which case you will need to provide more characters. Print the current working directory: chdir \Users\YourName\Desktop\Scripts You can also change directories by specifying the relative path of where you want to go. .\ = current directory ..\ = the parent directory of the current director (one directory above the current directory) \ = the root directory (or the highest directory in your computer). This location stores files used by the operating system and any some of the programs that are installed. If you are a new user you might want to avoid going here. If you don’t specify an absolute or relative path, the computer will assume that you mean for it to look in the current directory. To change directories back to the Desktop you could type: cd .. However for now, we want to be in the Scripts directory, so if you moved back to the desktop type cd Scripts Tip: If you misspell a directory or file name you will get an error saying something to the effect of “The system cannot find the path specified” 50 Tip: There is another handy way to change directories. At the terminal command prompt type cd plus a space and then drag and drop a directory onto the command prompt. The absolute path of the directory will be automatically written for you. Locations of programs and scripts When programs, such as NCBI BLAST + are installed, the executable file for the program is often put in your path in a location above the home directory. When you execute a program by typing its name at the command line the computer searches all of the directories that are in your path, and if it finds the program it will execute it. If it can’t find the program, it will tell you “The system cannot find the path specified” error, or perhaps that the command “is not recognized as an internal or external command, operable program, or batch file”. Unless a program has been installed in such a way that it is put into your path, the simplest thing is to keep the program executable (including scripts) in the same directory as the input files. If you do this, you don’t have to specify an absolute or relative path because the computer will just look for the program and the input files in the current directory. If you choose to put the program executable or input files in different directories you will need to specify the absolute path or the relative path to the program/files from the current working directory. Working with files and directories To see the contents of the current directory: dir This will probably show something like this: GoSelect.py MergeSeqs.py ParseBLASTExons.py To make a new directory (here, called Test) within the current directory: mkdir Test To copy a file, use the copy command. To make a copy of a file and place the copy in the current directory type: copy GoSelect.py GoSelect2.py 51 You can also copy the file to the directory Test that we created above: copy GoSelect.py .\Test\GoSelect2.py Tip: If you want the file that was copied to the Test directory to keep the same name you could have just typed cp GoSelect.py .\Test\ Tip: You can also use the relative path. For example, to copy a file from the Test directory back into the scripts directory you could type (assuming you have changed directory into the Test directory): cp GoSelect.py .\..\GoSelectTest.py Tip: When copying files, if you specify that the name of the copied file should be the same as the name of a file that currently exists in that location, the current file will be replaced with the copy without asking if you want to replace it. Go back to the Scripts directory. To move a file without copying it, you can use the move command: move ParseBLASTExons.py .\Test\ You can also use the move command to rename a file (move a file it to a file with a different name in the same directory) move ParseBLASTWhole.py .\ParseBLASTWholeTest.py Change to the Test directory. To delete a file use the remove command: del GoSelect2.py Move back to the Scripts directory. To delete the Test directory (and all of the files within it) type: rmdir .\Test /s Tip: For new users it may be safer at first to delete files and directories using the mouse. When deleting files and directories from the command line the computer will not usually ask if you are sure (although there are options to turn on this behavior), and usually the files and directories will NOT be put in the recycle bin. 52 To view (but not modify) the contents of a file that is too large to open with a text editor, you can see a few lines at a time using this command: more GoSelect.py Pressing the space bar will progress through the file. Type q to exit and return to the command prompt. To concatenate large files, such as fasta files containing genome sequences, you can use the command below. This will concatenate all of the files listed before the > together, immediately one after another (i.e. without inserting a blank line or an end of line character). type Filename1.txt Filename2.txt > ConcatenatedFiles.txt Getting help In most cases, you should be able to get information about a command by typing man (for manual) and then the name of the command. For help on the cp command type: copy /? To exit the man page, type: q You will probably find that Google is also a good source of help. 53 APPENDIX 2. COMMAND LINE TUTORIAL FOR MAC/UNIX/LINUX Getting Started Mac OSX and Linux are UNIX-based operating systems. In order to execute scripts from the command line it is helpful to know some basic UNIX commands as well as have an understanding of file structure. This appendix contains some of the basics for those who are unfamiliar with working at the command line. Let us start with an example. Suppose you want to execute the GoSelect.py script. We’ll assume that this script has been saved in folder called Scripts on the Desktop. For Mac OSX, the first thing to do is open the Terminal application, which accepts and executes commands from the user. Go to Finder Applications Utilities Terminal. The location of the Terminal application may be slightly different in Linux, but should be similar. Once open you will see a prompt that is waiting for your command. Usually the prompt contains information about the folder (also called directory) that you are currently in (also called the current working directory). The prompt usually ends with a $. Note that Unix is case-sensitive. Locations of files and directories Before working with a file, you need to navigate to the directory where the file resides. This should be a familiar concept as you have often navigated through directories to find files using the mouse. This is the file structure. Each directory is specified as a certain location in the file structure. To see your current location, print the working directory using this command: pwd This will show an output like this: /Users/YourName This is your home directory. Your home directory actually resides within a set of directories above this, but they are not shown because most of the time you don’t need to use them (see below regarding the root). Each file and directory will have a location. The full description of the location of a file is the absolute path: /Users/YourName/Desktop/Scripts/GoSelect.py 54 Assuming this is where the GoSelect.py script file is saved, you can navigate to it using the change directory command: cd Desktop cd Scripts Note: To accomplish this in one command you could have typed cd Desktop/Scripts Tip: On most operating systems you can begin typing a directory or file name and then press the tab button and the computer will try to guess what you are typing. If you have typed enough letters the computer will complete the word. If you have not typed enough letters the computer won’t be sure what you are trying to type and may be able to complete none or only part of the word you are typing, in which case you will need to provide more characters. Print the current working directory: pwd /Users/YourName/Desktop/Scripts You can also change directories by specifying the relative path of where you want to go. ./ = current directory ../ = the parent directory of the current director (one directory above the current directory) / = the root directory (or the highest directory in your computer). This location stores files used by the operating system and any some of the programs that are installed. If you are a new user you might want to avoid going here. If you don’t specify an absolute or relative path, the computer will assume that you mean for it to look in the current directory. To change directories back to the Desktop you could type: cd .. However for now, we want to be in the Scripts directory, so if you moved back to the desktop type cd Scripts Tip: If you misspell a directory or file name you will get an error saying something to the effect of “No such file or directory” 55 Tip: For Mac OSX, if you ever get lost within the file structure, you can simply type cd (with nothing after it), and you will automatically change back to your home directory. Tip: Mac OSX also has another handy way to change directories. At the terminal command prompt type cd plus a space and then drag and drop a directory onto the command prompt. The absolute path of the directory will be automatically written for you. Locations of programs and scripts When programs, such as NCBI BLAST + are installed, the executable file for the program is often put in your path in a location above the home directory. When you execute a program by typing its name at the command line the computer searches all of the directories that are in your path, and if it finds the program it will execute it. If it can’t find the program, you will get the “No such file or directory” error. Unless a program has been installed in such a way that it is put into your path, the simplest thing is to keep the program executable (including scripts) in the same directory as the input files. If you do this, you don’t have to specify an absolute or relative path because the computer will just look for the program and the input files in the current directory. If you choose to put the program executable or input files in different directories you will need to specify the absolute path or the relative path to the program/files from the current working directory. Working with files and directories To see the contents of the current directory: ls This will probably show something like this: GoSelect.py MergeSeqs.py ParseBLASTExons.py To make a new directory (here, called Test) within the current directory: mkdir Test To copy a file, use the cp command. To make a copy of a file and place the copy in the current directory type: cp GoSelect.py GoSelect2.py 56 You can also copy the file to the directory Test that we created above: cp GoSelect.py ./Test/GoSelect2.py Tip: If you want the file that was copied to the Test directory to keep the same name you could have just typed cp GoSelect.py ./Test/ Tip: You can also use the relative path. For example, to copy a file from the Test directory back into the scripts directory you could type (assuming you have changed directory into the Test directory): cp GoSelect.py ./../GoSelectTest.py Tip: When copying files, if you specify that the name of the copied file should be the same as the name of a file that currently exists in that location, the current file will be replaced with the copy without asking if you want to replace it. Go back to the Scripts directory. To move a file without copying it, you can use the move command: mv ParseBLASTExons.py ./Test/ You can also use the move command to rename a file (move a file it to a file with a different name in the same directory) mv ParseBLASTWhole.py ./ParseBLASTWholeTest.py Change to the Test directory. To delete a file use the remove command: rm GoSelect2.py Move back to the Scripts directory. To delete the Test directory (and all of the files within it) type: rm –r Test Tip: For new users it may be safer at first to delete files and directories using the mouse. When deleting files and directories from the command line the computer will not usually ask if you are sure (although there are options to turn on this behavior), and usually the files and directories will NOT be put in the recycle bin. 57 To view (but not modify) the contents of a file that is too large to open with a text editor, you can see a few lines at a time using this command: less GoSelect.py Pressing the space bar will progress through the file. Type q to exit and return to the command prompt. The more command is similar in many ways, but the contents of the file that you viewed will remain in the window after typing q to exit. To concatenate large files, such as fasta files containing genome sequences, you can use the command below. This will concatenate all of the files listed before the > together, immediately one after another (i.e. without inserting a blank line or an end of line character). cat Filename1.txt Filename2.txt > ConcatenatedFiles.txt Getting help In most cases, you should be able to get information about a command by typing man (for manual) and then the name of the command. For help on the cp command type: man cp To exit the man page, type: q You will probably find that Google is also a good source of help. Permissions In order to read, edit, or execute a script, you need to have permission to do so. To see detailed information about the contents of the current directory, including the permissions: ls –l This will show something like this: -rwxr-xr-x 1 AW staff 4949 Dec 8 12:50 GoSelect.py -r--r--r-- 1 AW staff 5979 Dec 7 10:11 MergeSeqs.py 58 The letters on the left here are the permissions. The first dash tells us that GoSelect.py and MergeSeqs.py are not directories (if they were directories there would be a d instead of a dash). The next letters (rwx) specify permission to read (r), write/edit (w), and execute (x). The first set of three represents you, the user, the second set of three specifies permissions for the group, and the final set of three specifies permissions for others. Only the owner of a file or directory may change the permissions. Here you can see that you have permission to read, write and execute GoSelect.py, however you only have permission to read MergeSeqs.py. As the owner of these scripts you can change the permissions using the chmod command. To add permission for you, the user, to write and execute the MergeSeqs.py script: chmod u+w+x MergeSeqs.py To add permission for the group to write (edit) and execute the script: chmod g+w+x MergeSeqs.py To remove permission for the group to write/edit the script: chmod g-w MergeSeqs.py Typing ls –l again should give you: -rwxr-xr-x 1 AW staff 4949 Dec 8 12:50 GoSelect.py -rwxr-xr-- 1 AW staff 5979 Dec 7 10:11 MergeSeqs.py Tip: Sometimes the computer may not recognize that you are the owner of a file and return the error “Permission denied”. In this case you may need to use the sudo command to tell the computer that you are in charge: sudo chmod u+w+x MergeSeqs.py In this case you will need to provide the password that you use to sign into your computer. 59