SeqSelector Documentation1.1b

advertisement
SEQSELECTOR VERSION 1.1B MANUAL
Andreanna J. Welch, Andre E. Moura, Annalora Irvine, and A. Rus Hoelzel
14 May 2014
TABLE OF CONTENTS
1. OVERVIEW
3
1.1 License and Warranty
1.2 Citing the Program
1.3 Conventions Used in This Manual
1.4 Getting help
5
5
5
5
2. INSTALLATION
7
2.1 SeqSelector Toolset Download
2.2 Required Software
2.3 Optional Software
7
7
8
3. SUGGESTED WORKFLOW AND TIPS
11
3.1 Understanding the scripts
3.2 Some basics of python to keep in mind
3.3 Using the scripts
3.4 Identifying genes using functional GO annotations
3.5 Selecting gene sequences using SeqFinder
3.6 Removing duplicate sequences using SeqFinderRemoveDuplicates
3.7 Quality control of sequences from the reference
3.8 Performing a local BLAST search
3.9 Using AccessionNumberChecker to list all accession numbers for the non-model species genome
sequences
3.10 Obtaining sequences from the BLAST results using ParseBLAST
3.11 Merging sequences from the reference and non-model species for bait design
3.12 Quality control with QCMaskedMissing
3.13 Quality Control with QCComplementarity
3.14 Quality Control with QCDuplicates
3.15 Sorting and obtaining stats for fasta files independently
11
12
12
13
15
17
18
18
4. DETAILED DESCRIPTION OF SCRIPTS
26
4.1 GoSelect
4.2 SeqFinder
4.3 SeqFinderRemoveDuplicates
4.4 AccessionNumberChecker
4.5 ParseBLAST
4.6 MergeSeqs
4.7 QCMaskedMissing
4.8 QCComplementarity
4.9 QCDuplicates
4.10 SequencesSortStats
26
27
31
33
34
39
41
42
44
45
5. TROUBLESHOOTING
47
5.1 Error Messages
5.2 Unexpected results
47
48
1
19
20
22
22
23
24
25
APPENDIX 1. COMMAND LINE TUTORIAL FOR WINDOWS
49
Getting Started
Locations of files and directories
Working with files and directories
Getting help
49
49
51
53
APPENDIX 2. COMMAND LINE TUTORIAL FOR MAC/UNIX/LINUX
54
Getting Started
Locations of files and directories
Working with files and directories
Getting help
Permissions
54
54
56
58
58
2
1. OVERVIEW
The SeqSelector toolset is a suite of scripts for identification and selection of the sequences of
interest from the genomes of model and non-model species for use in capture enrichment of
next-generation sequencing libraries. The suggested workflow (Figure 1), may start at two
different steps. One option is for users to use the toolset to identify genes of interest. In addition
to information from the literature and results from previous work, the GoSelect script can be
used to search GO term annotation files (available for model species from the Ensembl BioMart
database) for key words related to the function of gene products. Once a suitable set of genes has
been identified, the SeqSelector scripts will search the annotated genome sequence files of a
working reference species (i.e. an annotated genome sequence for a model or non-model species
in Genbank format) for those genes, and return the sequences in a fasta file. Depending on the
version of the script, either full or partial gene, exon (CDS), or mRNA sequences will be
returned. Baits for sequence capture can be designed from these sequences directly, depending
on the evolutionary distance to the species of interest, or if desired, these sequences can be used
as queries to perform a BLAST search to find the corresponding sequences in the assembled,
unannotated genome sequences of a non-model species. The workflow may also begin by
bypassing these steps and starting with EST, transcriptome, or other (e.g. ultraconserved or noncoding) sequences from the reference species, which can be used as BLAST queries. The
ParseBLAST script can then be used to retrieve these sequences from the genome of the nonmodel organism. Depending on the needs of the user, and the evolutionary distance between the
reference and non-model species, the sequences from the reference and the non-model species
can be merged such that the sequence from the reference species will be returned whenever a
sequence was not found in the non-model species. We provide some additional tools for quality
control of selected sequences, as well as other tools to facilitate progression through the
workflow.
The SeqSelector toolset has been developed to be flexible. Each tool stands alone and may be
implemented according to the user’s needs. The SeqFinder script focuses on selecting the
sequences of genes from the genome, but non-coding regions may also be introduced into the
workflow at the BLAST search/ParseBLAST step, as described above.
The tools were also developed with the aim of being user-friendly, as well as quick and easy to
run. The scripts are written in python, although essentially no knowledge of python is required to
run them. They will work on any of the most common operating systems, including Windows
and Mac OSX, and should run quickly (probably in less than 15 minutes in most cases) on
standard desktop or laptop computers. The infiles required are standard formats (csv formatted
text, GenBank, and fasta format files) or produced through the workflow (BLAST search
results). We have tried to provide thorough and user-friendly documentation, tips, tutorials, and
troubleshooting suggestions, as well as example files and results.
We recommend staring with the Suggested Workflow after installation of the required software.
3
Figure 1. Overview of the SeqSelector work flow demonstrating major steps and associated
tools. Gray shading represents steps using information from the working reference genome or
additional published and unpublished data. Blue text and dashed arrows indicate steps to obtain
sequences of genes of interest from the reference species genome, while green text and dotted
arrows indicate starting point when EST or transcriptome data are available. Gray dotted arrows
indicate an optional step.
4
1.1 License and Warranty
The SeqSelector toolset is free software: you can redistribute it and/or modify it under the terms
of the GNU General Public License as published by the Free Software Foundation, either version
3 of the License, or (at your option) any later version. This software is distributed in the hope
that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of
merchantability or fitness for a particular purpose. See the GNU General Public License at
http://www.gnu.org/copyleft/gpl.html for more details.
1.2 Citing the Program
Welch AJ, Moura AE, Irvine A, Hoelzel AR. 2014. SeqSelector: Bioinformatic toolset to select
genomic regions for capture enrichment of next-generation sequencing libraries from model and
non-model species In Review.
Please feel free to email andreanna05@gmail.com for a copy of the manuscript.
1.3 Conventions Used in This Manual
In attempt to increase clarity we use certain conventions throughout this manual. The terms
reference, query, and model species are used interchangeably, and by this we mean a species
whose genome sequences are annotated and available in GenBank format. These species may or
may not be traditional model species, as long as annotated sequences are available. The terms
target, subject, and non-model species are used interchangeably to mean a species whose genome
sequences have been assembled, but not yet annotated.
Tips that we thought might be helpful for getting started are indented and written in gray.
Arrows () are used to indicate selections that you should make (generally from interactive
dialogue boxes).
Commands that should be entered at the command-line are specified in the typewriter
font.
Output from the scripts are written in black text with gray shading.
1.4 Getting help
5
For issues related to installation of the required and optional software, Google is often the best
bet since these programs are widely used and generally well supported. This manual contains
several sections with helpful information on how to run the SeqSelector toolset. If you are
unfamiliar with executing programs from the command line, you may wish to start by going
through the command line tutorial for Windows or Mac/Unix/Linux (Appendix 1 and 2). The
suggested workflow provides step-by-step instructions and includes tips and additional
information. There is also a detailed description for each script, which contains important
information on input files, settings, and output files. Finally, at the end of the manual there is a
Trouble Shooting section, which contains a list of the most frequently encountered errors and
some suggestions of what to do if you receive unexpected output from the scripts. If you require
additional assistance, or have suggestions for how to improve the usefulness of these tools, you
can email Andreanna Welch at andreanna05@gmail.com.
6
2. INSTALLATION
If you are not familiar with operating from the command line on your computer, you may wish to
start with the Command Line tutorials in Appendix 1 for Windows and Appendix 2 for Unix,
Linux, and Mac.
2.1 SeqSelector Toolset Download
Programs are available at: https://sourceforge.net/projects/seqselector/
Programs are written in python scripting language. Python is platform independent and works on
all common operating systems, including MacOS, Windows, Unix, and Linux.
2.2 Required Software
Python 2.5 - 2.7
*Note: The code was written in python 2.7 and may or may not be compatible with python 3
There are many free options for obtaining python:
– Downloads are available at the python website http://www.python.org/download/ which
includes installers for Windows and Mac. If you use Unix or Linux, you will need to compile
from the source code when using option.
– Enthought python distribution. This is a user-friendly distribution that includes helpful
modules (like NumPy), and is free for academic users:
https://www.enthought.com/products/epd/
https://www.enthought.com/products/canopy/academic/
– NumPy provides installers for Windows and Mac that include python 2.7
http://www.scipy.org/scipylib/download.html
NumPy 1.5 and above
There are many options for obtaining NumPy:
– NumPy provides installers for Windows and Mac, as well as source code for installation on
other platforms
7
http://www.scipy.org/scipylib/download.html
–NumPy also comes with the Enthought python distribution, which is free for academic users:
https://www.enthought.com/products/epd/
https://www.enthought.com/products/canopy/academic/
Biopython
You can find downloads and installation instructions for biopython at
http://biopython.org/wiki/Download
Biopython installers are available for Windows, but not Mac. Mac users will have to install
Apple’s XCode tools and the optional command line tools (see
http://docwiki.embarcadero.com/RADStudio/XE4/en/Installing_the_Xcode_Command_Line_To
ols_on_a_Mac), which are now available at the App Store. The download is large and may
require free registration as an Apple Developer, but the package includes installers and should
not be difficult to obtain.
The installation instructions for biopython should provide quick and easy installation.
Test your installation
For Windows:
 Go to Accessories  Command Prompt to open the command prompt window
 Type python to start python. Some information about version should be printed to the
screen.
 Type import numpy to test for NumPy installation
 Type from Bio import SeqIO to test for Biopython installation
For Mac:
 Go to Applications  Utilities  Terminal to open the terminal window
 Type python to start python. Some information about version should be printed to the
screen.
 Type import numpy to test for NumPy installation
 Type from Bio import SeqIO to test for Biopython installation
2.3 Optional Software
NCBI BLAST + 2.2+ Command Line Applications
Installation
8
Manual with installation and usage instructions:
http://www.ncbi.nlm.nih.gov/books/NBK1763/
Download Site:
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/
*Note: You may choose to follow the steps to configure BLAST + so that you can call it from
any directory. However, if you choose not to configure BLAST + this way then you will need to
execute BLAST search commands from the directory where the executable resides. To make
things simple, keep your BLAST + executable, BLAST database, and query files in the same
directory together, or if you keep your database and query files somewhere else then you will
have to provide the path to the database and query files when issuing a command.
Selecting a BLAST database to use
A) You can download and use preformatted BLAST databases if these are suitable for your
purposes. Available databases include human genomic, refseq genomic, other genomic, and the
non-redundant database. These databases can be found at the NCBI ftp site
ftp://ftp.ncbi.nlm.nih.gov/blast/db/ The BLAST + program comes with scripts that can be set up
to update the database files regularly, and to only download files with a newer version than the
local installation. For more information see the readme file at
ftp://ftp.ncbi.nlm.nih.gov/blast/db/README In this case you will need to use the blastdbcmd
program to generate the original FASTA files for using the ParseBLAST scripts.
B) You can also build your own BLAST database using fasta files from your species of interest
using the makeblastdb program. See the section called “Building a BLAST database with local
sequences” in the BLAST + help manual/cookbook. If genomic sequences of your species of
interest are available on GenBank, perform a search (e.g. using the advanced search feature) that
returns the sequences you are interested in. Download all of sequences to a single file by clicking
the “Send to:” button in the upper right hand corner of the search results screen, and then
selecting “file” and then “fasta” format. If you don’t click any checkboxes then all of the
sequences will be downloaded, otherwise only the selected sequences will be downloaded. Use
the makeblastdb program to create the BLAST database from the fasta file:
makeblastdb -in MySpecies.fasta -dbtype nucl -parse_seqids -out
MySpecies -title "MySpecies Genome Contigs"
Testing your blast database
Test your database using a command like this:
blastn -db MySpecies -query QuerySequence.fasta -out Results.out
9
RepeatMasker
RepeatMasker (http://repeatmasker.org/) can identify and mask repetitive and low complexity
regions of sequence, and is a handy tool for quality control of sequences before bait design.
A web server version is available at http://repeatmasker.org/cgi-bin/WEBRepeatMasker, which
should be sufficient in most cases. Small datasets will be processed immediately and the results
displayed in the web browser. However for larger datasets (which you are likely to use), you will
need to select ‘email’ as the return method, and then the run will be queued and you will be
notified upon its completion. For extremely large datasets it may be beneficial to install a local
copy, although this is available for Unix-based systems only. See
http://repeatmasker.org/RMDownload.html for more details.
10
3. SUGGESTED WORKFLOW AND TIPS
3.1 Understanding the scripts
Python scripts are just plain text documents that contain a series of commands for the computer
to perform. There are two ways to use the SeqSelector scripts. The first method is to simply
execute the script (see next section) and change the settings interactively to match your filenames
and desired options. The second method is to open the script in your favorite text editor and
change the initial settings there directly. This way when you execute the script, all of the settings
will be correct. You can also optionally turn off the interactive mode completely. The second
method has two benefits: 1) if you need to re-run the script (or change just one or two settings)
you will not have to re-enter all of the settings at the prompt; 2) You can save a copy of the script
for each run, and therefore keep a record of your work. The second method does have one slight
drawback, however. In order to input your settings into the script directly, you will have to
follow some simple python rules (see next section).
If you would like to input your settings directly in the script, then open the script using a text
editor such as WordPad or TextEdit. There are also more advanced text editors available (see tip
below), which are often free. Once the script is open you will see that is set up with different
sections. The first section at the top of each file contains some housekeeping things. This is
because certain commands are needed (such as module import) before other input can be
accepted. The second section, called “Initial Input”, is where you can specify your filenames and
desired options. Be sure to save the file after changing any filenames or options, or when you
execute the script it will execute the last version that was saved (without your changes). The final
section contains the rest of the script, which carries out your commands.
Tip: For Mac, TextWrangler is a particularly nice text editor. For Windows Notepad++ is
very practical and versatile.
Tip: AVOID OPENING SCRIPTS IN MICROSOFT WORD or other word processing
programs as this could introduce ‘invisible characters’ to the file, which will cause errors
Tip: If you open the scripts and it looks like a jumble of text without line breaks, the line
endings may be an issue. Windows uses carriage returns plus a line feed character to
represent line endings, while Mac and Unix use just the line feed character. The easiest way
to avoid this problem is to open the files in a smarter text editor such as WordPad for
Windows or TextWrangler for Mac.
Tip: Only change settings in the “Initial Input” section or the script may not work
Tip: Be careful when selecting the outfile names as well. If you accidentally select the same
name as one of the input files it will be written over.
Tip: Script files can be saved with different names if you want to keep a record of your
analyses
11
3.2 Some basics of python to keep in mind
If you decide to enter your settings interactively after executing the script, you can probably skip
most of this section, although it may still be helpful to have an idea of some of the basics of
python. The only thing you must keep in mind is that python is case sensitive, except for some
situations where the script has been written to get around this.
If you would like to enter your settings in the script file directly then this section contains
important information for you. First, as mentioned above, python is case sensitive. There are
some situations in which the scripts have been written in a way to make them case insensitive,
and when this is true it will be clearly specified. All other times you should assume that case is
important.
Second, words or phrases should have quotes around them. These are also known as strings.
Depending on the content of the word/phrase, single, double, or triple quotes can be used. The
key is to make sure that the type of quote used matches on both sides of the text.
Tip: If your phrase contains double quotes, then use single quotes around the outside, and
vice versa
Tip: If your phrase contains both double and single quotes, then use triple quotes around the
outside
Third, when entering numbers that should be treated as numbers (e.g. minimum sequence length,
number of Ns when considering missing data) then numbers should not have quotes around
them. Putting quotes around numbers will result in them being treated as text, which could cause
errors depending on their usage in the script.
Finally, when entering settings directly into the scripts, the # is the comment symbol, and any
text following it will not be executed. Therefore, you can use the # to write notes or comments in
the file.
3.3 Using the scripts
The scripts are executed from the command line. See the Windows and Unix tutorials for help on
using the command line if you are not already familiar with it.
For Windows:
 Go to Accessories  Command Prompt to open the command prompt window
 Navigate to the appropriate folder
 Type the name of the script
12
Tip: You may need to type ./ or python in front of the script name
For Mac:
 Go to Applications  Utilities  Terminal to open the terminal window
 Navigate to appropriate folder
 Type the name of the script
Tip: You may need to type ./ or python in front of script name
Tip: If you get an error about permissions, see the command line tutorial for how to fix this
If a run is going and you decide you would like to stop it you can kill the run by pressing
control c (for Windows or Mac) or control ScrLock (for Windows). This will usually
cause python to output several errors before returning to the command prompt.
The script will start by printing its name and the list of initial settings. You will be given a
prompt to either change the settings or to type run to start the analysis. To change a setting, just
type its name. The setting name will be on the left side of the equals sign and the current setting
will be on the right. When using the scripts, you will need to specify the location and name of the
input files. If you keep the input files in the same folder as the script then you simply need to
type the name of the input files in the script. If you keep the input files somewhere else, you will
need to specify the entire path to the file (e.g. something like
/YourName/Desktop/Folder/Input.fa). See the Windows and Unix tutorials for more on this.
Tip: Spaces in the names of files or folders can be difficult to handle and should probably be
avoided – use underscores instead
Tip: Copying and pasting the names of files and directories is a helpful way to prevent errors
caused by typos. Windows may not allow you to copy and paste at the command line.
In cases where a setting has a small number of options (e.g. yes or no), if your input is not valid
it will prompt you to try again. In cases where the options are unlimited (e.g. filenames) the
script will not know there has been an error until the analysis is started. If that happens you will
need to execute the script again and re-enter all of our settings.
3.4 Identifying genes using functional GO annotations
Downloading GO annotation files
Candidate genes for targeted sequence capture can be identified in various ways, including
literature searches and from the results of previous experiments. It may also be useful to identify
genes by searching for a particular annotated function. Gene Ontology (GO) terms are a
standardized vocabulary for specifying gene product functions (see
http://www.geneontology.org/ for more details). The genome sequences of many model
13
organisms have been annotated with GO terms and these annotations can be downloaded and
searched.
To download GO term annotations for genes of a reference species:
Go to the Ensembl BioMart database (http://www.ensembl.org/biomart/). Under database choose
Ensembl Genes, and under dataset select the reference species of interest. Click on attributes on
left side of screen and select relevant options (note that attributes are added in the order that you
select them), such as:

Under Gene:
o Ensembl Gene ID
o Ensembl Transcript ID
o Description
o Chromosome Name
o Gene Start (bp)
o Gene End (bp)
o Associated Gene Name

Under External:
o GO Term Accession
o GO Term Name
o EMBL (GenBank) ID
When finished, click on Results on the left. Go term annotations can be downloaded by selecting
‘Export all results to’  File  TSV  Go.
Note: It’s important to download the results as TSV (tab delimited) and not CSV (comma
separated values) because some of the GO term descriptions contain commas, which interfere
when parsing the file
Using GoSelect to search for functions
The GoSelect.py script will identify all entries (lines) of the GO term annotation file that contain
a particular gene name or function of interest. See the detailed description of the GoSelect script
for more information, and example files on SourceForge. Users can select to search for a single
term, or to search for two terms, whose relationship is defined using the Boolean operators and,
or, and not. The makeunique setting can be used so that the script also returns a list of results
for only the first instance of a gene or transcript ID. In this way a list of only unique
genes/transcripts are returned which contain the search term(s) in the function or gene name. The
full results file is useful to look at because a single gene may have several functions of interest.
However, it can be difficult to tell from this file how many unique genes were found with your
search terms. This information is much easier to see from the unique results file. Finally, the
geneslist option can be set so that after the search, the unique list of gene names is returned in the
14
format required for the SeqSelector scripts. To search within the results for a further subset of
genes of interest, simply run the script again on the results file.
You may wish to examine and manipulate the search result text files in Excel. If you decide not
to include some genes, you can either remove them from the list of unique genes created for the
SeqFinder tool (if you opted to make this file), or you can delete the corresponding rows in
Excel. Your final list of selected genes can be exported from Excel in the csv format required by
the SeqFinder script by using the ‘Save As’ function and selecting .csv format. Note: For the
SeqFinder script, the gene names must be in a single row, and separated by commas (csv
format).
Tip: If gene names are in a vertical column in Excel, highlight them and then use Edit 
Copy  Paste Special  Transpose to paste them into a single row on a new sheet.
Repeat as necessary until a suitable set of target genes have been identified.
3.5 Selecting gene sequences using SeqFinder
Running SeqFinder
Once you have a list of interesting genes in csv format (all gene names in a single row, separated
by commas – see example files on SourceForge), the SeqFinder scripts can be used to select the
sequences for these genes from the annotated genome of a reference species. This script requires
the genome sequences to be in GenBank format, but there can be many GenBank entries per file,
and many files can be used (e.g. one file with multiple GenBank entries per chromosome).
Genome sequences are available from the GenBank ftp site
(ftp://ftp.ncbi.nlm.nih.gov/genomes/). Navigate to the genome of interest and you should find a
list of directories labeled something like “CHR_01”. Within each directory select the GenBank
file to download, which is the file that has the .gbk.gz extension. These are gzip compressed
files, which will need to be uncompressed before you can use them
Tip: Uncompressing .gz files – On Macs you should be able to double click the file to
uncompress it. On Windows, you may be able to double click the file to uncompress it as
well. However, if double clicking doesn’t work (e.g. if a file full of strange symbols appears)
then you will need to download and install the free 7-zip program from http://www.7-zip.org/
Simply select the .exe file and install. To unzip a file, navigate to the installation location
(default will be something like C:\Program Files\7-Zip\) and double click on the .exe file to
open the 7-zip utility. Click the Add + button and find the file to unzip. Click the Extract –
button to extract the file.
If you have annotated sequences that are not in GenBank format yet, but that you would like to
use as a reference sequence, GenBank format files can be created using NCBI’s Sequin program
(available at http://www.ncbi.nlm.nih.gov/guide/data-software/#submissions_). For example, this
15
software takes fasta format sequence files and gff annotation files to create a GenBank-style file.
You can create these files locally without submission to the GenBank database.
Select the SeqFinder script that best suits your needs. SeqFinderExons will return exon
sequences only. In the case where multiple transcripts have been annotated in the GenBank file,
you can chose to return the exon sequences for the longest transcript, the shortest transcript, or
all transcripts. SeqFinderWhole will return the entire gene sequence, including introns (default)
or optionally it will return all CDS sequences or mRNA sequences instead. SeqFinderPartial will
return the first N bp of a gene (exons and introns), where N is a length set by the user.
The genome sequences of many model species have been assigned to their respective
chromosomes, however the sequences from on-going or recently completed sequencing projects
of non-model species may not have been. The SeqFinder scripts are set up to allow you to
specify a number of files for autosomal chromosomes in cases where that is possible (assumes
they are contiguous, and start with chromosome 1). In addition to or instead of this you can
specify files for sex chromosomes, sequences that have unidentified locations, or a subset of
chromosome files (e.g. files for chromosomes 21 and 22). For more on this, and other options see
the detailed description of the SeqFinder scripts. Once all the options are set, execute the script
by typing run.
Tip: Keep all of the chromosome files, the csv file with your genes of interest and the
SeqFinder script in the same directory to simplify specification of file names.
Inspecting the results
SeqFinder will output a variety of files when the run is finished. The first file to look at is the
Stats file. This gives information about the settings and the results, including how many genes
were found and not found, as well as some summary information about average and total length,
and GC content. It will also specify sequences for which there were annotation uncertainties (e.g.
where the start and stop codons are not precisely known), and sequences with a user-specified
amount or more of missing data. It will also list sequences that were found multiple times. This
may be because the sequences are duplicated and/or annotated in multiple locations in the
genome, or because you selected to have the sequences for all transcripts (or CDS or mRNA)
returned. The Info file contains the list of genes found in each input file. The sequences found
will be in the .fa file (in the order in which they were found) or sorted.fa file (sorted by name).
See the detailed description of the SeqFinder script for more information about output files.
If some genes were not found, it may be that the gene name may differ between the annotations
in the GO file and the annotations in the GenBank file. There are several ways to resolve this.
You may be able to identify synonyms for the gene by going to the UCSC Genome Browser
(http://genome.ucsc.edu/). Once there, click “Genomes” in the top left corner and search for the
gene in your reference genome. From the results you can compare the annotations to other
species. Clicking on the transcript name and/or the protein name may also help identify
synonyms.
16
If you know which chromosome the gene should be located on, and if the GenBank file for that
chromosome is not too large, it can be opened in TextEdit or Wordpad and you can search the
file for a word from the gene name or function. For example, if the gene TPK1 is not found, try
searching the file for ‘thiamin’ instead to determine the synonym. Alternatively, this may be
done at the GenBank website as well.
Repeat as necessary until you have identified the appropriate gene names in the GenBank files.
Tip: To reduce the number of files used for further steps, run SeqFinder one last time on the
final list of gene names
Multiple transcripts
If you chose to return all transcripts, you can use some criteria other than length (e.g.
experimental support) to decide which transcript to retain. The GenBank files will contain
information on the support for each mRNA for a gene, as well as specify the accession number
for the associated CDS. Using the accession number, you can then look in the GenBank file to
determine which of the returned transcripts corresponds to the transcript you selected. SeqFinder
writes the CDS sequences in the order of the GenBank file, therefore if the first CDS has the
accession number that corresponds to the transcript with the most support, then keep the first
exon listed in the UNSORTED output file.
Tip: The mRNA and associated CDS are not necessarily listed in the same order in the
GenBank file (e.g. the first mRNA for a gene may or may not be represented by the first CDS
for a gene)
Tip: The online version of the GenBank file should be most up-to-date
3.6 Removing duplicate sequences using SeqFinderRemoveDuplicates
When using the SeqFinder script, it will return sequences of genes whenever they are annotated
in the reference genome (e.g. when the same gene is annotated on multiple chromosomes). You
can use the SeqFinderRemoveDuplicates script to search for sequences in the SeqFinder results
file that have the same name, and keep one copy of the sequences and get rid of the extras. This
script was written with duplicate genes in mind, and therefore the script compares different sets
of exons found at different genomic locations (although the duplicated genes may be adjacent to
each other). Therefore the input sequence file for this script should be the file sorted by genome
location and NOT the file sorted by name. Using the SeqFinderRemoveDuplicates script, you
can choose whether to keep the duplicate with the largest number of exons or the duplicate with
the longest total sequence length. In cases where multiple duplicates have the same number of
exons or the same total sequence length, the first occurrence is retained. The input files for this
script are the list of genes in csv format used for SeqFinder and the UNSORTED output
sequence file in fasta format from SeqFinder. Note that the script uses the gene name to identify
exons. Since SeqFinder inserts a species identifier before the locus name, you will have to
specify this identifier in the settings. See the detailed description of the RemoveDuplicates script
for more information.
17
During the run, the script will print the name of the gene that it is working on, and some
indication of whether or not duplicates were found. This information is also written to the Info
file when the run has completed. The resulting sequences (minus the duplicates) are written to a
fasta format file in the order that the genes are given in the csv formatted list of genes, and also
to a fasta file that is sorted by name. As usual some summary information about the sequences,
such as the total number of sequences retained, average and total length, etc., are written to the
Stats file. As noted above, this script was written particularly to compare duplicated genes, and it
may not remove single duplicated exons, although this latter instance should be relatively rare.
If you are only interested in single copy loci for your project, you may wish to remove the
duplicated genes from the list of genes and then re-run SeqFinder again. This will help avoid the
inclusion of duplicated genes in your dataset.
3.7 Quality control of sequences from the reference
If you are only selecting sequences from your working reference genome, and you don’t need to
use the rest of the workflow, then you should proceed to the sections on final Quality Control to
finalize your scripts before bait design.
If you are going to use the sequences selected from the reference genome (whether they were
obtained using the SeqFinder tools or from previously available data such as EST or
transcriptome sequences) as queries to find the corresponding sequences in the unannotated
genome of a non-model species, then it might be helpful at this point to investigate if the
sequences contain repetitive or low complexity regions, as these will return many or low quality
BLAST hits. This is particularly important if your sequences are non-coding. See the section on
Quality control with QCMaskedMissing for information on how to use RepeatMasker to identify
these regions and then discard sequences with a certain threshold of masked or missing data.
3.8 Performing a local BLAST search
The sequences that were selected from the genome of the working reference species (including
any other sequences, such as EST or transcriptome sequences) can be used as queries for a
BLAST search to find corresponding sequences in an un-annotated genome of a non-model
(subject) species. This works best for sequences of reasonable lengths – performing a BLAST
search for a 50 kb sequence of a whole gene may not be feasible. See the NCBI BLAST +
installation instructions (Section 2.3) for how to make a local database for sequences from the
non-model species.
Tip: Either keep the NCBI BLAST+ executable in the same directory as your local database
and the output from SeqFinder, or verify that the BLAST+ program is in your path. On Mac,
at the command prompt type $PATH and verify the appropriate directory with the BLAST+
program is in your path.
18
Conduct the local BLAST search using this command:
blastn -db DatabaseName -max_target_seqs 1 -outfmt "10 qacc sacc
evalue bitscore pident qstart qend sstart send" -query
SeqFinderResultsFile.fa -out BLASTResults.csv
Where DatabaseName is the name of the local BLAST database you created;
SeqFinderResultsFile.fa is the fasta format file of sequences you wish to find (queries), and
BLASTResults.csv is the name you designate for the BLAST results file in csv format.
This outputs the BLAST results in the proper format for the ParseBLAST.py script.
Tip: It is important not to change the –outfmt options (including their order) or the
ParseBLAST script will not work properly.
Tip: You may also want to conduct the BLAST search and return the standard BLAST
output for visual inspection. Note, that this format will NOT work for the ParseBLAST
script. To return the full output, use this command:
blastn -db DatabaseName -query SeqFinderResultsFile.fa -out
BLASTResults.out
Where DatabaseName is the name of the local BLAST database you created during
installation, SeqFinderResultsFile.fa is the fasta format file of sequences you wish to find,
and BLASTResults.out is the name you designate for the BLAST results file in the full
output format.
3.9 Using AccessionNumberChecker to list all accession numbers for the nonmodel species genome sequences
The results from the BLAST search (see above) give the accession number of the non-model
(subject) genomic sequence where a match was found for the reference (query) sequence. The
ParseBLAST script (see next section) uses this accession number to find the correct subject
genome sequence, and then extracts the appropriate part of it. Therefore, in order for
ParseBLAST to work correctly, it needs to know how to find the accession number amongst all
of the information given after the > sign in the description for each sequence in the subject
genome fasta file. It is helpful to look at the descriptions of the subject genome sequences, but
often these files are too large to open with a text editor. The AccessionNumberChecker script
takes the genome sequence file in fasta format as input and will return the description line for
every sequence in a file to a separate text file. You can scroll through this file to see if the
formats of the description lines are consistent for all sequences. See the detailed description of
the AccessionNumberChecker script for more information.
19
3.10 Obtaining sequences from the BLAST results using ParseBLAST
Running ParseBLAST
Using the output from the BLAST search (in the format as described above), the ParseBLAST
scripts return the best sequences identified, based on a threshold e-value set by the user and the
bit score, which reflects the quality of the alignment between the query and the subject
sequences. In cases where the sequence from the non-model species does not match the full
length of the sequence from the reference species, the ParseBLAST script will return additional
sequence upstream and/or downstream until a certain length is reached. For ParseBLASTWhole,
a sequence of similar length to the query sequence will be returned. For ParseBLASTPartial,
sequences of a particular length will be returned (e.g. all will be ~1000 bp). For
ParseBLASTExons, a sequence of similar length to the query sequence will be returned (similar
to ParseBLASTWhole), but the output will contain some information that is more relevant to
exons. Note that because of insertions/deletions between the sequences of the reference and the
non-model species, the lengths of sequences returned may not be exactly identical to the length
of the query or to the length specified.
Three input files are needed in order to run this script: the BLAST results (Note: these must be in
the format described above or the script will produce nonsensical results), the query sequences
used in the BLAST search, plus the fasta file of sequences represented in the BLAST database
for the non-model species. If you created your own local BLAST database, it is simply the fasta
file that you used to do this. If you used a precompiled database from NCBI, then you will need
to use the blastdbcmd program to generate the original FASTA file (see the BLAST + manual at
http://www.ncbi.nlm.nih.gov/books/NBK1763/ for more information).
Accession number identification methods
For each line of the BLAST results, ParseBLAST looks in the non-model species genome file for
the sequence with the accession number that matches that given in the BLAST hit. Here is an
example of a BLAST search result with the accession number of the hit in bold:
Reference_CYTB,HQ420351,0.0,100,1,1000,1702,695
The genome sequences of the non-model species are in fasta format, and contain a line like this
before each sequence (the accession number is in bold again):
>gi|317409317|gb|HQ420351.1| Pterodroma sandwichensis
Therefore, to run ParseBLAST, you need to tell it how to find the accession number (or just the
name if your sequences don’t have an accession number), in order to return the proper
sequences. AccessionNumberMethod1 assumes that the accession number is the only text
between the > and the first space for each sequence, and therefore no other options are required.
AccessionNumberMethod2 assumes that the accession number is found at the exact same
position for each sequence name and is always the exact same length, although the accession
number may not be the only text in the description. For this method you need to enter the exact
20
starting and exact ending position of the accession number (not including the >).
AccessionNumberMethod3 attempts to find the accession number based on some information
you provide (query and offset). This works best if your accession numbers have some consistent
value in them, but will work fine if they differ in position in the line or in length. For more
information see the detailed description of the ParseBLAST script.
Tip: Inputting the Accession number method settings correctly may take a couple of
tries. The ParseBLAST script outputs to the screen each accession number it is searching for.
If the output to the screen does not match the format of the accession numbers in the fasta file
used to make the local BLAST database, kill the run by typing control c and adjust the
proper settings accordingly.
Inspecting the results
ParseBLAST will output a variety of files when the run is finished. The first file to look at is the
Stats file. This gives information about how many loci/exons were found and not found, as well
as some summary information about the % identity, average and total length, and GC content. It
will also return a list of sequences that were below the minimum length specified (although it
does not remove them at this point – you can do that later using the QCMaskedMissing tool),
sequences that may have been found multiple times, and sequences with a user-specified amount
of missing data or more. The UniqueBLASTtable file will also be helpful to look at. This file is
sorted by the % identity between the query reference sequence and the corresponding sequence
in the non-model species, and also gives the e-values for each hit. The final column is the %
match length between the query reference sequence and the corresponding sequence in the nonmodel species. Sequences with short regions of alignment and/or low identity may be spurious.
The Info file gives the accession number that each sequence was retrieved from. The sequences
from the non-model species will be ouput to the .fa file (in the order that the sequences were
found) or the sorted.fa file (sorted by sequence name).
ParseBLAST will return a single subject sequence for all query sequences found during the
BLAST search (assuming the e-value for the hit is below the user-defined threshold). If
sequences for some of your loci are missing at this point, there are several reasons why they may
not have been found during the BLAST search: they may be missing from the genomic
sequences obtained for the non-model species (even though they may in reality be present in the
genome) because of low coverage, the sequences may contain low complexity or repetitive
regions masked by BLAST, the sequences may be too divergent, or they have been lost from the
genome of the non-model species. If desired, you can identify additional genes/loci to take the
place of sequences that were not found. Or, if the genes are of particular interest, you may
choose to retain the sequences from the reference species for those loci and use them for bait
design, under the assumption that coverage may be an issue. In this case, the MergeSeqs script
may be of use (see next section).
Tip: If you used EST or other sequences not obtained with the SeqFinder tool as the query
and the sequence identifiers (the text after the > in the fasta file) look strange then you may
need to change the way you have specified the query species. See the detailed description of
the ParseBLAST script for more information.
21
3.11 Merging sequences from the reference and non-model species for bait
design
If the sequences for some loci could not be found during the BLAST search, you may decide to
design baits from the sequences of the reference genome instead, assuming the sequences are not
too divergent from your species of interest. The MergeSeqs script opens the file of query
sequences obtained from the reference genome, and then looks for a corresponding sequence
from the subject or non-model species genome. If a sequence was found for the non-model
species, it is written to the results file, otherwise the sequence from the reference species will be
returned instead. Note that the matching is based on the locus name. Since SeqFinder and
ParseBLAST both insert a species identifier before the locus name, you will have to specify
these before running the script. See the detailed description of the MergeSeqs script for more
information. While the script is running it will output the locus names that it is looking for. If
these are not in the correct format, simply kill the run by typing control c, adjust the species
identifiers as necessary and try again.
3.12 Quality control with QCMaskedMissing
Before designing baits from the selected sequences some quality control may be helpful. Baits
designed from sequences with repetitive or low complexity regions will hybridize nonspecifically to similar regions across the genome rather than your locus of interest. Additionally,
sequences much shorter than the length of baits in your kit may need to be removed or they will
have to be heavily padded. The QCMaskedMissing tool will help you investigate these issues in
your set of selected sequences.
To identify and mask repetitive or low complexity regions use the RepeatMasker program. The
webserver version available at http://repeatmasker.org/cgi-bin/WEBRepeatMasker will allow
submission of large fasta format datasets and will email you a message when the results are
ready. Select the rmblast option, which is compatible with NCBI’s BLAST program, and select
your desired speed/sensitivity (default should be fine in most cases). For the return format, you
can select ‘html’ to have the results to be displayed in a web browser, in which case you will
need to download the results files individually, or you can select ‘tar file’ and have the results
displayed both in the browser and downloadable as a “zipped” folder. For return method you will
most likely need to select ‘email’ unless your sequence input file is small (< 50kb). There are
additional options you can select if your species of interest is closely related to a model
organism. Finally, make sure that ‘Masking options’ under ‘Advanced Options’ says that
repetitive sequences will be replaced by strings of X, otherwise the default is to mask with
strings of N, which will be indistinguishable from missing data. The results from RepeatMasker
include a summary table (.tbl file, which can be opened in a text editor) with information about
the regions identified, as well as your set of sequences with the repetitive/low complexity regions
masked by Xs (.masked file).
22
The QCMaskedMissing tool will go through a set of sequences and discard any that have a
percentage of missing or masked data above the user define threshold. It will also investigate
sequence lengths and, if desired, discard sequences shorter than a user defined length. The input
for this tool is a set of fasta format sequences, such as those produced from the SeqFinder
workflow and/or from RepeatMasker. The user can specify the threshold of missing/masked data
to allow. This is should be entered as a percentage: for example 10% would be written as 10
rather than 0.10. The user can also specify the minimum sequence length. During the run each
sequence being investigated will be printed to the screen. The ouput files include the set of fasta
formatted sequences that pass the threshold for masked/missing data (and sequence length, if
desired), and stats and info files that contain results and information about the run. For more
information see the detailed description of the QCMaskedMissing script.
3.13 Quality Control with QCComplementarity
It may also be helpful to identify any baits sequences that are highly complimentary to each
other. If baits are complimentary then they could bind to each other rather than to your
sequencing library. For example, MYcroArray suggests avoiding sequences with greater than
90% sequence identity for greater than 100 bp. In order to investigate this, you can create a
BLAST database for the sequences you have selected for targeted enrichment, then use your
selected sequences as queries to BLAST against that database, and then use the
QCComplementarity tool to remove sequences that have high complementarity.
First, create a new blast database for your sequences using the command:
makeblastdb -in TargetSequenceFile.fasta -dbtype nucl parse_seqids -out TargetSequences -title "Sequences for Targeted
Enrichment"
Where TargetSequenceFile.fasta is the file containing the sequences you have selected for
targeted enrichment (e.g. through the SeqSelector workflow), TargetSequences is a name you
selected for the database, and “Sequences for Targeted Enrichment” is some description of the
database.
Next, BLAST your sequences against this database using this command (this produces the
correct format for the QCComplementarity script):
blastn -db TargetSequences -outfmt "10 qacc sacc pident length
mismatch gaps qstart qend sstart send evalue bitscore" -query
TargetSequenceFile.fasta -out SelfBLASTResults.csv
Where TargetSequences is the name of the database you created, TargetSequenceFile.fasta is the
file containing the sequences you have selected for targeted enrichment (e.g. through the
SeqSelector workflow), and SelfBLASTResults.csv is the name you selected for the file
containing the results.
23
The SelfBLASTResults.csv file is used as the input for the QCComplementarity tool, along with
the query sequence file. The user can set thresholds for percent identity (as above percent should
be written as whole numbers – e.g. 90% is input as 90) and match length (in bp). When a BLAST
hit between a pair of sequences has a percent identity higher than the setting over a region longer
than the length specified, then the first sequence in the pair will be removed. The reciprocal
BLAST hit between the pair is ignored. The output files include a reduced BLAST results file
that contains only the hits that fail to meet the user-defined thresholds, a stats file with
information about the run, including the IDs for the sequences removed, and a fasta formatted
file with all of the sequences retained. Note that some sequences may be highly complementary
to several other sequences, and removing this one sequence may resolve more than one instance
of high complementarity. Therefore, fewer sequences may be removed than the number of hits in
the reduced BLAST table (see example file). See the detailed description of the
QCComplementarity script for more information.
3.14 Quality Control with QCDuplicates
It may also be helpful to investigate whether your selected sequences come from single copy or
multi-copy loci. If you used the SeqFinder and SeqFinderRemoveDuplicates scripts you may
already know whether the sequences for your genes of interest were annotated in multiple
regions of the working reference genome. However, you may wish to check again or if you
started with EST, transcriptome, or other data you may wish to investigate this further.
The QCDuplicates script parses the results of a BLAST search of your selected sequences
against a database created for your genome of interest (e.g. a non-model species and potentially
your working reference species) and identifies those that have multiple hits above a certain
percent identity and below a specified e-value threshold. Those with multiple hits that fail the
threshold settings are discarded.
First, create a BLAST database for your genome of interest using this command (if you haven’t
already):
makeblastdb -in GenomeSequenceFile.fasta -dbtype nucl parse_seqids -out GenomeSequences -title "Genome Sequences for
Species of Interest"
Where GenomeSequenceFile.fasta is the file containing the genome sequences for the species of
interest in fasta format, GenomeSequences is the name you selected for the database, and
“Genome Sequences for Species of Interest” is a description of the database.
Next, BLAST your sequences against the genome database using this command
blastn -db GenomeSequences -outfmt "10 qacc sacc pident length
mismatch gaps qstart qend sstart send evalue bitscore" -query
TargetSequenceFile.fasta -out GenomeBLASTResults.csv
24
Where GenomeSequences is the name of the database you created, TargetSequenceFile.fasta is
the file containing the sequences you have selected for targeted enrichment (e.g. through the
SeqSelector workflow), and GenomeBLASTResults.csv is the name you selected for the file
containing the results.
The QCDuplicates tool takes the genome BLAST results and the fasta format file of the query
(target) sequences as inputs. The user specifies a percent identity (as a whole number) and evalue threshold. Sequences that have multiple hits above the percent identity threshold and below
the e-value threshold are flagged as potentially duplicated regions and discarded from the
dataset. The sequence being examined is shown on the screen during the run. The output files
include the fasta format sequence file with the potentially duplicated sequences removed, an info
file that gives information about the potential number of duplicates for each duplicated sequence,
and a stats file with information about the run, including how many sequences were removed and
their IDs, as well as some information about the sequences that were retained. For further
information see the detailed description of the QCDuplicates script.
Final note
Further considerations and quality checks may be necessary for your project depending on the
capture format (array or in-solution), design specifications of the company producing the baits,
etc.
3.15 Sorting and obtaining stats for fasta files independently
At times during the sequence selection process, you may wish to sort or obtain stats for a file of
sequences independent from the pipeline. The SequencesSortStats script will do this for you.
25
4. DETAILED DESCRIPTION OF SCRIPTS
Note: If you would like to enter the settings interactively, execute the script and at the prompt
type the name of the setting you would to change. The setting names appear on the left side of
the equal sign. At the next prompt simply type what the new setting should be.
Note: If you would like to enter the settings into the script directly, then you will need to include
quotes around settings that should be treated as text and no quotes around settings that should be
treated as numbers (e.g. min sequence length). In description of the user input below,
information about whether to include quotes or not will be included in gray.
4.1 GoSelect
Functions




Reads an Ensembl GO annotation file for a reference genome and outputs all lines that
contain particular search term(s) to a separate file
A search can be performed for one or two terms, using the boolean operators ‘and’ ‘or’ ‘not’
It can take the output file produced above and returns a separate file with only the first
instance of the Ensembl transcript or gene number
Finally, it cam produce a list of gene names resulting from your search terms in
Assumptions
 The script will find any instance of the query text, including if it occurs in a gene name, gene
description, or GO term, etc.
 When creating the unique file it assumes that entries with the same transcript or gene number
are on adjacent lines (this is the format of the unmodified file downloaded from Ensembl)
User Input







GOfilename: Name of the GO annotation file to search [When entering settings directly put
this in quotes]
o Case sensitive
outname: Name to use for the beginning of all output files [quotes]
o Additional text will be appended to this to designate specific results files
numbersearchterms: Number of terms to search for – select 1 or 2 [no quotes]
searchphrase1: First search term [quotes, see Basics of Python above if you would like to
include quotes in your search term]
boolean: If searching for two terms, enter the appropriate Boolean operator, either and, or,
not [quotes]
searchphrase2: Second search term, if applicable [quotes, see Basics of Python above if you
would like to include quotes in your search term]
makeunique: Whether or not to make the unique GO gene/transcript file
o Answer Yes or No [quotes]
26




idcolumn: The column in the GO annotation file downloaded from Ensembl that contains the
transcript or gene ID that should be used when making the unique file [no quotes]
genelist: This setting species whether or not a separate file should be made that contains the
list of unique genes in the format required by SeqFinder. Specify ‘yes’ or ‘no’ [quotes]
genecolumn: The column in the GO annotation file downloaded from Ensembl that contains
the names of the genes [no quotes]
[Optional setting – disableinteractive: When entering settings directly into the script this
setting allows you to shut off the interactive mode. Enter ‘yes’ to disable it and ‘no’ to leave
it on]
Output to Screen

Whether or not the unique file and genes files have been made
Output files




The names of all output files will contain the search terms
It will return a text file which includes the full search results. This file will contain all lines
for which a match was found, in the same format as the initial GO annotation file (i.e.
multiple lines for each transcript/gene as long as the search term is found in that line)
Optionally, it will also return a separate file with only the first instance of a match for each
Ensembl gene/transcript number. The name of this file will be the same as the name of the
full file but will include ‘Unique.txt’
Optionally, it will also return a separate file that contains all of the unique gene names found
during the search (i.e. with the duplicates removed) that is in the correct format for the
SeqFinder script. The name of this file will be the same as the name of the full file but will
include ‘UniqueGenes.txt’
Back to workflow
4.2 SeqFinder
Function




Reads a list of genes in csv format
Opens each reference genome file in GenBank format (e.g. a file for each chromosome) and
searches for each gene in the list
For SeqFinderPartial.py, if the gene name is found in the GenBank file, it will return the
first N basepairs, where the length N is set by the user
o Use this script if you want partial gene sequences
For SeqFinderWhole.py, if the gene name is found in the GenBank file, it will return the
entire sequence
o This script will return the whole gene (exons + introns), whole mRNA, or whole CDS
sequences, based on the options selected
27




o If CDS is selected the function is similar to SeqFinderExons (see below), but the
output is more crude
 E.g. All transcripts will be returned by this script, whereas with
SeqFinderExons you can choose which one(s) to output. Also, the Info file
will not contain information about the number of transcripts found, etc.
o Similarly, if mRNA is selected, all mRNAs will be returned
For SeqFinderExons.py, if the gene name is found in the GenBank file, it will return the
sequence for each exon
o Use this script if you want whole exon sequences
o If multiple transcripts are annotated, you can select whether to return the shortest, the
longest, or all of them
o Additional exon specific and transcript specific information is output in the results
files
For SeqFinder Partial, if the length of the sequence is shorter than the target length set by the
user, then the whole gene sequence will be returned
For all scripts, if the length of the selected sequence is shorter than the minimum length set
by the user, the sequence plus additional sequence after the gene will be returned until the
min length is reached
Some statistics are calculated, including average length of sequence, average GC content,
total length of regions found, names of duplicate sequences found, names of sequences with
more than the user-specified number of Ns
Assumptions





The GenBank files used all begin with the same string, have the same extension, and differ
only by a number/text as entered by the user
The script will only find a gene if the name in the csv file matches the name in the GenBank
file exactly (case sensitive)
SeqFinderExons.py will only find a gene if it has a CDS annotation in the GenBank file
o E.g. if the gene you are looking for is a pseudogene and does not have a CDS
sequence annotated in the GenBank file then it will not be found by this script.
o SeqFinder whole can search for a gene, whether or not it is annotated with “CDS” or
“mRNA”
Annotations are accurate
o If there is some uncertainty in the annotation of a region (e.g. if the start or stop
codons are not precisely known) then the script will make a list of these regions and
report them in the ‘AnnotationUncertainties.txt’ file and in the ‘Stats.txt’ file
If a gene name is annotated on more than one chromosome then more than one sequence will
be returned
User input

genesfilename: Name of the csv file containing the names of the query genes [When entering
settings directly put this in quotes]
 The csv file should have all of the gene names on a single line, separated by a comma
 The last gene should not have a comma after it
28

 Do not use quotes around the gene names in this file
 When entering the name of the csv file, case in important
Name of the annotated reference genome files to search. These must be in GenBank
format and must be unzipped.
 genbankfilename: Base name for the GenBank file [quotes]
o If you have multiple files that have the same text at the beginning of the name,
input the part of the name that is common to all of them
o Case sensitive
 genbankfileextension: extension for the GenBank file [quotes]
o As above, but input the extension for the files, including anything the files
have in common after the chromosome number
o Case sensitive
 chromnumber: Number of autosomal chromosomes [no quotes]
o This is the number of consecutive autosomal chromosome files, starting at 1,
whose names differ by a number
o If the reference genome sequences have not been mapped to chromosomes,
put a zero here and then treat the GenBank file as if it was an ‘extra
chromosome’ file (see below)
o If you want to search some autosomal chromosome files, but not all of the
them, put a zero here and then treat the autosomal chromosome files as ‘extra
chromosome’ files (see below)
 useextrachromosomes: Whether or not you would like to use extra chromosomes
(these would be files whose base name and extensions are the same but differ by a
character, a string of characters, or non-consecutive numbers; e.g. X, Un, 10, 11)
o Answer Yes or No [quotes]
o extrachromosomes: Extra files to use (these have the same base name and extensions
but differ by a character, a string of characters, or non-consecutive numbers)
o Enter the number of extra files you would like to use, and type the unique part
of the name for each file at the prompt
 [Not applicable when entering settings directly in the script file]
o [For direct entry of settings give the unique part of the filenname of the extra
chromosomes using quotes. Multiple strings can be entered here if they are
separated by a comma. This all needs to be surrounded by square brackets.
Note: Here numbers should be treated as text and be surrounded by quotes,
e.g. [‘X’, ‘Un’, ’10’, ‘12’]
Example:
For these files:
DogChromosome1.gbk
DogChromosome2.gbk
DogChromosome3.gbk
DogChromosomeX.gbk
DogChromosoneUnk.gbk
The appropriate settings would be:
29
GenBankfilename = DogChromosome [in quotes]
GenBankfileextension = .gbk [in quotes]
chromnumber = 3 [no quotes]
useextrachromosomes = Yes [quotes]
extrachromosomes = X [‘X’]







transcripthandling: Handling of multiple transcripts
 If multiple transcripts are found you can use this option to specify which to return [no
quotes]:
o 1 – Return only the longest
o 2 – Return only the shortest
o 3 – Return all
 For each of these options the number of transcripts found for each gene, the length(s)
of the transcripts returned, and the sequence location of the transcript(s) is printed in
the Info file
outname: Base name for output files
 Enter a name that you would like to use as the basename for the output files [quotes]
queryspecies: Reference/query species name
 Enter the name of the reference/query species [quotes]
numberNs: Identify sequences that have this many Ns (missing data)
 Enter a number [no quotes]
targetsequencesize: For SeqFinderPartial.py only: Length of sequence to return
 Enter a number [no quotes]
seqtype: For SeqFinderWhole.py only: Select the type of sequence to return
 Choose gene, CDS, or mRNA [no quotes]
 Note: The seqtypes CDS and mRNA will return sequences that lack introns and
therefore may not be suitable for bait design because baits may inadvertently cross
exons
[Optional setting – disableinteractive: When entering settings directly into the script this
setting allows you to shut off the interactive mode. Enter ‘yes’ to disable it and ‘no’ to leave
it on]
Output to screen



Whether or not additional chromosome files are being used (e.g. sex chromosomes, files with
names other than numbers, or non-consecutive chromosome files)
A list of files to search
The current chromosome file being searched
Output files (assuming the base name selected was Results)

ResultsStats.txt
o This file gives some information about the run, including the settings used, number of
genes found or missing, total length of the sequences, average sequence length and
GC content
30








o It also lists sequences that had uncertainties associated with the annotated locations of
the start and/or stop codons, sequences that were found multiple times, and sequences
with strings of length N or longer, where N is the value specified by the user
ResultsInfo.txt
o This file gives the name of each chromosome file searched, the genes found in that
file, their chromosomal location including strand, and for SeqFinderExons the
number of transcripts found, along with the total length of the sequence returned for
each of the selected transcripts
Results.fa
o Sequences selected from the reference genome, in the order they were found in the
GenBank files
ResultsSorted.fa
o Sequences selected from the reference genome, sorted by name
ResultsAnnotationUncertainties.txt
o Lists the names of loci for which there are uncertainties in the annotated location of
the start and/or stop codons
ResultsFoundGenes.txt
o The subset of genes from the .csv file that were found by SeqFinder
ResultsMissingGenes.txt
o The subset of genes from the .csv file that were not found by SeqFinder
ResultsLengths.txt
o A list of the lengths of the sequences found. This could be put into Excel or some
other program to further examine the distribution of lengths
ResultsGC_Content.txt
o A list of the GC content for each of the sequences found. This could be put into Excel
or some other program to further examine the distribution of GC content of the
sequences
Back to Workflow
4.3 SeqFinderRemoveDuplicates
Function





Searches through the fasta format sequence file produced by SeqFinder to identify sequences
returned when the same genes are annotated multiple times in the reference Genbank file(s)
o Searches the fasta file for multiple sequences with the same name
Counts the total number of exons and the total combined sequence length
If a gene has not been duplicated it is written to the results file
If a gene has been duplicated it will return either the copy with the most exons or the copy
with longest combined length, as set by the user
If multiple copies have the same number of exons or same sequence length then the first
occurrence will be returned
31

Some statistics are calculated, including average length of sequence, average GC content,
total length of regions found, names of duplicate sequences found, names of sequences with
more than the user-specified number of Ns
Assumptions



This script was developed for use with SeqFinderExons. It assumes that pseudogenes would
not have been annotated as coding sequences in the reference genome, and therefore would
not have been returned by the SeqFinderExons script.
This script compares sets of exons, and therefore UNSORTED results files from SeqFinder
should be used. If the results files is sorted by name, the exons will be disassociated from the
set they belong to, which will cause the script to return the wrong exons.
In order to identify a duplicated gene, the script looks to see how many occurrences were
found for exon 1. If part of a gene that does not include exon 1 was duplicated, then this
duplication event will not be detected by the script.
User input








genesfilename: Name of the csv file used with Target Finder containing the names of the
query genes [When entering settings directly put this in quotes]
o The csv file should have all of the gene names on a single line, separated by a comma
o The last gene should not have a comma after it
o Do not use quotes around the gene names in this file
o When entering the name of the csv file, case in important
sequencefilename: Name of the UNSORTED results file from SeqFinder [in quotes]
o Case sensitive
queryspecies: [in quotes]
o Because SeqFinder adds the name of the query species to the sequences you need to
enter this information here so that the gene name from the csv file will match the gene
name for the sequence
o Include here all text between the > and the beginning of the gene name, which should
be common for all sequences
keepmethod: Enter the number corresponding to which duplicate you would like to keep [no
quotes]
o 1 – Keep the duplicate with the most exons
o 2 – Keep the duplicate with the longest combined sequence length
outname: Enter the base name you would like to use for the output of the run [in quotes]
o Case sensitive
minsequencesize: This setting identifies sequences shorter than the desired length (e.g. the
length of the baits). Enter a number [no quotes]
numberNs: This setting identifies sequences that have a certain number of Ns (missing data).
Enter a number [no quotes]
[Optional setting – disableinteractive: When entering settings directly into the script this
setting allows you to shut off the interactive mode. Enter ‘yes’ to disable it and ‘no’ to leave
it on]
32
Output to screen



Whether you selected to keep the duplicate with the most exons or the longest length
The gene being processed for duplicates and an indication of the number of occurrences
found
If the gene has been duplicated, the number of exons in each duplicate, the total length of
each duplicate, and which duplicate is being returned will also be printed to the screen
Output files (assuming the basename is Results)



Results.fa
o This is the set sequences with the duplicated removed sorted by the order of the genes
listed in the .csv file
ResultsSorted.fa
o This is the same file above but sorted by name
ResultsStats.txt
o File with summary information including the settings use, total number of sequences,
total length of sequences, average sequence length, average GC content, names of
duplicate sequences found, names of sequences with more than the user-specified
number of Ns
Back to workflow
4.4 AccessionNumberChecker
Function

Reads a file of sequences and returns the description line for each sequence to another text
file
Assumptions

Sequences are in fasta format
User input



sequencefilename: Name of fasta format sequence file [When entering settings directly put
this in quotes]
outname: Name for the output files [quotes]
[Optional setting – disableinteractive: When entering settings directly into the script this
setting allows you to shut off the interactive mode. Enter ‘yes’ to disable it and ‘no’ to leave
it on]
Output to screen
33

None
Output files (assuming that outname is Results)

It will return a text file with one line for each sequence that contains all of the information
from the description of that sequence
Back to workflow
4.5 ParseBLAST
Function






Parses output from a BLAST search of a local database of un-annotated, non-model species
genomic sequences assembled as scaffolds/contigs to retrieve the sequences that correspond
to the reference species sequences
For the BLAST match, if the alignment of the reference and non-model sequences does not
cover the whole length of the reference sequence (e.g. a match with an alignment for 900 out
of 1000 bp), additional sequence will be added from upstream and/or downstream as
necessary to return a sequence of comparable length
ParseBLASTPartial.py will return sequences all of approximately the same length, which is
set by the user
o Some differences in length may still occur due to insertions/deletions
ParseBLASTExons.py will return sequences of similar length to those used in the BLAST
search
o Includes some more detailed output regarding the number of genes searched, and
which genes had exons lacking BLAST hits
o Some differences in length may still occur due to insertions/deletions
ParseBLASTWhole.py will return sequences of similar length to those used in the BLAST
search
o Similar to ParseBLASTExons, but the output is more general and does not focus
specifically on exons
o Some differences in length may still occur due to insertions/deletions
Some statistics are calculated, including average length of sequence, average GC content,
total length of regions found, names of duplicate sequences found, names of sequences with
more than the user-specified number of Ns
Assumptions


Reference and subject are relatively closely related
Returns only the best BLAST hit for each sequence based on the bitscore, after first
removing BLAST hits with e-values higher than the user defined threshold
o As defined by NCBI, “The bit score gives an indication of how good the alignment is
[between the query and the subject sequences]; the higher the score, the better the
34


alignment. In general terms, the bit score is calculated from a formula that takes into
account the alignment of similar or identical residues, as well as any gaps introduced
to align the sequences.”
As with other programs that utilize BLAST searches, the sequence returned may not be
homologous to the regions of interest
If multiple copies of the same sequence are present in the BLAST query file (e.g. duplicated
genes, multiple transcripts), then the single best BLAST hit will be returned for all of the
sequences
User input

blastfilename: Name of the file with the BLAST results [When entering settings directly in
script use quotes]
 Case sensitive
 A BLAST search must be run so that it outputs specific information in a particular
order. See the Installation Instructions for how to make a local BLAST database.
 Use this BLAST command to generate the appropriate output (Note: the outfmt
options are the important part here. They must all be specified and specified in
this order)
blastn -db DatabaseName -max_target_seqs 1 -outfmt "10
qacc sacc evalue bitscore pident qstart qend sstart
send" -query SeqFinderResultsFile.fa -out
BLASTResults.csv



fastaqueriesfilename: Name of the fasta file with the query sequences used for the BLAST
search [quotes]
 This is used for the output summary file to identify genes/exons that were and were
not found during the run
 Case sensitive
subjectfilename: The name of the file that contains the fasta sequences of the non-model
species genome used to create the BLAST database [quotes]
 Case sensitive
 As when creating the database, all scaffold/contig sequences for the non-model
species should be included in this single fasta format file
AccessionNumberMethod: Enter the number of the Accession number method you would
like to use (1, 2, or 3; see below) [no quotes]
 The ParseBLAST script works by matching the accession number from the BLAST
hit to the accession number in the subject sequence file
o You need to tell ParseBLAST how to find the accession number for sequences
in the sequence file
o Here is an example BLAST hit to illustrate the three accession number
method options (with the accession number in bold):
Reference_CYTB,HQ420351,0.0,1,1000,1702,695
35

AccessionNumberMethod 1 – assumes that the accession number is the only text
between the > and the first space for each sequence.
o There are no settings specifically for this method
o You could use this method if the accession number in your sequence file looks
like this:
> HQ420351

AccessionNumberMethod 2 – assumes that the accession number is found at the exact
same position for each sequence name, although the accession number may not be the
only text in the name
o For this method you need to enter the exact starting and exact ending
position of the accession number (not including the >) for Accession2Start
and Accession2End
o Example:
>gi|317409317|gb|HQ420351.1| Pterodroma sandwichensis

o Here the accession number starts at character 18 with H and ends at character
25 with 1
 The BLAST hit above did not include the revision number (the
decimal or the text after it) so that should not be included here
AccessionNumberMethod 3 – attempts to find the accession number based on some
information you provide
o This works best if your accession numbers have some consistent value in
them, but differ in length or in position in the name of the sequence
o Using the same example as immediately above:
>gi|317409317|gb|HQ420351.1| Pterodroma sandwichensis
>gi|317409317|gb|HQ420352.1| Pterodroma sandwichensis
o Here all of the accession numbers start with ‘HQ’, so HQ could be used as the
start query in Accession3StartFind. Python will find this position each time
and store its location for you
 Since H is the exact start of the accession number you do not need to
specify any offset for Accession3StartFindOffset.
o Here all of the accession numbers end at ‘.’ so the decimal point could be used
as the end query for Accession3EndFind. Python will find this position and
store its location for you
 In this case, since the decimal occurs immediately after the end of the
accession number you do not need any offset for
Accession3EndFindOffset
o If your search queries occurred before or after the accession number (e.g. you
searched for the gb in the example above instead of HQ) you can specify the
36






number of characters that the query is from the start or end of the accession
number.
 If you searched for a string of letters instead of a single character,
python returns the position of the beginning of the string. For example,
if you searched for gb above, you would need to specify the offset
from the letter g.
 Note: Finding the right values to enter for Accession Number Methods 2 and 3 may
take some trial and error. Once you start the ParseBLAST script it will output the
accession numbers it is searching for to the screen. If these do not match the format of
the accession numbers in your BLAST results file (e.g. the first letter of the accession
number is missing), kill the run by typing control c and try again
o In python, ranges include the beginning number but not the end number (e.g. a
range of 1 – 5 only includes the values 1, 2, 3, and 4); and python starts
counting at 0 instead of 1
o Accession number method 2 has been adjusted for this so that when you
specify the exact start and exact end positions of the accession number it will
know what to do
o Accession number method 3 has not been adjusted for this because Python is
searching for the accession number for us. Therefore, if you need to specify
an offset for Accession Number Method 3 you should specify the distance
to the exact starting position of the accession number and the position of
the character immediately after the end of the accession.
evalue: E-value threshold to use to filter the BLAST search results [no quotes]
o As defined by NCBI, “the E-value is a parameter that describes the number of hits
one can ‘expect’ to see by chance in a database of a particular size…For example, an
E-value of 1 assigned to a hit can be interpreted as meaning that in a database of the
current size, one might expect to see one match with a similar score simply by
chance.”
o BLAST hits with e-values larger than this threshold will not be returned by the script
queryspecies: Query species [quotes]
 This is anything that occurs before the name of the gene (excluding the >) in the fasta
definition line
 If you used the SeqFinder script, this is the same text you entered for this option there
 If you are using sequences that were not obtained with the SeqFinder script, enter any
text between the > and the beginning of the gene (assuming it is the same length for
all sequences). If there is no extra text simply enter an underscore (‘_’) here.
subjectspecies: Subject (non-model) species [quotes]
 Enter this so you can later differentiate the sequences from the non-model and
reference species
outname: Base name for the output files [quotes]
minsequencesize: Minimum sequence length [no quotes]
 The output summary file will tell you which sequences are shorter than this length
 Short sequences may arise because a contig/scaffold ended in the middle of a gene,
because of insertions and deletions between the reference and subject species
numberNs: Number of Ns to find [no quotes]
37


The output summary file will tell you which sequences have strings of Ns this long or
longer
[Optional setting – disableinteractive: When entering settings directly into the script this
setting allows you to shut off the interactive mode. Enter ‘yes’ to disable it and ‘no’ to leave
it on]
Output to screen

This script looks at the accession number for each sequence in the subject file and compares
that to the accession number for each BLAST hit. It will write that accession number to the
screen as it progresses through the file. If the accession numbers look strange, kill the run by
typing control c and adjust the settings of the AccessionNumberMethod
Output files (assuming the basename is Results)




Results.fa
o This file contains one sequence from the non-model species for the best, longest
BLAST hit. If a BLAST hit was not found for a particular reference sequence then
there will be no corresponding subject sequence in this file
o This is sorted in the order of the accession number the sequence came from
ResultsSorted.fa
o This is the file above sorted by locus name instead
ResultsStats.txt
o This file contains summary information about the run:
 Settings used
 Number of regions found and missing
 For ParseBLASTExons only: Number of unique genes from the query file
 For ParseBLASTExons only: Number of genes with missing exons after the
BLAST search
 Average query length, average BLAST match length, average identity
 Average length and GC content of the sequences found
 Total length of sequences found
 Sequences shorter than the minimum size set by the user
 Sequences that were found more than once
 Sequences with strings of Ns as long or longer than the value input by the user
ResultsUniqueBLASTtable.txt
o This is the final BLAST table used by ParseBLAST, sorted by percent identity
 Column 1 – name of query sequence
 Column 2 – Accession number (or name) of the sequence where the hit was
found
 Column 3 – E-value for hit
 Column 4 – Bitscore
 Column 5 – Percent identity
 Column 6 – Query sequence start for hit alignment
 Column 7 – Query sequence end for hit alignment
 Column 8 – Subject sequence start for hit alignment
38










 Column 9 – Subject sequence end for hit alignment
 Column 10 – Length of aligned region
 Column 11 – Length of query sequence
 Column 12 – % of query covered by BLAST hit
o Comparison of the e-value in the third column, the percent identity in the fourth
column, and the % match length in column 11 will give some information about the
quality of the BLAST hit for which the sequence was returned
ResultsAccessions.txt
o This file contains a non-redundant list of accession numbers for which a sequence
from the non-model species was retrieved
o These sequences can be used later for mapping the resulting next-generation sequence
reads
For ParseBLASTExons.py – ResultsFoundExons.txt
o A list of the regions/exons found and the number of times found
For ParseBLASTExons.py – ResultsMissingExons.txt
o A list of the exons that were not found by BLAST
For ParseBLASTExons.py – ResultsGenesMissingExons.txt
o A list of the genes for which at least one exon was not found by BLAST
o This can be compared to the ResultsMissingExons.txt file to determine how many
and which exons are missing
For ParseBLASTPartial.py and ParseBLASTWhole.py – ResultsFoundGenes.txt
o A list of the regions/exons found and the number of times found
For ParseBLASTPartial.py and ParseBLASTWhole.py – ResultsMissingRegions.txt
o A list of the regions that were not found by BLAST
ResultsLengths.txt
o A list of the lengths for the sequences returned. This could be put into Excel or some
other program to further examine the distribution of lengths.
ResultsMatchLength.txt
o A list of the BLAST match lengths. This could be put into Excel or some other
program to further examine the distribution of lengths
ResultsIdentity.txt
o A list giving the identity between the reference query sequence and the subject
sequence
ResultsInfo.txt
o This gives the accession number where the sequence for each region was found
Back to Workflow
4.6 MergeSeqs
Function


This script merges the sequences from the reference and non-model species.
It starts by examining the reference/query sequence file. If there is a sequence for the nonmodel species present, the sequence for the non-model species will be added to the output
39


file. If a sequence from the reference species is present, but there is no corresponding
sequence from the non-model species for it, the reference sequence will be added to the
merged output file instead.
Using this script will allow baits to be designed from the reference species instead of the nonmodel species. This is helpful if your reference is relatively closely related to your target
species, and if there are genes of high interest from the reference that were not found in the
target species.
The script will also return the standard summary information
Assumptions

Only sequences with corresponding members in the query sequence file will be written to the
merged file
o If, for example, a sequence is present only for the non-model species but not the
query species, then it will not be written to the merged file
User input



queryfilename: Name of the fasta format query/reference sequence file (e.g. from SeqFinder)
[When entering settings directly into script use quotes]
subjectfilename: Name of the fasta format non-model species sequence file (e.g. from
ParseBLAST) [quotes]
queryidentifier [quotes]
 From the query fasta sequence file, this is the string of characters immediately
following (but not including) the > sign up to the beginning of the locus name
 It should be the same in every sequence
Example:
For these sequences:
>Dog_Gene1_Exon1
>Dog_Gene2_Exon3
>Dog_Gene3_Exon1




The query identifier would be ‘Dog_’
subjectidentifier: Subject identifier [quotes]
 Similar to above, but for the non-model fasta sequence file
outname: Base name for the output files [quotes]
numberNs: Identify sequences with this number (or more) when returning sequences [no
quotes]
[Optional setting – disableinteractive: When entering settings directly into the script this
setting allows you to shut off the interactive mode. Enter ‘yes’ to disable it and ‘no’ to leave
it on]
Output to screen
40
Name of the locus being searched for in the subject file, and subsequently being copied to the
merged file. If the locus names look strange, kill the run by typing control c, adjust the
settings of the query and subject species identifiers, and try again.
Output files (assuming base name for output files is Results)



ResultsMerged.fa
o This file contains the sequences merged from both the query (reference) species and
the subject (non-model) species, sorted in the order of the gene name
ResultsStats.txt
o This file contains summary information about the merged sequence file
QueryFileSorted.fa
o This file contains the query sequences sorted by name
Back to Workflow
4.7 QCMaskedMissing
Function



This tool reads in a set of fasta format sequences and will discard sequences with too much
masked/missing data, as well as sequences below a desired length
It calculates the percentage of sites with IUPAC ambiguity codes (e.g. R,Y,M, etc),
percentage of sites with missing data (N or ?), percentage of sites that have been masked for
repetitive or low complexity sequences (indicated by X), and then discards any sequences for
which the percent missing/masked data is above a user-defined threshold.
It will also discard sequences shorter than the desired length set by the user
Assumptions



Input is a fasta formatted sequence file
Ambiguous sites are identified with IUPAC ambiguity codes and/or question marks
Repetive and low complexity regions are masked with Xs
User input






sequencefilename: Name of the fasta sequence file to analyze [quotes]
outname: Base name to use for the output files [quotes]
threshold: Percentage of missing + masked data above which the sequence should be
discarded. Enter a whole number (e.g. 10% is entered as 10 not 0.10) [no quotes]
minsequencesize: Identify sequences shorter than this length in the results [no quotes]
discardshortseqs: Whether or not discard sequences shorter than minsequencesize. Enter
‘yes’ or ‘no’ [quotes]
numberNs: Identify sequences with this number (or more) Ns when returning sequences [no
quotes]
41

[Optional setting – disableinteractive: When entering settings directly into the script this
setting allows you to shut off the interactive mode. Enter ‘yes’ to disable it and ‘no’ to leave
it on
Output to screen

ID for sequence being evaluated.
Output files (assuming base name for output files is Results)


ResultsQCMaskedMissing.fa
o File with the sequences that were retained
ResultsStats.txt
o File with summary information including total number of sequences read, number of
sequences retained and discarded, total length of sequences retained, average length
of sequences, average GC content of sequences, names of sequences that were
discarded because they contained to much missing/masked data, names of sequences
shorter than length specified by the user and whether they were retained or discarded,
names of sequences with more than the user-specified number of Ns
Back to Workflow
4.8 QCComplementarity
Function



This script reads in a results file from a BLAST search of the selected sequences (e.g. from
the SeqSelector workflow) against themselves and then parses these results to identify pairs
of sequences that demonstrate high complementarity to each other.
The first sequence (e.g. Seq1) of a pair ( e.g. Seq1, Seq2) with percent identity greater than
the user defined threshold for a region longer than that defined by the user will be discarded.
Self BLAST hits (Seq1, Seq1) and the reciprocal BLAST hit (Seq2, Seq1) are ignored.
Assumptions

Input is a csv formatted BLAST results file created using the command given in the
workflow
User input


blastfilename: Name of the file with the BLAST results [When entering settings directly in
script use quotes]
fastaqueriesfilename: Name of the fasta file with the query sequences used for the BLAST
search [quotes]
42






percentidentity: One part of the criteria above which a pair of sequences will be considered
complementary. Percent identity should be entered as a whole number. For example, 90%
identity should be entered as 90 and not 0.90 [no quotes]
matchlength: The second part of the criteria above which a pair of sequences will be
considered complementary. If a pair of sequences has higher percent identity for a region
longer than the matchlength then it will be discarded. Matchlength should be entered as a
whole number [no quotes]
outname: Base name to use for the output files [quotes]
minsequencesize: Identify sequences shorter than this length in the results [no quotes]
numberNs: Identify sequences with this number (or more) Ns when returning sequences [no
quotes]
[Optional setting – disableinteractive: When entering settings directly into the script this
setting allows you to shut off the interactive mode. Enter ‘yes’ to disable it and ‘no’ to leave
it on]
Output to screen
None
Output files (assuming base name for output files is Results)



ResultsQCComplementarity.fa
o Fasta format file with the complementary sequences removed
ResultsQCComplementarityStats.txt
o File with summary information including the number of sequence pairs that
demonstrated complementarity above the percent identity and match length
thresholds, the number of sequences removed, total number of sequences retained,
total length of sequences retained, average length of sequences retained, average GC
content of sequences retained, names of highly complementary sequences that were
removed, names of sequences shorter than a specified length, names of sequences
found multiple times in the data set, and those with more than the user-specified
number of Ns.
ResultsQCComplementarityMatchTable.txt
o A table with the BLAST hits that failed the percent identity and match length
thresholds. The columns in this table are:
 Column 1 - query sequence ID
 Column 2 - subject sequence ID
 Column 3 - percent identity
 Column 4 - match length
 Column 5 - number of mismatches
 Column 6 - number of gaps
 Column 7 - query starting position for the match
 Column 8 - query ending position
 Column 9 - subject starting position for the match
 Column 10 - subject ending position
 Column 11 - E-value
43

Column 12 - bitscore
Back to Workflow
4.9 QCDuplicates
Function

This script reads in a results file from a BLAST search of the selected sequences (e.g. from
the SeqSelector workflow) against the genome of a species of interest (e.g non-model or
working reference species) and then parses these results to identify sequences with multiple
high quality hits with percent identity above and e-value below user-specified thresholds,
respectively. Sequences with multiple high quality hits are discarded.
Assumptions


Input is a csv formatted BLAST results file created using the command given in the
workflow
Depending on the settings selected this script may also remove sequences that are
complementary to other regions even though they may not be duplicates
User input








blastfilename: Name of the file with the BLAST results [When entering settings directly in
script use quotes]
fastaqueriesfilename: Name of the fasta file with the query sequences used for the BLAST
search [quotes]
percentidentity: One part of the criteria for considering whether a sequence may potentially
be duplicated. Percent identity should be entered as a whole number. For example, 95%
identity should be entered as 95 and not 0.95 [no quotes]
evalue: The second part of the criteria for identifying duplicated regions [no quotes]. If a
sequence has multiple hits with percent identity higher than the threshold specified above and
e-value less than that specified here will be discarded
outname: Base name to use for the output files [quotes]
minsequencesize: Identify sequences shorter than this length in the results [no quotes]
numberNs: Identify sequences with this number (or more) Ns when returning sequences [no
quotes]
[Optional setting – disableinteractive: When entering settings directly into the script this
setting allows you to shut off the interactive mode. Enter ‘yes’ to disable it and ‘no’ to leave
it on]
Output to screen
44

Each sequence examined will be output to the screen along with information on the number
of potential duplicates.
Output files (assuming base name for output files is Results)



ResultsQCDuplicates.fa
o Fasta format file with the potentially duplicated sequences removed
ResultsQCDuplicatesStats.txt
o File with summary information including the number of sequences that are potentially
duplicated, the number of sequences removed, total number of sequences retained,
total length of sequences retained, average length of sequences retained, average GC
content of sequences retained, names of potentially duplicated sequences that were
removed, names of sequences shorter than a specified length, names of sequences
found multiple times in the data set, and those with more than the user-specified
number of Ns.
ResultsQCDuplicatesInfo.txt
o A list of potentially duplicated sequences and potential number of duplicates (i.e.
similar to the output to the screen during the run.
Back to Workflow
4.10 SequencesSortStats
Function


This is a stand alone utility for summarizing a set of sequences in a fasta format file, separate
from SeqFinder or ParseBLAST scripts
This script will output a sorted file of your sequences and calculate some statistics including
total number of sequences, total length of sequences, average sequence length, average GC
content, names of duplicate sequences found, names of sequences with more than the userspecified number of Ns
Assumptions




Input is a fasta formatted sequence file
Sequences will be sorted by their names, which includes text after the > sign and until the
first space
Duplicate sequences will be found on adjacent lines of the sorted sequence file
Only sequences with exactly the same sequence name (up to the first space) will be
considered duplicates; this is case sensitive.
User input

sequencefilename: Name of the fasta sequence file to analyze [quotes]
45




outname: Base name to use for the output files [quotes]
minsequencesize: Identify sequences shorter than this length in the results [no quotes]
numberNs: Identify sequences with this number (or more) Ns when returning sequences [no
quotes]
[Optional setting – disableinteractive: When entering settings directly into the script this
setting allows you to shut off the interactive mode. Enter ‘yes’ to disable it and ‘no’ to leave
it on]
Output to screen
None
Output files (assuming base name for output files is Results)


ResultsSorted.fa
o File with the sequences sorted by name
ResultsStats.txt
o File with summary information including total number of sequences, total length of
sequences, average sequence length, average GC content, names of duplicate
sequences found, names of sequences with more than the user-specified number of Ns
Back to Workflow
46
5. TROUBLESHOOTING
5.1 Error Messages
In the section below error messages given by python are shaded in gray and potential solutions to
the problem are described below.
Traceback (most recent call last):
[several lines of text]
IOError: [Errno 2] No such file or directory: 'Filename.fasta'
Python could not find a file that you told it to use. Check the spelling of the filenames (Note this
is case sensitive!), and make sure that the files are present in the correct directory (usually the
directory where you are executing the script).
File "./ParseBLASTPartial.py", line 32
outname = ‘Filename.fasta
^
SyntaxError: EOL while scanning string literal
Quotation marks have been misused around some text. Make sure that quotation marks are
present at the beginning and end of a string of text. Double check that you use the same style
quotation marks at each end of the text (e.g. ‘word‘ )
Traceback (most recent call last):
File "./ParseBLASTExons.py", line 32, in <module>
outname = Filename
NameError: name 'Filename' is not defined
In this case the quotation marks may be completely missing from around some text. Please add
them and try again.
Traceback (most recent call last):
File "./ParseBLASTPartial.py", line 86, in <module>
Accession2Start = Accession2Start - 1
TypeError: unsupported operand type(s) for -: 'str' and 'int'
Python was expecting one type of input (e.g. a string with quotes around it) and found a different
type of input (e.g. a number). Please check that all of the user input items are correct.
Traceback (most recent call last):
File "./ParseBLASTExons.py", line 569, in <module>
avelength = x / numgenes
ZeroDivisionError: division by zero
47
No genes or sequences were found to match your query. Check to make sure you have entered
correct gene names (and that they are spelled correctly), that there were BLAST hits for your
genes, that the matching record for the BLAST hit is present in your fasta file of subject
sequences, etc. If this message is obtained while using a ParseBLAST script then look at the line
directly above to see if an error message was printed. If there is an error message it will give
hints how to fix the problem.
ERROR! Please enter 1, 2, or 3 for AccessionNumberMethod.
Traceback (most recent call last):
File "./ParseBLASTExons.py", line 706, in <module>
avelength = x / numgenes
ZeroDivisionError: division by zero
As the error message suggests, you have selected an inappropriate value for
AccessionNumberMethod in the ParseBLAST script. Please select 1, 2, or 3 instead.
5.2 Unexpected results
For the SeqFinder scripts, if some genes of interest were not found, then there may be a
mismatch between the gene name (e.g. from the Ensembl file) and the annotated gene name in
the reference GenBank file. There are a couple of things you can try:
 If the gene has synonyms in other species, try running the scripts again using those names
(UCSC Genome Browser is useful for identifying these)
 If the gene still isn’t found, try searching the reference genome (e.g. UCSC Genome
Browser or the appropriate chromosome GenBank file) for a key word in the gene’s name
or the gene’s function to identify an alternative name. The gene may have been annotated
in the GenBank file using a generic name, such as “LOC0895654” instead.
For the ParseBLAST script, if no Accession numbers are printed while the script is running or if
they seem incomplete, it is likely that the script is not able to understand how it should look for
the Accession number. Check to make sure you have selected the correct
AccessionNumberMethod, and that the start and end coordinates, the search strings, or the
offsets have been entered correctly. If they seem incorrect adjust them and try again. The script
outputs to the screen what it is searching for, so if the format of the Accession number on the
screen doesn’t look similar to the format of the Accession number in the BLAST result then you
will not get any sequences back.
For the ParseBLAST script, if you get an error message stating ‘BLAST hits but no genes
found.’ in the Stats file, there may be an issue with how you are specifying the query or subject
species. Check to make sure these are entered correctly and/or that you are using the correct files.
For the ParseBLAST script, if you get anomalous looking gene names, check that the query
species has been input correctly.
48
APPENDIX 1. COMMAND LINE TUTORIAL FOR WINDOWS
Getting Started
In order to execute scripts from the command line it is helpful to know some basic Windows
commands as well as have an understanding of file structure. This appendix contains some of the
basics for those who are unfamiliar with working at the command line.
Let us start with an example. Suppose you want to execute the GoSelect.py script. We’ll assume
that this script has been saved in folder called Scripts on the Desktop.
For Windows 8, the first thing to do is open the Terminal application, which accepts and
executes commands from the user. Go to Programs  Windows System  Command Prompt.
The location of the Command Prompt application may be slightly different in other versions of
Windows, but should be similar. Once open you will see a prompt that is waiting for your
command. Usually the prompt contains information about the folder (also called directory) that
you are currently in (also called the current working directory). The prompt usually ends with a
‘>’. Note that most commands are case-sensitive.
Locations of files and directories
Before working with a file, you need to navigate to the directory where the file resides. This
should be a familiar concept as you have often navigated through directories to find files using
the mouse. This is the file structure. Each directory is specified as a certain location in the file
structure.
Usually the prompt will tell you your location, but in case it is truncated, to see your current
location, use this command:
chdir
This will show an output like this:
\Users\YourName
This is your home directory. Your home directory actually resides within a set of directories
above this, but they are not shown because most of the time you don’t need to use them (see
below regarding the root).
Each file and directory will have a location. The full description of the location of a file is the
absolute path:
\Users\YourName\Desktop\Scripts\GoSelect.py
49
Assuming this is where the GoSelect.py script file is saved, you can navigate to it using the
change directory command:
cd Desktop
cd Scripts
Note: To accomplish this in one command you could have typed cd Desktop\Scripts
Tip: On most operating systems you can begin typing a directory or file name and then press
the tab button and the computer will try to guess what you are typing. If you have typed
enough letters the computer will complete the word. If you have not typed enough letters the
computer won’t be sure what you are trying to type and may be able to complete none or
only part of the word you are typing, in which case you will need to provide more characters.
Print the current working directory:
chdir
\Users\YourName\Desktop\Scripts
You can also change directories by specifying the relative path of where you want to go.
.\ = current directory
..\ = the parent directory of the current director (one directory above the current directory)
\ = the root directory (or the highest directory in your computer). This location stores files used
by the operating system and any some of the programs that are installed. If you are a new
user you might want to avoid going here.
If you don’t specify an absolute or relative path, the computer will assume that you mean
for it to look in the current directory.
To change directories back to the Desktop you could type:
cd ..
However for now, we want to be in the Scripts directory, so if you moved back to the desktop
type
cd Scripts
Tip: If you misspell a directory or file name you will get an error saying something to the
effect of “The system cannot find the path specified”
50
Tip: There is another handy way to change directories. At the terminal command prompt
type cd plus a space and then drag and drop a directory onto the command prompt. The
absolute path of the directory will be automatically written for you.
Locations of programs and scripts
When programs, such as NCBI BLAST + are installed, the executable file for the program is
often put in your path in a location above the home directory. When you execute a program by
typing its name at the command line the computer searches all of the directories that are in your
path, and if it finds the program it will execute it. If it can’t find the program, it will tell you “The
system cannot find the path specified” error, or perhaps that the command “is not recognized as
an internal or external command, operable program, or batch file”. Unless a program has been
installed in such a way that it is put into your path, the simplest thing is to keep the program
executable (including scripts) in the same directory as the input files. If you do this, you don’t
have to specify an absolute or relative path because the computer will just look for the program
and the input files in the current directory. If you choose to put the program executable or input
files in different directories you will need to specify the absolute path or the relative path to the
program/files from the current working directory.
Working with files and directories
To see the contents of the current directory:
dir
This will probably show something like this:
GoSelect.py
MergeSeqs.py
ParseBLASTExons.py
To make a new directory (here, called Test) within the current directory:
mkdir Test
To copy a file, use the copy command. To make a copy of a file and place the copy in the
current directory type:
copy GoSelect.py GoSelect2.py
51
You can also copy the file to the directory Test that we created above:
copy GoSelect.py .\Test\GoSelect2.py
Tip: If you want the file that was copied to the Test directory to keep the same name you
could have just typed cp GoSelect.py .\Test\
Tip: You can also use the relative path. For example, to copy a file from the Test directory
back into the scripts directory you could type (assuming you have changed directory into the
Test directory): cp GoSelect.py .\..\GoSelectTest.py
Tip: When copying files, if you specify that the name of the copied file should be the same
as the name of a file that currently exists in that location, the current file will be replaced
with the copy without asking if you want to replace it.
Go back to the Scripts directory. To move a file without copying it, you can use the move
command:
move ParseBLASTExons.py .\Test\
You can also use the move command to rename a file (move a file it to a file with a different
name in the same directory)
move ParseBLASTWhole.py .\ParseBLASTWholeTest.py
Change to the Test directory. To delete a file use the remove command:
del GoSelect2.py
Move back to the Scripts directory. To delete the Test directory (and all of the files within it)
type:
rmdir .\Test /s
Tip: For new users it may be safer at first to delete files and directories using the mouse.
When deleting files and directories from the command line the computer will not usually ask
if you are sure (although there are options to turn on this behavior), and usually the files and
directories will NOT be put in the recycle bin.
52
To view (but not modify) the contents of a file that is too large to open with a text editor, you can
see a few lines at a time using this command:
more GoSelect.py
Pressing the space bar will progress through the file. Type q to exit and return to the command
prompt.
To concatenate large files, such as fasta files containing genome sequences, you can use the
command below. This will concatenate all of the files listed before the > together,
immediately one after another (i.e. without inserting a blank line or an end of line
character).
type Filename1.txt Filename2.txt > ConcatenatedFiles.txt
Getting help
In most cases, you should be able to get information about a command by typing man (for
manual) and then the name of the command. For help on the cp command type:
copy /?
To exit the man page, type:
q
You will probably find that Google is also a good source of help.
53
APPENDIX 2. COMMAND LINE TUTORIAL FOR MAC/UNIX/LINUX
Getting Started
Mac OSX and Linux are UNIX-based operating systems. In order to execute scripts from the
command line it is helpful to know some basic UNIX commands as well as have an
understanding of file structure. This appendix contains some of the basics for those who are
unfamiliar with working at the command line.
Let us start with an example. Suppose you want to execute the GoSelect.py script. We’ll assume
that this script has been saved in folder called Scripts on the Desktop.
For Mac OSX, the first thing to do is open the Terminal application, which accepts and executes
commands from the user. Go to Finder  Applications  Utilities  Terminal. The location of
the Terminal application may be slightly different in Linux, but should be similar. Once open
you will see a prompt that is waiting for your command. Usually the prompt contains
information about the folder (also called directory) that you are currently in (also called the
current working directory). The prompt usually ends with a $. Note that Unix is case-sensitive.
Locations of files and directories
Before working with a file, you need to navigate to the directory where the file resides. This
should be a familiar concept as you have often navigated through directories to find files using
the mouse. This is the file structure. Each directory is specified as a certain location in the file
structure.
To see your current location, print the working directory using this command:
pwd
This will show an output like this:
/Users/YourName
This is your home directory. Your home directory actually resides within a set of directories
above this, but they are not shown because most of the time you don’t need to use them (see
below regarding the root).
Each file and directory will have a location. The full description of the location of a file is the
absolute path:
/Users/YourName/Desktop/Scripts/GoSelect.py
54
Assuming this is where the GoSelect.py script file is saved, you can navigate to it using the
change directory command:
cd Desktop
cd Scripts
Note: To accomplish this in one command you could have typed cd Desktop/Scripts
Tip: On most operating systems you can begin typing a directory or file name and then press
the tab button and the computer will try to guess what you are typing. If you have typed
enough letters the computer will complete the word. If you have not typed enough letters the
computer won’t be sure what you are trying to type and may be able to complete none or
only part of the word you are typing, in which case you will need to provide more characters.
Print the current working directory:
pwd
/Users/YourName/Desktop/Scripts
You can also change directories by specifying the relative path of where you want to go.
./ = current directory
../ = the parent directory of the current director (one directory above the current directory)
/ = the root directory (or the highest directory in your computer). This location stores files used
by the operating system and any some of the programs that are installed. If you are a new
user you might want to avoid going here.
If you don’t specify an absolute or relative path, the computer will assume that you mean
for it to look in the current directory.
To change directories back to the Desktop you could type:
cd ..
However for now, we want to be in the Scripts directory, so if you moved back to the desktop
type
cd Scripts
Tip: If you misspell a directory or file name you will get an error saying something to the
effect of “No such file or directory”
55
Tip: For Mac OSX, if you ever get lost within the file structure, you can simply type cd
(with nothing after it), and you will automatically change back to your home directory.
Tip: Mac OSX also has another handy way to change directories. At the terminal command
prompt type cd plus a space and then drag and drop a directory onto the command prompt.
The absolute path of the directory will be automatically written for you.
Locations of programs and scripts
When programs, such as NCBI BLAST + are installed, the executable file for the program is
often put in your path in a location above the home directory. When you execute a program by
typing its name at the command line the computer searches all of the directories that are in your
path, and if it finds the program it will execute it. If it can’t find the program, you will get the
“No such file or directory” error. Unless a program has been installed in such a way that it is put
into your path, the simplest thing is to keep the program executable (including scripts) in the
same directory as the input files. If you do this, you don’t have to specify an absolute or relative
path because the computer will just look for the program and the input files in the current
directory. If you choose to put the program executable or input files in different directories you
will need to specify the absolute path or the relative path to the program/files from the current
working directory.
Working with files and directories
To see the contents of the current directory:
ls
This will probably show something like this:
GoSelect.py
MergeSeqs.py
ParseBLASTExons.py
To make a new directory (here, called Test) within the current directory:
mkdir Test
To copy a file, use the cp command. To make a copy of a file and place the copy in the current
directory type:
cp GoSelect.py GoSelect2.py
56
You can also copy the file to the directory Test that we created above:
cp GoSelect.py ./Test/GoSelect2.py
Tip: If you want the file that was copied to the Test directory to keep the same name you
could have just typed cp GoSelect.py ./Test/
Tip: You can also use the relative path. For example, to copy a file from the Test directory
back into the scripts directory you could type (assuming you have changed directory into the
Test directory): cp GoSelect.py ./../GoSelectTest.py
Tip: When copying files, if you specify that the name of the copied file should be the same
as the name of a file that currently exists in that location, the current file will be replaced
with the copy without asking if you want to replace it.
Go back to the Scripts directory. To move a file without copying it, you can use the move
command:
mv ParseBLASTExons.py ./Test/
You can also use the move command to rename a file (move a file it to a file with a different
name in the same directory)
mv ParseBLASTWhole.py ./ParseBLASTWholeTest.py
Change to the Test directory. To delete a file use the remove command:
rm GoSelect2.py
Move back to the Scripts directory. To delete the Test directory (and all of the files within it)
type:
rm –r Test
Tip: For new users it may be safer at first to delete files and directories using the mouse.
When deleting files and directories from the command line the computer will not usually ask
if you are sure (although there are options to turn on this behavior), and usually the files and
directories will NOT be put in the recycle bin.
57
To view (but not modify) the contents of a file that is too large to open with a text editor, you can
see a few lines at a time using this command:
less GoSelect.py
Pressing the space bar will progress through the file. Type q to exit and return to the command
prompt.
The more command is similar in many ways, but the contents of the file that you viewed will
remain in the window after typing q to exit.
To concatenate large files, such as fasta files containing genome sequences, you can use the
command below. This will concatenate all of the files listed before the > together,
immediately one after another (i.e. without inserting a blank line or an end of line
character).
cat Filename1.txt Filename2.txt > ConcatenatedFiles.txt
Getting help
In most cases, you should be able to get information about a command by typing man (for
manual) and then the name of the command. For help on the cp command type:
man cp
To exit the man page, type:
q
You will probably find that Google is also a good source of help.
Permissions
In order to read, edit, or execute a script, you need to have permission to do so. To see detailed
information about the contents of the current directory, including the permissions:
ls –l
This will show something like this:
-rwxr-xr-x 1 AW staff 4949 Dec 8 12:50 GoSelect.py
-r--r--r-- 1 AW staff 5979 Dec 7 10:11 MergeSeqs.py
58
The letters on the left here are the permissions. The first dash tells us that GoSelect.py and
MergeSeqs.py are not directories (if they were directories there would be a d instead of a dash).
The next letters (rwx) specify permission to read (r), write/edit (w), and execute (x). The first set
of three represents you, the user, the second set of three specifies permissions for the group, and
the final set of three specifies permissions for others. Only the owner of a file or directory may
change the permissions.
Here you can see that you have permission to read, write and execute GoSelect.py, however you
only have permission to read MergeSeqs.py. As the owner of these scripts you can change the
permissions using the chmod command.
To add permission for you, the user, to write and execute the MergeSeqs.py script:
chmod u+w+x MergeSeqs.py
To add permission for the group to write (edit) and execute the script:
chmod g+w+x MergeSeqs.py
To remove permission for the group to write/edit the script:
chmod g-w MergeSeqs.py
Typing ls –l again should give you:
-rwxr-xr-x 1 AW staff 4949 Dec 8 12:50 GoSelect.py
-rwxr-xr-- 1 AW staff 5979 Dec 7 10:11 MergeSeqs.py
Tip: Sometimes the computer may not recognize that you are the owner of a file and return
the error “Permission denied”. In this case you may need to use the sudo command to tell the
computer that you are in charge:
sudo chmod u+w+x MergeSeqs.py
In this case you will need to provide the password that you use to sign into your computer.
59
Download