ppt

advertisement
NGS Bioinformatics Workshop
1.3 Tutorial - Sequence Alignment and
Searching
March 22nd, 2012
IRMACS 10900
Facilitator: Richard Bruskiewich
Adjunct Professor, MBB
Learning Objectives
A few more Linux tips
FASTA and BLAST on the web
Local BLAST
Local installation of BLAST
Making a blast database
Running a local blast query
Parsing search results using a script
First, a few more Linux tips…
 “cat”, “more” or “less”: list contents of a (text) file
 “head” or “tail”: display the start or end of a file
 One can ‘redirect’ the contents of a program into or out of
a file:
 ‘<‘ for input (‘<<TAG‘ appends from keyboard)
 ‘>’ for output (‘>>’ appends rather than overwrites)
 TRY:
cat >myfile.txt <<EOF
This is my file!
Another line
EOF
 “more” or “less”: like “cat” but controllable
More….
 “wget”: downloads an internet file
 “which”: displays where a program is located on the system
 “mkdir”: makes a directory
 “cp”: copies files or directories
 “mv”: moves a file or directory
 “ln”: creates alias names/locations to files
 ln –s source target # -s link “symbolic”
 e.g. ln –s ncbi-blast-2.2.26+ ncbi
 “export”: exposes an environment variable, e.g.
 export BLAST=/usr/local/ncbi
Environment variables provide general configuration and global
contextual information that operating system scripts and
computer programs can read if they need to know about such
context.
File archives (and programs)
 “.gz” files: gzip, gunzip archive commands
 “.tar” files: tar command
Flags:
-x # extract
-f filename # file to extract
-v # verbose output
-z # uncompress gz file on the fly…
e.g.
 tar –xvf file.tar
 Tar –zxvf file.tar.gz
 “.bz2”: bzip2 and bunzip2 archive commands
 “.zip”: zip and unzip archive commands
FASTA Walkthrough
http://www.ebi.ac.uk/Tools/sss/fasta/
Database: Knowledgebase
(complete source for protein
information, combines other
databases), Swiss-Prot (only
manually-curated proteins)
Paste your sequence here!!!!
Program: fasta (DNA or protein
sequence against a like
database)
Matrix: scoring matrix used
(first click More options… )
Results: interactive or email
Don’t forget to push SUBMIT!
http://blast.ncbi.nlm.nih.gov
BLAST Walkthrough
Note: the program used;
nucleotide blast (blastn), protein
blast (blastp), etc. is chosen at
an earlier screen
Paste your sequence here!!!
Database: nr, swissprot, etc.
Algorithm: blastp (protein blast)
To configure blastp options,
click Algorithm parameters.
BLAST Walkthrough
Algorithm paramters
Expect value: can be changed
Matrix
Filtering: Off by default, can turn
on
Don’t forget to push BLAST!
BLAST Results
Protein domains in your
sequence
Graphical representation of
the alignment results; each
line represents an
alignment; color indicates
similarity
List of the hits from the
database you searched
against (ID, name, E-value
of top HSP)
 click on score to jump
down to textual alignment
Individual display for each
alignment (HSP)
http://www.youtube.com/watch?v=HXEpBnUbAMo&feature=related
You can also use a Script to directly do the
BLAST’ing – e.g. Biopython
Example 1:
If you have a nucleotide sequence you want to search against the
nucleotide database (nt) using BLASTN, and you know the GI
number of your query sequence, you can use:
from Bio.Blast import NCBIWWW
result_handle = NCBIWWW.qblast("blastn", "nt", "8332116")
Then, save the result…
save_file = open("my_blast.xml", "w")
save_file.write(result_handle.read())
save_file.close()
result_handle.close()
# save as XML format
More BLASTing…
Example 2:
Alternatively, if we have our query sequence already in a FASTA
formatted file, we just need to open the file and read in this
record as a string, and use that as the query argument:
from Bio.Blast import NCBIWWW
fasta_string = open("m_cold.fasta").read()
result_handle = NCBIWWW.qblast("blastn", "nt", fasta_string)
See http://biopython.org/DIST/docs/tutorial/Tutorial.html for
more sample BLAST query code (or see the equivalent sections
of other open-bio toolkits)
Locally installed BLAST - Advantages
Search many input sequences at once
Customizable databases
No need for internet access
Faster for small to medium-sized databases
Integrate BLAST searches into a larger, automated
bioinformatics analysis
Can also run local BLAST through open-bio scripts (e.g. see
http://biopython.org/DIST/docs/tutorial/Tutorial.html)
Parts of the standalone
BLAST equation
FASTA file
containing sequences
to build database to
BLAST against (NCBI or
your own file)
BLAST Programs
File downloaded from
NCBI contains:
makeblastdb
blastp
blastn
Your
Query
Sequence
blastx
tblastx
rpsblast
makeblastdb
etc…
BLAST
database
Output
Let’s try this ourselves….
We will:
Obtain and install the BLAST executables (Linux)
Set up a BLAST database
Copy an archive
Use ‘makeblastdb’ to create a novel database to search
against
Use the ‘blastp’ program to carry out a BLAST
analysis over the command-line
Output your BLAST results into a more flexible
format
Use a small BioPython script to parse the output
Step 1 – Installing BLAST tools
 Go to http://blast.ncbi.nlm.nih.gov and follow
links to latest release and follow instructions for
favorite operating system:
Windows 32/64: .exe installer
Linux 32/64: compiled binaries (RPM or tar.gz)
Other Unix: compiled binaries (in tar.gz)
 Apply platform-specific configuration details for
your operating system
 Read the good documentation:
http://www.ncbi.nlm.nih.gov/books/NBK1763/
Installing from Linux .tar.gz archive
wget
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/
blast+/LATEST/ncbi-blast-2.2.26+-x64-linux.tar.gz
tar -zxvf ncbi-blast-2.2.26+-x64-linux.tar.gz
sudo mv ncbi-blast-2.2.26+ /usr/local
cd /usr/local/
sudo ln -s ncbi-blast-2.2.26+ ncbi
export BLAST=/usr/local/ncbi
export PATH=$BLAST/bin:$PATH
Step 2 – Getting a BLAST database
Option A:
Pre-formatted databases: download archives
from NCBI: ftp://ftp.ncbi.nlm.nih.gov/blast/db/
Option B:
“Roll your own”: construct de novo from a file of
custom FASTA sequences using makeblastdb
Option A: Obtain a existing NCBI database... (Linux)
sudo mkdir $BLAST/db
wget
ftp://ftp.ncbi.nlm.nih.gov/blast/db/swissprot.tar.gz
tar -zxvf swissprot.tar.gz $BLAST/db
cd
Option B: Create a simple BLAST database
from a local FASTA sequence file
The makeblastdb application produces BLAST databases from
FASTA files. In the simplest case the FASTA definition lines are not
parsed by makeblastdb and may be completely unstructured
(but can only be BLAST’ed and not be directly retrieved)
makeblastdb -in mydb.fsa -dbtype nucl
creates a BLAST database from a nucleotide FASTA sequences
which can be put into the “db” directory for searching. Of course,
like all blast programs, there are a rich set of parameters which
can be used to customize the generation of the database (see the
BLAST manual).
With either option, still need some
configuration details
cd
# back to home directory
# need to point to the database…
cat >.ncbirc <<EOF
[BLAST]
BLASTDB=/usr/local/ncbi/db
EOF
Step 3 – Executing a BLAST operation
Command line programs (only) but
parameters are generally equivalent to (or a
superset of) the NCBI web BLAST application
Sample run:
Retrieve a sequence from the database:
blastcmd –db swissprot –entry Q9MAH0 –outform "%f" –out test_query.txt
Blast it back against the same database:
blastp –query text_query.txt –db swissprot –out output.txt –outfmt 5
Different BLAST programs
blastn – search nucleotide database using a nucleotide query
blastp – search protein database using a protein query
blastx – search protein database using a translated nucleotide
query
tblastn – search translated nucleotide database using a protein
query
tblastx – search translated nucleotide database using a translated
nucleotide query
psiblast – Position-Specific Iterated BLAST
rpsblast – Reversed Position Specific BLAST
See BLAST documentation for related utility programs
Command line parameters: statistics
-evalue: expect value, normally set to 10
-word_size: “k-tuple” size; increase for speed, decrease
for sensitivity
-gapopen: cost to open a gap; increase for stringency
-gapextend: cost to extend a gap; increase for stringency
-matrix: substitution scoring matrix (default BLOSUM62);
change if sequences too related or too distant
To get more information use option “-help”
Command line parameters: input/output
-query in.txt: specify input file
-out out.txt: specify output file
-db nr: which database (created with makeblastdb)
-dust yes/no: filter low complexity regions in nucleotide
sequence search yes/no (default is yes)
-seg yes/no: filter low complexity regions in protein sequence
search yes/no (default is no)
-html: format output as HTML
-outfmt: specify output format, e.g. 5 = XML blast output
(use –help flag to see other options)
Additional useful program options
Depending on program:
-num_threads: use multiple CPUs (speeds up search)
-subject: specify a second input sequence instead of a database (former
‘bl2seq’)
-task megablast: much faster for highly similar nucleotide sequences
-task blastn_short: find similar short sequences (e.g. primer sequences)
Step 4 – Parse the output
If you just have one query sequence, simply
view the BLAST text file
If you are doing a lot of queries on the
database and looking for “best hits”, you may
wish to use a parsing script (e.g. biopython or
equivalent)
Parsing BLAST output with Biopython
 Good to use the BLAST XML format for this…
result_handle = open("my_blast.xml")
 Now that we’ve got a handle, we are ready to parse the
output. The code to parse it is really quite small.
1. If you expect a single BLAST result (i.e. you used a
single query):
from Bio.Blast import NCBIXML
blast_record = NCBIXML.read(result_handle)
More parsing…
2. or, if you have lots of results (i.e. multiple
query sequences):
from Bio.Blast import NCBIXML
blast_records = NCBIXML.parse(result_handle)
for blast_record in blast_records:
... # Do something with blast_record
What’s in a BLAST record?
E_VALUE_THRESH = 0.04
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if hsp.expect < E_VALUE_THRESH:
print '****Alignment****‘
print 'sequence:', alignment.title
print 'length:', alignment.length
print 'e value:', hsp.expect
print hsp.query[0:75] + '...‘
print hsp.match[0:75] + '...‘
print hsp.sbjct[0:75] + '...'
Gives output something like this…
****Alignment****
sequence: >gb|AF283004.1|AF283004 Arabidopsis thaliana cold acclimation
protein WCOR413-like protein
alpha form mRNA, complete cds
length: 783
e value: 0.034
tacttgttgatattggatcgaacaaactggagaaccaacatgctcacgtcacttttagtcccttacatattcctc...
||||||||| | ||||||||||| || |||| || || |||||||| |||||| | | |||||||| | ||| ||...
tacttgttggtgttggatcgaaccaattggaagacgaatatgctcacatcacttctcattccttacatcttcttc...
Again, see http://biopython.org/DIST/docs/tutorial/Tutorial.html
for more details...
Download