UNIX and Perl

advertisement
UNIX and Perl
Lecture 2
Matt Hudson
Review
• Unix is text based:
doesn’t waste computer resources on graphics
allows you to write and use scripts easily
makes remote access easy
don’t have to learn “where everything is”
gives the user more power
Review
• When negotiating file systems, it is important to
remember the directory structure and the
commands cd, ls and pwd.
• You must be very wary of creating multiple files
with the same name, as it is easy to over-write
an existing, important file
• There is no undelete or trash basket in UNIX –
delete or overwrite a file and it is gone
Review
• Edit text files with nano. Standard format
for bioinformatics text-based applications
is called fasta:
Name, or ID, of sequence
>sequence id|more info|yet more info
ACCCGTGAGTAGTGTGTGCTCATCTCGTAGCATCGATCGAAATCGTCATCG
CAGATGTCGTGCTAGCTGTGCATGCATCGATCGATCGATGTAGCTAGTAGT
CAGTAGTAGTCGTAGATGTCGTCGATGCTAGTAGTGCTGCTGTGTGCTGTC
CCACTAGCTGCATCGATG
Sequence itself, in one or many
lines
Review
• The blastall command takes the following obligatory
arguments
-p <program name, eg blastn, blastp>
-i <input file name>
And these arguments are also very important
-d <database name>
-e <E-value cutoff>
-m <matrix name>
Using your UNIX skills
• Now we’re going to use our new UNIX
skills in anger with some more
sophisticated bioinformatics programs.
• It’s an idea to make an “experiment” or
“scratch” folder
• Use this as a place for stuff that might
explode….
Bioinformatics programs
• blastall:
blastp
blastn
blastx
bl2seq… etc
• HMMER
hmmpfam
hmmsearch
hmmbuild
• clustalw
• fasta34
The background
• If you are running a program that takes a
long time, especially if redirecting output to
a file, put it in the background, and you
can keep working.
Either put & at the end of the command
Or stop (ctrl-Z) then bg %.
Running overnight
• If you are running a program overnight,
use qsub from the head node to control
the program command – this way the
command will keep running when you exit
the shell.
• Don’t forget to redirect output to a file
when doing this (can still get output but
can be hard to figure out).
Viewing running processes
• You can see all the processes on the
system, ranked by how much memory and
CPU time they are using.
$top
Fasta format
• The standard format for nucleotide and
protein sequence is fasta, named after the
program. It is very easy to read and write
manually or with a program:
Name of sequence
>sequence id|more info|yet more info
ACCCGTGAGTAGTGTGTGCTCATCTCGTAGCATCGATCGAAATCGTCATCG
CAGATGTCGTGCTAGCTGTGCATGCATCGATCGATCGATGTAGCTAGTAGT
CAGTAGTAGTCGTAGATGTCGTCGATGCTAGTAGTGCTGCTGTGTGCTGTC
CCACTAGCTGCATCGATG
Sequence itself, in one or many
lines
Multiple fasta format
>sequence id|more info|yet more info
ACCCGTGAGTAGTGTGTGCTCATCTCGTAGCATCGATCGAAATCGTCATCG
CAGATGTCGTGCTAGCTGTGCATGCATCGATCGATCGATGTAGCTAGTAGT
CAGTAGTAGTCGTAGATGTCGTCGATGCTAGTAGTGCTGCTGTGTGCTGTC
CCACTAGCTGCATCGATG
>sequence 2 id|more info|yet more info
ACCCGTGAGTAGTGTGTGCTCATCTCGTAGCATCGATCGAAATCGTCATCG
CAGATGTCGTGCTAGCTGTGCATGCATCGATCGATCGATGTAGCTAGTAGT
CAGTAGTAGTCGTAGATGTCGTCGATGCTAGTAGTGCTGCTGTGTGCTGTC
CCACTAGCTGCATCGATG
>sequence 3 id|more info|yet more info
ACCCGTGAGTAGTGTGTGCTCATCTCGTAGCATCGATCGAAATCGTCATCG
CAGATGTCGTGCTAGCTGTGCATGCATCGATCGATCGATGTAGCTAGTAGT
CAGTAGTAGTCGTAGATGTCGTCGATGCTAGTAGTGCTGCTGTGTGCTGTC
CCACTAGCTGCATCGATG
Multiple fasta format
•Multiple sequence formats are essential for
doing batch or high-throughput work
•Most bioinfomatics programs accept
multiple sequences in this format
•Some websites still do, but most have
stopped accepting this as people use too
many resources.
DNA sequence output from ABI 377 (a gel-based sequencer)
1. Trace files (dye signals) are analyzed and
bases called to create chromatograms.
2. Chromatograms from opposite strands are
reconciled with software to create doublestranded sequence data.
Quality and phred
• When manually interpreting Sanger
sequence, you interpret the quality of the
base intuitively.
• Phil Green’s program “phred” made
genome sequencing possible by doing this
mathematically.
Fasta + quality
• This is the standard output of sanger
platforms – two files
>sequence id|more info|yet more info
ACCCGTGA
>sequence id|more info|yet more info
9 13 20 24 26 30 29 30
• It’s great and easy to read, but takes up a
lot of disk space (at least 4 bytes / base)
fastq
• Much less fun to read, but only two bytes
per base and only one file.
@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
• Encoding varies. Illumina 1.3+ format can
encode a Phred quality score from 0 to 62
using ASCII 64 to 126
sam and bam
• These formats originated as ways to store short reads
aligned to a reference genome
• Now a lot of Illumina raw data also comes as bam
• The sam format is a tab delimited text one that is
possible to edit in Perl, but bam is binary – difficult to use
Perl for.
• The samtools utility (installed on biocluster) can convert
between sam, bam, fastq and some other formats.
The BLAST command
• The blast command is blastall
• you need to tell it what program to use,
what database, and what input file. Many
other options are available.
• e.g.
[user@server ~]$ blastall -p myprogram -i
myfile.txt -d mydatabase
genomics programs
• Blast:
formatdb
blastall:
blastp
blastn
blastx
bl2seq… etc
• novoalign
novoindex
novoalign
• bowtie
bowtie-build
bowtie
Noticing a pattern?
All of these programs
generate an “index” file
which speeds up searching
of the query (short) against
the “database” (big)
Index files
Making your own database
• You can use formatdb to make your
own BLAST database
[user@server ~]$ wget
ftp://ftp.ncbi.nih.gov/blast/db/swissprot.tar.gz
[user@server ~]$ tar –xzvf swissprot.tar.gz
[user@server ~]$ formatdb –i swissprot –p T
[user@server ~]$ blastall –p blastp –i
exampleprotein.txt –d swissprot
Hidden Markov models
• The HMMER package:
hmmpfam
Search for matches to the pfam database
hmmbuild Make your own model
hmmsearch Search a protein file with your model
Let’s try it. But not all at once – very compute intensive.
[user@server ~]$ nice –n 10 hmmpfam
/home/bio/db/pfam testprotein.txt
clustalw
• The most commonly used alignment
program.
• Try aligning the proteins in “testprotein.txt”
• You can use the alignment for
phylogenetic analysis, or to create a
hidden Markov model.
Making a HMM
• Using the .aln file from your clustalw
output:
[user@server ~]$ hmmbuild myhmm
testprotein.aln
This creates a model of your alignment that you can use to
search for sequences that belong in it.
Searching with the HMM
[user@server ~]$ hmmsearch myhmm testprotein.txt
• In this example we’re searching against the file we used to
make the hmm.
• But you could search against the whole of genbank or
swissprot, or against a whole genome, to find proteins with
structural similarity to yours.
Fasta
• The fasta34 program is also installed on
mrmarsh
• Fasta34 is an alternative to BLAST that is
slower, but provides more accurate output, and
can use any fasta format file as a database.
• Try searching exampleprotein.txt against
testprotein.txt
[user@server ~]$ fasta34
Getting information from output
files
• Often these are huge text files
• grep is a great tool for getting at the nittygritty.
• awk is more powerful, but mostly involves
writing scripts, and has been largely
superseded by Perl.
grep
• My favorite of all UNIX commands
• “global regular expression and print”
• Allows you to pick out lines of a text file
that match a query, count them, and
retrieve lines around the match.
grep - continued
grep ‘Query=’ myblast.txt
What sequences did I BLAST?
grep –c ‘>’ testprotein.txt
How many sequences are in this file?
grep –A 10 ‘>’ testprotein.txt
Give me the first ten lines of each protein
Getting files from remote servers
• Before there was the world wide web,
there was ftp.
• Note that WORKER NODES ARE NOT
CONNECTED TO THE INTERNET, so do
network stuff from the head node.
[user@server ~]$ ftp ftp.ncbi.nih.gov
ftp commands
•
•
•
•
•
•
•
•
•
open
ls
cd
get
mget
put
lcd
close
bye
open a connection
same as UNIX
same as UNIX
get me this file
get more than one file
put a file on the server
local cd
close connection
exit the ftp program
Secure ftp
• Although NCBI allows you to connect using
ftp, this is because they have only public
files, and they don’t let you upload anything.
• Most UNIX computers disallow ftp logins.
However, if you can ssh to a computer, you
can also use sftp. The commands are
identical to ftp, but you can access your own
files securely.
wget
• But what if you want to get a file which is
available for download from a website, but
not by ftp?
• wget will get the contents of any URL and
put them in a file.
[user@server ~]$ wget www.cnn.com
How do I write a script, then?
[user@server ~]$ nano myscript.pl
So what’s a program look
like?
All programming courses traditionally start
with a program that prints “Hello, world!”. So
in keeping with that tradition:
#!/usr/bin/perl
print “Hello, world\n”;
Note:
No line numbers.
Each command line ends with a semicolon
Exit and run
Control-O then control-x.
[user@server ~]$ perl myscript.pl
Or, if you’re feeling fancy
[user@server ~]$ chmod 755 myscript.pl
[user@server ~]$./myscript.pl
This makes the file EXECUTABLE.
Those numbers after chmod are octal
numbers.. But don’t worry too much about
that.
GNU
• GNU stands for “GNU’s Not Unix”
• This is what computer people think is a joke
• The reason for GNU is that the name “UNIX” was
owned by private companies. GNU exists to make
free software for UNIX, but couldn’t use the name
• Linux grew out of GNU
• Not only is GNU software free, but also all the way
in which the software was made (the “source
code”) is made public and easily downloaded.
Downloading programs
• “ready to run” programs are called
binaries in unix-speak.
• They are often “zipped” in a .tar.gz file.
• To unzip, use gunzip and tar –xvf
• To run, specify the path to the program.
E.g., ./program or /home/matt/bin/program
• You can download programs for UNIX just
as you would for a PC
Bowtie
• Bowtie is a more heavily indexed search
program, which requires a more exact
match than BLAST. It is much, much
faster.
• See if you can figure this out.
• Use bowtie-build to build the database,
bowtie to search it…
Your path
• To see your path, type echo $PATH
• If you are bored with typing the full path to
programs, you can put them in your path.
• Eg.
mkdir ~/bin/
mv program ~/bin/
export PATH=$PATH:~/bin
program
Source Code
• Most bioinformatics software is free, and
open source. That is, you can download
the actual instructions the programmer
wrote.
• This is great, because it means you can
install these programs on almost any
machine.
• If somebody asks you for money for
bioinformatics software, DON’T DO IT!
The GNU install pragma
• GNU source code can be complicated to compile,
so it comes with programs to help you. There is a
standard way to build and install GNU software.
[user@server ~]$ gunzip program.tar.gz
[user@server ~]$ tar –xvf program.tar
[user@server ~]$ cd program
[user@server ~]$ ./configure
[user@server ~]$ make
[user@server ~]$ make install
The root user
• Most UNIX machines have an account called
“root”
• root can see everything, change everything,
delete everything, including other users work
• Unless you buy your own machine, nobody sane
will give you root access
• You usually need root access to install programs
in the default location. But you can put them in
your home directory instead.
UNIX summary
• Use ls, cd, mv, cp, nano and friends to deal with files and
directories
• Install, or compile, any program you like. Most are free.
• Use blastall, hmmer etc on the command line for high
throughput work. Transfer the output to a file for best
results and run in the background. Grep the output file to
get pertinent information….
• Or process it with a Perl script
Download