Getting started with UNIX

advertisement
Getting started with UNIX
for bioinformatics
Matt Hudson
Dept of Crop Sciences
Command lines
• You’ll all have interacted with computer operating
systems such as Windows and MacOS.
• These are GUIs; programs to let you instinctively
interact with the computer. If you want to get closer
to the computer system, you need to use a
command line.
Command lines
• Although the old Mac System 7 did not have one,
all modern operating systems have one – Windows
has DOS, MacOSX has a form of BSD UNIX, and
Linux has, well, Linux.
• Unlike Windows or MacOS, the command line is
much more central to your work in Linux or another
form of ‘true’ UNIX, like Solaris, HPUX or Tru64.
• (There are standards for UNIX called the POSIX
standards… MacOSX does not follow them).
Commands
• Each word you type in the command line
runs a program. So it is easy to add your
own commands – just add, or write, another
program.
• The output of the program is returned to the
terminal unless you say otherwise. So all
your interaction is through one text window.
This makes it easy to log in remotely.
UNIX Commands
• Why use UNIX?
- designed for lots of small programs
- can link programs together
- doesn’t waste computer resources on
graphics
- gives the user much more power
• Unfortunately, at the expense of being userfriendly. But once you learn it properly you
won’t want to go back.
Your UNIX course
• You each have a Windows computer that can
log into our UNIX cluster via a “terminal”
program called PuTTY.
• This class has a bioinformatics cluster, a
common way to do computationally intensive
work. You can log in remotely with your user
account from any machine on Campus,
including Windows or Mac. You need to have
PuTTY or SSH installed for Windows, Mac does
this natively.
Logging in to the cluster
• The class cluster is in a cooled server room in
the IGB basement.
• http://biocluster.igb.illinois.edu/ganglia/
• Most UNIX computing is done remotely like this.
• There are 30 “class” accounts. Sign up for one
of them on the sheet.
Structure of a cluster
user
client w/
internet
connection
compute node 01
compute node 02
via
internet
compute node 03
Head Node
compute node 04
Controls jobs
compute node 05
compute node 06
private
network
passwordless
Logging onto the cluster
• Log in to the head node through PuTTY:
Server:
biocluster.igb.illinois.edu
Username: from sheet, eg. cpsc565_stud28
Password: see sheet
Note if you use MacOSX or linux, from a terminal window:
ssh –X username@biocluster.igb.illinois.edu
Hit “return” in answer to all the first
login questions
Unix Files
• Unix is CaSe SeNsiTivE! Like you!
• UNIX filenames contain only letters, numbers, and the
_ (underscore), . (dot), and - (dash) characters. NO
SPACES! Underanycircumstances!
• The extension (eg .txt, .fasta) can be any number of
letters and is optional. It’s for your own convenience so
you know what kind of file is what.
• You can only have one file in the same directory with
the same name. If you make another one, the old
file will be deleted. This is VERY easy to do.
Second command
• Once you have logged in to the server,
you are automatically on the head node.
Log in to the compute node server:
[cpscstu@biocluster ~]$qsub -IX -q classroom
– Log into a compute node before you do anything
computationally intensive
– Jobs should not be run on the head node
Working with Directories
• Directories organize files on a Unix
computer.
– They are equivalent to folders in Windows and
Mac, except they can’t have a space in their name.
– The directory list that allows you to locate a file is
called a PATH (eg., /home/matt/drivel.txt is the
FULL PATH to the file drivel.txt).
• Understanding directories is vital.
Directory commands
cd – change current directory
mkdir – make a directory
rmdir – remove (delete) a directory
pwd – present working directory
(= where am I?)
Typical UNIX directory structure
/
pronounced
‘slash’ or
‘root’.
/bin
=where the programs live. Don’t mess
/lib
/etc
/usr
=programming libraries. Ignore
=admin stuff. Ignore.
=more programs, not user files. Don’t mess
/mnt =‘mount point’ for floppies, cd roms etc.
If you put a cd rom in, it is in /mnt/cdrom
/tmp =temporary files. Ignore.
/var =more temporary files. Ignore.
/home
/home/matt
where ALL my files are
/home/fred
/home/jane
where Jane’s files are.
I can’t see them unless she lets me.
A UNIX workstation is usually set up like this, cygwin and MacOSX are different
Your Home Directory
• When you log in to any UNIX computer, you start
off in your own home directory
• This is your home – keep it tidy. Create subdirectories to store specific projects or groups of
information.
• Don’t accumulate hundreds of files in your home
directory
File & Directory Commands
• This is a minimal list of Unix commands
that you need for file management:
ls (list)
cp (copy)
mv (move)
rm (remove, i.e. delete a file)
nano (a text editor that lets you edit any text file)
gunzip and unzip (extract compressed files)
tar (extract files in an ‘archive’)
man (help)
All of these commands can be modified with many
options. To see the options, use man (eg $man cp).
List my files
• You can list the files in your current directory
with
[user@server ~]$ ls
• You can modify a command with Options.
• Try:
ls –F
ls –a
ls -l
Help! I messed up!
• If you accidentally start something that will take
forever, hit control + z
• This will stop the process but not make it go
away entirely.
• Restart it with
[user@server ~]$ fg %1
• Kill it forever with
[user@server ~]$ kill %1
Edit a file
• There are also many text editors in UNIX.
• These are ways to edit a file via the
terminal. Many of them are very old, and
very cranky. But UNIX buffs still love them.
• Let’s make a text file. I recommend you
use:
[user@server ~]$ nano text.txt
The nano editor
GNU nano 1.2.1
File: test
This is text in a file. I edit it just like in Word.
All you really need to know is:
1)
^ means control
2)
$ means “this line is longer than the screen” – useful for DNA sequence files.
3) If you type
#nano myfile
Where myfile already exists it will edit it. If not, it will make a new file.
^G Get Help
^X Exit
^O WriteOut
^J Justify
[ Read 1 line ]
^R Read File ^Y Prev Page ^K Cut Text ^C Cur Pos
^W Where Is ^V Next Page ^U UnCut Txt ^T To Spell
Other ways to view files
These can be very useful. Try them out:
[user@server ~]$ more text.txt
[user@server ~]$ less text.txt
[user@server ~]$ head text.txt
[user@server ~]$ tail text.txt
The gedit graphical editor
• In a terminal window on your desktop,
while logged into the head node, type
gedit &
• Now try the same thing while logged into a
biocluster worker node.
• You are now runnning nedit on the cluster,
and can see and edit files on a remote
server. This is called SSH Tunelling
Shortcuts
There are several shortcuts in Unix for specifying directories
They don’t make much sense. You just need to learn them.
. (dot) means "the working directory“ – the one you’re in. So
[user@server ~]$ cd .
does absolutely nothing.
.. means "the parent directory" - the directory one level above
the working directory. So
[user@server ~]$ cd ..
will move you up (towards /) one level
~ (tilde) means your Home directory, so
[user@server ~/work]$ cd~ will take you home.
Note that your current directory is IN THE PROMPT
Bioinformatics commands
• No bioinformatics programs come with UNIX
• Most biology department servers have them
installed already. But you should probably know
how to do it yourself
• It is pretty much the same as installing any other
program on UNIX – except you need to keep in
mind the requirements for disk space and
memory.
Disk space and memory
• These are different things.
• Disk space is the amount of free space for data on your
hard disk drive.
• Memory is the amount of RAM installed in the computer.
• Both of these are critical for many bioinformatics
applications. For example, BLAST databases can be very
large and take up a lot of disk space, and in order to
search through them, the BLAST program needs to load a
lot of data into RAM.
Running programs
•
To run a program with a command, it needs to be either in your PATH, or you specify the
path to it.
•
On mrmarsh, all this is done for you, but you might need to understand how to set it up on
another machine.
•
E.g. say I have the blastall binary in my home directory. I could run blastall with either of
the following:
[user@server /var/tmp]$ cd /home/matt
[user@server ~]$ ./blastall
•
Or, from any directory,
[user@server /var/tmp]$ /home/matt/blastall
•
Or, I can install it and put it in my path:
[user@server ~]$ mv blastall /home/matt/bin
[user@server ~]$ export PATH=$PATH:/home/matt/bin
[user@server ~]$ blastall
Fasta format
• The standard format for nucleotide and
protein sequence is fasta, named after the
program. It is very easy to read and write
manually or with a program:
Name of sequence
>sequence id|more info|yet more info
ACCCGTGAGTAGTGTGTGCTCATCTCGTAGCATCGATCGAAATCGTCATCG
CAGATGTCGTGCTAGCTGTGCATGCATCGATCGATCGATGTAGCTAGTAGT
CAGTAGTAGTCGTAGATGTCGTCGATGCTAGTAGTGCTGCTGTGTGCTGTC
CCACTAGCTGCATCGATG
Sequence itself, in one or many
lines
Multiple fasta format
>sequence id|more info|yet more info
ACCCGTGAGTAGTGTGTGCTCATCTCGTAGCATCGATCGAAATCGTCATCG
CAGATGTCGTGCTAGCTGTGCATGCATCGATCGATCGATGTAGCTAGTAGT
CAGTAGTAGTCGTAGATGTCGTCGATGCTAGTAGTGCTGCTGTGTGCTGTC
CCACTAGCTGCATCGATG
>sequence 2 id|more info|yet more info
ACCCGTGAGTAGTGTGTGCTCATCTCGTAGCATCGATCGAAATCGTCATCG
CAGATGTCGTGCTAGCTGTGCATGCATCGATCGATCGATGTAGCTAGTAGT
CAGTAGTAGTCGTAGATGTCGTCGATGCTAGTAGTGCTGCTGTGTGCTGTC
CCACTAGCTGCATCGATG
>sequence 3 id|more info|yet more info
ACCCGTGAGTAGTGTGTGCTCATCTCGTAGCATCGATCGAAATCGTCATCG
CAGATGTCGTGCTAGCTGTGCATGCATCGATCGATCGATGTAGCTAGTAGT
CAGTAGTAGTCGTAGATGTCGTCGATGCTAGTAGTGCTGCTGTGTGCTGTC
CCACTAGCTGCATCGATG
Multiple fasta format
•This is the BIG TRICK to doing batch or
high-throughput work
•Almost all bioinfomatics programs accept
multiple sequences in this format
•Some websites still do, but most have
stopped accepting this as people use too
many resources.
Example files
• You have example files in your home
directory
[user@server ~]$ ls –l
nano exampleprotein.txt
BLAST
• We’re going to concentrate on BLAST as an
example of a bioinformatics application that is
widely used.
• It’s probably the most widely used of all
bioinformatics programs
• Unfortunately, the databases need so much disk
space and memory you need to learn “nice”.
• What is more powerful is the ability to create your
own databases.
The BLAST command
• The blast command is blastall
• you need to tell it what program to use,
what database, and what input file. Many
other options are available.
• e.g.
[user@server ~]$ blastall -p myprogram -i
myfile.txt -d mydatabase
Being “nice”
• blastall takes a lot of resources.
• So that more important jobs take
precedence (ie other people can still read
their terminals) you can use “nice”.
[user@server ~]$ nice –n 10 blastall –p
etc.
• This is important on a shared machine, not so
much on the cluster.
Fun with Modules
• The biocluster uses 'modules' to systematically organize,
version control, and load software and libraries.
• Try the command ' module ' to see all of your available
options with the tool.
• Try the command ' module avail ' to see all of the
loaded modules on the server. Before we can run the
correct version of blast, we need to specify the version of
blast or blast+ with the correct module.
blast/2.2.25
blast/2.2.256
blast+/2.2.25+
Let’s BLAST
blast/2.2.25
blast/2.2.256
blast+/2.2.25+
• Select the lastest version of blast with either of the following
commands:
– Module load blast/2.2.26
– Module load blast
If you do not specify a version number for the
module, the latest installed version of the
software will be added.
Let’s BLAST
• There are two example files in your home
directory, exampledna.txt and exampleprotein.txt
• Try some BLAST searches against the protein
database ArabidopsisP and the DNA database
ArabidopsisN.
• Remember: login to a compute node! (slide 12)
[user@server ~]$ blastall -p blastx -i exampledna.txt
-d nr
Blastall programs
blastall
–p blastn
–p blastp
–p blastx
–p tblastn
–p tblastx
nucleotide against nucleotide
protein against protein
nucleotide against protein
protein against nucleotide
nucleotide against nucleotide at the
protein level
And then there’s blast+ …
Try making some more files
• Go to the NCBI website, or anywhere else
• Download some genes of interest to you,
in fasta format
• Do some blasts, maybe some
bl2seq
Big output
• How are we going to deal with the size of
the output text?
[user@server ~]$ nice –n 10 blastall -p blastp -i
exampleprotein.txt -d nr >myblastfile.txt
[user@server ~]$ nice –n 10 (blastall blah blah) |more
UNIX 1 summary
• Use a text terminal for powerful, remote
computing
• Use ls, cd, mv, cp, nano and friends to
deal with files and directories
• You can use many tools quickly – but
generally the output is in text format
Download