Getting started with UNIX for bioinformatics Matt Hudson Dept of Crop Sciences Command lines • You’ll all have interacted with computer operating systems such as Windows and MacOS. • These are GUIs; programs to let you instinctively interact with the computer. If you want to get closer to the computer system, you need to use a command line. Command lines • Although the old Mac System 7 did not have one, all modern operating systems have one – Windows has DOS, MacOSX has a form of BSD UNIX, and Linux has, well, Linux. • Unlike Windows or MacOS, the command line is much more central to your work in Linux or another form of ‘true’ UNIX, like Solaris, HPUX or Tru64. • (There are standards for UNIX called the POSIX standards… MacOSX does not follow them). Commands • Each word you type in the command line runs a program. So it is easy to add your own commands – just add, or write, another program. • The output of the program is returned to the terminal unless you say otherwise. So all your interaction is through one text window. This makes it easy to log in remotely. UNIX Commands • Why use UNIX? - designed for lots of small programs - can link programs together - doesn’t waste computer resources on graphics - gives the user much more power • Unfortunately, at the expense of being userfriendly. But once you learn it properly you won’t want to go back. Your UNIX course • You each have a Windows computer that can log into our UNIX cluster via a “terminal” program called PuTTY. • This class has a bioinformatics cluster, a common way to do computationally intensive work. You can log in remotely with your user account from any machine on Campus, including Windows or Mac. You need to have PuTTY or SSH installed for Windows, Mac does this natively. Logging in to the cluster • The class cluster is in a cooled server room in the IGB basement. • http://biocluster.igb.illinois.edu/ganglia/ • Most UNIX computing is done remotely like this. • There are 30 “class” accounts. Sign up for one of them on the sheet. Structure of a cluster user client w/ internet connection compute node 01 compute node 02 via internet compute node 03 Head Node compute node 04 Controls jobs compute node 05 compute node 06 private network passwordless Logging onto the cluster • Log in to the head node through PuTTY: Server: biocluster.igb.illinois.edu Username: from sheet, eg. cpsc565_stud28 Password: see sheet Note if you use MacOSX or linux, from a terminal window: ssh –X username@biocluster.igb.illinois.edu Hit “return” in answer to all the first login questions Unix Files • Unix is CaSe SeNsiTivE! Like you! • UNIX filenames contain only letters, numbers, and the _ (underscore), . (dot), and - (dash) characters. NO SPACES! Underanycircumstances! • The extension (eg .txt, .fasta) can be any number of letters and is optional. It’s for your own convenience so you know what kind of file is what. • You can only have one file in the same directory with the same name. If you make another one, the old file will be deleted. This is VERY easy to do. Second command • Once you have logged in to the server, you are automatically on the head node. Log in to the compute node server: [cpscstu@biocluster ~]$qsub -IX -q classroom – Log into a compute node before you do anything computationally intensive – Jobs should not be run on the head node Working with Directories • Directories organize files on a Unix computer. – They are equivalent to folders in Windows and Mac, except they can’t have a space in their name. – The directory list that allows you to locate a file is called a PATH (eg., /home/matt/drivel.txt is the FULL PATH to the file drivel.txt). • Understanding directories is vital. Directory commands cd – change current directory mkdir – make a directory rmdir – remove (delete) a directory pwd – present working directory (= where am I?) Typical UNIX directory structure / pronounced ‘slash’ or ‘root’. /bin =where the programs live. Don’t mess /lib /etc /usr =programming libraries. Ignore =admin stuff. Ignore. =more programs, not user files. Don’t mess /mnt =‘mount point’ for floppies, cd roms etc. If you put a cd rom in, it is in /mnt/cdrom /tmp =temporary files. Ignore. /var =more temporary files. Ignore. /home /home/matt where ALL my files are /home/fred /home/jane where Jane’s files are. I can’t see them unless she lets me. A UNIX workstation is usually set up like this, cygwin and MacOSX are different Your Home Directory • When you log in to any UNIX computer, you start off in your own home directory • This is your home – keep it tidy. Create subdirectories to store specific projects or groups of information. • Don’t accumulate hundreds of files in your home directory File & Directory Commands • This is a minimal list of Unix commands that you need for file management: ls (list) cp (copy) mv (move) rm (remove, i.e. delete a file) nano (a text editor that lets you edit any text file) gunzip and unzip (extract compressed files) tar (extract files in an ‘archive’) man (help) All of these commands can be modified with many options. To see the options, use man (eg $man cp). List my files • You can list the files in your current directory with [user@server ~]$ ls • You can modify a command with Options. • Try: ls –F ls –a ls -l Help! I messed up! • If you accidentally start something that will take forever, hit control + z • This will stop the process but not make it go away entirely. • Restart it with [user@server ~]$ fg %1 • Kill it forever with [user@server ~]$ kill %1 Edit a file • There are also many text editors in UNIX. • These are ways to edit a file via the terminal. Many of them are very old, and very cranky. But UNIX buffs still love them. • Let’s make a text file. I recommend you use: [user@server ~]$ nano text.txt The nano editor GNU nano 1.2.1 File: test This is text in a file. I edit it just like in Word. All you really need to know is: 1) ^ means control 2) $ means “this line is longer than the screen” – useful for DNA sequence files. 3) If you type #nano myfile Where myfile already exists it will edit it. If not, it will make a new file. ^G Get Help ^X Exit ^O WriteOut ^J Justify [ Read 1 line ] ^R Read File ^Y Prev Page ^K Cut Text ^C Cur Pos ^W Where Is ^V Next Page ^U UnCut Txt ^T To Spell Other ways to view files These can be very useful. Try them out: [user@server ~]$ more text.txt [user@server ~]$ less text.txt [user@server ~]$ head text.txt [user@server ~]$ tail text.txt The gedit graphical editor • In a terminal window on your desktop, while logged into the head node, type gedit & • Now try the same thing while logged into a biocluster worker node. • You are now runnning nedit on the cluster, and can see and edit files on a remote server. This is called SSH Tunelling Shortcuts There are several shortcuts in Unix for specifying directories They don’t make much sense. You just need to learn them. . (dot) means "the working directory“ – the one you’re in. So [user@server ~]$ cd . does absolutely nothing. .. means "the parent directory" - the directory one level above the working directory. So [user@server ~]$ cd .. will move you up (towards /) one level ~ (tilde) means your Home directory, so [user@server ~/work]$ cd~ will take you home. Note that your current directory is IN THE PROMPT Bioinformatics commands • No bioinformatics programs come with UNIX • Most biology department servers have them installed already. But you should probably know how to do it yourself • It is pretty much the same as installing any other program on UNIX – except you need to keep in mind the requirements for disk space and memory. Disk space and memory • These are different things. • Disk space is the amount of free space for data on your hard disk drive. • Memory is the amount of RAM installed in the computer. • Both of these are critical for many bioinformatics applications. For example, BLAST databases can be very large and take up a lot of disk space, and in order to search through them, the BLAST program needs to load a lot of data into RAM. Running programs • To run a program with a command, it needs to be either in your PATH, or you specify the path to it. • On mrmarsh, all this is done for you, but you might need to understand how to set it up on another machine. • E.g. say I have the blastall binary in my home directory. I could run blastall with either of the following: [user@server /var/tmp]$ cd /home/matt [user@server ~]$ ./blastall • Or, from any directory, [user@server /var/tmp]$ /home/matt/blastall • Or, I can install it and put it in my path: [user@server ~]$ mv blastall /home/matt/bin [user@server ~]$ export PATH=$PATH:/home/matt/bin [user@server ~]$ blastall Fasta format • The standard format for nucleotide and protein sequence is fasta, named after the program. It is very easy to read and write manually or with a program: Name of sequence >sequence id|more info|yet more info ACCCGTGAGTAGTGTGTGCTCATCTCGTAGCATCGATCGAAATCGTCATCG CAGATGTCGTGCTAGCTGTGCATGCATCGATCGATCGATGTAGCTAGTAGT CAGTAGTAGTCGTAGATGTCGTCGATGCTAGTAGTGCTGCTGTGTGCTGTC CCACTAGCTGCATCGATG Sequence itself, in one or many lines Multiple fasta format >sequence id|more info|yet more info ACCCGTGAGTAGTGTGTGCTCATCTCGTAGCATCGATCGAAATCGTCATCG CAGATGTCGTGCTAGCTGTGCATGCATCGATCGATCGATGTAGCTAGTAGT CAGTAGTAGTCGTAGATGTCGTCGATGCTAGTAGTGCTGCTGTGTGCTGTC CCACTAGCTGCATCGATG >sequence 2 id|more info|yet more info ACCCGTGAGTAGTGTGTGCTCATCTCGTAGCATCGATCGAAATCGTCATCG CAGATGTCGTGCTAGCTGTGCATGCATCGATCGATCGATGTAGCTAGTAGT CAGTAGTAGTCGTAGATGTCGTCGATGCTAGTAGTGCTGCTGTGTGCTGTC CCACTAGCTGCATCGATG >sequence 3 id|more info|yet more info ACCCGTGAGTAGTGTGTGCTCATCTCGTAGCATCGATCGAAATCGTCATCG CAGATGTCGTGCTAGCTGTGCATGCATCGATCGATCGATGTAGCTAGTAGT CAGTAGTAGTCGTAGATGTCGTCGATGCTAGTAGTGCTGCTGTGTGCTGTC CCACTAGCTGCATCGATG Multiple fasta format •This is the BIG TRICK to doing batch or high-throughput work •Almost all bioinfomatics programs accept multiple sequences in this format •Some websites still do, but most have stopped accepting this as people use too many resources. Example files • You have example files in your home directory [user@server ~]$ ls –l nano exampleprotein.txt BLAST • We’re going to concentrate on BLAST as an example of a bioinformatics application that is widely used. • It’s probably the most widely used of all bioinformatics programs • Unfortunately, the databases need so much disk space and memory you need to learn “nice”. • What is more powerful is the ability to create your own databases. The BLAST command • The blast command is blastall • you need to tell it what program to use, what database, and what input file. Many other options are available. • e.g. [user@server ~]$ blastall -p myprogram -i myfile.txt -d mydatabase Being “nice” • blastall takes a lot of resources. • So that more important jobs take precedence (ie other people can still read their terminals) you can use “nice”. [user@server ~]$ nice –n 10 blastall –p etc. • This is important on a shared machine, not so much on the cluster. Fun with Modules • The biocluster uses 'modules' to systematically organize, version control, and load software and libraries. • Try the command ' module ' to see all of your available options with the tool. • Try the command ' module avail ' to see all of the loaded modules on the server. Before we can run the correct version of blast, we need to specify the version of blast or blast+ with the correct module. blast/2.2.25 blast/2.2.256 blast+/2.2.25+ Let’s BLAST blast/2.2.25 blast/2.2.256 blast+/2.2.25+ • Select the lastest version of blast with either of the following commands: – Module load blast/2.2.26 – Module load blast If you do not specify a version number for the module, the latest installed version of the software will be added. Let’s BLAST • There are two example files in your home directory, exampledna.txt and exampleprotein.txt • Try some BLAST searches against the protein database ArabidopsisP and the DNA database ArabidopsisN. • Remember: login to a compute node! (slide 12) [user@server ~]$ blastall -p blastx -i exampledna.txt -d nr Blastall programs blastall –p blastn –p blastp –p blastx –p tblastn –p tblastx nucleotide against nucleotide protein against protein nucleotide against protein protein against nucleotide nucleotide against nucleotide at the protein level And then there’s blast+ … Try making some more files • Go to the NCBI website, or anywhere else • Download some genes of interest to you, in fasta format • Do some blasts, maybe some bl2seq Big output • How are we going to deal with the size of the output text? [user@server ~]$ nice –n 10 blastall -p blastp -i exampleprotein.txt -d nr >myblastfile.txt [user@server ~]$ nice –n 10 (blastall blah blah) |more UNIX 1 summary • Use a text terminal for powerful, remote computing • Use ls, cd, mv, cp, nano and friends to deal with files and directories • You can use many tools quickly – but generally the output is in text format