UNIX and Perl Lecture 2 Matt Hudson Review • Unix is text based: doesn’t waste computer resources on graphics allows you to write and use scripts easily makes remote access easy don’t have to learn “where everything is” gives the user more power Review • When negotiating file systems, it is important to remember the directory structure and the commands cd, ls and pwd. • You must be very wary of creating multiple files with the same name, as it is easy to over-write an existing, important file • There is no undelete or trash basket in UNIX – delete or overwrite a file and it is gone Review • Edit text files with nano. Standard format for bioinformatics text-based applications is called fasta: Name, or ID, of sequence >sequence id|more info|yet more info ACCCGTGAGTAGTGTGTGCTCATCTCGTAGCATCGATCGAAATCGTCATCG CAGATGTCGTGCTAGCTGTGCATGCATCGATCGATCGATGTAGCTAGTAGT CAGTAGTAGTCGTAGATGTCGTCGATGCTAGTAGTGCTGCTGTGTGCTGTC CCACTAGCTGCATCGATG Sequence itself, in one or many lines Review • The blastall command takes the following obligatory arguments -p <program name, eg blastn, blastp> -i <input file name> And these arguments are also very important -d <database name> -e <E-value cutoff> -m <matrix name> Using your UNIX skills • Now we’re going to use our new UNIX skills in anger with some more sophisticated bioinformatics programs. • It’s an idea to make an “experiment” or “scratch” folder • Use this as a place for stuff that might explode…. Bioinformatics programs • blastall: blastp blastn blastx bl2seq… etc • HMMER hmmpfam hmmsearch hmmbuild • clustalw • fasta34 The background • If you are running a program that takes a long time, especially if redirecting output to a file, put it in the background, and you can keep working. Either put & at the end of the command Or stop (ctrl-Z) then bg %. Running overnight • If you are running a program overnight, use qsub from the head node to control the program command – this way the command will keep running when you exit the shell. • Don’t forget to redirect output to a file when doing this (can still get output but can be hard to figure out). Viewing running processes • You can see all the processes on the system, ranked by how much memory and CPU time they are using. $top Fasta format • The standard format for nucleotide and protein sequence is fasta, named after the program. It is very easy to read and write manually or with a program: Name of sequence >sequence id|more info|yet more info ACCCGTGAGTAGTGTGTGCTCATCTCGTAGCATCGATCGAAATCGTCATCG CAGATGTCGTGCTAGCTGTGCATGCATCGATCGATCGATGTAGCTAGTAGT CAGTAGTAGTCGTAGATGTCGTCGATGCTAGTAGTGCTGCTGTGTGCTGTC CCACTAGCTGCATCGATG Sequence itself, in one or many lines Multiple fasta format >sequence id|more info|yet more info ACCCGTGAGTAGTGTGTGCTCATCTCGTAGCATCGATCGAAATCGTCATCG CAGATGTCGTGCTAGCTGTGCATGCATCGATCGATCGATGTAGCTAGTAGT CAGTAGTAGTCGTAGATGTCGTCGATGCTAGTAGTGCTGCTGTGTGCTGTC CCACTAGCTGCATCGATG >sequence 2 id|more info|yet more info ACCCGTGAGTAGTGTGTGCTCATCTCGTAGCATCGATCGAAATCGTCATCG CAGATGTCGTGCTAGCTGTGCATGCATCGATCGATCGATGTAGCTAGTAGT CAGTAGTAGTCGTAGATGTCGTCGATGCTAGTAGTGCTGCTGTGTGCTGTC CCACTAGCTGCATCGATG >sequence 3 id|more info|yet more info ACCCGTGAGTAGTGTGTGCTCATCTCGTAGCATCGATCGAAATCGTCATCG CAGATGTCGTGCTAGCTGTGCATGCATCGATCGATCGATGTAGCTAGTAGT CAGTAGTAGTCGTAGATGTCGTCGATGCTAGTAGTGCTGCTGTGTGCTGTC CCACTAGCTGCATCGATG Multiple fasta format •Multiple sequence formats are essential for doing batch or high-throughput work •Most bioinfomatics programs accept multiple sequences in this format •Some websites still do, but most have stopped accepting this as people use too many resources. DNA sequence output from ABI 377 (a gel-based sequencer) 1. Trace files (dye signals) are analyzed and bases called to create chromatograms. 2. Chromatograms from opposite strands are reconciled with software to create doublestranded sequence data. Quality and phred • When manually interpreting Sanger sequence, you interpret the quality of the base intuitively. • Phil Green’s program “phred” made genome sequencing possible by doing this mathematically. Fasta + quality • This is the standard output of sanger platforms – two files >sequence id|more info|yet more info ACCCGTGA >sequence id|more info|yet more info 9 13 20 24 26 30 29 30 • It’s great and easy to read, but takes up a lot of disk space (at least 4 bytes / base) fastq • Much less fun to read, but only two bytes per base and only one file. @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36 GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC +SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36 IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC • Encoding varies. Illumina 1.3+ format can encode a Phred quality score from 0 to 62 using ASCII 64 to 126 sam and bam • These formats originated as ways to store short reads aligned to a reference genome • Now a lot of Illumina raw data also comes as bam • The sam format is a tab delimited text one that is possible to edit in Perl, but bam is binary – difficult to use Perl for. • The samtools utility (installed on biocluster) can convert between sam, bam, fastq and some other formats. The BLAST command • The blast command is blastall • you need to tell it what program to use, what database, and what input file. Many other options are available. • e.g. [user@server ~]$ blastall -p myprogram -i myfile.txt -d mydatabase genomics programs • Blast: formatdb blastall: blastp blastn blastx bl2seq… etc • novoalign novoindex novoalign • bowtie bowtie-build bowtie Noticing a pattern? All of these programs generate an “index” file which speeds up searching of the query (short) against the “database” (big) Index files Making your own database • You can use formatdb to make your own BLAST database [user@server ~]$ wget ftp://ftp.ncbi.nih.gov/blast/db/swissprot.tar.gz [user@server ~]$ tar –xzvf swissprot.tar.gz [user@server ~]$ formatdb –i swissprot –p T [user@server ~]$ blastall –p blastp –i exampleprotein.txt –d swissprot Hidden Markov models • The HMMER package: hmmpfam Search for matches to the pfam database hmmbuild Make your own model hmmsearch Search a protein file with your model Let’s try it. But not all at once – very compute intensive. [user@server ~]$ nice –n 10 hmmpfam /home/bio/db/pfam testprotein.txt clustalw • The most commonly used alignment program. • Try aligning the proteins in “testprotein.txt” • You can use the alignment for phylogenetic analysis, or to create a hidden Markov model. Making a HMM • Using the .aln file from your clustalw output: [user@server ~]$ hmmbuild myhmm testprotein.aln This creates a model of your alignment that you can use to search for sequences that belong in it. Searching with the HMM [user@server ~]$ hmmsearch myhmm testprotein.txt • In this example we’re searching against the file we used to make the hmm. • But you could search against the whole of genbank or swissprot, or against a whole genome, to find proteins with structural similarity to yours. Fasta • The fasta34 program is also installed on mrmarsh • Fasta34 is an alternative to BLAST that is slower, but provides more accurate output, and can use any fasta format file as a database. • Try searching exampleprotein.txt against testprotein.txt [user@server ~]$ fasta34 Getting information from output files • Often these are huge text files • grep is a great tool for getting at the nittygritty. • awk is more powerful, but mostly involves writing scripts, and has been largely superseded by Perl. grep • My favorite of all UNIX commands • “global regular expression and print” • Allows you to pick out lines of a text file that match a query, count them, and retrieve lines around the match. grep - continued grep ‘Query=’ myblast.txt What sequences did I BLAST? grep –c ‘>’ testprotein.txt How many sequences are in this file? grep –A 10 ‘>’ testprotein.txt Give me the first ten lines of each protein Getting files from remote servers • Before there was the world wide web, there was ftp. • Note that WORKER NODES ARE NOT CONNECTED TO THE INTERNET, so do network stuff from the head node. [user@server ~]$ ftp ftp.ncbi.nih.gov ftp commands • • • • • • • • • open ls cd get mget put lcd close bye open a connection same as UNIX same as UNIX get me this file get more than one file put a file on the server local cd close connection exit the ftp program Secure ftp • Although NCBI allows you to connect using ftp, this is because they have only public files, and they don’t let you upload anything. • Most UNIX computers disallow ftp logins. However, if you can ssh to a computer, you can also use sftp. The commands are identical to ftp, but you can access your own files securely. wget • But what if you want to get a file which is available for download from a website, but not by ftp? • wget will get the contents of any URL and put them in a file. [user@server ~]$ wget www.cnn.com How do I write a script, then? [user@server ~]$ nano myscript.pl So what’s a program look like? All programming courses traditionally start with a program that prints “Hello, world!”. So in keeping with that tradition: #!/usr/bin/perl print “Hello, world\n”; Note: No line numbers. Each command line ends with a semicolon Exit and run Control-O then control-x. [user@server ~]$ perl myscript.pl Or, if you’re feeling fancy [user@server ~]$ chmod 755 myscript.pl [user@server ~]$./myscript.pl This makes the file EXECUTABLE. Those numbers after chmod are octal numbers.. But don’t worry too much about that. GNU • GNU stands for “GNU’s Not Unix” • This is what computer people think is a joke • The reason for GNU is that the name “UNIX” was owned by private companies. GNU exists to make free software for UNIX, but couldn’t use the name • Linux grew out of GNU • Not only is GNU software free, but also all the way in which the software was made (the “source code”) is made public and easily downloaded. Downloading programs • “ready to run” programs are called binaries in unix-speak. • They are often “zipped” in a .tar.gz file. • To unzip, use gunzip and tar –xvf • To run, specify the path to the program. E.g., ./program or /home/matt/bin/program • You can download programs for UNIX just as you would for a PC Bowtie • Bowtie is a more heavily indexed search program, which requires a more exact match than BLAST. It is much, much faster. • See if you can figure this out. • Use bowtie-build to build the database, bowtie to search it… Your path • To see your path, type echo $PATH • If you are bored with typing the full path to programs, you can put them in your path. • Eg. mkdir ~/bin/ mv program ~/bin/ export PATH=$PATH:~/bin program Source Code • Most bioinformatics software is free, and open source. That is, you can download the actual instructions the programmer wrote. • This is great, because it means you can install these programs on almost any machine. • If somebody asks you for money for bioinformatics software, DON’T DO IT! The GNU install pragma • GNU source code can be complicated to compile, so it comes with programs to help you. There is a standard way to build and install GNU software. [user@server ~]$ gunzip program.tar.gz [user@server ~]$ tar –xvf program.tar [user@server ~]$ cd program [user@server ~]$ ./configure [user@server ~]$ make [user@server ~]$ make install The root user • Most UNIX machines have an account called “root” • root can see everything, change everything, delete everything, including other users work • Unless you buy your own machine, nobody sane will give you root access • You usually need root access to install programs in the default location. But you can put them in your home directory instead. UNIX summary • Use ls, cd, mv, cp, nano and friends to deal with files and directories • Install, or compile, any program you like. Most are free. • Use blastall, hmmer etc on the command line for high throughput work. Transfer the output to a file for best results and run in the background. Grep the output file to get pertinent information…. • Or process it with a Perl script