Module 10. Whole Genome Annotation: Misc Background Fathom (part of the snap gene predictor) analyzes annotation sets and can sort and categorize genes into different groups based on their annotation quality. Grep is used to grab information from many files at once using Linux. A shell script is used to run many commands automatically. Command line Blast allows one to search the genome multifasta file for sequence motifs. Remmina is a remote desktop client similar to PuTTY used for windows. Installing software in Linux is an essential skill. Goals Assess and summarize gene quality from a whole genome annotation using Fathom Use Grep and shell scripts to pull sequences out of a large multifasta file Use a remote desktop client on Ubuntu to transfer files from a virtual box Practice installing software in the Linux environment V&C core competencies addressed 1) Ability to apply the process of science: Evaluation of experimental evidence, Developing problemsolving strategies 2) Ability to use quantitative reasoning: Applying statistical methods to diverse data, Managing and analyzing large data sets 3) Applying informatics tools, Managing and analyzing large data sets GCAT-SEEK sequencing requirements None Computer/program requirements for data analysis Linux OS, Fathom, Blast, GCATSEEK Linux Virtual Machine If starting from Window OS: Putty If starting from Mac OS: SSH 1 Protocols I. Using Fathom to assess Annotation Quality This tutorial uses the output from Maker after genome.fa was annotated. Also, running any of the listed commands without any additional input (or running them with the -h option) will display the help message including a description of the program, the usage, and all options. 1) Gather all of the GFFs into one inclusive file. In the genome.maker.output directory $gff3_merge -g -d genome_master_datastore_index.log 2) Convert the gff to zff (a proprietary format used by fathom) $maker2zff -n genome.all.gff This will create two new files genome.ann & genome.dna The .ann file is a tab delimited file similar to a gff with sequence information. The .dna file is a fasta file that corresponds to the sequences in the .ann file. 3) First, determine the total number of genes and total number of errors there were. The results will appear on your screen. Enter them in the table below. $fathom –validate genome.ann genome.dna 4) Next, calculate the number of Multi and Single exon genes as well as average exon and intron lengths. Enter them in the table below. $fathom –gene-stats genome.ann genome.dna 5) Categorize splits the genes in the genome.ann and genome.dna files into 5 different categories: Unique (uni.ann uni.dna), Warning (wrn.ann uni.dna), Alternative splicing (alt.ann alt.dna), Overlapping (olp.ann olp.dna), and Errors (err.ann err.dna). $fathom –categorize 1000 genome.ann genome.dna Explanation of categories Category Unique Explanation Single complete gene with no problems: Good start and stop codons, canonical intron exon boundaries (if a multi-exon gene) no other genes overlapping it or any other potential problems Something about the gene is incorrect: missing start or stop codon, does not have a canonical intron exon boundary, or has a short intron or exon Warning 2 (or is a short gene if it has a single exon) Genes that are complete, but have alternative splicing Two or more genes that overlap each other Something occurred and the gene is incorrect. Usually caused by miss-ordered exons. Alternative splicing Overlapping Error 6) To complete the table, repeat 3) and 4) for each of the non-error categories. Using Fathom, complete the following table using your data Count Percent of Total Total Complete Genes Warnings Alternative splice Overlapping Errors Single Exon II. Grep Tutorial grep [options] “pattern” filename Command Function Example Usage Grep Searches for lines in the file containing the search pattern grep “GENE” annotation.gff grep –i Same as grep but is not case sensitive grep -i “$think of something” $_ grep –c Counts the number of times the pattern appears grep -c “>” proteins.fasta grep -An Prints the line matching the pattern grep -A1 “scaffold1” genome.fasta and n lines after the pattern 3 grep Finds and prints every line that contains the pattern being searched for. If you had a file with: APPLE pineapple apples And ran the command: grep “APPLE” file grep will print: APPLE $grep -i The same as grep but is not case sensitive. So running the command: grep “APPLE” file grep will print: APPLE apples $grep -c Will count the number of lines the pattern appears in in a file. Useful for counting the number of sequences in a fasta file. If you had a fasta file with sequence data: >sequence1 GCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATG CATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCA TGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATG CATGCATG >sequence 2 AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGGTCTGTTCCAAGGGCCTTTGCGTCAGGT GGGCTCAGGATTCCAGGGTGGCTGGACCCCAGGCCCCAGCTCTGCAGCAGGGAGGACGTGGCTGGGCTC GTGAAGCATGTGGGGGTGAGCCCAGGGGCCCCAAGGCAGGGCACCTGGCCTTCAGCCTGCTCAGCCCTG CCTGTCTCCCAGATCACTGTCCTTCTGCCATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGCGCTGCTG and ran the command: grep -c “>” file grep will print: 2 (the number of lines containing > in the file which is also the number of sequences) $grep -A n Will print the line matching the pattern and n number of lines after it. So, running the command: grep -A1 “sequence1” file grep will print: 4 >sequence1 GCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATG CATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCA TGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATG CATGCATG Some notes on grep: If a line contains the pattern more than one time, the line will be printed or counted only once. Grep can also make use of regular expressions. Other options: -n Print the line number for lines that match the pattern -B n Print lines before the pattern (opposite of -An) -H Print the file's name before each line (when searching multiple files) -f file Read the pattern from a file line by line -v Will print lines that don't match the pattern -x Print the exact match. The line can only -m n Stop after finding n matches egrep Uses the full set of regular expressions fgrep Does not use regular expressions and only searches of literal strings. (Faster than grep) Regular expressions: Regular expressions are certain characters that stand for other characters. Often times multiple types of characters or multiple characters. Some (but definitely not all) Regular Expressions: * Wildcard, any character . Any character other than new line (\n) \t Tab \s Whitespace (space, Tab, new line) ^ Beginning of line $ End of line More regular expressions can be found at: http://regexlib.com/CheatSheet.aspx?AspxAutoDetectCookieSupport=1 5 III. Creating a grep shell script If you have a large number of search terms to extract with grep, executing a large number of grep commands can be repetitive and overly time consuming. One way around this is by making a shell script. A shell script is a list of commands that will be executed on after the other. Shell scripts are easy to create and can run multiple commands with only one command. Earlier in this tutorial we saw how grep -A1 will extract a sequence title and the sequence. Extracting 5 sequences using a grep shell script: 1) Make a list of the sequences you want to extract from a large genome. $nano Enter the sequence names below: ___________ sequence_1 sequence_2 sequence_3 sequence_4 sequence_5 ___________ Control X to exit Save the file 2) Open the list in vi (vi is a command line level text editor that can handle regular expressions. Nano cannot.). $vi [filename] 3) In vi, add the grep -A1 command to the front of each line. This is a regular expression that will find the beginning of each line and replace it with “grep –A1 ” The colon is the command line for vi. The s option is search. The characters between the first two slashes (/^/) is the search term and the characters between the last two slashes (/grep -A1 “/) is the replace term. The special character ^ matches the beginning of each line. :%s/^/grep -A1 “/ This will turn the list into: grep -A1 “sequence_1 6 grep -A1 “sequence_2 grep -A1 “sequence_3 grep -A1 “sequence_4 grep -A1 “sequence_5 4) Now, add the rest of the grep command (” genome.fa >> output) to the end of each line. :%s/$/” genome.fa >> output The >> characters will append the grep output to the end of the file “output”. This will turn the list into: grep -A1 “sequence_1” genome.fa >> output grep -A1 “sequence_2” genome.fa >> output grep -A1 “sequence_3” genome.fa >> output grep -A1 “sequence_4” genome.fa >> output grep -A1 “sequence_5” genome.fa >> output 5) Save the file and close vi. The command below “shift-zz” will save and close the file in vi. :ZZ 6) Open the file in nano nano filename 7) Now, change the first “>>” to “>”. “>>” will append information to the end of a file. A single > will create a new file if it does not exist yet and clear the contents of the file if it does already. The list should now look like: grep -A1 “sequence_1” genome.fa > output grep -A1 “sequence_2” genome.fa>> output grep -A1 “sequence_3” genome.fa >> output grep -A1 “sequence_4” genome.fa >> output grep -A1 “sequence_5” genome.fa >> output 8) Now add “#! /bin/bash” to the first line of the file. The #! (known as shebang) tells the computer that the commands are to be sent to the program bash which is a common UNIX shell and the one that both the virtual machines and the HHMI cluster use by default. 9) Make the file executable by changing the file permissions. “u+x” will give users executable permissions. 7 $sudo chmod u+x filename 10) Run the script using “./” below will tell the computer to look in the working directory rather than the path specified in your .bashrc file for the program to execute that you just made. $./filename The script will now run and extract the 5 sequence headers along with their DNA from genome.fa and put them in the file “output.” IV. Transfer files from the HHMI cluster to your Linux virtual box using the Remmina client FTP server. Connect to the HHMI Cluster from Virtual Box using Remmina Remote Desktop Remmina remote desktop is the default remote desktop application for the Ubuntu OS. This is the equivalent of Putty on windows. 1) Select the “Dash Home” button which is similar to the “Start Menu” in Windows in the top left of your desktop screen. 2) Type in “remm” 3) Select Remmina Remote Desktop. 4) Select the icon showing a paper with a plus sign to create a new connection. 5) Under “profile”: Name: HHMI Cluster Protocol: ssh – Secure Shell Server: 192.112.102.21 (10.39.6.10 for JC internal users) username: [enter your cluster username] 6) Connect 7) If a warning message saying the server is unknown appears, click ok. 8) Enter in your password and click ok. Transfer Files from the cluster to Virtual box using sFTP 1) Click on the icon showing three gears in the top of the Remmina Remote Desktop window. 8 2) Select “Open Secure File Transfer” 3) Navigate within the GUI to the folder containing the file you want to download. 4) Select the file you want to download and click on the “Download” near the top of the Remmina Remote Desktop window. OPTIONAL: Upload files from your virtual machine to the cluster 1) Navigate to the directory you wish to upload your file to. 2) Click on “Upload” 3) Navigate to the file you wish to upload. 4) Select the file. 5) Click ok V. Command line Blast Tutorial The Basic Local Alignment Search Tool or BLAST is easily the most common and widely used piece of bioinformatical software. BLAST can perform several different types of alignments with several different forms of output. BLAST can use either publicly available, previously created databases to compare the sequence of interest too, or it can be used with a custom database. This tutorial assumes you are comparing a sequence file called genome.fa to another set of reference sequences called reference.fa. Creating a custom BLAST database 1) Make a directory for blast from your home page, and copy your genome.fa file from your Maker folder to blast folder. We will use the genome file to make a custom database that we can then use as a reference subject to search for a protein query (tBlastn). $mkdir blast $cp maker/genome.fa blast/ 2) Move to your blast folder and create a BLAST database from a reference set of fasta files using genome.fa. $cd blast $makeblastdb -in [reference.fa] -dbtype [nucl or prot] -title [reference] -out [reference] 9 This will create three database files all starting with “reference” and ending in certain file extensions. If the initial sequences were proteins, dbtype would be set to prot and if they were nucleotides, dbtype would be set to nucl. Use BLAST to align the sequence of interest to the custom database 1) Run tBlastn as follows: $[t]blast[n p x] -query proteins.fa -db reference -out blastresults.txt -outfmt 0 The outfmt (outformat) option tells BLAST what format to output the data in. Pairwise (default or – outfmt 0) is a good way to visualize results. 7 (-outfmt 7) is a tabular output with comments indicating column titles. This is a good format for exporting to a spreadsheet. BLAST also has options to control stringency of match between subject and query. To see the options call up the help documentation for the type of blast you are running. For example: $tblastn –help For example, to run the same query, but change the e-value to 1 x 10E-10: $blastn -query genome.fa -db reference –evalue 1e-10 -out blastresults.txt -outfmt 0 Blast types Blastn Nucleotide query to nucleotide database Blastp Protein query to protein database Blastx Translated nucleotide query to protein database Tblastn Protein query to translated nucleotide database Tblastx Translated nucleotide query to translated nucleotide database VI. For Virtual Box Users: Installing software in a Linux OS One of the most important, but often most difficult parts of doing computer work in Linux is installing new software. Since there are a wide variety of UNIX varieties (aka distributions) and each individual machine can have a different set up, programs are often written on one distribution but may not work on another. UNIX has some support for this, but installation oftentimes depends on the user first installing other programs that are required for the installed program to work properly (aka dependencies). The user must then provide certain information on their system to the program being installed in order for it to run from any directory. 10 There are four basic forms of software installation: 1) automatic installation using a graphical package manager such as Ubuntu Software Center, Synaptic Package Manager, or Red-hat Package Manager (RPM), 2) automatic installation using a command line package manager such as “apt-get” or “yum”, 3) semi-automatic installation using the program “make” to help compile source code, and 4) direct compilation of source code. Installation using a graphical package manager is as simple as opening the package manager, searching for the program you wish to install, and selecting install. Installation using a command line package manager requires you to know the exact name of the program you want to install and to use the command “sudo [apt-get or yum] install [program name]”. The program “make” is UNIX program that looks for an input file called either “Makefile” or “makefile”. The Makefile contains instructions on how to compile source code such as which compilers to use, compilation options, and where to put the programs as they are compiled. The $PATH environment variable tells the computer where to look for programs, and is set up in the .bashrc file that we copied from ~sickler in module 2 above in order for all the programs to work for you on the cluster. If the Makefile is set up correctly, there is no need to add programs to the $PATH environment variable. Sometimes, an additional program called either “configure” or “configure.sh” is included, which will either guess where certain programs are already installed or prompt the user for the locations of those programs. The bottom line is that most software comes with either a readme or installation guide specifying how to install the software. If the Makefile does not automatically add the programs being installed, the user must adjust their path to include the location of the installed software. Directly compiling a program from source is very rarely done except on the simplest of programs. Example: Installing CEGMA To practice installing programs in Linux, this tutorial will guide you through installing CEGMA (core eukaryotic genes mapping approach). CEGMA is a program that aligns the most highly conserved eukaryotic proteins to a draft genome and can be used to: 1) generate an initial training set for gene predictor training, 2) measure completeness of an assembly as percent of CEGMA genes found. We will start off by installing “genewise,” a dependency of CEGMA. We will use “apt-get,” a terminal based package installer. 11 1) Install genewise using apt-get. It will download and install genewise from a central repository. You will first need to update your repositories and operating system. NOTE: You may have to hit Control-C at “headers” during the first step if your installation hangs up. $sudo apt-get update $sudo apt-get install wise Enter your password 2) From home directory, download CEGMA to the virtual box using the command “curl” (Copy URL), untar and decompress it using the command “tar.” You will first need to install Curl. $sudo apt-get install curl $curl http://korflab.ucdavis.edu/Datasets/cegma/cegma_v2.4.010312.tar.gz > cegma.tar.gz $tar –xvzf cegma.tar.gz Above, x flags the program gzip, v flags verbose, z flags unzipping, f flags reading from file. 3) Because the file was a tar directory, we now need to go to the directory where CEGMA was unpacked. Check the name of the directory using LS first. $ls $cd cegma…blah blah blah, just press tab 4) Use “make” to compile CEGMA. $make CEGMA could be added to the $PATH by running the command “make install”, but this tutorial will show you how to manually add the programs to the $PATH. 5) Go back to your home directory and open your .bashrc profile. $cd $nano .bashrc 6) “Make” generated a folder called “bin” in your cegma directory. The bin directory contains the program executables. You now need to add the path to the CEGMA bin directory to the end of your .bashrc profile so that you can execute Cegma from any directory. Notice that the .bashrc file is hidden 12 because many years ago it was determined that only super geniuses should be able to edit the .bashrc, and this is now you. PATH=$PATH:/home/[username]/cegma_v2.4.010312/bin Press control X to save and exit nano. export $PATH 7) Close and reopen the terminal for the changes to .bashrc to take effect. 8) To validate that CEGMA is working type the following to see if the help files appear. $cegma –h 9) Go to the Korf lab page describing the CEGMA software. See if you can figure out how to run it on the mini-genome you used for Maker annotation earlier and interpret the output. http://korflab.ucdavis.edu/Datasets/cegma/#SCT3. What fraction of CEGMA genes were found on the mini-genome? Assessment The tools in this Module are essential for working with a draft genome and genome annotation. These are perhaps best assessed as part of a project involving summarizing Maker results, or working with subsets of scaffolds or annotations. Discussion References Korf I. 2004. Gene finding in novel genomes. BMC Bioinformatics 5:59 [SNAP and FATHOM] Haddock SHD, Dunn CW. 2011. Practical computing for biologists. Sinauer Associates Korf I. Unix and Perl Primer for Biologists. korflab.ucdavis.edu/Unix_and_Perl/unix_and_perl_v3.0.pdf 13