Whole Genome Annotation: Misc

advertisement
Module 10. Whole Genome Annotation: Misc
Background






Fathom (part of the snap gene predictor) analyzes annotation sets and can sort and categorize
genes into different groups based on their annotation quality.
Grep is used to grab information from many files at once using Linux.
A shell script is used to run many commands automatically.
Command line Blast allows one to search the genome multifasta file for sequence motifs.
Remmina is a remote desktop client similar to PuTTY used for windows.
Installing software in Linux is an essential skill.
Goals




Assess and summarize gene quality from a whole genome annotation using Fathom
Use Grep and shell scripts to pull sequences out of a large multifasta file
Use a remote desktop client on Ubuntu to transfer files from a virtual box
Practice installing software in the Linux environment
V&C core competencies addressed
1) Ability to apply the process of science: Evaluation of experimental evidence, Developing problemsolving strategies
2) Ability to use quantitative reasoning: Applying statistical methods to diverse data, Managing and
analyzing large data sets
3) Applying informatics tools, Managing and analyzing large data sets
GCAT-SEEK sequencing requirements
None
Computer/program requirements for data analysis
Linux OS, Fathom, Blast, GCATSEEK Linux Virtual Machine
If starting from Window OS: Putty
If starting from Mac OS: SSH
1
Protocols
I. Using Fathom to assess Annotation Quality
This tutorial uses the output from Maker after genome.fa was annotated. Also, running any of the listed
commands without any additional input (or running them with the -h option) will display the help
message including a description of the program, the usage, and all options.
1) Gather all of the GFFs into one inclusive file.
In the genome.maker.output directory
$gff3_merge -g -d genome_master_datastore_index.log
2) Convert the gff to zff (a proprietary format used by fathom)
$maker2zff -n genome.all.gff
This will create two new files genome.ann & genome.dna
The .ann file is a tab delimited file similar to a gff with sequence information. The .dna file is a fasta file
that corresponds to the sequences in the .ann file.
3) First, determine the total number of genes and total number of errors there were. The results will
appear on your screen. Enter them in the table below.
$fathom –validate genome.ann genome.dna
4) Next, calculate the number of Multi and Single exon genes as well as average exon and intron lengths.
Enter them in the table below.
$fathom –gene-stats genome.ann genome.dna
5) Categorize splits the genes in the genome.ann and genome.dna files into 5 different categories:
Unique (uni.ann uni.dna), Warning (wrn.ann uni.dna), Alternative splicing (alt.ann alt.dna), Overlapping
(olp.ann olp.dna), and Errors (err.ann err.dna).
$fathom –categorize 1000 genome.ann genome.dna
Explanation of categories
Category
Unique
Explanation
Single complete gene with no problems:
Good start and stop codons, canonical intron exon
boundaries (if a multi-exon gene) no other genes
overlapping it or any other potential problems
Something about the gene is incorrect: missing
start or stop codon, does not have a canonical
intron exon boundary, or has a short intron or exon
Warning
2
(or is a short gene if it has a single exon)
Genes that are complete, but have alternative
splicing
Two or more genes that overlap each other
Something occurred and the gene is incorrect.
Usually caused by miss-ordered exons.
Alternative splicing
Overlapping
Error
6) To complete the table, repeat 3) and 4) for each of the non-error categories.
Using Fathom, complete the following table using your data
Count
Percent of Total
Total
Complete Genes
Warnings
Alternative splice
Overlapping
Errors
Single Exon
II. Grep Tutorial
grep [options] “pattern” filename
Command
Function
Example Usage
Grep
Searches for lines in the file
containing the search pattern
grep “GENE” annotation.gff
grep –i
Same as grep but is not case
sensitive
grep -i “$think of something” $_
grep –c
Counts the number of times the
pattern appears
grep -c “>” proteins.fasta
grep -An
Prints the line matching the pattern grep -A1 “scaffold1” genome.fasta
and n lines after the pattern
3
grep
Finds and prints every line that contains the pattern being searched for.
If you had a file with:
APPLE
pineapple
apples
And ran the command: grep “APPLE” file grep will print:
APPLE
$grep -i
The same as grep but is not case sensitive. So running the command: grep “APPLE” file grep will
print:
APPLE
apples
$grep -c
Will count the number of lines the pattern appears in in a file. Useful for counting the number of
sequences in a fasta file.
If you had a fasta file with sequence data:
>sequence1
GCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATG
CATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCA
TGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATG
CATGCATG
>sequence 2
AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGGTCTGTTCCAAGGGCCTTTGCGTCAGGT
GGGCTCAGGATTCCAGGGTGGCTGGACCCCAGGCCCCAGCTCTGCAGCAGGGAGGACGTGGCTGGGCTC
GTGAAGCATGTGGGGGTGAGCCCAGGGGCCCCAAGGCAGGGCACCTGGCCTTCAGCCTGCTCAGCCCTG
CCTGTCTCCCAGATCACTGTCCTTCTGCCATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGCGCTGCTG
and ran the command: grep -c “>” file grep will print:
2 (the number of lines containing > in the file which is also the number of sequences)
$grep -A n
Will print the line matching the pattern and n number of lines after it. So, running the command:
grep -A1 “sequence1” file grep will print:
4
>sequence1
GCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATG
CATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCA
TGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATG
CATGCATG
Some notes on grep:
If a line contains the pattern more than one time, the line will be printed or counted only once.
Grep can also make use of regular expressions.
Other options:
-n
Print the line number for lines that match the pattern
-B n
Print lines before the pattern (opposite of -An)
-H
Print the file's name before each line (when searching
multiple files)
-f file
Read the pattern from a file line by line
-v
Will print lines that don't match the pattern
-x
Print the exact match. The line can only
-m n
Stop after finding n matches
egrep
Uses the full set of regular expressions
fgrep
Does not use regular expressions and only searches of
literal strings. (Faster than grep)
Regular expressions:
Regular expressions are certain characters that stand for other characters. Often times multiple
types of characters or multiple characters.
Some (but definitely not all) Regular Expressions:
*
Wildcard, any character
.
Any character other than new line (\n)
\t
Tab
\s
Whitespace (space, Tab, new line)
^
Beginning of line
$
End of line
More regular expressions can be found at:
http://regexlib.com/CheatSheet.aspx?AspxAutoDetectCookieSupport=1
5
III. Creating a grep shell script
If you have a large number of search terms to extract with grep, executing a large number of grep
commands can be repetitive and overly time consuming. One way around this is by making a
shell script. A shell script is a list of commands that will be executed on after the other. Shell
scripts are easy to create and can run multiple commands with only one command.
Earlier in this tutorial we saw how grep -A1 will extract a sequence title and the sequence.
Extracting 5 sequences using a grep shell script:
1) Make a list of the sequences you want to extract from a large genome.
$nano
Enter the sequence names below:
___________
sequence_1
sequence_2
sequence_3
sequence_4
sequence_5
___________
Control X to exit
Save the file
2) Open the list in vi (vi is a command line level text editor that can handle regular expressions. Nano
cannot.).
$vi [filename]
3) In vi, add the grep -A1 command to the front of each line. This is a regular expression that
will find the beginning of each line and replace it with “grep –A1 ” The colon is the command
line for vi. The s option is search. The characters between the first two slashes (/^/) is the search
term and the characters between the last two slashes (/grep -A1 “/) is the replace term. The
special character ^ matches the beginning of each line.
:%s/^/grep -A1 “/
This will turn the list into:
grep -A1 “sequence_1
6
grep -A1 “sequence_2
grep -A1 “sequence_3
grep -A1 “sequence_4
grep -A1 “sequence_5
4) Now, add the rest of the grep command (” genome.fa >> output) to the end of each line.
:%s/$/” genome.fa >> output
The >> characters will append the grep output to the end of the file “output”.
This will turn the list into:
grep -A1 “sequence_1” genome.fa >> output
grep -A1 “sequence_2” genome.fa >> output
grep -A1 “sequence_3” genome.fa >> output
grep -A1 “sequence_4” genome.fa >> output
grep -A1 “sequence_5” genome.fa >> output
5) Save the file and close vi. The command below “shift-zz” will save and close the file in vi.
:ZZ
6) Open the file in nano
nano filename
7) Now, change the first “>>” to “>”. “>>” will append information to the end of a file. A single > will
create a new file if it does not exist yet and clear the contents of the file if it does already.
The list should now look like:
grep -A1 “sequence_1” genome.fa > output
grep -A1 “sequence_2” genome.fa>> output
grep -A1 “sequence_3” genome.fa >> output
grep -A1 “sequence_4” genome.fa >> output
grep -A1 “sequence_5” genome.fa >> output
8) Now add “#! /bin/bash” to the first line of the file.
The #! (known as shebang) tells the computer that the commands are to be sent to the program
bash which is a common UNIX shell and the one that both the virtual machines and the HHMI
cluster use by default.
9) Make the file executable by changing the file permissions. “u+x” will give users executable
permissions.
7
$sudo chmod u+x filename
10) Run the script using “./” below will tell the computer to look in the working directory rather than
the path specified in your .bashrc file for the program to execute that you just made.
$./filename
The script will now run and extract the 5 sequence headers along with their DNA from genome.fa and put
them in the file “output.”
IV. Transfer files from the HHMI cluster to your Linux virtual box using
the Remmina client FTP server.
Connect to the HHMI Cluster from Virtual Box using Remmina Remote Desktop
Remmina remote desktop is the default remote desktop application for the Ubuntu OS. This is the
equivalent of Putty on windows.
1) Select the “Dash Home” button which is similar to the “Start Menu” in Windows in the top left of your
desktop screen.
2) Type in “remm”
3) Select Remmina Remote Desktop.
4) Select the icon showing a paper with a plus sign to create a new connection.
5) Under “profile”:
Name: HHMI Cluster
Protocol: ssh – Secure Shell
Server: 192.112.102.21 (10.39.6.10 for JC internal users)
username: [enter your cluster username]
6) Connect
7) If a warning message saying the server is unknown appears, click ok.
8) Enter in your password and click ok.
Transfer Files from the cluster to Virtual box using sFTP
1) Click on the icon showing three gears in the top of the Remmina Remote Desktop window.
8
2) Select “Open Secure File Transfer”
3) Navigate within the GUI to the folder containing the file you want to download.
4) Select the file you want to download and click on the “Download” near the top of the Remmina
Remote Desktop window.
OPTIONAL: Upload files from your virtual machine to the cluster
1) Navigate to the directory you wish to upload your file to.
2) Click on “Upload”
3) Navigate to the file you wish to upload.
4) Select the file.
5) Click ok
V. Command line Blast Tutorial
The Basic Local Alignment Search Tool or BLAST is easily the most common and widely used piece of
bioinformatical software. BLAST can perform several different types of alignments with several different
forms of output. BLAST can use either publicly available, previously created databases to compare the
sequence of interest too, or it can be used with a custom database. This tutorial assumes you are
comparing a sequence file called genome.fa to another set of reference sequences called reference.fa.
Creating a custom BLAST database
1) Make a directory for blast from your home page, and copy your genome.fa file from your Maker
folder to blast folder. We will use the genome file to make a custom database that we can then use as a
reference subject to search for a protein query (tBlastn).
$mkdir blast
$cp maker/genome.fa blast/
2) Move to your blast folder and create a BLAST database from a reference set of fasta files using
genome.fa.
$cd blast
$makeblastdb -in [reference.fa] -dbtype [nucl or prot] -title
[reference] -out [reference]
9
This will create three database files all starting with “reference” and ending in certain file extensions. If
the initial sequences were proteins, dbtype would be set to prot and if they were nucleotides, dbtype
would be set to nucl.
Use BLAST to align the sequence of interest to the custom database
1) Run tBlastn as follows:
$[t]blast[n p x] -query proteins.fa -db reference -out blastresults.txt -outfmt 0
The outfmt (outformat) option tells BLAST what format to output the data in. Pairwise (default or –
outfmt 0) is a good way to visualize results. 7 (-outfmt 7) is a tabular output with comments indicating
column titles. This is a good format for exporting to a spreadsheet.
BLAST also has options to control stringency of match between subject and query. To see the options
call up the help documentation for the type of blast you are running. For example:
$tblastn –help
For example, to run the same query, but change the e-value to 1 x 10E-10:
$blastn -query genome.fa -db reference –evalue 1e-10 -out blastresults.txt -outfmt 0
Blast types
Blastn
Nucleotide query to nucleotide database
Blastp
Protein query to protein database
Blastx
Translated nucleotide query to protein database
Tblastn
Protein query to translated nucleotide database
Tblastx
Translated nucleotide query to translated
nucleotide database
VI. For Virtual Box Users: Installing software in a Linux OS
One of the most important, but often most difficult parts of doing computer work in Linux is installing
new software. Since there are a wide variety of UNIX varieties (aka distributions) and each individual
machine can have a different set up, programs are often written on one distribution but may not work
on another. UNIX has some support for this, but installation oftentimes depends on the user first
installing other programs that are required for the installed program to work properly (aka
dependencies). The user must then provide certain information on their system to the program being
installed in order for it to run from any directory.
10
There are four basic forms of software installation: 1) automatic installation using a graphical package
manager such as Ubuntu Software Center, Synaptic Package Manager, or Red-hat Package Manager
(RPM), 2) automatic installation using a command line package manager such as “apt-get” or “yum”, 3)
semi-automatic installation using the program “make” to help compile source code, and 4) direct
compilation of source code.
Installation using a graphical package manager is as simple as opening the package manager, searching
for the program you wish to install, and selecting install.
Installation using a command line package manager requires you to know the exact name of the
program you want to install and to use the command “sudo [apt-get or yum] install [program name]”.
The program “make” is UNIX program that looks for an input file called either “Makefile” or “makefile”.
The Makefile contains instructions on how to compile source code such as which compilers to use,
compilation options, and where to put the programs as they are compiled. The $PATH environment
variable tells the computer where to look for programs, and is set up in the .bashrc file that we copied
from ~sickler in module 2 above in order for all the programs to work for you on the cluster. If the
Makefile is set up correctly, there is no need to add programs to the $PATH environment variable.
Sometimes, an additional program called either “configure” or “configure.sh” is included, which will
either guess where certain programs are already installed or prompt the user for the locations of those
programs. The bottom line is that most software comes with either a readme or installation guide
specifying how to install the software. If the Makefile does not automatically add the programs being
installed, the user must adjust their path to include the location of the installed software.
Directly compiling a program from source is very rarely done except on the simplest of programs.
Example: Installing CEGMA
To practice installing programs in Linux, this tutorial will guide you through installing CEGMA (core
eukaryotic genes mapping approach). CEGMA is a program that aligns the most highly conserved
eukaryotic proteins to a draft genome and can be used to: 1) generate an initial training set for gene
predictor training, 2) measure completeness of an assembly as percent of CEGMA genes found. We will
start off by installing “genewise,” a dependency of CEGMA. We will use “apt-get,” a terminal based
package installer.
11
1) Install genewise using apt-get. It will download and install genewise from a central repository. You
will first need to update your repositories and operating system. NOTE: You may have to hit Control-C at
“headers” during the first step if your installation hangs up.
$sudo apt-get update
$sudo apt-get install wise
Enter your password
2) From home directory, download CEGMA to the virtual box using the command “curl” (Copy URL),
untar and decompress it using the command “tar.” You will first need to install Curl.
$sudo apt-get install curl
$curl http://korflab.ucdavis.edu/Datasets/cegma/cegma_v2.4.010312.tar.gz >
cegma.tar.gz
$tar –xvzf cegma.tar.gz
Above, x flags the program gzip, v flags verbose, z flags unzipping, f flags reading from file.
3) Because the file was a tar directory, we now need to go to the directory where CEGMA was unpacked.
Check the name of the directory using LS first.
$ls
$cd cegma…blah blah blah, just press tab
4) Use “make” to compile CEGMA.
$make
CEGMA could be added to the $PATH by running the command “make install”, but this tutorial will show
you how to manually add the programs to the $PATH.
5) Go back to your home directory and open your .bashrc profile.
$cd
$nano .bashrc
6) “Make” generated a folder called “bin” in your cegma directory. The bin directory contains the
program executables. You now need to add the path to the CEGMA bin directory to the end of your
.bashrc profile so that you can execute Cegma from any directory. Notice that the .bashrc file is hidden
12
because many years ago it was determined that only super geniuses should be able to edit the .bashrc,
and this is now you.
PATH=$PATH:/home/[username]/cegma_v2.4.010312/bin
Press control X to save and exit nano.
export $PATH
7) Close and reopen the terminal for the changes to .bashrc to take effect.
8) To validate that CEGMA is working type the following to see if the help files appear.
$cegma –h
9) Go to the Korf lab page describing the CEGMA software. See if you can figure out how to run it on
the mini-genome you used for Maker annotation earlier and interpret the output.
http://korflab.ucdavis.edu/Datasets/cegma/#SCT3. What fraction of CEGMA genes were found on the
mini-genome?
Assessment
The tools in this Module are essential for working with a draft genome and genome annotation. These
are perhaps best assessed as part of a project involving summarizing Maker results, or working with
subsets of scaffolds or annotations.
Discussion
References
Korf I. 2004. Gene finding in novel genomes. BMC Bioinformatics 5:59 [SNAP and FATHOM]
Haddock SHD, Dunn CW. 2011. Practical computing for biologists. Sinauer Associates
Korf I. Unix and Perl Primer for Biologists. korflab.ucdavis.edu/Unix_and_Perl/unix_and_perl_v3.0.pdf
13
Download