Gene Prediction in Genomic Studies Ab-initio/Homology

advertisement
Gene Prediction in Genomic Studies
Ab-initio based methods
Angela Pena Gonzalez
Lavanya Rishishwar
What Gene Prediction means and a brief background
INTRODUCTION
Introduction: Gene Prediction
• Gene Prediction is the process of detection of the
location of open reading frames (ORFs) and
delineation of the structures of introns as well as
exons if the genes of interest are of eukaryotic origin.
• The ultimate goal is to describe all the genes
computationally with near 100% accuracy
Introduction: ORF
• Reading Frame: A sequence of DNA/RNA that is
translated into an amino acid sequence, three bases
at a time, each triplet sequence coding for a single
amino acid
• Every region of DNA has six possible reading frames
• Open Reading Frame (ORF) is the longest frame
uninterrupted by a stop codon
Introduction: ORF
• Not all translations have a biochemical support for
them, some are merely derived theoretically or
computationally
• In other words, each gene is an ORF but not ever ORF
is a gene
Introduction: Gene
• Genes are the functional and physical unit of
heredity passed from parent to offspring.
• Genes are pieces of DNA, and most genes contain
the information for making a specific protein.
Introduction: Gene Models
Prokaryotic
Eukaryotic
Introduction: Coding v/s Noncoding
Coding region
Noncoding region
Coding regions are the parts of DNA
which will give rise to a mature
messenger RNA that will be
translated into the specific amino
acids of the protein product
Noncoding regions are the parts of
DNA which do not encode protein
sequences. They may or may not be
transcribed into RNA.
E.g.: tRNA, rRNA, sRNA genes
Why we need gene prediction algorithms?
NECESSITY
Necessity
• There have been a sharp up trend in the number of
genomes sequenced in the past decade.
Necessity
2000
No. of Genomes in KEGG
1800
1600
1400
1200
1000
800
600
400
200
0
7/98
10/99
1/01
4/02
7/03
10/04
1/06
4/07
7/08
KEGG Genome: Release Update of Jan 2012
10/09
1/11
Necessity
• There have been a sharp up trend in the number of
genomes sequenced in the past decade.
• Accurately predicting genes can significantly reduce
the amount of experimental verification work which
is time and labor consuming and expensive to carry
out
• Current state-of-art gene predictors have a high
accuracy of ~90-99% (i.e., able to predict >90% of the
experimentally validated genes)
How the gene predictors make the predictions?
METHODS
Gene Prediction Methods
• Gene Prediction represents one of the most difficult
problems in the field of pattern recognition,
particularly in the case of eukaryotes
• The principle difficulties are:
o
o
o
o
Detection of initiation site (AUG)
Alternative start codons
Gene overlap
Undetected small proteins
Gene Prediction Methods
ACGTACTACGTACGTACGTACGATCGATCGATCGATCGATC
GACTGATCGATCGATCGATCGTACGTAGCGACTGACTGAC
TGATCGACTACGTAGCTGCAGTCAGTCGACTGACTGACTA
Ab-initio methods
Homology based methods
Ab-initio Methods
• Predicts gene based on the given sequence alone.
• Consists of two types of models:
o Markov based models
o Dynamic Programming
A brief introduction of HMMs
• Hidden Markov models (HMMs) are discrete Markov
processes where every state generates an
observation at each time step.
• A hidden Markov model (HMM) is statistical Markov
model in which the system being modeled is
assumed to be a Markov process with unobserved
(hidden) states.
Markov Model (Discrete Markov Process)
• A discrete Markov process is a sequence of random
variables q1,…,qt that take values in a discrete set
S={s1,…,sN} where the Markov property holds.
• Markov property:
• Parameters
– Initial state probabilities: πi
– State transition probabilities: aij
18
From Markov Model to HMM
• HMMs are discrete Markov processes where each
state also emits an observation according to some
probability distribution, we need to augment our
model.
• Parameters
– Initial state probabilities: πi
– State transition probabilities: aij
– Emission probabilities: ei(k)
Markov Model
Hidden Markov Model
Each state emits an observation
with 100% probability
Each state emits an observation according to
a certain probability distribution
19
¡EMPECEMOS!
Say Adios to your windows and get
to Linux!!
Say “Si” when you are ready to work on Linux!!!
Di que “Si” si tu estas listo para trabajar con Linux!!!
A Quick Linux How-To Manual
• Terminal (and Kernel)!
“That’s Linux to me!” – Lava
• Basic Navigations in Terminal:
–
–
–
–
–
–
Change to a specific directory – cd
List the contents of the folder – ls
Come up one level of the folder – cd ..
Copy a file to one location to another – cp
Move a file from one location to another – mv
Rename a file (file1) to (file2) – mv file1 file2
A Quick Linux How-To Manual
– Autocomplete – tab!
– Extract a file – tar –xvf [file name]
– Installing a software:
•
•
•
•
Navigate to the folder where “Makefile” is present
Type make
Wait for the installer to finish processing
Programs will be stored in the same folder or a different folder by
the name “bin” (stands for basic input)
That’s all Folks, Thank you for coming, Gracious!
Naah, Just kidding! Lets get down to business!
GeneMark
• Developed by Dr. Mark Borodovsky (from Georgia
Tech!)
• Works on elegant pseudo-HMMs and HMM
• Several versions available – prokayotic/eukaryotic,
self training
Running GeneMark
• ./gmsn.pl --prok --out [output file] [genome file]
Glimmer3
• Works by creating a variable-length Markov model
from a training set of genes
• Using the model to identify all genes in a DNA
sequence
Running Glimmer3
• It’s a 2 step progress
1. A probability model of coding sequences must be
built called an interpolated context model.
./build-icm [model name] < [genome]
2. Program is run to analyze the sequences and make
gene predictions
./glimmer3 [genome] [icm_model] [output]
o Best results require longest possible training set of genes
Glimmer3 programs (if you are curious)
• Long-orfs  uses an amino-acid distribution model
to filter the set of orfs
• Extract builds training set from long,
nonoverlapping orfs
• Build-icm build interpolated context model from
training sequences
• Glimmer3 analyze sequences and make
predictions
RNA Prediction
Running tRNA-Scan-SE
tRNAscan-SE –B -o <outputfile1> -f <outputfile2> -m <outputfile3> <inputfile>
-B <file> : search for bacterial tRNAs
This option selects the bacterial covariace model for tRNA analysis, and loosens the
search parameters for EufindtRNA to improve detection o f bacterial tRNAs.
-o <file> : save final results in <file>
Specifiy this option to write results to <file>.
-f <file> : save results and tRNA secondary structures to <file>.
-m <file> : save statistics summary for run
contains the run options selected as well as statistics on the number of tRNAs
detected at each phase of the search, search speed, and other statistics.
Output using “–o” parameter
Output using “–f” parameter
Yes I am serious. We are done. You are saved!
THANK YOU
Download