PPT - Department of Computer Science

advertisement
Discovering gapped binding sites
Chengwei Lei
Dr. Jianhua Ruan
University of Texas at San Antonio
Department of Computer Science
Outline of Talk
•
•
Motif Finding Background
Gapped Motif Finding
– Chen’s method
– SPACE
•
•
The PSO-motif algorithm
Future Work
Introduction/Motivation
• Introduction: Identification of a transcription factor
binding sites is an important aspect of the
analysis of genetic regulation. Many programs
have been developed for discovering the motif.
• Motivation: The previously algorithms cost too
much memory or time to find out the result; my
work is trying to find out a new algorithm use less
memory and less time to find the motif.
What is motif finding
• Motif finding, the process of discovering
a meaningful pattern (of nucleotides or
amino acids) that is shared by two or
more sequences, is an important part of
the study of gene function.
Cells respond to environment
Various external
messages
Heat
Responds to
environmental
conditions
Food
Supply
Regulation of Genes
Transcription Factor (TF)
(Protein)
RNA polymerase
(Protein)
DNA
Promoter
Gene
Regulation of Genes
Transcription Factor (TF)
(Protein)
RNA polymerase
(Protein)
DNA
Regulatory Element, TF binding site, TF
binding motif, cis-regulatory motif (element)
Gene
Regulation of Genes
Transcription Factor
(Protein)
RNA polymerase
DNA
Regulatory Element
Gene
Regulation of Genes
New protein
RNA
polymerase
Transcription Factor
DNA
Regulatory Element
Gene
Real example
.
.
.
Real example
.
.
.
Look Like
• I need a refrigerator, so I go to a
refrigerator shop, I try to pick a very
beautiful refrigerator from a lot of
refrigerator(s). Finally I decide that I will
buy a GE refrigerator.
Look Like
• I need a refrigeretor, so I go to a
rafrigerator shop, I try to pick a very
beautiful refragerator from a lot of
refrigerater(s). Finally I decide that I will
buy a GE refrigarator.
Mismatch
…TACGAT…
…TAAAAT…
…TATACT…
…GATAAT…
…TATAAT…
…TATGTT…
.
.
.
Real example
Consensus: TATAAT
•
•
•
•
•
•
…TACGAT…
…TAAAAT…
…TATACT…
…GATAAT…
…TATAAT…
…TATGTT…
refrigerator
•refrigeretor
•rafrigerator
•refragerator
•refrigerater
•refrigarator.
Gapped Motif
New protein
RNA
polymerase
Transcription Factor
DNA
Regulatory Element
Gene
Gapped DNA binding?
Gapped Motif
• Together
• Separate
Together
mutations
n=5
5+3+5
L
• Red+blue+green=5/25+15/15+5/25 = 25/65
• Red+xxx+green=5/25+xxx+5/25 = 10/50
Separate
mutations
n=5
L
• Red=5/25
• Green=5/25
• Pink=4/25
What can we do with the gap?
• Chen’s method
• SPACE
• PSO
Chen’s method
• ChIP-chip experiment
– Get a positive set Ga
– Get a negative set G-a
Compact Blocks
• Patterns that are found in Ga with a
proportion larger than a predefined
value (25% by default) are included in
the pattern list.
Compact Blocks
• Long enough patterns (3containing at least six
nonwildcards) are taken as candidate motifs.
Short patterns (2blocks of 3 or 4 bp) are filtered
Hit/Seq ratio
• The sequences that match the pattern are
called the supporting sequences of a pattern.
It is possible that a pattern matches a
sequence at more than one position.
• The Hit/Seq ratio of a pattern is the average
number of occurrences of a pattern among its
supporting sequences.
Block Filtering
• Filtered out if the Hit/Seq ratio is larger
than 15
• A large Hit/Seq ratio implies that the
compact blocks are frequently repeated
in a single promoter region.
• In addition to the Hit/Seq ratio, they also
use an upper threshold for f-a (the
proportion of sequences with a pattern
P in G-a) to eliminate repetitive elements
present across different promoter
sequences. A pattern is retained only if
it satisfies: (less than 0.16)
Growing Gapped Motifs
• Growing gapped motifs is similar to
growing compact motifs.
Pattern Ranking
• An identified pattern is filtered out
before ranking if the Hit/Seq ratio is2,
which is considered as a reasonable
upper bound for selecting reliable
patterns.
• Sd is the preferential occurrence of a
pattern in Ga relative to G-a
• Sp is a formula value.
• Sc is the conservation score.
Sd
• The proportions of sequences in Ga and G-a
that contain a pattern P are denoted as fa and
f-a. The one-tailed two-sample proportion test
can be performed as follows:
• Patterns with a z score (Sd) smaller than z1–
0.01 are treated as nonsignificant and are
removed before the ranking process.
Sp
Sc
• Sc is the degree of evolutionary conservation
among a set of orthologous sequences.
• (from Saccharomyces paradoxus,
Saccharomyces kudriavzevii, Saccharomyces
mikatae, and Saccharomyces bayanus)
Result
Key point
• Filter !!
SPACE
• Generation of motif candidates
– Consider L=20
• Consider L=20, r=0.5, l=5, d=1 and q=4.
Refinding Motif
• GAAGAnnnnnnnTAGAAAnn is a
spaced motif of five sequences.
• Motif Score(M) =
•
+
• E(M, e) be the expected frequency of M
with at most e mutations based on a set
of background sequences
Why PSO method
Background
• Particle swarm optimization (PSO) is a population based
stochastic optimization technique and it is inspired by social
behavior of bird flocking or fish schooling.
• PSO shares many similarities with evolutionary computation
techniques such as Genetic Algorithms (GA). But it is simpler
and faster than GA.
• It has been shown to be effective in optimizing difficult
multidimensional problems in a variety of fields.
• PSO has widely application in ANN (Artificial Neural Network),
Nonlinear Control, Electromagnetic, Antenna design,
Bioinformatics.
Some key terms used to describe
PSO
Agent
(Particle)
One single individual in the swarm
Position
An agent’s N-dimensional coordinates which represents a
solution to the problem
Swarm
The entire collection of agents.
Fitness
A single number representing the goodness of a given
solution
Pbest
The location in parameter space of the best fitness
returned for a specific agent
Gbest
The location in parameter space of the best fitness
returned for the entire swarm
V
The velocity of each agent.
gbest
Pbest2
Pbest1
Vn   Vn  C1 rand () ( pbest ,n  xn )  C2 rand () ( gbest ,n  xn )
xn  xn  Vn
• One agent’s movement in the PSO
algorithm.
Flow chart of the PSO algorithm
• In a typical PSO algorithm, one wishes to control
the velocity so that at the beginning stage the
particles can fly around quickly inside the search
space, and when a particle approaches the
optimal solution, it should slow down so it can
converge quickly.
.
.
.
•
•
•
•
•
•
…TACGATA…
…TAAAAT…
…TATACT…
…GATAAT…
…TATGAT…
…TATGTT…
• One can achieve this if the fitness function is
continuous, since the velocity is updated
according to the distances between the current
position and the positions of pbest and gbest.
How to solve
• Remap
• Redefine
Remap the neighborhood information
1
2
N
A C G T T C C A T.............A C G T T C C T
mis is 6
mis is 1
Redefine
n=5
L
•
•
•
•
Green
Red
Pink
Blue
Current
Gbest
Pbest
Random
Redefine
• Good for gapped motif finding.
– Quick
– Flexible
– High sensitivity
– High extensibility
Thank you !
Download