Subclass approach for mutational spectrum analysis

From: ISMB-95 Proceedings. Copyright © 1995, AAAI (www.aaai.org). All rights reserved.
Subclass
approach for mutational
spectrum analysis
Igor B. Rogozin, Galina V. Glasko, Evgeniy L Latkin*
Institute of Cytologyand Genetics; 102_,avrentyevAve., Novosibirsk630090,Russia
*RIMIBE
at the NovosibirskState University; 2. PirogovaSt., Novosibirsk630090,Russia
rogozin@cgi.nsk.su
(Glickmunand Ripley, 1984), with repair systems(Topal
et al. 1986;Bums
et al. 1986).
Abstract
It shouldbe emphasized
that the context of hotspotscan
help
specify
the
underlying
molecnlar mechanismof
Analysis and comparison of mutational spectra
represents a burningquestion in molecularbiology.
mutagenesis
(Horsfallet all. 1990).
Wereport an algorithm based uponthe SEM
subclass
A number
of theoretical
methods
hasbeendeveloped
approach(SEM- stochastique, estimations, maximifor mutationalspectra analysis. Several approacheshave
zations). Anyreal mutationalspectrumis regardedas
been applied for comparisonof two or several mutational
a mixture of standard binomial distributions. The
spectra (Adams& Skopek 1986; Piegorsch & Bailer
separation proc~ureis run by rounds. Eachiteration
1994).Benigniet al. (1992)used the standardmultivariate
includes simulation, maximizationand estimation.
method.Multiple regression analysis have been applied
The algorithm has been checked on randomspectra
for construction of the best modelfor context effects on
with the preset parameters and on real mutational
the 2APinsertion (Stormo et al. 1986). However,the
spectra.
As hasbeenshown,
anyrealmutational
problem
of identification of hotspots in the mutational
specmlrn
canberepresented
asa mixture
oftwoand
spectra
remains unresolved. The mutational spectrum
morebinomial
distributions,
ofwhich
onecontains
induced by O6-methylguaninewas approximated by the
hotspots
ofmutation.
Keywords: subclass approach, classification,
Poisson distribution (Topal et al. 1986). Bywayof that
mutational spectrum, SEMalgorithm, maximization,
approach,the contentofcold
spots
ofmutation(.positions
likelihood.function.
with lower mutationfrequencies) was analyzed. However,
with this approach, one can go as far as to reveal
discrepancies between a real and expected mutational
Introduction
spectra, but fail to uncover the causes of these
discrepancies0aotspots,coldspots).
Novelgene engineering techniques have brought up a lot
To solve the problem, we decided to approximatethe
of information on the mutations observedin nucleotide
real mutational spectrum by a mixture of binomial
sequences. Thedata of the sort were called "mutational
distributions. Weapplied
the standard classification
spectra". Analysisof these spectra has shownspontaneous
method,
slightly
modified
though. The programthat we
and inducedmutations to be largely confined to certain
have developedwastested on real mutationalspectra. As
regions of nucleotide sequences. Examinationof such
wasshown,the mutationalpositions of the spectra can be
regions
(mutational
"hotspots")
hasprovided
evidence
that
theobserved
nonrandom
distribution
ofmutations
along groupedinto two or moreclasses, one of whichcontains
thesequence
mustbeaccounted
forby somestructural hotspots.
features
ofthehotspot
subsequences
(theDNAcontext).
An example
of a mutational
spectrum
is presented
in
Statistical
modelof the experiment
Fig.la.
Asis seen,a range
of sites
display
elevated
mutation
frequencies
(e.g.,
positions
84,185,202).
Considertwosets Y=(yl..... yn), X=(xl,...,xn)of integers,
TheideathatDNAcontext
mayaffect
mutability
was
whereyi stands for the numberof the position in the
first
suggested
byBenzer
(1961).
Todate,
evidence
has
sequenceat whicha mutation(s) can occur, and xi stands
been
gathered
that
this
isquite
so.Investigation
into
these for the nmnberof mutations at this position; n - total
fenmres
mayprovide
valuable
information
aboutthe
numberof positions. Hg.lpresents a mutational
spectrum
underlying
molecnlar
mechanisms.
Theinfluence
ofDNA
andthecorresponding
twosetsY andX.
context maybe associated with base pair reactivity
(Matters et al. 1986), with the secondarystructure of DNA
Rogozin 309
(a)
40
50
60
2
g
a
a
a
c
c
agt
a
a
C
gt
t
at
a
at
80
0
C a g
70
5
gt
at
g
c
c
ggt
gt
ct
Ct
t
I00
1
cC
t
2 0
G cGt
g c
g a
a
t
gg
c
gt
gt
I0
C c
a
gt
ag
ggt
gaa
cC
a a
130
0
a C g c
aggc
g gg
c
agc
c
gga
0 0
aC
a
t
120
6
t C
a
gt
t
150
140
4
a a a Gt
a a
c
g g
a a
g cgg
170
gct
g
a
at
1
aC
c
t
a
c
a
a
ct
0
ggCggG
a
90
7
t C
II0
190
14
gGC
at
0
160
g
cg
2 0
c GC
c
at
180
200
15
c
t
c
6
C c
a
a
c
c
g
210
0
a
a
a
C
a
gt
c
(b)
Y= (42,57,58,75,
84,89, 90, 92,93, 95,120,140,174,185,186,188,191,198,202,206)
x= (2,2,0,5,10,2,7, I, 2,0,6, 4,6, 14,0,0, i, 0,15,0)
FIGURE
I.Thesequence
of lac/gene
withmutations
induced
byNMAM(Burns
etal.,1986)(a)and¢xnzespcmding
X andY arrays
(b). Number
abovesequencemeansthe number
of mutationsin the position. 0 meansthat no mutationswereobservedin the site.
Sites detectedpreviouslyin lacI gene (Gordon
et al., 1988)are shownby u~cletters.
Let ~ be a random value denoting the number of
mutations at any site of the sequence. In that case X is the
sampling of values for ~.
Theexperiment
is briefly
described
as follows:
The
knownnumber,
N, of copiesof thesequence
is affected
bya mutagene.
Oneor lessmutations
occurs
in eachcopy.
The numberof mutations
at eachpositions
is summed
overallcopies
andthesumrepresents
theresult
of the
experiment.
Formally, we may assume that N experiments have
been made over one sequence. Here pi is the probability
of a mutation occurring at the i-th site over one
experiment. For the i-th site, N experiments present a
succession of independent Bernoulli’s probes; and the
probability of k successes and (N-k) failures over
experiments is
510 ISMB--95
k k
P(~=k) ffi C pi (I N
N-k
The sample of values xl,...,xn for ~ maybe regarded as
a mixture of I binomial distributions with the different
probabilities of mutations for any distribution of the
mixture. Thus any distribution is responsible for the
operation of certain mechanismof mutagenesis.
By virtue
oftheformula
forthetotalprobability,
the
probability
ofxjmutations
occm’ring
atthej-thsiteis
1
P(~=xJ) = ~ v(~=xJ [ Si) * P(Si)
i=l
whereSi is the i-th group of sites at each of whicha
mutation occurs one experimentwith the probability pi;
P(Si) is the probabilityof the Si groupbeingselected from
amongall the groups;I is the numberof groups.
Thusthe densityof distribution is
f(x)
1
x
x
= ~ ~i C pi (i i=l
N
Over the first iteration, a memberof the samplemay
co-occur in several classes. The class whichdoes not
contain any memberis omitted from consideration, and
the number of classes decreases. The matrix G is
recalculated for the newnumberof classes:
k
giJ=giJ / E giJ
J=l
N-x
~i is theprobability
ofSi beingselected.
In order to assess thenumber of components and
parametersof the mixtureof distributions, it necessaryto
separate the distributions fromthe mixture.
(ii) Maximization
Nowevaluating the parametersof the classes. First, the
weightsof the distributions are defined by the following
formula:
12
pJ =
Algorithm of separation
i=l
Weused the algorithm SEM(stochastique - estimations
maximizations)(Celeux&Riebolt 1984) for assessing
numberof distributions of a mixture with the unknown
parameters and weights of each component.
Tostart with, weset the initial numberof distributions
in the mixture, kmax,which should be higher than the
true numberof the componentdistributions. Wealso set
upthe matrixof the posteriorprobabilitiesgil ..... gikmax
(i=l,n) suchthat
giJ=i/kmax,
~ giJ=l,
i
O<giJ<l
Here gij is the posterior probability of xi belongingto
classes j= 1,2,...,kmax.
Threesteps of the algorithmfollowin succession.
(i) Simulation
Here we classify the membersxl ..... xn of the initial
sample. By virtue of the matrix G={gij) (i=l ..... n;
j=l,...JO the matrixE is built; k - number
of classes.
f I with the probability
gij
[ 0 with the probability
l-giJ
=
eiJ
As result, each memberof the samplexl ..... xn was
assigned to one or more classes depending on gij
randomly.
It means that the rows of the matrix E={eij} are
represented by polynomiallydistributed randomvalues
(with the parametersgil,...,gik). ThematrixE sets up the
following rule by whichto classify the membersof the
sample:
SJ={xi
:
eJ(xi)=l),
i/n ~ eJ (xi)
Then we evaluate the parameters of the classes by
maximization
of the likelihood function
in f(xi,OJ)
-) sup J=l,...,k
xeSJ
OJ
(lii) Estimation
TheBayes’reestimationof probabilities gij is basedupon
the estimatesobtainedfromthe precedingiteration
A
giJ = pj
A
f(xi,OJ)
/
k A
Z pl
i=i
A
f(xi,ej)
The iteration procedure terminates whenall members
have been classified. If there are member’swith a high
probability of membership
in two classes (and therefore
no classification is possible for them), the procedure
terminates whenthe difference betweenthe values of the
parametersof the i-th and (i+l)-th iteration has fallen
short of accuracy.
As was shownby Shlezinger (1964), the algorithm
converges to the stationary points of the logarithmic
likelihoodfunction used.
Results
Wechecked our version
of the algorithm on random
binomial dism~butions with various numbers of
componentsand parameters. A standard randomgenerator
wasused.
Fig.2presenls the dependence of variance and the
expectation of pl for a one- componentmixture on
samplevolume.The original value of p was 0.01. Fifty
at each J=l...k
Rogozin 311
-7
Dp x 10
4
Epx 10
/
_=
lo --~
0
10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200
FIGURE
2. The d ep~mdence
of the variance (Dp*)and the expectation (Ep*)of pl for a one- componentmixture on samplevolume
Sv. Thewluesof Epare gzvennear the correspondingpoints of the plot.
random distributions were generated for sample volumes
of 20 to 200 mutations. For each of the distributions
generated,
except for four, only one-component
distribution was obtained. The four incorrect distributions
were omitted fxom the calculation
of the mean and
variance. For each sample volume, variance (I39") and
expectation (Ep*) were evaluated. As is seen, the variance
of pl decreases as 1/~Sv with an increase of sample
volume Sv.
The algorithm was also tested on a mixture of two
binomial distributions at the following parameter values:
sample volume N = 100; I1 = 12 = 0.5; pl=0.09, p2
ranged from 0.01 through 0.08. The algorithm was unable
to separate two distributions ff the difference betweenP1
and P2 was less than 0.02. In that case, the two
distributions were "mingled" and could not be separated
signif’manfly.
The algorithm was also tested on real mutational
spectra induced by alkylating agents MNAM
and MNNG
in the E.coli genome.The role of context in the incidence
of these mutations is well known. The overwhelming
majority of the mutations occurs at the G:C positions.
Mutations at the positions with the RGcontext (G is the
mutation position) occur several times as frequently as
mutations at the YGpositions. Thus, these mutational
specWa
should represent a mixture of not less that.two
distributions. The mutational spectrtma induced by the
mutagene MNAM
(I-Iorsfall
& Glickman 1988)
presented in Fig.la. Analysis of this spectrum by the
algorithm as described suggests two distributions with the
312
ISMB-95
parameter values N = 100, pl = 0.005, ~.1 = 0.64, p2 =
0.08, X2. = 0.36. Oneof the distributions included only the
sites with the YGcontext (number of mutations < 3,
positions 42, 57, 58, 92, 93, 186, 188, 191,198,206), the
other included the spectra with the RGcontext (numberof
mutations > 3, positions 75, 84, 90, 120, 140, 174, 185,
202). The mutational spectra induced by MNNG
(Burns
al. 1987) was revealed within the 1- 580 bp region of the
lacI gene. However,the results obtained are quite similar
with what we had obtained for MNAM
(within 1-232 bp):
one of the distributions included only the sites with the
YGcontext, the other included the spectra with the RG
context.The parameter values were evaluated as N = 100,
pl = 0.01, 7L1= 0.61, p2 = 0.1, k2 = 0.39.
Discussion
In all, the analysis of real and generated data has proven
the efficiency of the algorithm suggested for analysis of
mutational spectra. For any real spectrum studied, one of
the classes contains hotspot sites. Thus, the algorithm may
be applied for mutationalspectra _analysis.
A lot of mutational spectra has nowbeen obtained. For
only part of them, the mechanisms of mutagenesis has
been revealed. The algorithm reported allows more
detailed analysis of any spectrum to be performed. It can
be successful at the study of correlations betweenhotspots
and the nucleotide context features within any class of
sites revealed. These prospects axe pursued by the system
for analysis of mutational spectra MutAn(mutation
analysis) (Rogozinet al. 1992) whichis currently under
development.
The programhas been written in C and tested on a PC
(MS DOS)computer. The program is freely available
fi’omthe first author(E.mail: rogozin~cgi.nsk.su).
Acknowledgements
This workwas supportedby grants fromthe RussianState
Program"Frontiers in Genetics"; Russian Mini.~c,y of
Sciences, Education and Technical Politics and the
Departmentof Energy of the USA.Weare ~,nk’ful to
V,F’donenkofor translation of this manuscript from
Russianinto English.
References
Adams,
W.T., andScope,k, T. R. 1987.Statistical test for
the comparison of samples from mutational spectra.
d.Mol.Biol.194:391-396.
Begnini, R., Palombo, F., and Dogliotti, E. 1992.
Multivariate statistical analysis of mutationalspectra of
alkylating agents. Mmat.Res. 267: 77-88.
Benzer, S. 1961. On the topology of the genetic fine
structure. Proc. Natl. Acad.Sci. USA47: 403-415.
Bolshoy, A., McNamura,P., Harrington, R. E., and
Trifonov, E. N. 1991. Curved DNAwithout A-A:
experimental estimation of all 16 DNAwedgeangles.
Proc. Natl. Acad. Sci. USA88: 2312-2318.
BumsP. A., Gordon,A. ft. E., and Glickman,B. W.1987.
Influence of neighbouringbase sequenceon N-methyl-N’nitro-N-nitro-soguanidinemutagenesisin the lacI geneof
Escherichiacoil J.Mol.Biol. 194: 385-390.
Celeux, Q., and Riebolt, J. 1984. Reconnaissance de
melange de densite et classification. Un algorithme
d’apprentissage probabiliste: l’algorithme SEM.Rapports
de Recerchede HNRIA.Centre de Rocquencort.
Glickman, B. W.; and Ripley, L. S. 1984. Structural
intermediates of deletion mutagenesis: a role for
palindromic DNA.Proc. Natl. Acad. Sci. USA81: 512516.
Horsfall, M. J., and Glickanan,B. W.1988. Mutationsite
specificity of N-nitroso-N-methyl-N-alpha-acetoxybenzylmaine: a modelderivative of an esophageal carcinogen.
Carcinogenesis9: 1529-1532.
Horsfall., M.J., Gordon,A. J.E., Bums,P. A., Zielenska,
M., van ¢ler Vliet, G. M. E., and Glickman;B.W.1990.
Mutational specificity of alky]atlng agents and the
influence of DNArepair. Envirom. Mol. Mutagen. 15:
107-122.
Matters, W. B., Hartley, J. A., and Kohn,IC W. 1986.
DNAsequence selectivity of guantine-N7 alkytron by
nitrogen mustards.Nucleic Acids Res. 14: 2971-2987.
Piegorsch, W.W., and Bailer, A. J. 1994. Statistical
approaches for analyzing mutational spectra: some
recomendationsfor categorial data. Genetics 136: 403416.
Rogozin, I. B., Sredneva, N. E., and Kolchanov,N. A.
1992. Intelligent system of mutational analysis. In
Modelling and ComputerMethodsin Molecular Biology
and Genetics (Ramer,V.A., and Kolchanov,N.A., ¢ds.),
63-72. NovaSience Publishers, Inc., NY.
Shlezinger, M. L On the spontaneousimagerecognition.
Readingautomats.1965, NaychnaiaMisl’, Kiev, 38-45 (in
Russian).
Stormo, G. D., Schneider, T. D., and Gold, L. 1986.
Quantmive
analysis of the relationship betweennuclentide
sequence and functional activity. Nucleic Acids Res.
14(16): 6661~679.
Topal, M. D., Eadie, J. S., and Conrad M. 1986. 06methylg~mnine
mutationand repair is nonuniform,J. Biol.
Chem.261: 9879-9885.
Rogozin 313