From: ISMB-95 Proceedings. Copyright © 1995, AAAI (www.aaai.org). All rights reserved. Subclass approach for mutational spectrum analysis Igor B. Rogozin, Galina V. Glasko, Evgeniy L Latkin* Institute of Cytologyand Genetics; 102_,avrentyevAve., Novosibirsk630090,Russia *RIMIBE at the NovosibirskState University; 2. PirogovaSt., Novosibirsk630090,Russia rogozin@cgi.nsk.su (Glickmunand Ripley, 1984), with repair systems(Topal et al. 1986;Bums et al. 1986). Abstract It shouldbe emphasized that the context of hotspotscan help specify the underlying molecnlar mechanismof Analysis and comparison of mutational spectra represents a burningquestion in molecularbiology. mutagenesis (Horsfallet all. 1990). Wereport an algorithm based uponthe SEM subclass A number of theoretical methods hasbeendeveloped approach(SEM- stochastique, estimations, maximifor mutationalspectra analysis. Several approacheshave zations). Anyreal mutationalspectrumis regardedas been applied for comparisonof two or several mutational a mixture of standard binomial distributions. The spectra (Adams& Skopek 1986; Piegorsch & Bailer separation proc~ureis run by rounds. Eachiteration 1994).Benigniet al. (1992)used the standardmultivariate includes simulation, maximizationand estimation. method.Multiple regression analysis have been applied The algorithm has been checked on randomspectra for construction of the best modelfor context effects on with the preset parameters and on real mutational the 2APinsertion (Stormo et al. 1986). However,the spectra. As hasbeenshown, anyrealmutational problem of identification of hotspots in the mutational specmlrn canberepresented asa mixture oftwoand spectra remains unresolved. The mutational spectrum morebinomial distributions, ofwhich onecontains induced by O6-methylguaninewas approximated by the hotspots ofmutation. Keywords: subclass approach, classification, Poisson distribution (Topal et al. 1986). Bywayof that mutational spectrum, SEMalgorithm, maximization, approach,the contentofcold spots ofmutation(.positions likelihood.function. with lower mutationfrequencies) was analyzed. However, with this approach, one can go as far as to reveal discrepancies between a real and expected mutational Introduction spectra, but fail to uncover the causes of these discrepancies0aotspots,coldspots). Novelgene engineering techniques have brought up a lot To solve the problem, we decided to approximatethe of information on the mutations observedin nucleotide real mutational spectrum by a mixture of binomial sequences. Thedata of the sort were called "mutational distributions. Weapplied the standard classification spectra". Analysisof these spectra has shownspontaneous method, slightly modified though. The programthat we and inducedmutations to be largely confined to certain have developedwastested on real mutationalspectra. As regions of nucleotide sequences. Examinationof such wasshown,the mutationalpositions of the spectra can be regions (mutational "hotspots") hasprovided evidence that theobserved nonrandom distribution ofmutations along groupedinto two or moreclasses, one of whichcontains thesequence mustbeaccounted forby somestructural hotspots. features ofthehotspot subsequences (theDNAcontext). An example of a mutational spectrum is presented in Statistical modelof the experiment Fig.la. Asis seen,a range of sites display elevated mutation frequencies (e.g., positions 84,185,202). Considertwosets Y=(yl..... yn), X=(xl,...,xn)of integers, TheideathatDNAcontext mayaffect mutability was whereyi stands for the numberof the position in the first suggested byBenzer (1961). Todate, evidence has sequenceat whicha mutation(s) can occur, and xi stands been gathered that this isquite so.Investigation into these for the nmnberof mutations at this position; n - total fenmres mayprovide valuable information aboutthe numberof positions. Hg.lpresents a mutational spectrum underlying molecnlar mechanisms. Theinfluence ofDNA andthecorresponding twosetsY andX. context maybe associated with base pair reactivity (Matters et al. 1986), with the secondarystructure of DNA Rogozin 309 (a) 40 50 60 2 g a a a c c agt a a C gt t at a at 80 0 C a g 70 5 gt at g c c ggt gt ct Ct t I00 1 cC t 2 0 G cGt g c g a a t gg c gt gt I0 C c a gt ag ggt gaa cC a a 130 0 a C g c aggc g gg c agc c gga 0 0 aC a t 120 6 t C a gt t 150 140 4 a a a Gt a a c g g a a g cgg 170 gct g a at 1 aC c t a c a a ct 0 ggCggG a 90 7 t C II0 190 14 gGC at 0 160 g cg 2 0 c GC c at 180 200 15 c t c 6 C c a a c c g 210 0 a a a C a gt c (b) Y= (42,57,58,75, 84,89, 90, 92,93, 95,120,140,174,185,186,188,191,198,202,206) x= (2,2,0,5,10,2,7, I, 2,0,6, 4,6, 14,0,0, i, 0,15,0) FIGURE I.Thesequence of lac/gene withmutations induced byNMAM(Burns etal.,1986)(a)and¢xnzespcmding X andY arrays (b). Number abovesequencemeansthe number of mutationsin the position. 0 meansthat no mutationswereobservedin the site. Sites detectedpreviouslyin lacI gene (Gordon et al., 1988)are shownby u~cletters. Let ~ be a random value denoting the number of mutations at any site of the sequence. In that case X is the sampling of values for ~. Theexperiment is briefly described as follows: The knownnumber, N, of copiesof thesequence is affected bya mutagene. Oneor lessmutations occurs in eachcopy. The numberof mutations at eachpositions is summed overallcopies andthesumrepresents theresult of the experiment. Formally, we may assume that N experiments have been made over one sequence. Here pi is the probability of a mutation occurring at the i-th site over one experiment. For the i-th site, N experiments present a succession of independent Bernoulli’s probes; and the probability of k successes and (N-k) failures over experiments is 510 ISMB--95 k k P(~=k) ffi C pi (I N N-k The sample of values xl,...,xn for ~ maybe regarded as a mixture of I binomial distributions with the different probabilities of mutations for any distribution of the mixture. Thus any distribution is responsible for the operation of certain mechanismof mutagenesis. By virtue oftheformula forthetotalprobability, the probability ofxjmutations occm’ring atthej-thsiteis 1 P(~=xJ) = ~ v(~=xJ [ Si) * P(Si) i=l whereSi is the i-th group of sites at each of whicha mutation occurs one experimentwith the probability pi; P(Si) is the probabilityof the Si groupbeingselected from amongall the groups;I is the numberof groups. Thusthe densityof distribution is f(x) 1 x x = ~ ~i C pi (i i=l N Over the first iteration, a memberof the samplemay co-occur in several classes. The class whichdoes not contain any memberis omitted from consideration, and the number of classes decreases. The matrix G is recalculated for the newnumberof classes: k giJ=giJ / E giJ J=l N-x ~i is theprobability ofSi beingselected. In order to assess thenumber of components and parametersof the mixtureof distributions, it necessaryto separate the distributions fromthe mixture. (ii) Maximization Nowevaluating the parametersof the classes. First, the weightsof the distributions are defined by the following formula: 12 pJ = Algorithm of separation i=l Weused the algorithm SEM(stochastique - estimations maximizations)(Celeux&Riebolt 1984) for assessing numberof distributions of a mixture with the unknown parameters and weights of each component. Tostart with, weset the initial numberof distributions in the mixture, kmax,which should be higher than the true numberof the componentdistributions. Wealso set upthe matrixof the posteriorprobabilitiesgil ..... gikmax (i=l,n) suchthat giJ=i/kmax, ~ giJ=l, i O<giJ<l Here gij is the posterior probability of xi belongingto classes j= 1,2,...,kmax. Threesteps of the algorithmfollowin succession. (i) Simulation Here we classify the membersxl ..... xn of the initial sample. By virtue of the matrix G={gij) (i=l ..... n; j=l,...JO the matrixE is built; k - number of classes. f I with the probability gij [ 0 with the probability l-giJ = eiJ As result, each memberof the samplexl ..... xn was assigned to one or more classes depending on gij randomly. It means that the rows of the matrix E={eij} are represented by polynomiallydistributed randomvalues (with the parametersgil,...,gik). ThematrixE sets up the following rule by whichto classify the membersof the sample: SJ={xi : eJ(xi)=l), i/n ~ eJ (xi) Then we evaluate the parameters of the classes by maximization of the likelihood function in f(xi,OJ) -) sup J=l,...,k xeSJ OJ (lii) Estimation TheBayes’reestimationof probabilities gij is basedupon the estimatesobtainedfromthe precedingiteration A giJ = pj A f(xi,OJ) / k A Z pl i=i A f(xi,ej) The iteration procedure terminates whenall members have been classified. If there are member’swith a high probability of membership in two classes (and therefore no classification is possible for them), the procedure terminates whenthe difference betweenthe values of the parametersof the i-th and (i+l)-th iteration has fallen short of accuracy. As was shownby Shlezinger (1964), the algorithm converges to the stationary points of the logarithmic likelihoodfunction used. Results Wechecked our version of the algorithm on random binomial dism~butions with various numbers of componentsand parameters. A standard randomgenerator wasused. Fig.2presenls the dependence of variance and the expectation of pl for a one- componentmixture on samplevolume.The original value of p was 0.01. Fifty at each J=l...k Rogozin 311 -7 Dp x 10 4 Epx 10 / _= lo --~ 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 FIGURE 2. The d ep~mdence of the variance (Dp*)and the expectation (Ep*)of pl for a one- componentmixture on samplevolume Sv. Thewluesof Epare gzvennear the correspondingpoints of the plot. random distributions were generated for sample volumes of 20 to 200 mutations. For each of the distributions generated, except for four, only one-component distribution was obtained. The four incorrect distributions were omitted fxom the calculation of the mean and variance. For each sample volume, variance (I39") and expectation (Ep*) were evaluated. As is seen, the variance of pl decreases as 1/~Sv with an increase of sample volume Sv. The algorithm was also tested on a mixture of two binomial distributions at the following parameter values: sample volume N = 100; I1 = 12 = 0.5; pl=0.09, p2 ranged from 0.01 through 0.08. The algorithm was unable to separate two distributions ff the difference betweenP1 and P2 was less than 0.02. In that case, the two distributions were "mingled" and could not be separated signif’manfly. The algorithm was also tested on real mutational spectra induced by alkylating agents MNAM and MNNG in the E.coli genome.The role of context in the incidence of these mutations is well known. The overwhelming majority of the mutations occurs at the G:C positions. Mutations at the positions with the RGcontext (G is the mutation position) occur several times as frequently as mutations at the YGpositions. Thus, these mutational specWa should represent a mixture of not less that.two distributions. The mutational spectrtma induced by the mutagene MNAM (I-Iorsfall & Glickman 1988) presented in Fig.la. Analysis of this spectrum by the algorithm as described suggests two distributions with the 312 ISMB-95 parameter values N = 100, pl = 0.005, ~.1 = 0.64, p2 = 0.08, X2. = 0.36. Oneof the distributions included only the sites with the YGcontext (number of mutations < 3, positions 42, 57, 58, 92, 93, 186, 188, 191,198,206), the other included the spectra with the RGcontext (numberof mutations > 3, positions 75, 84, 90, 120, 140, 174, 185, 202). The mutational spectra induced by MNNG (Burns al. 1987) was revealed within the 1- 580 bp region of the lacI gene. However,the results obtained are quite similar with what we had obtained for MNAM (within 1-232 bp): one of the distributions included only the sites with the YGcontext, the other included the spectra with the RG context.The parameter values were evaluated as N = 100, pl = 0.01, 7L1= 0.61, p2 = 0.1, k2 = 0.39. Discussion In all, the analysis of real and generated data has proven the efficiency of the algorithm suggested for analysis of mutational spectra. For any real spectrum studied, one of the classes contains hotspot sites. Thus, the algorithm may be applied for mutationalspectra _analysis. A lot of mutational spectra has nowbeen obtained. For only part of them, the mechanisms of mutagenesis has been revealed. The algorithm reported allows more detailed analysis of any spectrum to be performed. It can be successful at the study of correlations betweenhotspots and the nucleotide context features within any class of sites revealed. These prospects axe pursued by the system for analysis of mutational spectra MutAn(mutation analysis) (Rogozinet al. 1992) whichis currently under development. The programhas been written in C and tested on a PC (MS DOS)computer. The program is freely available fi’omthe first author(E.mail: rogozin~cgi.nsk.su). Acknowledgements This workwas supportedby grants fromthe RussianState Program"Frontiers in Genetics"; Russian Mini.~c,y of Sciences, Education and Technical Politics and the Departmentof Energy of the USA.Weare ~,nk’ful to V,F’donenkofor translation of this manuscript from Russianinto English. References Adams, W.T., andScope,k, T. R. 1987.Statistical test for the comparison of samples from mutational spectra. d.Mol.Biol.194:391-396. Begnini, R., Palombo, F., and Dogliotti, E. 1992. Multivariate statistical analysis of mutationalspectra of alkylating agents. Mmat.Res. 267: 77-88. Benzer, S. 1961. On the topology of the genetic fine structure. Proc. Natl. Acad.Sci. USA47: 403-415. Bolshoy, A., McNamura,P., Harrington, R. E., and Trifonov, E. N. 1991. Curved DNAwithout A-A: experimental estimation of all 16 DNAwedgeangles. Proc. Natl. Acad. Sci. USA88: 2312-2318. BumsP. A., Gordon,A. ft. E., and Glickman,B. W.1987. Influence of neighbouringbase sequenceon N-methyl-N’nitro-N-nitro-soguanidinemutagenesisin the lacI geneof Escherichiacoil J.Mol.Biol. 194: 385-390. Celeux, Q., and Riebolt, J. 1984. Reconnaissance de melange de densite et classification. Un algorithme d’apprentissage probabiliste: l’algorithme SEM.Rapports de Recerchede HNRIA.Centre de Rocquencort. Glickman, B. W.; and Ripley, L. S. 1984. Structural intermediates of deletion mutagenesis: a role for palindromic DNA.Proc. Natl. Acad. Sci. USA81: 512516. Horsfall, M. J., and Glickanan,B. W.1988. Mutationsite specificity of N-nitroso-N-methyl-N-alpha-acetoxybenzylmaine: a modelderivative of an esophageal carcinogen. Carcinogenesis9: 1529-1532. Horsfall., M.J., Gordon,A. J.E., Bums,P. A., Zielenska, M., van ¢ler Vliet, G. M. E., and Glickman;B.W.1990. Mutational specificity of alky]atlng agents and the influence of DNArepair. Envirom. Mol. Mutagen. 15: 107-122. Matters, W. B., Hartley, J. A., and Kohn,IC W. 1986. DNAsequence selectivity of guantine-N7 alkytron by nitrogen mustards.Nucleic Acids Res. 14: 2971-2987. Piegorsch, W.W., and Bailer, A. J. 1994. Statistical approaches for analyzing mutational spectra: some recomendationsfor categorial data. Genetics 136: 403416. Rogozin, I. B., Sredneva, N. E., and Kolchanov,N. A. 1992. Intelligent system of mutational analysis. In Modelling and ComputerMethodsin Molecular Biology and Genetics (Ramer,V.A., and Kolchanov,N.A., ¢ds.), 63-72. NovaSience Publishers, Inc., NY. Shlezinger, M. L On the spontaneousimagerecognition. Readingautomats.1965, NaychnaiaMisl’, Kiev, 38-45 (in Russian). Stormo, G. D., Schneider, T. D., and Gold, L. 1986. Quantmive analysis of the relationship betweennuclentide sequence and functional activity. Nucleic Acids Res. 14(16): 6661~679. Topal, M. D., Eadie, J. S., and Conrad M. 1986. 06methylg~mnine mutationand repair is nonuniform,J. Biol. Chem.261: 9879-9885. Rogozin 313