OPEN ACCESS DOCUMENT Information of the Journal in which the present paper is published: Elsevier, Analytica Chimica Acta, 2009, 642 (1-2), pp. 117-126. DOI: dx.doi.org/10.1016/j.aca.2008.12.052 1 Classification of nucleic acids structures by means of the chemometric analysis of circular dichroism spectra Joaquim Jaumot1*, Ramon Eritja2, Susana Navea3 and Raimundo Gargallo1 1. Department of Analytical Chemistry, Universitat de Barcelona, Diagonal 647, Barcelona, E-08028 Spain. 2. Department of Structural Biology, IBMB-CSIC, Jordi Girona 18-26, Barcelona, E-08034 Spain 3. Acciona Agua, Av. de les Garrigues 22, El Prat de Llobregat, E-08820, Spain * Author to whom correspondence should be addressed Tel: +34-934034445 Fax: +34-934021233 E-mail: joaquim@apolo.qui.ub.es 2 1. ABSTRACT DNA can adopt structures in solution apart from the well-known Watson-Click double helix, ranging from disordered single strands to high-order structures such as triplexes or quadruplexes. Moreover, different topologies can be adopted depending on the polarity of the DNA strands. The elucidation of the structure and topology adopted by a DNA sequence is usually carried out by means of spectroscopic techniques, such as circular dichroism. In this work, the ability of several chemometric methods to efficiently classify DNA structures from circular dichroism data is tested. With this objective in mind, a data set including 50 experimental spectra corresponding to different DNA structures (random coil, duplex, hairpin, reversed and normal triplex, parallel and antiparallel Gquadruplex, and i-motif) has been analyzed by means of unsupervised Hierarchical Clustering Analysis, Principal Component Analysis and Partial Least Squares Discriminant Analysis. The results have shown than those methods allow efficiently the classification of DNA structures from circular dichroism spectra. Moreover, these classification methods also provided the most characteristic wavelengths used in the classification procedures. Keywords: DNA structure, classification, Principal Component Analysis, Clustering, Partial Least Squares Discriminant Analysis, Circular Dichroism spectroscopy 3 2. INTRODUCTION Since the initial elucidation of the secondary structure of B-DNA duplex by Watson and Crick in 1953 [1], additional DNA secondary structures have been described in the literature (Scheme 1) [2, 3]. These are related to base and sugar geometries different from those found in the B-DNA duplex structure and are possible due to the existence of base pairs different from those initially proposed by Watson and Crick. Well-known examples of these base pairs are inverted Watson-Crick, Hoogsteen, or Wooble base pairs [3]. Furthermore, it has been observed that interactions involving more than two nitrogenated bases are also possible. This allows the formation of higher order structures such as triplex (when three bases interact, for example, (thymine•adenine)*adenine or (cytosine•guanine)*guanine) or quadruplex (when four bases interact, for example, the G-tetrad in which four guanine bases simultaneously interact form a planar arrangement) (see Scheme 1) [4-6]. The structure adopted by a DNA in solution is often strongly fixed by the sequence of nitrogenated bases. As an example, the presence of several tracks containing guanines can favor the formation of G-quadruplex structures [7, 8], whereas the presence of several tracks containing cytosines can favor the formation of i-motif structures at pH values near 5 -6 [9]. It is interesting to point out that these DNA secondary structures can be built up by means of intramolecular or intermolecular interactions [3]. In the first case, we will have a DNA structure built up by just a single strand. This is the case of the hairpin structure (Scheme 1), which is a intramolecular duplex consisting of an ordered part with base pairs (stem) and another part without base pairs (loop). Intramolecular folding can also be observed in higher order structures such as triplexes or quadruplexes [4]. The complexity of DNA secondary structures increases if we also consider the variability in the structures caused by the topologies that can adopt the different interacting strands. Thus, it is possible to distinguish among parallel (if the chains run in the same sense) 4 and antiparallel (if the chains run in opposite sense) topologies, or even mixtures of both. Finally, we have to consider the disordered structures or random coil [3]. In this case, there are no hydrogen bonds between the nitrogenated bases and, because of this, it is not possible the formation of ordered structures. The random coil is observed when the nucleic acids are in denaturating conditions, such as high temperature, extreme pH values or in the presence of denaturing agents. Several spectroscopic techniques can be used to monitor the formation and stability of these structures, like UV molecular absorption, circular dichroism (CD) or Nuclear Magnetic Resonance. Among these, CD in the UV region can be considered the most appropriate technique because the measured instrumental response is extremely sensitive to the distance between the interacting strands, the inclination and distance between the nitrogenated bases and the axis of the structure, with an acceptable cost [10-12]. Therefore, CD spectroscopy is widely used to distinguish between ordered and disordered structures and, also, between different types of ordered structures. In this work, we have tested the ability of CD spectroscopy and Chemometrics to efficiently classify DNA secondary structures. This classification has been first attempted by using unsupervised classification methods such as Hierarchical Clustering Analysis (HCA) [13] and Principal Component Analysis (PCA) [14] in order to explore the data set and obtain different sample groups. Finally, Partial Least Squares Discriminant Analysis [15] (a supervised method) has been used to model the different DNA structures classes from the CD spectra. 3. EXPERIMENTAL DNA synthesis Oligonucleotide sequences were synthesized on an Applied Biosystems 392 DNA synthesizer using the 1 mol scale synthesis cycle. Standard phosphoramidites were used for the natural bases. Sequences were deprotected using standard protocols 5 (concentrated ammonia, 55ºC, and overnight). After deprotection, oligonucleotides were purified using purification cartridges and, finally, desalted using Sephadex G-25 cartridges (NAP-10, Amersham Biosciences). Besides, the parallel-stranded hairpins were prepared as described elsewhere [16, 17]. 5’-5’ Hairpins were prepared in three steps. First, the pyrimidine part was prepared using reversed C and T phosphoramidites and reversed C-support (linked to the support through the 5' end). Second, a hexaethyleneglycol linker was added using a commercially available phosphoramidite. Third, the purine part was assembled using standard phosphoramidites. For the preparation of 3’-3’ hairpins a similar approach was used. In this case, the purine part was assembled first, followed by the hexaethyleneglycol. The pyrimidine part was the last to be assembled using reversed phosphoramidites. Finally, oligonucleotides sequence 5’-T12-(EG)6-A12-(EG)6-T12-3’ was assembled using standard phosphoramidites and the hexaethyleneglycol linker. The synthesis of 3’-T125’-(EG)6-5’-A12-(EG)6-T12-3’ was prepared in two steps. First, DMT-(EG)6-5’-A12-(EG)6T12-3’ was assembled using standard phosphoramidites and the last dodecathymidine sequence was assembled using reversed phosphoramidites. Sample Preparation Samples described in Table I were prepared at a concentration of 3 M in strand. DNA concentration was determined by measuring the UV absorbance at 260 nm at 90ºC and calculating the concentration by means of the nearest-neighbor method as implemented in Oligo Parameter Calculation [18]. Appropriate volumes of phosphate (pH 6.9) or acetate (pH 5.1) buffer solutions were added to the samples. For the preparation of buffers, NaH2PO4 (Panreac, a.r., Spain), KH2PO4 (Panreac, a.r., Spain), CH3COOH (Merck, a.r., Germany) and CH3COONa (Panreac, a.r., Spain) were used. Ionic strength was adjusted to 150 mM by adding appropriate volumes of KCl, NaCl and MgCl2 stock solutions. KCl (Merck, a.r., Germany), NaCl (Merck, a.r., Germany) 6 and MgCl2 (Panreac, a.r., Spain) were used. The samples used to test the prediction ability of the proposed models consisted of equimolar mixtures of two complementary strands. These samples were prepared by using the same ionic strength buffer as for the other samples but the pH was adjusted using small volumes of HCl (Panreac, a.r., Spain) or NaOH (Panreac, a.r., Spain). All the solutions were prepared in Ultrapure water (Millipore, France). Finally, samples were heated at 90ºC for 10 minutes and allowed to renaturalise, cooling slowly until room temperature. Oligonucleotides samples were kept at 4ºC until measurement. Spectroscopic measurements CD spectra were recorded on a Jasco J-810 spectropolarimeter equipped with a Julabo F-25/HD control unit. Spectra were recorded between 220 and 360 nm (data pitch: 0.5 nm; scan mode: continuous, sensitivity: 10 mdeg, speed: 50 nm/min, response: 4 s, bandwidth: 1 nm, 2 accumulations). Spectra were recorded using a Hellma quartz cell with pathlength of 10 mm and volume of 1400 l. Measurements were carried out at two different temperatures: 20ºC in annealing conditions and 85ºC in denaturing conditions. Samples used for building up the chemometrical models Table I lists all the DNA sequences used in this work. The data set includes disordered single stranded DNAs, B-DNA duplex, intramolecular and intermolecular triplexes, parallel and antiparallel G-quadruplexes, and i-motifs. Disordered DNAs include samples whose CD spectrum has been measured at high temperature where it is expected that the secondary structure is lost (samples 1-10) [3]. Moreover, some sequences which do not form secondary structures have been included (samples 11-12). Duplex structures include both intra- and intermolecular structures and parallel and antiparallel topologies. The data set includes two intramolecular triplex structures: one normal (sample 28) and another reversed 7 structure (sample 29). Several DNA structures involving four strands have been included. First, G-quadruplexes, which is a structure formed in guanine-rich sequences. The data set contains both parallel (samples 30 - 35) and antiparallel (samples 36 - 41) G-quadruplexes. Second, the data set includes two spectra corresponding to i-motif structures (samples 42 - 43). These are only stable at pH 5 - 6 because its formation requires half-protonation of cytosine bases. For that reason, the CD spectra of these sequences have been measured in acetate buffer. Finally, 7 samples prepared by mixing two DNA sequences have been included. These samples could potentially present more than one structure simultaneously. Hence, samples 44 - 45 can produce a mixture of duplex and G-quadruplex structures due to the mixing of a G-quadruplexforming oligonucleotide with its complementary strand. In some samples (46 - 50), duplex and triplex structures could be simultaneously present after the addition of a single strand target to a hairpin structure. 4. CHEMOMETRICAL METHODS Hierarchical Clustering Analysis (HCA) Cluster analysis is used to classify objects, characterized by the values of a set of variables, into clusters or groups [14], in such a way that one object within a cluster is more closely related to another object in the same cluster than to another object assigned to a different cluster. In HCA the data are not partitioned into a particular cluster in a single step. Instead of this, a series of partitions takes place, which may run from a single cluster containing all objects to n clusters each containing a single object. In order to build up these groups a measurement of the similarity between the different objects is considered. This measurement is also known as the distance between the objects considered. There are several methods to measure distances and its selection will influence the shape of the clusters [19]. Examples of these distances are the Euclidean distance, City block distance or Mahalanobis distance. 8 Among all the linkage cluster methods in this work the agglomerative Ward’s method has been selected. Ward’s clustering method generates the different clusters in order to minimize the loss associated with each cluster. At each step in the analysis, the union of every possible cluster pair is considered and the two clusters whose fusion results in minimum increase in 'information loss' are combined. Information loss is defined by Ward in terms of an error sum-of-squares criterion. Principal Component Analysis Principal Component Analysis (PCA) is a multivariate technique that allows the reduction of matrices to their lowest orthogonal space [14]. PCA assumes a bilinear model to explain the observed data variance using a reduced number of factors (also known as principal components): X = U VT + E Equation 1 In particular, the principal components identified by PCA are linear combinations of the original variables which are orthonormal (orthogonal and normalized to unit length) and explain maximum variance. The goal of PCA is to represent the variation presents in many samples using the smallest number of components [20]. A new row space is built up in which to plot the samples by redefining the axes using factors rather than the original measured variables. The new axes, the principal components, allow the investigation of data matrices with many variables and the display of the true multivariate nature in a relatively small number of dimensions. In PCA, the matrix related to the sample contributions (U), is called the score matrix, and the matrix related to the variables contributions (VT) is called the loadings matrix. By retaining only the significant components, one compress the relevant data information into these two data matrices, U and VT, and, supposedly, the random noise contribution (E) is suppressed. In the present study, X contains the CD spectra for the different samples considered (Table I). Therefore, X contains 50 rows (corresponding to the number of samples) and 9 282 columns (corresponding to the number of measured wavelengths). In this case, the scores matrix, U, provides information about the samples (DNA structures) distribution and grouping, whereas the loadings matrix, VT, provides information about the most relevant wavelengths used to obtain this classification. Partial Least Squares Discriminant Analysis (PLS-DA) PLS-DA is a variant of PLS used as a classification tool. In this method, X contains the input information (spectra) about the objects to be classified and Y the class membership information [15]. So, this method fulfills the general equation of PLS methods: Y=XB Equation 2 where X and Y are represented by their latent variables and B contains the regression coefficients in the calibration step. In this work, PLS-DA has been applied to a subset of the full data set (calibration data set) whereas the remaining samples have been used as a validation data set. The validation set includes samples 44 - 50 (in which there is a mixture of sequences) and 9 additional samples selected by using a Kennard-Stone algorithm [21]. A crossvalidated leave-one out model on the calibration data set was used to test the ability of the method to carry out the class recognition. The threshold value to separate different classes is calculated using a Bayesian statistical approach and allows to separate DNA structures in the class and out of the class. Software The software used in this work for the PCA and PLS-DA was the PLS toolbox 3.5 for MATLAB® from Eigenvector Research. HCA was performed using pdist (calculates the pairwise distance between observations using the distance measurement method specified by the user), linkage (creates a hierarchical clustering tree using the algorithm 10 specified by the user) and dendrogram (generates the dendrogram plot) functions of the Statistics Toolbox for Matlab®. 5. RESULTS A) Building up the chemometrical models Figure 1 shows the measured CD spectra that have been later analyzed. Baseline subtraction and Savitzky-Golay smoothing [22] data pretreatments have been applied prior to the analysis. The results obtained have been organized in three blocks corresponding to the considered chemometrical methods used in the classification procedure. First, we have considered the unsupervised classification methods (Hierarchical Clustering and Principal Component Analysis) and, finally, the supervised classification method (Partial Least Squares – Discriminant Analysis). Hierarchical clustering analysis (HCA) As explained in the “Chemometrical Methods” section there are two parameters (distance measurement and linkage method) that should be optimized to built up a reliable dendrogram. Several options for these two parameters have been investigated and, finally, the Euclidean distance, as distance measurement, and the Ward’s method, as linkage method, were selected. Other distance measurements, such as the Semieuclidean or Chebychev distances, and other linkage methods, such as the complete or the weighted methods, also provided acceptable results. Figure 2 shows the calculated dendrogram. Two main branches can be clearly distinguished. The branch on the left contains the disordered DNAs together with several triplex, i-motifs and antiparallel G-quadruplex structures, whereas the branch on the right includes most of the parallel G-quadruplex and duplex structures listed in Table I. Based on the known CD signatures for these structures and on the characteristics of the data set studied in this work, it can be deduced that the right 11 branch includes those DNA secondary structures showing a positive CD band at 260 nm (Figure 1), whereas the left branch contains those structures which do not show this CD signature [10]. Now, we will study in detail each one of these two main branches. The left one contains in turn three clusters. The first one (samples 2 to 16) undoubtedly contains the samples that present a disordered structure. This cluster includes samples 1 to 10 (measured at denaturing conditions (i.e. at high temperature) and samples 11 and 12, whose base sequence does not allow the formation of higher order structures. In all these cases, the measured CD spectra showed low intensity (less than 5 mdeg) and no significant CD signatures. Finally, this cluster also contains samples 16 and 41. CD spectra of these samples showed weak signals probably because of the ionic medium in which the DNA sequences were dissolved (Na+ cations). For instance, in the case of sample 41, it is known that the formation of the G-quadruplex structure is clearly favored in the presence of K+ over Na+ [7]. The second cluster includes 5 samples that show weak positive bands between 270-290 nm and a strong negative band around 250 nm. These five samples are the two triplex structures (samples 28 and 29) that are present in the data set and three B-DNA duplex structures: the Dickerson oligonucleotide (sample 19) and the two d(CGCGCGCG) oligonucleotides, either in potassium (sample 13) or sodium saline medium (sample 14). Finally, the third cluster can be assigned to the antiparallel quadruplex structures. These samples are characterized by a positive band around 285-290 nm and weaker bands negative and positive, respectively, around 265 and 240 nm. Moreover, we can distinguish that samples 42 and 43 correspond to the structure known as i-motif while samples 32, 38, 39 and 40 correspond to the antiparallel G-quadruplex structure. As explained above, the right branch corresponds to samples that show a strong positive CD band at 260 nm. This band has been assigned to both the duplex DNA and to the parallel G-quadruplex. Differentiation of these two structures within the different clusters is not so obvious. However, a careful analysis of CD spectra allows us to 12 explain the differentiation between the two major clusters that may be observed in this branch. Hence, the cluster on the left (samples 15 to 33) corresponds to DNA samples whose CD spectra showed additional features. Thus, these samples show small contributions in the CD spectrum around 280 nm possibly due to the presence of mixed topologies or to the existence of additional structures in solution. A clear example of this behavior is sample 33 that shows a maximum at 260 nm and a shoulder around 290 nm. In this case, it has been proposed a mixed parallel / antiparallel topology for this sequence reference [23]. On the contrary, the cluster on the right (samples 17 to 50) corresponds to DNA structures whose spectra only show maxima around 260 nm and minima around 240 nm. Principal Component Analysis (PCA) The first step in the creation of the PCA model was to determine the optimal number of components needed to explain most of the variance of the data whereas overfitting is avoided. In this case, the selected number of components was 3, explaining approximately 91% of the data variance (PC1 explains a 62.1% of the variance, PC2 a 17.5% of the variance and PC3 a 11.3% of the variance). The results obtained are shown in Figure 3. The loadings plot allows us the selection of the key wavelengths for each one of the 3 components (Figure 3a) whereas the scores plot allows us the classification of the samples (Figure 3b-d). The first component separates samples showing a clear positive signal at 260 nm, like B-DNA duplexes and parallel G-quadruplexes, from the others. In fact, the shape of the corresponding loading reminds that of the typical CD spectrum for a B-DNA duplex [11]. The second component models samples that show a positive signal around 285 nm and negative signal around 260 nm. This group comprises i-motifs (samples 42 and 43) and also some characteristic B-DNA duplex structures (samples 13, 14 and 19). In this case, although described as B-DNA, the CD spectra of these samples is clearly 13 different from the one typical associated with B-DNA (which is similar to the loading of the first component). This is due to the high content of guanine and cytosine bases in these sequences which produce several distortions to the canonical B-DNA structure [3]. Finally, the third component models samples which show positive signals around 295 nm and 245 nm. This group mainly comprises antiparallel G-quadruplex, which show a typical signature at 295 nm. On the other hand, samples 13, 14, 19, 28 and 29 show a negative correlation because their CD spectra present a significant negative signal around 250 nm. Looking at the scores plots, it is interesting to point out that DNA samples showing a disordered structure are very close to the origin of coordinates in all three principal components. This is probably because these samples do not show any significant CD signal and, therefore, are not relevant to any of the three components. Globally, the results obtained with PCA are consistent with those previously obtained with HCA. Partial Least Squares Discriminant Analysis (PLS-DA) Finally, the supervised PLS-DA method has been applied to classify the distinct DNA structures based on their CD spectra. In order to analyze matrix X it is necessary to define previously the class membership information (matrix Y). Hence, PLS-DA analysis has been carried out from two different viewpoints. Firstly, the classes in Y have been defined from the information previously obtained with HCA and PCA. Secondly, the classes in Y have been defined from the information about the DNA structures found in the literature (as listed in Table I). In both cases the data set has been split into a calibration set and a validation set. The validation set contains 16 samples, including 7 samples in which there is a mixing of two DNA sequences (samples 44 to 50) and 9 samples selected using the Kennard-Stone algorithm. 14 The results obtained when using the first viewpoint are shown in Figure 4. In this case, 3 latent variables have been used to build the model in B in Equation 2. Figure 4a shows the modeling of the samples with disordered structure in which the CD spectra do not show any significant feature. As expected, it is difficult to classify these samples into a single class because of the absence of key features in their CD spectra. On the other hand, a good classification of the remaining classes is obtained. Hence, triplex and B-DNA structures (which were classified in the second HCA cluster) appear above the calculated threshold (Figure 4b). Moreover, sample number 14 included in the validation set have been properly classified. Similarly, antiparallel G-quadruplex (which were classified in the third HCA cluster) appear quite well solved (Figure 4c), despite a false positive that corresponds to the sample 33 (this sample shows the maximum in the spectrum at 260 nm and a significant contribution around 290 nm). Also, it can be seen that validation sample number 40 has been correctly classified. Finally, the last HCA cluster, which corresponds to B-DNA duplex and parallel G-quadruplex structures, has been also correctly classified by the PLS-DA method (Figure 4d). PLS-DA allowed the prediction of the structure for samples consisting in a mixture of two sequences (for example, samples 44 and 45). It has been attempted to carry out the classification of these samples distinguishing between the two subgroups (duplex and quadruplex) that can be seen in HCA or PCA but the results obtained were not satisfactory. Finally, the PLS-DA results obtained when considering the second viewpoint are shown in Figure 5. In this case, the classification was considerably more complex than in the previous case because it does not take into account the spectral properties of the different DNA structures. So, in the building of the model we have only considered the a priori expected structure. As observed previously when using the classes predicted by HCA, the classification of the disordered samples is not good (Figure 5a). These samples are distributed throughout all the space, hampering an efficient classification. In contrast, the 15 classification of the duplex structures is correct (Figure 5b) as most of the samples are above the calculated threshold. However, two false negatives (samples 22 and 25) and two false positives for G-quadruplex structures (samples 33 and 37) can be observed. Looking at the validation set, samples that were expected to be duplex structures have been classified correctly. Figure 5c shows the classification of the two samples identified as triplex structures and its difference from all the other samples. Finally, Figure 5d shows the classification obtained for both parallel and antiparallel Gquadruplexes. In this case, we have obtained almost a correct classification for the considered samples despite the existence of three false negatives and three false positives. In the case of the validation set, the samples that could be expected as Gquadruplex structures (for instance, samples 30 and 40) have been properly classified. The comparison of the two calculated PLS-DA models allows us to determine that using information obtained by HCA or PCA provided us a better model. Despite this, the results obtained with the model created only with the structural information available in the literature can be considered acceptable. B) Application examples of the proposed chemometrical models to unknown samples Two additional samples (see last rows of Table I) were used to test the prediction ability of the previously proposed HCA, PCA and PLS-DA models. Both samples contain a equimolar mixture of a guanine-rich strand and of a cytosine-rich strand but were measured at different pH values. Guanine-rich strands can form antiparallel or parallel G-quadruplex structures from neutral to slightly acid pH values. Cytosine-rich strands form antiparallel i-motif structures at slightly acid pH values. At neutral pH values, it is expected the formation of B-DNA duplex structure. According to this, the major structure in sample A, prepared at pH 7.1, would correspond to a B-DNA duplex (Figure 6a). On the contrary, it is more difficult to predict the structures present in sample B, prepared at pH 2.9, because the protonation of cytosine bases will hinder 16 the formation of the B-DNA duplex structure. Therefore, a mixture of an i-motif and an G-quadruplex could be expected. Hierarchical clustering analysis (HCA) A new HCA model was built up using all 52 samples listed in Table I. The Euclidean distance and the Ward’s method were used for measuring distances and linking, respectively. The new dendrogram was almost identical to that previously calculated when only the 50 calibration samples were considered (Figure 2). Because of this, only the classification of the new samples will now be discussed. First, sample A has been classified inside cluster number 4, between samples 31 and 33. As commented previously, this cluster is characterized by a strong positive band at 260 nm and weaker contributions around 280 nm. According to this, DNA in sample A will have a major contribution of B-DNA duplex structure and minor contributions of other structures, like G-quadruplex or i-motif. On the other hand, sample B has been classified inside cluster 3, between samples 40 and 38. As said before, this cluster contains mainly antiparallel quadruplex structures. According to this, a mixture of i-motif structure and / or antiparallel G-quadruplex will be predominant. Principal Component Analysis (PCA) PCA has also been used to predict the structures adopted by the DNA sequences in the new samples. The projection of the corresponding CD spectra into the space spanned by the previously calculated PCA model is shown in Figures 6b and 6c, respectively. In Figure 6b, sample A is now located at the right end of the PC1 axis, surrounded by duplex-related samples. According to this, sample A will be assigned to a B-DNA duplex structure. As in the case of HCA, more difficulties were found when studying sample B. In this case, the assignation of this sample to an unambiguous group is not straightforward. In Figure 6c, this sample has been placed in the middle of the group of antiparallel quadruplex structures (see cluster in the first quadrant of 17 Figure 3d). However, when looking at Figure 6b, this sample is only close to samples 42 and 43 and far away from other samples of this group (such as 32, 38 or 39). Moreover, it could be considered that the sample is located also close to the B-DNA duplex structures. This behavior can help us to explain the composition of the mixture with a component of antiparallel quadruplex structure (probably an i-motif because its proximity to samples 42 and 43) and another component of B-DNA duplex structure or parallel G-quadruplex. Partial Least Squares Discriminant Analysis (PLS-DA) Finally, PLS-DA has been applied to predict the structures present in samples A and B using the first point of view above discussed. Figures 6d and 6e show the prediction according to classes 3 (antiparallel G-quadruplex and i-motif) and 4 (B-DNA duplex and parallel G-quadruplex). The prediction for classes 1 (disordered structures) and 2 (triplex and some characteristic B-DNA structures) is not shown because it does not provide any significant information. According to Figure 6d it is not possible to classify sample A as belonging to class 3 but, on the contrary, it can be clearly assigned to the class 4 (Figure 6e). In the case of sample B, it can not be labeled as belonging to only one class because it shows characteristic features of two classes (see Figures 6d and 63): 3 (antiparallel G-quadruplex and i-motif) and 4 (B-DNA duplex and parallel Gquadruplex). So, the classification of both samples according to the PLS-DA model is quite concordant with that previously obtained according to the PCA model. 18 6. CONCLUSIONS The application of multivariate data analysis methods like HCA, PCA or PLS-DA to a CD spectra data set has been shown to be a useful tool in order to classify DNA sequences according to their experimental CD spectra. The three chemometric methods used in this work have allowed to extract information from the data set. Unsupervised Hierarchical Clustering Analysis and Principal Component Analysis have provided complementary information about the grouping of the samples. However, PCA has provided information about the key wavelength to perform this classification. Finally, PLS-DA, as a supervised method, has allowed to classify the samples that present a spectral signature but fails when the samples do not show significant features. The procedure for classification proposed in this work can be especially useful when it is applied to the elucidation of mixtures of DNA sequences where more than one major structure could be present. This has been demonstrated in the example presented where the chemometrical methods have allowed the prediction of the structure of the DNA present in solution. This would be the case more usual when research related to DNA structures is performed such as the case of DNA sequences corresponding to the promoter regions of several oncogenes, like bcl-2, c-kit or c-myc [24, 25]. In those cases, the simultaneous presence of guanine-rich and cytosine-rich strands can produce a mixture of several structures, like G-quadruplex, i-motif and BDNA duplex, depending on the experimental conditions [26, 27]. The knowledge of the structures present in these mixtures can be useful when designing drugs which mode of interaction strongly depends on the DNA structure. In order to make available the results showed in this manuscript to the chemometrical and circular dichroism communities, we will follow two different ways. First, the CD dataset is publicly available to download from our webpage (http://www.ub.es/gesq/dna/). Second, models obtained with this dataset (or extended datasets) will be implemented in a web based application to allow users without chemometrics expertise to predict DNA structures from circular dichroism spectra. 19 7. ACKNOWLEDGEMENTS This research was supported by the Spanish MEC (CTQ2006-15052-C02-01/BQU, CTQ2007-28940-E/BQU and BFU2007-63287/BMC). 20 8. TABLES AND FIGURES CAPTIONS Table I. Description of the CD spectra data set. Scheme 1. Examples of DNA structures. a) duplex (antiparallel, intermolecular), b) hairpin (antiparallel, intramolecular), c) triplex (parallel, intermolecular), d) Gquadruplex (antiparallel, intramolecular), e) i-motif (intramolecular). Figure 1. Experimental CD spectra of the DNA samples Figure 2. HCA obtained dendrogram using Ward’s linkage method and Euclidian distance. Figure 3. PCA analysis: a) Loadings plot for three components (Solid Line: PC1, Green Dotted Line: PC2, Dashed Line: PC3), b) PC1 vs. PC2 scores plot c) PC1 vs. PC3 scores plot and d) PC2 vs. PC3 scores plot. Symbols: ▼: class 1 (disordered structures), : class 2 (duplex structures), ■: class 3 (triplex structures), +: class 4 (quadruplex structures) and ◊: class 5 (mixture samples). Figure 4. PLS-DA model for the first viewpoint. a) Y predicted for class 1, b) Y predicted for class 2, c) Y predicted for class 3 and d) Y predicted for class 4. Symbols: ▼: class 1 (HCA cluster number 1, : class 2 (HCA cluster number 2), ■: class 3 (HCA cluster number 3), +: class 4 (HCA cluster number 4) and ◊: validation samples. Figure 5. PLS-DA model for the second viewpoint. a) Y predicted for class 1, b) Y predicted for class 2, c) Y predicted for class 3 and d) Y predicted for class 4. Symbols: ▼: class 1 in Table I, : class 2 in Table I, ■: class 3 in Table I, +: class 4 in Table I and ◊: validation set. Figure 6. Results obtained for the prediction of two new samples. a) Experimental CD spectra. Solid Line: Sample A, Dashed Line: Sample B, b) PCA analysis: PC1 vs. PC2 scores plot, c) PCA analysis: PC2 vs. PC3 scores plot, d) PLS-DA analysis: Y predicted for class 3 and e) PLS-DA analysis: Y predicted for class 4. Legends for symbols in figures b) and c) like in Figure 3. Legends for figures d) and e) like in Figure 4. 21 9. TABLES AND FIGURES Table I. Code DNA Sequence Expected structure 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 A 5’-CCGGCCGG-3’ 5’-TCTCCTCCTTC-3’ 5’-GAAGGA GGAGA -3’-(EG)6- 3’-TCTCCTCCTTC-5’ 5’-GAAGGAGGAGA-T4-TGTGGTGGTTG-3’ 5’-phos-AGGAGA-T4-AGAGGAGGAAG-T4-GAAGG Mixture of DNA Sequences 2 & 4 Mixture of DNA Sequences 2 & 5 5’-CGCGCGCG-3’ 5’-CCGGCCGG-3’ 5’-CCCCGGGG-3’ 5’-TCTCCTCCTTC-3’ 5’-ACCCTAACCCTA-3’ 5’-CGCGCGCG-3’ 5’-CGCGCGCG-3’ 5’-CCGGCCGG-3’ 5’-CCGGCCGG-3’ 5’-CCCCGGGG-3’ 5’-CCCCGGGG-3’ 5’-CGCGAATTCGCG-3’ 5’-GAAGGAGGAGA -3’-(EG)6- 3’-TCTCCTCCTTC-5’ 3’-AGANGGANGGAAG-5’-5’-T4-CTTCCTCCTCT-3’ 3’-AGANGGANGGAAG-CTTTG-5’-5’-CTTCCTCCTCT-3’ 5’-GAAGGANGGANGA-T4-AGAGGAGGAAG-3’ 5’-GAAGGAGGAGA-T4-TGTGGTGGTTG-3’ 5’-GAAGGANGGANGA-T4-TGTGGTGGTTG-3’ 5’-phos-AGGAGA-T4-TGTGGTGGTTG-T4-GAAGG-3’ 5’-phos-AGGAGA-T4-AGAGGAGGAAG-T4-GAAGG-3’ 5’-T12 -(EG)6-A12 -3’-(EG)6-3’-T12-5’ 5’-T12 -(EG)6-A12-(EG)6-T12-3’ 5’-CGGGCACGGGAGGAAGGGGGCGGG-3’ 5’-CGGGCACGGGAGGAPAGGGGGCGGG-3’ 5’-GGCGCGGGAGGAATTGGGCGGG-3’ 5’-GCGCGGGAGGAATTGGGCGGG-3’ 5’-TGGGGGT-3’ 5’-TGGGGGT-3’ 5’ GGNGTTGGGTGTGGGTTGGG 3’ 5’ GGGNTTGGGTGTGGGTTGGG 3’ 5’-GGTTGGTGTGGTTGG -3’- biot 5’-TAGGGTTAGGGT-3’ 5’ GGNGTTGGGTGTGGGTTGGG 3’ 5’ GGGNTTGGGTGTGGGTTGGG 3’ 5’-CCCGCCCAATTCCTCCCGCGCCCG-3’ 5’-CCCGAC CCCTTCAPTCCCGAGCCCG-3’ Mixture of DNA Sequences 33 & 42 Mixture of DNA Sequences 33 & 42 Mixture of DNA Sequences 2 & 20 Mixture of DNA Sequences 2 & 25 Mixture of DNA Sequences 2 & 22 Mixture of DNA Sequences 2 & 26 Mixture of DNA Sequences 2 & 27 5’-CCCCTCCCTCGCGCCCGCCCG-3’ + 5’-CGGGCGGGCGCGAGGGAGGGG-3’ 5’-CCCCTCCCTCGCGCCCGCCCG-3’ + 5’-CGGGCGGGCGCGAGGGAGGGG-3’ B Conditions* Single Strand Single Strand Single Strand Single Strand Single Strand Single Strand Single Strand Single Strand Single Strand Single Strand Single Strand Single Strand Duplex Inter Antiparallel Duplex Inter Antiparallel Duplex Inter Antiparallel Duplex Inter Antiparallel Duplex Inter Antiparallel Duplex Inter Antiparallel Duplex Inter Antiparallel Duplex Intra Parallel Duplex Intra Parallel Duplex Intra Parallel Duplex Intra Antiparallel Duplex Intra Antiparallel Duplex Intra Antiparallel Duplex Intra Antiparallel Duplex Intra Antiparallel Triplex reversed Triplex normal G-quadruplex Parallel G-quadruplex Parallel G-quadruplex Parallel G-quadruplex Parallel G-quadruplex Parallel G-quadruplex Parallel G-quadruplex Antiparallel G-quadruplex Antiparallel G-quadruplex Antiparallel G-quadruplex Antiparallel G-quadruplex Antiparallel G-quadruplex Antiparallel i-motif i-motif Duplex + Quadruplex Duplex + Quadruplex Duplex + Triplex Duplex + Triplex Duplex + Triplex Duplex + Triplex Duplex + Triplex Duplex Class Code 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 - Unknown - LT, K, pH2.9 HT, K, pH7 HT, K, pH7 HT, K, pH7 HT, K, pH7 HT, K, pH7 HT, K, pH7 HT, K, pH7 HT, Na, pH7 HT, Na, pH7 HT, Na, pH7 LT, K, pH7 LT, K, pH7 LT, K, pH7 LT, Na, pH7 LT, K, pH7 LT, Na, pH7 LT, K, pH7 LT, Na, pH7 LT, K, pH7 LT, K, pH7 LT, K, pH7 LT, K, pH7 LT, K, pH7 LT, K, pH7 LT, K, pH7 LT, K, pH7 LT, K, pH7 LT, K, pH7 LT, K, pH7 LT, K, pH5 LT, K, pH7 LT, Na, pH7 LT, K, pH7 LT, Na, pH7 LT, K, pH7 LT, K, pH7 LT, K, pH7 LT, K, pH7 LT, K, pH7 LT, Na, pH7 LT, Na, pH7 LT, K, pH5 LT, K, pH5 LT, K, pH7 LT, K, pH5 LT, K, pH7 LT, K, pH7 LT, K, pH7 LT, K, pH7 LT, K, pH7 LT, K, pH7.1 where inter denotes intermolecular: intra denotes intramolecular; (EG)6 denotes hexaethyleneglycol linker; biot denotes biotine tetraethylenglycol (biotine-TEG); phos denotes phosphate; AN denotes 8aminoadenine; AP denotes 2-aminopurine and GN denotes 8-aminoguanine. * where in conditions column HT refers to High Temperature (85 ºC), LT to Low Temperature (20 ºC), K to a 150 mM potassium medium, Na to a 150 mM sodium medium and pH7 and pH5 to the pH of the solution (pH=6.9 was obtained with a phosphate buffer, pH=5.1 was obtained with an acetate buffer and pH 7.1 and 2,9 were obtained with the appropriate volumes of HCl/NaOH). 22 10. REFERENCES [1] J.D. Watson, F.H.C. Crick, Nature, 171 (1953) 737. [2] W. Saenger, Principles of nucleic acid structure, Springer, New York, NY, USA, 1988. [3] V.A. Bloomfield, D.M. Crothers, I.T. Jr., Nucleics Acids. Structure, Properties and Functions, University Science Books, Sausalito, CA, USA, 1999. [4] D.E. Gilbert, J. Feigon, Curr Opin Struc Biol, 9 (1999) 305. [5] L.E. Xodo, G. Manzini, M. Alunnifabbroni, B. Scaggiante, F. Quadrifoglio, Acta Pharmaceut, 42 (1992) 299. [6] T. Simonsson, Biological Chemistry, 382 (2001) 621. [7] S. Neidle, S. Balasubramanian (Eds.), Quadruplex Nucleic Acids, RSC Biomolecular Sciences, Cambridge, 2006. [8] M.A. Keniry, Biopolymers, 56 (2000) 123. [9] M. Gueron, J.L. Leroy, Curr Opin Struc Biol, 10 (2000) 326. [10] D.M. Gray, S.H. Hung, K.H. Johnson, Methods in enzymology, 216 (1995) 19. [11] G.D. Fasman, Circular Dichroism and the conformational analysis of biomolecules, Plenum Press, New York, NY, USA, 1996. [12] N. Berova, K. Nakanishi, R.W. Woody (Eds.), Circular Dichroism. Principles and Applications, Wiley-VCH Inc., New York, US, 2000. [13] D.L. Massart, L. Kaufman, The interpretation of analytical chemical data by the use of cluster analysis, Wiley, New York, US, 1983. [14] D.L. Massart, L.M.C. Buydens, B.G.M. Vandegiste, Handbook of Chemometrics and Qualimetrics, Elsevier, Amsterdam, The Netherlands, 1997. [15] S. Wold, A. Ruhe, H. Wold, W.J. Dunn, Siam Journal on Scientific and Statistical Computing, 5 (1984) 735. 23 [16] A. Avino, M. Frieden, J.C. Morales, B. Garcia de la Torre, R. Guimil Garcia, F. Azorin, J.L. Gelpi, M. Orozco, C. Gonzalez, R. Eritja, Nucleic Acids Res, 30 (2002) 2609. [17] M.G. Grimau, A. Avino, R. Gargallo, R. Eritja, Chem Biodivers, 2 (2005) 275. [18] http://proligo2.proligo.com/Calculation/calculation.html. [19] L. Kaufman, P.J. Rousseeuw, Finding Groups in Data; An Introduction to Cluster Analysis., Wiley, New York, 1990. [20] S. Wold, K. Esbensen, P. Geladi, Chemometrics and Intelligent Laboratory Systems, 2 (1987) 37. [21] R.W. Kennard, L.A. Stone, Technometrics, 11 (1969) 137. [22] A. Savitzky, M.J.E. Golay, Analytical Chemistry, 36 (1964) 1627. [23] J.X. Dai, T.S. Dexheimer, D. Chen, M. Carver, A. Ambrus, R.A. Jones, D.Z. Yang, Journal of the American Chemical Society, 128 (2006) 1096. [24] A. Chanan-Khan, Blood Reviews, 19 (2005) 213. [25] D.J. Patel, A.T. Phan, V. Kuryavyi, Nucl. Acids Res., 35 (2007) 7429. [26] S. Neidle, M.A. Read, Biopolymers, 56 (2000) 195. [27] L.H. Hurley, D.D. Von Hoff, A. Siddiqui-Jain, D.Z. Yang, Seminars in Oncology, 33 (2006) 498. 24 Scheme 1 25 Figure 1 26 Figure 2 27 Figure 3 28 Figure 4 29 Figure 5 30 Figure 6 31