Triclinic Labs Chemometric Engine Simon Bates, Ph.D. The chemometric tools assembled at Triclinic Labs are designed to help our scientists ask various questions about a set of measured data files: - How many variable components describe the differences in the observed data sets? How many independent solid forms (‘pure’ phases) are represented in the data ensemble? Clustering the data sets according to the variable components, how many of the input data sets represent ‘pure’ phases. How is the clustering and the relationships between clusters best displayed visually? With respect to the ‘pure’ phases, what is the quantitative composition of each data set? What do the ‘reference’ data sets corresponding to the ‘pure’ phases look like? What are most important observations (e.g. peaks) for classification of each ‘pure’ phase? Are the solid-matrix effects small enough to allow traditional quantitative analysis? What are the magnitudes of the potential errors associated with quantitative analysis? How are the variable components in the data ensemble related to control variables? The Triclinic Labs Chemometric Engine is built around randomly generated conditional inference trees (decision trees) [Breiman 2001 and Hothorn et al 2006]]. This allows for an unbiased classification of the similarities and differences between a set of input data files that is not tied to the analytical tools used to collect the data and is not dependent on the type of material being studied. For each tree, the observations used to define the classification of a data file at each decision node are randomly selected from the input data set, which allows the method to identify the most predictive observations from out of the input data set (learning algorithm). The random nature of the individual tree growing and subsequent assembly of individual trees into multiple classification forests further allows for an inbuilt error estimate for the classification results which is generated as part of the classification process. The use of conditional inference trees and random forests as a classification process is widely used in a number of scientific disciplines, however, getting ‘appropriate’ data into the method and interpreting the output from the method have often proved a challenge. At Triclinic Labs we have written a number of software tools around the classification method that take raw as measured data (structural, spectroscopic and thermal) and convert that data into a suitable input data ensemble for classification. In each case, the input data ensemble is representative of the measured data and does not include any attempt to identify peaks, valleys or events of interest. The random forests method identifies the most significant observations as part of the classification procedure in keeping with the unbiased methodology. Similarly, the output results can be interpreted by a number of support tools to give a meaningful overview of the classification with respect to the questions being asked of the data by our scientists. Because the method is a learning method and requires only the input data ensemble to arrive at the optimum classification procedure, semi-quantitative results can be returned without the use of any artificial standards. This is particularly useful when dealing with production optimization or trouble shooting a production problem. For a drug product, for example, it can be difficult (impossible) to make representative reference standards to perform a traditional quantitative analysis. The solid state matrix and micro-structure generated by the production process often cannot be reproduced by mixing together the input excipients and API in known quantities. The use of conditional inference trees and random forests can identify ‘effective reference’ patterns that correspond to the individual phase contributions as they appear within the drug product matrix. Semiquantitative [1] analysis is performed with respect to these effective reference patterns. [1] The method is considered to be semi-quantitative in that external reference standards and known mixtures are not use to calibrate the output. However, in many traditional quantitative methods, one of the largest sources of error in absolute accuracy can be the choice of standards and the use of physical mixtures to reproduce the solidstate matrix of a drug product. This error is not identified during traditional quantitative method development. The effective reference patterns generated by the random forests method can be compared with the measured data collected on pure ‘standard’ material to characterize the degree of solid-state matrix effects and its impact on absolute accuracy. If the matrix effects are minimal then a standard quantitative analysis can be considered to be representative of the real drug product. For more information or to discuss how to apply the discussed techniques to your molecules, please contact: Simon Bates, Ph.D. Triclinic Labs, Inc. 1201 Cumberland Ave., Suite S. West Lafayette, IN 47906 Direct line: +1 765-588-5632 Triclinic labs: +1 765-588-6200 sbates@tricliniclabs.com References: Breiman L: Random forests. Machine Learning 2001; 45:5-32. Hothorn, T., Hornik, K., & Zeileis, A. Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics 2006; 15:651– 674. Available random forest and conditional inference tree classification software: Breiman, L., Cutler, A., Liaw, A., & Wiener, M., 2010. randomForest: Breiman and Cutler’s random forests for classification and regression (R package version 4.6-2) {URL: http://cran.rproject. org/package=randomForest} Hothorn, T., Hornik, K., Strobl, C., & Zeileis, A. (2010). PARTY: A laboratory for recursive part(y)itioning (R package version 0.9-99991) {URL: http://cran.r-project.org/package=party}