Generating Synthetic Data to Match Data Mining Patterns Eno , J. Thompson, C.W. Univ. of Arkansas, Little Rock, AK This paper appears in: Internet Computing, IEEE Publication Date: May-June 2008 Volume: 12, Issue: 3 On page (s): 78-82 ISSN: 1089-7801 INSPEC Accession Number: 9995883 Digital Object Identifier: 10.1109/MIC.2008.55 Current Version Published: 2008-05-07 Outline • Introduction • Related Works 1.Synthetic Data Definition Language 1.1 Synthetic Data Description Language (SDDL) 2. Predictive Model Markup Language • Experiment 1. four-step process 2. flower classification • To analyze the data Introduction This article explores how to combine information derived from data mining applications with the descriptive ability of synthetic data generation software. Our goal is to demonstrate that at least some data mining techniques (in particular, a decision tree) can discover patterns that we can then use to inverse map into synthetic data sets. These synthetic data sets can be of any size and will faithfully exhibit the same (decision tree) patterns. Related Works work builds on two technologies : 1.synthetic data definition language 2.predictive model markup language Synthetic Data Definition Language SDDL, a language developed at the University of Arkansas is an XML format that describes data generated by synthetic data generation software. While SDDL doesn’t specify the generator, our work uses the Parallel Synthetic Data Generator (PSDG) to generate SDDL-specified data in grid computing environments. Synthetic Data Description Language (SDDL) SDDL Constraint Types –Min/Max/Step –Probabilistic Distribution –Pool Reference: basically a parameterized dictionary lookup. Users can define their own dictionaries. –Formula: field value based on mathematical formulas involving constants and other fields. –Data types supported: integer, real, string, date, time, timestamp, Boolean. SDDL SDDL <pool name=“colors”> <choice name=“red”> <weight>12</weight> </choice> <choice name=“yellow”> <weight>2</weight> </choice> <choice name=“green”> <weight>6</weight> </choice> </pool> Predictive Model Markup Language PMML is an open standard developed by the Data Mining Group. A PMML file consists of a data dictionary followed by one or more mining models. The data dictionary contains information about field types and data ranges and is independent of any specific mining model. Specific information about the distribution of values in a field is stored in mining models, not in the data dictionary. Experiment To evaluate the algorithms and implementation, a simpler flower classification Data Mining Group has analyzed both using SPSS Clementine, a commercial data mining software package. Clementine created a decision tree model for each data set, which was stored as a PMML file. We then used these files to test the software using a four-step process: four-step process : 1. Our parsing software used the PMML file to create an SDDL file. 2. PSDG generated a large data set based on the SDDL File. 3. We loaded the generated data into a relational database for analysis. 4. We analyzed the data through a series of SQL queries, which determined how many rows were generated for each tree node and whether the records generated for a leaf node have the correct ratio of correct to incorrect classifications. flower classification it consists of 150 measurements of flowers including petal width, petal length, sepal width, and sepal length divided evenly between three species. Because the generating process for the data is the measurements of flowers, it’s not surprising that the four measurements are each normally distributed within each species and aren’t independent. PMML Tree Model SDDL Tree Model Node score=“Iris-virginica” recordCount=“6” id=“5”> <CompoundPredicate booleanOperator=“surrogate”> <SimplePredicate field=“petal length” operator= “greaterThan” value=“4.95”/> <SimplePredicate field=“sepal length” operator= “greaterThan” value=“7.1”/> <False/> </CompoundPredicate> <ScoreDistribution value=“Iris-setosa” recordCount=“0”> </ScoreDistribution> <ScoreDistribution value=“Iris-versicolor” recordCount=“2”> </ScoreDistribution> <ScoreDistribution value=“Iris-virginica” recordCount=“4”> </ScoreDistribution> <Node score=“Iris-virginica” recordCount=“3” id=“6”> … </Node> <Node score= “Iris-versicolor” recordCount=“3” id=“7”> … </Node> </Node> SDDL Tree Model The node pool contains five choices, one for each leaf node. The weights are the record counts for each respective leaf node. Each choice contains eight auxiliary fields, specifying the minimum and maximum for each numeric field. The default minimum for each field is zero, and the default maximum is 10. In most nodes, any given field will have the default value for either the minimum or maximum. Each choice also includes a sub-pool, which specifies the weights for each species. To analyze the data 1 In the original data, there was a two percent (3/150) rate of misclassification. Of these, .67 percent (1/150) were Versicolor misclassified as Virginica, while 1.33 percent (2/150) were Virginica misclassified as Versicolor. With the generated synthetic data set, which is a thousand times as large as the original data set, we would expect around 3,000 misclassifications, with 1,000 Versicolor classified as Virginica, and 2,000 being the opposite. In fact, 1,019 Versicolor rows are misclassified, while 1,983 Virginica are misclassified, for a total of 3,002 misclassifications. This is well within the expected random variation and results in misclassification rates of .68 percent, 1.32 percent, and 2.00 percent for Versicolor, Virginica, and total misclassifications, respectively. To analyze the data 2 From a quick examination, it’s clear that the counts in the generated data are all close to 1,000 times the original data record counts. Table 2 lists the record counts divided by total records to give the probabilities. Although the rates from the training data to the generated data do vary slightly, none of the differences is greater than 0.001, indicating the synthetic data set successfully preserves the tree classification pattern of the original data set.