Generating Synthetic Data to Match Data Mining Patterns

advertisement
Generating Synthetic Data to
Match Data Mining Patterns
Eno , J. Thompson, C.W.
Univ. of Arkansas, Little Rock, AK
This paper appears in: Internet Computing, IEEE
Publication Date: May-June 2008
Volume: 12, Issue: 3
On page (s): 78-82
ISSN: 1089-7801
INSPEC Accession Number: 9995883
Digital Object Identifier: 10.1109/MIC.2008.55
Current Version Published: 2008-05-07
Outline
• Introduction
• Related Works
1.Synthetic Data Definition Language
1.1 Synthetic Data Description Language
(SDDL)
2. Predictive Model Markup Language
• Experiment
1. four-step process
2. flower classification
• To analyze the data
Introduction
This article explores how to combine information
derived from data mining applications with the
descriptive ability of synthetic data generation
software. Our goal is to demonstrate that at least
some data mining techniques (in particular, a
decision tree) can discover patterns that we can
then use to inverse map into synthetic data sets.
These synthetic data sets can be of any size and
will faithfully exhibit the same (decision tree)
patterns.
Related Works
work builds on two technologies :
1.synthetic data definition language
2.predictive model markup language
Synthetic Data Definition Language
SDDL, a language developed at the
University of Arkansas is an XML
format that describes data generated
by synthetic data generation software.
While SDDL doesn’t specify the
generator, our work uses the Parallel
Synthetic Data Generator (PSDG) to
generate SDDL-specified data in grid
computing environments.
Synthetic Data Description Language
(SDDL)
SDDL Constraint Types
–Min/Max/Step
–Probabilistic Distribution
–Pool Reference: basically a parameterized dictionary
lookup. Users can define their own dictionaries.
–Formula: field value based on mathematical
formulas involving constants and other fields.
–Data types supported: integer, real, string, date, time,
timestamp, Boolean.
SDDL
SDDL
<pool name=“colors”>
<choice name=“red”>
<weight>12</weight>
</choice>
<choice name=“yellow”>
<weight>2</weight>
</choice>
<choice name=“green”>
<weight>6</weight>
</choice>
</pool>
Predictive Model Markup Language
PMML is an open standard developed by the
Data Mining Group. A PMML file consists of a data
dictionary followed by one or more mining models.
The data dictionary contains information about field
types and data ranges and is independent of any
specific mining model. Specific information about
the distribution of values in a field is stored in
mining models, not in the data dictionary.
Experiment
To evaluate the algorithms and
implementation, a simpler flower
classification Data Mining Group has
analyzed both using SPSS Clementine, a
commercial data mining software package.
Clementine created a decision tree model
for each data set, which was stored as a
PMML file. We then used these files to test
the software using a four-step process:
four-step process :
1. Our parsing software used the PMML file to create an SDDL
file.
2. PSDG generated a large data set based on the SDDL File.
3. We loaded the generated data into a relational database for
analysis.
4. We analyzed the data through a series of SQL queries, which
determined how many rows were generated for each tree
node and whether the records generated for a leaf node have
the correct ratio of correct to incorrect classifications.
flower classification
it consists of 150 measurements of flowers
including petal width, petal length, sepal
width, and sepal length divided evenly
between three species. Because the
generating process for the data is the
measurements of flowers, it’s not surprising
that the four measurements are each
normally distributed within each species and
aren’t independent.
PMML Tree Model
SDDL Tree Model
Node score=“Iris-virginica”
recordCount=“6” id=“5”>
<CompoundPredicate
booleanOperator=“surrogate”>
<SimplePredicate
field=“petal length”
operator=
“greaterThan”
value=“4.95”/>
<SimplePredicate
field=“sepal length”
operator=
“greaterThan”
value=“7.1”/>
<False/>
</CompoundPredicate>
<ScoreDistribution
value=“Iris-setosa”
recordCount=“0”>
</ScoreDistribution>
<ScoreDistribution
value=“Iris-versicolor”
recordCount=“2”>
</ScoreDistribution>
<ScoreDistribution
value=“Iris-virginica”
recordCount=“4”>
</ScoreDistribution>
<Node score=“Iris-virginica”
recordCount=“3” id=“6”> …
</Node>
<Node score=
“Iris-versicolor”
recordCount=“3” id=“7”> …
</Node>
</Node>
SDDL Tree Model
The node pool contains five choices, one for
each leaf node. The weights are the record
counts for each respective leaf node. Each
choice contains eight auxiliary fields, specifying
the minimum and maximum for each numeric
field. The default minimum for each field is zero,
and the default maximum is 10. In most nodes,
any given field will have the default value for
either the minimum or maximum. Each choice
also includes a sub-pool, which specifies the
weights for each species.
To analyze the data 1
In the original data, there was a two percent (3/150) rate of
misclassification. Of these, .67 percent (1/150) were
Versicolor misclassified as Virginica, while 1.33 percent
(2/150) were Virginica misclassified as Versicolor. With the
generated synthetic data set, which is a thousand times as
large as the original data set, we would expect around 3,000
misclassifications, with 1,000 Versicolor classified as
Virginica, and 2,000 being the opposite. In fact, 1,019
Versicolor rows are misclassified, while 1,983 Virginica are
misclassified, for a total of 3,002 misclassifications. This is
well within the expected random variation and results in
misclassification rates of .68 percent, 1.32 percent, and
2.00 percent for Versicolor, Virginica, and total
misclassifications, respectively.
To analyze the data 2
From a quick examination, it’s clear that the counts in the
generated data are all close to 1,000 times the original data
record counts. Table 2 lists the record counts divided by
total records to give the probabilities. Although the rates
from the training data to the generated data do vary slightly,
none of the differences is greater than 0.001, indicating the
synthetic data set successfully preserves the tree
classification pattern of the original data set.
Download