Parameterizing Random Test Data According to Equivalence Classes Columbia University

Parameterizing Random Test Data
According to Equivalence Classes
Chris Murphy, Gail Kaiser, Marta Arias
Columbia University
What is random testing?
This is not part of the talk!!!!
Random testing is the notion of using
“random” input to test the application
As opposed to using pre-determined and
manually selected “equivalence classes”
or “partitions”
We are investigating the quality assurance
of Machine Learning (ML) applications
 Currently we are concerned with a realworld application for potential future use in
predicting electrical device failures
 Using
ranking instead of classification
Our concern is not whether an algorithm
predicts well but whether an
implementation operates correctly
Data Set Options
Real-world data sets
 Not
always accessible/available
 May not necessarily contain the separation or
combination of traits that we desire to test
Hand-generation of data
 Only
useful for small tests
Random testing
 Limited
by the lack of a reliable test oracle
 ML applications of interest fall into the category
of “non-testable programs”
Without a reliable test oracle, we can only:
 Look
for obvious faults
 Consider intermediate results
 Detect discrepancies in the specification
We need to restrict some properties of
random test data generation
Our Solution
Parameterized Random Test Data Generation
Automatically generate random data sets, but
parameterized to control the range and
characteristics of those random values
Parameterization allows us to create a hybrid
between equivalence class partitioning and
random testing
Machine Learning Background
 Data Generation Framework
 Findings and Results
 Evaluation and Observations
 Conclusions and Future Work
Machine Learning Fundamentals
Data sets consist of a number of
examples, each of which has attributes
and a label
 In the first phase (“training”), a model is
generated that attempts to generalize how
attributes relate to the label
 In the second phase (“validation”), the
model is applied to a previously-unseen
data set with unknown labels to produce a
classification (or, in our case, a ranking)
Problems Faced in Testing
The testing input should be based on the
problem domain
 Need to consider a way to mimic all of the
traits of the real-world data sets
 Also need to keep in mind that we do not
have a reliable test oracle
Analyzing the Problem Domain
Consider properties of data sets in general
 Data
set size: number of attributes and examples
 Range of values: attributes and labels
 Precision of floating-point numbers
 Whether values can repeat
Consider properties of real-world data sets in the
domain of interest
 How
alphanumeric attributes are to be interpreted
 Whether data values might be missing
Equivalence Classes
Data sizes of different orders of magnitude
Repeating vs. non-repeating attribute values
Missing vs. no-missing attribute values
Categorical vs. non-categorical data
0/1 labels vs. non-negative integer labels
Predictable vs. non-predictable data sets
Used data set generator to parameterize test
case selection criteria
How Data Are Generated
M attributes and N examples
 No-repeat mode:
 Generate
a list of integers from 1 to M*N and
then randomly permute them
Repeat mode:
 Each
value in the data set is simply a random
integer between 1 and M*N
 Tool ensures at least one set of repeating
Generating Labels
Specify percentage of “positive examples” to
include in the data set
 positive
examples have a label of 1
 negative examples have a label of 0
Data generation framework guarantees that the
number of positive examples comes out to be
the right number, even though the values are
randomly placed throughout the data set
Labels are never unknown/missing
Categorical Data
For some alphanumeric attributes, data
pre-processing is used to expand K
distinct values to K attributes
 Same
as in real-world ranking application
Input parameter to data generation tool is
of the format (a1, a2, ..., aK-1, aK, m)
 a1
through aK represent the percentage
distribution of those values for the categorical
 m is the percentage of unknown values
Data Set Generator - Parameters
# of examples
 # of attributes
 % positive examples (label = 1)
 % missing
 any categorical data
 repeat/no-repeat modes
Sample Data Sets
10 examples, 10 attributes, 40% positive
examples, 20% missing, repeats allowed
27,81,88,59, ?,16,88, ?,41, ?,0
15,70,91,41, ?, 3, ?, ?, ?,64,0
82, ?,51,47, ?, 4, 1,99, ?,51,0
22,72,11, ?,96,24,44,92, ?,11,1
57,77, ?,86,89,77,61,76,96,98,1
76,11, 4,51,43, ?,79,21,28, ?,0
6,33, ?, ?,52,63,94,75, 8,26,0
77,36,91, ?,47, 3,85,71,35,45,1
?,17,15, 2,90,70, ?, 7,41,42,0
8,58,42,41,74,87,68,68, 1,15,1
35, 3,20,41,91, ?,32,11,43, ?,1
19,50,11,57,36,94, ?,96, 7,23,1
24,36,36,79,78,33,34, ?,32, ?,0
?,15, ?,19,65,80,17,78,43, ?,0
40,31,89,50,83,55,25, ?, ?,45,1
52, ?, ?, ?, ?,39,79,82,94, ?,0
86,45, ?, ?,74,68,13,66,42,56,0
?,53,91,23,11, ?,47,61,79, 8,0
77,11,34,44,92, ?,63,62,51,51,1
21, 1,70,14,16,40,63,94,69,83,0
The Testing Framework
Data set generator
 Model comparison
 Ranking comparison: includes metrics like
normalized equivalence and AUCs
 Tracing options: for generating and
comparing outputs of debugging
MartiRank and SVM
MartiRank was specifically designed for
the real-world device failure application
 Seeks
to find the sequence of attributes to
segment and sort the data to produce the best
SVM is typically a classification algorithm
 Seeks
to find a hyperplane that separates
examples from different classes
 SVM-Light has a ranking mode based on the
distance from the hyperplane
Testing approach and framework were
developed for MartiRank then applied to SVM
Only the findings most related to parameterized
random testing are presented here
 More
details and case studies about the testing of
MartiRank can be found in our tech report
Issue #1: Repeating Values
One version of MartiRank did not use
“stable” sorting
91,41,19, 3,57,11,20,64,0
36,73,47, 3,85,71,35,45,1
91,41,19, 3,57,11,20,64,0
36,73,47, 3,85,71,35,45,1
unstable ...
36,73,47, 3,85,71,35,45,1
91,41,19, 3,57,11,20,64,0
Issue #2: Sparse Data Sets
Not specifically addressed in specification
41,91, ?,32,11,43, ?,1
57,36,94, ?,96, 7,23,1
79,78,33,34, ?,31, ?,0
19,65,80,17,78,46, ?,0
50,83,55,25, ?, ?,45,1
?, ?,39,79,82,94, ?,0
sort “around”
missing values
randomly insert
put missing
missing values
values at end
41,91, ?,32,11,43, ?,1
50,83,55,25, ?, ?,45,1
19,65,80,17,78,46, ?,0
79,78,33,34, ?,31, ?,0
?, ?,39,79,82,94, ?,0
57,36,94, ?,96, 7,23,1
41,91, ?,32,11,43, ?,1
19,65,80,17,78,46, ?,0
79,78,33,34, ?,31, ?,0
?, ?,39,79,82,94, ?,0
50,83,55,25, ?, ?,45,1
57,36,94, ?,96, 7,23,1
41,91, ?,32,11,43, ?,1
19,65,80,17,78,46, ?,0
?, ?,39,79,82,94, ?,0
57,36,94, ?,96, 7,23,1
79,78,33,34, ?,31, ?,0
50,83,55,25, ?, ?,45,1
Issue #3: Categorical Data
Discovered that refactoring had introduced
a bug into an important calculation
 A global
variable was being used incorrectly
This bug did not appear in any of the tests
only with repeating values or only with
missing values
 However, categorical data necessarily has
repeating values and may have missing
Issue #4: Permuted Input Data
Randomly permuting the input data led to
different models (and then different
rankings) generated by SVM-Light
Caused by “chunking” data for use by an
approximating variant of optimization
Parameterized random testing allowed us
to isolate the traits of the data sets
These traits may appear in real-world data
but not necessarily in the desired
Algorithm’s failure to address specific data
set traits can lead to discrepancies
Related Work – Machine Learning
There has been much research into
applying Machine Learning techniques to
software testing, but not the other way
 Reusable real-world data sets and
Machine Learning frameworks are
available for checking how well a Machine
Learning algorithm predicts, but not for
testing its correctness
Related Work – Random Testing
Parameterization generally refers to
specifying data type or range of values
 Our work differs from that of ThénevodFosse et al. [’91] on “structural statistical
testing”, which focuses on path selection and
coverage testing, not system testing
 Also differs from “uniform statistical testing”
because although we do select random data
over a uniform distribution, we parameterize
it according to equivalence classes
Limitations and Future Work
Test suite adequacy for coverage not
addressed or measured
 Could also consider non-deterministic
Machine Learning algorithms
Can also include mutation testing for
effectiveness of data sets
 Should investigate creating large data sets
that correlate to real-world data
Our contribution is an approach that
combines parameterization and
randomness to control the properties of
very large data sets
Critical for limiting the scope of individual
tests and for pinpointing specific issues
related to the traits of the input data
Parameterizing Random Test Data
According to Equivalence Classes
Chris Murphy, Gail Kaiser, Marta Arias
Columbia University