Parameterizing Random Test Data According to Equivalence Classes Columbia University

advertisement
Parameterizing Random Test Data
According to Equivalence Classes
Chris Murphy, Gail Kaiser, Marta Arias
Columbia University
What is random testing?
This is not part of the talk!!!!

Random testing is the notion of using
“random” input to test the application

As opposed to using pre-determined and
manually selected “equivalence classes”
or “partitions”
Introduction
We are investigating the quality assurance
of Machine Learning (ML) applications
 Currently we are concerned with a realworld application for potential future use in
predicting electrical device failures

 Using

ranking instead of classification
Our concern is not whether an algorithm
predicts well but whether an
implementation operates correctly
Data Set Options

Real-world data sets
 Not
always accessible/available
 May not necessarily contain the separation or
combination of traits that we desire to test

Hand-generation of data
 Only

useful for small tests
Random testing
 Limited
by the lack of a reliable test oracle
 ML applications of interest fall into the category
of “non-testable programs”
Motivation

Without a reliable test oracle, we can only:
 Look
for obvious faults
 Consider intermediate results
 Detect discrepancies in the specification

We need to restrict some properties of
random test data generation
Our Solution

Parameterized Random Test Data Generation

Automatically generate random data sets, but
parameterized to control the range and
characteristics of those random values

Parameterization allows us to create a hybrid
between equivalence class partitioning and
random testing
Overview
Machine Learning Background
 Data Generation Framework
 Findings and Results
 Evaluation and Observations
 Conclusions and Future Work

Machine Learning Fundamentals
Data sets consist of a number of
examples, each of which has attributes
and a label
 In the first phase (“training”), a model is
generated that attempts to generalize how
attributes relate to the label
 In the second phase (“validation”), the
model is applied to a previously-unseen
data set with unknown labels to produce a
classification (or, in our case, a ranking)

Problems Faced in Testing
The testing input should be based on the
problem domain
 Need to consider a way to mimic all of the
traits of the real-world data sets
 Also need to keep in mind that we do not
have a reliable test oracle

Analyzing the Problem Domain

Consider properties of data sets in general
 Data
set size: number of attributes and examples
 Range of values: attributes and labels
 Precision of floating-point numbers
 Whether values can repeat

Consider properties of real-world data sets in the
domain of interest
 How
alphanumeric attributes are to be interpreted
 Whether data values might be missing
Equivalence Classes







Data sizes of different orders of magnitude
Repeating vs. non-repeating attribute values
Missing vs. no-missing attribute values
Categorical vs. non-categorical data
0/1 labels vs. non-negative integer labels
Predictable vs. non-predictable data sets
Used data set generator to parameterize test
case selection criteria
How Data Are Generated
M attributes and N examples
 No-repeat mode:

 Generate
a list of integers from 1 to M*N and
then randomly permute them

Repeat mode:
 Each
value in the data set is simply a random
integer between 1 and M*N
 Tool ensures at least one set of repeating
numbers
Generating Labels

Specify percentage of “positive examples” to
include in the data set
 positive
examples have a label of 1
 negative examples have a label of 0

Data generation framework guarantees that the
number of positive examples comes out to be
the right number, even though the values are
randomly placed throughout the data set

Labels are never unknown/missing
Categorical Data

For some alphanumeric attributes, data
pre-processing is used to expand K
distinct values to K attributes
 Same

as in real-world ranking application
Input parameter to data generation tool is
of the format (a1, a2, ..., aK-1, aK, m)
 a1
through aK represent the percentage
distribution of those values for the categorical
attribute
 m is the percentage of unknown values
Data Set Generator - Parameters
# of examples
 # of attributes
 % positive examples (label = 1)
 % missing
 any categorical data
 repeat/no-repeat modes

Sample Data Sets

10 examples, 10 attributes, 40% positive
examples, 20% missing, repeats allowed
27,81,88,59, ?,16,88, ?,41, ?,0
15,70,91,41, ?, 3, ?, ?, ?,64,0
82, ?,51,47, ?, 4, 1,99, ?,51,0
22,72,11, ?,96,24,44,92, ?,11,1
57,77, ?,86,89,77,61,76,96,98,1
76,11, 4,51,43, ?,79,21,28, ?,0
6,33, ?, ?,52,63,94,75, 8,26,0
77,36,91, ?,47, 3,85,71,35,45,1
?,17,15, 2,90,70, ?, 7,41,42,0
8,58,42,41,74,87,68,68, 1,15,1
35, 3,20,41,91, ?,32,11,43, ?,1
19,50,11,57,36,94, ?,96, 7,23,1
24,36,36,79,78,33,34, ?,32, ?,0
?,15, ?,19,65,80,17,78,43, ?,0
40,31,89,50,83,55,25, ?, ?,45,1
52, ?, ?, ?, ?,39,79,82,94, ?,0
86,45, ?, ?,74,68,13,66,42,56,0
?,53,91,23,11, ?,47,61,79, 8,0
77,11,34,44,92, ?,63,62,51,51,1
21, 1,70,14,16,40,63,94,69,83,0
The Testing Framework
Data set generator
 Model comparison
 Ranking comparison: includes metrics like
normalized equivalence and AUCs
 Tracing options: for generating and
comparing outputs of debugging
statements

MartiRank and SVM

MartiRank was specifically designed for
the real-world device failure application
 Seeks
to find the sequence of attributes to
segment and sort the data to produce the best
result

SVM is typically a classification algorithm
 Seeks
to find a hyperplane that separates
examples from different classes
 SVM-Light has a ranking mode based on the
distance from the hyperplane
Findings

Testing approach and framework were
developed for MartiRank then applied to SVM

Only the findings most related to parameterized
random testing are presented here
 More
details and case studies about the testing of
MartiRank can be found in our tech report
Issue #1: Repeating Values

One version of MartiRank did not use
...
“stable” sorting
stable
...
91,41,19, 3,57,11,20,64,0
...
...
...
36,73,47, 3,85,71,35,45,1
...
91,41,19, 3,57,11,20,64,0
36,73,47, 3,85,71,35,45,1
...
...
...
...
unstable ...
36,73,47, 3,85,71,35,45,1
91,41,19, 3,57,11,20,64,0
...
...
...
...
Issue #2: Sparse Data Sets

Not specifically addressed in specification
41,91, ?,32,11,43, ?,1
57,36,94, ?,96, 7,23,1
79,78,33,34, ?,31, ?,0
19,65,80,17,78,46, ?,0
50,83,55,25, ?, ?,45,1
?, ?,39,79,82,94, ?,0
sort “around”
missing values
randomly insert
put missing
missing values
values at end
41,91, ?,32,11,43, ?,1
50,83,55,25, ?, ?,45,1
19,65,80,17,78,46, ?,0
79,78,33,34, ?,31, ?,0
?, ?,39,79,82,94, ?,0
57,36,94, ?,96, 7,23,1
41,91, ?,32,11,43, ?,1
19,65,80,17,78,46, ?,0
79,78,33,34, ?,31, ?,0
?, ?,39,79,82,94, ?,0
50,83,55,25, ?, ?,45,1
57,36,94, ?,96, 7,23,1
41,91, ?,32,11,43, ?,1
19,65,80,17,78,46, ?,0
?, ?,39,79,82,94, ?,0
57,36,94, ?,96, 7,23,1
79,78,33,34, ?,31, ?,0
50,83,55,25, ?, ?,45,1
Issue #3: Categorical Data

Discovered that refactoring had introduced
a bug into an important calculation
 A global
variable was being used incorrectly
This bug did not appear in any of the tests
only with repeating values or only with
missing values
 However, categorical data necessarily has
repeating values and may have missing

Issue #4: Permuted Input Data

Randomly permuting the input data led to
different models (and then different
rankings) generated by SVM-Light

Caused by “chunking” data for use by an
approximating variant of optimization
algorithm
Observations

Parameterized random testing allowed us
to isolate the traits of the data sets

These traits may appear in real-world data
but not necessarily in the desired
combinations

Algorithm’s failure to address specific data
set traits can lead to discrepancies
Related Work – Machine Learning
There has been much research into
applying Machine Learning techniques to
software testing, but not the other way
around
 Reusable real-world data sets and
Machine Learning frameworks are
available for checking how well a Machine
Learning algorithm predicts, but not for
testing its correctness

Related Work – Random Testing
Parameterization generally refers to
specifying data type or range of values
 Our work differs from that of ThénevodFosse et al. [’91] on “structural statistical
testing”, which focuses on path selection and
coverage testing, not system testing
 Also differs from “uniform statistical testing”
because although we do select random data
over a uniform distribution, we parameterize
it according to equivalence classes

Limitations and Future Work
Test suite adequacy for coverage not
addressed or measured
 Could also consider non-deterministic
Machine Learning algorithms

Can also include mutation testing for
effectiveness of data sets
 Should investigate creating large data sets
that correlate to real-world data

Conclusion

Our contribution is an approach that
combines parameterization and
randomness to control the properties of
very large data sets

Critical for limiting the scope of individual
tests and for pinpointing specific issues
related to the traits of the input data
Parameterizing Random Test Data
According to Equivalence Classes
Chris Murphy, Gail Kaiser, Marta Arias
Columbia University
Download