Vertical Functional Analytic Unsupervised Machine Learning

advertisement
Variance Gradient Optimized Classification on Vertically Structured Data
Dr. William Perrizo and Dr. Vasant Ubhaya
North Dakota State University
(william.perrizo@ndsu.edu)
ABSTRACT:
This paper describes an approach for the data mining technique called classification or prediction using
vertically structured data and linear partitioning methodology based on the scalar product with a
judiciously chosen unit vector whose direction has been optimized using a gradient ascent method.
Choosing the best (or a good) unit vector is a challenge. We address that challenge by deriving a series
of theorem to guide the process in which a starting vector is identified which is guaranteed to result in
maximum variance in the scalar product values. Knowing that the unit vector chosen is one which
maximizes the variance of the scalar product values with that unit vector, gives the user of this technology
high assurance that gaps in those scalar values will be prominent. A prominent gap in the scalar product
values is a good cut point for a linear separation hyperplane. Placing that hyperplane at the middle of
that gap guarantees a substantial margin between classes.,
For so-called “big data”, the speed at which a classification model can be trained is a critical issue.
Many very good classification algorithms are unusable in the big data environment due to the fact that
the training step takes an unacceptable amount of time. Therefore, speed of training is very important.
To address the speed issue, in this paper, we use horizontal process of vertically structured data rather
than the ubiquitous vertical (scan) processing of horizontal (record) data, We use pTree bit level vertical
structuring. pTree technology represent and process data differently from the ubiquitous horizontal data
technologies. In pTree technology, the data is structured column-wise and the columns are processed
horizontally (typically across a few to a few hundred columns), while in horizontal technologies, data is
structured row-wise and those rows are processed vertically (often down millions, even billions of rows).
pTree technology is a vertical data technology. P-trees are lossless, compressed and data-mining ready
data structures [9][10]. pTrees are lossless because the vertical bit-wise partitioning that is used in the
pTree technology guarantees that all information is retained completely. There is no loss of information
in converting horizontal data to this vertical format. pTrees are compressed because in this technology,
segments of bit sequences which are either purely 1-bits or purely 0-bits, are represented by a single bit.
This compression saves a considerable amount of space, but more importantly facilitates faster
processing. pTrees are data-mining ready because the fast, horizontal data mining processes involved
can be done without the need to decompress the structures first. pTree vertical data structures have been
exploited in various domains and data mining algorithms, ranging from classification {1,2,3], clustering
[4,7], association rule mining [9], as well as other data mining algorithms. PTree technology is patented
in the U.S. Speed improvements are very important in data mining because many quite accurate
algorithms require an unacceptable amount of processing time to complete, even with today’s powerful
computing systems and efficient software platforms. In this paper, we evaluate and compare the speed of
various clustering data mining algorithms when using pTree technology.
pTree Horizontal Processing of Vertical Data
Supervised Machine Learning, Classification or Prediction is one of the important data mining
technologies for mining information out of large data sets. The assumption is usually that there is a,
usually very large, table of data in which the “class” of each instance is given (the training data set) and
there is another data set in the class are not known (the test data set) but are to be predicted based on class
information found in the training data set (therefore “supervised” prediction). [1, 2, 3].
Unsupervised Machine Learning or Clustering is also an important data mining technology for mining
information out of new data sets. The assumption is usually that there is essentially nothing yet known
about the data set (therefore it is “unsupervised”). The goal is often to partition the data set into subset of
“similar” or “correlated” records [4, 7].
There may be various additional levels of supervision available in either classification or clustering and,
of course, that additional information should be used to advantage during the classification or clustering
process. That is to say, often the problem is not a purely supervised nor purely unsupervised problem.
For instance, it may be known that there are exactly k similarity subsets, in which case, a method such as
k-means clustering may be a productive method. To mine a RGB image for, say red cars, white cars,
grass, pavement, bare ground, and other, k would be six. It would make sense to use that supervising
knowledge by employing k-means clustering starting with a mean set consisting of RGB vectors as
closely approximating the clusters as one can guess, e.g., red_car=(150,0,0), white_car=(85,85,85),
grass=(0,150,0), etc. That is to say, we should view the level of supervision available to us as a
continuum and not just the two extremes. The ultimate in supervising knowledge is a very large training
set, which has enough class information in it to very accurately assign predicted classes to all test
instances. We can think of a training set as a set of records that have been “classified” by an expert
(human or machine) into similarity classes (and assigned a class or label).
In this paper we assume there is no supervising knowledge except for an idea of what “similar instances”
should mean. We will assume the data set is a table of non-negative integers with n columns and N rows
and that two rows are similar if there are close in the Euclidean sense. More general assumptions could
be made (e.g., that there are categorical data columns as well, or that the similarity is based on L1 distance
or some correlation-based similarity) but we feel that would only obscure the main points we want to
make by generalizing.
We structure the data vertically into columns of bits (possibly compressed into tree structures), called
predicate Trees or pTrees. The simplest example of pTree structuring of a non-negative integer table is to
slice it vertically into its bit-position slices. The main reason we do that is so that we can process across
the (usually relatively few) vertical pTree structures rather than processing down the (usually very
numerous) rows. Very often these days, data is called Big Data because there are many, many rows
(billions or even trillions) while the number of columns, by comparison, is relatively small (tens or
hundreds, thousands, multiple thousands). Therefore processing across (bit) columns rather than down
the rows has a clear speed advantage, provided that the column processing can be done very efficiently.
That is where the advantage of our approach lies, in devising very efficient (in terms of time taken)
algorithms for horizontal processing of vertical (bit) structures. Our approach also benefits greatly from
the fact that, in general, modern computing platforms can do logical processing of (even massive) bit
arrays very quickly [9,10].
The broad question we address in this paper is “do we give up quality (accuracy) when horizontally
processing vertical bit-slices compared to vertically processing horizontal records. That is, if we structure
our data vertically and process across those vertical structures (horizontally), can those horizontal
algorithms compete quality-wise with the time-honored methods that process horizontal (record) data
vertically? The simplest and clearest setting to make that point, we believe is that of totally unsupervised
machine learning or clustering.
Horizontal Clustering of Vertical Data Algorithms
We have developed a series of algorithms to cluster datasets by employing a distance-dominating
functional (assigns a non-negative integer to each row). By distance dominating we simply mean that the
distance between any two output functional values is always dominated by the distance between the two
input vectors. A class of functionals we have found productive are based on the dot or scalar product with
a unit vector.
We first note that the dot product with any unit vector is “distance dominating because xod = |x||d|cos
where  is the angle between x and d. Since |d|=1 and cos  1, distance(x,y)distance(xod,yod).
The goal in each of these algorithms is to produce a large gap (or several large gaps) in the functional
values. A large gap in the functional values reveals a [at least as large] gap between the functional preimages (due to the distance dominance of the functional. Thus, each functional range gap partitions the
dataset into two clusters.
The algorithms we comparing in this paper are:
1. MVM (Mean-to-Vector_of_Medians),
2. GV (Gradient-based hill-climbing of the Variance),
3. GM (essentially, GV applied to MVM)
MVM: In the MVM algorithm, which is heuristic in nature, we simply take the vector D running from
the mean of the data set to the vector of column-wise medians, then unitize it by dividing by its length to
get a unit vector d. The functional is then just the dot product with this unit vector d. In order to bit-slice
the column of functional values, we need it to contain only non-negative integers, so we subtract the
minimum from each functional value and round.
GV: In the GV algorithm, we start with a particular unit vector d (e.g., we can compute the variance
matrix of the data set and takes a our initial d, the ek=(0, … 0,1,0, … 0) with a 1 only in the position k
corresponding to the maximal diagonal element of the variance matrix). Next, we “hill-climb” the initial
unit vector using the gradient of the variance matrix, until a [local] maximum is reached (a unit vector that
[locally] maximizes the variance of the range set of the functional). Roughly “high variance” is likely to
imply “larger gap(s)”.
GM: In the GM algorithm, we simply apply the GV algorithm but starting with what we believe to be a
“very good” unit vector in terms of separating the space into clusters at a functional gap. One such “very
good unit vector” is the unitization of the vector running from the mean to the vector of medians (the one
used in MVM).
Formulas
In the MVM algorithm, the calculation of the mean or average function, which computes the average of
one complete table column by horizontally (ANDs and Ors) the vertical bit slices horizontally, goes as
shown here. The COUNT function is probably the simplest and most useful of all these Aggregate
functions. It is not necessary to write special function for Count because a pTree RootCount function
which efficiently counts the number of 1-bits in the full slice, provided the mechanism to implement it.
Given a pTree Pi, RootCount(Pi) returns the number of 1-bits in Pi.
The Sum Aggregate function can total a column of numerical values.
Evaluating sum () with pTrees.
total = 0.00;
For i = 0 to n {
total = total + 2i * RootCount (Pi);
}
Return total
The Average Aggregate will show the average value of a column and is calculated from Count and Sum.
Average () = Sum ()/Count ().
The calculation of the Vector of Medians, or simply the median of any column, goes as follows.
Median returns the median value in a column. Rank (K) returns the value that is the kth largest value in a
field. Therefore, for a very fast and quite accurate determination of the median value use K =
Roof(TableCardinality/2)
Evaluating Median () with pTrees
median = 0.00;
pos = N/2; for rank pos = K; c = 0;
Pc is set to all 1s for single attribute
For i = n to 0 {
c = RootCount (Pc AND Pi);
If (c >= pos)
median = median + 2i;
Pc = Pc AND Pi;
Else
pos = pos - c;
Pc = Pc AND NOT (Pi);
}
Return median;
The Gradient-based variance hill climbing calculations are as follows. V is the variance of the Dot
Product Projection with unit vector, d,
V(d)≡VarDPPd(X) = MEAN((Xod)2 ) - (MEAN(X)od)2
We use upper bar for the MEAN. We can derive the following expression for that variance.
V(d)≡VarDPPd(X)= (Xod)2 - (Xod)2
= 1 i=1..N(j=1..n xi,jdj)2 - ( j=1..nXj dj )2
N
= 1 i(j xi,jdj)(k xi,kdk) - (jXj dj)(kXk dk)
N
2
= N1 ijxi,j2dj2 +Nj<k xi,jxi,kdjdk - jXj2dj2 +2j<k XjXkdjdk
= jXj2 dj2 +2j<kXjXkdjdk
-
+ j=1..n<k=1..n(XjXk - XjXk)djdk )
= j=1..n"(Xj2 - Xj2)dj2 +(2
Therefore the ij-th component of the variance matrix is aij =
XiXj-XiX,j
V(d)=
and the gradient vector is
2a11d1 +j1a1jdj
2a22d2 +j2a2jdj
:
2anndn +jnanjdj
The hill climbing steps, starting at d0, are as follows.
d1≡V(d0)
d2≡V(d1) ...
Thus, we simply computer the gradient of the variance matrix one time, and then apply it recursively to
the unit vectors, d, as follows.
GRADIENT(V) = 2A o d
2a11 2a12 ... 2a1n
2a21 2a22 ... 2a2n
:
'
2an1
... 2ann
d
1
:
di
:
d
n
Finding the Optimal Scalar Product Unit Vector
VASANT’s Theorems go here. (Vasant develop).
Performance Evaluation
In this section we compare the accuracy performance of the three algorithms, MVM, GV and GM using 4
datasets taken from the University of California Irvine Machine Learning Repository (UCI MLR), IRIS,
SEEDS, CONCRETE, and WINE. These datasets were selected because they are commonly used for
such performance evaluations and because they provide “supervision”, that is, they are already classified.
We didn’t use the supervision in applying the algorithms but only in evaluating their effectiveness in
terms of accuracy based on the expert classifications provided with these datasets.
The accuracy percentage results were as follows.
ACCURACY
GV
MVM
GM
CONCRETE
76
78.8
83
IRIS SEEDS
82.7 94
94
93.3
94.7 96
WINE
62.7
66.7
81.3
REFERENCES
[1]
T. Abidin and W. Perrizo, “SMART-TV: A Fast and Scalable Nearest Neighbor Based Classifier for Data Mining,”
Proceedings of the 21st Association of Computing Machinery Symposium on Applied Computing (SAC-06), Dijon, France,
April 23-27, 2006.
[2]
T. Abidin, A. Dong, H. Li, and W. Perrizo, “Efficient Image Classification on Vertically Decomposed Data,” Institute of
Electrical and Electronic Engineers (IEEE) International Conference on Multimedia Databases and Data Management
(MDDM-06), Atlanta, Georgia, April 8, 2006.
[3]
M. Khan, Q. Ding, and W. Perrizo, “K-nearest Neighbor Classification on Spatial Data Stream Using P-trees,” Proceedings of
the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 02), pp. 517-528, Taipei, Taiwan, May
2002.
[4]
A. Perera, T. Abidin, M. Serazi, G. Hamer, and W. Perrizo, “Vertical Set Squared Distance Based Clustering without Prior
Knowledge of K,” International Conference on Intelligent and Adaptive Systems and Software Engineering (IASSE-05), pp.
72-77, Toronto, Canada, July 20-22, 2005.
[5]
I. Rahal, D. Ren, W. Perrizo, “A Scalable Vertical Model for Mining Association Rules,” Journal of Information and
Knowledge Management (JIKM), V3:4, pp. 317-329, 2004.
[6]
D. Ren, B. Wang, and W. Perrizo, “RDF: A Density-Based Outlier Detection Method using Vertical Data Representation,”
Proceedings of the 4th Institute of Electrical and Electronic Engineers (IEEE) International Conference on Data Mining
(ICDM-04), pp. 503-506, Nov 1-4, 2004.
[7]
E. Wang, I. Rahal, W. Perrizo, “DAVYD: an iterative Density-based Approach for clusters with Varying Densities".
International Society of Computers and their Applications (ISCA) International Journal of Computers and Their
Applications, V17:1, pp. 1-14, March 2010.
[8]
Qin Ding, Qiang Ding, W. Perrizo, “PARM - An Efficient Algorithm to Mine Association Rules from Spatial Data" Institute
of Electrical and Electronic Engineering (IEEE) Transactions of Systems, Man, and Cybernetics, Volume 38, Number 6,
ISSN 1083-4419), pp. 1513-1525, December, 2008.
[9]
I. Rahal, M. Serazi, A. Perera, Q. Ding, F. Pan, D. Ren, W. Wu, W. Perrizo, “DataMIME™”, Association of Computing
Machinery, Management of Data (ACM SIGMOD 04), Paris, France, June 2004.
[10] Treeminer Inc., The Vertical Data Mining Company, 175 Admiral Cochrane Drive, Suite 300, Annapolis, Maryland 21401,
http://www.treeminer.com
Download