Variance Gradient Optimized Classification on Vertically Structured Data Dr. William Perrizo and Dr. Vasant Ubhaya North Dakota State University (william.perrizo@ndsu.edu) ABSTRACT: This paper describes an approach for the data mining technique called classification or prediction using vertically structured data and linear partitioning methodology based on the scalar product with a judiciously chosen unit vector whose direction has been optimized using a gradient ascent method. Choosing the best (or a good) unit vector is a challenge. We address that challenge by deriving a series of theorem to guide the process in which a starting vector is identified which is guaranteed to result in maximum variance in the scalar product values. Knowing that the unit vector chosen is one which maximizes the variance of the scalar product values with that unit vector, gives the user of this technology high assurance that gaps in those scalar values will be prominent. A prominent gap in the scalar product values is a good cut point for a linear separation hyperplane. Placing that hyperplane at the middle of that gap guarantees a substantial margin between classes., For so-called “big data”, the speed at which a classification model can be trained is a critical issue. Many very good classification algorithms are unusable in the big data environment due to the fact that the training step takes an unacceptable amount of time. Therefore, speed of training is very important. To address the speed issue, in this paper, we use horizontal process of vertically structured data rather than the ubiquitous vertical (scan) processing of horizontal (record) data, We use pTree bit level vertical structuring. pTree technology represent and process data differently from the ubiquitous horizontal data technologies. In pTree technology, the data is structured column-wise and the columns are processed horizontally (typically across a few to a few hundred columns), while in horizontal technologies, data is structured row-wise and those rows are processed vertically (often down millions, even billions of rows). pTree technology is a vertical data technology. P-trees are lossless, compressed and data-mining ready data structures [9][10]. pTrees are lossless because the vertical bit-wise partitioning that is used in the pTree technology guarantees that all information is retained completely. There is no loss of information in converting horizontal data to this vertical format. pTrees are compressed because in this technology, segments of bit sequences which are either purely 1-bits or purely 0-bits, are represented by a single bit. This compression saves a considerable amount of space, but more importantly facilitates faster processing. pTrees are data-mining ready because the fast, horizontal data mining processes involved can be done without the need to decompress the structures first. pTree vertical data structures have been exploited in various domains and data mining algorithms, ranging from classification {1,2,3], clustering [4,7], association rule mining [9], as well as other data mining algorithms. PTree technology is patented in the U.S. Speed improvements are very important in data mining because many quite accurate algorithms require an unacceptable amount of processing time to complete, even with today’s powerful computing systems and efficient software platforms. In this paper, we evaluate and compare the speed of various clustering data mining algorithms when using pTree technology. pTree Horizontal Processing of Vertical Data Supervised Machine Learning, Classification or Prediction is one of the important data mining technologies for mining information out of large data sets. The assumption is usually that there is a, usually very large, table of data in which the “class” of each instance is given (the training data set) and there is another data set in the class are not known (the test data set) but are to be predicted based on class information found in the training data set (therefore “supervised” prediction). [1, 2, 3]. Unsupervised Machine Learning or Clustering is also an important data mining technology for mining information out of new data sets. The assumption is usually that there is essentially nothing yet known about the data set (therefore it is “unsupervised”). The goal is often to partition the data set into subset of “similar” or “correlated” records [4, 7]. There may be various additional levels of supervision available in either classification or clustering and, of course, that additional information should be used to advantage during the classification or clustering process. That is to say, often the problem is not a purely supervised nor purely unsupervised problem. For instance, it may be known that there are exactly k similarity subsets, in which case, a method such as k-means clustering may be a productive method. To mine a RGB image for, say red cars, white cars, grass, pavement, bare ground, and other, k would be six. It would make sense to use that supervising knowledge by employing k-means clustering starting with a mean set consisting of RGB vectors as closely approximating the clusters as one can guess, e.g., red_car=(150,0,0), white_car=(85,85,85), grass=(0,150,0), etc. That is to say, we should view the level of supervision available to us as a continuum and not just the two extremes. The ultimate in supervising knowledge is a very large training set, which has enough class information in it to very accurately assign predicted classes to all test instances. We can think of a training set as a set of records that have been “classified” by an expert (human or machine) into similarity classes (and assigned a class or label). In this paper we assume there is no supervising knowledge except for an idea of what “similar instances” should mean. We will assume the data set is a table of non-negative integers with n columns and N rows and that two rows are similar if there are close in the Euclidean sense. More general assumptions could be made (e.g., that there are categorical data columns as well, or that the similarity is based on L1 distance or some correlation-based similarity) but we feel that would only obscure the main points we want to make by generalizing. We structure the data vertically into columns of bits (possibly compressed into tree structures), called predicate Trees or pTrees. The simplest example of pTree structuring of a non-negative integer table is to slice it vertically into its bit-position slices. The main reason we do that is so that we can process across the (usually relatively few) vertical pTree structures rather than processing down the (usually very numerous) rows. Very often these days, data is called Big Data because there are many, many rows (billions or even trillions) while the number of columns, by comparison, is relatively small (tens or hundreds, thousands, multiple thousands). Therefore processing across (bit) columns rather than down the rows has a clear speed advantage, provided that the column processing can be done very efficiently. That is where the advantage of our approach lies, in devising very efficient (in terms of time taken) algorithms for horizontal processing of vertical (bit) structures. Our approach also benefits greatly from the fact that, in general, modern computing platforms can do logical processing of (even massive) bit arrays very quickly [9,10]. The broad question we address in this paper is “do we give up quality (accuracy) when horizontally processing vertical bit-slices compared to vertically processing horizontal records. That is, if we structure our data vertically and process across those vertical structures (horizontally), can those horizontal algorithms compete quality-wise with the time-honored methods that process horizontal (record) data vertically? The simplest and clearest setting to make that point, we believe is that of totally unsupervised machine learning or clustering. Horizontal Clustering of Vertical Data Algorithms We have developed a series of algorithms to cluster datasets by employing a distance-dominating functional (assigns a non-negative integer to each row). By distance dominating we simply mean that the distance between any two output functional values is always dominated by the distance between the two input vectors. A class of functionals we have found productive are based on the dot or scalar product with a unit vector. We first note that the dot product with any unit vector is “distance dominating because xod = |x||d|cos where is the angle between x and d. Since |d|=1 and cos 1, distance(x,y)distance(xod,yod). The goal in each of these algorithms is to produce a large gap (or several large gaps) in the functional values. A large gap in the functional values reveals a [at least as large] gap between the functional preimages (due to the distance dominance of the functional. Thus, each functional range gap partitions the dataset into two clusters. The algorithms we comparing in this paper are: 1. MVM (Mean-to-Vector_of_Medians), 2. GV (Gradient-based hill-climbing of the Variance), 3. GM (essentially, GV applied to MVM) MVM: In the MVM algorithm, which is heuristic in nature, we simply take the vector D running from the mean of the data set to the vector of column-wise medians, then unitize it by dividing by its length to get a unit vector d. The functional is then just the dot product with this unit vector d. In order to bit-slice the column of functional values, we need it to contain only non-negative integers, so we subtract the minimum from each functional value and round. GV: In the GV algorithm, we start with a particular unit vector d (e.g., we can compute the variance matrix of the data set and takes a our initial d, the ek=(0, … 0,1,0, … 0) with a 1 only in the position k corresponding to the maximal diagonal element of the variance matrix). Next, we “hill-climb” the initial unit vector using the gradient of the variance matrix, until a [local] maximum is reached (a unit vector that [locally] maximizes the variance of the range set of the functional). Roughly “high variance” is likely to imply “larger gap(s)”. GM: In the GM algorithm, we simply apply the GV algorithm but starting with what we believe to be a “very good” unit vector in terms of separating the space into clusters at a functional gap. One such “very good unit vector” is the unitization of the vector running from the mean to the vector of medians (the one used in MVM). Formulas In the MVM algorithm, the calculation of the mean or average function, which computes the average of one complete table column by horizontally (ANDs and Ors) the vertical bit slices horizontally, goes as shown here. The COUNT function is probably the simplest and most useful of all these Aggregate functions. It is not necessary to write special function for Count because a pTree RootCount function which efficiently counts the number of 1-bits in the full slice, provided the mechanism to implement it. Given a pTree Pi, RootCount(Pi) returns the number of 1-bits in Pi. The Sum Aggregate function can total a column of numerical values. Evaluating sum () with pTrees. total = 0.00; For i = 0 to n { total = total + 2i * RootCount (Pi); } Return total The Average Aggregate will show the average value of a column and is calculated from Count and Sum. Average () = Sum ()/Count (). The calculation of the Vector of Medians, or simply the median of any column, goes as follows. Median returns the median value in a column. Rank (K) returns the value that is the kth largest value in a field. Therefore, for a very fast and quite accurate determination of the median value use K = Roof(TableCardinality/2) Evaluating Median () with pTrees median = 0.00; pos = N/2; for rank pos = K; c = 0; Pc is set to all 1s for single attribute For i = n to 0 { c = RootCount (Pc AND Pi); If (c >= pos) median = median + 2i; Pc = Pc AND Pi; Else pos = pos - c; Pc = Pc AND NOT (Pi); } Return median; The Gradient-based variance hill climbing calculations are as follows. V is the variance of the Dot Product Projection with unit vector, d, V(d)≡VarDPPd(X) = MEAN((Xod)2 ) - (MEAN(X)od)2 We use upper bar for the MEAN. We can derive the following expression for that variance. V(d)≡VarDPPd(X)= (Xod)2 - (Xod)2 = 1 i=1..N(j=1..n xi,jdj)2 - ( j=1..nXj dj )2 N = 1 i(j xi,jdj)(k xi,kdk) - (jXj dj)(kXk dk) N 2 = N1 ijxi,j2dj2 +Nj<k xi,jxi,kdjdk - jXj2dj2 +2j<k XjXkdjdk = jXj2 dj2 +2j<kXjXkdjdk - + j=1..n<k=1..n(XjXk - XjXk)djdk ) = j=1..n"(Xj2 - Xj2)dj2 +(2 Therefore the ij-th component of the variance matrix is aij = XiXj-XiX,j V(d)= and the gradient vector is 2a11d1 +j1a1jdj 2a22d2 +j2a2jdj : 2anndn +jnanjdj The hill climbing steps, starting at d0, are as follows. d1≡V(d0) d2≡V(d1) ... Thus, we simply computer the gradient of the variance matrix one time, and then apply it recursively to the unit vectors, d, as follows. GRADIENT(V) = 2A o d 2a11 2a12 ... 2a1n 2a21 2a22 ... 2a2n : ' 2an1 ... 2ann d 1 : di : d n Finding the Optimal Scalar Product Unit Vector VASANT’s Theorems go here. (Vasant develop). Performance Evaluation In this section we compare the accuracy performance of the three algorithms, MVM, GV and GM using 4 datasets taken from the University of California Irvine Machine Learning Repository (UCI MLR), IRIS, SEEDS, CONCRETE, and WINE. These datasets were selected because they are commonly used for such performance evaluations and because they provide “supervision”, that is, they are already classified. We didn’t use the supervision in applying the algorithms but only in evaluating their effectiveness in terms of accuracy based on the expert classifications provided with these datasets. The accuracy percentage results were as follows. ACCURACY GV MVM GM CONCRETE 76 78.8 83 IRIS SEEDS 82.7 94 94 93.3 94.7 96 WINE 62.7 66.7 81.3 REFERENCES [1] T. Abidin and W. Perrizo, “SMART-TV: A Fast and Scalable Nearest Neighbor Based Classifier for Data Mining,” Proceedings of the 21st Association of Computing Machinery Symposium on Applied Computing (SAC-06), Dijon, France, April 23-27, 2006. [2] T. Abidin, A. Dong, H. Li, and W. Perrizo, “Efficient Image Classification on Vertically Decomposed Data,” Institute of Electrical and Electronic Engineers (IEEE) International Conference on Multimedia Databases and Data Management (MDDM-06), Atlanta, Georgia, April 8, 2006. [3] M. Khan, Q. Ding, and W. Perrizo, “K-nearest Neighbor Classification on Spatial Data Stream Using P-trees,” Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 02), pp. 517-528, Taipei, Taiwan, May 2002. [4] A. Perera, T. Abidin, M. Serazi, G. Hamer, and W. Perrizo, “Vertical Set Squared Distance Based Clustering without Prior Knowledge of K,” International Conference on Intelligent and Adaptive Systems and Software Engineering (IASSE-05), pp. 72-77, Toronto, Canada, July 20-22, 2005. [5] I. Rahal, D. Ren, W. Perrizo, “A Scalable Vertical Model for Mining Association Rules,” Journal of Information and Knowledge Management (JIKM), V3:4, pp. 317-329, 2004. [6] D. Ren, B. Wang, and W. Perrizo, “RDF: A Density-Based Outlier Detection Method using Vertical Data Representation,” Proceedings of the 4th Institute of Electrical and Electronic Engineers (IEEE) International Conference on Data Mining (ICDM-04), pp. 503-506, Nov 1-4, 2004. [7] E. Wang, I. Rahal, W. Perrizo, “DAVYD: an iterative Density-based Approach for clusters with Varying Densities". International Society of Computers and their Applications (ISCA) International Journal of Computers and Their Applications, V17:1, pp. 1-14, March 2010. [8] Qin Ding, Qiang Ding, W. Perrizo, “PARM - An Efficient Algorithm to Mine Association Rules from Spatial Data" Institute of Electrical and Electronic Engineering (IEEE) Transactions of Systems, Man, and Cybernetics, Volume 38, Number 6, ISSN 1083-4419), pp. 1513-1525, December, 2008. [9] I. Rahal, M. Serazi, A. Perera, Q. Ding, F. Pan, D. Ren, W. Wu, W. Perrizo, “DataMIME™”, Association of Computing Machinery, Management of Data (ACM SIGMOD 04), Paris, France, June 2004. [10] Treeminer Inc., The Vertical Data Mining Company, 175 Admiral Cochrane Drive, Suite 300, Annapolis, Maryland 21401, http://www.treeminer.com