The Universality of Nearest Neighbor Sets in Classification and Prediction William Perrizo Gregory Wettstein, Amal Shehan Perera, Tingda Lu Computer Science Department North Dakota State University Fargo, ND 58105 USA Computer Science Department North Dakota State University Fargo, ND 58105 USA William.perrizo@ndsu.edu attributes (called the feature attributes) in a table (called the training table). ABSTRACT In this paper, we make the case that essentially all classification and prediction algorithms are nearest neighbor vote classification and predictions. The only issue is how one defines “near”. This is important because the first decision that needs to be made when faced with a classification or prediction problem is to decide which classification or prediction algorithm to employ. We will show how, e.g., Decision Tree Induction, as a classification method can be viewed legitimately as a particular type of Nearest Neighbor Classification. We will show how Neural Network Predictors are really Nearest Neighbor Classifiers. Keywords Predicate-Tree, Classification Nearest-Neighbor, Decision Tree Induction, Neural Network. 1. INTRODUCTION What is Data mining? Data mining, in its most restricted form can be broken down into 3 general methodologies for extracting information and knowledge from data. These methodologies are Rule Mining, Classification and Clustering. To have a unified context in which to discuss these three methodologies, let us assume that the “data” is in one relations, R(A1,…,An) (a universal relation – unnormalized) which can be thought of as a subset of the n product of the attribute domains, D i i 1 Rule Mining is a matter of discovering strong antecedentconsequent relationships among the subsets of the columns (in the schema). Classification is a matter of discovering signatures for the individual values in a specified column or attribute (called the class label attribute), from values of the other Clustering is a matter of using some notion of tuple similarity to group together training table rows so that within a group (a cluster) there is high similarity and across groups there is low similarity [20]. 1.1 P-tree Data Structure We convert input data to vertical Predicate-trees or Ptrees. P-trees are a lossless, compressed, and data-miningready vertical data structures. P-trees are used for fast computation of counts and for masking specific phenomena. This vertical data representation consists of set structures representing the data column-by-column rather than row-by row (horizontal relational data). Predicate-trees are one choice of vertical data representation, which can be used for data mining instead of the more common sets of relational records. This data structure has been successfully applied in data mining applications ranging from Classification and Clustering with K-Nearest Neighbor, to Classification with Decision Tree Induction, to Association Rule Mining [12][7][19][1][22][2][13][18] [24][6]. A basic P-tree represents one attribute bit that is reorganized into a tree structure by recursive sub-division, while recording the predicate truth value for each division. Each level of the tree contains truth-bits that represent sub-trees and can then be used for phenomena masking and fast computation of counts. This construction is continued recursively down each tree path until downward closure is reached. E.g., if the predicate is “purely 1 bits”, downward closure is reached when purity is reached (either purely 1 bits or purely 0 bits). In this case, a tree branch is terminated when a sub-division is reached that is entirely pure (which may or may not be at the leaf level). These basic P-trees and their complements are combined using boolean algebra operators such as AND(&) OR(|) and NOT(`) to produce mask P-trees for individual values, individual tuples, value intervals, tuple rectangles, or any other attribute pattern[3]. The root count of any P-tree will indicate the occurrence count of that pattern. The P-tree data structure provides a structure for counting patterns in an efficient, highly scalable manner. 2. The Case: Most Classifiers are Near Neighbor Classifiers 2.1 Classification Given a (large) TRAINING SET T(A1, ..., An, C) with CLASS, C, and FEATURES A=(A1,...,An), C Classification of an unclassified sample, (a1,...,an) is just: SELECT Max (Count (T.Ci)) FROM T WHERE T.A1 = a1 AND T.A2 = a2 ... AND T.An = an GROUP BY T.C; i.e., just a SELECTION, since C-Classification is assigning to (a1..an) the most frequent C-value in RA=(a1..an). But, if the EQUALITY SELECTION is empty, then we need a FUZZY QUERY to find NEAR NEIGHBORs (NNs) instead of exact matches. That's Nearest Neighbor Classification (NNC). E.g., Medical Expert System (Ask a Nurse): Symptoms plus past diagnoses are collected into a table called CASES. For each undiagnosed new_symptoms, CASES is searched for matches: SELECT DIAGNOSIS FROM CASES WHERE CASES.SYMPTOMS = new_symptoms; If there is a predominant DIAGNOSIS, then report it, ElseIf there's no predominant DIAGNOSIS, then Classify instead of Query, i.e., find the fuzzy matches (near neighbors) SELECT DIAGNOSIS FROM CASES WHERE CASES.SYMPTOMS ≅ new_symptoms Else call your doctor in the morning That's exactly (Nearest Neighbor) Classification!! CAR TALK radio show example: Click and Clack the Tappet brothers have a vast TRAINING SET on car problems and solutions built from experience. They search that TRAINING SET for close matches to predict solutions based on previous successful cases. (Nearest Neighbor) Classification!! That's We all perform Nearest Neighbor Classification every day of our lives. E.g., We learn when to apply specific programming/debugging techniques so that we can apply them to similar situations thereafter. COMPUTERIZED NNC = MACHINE LEARNING (most clustering (which is just partitioning) is done as a simplifying prelude to classification). Again, given a TRAINING SET, R(A1,..,An,C), with C=CLASSES and (A1..An)=FEATURES Nearest Neighbor Classification (NNC) = selecting a set of R-tuples with similar features (to the unclassified sample) and then letting the corresponding class values vote. Nearest Neighbor Classification won't work very well if the vote is inconclusive (close to a tie) or if similar (near) is not well defined, then we build a MODEL of TRAINING SET (at, possibly, great 1-time expense?) When a MODEL is built first the technique is called Eager classification, whereas model-less methods like Nearest Neighbor are called Lazy or Sample-based. Eager Classifiers models can be: decision trees, probabilistic models (Bayesian Classifier, Neural Networks, Support Vector Machines, etc. How do you decide when an EAGER model is good enough to use? How do you decide if a Nearest Neighbor Classifier is working well enough? We have a TEST PHASE. Typically, we set aside some training tuples as a Test Set (then, of course, those Test tuples cannot be used in model building or and cannot be used as nearest neighbors). If the classifier passes the the test (a high enough % of Test tuples are correctly classified by the classifier) it is accepted. EXAMPLE: Computer Ownership TRAINING SET for predicting who owns a computer: Customer (Age | | | | | 24 58 48 58 28 Salary | | | | | 55,000 94,000 14,000 19,000 18,000 Job | | | | | Owns Computer Programmer | yes Doctor | no Laborer | no Domestic | no Builder | no A Decision Tree classifier might be built from this TRAINING SET as follows: Age < 30 / \ T / Salary>50K / \ T F / Yes 24|55000|Prog|yes F \ No 58|94000|Doctor |no 48|14000|laborer |no 58|19000|Dommestic|no \ \ \ \ \ \No 28|18000|Builder |no The question is: how then is this really a Near Neighbor Classifier (where are Near Neighborhoods involved?)? In actuality, what we are doing is saying that the training subset at the bottom of each decision path represents a near neighborhood of any unclassified sample that traverses the decision tree to that leaf. The concept of “near” or “correlation” used is that the unclassified sample meets the same set of condition criteria as the near neighbors at the bottom of that path of condition criteria. Thus, in a real sense, we are using a different (accumulative) “correlation” definition along each branch of the decision tree and the subsets at the leaf of each branch are true Near Neighbor sets for the respective correlations or notions of nearness. Similarly, for any Neural Network classifier [8][9][16], we train the NN by adjusting the weights and biases through back-propagation until we reach an acceptable level of performance. In so doing we are using the matrix of weights and biases as the determiners of our near neighbor sets. We don’t stop training until those near neighbor sets (the sets of inputs that produce the same class output), are sufficiently “near” to each other to give us a level of accuracy that is sufficient for our needs. 2.2 Boundary based Classification By contrast, it is only fair to say that viewing Support Vector Machine classification as a form of Nearest Neighbor Classification is a stretch. On the other hand, the very first step in SVM classification is often to isolate a neighborhood in which to examine the boundary and the margins of the boundary between classes (assuming a binary classification problem). We will not go into any more detail on the issue of SVM as a NNC method due to size limitations of this paper. 3. CONCLUSIONS AND FUTURE WORK In this paper, we have made the case that classification and prediction algorithms (at least Decision Induction type classifiers and Neural Network type classifiers) are nearest neighbor vote classification and predictions. The conclusion depends upon how one defines “near” and we have shown that there are clearly “nearnesses” or “correlations” that provide these definitions. Two samples are considered near if their correlation is high enough. Broadly speaking, this is the way we always proceed in Classification. This is important because the first decision that needs to be made when faced with a classification or prediction problem is to decide which classification or prediction algorithm to employ. We show how, e.g., Decision Tree Induction, as a classification method can be viewed legitimately as a particular type of Nearest Neighbor Classification. We also will show how Neural Network Predictors are really Nearest Neighbor Classifiers. What good does this understanding do for someone faced with a classification or prediction problem? In a real sense the point of this paper is to head off the standard way of approaching Classification, which seem to be that of using a model-based classification method unless it just doesn’t work well enough and only then using Nearest Neighbor Classification. Our point is that “It is all Nearest Neighbor Classification” essentially and that standard NNC should be used UNLESS it takes too long. Only then should one consider giving up accuracy (of your near neighbor set) for speed by using a model (Decision Tree or Neural Network). 4. REFERENCES [1] Abidin T., and Perrizo W., SMART-TV: A Fast and Scalable Nearest Neighbor Based Classifier for Data Mining. Proceedings of the 21st ACM Symposium on Applied Computing , Dijon, France, April 2006. [2] Abidin, T., Perera, A., Serazi,M., Perrizo,W., Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets, CATA-2005 New Orleans, 2005. [3] Q. Ding, M. Khan, A. Roy, and W. Perrizo, The P-tree Algebra, Proceedings of the ACM Sym. on App. Comp., pp. 426-431, 2002. [4] Bandyopadhyay, S., and Muthy, C.A., Classification Using Genetic Algorithms. Recognition Letters, Vol. 16, (1995) 801-808. Pattern Pattern [5] Cost, S. and Salzberg, S., A weighted nearest neighbor algorithm for learning with symbolicfeatures, Machine Learning, 10, 57-78, 1993. [6] DataSURG, P-tree Application Programming Interface Documentation, North Dakota State University. http://midas.cs.ndsu.nodak.edu/~datasurg/ptree/ [7] Ding, Q., Ding, Q., Perrizo, W., “ARM on RSI Using Ptrees,” Pacific-Asia KDD Conf., pp. 66-79, Taipei, May2002. [8] Duch W, Grudzi ´N.K., and Diercksen G., Neural Minimal Distance Methods., World Congress of Computational intelligence, May 1998, Anchorage, Alaska, IJCNN’98 Proceedings, pp. 1299-1304. [23] Raymer, M.L. Punch, W.F., Goodman, E.D., Kuhn, L.A., and Jain, A.K.: Dimensionality Reduction Using Genetic Algorithms. IEEE Transactions on Evolutionary Computation, Vol. 4, (2000) 164-171 [9] Goldberg, D.E., Genetic Algorithms in Search Optimization, and Machine Learning, Addison Wesley, 1989. [24] Serazi, M. Perera, A., Ding, Q., Malakhov, V., Rahal, I., Pan, F., Ren, D., Wu, W., and Perrizo,W., DataMIME, ACM SIGMOD, Paris, France, June 2004. [10] Guerra-Salcedo C., and Whitley D., Feature Selection mechanisms for ensemble creation: a genetic search perspective, Data Mining with Evolutionary Algorithms: Research Directions – Papers from the AAAI Workshop, 13-17. Technical Report WS-99-06. AAAI Press (1999). [25] Vafaie, H. and De Jong, K.: Robust feature Selection algorithms. Proceeding of IEEE International Conference on Tools with AI, Boston, Mass., USA. November. (1993) 356-363. [11] Jain, A. K.; Zongker, D. Feature Selection: Evaluation, Application, and Small Sample Performance. IEEE Transaction on Pattern Analysis and Machine Intelligence, Vol. 19, No. 2, February (1997) [12] Khan M., Ding Q., Perrizo W., k-Nearest Neighbor Classification on Spatial Data Streams Using P-trees, Advances in KDD, Springer Lecture Notes in Artificial Intelligence, LNAI 2336, 2002, pp 517-528. [13] Khan,M., Ding, Q., and Perrizo,W., K-Nearest Neighbor Classification of Spatial Data Streams using P-trees, Proceedings of the PAKDD, pp. 517-528, 2002. [14] Krishnaiah, P.R., and Kanal L.N., Handbook of statistics 2: classification, pattern recognition and reduction of dimensionality. North Holland, Amsterdam 1982. [15] Kuncheva, L.I., and Jain, L.C.: Designing Classifier Fusion Systems by Genetic Algorithms. IEEE Transaction on Evolutionary Computation, Vol. 33 (2000) 351-373. [16] Lane, T., ACM Knowledge Discovery and Data Mining Cup 2006, http://www.kdd2006.com/kddcup.html [17] Martin-Bautista M.J., and Vila M.A.: A survey of genetic feature selection in mining issues. ProceedingCongress on Evolutionary Computation (CEC-99), Washington D.C., July (1999) 1314-1321. [18] Perera, A., Abidin T., Serazi, M. Hamer, G., Perrizo, W., Vertical Set Square Distance Based Clustering without Prior Knowledge of K, 14th International Conference on Intelligent and Adaptive Systems and Software Engineering (IASSE'05), Toronto, Canada, 2004. [19] Perera, A., Denton A., Kotala P., Jockheck W., Valdivia W., Perrizo W., P-tree Classification of Yeast Gene Deletion Data, SIGKDD Explorations, Volume 4, Issue 2, December 2002. [20] Perera A. and Perrizo W., Vertical K-Median Clustering, In Proceeding of the 21st International Conference on Computers and Their Applications (CATA-06), March 2325, 2006 Seattle, Washington, USA. [21] Punch, W.F. Goodman, E.D., Pei, M., Chia-Shun, L., Hovland, P., and Enbody, R., Further research on feature selection and classification using genetic algorithms, Proc. of the Fifth Int. Conf. on Genetic Algorithms, pp 557-564, San Mateo, CA, 1993. [22] Rahal, I. and Perrizo, W., An Optimized Approach for KNN Text Categorization using P-Trees. Proceedings. of ACM Symposium on Applied Computing, pp. 613-617, 2004.