Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University, Fargo, ND 58102, USA Voice: +1 701 231 7248 Email: william.perrizo@ndsu.edu http://www.cs.ndsu.nodak.edu/~perrizo P-tree Vertical Data Mining Technology Vertical technologies represent and process data differently from the ubiquitous horizontal data technologies. In vertical technologies, the data is structured column-wise and the columns are processed horizontally (typically across a few to a few hundred columns), while in horizontal technologies, data is structured row-wise and those rows are processed vertically (often down millions, even billions of rows). The patented P-tree technology is a vertical data technology. P-trees are lossless, compressed and data-mining ready data structures [9][10]. P-trees are lossless because the vertical bit-wise partitioning that is used in the P-tree technology guarantees that all information is retained completely. There is no loss of information in converting horizontal data to this vertical format. P-trees are compressed because in this technology, segments of bit sequences which are either purely 1-bits or purely 0-bits, are represented by a single bit. This compression saves a considerable amount of space, but more importantly facilitates greater processing speed. P-trees are data-mining ready because the fast, horizontal data mining processes involved can be done without the need to decompress the structures first. P-tree vertical data structures have been exploited in various domains and data mining algorithms, ranging from classification [1][2][3], clustering [4][7], association rule mining [5][9], to outlier analysis [6] as well as other data mining algorithms. P-tree technology is patented in the United States by North Dakota State University. Typically, a new data mining technology will either tout improved speed or improved accuracy. P-trees can facilitate both. In fact, the “Closed Nearest Neighbor Classification P-tree technology, first introduced in [2], has been shown to do both simultaneously. Speed improvements are very important in data mining because many quite accurate algorithms require an unacceptable amount of processing time to complete, even with today’s powerful computing systems and efficient software platforms. Undoubtedly the most important breakthrough offered by P-tree technology is the ability to process all instances (even billions) of an entity with one horizontal pass across a small number (a few to a few hundred) vertical, compressed P-tree data structures. P-tree vertical data structuring and mining technology development is continuing. Commercial development is being done by Treeminer Inc.[10] which licensed NDSU’s P-tree patents. Ptree research is ongoing at NDSU by Dr. William Perrizo’s DataSURG group. A central 1 expansion and improvement to P-tree technology being worked on at this time is faster logical processing of a selection of P-trees. This work focuses on the careful P-tree segment-to-cache strides and segment-to-memory extents for fast access and on faster AND/OR processing and faster 1-counting using multi-core CPU, GP-GPU and FPGA technologies. Reference [1] T. Abidin and W. Perrizo, “SMART-TV: A Fast and Scalable Nearest Neighbor Based Classifier for Data Mining,” Proceedings of the 21st Association of Computing Machinery Symposium on Applied Computing (SAC-06), Dijon, France, April 23-27, 2006. [2] T. Abidin, A. Dong, H. Li, and W. Perrizo, “Efficient Image Classification on Vertically Decomposed Data,” IEEE International Conference on Multimedia Databases and Data Management (MDDM-06), Atlanta, Georgia, April 8, 2006. [3] M. Khan, Q. Ding, and W. Perrizo, “K-nearest Neighbor Classification on Spatial Data Stream Using P-trees,” Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 02), pp. 517-528, Taipei, Taiwan, May 2002. [4] A. Perera, T. Abidin, M. Serazi, G. Hamer, and W. Perrizo, “Vertical Set Squared Distance Based Clustering without Prior Knowledge of K,” International Conference on Intelligent and Adaptive Systems and Software Engineering (IASSE-05), pp. 72-77, Toronto, Canada, July 20-22, 2005. [5] I. Rahal, D. Ren, and W. Perrizo, “A Scalable Vertical Model for Mining Association Rules,” Journal of Information & Knowledge Management (JIKM), Vol.3, No. 4, pp. 317329, 2004. [6] D. Ren, B. Wang, and W. Perrizo, “RDF: A Density-Based Outlier Detection Method using Vertical Data Representation,” Proceedings of the 4th IEEE International Conference on Data Mining (ICDM-04), pp. 503-506, Nov 1-4, 2004. [7] E. Wang, I. Rahal, W. Perrizo, “DAVYD: an iterative Density-based Approach for clusters with Varying Densities". International Journal of Computers and Their Applications, V17:1, pp. 1-14, March 2010. [8] Qin Ding, Qiang Ding, W. Perrizo, “PARM - An Efficient Algorithm to Mine Association Rules from Spatial Data" Institute of Electrical and Electronic Engineering (IEEE) Transactions of Systems, Man, and Cybernetics, Volume 38, Number 6, ISSN 1083-4419), pp. 1513-1525, December, 2008. [9] I. Rahal, M. Serazi, A. Perera, Q. Ding, F. Pan, D. Ren, W. Wu, W. Perrizo, “DataMIME™”, Association of Computing Machinery, Management of Data (ACM SIGMOD 04), Paris, France, June 2004. [10] Treeminer Inc., The Vertical Data Mining Company, 175 Admiral Cochrane Drive, Suite 300, Annapolis, Maryland 21401, http://www.treeminer.com