P-tree Vertical Data Mining Technology

advertisement
Dr. William Perrizo
University Distinguished Professor of Computer Science
North Dakota State University, Fargo, ND 58102, USA
Voice: +1 701 231 7248 Email: william.perrizo@ndsu.edu
http://www.cs.ndsu.nodak.edu/~perrizo
P-tree Vertical Data Mining Technology
Vertical technologies represent and process data differently from the ubiquitous horizontal data
technologies. In vertical technologies, the data is structured column-wise and the columns are
processed horizontally (typically across a few to a few hundred columns), while in horizontal
technologies, data is structured row-wise and those rows are processed vertically (often down
millions, even billions of rows). The patented P-tree technology is a vertical data technology.
P-trees are lossless, compressed and data-mining ready data structures [9][10].
P-trees are lossless because the vertical bit-wise partitioning that is used in the P-tree
technology guarantees that all information is retained completely.
There is no loss of
information in converting horizontal data to this vertical format.
P-trees are compressed because in this technology, segments of bit sequences which are either
purely 1-bits or purely 0-bits, are represented by a single bit. This compression saves a
considerable amount of space, but more importantly facilitates greater processing speed.
P-trees are data-mining ready because the fast, horizontal data mining processes involved can be
done without the need to decompress the structures first. P-tree vertical data structures have
been exploited in various domains and data mining algorithms, ranging from classification
[1][2][3], clustering [4][7], association rule mining [5][9], to outlier analysis [6] as well as other
data mining algorithms. P-tree technology is patented in the United States by North Dakota
State University.
Typically, a new data mining technology will either tout improved speed or improved accuracy.
P-trees can facilitate both. In fact, the “Closed Nearest Neighbor Classification P-tree
technology, first introduced in [2], has been shown to do both simultaneously.
Speed improvements are very important in data mining because many quite accurate algorithms
require an unacceptable amount of processing time to complete, even with today’s powerful
computing systems and efficient software platforms. Undoubtedly the most important
breakthrough offered by P-tree technology is the ability to process all instances (even billions)
of an entity with one horizontal pass across a small number (a few to a few hundred) vertical,
compressed P-tree data structures.
P-tree vertical data structuring and mining technology development is continuing. Commercial
development is being done by Treeminer Inc.[10] which licensed NDSU’s P-tree patents. Ptree research is ongoing at NDSU by Dr. William Perrizo’s DataSURG group. A central
1
expansion and improvement to P-tree technology being worked on at this time is faster logical
processing of a selection of P-trees. This work focuses on the careful P-tree segment-to-cache
strides and segment-to-memory extents for fast access and on faster AND/OR processing and
faster 1-counting using multi-core CPU, GP-GPU and FPGA technologies.
Reference
[1] T. Abidin and W. Perrizo, “SMART-TV: A Fast and Scalable Nearest Neighbor Based
Classifier for Data Mining,” Proceedings of the 21st Association of Computing Machinery
Symposium on Applied Computing (SAC-06), Dijon, France, April 23-27, 2006.
[2]
T. Abidin, A. Dong, H. Li, and W. Perrizo, “Efficient Image Classification on Vertically
Decomposed Data,” IEEE International Conference on Multimedia Databases and Data
Management (MDDM-06), Atlanta, Georgia, April 8, 2006.
[3]
M. Khan, Q. Ding, and W. Perrizo, “K-nearest Neighbor Classification on Spatial Data
Stream Using P-trees,” Proceedings of the Pacific-Asia Conference on Knowledge
Discovery and Data Mining (PAKDD 02), pp. 517-528, Taipei, Taiwan, May 2002.
[4]
A. Perera, T. Abidin, M. Serazi, G. Hamer, and W. Perrizo, “Vertical Set Squared
Distance Based Clustering without Prior Knowledge of K,” International Conference on
Intelligent and Adaptive Systems and Software Engineering (IASSE-05), pp. 72-77,
Toronto, Canada, July 20-22, 2005.
[5]
I. Rahal, D. Ren, and W. Perrizo, “A Scalable Vertical Model for Mining Association
Rules,” Journal of Information & Knowledge Management (JIKM), Vol.3, No. 4, pp. 317329, 2004.
[6]
D. Ren, B. Wang, and W. Perrizo, “RDF: A Density-Based Outlier Detection Method
using Vertical Data Representation,” Proceedings of the 4th IEEE International Conference
on Data Mining (ICDM-04), pp. 503-506, Nov 1-4, 2004.
[7]
E. Wang, I. Rahal, W. Perrizo, “DAVYD: an iterative Density-based Approach for
clusters with Varying Densities". International Journal of Computers and Their
Applications, V17:1, pp. 1-14, March 2010.
[8]
Qin Ding, Qiang Ding, W. Perrizo, “PARM - An Efficient Algorithm to Mine
Association Rules from Spatial Data" Institute of Electrical and Electronic Engineering
(IEEE) Transactions of Systems, Man, and Cybernetics, Volume 38, Number 6, ISSN
1083-4419), pp. 1513-1525, December, 2008.
[9]
I. Rahal, M. Serazi, A. Perera, Q. Ding, F. Pan, D. Ren, W. Wu, W. Perrizo,
“DataMIME™”, Association of Computing Machinery, Management of Data (ACM
SIGMOD 04), Paris, France, June 2004.
[10] Treeminer Inc., The Vertical Data Mining Company, 175 Admiral Cochrane Drive, Suite
300, Annapolis, Maryland 21401, http://www.treeminer.com
Download