Vertical Data Mining William Perrizo Voice: +1 701-231-7248 Fax: +1 701-231-8255 Email: william.perrizo@ndsu.nodak.edu Department of Computer Science North Dakota State University 258 A10 IACC Building 1301 12th Ave. N. Fargo, ND 58105, USA Qiang Ding Voice: +1 218-299-3347 Fax: +1 218-299-4308 Email: ding@cord.edu Department of Computer Science Concordia College 234E Ivers Building 901 8th St. S. Moorhead, MN 56562, USA Qin Ding Voice: +1 717-948-6636 Fax: +1 717-948-6352 Email: qding@psu.edu Department of Computer Science Pennsylvania State University - Harrisburg W-256 Olmsted Building 777 West Harrisburg Pike Middletown, PA 17057, USA Taufik Abidin Voice: +1 701-231-6257 Fax: +1 701-231-8255 Email: taufik.abidin@ndsu.nodak.edu Department of Computer Science North Dakota State University 108 IACC Building 1301 12th Ave. N. Fargo, ND 58105, USA Vertical Data Mining INTRODUCTION The volume of data is keeping increasing. There are many data sets that have become extremely large. It is of importance and challenge to develop scalable methodologies that can be used to perform efficient and effective data mining on large data sets. Vertical data mining strategy aims at addressing the scalability issues by organizing data in vertical layouts and conducting logical operations on vertical partitioned data instead of scanning the entire database horizontally. BACKGROUND The traditional horizontal database structure (files of horizontally structured records) and traditional scan-based data processing approaches (scanning files of horizontal records) are known to be inadequate for knowledge discovery in very large data repositories due to the problem of scalability. For this reason much effort has been put on sub-sampling and indexing as ways to address and solve the problem of scalability. However, sub-sampling requires that the sub-sampler know enough about the large dataset in the first place in order to sub-sample “representatively”. That is, subsampling requires considerable knowledge about the data, which, for many large datasets, may be inadequate or non-existent. Index files are vertical structures. That is, they are vertical access paths to sets of horizontal records. Indexing files of horizontal data records does address the scalability problem in many cases, but it does so at the cost of creating and maintaining the index files separate from the data files themselves. A new way to organize data is to organize them vertically, instead of horizontally. Data miners are typically interested in collective properties or predictions that can be expressed very briefly (e.g., a yes/no answer). Therefore, the result of a data mining query can be represented by a bitmap vector. This important property makes it possible to do data mining directly on vertical data structures. MAIN THRUST OF THE CHAPTER Vertical data structures, vertical mining approaches and multi-relational vertical mining will be explored in detail to show how vertical data mining works. Vertical Data Structures The concept of vertical partitioning has been studied within the context of both centralized and distributed database systems for a long time, yet much remains to be done (Winslett, 2002). There are great advantages of using vertical partitioning, for example, it makes hardware caching work really well; it makes compression easy to do; it may greatly increase the effectiveness of the I/O device since only participating fields are retrieved each time. The vertical decomposition of a relation also permits a number of transactions to execute concurrently. Copeland & Khoshafian (1985) presented an attribute-level Decomposition Storage Model called DSM, similar to the Attribute Transposed File model (ATF) (Batory, 1979), that stores each column of a relational table into a separate table. DSM was shown to perform well. It utilizes surrogate keys to map individual attributes together, hence requiring a surrogate key to be associated with each attribute of each record in the database. Attribute-level vertical decomposition is also used in Remotely Sensed Imagery (e.g., Landsat Thematic Mapper Imagery), where it is called Band Sequential (BSQ) format. Beyond attribute-level decomposition, Wong et al. (1985) presented the Bit Transposed File model (BTF), which took advantage of encoded attribute values using a small number of bits to reduce the storage space. In addition to ATF, BTF, and DSM models, there has been other work on vertical data structuring, such as Bit-Sliced Indexes (BSI) (Chan & Ioannidis, 1998, O’Neil & Quass, 1997, Rinfret et al., 2001), Encoded Bitmap Indexes (EBI) (Wu & Buchmann, 1998, Wu, 1998), and Domain Vector Accelerator (DVA) (Perrizo et al. 1991). A Bit-Sliced Index (BSI) is an ordered list of bitmaps used to represent values of a column or attribute, C. These bitmaps are called bit-slices, which provide binary representations of C-values for all the rows. In the EBI approach, an encoding function on the attribute domain is applied and a binary-based bit-sliced index on the encoded domain is built. EBIs minimize the space requirement and show more potential optimization than binary bit-slices. Both BSIs and EBIs are auxiliary index structures that need to be stored twice for particular data columns. As we know, even the simplest index structure used today incurs substantial increase in total storage requirements. The increased database size, in turn, translates into higher media and maintenance costs, and results in lower performance. Domain Vector Accelerator (DVA) is a method to perform relational operations based on vertical bit-vectors. The DVA method performs particularly well for joins involving a primary key attribute and an associated foreign key attribute. Vertical mining requires data to be organized vertically and be processed horizontally through fast, multi-operand logical operations, such as AND, OR, XOR, and complement. Predicate tree (P-tree1) is one form of lossless vertical structure that meets this requirement. P-tree is suitable to represent numerical and categorical data and has been successfully used in various data mining applications, including classification (Khan et al., 2002), clustering (Denton et al., 2002), and association rule mining (Ding et al., 2002). P-trees can be 1-dimensional, 2-dimensional, and multi-dimensional. If the data has a natural dimension (e.g., spatial data), the P-tree dimension is matched to the data dimension. Otherwise, the dimension can be chosen to optimize the compression ratio. To convert a relational table of horizontal records to a set of vertical P-trees, the table has to be projected into columns, one for each attribute, retaining the original record order in each. Then each attribute column is further decomposed into separate bit vectors, one for each bit position of the values in that attribute. Each bit vector is then compressed into a tree structure by recording the truth of the predicate “purely 1-bits” recursively on halves until purity is reached. Vertical Mining Approaches A number of vertical data mining algorithms have been proposed, especially in the area of association rule mining. Mining algorithms using the vertical format have been shown to outperform horizontal approaches in many cases. One example is the Frequent Pattern Growth algorithm using Frequent Pattern Trees introduced by Han in (Han et al., P-tree is a patent-pending technology developed by Dr. William Perrizo’s DataSURG research group at North Dakota State University. 1 2001). The advantages come from the fact that frequent patterns can be counted via transaction_id_set intersections, instead of using complex internal data structures. The horizontal approach, on the other hand, requires complex hash/search trees. Zaki (Zaki & Hsiao, 2002) introduced a vertical presentation called diffset, which keeps track of differences in the tidset of a candidate pattern from its generated frequent patterns. Diffset drastically reduces the memory required to store intermediate results; therefore, even in dense domains, the entire working set of patterns of several vertical mining algorithms can be fit entirely in main-memory, facilitating the mining for very large database. P.Shenoy et al. (2000) proposes a vertical approach, called VIPER, for association rule mining of large databases. VIPER stores data in compressed bit-vectors and integrates a number of optimizations for efficient generation, intersection, counting, and storage of bit-vectors, which provides significant performance gains for large databases with a close to linear scale-up with database size. P-trees have been applied to a wide variety of data mining areas. The efficient Ptree storage structure and the P-tree algebra provide a fast way to calculate various measurements for data mining task, such as support and confidence in association rule mining, information gain in decision tree classification, Bayesian probability values in Bayesian classification, etc. P-trees have also been successfully used in many kinds of distance based classification and clustering techniques. A computationally efficient distance metric called Higher Order Basic Bit Distance (HOBBit) (Khan et al., 2002) has been proposed based on P-trees. For one dimension, the HOBBit distance is defined as the number of digits by which the binary representation of an integer has to be right-shifted to make two numbers equal. For more than one dimension, the HOBBit distance is defined as the maximum of the HOBBit distances in the individual dimensions. Since computers use binary systems to represent numbers in memory, bit-wise logical operations are much faster than ordinary arithmetic operations such as addition and multiplication of decimal number. Therefore, knowledge discovery algorithm utilizing vertical bit representation can accomplish its goals in a quick manner. Multi-Relational Vertical Mining Multi-Relational Data Mining (MRDM) is the process of knowledge discovery from relational databases consisting of multiple tables. The rise of several application areas in Knowledge Discovery in Databases (KDD) that is intrinsically relational has provided and continues to provide a strong motivation for the development of MRDM approaches. Since scalability has always been an important concern in the field of data mining, it is even more important in the multi-relational context, which is inherently more complex. From the perspective of database, multi-relational data mining usually involves one or more joins between tables, as is not the case for classical data mining methods. Until now, there is still lack of accurate, efficient, and scalable multi-relational data mining methods to handle large databases with complex schemas. Databases are usually normalized for implementation reasons. However, for data mining workload, denormalizing the relations into a view may better represent the real world. In addition, if a view can be materialized by storing onto disks, it will be accessed much faster without being computed on the fly. Two alternative materialized view approaches, relational materialized view model and multidimensional materialized view model, can be utilized to model relational data (Ding, 2004). In the relational model, a relational or extended-relational DBMS is used to store and manage data. A set of vertical materialized views can be created to encompass all the information necessary for data mining. Vertical materialized views can be generated directly from vertical representation of the original data. The transformation can be done in parallel by Boolean operations. In the multidimensional materialized view model, multidimensional data are mapped directly to a data cube structure. The advantage of using a data cube is that it allows fast indexing by using offset calculation to precomputed data. The vertical materialized views of the data cube will require larger storage than that of the relational model. However, with the employment of the P-tree technology, there would not be much difference due to the compression inside the P-tree structure. All the vertical materialized views can be easily stored in P-tree format. When encountering any data mining task, by relevance analysis and feature selection, all the relevant materialized view P-trees can be grabbed for data mining. FUTURE TRENDS Vertical data structure and vertical data mining will become more and more important. New vertical data structures will be needed for various types of data. There is great potential to combine vertical data mining with parallel mining as well as hardware. The salability will become a very important issue in the area of data mining. The challenge of scalability is not just dealing with a large number of tuples but also handling with high dimensionality (Fayyad, 1999, Han & Kamber, 2001). CONCLUSION Horizontal data structure has been proven to be inefficient for data mining on very large sets due to the large cost of scanning. It is of importance to develop vertical data structures and algorithms to solve the scalability issue. Various structures have been proposed, among which P-tree is a very promising vertical structure. P-trees have show great performance to process data containing large number of tuples due to the fast logical AND operation without scanning (Ding et al., 2002). Vertical structures, such as P-trees, also provide an efficient way for multi-relational data mining. In general, horizontal data structure is preferable for transactional data with intended output as a relation, and vertical data mining is more appropriate for knowledge discovery on very large data sets. REFERENCES Batory, D. S. (1979). On Searching Transposed Files. ACM Transactions on Database Systems, 4(4):531-544. Chan, C. Y. & Ioannidis, Y. (1998). Bitmap index design and evaluation. Proceedings of the ACM SIGMOD, 355-366. Copeland, G. & Khoshafian, S. (1985). Decomposition Storage Model. Proceedings of the ACM SIGMOD, 268-279. Denton, A., Ding, Q., Perrizo, W., & Ding, Q. (2002). Efficient Hierarchical Clustering of Large Data Sets Using P-Trees. Proceeding of International Conference on Computer Applications in Industry and Engineering, 138-141. Ding, Q. (2004). Multi-Relational Data Mining Using Vertical Database Technology. Ph.D. Thesis, North Dakota State University. Ding, Q., Ding, Q., & Perrizo, W. (2002). Association Rule Mining on Remotely Sensed Images Using Ptrees, Proceeding of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, 66-79. Khan, M., Ding, Q., & Perrizo, W. (2002). K-nearest Neighbor Classification on Spatial Data Stream Using Ptrees, Proceeding of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, 517-528. Fayyad, U. (1999). Editorial, SIGKDD Explorations, 1(1), 1-3. Han, J. & Kamber, M. (2001). Data Mining: Concepts and Techniques. San Francisco, CA: Morgan Kaufmann. Han, J., Pei, J., & Yin, Y. (2000). Mining Frequent Patterns without Candidate Generation. Proceedings of the ACM SIGMOD, 1-12. O’Neil, P. & Quass, D. (1997). Improved Query Performance with Variant Indexes. Proceedings of the ACM SIGMOD, 38-49. Perrizo, W., Gustafson, J., Thureen, D., & Wenberg, D. (1991). Domain Vector Accelerator for Relational Operations. Proceedings of IEEE International Conference on Data Engineering, 491-498. Rinfret, D., O’Neil, P., & O’Neil, E. (2001). Bit-Sliced Index Arithmetic. Proceedings of the ACM SIGMOD, 47-57. Shenoy, Haristsa, Sudatsham, Bhalotia, Baqa & Shah (2000). Turbo-charging vertical mining of large databases. Proceedings of the ACM SIGMOD, 22-33. Winslett, M. (2002). David DeWitt Speaks Out. ACM SIGMOD Record, 31(2):50-62. Wong, H. K. T., Liu, H.-F., Olken, F., Rotem, D., & Wong. L. (1985). Bit Transposed Files. Proceedings of VLDB, 448-457. Wu, M-C. (1998). Query Optimization for Selections using Bitmaps. Technical Report, DVS98-2, DVS1, Computer Science Department, Technische Universitat Darmstadt. Wu, M-C & Buchmann, A. (1998). Encoded bitmap indexing for data warehouses. Proceedings of IEEE International Conference on Data Engineering, 220-230. Zaki, M. J., & Hsiao, C-J. (2002). CHARM: An Efficient Algorithm for Closed Itemset Mining. Proceedings of the SIAM International Conference on Data Mining. TERMS AND THEIR DEFINITION Association Rule Mining: The process of finding interesting association or correlation relationships among a large set of data items. Data Mining: The application of analytical methods and tools to data for the purpose of identifying patterns and relationships such as classification, prediction, estimation, or affinity grouping. HOBBit Distance: A computationally efficient distance metric. In one dimension, it is the number of digits by which the binary representation of an integer has to be right-shifted to make two numbers equal. In other dimension, it is the maximum of the HOBBit distances in the individual dimensions. Multi-Relational Data Mining: The process of knowledge discovery from relational databases consisting of multiple tables. Multi-Relational Vertical Mining: The process of knowledge discovery from relational databases consisting of multiple tables using vertical data mining approaches. Predicate tree (P-tree): A lossless tree that is vertically structured and horizontally processed through fast multi-operand logical operations. Vertical Data Mining (Vertical Mining): The process of finding patterns and knowledge from data organized in vertical formats, which aims to address the scalability issues.