Vertical Data Mining - NDSU Computer Science

advertisement
Vertical Data Mining
William Perrizo
Voice: +1 701-231-7248
Fax: +1 701-231-8255
Email: william.perrizo@ndsu.nodak.edu
Department of Computer Science
North Dakota State University
258 A10 IACC Building
1301 12th Ave. N.
Fargo, ND 58105, USA
Qiang Ding
Voice: +1 218-299-3347
Fax: +1 218-299-4308
Email: ding@cord.edu
Department of Computer Science
Concordia College
234E Ivers Building
901 8th St. S.
Moorhead, MN 56562, USA
Qin Ding
Voice: +1 717-948-6636
Fax: +1 717-948-6352
Email: qding@psu.edu
Department of Computer Science
Pennsylvania State University - Harrisburg
W-256 Olmsted Building
777 West Harrisburg Pike
Middletown, PA 17057, USA
Taufik Abidin
Voice: +1 701-231-6257
Fax: +1 701-231-8255
Email: taufik.abidin@ndsu.nodak.edu
Department of Computer Science
North Dakota State University
108 IACC Building
1301 12th Ave. N.
Fargo, ND 58105, USA
Vertical Data Mining
INTRODUCTION
The volume of data is keeping increasing. There are many data sets that have
become extremely large.
It is of importance and challenge to develop scalable
methodologies that can be used to perform efficient and effective data mining on large
data sets. Vertical data mining strategy aims at addressing the scalability issues by
organizing data in vertical layouts and conducting logical operations on vertical
partitioned data instead of scanning the entire database horizontally.
BACKGROUND
The traditional horizontal database structure (files of horizontally structured
records) and traditional scan-based data processing approaches (scanning files of
horizontal records) are known to be inadequate for knowledge discovery in very large
data repositories due to the problem of scalability. For this reason much effort has been
put on sub-sampling and indexing as ways to address and solve the problem of
scalability. However, sub-sampling requires that the sub-sampler know enough about the
large dataset in the first place in order to sub-sample “representatively”. That is, subsampling requires considerable knowledge about the data, which, for many large datasets,
may be inadequate or non-existent. Index files are vertical structures. That is, they are
vertical access paths to sets of horizontal records. Indexing files of horizontal data
records does address the scalability problem in many cases, but it does so at the cost of
creating and maintaining the index files separate from the data files themselves.
A new way to organize data is to organize them vertically, instead of horizontally.
Data miners are typically interested in collective properties or predictions that can be
expressed very briefly (e.g., a yes/no answer). Therefore, the result of a data mining
query can be represented by a bitmap vector. This important property makes it possible
to do data mining directly on vertical data structures.
MAIN THRUST OF THE CHAPTER
Vertical data structures, vertical mining approaches and multi-relational vertical
mining will be explored in detail to show how vertical data mining works.
Vertical Data Structures
The concept of vertical partitioning has been studied within the context of both
centralized and distributed database systems for a long time, yet much remains to be done
(Winslett, 2002). There are great advantages of using vertical partitioning, for example, it
makes hardware caching work really well; it makes compression easy to do; it may
greatly increase the effectiveness of the I/O device since only participating fields are
retrieved each time. The vertical decomposition of a relation also permits a number of
transactions to execute concurrently. Copeland & Khoshafian (1985) presented an
attribute-level Decomposition Storage Model called DSM, similar to the Attribute
Transposed File model (ATF) (Batory, 1979), that stores each column of a relational
table into a separate table. DSM was shown to perform well. It utilizes surrogate keys to
map individual attributes together, hence requiring a surrogate key to be associated with
each attribute of each record in the database. Attribute-level vertical decomposition is
also used in Remotely Sensed Imagery (e.g., Landsat Thematic Mapper Imagery), where
it is called Band Sequential (BSQ) format. Beyond attribute-level decomposition, Wong
et al. (1985) presented the Bit Transposed File model (BTF), which took advantage of
encoded attribute values using a small number of bits to reduce the storage space.
In addition to ATF, BTF, and DSM models, there has been other work on vertical
data structuring, such as Bit-Sliced Indexes (BSI) (Chan & Ioannidis, 1998, O’Neil &
Quass, 1997, Rinfret et al., 2001), Encoded Bitmap Indexes (EBI) (Wu & Buchmann,
1998, Wu, 1998), and Domain Vector Accelerator (DVA) (Perrizo et al. 1991).
A Bit-Sliced Index (BSI) is an ordered list of bitmaps used to represent values of a
column or attribute, C. These bitmaps are called bit-slices, which provide binary
representations of C-values for all the rows.
In the EBI approach, an encoding function on the attribute domain is applied and a
binary-based bit-sliced index on the encoded domain is built. EBIs minimize the space
requirement and show more potential optimization than binary bit-slices.
Both BSIs and EBIs are auxiliary index structures that need to be stored twice for
particular data columns. As we know, even the simplest index structure used today
incurs substantial increase in total storage requirements. The increased database size, in
turn, translates into higher media and maintenance costs, and results in lower
performance.
Domain Vector Accelerator (DVA) is a method to perform relational operations
based on vertical bit-vectors. The DVA method performs particularly well for joins
involving a primary key attribute and an associated foreign key attribute.
Vertical mining requires data to be organized vertically and be processed
horizontally through fast, multi-operand logical operations, such as AND, OR, XOR, and
complement. Predicate tree (P-tree1) is one form of lossless vertical structure that meets
this requirement. P-tree is suitable to represent numerical and categorical data and has
been successfully used in various data mining applications, including classification (Khan
et al., 2002), clustering (Denton et al., 2002), and association rule mining (Ding et al.,
2002).
P-trees can be 1-dimensional, 2-dimensional, and multi-dimensional. If the data
has a natural dimension (e.g., spatial data), the P-tree dimension is matched to the data
dimension. Otherwise, the dimension can be chosen to optimize the compression ratio.
To convert a relational table of horizontal records to a set of vertical P-trees, the
table has to be projected into columns, one for each attribute, retaining the original record
order in each. Then each attribute column is further decomposed into separate bit
vectors, one for each bit position of the values in that attribute. Each bit vector is then
compressed into a tree structure by recording the truth of the predicate “purely 1-bits”
recursively on halves until purity is reached.
Vertical Mining Approaches
A number of vertical data mining algorithms have been proposed, especially in the
area of association rule mining. Mining algorithms using the vertical format have been
shown to outperform horizontal approaches in many cases. One example is the Frequent
Pattern Growth algorithm using Frequent Pattern Trees introduced by Han in (Han et al.,
P-tree is a patent-pending technology developed by Dr. William Perrizo’s DataSURG research group at
North Dakota State University.
1
2001). The advantages come from the fact that frequent patterns can be counted via
transaction_id_set intersections, instead of using complex internal data structures. The
horizontal approach, on the other hand, requires complex hash/search trees. Zaki (Zaki &
Hsiao, 2002) introduced a vertical presentation called diffset, which keeps track of
differences in the tidset of a candidate pattern from its generated frequent patterns.
Diffset drastically reduces the memory required to store intermediate results; therefore,
even in dense domains, the entire working set of patterns of several vertical mining
algorithms can be fit entirely in main-memory, facilitating the mining for very large
database. P.Shenoy et al. (2000) proposes a vertical approach, called VIPER, for
association rule mining of large databases. VIPER stores data in compressed bit-vectors
and integrates a number of optimizations for efficient generation, intersection, counting,
and storage of bit-vectors, which provides significant performance gains for large
databases with a close to linear scale-up with database size.
P-trees have been applied to a wide variety of data mining areas. The efficient Ptree storage structure and the P-tree algebra provide a fast way to calculate various
measurements for data mining task, such as support and confidence in association rule
mining, information gain in decision tree classification, Bayesian probability values in
Bayesian classification, etc.
P-trees have also been successfully used in many kinds of distance based
classification and clustering techniques. A computationally efficient distance metric
called Higher Order Basic Bit Distance (HOBBit) (Khan et al., 2002) has been proposed
based on P-trees. For one dimension, the HOBBit distance is defined as the number of
digits by which the binary representation of an integer has to be right-shifted to make two
numbers equal. For more than one dimension, the HOBBit distance is defined as the
maximum of the HOBBit distances in the individual dimensions.
Since computers use binary systems to represent numbers in memory, bit-wise
logical operations are much faster than ordinary arithmetic operations such as addition
and multiplication of decimal number. Therefore, knowledge discovery algorithm
utilizing vertical bit representation can accomplish its goals in a quick manner.
Multi-Relational Vertical Mining
Multi-Relational Data Mining (MRDM) is the process of knowledge discovery
from relational databases consisting of multiple tables. The rise of several application
areas in Knowledge Discovery in Databases (KDD) that is intrinsically relational has
provided and continues to provide a strong motivation for the development of MRDM
approaches. Since scalability has always been an important concern in the field of data
mining, it is even more important in the multi-relational context, which is inherently
more complex. From the perspective of database, multi-relational data mining usually
involves one or more joins between tables, as is not the case for classical data mining
methods. Until now, there is still lack of accurate, efficient, and scalable multi-relational
data mining methods to handle large databases with complex schemas.
Databases are usually normalized for implementation reasons. However, for data
mining workload, denormalizing the relations into a view may better represent the real
world. In addition, if a view can be materialized by storing onto disks, it will be accessed
much faster without being computed on the fly. Two alternative materialized view
approaches, relational materialized view model and multidimensional materialized view
model, can be utilized to model relational data (Ding, 2004).
In the relational model, a relational or extended-relational DBMS is used to store
and manage data. A set of vertical materialized views can be created to encompass all the
information necessary for data mining. Vertical materialized views can be generated
directly from vertical representation of the original data. The transformation can be done
in parallel by Boolean operations.
In the multidimensional materialized view model, multidimensional data are
mapped directly to a data cube structure. The advantage of using a data cube is that it
allows fast indexing by using offset calculation to precomputed data. The vertical
materialized views of the data cube will require larger storage than that of the relational
model. However, with the employment of the P-tree technology, there would not be
much difference due to the compression inside the P-tree structure.
All the vertical materialized views can be easily stored in P-tree format. When
encountering any data mining task, by relevance analysis and feature selection, all the
relevant materialized view P-trees can be grabbed for data mining.
FUTURE TRENDS
Vertical data structure and vertical data mining will become more and more
important. New vertical data structures will be needed for various types of data. There is
great potential to combine vertical data mining with parallel mining as well as hardware.
The salability will become a very important issue in the area of data mining. The
challenge of scalability is not just dealing with a large number of tuples but also handling
with high dimensionality (Fayyad, 1999, Han & Kamber, 2001).
CONCLUSION
Horizontal data structure has been proven to be inefficient for data mining on very
large sets due to the large cost of scanning. It is of importance to develop vertical data
structures and algorithms to solve the scalability issue. Various structures have been
proposed, among which P-tree is a very promising vertical structure. P-trees have show
great performance to process data containing large number of tuples due to the fast
logical AND operation without scanning (Ding et al., 2002). Vertical structures, such as
P-trees, also provide an efficient way for multi-relational data mining. In general,
horizontal data structure is preferable for transactional data with intended output as a
relation, and vertical data mining is more appropriate for knowledge discovery on very
large data sets.
REFERENCES
Batory, D. S. (1979). On Searching Transposed Files. ACM Transactions on Database
Systems, 4(4):531-544.
Chan, C. Y. & Ioannidis, Y. (1998). Bitmap index design and evaluation. Proceedings of
the ACM SIGMOD, 355-366.
Copeland, G. & Khoshafian, S. (1985). Decomposition Storage Model. Proceedings of
the ACM SIGMOD, 268-279.
Denton, A., Ding, Q., Perrizo, W., & Ding, Q. (2002). Efficient Hierarchical Clustering
of Large Data Sets Using P-Trees. Proceeding of International Conference on
Computer Applications in Industry and Engineering, 138-141.
Ding, Q. (2004). Multi-Relational Data Mining Using Vertical Database Technology.
Ph.D. Thesis, North Dakota State University.
Ding, Q., Ding, Q., & Perrizo, W. (2002). Association Rule Mining on Remotely Sensed
Images Using Ptrees, Proceeding of the Pacific-Asia Conference on Knowledge
Discovery and Data Mining, 66-79.
Khan, M., Ding, Q., & Perrizo, W. (2002). K-nearest Neighbor Classification on Spatial
Data Stream Using Ptrees, Proceeding of the Pacific-Asia Conference on
Knowledge Discovery and Data Mining, 517-528.
Fayyad, U. (1999). Editorial, SIGKDD Explorations, 1(1), 1-3.
Han, J. & Kamber, M. (2001). Data Mining: Concepts and Techniques. San Francisco,
CA: Morgan Kaufmann.
Han, J., Pei, J., & Yin, Y. (2000).
Mining Frequent Patterns without Candidate
Generation. Proceedings of the ACM SIGMOD, 1-12.
O’Neil, P. & Quass, D. (1997). Improved Query Performance with Variant Indexes.
Proceedings of the ACM SIGMOD, 38-49.
Perrizo, W., Gustafson, J., Thureen, D., & Wenberg, D. (1991). Domain Vector
Accelerator for Relational Operations. Proceedings of IEEE International
Conference on Data Engineering, 491-498.
Rinfret, D., O’Neil, P., & O’Neil, E. (2001). Bit-Sliced Index Arithmetic. Proceedings of
the ACM SIGMOD, 47-57.
Shenoy, Haristsa, Sudatsham, Bhalotia, Baqa & Shah (2000). Turbo-charging vertical
mining of large databases. Proceedings of the ACM SIGMOD, 22-33.
Winslett, M. (2002). David DeWitt Speaks Out. ACM SIGMOD Record, 31(2):50-62.
Wong, H. K. T., Liu, H.-F., Olken, F., Rotem, D., & Wong. L. (1985). Bit Transposed
Files. Proceedings of VLDB, 448-457.
Wu, M-C. (1998). Query Optimization for Selections using Bitmaps. Technical Report,
DVS98-2, DVS1, Computer Science Department, Technische Universitat
Darmstadt.
Wu, M-C & Buchmann, A. (1998). Encoded bitmap indexing for data warehouses.
Proceedings of IEEE International Conference on Data Engineering, 220-230.
Zaki, M. J., & Hsiao, C-J. (2002). CHARM: An Efficient Algorithm for Closed Itemset
Mining. Proceedings of the SIAM International Conference on Data Mining.
TERMS AND THEIR DEFINITION
Association Rule Mining: The process of finding interesting association or correlation
relationships among a large set of data items.
Data Mining: The application of analytical methods and tools to data for the purpose of
identifying patterns and relationships such as classification, prediction, estimation,
or affinity grouping.
HOBBit Distance: A computationally efficient distance metric. In one dimension, it is
the number of digits by which the binary representation of an integer has to be
right-shifted to make two numbers equal. In other dimension, it is the maximum of
the HOBBit distances in the individual dimensions.
Multi-Relational Data Mining: The process of knowledge discovery from relational
databases consisting of multiple tables.
Multi-Relational Vertical Mining: The process of knowledge discovery from relational
databases consisting of multiple tables using vertical data mining approaches.
Predicate tree (P-tree): A lossless tree that is vertically structured and horizontally
processed through fast multi-operand logical operations.
Vertical Data Mining (Vertical Mining): The process of finding patterns and
knowledge from data organized in vertical formats, which aims to address the
scalability issues.
Download