Uploaded by joeconteh98

data mining and warehousing pdf

advertisement
LECTURE NOTES
ON
DATA WAREHOUSING AND
MINING
2018 – 2019
II MCA II Semester
Mrs. T.LAKSHMI PRASANNA, Assistant Professor
CHADALAWADA RAMANAMMA ENGINEERING COLLEGE
(AUTONOMOUS)
Chadalawada Nagar, Renigunta Road, Tirupati – 517 506
Department of Master of Computer Applications
Unit - I
Data Warehousing
Classes:10
Introduction to data mining: Motivation, importance, definition of data mining, kinds of data
mining, kinds of patterns, data mining technologies, kinds of applications targeted, major issues
in data mining; Preprocessing: data objects and attribute types, basic statistical descriptions of
data, data visualization, data quality, data cleaning, data integration, data reduction, data
transformation and data discretization.
MOTIVATION AND IMPORTANCE:
Data Mining is defined as the procedure of extracting information from huge sets of data. In
other words, we can say that data mining is mining knowledge from data. The tutorial starts
off with a basic overview and the terminologies involved in data mining and then gradually
moves on to cover topics such as knowledge discovery, query language, classification and
prediction, decision tree induction, cluster analysis, and how to mine the Web.
here is a huge amount of data available in the Information Industry. This data is of no use
until it is converted into useful information. It is necessary to analyze this huge amount of
data and extract useful information from it.
Extraction of information is not the only process we need to perform; data mining also
involves other processes such as Data Cleaning, Data Integration, Data Transformation, Data
Mining, Pattern Evaluation and Data Presentation. Once all these processes are over, we
would be able to use this information in many applications such as Fraud Detection, Market
Analysis, Production Control, Science Exploration, etc.
DEFINITION OF DATA MINING?
Data Mining is defined as extracting information from huge sets of data. In other words, we
can say that data mining is the procedure of mining knowledge from data. The information or
knowledge extracted so can be used for any of the following applications −
•
Market Analysis
•
Fraud Detection
•
Customer Retention
•
Production Control
•
Science Exploration
* Major Sources of Abundant data:
- Business – Web, E-commerce, Transactions, Stocks
- Science – Remote Sensing, Bio informatics, Scientific Simulation
- Society and Everyone – News, Digital Cameras, You Tube
* Need for turning data into knowledge – Drowning in data, but starving for knowledge
* Applications that use data mining:
- Market Analysis
- Fraud Detection
- Customer Retention
- Production Control - Scientific Exploration
Definition of Data Mining?
Extracting and ‘Mining’ knowledge from large amounts of data.
“Gold Mining from rock or sand” is same as “Knowledge mining from data”
Other terms for Data Mining:
o Knowledge Mining
o Knowledge Extraction
o Pattern Analysis
o Data Archeology
o Data Dredging
Data Mining is not same as KDD (Knowledge Discovery from Data)
Data Mining is a step in KDD
Data Cleaning – Remove noisy and inconsistent data
Data Integration – Multiple data sources combined
Data Selection – Data relevant to analysis retrieved
Data Transformation – Transform into form suitable for Data Mining (Summarized /
Aggregated)
Data Mining – Extract data patterns using intelligent methods
Pattern Evaluation – Identify interesting patterns
Knowledge Presentation – Visualization / Knowledge Representation
– Presenting mined knowledge to the user
DIFFERENET KINDS OF DATA MINING:
There are several major data mining techniques have been developing and using in data
mining projects recently including
association,
classification,
clustering,
prediction,
sequential patterns and
decision tree.
We will briefly examine those data mining techniques in the following sections.
Association:
Association is one of the best-known data mining technique. In association, a pattern is
discovered based on a relationship between items in the same transaction. That’s is the reason
why association technique is also known as relation technique. The association technique is
used in market basket analysis to identify a set of products that customers frequently
purchase together.
Retailers are using association technique to research customer’s buying habits. Based on
historical sale data, retailers might find out that customers always buy crisps when they buy
beers, and, therefore, they can put beers and crisps next to each other to save time for the
customer and increase sales.
Classification
Classification is a classic data mining technique based on machine learning. Basically,
classification is used to classify each item in a set of data into one of a predefined set of classes
or groups. Classification method makes use of mathematical techniques such as decision trees,
linear programming, neural network, and statistics. In classification, we develop the software
that can learn how to classify the data items into groups. For example, we can apply
classification in the application that “given all records of employees who left the company,
predict who will probably leave the company in a future period.” In this case, we divide the
records of employees into two groups that named “leave” and “stay”. And then we can ask our
data mining software to classify the employees into separate groups.
Clustering
Clustering is a data mining technique that makes a meaningful or useful cluster of objects
which have similar characteristics using the automatic technique. The clustering technique
defines the classes and puts objects in each class, while in the classification techniques, objects
are assigned into predefined classes. To make the concept clearer, we can take book
management in the library as an example. In a library, there is a wide range of books on
various topics available. The challenge is how to keep those books in a way that readers can
take several books on a particular topic without hassle. By using the clustering technique, we
can keep books that have some kinds of similarities in one cluster or one shelf and label it with
a meaningful name. If readers want to grab books in that topic, they would only have to go to
that shelf instead of looking for the entire library.
Prediction
The prediction, as its name implied, is one of a data mining techniques that discovers the
relationship between independent variables and relationship between dependent and
independent variables. For instance, the prediction analysis technique can be used in the sale
to predict profit for the future if we consider the sale is an independent variable, profit could
be a dependent variable. Then based on the historical sale and profit data, we can draw a
fitted regression curve that is used for profit prediction.
Sequential Patterns
Sequential patterns analysis is one of data mining technique that seeks to discover or identify
similar patterns, regular events or trends in transaction data over a business period.
In sales, with historical transaction data, businesses can identify a set of items that customers
buy together different times in a year. Then businesses can use this information to
recommend customers buy it with better deals based on their purchasing frequency in the
past.
Decision trees
The A decision tree is one of the most commonly used data mining techniques because its
model is easy to understand for users. In decision tree technique, the root of the decision tree
is a simple question or condition that has multiple answers. Each answer then leads to a set of
questions or conditions that help us determine the data so that we can make the final decision
based on it. For example, We use the following decision tree to determine whether or not to
play tennis:
Starting at the root node, if the outlook is overcast then we should definitely play tennis. If it is
rainy, we should only play tennis if the wind is the week. And if it is sunny then we should play
tennis in case the humidity is normal.
We often combine two or more of those data mining techniques together to form an
appropriate process that meets the business needs.
1. Classification analysis
This analysis is used to retrieve important and relevant information about data, and metadata.
It is used to classify different data in different classes. Classification is similar to clustering in a
way that it also segments data records into different segments called classes. But unlike
clustering, here the data analysts would have the knowledge of different classes or cluster. So,
in classification analysis you would apply algorithms to decide how new data should be
classified. A classic example of classification analysis would be our outlook email. In outlook,
they use certain algorithms to characterize an email as legitimate or spam.
2. Association rule learning
It refers to the method that can help you identify some interesting relations (dependency
modeling) between different variables in large databases. This technique can help you unpack
some hidden patterns in the data that can be used to identify variables within the data and the
concurrence of different variables that appear very frequently in the dataset.association rules
are useful for examining and forecasting customer behavior. It is highly recommended in the
retail industry analysis. This technique is used to determine shopping basket data analysis,
product clustering, catalog design and store layout. In it, programmers use association rules to
build programs capable of machine learning.
3. Anomaly or outlier detection
This refers to the observation for data items in a dataset that do not match an expected
pattern or an expected behavior. Anomalies are also known as outliers, novelties, noise,
deviations and exceptions. Often they provide critical and actionable information. An anomaly
is an item that deviates considerably from the common average within a dataset or a
combination of data. These types of items are statistically aloof as compared to the rest of the
data and hence, it indicates that something out of the ordinary has happened and requires
additional attention. This technique can be used in a variety of domains, such as intrusion
detection, system health monitoring, fraud detection, fault detection, event detection in
sensor networks, and detecting eco-system disturbances. Analysts often remove the
anomalous data from the dataset top discover results with an increased accuracy.
4. Clustering analysis
The cluster is actually a collection of data objects; those objects are similar within the same
cluster. That means the objects are similar to one another within the same group and they are
rather different or they are dissimilar or unrelated to the objects in other groups or in other
clusters. Clustering analysis is the process of discovering groups and clusters in the data in
such a way that the degree of association between two objects is highest if they belong to the
same group and lowest otherwise.a result of this analysis can be used to create customer
profiling.
5. Regression analysis
In statistical terms, a regression analysis is the process of identifying and analyzing the
relationship among variables. It can help you understand the characteristic value of the
dependent variable changes, if any one of the independent variables is varied. This means one
variable is dependent on another, but it is not vice versa.it is generally used for prediction and
forecasting.
All of these techniques can help analyze different data from different perspectives. Now you
have the knowledge to decide the best technique to summarize data into useful information –
information that can be used to solve a variety of business problems to increase revenue,
customer satisfaction, or decrease unwanted cost.
DATA MINING TECHNIQUES:
1.Classification:
This analysis is used to retrieve important and relevant information about data, and metadata.
This data mining method helps to classify data in different classes.
2. Clustering:
Clustering analysis is a data mining technique to identify data that are like each other. This
process helps to understand the differences and similarities between the data.
3. Regression:
Regression analysis is the data mining method of identifying and analyzing the relationship
between variables. It is used to identify the likelihood of a specific variable, given the presence
of other variables.
4. Association Rules:
This data mining technique helps to find the association between two or more Items. It
discovers a hidden pattern in the data set.
5. Outer detection:
This type of data mining technique refers to observation of data items in the dataset which do
not match an expected pattern or expected behavior. This technique can be used in a variety
of domains, such as intrusion, detection, fraud or fault detection, etc. Outer detection is also
called Outlier Analysis or Outlier mining.
6. Sequential Patterns:
This data mining technique helps to discover or identify similar patterns or trends in
transaction data for certain period.
7. Prediction:
Prediction has used a combination of the other data mining techniques like trends, sequential
patterns, clustering, classification, etc. It analyzes past events or instances in a right sequence
for predicting a future event.
DATA MINING TECHNIQUES(IN DETAIL):
1.Tracking patterns. One of the most basic techniques in data mining is learning to recognize
patterns in your data sets. This is usually a recognition of some aberration in your data
happening at regular intervals, or an ebb and flow of a certain variable over time. For example,
you might see that your sales of a certain product seem to spike just before the holidays, or
notice that warmer weather drives more people to your website.
2. Classification. Classification is a more complex data mining technique that forces you to
collect various attributes together into discernable categories, which you can then use to draw
further conclusions, or serve some function. For example, if you’re evaluating data on
individual customers’ financial backgrounds and purchase histories, you might be able to
classify them as “low,” “medium,” or “high” credit risks. You could then use these
classifications to learn even more about those customers.
3. Association. Association is related to tracking patterns, but is more specific to dependently
linked variables. In this case, you’ll look for specific events or attributes that are highly
correlated with another event or attribute; for example, you might notice that when your
customers buy a specific item, they also often buy a second, related item. This is usually what’s
used to populate “people also bought” sections of online stores.
4. Outlier detection. In many cases, simply recognizing the overarching pattern can’t give you
a clear understanding of your data set. You also need to be able to identify anomalies, or
outliers in your data. For example, if your purchasers are almost exclusively male, but during
one strange week in July, there’s a huge spike in female purchasers, you’ll want to investigate
the spike and see what drove it, so you can either replicate it or better understand your
audience in the process.
5. Clustering. Clustering is very similar to classification, but involves grouping chunks of data
together based on their similarities. For example, you might choose to cluster different
demographics of your audience into different packets based on how much disposable income
they have, or how often they tend to shop at your store.
6. Regression. Regression, used primarily as a form of planning and modeling, is used to
identify the likelihood of a certain variable, given the presence of other variables. For example,
you could use it to project a certain price, based on other factors like availability, consumer
demand, and competition. More specifically, regression’s main focus is to help you uncover
the exact relationship between two (or more) variables in a given data set.
7. Prediction. Prediction is one of the most valuable data mining techniques, since it’s used to
project the types of data you’ll see in the future. In many cases, just recognizing and
understanding historical trends is enough to chart a somewhat accurate prediction of what will
happen in the future. For example, you might review consumers’ credit histories and past
purchases to predict whether they’ll be a credit risk in the future.
DATA MINING TOOLS:
So do you need the latest and greatest machine learning technology to be able to apply these
techniques? Not necessarily. In fact, you can probably accomplish some cutting-edge data
mining with relatively modest database systems, and simple tools that almost any company
will have. And if you don’t have the right tools for the job, you can always create your own.
However you approach it, data mining is the best collection of techniques you have for making
the most out of the data you’ve already gathered. As long as you apply the correct logic, and
ask the right questions, you can walk away with conclusions that have the potential to
revolutionize your enterprise.
Challenges of Implementation of Data mine:
•
Skilled Experts are needed to formulate the data mining queries.
•
Overfitting: Due to small size training database, a model may not fit future states.
•
Data mining needs large databases which sometimes are difficult to manage
•
Business practices may need to be modified to determine to use the information
uncovered.
•
If the data set is not diverse, data mining results may not be accurate.
•
Integration information needed from heterogeneous databases and global information
systems could be complex
MAJOR ISSUES IN DATA MINING:
•
Mining Methodology Issues:
•
o Mining different kinds of knowledge in databases.
•
o Incorporation of background knowledge
•
o Handling noisy or incomplete data
•
o Pattern Evaluation – Interestingness Problem
•
User Interaction Issues:
•
o Interactive mining of knowledge at multiple levels of abstraction
•
o Data mining query languages and ad-hoc data mining.
•
Performance Issues:
•
o Efficiency and Scalability of Data Mining Algorithms.
•
o Parallel, distributed and incremental mining algorithms.
•
Issues related to diversity of data types:
•
o Handling of relational and complex types of data.
•
o Mining information from heterogeneous databases and global information
•
system
DATA MINING TECHNOLOGIES:
As a highly application-driven domain, data mining has incorporated many techniques
from other domains such as statistics, machine learning, pattern recognition, database
and data warehouse systems, information retrieval, visualization, algorithms, highperformance computing, and many application domains (Figure ) The interdisciplinary
nature of data mining research and development contributes significantly to the
success of data mining and its extensive applications. In this section, we give examples
of several disciplines that strongly influence the development of data mining methods.
DATA OBJECTS AND ATTRIBUTE TYPES:
When we talk about data mining, we usually discuss about knowledge discovery from data. To
get to know about the data it is necessary to discuss about data objects, data attributes and
types of data attributes. Mining data includes knowing about data, finding relation between
data. And for this we need to discuss about data objects and attributes.
Data objects are the essential part of a database. A data object represents the entity. Data
Objects are like group of attributes of a entity. For example a sales data object may represent
customer, sales or purchases. When a data object is listed in a database they are called data
tuples.
Attribute
It can be seen as a data field that represents characteristics or features of a data object. For a
customer object attributes can be customer Id, address etc. We can say that a set of attributes
used to describe a given object are known as attribute vector or feature vector.
Type
of
attributes
:
This is the First step of Data Data-preprocessing. We differentiate between different types of
attributes and then preprocess the data. So here is description of attribute types.
1.
Qualitative
(Nominal
(N),
Ordinal
(O),
Binary(B)).
2. Quantitative (Discrete, Continuous)
Qualitative Attributes
1. Nominal Attributes – related to names : The values of a Nominal attribute are name of
things, some kind of symbols. Values of Nominal attributes represents some category
or state and that’s why nominal attribute also referred as categorical attributes and
there
is no
order
(rank,
position)
among
values
of
nominal
attribute.
Example
:
2. Binary Attributes : Binary data has only 2 values/states. For Example yes or no,
affected
i)
Symmetric
or
:
unaffected,
Both
values
true
are
or
equally
important
false.
(Gender).
ii) Asymmetric : Both values are not equally important (Result).
3. Ordinal Attributes : The Ordinal Attributes contains values that have a meaningful
sequence or ranking(order) between them, but the magnitude between values is not
actually known, the order of values that shows what is important but don’t indicate
how
important
it
is.
Quantitative Attributes
1. Numeric : A numeric attribute is quantitative because, it is a measurable quantity,
represented in integer or real values. Numerical attributes are of 2 types, interval and
ratio.
i) An interval-scaled attribute has values, whose differences are interpretable, but the
numerical attributes do not have the correct reference point or we can call zero point.
Data can be added and subtracted at interval scale but can not be multiplied or
divided.Consider a example of temperature in degrees Centigrade. If a days
temperature of one day is twice than the other day we cannot say that one day is twice
as hot as another day.
ii) A ratio-scaled attribute is a numeric attribute with an fix zero-point. If a
measurement is ratio-scaled, we can say of a value as being a multiple (or ratio) of
another value. The values are ordered, and we can also compute the difference
between values, and the mean, median, mode, Quantile-range and Five number
summary can be given.
2. Discrete : Discrete data have finite values it can be numerical and can also be in
categorical form. These attributes has finite or countably infinite set of values.
Example
3. Continuous : Continuous data have infinite no of states. Continuous data is of float
type.
There
can
be
many
values
between
Example
2
and
3.
:
DATA PRE-PROCESSING
Definition - What does Data Preprocessing mean?
Data preprocessing is a data mining technique that involves transforming raw data into an
understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in
certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a
proven method of resolving such issues. Data preprocessing prepares raw data for further
processing.
Data preprocessing is used database-driven applications such as customer relationship
management and rule-based applications (like neural networks).
Data goes through a series of steps during preprocessing:
•
Data Cleaning: Data is cleansed through processes such as filling in missing values,
smoothing the noisy data, or resolving the inconsistencies in the data.
•
Data Integration: Data with different representations are put together and conflicts
within the data are resolved.
•
Data Transformation: Data is normalized, aggregated and generalized.
•
Data Reduction: This step aims to present a reduced representation of the data in a
data warehouse.
•
Data Discretization: Involves the reduction of a number of values of a continuous
attribute by dividing the range of attribute intervals.
Why Data Pre-processing?
Data preprocessing prepares raw data for further processing. The traditional data
preprocessing method is reacting as it starts with data that is assumed ready for analysis and
there is no feedback and impart for the way of data collection. The data inconsistency
between data sets is the main difficulty for the data preprocessing
Following is the Major task of preprocessing
Data Cleaning
Data in the real world is dirty. That is it is incomplete or noisy or inconsistent.
Incomplete: means lacking attribute values, lacking certain attributes of interest,
or containing only aggregate data
e.g., occupation=“ ”
Noisy: means containing errors or outliers
e.g., Salary=“-10”
Inconsistent: means containing discrepancies in codes or names
e.g., Age=“42” Birthday=“03/07/1997”
e.g., Was rating “1,2,3”, now rating “A, B, C”
e.g., discrepancy between duplicate records
Why Is Data Dirty?
Data is dirty because of the below reasons.
Incomplete data may come from
“Not applicable” data value when collected
Different considerations between the time when the data was collected and
when it is analyzed.
Human / hardware / software problems
Noisy data (incorrect values) may come from
Faulty data collection instruments
Human or computer error at data entry
Errors in data transmission
Inconsistent data may come from
Different data sources
Functional dependency violation (e.g., modify some linked data)
Duplicate records also need data cleaning
Why Is Data Pre-processing Important?
Data Pre-processing is important because:
If there is No quality data, no quality mining results!
Quality decisions must be based on quality data
e.g., duplicate or missing data may cause incorrect or even misleading statistics.
Data warehouse needs consistent integration of quality data
Data extraction, cleaning, and transformation comprises the majority of the work
of building a data warehouse
DATA CLEANING
Importance of Data Cleaning
“Data cleaning is one of the three biggest problems in data warehousing”—
“Data cleaning is the number one problem in data warehousing”—DCI survey
Data cleaning tasks are:
Filling in missing values
Identifying outliers and smoothing out noisy data
Correcting inconsistent data
Resolving redundancy caused by data integration
Explanation of Data Cleaning
Missing Data
Eg. Missing customer income attribute in the sales data
Methods of handling missing values:
a) Ignore the tuple
1)
When the attribute with missing values does not
contribute to any of the classes or has missing class
label.
2)
Effective only when more number of missing values
are there for many attributes in the tuple.
3)
Not effective when only few of the attribute values
are missing in a tuple.
b) Fill in the missing value manually
1) This method is time consuming
2) It is not efficient
3) The method is not feasible
c) Use of a Global constant to fill in the missing value
1) This means filling with “Unknown” or “Infinity”
2) This method is simple
3) This is not recommended generally
d) Use the attribute mean to fill in the missing value
That is, take the average of all existing income values and fill in the missing
income value.
e) Use the attribute mean of all samples belonging to the same class as that of
the given tuple.
Say, there is a class “Average income” and the tuple with the missing value
belongs to this class and then the missing value is the mean of all the values
in this class.
f) Use the most probable value to fill in the missing value
This method uses inference based tools like Bayesian Formula, Decision tree
etc.
Data Cleaning in Data Mining
Quality of your data is critical in getting to final analysis. Any data which tend to be
incomplete, noisy and inconsistent can effect your result.
Data cleaning in data mining is the process of detecting and removing corrupt or inaccurate
records from a record set, table or database.
Some data cleaning methods :1 You can ignore the tuple.This is done when class label is missing.This method is not very
effective , unless the tuple contains several attributes with missing values.
2 You can fill in the missing value manually.This approach is effective on small data set with
some missing values.
3 You can replace all missing attribute values with global constant, such as a label like
“Unknown” or minus infinity.
4 You can use the attribute mean to fill in the missing value.For example customer average
income is 25000 then you can use this value to replace missing value for income.
5 Use the most probable value to fill in the missing value.
Noisy Data
Noise is a random error or variance in a measured variable. Noisy Data may be due to faulty
data collection instruments, data entry problems and technology limitation.
How to Handle Noisy Data?
Binning:
Binning methods sorted data value by consulting its “neighbor- hood,” that is, the values
around it.The sorted values are distributed into a number of “buckets,” or bins.
For example
Price = 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins:
Bin a: 4, 8, 15
Bin b: 21, 21, 24
Bin c: 25, 28, 34
In this example, the data for price are first sorted and then partitioned into equal-frequency
bins of size 3.
Smoothing by bin means:
Bin a: 9, 9, 9
Bin b: 22, 22, 22
Bin c: 29, 29, 29
In smoothing by bin means, each value in a bin is replaced by the mean value of the bin.
Smoothing by bin boundaries:
Bin a: 4, 4, 15
Bin b: 21, 21, 24
Bin c: 25, 25, 34
In smoothing by bin boundaries, each bin value is replaced by the closest boundary value.
Regression
Data can be smoothed by fitting the data into a regression functions.
Clustering:
Outliers may be detected by clustering,where similar values are organized into groups, or
“clusters.Values that fall outside of the set of clusters may be considered outliers.
Clustering
DATA INTEGRATION
i)_Data Integration
- Combines data from multiple sources into a single store.
- Includes multiple databases, data cubes or flat files
ii)Schema integration
- Integrates meta data from different sources
- Eg. A.cust_id = B.cust_no
iii)Entity Identification Problem
- Identify real world entities from different data sources
- Eg. Pay type filed in one data source can take the values ‘H’ or ‘S’, Vs
in another data source it can take the values 1 or2
iv )Detecting and resolving data value conflicts:
- For the same real world entity, the attribute value can be different
in different data sources
-
Possible reasons can be - Different interpretations,
different representation and different scaling
-
Eg. Sales amount represented in Dollars (USD) in one data
source and as Pounds ($) in another data source.
V) Handling Redundancy in data integration:
- When we integrate multiple databases data redundancy occurs
- Object Identification – Same attributes / objects in different data
sources may have different names.
- Derivable Data – Attribute in one data source may be derived from
Attribute(s) in another data source
Eg. Monthly_revenue in one data source and Annual revenue
in another data source.
- Such redundant attributes can be detected using Correlation Analysis
- So, Careful integration of data from multiple sources can help in
reducing or avoiding data redundancy and inconsistency which will
in turn improve mining speed and quality.
DATA INTEGRATION IN DATA MINING
Data Integration is a data preprocessing technique that combines data from multiple sources
and provides users a unified view of these data.
Data Integration
These sources may include multiple databases, data cubes, or flat files. One of the most wellknown implementation of data integration is building an enterprise's data warehouse.
The benefit of a data warehouse enables a business to perform analyses based on the data in
the data warehouse.
There are mainly 2 major approaches for data integration:1 Tight Coupling
In tight coupling data is combined from different sources into a single physical location
through the process of ETL - Extraction, Transformation and Loading.
2 Loose Coupling
In loose coupling data only remains in the actual source databases. In this approach, an
interface is provided that takes query from user and transforms it in a way the source
database can understand and then sends the query directly to the source databases to obtain
the result.
DATA TRANSFORMATION
Smoothing:- Removes noise from the data
Aggregation:- Summarization, Data cube Construction
Generalization:- Concept Hierarchy climbing
Attribute / Feature Construction:- New attributes constructed from the given ones
Normalization:- Data scaled to fall within a specified range
- min-max normalization
- z-score normalization
- normalization by decimal scaling
In data transformation process data are transformed from one format to another format, that
is more appropriate for data mining.
Some Data Transformation Strategies:1 Smoothing
Smoothing is a process of removing noise from the data.
2 Aggregation
Aggregation is a process where summary or aggregation operations are applied to the data.
3 Generalization
In generalization low-level data are replaced with high-level data by using concept hierarchies
climbing.
4 Normalization
Normalization scaled attribute data so as to fall within a small specified range, such as
0.0 to 1.0.
5 Attribute Construction
In Attribute construction, new attributes are constructed from the given set of attributes.
DATA DISCRETIZATION AND CONCEPT HIERARCHY GENERATION
Data Discretization techniques can be used to divide the range of continuous attribute into
intervals.Numerous continuous attribute values are replaced by small interval labels.
This leads to a concise, easy-to-use, knowledge-level representation of mining results.
Top-down discretization
If the process starts by first finding one or a few points (called split points or cut points) to split
the entire attribute range, and then repeats this recursively on the resulting intervals, then it is
called top-down discretization or splitting.
Bottom-up discretization
If the process starts by considering all of the continuous values as potential split-points,
removes some by merging neighborhood values to form intervals, then it is called bottom-up
discretization or merging.
Discretization can be performed rapidly on an attribute to provide a hierarchical partitioning of
the attribute values, known as a concept hierarchy.
Concept hierarchies
Concept hierarchies can be used to reduce the data by collecting and replacing low-level
concepts with higher-level concepts.
In the multidimensional model, data are organized into multiple dimensions, and each
dimension contains multiple levels of abstraction defined by concept hierarchies. This
organization provides users with the flexibility to view data from different perspectives.
Data mining on a reduced data set means fewer input/output operations and is more efficient
than mining on a larger data set.
Because of these benefits, discretization techniques and concept hierarchies are typically
applied before data mining, rather than during mining.
Discretization and Concept Hierarchy Generation for Numerical Data
Typical methods
1 Binning
Binning is a top-down splitting technique based on a specified number of bins.Binning is an
unsupervised discretization technique.
2 Histogram Analysis
Because histogram analysis does not use class information so it is an unsupervised discreti In
data transformation process data are transformed from one format to another format, that is
more appropriate for data mining.
Some Data Transformation Strategies:-
1 Smoothing
Smoothing is a process of removing noise from the data.
2 Aggregation
Aggregation is a process where summary or aggregation operations are applied to the data.
3 Generalization
In generalization low-level data are replaced with high-level data by using concept hierarchies
climbing.
4 Normalization
Normalization scaled attribute data so as to fall within a small specified range, such as 0.0 to
1.0.
5 Attribute Construction
In Attribute construction, new attributes are constructed from the given set of attributes.
zation technique. Histograms partition the values for an attribute into disjoint ranges called
buckets.
3 Cluster Analysis
Cluster analysis is a popular data discretization method. A clustering algorithm can be applied
to discrete a numerical attribute of A by partitioning the values of A into clusters or groups.
Each initial cluster or partition may be further decomposed into several subcultures, forming a
lower level of the hierarchy.
UNIT -II
UNIT - II
Business Analysis
Classes:08
Data warehouse and OLAP technology for data mining, what is a data warehouse, multidimensional data model, data warehouse architecture, data warehouse implementation,
development of data cube technology, data warehousing to data mining; Data preprocessing:
Data summarization, data cleaning, data integration and transformation data reduction,
discretization and concept hierarchy generation.
DATA WAREHOUSE AND OLAP TECHNOLOGY FOR DATA MINING:
Online Analytical Processing (OLAP) is based on the multidimensional data model that allow
user to extract and view data from different points of view. OLAP data is stored in a
multidimensional database. OLAP is the technology behind many Business Intelligence (BI)
applications.
Using OLAP, user can create a spreadsheet showing all of a company's products sold in India in
the month of May, compare revenue figures etc.
WHAT IS A DATA WAREHOUSE:
The term "Data Warehouse" was first coined by Bill Inmon in 1990. According to Inmon, a
data warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of
data. This data helps analysts to take informed decisions in an organization.
An operational database undergoes frequent changes on a daily basis on account of the
transactions that take place. Suppose a business executive wants to analyze previous
feedback on any data such as a product, a supplier, or any consumer data, then the executive
will have no data available to analyze because the previous data has been updated due to
transactions.
A data warehouses provides us generalized and consolidated data in multidimensional view.
Along with generalized and consolidated view of data, a data warehouses also provides us
Online Analytical Processing (OLAP) tools. These tools help us in interactive and effective
analysis of data in a multidimensional space. This analysis results in data generalization and
data mining.
Data mining functions such as association, clustering, classification, prediction can be
integrated with OLAP operations to enhance the interactive mining of knowledge at multiple
level of abstraction. That's why data warehouse has now become an important platform for
data analysis and online analytical processing.
Understanding a Data Warehouse
•
A data warehouse is a database, which is kept separate from the organization's
operational database.
•
There is no frequent updating done in a data warehouse.
•
It possesses consolidated historical data, which helps the organization to analyze its
business.
•
A data warehouse helps executives to organize, understand, and use their data to take
strategic decisions.
•
Data warehouse systems help in the integration of diversity of application systems.
•
A data warehouse system helps in consolidated historical data analysis.
Why a Data Warehouse is Separated from Operational Databases
A data warehouses is kept separate from operational databases due to the following reasons
−
•
An operational database is constructed for well-known tasks and workloads such as
searching particular records, indexing, etc. In contract, data warehouse queries are
often complex and they present a general form of data.
•
Operational databases support concurrent processing of multiple transactions.
Concurrency control and recovery mechanisms are required for operational databases
to ensure robustness and consistency of the database.
•
An operational database query allows to read and modify operations, while an OLAP
query needs only read only access of stored data.
•
An operational database maintains current data. On the other hand, a data warehouse
maintains historical data.
Data Warehouse Features
The key features of a data warehouse are discussed below −
•
Subject Oriented − A data warehouse is subject oriented because it provides
information around a subject rather than the organization's ongoing operations.
These subjects can be product, customers, suppliers, sales, revenue, etc. A data
warehouse does not focus on the ongoing operations, rather it focuses on modelling
and analysis of data for decision making.
•
Integrated − A data warehouse is constructed by integrating data from heterogeneous
sources such as relational databases, flat files, etc. This integration enhances the
effective analysis of data.
•
Time Variant − The data collected in a data warehouse is identified with a particular
time period. The data in a data warehouse provides information from the historical
point of view.
•
Non-volatile − Non-volatile means the previous data is not erased when new data is
added to it. A data warehouse is kept separate from the operational database and
therefore frequent changes in operational database is not reflected in the data
warehouse.
Note − A data warehouse does not require transaction processing, recovery, and concurrency
controls, because it is physically stored and separate from the operational database.
Data Warehouse Applications
As discussed before, a data warehouse helps business executives to organize, analyze, and
use their data for decision making. A data warehouse serves as a sole part of a plan-executeassess "closed-loop" feedback system for the enterprise management. Data warehouses are
widely used in the following fields −
•
Financial services
•
Banking services
•
Consumer goods
•
Retail sectors
•
Controlled manufacturing
DATAWAREHOUSE ARCHITECTURE:
The business analyst get the information from the data warehouses to measure the
performance and make critical adjustments in order to win over other business holders in the
market. Having a data warehouse offers the following advantages −
•
Since a data warehouse can gather information quickly and efficiently, it can enhance
business productivity.
•
A data warehouse provides us a consistent view of customers and items, hence, it
helps us manage customer relationship.
•
A data warehouse also helps in bringing down the costs by tracking trends, patterns
over a long period in a consistent and reliable manner.
To design an effective and efficient data warehouse, we need to understand and analyze the
business needs and construct a business analysis framework. Each person has different views
regarding the design of a data warehouse. These views are as follows −
•
The top-down view − This view allows the selection of relevant information needed
for a data warehouse.
•
The data source view − This view presents the information being captured, stored, and
managed by the operational system.
•
The data warehouse view − This view includes the fact tables and dimension tables. It
represents the information stored inside the data warehouse.
•
The business query view − It is the view of the data from the viewpoint of the enduser.
Three-Tier Data Warehouse Architecture
Generally a data warehouses adopts a three-tier architecture. Following are the three tiers of
the data warehouse architecture.
•
Bottom Tier − The bottom tier of the architecture is the data warehouse database
server. It is the relational database system. We use the back end tools and utilities to
feed data into the bottom tier. These back end tools and utilities perform the Extract,
Clean, Load, and refresh functions.
•
Middle Tier − In the middle tier, we have the OLAP Server that can be implemented in
either of the following ways.
o
By Relational OLAP (ROLAP), which is an extended relational database
management system. The ROLAP maps the operations on multidimensional
data to standard relational operations.
o
By Multidimensional OLAP (MOLAP) model, which directly implements the
multidimensional data and operations.
•
Top-Tier − This tier is the front-end client layer. This layer holds the query tools and
reporting tools, analysis tools and data mining tools.
The following diagram depicts the three-tier architecture of data warehouse −
Data Warehouse Models
From the perspective of data warehouse architecture, we have the following data warehouse
models −
•
Virtual Warehouse
•
Data mart
•
Enterprise Warehouse
Virtual Warehouse
The view over an operational data warehouse is known as a virtual warehouse. It is easy to
build a virtual warehouse. Building a virtual warehouse requires excess capacity on
operational database servers.
Data Mart
Data mart contains a subset of organization-wide data. This subset of data is valuable to
specific groups of an organization.
In other words, we can claim that data marts contain data specific to a particular group. For
example, the marketing data mart may contain data related to items, customers, and sales.
Data marts are confined to subjects.
Points to remember about data marts −
•
Window-based or Unix/Linux-based servers are used to implement data marts. They
are implemented on low-cost servers.
•
The implementation data mart cycles is measured in short periods of time, i.e., in
weeks rather than months or years.
•
The life cycle of a data mart may be complex in long run, if its planning and design are
not organization-wide.
•
Data marts are small in size.
•
Data marts are customized by department.
•
The source of a data mart is departmentally structured data warehouse.
•
Data mart are flexible.
Enterprise Warehouse
•
An enterprise warehouse collects all the information and the subjects spanning an
entire organization
•
It provides us enterprise-wide data integration.
•
The data is integrated from operational systems and external information providers.
•
This information can vary from a few gigabytes to hundreds of gigabytes, terabytes or
beyond.
Load Manager
This component performs the operations required to extract and load process.
The size and complexity of the load manager varies between specific solutions from one data
warehouse to other.
Load Manager Architecture
The load manager performs the following functions −
•
Extract the data from source system.
•
Fast Load the extracted data into temporary data store.
•
Perform simple transformations into structure similar to the one in the data
warehouse.
Extract Data from Source
The data is extracted from the operational databases or the external information providers.
Gateways is the application programs that are used to extract data. It is supported by
underlying DBMS and allows client program to generate SQL to be executed at a server. Open
Database Connection(ODBC), Java Database Connection (JDBC), are examples of gateway.
Fast Load
•
In order to minimize the total load window the data need to be loaded into the
warehouse in the fastest possible time.
•
The transformations affects the speed of data processing.
•
It is more effective to load the data into relational database prior to applying
transformations and checks.
•
Gateway technology proves to be not suitable, since they tend not be performant
when large data volumes are involved.
Simple Transformations
While loading it may be required to perform simple transformations. After this has been
completed we are in position to do the complex checks. Suppose we are loading the EPOS
sales transaction we need to perform the following checks:
•
Strip out all the columns that are not required within the warehouse.
•
Convert all the values to required data types.
Warehouse Manager
A warehouse manager is responsible for the warehouse management process. It consists of
third-party system software, C programs, and shell scripts.
The size and complexity of warehouse managers varies between specific solutions.
Warehouse Manager Architecture
A warehouse manager includes the following −
•
The controlling process
•
Stored procedures or C with SQL
•
Backup/Recovery tool
•
SQL Scripts
Operations Performed by Warehouse Manager
•
A warehouse manager analyzes the data to perform consistency and referential
integrity checks.
•
Creates indexes, business views, partition views against the base data.
•
Generates new aggregations and updates existing aggregations. Generates
normalizations.
•
Transforms and merges the source data into the published data warehouse.
•
Backup the data in the data warehouse.
•
Archives the data that has reached the end of its captured life.
Note − A warehouse Manager also analyzes query profiles to determine index and
aggregations are appropriate.
Query Manager
•
Query manager is responsible for directing the queries to the suitable tables.
•
By directing the queries to appropriate tables, the speed of querying and response
generation can be increased.
•
Query manager is responsible for scheduling the execution of the queries posed by the
user.
Query Manager Architecture
The following screenshot shows the architecture of a query manager. It includes the
following:
•
Query redirection via C tool or RDBMS
•
Stored procedures
•
Query management tool
•
Query scheduling via C tool or RDBMS
•
Query scheduling via third-party software
Detailed Information
Detailed information is not kept online, rather it is aggregated to the next level of detail and
then archived to tape. The detailed information part of data warehouse keeps the detailed
information in the starflake schema. Detailed information is loaded into the data warehouse
to supplement the aggregated data.
The following diagram shows a pictorial impression of where detailed information is stored
and how it is used.
MULTI DIMENSIONAL DATA MODEL(SCHEMAS EXPLANATION):
Schema is a logical description of the entire database. It includes the name and description of
records of all record types including all associated data-items and aggregates. Much like a
database, a data warehouse also requires to maintain a schema. A database uses relational
model, while a data warehouse uses Star, Snowflake, and Fact Constellation schema. In this
chapter, we will discuss the schemas used in a data warehouse.
Star Schema
Star schema is the fundamental schema among the data mart schema and it is simplest. This
schema is widely used to develop or build a data warehouse and dimensional data marts. It
includes one or more fact tables indexing any number of dimensional tables. The star schema
is a necessary case of the snowflake schema. It is also efficient for handling basic queries.
It is said to be star as its physical model resembles to the star shape having a fact table at its
center and the dimension tables at its peripheral representing the star’s points. Below is an
example to demonstrate the Star Schema:
In the above demonstration, SALES is a fact table having attributes i.e. (Product ID, Order ID,
Customer ID, Employer ID, Total, Quantity, Discount) which references to the dimension
tables. Employee dimension table contains the attributes: Emp ID, Emp Name, Title,
Department and Region. Product dimension table contains the attributes: Product ID, Product
Name, Product Category, Unit Price. Customer dimension table contains the attributes:
Customer ID, Customer Name, Address, City, Zip. Time dimension table contains the attributes:
Order ID, Order Date, Year, Quarter, Month.
Model
of
Star
Schema
–
In Star Schema, Business process data, that holds the quantitative data about a business is
distributed in fact tables, and dimensions which are descriptive characteristics related to fact
data. Sales price, sale quantity, distant, speed, weight, and weight measurements are few
examples
of
fact
data
in
star
schema.
Often, A Star Schema having multiple dimensions is termed as Centipede Schema. It is easy to
handle a star schema which have dimensions of few attributes.
Advantages of Star Schema –
1. SimplerQueries:
Join logic of star schema is quite cinch in compare to other join logic which are needed
to fetch data from a transactional schema that is highly normalized.
2. SimplifiedBusinessReportingLogic:
In compared to a transactional schema that is highly normalized, the star schema
makes simpler common business reporting logic, such as as-of reporting and periodover-period.
3. FeedingCubes:
Star schema is widely used by all OLAP systems to design OLAP cubes efficiently. In fact,
major OLAP systems deliver a ROLAP mode of operation which can use a star schema
as a source without designing a cube structure.
Disadvantages of Star Schema –
1. Data integrity is not enforced well since in a highly de-normalized schema state.
2. Not flexible in terms if analytical needs as a normalized data model.
3. Star schemas don’t reinforce many-to-many relationships within business entities – at
least not frequently.
•
Each dimension in a star schema is represented with only one-dimension table.
•
This dimension table contains the set of attributes.
•
The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.
•
There is a fact table at the center. It contains the keys to each of four dimensions.
•
The fact table also contains the attributes, namely dollars sold and units sold.
Note − Each dimension has only one dimension table and each table holds a set of attributes.
For example, the location dimension table contains the attribute set {location_key, street, city,
province_or_state,country}. This constraint may cause data redundancy. For example,
"Vancouver" and "Victoria" both the cities are in the Canadian province of British Columbia.
The entries for such cities may cause data redundancy along the attributes province_or_state
and country.
Snowflake Schema
Introduction: The snowflake schema is a variant of the star schema. Here, the centralized fact
table is connected to multiple dimensions. In the snowflake schema, dimension are present in
a normalized from in multiple related tables. The snowflake structure materialized when the
dimensions of a star schema are detailed and highly structured, having several levels of
relationship, and the child tables have multiple parent table. The snowflake effect affects only
the dimension tables and does not affect the fact tables.
Example:
The Employee dimension table now contains the attributes: EmployeeID, EmployeeName,
DepartmentID, Region, Territory. The DepartmentID attribute links with Employee table with
the Department dimension table. The Department dimension is used to provide detail about
each department, such as Name and Location of the department. The Customer dimension
table now contains the attributes: CustomerID, CustomerName, Address, CityID. The CityID
attributes links the Customer dimension table with the City dimension table. The City
dimension table has details about each city such as CityName, Zipcode, State and Country.
The main difference between star schema and snowflake schema is that the dimension table
of the snowflake schema are maintained in normalized form to reduce redundancy. The
advantage here is that such table(normalized) are easy to maintain and save storage space.
However, it also means that more joins will be needed to execute query. This will adversely
impact system performance.
What
is
snowflaking?
The snowflake design is the result of further expansion and normalized of the dimension table.
In other words, a dimension table is said to be snowflaked if the low-cardinality attribute of
the dimensions have been divided into separate normalized tables. These tables are then
joined to the original dimension table with referential constrains(foreign key constrain).
Generally, snowflaking is not recommended in the dimension table, as it hampers the
understandability and performance of the dimension model as more tables would be required
to be joined to satisfy the queries.
•
Some dimension tables in the Snowflake schema are normalized.
•
The normalization splits up the data into additional tables.
•
Unlike Star schema, the dimensions table in a snowflake schema are normalized. For
example, the item dimension table in star schema is normalized and split into two
dimension tables, namely item and supplier table.
•
Now the item dimension table contains the attributes item_key, item_name, type,
brand, and supplier-key.
•
The supplier key is linked to the supplier dimension table. The supplier dimension table
contains the attributes supplier_key and supplier_type.
Note − Due to normalization in the Snowflake schema, the redundancy is reduced and
therefore, it becomes easy to maintain and the save storage space.
Fact Constellation Schema
•
A fact constellation has multiple fact tables. It is also known as galaxy schema.
•
The following diagram shows two fact tables, namely sales and shipping.
•
The sales fact table is same as that in the star schema.
•
The shipping fact table has the five dimensions, namely item_key, time_key,
shipper_key, from_location, to_location.
•
The shipping fact table also contains two measures, namely dollars sold and units sold.
•
It is also possible to share dimension tables between fact tables. For example, time,
item, and location dimension tables are shared between the sales and shipping fact
table.
Schema Definition
Multidimensional schema is defined using Data Mining Query Language (DMQL). The two
primitives, cube definition and dimension definition, can be used for defining the data
warehouses and data marts.
DEVELOPMENT OF DATA CUBE TECHNOLOGY
Data Cube
A data cube helps us represent data in multiple dimensions. It is defined by dimensions and
facts. The dimensions are the entities with respect to which an enterprise preserves the
records.
Illustration of Data Cube
Suppose a company wants to keep track of sales records with the help of sales data warehouse
with respect to time, item, branch, and location. These dimensions allow to keep track of
monthly sales and at which branch the items were sold. There is a table associated with each
dimension. This table is known as dimension table. For example, "item" dimension table may
have attributes such as item_name, item_type, and item_brand.
The following table represents the 2-D view of Sales Data for a company with respect to time,
item, and location dimensions.
But here in this 2-D table, we have records with respect to time and item only. The sales for
New Delhi are shown with respect to time, and item dimensions according to type of items
sold. If we want to view the sales data with one more dimension, say, the location dimension,
then the 3-D view would be useful. The 3-D view of the sales data with respect to time, item,
and location is shown in the table below −
The above 3-D table can be represented as 3-D data cube as shown in the following figure −
Data Mart
Data marts contain a subset of organization-wide data that is valuable to specific groups of
people in an organization. In other words, a data mart contains only those data that is specific
to a particular group. For example, the marketing data mart may contain only data related to
items, customers, and sales. Data marts are confined to subjects.
Points to Remember About Data Marts
•
Windows-based or Unix/Linux-based servers are used to implement data marts. They
are implemented on low-cost servers.
•
The implementation cycle of a data mart is measured in short periods of time, i.e., in
weeks rather than months or years.
•
The life cycle of data marts may be complex in the long run, if their planning and design
are not organization-wide.
•
Data marts are small in size.
•
Data marts are customized by department.
•
The source of a data mart is departmentally structured data warehouse.
•
Data marts are flexible.
The following figure shows a graphical representation of data marts.
Virtual Warehouse
The view over an operational data warehouse is known as virtual warehouse. It is easy to build
a virtual warehouse. Building a virtual warehouse requires excess capacity on operational
database servers.
DATA SUMMARIZATION
Data Summarization is a simple term for a short conclusion of a big theory or a paragraph. This is
something where you write the code and in the end, you declare the final result in the form of
summarizing data. Data summarization has the great importance in the data mining. As nowadays a
lot of programmers and developers work on big data theory. Earlier, you used to face difficulties to
declare the result, but now there are so many relevant tools in the market where you can use in the
programming or wherever you want in your data.
Why Data Summarization?
Why we need more summarization of data in the mining process, we are living in a digital world
where data transfers in a second and it is much faster than a human capability. In the corporate field,
employees work on a huge volume of data which is derived from different sources like Social
Network, Media, Newspaper, Book, cloud media storage etc. But sometimes it may create
difficulties for you to summarize the data. Sometimes you do not expect data volume because when
you retrieve data from relational sources you can not predict that how much data will be stored in
the database.
As a result, data becomes more complex and takes time to summarize information. Let me tell you
the solution to this problem. Always retrieve data in the form of category what type of data you
want in the data or we can say use filtration when you retrieve data. Although, “Data
Summarization” technique gives the good amount of quality to summarize the data. Moreover, a
customer or user can take benefits in their research. Excel is the best tool for data summarization
and I will discuss this in brief.
DATA CLEANING:
―Data cleaning is the number one problem in data warehousing‖—
Data quality is an essential characteristic that determines the reliability of data for making
decisions. High-quality data is
1.Complete: All relevant data such as accounts, addresses and relationships
for a given customer is linked.
2. Accurate: Common data problems like misspellings, typos, and random
abbreviations have been cleaned up.
3. Available: Required data are accessible on demand; users do not need to
search manually for the information.
4.Timely: Up-to-date information is readily available to support decisions.In general, data
quality is defined as an aggregated value over a set of quality criteria Starting with the quality
criteria defined in , the author describes the set of criteria that are affected by comprehensive
data cleansing and define how to assess scores for each one of them for an existing data
collection. To measure the quality of a data collection, scores have to be assessed for each of
the quality criteria. The assessment of scores for quality criteria can be used to quantify the
necessity of data cleansing for a data collection as well as the success of a performed data
cleansing process of a data collection. Quality criteria can also be used within the
optimization of data cleansing by specifying priorities for each of the criteria which in turn
influences the execution of data cleansing methods affecting the specific criteria.
Data cleaning routines work to ―clean‖ the data by filling in missing values, smoothing noisy
data, identifying or removing outliers, and resolving inconsistencies. The actual process of data
cleansing may involve removing typographical errors or validating and correcting values
against a known list of entities. The validation may be strict. Data cleansing differs from data
validation in that validation almost invariably means data is rejected from the system at entry
and is performed at entry time, rather than on batches of data. Data cleansing may also
involve activities like, harmonization of data, and standardization of data. For example,
harmonization of short codes (St, rd) to actual words (street, road). Standardization of data is a
means of changing a reference data set to a new standard, ex, use of standard codes.
The major data cleaning tasks include
1. Identify outliers and smooth out noisy data
2. Fill in missing values
3.Correct inconsistent data
4. Resolve redundancy caused by data integration
Among these tasks missing values causes inconsistencies for data mining. To
overcome these inconsistencies, handling the missing value is a good solution.
In the medical domain, missing data might occur as the value is not relevant
to a particular case, could not be recorded when the data was collected, or is
ignored by users because of privacy concerns or it may be unfeasible for the
patient to undergo the clinical tests, equipment malfunctioning, etc. Methods for
resolving missing values are therefore needed in health care systems to enhance
the quality of diagnosis. The following sections describe about the proposed data
cleaning methods.
Handling Missing Values
The missing value treating method plays an important role in the data
preprocessing. Missing data is a common problem in statistical analysis. The
tolerance level of missing data is classified as
Missing Value (Percentage) - Significant
Upto 1% - Trivial
1-5% - Manageable
5-15% - sophisticated methods to handle
More than 15% - Severe impact of interpretation
Several methods have been proposed in the literature to treat missing data. Those
methods are divided into three categories as proposed by Dempster and et
al.[1977]. The different patterns of missing values are discussed in the next
section.
4.4.2.1 Pattern of missing
The Missing value in database falls into this three categories viz., Missing
Completely at Random (MCAR), Missing Random (MAR) and Non-Ignorable
(NI)
Missing Completely at Random (MCAR)
This is the highest level of randomness. It occurs when the probability of an
instance (case) having a missing value for an attribute does not depend on either
the known values or the missing values are randomly distributed across all
observations. This is not a realistic assumption for many real time data.
Missing at Random (MAR)
When missingness does not depend on the true value of the missing
variable, but it might depend on the value of other variables that are observed.
This method occurs when missing values are not randomly distributed across all
observations, rather they are randomly distributed within one or more sub samples
Non-Ignorable (NI)
NI exists when missing values are not randomly distributed across
observations. If the probability that a cell is missing depends on the unobserved
value of the missing response, then the process is non-ignorable.
In next section the theoretical framework for Handling the missing value is discussed.
DATA INTEGRATION
Data Integration
- Combines data from multiple sources into a single store.
- Includes multiple databases, data cubes or flat files
Schema integration
- Integrates meta data from different sources
- Eg. A.cust_id = B.cust_no
Entity Identification Problem
- Identify real world entities from different data sources
- Eg. Pay_type filed in one data source can take the values ‘H’ or ‘S’, Vs
in another data source it can take the values 1 or2
Detecting and resolving data value conflicts:
- For the same real world entity, the attribute value can be different
in different data sources
-
Possible reasons can be - Different interpretations,
different representation and different scaling
-
Eg. Sales amount represented in Dollars (USD) in one data
source and as Pounds ($) in another data source.
Handling Redundancy in data integration:
- When we integrate multiple databases data redundancy occurs
- Object Identification – Same attributes / objects in different data
sources may have different names.
- Derivable Data – Attribute in one data source may be derived from
Attribute(s) in another data source
Eg. Monthly_revenue in one data source and Annual revenue
in another data source.
- Such redundant attributes can be detected using Correlation Analysis
- So, Careful integration of data from multiple sources can help in
reducing or avoiding data redundancy and inconsistency which will
in turn improve mining speed and quality.
Correlation Analysis – Numerical Data:
- Formula for Correlation Co-efficient =
(Pearson’s=Product
( A A)( B B)
rA,B
A
andB
Co-efficient)
( AB) n AB
(n 1)AB
-
Moment
(n
Where,
n
are
1)AB
=
respective
No.
means
Of
of
Tuples;
A
&
B
σA and σB are respective standard deviations of A & B
Σ(AB) is the sum of the cross product of A & B
- If the correlation co-efficient between the attributes A & B are
positive then they are positively correlated.
- That is if A’s value increases, B’s value also increases.
- As the correlation co-efficient value increases, the stronger
the correlation.
- If the correlation co-efficient between the attributes A & B is zero
then they are independent attributes.
- If the correlation co-efficient value is negative then they are negatively
Correlated.
Eg: Positive Correlation
Eg: Negative Correlation
Eg: No Correlation
Correlation Analysis – Categorical Data:
- Applicable for data where values of each attribute are divided
into different categories.
- Use Chi-Square Test (using the below formula)
2
2
(Observed Expected )
Expected
- If the value of Χ2 is high, higher the attributes are related.
- The cells that contribute maximum to the value of Χ2 are the ones whose
Observed frequency is very high than its Expected frequency.
- The Expected frequency is calculated using the data distribution in the
two categories of the attributes.
- Consider there are two Attributes A & B; the values of A are categorized into
category Ai and Aj; the values of B are categorized into category Bi and Bj
- The expected frequency of Ai and Bj =
Eij = (Count(Ai) * Count(Bj)) / N
Play
Not
play
chess
Sum
(row)
chess
Like science fiction
Not like science
250(90)
50(210)
200(360)
1000(840)
fiction
Sum(col.)
300
1200
1500
450
1050
- Eg. Consider a sample population of 1500 people who are surveyed to see
if they Play Chess or not and if they Like Science Fiction books are not.
- The counts given within parenthesis are expected frequency and
the remaining one is the observed frequency.
- For Example the Expected frequency for the cell (Play Chess, Like Science
Fiction) is:
= (Count (Play Chess) * Count (Like Science Fiction)) / Total sample
population
= (300 * 450) / 1500 = 90
2
2
(250 90)
90
2
(50 210)
210
2
(200 360)
360
2
(1000 840)
840
507.93
- This shows that the categories Play Chess and Like Science Fiction are
strongly correlated.
DATA REDUCTION
Why Data Reduction?
-
A database of data warehouse may store terabytes of data
-
Complex data analysis or mining will take long time to run on the
complete data set
What is Data Reduction?
-
Obtaining a reduced representation of the complete dataset
-
Produces same result or almost same mining / analytical results as that
of original.
Data Reduction Strategies:
1. Data cube Aggregation
2. Dimensionality reduction – remove unwanted attributes
3. Data Compression
4. Numerosity reduction – Fit data into mathematical models
5. Discretization and Concept Hierarchy Generation
1. Data Cube Aggregation:
- The lowest level of data cube is called as base cuboid.
- Single Level Aggregation - Select a particular entity or attribute and
Aggregate based on that particular attribute.
Eg. Aggregate along ‘Year’ in a Sales data.
- Multiple Level of Aggregation – Aggregates along multiple attributes –
Further reduces the size of the data to
analyze.
-
When a query is posed by the user, use the appropriate level of
Aggregation or data cube to solve the task
-
Queries regarding aggregated information should be answered
using the data cube whenever possible.
2. Attribute Subset Selection
Feature Selection: (attribute subset selection)
-
The goal of attribute subset selection is to find the minimum set of
Attributes such that the resulting probability distribution of data
classes is as close as possible to the original distribution obtained
using all Attributes.
-
This will help to reduce the number of patterns produced and
those patterns will be easy to understand
Heuristic Methods: (Due to exponential number of attribute choices)
-
Step wise forward selection
-
Step wise backward elimination
-
Combining forward selection and backward elimination
-
Decision Tree induction - Class 1 - A1, A5, A6; Class 2 - A2, A3, A4
3. Data Compression
- Compressed representation of the original data.
- This data reduction is called as Lossless if the original data can be
reconstructed from the compressed data without any loss of information.
- The data reduction is called as Lossy if only an approximation of the original
data can be reconstructed.
- Two Lossy Data Compression methods available are:
o
Wavelet Transforms
o Principal Components Analysis
3.1 Discrete Wavelet Transform (DWT):
-
Is a linear Signal processing technique
-
It transforms the data vector X into a numerically different vector X’.
-
These two vectors are of the same length.
-
Here each tuple is an n-dimensional data vector.
-
X = {x1,x2,…xn} n attributes
-
This wavelet transform data can be truncated.
-
Compressed Approximation: Stores only small fraction of strongest of
wavelet coefficients.
-
Apply inverse of DWT to obtain the original data approximation.
-
Similar to discrete Fourier transforms (Signal processing technique
involving Sines and Cosines)
-
DWT uses Hierarchical Pyramid Algorithm
o
Fast Computational speedo Halves the data in each iteration
1. The length of the data vector should be an integer power of two.
(Padding with zeros can be done if required)
2. Each transform applies two functions:
a. Smoothing – sum / weighted average
b. Difference – weighted difference
3. These functions are applied to pairs of data so that two sets of
data of length L/2 is obtained.
4.
Applies these two transforms iteratively until a user desired
data length is obtained.
3.2 Principal Components Analysis (PCA):
- Say, data to be compressed consists of N tuples and k attributes.
- Tuples can be called as Data vectors and attributes can be called as dimensions.
- So, data to be compressed consists of N data vectors each having k-dimensions.
- Consider a number c which is very very less than N. That is c << N.
- PCA searches for c orthogonal vectors that have k dimensions and that can
best be used to represent the data.
- Thus data is projected to a smaller space and hence compressed.
- In this process PCA also combines the essence of existing attributes and
produces a smaller set of attributes.
- Initial data is then projected on to this smaller attribute
set. Basic Procedure:
1. Input data Normalized. All attributes values are mapped to the same range.
2.
Compute N Orthonormal vectors called as principal components.
These are unit vectors perpendicular to each other.
Thus input data = linear combination of principal components
3. Principal Components are ordered in the decreasing order
of
“Significance” or strength.
4. Size of the data can be reduced by eliminating the components with less
“Significance” or the weaker components are removed. Thus the
Strongest Principal Component can be used to reconstruct a good
approximation of the original data.
- PCA can be applied to ordered & unordered attributes, sparse and skewed data.
- It can also be applied on multi dimensional data by reducing the same into
2 dimensional data.
- Works only for numeric data.
Principal Component Analysis
X2
Y1
Y2
X1
4. Numerosity Reduction
- Reduces the data volume by choosing smaller forms of data representations.
- Two types – Parametric, Non-Parametric.
- Parametric – Data estimated into a model
– only the data parameters stored and not the actual data.
-
Stored data includes outliers also
-
Eg. Log-Linear Models
- Non-Parametric – Do not fits data into models
- Eg. Histograms, Clustering and Sampling
Regression and Log-Linear Models:
- Linear Regression - data are modeled to fit in a straight line.
- That is data can be modeled to the mathematical equation:
- Where X is called the “Response Variable” and Y is called “Predictor Variable”.
- Alpha and beta are called the regression coefficients.
- Alpha is the Y-intercept and Beta is the Slope of the equation.
- These regression coefficients can be solved by using “method of least squares”.
- Multiple Regression – Extension of linear regression
– Response variable Y is modeled as a multidimensional vector.
- Log-Linear Models: Estimates the probability of each cell in a base cuboid for a set
of discretized attributes.
- In this higher order data cubes are constructed from lower ordered data cubes.
Histograms:
- Uses binning to distribute the data.
- Histogram for an attribute A:
- Partitions the data of A into disjoint subsets / buckets.
- Buckets are represented in a horizontal line in a histogram.
- Vertical line of histogram represents frequency of values in bucket.
- Singleton Bucket – Has only one attribute value / frequency pair
- Eg. Consider the list of prices (in $) of the sold items.
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15,
18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25,
25, 25, 25, 25, 28, 28, 30, 30, 30.
- consider a bucket of uniform width, say $10.
-
Methods of determining the bucket / partitioning the attribute values:
o Equi-Width: Width of each bucket is a constant
o Equi-Depth: Frequency of each bucket is a constant
o V-Optimal: Histogram with the least variance
Histogram Variance = Weighted sum of values in each bucket
Bucket Weight = Number of values in the bucket.
o MaxDiff: Find the difference between pair of adjacent values. Buckets are
formed between the pairs where the difference between the pairs is
greater than or equal to b-1, b (Beta) is user specified.
o V-Optimal & MaxDiff are most accurate and practical.
-
Histograms can be extended for multiple attributes – Multidimensional histograms
–
can
capture
dependency
between
attributes.
KLNCIT – MCA
Histograms of up to five attributes are found to be effective so far.
For Private Circulation only
Unit II - DATA WAREHOUSING AND DATA MINING -CA5010
-
15
Singleton buckets are useful for storing outliers with high frequency.
4.3 Clustering:
- Considers data tuples as objects.
- Partition objects into clusters.
- Objects within a cluster are similar to one another and the objects in
different clusters are dissimilar.
-
Quality of a cluster is represented by its ‘diameter’
– maximum distance between any two objects in a cluster.
-
Another measure of cluster quality = Centroid Distance = Average distance of
each cluster object from the cluster centroid.
-
The cluster representation of the data can be used to replace the actual data
-
Effectiveness depends on the nature of the data.
-
Effective for data that can be categorized into distinct clusters.
-
Not effective if data is ‘smeared’.
-
Can also have hierarchical clustering of data
-
For faster data access in such cases we use multidimensional index trees.
-
There are many choices of clustering definitions and algorithms available.
Diagram for clustering
4.4 Sampling:
- Can be used as a data reduction technique.
- Selects random sample or subset of data.
- Say large dataset D contains N tuples.
1. Simple Random Sample WithOut Replacement (SRSWOR) of size n:
- Draw n tuples from the original N tuples in D, where n<N.
- The probability of drawing any tuple in D is 1/N. That is all tuples have
equal chance
2. Simple Random Sample With Replacement (SRSWR) of size n:
- Similar to SRSWOR, except that each time when a tuple is drawn from
D it is recorded and replaced.
- After a tuple is drawn it is placed back in D so that it can be drawn again.
KLNCIT – MCA
For Private Circulation only
Unit II - DATA WAREHOUSING AND DATA MINING -CA5010
16
3. Cluster Sample:
- Tuples in D are grouped into M mutually disjoint clusters.
- Apply SRS (SRSWOR / SRSWR) to each cluster of tuples.
- Each page of data fetching of tuples can be considered as a cluster.
4. Stratified Sample:
- D is divided into mutually disjoint strata.
- Apply SRS (SRSWOR / SRSWR) to each Stratum of tuples.
- In this way the group having the smallest number of tuples is
also represented.
DATA DISCRETIZATION AND CONCEPT HIERARCHY GENERATION:
Data Discretization Technique:
-
Reduces the number of attribute values
-
Divides the attribute values into intervals
-
Interval labels are used to replace the attribute values
-
Result – Data easy to use, concise, Knowledge level representation of data
Types of Data Discretization Techniques:
1. Supervised Discretization
a. Uses Class information of the data
2. Unsupervised Discretization
a. Does not uses Class information of the data
3. Top-down Discretization (splitting)
a.
Identifies ‘Split-Points’ or ‘Cut-Points’ in data values b.
Splits attribute values into intervals at split-points
c. Repeats recursively on resulting intervals
d.
Stops when specified number of intervals reached or some stop criteria is
reached.
4. Bottom-up Discretization (merging)
a.
Divide the attribute values into intervals where each interval has a distinct
attribute value.
b.
Merge two intervals based on some merging criteria c.
Repeats recursively on resulting intervals
d.
Stops when specified number of intervals reached or some stop criteria is
reached.
Discretization results in – Hierarchical Partitioning of Attributes = Called as Concept
Hierarchy
Concept Hierarchy used for Data Mining at multiple levels of abstraction.
Eg. For Concept Hierarchy – Numeric values for the attribute Age can be replaced with
the class labels ‘Youth’, ‘Middle Aged’ and ‘Senior’
Discretization and Concept Hierarchy are pre-processing steps for Data Mining
For a Single Attribute multiple Concept Hierarchies can be produced to meet various user
needs.
Manual Definition of concept Hierarchies by Domain experts is a tedious and time
consuming task.
Automated discretization methods are available.
Some Concept hierarchies are implicit at the schema definition level and are defined
when the schema is being defined by the domain experts.
Eg of Concept Hierarchy using attribute ‘Age’
Interval denoted by (Y,X] Value Y (exclusive) and Value X (inclusive) Discretization
and Concept Hierarchy Generation for Numeric Data:
-
Concept Hierarchy generation for numeric data is difficult and tedious task as it has
wide range of data values and has undergoes frequent updates in any database.
-
Automated Discretization Methods:
o Binning
o Histogram analysis
o Entropy based Discretization Method
o X2 – Merging (Chi-Merging)
o Cluster Analysis
o Discretization by Intuition Partitioning
- These methods assumes the data is in the sorted order
Binning:
-
Top-Down Discretization Technique Used
-
Un Supervised Discretization Technique – No Class Information Used
-
User specified number of bins is used.
-
Same technique as used for Smoothing and Numerosity reduction
-
Data Discretized using Equi-Width or Equi-Depth method
-
Replace each bin value by bin mean or bin median.
-
Same technique applied recursively on resulting bins or partitions to generate
Concept Hierarchy
-
Outliers are also fitted in separate bins or partitions or intervals
Histogram Analysis:
-
Un Supervised Discretization; Top-Down Discretization Technique.
-
Data Values split into buckets – Equi-Width or Equi-Frequency
-
Repeats recursively on resulting buckets to generate multi-level Concept
Hierarchies.
-
Stops when user specified numbers of Concept Hierarchy Levels are generated.
Entropy-Based Discretization:
-
Supervised Discretization Method; Top-Down Discretization Method
-
Calculates and determines split-points; The value of the attribute A that has minimum
entropy is the split point; Data divided into partitions at the split points
-
Repeats recursively on resulting partitions to produce Concept Hierarchy of A
-
Basic Method:
o Consider a database D which has many tuples and A is one of the attribute.
o This attribute A is the Class label attribute as it decides the class of
the tuples.
o Attribute value of A is considered as Split-point - Binary Discretization.
Tuples with data values of A<= Split-point = D1
Tuples with data values of A > Split-point = D2
o Uses Class information. Consider there are two classes of tuples C1 and
C2. Then the ideal partitioning should be that the first partition should
have the class C1 tuples and the second partition should have the class
C2 tuples. But this is unlikely.
o First partition may have many tuples of class C1 and few tuples of class
C2 and Second partition may have many tuples of class C2 and few tuples
of class C1.
o To obtain a perfect partitioning the amount of Expected Information
Requirement is given by the formula:
o
Formula
o
| D1
|
o
Info(D)
I(D)
Entropy(D )
1
Entropy(D )
|D|
and
Entropy(D1)
m
| D2
p log ( p )
Entropy(D )
|
|D|
o
for
2
1
i
i1
2
i
o Consider that there are m classes and pi is the probability of class i in D1.
o Select a split-point so that it has minimum amount of Expected
Information Requirement.
o Repeat this recursively on the resulting partitions to obtain the
Concept Hierarchy and stop when the number of intervals exceed
the max-intervals (user specified)
X2 Merging (Chi-Merging):
-
Bottom-up Discretization; Supervised Discretization
-
Best neighboring intervals identified and merged recursively.
-
Basic concept is that adjacent intervals should have the same class
distribution. If so they are merged, otherwise remain separate.
o Each distinct value of the numeric attribute = one interval
o Perform X2 test on each pair of adjacent intervals.
o Intervals with least X2 value are merged to form a larger interval
o Low X2 value Similar class distribution
o Merging done recursively until pre-specified stop criteria is reached.
-
Stop criteria determined by 3 conditions:
o Stops when X2 value of every pair of adjacent intervals exceeds
a pre- specified significance level – set between 0.10 and 0.01
o Stops when number of intervals exceeds a pre-specified max interval (say
10 to 15)
o Relative Class frequencies should be consistent within an interval.
Allowed level of inconsistency within an interval should be within a prespecified threshold say 3%.
Cluster Analysis:
-
Uses Top-Down Discretization or Bottom-up Discretization
-
Data values of an attribute are partitioned into clusters
-
Uses the closeness of data values Produces high quality discretization results.
-
Each cluster is a node in the concept hierarchy
Each cluster further sub-divided into sub-clusters in case of Top-down
approach to create lower level clusters or concepts.
-
Clusters are merged in Bottom-up approach to create higher level cluster
or concepts.
Discretization by Intuitive Partitioning:
-
Users like numerical value intervals to be uniform, easy-to-use, ‘Intuitive’,
Natural.
-
Clustering analysis produces intervals such as ($53,245.78,$62,311.78].
-
But intervals such as ($50,000,$60,000] is better than the above.
-
3-4-5 Rule:
o Partitions the given data range into 3 or 4 or 5 equi-width intervals
o Partitions recursively, level-by-level, based on value range at most
significant digit.
o Real world data can be extremely high or low values which need to
be considered as outliers. Eg. Assets of some people may be several
magnitudes higher than the others
o Such outliers are handled separately in a different interval
o So, majority of the data lies between 5% and 95% of the given data range.
o Eg. Profit of an ABC Ltd in the year 2004.
o Majority data between 5% and 95% - (-$159,876,$1,838,761]
o MIN = -$351,976; MAX = $4,700,896; LOW = -$159,876; HIGH =
$1,838,761;
o Most Significant digit – msd = $1,000,000;
o Hence LOW’ = -$1,000,000 & HIGH’ = $2,000,000
o Number of Intervals = ($2,000,000 – (-$1,000,000))/$1,000,000 = 3.
Example of 3-4-5 Rule
o Hence intervals are: (-$1,000,000,$0], ($0,$1,000,000],
($1,000,000,$2,000,000]
o LOW’ < MIN => Adjust the left boundary to make the interval smaller.
o Most significant digit of MIN is $100,000 => MIN’ = -$400,000
o Hence first interval reduced to (-$400,000,$0]
o HIGH’ < MAX => Add new interval ($2,000,000,$5,000,000]
o Hence the Top tier Hierarchy intervals are:
o ($400,000,$0],($0,$1,000,000],($1,000,000,$2,000,000],($2,000,000,$5,00
0,000]
o These are further subdivided as per 3-4-5 rule to obtain the lower
level hierarchies.
o Interval (-$400,000,$0] is divided into 4 equi-width intervals
o Intervals ($0,$1,000,000] & is divided into 5 Equi-width intervals
o Interval ($1,000,000,$2,000,000] is divided into 5 Equi-width intervals
o Interval ($2,000,000, $5,000,000] is divided into 3 Equi-width intervals.
Concept Hierarchy Generation for Categorical Data:
Categorical Data = Discrete data; Eg. Geographic Location, Job type, Product Item type
Methods Used:
1. Specification of partial ordering of attributes explicitly at the schema level by users or
Experts.
2. Specification of a portion of a hierarchy by explicit data grouping.
3. Specification of the set of attributes that form the concept hierarchy, but not
their partial ordering.
4. Specification of only a partial set of attributes.
1. Specification of a partial set of attributes at the schema level by the users or
domain experts:
- Eg. Dimension ‘Location’ in a Data warehouse has attributes ‘Street’, ‘City’, ‘State’
&
‘Country’.
-
Hierarchical definition of these attributes obtained by ordering these attributes as:
-
State < City < State < Country at the schema level itself by user or expert.
2. Specification of a portion of the hierarchy by explicit data grouping:
- Manual definition of concept hierarchy.
- In real time large databases it is unrealistic to define the concept hierarchy for
the entire database
manually by value enumeration.
-
But we can easily specify intermediate-level grouping of data - a small portion of
hierarchy.
-
For Eg. Consider the sttribute State where we can specify as below:
-
{Chennai, Madurai, Trichy} C (Belongs to) Tamilnadu
-
{Bangalore, Mysore, Mangalore} C (Belongs to) Karnataka
3. Specification of a set of attributes but not their partial ordering:
- User specifies set of attributes of the concept hierarchy; but omits to specify
their ordering
- Automatic concept hierarchy generation or attribute ordering can be done in such
cases.
- This is done using the rule that counts and uses the distinct values of each attribute.
- The attribute that has the most distinct values is placed at the bottom of the hierarchy
- And the attribute that has the least distinct values is placed at the top of the hierarchy
- This heuristic rule applies for most cases but it fails of some.
- Users or experts can examine the concept hierarchy and can perform manual
adjustment.
-
Eg. Concept Hierarchy for ‘Location’ dimension:
-
Country (10); State (508), City (10,804), Street (1,234,567)
-
Street < City < State < Country
-
In this case user need not modify the generated order / concept hierarchy.
-
But this heuristic rule may fail for the ‘Time’ dimension.
-
Distinct Years (100); Distinct Months (12); Distinct Days-of-week (7)
-
So in this case the attribute ordering or the concept hierarchy is:
-
Year < Month < Days-of-week
-
This is not correct.
4. Specification of only partial set of attributes:
- User may have vague idea of the concept hierarchy
- So they just specify only few attributes that form the concept hierarchy.
- Eg. User specifies just the Attributes Street and City.
- To get the complete concept hierarchy in this case we have to link these user
specified attributes
with the data semantics specified by the domain experts.
-
Users have the authority to modify this generated hierarchy.
-
The domain expert may have defined that the attributes given below are
semantically linked
-
Number, Street, City, State, Country.
-
Now the newly generated concept hierarchy by linking the domain expert
specification and the users specification will be that:
-
Number < Street < City < State < Country
-
Here the user can inspect this concept hierarchy and can remove the unwanted
attribute ‘Number’ to generate the new Concept Hierarchy as below:
-
Street < City < State < Country
DATA WAREHOUSE IMPLEMENTATION
-
It is important for a data warehouse system to be implemented with:
o Highly Efficient Computation of Data cubes
o Access Methods; Query Processing Techniques
-
Efficient Computation of Data cubes:
o = Efficient computation of aggregations across many sets of dimensions.
o Compute Cube operator and its implementations:
Extends SQL to include compute cube operator
Create Data cube for the dimensions item, city, year
and sales_in_dollars:
Example Queries to analyze data:
o Compute sum of sales grouping by item and city
o Compute sum of sales grouping by item
o Compute sum of sales grouping by city
Here dimensions are item, city and year; Measure / Fact
is sales_in_dollars
Hence total number of cuboids or group bys for this data
cube is
Possible group bys are {(city, item, year), (city, item), (city,
year), (item, year), (city), (item), (year), ()}; These group bys
forms the lattice of cuboid
0-D (Apex) cuboid is (); 3-D (Base) cuboid is (city, item, year)
WWW.VIDYARTHIPLUS.COM
o Hence, for a cube with n dimensions there are total
cuboids.
o The statement ‘compute cube sales’ computes sales aggregate cuboids for the
eight subsets.
o Pre-computation of cuboids leads to faster response time and avoids
redundant computation.
o But challenge in pre-computation is that the required storage space
may explode.
o Number of cuboids in an n-dimensional data cube if there are no
concept hierarchy attached with each dimension =
cuboids.
o Consider Time dimension has the concept hierarchy
o Then total number of cuboids are:
where Li is the number of
levels associated with the dimension i.
o Eg. If a cube has 10 dimensions and each dimension has 4 levels, then
total number of cuboids generated will be
.
o This shows it is unrealistic to pre-compute and materialize all cuboids for a
data cube.
-
Hence we go for Partial Materialization:
o Three choices of materialization:
o No Materialization:
Pre-compute only base cuboid and no other cuboids;
Slow computation
o Full Materialization: Pre-compute all cuboids; Requires huge space
o Partial Materialization: Pre-compute a proper subset of whole set of cuboids
Considers 3 factors:
Identify cuboids to materialize – based on workload,
frequency, accessing cost, storage need, cost of update,
index usage. (or simply use greedy Algorithm that has good
performance)
WWW.VIDYARTHIPLUS.COM
Exploit materialized cuboids during query processing
Update materialized cuboid during load and refresh
(use parallelism and incremental update)
-
Multiway array aggregation in the computation of data
cubes:
o To ensure fast online analytical processing we need to go for full
materialization
o But should consider amount of main memory available and time taken
for computation.
o ROLAP and MOLAP uses different cube computation techniques.
o Optimization techniques for ROLAP cube computations:
Sorting, hashing and grouping operations applied to dimension
attributes – to reorder and cluster tuples.
Grouping performed on some sub aggregates – ‘Partial grouping step’
–
to
speed
up
computations
WWW.VIDYARTHIPLUS.COM
Aggregates computed from sub aggregates (rather than from
base tables).
In ROLAP dimension values are accessed by using value-based / keybased addressing search strategies.
o Optimization techniques for MOLAP cube computations:
MOLAP uses direct array addressing to access dimension values
Partition the array into Chunks (sub cube small enough to fit into
main memory).
Compute aggregates by visiting cube cells. The number of times
each cell is revisited is minimized to reduce memory access and
storage costs.
This is called as multiway array aggregation in data cube computation.
o MOLAP cube computation is faster than ROLAP cube computation
-
Indexing OLAP Data: Bitmap Indexing; Join Indexing
-
Bitmap Indexing:
o Index on a particular column; Each distinct value in a column has a bit vector
o The length of each bit vector = No. of records in the base table
o The i-th bit is set if the i-th row of the base table has the value for the indexed
column.
o This approach is not suitable for high cardinality domains
-
Join Indexing:
o Registers joinable rows of
two relations
o
Consider
Relations R & S
WWW.VIDYARTHIPLUS.COM
o Let R (RID, A) & S (SID,
B); where RID and SID
are record identifiers of R
& S Respectively.
o For
joining
index record contains
the pair (RID, SID).
in
the
traditional
attribute
values to a list of record
ids
o But,
dimensions of a star
fact table.
o Eg. Fact table: Sales and
two dimensions city and
o A join index on city
maintains
for
each
distinct city a list of RIDs of the tuples of the
Fact table Sales
in
data
warehouses join index
relates the
the
product
databases the join index
maps
of
schema to rows in the
the
attributes A & B the join
o Hence
values
o Join indices can span
across
multiple
dimensions. – Composite
join indices
o To
speed
up
query
processing join indexing
WWW.VIDYARTHIPLUS.COM
and bitmap indexing can
be integrated to form
Bitmapped join indices.
o
-
Efficient processing of OLAP queries:
o Steps for efficient OLAP query processing:
o 1. Determine which OLAP operations should be performed on the
available cuboids:
Transform the OLAP operations like drill-down, roll-up,… to its
corresponding SQL (relational algebra) operations.
Eg. Dice = Selection + Projection
o 2. Determine to which materialized cuboids the relevant OLAP operations
should be applied:
Involves (i) Pruning of cuboids using knowledge of “dominance”
(ii) Estimate the cost of remaining materialized cuboids (iii)
Selecting the cuboid with least cost
Eg. Cube: “Sales [time, item, location]: sum(sales_in_dollars)”
Dimension hierarchies used are:
“day < month < quarter < year” for time dimension
“Item_name < brand < type for item dimension
“street < city < state < country for location dimension
Say query to be processed is on {brand, state} with the condition
year = “1997”
Say there are four materialized cuboids available
Cuboid 1: {item_name, city, year} ; Cuboid 2: {brand,
country, year}
Cuboid 3: {brand, state, year} ;
Cuboid 4: {item_name, state} where year = 1997
Which cuboid selected for query processing?
Step 1: Pruning of cuboids – prune cuboid 2 as higher level of
concept “country” can not answer query at lower granularity
“state”
Step 2: Estimate cuboid cost; Cuboid 1 costs the most of the 3
cuboids as item_name and city are at a finer granular level
than brand and state as mentioned in the query.
Step 3: If there are less number of years and there are more
number of item_names under each brand then Cuboid 3 has the
least cost. But if otherwise and there are efficient indexes on
item_name then Cuboid 4 has the least cost. Hence select Cuboid
3 or Cuboid 4 accordingly.
-
Metadata repository:
o Metadata is the data defining warehouse objects. It stores:
Description of the structure of the data warehouse:
Schema, views, dimensions, hierarchies, derived
data definition, data mart locations and contents
Operational meta data:
Data lineage: History of migrated data and its
transformation path
Currency of data: Active, archived or purged
Monitoring information:
o Warehouse usage statistics, error reports, audit trails
Algorithms used for summarization:
Measure and Dimension definition algorithm
Granularity, Subject, Partitions definition
Aggregation, Summarization and pre-defined queries
and reports.
Mapping from operational environment to the data warehouse:
Source database information; Data refresh & purging rules
Data extraction, cleaning and transformation rules
Security rules (authorization and access control)
Data related to system performance:
Data access performance; data retrieval performance
Rules for timing and scheduling of refresh
Business metadata:
Business terms and definitions
Data ownership information; Data charging policies
-
Data Warehouse Back-end Tools and Utilities:
o Data Extraction: Get data from multiple, heterogeneous and
external sources
o Data Cleaning: Detects errors in the data and rectify them when possible
o Data Transformation: Convert data from legacy or host format to
warehouse format
o Load: Sort; Summarize, Consolidate; Compute views; Check integrity
Build indices and partitions
FROM DATA WAREHOUSING TO DATA MINING:
-
Data Warehousing Usage:
o Data warehouse and Data Marts used in wide range of applications;
o Used in Feedback system for enterprise management – “Plan-executeassess Loop”
o Applied in Banking, Finance, Retail, Manufacturing,…
o Data warehouse used for knowledge discovery and strategic decision
making using data mining tools
o There are three kinds of data warehouse applications:
Information Processing:
Supports querying & basic statistical analysis
Reporting using cross tabs, tables, charts and graphs
Analytical Processing:
Multidimensional analysis of data warehouse data
Supports basic OLAP operations slice-dice, drilling and
pivoting
Data Mining:
Knowledge discovery from hidden patterns
Supports associations, classification & prediction and
Clustering
Constructs analytical models
Presenting mining results using visualization tools
-
From Online Analytical Processing (OLAP) to Online Analytical Mining
(OLAM):
o OLAM also called as OLAP Mining – Integrates OLAP with mining
techniques
o Why OLAM?
High Quality of data in data warehouses:
DWH has cleaned, transformed and integrated data
(Preprocessed data)
Data mining tools need such costly preprocessing of data.
Thus DWH serves as a valuable and high quality data
source for OLAP as well as for Data Mining
Available information processing infrastructure surrounding
data warehouses:
Includes accessing, integration, consolidation and
transformation
of
multiple
heterogeneous
databases ; ODBC/OLEDB connections;
Web accessing and servicing facilities; Reporting and
OLAP analysis tools
OLAP-based exploratory data analysis:
OLAM provides facilities for data mining on different
subsets of data and at different levels of abstraction
Eg. Drill-down, pivoting, roll-up, slicing, dicing on OLAP
and on intermediate DM results
Enhances power of exploratory data mining by use of
visualization tools
On-line selection of data mining functions:
OLAM provides the flexibility to select desired data mining
functions and swap data mining tasks dynamically.
Architecture of Online Analytical Mining:
-
OLAP and OLAM engines accept on-line queries via User GUI API
-
And they work with the data cube in data analysis via Data Cube API
-
A meta data directory is used to guide the access of data cube
-
MDDB constructed by integrating multiple databases or by filtering a data
warehouse via Database API which may support ODBC/OLEDB connections.
-
OLAM Engine consists of multiple data mining modules – Hence sophisticated
than OLAP engine.
-
Data Mining should be a human centered process – users should often
interact with the system to perform exploratory data analysis
4.6 OLAP Need
OLAP systems vary quite a lot, and they have generally been distinguished by a letter tagged
onto the front of the word OLAP. ROLAP and MOLAP are the big players, and the other
distinctions represent little more than the marketing programs on the part of the vendors to
distinguish themselves, for example, SOLAP and DOLAP. Here, we aim to give you a hint as
to what these distinctions mean.
4.7 Categorization of OLAP Tools
Major Types:
Relational OLAP (ROLAP) –Star Schema based
Considered the fastest growing OLAP technology style, ROLAP or “Relational” OLAP systems
work primarily from the data that resides in a relational database, where the base data and
dimension tables are stored as relational tables. This model permits multidimensional
analysis of data as this enables users to perform a function equivalent to that of the
traditional OLAP slicing and dicing feature. This is achieved thorough use of any SQL
reporting tool to extract or ‘query’ data directly from the data warehouse. Wherein
specifying a ‘Where
clause’ equals performing a certain slice and dice action.
One advantage of ROLAP over the other styles of OLAP analytic tools is that it is deemed to
be more scalable in handling huge amounts of data. ROLAP sits on top of relational
databases therefore enabling it to leverage several functionalities that a relational database
is capable of. Another gain of a ROLAP tool is that it is efficient in managing both numeric
and textual data. It also permits users to “drill down” to the leaf details or the lowest level of
a hierarchy structure.
However, ROLAP applications display a slower performance as compared to other style of
OLAP tools since, oftentimes, calculations are performed inside the server. Another demerit
of a ROLAP tool is that as it is dependent on use of SQL for data manipulation, it may not be
ideal for performance of some calculations that are not easily translatable into an SQL
query.
Multidimensional OLAP (MOLAP) –Cube based
Multidimensional OLAP, with a popular acronym of MOLAP, is widely regarded as the classic
form of OLAP. One of the major distinctions of MOLAP against a ROLAP tool is that data are
pre-summarized and are stored in an optimized format in a multidimensional cube, instead
of in a relational database. In this type of model, data are structured into proprietary
Unit IV - DATA WAREHOUSING AND DATA MINING -CA5010
5
formats in accordance with a client’s reporting requirements with the calculations
pre- generated on the cubes.
This is probably by far, the best OLAP tool to use in making analysis reports since this
enables users to easily reorganize or rotate the cube structure to view different aspects of
data. This is done by way of slicing and dicing. MOLAP analytic tool are also capable of
performing complex calculations. Since calculations are predefined upon cube creation,
this
results in the faster return of computed data. MOLAP systems also provide users the ability
to quickly write back data into a data set. Moreover, in comparison to ROLAP, MOLAP is
considerably less heavy on hardware due to compression techniques. In a nutshell, MOLAP
is more optimized for fast query performance and retrieval of summarized information.
There are certain limitations to implementation of a MOLAP system, one primary
weakness of which is that MOLAP tool is less scalable than a ROLAP tool as the former is
capable of handling only a limited amount of data. The MOLAP approach also introduces
data redundancy. There are also certain MOLAP products that encounter difficulty in
updating models with dimensions with very high cardinality.
Hybrid OLAP (HOLAP)
HOLAP is the product of the attempt to incorporate the best features of MOLAP and ROLAP
into a single architecture. This tool tried to bridge the technology gap of both products by
enabling access or use to both multidimensional database (MDDB) and Relational Database
Management System (RDBMS) data stores. HOLAP systems stores larger quantities of
detailed data in the relational tables while the aggregations are stored in the pre-calculated
cubes. HOLAP also has the capacity to “drill through” from the cube down to the relational
tables for delineated data.
Some of the advantages of this system are better scalability, quick data processing
and flexibility in accessing of data sources.
Other Types:
There are also less popular types of OLAP styles upon which one could stumble upon every
so often. We have listed some of the less famous types existing in the OLAP industry.
Web OLAP (WOLAP)
Simply put, a Web OLAP which is likewise referred to as Web-enabled OLAP, pertains to
OLAP application which is accessible via the web browser. Unlike traditional client/server
OLAP applications, WOLAP is considered to have a three-tiered architecture which consists
of three components: a client, a middleware and a database server.
Probably some of the most appealing features of this style of OLAP are the considerably
lower investment involved, enhanced accessibility as a user only needs an internet
connection and a web browser to connect to the data and ease in installation, configuration
and deployment process.
But despite all of its unique features, it could still not compare to a conventional
client/server machine. Currently, it is inferior in comparison to OLAP applications which
involve deployment in client machines in terms of functionality, visual appeal and
performance.
KLNCIT – MCA
Unit IV - DATA WAREHOUSING AND DATA MINING -CA5010
For Private Circulation only
6
Desktop OLAP (DOLAP)
Desktop OLAP, or “DOLAP” is based on the idea that a user can download a section of the
data from the database or source, and work with that dataset locally, or on their desktop.
DOLAP is easier to deploy and has a cheaper cost but comes with a very limited functionality
in comparison with other OLAP applications.
Mobile OLAP (MOLAP)
Mobile OLAP is merely refers to OLAP functionalities on a wireless or mobile device. This
enables users to access and work on OLAP data and applications remotely thorough the
use of their mobile devices.
Spatial OLAP (SOLAP)
With the aim of integrating the capabilities of both Geographic Information Systems (GIS)
and OLAP into a single user interface, “SOLAP” or Spatial OLAP emerged. SOLAP is created
to facilitate management of both spatial and non-spatial data, as data could come not only
in an alphanumeric form, but also in images and vectors. This technology provides easy and
quick exploration of data that resides on a spatial database.
Other different blends of an OLAP product like the less popular ‘DOLAP’ and ‘ROLAP’
which stands for Database OLAP and Remote OLAP, ‘LOLAP’ for Local OLAP and
‘RTOLAP’ for Real-Time OLAP are existing but have barely made a noise on the OLAP
industry.
Unit - IV
Association Rule Mining And Classification
Classes:09
Mining frequent patterns, associations and correlations, mining methods, mining various kinds
of association rules, correlation analysis, constraint based association mining, classification
and prediction, basic concepts, decision tree induction, Bayesian classification, rule based
classification, classification by back propagation.
CLASSIFICATION AND PREDICTION:
There are two forms of data analysis that can be used for extracting models describing
important classes or to predict future data trends. These two forms are as follows −
•
Classification
•
Prediction
Classification models predict categorical class labels; and prediction models predict
continuous valued functions. For example, we can build a classification model to categorize
bank loan applications as either safe or risky, or a prediction model to predict the
expenditures in dollars of potential customers on computer equipment given their income
and occupation.
What is classification?
Following are the examples of cases where the data analysis task is Classification −
•
A bank loan officer wants to analyze the data in order to know which customer (loan
applicant) are risky or which are safe.
•
A marketing manager at a company needs to analyze a customer with a given profile,
who will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the categorical
labels. These labels are risky or safe for loan application data and yes or no for marketing
data.
What is prediction?
Following are the examples of cases where the data analysis task is Prediction −
Suppose the marketing manager needs to predict how much a given customer will spend
during a sale at his company. In this example we are bothered to predict a numeric value.
Therefore the data analysis task is an example of numeric prediction. In this case, a model or
a predictor will be constructed that predicts a continuous-valued-function or ordered value.
Note − Regression analysis is a statistical methodology that is most often used for numeric
prediction.
How Does Classification Works?
With the help of the bank loan application that we have discussed above, let us understand
the working of classification. The Data Classification process includes two steps −
•
Building the Classifier or Model
•
Using Classifier for Classification
Building the Classifier or Model
•
This step is the learning step or the learning phase.
•
In this step the classification algorithms build the classifier.
•
The classifier is built from the training set made up of database tuples and their
associated class labels.
•
Each tuple that constitutes the training set is referred to as a category or class. These
tuples can also be referred to as sample, object or data points.
Using Classifier for Classification
In this step, the classifier is used for classification. Here the test data is used to estimate the
accuracy of classification rules. The classification rules can be applied to the new data tuples
if the accuracy is considered acceptable.
Classification and Prediction Issues
The major issue is preparing the data for Classification and Prediction. Preparing the data
involves the following activities −
•
Data Cleaning − Data cleaning involves removing the noise and treatment of missing
values. The noise is removed by applying smoothing techniques and the problem of
missing values is solved by replacing a missing value with most commonly occurring
value for that attribute.
•
Relevance Analysis − Database may also have the irrelevant attributes. Correlation
analysis is used to know whether any two given attributes are related.
•
Data Transformation and reduction − The data can be transformed by any of the
following methods.
o
Normalization − The data is transformed using normalization. Normalization
involves scaling all values for given attribute in order to make them fall within
a small specified range. Normalization is used when in the learning step, the
neural networks or the methods involving measurements are used.
o
Generalization − The data can also be transformed by generalizing it to the
higher concept. For this purpose we can use the concept hierarchies.
Note − Data can also be reduced by some other methods such as wavelet transformation,
binning, histogram analysis, and clustering.
Comparison of Classification and Prediction Methods
Here is the criteria for comparing the methods of Classification and Prediction −
•
Accuracy − Accuracy of classifier refers to the ability of classifier. It predict the class
label correctly and the accuracy of the predictor refers to how well a given predictor
can guess the value of predicted attribute for a new data.
•
Speed − This refers to the computational cost in generating and using the classifier or
predictor.
•
Robustness − It refers to the ability of classifier or predictor to make correct
predictions from given noisy data.
•
Scalability − Scalability refers to the ability to construct the classifier or predictor
efficiently; given large amount of data.
•
Interpretability − It refers to what extent the classifier or predictor understands.
DECISION TREE INDUCTION:
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each
internal node denotes a test on an attribute, each branch denotes the outcome of a test, and
each leaf node holds a class label. The topmost node in the tree is the root node.
The following decision tree is for the concept buy_computer that indicates whether a
customer at a company is likely to buy a computer or not. Each internal node represents a
test on an attribute. Each leaf node represents a class.
The benefits of having a decision tree are as follows −
•
It does not require any domain knowledge.
•
It is easy to comprehend.
•
The learning and classification steps of a decision tree are simple and fast.
Decision Tree Induction Algorithm
A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm
known as ID3 (Iterative Dichotomiser). Later, he presented C4.5, which was the successor of
ID3. ID3 and C4.5 adopt a greedy approach. In this algorithm, there is no backtracking; the
trees are constructed in a top-down recursive divide-and-conquer manner.
Generating a decision tree form training tuples of data partition D
Algorithm : Generate_decision_tree
Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.
Output:
A Decision Tree
Method
create a node N;
if tuples in D are all of the same class, C then
return N as leaf node labeled with class C;
if attribute_list is empty then
return N as leaf node with labeled
with majority class in D;|| majority voting
apply attribute_selection_method(D, attribute_list)
to find the best splitting_criterion;
label node N with splitting_criterion;
if splitting_attribute is discrete-valued and
multiway splits allowed then // no restricted to binary trees
attribute_list = splitting attribute; // remove splitting attribute
for each outcome j of splitting criterion
// partition the tuples and grow subtrees for each partition
let Dj be the set of data tuples in D satisfying outcome j; // a partition
if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;
Tree Pruning
Tree pruning is performed in order to remove anomalies in the training data due to noise or
outliers. The pruned trees are smaller and less complex.
Tree Pruning Approaches
There are two approaches to prune a tree −
•
Pre-pruning − The tree is pruned by halting its construction early.
•
Post-pruning - This approach removes a sub-tree from a fully grown tree.
Cost Complexity
The cost complexity is measured by the following two parameters −
•
Number of leaves in the tree, and
•
Error rate of the tree.
BAYESIAN CLASSIFICATION
Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the statistical
classifiers. Bayesian classifiers can predict class membership probabilities such as the
probability that a given tuple belongs to a particular class.
Baye's Theorem
Bayes' Theorem is named after Thomas Bayes. There are two types of probabilities −
•
Posterior Probability [P(H/X)]
•
Prior Probability [P(H)]
where X is data tuple and H is some hypothesis.
According to Bayes' Theorem,
P(H/X)= P(X/H)P(H) / P(X)
Bayesian Belief Network
Bayesian Belief Networks specify joint conditional probability distributions. They are also
known as Belief Networks, Bayesian Networks, or Probabilistic Networks.
•
A Belief Network allows class conditional independencies to be defined between
subsets of variables.
•
It provides a graphical model of causal relationship on which learning can be
performed.
•
We can use a trained Bayesian Network for classification.
There are two components that define a Bayesian Belief Network −
•
Directed acyclic graph
•
A set of conditional probability tables
Directed Acyclic Graph
•
Each node in a directed acyclic graph represents a random variable.
•
These variable may be discrete or continuous valued.
•
These variables may correspond to the actual attribute given in the data.
Directed Acyclic Graph Representation
The following diagram shows a directed acyclic graph for six Boolean variables.
The arc in the diagram allows representation of causal knowledge. For example, lung cancer is
influenced by a person's family history of lung cancer, as well as whether or not the person is
a smoker. It is worth noting that the variable PositiveXray is independent of whether the
patient has a family history of lung cancer or that the patient is a smoker, given that we know
the patient has lung cancer.
Conditional Probability Table
The conditional probability table for the values of the variable LungCancer (LC) showing each
possible combination of the values of its parent nodes, FamilyHistory (FH), and Smoker (S) is
as follows −
RULE BASED CLASSIFICATION
IF-THEN Rules
Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a
rule in the following from −
IF condition THEN conclusion
Let us consider a rule R1,
R1: IF age = youth AND student = yes
THEN buy_computer = yes
Points to remember −
•
The IF part of the rule is called rule antecedent or precondition.
•
The THEN part of the rule is called rule consequent.
•
The antecedent part the condition consist of one or more attribute tests and these
tests are logically ANDed.
•
The consequent part consists of class prediction.
Note − We can also write rule R1 as follows −
R1: (age = youth) ^ (student = yes))(buys computer = yes)
If the condition holds true for a given tuple, then the antecedent is satisfied.
Rule Extraction
Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from a
decision tree.
Points to remember −
To extract a rule from a decision tree −
•
One rule is created for each path from the root to the leaf node.
•
To form a rule antecedent, each splitting criterion is logically ANDed.
•
The leaf node holds the class prediction, forming the rule consequent.
Rule Induction Using Sequential Covering Algorithm
Sequential Covering Algorithm can be used to extract IF-THEN rules form the training data.
We do not require to generate a decision tree first. In this algorithm, each rule for a given
class covers many of the tuples of that class.
Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the general
strategy the rules are learned one at a time. For each time rules are learned, a tuple covered
by the rule is removed and the process continues for the rest of the tuples. This is because
the path to each leaf in a decision tree corresponds to a rule.
Note − The Decision tree induction can be considered as learning a set of rules
simultaneously.
The Following is the sequential learning Algorithm where rules are learned for one class at a
time. When learning a rule from a class Ci, we want the rule to cover all the tuples from class
C only and no tuple form any other class.
Algorithm: Sequential Covering
Input:
D, a data set class-labeled tuples,
Att_vals, the set of all attributes and their possible values.
Output: A Set of IF-THEN rules.
Method:
Rule_set={ }; // initial set of rules learned is empty
for each class c do
repeat
Rule = Learn_One_Rule(D, Att_valls, c);
remove tuples covered by Rule form D;
until termination condition;
Rule_set=Rule_set+Rule; // add a new rule to rule-set
end for
return Rule_Set;
Rule Pruning
The rule is pruned is due to the following reason −
•
The Assessment of quality is made on the original set of training data. The rule may
perform well on training data but less well on subsequent data. That's why the rule
pruning is required.
•
The rule is pruned by removing conjunct. The rule R is pruned, if pruned version of R
has greater quality than what was assessed on an independent set of tuples.
FOIL is one of the simple and effective method for rule pruning. For a given rule R,
FOIL_Prune = pos - neg / pos + neg
where pos and neg is the number of positive tuples covered by R, respectively.
Note − This value will increase with the accuracy of R on the pruning set. Hence, if the
FOIL_Prune value is higher for the pruned version of R, then we prune R.
UNIT-V
Unit –V
Clustering And Trends In Data Mining
Classes:08
Cluster analysis: Types of data, categorization of major clustering methods, K-means partitioning
methods, hierarchical methods, density based methods, grid based methods, model based
clustering methods, clustering, high dimensional data, constraint based cluster analysis, outlier
analysis; Trends in data mining: Data mining applications, data mining system products and
research prototypes, social impacts of data mining.
What is Cluster Analysis?
The process of grouping a set of physical objects into classes of similar objects is called
clustering.
Cluster – collection of data objects
– Objects within a cluster are similar and objects in different clusters are dissimilar.
Cluster applications – pattern recognition, image processing and market research.
- helps marketers to discover the characterization of customer groups based
on purchasing patterns
- Categorize genes in plant and animal taxonomies
- Identify groups of house in a city according to house type, value and
geographical location
- Classify documents on WWW for information discovery
Clustering is a preprocessing step for other data mining steps like classification,
characterization.
Clustering – Unsupervised learning – does not rely on predefined classes with class labels.
Typical requirements of clustering in data mining:
1. Scalability – Clustering algorithms should work for huge databases
2. Ability to deal with different types of attributes – Clustering algorithms should work
not only for numeric data, but also for other data types.
3. Discovery of clusters with arbitrary shape – Clustering algorithms (based on distance
measures) should work for clusters of any shape.
4. Minimal requirements for domain knowledge to determine input parameters –
Clustering results are sensitive to input parameters to a clustering algorithm (example
– number of desired clusters). Determining the value of these parameters is
difficult and requires some domain knowledge.
5. Ability to deal with noisy data – Outlier, missing, unknown and erroneous data
detected by a clustering algorithm may lead to clusters of poor quality.
6. Insensitivity in the order of input records – Clustering algorithms should produce
same results even if the order of input records is changed.
7. High dimensionality – Data in high dimensional space can be sparse and highly
skewed, hence it is challenging for a clustering algorithm to cluster data objects
in high dimensional space.
8. Constraint-based clustering – In Real world scenario, clusters are performed based
on various constraints. It is a challenging task to find groups of data with good
clustering behavior and satisfying various constraints.
9. Interpretability and usability – Clustering results should be interpretable,
comprehensible and usable. So we should study how an application goal
may influence the selection of clustering methods.
Types of data in Clustering Analysis
1. Data Matrix: (object-by-variable structure)
Represents n objects, (such as persons) with p variables (or attributes) (such as age,
height, weight, gender, race and so on. The structure is in the form of relational
table or n x p matrix as shown below:
called as “two mode” matrix
2. Dissimilarity Matrix: (object-by-object structure)
This stores a collection of proximities (closeness or distance) that are available for all
pairs of n objects. It is represented by an n-by-n table as shown below.
called as “one mode” matrix
Where d (i, j) is the dissimilarity between the objects i and j; d (i, j) = d (j, i) and d (i,
i) = 0
Many clustering algorithms use Dissimilarity Matrix. So data represented using Data
Matrix are converted into Dissimilarity Matrix before applying such clustering
algorithms.
Clustering of objects done based on their similarities or dissimilarities.
Similarity coefficients or dissimilarity coefficients are derived from correlation
coefficients.
3.8 Categorization of Major Clustering Methods
Categorization of Major Clustering Methods
The choice of many available clustering algorithms depends on type of data available and the
application used.
Major Categories are:
1. Partitioning Methods:
- Construct k-partitions of the n data objects, where each partition is a cluster and k
<= n.
-
Each partition should contain at least one object & each object should belong
to exactly one partition.
Iterative Relocation Technique – attempts to improve partitioning by
moving objects from one group to another.
Good Partitioning – Objects in the same cluster are “close” / related and objects
in the different clusters are “far apart” / very different.
Uses the Algorithms:
o K-means Algorithm: - Each cluster is represented by the mean value of the
objects in the cluster.
o K-mediods Algorithm: - Each cluster is represented by one of the objects
located near the center of the cluster.
o These work well in small to medium sized database.
2. Hierarchical Methods:
- Creates hierarchical decomposition of the given set of data objects.
- Two types – Agglomerative and Divisive
- Agglomerative Approach: (Bottom-Up Approach):
o Each object forms a separate group
o Successively merges groups close to one another (based on distance
between clusters)
o Done until all the groups are merged to one or until a termination
condition holds. (Termination condition can be desired number of clusters)
- Divisive Approach: (Top-Down Approach):
o Starts with all the objects in the same cluster
o Successively clusters are split into smaller clusters
o Done until each object is in one cluster or until a termination condition
holds (Termination condition can be desired number of clusters)
- Disadvantage – Once a merge or split is done it can not be undone.
- Advantage – Less computational cost
- If both these approaches are combined it gives more advantage.
- Clustering algorithms with this integrated approach are BIRCH and CURE.
3. Density Based Methods:
- Above methods produce Spherical shaped clusters.
- To discover clusters of arbitrary shape, clustering done based on the notion
of density.
- Used to filter out noise or outliers.
-
Continue growing a cluster so long as the density in the neighborhood
exceeds some threshold.
Density = number of objects or data points
That is for each data point within a given cluster; the neighborhood of a
given radius has to contain at least a minimum number of points.
Uses the algorithms: DBSCAN and OPTICS
4. Grid-Based Methods:
- Divides the object space into finite number of cells to forma grid structure.
- Performs clustering operations on the grid structure.
- Advantage – Fast processing time – independent on the number of data objects &
dependent on the number of cells in the data
grid.
- STING – typical grid based method
-
CLIQUE and Wave-Cluster – grid based and density based clustering algorithms.
5. Model-Based Methods:
- Hypothesizes a model for each of the clusters and finds a best fit of the data to
the model.
- Forms clusters by constructing a density function that reflects the
spatial distribution of the data points.
- Robust clustering methods
- Detects noise / outliers.
Many algorithms combine several clustering methods.
Cluster is a group of objects that belongs to the same class. In other words, similar objects are
grouped in one cluster and dissimilar objects are grouped in another cluster.
What is Clustering?
Clustering is the process of making a group of abstract objects into classes of similar objects.
Points to Remember
•
A cluster of data objects can be treated as one group.
•
While doing cluster analysis, we first partition the set of data into groups based on data
similarity and then assign the labels to the groups.
•
The main advantage of clustering over classification is that, it is adaptable to changes
and helps single out useful features that distinguish different groups.
Applications of Cluster Analysis
•
Clustering analysis is broadly used in many applications such as market research,
pattern recognition, data analysis, and image processing.
•
Clustering can also help marketers discover distinct groups in their customer base. And
they can characterize their customer groups based on the purchasing patterns.
•
In the field of biology, it can be used to derive plant and animal taxonomies, categorize
genes with similar functionalities and gain insight into structures inherent to
populations.
•
Clustering also helps in identification of areas of similar land use in an earth
observation database. It also helps in the identification of groups of houses in a city
according to house type, value, and geographic location.
•
Clustering also helps in classifying documents on the web for information discovery.
•
Clustering is also used in outlier detection applications such as detection of credit card
fraud.
•
As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to observe characteristics of each cluster.
Requirements of Clustering in Data Mining
The following points throw light on why clustering is required in data mining −
•
Scalability − We need highly scalable clustering algorithms to deal with large
databases.
•
Ability to deal with different kinds of attributes − Algorithms should be capable to
be applied on any kind of data such as interval-based (numerical) data, categorical, and
binary data.
•
Discovery of clusters with attribute shape − The clustering algorithm should be
capable of detecting clusters of arbitrary shape. They should not be bounded to only
distance measures that tend to find spherical cluster of small sizes.
•
High dimensionality − The clustering algorithm should not only be able to handle lowdimensional data but also the high dimensional space.
•
Ability to deal with noisy data − Databases contain noisy, missing or erroneous data.
Some algorithms are sensitive to such data and may lead to poor quality clusters.
•
Interpretability − The clustering results should be interpretable, comprehensible, and
usable.
Clustering Methods
Clustering methods can be classified into the following categories −
•
Partitioning Method
•
Hierarchical Method
•
Density-based Method
•
Grid-Based Method
•
Model-Based Method
•
Constraint-based Method
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’
partition of data. Each partition will represent a cluster and k ≤ n. It means that it will classify
the data into k groups, which satisfy the following requirements −
•
Each group contains at least one object.
•
Each object must belong to exactly one group.
Points to remember −
•
For a given number of partitions (say k), the partitioning method will create an initial
partitioning.
•
Then it uses the iterative relocation technique to improve the partitioning by moving
objects from one group to other.
Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects. We can
classify hierarchical methods on the basis of how the hierarchical decomposition is formed.
There are two approaches here −
•
Agglomerative Approach
•
Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each object
forming a separate group. It keeps on merging the objects or groups that are close to one
another. It keep on doing so until all of the groups are merged into one or until the termination
condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of the objects
in the same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is
down until each object in one cluster or the termination condition holds. This method is rigid,
i.e., once a merging or splitting is done, it can never be undone.
Approaches to Improve Quality of Hierarchical Clustering
Here are the two approaches that are used to improve the quality of hierarchical clustering −
•
Perform careful analysis of object linkages at each hierarchical partitioning.
•
Integrate hierarchical agglomeration by first using a hierarchical agglomerative
algorithm to group objects into micro-clusters, and then performing macro-clustering
on the micro-clusters.
Density-based Method
This method is based on the notion of density. The basic idea is to continue growing the given
cluster as long as the density in the neighborhood exceeds some threshold, i.e., for each data
point within a given cluster, the radius of a given cluster has to contain at least a minimum
number of points.
Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite number of
cells that form a grid structure.
Advantages
•
The major advantage of this method is fast processing time.
•
It is dependent only on the number of cells in each dimension in the quantized space.
Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data for a given
model. This method locates the clusters by clustering the density function. It reflects spatial
distribution of the data points.
This method also provides a way to automatically determine the number of clusters based on
standard statistics, taking outlier or noise into account. It therefore yields robust clustering
methods.
Constraint-based Method
In this method, the clustering is performed by the incorporation of user or application-oriented
constraints. A constraint refers to the user expectation or the properties of desired clustering
results. Constraints provide us with an interactive way of communication with the clustering
process. Constraints can be specified by the user or the application requirement.
PARTITIONING METHODS
Database has n objects and k partitions where k<=n; each partition is a cluster.
Partitioning criterion = Similarity function:
Objects within a cluster are similar; objects of different clusters are dissimilar.
Classical Partitioning Methods: k-means and k-mediods:
(A) Centroid-based technique: The k-means method:
- Cluster similarity is measured using mean value of objects in the cluster (or
clusters center of gravity)
- Randomly select k objects. Each object is a cluster mean or center.
- Each of the remaining objects is assigned to the most similar cluster – based
on the distance between the object and the cluster mean.
- Compute new mean for each cluster.
- This process iterates until all the objects are assigned to a cluster and
the partitioning criterion is met.
- This algorithm determines k partitions that minimize the squared error function.
-
Square Error Function is defined as:
-
Where x is the point representing an object, mi is the mean of the cluster Ci.
HIERARCHICAL METHODS
This works by grouping data objects into a tree of clusters. Two types – Agglomerative and
Divisive.
Clustering algorithms with integrated approach of these two types are BIRCH, CURE, ROCK
and CHAMELEON.
BIRCH – Balanced Iterative Reducing and Clustering using Hierarchies:
- Integrated Hierarchical Clustering algorithm.
- Introduces two concepts – Clustering Feature and CF tree (Clustering Feature
-
Tree)
CF Trees – Summarized Cluster Representation – Helps to achieve good speed &
clustering scalability
Good for incremental and dynamical clustering of incoming data points.
Clustering Feature CF is the summary statistics for the cluster defined as:
;
where N is the number of points in the sub cluster (Each point is represented as
);
is the linear sum of N points =
; SS is the square sum of data points
-
CF Tree – Height balanced tree that stores the Clustering Features.
This has two parameters – Branching Factor B and threshold T
Branching Factor specifies the maximum number of children.
-
Threshold parameter T = maximum diameter of sub clusters stored at the
leaf nodes.
Change the threshold value => Changes the size of the tree.
The non-leaf nodes store sums of their children’s CF’s – summarizes information
about their children.
-
-
BIRCH algorithm has the following two phases:
o Phase 1: Scan database to build an initial in-memory CF tree – Multi-level
compression of the data – Preserves the inherent clustering structure of
the data.
CF tree is built dynamically as data points are inserted to
the closest leaf entry.
If the diameter of the subcluster in the leaf node after insertion
becomes larger than the threshold then the leaf node and
possibly other nodes are split.
After a new point is inserted, the information about it is
passed towards the root of the tree.
Is the size of the memory to store the CF tree is larger than the
the size of the main memory, then a smaller value of threshold is
specified and the CF tree is rebuilt.
This rebuild process builds from the leaf nodes of the old tree.
Thus for building a tree data has to be read from the database
only once.
-
-
o Phase 2: Apply a clustering algorithm to cluster the leaf nodes of the CFtree.
Advantages:
o Produces best clusters with available resources.
o Minimizes the I/O time
Computational complexity of this algorithm is – O(N) – N is the number of
objects to be
clustered.
Disadvantage:
o Not a natural way of clustering;
o Does not work for non-spherical shaped clusters.
CURE – Clustering Using Representatives:
- Integrates hierarchical and partitioning algorithms.
- Handles clusters of different shapes and sizes; Handles outliers separately.
- Here a set of representative centroid points are used to represent a cluster.
- These points are generated by first selecting well scattered points in a cluster
and shrinking them towards the center of the cluster by a specified fraction
(shrinking factor)
- Closest pair of clusters are merged at each step of the algorithm.
-
-
Having more than one representative point in a cluster allows BIRCH to handle
clusters of non-spherical shape.
Shrinking helps to identify the outliers.
To handle large databases – CURE employs a combination of random sampling
and partitioning.
The resulting clusters from these samples are again merged to get the final cluster.
CURE Algorithm:
o Draw a random sample s
o Partition sample s into p partitions each of size s/p
o Partially cluster partitions into s/pq clusters where q > 1
o Eliminate outliers by random sampling – if a cluster is too slow eliminate
it.
o Cluster partial clusters
-
-
o Mark data with the corresponding cluster labels
o
Advantage:
o High quality clusters
o Removes outliers
o Produces clusters of different shapes & sizes
o Scales for large database
Disadvantage:
o Needs parameters – Size of the random sample; Number of Clusters and
Shrinking factor
o These parameter settings have significant effect on the results.
ROCK:
-
-
Agglomerative hierarchical clustering algorithm.
Suitable for clustering categorical attributes.
It measures the similarity of two clusters by comparing the aggregate interconnectivity of two clusters against a user specified static interconnectivity model.
Inter-connectivity of two clusters C1 and C2 are defined by the number of
cross links between the two clusters.
link(pi, pj) = number of common neighbors between two points pi and pj.
Two steps:
o First construct a sparse graph from a given data similarity matrix using a
similarity threshold and the concept of shared neighbors.
o Then performs a hierarchical clustering algorithm on the sparse graph.
CHAMELEON – A hierarchical clustering algorithm using dynamic modeling:
- In this clustering process, two clusters are merged if the inter-connectivity
and closeness (proximity) between two clusters are highly related to the
internal interconnectivity and closeness of the objects within the clusters.
- This merge process produces natural and homogeneous clusters.
- Applies to all types of data as long as the similarity function is specified.
-
-
This first uses a graph partitioning algorithm to cluster the data items into
large number of small sub clusters.
Then it uses an agglomerative hierarchical clustering algorithm to find the
genuine clusters by repeatedly combining the sub clusters created by the graph
partitioning algorithm.
To determine the pairs of most similar sub clusters, it considers
the interconnectivity as well as the closeness of the clusters.
-
In this objects are represented using k-nearest neighbor graph.
Vertex of this graph represents an object and the edges are present between
two vertices (objects)
-
Partition the graph by removing the edges in the sparse region and keeping
the edges in the dense region. Each of these partitioned graph forms a cluster
Then form the final clusters by iteratively merging the clusters from the
previous cycle based on their interconnectivity and closeness.
-
-
CHAMELEON determines the similarity between each pair of clusters Ci and Cj
according to their relative inter-connectivity RI(Ci, Cj) and their relative
closeness RC(Ci, Cj).
-
= edge-cut of the cluster containing both Ci and Cj
-
= size of min-cut bisector
-
= Average weight of the edges that connect vertices in Ci to
vertices in Cj
-
= Average weight of the edges that belong to the min-cut bisector of
cluster Ci.
Advantages:
o More powerful than BIRCH and CURE.
o Produces arbitrary shaped clusters
Processing cost:
- n = number of objects.
APPLICATIONS OF DATA MINING:
Data Mining is primarily used today by companies with a strong consumer
focus — retail, financial, communication, and marketing organizations, to “drill
down” into their transactional data and determine pricing, customer
preferences and product positioning, impact on sales, customer satisfaction
and corporate profits. With data mining, a retailer can use point-of-sale
records of customer purchases to develop products and promotions to
appeal to specific customer segments.
14 areas where data mining is widely used
Here is the list of 14 other important areas where data mining is widely used:
Future Healthcare
Data mining holds great potential to improve health systems. It uses data and
analytics to identify best practices that improve care and reduce costs.
Researchers use data mining approaches like multi-dimensional databases,
machine learning, soft computing, data visualization and statistics. Mining
can be used to predict the volume of patients in every category. Processes
are developed that make sure that the patients receive appropriate care at
the right place and at the right time. Data mining can also help healthcare
insurers to detect fraud and abuse.
Market Basket Analysis
Market basket analysis is a modelling technique based upon a theory that if
you buy a certain group of items you are more likely to buy another group of
items. This technique may allow the retailer to understand the purchase
behaviour of a buyer. This information may help the retailer to know the
buyer’s needs and change the store’s layout accordingly. Using differential
analysis comparison of results between different stores, between customers
in different demographic groups can be done.
Education
There is a new emerging field, called Educational Data Mining, concerns with
developing methods that discover knowledge from data originating from
educational Environments. The goals of EDM are identified as predicting
students’ future learning behaviour, studying the effects of educational
support, and advancing scientific knowledge about learning. Data mining can
be used by an institution to take accurate decisions and also to predict the
results of the student. With the results the institution can focus on what to
teach and how to teach. Learning pattern of the students can be captured
and used to develop techniques to teach them.
Manufacturing Engineering
Knowledge is the best asset a manufacturing enterprise would possess. Data
mining tools can be very useful to discover patterns in complex
manufacturing process. Data mining can be used in system-level designing to
extract the relationships between product architecture, product portfolio, and
customer needs data. It can also be used to predict the product development
span time, cost, and dependencies among other tasks.
CRM
Customer Relationship Management is all about acquiring and retaining
customers, also improving customers’ loyalty and implementing customer
focused strategies. To maintain a proper relationship with a customer a
business need to collect data and analyse the information. This is where data
mining plays its part. With data mining technologies the collected data can be
used for analysis. Instead of being confused where to focus to retain
customer, the seekers for the solution get filtered results.
Fraud Detection
Billions of dollars have been lost to the action of frauds. Traditional methods
of fraud detection are time consuming and complex. Data mining aids in
providing meaningful patterns and turning data into information. Any
information that is valid and useful is knowledge. A perfect fraud detection
system should protect information of all the users. A supervised method
includes collection of sample records. These records are classified fraudulent
or non-fraudulent. A model is built using this data and the algorithm is made
to identify whether the record is fraudulent or not.
Intrusion Detection
Any action that will compromise the integrity and confidentiality of a resource
is an intrusion. The defensive measures to avoid an intrusion includes user
authentication, avoid programming errors, and information protection. Data
mining can help improve intrusion detection by adding a level of focus to
anomaly detection. It helps an analyst to distinguish an activity from common
everyday network activity. Data mining also helps extract data which is more
relevant to the problem.
Lie Detection
Apprehending a criminal is easy whereas bringing out the truth from him is
difficult. Law enforcement can use mining techniques to investigate crimes,
monitor communication of suspected terrorists. This filed includes text mining
also. This process seeks to find meaningful patterns in data which is usually
unstructured text. The data sample collected from previous investigations are
compared and a model for lie detection is created. With this model processes
can be created according to the necessity.
Customer Segmentation
Traditional market research may help us to segment customers but data
mining goes in deep and increases market effectiveness. Data mining aids in
aligning the customers into a distinct segment and can tailor the needs
according to the customers. Market is always about retaining the customers.
Data mining allows to find a segment of customers based on vulnerability and
the business could offer them with special offers and enhance satisfaction.
Financial Banking
With computerised banking everywhere huge amount of data is supposed to
be generated with new transactions. Data mining can contribute to solving
business problems in banking and finance by finding patterns, causalities,
and correlations in business information and market prices that are not
immediately apparent to managers because the volume data is too large or is
generated too quickly to screen by experts. The managers may find these
information for better segmenting,targeting, acquiring, retaining and
maintaining a profitable customer.
Corporate Surveillance
Corporate surveillance is the monitoring of a person or group’s behaviour by
a corporation. The data collected is most often used for marketing purposes
or sold to other corporations, but is also regularly shared with government
agencies. It can be used by the business to tailor their products desirable by
their customers. The data can be used for direct marketing purposes, such as
the targeted advertisements on Google and Yahoo, where ads are targeted
to the user of the search engine by analyzing their search history and emails.
Research Analysis
History shows that we have witnessed revolutionary changes in research.
Data mining is helpful in data cleaning, data pre-processing and integration of
databases. The researchers can find any similar data from the database that
might bring any change in the research. Identification of any co-occurring
sequences and the correlation between any activities can be known. Data
visualisation and visual data mining provide us with a clear view of the data.
Criminal Investigation
Criminology is a process that aims to identify crime characteristics. Actually
crime analysis includes exploring and detecting crimes and their relationships
with criminals. The high volume of crime datasets and also the complexity of
relationships between these kinds of data have made criminology an
appropriate field for applying data mining techniques. Text based crime
reports can be converted into word processing files. These information can
be used to perform crime matching process.
Bio Informatics
Data Mining approaches seem ideally suited for Bioinformatics, since it is
data-rich. Mining biological data helps to extract useful knowledge from
massive datasets gathered in biology, and in other related life sciences areas
such as medicine and neuroscience. Applications of data mining to
bioinformatics include gene finding, protein function inference, disease
diagnosis, disease prognosis, disease treatment optimization, protein and
gene interaction network reconstruction, data cleansing, and protein subcellular location prediction.
Another potential application of data mining is the automatic recognition of patterns that
were not previously known. Imagine if you had a tool that could automatically search
your database to look for patterns which are hidden. If you had access to this technology,
you would be able to find relationships that could allow you to make strategic decisions.
Data mining is becoming a pervasive technology in activities as diverse as using historical
data to predict the success of a marketing campaign, looking for patterns in financial
transactions to discover illegal activities or analyzing genome sequences. From this
perspective, it was just a matter of time for the discipline to reach the important area of
computer security.
Applications of Data Mining in Computer Security presents a collection of research efforts
on the use of data mining in computer security.
Data mining has been loosely defined as the process of extracting information from large
amounts of data. In the context of security, the information we are seeking is the
knowledge of whether a security breach has been experienced, and if the answer is yes,
who is the perpetrator. This information could be collected in the context of discovering
intrusions that aim to breach the privacy of services, data in a computer system or
alternatively, in the context of discovering evidence left in a computer system as part of
criminal activity.
Applications of Data Mining in Computer Security concentrates heavily on the use of data
mining in the area of intrusion detection. The reason for this is twofold. First, the volume of
data dealing with both network and host activity is so large that it makes it an ideal
candidate for using data mining techniques. Second, intrusion detection is an extremely
critical activity. This book also addresses the application of data mining to computer
forensics. This is a crucial area that seeks to address the needs of law enforcement in
analyzing the digital evidence.
Applications of Data Mining in Computer Security is designed to meet the needs of a
professional audience composed of researchers and practitioners in industry and
graduate level students in computer science.
SOCIAL IMPACTS OF DATA MINING
Data Mining can offer the individual many benefits by improving customer service and
satisfaction, and lifestyle in general. However, it also has serious implications regarding
one’s right to privacy and data security.
Is Data Mining a Hype or a persistent, steadily growing business?
Data Mining has recently become very popular area for research, development and
business as it becomes an essential tool for deriving knowledge from data to help business
person in decision making process.
Several phases of Data Mining technology is as follows:
Innovators
Early Adopters
Chasm
Early Majority
Late Majority
Laggards
Is Data Mining Merely Managers Business or Everyone’s Business?
Data Mining will surely help company executives a great deal in understanding the market
and their business. However, one can expect that everyone will have needs and means of
data mining as it is expected that more and more powerful, user friendly, diversified and
affordable data mining systems or components are made available.
Data Mining can also have multiple personal uses such as:
Identifying patterns in medical applications
To choose best companies based on customer
service. To classify email messages etc.
Is Data Mining a threat to Privacy and Data Security?
With more and more information accessible in electronic forms and available on the web
and with increasingly powerful data mining tools being developed and put into use, there
are increasing concern that data mining may pose a threat to our privacy and data security.
Data Privacy:
In 1980, the organization for Economic co-operation and development (OECD) established as
set of international guidelines, referred to as fair information practices. These guidelines aim
to protect privacy and data accuracy.
They include the following principles:
Purpose specification and use limitation.
Openness
Security Safeguards
Individual Participation
Data Security:
Many data security enhancing techniques have been developed to help protect data.
Databases can employ a multilevel security model to classify and restrict data according
to various security levels with users permitted access to only their authorized level.
Some of the data security techniques
are: Encryption Technique
Intrusion Detection
In secure multiparty computation
In data obscuration
Download