Data Mining: Concepts and Techniques — Slides for Textbook

Chapter I
Introduction
MIS 463
Fall 2011
1
Chapter 1. Introduction

Motivation: Why data mining?

Methodology of Knowledge Discovery in Databases

Data mining functionalities

Are all the patterns interesting?

Business applications of data mining
2
Motivation: “Necessity is the
Mother of Invention”

Data explosion problem

Automated data collection tools and mature database technology
lead to tremendous amounts of data stored in databases, data
warehouses and other information repositories

Need to convert such data into knowledge and information

Applications

Business management

Production control

Market analysis

Engineering design

Science exploration
3
Evolution of Database
Technology (1)

Data collection, database creation

Data management


data storage and retrieval

database transaction processing
Data analysis and understanding

Data mining and data warehousing
4
Evolution of Database
Technology (2) (See Fig. 1.1) Han

1960s:



1970s:

Relational data model, relational DBMS implementation

Query languages like SQL (structured query language)

Online transaction processing
1980s:



Data collection, database creation, primitive file processing, hierarchical and
network DBMS
Advanced DBMS, advanced data models (extended-relational, OO, deductive,
etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)
Data warehousing, data mining, OLAP, multimedia databases, and Web databases
1990s—2000s:

Web based database systems: XML based database systems, web mining
5
Developments in computer hardware

Powerful and affordable computers

Data collection equipment

Storage media

Communication and networking
6
Data Warehouse
Repository of multiple heterogeneous data sources, organized under a unified
schema at a single site in order to facilitate management decision making.
Data warehouse technology includes:

Data cleaning

Data integration

On-Line Analytical Processing (OLAP): Techniques that support
multidimensional analysis and decision making with the following functionalities





summarization
consolidation
aggregation
view information from different angles
but additional data analysis tools are needed for



classification
clustering
charecterization of data changing over time
7
Data-rich, information-poor state


Abundance of data AND need for powerful data analysis
tools
“data tombs” - data archives


Important decisions are made




seldom visited
not on the information rich data stored in databases
but on a decision maker’s intuition
No tool to extract knowledge embedded in vast amounts
of data
Current expert system technology

Users or domain experts manually input knowledge which is time
consuming, costly, prone to biases errors
8
What Is Data Mining?

Data mining (knowledge discovery in databases):


Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information or patterns
from data in large databases
Alternative names and their “inside stories”:



Gold mining
vs
sand mining
Data mining: a misnomer?
Knowledge discovery(mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting, business
intelligence, etc.
What is not data mining?


query processing.
Expert systems or small ML/statistical programs
9
Data Mining vs. Data Query

Data Query:e.g.



A list of all customers who use a credit card to buy a PC
A list of all MIS students having a GPA of 3.5 or higher and
has studied 4 or less semesters
Data Mining problems:e.g.



What is the likelihood of a customer purchasing PC with
credit card
Given the characteristics of MIS students predict her SPA in
the comming term
What are the characteristics of MIS undergrad students
10
Chapter 1. Introduction

Motivation: Why data mining?

Methodology of Knowledge Discovery in Databases

Data mining functionalities

Are all the patterns interesting?

Business applications of data mining
11
Why Data Mining?

Four questions to be answered

Can the problem clearly be defined?

Does potentially meaningful data exists?


Does the data contain hidden knowledge or useful only for
reporting purposes?
Will the cost of processing the data will be less then the likely
increase in profit from the knowledge gained from applying any
data mining project
12
Steps of a KDD Process (1)

1. Goal identification:
Define problem
 relevant prior knowledge and goals of application
2. Creating a target data set: data selection
3. Data preprocessing: (may take 60%-80% of
effort!)
 removal of noise or outliers
 strategies for handling missing data fields
 accounting for time sequence information




4. Data reduction and transformation:

Find useful features, dimensionality/variable
reduction, invariant representation.
13
Steps of a KDD Process (2)

5. Data Mining:





Choosing functions of data mining:

summarization, classification, regression, association,
clustering.
Choosing the mining algorithm(s):
 which models or parameters
Search for patterns of interest
6. Presentation and Evaluation:
 visualization, transformation, removing redundant patterns,
etc.
7. Taking action:
 incorporating into the performance system
 documenting
 reporting to interested parties
14
An example: Customer
Segmentation


1. Marketing department wants to perform a
segmentation study on the customers of AE Company
2. Decide on relevant variables from a data
warehouse on customers, sales, promotions





Customers: name,ID,income,age,education,...
Sales: history of sales
Promotion: promotion types durations...
3. Handle missing income, addresses..
determine outliers if any
4. Generate new index variables representing wealth
of customers


Wealth = a*income+b*#houses+c*#cars...
Make neccesary transformations z scores so that some data
mining algorithms work more efficiently
15
Example: Customer Segmentation
cont.


5.a: Choose clustering as the data mining functionality as it is
the natural one for a segmentation study so as to find group of
customers with similar characteristics
5.b: Choose a clustering algorithm


5.c: Apply the algorithm



K-means or k-medoids or any suitable one for that problem
Find clusters or segments
6. Make reverse transformations, visualize the customer
segments
7. Present the results in the form of a report to the marketing
department


Implement the segmentation as part of a DSS so that it can be
applied repeatedly at certain internvals as new customers arrive
Develop marketing strategies for each segment
16
Data Mining: A KDD Process
Pattern Evaluation
Data mining: the core of
knowledge discovery process.
Data Mining
Task-relevant Data
Data Warehouse
Data Selection
Data transformation
Data Cleaning
Data Integration
Databases
17
Architecture of a Typical
Data Mining System
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data
warehouse server
Data cleaning & data integration
Databases
Filtering
Data
Warehouse
18
Architecture of a Typical Data
Mining System



Data base, data warehouse
Data base or data warehouse server
Knowledge base


concept hierarchies
user beliefs



other thresholds
Data mining engine

functional modules



asses pattern’s interestingness
characterization, association, classification, cluster analysis,
evolution and deviation analysis
Pattern evaluation module
Graphical user interface
19
Data Mining: Confluence of
Multiple Disciplines
Database
Technology
Machine
Learning
Information
Science
Statistics
Data Mining
Visualization
Other
Disciplines
20
Efficient and Scalable
Techniques



For an algorithm to be efficient and
scalable
its running time should be predictable
and acceptable
How


Parallel and distributed algorithms
Sampling from databases
21
Chapter 1. Introduction

Motivation: Why data mining?

Methodology of Knowledge Discovery in Databases

Data mining functionalities

Are all the patterns interesting?

Business applications of data mining
22
Two Styles of Data Mining

Descriptive data mining




Predictive data mining





perform inference on the current data to make predictions
we know what to predict
Not mutually exclusive


characterize the general properties of the data in the database
finds patterns in data and
the user determines which ones are important
used together
Descriptive  predictive
Eg. Customer segmentation – descriptive by clustering
Followed by a risk assignment model – predictive by ANN
23
Supervised vs. Unsupervised
Learning


Supervised learning (classification, prediction)
 Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
 New data is classified based on the training set
Unsupervised learning (summarization. association,
clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
24
Descriptive Data Mining (1)



Discovering new patterns inside the data
Used during the data exploration steps
Typical questions answered by descriptive data
mining





what is in the data
what does it look like
are there any unusual patterns
what dose the data suggest for customer segmentation
users may have no idea

which kind of patterns may be interesting
25
Descriptive Data Mining (2)

patterns at verious granularities

geograph


student


country - city - region - street
university - faculty - department - minor
Fuctionalities of descriptive data mining

Clustering




Ex: customer segmentation
summarization
visualization
Association

Ex: market basket analysis
26
A model is a black box
X: vector of independent variables or inputs
Y =f(X) : an unknown function
Y: dependent variables or output
a single variable or a vector
inputs
X1,X2
Model
Y output
The user does not care what the model is doing
it is a black box
interested in the accuracy of its predictions
27
Predictive Data Mining (1)

Using known examples the model is trained


the more data with known outcomes is
available



the unknown function is learned from data
the better the predictive power of the model
Used to predict outcomes whose inputs are
known but the output values are not realized
yet
Never %100 accurate
28
Predictive Data Mining (2)

The performance of a model on past
data is not important


to predict the known outcomes
Its performance on unknown data is
much more important
29
Typical questions answered by
predictive models

Who is likely to respond to our next offer



Which customers are likely to leave in the
next six months
What transactions are likely to be fraudulent


based on history of previous marketing campaigns
based on known examples of fraud
What is the total amount spending of a
customer in the next month
30
Data Mining Functionalities (1)

Concept description: Characterization and discrimination


Generalize, summarize, and contrast data characteristics, e.g.,
big spenders vs. budget spenders
Association (correlation and causality)



Multi-dimensional vs. single-dimensional association
age(X, “20..29”) ^ income(X, “20..29K”)  buys(X, “PC”)
[support = 2%, confidence = 60%]
contains(T, “computer”)  contains(x, “software”) [1%, 75%]
31
Data Mining Functionalities (2)

Classification and Prediction



Finding models (functions) that describe and distinguish classes
or concepts for future prediction
E.g., classify people as healty or sick, or classify transactions as
fraudulent or not

Methods: decision-tree, classification rule, neural network

Prediction: Predict some unknown or missing numerical values
Cluster analysis


Class label is unknown: Group data to form new classes, e.g.,
cluster customers of a retail company to learn about
characteristics of different segments
Clustering based on the principle: maximizing the intra-class
similarity and minimizing the interclass similarity
32
Data Mining Functionalities (3)

Outlier analysis

Outlier: a data object that does not comply with the general behavior
of the data

It can be considered as noise or exception but is quite useful in fraud
detection, rare events analysis


Trend and evolution analysis

Trend and deviation: regression analysis

Sequential pattern mining: click stream analysis

Similarity-based analysis
Other pattern-directed or statistical analyses
33
Concept Description



Characterization
Discerimination
Data



classes of items for sale


classes or
concpets
computers, printers
concepts of customers:


bigSpenders
BudgetSpenders
34
Data Characterization


Summarization the data of the class under study
(target class)
Methods


SQL queries
OLAP roll up -operation

user-controlled data summarization
along a specified dimension

without step by step user interraction



attribute oriented induction
the output of characterization


pie charts, bar chars, curves, multidimensional data cube, or
cross tabs
in rule form as characteristic rules
35
Characterization example

Description summarizing the
characteristics of customers who spend
more than $1000 a year at AllElecronics


age, employment, income
drill down on any dimension

on occupation view these according to their
type of employment
36
Data Discrimination

Comparing the target class with one or a set
of comparative classes (contrasting classes)



these classes can be specified by the use
database queries
methods and output


similar to those used for characterization
include comparative measures to distinguish
between the target and contrasting classes
37
Discrimination examples

Example 1:Compare the general features of software products



whose sales increased by %10 in the last year (target class)
whose sales decreased by at least %30 during the same period (contrasting
class)
Example 2: Compare two groups of AE customers

I) who shop for computer products regularly (target class)


II) who rarely shop for such products (contrasting class)



less than three times a year
The resulting description:
%80 of I group customers



more than two times a month
university education
ages 20-40
%60 of II group customers


seniors or young
no university degree
38
Multidimensional Data

sales according to region month and
product type
Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region
Year
Product
Category Country Quarter
Product
City
Office
Month Week
Day
Month
39
Association Analysis






Discovery of association rules showing attribute-value
conditions that occur frequently together in a given
set of data
widely used
 market basket
 transaction data analysis
more formally
X  Y that is
A1A2.. Ak  B1B2.. Bl
A1 , B1 are attribute value pairs or predicates
40
Example: association analysis





From the AllEs database
 age(X,”20..29”)income(X,”1,000...2,000”)buy(X,”CD
player”)
 (support = %2,
 confidence= %60)
X is a variable representing a customer
%2 of the AE customers are
 between 20 and 29 age
 incomes ranging from 1 to 2 billon TL
 buy CD player
with %60 probability that customers in those age and income
groups will buy CD player
a multidimensional association rule
 contains more than one attribute or predicate
41
Market basket analysis


customers buying behaviour is
investigated
Based on only the transactions data


no information about customer properties:
age income
Managers

are interested in which products or product
groups are sold together
42
Transactional Database
Transaction ID
Item List
10001
Computer,CD,pritner
10002
Ploter,monitor,mouse
10003
Computer,DVD Player
10004
Printer
10005
Ploter,UPS,modem
43
Example: basket analysis rule



buy(computer)buy(printer)
(support= %1,confidence=%60)
%1 of all transactions contains


if a transaction contains computer


contains a single predicate
an association rule is interesting if



there is a %60 chance that it contains printer as well
a single dimensional association rule


computer and printer
its support exceeds a minimum threshold and
its confidence exceeds a min threshold
These min values are set by specialists
44
Classification



Learning is supervised
Dependent variable is categorical
Build a model able to assign new
instances to one of a set of well-defined
classes
45
Typical Classification Problems



Given characteristics of individuals
differentiate them who have suffered a
heart attack from those who have not
Determine if a credit card purchase is
fraudulent
Classify a car loan applicant as a good
or a poor credit risk
46
Methods of Classification



Decision Trees
Artificial Neural Networks
Bayesian Classification




Naïve
Belief Networks
k-nearest neighbor
Regression

Logistic (logit) probit


Predicts probability of each class
when the dependent variable is categorical

good customer bed customer or employed unemployed
47
Steps of classification process

(1) Train the model



(2) Test the model



using a training set
data objects whose class labels are known
on a test sample
whose class labels are known but not used for
training the model
(3) Use the model for classification

on new data whose class labels are unknown
48
An example - classification
ID
age
income
education
Type
Historical data Each customer type İs
known
Each customer has a Label
1
35
800
udergrad
risky
2
26
600
HighSch
risky
3
48
1200
grad
normal
8
52
2500
udergrad
good
44
29
1700
HighSch
good
CustID
age
income
education
Type
17
43
550
Ph.D.
risky
27
68
1650
grad
Normal
CustID
age
income
11
36
850
27
28
1650
Educatin
Type
Udergrd
?
grad
?
Testing set whose labels are also
Known but not used in model
Training the model

New customers Whose type hsa to be
Estimated
Each new customer hss to be classified
as Risky normal or good

49
An example – classification
cont.

Based on historical data develop a
classification model




Decision tree, neural network, regression ...
Test the performance of the model on a
portion of the historical data
İf accuricy of the model is satisfactory
Use the model on the new customers

11 and 27 to assign a type the these new
customers
50
Example AE customers
age
goodl
risky
Yearly income
51
Example AE customers
age
goodl
risky
?
Assign the new customer whose type in unknown to
either * or +
Yearly income
52
Solution
x2 : age
good
risky
35
x1 : yearly income
1000
rule: IF yearly income> 1000 and age> 35
THEN good ELSE risky
53
Credit Card Promotion Policy

Credit card companies



Promotional offerings with their monthly credit card billing
Offers provide the opportunity to purchase items such as
magazines, …
A data mining study




Predict individual behaviour
What is the likelihood of an individual towards taking the
advantage of promotions
based on individual characteristics, credit history..
Expected reduction in postage; paper and processing costs
for the credit card company
54
Credit Card Promotion Database
Magazıne
Promotıon
Income
Range
Watch
Promotıon
Lıfe
Insurance
Promotıon
Gender
Age
Credıt Card
Insurance
40-50 K
Yes
No
No
Male
45
No
30-40 K
Yes
Yes
Yes
Female
40
No
40-50 K
No
No
No
Male
42
No
30-40 K
Yes
Yes
Yes
Male
43
Yes
50-60 K
Yes
No
Yes
Female
38
No
20-30 K
No
No
No
Female
55
No
30-40 K
Yes
No
Yes
Male
35
Yes
20-30 K
No
Yes
No
Male
27
No
30-40 K
Yes
No
No
Male
43
No
30-40 K
Yes
Yes
Yes
Female
41
No
40-50 K
No
Yes
Yes
Female
43
No
20-30 K
No
Yes
Yes
Male
29
No
50-60 K
Yes
Yes
Yes
Female
39
No
40-50 K
No
Yes
No
Male
55
No
20-30 K
No
No
Yes
Female
19
Yes
55
Decision Trees for Credit Card
Insurance Database
age
<=43
Dependent Variable
Life Insurance Promotion
>43
Gender
Female
N 0, Y 6
Decision: Yes
N 3,Y 0
Decision:No
Male
A Production Rule
from the Tree
Cr Ins
No
N 4, Y 1
Decision: No
critical value of 43
is deter by the
algorithm

Yes
IF (age<=43)&(Sex=Male)
&(Credit Card In = No)
THEN Life Insurance Pr = No
Yes 2, No 0
Decision? Yes
56
Artificial Neural Networks


Set of interconnected nodes designed
to imitate the functioning of the human
brain
Feed-forward network

Supervised learner model
57
For the promotion example




Encode all variables
Assign a numerical value even for
qualitative variables such as sex
Say X1 represent gender
When


Male
X1 =1
Female X1 =0
58
Input
layer
X1=+1
1
Hidden
layer
Output
layer
W1,5=0.014
5
W5,9=-0.17
X2=0
X3=0.5
X4=-1
(1-0.78)2 is error square
1 actual value of O9 for a particular
Data object 0.78 is predicted value
59
Weights updating


Weights between nodes are adjusted so
as to reduce error
Details of the training process for neural
networks are not important for the time
being
60
Estimation-Prediction




Similar to classification
Output is a continuous variable
Estimation: current value
Prediction: future outcome rather then
current behavior
61
Typical Estimation-Prediction
Problems



Estimate the salary of an individual who
owns a sports car
Predict next week`s closing price for
the IMKB100 index
Forecast next days temperature
62
Prediction methods


Artificial Neural networks
linear regression


non-linear regression


Yi = a0+a1X1,i+a2X2,i+...+akXk,i+ui
Yi =f(X1,i, X2,i,.., Xk,ia1,a2,..,ak,ui)
generalized linear regression

logistic


poisson regression


logit,probit
for count variables
Regression Trees
63
Example:Prediction and
Classification

Classification is used to classify customers
applying for credit cards


known class labels: risky,reliable
when a new customer applies looking at her
charecteristics


income age education wealth region ...

Customer class is predicted

independent variables
Prediction: The monthly expense of a new
customer ( a real continuous variable ) is
predicted based on personal information


income education wealth profession ...
Some are numeric some categorical
64
Cluster Analysis

Class label is unknown: Group data to form new classes,

assign class labels to each data object



e.g., cluster customers to find customer segments
Clustering based on the principle: maximizing the intra-class
similarity and minimizing the interclass similarity



Unknown generated by the clustering model
Objects within a cluster have high similarity in comparison to one
another
but are very dissimilar to objects in other clusters
there may be hierarchy of classes
65
Example: Clustering



Can be performed on AE customer data
to identify homogenous subpopulations
of customers
represent individual target groups for
marketing
66
distance
Type1
Type 2
type 3
income
Clustering according to income and distance to store
three cluster of data points are evident
67
Outlier Analysis

Outlier: a data object that does not comply with the general
behavior of the data

It can be considered as noise or exception but is quite useful in
fraud detection, rare events analysis


DECTECED using

statistical tests

distance measures

visually inspecting the data
Examples:
68
Reasons for outliers


Measurement errors
coding errors


age is entered as 999
nature of data


salary of the general manager is much
more higher than the other employees
in crisis the interest rate was in the order
of 1000s
69
Evolution Analysis

Describes and models regularities or trends for objects whose
behavior changes over time


Distinct features include

Trend and deviation: time-series data analysis

Sequential pattern mining, periodicity analysis

Similarity-based analysis
Example

Stock market predictions: future stock prices

for overall stocks: indexes or individual company stocks
70
Sequential Pattern Analysis



Determine sequential patterns in data
Based on time sequence of actions
Similar to associations




Relationship is based on time
Example 1: buy CD player today buy CD within one
week
Example 2: In what sequence web pages of an ebusiness company are accessed
%70 percents of visitors follows


A B C or A D B C or A E B C
He then determines to add a link directly from page A to
page C
71
Chapter 1. Introduction

Motivation: Why data mining?

Methodology of Knowledge Discovery in Databases

Data mining functionalities

Are all the patterns interesting?

Business applications of data mining
72
Are All the “Discovered”
Patterns Interesting?

A data mining system/query may generate thousands of patterns,
not all of them are interesting.

Are all patterns interesting?

Typically not -only a small fraction of patterns are interesting to any
given user

Interestingness measures: A pattern is interesting if

it is easily understood by humans,

valid on new or test data with some degree of certainty,

potentially useful,

novel, or

validates some hypothesis that a user seeks to confirm
73
Objective vs. subjective
interestingness measures:


Objective:
 Objective: based on statistics and structures of patterns,
e.g.,
 support,
 X Y P(X  Y):probability of a transaction contains both X
and Y
 confidence, degree of certainty of the detected association
 P(Y I X) the conditional probability : the probability that a
transaction containing X also contains Y
 thresholds - controlled by the user
 ex: rules that do not satisfy a confidence threshold of %50
are uninteresting
Subjective: based on user’s belief in the data, e.g.,
unexpectedness, novelty, actionability, etc.
74
Chapter 1. Introduction

Motivation: Why data mining?

Methodology of Knowledge Discovery in Databases

Data mining functionalities

Are all the patterns interesting?

Business Applications of data mining
75
Potential Business Applications

Market analysis and management


target marketing, customer relation management, market basket
analysis, cross selling, market segmentation
Risk analysis and management

Banks assume a financial risk when they grant loans



risk models attempt to predict the probability of default or fail to pay back
the borrowed amount
Credit cards
Insurance companies

Fraud detection and management

Other Applications

Text mining (news group, email, documents) and Web analysis.

Intelligent query answering
76
Market Analysis and Management (1)

Where are the data sources for analysis?


Credit card transactions, loyalty cards, discount coupons, customer
complaint calls, plus (public) lifestyle studies,clickstreams
Customer profiling-segmentation

data mining can tell you what types of customers buy what
products (clustering or classification)

Target marketing

Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
77
Market Analysis and Management (2)

Effectiveness of sales campaigns




Advertisements, coupons, discounts, bonuses
promote products and attract customers
can help improve profits
Compare amount of sales and number of
transactions


during the sales period versus before or after the sales
campaign
Association analysis

which items are likely to be purchased together with the
items on sale
78
Market Analysis and Management (3)

Customer retention Analysis of Customer loyalty






sequences of purchases of particular customers
goods purchased at different periods by the same customers
can be grouped into sequences
changes in customer consumption or loyalty
suggests adjustments on the pricing and variety of goods
to retain old customers and attract new customers
Cross-selling and up-selling



associations from sales records
a customer who buy a PC is likely to buy a printer
purchase recommendations
79
Fraud Detection and Management

Applications


Approach


widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
use historical data to build models of fraudulent behavior and
use data mining to help identify similar instances
Examples



Credit card transactions: The FALCON fraud assessment system
by HNC Inc. to signal possibly fraudulent credit card transactions
money laundering: detect suspicious money transactions (US
Treasury's Financial Crimes Enforcement Network)
Detecting telephone fraud:ASPECT European Research Gr.


Unsupervised clustering to detect fraud in mobile phone networks
Telephone call model: destination of the call, duration, time of day or week.
Analyze patterns that deviate from an expected norm.
80
Health Care

Storing patients` records in electronic format,
developments in medical information systems


Regularities, trends and surprising events extracted
by data mining methods



Large amount of clinical data
ANN, temporal reasoning
assist clinicians to make informed decisions and improving
health sevices
MERCK-MEDCO Managed Care, Pharmaceutical
Insurance … company

Uncover less expensive but equally effective drug treatments
81
Financial Data Analysis


Financial data
 complete, reliable, high quality
Loan payment prediction and customer
credit policy analysis
82
Loan payment prediction and
customer credit policy analysis

Factors influencing loan payment performance









loan-to-value ratio
term of the loan
dept ratio (total monthly debt/total monthly income)
payment-to-income ratio
income level
education level
residence region
credit history
analysis may find that


payment-income ratio is a dominant factor while
education level and debt ratio are not
83
Risk Management and
Insurance




determine insurance rates
manage investment portfolios
differentiate between companies and/or
individuals who are good and poor credit risks
Farmer`s Group discover a scenario:


Someone who owns a sports car is not a higher
accident risk
Conditions: the sport car to be a second car and
the family car to be a station wagon or a sedan
84
Data Mining for the
Telecommunication Industry

Telecommunication data are multidimensional








duration
location of callee
data traffic
resource usage
profit
system workload
user group behavior
used to identify and compare


calling-time
location of caller
type of call
fraudulent pattern analysis and identification of
unusual patterns
to achieve customer loyalty
characteristics of customers affecting line usage
85
Other Applications

Sports and Gaming


Text Mining


Predicting outcome of football games
Spam detection
Internet Web Mining


Web usage mining

İmprove link structure

Recommander Systmes
Web structure mining: mining link structure of Web
86
Other Applications


Educational Data Mining

Clustering students

Design enterece exams, selection policies
Human Resources


How to select applicants
Online Dating

Recommandataions to visitors
87
Summary





Data mining: discovering interesting patterns from large
amounts of data
A natural evolution of database technology, in great
demand, with wide applications
A KDD process includes data cleaning, data integration,
data selection, transformation, data mining, pattern
evaluation, and knowledge presentation
Mining can be performed in a variety of information
repositories
Data mining functionalities: characterization,
discrimination, association, classification, clustering,
88