Analytical and Visual Data Mining Michael Welge Automated Learning Group, NCSA

advertisement
Analytical and Visual Data Mining
Michael Welge
welge@ncsa.uiuc.edu
Automated Learning Group, NCSA
www.ncsa.uiuc.edu/STI/ALG
October 14, 1998
1
University of Illinois at Urbana-Champaign
Why Data Mining? -- Potential Applications
• Database analysis, decision support, and
automation
– Market and Sales Analysis
– Fraud Detection
– Manufacturing Process Analysis
– Risk Analysis and Management
– Experimental Results Analysis
– Scientific Data Analysis
– Text Document Analysis
2
University of Illinois at Urbana-Champaign
Data Mining: Confluence of Multiple
Disciplines
•
•
•
•
•
•
Database Systems, Data Warehouses, and OLAP
Machine Learning
Statistics
Mathematical Programming
Visualization
High Performance Computing
3
University of Illinois at Urbana-Champaign
Data Mining: On What Kind of Data?
•
•
•
•
Relational Databases
Data Warehouses
Transactional Databases
Advanced Database Systems
– Object-Relational
– Spatial
– Temporal
– Text
– Heterogeneous, Legacy, and Distributed
– WWW
4
University of Illinois at Urbana-Champaign
Why Do We Need Data Mining?
• Leverage organization’s data assets
– Only a small portion (typically - 5%-10%) of the
collected data is ever analyzed
– Data that may never be analyzed continues to be
collected, at a great expense, out of fear that
something which may prove important in the future
is missed
– Growth rates of data precludes traditional “manual
intensive” approach
5
University of Illinois at Urbana-Champaign
Why Do We Need Data Mining?
• As databases grow, the ability to support the
decision support process using traditional query
languages become infeasible
– Many queries of interest are difficult to state in a
query language ( Query formulation problem)
– “find all cases of fraud”
– “find all individuals likely to buy a FORD Expedition”
– “find all documents that are similar to this customers
problem”
6
University of Illinois at Urbana-Champaign
Knowledge Discovery Process
• Data Mining: is a step in the knowledge discovery process
consisting of particular algorithms (methods) that under
some acceptable objective, produces a particular
enumeration of patterns (models) over the data.
• Knowledge Discovery Process: is the process of using
data mining methods (algorithms) to extract (identify) what
is deemed knowledge according to the specifications of
measures and thresholds, using a database along with any
necessary preprocessing or transformations.
7
University of Illinois at Urbana-Champaign
Data Mining: A KDD Process
8
University of Illinois at Urbana-Champaign
Knowledge Discovery Process Application
Domain
First and foremost you must understand your data
and your business.
It may be that you wish to increase the response
from a direct mail campaign. So do you want to
build a model to:
– increase the response rate
– increase the value of the response
Depending on your specific goal, the model you
choose may be different.
9
University of Illinois at Urbana-Champaign
Knowledge Discovery - Selecting Data
The task of selecting data begins with deciding
what data is needed to solve the problem.
Issues:
– Database incompatibility
– Data may be in an obscure form
– Data is incomplete
10
University of Illinois at Urbana-Champaign
Knowledge Discovery - Preparing The Data
Data may have to be loaded from legacy systems
or external sources, stored, cleaned, and
validated.
Issues:
– Data may be in a format incompatible for its end use
– Data may have many missing, incomplete, or
erroneous values
– Field descriptions may be unclear, confusing, or
have different meanings depending on the source
– Data may be stale
11
University of Illinois at Urbana-Champaign
Knowledge Discovery - Transforming Data
Considerable planning and knowledge of your
data should go into the transformation decision.
Data transformation are at the heart of developing
a sound model.
12
University of Illinois at Urbana-Champaign
Knowledge Discovery Types of Transformations
• Feature construction
– applying a mathematical formula to existing data
features
• Feature subset selection
– removing columns which are not pertinent or redundant,
or contain uninteresting predictors
• Aggregating data
– grouping features together and finding sums,
maximums, minimums, or averages
• Bin the data
– breaking up continuous ranges into discrete segments
13
University of Illinois at Urbana-Champaign
Knowledge Discovery - Data Mining
The process of building models differ among:
– Supervised learning (classification, regression,
time series problems)
– Unsupervised learning (database segmentation)
– Pattern identification and description (link analysis)
Once you have decided on the model type, you
must choose an method for building the model
(decision tree, neural net, K-nearest neighbor ),
then the algorithm (backpropagation)
14
University of Illinois at Urbana-Champaign
Knowledge Discovery - Analyze and Deploy
Once the model is built, its implications must be
understood. Graphical representations of
relationships between independent and dependent
variables may be necessary. Also, attention
should be focused on important aspects of the
model such as outliers or value.
Model deployment may mean writing a new
application, embedding into an existing system, or
applying it to an existing data set. Model
monitoring should be established.
15
University of Illinois at Urbana-Champaign
Required Effort for Each KDD Step
60
Effort (%)
50
40
30
20
10
0
Business
Objectives
Determination
Data Preparation
Data Mining
Analysis &
Assimilation
16
University of Illinois at Urbana-Champaign
What Data Mining Will Not Do
• Automatically find answers to questions you do
not ask
• Constantly monitor your database for new and
interesting relationships
• Eliminate the need to understand your business
and your data
• Remove the need for good data analysis skills
17
University of Illinois at Urbana-Champaign
Data Mining Models and Methods
Predictive
Modeling
Database
Segmentation
 Classification
 Demographic clustering
 Value prediction
 Neural clustering
Link
Analysis
Deviation
Detection
 Associations discovery
 Visualization
 Sequential pattern discovery
 Statistics
 Similar time sequence discovery
18
University of Illinois at Urbana-Champaign
Deviation Detection
• identify outliers in a dataset
• typical techniques - probability distribution
contrasts, supervised/unsupervised learning
• hypothetical example: Point-of-sale fraud
detection
19
University of Illinois at Urbana-Champaign
Fraud and Inappropriate Practice Prevention
Background:
Through regular review, HR has developed a
collaborative relationship with its Sales Associates
(SAs). Semi-annual meetings allow review of the
SAs practices with similar SAs across the country.
Goal:
The approach is aimed at modifying SAs behavior
to promote better service rather than at
investigating and prosecuting SAs, although both
strategies are employed.
20
University of Illinois at Urbana-Champaign
Fraud and Inappropriate Practice Prevention
Business Objective:
The focus of this project was on the recent and
steady 12% annual rise in overrides. The overall
business objective of the project was to find a way
to ensure that the overrides were appropriate with
out negatively affecting service provided by the
SAs.
21
University of Illinois at Urbana-Champaign
Fraud and Inappropriate Practice Prevention
Approach:
• To identify potential fraudulent overrides or
overrides arising from inappropriate practices.
• To develop general profiles of the SAs practices in
order to compare practice behavior of individual
SAs.
22
University of Illinois at Urbana-Champaign
Fraud and Inappropriate Practice Prevention
23
University of Illinois at Urbana-Champaign
Database Segmentation
• regroup datasets into clusters that share common
characteristics
• typical technique - unsupervised leaning (SOMs,
K-Means)
• hypothetical example: Cluster all similar regimes
(financial, free form text)
24
University of Illinois at Urbana-Champaign
Self Organizing Maps Example - Text Clustering
This data is considered to be confidential and proprietary to Caterpillar
and may only be used with prior written consent from Caterpillar.
25
University of Illinois at Urbana-Champaign
Predictive Modeling
• past data predicts future response
• typical technique - supervised learning (Artificial
Neural Networks, Decision Trees, Naïve Bayesian)
• hypothetical example (classification): Who is most
likely to respond to a direct mailing
• hypothetical example (predication): How will the
German Stock Price Index perform in the next 3, 5,
7, days
26
University of Illinois at Urbana-Champaign
Predictive Modeling - Prior Probabilities
27
University of Illinois at Urbana-Champaign
Predictive Modeling - Posterior Probabilities
28
University of Illinois at Urbana-Champaign
Link Analysis
• relationships between records/attributes in
datasets
• typical techniques - rule association, sequence
discovery
• hypothetical example (rule association): When
people buy a hammer they also buy nails 50% of
the time
• hypothetical example ( sequence discovery):
When people buy a hammer they also buy nails
within the next 3 months 18% of the time, and
within the subsequent 3 months 12% of the time
29
University of Illinois at Urbana-Champaign
Link Analysis (Rule Association)
• Given a database, find all associations of the form:
IF < LHS > THEN <RHS >
Prevalence = frequency of the LHS and RHS
occurring together
Predictability = fraction of the RHS out of all items
with the LHS
30
University of Illinois at Urbana-Champaign
Rule Association - Basket Analysis
31
University of Illinois at Urbana-Champaign
Association Rules - Basket Analysis
• <Dairy-Milk-Refrigerated> implies <Soft Drinks Carbonated>
– prevalence = 4.99%, predictability = 22.89%
• <Dry Dinners - Pasta> implies <Soup-Canned>
– prevalence = 0.94%, predictability = 28.14%
• <Paper Towels - Jumbo> implies <Toilet Tissue>
– prevalence = 2.11%, predictability = 38.22%
• <Dry Dinners - Pasta> implies <Cereal - Ready to Eat>
– prevalence = 1.36%, predictability = 41.02%
• <Cheese-Processed Slices - American> implies <Cereal - Ready to Eat>
– prevalence = 1.16%, predictability = 38.01%
32
University of Illinois at Urbana-Champaign
Requirements For Successful Data Mining
• There is a sponsor for the application.
• The business case for the application is clearly
understood and measurable, and the objectives
are likely to be achievable given the resources
being applied.
• The application has a high likelihood of having a
significant impact on the business.
• Business domain knowledge is available.
• Good quality, relevant data in sufficient quantities
is available.
33
University of Illinois at Urbana-Champaign
Requirements For Successful Data Mining
• The right people – business domain, data
management, and data mining experts. People
who have “been there and done that”
For a first time project the following criteria could be
added:
• The scope of the application is limited. Try to
show results within 3-6 months.
• The data source should be limited to those that
are well know, relatively clean and freely
accessible.
34
University of Illinois at Urbana-Champaign
Rapid KD Development Environment
35
University of Illinois at Urbana-Champaign
Rapid KDD Development Environment
36
University of Illinois at Urbana-Champaign
Why Information Visualization
• Gain insight into the contents and complexity of
the database being analyzed
• Vast amounts of under utilized data
• Time-critical decisions hampered
• Key information difficult to find
• Results presentation
• Reduced perceptual, interpretative, cognitive
burden
37
University of Illinois at Urbana-Champaign
Typical Data
•
•
•
•
•
•
Abstract corporate data
Mostly discrete not continuous
Often multi-dimensional
Quantitative
Text
Historical or real-time
38
University of Illinois at Urbana-Champaign
Typical Applications
• Historical Data Analysis
– Marketing Data Mining Analysis
– Portfolio Performance Attribution
– Fraud/Surveillance Analysis
• Decision Support
– Financial Risk Management
– Operations Planning
– Military Strategic Planning Typical Applications
39
University of Illinois at Urbana-Champaign
Typical Applications (cont)
• Monitoring Real-Time Status
– Industrial Process Control
– Capital Markets Trading Management
– Network Monitoring
• Management Reporting
– Financial Reporting
– Sales and Marketing Reporting
40
University of Illinois at Urbana-Champaign
Marketing Data Mining Analysis
Click on me.. I am an animation
41
University of Illinois at Urbana-Champaign
Risk Management
42
University of Illinois at Urbana-Champaign
Capital Markets Trading Management
43
University of Illinois at Urbana-Champaign
Network Monitoring
44
University of Illinois at Urbana-Champaign
Industrial Process Control
45
University of Illinois at Urbana-Champaign
Crisis Monitoring
Ground (Student) View
Normal
Ignited
Aerial/Oracular (Instructor) View
Color code for compartment status
Engulfed
Extinguished
Destroyed
Fire Alarm
Flooding
46
University of Illinois at Urbana-Champaign
3D Financial Reporting
47
University of Illinois at Urbana-Champaign
Statistics Visualizer
48
University of Illinois at Urbana-Champaign
Scatter Visualizer
49
University of Illinois at Urbana-Champaign
Splat Visualizer
50
University of Illinois at Urbana-Champaign
Tree Visualizer
51
University of Illinois at Urbana-Champaign
Map Visualizer
52
University of Illinois at Urbana-Champaign
Decision Tree
53
University of Illinois at Urbana-Champaign
Download