Data Mining

advertisement
Data Mining
Xuequn Shang
NorthWestern Polytechnical University
September 2006
Data Mining Techniques
1
About the Course
• Time
– Tue. 7:00 pm ~9:00 pm
– Fri. 7:00 pm~9:00 pm
• Location
– Room XA107 West building
• Instructor
– Xuequn shang, Ph.D.
– shang@nwpu.edu.cn
Data Mining Techniques
2
Mini Survey
• How many people took database course
before?
• How many people took statistic course?
• How many people took machine learning
before?
Data Mining Techniques
3
Textbook and Reference
• Text book
Data Mining: Concepts and Techniques, JiaweiHan
and Micheline Kamber, Morgan Kaufmann, 2001.
– 范明、孟小峰等译,数据挖掘概念与技术,机械工业
出版社,2001年8月
–
• References
– Principles of Data Mining (Adaptive Computation and
Machine Learning), David J. Hand, Heikki Mannila,
Padhraic Smyth, MIT Press, 2001
– Many research papers
Data Mining Techniques
4
Course Introduction
• Data that has relevance for managerial decisions is accumulating at
an incredible rate due to a host of technological advances.
– Electronic data capture has become inexpensive and ubiquitous as a
by-product of innovations such as the internet, e-commerce, electronic
banking, point-of-sale devices, bar-code readers, and intelligent
machines.
– Such data is often stored in data warehouses and data marts
specifically intended for management decision support.
• Data mining is a rapidly growing field that is concerned with
developing techniques to assist managers to make intelligent use of
these repositories.
– Such as credit rating, fraud detection, database marketing, customer
relationship management, and stock market investments.
• This course will examine methods that have emerged from both
fields and proven to be of value in recognizing patterns and making
predictions from an applications perspective. We will survey
applications and provide an opportunity for hands-on
experimentation with algorithms for data mining using easy-to-use
software and cases.
Data Mining Techniques
5
Course Objective
• To provide an introduction to knowledge discovery in
databases and complex data repositories, and to present
basic concepts relevant to real data mining applications,
as well as reveal important research issues germane to
the knowledge discovery domain and advanced mining
applications.
• Students will understand the fundamental concepts
underlying knowledge discovery in databases and gain
hands-on experience with implementation of some data
mining algorithms applied to real world cases.
Data Mining Techniques
6
Evaluation
•
•
•
•
•
Assignments (2) 20%
Class participant 10%
Project 20%
Final Exam 50%
– Quality of presentation + quality of report
+ quality of demos
Data Mining Techniques
7
About the Project
• Implement and experimentally evaluate
the major method in the paper (60%)
• If possible, improve the method in
effectiveness or efficiency, implement and
experimentally evaluate your improvement
• Write a technical report (40%)
Data Mining Techniques
8
Contents
•
•
•
•
•
•
•
Introduction to Data Mining
Association analysis
Sequential Pattern Mining
Classification and prediction
Data Clustering
Data preprocessing
Advanced topics
Data Mining Techniques
9
Course Schedule(1)
Date
Time
Session
Topic
Sep- 19
7:00 pm-9:00 pm
Session 1
Welcome and introduction
Sep- 22
7:00 pm-9:00 pm
Session 2
Association rule mining
Sep- 26
Session 3
Sep- 29
Session 4
Sequential Pattern Mining
Oct- 10
Session 5
classification
Oct- 13
Session 6
Data Mining Techniques
10
Course Schedule(2)
Date
Time
Session
Topic
Oct- 17
Session 7
Data Clustering
Oct- 20
Session 8
Data preprocessing
Oct- 24
Session 9
Oct- 27
Session 10
Oct- 31
Session 11
Nov- 3
Session 12
Advance topic
Seminar
Data Mining Techniques
11
Course Schedule(3)
Date
Time
Session
Topic
Nov- 7
Session 7
examination
Nov- 10
Session 8
Data Mining Techniques
12
Useful Information
• How to get a paper online?
– DBLP
• A good index for good papers
– CiteSeer
– Just google it
– Send requests to the authors
• Conferences and Journals on Data Mining
– KDD, PAKDD, ICDM, DAWAK, PKDD, etc.
– DMKD, TKDE, ACM Trans. on KDD. etc.
Data Mining Techniques
13
Additional Hits
• Be a good citizen
• Be a good graduate student
• Be a good scientist
– There are three chief ethical problems: frauds,
plagiarism, and duplicate or simultaneous
submissions
– There are four basic considerations in
technical ethics: honesty, justice, respect for
other’s works and copyrights held by others.
Data Mining Techniques
14
Introduction
• Why data mining?
• What is data mining?
• What kind of data to be mined?
• Are all the patterns interesting?
• Data mining functionality
• Major issues in data mining
Data Mining Techniques
15
Why Data Mining?
• Changes in the Business Environment
– Customers becoming more demanding
– Markets are saturated
• Databases today are huge:
– More than 1,000,000 entities/records/rows
– From 10 to 10,000 fields/attributes/variables
– Gigabytes and terabytes
• Databases a growing at an unprecedented rate
• Decisions must be made rapidly
• Decisions must be made with maximum knowledge
• We are drowning in data, but starving for knowledge!
• “Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets
Data Mining Techniques
16
Why Data Mining?
“The key in business is to know something that
nobody else knows.”
— Aristotle Onassis
PHOTO: LUCINDA DOUGLAS-MENZIES
PHOTO: HULTON-DEUTSCH COLL
“To understand is to perceive patterns.”
— Sir Isaiah Berlin
Data Mining Techniques
17
What Is Data Mining?
• Mining data –extracting or mining knowledge
from large amount of data
• Data mining
– is the non-trivial process of identifying valid, novel,
potentially useful, and ultimately understandable
patterns in data [Fayyad, Piatetsky-Shapiro, Smyth,
96]
Data Mining Techniques
18
Applications
• Data analysis and decision support
– Market analysis and management
• Target marketing, customer relationship management (CRM), market
basket analysis, cross selling, market segmentation
– Risk analysis and management
• Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
– Fraud detection and detection of unusual patterns (outliers)
• Other Applications
– Text mining (news group, email, documents) and Web mining
– Stream data mining
– Bioinformatics and bio-data analysis
Data Mining Techniques
19
Ex. 1: Market Analysis and Management
•
Where does the data come from?—Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies
•
Target marketing
– Find clusters of “model” customers who share the same characteristics: interest, income
level, spending habits, etc.,
– Determine customer purchasing patterns over time
•
Cross-market analysis—Find associations/co-relations between product sales, & predict
based on such association
•
Customer profiling—What types of customers buy what products (clustering or classification)
•
Customer requirement analysis
– Identify the best products for different customers
– Predict what factors will attract new customers
•
Provision of summary information
– Multidimensional summary reports
– Statistical summary information (data central tendency and variation)
Data Mining Techniques
20
Ex. 2: Corporate Analysis & Risk Management
•
Finance planning and asset evaluation
– cash flow analysis and prediction
– contingent claim analysis to evaluate assets
– cross-sectional and time series analysis (financial-ratio, trend analysis,
etc.)
•
Resource planning
– summarize and compare the resources and spending
•
Competition
– monitor competitors and market directions
– group customers into classes and a class-based pricing procedure
– set pricing strategy in a highly competitive market
Data Mining Techniques
21
Ex. 3: Fraud Detection & Mining Unusual Patterns
•
Approaches: Clustering & model construction for frauds, outlier analysis
•
Applications: Health care, retail, credit card service, telecomm.
– Auto insurance: ring of collisions
– Money laundering: suspicious monetary transactions
– Medical insurance
• Professional patients, ring of doctors, and ring of references
• Unnecessary or correlated screening tests
– Telecommunications: phone-call fraud
• Phone call model: destination of the call, duration, time of day or
week. Analyze patterns that deviate from an expected norm
– Retail industry
• Analysts estimate that 38% of retail shrink is due to dishonest
employees
– Anti-terrorism
Data Mining Techniques
22
The KDD Process
– Data mining—core of knowledge
discovery process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
Data Mining Techniques
23
KDD Process Steps
• Preprocessing
– Data cleaning
– Data integration
•
•
•
•
•
Data selection
Data transformation
Data mining
Pattern evaluation
Knowledge presentation
Data Mining Techniques
24
Confluence of Multiple Disciplines
Database
Technology
Machine
Learning
Pattern
Recognition
Statistics
Data Mining
Algorithm
Data Mining Techniques
Visualization
Other
Disciplines
25
Classification Schemes
• General functionality
– Descriptive data mining
– Predictive data mining
• Different views lead to different classifications
– Data view: Kinds of data to be mined
– Knowledge view: Kinds of knowledge to be discovered
– Method view: Kinds of techniques utilized
– Application view: Kinds of applications adapted
Data Mining Techniques
26
What Kind of Data?
• Database-oriented data sets and applications
– Relational database, data warehouse, transactional database
• Advanced data sets and advanced applications
– Data streams and sensor data
– Time-series data, temporal data, sequence data (incl. bio-sequences)
– Structure data, graphs, social networks and multi-linked data
– Object-relational databases
– Heterogeneous databases and legacy databases
– Spatial data and spatiotemporal data
– Multimedia database
– Text databases
– The World-Wide Web
Data Mining Techniques
27
Relational Databases
• Structured data
– Table –records –attributes
– Indexes & SQL
• Online transactional processing (OLTP)
– Insert a student “Jennet” into class CMPT 741, fall
2005
• Online analytical processing (OLAP)
– Find the average class size of CMPT 700 level
courses in the last 3 years, grouped by semesters
Data Mining Techniques
28
Data Warehouses
• A subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of
management’s decision making process [Inmon]
Client
Clean
Transform
Integrate
Data
Warehouse
Query and
analysis tools
Load
Client
Data Mining Techniques
29
Data Cube
• A Multi-dimensional Database
C
c3 61
62
63
64
c2 45
46
47
48
c1 29
30
31
32
c0
B
b3
B13
b2
9
b1
5
b0
14
15
16
1
2
3
4
a0
a1
a2
a3
60
44
28 56
40
24 52
36
20
A
Data Mining Techniques
30
Transactional Databases
TID
T100
Itemset
Milk, bread, beer, diaper
T200
Beer, cook, fish, potato, orange, apple
…
…
What kind of product combinations
that customers like to buy together?
Data Mining Techniques
31
Spatial Databases
• Spatial information
– Geographic databases (map)
– VLSI chip design databases
– Satellite image databases
• Spatial patterns
– What are the changes of the forest in the last 10
years?
– Find clusters of homes with kids of age 5-10
Data Mining Techniques
32
Time Series Data
• A sequence of values that change over time
– The sequences of stock price at every 5 minutes
– The daily temperature
• Typical operations
– Similarity search
– Trend analysis
Data Mining Techniques
33
Semi-Structure Data
•
•
•
•
HTML web documents
XML documents
Digital libraries
Annotated multimedia databases
– Image, audio and video data
Data Mining Techniques
34
Biological Data
• Bio-sequences
– DNA, gene, protein: very long sequences
• Micro-array data
• Medical documents and images
• Typically very noisy
– Data cleaning and integration are challenging
Data Mining Techniques
35
What Can Be Discovered?
• What can be discovered depends upon the data
mining task employed.
• Descriptive DM tasks
– characterize general properties
• Predictive DM tasks
– Infer on available data
Data Mining Techniques
36
What Kinds of Patterns?
•
•
•
•
•
Association rules and sequential patterns
Classification
Clustering
Outlier analysis
Other data mining tasks
Data Mining Techniques
37
Are All the “Discovered” Patterns
Interesting?
• Data mining may generate thousands even
million of patterns: Not all of them are interesting
– What makes a pattern interesting?
– Can a data mining system generate all of the
interesting patterns?
– Can a data mining system generate only interesting
patterns?
Data Mining Techniques
38
What makes a pattern interesting?
• Interestingness measures
– A pattern is interesting if it is easily understood by humans, valid
on new or test data with some degree of certainty, potentially
useful, novel, or validates some hypothesis that a user seeks to
confirm
• Objective vs. subjective interestingness measures
– Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.
– Subjective: based on user’s belief in the data, e.g.,
unexpectedness, novelty, etc.
Data Mining Techniques
39
Find All Interesting Patterns?
• Find all the interesting patterns:
Completeness
– Can a data mining system find all the interesting
patterns? Do we need to find all of the
interesting patterns?
– Heuristic vs. exhaustive search
– Association vs. classification vs. clustering
Data Mining Techniques
40
Find Only Interesting Patterns?
• Search for only interesting patterns: An
optimization problem
– Can a data mining system find only the
interesting patterns?
– Approaches
• First general all the patterns and then filter out the
uninteresting ones
• Generate only the interesting patterns—mining query
optimization
Data Mining Techniques
41
Research Issues in Data Mining
•
•
•
•
Effectiveness
Efficiency
Applications
Theory
Data Mining Techniques
42
Effectiveness
• What kind of patterns to mine?
– Propose interesting data mining problems
• How to identify interesting patterns
– Interestingness measures
– Useful constraints
• Visualization and interaction
– Presentation of mining results
– Interactive, adaptive mining
Data Mining Techniques
43
Efficiency
• Develop fast data mining algorithms
– Identify effective heuristics for mining
– Theoretical and/or empirical justification
• Systematic implementation
– Parallel, distributed, and incremental mining
• Integration to product systems
– Data mining module in DBMS and data warehouses
Data Mining Techniques
44
Applications
• Handle noisy or incomplete data
• Incorporate background knowledge
• Application/domain-oriented solutions
– Vertical solutions
Data Mining Techniques
45
Foundation for Data Mining
• Knowledge representation
• Data mining algebra and language
– Integration of multiple mining tasks/DBMS
– Open for new data/knowledge
– Interaction and visualization
• Data mining query optimization
– Common construct
– Automatic optimization by construct rewriting
Data Mining Techniques
46
Major Issues in Data Mining
•
Mining methodology
– Mining different kinds of knowledge from diverse data types, e.g., bio, stream,
Web
– Performance: efficiency, effectiveness, and scalability
– Pattern evaluation: the interestingness problem
– Incorporation of background knowledge
– Handling noise and incomplete data
– Parallel, distributed and incremental mining methods
•
– Integration of the discovered knowledge with existing one: knowledge fusion
User interaction
– Data mining query languages and ad-hoc mining
– Expression and visualization of data mining results
– Interactive mining of knowledge at multiple levels of abstraction
•
Applications and social impacts
– Domain-specific data mining
& invisible
data mining
Data Mining
Techniques
– Protection of data security, integrity, and privacy
47
A Brief History of Data Mining Society
•
1989 IJCAI Workshop on Knowledge Discovery in Databases
– Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley,
1991)
•
1991-1994 Workshops on Knowledge Discovery in Databases
– Advances in Knowledge Discovery and Data Mining (U. Fayyad, G.
Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
•
1995-1998 International Conferences on Knowledge Discovery in Databases
and Data Mining (KDD’95-98)
– Journal of Data Mining and Knowledge Discovery (1997)
•
ACM SIGKDD conferences since 1998 and SIGKDD Explorations
•
More conferences on data mining
– PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM
(2001), etc.
•
ACM Transactions on KDD starting
in 2007
Data Mining Techniques
48
Summary
• Data mining: Discovering interesting patterns from large amounts of
data
• A natural evolution of database technology, in great demand, with
wide applications
• A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
• Mining can be performed in a variety of information repositories
• Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
• Data mining systems and architectures
• Major issues in data mining
Data Mining Techniques
49
Assignment (Ⅰ)
• What is data mining?
– Data mining is the task of discovering interesting
patterns from large amounts of data, where the
data can be stored in databases, data warehouses,
or other information repositories. It is a young
interdisciplinary field, drawing from areas such as
database systems, data warehousing, statistics,
machine learning, data visualization, information
retrieval, and high-performance computing. Other
contributing areas include neural networks, pattern
recognition, spatial data analysis, image databases,
signal processing, and many application fields, such
as business, economics, and bioinformatics.
Data Mining Techniques
50
Assignment (Ⅱ)
• Define each of the following data mining functionalities: association and
correlation analysis, classification, prediction, clustering, and evolution
analysis. Give example of each data mining functionality, using a real-life
database with which you are familiar.
– Association analysis
• showing attribute-value conditions that occur frequently in a given set of data
– Classification
• finding a set of models that describe and distinguish data classes or concepts, for the
purpose of being able to use the model to predict the class of objects whose class
label is unknown
– Clustering analysis
•
analyzing data objects without consulting a known class label
– Outlier analysis
•
finding data objects that do not comply with the general behavior or model of the data
– Evolution analysis
•
describes and models regularities or trends for objects whose behavior changes over
time
Data Mining Techniques
51
Complement (Ⅰ)
• A student asked me what the difference between
data mining and information retrieval is
– There is really no clear difference
– Actually some of the recent information retrieval
system do discover associations between words and
paragraphs
Data Mining Techniques
52
Complement (Ⅱ)
• What is the difference between data mining (DM)
and pattern recognition (PR)
– Both of them are to find useful relations
– In PR, we typically deal with data set of moderate size,
while in a typical DM application, we are concerned
with data sets that are large in terms of dimension
and number of clusters
– PR is an important techniques used in DM
Data mining involves an
integration of techniques from
multiple disciplines
Data Mining Techniques
53
Architecture: Typical Data Mining System
Graphical User Interface
Pattern Evaluation
Data Mining Engine
Knowl
edgeBase
Database or Data
Warehouse Server
data cleaning, integration, and selection
Database
Data
World-Wide Other Info
Repositories
Warehouse
Web
Data Mining Techniques
54
Thank you !
Data Mining Techniques
55
Download