Uploaded by Kevin Nguyen

01Intro Data Mining

advertisement
Why Data Mining?
◼
We are living in the Big Data Age!
◼ The Explosive Growth of Data: TB of data is generated by the
second
◼ The data is also very complex
◼
◼
◼
◼
◼
Multiple types of data: tables, text, time series, images, voice, etc.
Spatial and temporal aspects.
Interconnected data of different types.
Need to analyze the raw data to extract knowledge
We are drowning in data, but starving for knowledge!
◼
Data rich but information poor!
What does those data mean?
◼ How to analyze data?
Data mining — Automated analysis of massive data sets to
harness the collective intelligence.
◼
◼
◼
“Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets
Why Data Mining?
◼
The Explosive Growth of Data
◼
Major sources of abundant data
◼
Business: Web, e-commerce, transactions, stocks, …
◼
Science: Remote sensing, bioinformatics, scientific simulation, …
◼
Society and everyone: news, digital cameras, YouTube
2
What Is Data Mining?
◼
Data mining (knowledge discovery from data)
◼
◼
◼
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
To refer to the mining of gold from rocks or sand, we say “gold
mining” instead of rock or sand mining. Analogously , data
mining should have been more appropriately named
“knowledge mining from data,”
Alternative names
◼
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
3
What Is Data Mining?
Business Perspective
◼
Data Mining is a business process by which
organizations extract value from data to increase
profit or efficiency, manage risk, improve service,
formulate strategy and increase knowledge.
◼
Data mining is a business process that uses business
knowledge to create new knowledge.
◼
New knowledge is created by discovering and interpreting
patterns in data.
4
Terminology Defintion
◼
◼
◼
◼
◼
The terms artificial intelligence, machine learning, and data mining are often grouped
together or used interchangeably because their definitions tend to overlap with no
clear boundaries. The following general definitions for these terms simply reinforce
this fact:
Artificial Intelligence (AI)
◼
Demonstrate human-like intelligence and cognitive functions
◼
Deduction, pattern recognition, and interpretation of complex data
◼
Examples: Deep Blue playing chess, Watson
Machine Learning
◼
Application of AI that allows the computer to learn from data automatically
◼
Uncover hidden patterns and relationships
◼
Use self-learning algorithms to evaluate results and improve performance over
time
◼
Example: predict rider demand to strategically dispatch drivers for Uber
Data Mining
◼
Process of applying a set of analytical techniques (AI, ML) for specific goal.
◼
Uncover hidden patterns and relationships in data
◼
Data segmentation, pattern recognition, classification, prediction
◼
Example: group customers into segments for customized promotions
These are often grouped together or used interchangeably.
5
Data Mining vs. Data Analysis
◼
Data Analysis “Statistics” is about hypothesis testing “Confirmative”
◼
◼
We start with a theory or idea about how something works
Data Mining is about hypothesis generation “Explorative”
◼
Discovering connections and relationship not previously known.
Data Analysis “Statistics”
DATA MINING
Confirmative
Explorative
Techniques are not optimized for large
amounts of data. Small data sets.
Can find patterns in very large amounts of data
Small number of variables
Numeric data
Large number of variables
Requires understanding of data and business
problem
Numeric and non-numeric
Clean data
Data cleaning
Requires strong statistical skills
Data Mining Vs. Business Intelligence
◼
Business Intelligence broadly encompassing reporting
techniques and simple visualization when a relationship or a set
of relationship is under investigation
◼
◼
◼
Sales data broken down by product region, time and personnel.
BI gives you an instant picture of sales performance and trends.
Data Mining is used when key relationships are not known, for
example, what factors drive sales.
◼
◼
There might be tens or hundreds of factors to be considered.
DM helps you sift through these factors and find the one or combined
factors which have important effects.
Most Common Data Mining
Applications.
◼
◼
The areas of business in which data mining most commonly delivers benefit
is Customer Analytics.
◼
Sometimes known analytical CRM or Customer Relationship Management.
Organization use DM to help them understand their customers, predict their
behavior, forge stronger and more profitable Customer relationships.
Some of the common modeling techniques in CA:
◼
Segmentation: Discovering through DM what different types of customers
the organization has and what market segment can be targeted.
◼
Response Modeling: Discovering who is most likely to respond to a
marketing campaign. This can be used for strategy or induvial level to target
with a specific marketing message.
◼
Attrition Modeling helps organizations understand why customers leave the
customer base. Target retention campaign towards the customers most at
risk.
Most Common Data Mining
Applications.
Risk and Fraud is the 2nd most common application for DM.
◼
Risk Management: is an important aspect to every business especially
financial services.
◼
◼
◼
◼
Credit risk (some people default on a debit).
Market risk due to market movement.
Data Mining is particularly useful when insight into sources of risk is required,
and risk assessment needs to be integrated into operational systems.
Fraud Detection: is the risk of loss through fraud.
◼
◼
Data mining help organization to detect and prevent fraud that is buried in a large
value of transactions.
Detecting fraudulent or suspect transactions depends on two types of recognition:
1.
Known patterns: Recognize pattern of frauds that have been seen before,
this is similarity to known types of fraud that have been seen in previous
knwnw cases of fraud.
2.
Unknown patterns: Discover anomalies which represent types of fraud
which have not been seen and detected before.
Data Mining Tasks
◼
Descriptive data mining also called unsupervised
learning techniques.
◼
◼
◼
These tasks characterize properties of the data in a target data
set.
The objective is to derive patterns (associations, trends,
clusters, and anomalies) that can summarize the underlying
relationships in data.
Predictive mining tasks also called supervised
learning techniques.
◼
◼
perform induction on the current data in order to make
predictions.
The objective of these tasks is to predict the value of a
particular attribute based on the values of other attributes.
10
Data Mining Tasks
◼
There are various data mining functionalities and tasks
that can discover different kinds of patterns:
◼
Unsupervised Learning:
◼
◼
◼
◼
Association Analysis: the task of discovering patterns that
describe relationships.
Anomaly Detection: the task of detecting unusual
deviations.
Clustering: the task of discovering groups and structures.
Supervised Learning:
◼
◼
Classification: the task of assigning (discrete) target
variables to one of several predefined categories.
Regression: the task of finding a function that models
(continuous) target variables.
11
August
Data
Mining:
30, 2022
Concepts and Techniques
Data Mining Process
◼
◼
Data mining is a complex process of examining large sets of
data for identifying patterns and then using them for valuable
business insights.
Established standards with defined steps
◼ Cross-Industry Standard Process for Data Mining (CRISPDM)
◼ Sample, Explore, Modify, Model, and Assess (SEMMA): SAS
◼ knowledge discovery from data (KDD)
12
Data Mining Process
August
Data
Mining:
30, 2022
Concepts and Techniques
Data Mining Process
CRISP-DM
14
Data Mining Process
◼
◼
◼
◼
◼
◼
Business understanding: situational context, specific
objectives, project schedule, deliverables
Data understanding: collecting raw data, preliminary
results, potential hypotheses
Data preparation: record and variable selection, wrangling,
cleaning
Modeling: selection and execution of data mining
techniques, convert or transform data to formats/types
needed for certain analyses, document assumptions, crossvalidation
Evaluation: evaluate performance of competing models,
select best models, review and interpret results, develop
recommendations
Deployment: develop a set of actionable insights and a
strategy for deployment/monitoring/feedback
15
Data Mining Overview
SEMMA
◼
◼
◼
◼
◼
Sample: identify appropriate variables, merging and/or
dividing, sample
Explore: exploratory data analysis
Modify: variables are selected, created, and/or transformed
Model: analysis techniques and models are chosen and
applied
Assess: results from different models are presented to end
users, compare outcomes and performance, new
observations are scored
Download