Data Mining Week 9 Introduction to Data Mining Fox MIS

advertisement
Data Mining
Week 9
Introduction to Data Mining
Fox MIS
Spring 2011
Competitive Advantage
Performance
Good Business Decision
Better Understanding
Data Mining
External Source
Data Warehouse
Product No.
Product Name
Price
MySQL
ERD
Customer No.
Name
Address
Membership
Description
Defining User Communities
•
•
•
Information user
– Generally requires standard reports and
that often includes charts and tables
– Wants to scan consistently structured
reports without needing slice or dice to
find the desired values
– Static or simple interactive reports
Information consumer
– Requires the ability to dynamically query
the database, without becoming an
expert at database design or the query
tool
– Ad-hoc multidimensional analysis
– Many business people cross the line
between information users and
information consumers
Power analyst
– Require the full analytical power of the
data mart in order to perform free-form
ad hoc analysis
Some Questions Analysts Need to Answers
• Sales analysis:
– What are the sales by quarter and geography?
– How do sales compare in two different stores in the same
state?
• Profitability analysis:
– Which is the most profitable store in the state CA?
– Which product lines are the highest revenue producers this
year?
– Which products and product lines are the most profitable
this quarter?
• Sale force analysis
– Which salesperson is the best revenue producer this year?
Do salesperson X meet his sale target this quarter?
Finding a Pattern from Data
• Tenure and sick days by department
– Average tenure for each department: 9.0
– Average number of sick days is 7.5 for each
Finding a Pattern: Graphical Representation
Data Analysis Evolutionary Step
Evolutionary Step
Business Question
Data Collection (1960s) "What was my total revenue
in the last five years?"
Enabling Technologies
Characteristics
Computers, tapes, disks
Retrospective,
static data delivery
Data Access (1980s)
"What were unit sales in New Relational databases
Retrospective,
England last March?"
(RDBMS), Structured Query dynamic data
Language (SQL)
delivery at record
level
Data Warehousing &
Decision Support
(1990s)
"What were unit sales in New On-line analytic processing Retrospective,
England last March? Drill
(OLAP), multidimensional
dynamic data
down to Boston."
databases, data warehouses delivery at multiple
levels
Data Mining
(Emerging Today)
"What’s likely to happen to
Advanced algorithms,
Boston unit sales next month multiprocessor computers,
? Why?"
massive databases
Prospective,
proactive
information
delivery
Data Mining
• The application of specific algorithms for
extracting patterns from data
• Data mining tools automatically search data for
patterns and relationships
• Data mining tools
–
–
–
–
–
Analyze data
Uncover problems or opportunities
Form computer models based on findings
Predict business behavior with models
Require minimal end-user intervention
Data Mining
• Goal
– Simplification and automation of the overall
statistical process, from data source(s) to
model application
• Data mining is ready for application in the
business community because it is supported by
three technologies that are now sufficiently
mature:
– Massive data collection
– Powerful multiprocessor computers
– Data mining algorithms
Convergence of Three Key
Technologies
Data Mining and Knowledge
Discovery in the Real World
• Marketing
– If customer bought X, he/she is also likely to
buy Y and Z
• Investment
– Stock investment
• Fraud detection
– Identify financial transactions that might
indicate money-laundering activity
A Problem...
• You are a marketing manager for a brokerage
company
• Problem: Churn is too high
– Turnover (after six month introductory period
ends) is 40%
– Customers receive incentives (average cost:
$160) when account is opened
– Giving new incentives to everyone who might
leave is very expensive (as well as wasteful)
– Bringing back a customer after they leave is both
difficult and costly
… A Solution
• One month before the end of the introductory
period is over, predict which customers will leave
• If you want to keep a customer that is predicted
to churn, offer them something based on their
predicted value
• The ones that are not predicted to churn need
no attention
A weather problem
A numeric weather problem
Benefit of Data Mining
• New business opportunities by providing these
capabilities:
• Automated prediction of trends and behaviors
– Targeted marketing.
• Promotional mailings to identify the targets most likely
to maximize return on investment in future mailings.
– Forecasting bankruptcy and other forms of default
• Automated discovery of previously unknown
patterns.
– Data mining tools sweep through databases and
identify previously hidden patterns in one step
– Analysis of retail sales data to identify seemingly
unrelated products that are often purchased
together
Descriptive Data Mining
• Descriptive Data Mining
– Seeks to describe new patterns in the data and
requires human interaction to determine the
significance and meaning of these patterns
– Affinity grouping
• Which item goes together
– Clustering
• Divides data into smaller groups based on similarity
without predefinition of the groups
– Customers with similar buying habits
– Visualization
• Graphical representation of data
Predictive Data Mining
• Likelihood of a particular outcome
• Mathematical algorithms are used to create models
• Classification
– A new record is assigned to a specific category
defined by the model
– New credit applicants as low risk, medium risk, or
high risk
• Estimation
– Assign a new record with a predicted value
– Length of time a customer will stay
Defining Data Mining
• The automated extraction of predictive
information from (large) databases
• Two key words:
– Automated
– Predictive
• Data mining lets you be proactive
• Prospective rather than Retrospective
How Data Mining Works: Modeling
• Modeling is simply the act of building a model in one
situation where you know the answer and then applying
it to another situation that you don't.
• Some models are better than others
– Accuracy
– Understandability
• Models range from “easy to understand” to
incomprehensible
•
•
•
•
Decision trees
Rule induction
Regression models
Neural Networks
Techniques in Data Ming
• Decision Trees
• Nearest Neighbor Classification
• Neural Networks
• Rule Induction
• K-means Clustering
Distinctions
Distinctions (Continued)
Download