Document

advertisement
‫מבוא ל‪BI‬‬
Automated Decision-Making
Framework
‫‪( BI‬לפי ויקיפדיה)‬
‫•‬
‫‪http://he.wikipedia.org/wiki/%D7%91%D7%99%D7%A0%D7%94_%D7%A2%D7%A1%D7%A7%D7%99%D7%AA‬‬
‫תוכן עניינים‬
‫• ‪ 1‬היסטוריה‬
‫• ‪ 2‬תהליך העבודה‬
‫• ‪ 3‬מחסן נתונים ו‪BI-‬‬
‫• ‪4‬עיבוד אנליטי מקוון )‪(OLAP‬‬
‫• ‪5‬כריית מידע (כל שיטות הלמידה שלמדנו)‬
‫• ‪ 6‬בינה עסקית תפעולית‬
‫• ‪ 7‬שימושים עיקריים‬
‫• ‪ 8‬מוצרי ‪BI‬‬
DSS ‫היסטוריה של‬
Classical Definitions of DSS
• Interactive computer-based systems, which help decision
makers utilize data and models to solve unstructured
problems" - Gorry and Scott-Morton, 1971
• Decision support systems couple the intellectual resources of
individuals with the capabilities of the computer to improve
the quality of decisions. It is a computer-based support
system for management decision makers who deal with
semistructured problems
- Keen and Scott-Morton,
1978
Types of DSS
• Two major types:
– Model-oriented DSS
– Data-oriented DSS
• Evolution of DSS into Business Intelligence
– Use of DSS moved from specialist to managers, and
then whomever, whenever, wherever
– Enabling tools like OLAP, data warehousing, data mining,
intelligent systems, delivered via Web technology have
collectively led to the term “business intelligence” (BI)
and “business analytics”
‫מויקיפדיה‪...‬‬
‫החל מאמצע שנות ה‪ 2000-‬קיימים כלים חדשים לבינה עסקית‬
‫בתפיסה הנקראת ‪(BI 2.0), Business Intelligence 2.0‬‬
‫המאפשרים ביצוע שאילתות על ידי עובדים על נתוני הארגון‬
‫בזמן אמיתי‪ .‬המושג ‪BI 2.0‬נטבע בהקבלה למושג ‪Web‬‬
‫‪2.0‬משום שעיבודים מסוג זה הם בתפיסה‬
‫של דפדפן בסביבת ‪Web.‬כלי ‪BI 2.0‬מאפשרים דיווחים‬
‫דינמיים יותר מהדיווחים הסטטיים שאפיינו כלים מדור קודם‪.‬‬
‫בסיס חשוב לעיבודים מסוג זה הוא השימוש ב‪SOA, -‬שבא‬
‫ביחד עם שימוש במוצרי תו ְוכה ( )‪Middleware‬גמישים יותר‬
‫ושימוש בתקנים להעברת מידע‪.‬‬
‫‪SOA = Service Oriented Architecture‬‬
DSS Description
•
•
DSS application
A DSS program built for a specific purpose
(e.g., a scheduling system for a specific
company)
Business intelligence (BI)
A conceptual framework for decision
support. It combines architecture, databases
(or data warehouses), analytical tools, and
applications
Business Intelligence (BI)
• BI is an evolution of decision support
concepts over time.
– Meaning of EIS/DSS…
• Then: Executive Information System
• Now: Everybody’s Information System (BI)
• BI systems are enhanced with additional
visualizations, alerts, and performance
measurement capabilities.
• The term BI emerged from industry apps.
The Evolution of BI Capabilities
The Architecture of BI
• A BI system has four major components
– a data warehouse, with its source data
– business analytics, a collection of tools for manipulating,
mining, and analyzing the data in the data warehouse;
– business performance management (BPM) for monitoring
and analyzing performance
– a user interface (e.g., dashboard)
‫– בשנים האחרונות תפס הנושא של בינה עסקית מקום מרכזי‬
‫ הגידול הרב במידע הנצבר במערכות‬.‫במערכות המידע‬
‫ממוחשבות מחייב הצגה וריכוז של נתונים רלוונטיים על מנת‬
‫ אחד הביטויים לחשיבות התחום הוא‬.‫שלמידע תהיה משמעות‬
‫רכישת חברות בולטות המתמחות בתחום על ידי חברות תוכנה‬
‫גדולות‬
A High-Level Architecture of BI
Learning Objectives
• Explain data integration and the extraction,
transformation, and load (ETL) processes
• Describe real-time (a.k.a. right-time and/or active)
data warehousing
• Understand data warehouse administration and
security issues
Stage 1: Data Warehouse
•
A physical repository where relational data are
specially organized to provide enterprise-wide,
cleansed data in a standardized format
•
“The data warehouse is a collection of integrated,
subject-oriented databases designed to support
DSS functions, where each unit of data is nonvolatile and relevant to some moment in time”
DW Framework
No data marts option
Applications
(Visualization)
Data
Sources
Access
ETL
Process
Select
Legacy
Metadata
Extract
POS
Transform
Enterprise
Data warehouse
Integrate
Other
OLTP/wEB
Data mart
(Finance)
Load
Replication
External
data
Data mart
(Engineering)
Data mart
(...)
/ Middleware
Data mart
(Marketing)
API
ERP
Routine
Business
Reporting
Data/text
mining
OLAP,
Dashboard,
Web
Custom built
applications
Data Integration and the Extraction,
Transformation, and Load (ETL) Process
Extraction, transformation, and load (ETL)
Transient
data source
Packaged
application
Data
warehouse
Legacy
system
Extract
Transform
Cleanse
Load
Data mart
Other internal
applications
Data Mart
A departmental data warehouse that stores
only relevant data
– Dependent data mart
A subset that is created directly from a data
warehouse
– Independent data mart
A small data warehouse designed for a
strategic business unit or a department
OLAP vs. OLTP
Online Analytical vs. Online Transaction (Processing)
A 3-dimensional
OLAP cube with
slicing
operations
OLAP
Ti
m
e
Slicing Operations on a
Simple Tree-Dimensional
Data Cube
Sales volumes of
a specific Product
on variable Time
and Region
Cells are filled
with numbers
representing
sales volumes
Geography
Product
Sales volumes of
a specific Region
on variable Time
and Products
Sales volumes of
a specific Time on
variable Region
and Products
Star vs Snowflake Schema
Star Schema
Dimension
TIME
Snowflake Schema
Dimension
PRODUCT
Dimension
MONTH
Quarter
Brand
M_Name
...
...
...
Fact Table
SALES
Dimension
QUARTER
UnitsSold
Dimension
BRAND
Brand
Dimension
DATE
Date
LineItem
...
...
Q_Name
...
Dimension
GOGRAPHY
Division
Coutry
...
...
...
Dimension
CATEGORY
Category
Fact Table
SALES
...
Dimension
PEOPLE
Dimension
PRODUCT
...
UnitsSold
...
Dimension
PEOPLE
Dimension
STORE
Division
LocID
...
...
Dimension
LOCATION
State
...
‫עוד דוגמא של ‪SNOWFLAKE‬‬
‫כריית מידע‬
‫• סיווג (שווה או לא שווה ל‪)...‬‬
‫– להלוות כסף‪ ,‬להשקיע בתחום‪ ,‬לפחות סניף חדש‬
‫• ניתוח אשכולות (‪)Clustering‬‬
‫– כמה סוגי לקוחות יש? מה מאחד אותם?‬
‫• ניתוח רגרסיה‬
‫– כמה נרוויח‪ ,‬אופטימיזציה‬
‫סוגי מידע‬
‫• כריית מידע מנתונים‬
"‫– היותר "פשוט‬
‫• כריית מידע מטקסטים‬
INFORMATION RETRIEVAL –
SENTIMENT ANALYSIS ,TREND ANALYSIS –
Categories of Models
Category
Objective
Techniques
Optimization of
problems with few
alternatives
Find the best solution from a
small number of alternatives
Decision tables,
decision trees
Optimization via
algorithm
Find the best solution from a
large number of alternatives
using a step-by-step process
Linear and other
mathematical
programming models
Optimization via an
analytic formula
Find the best solution in one
step using a formula
Some inventory models
Simulation
Find a good enough solution
by experimenting with a
dynamic model of the system
Several types of
simulation
Heuristics
Find a good enough solution
using “common-sense” rules
Heuristic programming
and expert systems
Predictive and
other models
Predict future occurrences,
what-if analysis, …
Forecasting, Markov
chains, financial, …
Static and Dynamic Models
• Static Analysis
– Single snapshot of the situation
– Single interval
– Steady state
• Dynamic Analysis
–
–
–
–
–
Dynamic models
Evaluate scenarios that change over time
Time dependent
Represents trends and patterns over time
More realistic: Extends static models
Decision Analysis: A Few Alternatives
Single Goal Situations
• Decision trees
– Graphical representation of
relationships
– Multiple criteria approach
– Demonstrates complex
relationships
– Cumbersome, if many alternatives
exists
Decision Tables
• Investment example
• One goal: maximize the yield after one year
• Yield depends on the status of the economy
(the state of nature)
– Solid growth
– Stagnation
– Inflation
Investment Example:
Possible Situations
1. If solid growth in the economy, bonds yield 12%; stocks
15%; time deposits 6.5%
2. If stagnation, bonds yield 6%; stocks 3%; time deposits 6.5%
3. If inflation, bonds yield 3%; stocks lose 2%; time deposits
yield 6.5%
Optimization
via Mathematical Programming
• Mathematical Programming
A family of tools designed to help solve managerial
problems in which the decision maker must allocate
scarce resources among competing activities to optimize a
measurable goal
• Optimal solution: The best possible solution to a
modeled problem
– Linear programming (LP): A mathematical model for the
optimal solution of resource allocation problems. All the
relationships are linear
LP Problem Characteristics
1. Limited quantity of economic resources
2. Resources are used in the production of products or
services
3. Two or more ways (solutions, programs) to use the
resources
4. Each activity (product or service) yields a return in
terms of the goal
5. Allocation is usually restricted by constraints
Linear Programming Steps
• 1. Identify the …
–
–
–
–
Decision variables
Objective function
Objective function coefficients
Constraints
• Capacities / Demands
• 2. Represent the model
– LINDO: Write mathematical formulation
– EXCEL: Input data into specific cells in Excel
• 3. Run the model and observe the results
Line
LP Example
The Product-Mix Linear Programming Model
•
•
•
•
MBI Corporation
Decision: How many computers to build next month?
Two types of mainframe computers: CC7 and CC8
Constraints: Labor limits, Materials limit, Marketing lower limits
Labor (days)
Materials ($)
Units
Units
Profit ($)
CC7
300
10,000
1
8,000
CC8
500
15,000
1
12,000
Rel
<=
<=
>=
>=
Max
Objective: Maximize Total Profit / Month
Limit
200,000 /mo
8,000,000 /mo
100
200
Sensitivity, What-if, and
Seeking Analysis
Goal
• Sensitivity
– Assesses impact of change in inputs on outputs
– Eliminates or reduces variables
– Can be automatic or trial and error
• What-if
– Assesses solutions based on changes in variables or
assumptions (scenario analysis)
• Goal seeking
– Backwards approach, starts with goal
– Determines values of inputs needed to achieve goal
– Example is break-even point determination
Heuristic Programming
• Cuts the search space
• Gets satisfactory solutions more
quickly and less expensively
• Finds good enough feasible
solutions to very complex problems
• Heuristics can be
– Quantitative
– Qualitative (in ES)
• Traveling Salesman Problem >>>
Heuristic Programming - SEARCH
Traveling Salesman Problem
• What is it?
– A traveling salesman must visit customers in several cities,
visiting each city only once, across the country. Goal: Find
the shortest possible route
– Total number of unique routes (TNUR):
TNUR = (1/2) (Number of Cities – 1)!
Number of Cities
TNUR
5
12
6
60
9
20,160
20
1.22 1018
When to Use Heuristics
When to Use Heuristics
–
–
–
–
–
Inexact or limited input data
Complex reality
Reliable, exact algorithm not available
Computation time excessive
For making quick decisions
Limitations of Heuristics
–
Cannot guarantee an optimal solution
Modern Heuristic Methods
• Tabu search
– Intelligent search algorithm
• Genetic algorithms
– Survival of the fittest
• Simulated annealing
– Analogy to Thermodynamics
Simulation
• Technique for conducting experiments with a
computer on a comprehensive model of the behavior
of a system
• Frequently used in DSS tools
Major Characteristics of Simulation
•
•
•
•
Imitates reality and capture its richness
Technique for conducting experiments
Descriptive, not normative tool
Often to “solve” very complex problems
Simulation is normally used only when a problem is
too complex to be treated using numerical
optimization techniques
Advantages of Simulation
•
•
•
•
•
•
•
•
The theory is fairly straightforward
Great deal of time compression
Experiment with different alternatives
The model reflects manager’s perspective
Can handle wide variety of problem types
Can include the real complexities of problems
Produces important performance measures
Often it is the only DSS modeling tool for nonstructured problems
Limitations of Simulation
• Cannot guarantee an optimal solution
• Slow and costly construction process
• Cannot transfer solutions and inferences to solve
other problems (problem specific)
• So easy to explain/sell to managers, may lead
overlooking analytical solutions
• Software may require special skills
Simulation Types
• Stochastic vs. Deterministic Simulation
– In stochastic simulations: We use distributions (Discrete or Continuous
probability distributions)
• Time-dependent vs. Time-independent Simulation
– Time independent stochastic simulation via Monte Carlo technique (X
= A + B)
• Discrete event vs. Continuous simulation
• Steady State vs. Transient Simulation
• Simulation Implementation
– Visual simulation
– Object-oriented simulation
Data Mining Methods: Classification
•
•
•
•
•
Most frequently used DM method
Part of the machine-learning family
Employ supervised learning
Learn from past data, classify new data
The output variable is categorical (nominal
or ordinal) in nature
• Classification versus regression?
• Classification versus clustering?
Assessment Methods for
Classification
• Predictive accuracy
– Hit rate
• Speed
– Model building; predicting
• Robustness
• Scalability
• Interpretability
– Transparency, explainability
Accuracy of Classification Models
• In classification problems, the primary source for
accuracy estimation is the confusion matrix
Predicted Class
Negative
Positive
True Class
Positive
Negative
True
Positive
Count (TP)
False
Positive
Count (FP)
Accuracy
TP  TN
TP  TN  FP  FN
True PositiveRate
TP
TP  FN
True NegativeRate 
False
Negative
Count (FN)
True
Negative
Count (TN)
Precision
TP
TP  FP
TN
TN  FP
Recall 
TP
TP  FN
Estimation Methodologies for
Classification
• Simple split (or holdout or test sample estimation)
– Split the data into 2 mutually exclusive sets training
(~70%) and testing (30%)
2/3
Training Data
Model
Development
Classifier
Preprocessed
Data
1/3
Testing Data
Model
Assessment
(scoring)
Prediction
Accuracy
Estimation Methodologies for
Classification
• k-Fold Cross Validation (rotation estimation)
– Split the data into k mutually exclusive subsets
– Use each subset as testing while using the rest of the
subsets as training
– Repeat the experimentation for k times
– Aggregate the test results for true estimation of
prediction accuracy training
• Other estimation methodologies
– Leave-one-out, bootstrapping, jackknifing
– Area under the ROC curve
Classification Techniques
•
•
•
•
•
•
•
•
Decision tree analysis
Statistical analysis
Neural networks
Support vector machines
Case-based reasoning
Bayesian classifiers
Genetic algorithms
Rough sets
Decision Trees
• Employs the divide and conquer method
• Recursively divides a training set until each division
consists of examples from one class
1.
2.
3.
4.
Create a root node and assign all of the training data to it
Select the best splitting attribute
Add a branch to the root node for each value of the split.
Split the data into mutually exclusive subsets along the
lines of the specific split
Repeat the steps 2 and 3 for each and every leaf node
until the stopping criteria is reached
Decision Trees
• DT algorithms mainly differ on
– Splitting criteria
• Which variable to split first?
• What values to use to split?
• How many splits to form for each node?
– Stopping criteria
• When to stop building the tree
– Pruning (generalization method)
• Pre-pruning versus post-pruning
• Most popular DT algorithms include
– ID3, C4.5, C5; CART; CHAID; M5
Cluster Analysis for Data Mining
• k-Means Clustering Algorithm
– k : pre-determined number of clusters
– Algorithm (Step 0: determine value of k)
Step 1: Randomly generate k random points as initial cluster
centers
Step 2: Assign each point to the nearest cluster center
Step 3: Re-compute the new cluster centers
Repetition step: Repeat steps 3 and 4 until some
convergence criterion is met (usually that the assignment
of points to clusters becomes stable)
Cluster Analysis for Data Mining k-Means Clustering Algorithm
Step 1
Step 2
Step 3
Data Mining Myths
• Data mining …
– provides instant solutions/predictions
– is not yet viable for business applications
– requires a separate, dedicated database
– can only be done by those with advanced degrees
– is only for large firms that have lots of customer
data
– is another name for the good-old statistics
Common Data Mining Mistakes
1.
2.
3.
4.
5.
Selecting the wrong problem for data mining
Ignoring what your sponsor thinks data mining is
and what it really can/cannot do
Not leaving insufficient time for data acquisition,
selection and preparation
Looking only at aggregated results and not at
individual records/predictions
Being sloppy about keeping track of the data
mining procedure and results
Common Data Mining Mistakes
6.
7.
8.
9.
10.
Ignoring suspicious (good or bad) findings and
quickly moving on
Running mining algorithms repeatedly and blindly,
without thinking about the next stage
Naively believing everything you are told about
the data
Naively believing everything you are told about
your own data mining analysis
Measuring your results differently from the way
your sponsor measures them
Text Mining Application Area
•
•
•
•
•
•
•
Information extraction
Topic tracking
Summarization
Categorization
Clustering
Concept linking
Question answering
Text Mining Terminology
•
•
•
•
•
•
•
•
Unstructured or semistructured data
Corpus (and corpora)
Terms
Concepts
Stemming
Stop words (and include words)
Synonyms (and polysemes)
Tokenizing
Text Mining Terminology
•
•
•
•
•
Term dictionary
Word frequency
Part-of-speech tagging
Morphology
Term-by-document matrix
– Occurrence matrix
• Singular value decomposition
– Latent semantic indexing
Natural Language Processing (NLP)
• Structuring a collection of text
– Old approach: bag-of-words
– New approach: natural language processing
• NLP is …
– a very important concept in text mining
– a subfield of artificial intelligence and computational
linguistics
– the studies of "understanding" the natural human
language
• Syntax versus semantics based text mining
Natural Language Processing (NLP)
• Challenges in NLP
–
–
–
–
–
–
Part-of-speech tagging
Text segmentation
Word sense disambiguation
Syntax ambiguity
Imperfect or irregular input
Speech acts
• Dream of AI community
– to have algorithms that are capable of automatically
reading and obtaining knowledge from text
NLP Task Categories
•
•
•
•
•
•
•
•
•
•
•
Information retrieval
Information extraction
Named-entity recognition
Question answering
Automatic summarization
Natural language generation and understanding
Machine translation
Foreign language reading and writing
Speech recognition
Text proofing
Optical character recognition
Text Mining Applications
• Marketing applications
– Enables better CRM
• Security applications
– ECHELON, OASIS
– Deception detection (…)
• Medicine and biology
– Literature-based gene identification (…)
• Academic applications
– Research stream analysis
Web Mining Success Stories
• Amazon.com, Ask.com, Scholastic.com, …
• Website Optimization Ecosystem
Customer Interaction
on the Web
Analysis of Interactions
Web
Analytics
Voice of
Customer
Customer Experience
Management
Knowledge about the Holistic
View of the Customer
Web Mining Tools
Product Name
URL
Angoss Knowledge WebMiner
angoss.com
ClickTracks
clicktracks.com
LiveStats from DeepMetrix
deepmetrix.com
Megaputer WebAnalyst
megaputer.com
MicroStrategy Web Traffic Analysis
microstrategy.com
SAS Web Analytics
sas.com
SPSS Web Mining for Clementine
spss.com
WebTrends
webtrends.com
XML Miner
scientio.com
Machine Learning Methods
Machine Learning
Supervised
Learning
Classification
· Decision Tree
· Neural Networks
· Support Vector Machines
· Case-based Reasoning
· Rough Sets
· Discriminant Analysis
· Logistic Regression
· Rule Induction
Regression
· Regression Trees
· Neural Networks
· Support Vector Machines
· Linear Regression
· Non-linear Regression
· Bayesian Linear Regression
Reinforcement
Learning
· Q-Learning
· Adaptive Heuristic Critic
(AHC),
· State-Action-Reward-StateAction (SARSA)
· Genetic Algorithms
· Gradient Descent
Unsupervised
Learning
Clustering / Segmentation
· SOM (Neural Networks)
· Adaptive Resonance Theory
· Expectation Maximization
· K-Means
· Genetic Algorithms
Association
· Apriory
· ECLAT Algorithm
· FP-Growth
· One-attribute Rule
· Zero-attribute Rule
BPM versus BI
• BPM is an outgrowth of BI and incorporates many of
its technologies, applications, and techniques.
– The same companies market and sell them.
– BI has evolved so that many of the original differences
between the two no longer exist (e.g., BI used to be
focused on departmental rather than enterprise-wide
projects).
– BI is a crucial element of BPM.
• BPM = BI + Planning (a unified solution)
Performance Measurement
KPIs and Operational Metrics
• Key performance indicator (KPI)
A KPI represents a strategic objective and
metric that measures performance against
a goal
• Distinguishing features of KPIs



Strategy
Targets
Ranges



Encodings
Time frames
Benchmarks
Performance Measurement
• Key performance indicator (KPI)
Outcome KPIs vs.
(lagging indicators
e.g., revenues)
Driver KPIs
(leading indicators
e.g., sales leads)
• Operational areas covered by driver KPIs
–
–
–
–
Customer performance
Service performance
Sales operations
Sales plan/forecast
BPM Methodologies
• The meaning of “balance”
– BSC is designed to overcome the limitations of
systems that are financially focused
– Nonfinancial objectives fall into one of three
perspectives:
1.
2.
3.
Customer
Internal business process
Learning and growth
BPM Methodologies
• In BSC, the term “balance” arises because
the combined set of measures are
supposed to encompass indicators that are:
–
–
–
–
–
Financial and nonfinancial
Leading and lagging
Internal and external
Quantitative and qualitative
Short term and long term
BPM Methodologies
Strategy map
A visual display that
delineates the
relationships
among the key
organizational
objectives for all
four BSC
perspectives
Performance Dashboards
• Dashboards and scorecards both provide
visual displays of important information that is
consolidated and arranged on a single screen
so that information can be digested at a single
glance and easily explored
Performance Dashboards
Performance Dashboards
• Dashboards versus scorecards
– Performance dashboards
Visual display used to monitor operational
performance (free form)
– Performance scorecards
Visual display used to chart progress against
strategic and tactical goals and targets
(predetermined measures)
Download