Profiting from Data Mining Gio Wiederhold November 2003 Gio Wiederhold PDM 1

advertisement
Profiting from Data Mining
Gio Wiederhold
November 2003
Gio Wiederhold PDM 1
Steps needed to profit
1. Obtaining relevant data
–
Always incomplete
2. Extracting relationships
–
Imputing causality
?
3. Finding applicability
–
Determining leverage points
Model
based
4. Inventing candidate actions
–
Assessing likely outcomes and benefits
5. Selecting action to be taken
–
Measuring the outcome
 Collecting data for next round
Gio Wiederhold PDM 2
Today's Problem: Disjointness
1. Database administrators
•
Focus on data collection, organization, currency
2. Analysts
•
Focus on slicing, dicing, relationships
3. Middle managers
•
Focus on their costs, profits
4. MBAs
•
Focus on business models, planning
5. Executives
•
Must make decisions based on diverse inputs
Gio Wiederhold PDM 3
1. Data Collection
Two choices
1. (rare) Collect data specifically for analysis

allows careful design - model causes and effects
Purchase = f(price, color, size, custumer inc., gender,. ,,
 costly
 often small to make collection manageable

imposes delays
2. (common) Use data collected for other purposes

take advantage of what is readily available
 low cost
 filtering, reformatting, integration


incomplete - rarely covers all causes / effects
biased -- missing categories
 only people with phones, cars -- shopping in super markets
Gio Wiederhold PDM 4
1a. Data Integration
Needed when sources have inadequate coverage
• in distinct DBs for
– Prices, Number purchased
– Customer segments (supermarket, stores, on-line)
implies some expectations
append attributes where keys match: Joe
include semantic match Joe = 012 34 567
append rows where key types match: customer
include semantic match customer = owner
Gio Wiederhold PDM 5
2. Data analyis
• Find relationships
– already known - ignore or adjust in next round
» requires comparison with expert knowledge
» now have quantification
– unknown
» uninteresting per expert
» interesting per expert
Gio Wiederhold PDM 6
3. Establish causality
• Already known -- Prior Model
– But is it complete, i.e., does it explain all effects ?
• Analyze relationships
– use expertise to decide direction
» often obvious
"common world knowledge"
» sometimes ambiguous
use temporal
information
smoking  Cancer  not-smoking
» often major true cause not captured in data
food color 10%,
food price 20%,
purchase of Chinese vs other food
buyer gender 2%
unknown 75%
guess: ethnicity, income
invent surrogates: names, ZIP codes,
Gio Wiederhold PDM 7
Establishing causality is risky
1. Is a Volvo a safe car?
Mined: Volvos have fewer accidents
2. What causes accidents?
3. Who buys Volvos?
Drivers!
Careful drivers!
4. Must determine
•
•
•
effect of safe drivers
percentage of safe drivers overall
percentage of safe drivers with Volvos
5. How much of the accident rate is now explained?
The unexplained difference can be attributed to the car.
Gio Wiederhold PDM 8
To use results of data mining
• have to understand direction of relationships
controllable
causes
external
causes
hidden
captured by data
side effects
Model
interesting
beneficial
effects
side effects
Gio Wiederhold PDM 9
4. Causes provide the leverage
Language of analyst / Language of modeling
• Many causes -- independent variables
– A few may be controllable
– Some may be controlled by our competition
– Others are forces-of-nature
• Even more effects -- dependent variables
– A few may be desired
– Some may be disastrous
– Many are poorly understood
• Intermediate effects
– Provide a means for measuring effectiveness
– Allow correction of actions taken
Gio Wiederhold PDM 10
5. Planning & Assessment
Analyze Alternatives
• Current Capabilities
• Future Expectations
Predict
the
now
future
Process tasks:
• List resources
• Enumerate alternatives
• Prune alternative
• Compare alternatives
Gio Wiederhold PDM 11
Prediction Requires Tools
E-mail this book,
Alfred Knopf, 1997
Gio Wiederhold PDM 12
Simulations predict
1. Back-of-the-envelope
•
•
•
Iv
DM
gH
Xy
mN
Common
Adequate if model is simple
Assumptions are easily forgotten after some time,
not distinguished from data "Why are we doing this"
2. Spreadsheets
•
•
•
•
Most common computing tool
Specialist modeler can help
New, recent data can be pasted in
Awkward for the tree of future alternatives
3. Constructed to order
•
•
•
•
Costly, powerful technology
Specialist modelers required
Expressive simulation languages
Requires specialists to set up, run, and rerun with new data
Gio Wiederhold PDM 13
Simulation results: likelihoods
Next period alternatives
and subsequent periods
0.15
now
0.4
0.25
0.18
0.6
0.12
0.2
0.3
0.19
0.1
0.4
0.3
0.17
0.13
0.11
uncertainty increases
time
Gio Wiederhold PDM 14
Simulation services
Wide variety,
but common principle
Inputs
Model
Output (time, $, place, ...)
1. Spreadsheets
Identify independent, controlable, and resulting values
2. Execution specific to query: what-if assessment
– may require HPC power for adequate response
3. Continously executing: weather prediction
– Search for best match ( location, time )
4. Past simulations results collected for future use
Typically sparse -- the dimension of the futures is too large:
– Tables in a design handbook: materials
Perform inter- or extra-polations to match query parameters
Gio Wiederhold PDM 15
6. Specify Value of Effects
Still needed: Value of alternative outcomes
• Decision maker / owner input
– Benefits and Costs
– Potential Profit
past
now
1000
2000
5000
1000
0
-2000
-6000
Values
futures
time
– Correct for risk, and adjust to present value
Gio Wiederhold PDM 16
Having it all together
• Relationships from analyses of past data
• Data representing the current state
• List of actionable alternatives
• Tree of subsequent alternatives
0.15
• Probabilities of
0.4
0.6
0.25
0.2
0.3
those alternatives
0.19
0.1
0.4
0.18
0.12
0.17
0.3
0.13
• Values of the outcomes
0.11
1000
2000
5000
1000
0
-2000
-6000
Values
• Ability to predict the likelihood of futures
Gio Wiederhold PDM 17
Vision: Putting it all together
Combine results mined from past data, current
observations, and predictions into the future.
Decision Maker
oo
o o
o o
Support specialists
Gio Wiederhold PDM 18
Needed: Information Systems that also
project seamlessly into the Futures
past
now
future
time
Support of decision-making requires dealing with the futures,
as well the past
• Databases deal well with the past
• Streaming sensors supply current status
• Spreadsheets, simulations deal with the likely futures
Future information systems should combine all these sources
Gio Wiederhold PDM 19
Connecting it all
Build super systems
Develop interfaces
•
•
•
•
•
•
•
•
Coherent, consistent
Expensive
Unmaintainable
Too many cooks:
–
–
–
–
–
–
Database folk
Data miners
Analysts
Planners
Simulation specialists
Decision makers
Incremental
Composable as needed
Heterogeneous
Interfaces required: Metadata
– Database to miners: SQL
– Mined results to analysts: XML?
– Analysts to planners ?
– Planners to Simulations? SimQL
– Decision makers: New tools !
Gio Wiederhold PDM 20
Interfaces enable integration:
New: SimQL to access Simulations
past
now
futures
time
Databases and schemas,
accessed via SQL or XML
Simulations,
accessed via SimQL and
schema compliant wrappers
Msg
systems,
Sensors
Streaming data
Gio Wiederhold PDM 21
SimQL proof-of-concept Implementation
Developer
Development
Interaction
Help
Query
Parser
Customer
Help
Production
Interaction
Schema
Commands
Schema
Schema Commands
Query
Manager
Use of
manager
Access
Specs
Metadata
Manager
Error
Filing of
reports
Access
Specs
Wrapped ..
Metadata
o o
Initiation and
Results of
Simulations
Simulations
Gio Wiederhold PDM 22
Demonstration of SimQL
Simple GUI
common language
requirements
Test Applications
wrapper
Business planning
spreadsheets
wrapper
wrapper
Weather on
the Internet
Shipping
location
database
Engineering
simulation
Gio Wiederhold PDM 23
Information system
use of simulation results
0.6
0.3
0.5
0.2
0.1
0.5
0.07
0.03
0.5
0.2
0.1
time
0.4
0.2
0.1
0.3
0.1
prob
Simulation results are mapped to
alternative Courses-of-actions
Information system should support model
driving the the computation and
recomputation of likelihoods
Likelihoods change as now moves forwards
and eliminates earlier alternatives.
Gio Wiederhold PDM 24
The likelihoods multiply out to the end-effects
then their values can be applied to earlier nodes
prob
value
0.6
66
134
now
0.4
600
0.1 0.3
1266
past
0.4
1200 0.5
and subsequent periods
-1086
1000
0.3
0.1 100
Next period alternatives
0.2
0.3
-1220 0.2
0.1
1100 0.2 500
5000
200
200
0.1
0
0.07 -420
0
-820 0.13 -400
.
future
2000
. .
1000
-6000
-3000
Values
time
Gio Wiederhold PDM 25
Recomputation is needed
at the next time phase
A Pruned Bush
Re-assess as time
marches forward !
100
1200
1266 ?
66
?
?
1000
600
1100
200
2000
500
5000
200
1000
0
past
now
future
0
time
Spreadsheets,
other simulations,
Databases, . . .
Msgs
sensors
Gio Wiederhold PDM 26
Even the present needs SimQL
last recorded observations
point-in-time for
situational
assessment
simple simulations
to extrapolate data
past
Not all data are current:
now
time
future
Is the delivery truck in X?
• Is the right stuff on the truck?
• Will the crew be at X?
• Will the forces be ready to accept delivery?
Gio Wiederhold PDM 27
Integrative information systems:
research questions
• What human interfaces can support the decision maker?
• How to move seamlessly from the past to the future?
• What system interfaces are good now and stay adaptable
• How can multiple futures be managed (indexed)?
• How can multiple futures be compared, selected?
• How should joint uncertainty be computed?
• How can the NOW point be moved automatically?
Gio Wiederhold PDM 28
SimQL research questions
• How little of the model needs to be exposed?
• How can defaults be set rationally?
• How should expected execution cost be reported?
• How should uncertainty be reported?
• Are there differences among application areas that
require different language structures?
• Are there differences among application areas that
require different language features?
• How will the language interface support effective
partitioning and distribution?
Gio Wiederhold PDM 29
Moving to a Service Paradigm
Interfaces define service potentials
• Server is an independent contractor, defines service
• Client selects service, and specifies parameters
• Server’s success depends on value provided
•
Some form of payment is due for services
x,y
Databases are a current example.
Simulations have the same potential.
Gio Wiederhold PDM 30
Summary of SimQL
A new service for Decision Making:
• follows database paradigm
– ( by about 25 years )
• coherence in prediction
– displacement of ad-hoc practices
• seamless information integration
– single paradigm for decision makers
• simulation industry infrastructure
– investment has a potential market
– should follows database industry model:
Interfaces promote new industries
Gio Wiederhold PDM 31
Summary:
Today decision making support is disjoint, each
community improves its area and ignores others
Planning Science
Distribution
extensions for network
support are also disjoint
Gio Wiederhold PDM 32
The decisionmaker has few tools
past
now
time
organized support
Data integration
Databases
distributed, heterogeneous
future
disjointed support
x17 @qbfera
ffga 67 .78 jjkl,a
nsnd nn 23.5a
Intuition +
• Spreadsheets
• Planning of allocations
• Other simulations
various point assessments
Gio Wiederhold PDM 33
Coda:
Put relevant work together and move on
Support integration of results mined from past data,
current observations, and predictions about the futures.
Decision Maker
Databases
Human interfaces
Service interfaces
Data Mining
oo
o o
Modeling tools
?
Simulation Support Services
o o
Gio Wiederhold PDM 34
Download