Profiting from Data Mining Gio Wiederhold November 2003 Gio Wiederhold PDM 1 Steps needed to profit 1. Obtaining relevant data – Always incomplete 2. Extracting relationships – Imputing causality ? 3. Finding applicability – Determining leverage points Model based 4. Inventing candidate actions – Assessing likely outcomes and benefits 5. Selecting action to be taken – Measuring the outcome Collecting data for next round Gio Wiederhold PDM 2 Today's Problem: Disjointness 1. Database administrators • Focus on data collection, organization, currency 2. Analysts • Focus on slicing, dicing, relationships 3. Middle managers • Focus on their costs, profits 4. MBAs • Focus on business models, planning 5. Executives • Must make decisions based on diverse inputs Gio Wiederhold PDM 3 1. Data Collection Two choices 1. (rare) Collect data specifically for analysis allows careful design - model causes and effects Purchase = f(price, color, size, custumer inc., gender,. ,, costly often small to make collection manageable imposes delays 2. (common) Use data collected for other purposes take advantage of what is readily available low cost filtering, reformatting, integration incomplete - rarely covers all causes / effects biased -- missing categories only people with phones, cars -- shopping in super markets Gio Wiederhold PDM 4 1a. Data Integration Needed when sources have inadequate coverage • in distinct DBs for – Prices, Number purchased – Customer segments (supermarket, stores, on-line) implies some expectations append attributes where keys match: Joe include semantic match Joe = 012 34 567 append rows where key types match: customer include semantic match customer = owner Gio Wiederhold PDM 5 2. Data analyis • Find relationships – already known - ignore or adjust in next round » requires comparison with expert knowledge » now have quantification – unknown » uninteresting per expert » interesting per expert Gio Wiederhold PDM 6 3. Establish causality • Already known -- Prior Model – But is it complete, i.e., does it explain all effects ? • Analyze relationships – use expertise to decide direction » often obvious "common world knowledge" » sometimes ambiguous use temporal information smoking Cancer not-smoking » often major true cause not captured in data food color 10%, food price 20%, purchase of Chinese vs other food buyer gender 2% unknown 75% guess: ethnicity, income invent surrogates: names, ZIP codes, Gio Wiederhold PDM 7 Establishing causality is risky 1. Is a Volvo a safe car? Mined: Volvos have fewer accidents 2. What causes accidents? 3. Who buys Volvos? Drivers! Careful drivers! 4. Must determine • • • effect of safe drivers percentage of safe drivers overall percentage of safe drivers with Volvos 5. How much of the accident rate is now explained? The unexplained difference can be attributed to the car. Gio Wiederhold PDM 8 To use results of data mining • have to understand direction of relationships controllable causes external causes hidden captured by data side effects Model interesting beneficial effects side effects Gio Wiederhold PDM 9 4. Causes provide the leverage Language of analyst / Language of modeling • Many causes -- independent variables – A few may be controllable – Some may be controlled by our competition – Others are forces-of-nature • Even more effects -- dependent variables – A few may be desired – Some may be disastrous – Many are poorly understood • Intermediate effects – Provide a means for measuring effectiveness – Allow correction of actions taken Gio Wiederhold PDM 10 5. Planning & Assessment Analyze Alternatives • Current Capabilities • Future Expectations Predict the now future Process tasks: • List resources • Enumerate alternatives • Prune alternative • Compare alternatives Gio Wiederhold PDM 11 Prediction Requires Tools E-mail this book, Alfred Knopf, 1997 Gio Wiederhold PDM 12 Simulations predict 1. Back-of-the-envelope • • • Iv DM gH Xy mN Common Adequate if model is simple Assumptions are easily forgotten after some time, not distinguished from data "Why are we doing this" 2. Spreadsheets • • • • Most common computing tool Specialist modeler can help New, recent data can be pasted in Awkward for the tree of future alternatives 3. Constructed to order • • • • Costly, powerful technology Specialist modelers required Expressive simulation languages Requires specialists to set up, run, and rerun with new data Gio Wiederhold PDM 13 Simulation results: likelihoods Next period alternatives and subsequent periods 0.15 now 0.4 0.25 0.18 0.6 0.12 0.2 0.3 0.19 0.1 0.4 0.3 0.17 0.13 0.11 uncertainty increases time Gio Wiederhold PDM 14 Simulation services Wide variety, but common principle Inputs Model Output (time, $, place, ...) 1. Spreadsheets Identify independent, controlable, and resulting values 2. Execution specific to query: what-if assessment – may require HPC power for adequate response 3. Continously executing: weather prediction – Search for best match ( location, time ) 4. Past simulations results collected for future use Typically sparse -- the dimension of the futures is too large: – Tables in a design handbook: materials Perform inter- or extra-polations to match query parameters Gio Wiederhold PDM 15 6. Specify Value of Effects Still needed: Value of alternative outcomes • Decision maker / owner input – Benefits and Costs – Potential Profit past now 1000 2000 5000 1000 0 -2000 -6000 Values futures time – Correct for risk, and adjust to present value Gio Wiederhold PDM 16 Having it all together • Relationships from analyses of past data • Data representing the current state • List of actionable alternatives • Tree of subsequent alternatives 0.15 • Probabilities of 0.4 0.6 0.25 0.2 0.3 those alternatives 0.19 0.1 0.4 0.18 0.12 0.17 0.3 0.13 • Values of the outcomes 0.11 1000 2000 5000 1000 0 -2000 -6000 Values • Ability to predict the likelihood of futures Gio Wiederhold PDM 17 Vision: Putting it all together Combine results mined from past data, current observations, and predictions into the future. Decision Maker oo o o o o Support specialists Gio Wiederhold PDM 18 Needed: Information Systems that also project seamlessly into the Futures past now future time Support of decision-making requires dealing with the futures, as well the past • Databases deal well with the past • Streaming sensors supply current status • Spreadsheets, simulations deal with the likely futures Future information systems should combine all these sources Gio Wiederhold PDM 19 Connecting it all Build super systems Develop interfaces • • • • • • • • Coherent, consistent Expensive Unmaintainable Too many cooks: – – – – – – Database folk Data miners Analysts Planners Simulation specialists Decision makers Incremental Composable as needed Heterogeneous Interfaces required: Metadata – Database to miners: SQL – Mined results to analysts: XML? – Analysts to planners ? – Planners to Simulations? SimQL – Decision makers: New tools ! Gio Wiederhold PDM 20 Interfaces enable integration: New: SimQL to access Simulations past now futures time Databases and schemas, accessed via SQL or XML Simulations, accessed via SimQL and schema compliant wrappers Msg systems, Sensors Streaming data Gio Wiederhold PDM 21 SimQL proof-of-concept Implementation Developer Development Interaction Help Query Parser Customer Help Production Interaction Schema Commands Schema Schema Commands Query Manager Use of manager Access Specs Metadata Manager Error Filing of reports Access Specs Wrapped .. Metadata o o Initiation and Results of Simulations Simulations Gio Wiederhold PDM 22 Demonstration of SimQL Simple GUI common language requirements Test Applications wrapper Business planning spreadsheets wrapper wrapper Weather on the Internet Shipping location database Engineering simulation Gio Wiederhold PDM 23 Information system use of simulation results 0.6 0.3 0.5 0.2 0.1 0.5 0.07 0.03 0.5 0.2 0.1 time 0.4 0.2 0.1 0.3 0.1 prob Simulation results are mapped to alternative Courses-of-actions Information system should support model driving the the computation and recomputation of likelihoods Likelihoods change as now moves forwards and eliminates earlier alternatives. Gio Wiederhold PDM 24 The likelihoods multiply out to the end-effects then their values can be applied to earlier nodes prob value 0.6 66 134 now 0.4 600 0.1 0.3 1266 past 0.4 1200 0.5 and subsequent periods -1086 1000 0.3 0.1 100 Next period alternatives 0.2 0.3 -1220 0.2 0.1 1100 0.2 500 5000 200 200 0.1 0 0.07 -420 0 -820 0.13 -400 . future 2000 . . 1000 -6000 -3000 Values time Gio Wiederhold PDM 25 Recomputation is needed at the next time phase A Pruned Bush Re-assess as time marches forward ! 100 1200 1266 ? 66 ? ? 1000 600 1100 200 2000 500 5000 200 1000 0 past now future 0 time Spreadsheets, other simulations, Databases, . . . Msgs sensors Gio Wiederhold PDM 26 Even the present needs SimQL last recorded observations point-in-time for situational assessment simple simulations to extrapolate data past Not all data are current: now time future Is the delivery truck in X? • Is the right stuff on the truck? • Will the crew be at X? • Will the forces be ready to accept delivery? Gio Wiederhold PDM 27 Integrative information systems: research questions • What human interfaces can support the decision maker? • How to move seamlessly from the past to the future? • What system interfaces are good now and stay adaptable • How can multiple futures be managed (indexed)? • How can multiple futures be compared, selected? • How should joint uncertainty be computed? • How can the NOW point be moved automatically? Gio Wiederhold PDM 28 SimQL research questions • How little of the model needs to be exposed? • How can defaults be set rationally? • How should expected execution cost be reported? • How should uncertainty be reported? • Are there differences among application areas that require different language structures? • Are there differences among application areas that require different language features? • How will the language interface support effective partitioning and distribution? Gio Wiederhold PDM 29 Moving to a Service Paradigm Interfaces define service potentials • Server is an independent contractor, defines service • Client selects service, and specifies parameters • Server’s success depends on value provided • Some form of payment is due for services x,y Databases are a current example. Simulations have the same potential. Gio Wiederhold PDM 30 Summary of SimQL A new service for Decision Making: • follows database paradigm – ( by about 25 years ) • coherence in prediction – displacement of ad-hoc practices • seamless information integration – single paradigm for decision makers • simulation industry infrastructure – investment has a potential market – should follows database industry model: Interfaces promote new industries Gio Wiederhold PDM 31 Summary: Today decision making support is disjoint, each community improves its area and ignores others Planning Science Distribution extensions for network support are also disjoint Gio Wiederhold PDM 32 The decisionmaker has few tools past now time organized support Data integration Databases distributed, heterogeneous future disjointed support x17 @qbfera ffga 67 .78 jjkl,a nsnd nn 23.5a Intuition + • Spreadsheets • Planning of allocations • Other simulations various point assessments Gio Wiederhold PDM 33 Coda: Put relevant work together and move on Support integration of results mined from past data, current observations, and predictions about the futures. Decision Maker Databases Human interfaces Service interfaces Data Mining oo o o Modeling tools ? Simulation Support Services o o Gio Wiederhold PDM 34