Data Mining References: U.S. News and World Report's Business & Technology section, 12/21/98, by William J. Holstein Prof. Juran’s lecture note 1 (at Columbia University) J.H. Friedman (1999) Data Mining and Statistics. technical report, Dept. of Stat., Stanford University Main Goal •Study statistical tools useful in managerial decision making. – Most management problems involve some degree of uncertainty. – People have poor intuitive judgment of uncertainty. – IT revolution... abundance of available quantitative information • data mining: large databases of info, ... • market segmentation & targeting • stock market data • almost anything else you may want to know... •What conclusions can you draw from your data? •How much data do you need to support your conclusions? Applications in Management •Operations management – e.g., model uncertainty in demand, production function... •Decision models – portfolio optimization, simulation, simulation based optimization... •Capital markets – understand risk, hedging, portfolios, beta's... •Derivatives, options, ... – it is all about modeling uncertainty •Operations and information technology – dynamic pricing, revenue management, auction design, ... • Data mining... many applications Portfolio Selection •You want to select a stock portfolio of companies A, B, C, … •Information: Stock Annual returns by year A 10%, 14%, 13%, 27%, … B 16%, 27%, 42%, 23%, … •Questions: – How do we measure the volatility of each stock? – How do we quantify the risk associated with a given portfolio? – What is the tradeoff between risk and returns? Currency Value (Relative to Jan 2 1998) Introduction •Premise: All business becomes information driven. – The concept of Data Mining is becoming increasingly popular as a business information management tool where it is expected to reveal knowledge structures that can guide decisions in conditions of limited certainty. •Competitiveness: How you collect and exploit information to your advantage? •The challenges – Most corporate data systems are not ready. •Can they share information? •What is the quality of the information going in – Most data techniques come from the empirical sciences; the world is not a laboratory. – Cutting through vendor hype, info-topia. – Defining good metrics; abandoning gut rules of thumb may be too "risky" for the manager. – Communicating success, setting the right expectations. Wal-Mart •U.S. News and World Report's Business & Technology section, 12/21/98, by William J. Holstein Data-Crunching Santa Wal-Mart knows what you bought last Christmas •Wal-Mart is expected to finish the year with $135 billion in sales, up from $118 billion last year. – It hurts department stores such as Sears, J. C. Penney, and Federated's Macy's and Bloomingdale's units, which have been slower to link all their operations from stores directly to manufacturers. . – For example, Sears stocked too many winter coats this season and was surprised by warmer than average weather. •The field of business analytics has improved significantly over the past few years, giving business users better insights, particularly from operational data stored in transactional systems. business analytics in its everyday activities. – Analytics are now routinely used in sales, marketing, supply chain optimization, and fraud detection. A visualization of a Naive Bayes model for predicting who in the U.S. earns more than $50,000 in yearly salary. The higher the bar, the greater the amount of evidence a person with this attribute value earns a high salary. Telecommunications •Data mining flourishes in telecommunications due to the availability of vast quantities of high-quality data. – A significant stream of it consists of call records collected at network switches used primarily for billing; it enables data mining applications in toll fraud detection and consumer marketing. •The best-known marketing application of data mining, albeit via unconfirmed anecdote, concerns MCI’s “Friends & Family” promotion launched in the domestic U.S. market in 1991. – As the anecdote goes, market researchers observed relatively small subgraphs in this long-distance phone company’s large call-graph of network activity. – It reveals the promising strategy of adding entire calling circles to the company’s subscriber base, rather than the traditional and costly approach of seeking individual customers one at a time. Indeed, MCI increased its domestic U.S. market share in the succeeding years by exploiting the “viral” capabilities of calling circles; one infected member causes others to become infected. – Interestingly, the plan was abandoned some years later (not available since 1997), possibly because the virus had run its course but more Telecommunications •In toll-fraud detection, data mining has been instrumental in completely changing the landscape for how anomalous behaviors are detected. – Nearly all fraud detection systems in the telecommunications industry 10 years ago were based on global threshold models. • They can be expressed as rule sets of the form “If a customer makes more than X calls per hour to country Y, then apply treatment Z.” • The placeholders X, Y, and Z are parameters of these rule sets applied to all customers. – Given the range of telecommunication customers, blanket application of these rules produces many false positives. •Data mining methods for customized monitoring of land and mobile phone lines were subsequently developed by leading service providers, including AT&T, MCI, and Verizon, whereby each customer’s historic calling patterns are used as a baseline against which all new calls are compared. – For customers routinely calling country Y more than X times a day, such alerts would be suppressed, but if they ventured to call a different country Y’, an alert might be generated. Risk management and targeted marketing •Insurance and direct mail are two industries that rely on data analysis to make profitable business decisions. – Insurers must be able to accurately assess the risks posed by their policyholders to set insurance premiums at competitive levels. •For example, overcharging low-risk policyholders would motivate them to seek lower premiums elsewhere; undercharging high-risk policyholders would attract more of them due to the lower premiums. •In either case, costs would increase and profits inevitably decrease. – Effective data analysis leading to the creation of accurate predictive models is essential for addressing these issues. •In direct-mail targeted marketing, retailers must be able to identify subsets of the population likely to respond to promotions in order to offset mailing and printing costs. – Profits are maximized by mailing only to those potential customers most likely to generate net income to a retailer in excess of the retailer’s mailing and printing costs. Medical applications (diabetic screening) •Preprocessing and postprocessing steps are often the most critical elements determining the effectiveness of real-life data-mining applications, as illustrated by the following recent medical application in diabetic patient screening. – In the 1990s in Singapore, about 10% of the population was diabetic, a disease with many side effects, including increased risk of eye disease kidney failure, and other complications. – However, early detection and proper care management can make a difference in the health and longevity of individual sufferers. – To combat the disease, the government of Singapore introduced a regular screening program for diabetic patients in its public hospitals in 1992. •Patient information, clinical symptoms, eye-disease diagnosis, treatments, and other details, were captured in a database maintained by government medical authorities. •After almost 10 years of collecting data, a wealth of medical information is available. This vast store of historical data leads naturally to the application of data mining techniques to discover interesting patterns. – The objective is to find rules physicians can use to understand more about diabetes and how it might be associated with different segments of the population. Christmas Season: Georgia Stores •Store at Decatur (just east of Atlanta) – A black middle-income community • Decoration display: African-American angels and ethnic Santas aplenty • Music section: Promoting seasonal disks like "Christmas on Death Row," which features rapper Snoop Doggy Dogg. • Toy department: a large selection of brown-skinned dolls •Store at Dunwoody (20 miles away fom Decatur) – An affluent, mostly white suburb (north of Atlanta) • Music section: Showcasing Christmas tunes by country superstar Garth Brooks. • Toy department: a few expensive toys that aren't available in the Decatur store; Out of the hundreds of dolls in stock, only two have brown skin. •How to determine the kinds of products that are carried by various Wal-Marts across the land? Wal-Mart system •Every item in the store has a laser bar code, so when customers pay for their purchases a scanner captures information about – what is selling on what day of the week and at what price. – The scanner also records what other products were in each shopper's basket. – Wal-Mart analyzes what is in the shopping cart itself. – The combination of [what's in a purchaser's cart] gives you a good indication of the age of that consumer and the preferences in terms of ethnic background. •Wal-Mart combines the in-store data with information about the demographics of communities around each store. – The end result is surprisingly different personalities for Wal-Marts. – It also help Wal-Mart figure out how to place goods on the floor to get what retailers call "affinity sales," or sales of related products. Wal-Mart system (Cont.) •One big strength of the system is that about 5,000 manufacturers are tied into it through the company's Retail Link program, which they access via the Internet. – Pepsi, Disney, or Mattel, for example, can tap into Wal-Mart's data warehouse to see how well each product is selling at each Wal-Mart. – They can look at how things are selling in individual areas and make decisions about categories where there may be an opportunity to expand. – That tight information link helps Wal-Mart work with its suppliers to replenish stock of products that are selling well and to quickly pull those that aren't. Data Mining and Statistics •Data Mining is used to discover patterns and relationships in data with an emphasis on large observational data bases. – It sits at the common frontiers of several fields including Data Base Management, Artificial Intelligence, Machine Learning, Pattern Recognition and Data Visualization. – From a statistical perspective it can be viewed as computer automated exploratory data analysis of large complex data sets. – Many organizations have large transaction oriented data bases used for inventory billing accounting, etc. These data bases were very expensive to create and are costly to maintain. For a relatively small additional investment DM tools offer to discover highly profitable nuggets of information hidden in these data. •Data, especially large amounts of it reside in data base management systems DBMS. – Conventional DBMS are focused on online transaction processing (OLTP); that is the storage and fast retrieval of individual records for purposes of data organization. They are used to keep track of inventory payroll records, billing records, invoices, etc. Data Mining Techniques •Data Mining as an analytic process designed to – explore data (usually large amounts of - typically business or market related - data) in search for consistent patterns and/or systematic relationships between variables, and then – to validate the findings by applying the detected patterns to new subsets of data. – The ultimate goal of data mining is prediction - and predictive data mining is the most common type of data mining and one that has most direct business applications. •The process of data mining consists of three stages: – the initial exploration, – model building or pattern identification with validation and verification, and it is concluded with – deployment (i.e., the application of the model to new data in order to generate predictions). Stage 1: Exploration •It usually starts with data preparation which may involve cleaning data, data transformations, selecting subsets of records and - in case of data sets with large numbers of variables ("fields") - performing some preliminary feature selection operations to bring the number of variables to a manageable range (depending on the statistical methods which are being considered). •Depending on the nature of the analytic problem, this first stage of the process of data mining may involve anywhere between a simple choice of straightforward predictors for a regression model, to elaborate exploratory analyses using a wide variety of graphical and statistical methods in order to identify the most relevant variables and determine the complexity and/or the general nature of models that can be taken into account in the next stage. Stage 2: Model building and validation •This stage involves considering various models and choosing the best one based on their predictive performance – Explain the variability in question and – Producing stable results across samples. •How do we achieve these goals? •This may sound like a simple operation, but in fact, it sometimes involves a very elaborate process. – "competitive evaluation of models," that is, applying different models to the same data set and then comparing their performance to choose the best. – These techniques - which are often considered the core of predictive data mining - include: Bagging (Voting, Averaging), Boosting, Stacking (Stacked Generalizations), and Meta-Learning. Models for Data Mining •In the business environment, complex data mining projects may require the coordinate efforts of various experts, stakeholders, or departments throughout an entire organization. •In the data mining literature, various "general frameworks" have been proposed to serve as blueprints for how to organize the process of gathering data, analyzing data, disseminating results, implementing results, and monitoring improvements. – CRISP (Cross-Industry Standard Process for data mining) was proposed in the mid-1990s by a European consortium of companies to serve as a non-proprietary standard process model for data mining. – The Six Sigma methodology - is a well-structured, data-driven methodology for eliminating defects, waste, or quality control problems of all kinds in manufacturing, service delivery, management, and other business activities. CRISP •This general approach postulates the following (perhaps not particularly controversial) general sequence of steps for data mining projects: Six Sigma • This model has recently become very popular (due to its successful implementations) in various American industries, and it appears to gain favor worldwide. It postulated a sequence of, so-called, DMAIC steps – The categories of activities: Define (D), Measure (M), Analyze (A), Improve (I), Control (C ). – Postulates the following general sequence of steps for data mining projects: Define (D) → Measure (M) → Analyze (A) → Improve (I) → Control (C ) - It grew up from the manufacturing, quality improvement, and process control traditions and is particularly well suited to production environments (including "production of services," i.e., service industries). • Define. It is concerned with the definition of project goals and boundaries, and the identification of issues that need to be addressed to achieve the higher sigma level. • Measure. The goal of this phase is to gather information about the current situation, to obtain baseline data on current process performance, and to identify problem areas. • Analyze. The goal of this phase is to identify the root cause(s) of quality problems, and to confirm those causes using the appropriate data analysis tools. • Improve. The goal of this phase is to implement solutions that address the problems (root causes) identified during the previous (Analyze) phase. • Control. The goal of the Control phase is to evaluate and monitor the results of the previous phase (Improve). Six Sigma Process •A six sigma process is one that can be expected to produce only 3.4 defects per one million opportunities. – The concept of the six sigma process is important in Six Sigma quality improvement programs. •The term Six Sigma derives from the goal to achieve a process variation, so that 6×sigma (the estimate of the population standard deviation) will "fit" inside the lower and upper specification limits for the process. – In that case, even if the process mean shifts by 1.5×sigma in one direction (e.g., to +1.5 sigma in the direction of the upper specification limit), then the process will still produce very few defects. •For example, suppose we expressed the area above the upper specification limit in terms of one million opportunities to produce defects. The 6×sigma process shifted upwards by 1.5 ×sigma will only produce 3.4 defects (i.e., "parts" or "cases" greater than the upper specification limit) per one million opportunities Statisticians’s remark on DM paradigms •The DM community may have to moderate its romance with big. – A prevailing attitude seems to be that unless an analysis involves gigabytes or terabytes of data, it can not possibly be worthwhile. – It seems to be a requirement that all of the data that has been collected must be used in every aspect of the analysis. – Sophisticated procedures that cannot simultaneously handle data sets of such size are not considered relevant to DM. – Most DM applications routinely require data sets that are considerably larger than those that have been addressed by traditional statistical procedures (kilobytes). – It is often the case that the questions being asked of the data can be answered to sufficient accuracy with less than the entire giga or terabyte data base. – Sampling methodology which has a long tradition in Statistics can profitably be used to improve accuracy while mitigating computational requirements. – Also a powerful computationally intense procedure operating on a subsample of the data may in fact provide superior accuracy than a less sophisticated one using the entire data base. Sampling • Objective: Determine the average amount of money spent in the Central Mall. • Sampling: A Central City official randomly samples 12 people as they exit the mall. – He asks them the amount of money spent and records the data. – Data for the 12 people: Person $ spent Person $ spent Person $ spent 1 $132 5 $123 9 $449 2 $334 6 $ 5 10 $133 3 $ 33 7 $ 6 11 $ 44 4 $ 10 8 $ 14 12 $ 1 – The official is trying to estimate mean and variance of the population based on a sample of 12 data points. Population versus Sample A population is usually a group we want to know something about: all potential customers, all eligible voters, all the products coming off an assembly line, all items in inventory, etc.... Finite population: {u1, u2, ... , uN} versus Infinite population A population parameter is a number (q) relevant to the population that is of interest to us: the proportion (in the population) that would buy a product, the proportion of eligible voters who will vote for a candidate, the average number of M&M's in a pack.... A sample is a subset of the population that we actually do know about (by taking measurements of some kind): a group who fill out a survey, a group of voters that are polled, a number of randomly chosen items off the line.... {x1, x2, ... , xn} A sample statistic g(x1, x2, ... , xn) is often the only practical estimate of a population parameter. We will use g(x1, x2, ... , xn) as proxies for q, but remember their difference. Average Amount of Money spent in the Central Mall • A sample (x1, x2, ... , xn) • Its mean is the sum of their values divided by the number of observations. n x x i 1 n i x1 x2 ... xn n • The sample mean, the sample variance, and the sample standard deviation are $107, $220,854, and $144.40, respectively. • It claims that on average $107 are spent per shopper with a standard deviation of $144.40. • Why can we claim so? 2 2 ( x x ) ... ( x x ) 2 1 n s n 1 1 n 2 ( xi x ) n 1 i 1 s 1 n 2 ( x x ) i n 1 i 1 •The variance s2 of a set of observations is the average of the squares of the deviations of the observations from their mean. •The standard deviation s is the square root of the variance s2 . •How far the observations are from the mean? s2 and s will be – large if the observations are widely spread about their mean, – small if they are all close to the mean. Stock Market Indexes •It is a statistical measure that shows how the prices of a group of stocks changes over time. – Price-Weighted Index: DJIA – Market-Value-Weighted Index: Standard and Poor’s 500 composite Index – Equally Weighted Index: Wilshire 5000 Equity Index •Price-Weighted Index: It shows the change in the average price of the stock that are included in the index. – Price per share in current period P0 and price per share in next period P1. – Number of shares outstanding in current period Q0 and number of shares outstanding in next period Q1. DJIA •Dow Jones industrial average (DJIA): – Charles Dow first concocted his 12-stock industrial average in 1896 (expanding to 30 in 1928) – Original: It is an arithmetic average of the thirty stock prices that make up the index. DJIA = [(P01 + P02 +… + P0,30)/30]/[(P11 + P12 +… + P1,30)/30] – Current: It is adjusted for stock splits and the insurance of stock dividends. DJIA = [(P01+ P02 +… + P0,30)/AD1]/(P11 + P12 +… + P1,30) where AD1 is the appropriate divisor. •How do we adjust AD1 to account for stock splits, adding new stocks,...? – The adjustment process is designed to keep the index value the same as it would have been if the split had not occurred. – Suppose X30 splits 2:1 from $100 to $50. Then change c to c0 such that (X1 + X2 +… + 100)/c = (X1 + X2 +… + 50)/c0 – change to c0 < c to keep index constant before & after split. •How about when new stocks are added and others are DJIA • How each stock in the Dow performed during the period when the Dow rose 100 percent (from its close above 5,000 on Nov. 21, 1995 until it closed above 10,000 on March 29, 1999). *Companies not in the Dow when it crossed 5,000. **Adjusted for spinoffs. Does not reflect performance of stocks spun off to shareholders. Company Weight in the Dow (%) Change in Price (%) Alcoa 1.9 + 52 AlliedSignal 2.3 +129 Amer. Express 5.5 +185 AT&T** 3.6 + 87 Boeing 1.5 -5 Caterpillar 2.1 +59 Chevron 4.0 +77 Citigroup* 2.8 +262 Coca-Cola 3.0 +69 Du Pont 2.5 +76 Eastman Kodak 2.9 -6 DJIA Company Weight in the Dow (%) Change in Price (%) Exxon 3.2 + 83 General Electric 5.3 +232 General Motors** 3.9 +89 Goodyear 2.2 + 23 Hewlett-Packard* 3.1 +66 I.B.M. 1.9 +276 International Paper 2.0 +24 J. P. Morgan 5.0 +63 Johnson & Johnson* 4.2 +120 McDonald's 2.0 +102 Merck 3.6 +175 Minnesota Mining** 3.2 + 15 Philip Morris 1.8 + 37 Procter & Gamble 4.5 +134 Sears, Roebuck 2.1 + 18 Union Carbide 2.1 + 19 United Technologies 6.0 +196 Wal-Mart* 4.2 +288 Walt Disney 1.5 + 62 S&P 500 •The S&P 500, which started in 1957, weights stocks on the basis of their total market value. – Suppose X30 splits 2:1 from $100 to $50. Then change c to c0 such that (X1 + X2 +… + 100)/c = (X1 + X2 +… + 50)/c0 – change to c0 < c to keep index constant before & after split. • How about when new stocks are added and others are removed? • S&P 500 is computed by S&P 500 = (w1X1 + w2X2 +… + w500X500)/c where Xi=price of ith stock and wi=# of shares of ith stock. • What happens when a stock splits? • It is a weighted average. Sample vs Population • For both problems, we try to infer properties of a large group (population) by analyzing a small subgroup (the sample). – The population is the group we are trying to analyze; e.g., all eligible voters, etc. – A sample is a subset of the total population that we have observed or collected data from; e.g., voters that are actually polled, etc. • How to draw a sample which can be used to make statements about the population? – Sample must be representative of the population – Sampling is the way to obtain reliable information in a cost effective way (why not census?) Issues in sampling • Representativeness – Interviewer discretion – Respondent discretion - non-response – Key question: is the reason for non-response related to the attribute you are trying to measure? Illegal aliens/Census. Start-up companies/not in phone book. Library exit survey. • Good samples; – Good samples; probability samples; each unit in the population has a known probability of being in the sample – Simplest case; equal probability sample, each unit has the same chance of being in the sample. Utopian Sample for Analysis • You have a complete and accurate list of ALL the units in the target population (sampling frame) • From this you draw an equal probability sample (generate a list of random numbers) • Reality check; incomplete frame, impossible frame, practical constraints on the simple random sample (cost and time of sampling) • Precision considerations – How large a sample do I need? – Focus on confidence interval - choose coverage rate (90%, 95%, 99%) margin of error (half the width). Typically trade off width against coverage rate. – Simple rule of thumb for a population proportion - if it's a 95% CI, then use n = 1/(margin of error)**2. Data Analysis • Statistical Thinking is understanding variation and how to deal with it. • Move as far as possible to the right on this continuum: Ignorance-->Uncertainty-->Risk-->Certainty • Information science:learning from data – Probabilistic inference based on mathematics – What is Statistics? – What is the connection if any – elds including Data Base Management Articial In