Data Mining Chapter 1 Data vs Information • Society produces huge amounts of data • Potentially valuable resource • Needs to be organized – patterns underlying data Data Mining • People have always looked for patterns • Computing power makes more feasible • Data Mining = Extraction of implicit, previously unknown, and potentially useful information from data Data Mining Has Commercial Value • Example Practical Uses – Database Marketing – Credit Approval – Fraud Detection (e.g. Credit Card) 3 General Techniques • Statistical • Machine Learning • Visualization Machine Learning • Skip the philosophy • Acquisition of knowledge and ability to use it • Computer performance via indirect programming rather than direct programming Black Box vs Clear Box • Black Box – what the program has learned and how it is using it is incomprehensible to humans • Clear Box – what is learned is understandable – Structural descriptions represent patterns explicitly (e.g. as rules) – …. Let’s Look at Some Data and Structural Patterns • clraw - Access database with data related to community crime – including socioeconomic and law enforcement data • cl-revised-attribute-list – lists attribute names • NJDOHalldata – Access database with data related to community addictions • njcrimenominal – heavily processed from clraw – ready to run Weka data mining tools on • … Run Weka • On njcrimenominal – – Rules – Prism – Tree – ID3 • On my-weather – – Rules – Prism – Tree – ID3 Black Box • Some machine learning / data mining methods may be successful but not produce understandable structural descriptions • E.g. neural networks • This book does not focus on them – Frequently what is desired is a “take home message” for future HUMAN decision making – this requires an understandable result of learning – Frequently, even if the results of learning may be used “automatically” by a computer program, human decision to TRUST the automatic program may depend on the human being able to make sense and evaluate what has been learned The Real World • Examples learning from are NOT complete • Data is NOT complete • Frequently there are ERRORS or mistakes in the data Lessons from Simple Examples – Weather • Numeric vs “Symbolic” attributes – Nominal/ categorical weather data could have at most 3 X 3 X 2 X 2 = 36 possible combinations of values – With numeric values for temperature and humidity, there are a much larger set of possibilities, and the learning problem gets much more difficult – may need to learn inequality test (e.g. temp > 80) Lessons from Simple Examples – Weather • Classification vs Association tasks – In our examples so far, we learned rules / trees to predict the value for the “Play” attribute • It was pre-determined what was to be predicted – A less-focused approach would be to look for any rule that can be inferred from the data (e.g. if temp = cool then humidity = normal) – It is possible to generate large sets of association rules; these must be carefully controlled Lessons from Simple Data – Contact Lenses • Decision Rules vs Decision Trees – Decision tree may be more compact representation of patterns in the data – see next two slides If tear production ra te = reduced t hen r ecommendation = no ne. If age = yo ung and as tigmatic = n o and t ear p roduction rate = normal then recommendation = soft If age = pr e-presbyopic an d astigmatic = no a nd tear production rate = normal t hen r ecommendation = s oft If age = pr esbyopic and sp ectacle p rescription = m yope and astigmatic = no th en re commendation = no ne If spectacle prescription = h ypermetrope an d astigmatic = no an d tear production ra te = normal then recommendation = s oft If spectacle prescription = m yope a nd astigmatic = ye s and tear production ra te = normal then recommendation = h ard If age = yo ung and as tigmatic = y es an d tear production ra te = normal then recommendation = hard If age = pr e-presbyopic an d spectacle prescription = hypermetrope and a stigmatic = yes th en re commendation = no ne If age = pr esbyopic and sp ectacle p rescription = h ypermetrope and a stigmatic = yes th en re commendation = no ne Figure 1.1 Rules for the contact lens data. Figure 1.2 Decision tree for the contact lens data. Lessons from Simple Data – CPU performance • In some cases, what is desired is a numeric prediction rather than a classification (here prediction of CPU performance) – see table 1.5 on p15 • This is more challenging, and some methods used for classification are not appropriate for numeric prediction • Statistical regression is the standard for comparison Lessons from Simple Data – Labor Negotiations • As we saw with crime dataset, realistic datasets sometimes have missing values. This is also seen in the labor negotiations dataset (sketched in Table 1.6 on p16) • Also – not all data mining involves thousands of records – generating this 57 record dataset was A LOT OF WORK Lessons from Simple Data – Labor Negotiations • Training and Testing – – A program that uses data to learn (train) should be tested on other data (as a independent judgment) • Overfitting – – Getting 100% correct on training data may not always be the best thing to do – May be adjusting for idiosyncratic aspects of the training data that will not generalize to yet to be seen instances – Assuming the goal is to learn something that will be useful in the future, then it is better to avoid “overfitting” what is learned to the training data – see next slide (a) Figure 1.3 Decision trees for the labor negotiations data. (b) Fielded Applications • Loan approval (for borderline cases) – 20 attributes – age, years with current employer, years at current address, years with bank, other credit … – 1000 training cases – Learn rules – 2/3 correct vs ½ correct by human decision makers – Rules could be used to explain decision to rejected applicants Fielded Applications • Marketing and Sales – E.g. bank detecting when they might be in danger of losing a customer • E.g. bank by phone customers who call at times when response is slow • Market Basket Analysis (Supermarkets and …) – Use of “association” techniques to determine groups of items that tend to occur together in transactions – Famous example – diapers and beer • May help in planning store layouts • Send out coupons for one of the items • Give out cash register coupons for one when other is purchased Fielded Applications • Direct Marketing – Better response rate allows fewer mailings to produce same result, or same mailings to produce larger sales – May use data from outside organization too (e.g. socio-economic data based on zip code) Other Fielded Applications • Oil slick detection from images • Energy use (load) prediction • Diagnosis of machine faults 1.5 Learning as “Search” • The possible sets of rules or the possible decision trees is too large to consider all and pick the best • Many learning algorithms follow a “hill climbing” approach to search – try one possibility, then look for small changes that will improve; stop when no improvement can be found. • Such a process IS NOT guaranteed to find the best solution. It is “heuristic” – it uses rules of thumb that are likely to help, but which are not a sure thing. • But it is more tractable than exhaustive brute-force. • In many learning schemes, search is also “greedy” – decisions, once made, are not retracted • Learning methods have a “bias” – they are more likely to come to some conclusions that to others – for efficiency sake 1.6. Data Mining and Ethics • • • • Discrimination Privacy policies Human common sense on using conclusions Profits vs service End Chapter 1 • Show coming into Weka from Start, opening file, visualizing data