LECTURE 3-DATA MINING WHY DATA MINING AND THE KDD PROCESS? 2 WHY DATA WAREHOUSING AND DATA MINING? • Huge volumes of Data available: from terabytes to petabytes • Data collection and data availability • Automated data collection tools, database systems, Web, computerized society • Cameras, publication tools, scanned text and images • Major sources of abundant data • Business: Web, e-commerce, transactions, stocks, … • Science: Remote sensing, bioinformatics, scientific simulation, … • Society and everyone: news, digital cameras, YouTube • Medical data, demographic data, financial data and marketing data • We are drowning in data, but starving for knowledge! • “Necessity is the mother of invention”—Data mining— Automated analysis of massive data sets 3 WHY DATA WAREHOUSING AND DATA MINING?—POTENTIAL APPLICATIONS • Data analysis and decision support • Market analysis and management • Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation • Risk analysis and management • Forecasting, customer retention, improved underwriting, quality control, competitive analysis • Fraud detection and detection of unusual patterns (outliers) • Other Applications • Text mining (news group, email, documents) and Web mining • Stream data mining • Bioinformatics and bio-data analysis 4 EX. 1: MARKET ANALYSIS AND MANAGEMENT • Where does the data come from?—Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies • Target marketing • Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. • Determine customer purchasing patterns over time • Cross-market analysis—Find associations/co-relations between product sales & predict based on such association • Customer profiling—What types of customers buy what products (clustering or classification) • Customer requirement analysis • Identify the best products for different groups of customers • Predict what factors will attract new customers • Provision of summary information • Multidimensional summary reports • Statistical summary information (data central tendency and variation) 5 EX. 2: FRAUD DETECTION & MINING UNUSUAL PATTERNS • Approaches: Clustering & model construction for frauds, outlier analysis • Applications: Health care, retail, credit card service, telecomm. • Auto insurance: ring of collisions • Money laundering: suspicious monetary transactions • Medical insurance • Professional patients, ring of doctors, and ring of references • Unnecessary or correlated screening tests • Telecommunications: phone-call fraud • Phone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm • Retail industry • Analysts estimate that 38% of retail shrink is due to dishonest employees • Anti-terrorism 6 CLASS QUIZ 1. Which of the following is the reason why Data Mining is required today? a) b) c) d) 2. Which of the following is not a source of data: a) b) c) d) 3. Youtube Web Online medical data repositories Cars Which is the most important application of Data Mining? a) b) c) d) 4. Data has not been mined before Mining can lead to good quality data Data is hard to find Huge amounts of Data is available today hence we need advanced techniques to make sense of it Decision support Data analysis Searching Providing data Match the techniques on the left to their description on the right a) b) c) d) Cross market analysis Target Marketing Customer requirement analysis Customer profiling i. what customer buy what kind of product ii. Predict what will attract a customer iii. Determine customer purchasing patterns 7 iv. Finding association between products DATA MINING: CONFLUENCE OF MULTIPLE DISCIPLINES Database Technology Machine Learning Pattern Recognition Statistics Data Mining Algorithm Visualization Artificial Intelligence 8 KNOWLEDGE DISCOVERY (KDD) PROCESS Data mining—core of knowledge discovery process Pattern Evaluation Data Mining Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Databases 9 KDD PROCESS: SEVERAL KEY STEPS • Learning the application domain • relevant prior knowledge and goals of application • Creating a target data set: data selection • Data cleaning and preprocessing: (may take 60% of effort!) • Data reduction and transformation • Find useful features, dimensionality/variable reduction, invariant representation • Choosing functions of data mining • summarization, classification, regression, association, clustering • Choosing the mining algorithm(s) • Data mining: search for patterns of interest • Pattern evaluation and knowledge presentation • visualization, transformation, removing redundant patterns, etc. • Use of discovered knowledge 10 Suppose you are asked to do some Data Mining for Chelsea. 1. What possible solutions using Data Mining would you propose to them?ما هي الحلول باستخدام بيانات التعدين ممكن هل تقترح لهم؟ Ans. Which player will score the most goals? Which club will win the cup? 2. What is the first thing you would do? Ans. Learn more about the domain. 3. What would you consider of utmost importance in data selection?ماذا كنت تنظر من أهمية قصوى في اختيار البيانات ؟ Ans. Select good quality data important for further analysis. 4. How would you acquire data reduction? كيف يمكنك الحصول على تخفيض البيانات ؟ Ans. Remove data that is redundant. 5. What data transformation would be required? ما تحويل البيانات سيكون مطلوبا ؟ Ans. Convert dates from Gregorian to Hijri 11 CLASS WORK Suppose you are asked to do some Data Mining for Hyper Panda. 1. What possible solutions using Data Mining would you propose to them?ما هي الحلول باستخدام بيانات التعدين ممكن هل تقترح لهم؟ Ans. What will the customers buy next year? Will they be interested in a new kind of milk? 2. What is the first thing you would do? Ans. Learn more about the domain. 3. What would you consider of utmost importance in data selection?ماذا كنت تنظر من أهمية قصوى في اختيار البيانات ؟ Ans. Select good quality data important for further analysis. 4. How would you acquire data reduction? كيف يمكنك الحصول على تخفيض البيانات ؟ Ans. Remove data that is redundant. 5. What data transformation would be required? ما تحويل البيانات سيكون مطلوبا ؟ Ans. Convert dates from Gregorian to Hijri 12 LET US DISCUSS AN EXAMPLE Suppose you are asked to do some Data Mining for Abdul Lateef Jamil. 1. What possible solutions using Data Mining would you propose to them? Ans.Which car will sell more in the next year? Which car will be more popular in Province Riyadh? 2. What is the first thing you would do? Ans. Learn more about the domain. 3. What would you consider of utmost importance in data selection? Ans Select good quality data important for further analysis. 4. How would you acquire reduction? Ans. Remove data that is redundant. 5. What transformation would be required? Ans. Convert dates from Gregorian to Hijri 13 CLASS WORK Suppose you are asked to do some data mining for ELM who would like to know the opinion of the users of there system using a database of online opinion blogs and tweet messages. 1. What possible solutions using Data Mining would you propose to them or what knowledge can be mined using Data Mining? 2. What is the first thing you would do? 3. What would you consider of utmost importance in data selection? 4. How would you acquire reduction? 5. What transformation would be required? 14