Data Mining UMUC CSMN 667 Lecture #2 By Dr. Borne 2005 UMUC Data Mining Lecture 2 1 Term Paper - Data Mining Case Analysis • Refer to Project Descriptions section of WebTycho course Syllabus for detailed information. • 1-page Summary (Abstract+Outline) due: April 4, 2005 • Final Paper Due Date: 12midnight, April 18, 2005 • Submit both in your WebTycho Assignments Folder • Term Paper Page Restrictions: 5-8 pages • I will submit your paper to TurnItIn.com for verification of originality – per UMUC Graduate School policies. • Format/Style: Use the SPIE Conference Proceedings Style, which is available at: http://www.spie.org/app/Publications/index.cfm?fuseaction=authinfo&type=manspecs [ONLY USE THIS FOR STYLE FILES AND FORMATTING INSTRUCTIONS] By Dr. Borne 2005 UMUC Data Mining Lecture 2 2 Case Analysis Instructions (1) The goal of the paper assignment is to complete an in-depth study of a data mining application. Examples of applications include financial, scientific, medical, intrusion detection, and web mining. Describe data types, data volumes, technical challenges, end-goals, who is the user community, which data mining algorithms are most relevant, why data mining, how is it used, what is the current status of data mining usage in this field? --- Possible case topics include: A direct mailing application looking to maximize cross-selling opportunities (e.g., Doubleclick). A bank determining the credit worthiness of a potential customer (e.g., American Express, Bank of America). A medical insurer looking to detect medical fraud. Gene detection in BioInformatics (e.g., Celera). Glitch or anomaly detection in scientific time series data. Abnormal network access behavior for detection of computer system intrusion and security violation. By Dr. Borne 2005 UMUC Data Mining Lecture 2 3 Case Analysis Instructions (2) • You may choose to go in depth in either one of these two areas: – A data mining application domain: Evaluate the application area in detail, as explained on the previous slide, including a review and analysis of the different data mining techniques employed there. Or – A data mining technique: Research in depth the different application domains where this technique has been used. Answer the questions on the previous slide when evaluating this technique’s different application areas. By Dr. Borne 2005 UMUC Data Mining Lecture 2 4 Case Analysis Paper - Instructions (3) • Please e-mail me your suggested topic (application area to be researched) so that I may verify that it is okay. By Dr. Borne 2005 UMUC Data Mining Lecture 2 5 Case Analysis Paper - Instructions (4) • Submit your completed paper in WebTycho. • You may submit your paper in any of these formats: PDF, or Microsoft WORD, or postscript (PS). • You must submit it no later than midnight on April 18. WebTycho will not allow submissions after that time. • Submit the paper in your "Assignments Folder" (on the left menu bar within the WebTycho course website). By Dr. Borne 2005 UMUC Data Mining Lecture 2 6 Lecture 2: “Data Mining Roots” (Chapter 2 of Dunham textbook) By Dr. Borne 2005 UMUC Data Mining Lecture 2 7 Lecture 2 Outline • • • • • • • • • • • • Summary of “What is Data Mining?” Tutorial Foundations of Data Mining Database Systems Data Warehousing and OLAP Statistics and Data Mining Information Retrieval Data Mining as “Rule Induction” Fuzzy Sets and Logic Machine Learning Steps in the Data Mining Process Major Issues in Data Mining A Case Study: The NASA Mars Rover By Dr. Borne 2005 UMUC Data Mining Lecture 2 8 “What is Data Mining?” From online reading assigment -Data Mining Tutorial at : http://www.megaputer.com/dm/dm101.php3 By Dr. Borne 2005 UMUC Data Mining Lecture 2 9 Summary of “What is Data Mining?” Tutorial • • • • • • What is data mining? Why use data mining? What can Data Mining do for you? Reasons for the growing popularity of Data Mining Tasks Solved by Data Mining Different DM Technologies and Systems Subject-oriented analytical systems Statistical packages Neural Networks Evolutionary Programming Memory Based Reasoning Decision Trees Genetic Algorithms Nonlinear Regression Methods By Dr. Borne 2005 UMUC Data Mining Lecture 2 10 What can Data Mining do for you? (business-focused list) • Identify your best prospects and then retain them as customers. • Predict cross-sell opportunities and make recommendations. • Learn parameters influencing trends in sales and margins. • Segment markets and personalize communications. By Dr. Borne 2005 UMUC Data Mining Lecture 2 11 Reasons for the Growing Popularity of Data Mining • Growing Data Volumes • Limitations of Human Analysis • Low Cost of Machine Learning Tasks Solved by Data Mining • Prediction • Explicit Modeling • Classification • Detection of Relations • Deviation Detection By Dr. Borne 2005 • Clustering • Market Basket Analysis UMUC Data Mining Lecture 2 12 Foundations of Data Mining By Dr. Borne 2005 UMUC Data Mining Lecture 2 13 Foundations of Data Mining: Databases, Statistics, and Machine Learning • David Hand (1998. “Data Mining: Statistics and More?”, The American Statistician, 52, pp. 112– 118) used the following definition. – "Data mining is a new discipline lying at the interface of statistics, database technology, pattern recognition, machine learning, and other areas. It is concerned with the secondary analysis of large databases in order to find previously unsuspected relationships which are of interest or value to the database owners.” – Why “secondary”? … Because the data were typically collected for other purposes (such as billing, accounting, customer addresses, etc.). Primary analysis of large databases is generally the domain of STATISTICS. By Dr. Borne 2005 UMUC Data Mining Lecture 2 14 Slide from Lecture 1 Evolution of Data Mining <http://www.thearling.com/text/dmwhite/dmwhite.htm> Evolutionary Step Business Question Data Collection (1960s) "What was my total revenue in the last five years?" Data Access (1980s) "What were unit sales in Relational databases Retrospective, dynamic New England last (RDBMS), Structured data delivery at record March?" Query Language (SQL), level ODBC Data Warehousing & Decision Support (1990s) "What were unit sales in New England last March? Drill down to Boston." Data Mining (Emerging Today) By Dr. Borne 2005 Enabling Characteristics Technologies Computers, tapes, disks Retrospective, static data delivery On-line analytic processing (OLAP), multidimensional databases, data warehouses "What’s likely to Advanced algorithms, happen to Boston unit multiprocessor sales next month? computers, massive Why?" databases UMUC Data Mining Lecture 2 Retrospective, dynamic data delivery at multiple levels Prospective, proactive information delivery 15 Foundation for Data Mining Techniques • 1960s: – Data collection, database creation, IMS, and hierarchical DBMS • 1970s: – Relational data model, relational DBMS implementation • 1980s: – RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, financial, manufacturing, sales, etc.) • 1990s—2000s: – Data mining and data warehousing, multimedia databases, and Web databases By Dr. Borne 2005 UMUC Data Mining Lecture 2 16 History of Data Mining • Dates for specific events were imprecise in the preceding slides. This might be a little better : By Dr. Borne 2005 UMUC Data Mining Lecture 2 17 Data Mining: Confluence of Multiple Disciplines Database Technology Machine Learning Statistics Data Mining Information Science By Dr. Borne 2005 Visualization Other Disciplines UMUC Data Mining Lecture 2 18 Data Mining Stepping Stones http://www.cs.sfu.ca/~han/DM_Book.html Increasing potential to support business decisions End User Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP By Dr. Borne 2005 UMUC Data Mining Lecture 2 Business Analyst Data Analyst DBA 19 Database Systems By Dr. Borne 2005 UMUC Data Mining Lecture 2 20 Database Systems • DBMS joins “AI and statistics” to become Data Mining • Data mining usually asks complex statistical questions that are difficult to answer via traditional SQL queries • Data mining relies on special algorithms outside of the standard DBMS/SQL family of tools • Data mining is used to extract knowledge from DBMS, not just the data bits (i.e., KDD) • Data mining applies familiar statistical concepts to large DBMS (e.g., outlier detection; cluster analysis; data modeling; evolutionary analysis; prediction) By Dr. Borne 2005 UMUC Data Mining Lecture 2 21 Data Mining is a core database function • Data Mining has many names / aliases : – – – – – – – – – – – – – – – – By Dr. Borne 2005 Knowledge Discovery in Databases (KDD) Machine Learning (ML) Exploratory Data Analysis (EDA) Intelligent Data Analysis (IDA) On-Line Analytical Processing (OLAP) Business Intelligence (BI) Customer Relationship Management (CRM) Business Analytics Target Marketing Cross-Selling Market Basket Analysis Credit Scoring Case-Based Reasoning (CBR) Connecting the Dots Intrusion Detection Systems (IDS) Recommendation / Personalization Systems! UMUC Data Mining Lecture 2 22 Database Systems and Data Mining • Data mining brings novel non-traditional concepts to large DBMS (e.g., association mining; neural nets; decision trees; link analysis; pattern recognition; classification; regression; SOMs). For example: – Clustering Analysis = group together similar items and separate dissimilar items – Classification Prediction = predict the class label – Regression = predict a numeric attribute value – Association Analysis = detect attribute-value conditions that occur frequently together (e.g., Beer & Diapers example) By Dr. Borne 2005 UMUC Data Mining Lecture 2 23 Types of Databases to be Mined • • • • Relational databases Data warehouses Transactional databases Advanced DB and information repositories: – – – – – – Object-oriented and object-relational databases Spatial databases Time-series data and temporal data Text databases and multimedia databases Heterogeneous and legacy databases WWW, and eventually the Semantic Web By Dr. Borne 2005 UMUC Data Mining Lecture 2 24 Data Warehousing and OLAP By Dr. Borne 2005 UMUC Data Mining Lecture 2 25 Data Warehousing • Data warehouse = Materialized view • Integrated view of data from distributed sources • If transformation process can be represented via SQL, then data warehouse can be seen as a DB view: – CREATE VIEW warehouse_table AS SELECT … FROM source_table1, source_table2, … WHERE … – except that the view is materialized = result is stored and needs to be maintained when source data change By Dr. Borne 2005 UMUC Data Mining Lecture 2 26 Order of Database Operations (1) • When building a DW, pay attention to the order of operations in the SQL command – particularly if large data need to be selected, grouped, and ordered – perhaps build intermediate views to cull data down to manageable size • Order of operations . . . By Dr. Borne 2005 UMUC Data Mining Lecture 2 27 Order of Database Operations (2) (4) select ..... specifies attributes and computations to appear in answer (1) from .... indicates Cartesian product of source tables (2) where ..... provides boolean to filter Cartesian product groupby .... specifies attributes necessary to cluster the results of the where-filter (5) orderby .... indicates attributes on which to order any visual display or sequential tuple returns (6) into .... specifies a temporary table to hold the answer (3) Operational order By Dr. Borne 2005 UMUC Data Mining Lecture 2 28 Maintaining the Data Warehouse The key concept is ETL : – Extraction: extract relevant data and/or changes from the DB sources – Transformation: transform the data to match the warehouse schema – Loading: integrate data (and subsequent changes to data) into the warehouse By Dr. Borne 2005 UMUC Data Mining Lecture 2 29 Data Warehousing “features” • Data are integrated into the DW in advance, prior to queries being formulated – Caution: Query results could therefore be stale • Data are copied from distributed sources – Care must be exercised to maintain consistency – Query processing is local to the DW: • faster • can operate even when data sources are unavailable By Dr. Borne 2005 UMUC Data Mining Lecture 2 30 Selecting views to materialize • Factors that affect what to materialize: – – – – Storage cost Update cost Which queries will benefit from it How much will those queries benefit from it • Examples: – GROUP BY A1 is small, but not useful for most queries – GROUP BY A1, B2, C3 is useful for most queries, but too large to be of much benefit By Dr. Borne 2005 UMUC Data Mining Lecture 2 31 Data Warehousing and OLAP (On-Line Analytical Processing) • OLAP as Data Mining: – Read data from integrated view of data sources – Complex queries of DW for Data Analysis – Data Analysis for Knowledge Discovery (KDD = Data Mining) – Knowledge Discovery for Decision Making – Goal: optimize reads and data warehouse queries for data exploration, mining, analysis By Dr. Borne 2005 UMUC Data Mining Lecture 2 32 OLTP versus OLAP (On-Line Transaction Processing vs. On-Line Analytical Processing) • OLTP • OLAP – Mostly updates – Short, simple transactions – DBA, clerical users – Goal: transaction throughput – Local sources: heterogeneous DBs By Dr. Borne 2005 – Mostly reads – Long, complex queries – Analysts, decision makers – Goal: fast queries – Distributed sources: single integrated view (data warehouse) UMUC Data Mining Lecture 2 33 OLAP Operations in the Warehouse • Slice (select one dimensional view) • Dice (select multi-dimensional view; aids in the search for trends and patterns) • Roll-up (consolidation; dimension reduction; aggregation; using simple or complex expressions) • Drill-down (querying specific items) • Visualize (“see” the results; allows for intuitive data understanding) By Dr. Borne 2005 UMUC Data Mining Lecture 2 34 From Lecture #1 The Data Warehouse as the Source for the Mining Process By Dr. Borne 2005 UMUC Data Mining Lecture 2 35 From “DataMines for DataWarehouses” article (available in Webliography) Data Mining external to the Data Warehouse Data Mining within the Data Warehouse By Dr. Borne 2005 UMUC Data Mining Lecture 2 36 Statistics and Data Mining By Dr. Borne 2005 UMUC Data Mining Lecture 2 37 Data Mining = Statistical Analysis? • "Data mining … is the exploration and analysis, by automatic and semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules." (Berry, J. A. & Linoff, G. [1997]. Data mining Techniques For Marketing, Sales and Customer Support, John Wiley & Sons, Inc. New York, p.5, http://www.dataminers.com/books/order.html ) • "Data mining is the process of selecting, exploring, and modeling large amounts of data to uncover previously unknown patterns of data for business advantage." (SAS Institute Inc., http://www.sas.com/technologies/analytics/datamining/index.html ) • "Data mining simply means finding patterns in your business data which you can use to do your business better" (SPSS Inc., http://www.statistical.com.au/dm.htm ) • ”Data mining is the use of statistical analysis and machine learning techniques, in a semiautomatic fashion, on large collections of data." (Jorgensen, M. & Gentleman, R. [1998]. Data Mining. Chance 11, 34–42.) By Dr. Borne 2005 UMUC Data Mining Lecture 2 38 Statistics and Data Mining • Data mining got a bad name initially because it was initially viewed as “statistical dredging” or a “fishing expedition”. • Data mining became an acceptable practice because its users exercised statistical rigor in their analyses. • Challenges and concerns: – – – – Data volumes are huge. Techniques don’t often scale. Contaminated or corrupt data values (6-sigma effect) Selection bias; non-independent observations Fishing expedition = if you look hard enough, you will find something. But, is it really useful or not? … … this is the “Interestingness” Problem … • Are the data mining results interesting to anyone? By Dr. Borne 2005 UMUC Data Mining Lecture 2 39 Quality Management and Data Mining • The focus of TQM (Total Quality Management) is total customer satisfaction. • This can be realized through CRM (Customer Relationship Management) systems = a data mining technology : – Gather data – Analyze data – Make decisions based upon results • Related to this are 6-Sigma quality control processes : customer satisfaction maximized through minimizing defects in products and services delivered. • Some references: – http://www.sbaer.uca.edu/newsletter/2002/012202.pdf – http://www.qualitydigest.com/apr99/html/body_spcguide.html By Dr. Borne 2005 UMUC Data Mining Lecture 2 40 Information Retrieval By Dr. Borne 2005 UMUC Data Mining Lecture 2 41 Information Retrieval (IR) • IR is a combination of data discovery and data mining in digital libraries or other information repositories. • An IR system operates on a collection of documents (e.g., the WWW) • IR is sometimes called Text Mining or Web Mining • Effectiveness of an IR project is measured by precision and recall By Dr. Borne 2005 UMUC Data Mining Lecture 2 42 Information Retrieval Metrics Precision = (relevant & retrieved) / (retrieved) – “Am I interested in the documents retrieved?” – High Precision means most of the retrieved documents are relevant to my query Recall = (relevant & retrieved) / (relevant) – “Have all relevant documents been retrieved?” – High Recall means that most of the relevant documents have been retrieved. By Dr. Borne 2005 UMUC Data Mining Lecture 2 43 IR and Text/Web Mining • Semantic markup of Web or other text documents using XML (eXtensible Markup Language) • XML enables metadata / keyword harvesting from document collections (e.g., Web screen-scraping) • Harvested metadata can be stored in a Data Warehouse for mining -- this is clearly an example of a materialized view of distributed data sources • Other metrics: “similarity” to other documents (e.g., common keywords, common keyphrases) • Application area: Automated Recommendation System By Dr. Borne 2005 UMUC Data Mining Lecture 2 44 Information Retrieval Issues • • • • • • • Semantic content of documents Unstructured versus structured content Multi-modal content (image, text, numeric) Reliability of sources Quality of sources Indexing for efficient & effective access Similarity metrics (e.g., how do you do a Groupby or a Roll-up ?) • Privacy, Copyright, Intellectual Property By Dr. Borne 2005 UMUC Data Mining Lecture 2 45 IR and Image Mining • Image Mining is a form of IR and data mining • Techniques: – – – – Wavelet analysis and summarization Pixel value (color) histograms and vectorization Scene pattern recognition and indexing Event/anomaly detection and cataloguing (e.g, forest fires seen in satellite photos) – Edge detection (unsharp masking) and graphs • The data to be mined are the information databases extracted from the images (not the raw image data themselves) By Dr. Borne 2005 UMUC Data Mining Lecture 2 46 Data Mining as “Rule Induction” By Dr. Borne 2005 UMUC Data Mining Lecture 2 47 From Lecture #1 Decision Tree Classification: based on rules at each node of the tree Should I play tennis today? By Dr. Borne 2005 UMUC Data Mining Lecture 2 48 Intelligent actions (decision support) are often represented by a set of rules… IF age = “<=30” AND student = “no” IF age = “<=30” AND student = “yes” IF age = “31…40” IF age = “>40” AND credit_rating = “excellent” IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no” THEN buys_computer = “yes” THEN buys_computer = “yes” THEN buys_computer = “yes” THEN buys_computer = “no” (example of Decision Tree rules) By Dr. Borne 2005 UMUC Data Mining Lecture 2 49 Rule-Based Algorithms (RBA) • • • • • RBA = Decision Support via “if-then rules” Can generate the rules from a Decision Tree (DT). But, rules do not need to be derived from a DT. Rules have no order, unlike Decision Trees. Trees are built by examining all cases; whereas rules are generated one case at a time. • Rule Induction is the method for deriving rules. • Case-Based Reasoning (CBR) is a related application of rule-based algorithms. By Dr. Borne 2005 UMUC Data Mining Lecture 2 50 Sometimes the rules are fuzzy… (example of Fuzzy Rule Induction) By Dr. Borne 2005 UMUC Data Mining Lecture 2 51 Fuzzy Sets and Logic By Dr. Borne 2005 UMUC Data Mining Lecture 2 52 Fuzzy Sets and Logic • Data mining does not always yield absolute answers, but statistical answers that indicate the probability frequency of occurrence of patterns or classes, or the likelihood that an object in the database belongs to a given class. • In predictive data mining, the result is fuzzy (e.g., predicting loan default through bank account analysis does not guarantee that the customer will indeed default on their loan). • Fuzzy Logic is a method for handling uncertainty in data, in decision-making, and in control systems. By Dr. Borne 2005 UMUC Data Mining Lecture 2 53 Sets and Logic - Classical (Boolean) By Dr. Borne 2005 UMUC Data Mining Lecture 2 54 Sets and Logic - Fuzzy By Dr. Borne 2005 UMUC Data Mining Lecture 2 55 Classical versus Fuzzy By Dr. Borne 2005 UMUC Data Mining Lecture 2 56 Fuzzy Logic, Control Systems, and Data Mining • Suppose you have a R/T (real-time) data monitoring (data mining) control system attached to machinery in a large manufacturing plant. • Temperature sensor on a machine says that it is running very hot (... what is “hot”? -- that’s fuzzy). • Motion sensor within machine says that it is running at high RPM, very fast (… what is “fast”? -- that’s fuzzy). • The machine is not technically over-heating, which you know because of past experience and common sense. • Control System responds to data and knowledge-base by invoking a rule to slow down the motor speed a little bit. By Dr. Borne 2005 UMUC Data Mining Lecture 2 57 Application of Fuzzy Logic to Data Mining - 1 <http://www.cs.uah.edu/~thinke/CS687/Fall97/Tech/rahul_dbase_paper.html> Direct Mailing System • The problem is to identify customers from a customer database who can be targeted for a sale under the assumption that these customers responded positively to advertisements mailed to them. The additional constraint is that the mailing list budget is limited and number of advertisements to be mailed are to be controlled to increase profit. The first step involves analyzing the database for attributes like "frequency of visits to the store", "sum of purchases", etc. Analysis and plots of the data then determine the cluster of good customers. Next, one has to find the attribute relationships to define a query condition which is represented by a pair of attributes and a fuzzy linguistic value. One then verifies and refines the query condition by using another customer database. Thus the customer database is ranked and sorted by degree values based on a given fuzzy query condition. The customers retrieved by the query determine the list of the potential of good customers. By Dr. Borne 2005 UMUC Data Mining Lecture 2 58 Application of Fuzzy Logic to Data Mining - 2 <http://www.cs.uah.edu/~thinke/CS687/Fall97/Tech/rahul_dbase_paper.html> Vibration Sensor • A product which was used to sense vibrations and predict the causes of these vibrations (i.e., earthquakes, etc.) was improved by utilizing fuzzy rules. The original sensor was based on simple threshold rule. The error rate for this sensor was around 12%. The fuzzy rules were created by analyzing the actual data in specified cases of earthquakes, automobiles etc. A feature extraction was done on the data set to identify each kind of cause. Relationships between the feature parameters and the kind of vibration were discovered to develop the fuzzy rules. These rules were then tested and refined. The accuracy of the sensor’s prediction improved dramatically, with the error rate falling to within 1%. By Dr. Borne 2005 UMUC Data Mining Lecture 2 59 Non-Fuzzy Logic System By Dr. Borne 2005 UMUC Data Mining Lecture 2 60 Adaptive Fuzzy Logic System This example is related to air conditioner settings in a warm room, but the adaptive fuzzy logic system may be applied to activate other “thinking machines”. By Dr. Borne 2005 UMUC Data Mining Lecture 2 61 Machine Learning – a tool for Data Mining and Intelligent Decision Support By Dr. Borne 2005 UMUC Data Mining Lecture 2 62 Machine Learning • What is Machine Learning? -- “ML is the application of computer algorithms that improve automatically through experience.” • Why is ML applicable to Data Mining? -– Refer to earlier slide “Reasons for the growing popularity of data mining” : • Growing Data Volume -- ML enables the intelligent analysis of overwhelmingly large data/knowledge repositories • Limitations of Human Analysis -- ML enables automated searches for complex multifactor dependencies in data • Low Cost of Machine Learning -- machines and software are cheaper than people; the ML process is repeatable, consistent, and robust in handling very large data analysis tasks; adaptive ML algorithms can scale with the problem. By Dr. Borne 2005 UMUC Data Mining Lecture 2 63 Machine Learning and Data Mining • ML Techniques for DM (to be covered later): – – – – – – – – Decision Trees Rule Mining and Rule Learning Case-Based Reasoning (CBR) Neural Nets (NN) Supervised and Unsupervised Learning Support Vector Machines (SVM) Bayesian Networks Genetic Algorithms (GA) By Dr. Borne 2005 UMUC Data Mining Lecture 2 64 Neural Nets • “Neural networks are the second best way of doing just about anything.” (John Denker) Data Neural Network Fuzzy Rules • The best way is “is to apply all available domain knowledge and spend a considerable amount of time, money and effort in building a rule system that will give the right answer. The second best way of doing anything is to learn from experience.” (Burbidge & Buxton) By Dr. Borne 2005 UMUC Data Mining Lecture 2 65 Supervised vs. Unsupervised Learning • In Supervised Learning algorithms, a training set is provided (data with correct answers), which is used to mine for known patterns. • In Unsupervised Learning algorithms, data are provided with no a priori knowledge of the hidden patterns (knowledge) that they contain. The goal is to discover (learn) these patterns. • A class known as Semi-Supervised Learning also exists, where knowledge is known and applied from one data collection in order to mine, analyze, classify, and interpret a related data collection. By Dr. Borne 2005 UMUC Data Mining Lecture 2 66 Machine Learning, Data Mining, and Support Vector Machines (SVM) • SVM is the tool of choice for the application of ML to the data mining classification problem. • So what are they? … “a statistical learning system for predictive data mining -- for estimating regression functions.” • Loads of information available here: http://www.cs.rpi.edu/~bij2/svm.html http://www.kernel-machines.org/tutorial.html By Dr. Borne 2005 UMUC Data Mining Lecture 2 67 SVM Process Overview Initial Classification Data Data SVM Training SVM Weights Classification Elements In Classification By Dr. Borne 2005 UMUC Data Mining Lecture 2 Elements Out of Classification 68 SVM Classification • SVM attempts to find an optimal separating hyperplane between members of the two initial classifications. Separating hyperplane Class “A” Class “B” By Dr. Borne 2005 UMUC Data Mining Lecture 2 69 SVM Class Separation Problem • An optimal hyperplane partitions the initial classification correctly and maximizes distance from the plane to elements on either ‘side’: positive and negative examples. • When the training examples (initial classification) consist of very diverse expression patterns, then finding an optimal hyperplane can be impossible. By Dr. Borne 2005 UMUC Data Mining Lecture 2 70 SVM Kernel Construction The expression data can be transformed to a higher dimensional space (feature space) by applying a kernel function. This transformation can have the effect of allowing a separating hyperplane to be found. By Dr. Borne 2005 UMUC Data Mining Lecture 2 71 Practical SVM Issues • Results depend heavily on the input parameters. • Using a high degree kernel function risks artificial separation of the data. • An iterative approach to increasing the kernel power is advisable. By Dr. Borne 2005 UMUC Data Mining Lecture 2 72 SVM Results • Two classes are produced: – Positive Class: contains elements with expression patterns similar to those in the positive examples in the training set. – Negative Class: contains all other members of the input set. • Each of these classes has elements that fall in two groups: – Those initially in the class (true positives and true negatives) – Those recruited into the class (false positives and false negatives) By Dr. Borne 2005 UMUC Data Mining Lecture 2 73 Machine Learning Resources • 1. Massive compilation of ML resources at : http://home.earthlink.net/~dwaha/research/machine-learning.html • 2. Excellent Reference Book: Tom Mitchell’s “Machine Learning” (1997; McGraw-Hill) : http://www-2.cs.cmu.edu/~tom/mlbook-chapter-slides.html • 3. Machine Learning & Data Mining Resources : http://www.mlnet.org/ My favorite ML site … Click on Software … a site dedicated to “machine learning, knowledge discovery, case-based reasoning, knowledge acquisition, and data mining.” By Dr. Borne 2005 UMUC Data Mining Lecture 2 74 Recap of ML and DM • DM requires machine assistance in the search and analysis of very large (often distributed, heterogeneous) databases • Intelligent analysis of complex multi-dimensional multipledependency data also demands machine assistance • Algorithms for DM are most efficient when they are adaptable to the type and content of the data (i.e., the system “learns”) • Machines are less expensive than humans • Machines are usually scalable as the problem size grows • Actionable data (the end-goal of DM) depends in many cases on an embedded ML algorithm to take appropriate action (in control systems; decision-support systems; robotics; autonomous systems) • ML and DM are historically, technically, and functionally intertwined (e.g, some data mining research groups call themselves Machine Learning Groups) By Dr. Borne 2005 UMUC Data Mining Lecture 2 75 Steps in the Data Mining Process By Dr. Borne 2005 UMUC Data Mining Lecture 2 76 Steps in the Data Mining Process http://www.cs.sfu.ca/~han/DM_Book.html • Learning the application domain: – relevant prior knowledge and goals of DM application • Creating a target data set: Data selection • Data cleaning and preprocessing: (may take 40-60% of effort!) • Data reduction and transformation: – Find useful features, dimensionality/variable reduction, invariant representation. • Choosing data mining functions – summarization, classification, regression, association, clustering • Choosing the mining algorithm(s) • Data mining & KDD: search for patterns of interest • Pattern evaluation and knowledge presentation – visualization, transformation, removing redundant patterns, etc. • Using the discovered knowledge = Actionable Data! By Dr. Borne 2005 UMUC Data Mining Lecture 2 77 Steps in the Data Mining Process - Pictorial View By Dr. Borne 2005 UMUC Data Mining Lecture 2 78 Cleaning the “Dirty Data” • Excellent reference: Dorian Pyle’s book “Data Preparation for Data Mining” (1999, Morgan Kaufmann; 540pp) • Frequent problem: missing (NULL) values • Empty value Missing value (must treat each case differently) • Various options for NULLs (may introduce bias): – – – – use “fill value” (e.g, -999) use estimated value (prediction from data model) use interpolated value (from surrounding entries) ignore any records with nulls • November 2003 Workshop on Data Cleaning: http://dimacs.rutgers.edu/Workshops/DataCleaning/ By Dr. Borne 2005 UMUC Data Mining Lecture 2 79 Data Preprocessing (Laundering the Data) (may take 40-80% of the total data mining project effort!) (Reference: “Data Scrubbing” article in Computerworld 2003) By Dr. Borne 2005 UMUC Data Mining Lecture 2 80 "Data Scrubbing by the Numbers” (http://www.computerworld.com/printthis/2003/0,4814,78260,00.html) Here are some of the findings: Data cleansing accounts for up to 70% of the cost and effort of implementing most data warehouse projects, according to analysts. In 2001, The Data Warehousing Institute estimated that dirty data costs U.S. businesses $600 billion per year. Data cleanliness and quality was the No. 2 problem -- right behind budget cuts -- cited in a 2003 IDC survey of 1,648 companies implementing business analytics software enterprise-wide. Only 23% of 130 companies surveyed by Cutter Consortium on their data warehousing and business-intelligence practices use specialized data cleansing tools. Of those companies in the Cutter Consortium study using specialized data scrubbing software, 31% are using tools that were built in-house. By Dr. Borne 2005 UMUC Data Mining Lecture 2 81 Major Issues in Data Mining By Dr. Borne 2005 UMUC Data Mining Lecture 2 82 Major Issues in Data Mining (1) • Mining methodology and user interaction – Mining different kinds of knowledge in databases – Interactive mining of knowledge at multiple levels of abstraction – Incorporation of background knowledge – Data mining query languages and ad-hoc data mining – Expression and visualization of data mining results – Handling of noise and incomplete data – Pattern evaluation: the interestingness problem • Performance and scalability – Handling very large data volumes (the “data flood”) – Efficiency and scalability of data mining algorithms – Parallel, distributed, and incremental mining methods By Dr. Borne 2005 UMUC Data Mining Lecture 2 83 Major Issues in Data Mining (2) • Issues relating to the diversity of data types – Handling relational and complex types of data – Mining information from heterogeneous databases and global information systems (WWW) • Issues related to applications and social impacts – Application of discovered knowledge • Domain-specific data mining tools • Intelligent query answering • Process control and decision making – Integration of the discovered knowledge with existing knowledge: A knowledge fusion problem – Protection of data security, integrity, and privacy • Dirty data (60% of the effort, or more) – Preparing the data for mining (transformation, cleaning, processing) By Dr. Borne 2005 UMUC Data Mining Lecture 2 84 Case Study - The Mars Rover http://mars.jpl.nasa.gov/mer/mission/spacecraft_surface_rover.html By Dr. Borne 2005 UMUC Data Mining Lecture 2 85 Data Mining in Action • Data Mining facilitates Intelligent Data Understanding • Data Mining enables Decision Support and Active Control Systems By Dr. Borne 2005 UMUC Data Mining Lecture 2 86 What is Intelligent Data Understanding? • IDU refers to the application of techniques for transforming data into understanding. … (sound familiar?) Data Information Knowledge Understanding / Wisdom! • Web reference: http://is.arc.nasa.gov/IDU/index.html • IDU specifically refers to automating the following techniques for machine-assisted data analysis: – Data Mining (e.g., http://is.arc.nasa.gov/IDU/tasks/NVODDM.html) – Knowledge Discovery – Machine Learning By Dr. Borne 2005 UMUC Data Mining Lecture 2 87 Intelligent Data System Applications (1) • Rove around the surface of Mars and take samples of rocks (mass spectroscopy = a data histogram) • Supervised Learning (search for rocks with known compositions) • Unsupervised Learning (discover what types of rocks are present, without preconceived biases) • Association Mining (find unusual associations) • Clustering (find the set of unique classes of rocks) • Classification (assign rocks to known classes) • Deviation/Outlier Detection (one-of-kind; interesting?) By Dr. Borne 2005 UMUC Data Mining Lecture 2 88 Intelligent Data System Applications (2) • On-board Intelligent Data Understanding & Decision Support Systems (Fuzzy Logic & Decision Trees & Cased-Based Reasoning ) – Science Goal Monitoring: – “stay here and do more”; or else “move on to another rock” – “send results to Earth immediately”; or “send results later” • Learn as it goes (Machine Learning & Neural Nets) • Relate the results to other factors, such as dust storms (XML & Information Retrieval & Information Fusion with other data from orbiting satellite “mother ship”) • Predict where to go in order to find interesting rocks (Logistic Regression & Case-Based Reasoning) By Dr. Borne 2005 UMUC Data Mining Lecture 2 89 Mars Rover as an Adaptive Fuzzy Logic System • Decisions are based on data mined, prior experience, new knowledge, and fuzzy logic • Rover acts autonomously, without human intervention, in Deep Space environment • Actions are driven by mining actionable data from all sensors By Dr. Borne 2005 UMUC Data Mining Lecture 2 90 Summary By Dr. Borne 2005 UMUC Data Mining Lecture 2 91 Summary of Topics Covered • • • • • • • • • • • • Summary of “What is Data Mining?” Tutorial Foundations of Data Mining Database Systems Data Warehousing and OLAP Statistics and Data Mining Information Retrieval Data Mining as “Rule Induction” Fuzzy Sets and Logic Machine Learning Steps in the Data Mining Process Major Issues in Data Mining A Case Study: The NASA Mars Rover By Dr. Borne 2005 UMUC Data Mining Lecture 2 92