Department of Computer and Information Science Data Mining Basics CISC4631 Data Mining L1 U2 1 CISC 4631 Department of Computer and Information Science Outline From Data to Information What is NOT Data Mining? A KDD Process Data Mining Techniques Data Mining: Miscellaneous 2 CISC 4631 Department of Computer and Information Science From data to information Data vs. Information? (Do you agree there are differences?) Data is a collection of facts, while information puts those facts into context. Data is raw and unorganized, while information is organized. 3 CISC 4631 Department of Computer and Information Science From data to information Information is crucial Example 1: Covid-19 Given: Coronavirus exposure described by 30 more features Problem: find virus strength and patient’s immune that may impact the patient longer Data: historical records of patient and outcome Example 2: Inventing a new product Given: market attraction described by 20 more requirements Problem: perform market analysis to identify new product bundles Data: historical records market behaviors, manufacturing issues, profiling customers with more accuracy. Example 3: Medicine Given: the patient's information Problem: provide accurate diagnostics Data: historical medical records, physical examinations, and treatment patterns 4 CISC 4631 Department of Computer and Information Science Society produces huge amounts of data Sources: social, business, science, medicine, economics, geography, environment, sports, … “All scientists are data scientists” - Monica Rogati, Senior Research Scientist @LinkedIn Search Engines Early search engines used mainly keywords on a page – were subject to manipulation Google success is due to its algorithm, which uses mainly links to the page Google founders Sergey Brin and Larry Page were students in Stanford doing research in databases and data mining in 1998, which led to Google 5 CISC 4631 Department of Computer and Information Science o Growth Trends Moore’s law Computer speed doubles every 18 months Storage law Total storage doubles every 9 months o Consequence Very little data will ever be looked by a human o Knowledge discovery is NEEDED to make sense of and use data. Data mining may help scientists in classifying and segmenting data in Hypothesis Formation http://www.intelfreepress.com/news/3d-xpoint-memory-storage/9790/ 6 CISC 4631 Department of Computer and Information Science Data Mining vs. Machine Learning DM finds/extracts information from data to provide a conclusion Provide insights and information, or enable fast and accurate decision-making Strong, accurate patterns are needed to make a decision Problem 1: most patterns are not interesting Problem 2: patterns may be inexact Problem 3: data may be garbled or missing ML identifies patterns (e.g., model) in data, and provides many tools for data mining ML provides structural descriptions 7 CISC 4631 Department of Computer and Information Science Structural descriptions Patterns that are found may be represented as structural descriptions or as black-box models Example: if-then rules If tear production rate = reduced then recommendation = none Otherwise, if age = young and astigmatic = no then recommendation = soft Age Spectacle prescription Astigmatism Tear production rate Recommended lenses Young Myope No Reduced None Young Hypermetrope No Normal Soft Pre-presbyopic Hypermetrope No Reduced None Presbyopic Myope Yes Normal Hard … … … … … 8 CISC 4631 Department of Computer and Information Science The weather problem Conditions for playing a certain game Outlook Temperature Humidity Windy Play Sunny Hot High False No Sunny Hot High True No Overcast Hot High False Yes Rainy Mild Normal False Yes … … … … … If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity = normal then play = yes If none of the above then play = yes 9 CISC 4631 Department of Computer and Information Science Definitions of “learning” from dictionary: To get knowledge of by study, experience, or being taught To become aware by information or from observation To commit to memory To be informed of, ascertain; to receive instruction Difficult to measure Trivial for computers • Operational definition: Things learn when they change their behavior in a way that makes them perform better in the future. 10 CISC 4631 Does a slipper learn? Department of Computer and Information Science 11 CISC 4631 Department of Computer and Information Science http://www.differencebetween.net/technology/difference-between-data-mining-and-machine-learning/ 12 CISC 4631 Department of Computer and Information Science Outline From Data to Information What is NOT Data Mining? A KDD Process Data Mining Techniques Data Mining: Miscellaneous 13 CISC 4631 Department of Computer and Information Science What is (not) Data Mining? o What is not o What is Look up phone number in phone directory Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area) Query a Web search engine for Group together similar documents information about “Amazon” returned by search engine according to their context (e.g., Amazon rainforest, Amazon.com,) What are sold more on a particular day than other days Total amount of books sold in a day Selection, interpretation Clustering, analysis, characterization, discrimination 14 CISC 4631 Department of Computer and Information Science Outline From Data to Information What is NOT Data Mining? A KDD Process Data Mining Techniques Data Mining: Miscellaneous 15 CISC 4631 Department of Computer and Information Science Data Mining: A KDD Process Learning the application domain: relevant prior knowledge and goals of application Creating a target data set: data selection 16 CISC 4631 Department of Computer and Information Science Data cleaning and preprocessing: (may take 60% of effort!) Data Warehouse Data Cleaning Data Integration Databases 17 CISC 4631 Department of Computer and Information Science Data Selection Data Preprocessing Data Warehouse Data Cleaning Data Integration Databases 18 CISC 4631 Department of Computer and Information Science Task-relevant Data Data Selection Data Preprocessing • • Data Warehouse Data Cleaning Data Integration Databases 19 CISC 4631 Data reduction and transformation: Find useful features, dimensionality/variable reduction, invariant representation. Department of Computer and Information Science Data Mining • • Task-relevant Data • • Data Selection Data Preprocessing Data Warehouse Data Cleaning Data Integration Databases 20 CISC 4631 Choosing functions of data mining summarization, classification, regression, association, clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Department of Computer and Information Science Pattern Evaluation Data Mining Task-relevant Data Data Selection Data Preprocessing Data Warehouse Data Cleaning Data Integration Databases 21 CISC 4631 Pattern evaluation and knowledge presentation, visualization, transformation, removing redundant patterns, etc. Department of Computer and Information Science Pattern Evaluation Task-relevant Data Data Selection Data Preprocessing Data Warehouse Data Cleaning Data Integration Databases 22 CISC 4631 Understanding Data Mining Department of Computer and Information Science Outline From Data to Information What is NOT Data Mining? A KDD Process Data Mining Techniques Data Mining: Miscellaneous 23 CISC 4631 Department of Computer and Information Science Data Mining Techniques o Common data mining techniques or tasks Classification [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Sequential Pattern Discovery [Descriptive] Regression [Predictive] Deviation Detection [Predictive] 24 CISC 4631 Department of Computer and Information Science Classification o Databases to be mined o Knowledge to be mined o Techniques to be utilized o Applications to be adapted 25 CISC 4631 Department of Computer and Information Science Classification: Definition o A data mining task that assigns items in a collection to target categories or classes. o Goal: To accurately predict the target class for each case in the data. 26 Previously unseen records should be assigned a class as accurately as possible. CISC 4631 Department of Computer and Information Science Classification: Definition o Training vs. Test set Training set A subset to train a model Test set: A subset to test the trained model Used to determine the accuracy of the model. o Classifier 27 Model to assign items in a collection to target classes CISC 4631 Department of Computer and Information Science o Classification rule: predicts value of a given attribute (the classification of an example) If outlook = sunny and humidity = high then play = no 28 CISC 4631 Department of Computer and Information Science Classification Example used to assess the strength and utility of a predictive relationship Tid Refund Marital Status Taxable Income Cheat Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No No Single 75K ? 2 No Married 100K No Yes Married 50K ? 3 No Single 70K No No Married 150K ? 4 Yes Married 120K No Yes Divorced 90K ? 5 No Divorced 95K Yes No Single 40K ? 6 No Married 60K No No Married 80K ? 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 10 No Single 90K Yes used to discover potentially predictive relationships 29 10 Training Set CISC 4631 Learn Classifier Test Set Model Department of Computer and Information Science Classification: Application 1 o Direct Marketing Goal Reduce the cost of mailing by targeting a set of consumers likely to buy. Approach: 30 Using similar products {buy, don’t buy} decision forms Various demographic, lifestyle Type of business, social status CISC 4631 Department of Computer and Information Science Classification: Application 2 Classifying Galaxies Early Class: •Stages of Formation Attributes: •Image features, •Characteristics of light waves received, etc. Intermediate Late Data Size: •72 million stars, 20 million galaxies •Object Catalog: 9 GB •Image Database: 150 GB 31 CISC 4631 Department of Computer and Information Science Clustering: Definition o Find “natural” grouping of instances given un-labeled data 32 CISC 4631 Department of Computer and Information Science o Finding a similarity measure, Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another. 33 Euclidean distance Other problem-specific measures. CISC 4631 Department of Computer and Information Science o Market Segmentation: Goal Subdividing a market into distinct subsets of customers Approach: 34 Different attributes of customers << geographical, lifestyle, business Finding clusters of similar customers Measuring the clustering quality by observing buying patterns CISC 4631 Department of Computer and Information Science o Document Clustering: Goal To find groups of documents that are similar to each other Approach 35 Identify frequently occurring terms CISC 4631 Department of Computer and Information Science Association Rule Discovery o Given a set of records each of which contain some number of items from a given collection Produce dependency rules which will predict occurrence of an item based on occurrences of other items. TID Items 1 Bread, Coke, Milk 2 3 4 5 Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk 36 Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} CISC 4631 Department of Computer and Information Science Regression o Predict a value of a given continuous valued variable based on the values of other variables Greatly studied in statistics, neural network fields. o Examples: Predicting sales amounts Predicting wind velocities Time series prediction of stock market indices 37 CISC 4631 Department of Computer and Information Science Outline From Data to Information What is NOT Data Mining? A KDD Process Data Mining Techniques Data Mining: Miscellaneous 38 CISC 4631 Department of Computer and Information Science Large-scale Endeavors Products SAS SPSS Oracle (Darwin) IBM Clustering Classification Association Sequence Deviation Decision Trees ANN Time Series Decision Trees DBMiner (Simon Fraser) 39 √ √ Weka Colab √ √ √ CISC 4631 Specially, for ML Department of Computer and Information Science Next o Data Warehouse 40 CISC 4631