DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining, Introductory and Advanced Topics, Prentice Hall, 2002. http://iubio.bio.indiana.edu/treeapp/treeprint-sample1.html © Prentice Hall 1 Data Mining Outline – Introduction – Related Concepts – Data Mining Techniques © Prentice Hall 2 Introduction Outline Goal: Provide an overview of data mining. Define data mining Data mining vs. databases Basic data mining tasks Data mining issues © Prentice Hall 3 Introduction Data is growing at a phenomenal rate (read “How Much Information Is There In the World?” By Michael Lesk ) Users expect more sophisticated information How? UNCOVER HIDDEN INFORMATION DATA MINING © Prentice Hall 4 Data Mining Definition Finding hidden information in a database Data Mining has been defined as “The nontrivial extraction of implicit, previously unknown, and potentially useful information from data”. Similar terms – Exploratory data analysis – Data driven discovery – Deductive learning – Discovery Science – Knowledge Discovery © Prentice Hall 5 Database Processing vs. Data Mining Processing Query – Well defined – SQL Query – Poorly defined – No precise query language Output – Subset of database Output –Not a subset of database © Prentice Hall 6 Query Examples Database – Find all credit applicants with last name of Smith. – Identify customers who have purchased more than $10,000 in the last month. – Find all customers who have purchased milk Data Mining – Find all credit applicants who are poor credit risks. (classification) – Identify customers with similar buying habits. (Clustering) – Find all items which are frequently purchased with milk. (association rules) © Prentice Hall 7 Data Mining Models and Tasks © Prentice Hall 8 Basic Data Mining Tasks I Classification maps data into predefined groups or classes – Supervised learning – Pattern recognition – Prediction Regression is used to map a data item to a real valued prediction variable. Clustering groups similar data together into clusters. – Unsupervised learning – Segmentation – Partitioning © Prentice Hall H =1.31 (Fem + Fib) + 63.05 9 Basic Data Mining Tasks II Summarization maps data into subsets with associated simple descriptions. – Characterization – Generalization Link Analysis uncovers relationships among data. – Affinity Analysis – Association Rules – Sequential Analysis determines sequential patterns. © Prentice Hall 10 KDD Process Modified from [FPSS96C] Selection: Obtain data from various sources. Preprocessing: Cleanse data. Transformation: Convert to common format. Transform to new format. Data Mining: Obtain desired results. Interpretation/Evaluation: Present results to user in meaningful manner. © Prentice Hall 11 KDD Process Ex: Shuttle Data Selection: – Select data (which missions etc) to use Preprocessing: – Remove Spikes Transformation: 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 00 100 200 300 400 500 600 700 800 900 1000 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 – DFT, DWT, PAA etc Data Mining: – Look for Rules… 0 100 200 300 400 500 600 700 800 900 1000 Interpretation/Evaluation: – Show rules to domain experts Potential User Applications: – Prediction of Failures© Prentice Hall 12 Data Mining Development •Relational Data Model •SQL •Association Rule Algorithms •Data Warehousing •Scalability Techniques •Similarity Measures •Hierarchical Clustering •IR Systems •Imprecise Queries •Textual Data •Web Search Engines •Bayes Theorem •Regression Analysis •EM Algorithm •K-Means Clustering •Time Series Analysis •Algorithm Design Techniques •Algorithm Analysis •Data Structures •Neural Networks •Decision Tree Algorithms © Prentice Hall 13 KDD Issues Human Interaction Overfitting Outliers Interpretation Visualization Large Datasets High Dimensionality © Prentice Hall 14 KDD Issues (cont’d) Multimedia Data Missing Data Irrelevant Data Noisy Data Changing Data (streams) Integration Application © Prentice Hall 15 Social Implications of DM Privacy Profiling Unauthorized use © Prentice Hall 16 Data Mining Metrics Usefulness Return on Investment (ROI) Accuracy Space/Time Complexity © Prentice Hall 17