Overview Data Mining Key idea: Finding Structure in Data $ Chris Williams The Data Mining Process $ Institute for Adaptive and Neural – Predictive Modelling Computation ! – Descriptive Modelling Scientific Data Mining $ 24 October 2002 $ N I V E R Probabilistic Modelling S T H Y IT E U R G H O F E D I U N B Division of Informatics University of Edinburgh, UK What is data mining? The Data Mining Process Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner. Hand, Mannila, Smyth Evaluation and Knowledge Presentation Data Mining [Data mining is the] extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases. Han We are drowning in data, but starving for knowledge! Han Selection and Transformation Data warehouse Cleaning and Integration Databases Flat files Figure from Han and Kamber Patterns Models and Patterns Data Mining Tasks Exploratory Data Analysis / A model structure is a global summary of the data set. Example: linear regression, makes a prediction for all input values # Descriptive Modelling / – Probabilisty density estimation – Cluster analysis/segmentation Predictive Modelling: Classification and Regression / Discovering Patterns Pattern structures make statements only about restricted regions of the space spanned by the variables. Example: # / – Association rules – Outlier detection Mining Complex Types of Data / " $ & – Retrieval by Content (RBC) for text, images ( Equivalently Example: detection of outliers + " $ & . – Time series and sequence data – Spatial data – Text mining Definition from Hand, Mannila, Smyth (2001) # Predictive Modelling – Mining the WWW (content, structure, usage) Example methods Neural Networks / Learning from input-output pairs # Decision Trees / Nearest-neighbour methods / Predict output(s) given inputs # / Examples # Supervised learning is – SKICAT (JPL/Caltech): Predict if an astronomical object is a star or a galaxy Always inductive (from the specific to the general). Need inductive bias # – Predicting disease presence/absence based on gene expression data Learning as search: usually we have a class of functions, and wish to find suitable candidates within the class # – PapNet – searching for abnormal cells in Pap smears # Classification and regression problems Support Vector Machines # Key issue is generalization, ability to make predictions for new inputs Descriptive Modelling Some lower dimensional structures in a higher-dimensional space e.g. Cluster centres (points in 0-d) Task is to discover significant patterns or features in the input data, without a teacher; self-organization # ... . . . . .X . . .. . . . .. . .. . .. X .. .. No external teacher or critic, but often an internal quality measure is optimized # Examples # Lower-dimensional manifolds, e.g. lines, sheets (1-d, 2-d) ... . . . . . ... . . . .. . . . . . .. . . . . . . . . – Clustering – Dimensionality reduction/fitting a lower-dimensional manifold Computational Environments Scientific Data Mining (SDM) Data mining is greatly facilitated by having a computational environment in which many operations can be carried out, pipelined, evaluated and visualized etc # Sometimes standard data mining techniques will be enough e.g. SKICAT, Clustering of Regimes in the Earth’s Upper Atmosphere # Domain knowledge can be introduced by the choice and construction of appropriate features if these are known # Some data mining examples: WEKA, Clementine # # . . . .. .. .. X.. . . .. Some general scientific computational environments, e.g. MATLAB, IDL However, many standard machine learning methods encode only vague notions of prior knowledge # To what extent do we need to incorporate domain knowledge in SDM? # # How useful are general-purpose data mining tools for SDM? Clustering of regimes in the Earth’s Upper Atmosphere Probabilistic Modelling A tool for modelling (possibly) complex networks of relationships which are non-deterministic. # Smyth, Ghil, Ide (1998), see # Observations of geopotential height on a spatial grid of 500 points recorded twice daily since 1948. # A directed graphical structure. Joint probability is defined as # * ! ! " $ $ $ " & ( * + - / 0 1 3 Reduced using PCA, then clustered # # Showed good agreement with three well-known maps in atmospheric science Example 1: Does my car start? Use of probability theory as a calculus of uncertainty # P(f=empty) = 0.05 P(b=bad) = 0.02 Battery Fuel Graphical descriptions are used to define (in)dependence # Gauge Turn Over P(g=empty|b=good, f=not empty) = 0.04 P(g=empty| b=good, f=empty) = 0.97 P(g=empty| b=bad, f=not empty) = 0.10 P(g=empty|b=bad, f=empty) = 0.99 P(t=no|b=good) = 0.03 P(t=no|b=bad) = 0.98 Heckerman (1995) Probabilistic expert systems (can deal comfortably with exceptions), cf rule-based systems Start # Probabilistic expert systems for medical diagnosis (e.g. QMR-DT, MUNIN) # P(s=no|t=yes, f=not empty) = 0.01 P(s=no|t=yes, f=empty) = 0.92 P(s=no| t = no, f=not empty) = 1.0 P(s=no| t = no, f = empty) = 1.0 Can learn probability models from data # A framework for dealing with hidden-cause (latent variable) models # Association for Uncertainty in AI # 4 Hidden Markov Models q0 π0 q1 A y0 q2 A y1 q3 A y2 .. Summary qΤ y3 “Standard view” of data mining, including visualization, predictive and descriptive modelling # yΤ Scientific data mining—how much do we need to go beyond the “standard view”? # Models for sequences (biological, temporal) # # Examples in Scientific Data Mining # – Hidden Markov Models for gene finding in DNA (profile HMMs) – Condition monitoring in antenna control systems for NASA (Smyth) Data Mining: Relationships to Other Fields Statistics # Machine Learning # Database technology # Visualization # ... # Relationship of Machine Learning to Data Mining Machine Learning is concerned with making computers that learn things for themselves. # # Data mining is more concerned with enabling humans to learn from data Belief networks as one way of modelling complex networks of non-deterministic relationships