Descriptive Exploratory Data Analysis 9/6/2007 Jagdish S. Gangolly State University of New York at Albany Data Manipulation: – Matrices: bind rows (rbind), bind columns (cbind) – Arrays: rowMeans, colMeans, rowSums, colSums, rowVars, colVars,… – apply(data, dim, function,…) – attach(framename):permits you to refer to variables without cumbersome notations. You can detach the frame when done. – function (x) { function definition}: To define your own functions – rm(comma-separated S-Plus objects): To remove objects Trellis Graphics I • A matrix of graphs Example: >par(mfrow=c(2,2)) # 2 X 2 matrix of figures >x <- 1:100/100:1 >plot(x) # plot cell (1,1) >plot(x, type=“l”) # plot cell (1,2) line >hist(x) # plot cell (2,1) histogram >boxplot(x) # plot cell (2,2) boxplot Trellis Graphics II Syntax: Dependent variable ~ explanatory variable |conditioning variable Data set Output: >trellis.device(motif) >dev.off() or >graphics.off() Trellis Graphics III Example: histogram(~height | voice.part, data=singer) – No dependent variable for histogram – Height is explanatory variable – Data set is singer Trellis Graphics IV • Layout: layout and skip and aspect parameters (p.147). • Ordering graphs: left to right, bottom to top. If as.table=T, left to right top to bottom p.149). Data Mining • What is Data mining? • Data mining primitives – Task-relevant data – Kinds of knowledge to be mined – Background knowledge – Interestedness measures – Visualisation of discovered patterns • Query language Data Mining • Concept Description (Descriptive Datamining) – Data generalisation • Data cube (OLAP) approach (offline pre-computation) • Attribute-oriented induction approach (online aggregation) • Presentation of generalisation • Descriptive Statistical Measures and Displays What is Data mining? • Discovery of knowledge from Databases – A set of data mining primitives to facilitate such discovery (what data, what kinds of knowledge, measures to be evaluated, how the knowledge is to be visualised) – A query language for the user to interactively visualise knowledge mined Data mining primitives I • Task-relevant data: attributes relevant for the study of the problem at hand • Kinds of knowledge to be mined: characterisation, discrimination, association, classification, clustering, evolution,… • Background knowledge: Knowledge about the domain of the problem (concept hierarchies, beliefs about the relationships, expected patterns of data, …) Data mining primitives II • Interestedness measures: support measures (prevalence of rule pattern) and confidence measures(strength of the implication of the rule) • Visualisation of discovered patterns: rules, tables, charts, graphs, decision trees, cubes,… Task-relevant Data Steps: • Derivation of initial relation through database queries (data retrieval operations). (Obtaining a minable view) • Data cleaning & transformation of the initial relation to facilitate mining • Data mining Kinds of knowledge to be mined • Kinds of knowledge & templates (metapatterns, meta-rules, meta-queries) – Association An Example: age(X:customer, W) Λ income(X, Y) buys(X, Z) – Classification – Discrimination – Clustering – Evolution analysis Background knowledge • Knowledge from the problem domain – usually in the form of • concept hierarchies (rolling up or drilling down) • schema hierarchies (lattices) • set-grouping hierarchies (successive sub-grouping of attributes) • rule-based hierarchies Interestedness measures I • Simplicity: More complex the structure, the more difficult it is to interpret, and so likely to be less interesting (rule length,…) • Certainty: Validity, trustworthiness # tuples containing both A and B confidence(AB) # tuples containing A Sometimes called “certainty factor” Interestedness measures II • Utility: Support is the percentage of taskrelevant data tuples for which the pattern is true # tuples containing both A and B support(AB) total # tuples Visualisation of discovered patterns • • • • • Hierarchies tables pie/bar charts dot/box plots …… Descriptive Datamining (Concept Description & Characterisation ) • Concept description:Description of data generalised at multiple levels of abstraction • Concept characterisation: Concise and succinct summarisation of a given collection of data • Concept comparison: Discrimination Data Generalisation • Abstraction of task-relevant high conceptual level data from a database containing relatively low conceptual level data – Data cube (OLAP) approach (offline precomputation) (Figs 2.1 & 2.2, pages 46 &47) – Attribute-oriented induction approach (online aggregation) • Presentation of generalisation (Tables 5.3 & 5.4 on p. 191, and Figs 5.2, 5.3, & 5.4 on pages 192 & 193) Descriptive Statistical Measures and Displays I • Measures of central tendency – Mean, Weighted mean (weights signifying importance or occurrence frequency) – Median – Mode • Measures of dispersion – Quartiles, outliers, boxplots Descriptive Statistical Measures and Displays II • Displays – Histograms (Fig 5.6, page 214) – Barcharts – Quantile plot (Fig 5.7, page 215) – Quantile-Quantile plot (Fig 5.8, page 216) – Scatter plot (Fig 5.9, page 216) – Loess curve (Fig 5.10, page 217) Descriptive Data Exploration • • • • • • • summary : mean, median, quartiles p.171 stem : stem and leaf display p.171 quantile p.172 stdev p.173 tapply : splits data p.174 by p.175 mean works on vector, and other structures need to be converted to vectors before computing means. • (example on p.176-7) Data Preprocessing for Datamining I • Why – Incomplete • Attribute values not available, equipment malfunctions, not considered important – Noisy (errors) • instrument problems, human/computer errors, transmission errors – Inconsistent • inconsistencies due to data definitions Data Preprocessing for Datamining II • Data Cleaning – Missing values: • ignore tuple, fill-in values manually, use a global constant (unknown), missing value=attribute mean, missing value = attribute group mean, missing value= most probable value – Noisy data: • Binning: partitioning into equi-sized bins, smoothing by bin means or bin boundaries • Clustering • Inspection: computer & human • Regression – Inconsistencies Data Preprocessing for Datamining III • Data Integration: Combining data from different sources into a coherent whole – Schema integration: combining data models (entity identification problems) – Redundancy (derived values, calculated fields, use of different key attributes): use of correlations to detect redundancies – Resolution of data value conflicts (coding values in different measures) Data Preprocessing for Datamining III • Transformation – – – – – Smoothing Aggregation Generalisation Normalisation Attribute (or feature) construction Data Preprocessing for Datamining IV • Data Reduction & compression – Data cube aggregation (p.117) – Dimension reduction: minimise loss of information. • Attribute selection • Decision tree induction • Principal components analysis Data Preprocessing for Datamining IV – Numerosity reduction • Regression/log-linear regression • histograms • Clustering