Descriptive Exploratory Data Analysis II

Descriptive Exploratory Data Analysis 9/6/2007 Jagdish S. Gangolly State University of New York at Albany Data Manipulation: – Matrices: bind rows (rbind), bind columns (cbind) – Arrays: rowMeans, colMeans, rowSums, colSums, rowVars, colVars,… – apply(data, dim, function,…) – attach(framename):permits you to refer to variables without cumbersome notations. You can detach the frame when done. – function (x) { function definition}: To define your own functions – rm(comma-separated S-Plus objects): To remove objects Trellis Graphics I • A matrix of graphs Example: >par(mfrow=c(2,2)) # 2 X 2 matrix of figures >x <- 1:100/100:1 >plot(x) # plot cell (1,1) >plot(x, type=“l”) # plot cell (1,2) line >hist(x) # plot cell (2,1) histogram >boxplot(x) # plot cell (2,2) boxplot Trellis Graphics II Syntax: Dependent variable ~ explanatory variable |conditioning variable Data set Output: >trellis.device(motif) >dev.off() or >graphics.off() Trellis Graphics III Example: histogram(~height | voice.part, data=singer) – No dependent variable for histogram – Height is explanatory variable – Data set is singer Trellis Graphics IV • Layout: layout and skip and aspect parameters (p.147). • Ordering graphs: left to right, bottom to top. If as.table=T, left to right top to bottom p.149). Data Mining • What is Data mining? • Data mining primitives – Task-relevant data – Kinds of knowledge to be mined – Background knowledge – Interestedness measures – Visualisation of discovered patterns • Query language Data Mining • Concept Description (Descriptive Datamining) – Data generalisation • Data cube (OLAP) approach (offline pre-computation) • Attribute-oriented induction approach (online aggregation) • Presentation of generalisation • Descriptive Statistical Measures and Displays What is Data mining? • Discovery of knowledge from Databases – A set of data mining primitives to facilitate such discovery (what data, what kinds of knowledge, measures to be evaluated, how the knowledge is to be visualised) – A query language for the user to interactively visualise knowledge mined Data mining primitives I • Task-relevant data: attributes relevant for the study of the problem at hand • Kinds of knowledge to be mined: characterisation, discrimination, association, classification, clustering, evolution,… • Background knowledge: Knowledge about the domain of the problem (concept hierarchies, beliefs about the relationships, expected patterns of data, …) Data mining primitives II • Interestedness measures: support measures (prevalence of rule pattern) and confidence measures(strength of the implication of the rule) • Visualisation of discovered patterns: rules, tables, charts, graphs, decision trees, cubes,… Task-relevant Data Steps: • Derivation of initial relation through database queries (data retrieval operations). (Obtaining a minable view) • Data cleaning & transformation of the initial relation to facilitate mining • Data mining Kinds of knowledge to be mined • Kinds of knowledge & templates (metapatterns, meta-rules, meta-queries) – Association An Example: age(X:customer, W) Λ income(X, Y)  buys(X, Z) – Classification – Discrimination – Clustering – Evolution analysis Background knowledge • Knowledge from the problem domain – usually in the form of • concept hierarchies (rolling up or drilling down) • schema hierarchies (lattices) • set-grouping hierarchies (successive sub-grouping of attributes) • rule-based hierarchies Interestedness measures I • Simplicity: More complex the structure, the more difficult it is to interpret, and so likely to be less interesting (rule length,…) • Certainty: Validity, trustworthiness # tuples containing both A and B confidence(AB)  # tuples containing A Sometimes called “certainty factor” Interestedness measures II • Utility: Support is the percentage of taskrelevant data tuples for which the pattern is true # tuples containing both A and B support(AB)  total # tuples Visualisation of discovered patterns • • • • • Hierarchies tables pie/bar charts dot/box plots …… Descriptive Datamining (Concept Description & Characterisation ) • Concept description:Description of data generalised at multiple levels of abstraction • Concept characterisation: Concise and succinct summarisation of a given collection of data • Concept comparison: Discrimination Data Generalisation • Abstraction of task-relevant high conceptual level data from a database containing relatively low conceptual level data – Data cube (OLAP) approach (offline precomputation) (Figs 2.1 & 2.2, pages 46 &47) – Attribute-oriented induction approach (online aggregation) • Presentation of generalisation (Tables 5.3 & 5.4 on p. 191, and Figs 5.2, 5.3, & 5.4 on pages 192 & 193) Descriptive Statistical Measures and Displays I • Measures of central tendency – Mean, Weighted mean (weights signifying importance or occurrence frequency) – Median – Mode • Measures of dispersion – Quartiles, outliers, boxplots Descriptive Statistical Measures and Displays II • Displays – Histograms (Fig 5.6, page 214) – Barcharts – Quantile plot (Fig 5.7, page 215) – Quantile-Quantile plot (Fig 5.8, page 216) – Scatter plot (Fig 5.9, page 216) – Loess curve (Fig 5.10, page 217) Descriptive Data Exploration • • • • • • • summary : mean, median, quartiles p.171 stem : stem and leaf display p.171 quantile p.172 stdev p.173 tapply : splits data p.174 by p.175 mean works on vector, and other structures need to be converted to vectors before computing means. • (example on p.176-7) Data Preprocessing for Datamining I • Why – Incomplete • Attribute values not available, equipment malfunctions, not considered important – Noisy (errors) • instrument problems, human/computer errors, transmission errors – Inconsistent • inconsistencies due to data definitions Data Preprocessing for Datamining II • Data Cleaning – Missing values: • ignore tuple, fill-in values manually, use a global constant (unknown), missing value=attribute mean, missing value = attribute group mean, missing value= most probable value – Noisy data: • Binning: partitioning into equi-sized bins, smoothing by bin means or bin boundaries • Clustering • Inspection: computer & human • Regression – Inconsistencies Data Preprocessing for Datamining III • Data Integration: Combining data from different sources into a coherent whole – Schema integration: combining data models (entity identification problems) – Redundancy (derived values, calculated fields, use of different key attributes): use of correlations to detect redundancies – Resolution of data value conflicts (coding values in different measures) Data Preprocessing for Datamining III • Transformation – – – – – Smoothing Aggregation Generalisation Normalisation Attribute (or feature) construction Data Preprocessing for Datamining IV • Data Reduction & compression – Data cube aggregation (p.117) – Dimension reduction: minimise loss of information. • Attribute selection • Decision tree induction • Principal components analysis Data Preprocessing for Datamining IV – Numerosity reduction • Regression/log-linear regression • histograms • Clustering

Descriptive Exploratory Data Analysis II

Related documents

Products

Support

Descriptive Exploratory Data Analysis II

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib