Multidimensional data processing • Multivariate data consist of several variables for each observation. • Actually, serious data is always multivariate. • Some variables are usually not collected to simplify collecting and processing. • • Removal of variables before data analysis leads to information loss. Unknown information is never recovered. • One of the most common task is clustering or classification. • classification • • • target classes are known properties of target classes are usually unknown goal: find rules which separate observed data into target classes • clustering • • • target classes are unknown goal: find observations with common properties which may (or may not) represent classes in real world difficult situation • we are trying to extract information from data • measurements, observations, surveys • data preparation • • data adjustment – removal of invalid or incomplete observations/measurements normalization? – best handled when collecting • extracting information • • we know what we are looking for – testing of an hypothesis trying to discover something new – data exploration • preliminary analysis of the data • • • better understanding of its characteristics allows to select the right tools for preprocessing or analysis wrong tools may yield invalid information or hide important patterns • also known as Exploratory Data Analysis (EDA) • • • • a different approach – mind shift is required concentrates on the larger view 1977+ aka visual data mining Richard Wesley Hamming, Numerical Methods for Scientists and Engineers, 1962 • steps • • • • • • • maximize insight into a data set uncover underlying structure extract important variables detect outliers and anomalies test underlying assumptions develop minimalistic models determine optimal property settings • heavily relies on graphics • numbers are very abstract • Characteristics: • • • • • • • N = 11 Mean of X = 9.0 Mean of Y = 7.5 Intercept = 3 Slope = 0.5 Residual standard deviation = 1.237 Correlation = 0.816 10.00 8.04 8.00 13.00 6.95 7.58 9.00 11.00 8.81 8.33 14.00 9.96 6.00 7.24 4.00 4.26 12.00 10.84 7.00 4.82 5.00 5.68 • Have we realized something important? • Run-sequence plot • • • • similar to line-chart in excel shifts in variations shifts in location outliers • Histogram • • • • center, spread, skew, multimodality outliers very useful – know how to create it! nice presentations (e.g. word-cloud, tag-cloud) • check whether the data set is random or no • random data should have no observable structure • lag = fixed time displacement • • can be arbitrary most common is 1 • observe • • • • week autocorrelation strong autocorrelation sinusoidal model outliers • 1 dimension – piece of cake (pie) • 2 dimensions – still easy – Cartesian coordinate system • 3 dimensions – still doable in Cartesian system • 4 and more dimensions – only Chuck Norris can do that in Cartesian system • • other types of visualization are required some may be useful only for some types of data • understanding the data is very important • • • good visualization can help us understand the contained information results need to be presented to other people sanity check, intuition – people capture patterns, which are missed by automated methods • some options: • • • • bubble chart (3dim scatter plot) scatter plot array star plot, Radviz, Polyviz parallel coordinates • • • • • also called: 3 dimensional scatter plot 2 data dimensions – graph X and Y 3rd dimension – point size optional 4th dimension – point color advantages • • allows to uncover clusters and variable dependencies easy to understand • disadvantages • different combinations need to be tried • extension to common scatter plot • 2 dimensional array of scatter plots • • each combination of variables is drawn (twice) diagonal descriptions • easy to create • messy • dependencies between more than two variables are still hidden Sepal length Sepal width Petal length Petal width • axes radiate from central point • Star plot • • • values of a data point are connected to form a polygon can display only a small number of points order of variables may be important • Radviz • • • • values of a data point act as spring stiffness values normalized into interval <0, 1> object is placed in equilibrium of all forces order of variables becomes very important Iris-virginica Iris-versicolor Iris-setosa • similar principle to Radviz • data points are not attracted to a single point • data points are attracted to an axis • • circle becomes polygon → Polyviz order of variables is less important • polygon edges become very important • • • candidates for classification rules different combinations of variables exact position of point is displayed – no information loss • orthogonal system uses up the plane very fast • geometrical transformation • • unlike the before mentioned methods has other uses, than just visualization • low representational complexity – scatter plot array has 𝑂(𝑛2 ) • equidistant parallel axes • • • same positive orientation a point C in ℝ3 is represented by polygonal line 𝐶 a plane in ℝ3 is represented by lines • advantages • determine correlation between variables • • both positive and negative determine partial correlations • • only some values of some variable are correlated with some values of other variable very important • disadvantages • • • dependent on variable ordering not that useful without interactive software may be hard to understand for newbies • • • • Exploratory data analysis: • http://www.itl.nist.gov/div898/handbook/eda/eda.htm Have a look at the graphical techniques: • http://www.itl.nist.gov/div898/handbook/eda/section3/ed a33.htm Orange Canvas – open-source data mining • • • http://orange.biolab.si/ interface similar to IBM Clementine (SPSS Modeler) widget documentation: http://orange.biolab.si/doc/widgets/ Sample data • • http://archive.ics.uci.edu/ml/index.html http://www958.ibm.com/software/data/cognos/manyeyes/