Multidimensional data processing

advertisement
Multidimensional data processing
• Multivariate data consist of several variables
for each observation.
• Actually, serious data is always multivariate.
• Some variables are usually not collected to
simplify collecting and processing.
•
•
Removal of variables before data analysis leads to
information loss.
Unknown information is never recovered.
• One of the most common task is clustering or
classification.
• classification
•
•
•
target classes are known
properties of target classes are usually unknown
goal: find rules which separate observed data into
target classes
• clustering
•
•
•
target classes are unknown
goal: find observations with common properties
which may (or may not) represent classes in real
world
difficult situation
• we are trying to extract information from data
•
measurements, observations, surveys
• data preparation
•
•
data adjustment – removal of invalid or
incomplete observations/measurements
normalization? – best handled when collecting
• extracting information
•
•
we know what we are looking for – testing of an
hypothesis
trying to discover something new – data
exploration
• preliminary analysis of the data
•
•
•
better understanding of its characteristics
allows to select the right tools for preprocessing or
analysis
wrong tools may yield invalid information or hide
important patterns
• also known as Exploratory Data Analysis (EDA)
•
•
•
•
a different approach – mind shift is required
concentrates on the larger view
1977+
aka visual data mining
Richard Wesley Hamming, Numerical Methods
for Scientists and Engineers, 1962
• steps
•
•
•
•
•
•
•
maximize insight into a data set
uncover underlying structure
extract important variables
detect outliers and anomalies
test underlying assumptions
develop minimalistic models
determine optimal property settings
• heavily relies on graphics
•
numbers are very abstract
• Characteristics:
•
•
•
•
•
•
•
N = 11
Mean of X = 9.0
Mean of Y = 7.5
Intercept = 3
Slope = 0.5
Residual standard deviation = 1.237
Correlation = 0.816
10.00
8.04
8.00
13.00
6.95
7.58
9.00
11.00
8.81
8.33
14.00
9.96
6.00
7.24
4.00
4.26
12.00
10.84
7.00
4.82
5.00
5.68
• Have we realized something important?
• Run-sequence plot
•
•
•
•
similar to line-chart in excel
shifts in variations
shifts in location
outliers
• Histogram
•
•
•
•
center, spread, skew, multimodality
outliers
very useful – know how to create it!
nice presentations (e.g. word-cloud, tag-cloud)
• check whether the data set is random or no
• random data should have no observable
structure
• lag = fixed time displacement
•
•
can be arbitrary
most common is 1
• observe
•
•
•
•
week autocorrelation
strong autocorrelation
sinusoidal model
outliers
• 1 dimension – piece of cake (pie)
• 2 dimensions – still easy – Cartesian coordinate
system
• 3 dimensions – still doable in Cartesian system
• 4 and more dimensions – only Chuck Norris can
do that in Cartesian system
•
•
other types of visualization are required
some may be useful only for some types of data
• understanding the data is very important
•
•
•
good visualization can help us understand the
contained information
results need to be presented to other people
sanity check, intuition – people capture patterns,
which are missed by automated methods
• some options:
•
•
•
•
bubble chart (3dim scatter plot)
scatter plot array
star plot, Radviz, Polyviz
parallel coordinates
•
•
•
•
•
also called: 3 dimensional scatter plot
2 data dimensions – graph X and Y
3rd dimension – point size
optional 4th dimension – point color
advantages
•
•
allows to uncover clusters and variable
dependencies
easy to understand
• disadvantages
•
different combinations need to be tried
• extension to common scatter plot
• 2 dimensional array of scatter plots
•
•
each combination of variables is drawn (twice)
diagonal descriptions
• easy to create
• messy
• dependencies between more than two
variables are still hidden
Sepal length
Sepal width
Petal length
Petal width
• axes radiate from central point
• Star plot
•
•
•
values of a data point are connected to form a
polygon
can display only a small number of points
order of variables may be important
• Radviz
•
•
•
•
values of a data point act as spring stiffness
values normalized into interval <0, 1>
object is placed in equilibrium of all forces
order of variables becomes very important
Iris-virginica
Iris-versicolor
Iris-setosa
• similar principle to Radviz
• data points are not attracted to a single point
• data points are attracted to an axis
•
•
circle becomes polygon → Polyviz
order of variables is less important
• polygon edges become very important
•
•
•
candidates for classification rules
different combinations of variables
exact position of point is displayed – no
information loss
• orthogonal system uses up the plane very fast
• geometrical transformation
•
•
unlike the before mentioned methods
has other uses, than just visualization
• low representational complexity – scatter plot
array has 𝑂(𝑛2 )
• equidistant parallel axes
•
•
•
same positive orientation
a point C in ℝ3 is represented by polygonal line 𝐶
a plane in ℝ3 is represented by lines
• advantages
•
determine correlation between variables
•
•
both positive and negative
determine partial correlations
•
•
only some values of some variable are correlated with
some values of other variable
very important
• disadvantages
•
•
•
dependent on variable ordering
not that useful without interactive software
may be hard to understand for newbies
•
•
•
•
Exploratory data analysis:
•
http://www.itl.nist.gov/div898/handbook/eda/eda.htm
Have a look at the graphical techniques:
•
http://www.itl.nist.gov/div898/handbook/eda/section3/ed
a33.htm
Orange Canvas – open-source data mining
•
•
•
http://orange.biolab.si/
interface similar to IBM Clementine (SPSS Modeler)
widget documentation:
http://orange.biolab.si/doc/widgets/
Sample data
•
•
http://archive.ics.uci.edu/ml/index.html
http://www958.ibm.com/software/data/cognos/manyeyes/
Download