Uploaded by erinabock

Data Analytics Summary: EDA, Visualization, Data Mining

Summary Data Analytics
Data analytics for engineers (Technische Universiteit Eindhoven)
StudeerSnel wordt niet gesponsord of ondersteund door een hogeschool of universiteit
Gedownload door stijn buurman (buurmanstijn@gmail.com)
2IAB0 : Data Analytics 2020
Week 1: EDA = Exploratory Data Analysis
Descriptive Data Analytics:
Data collected now may be used later for other purposes
Insight into the past
We study the methods for organizing, displaying and describing data.
Predictive Data Analytics:
Only prediction, but no indication of what we should do
Looking into the future
Prescriptive Data Analytics:
Data-driven advices how to take action to influence or change the future
Data forms/types:
Categorical (Not numbers)
o Dichotomous (yes/no, male/female)
o Nominal: no ordering. (genre, nationality)
o Ordinal: has ordering (ratings (bad, neutral, good))
Numerical (Numbers)
o Interval : no fixed ‘zero point’, only difference has a meaning (release date,
world ranking, temperature)
o Ratio: has fixed ‘zero point’, (budget, athlete weight)
Charts/Graphs are good:
For making comparisons
To get a quick view
Usually used for categorical data
Useful tool to discover unexpected relations
Tables are good:
For reading off values
To draw attention to actual sizes
Reference table: Store all data in a table so that in can be looked up easily, used to
organize and store raw data.
Demonstration table: Table to illustrate a point (so just present enough data), short
and designed to make a particular, succinct point.
Scatter plots: (Actual values and structure of numerical variables)
Good for showing actual values and structure of numerical
Not suitable for large data sets
The jitter option (i.e., slight changes is horizontal
displacement) may help to avoid overlapping dots.
Gedownload door stijn buurman (buurmanstijn@gmail.com)
Dot plots/strip plots:
o Showing actual values/structure of numerical variables
o Not suitable for large data sets
o Easily judge distributional properties like (a)ssymetry
o Distribution of numerical data
o The range of data values is split in bins (=intervals of values)
o The histogram will show the number of observations in the dataset for every
o Choose √ n as number of bins, where n is the number of observations
o No gaps between the bars
o A histogram is for NUMERICAL data
Bar chart (categorical values)
o Gaps between the bars
o Distribution of CATEGORICAL data
KDP: Kernel Density Plot
o Improved histograms
o Useful to illustrate thresholds
o Overcome the drawbacks of histograms because
they do not have fixed bins: they use moving bins
instead of fixed bins
o Good tools to explore distribution shapes
o Good for detailed inspection of the shape of the
Location Statistics:
Mean: average
Median: middle value
Mode: most frequent value
1st quartile: cut-off point for 25% of the data
2nd quartile: cut-off point for 50% of the data (median)
3rd quartile: cut-off point for 75% of the data
Gedownload door stijn buurman (buurmanstijn@gmail.com)
Scale Statistics: (The higher spread these statistics are, the bigger the spread in the data)
Range: max – min
Interquartile range (IQR): 3rd quartile minus 1st quartile
Sample variance:
∑ ( xi−x )2
n−1 i=1
o Involve differences between observations and their mean/
 Sample standard deviation = √ Sa mple variance
 If all observations are equal, this will SV and SDD will be 0 but never negative.
Standardization and Z-scores:
Individual result−mean
Sample standard deviation
Transform data in their original units into universal statistical unit of standard
deviation from the mean.
Negative Z-score  The value is below the mean
Positive Z-score  The value is above the mean
Rule of thumb: Observations with a z=score larger than 2.5 are considered to be
extreme (‘outliers’)
Allowed actions for data types:
Ok to compute
Median and
Add or subtract
Mean, Standard
Box-and-Whisker Plot: (Quartiles and IQR, variance)
Is better than histogram/kernel density estimators to compare groups
Gedownload door stijn buurman (buurmanstijn@gmail.com)
Week 2: VIS = Visualization
Visualization is the process that transforms (abstract) data into (interactive) graphical
representations for the purpose of exploration, confirmation or communication.
Why visualize?
Communication to inform humans
o Annotations in graph: highlight important aspects
o Compare data: 2 graphs (upper and lower) with comparable axes and scales
o Support specific analysis: Select and summarize the data
Exploration when questions are not well-defined  generate hypothesis
o Display all the data: allows different perspectives on the dataset
o Not suitable for testing a specific hypothesis
When NOT to visualize?
Well-defined question on well-defined dataset
Decision needed in minimal time
Summary so far
Visualizations can be made to explore an communicate data
A good data visualization:
o Combined strengths of humans and computers
o Make data accessible
o Effectively enables insight
o Truthfully communicates
Analyze (Action):
Visualization for consuming information:
o For end-users, often non-technical
o Needs for good balance of details and message
o Design matter more than for other tasks
o Dataset will not be changed
Visualization for producing data:
o Extends the dataset
o Usually interactive, more technical users
o Additional meta-data (semantics)
o Additional data (volume)
o Additional dimensions (derived)
Search (Action):
Look up  You know what to look for and where
Browse  You don’t know what, you look around
Locate  You know what, but not where
Explore  You don’t know what, nor where
Gedownload door stijn buurman (buurmanstijn@gmail.com)
We look at all data
o Trends or patterns define the
o Outliers standout from the mainstream
o Features are task-dependent structures of
We look at attributes only
o One: Analyze the distribution or extremes
o Many: Analyze dependency, correlation
or similarity
Our brain pays an active role in what we see and it automatically groups objects according
to distance, color and shape. We are naturally faster and more accurate in position than in
Perception of color begin with 3 specialized retinal cells containing pigments with
different spectral sensitivities, known as cone cells
The output of these cells in the retina is used in the visual cortex and associative
areas of the brain
A key attribute acts as an index that is used to look up value attributes
Synonym for key attribute is independent attribute
Synonym for value attribute is dependent attribute
Data visualization makes use of
Marks (geometric primitive)
o Points
o Lines
o Areas
o Complex shapes
Channels (appearance of marks)
o Position
o Color
o Length
o Length, size
o Shape, orientation, curvature
Arrangement is about how data values determine position and
visual representatives.
Express: attribute is mapped to spatial position along an axis
Separate: emphasize similarity and distinction
Order: emphasize order
Align: emphasize quantitative comparison
Use: using an existing structure for arrangement
Gedownload door stijn buurman (buurmanstijn@gmail.com)
alignment of
Munzner’s reference model of visualisation:
What?  What is the date and how is structured?
Why?  What are the actions of and target of the
How?  What is the mapping betwwen data items and
Summary so far:
Reference model allows to
o Separately analyse why, what and how to visualise
o Shows a process to construct effective visualisations
Why: tasks, targets and actions
What: dataset and data attributes
How: encoding means arrangement (position) and mapping (other channels)
The reference frame work uses the word idiom to indicate chart types. Idioms do not serve
all tasks equally, so choose them carefully.
Elements of a chart:
Data is expressed in visual elements, but that is not enough for a chart, we also
o Coordinate system an scaling of the date
o Axes and frig (help reading the chart)
o Legend (provides mapping information)
o Annotations (highlight or emphasize)
o Layout and layering (support tasks such as comparison
Keys and values:
Dependent attribute or value of cell in a table
Independent attribute
Used as unique index to look up items
Simple tables: 1 key
Multidimensional tables: multiple keys
Bar vs Line charts:
Depnds on type of key attribute
o Bar chart is the key is categorical (nominal)
o Line chart is key is ordered
Gedownload door stijn buurman (buurmanstijn@gmail.com)
The difference between open exploration, confirmation and communication is the
purpose of visualization
Angles of degree 0, 90, 76, 137, 180, 223 and 270 are similar in their accuracy when
visualising data.
Explorative visualization should be used when there is (a lot) of data that you do
not understand well
A line chart fit a dataset with two time-series
The common task of a histogram is to find distribution
The most important visual attribute for visual encoding is to mark position
Studies have shown that the length of shapes compares easier than area or even
A chart title is NOT a common accessibility issue for data visualisations.
Gedownload door stijn buurman (buurmanstijn@gmail.com)
Week 3: DMM = Data Mining Methods
Data mining = Distilling information from raw data
Supervised vs unsupervised methods: predefined target vs no predefined
Global vs Local models: : searching for information on all observations vs on
only a few
Data mining often finds correlations, not causations
There are 4 methods of data mining: Linear Regression, Clustering, Decision Tree Mining,
Association Rule Learning
Linear regression
Global, Supervised  So we have a specific target in mind and it tries to say
something about the whole dataset.
Goal  Find a linear model that relates an output variable to one or more input
variables. Predict the output values from linear model over input values
The lower the Sum of the Squared Deviations (SSD), the better the linear model
∑ ( y 1−( β 0+ β 1∗x 1 ))2
Alternative measure of quality: R2 = percentage of variation in output values that is
explained by the variations in input values
o The higher the better the quality of the linear regressions
May have multiple inputs
Refers to the wat parameters occur, not the form of the regression model!
Add a cross term to the model: when two input columns amplify or dampen each
other’s effect on the output
Global, unsupervised  We don’t have a specific target and we try to say something
about the whole data set.
Goal  Find a natural grouping of observations. Predict the output values from a
linear model over the input values
Can also be applied where labels are unknown, uncertain or too expensive to obtain
Each cluster is represented by one point: the centroid
Clustering algorithm: k-means (convenient way to find clusters)
1. Pick k random points as initial centroid
2. Assign points to nearest centroid
3. Recompute centroids by moving each centroid to the mean of its observations
and repeat the steps
Strive for low within-cluster distance
Good clusterings try to find a balance between small within-cluster distances and the
number of clusters.
Gedownload door stijn buurman (buurmanstijn@gmail.com)
Euclidian distance :
Movement is unrestical
Network distance:
Network of possible
Manhattan distance:
Network is a grid
Decision Tree Mining
Global, supervised  So we have a specific target in mind and it tries to say
something about the whole dataset.
Goal  separate success from unsuccessful observations by few, interpretable
True positives and True negatives are good
False positives and False negatives are bad
Internal nod: Test on an attribute (nodes
in decision trees)
Branch: An outcome of that test
Leaf: Class label distribution
Decision trees are built by finding splits
that have the highest information gain
Association Rule Learning
Local, unsupervised  So the results do not say something about whole dataset and
we dot not have a specific goal in mind.
Goal  Identify strong rules, an event that is strongly associated with another event
in the data.
Rules of the form X  Y are strong if both apply:
o Their support (=how often does the combination of X and Y occur in
the dataset) is high enough : frequent itemsets
 Frequent is when it happens at least 3 times
o Their confidence (=If X occurs, how often does Y also occur?) is high
Gedownload door stijn buurman (buurmanstijn@gmail.com)
Week 4: ORG = Organization
Data needs to be stored, used an manipulated in many types of applications:
Scientific: biology, chemistry
Technical/engineering: air traffic control, embedded systems
Administrative: banking, student administration
Document-oriented: like Google
Objects system: the ‘real’ world of a company, organization or experiment with people,
machines, products etc…
Information system: A representation (always approximation) of the real world in a
computer system, using data to represent objects such as people, machine, products etc..
Primary key  A minimal set of attributes of a table that uniquely defines each row of this
table. If we leave only the ‘key columns’ in the table, than there are no repetitions of values
Database Management System (DBMS) is a very common type of information system and
is desgined to support systematic principled solutions to:
Data redundancy and inconsistency
Data security
Expressive and effeicient Data Analytics
Relational Database model organizes data into one or more tables of columns and rows
with a unique key identifying each row. Deschribes data and their relationship
Instances: The actual contont of the database at a particular point in time
Table (relation instance): Contains multiple instances with their attributes
Logical schema (data model): Logical structure of the database, contains multiple
Entity-relationship model: We model the relationship between the entities, it is thus entity
specific so it only represents the values in the tables
Entity: Object that exists and is distinguishable from other objects, described by
Entities have attributes, like people have names and adresses
Domain: The set of permitted values for each attribute
E-R Diagrams syntax:
Rectangles represent entity sets
Diamonds represent relationship sets
Lines link entity sets to relationship sets
Attributes listed inside entity rectangles
Underline indicates unique ID (primary key)
In this case ‘date’ indicates an attribute associated with a relationship set
Gedownload door stijn buurman (buurmanstijn@gmail.com)
A database represents the information of a particual domain:
Determine the informations needs and the users
Design a conceptual model for this information
Determine functional requirements for the system
Example code (SQL):
instructor.instructor_id, instructor.name AS name
instructor, advisor, student
instructor.instructor_id = advisor.instructor_ID AND
student.student_id = advisor.student_ID
SELECT *: Lists attributes to retrieve
- Using ‘ * ‘ shows all attributes
- Operators can be used on numeric attributes.
FROM: Lists tables from which we query
WHERE: Defines a predicate
- AND, OR, NOT and parentheses can be used.
- For comparison can be used: =, <, <=, <>
- ‘ % ‘ represents zero, one or multiple characters
- ‘ _ ‘ represents a single character
- Can be used for linking data from different tables (see example code)
DISTINCT: Removes duplicates (insert after SELECT)
AS: Can be used in the SELCT and FROM statement to rename attributes / tables (can then
also be used in the query code).
GROUP BY: Choose attribute to group on
HAVING: WHERE statement to use on the GROUP BY statement
- WHERE is evaluated first, then HAVING on the groups.
UNION, INTERSECT and EXCEPT can be used on two queries
Aggregate functions: COUNT, MIN, MAX, AVG, SUM
Order of operations in SQL: FROM  WHERE  GROUP BY  HAVING  SELECT
Gedownload door stijn buurman (buurmanstijn@gmail.com)
Week 5: DAS = Data Aggregation and Sampling
Data sampling: Any process of reducing the size of a data set by selecting a subset of the
Data aggregation: Any process in which available data is partitioned and the data in a
single partition is expressed in a summary form by means of one or more extracted features
All information that is a priori considered to be potentially useful is usually stored to minimize
the risk that essential information is missing. The problem statement determines which
information is actually needed. Some information may have to be made explicit.
Fitt’s law: States that the average time needed to select the target is linearly related to the
index-of-difficulty ID=log ⁡(1+
). W=width, D=distance(target amplitude)
Primary data: collected by you/your team
Secondary data: collected by others
Deductive reasoning: If the premises are true, the conclusion is valid.
Inductive reasoning: The opposite of deductive reasoning, makes broad generalizations
from specific observations.
Occam’s razor/ Parsimony principle: when faced with two possible explanations, the
simpler of the two is the one most likely to be true. A simpler explanation of the phenomenon
is preferred.  A model with less variables is preferred, if it fits the data equally well.
There should be as few as
possible independent
variables in a regression
There should be as few as
possible independent
variables in a regression
There should be as few possible independent variable in a regressions model
Measurements or conclusion are:
 Valid when they accurately describe the world
o Internal validity: Are the conclusions valid within the study?
o External validity: Can the conclusions of a scientific study be applied beyond
the context of the study?
Gedownload door stijn buurman (buurmanstijn@gmail.com)
 Reliable when the same resuls are obtained under the same conditions.
Reproducibility: The ability of others to replicate the findings. If others do not get similar
measurements/conclusions, analysis results will not be accepted.
Random errors: not forming any pattern.
Systematic errors: consistent errors, like offset errors or scale errors.
Precision: refers to the errors introduced by the measuring instrument.
Accuracy: refers to deviations from real values (systematic errors).
Data cleaning: a process of detecting, diagnosing and editing faulty data.
Data sampling: any process of reducing the size of a data set by selecting a subset of the
Data aggregation: any process in which available data is partitioned and the data in a
single partition is expressed in a summary form by means of one or more extracted features.
Types of problems with data:
 Incomplete (missing) data
 Inconsistent data
 Invalid data
Handling inconsistent, invalid or missing data:
 Discard rows with at least 1 inconsistent, invalid or missing value or discard a column
with lots of such values.
 Impute values => fill in estimated values.
 Work in the presence of missing data.
Time series: Sequence of pairs (tn, xn), where tn is the observation time and xn is the
observed value tn>tn + 1.
For equispaced time series, the spacing of observation times is constant and is called
sampling frequency.
Noise: Unwanted disturbance in time-series data.
Median filter: Good at filtering away outliers.
 Choose a window size for example 3 and it will
take the median out of 3 values, move the
window by 1 and it will compute the median of
all the values 1 by 1.
Mean filter: (moving average)
 Works the same as the median filter but will
compute the mean instead of the median. It is
more sensitive to outliers. Values far away have
the same influence as values nearby.
Gaussian filter:
 Consider a pair (tn, xn) from the time series,
assign weights wi = T (ti – tn, ω) to each value xi,
the further away the lower the weight. Compute
value filtered data as: xn=∑ wi xi−n note that
∑ wi=1. Parameter ω indicates the width of the
applied kernel.
When computing derivatives; the number of zero crossings is equal to the derivative
order. Even order = symmetric, Odd order = asymmetric
Gedownload door stijn buurman (buurmanstijn@gmail.com)
Convolution filters:
 The same idea as gaussian filter, but using differently shaped functions to generate
Gedownload door stijn buurman (buurmanstijn@gmail.com)
A sample is any part of the population. Biased if some part of the population is
overrepresented compared to others.
Convenience sampling: Going for data that is easier to collect
 Advantages: saving time, effort, money, etc.
 Disadvantages: possible bias that is a threat to external validity.
Random sampling: Each individual is equally likely to be included into the sampling.
 Ignoring any knowledge of the population.
 Stratified random sampling:
o Defining strata  disjoint parts forming the whole target
o Define sample size
o Proportionate stratified random sampling: take a random
sample from every stratum in the proportion equal to
proportion of this stratum in the population.
o Disproportionate stratified random sampling, when you
want to over represent particular strata in the sample.
Voluntary sampling: Individuals select themselves
 Self-selection bias the effects of which are difficult to measure.
Feature generation: identifying informative variables
 Goal: reduce the number of data/variables in the dataset by creating new, more
informative variables from existing ones.
Independent variables: Are used to see how the change of their value will be reflected in a
change of value of a dependent variable. Can be correlated, but that doesn’t make the
dependent variable causal. Independent variables are the x-axis in regression models.
Confounding variable: variable that is not taken into account and that can provide an
explanation to the observed effect of independent variable on the dependent variable.
When computing derivatives; the number of zero crossings is equal to the derivative order.
Even order = symmetric
Odd order = asymmetric
Gedownload door stijn buurman (buurmanstijn@gmail.com)
Week 6: HYP = Hypothesis
Probability theory: Useful mathematical tool to express uncertainty.
 Probability is a mathematical notion. It is a number between 0 and 1 indicates how
likely an outcome is. The higher the value, the more likely the outcome is. P stands
for probability.
2 reasons to study more than just averages:
 Data may be very skewed and averages may be misleading
 The average itself is not interesting in a given context: not interested in average
person, but in what most people do/need.
Modeling successes (yes/no data) is like
modeling flipping coins, but it can be bias
so heads and tails may not be equally
General formula: n independent
observations, each with successes
probability P. X is number of successes in
n observations.
P ( X=k )=
(nk ) p ( 1− p )
0, 1, .., n
This gives probabilities of single number of successes, also known as the Probability
Mass Function (PMF)
It is also useful to look at cumulative probabilities:
P ( X ≤l )=∑ n pk ( 1− p )
P ( X=k )=P ( X ≤ k )−P( X ≤ k −1)
Binomial probability distribution X~Bin(n,p)
Other probability distributions:
Uniform distribution: All outcomes have the same probability. P ( X=k )= , k=1,…,
Geometric distribution: How many tries you need before the first success.
P ( X=k )=( 1− p )k p, k = 0,1,…
Poisson distribution: For modeling counts of rare events, can be infinite.
P ( X=k )=e
, k = 0,1,2,…
Gedownload door stijn buurman (buurmanstijn@gmail.com)
Gedownload door stijn buurman (buurmanstijn@gmail.com)
Continuous distribution (density): has no direct interpretation only area under the curve
has meaning -> sums become integrals.
Normal distribution: When we let the number of observations increase indefinitely (limit to
infinity) then the binomial distribution changes into a probability distribution called normal
distribution/Gaussian distribution/Bell curve distribution that:
 Assumes all values on the real line
 Has probability 0 for individual probabilities
 Has an integral presentation to describe cumulative probabilities
 Notation: N(μ, σ2)
 Symmetric around the mean
 μ = 0 and σ2=1 is called standard normal distribution
 μ is the mean (location parameters), to be estimated by the sample mean x (met
streepje erboven)
 σ2 is variance (larger values indicate more spread), to be estimated by the sample
ECDF is a function that for a given value x returns
the fraction of observations that are smaller or
equal to x
 Example: 3,4,6,7,7,10,15,22,34,40
ECDF(22) = 0.80 since 80 % of the
measurements are smaller or equal to 22
Confidence intervals: Intervals with a certain probability, contain true, unknown value that
wish to estimate.
 The wider the interval, the more uncertainty there is about the true value
 The width of the interval increases when the standard deviation σ increases (less
 The width of the interval decreases when the sample size n increases (more
 Confidence intervals also exist for:
o The difference of two means (normal distributions)
o The difference of two proportions (binomial distributions)
Summary so far:
 The binomial distribution is a discrete probability distribution that counts the number
of successes
 When the number of observations is large, the binomial distribution is well
approximated by a continuous probability distribution called the normal distribution
 Confidence intervals indicate how certain we are about a value that we estimated
from data: wider intervals means more uncertainty
(statistical) hypothesis testing: Using probability theory to answer objectively.
 Hypothesis are rejected when the outcome of the experiment is too unlikely (too low
probability) if the null hypothesis would be true.
 Null hypothesis is for example, someone is not guilty till proven guilty.
Test statistic: Properly scaled difference of the sample proportions,
 If value of statistic too extreme, H0 rejected
 If not, don’t reject H0
P value: Threshold is 0.05, H0 rejected below 0.05
Gedownload door stijn buurman (buurmanstijn@gmail.com)
Choosing hypothesis tests:
 Is it one-sample, paired sample or two-sample
 Is it proportions or means
 Is the alternative two-sided or one-sided
 Is the data normally distributed? If not, is the sample size ‘large’?
One-sided interval: in Ha > or < than a value
Two-sided interval: in Ha !=
Two-sided test: in hypotheses there is also an unknown factor
One-sample case: x ± z
z is actually; zα/2 = normal quantile (z0.025 = 1.95; z0.05 = 1.645)
Hypothesis testing may be performed by:
 Critical regions
 P values
 Confidence intervals
 Rule of thumb: data points more than 2 or 3 standard deviations away from mean are
 Should not be deleted.
Normality testing
 Graphical (gives insight why normality may not be appropriate)
o Kernel density plot (good for global assessment of shape, not good enough
for tails)
o Normal probability plot (good for detecting whether there are problems)
 Similar to ECDF (an improved histogram)
 Trick transform y-axis so that ECDF becomes straight line
 Goodness-of-fit test (gives objective decision criterion)
o Anderson-Darling test = statistical test with objective answer
Residuals: The differences between the original observations and the fitted values
corresponding to the model.
Model diagnostics: Model assumptions for linear regression check after a model has been
Raw residuals: Differences yi-^yi of the observations and the fitted values corresponding to
them (the regression model)
Studentized residuals: Scaled raw residuals
Gedownload door stijn buurman (buurmanstijn@gmail.com)
Linear regression – normality
 Normal probability plot of standardized residuals (optimal for detecting deviations)
 Kernel density plots of standardized residuals (better in detecting where deviations
 Anderson-Darling test applied to standardized residuals
Differences between exploratory and confirmatory analysis:
What seems to be in the data?
Use the data to answer a specific question
posed before collecting the data.
Magnitude of effects
“level of significance”
Simple techniques
Advanced techniques
Generate questions (“hypothesis”)
Test hypothesis
Gedownload door stijn buurman (buurmanstijn@gmail.com)