lOMoARcPSD|5869273 Summary Data Analytics Data analytics for engineers (Technische Universiteit Eindhoven) StudeerSnel wordt niet gesponsord of ondersteund door een hogeschool of universiteit Gedownload door stijn buurman (buurmanstijn@gmail.com) lOMoARcPSD|5869273 2IAB0 : Data Analytics 2020 Week 1: EDA = Exploratory Data Analysis Descriptive Data Analytics: Data collected now may be used later for other purposes Insight into the past We study the methods for organizing, displaying and describing data. Predictive Data Analytics: Only prediction, but no indication of what we should do Looking into the future Prescriptive Data Analytics: Data-driven advices how to take action to influence or change the future Data forms/types: Categorical (Not numbers) o Dichotomous (yes/no, male/female) o Nominal: no ordering. (genre, nationality) o Ordinal: has ordering (ratings (bad, neutral, good)) Numerical (Numbers) o Interval : no fixed ‘zero point’, only difference has a meaning (release date, world ranking, temperature) o Ratio: has fixed ‘zero point’, (budget, athlete weight) Charts/Graphs are good: For making comparisons To get a quick view Usually used for categorical data Useful tool to discover unexpected relations Tables are good: For reading off values To draw attention to actual sizes Reference table: Store all data in a table so that in can be looked up easily, used to organize and store raw data. Demonstration table: Table to illustrate a point (so just present enough data), short and designed to make a particular, succinct point. Scatter plots: (Actual values and structure of numerical variables) Good for showing actual values and structure of numerical variables Not suitable for large data sets The jitter option (i.e., slight changes is horizontal displacement) may help to avoid overlapping dots. Gedownload door stijn buurman (buurmanstijn@gmail.com) lOMoARcPSD|5869273 Graphs: Dot plots/strip plots: o Showing actual values/structure of numerical variables o Not suitable for large data sets Histogram o Easily judge distributional properties like (a)ssymetry o Distribution of numerical data o The range of data values is split in bins (=intervals of values) o The histogram will show the number of observations in the dataset for every bin. o Choose √ n as number of bins, where n is the number of observations o No gaps between the bars o A histogram is for NUMERICAL data Bar chart (categorical values) o Gaps between the bars o Distribution of CATEGORICAL data KDP: Kernel Density Plot o Improved histograms o Useful to illustrate thresholds o Overcome the drawbacks of histograms because they do not have fixed bins: they use moving bins instead of fixed bins o Good tools to explore distribution shapes o Good for detailed inspection of the shape of the data Location Statistics: Mean: average Median: middle value Mode: most frequent value 1st quartile: cut-off point for 25% of the data 2nd quartile: cut-off point for 50% of the data (median) 3rd quartile: cut-off point for 75% of the data Gedownload door stijn buurman (buurmanstijn@gmail.com) lOMoARcPSD|5869273 Scale Statistics: (The higher spread these statistics are, the bigger the spread in the data) Range: max – min Interquartile range (IQR): 3rd quartile minus 1st quartile Sample variance: n 1 ∑ ( xi−x )2 n−1 i=1 o Involve differences between observations and their mean/ Sample standard deviation = √ Sa mple variance If all observations are equal, this will SV and SDD will be 0 but never negative. Standardization and Z-scores: z= Individual result−mean Sample standard deviation Transform data in their original units into universal statistical unit of standard deviation from the mean. Negative Z-score The value is below the mean Positive Z-score The value is above the mean Rule of thumb: Observations with a z=score larger than 2.5 are considered to be extreme (‘outliers’) Allowed actions for data types: Ok to compute Frequency distribution Median and percentiles Add or subtract Mean, Standard deviation Ratio Nominal Yes Ordinal Yes Interval Yes Ratio Yes No Yes Yes Yes No No No No Yes Yes Yes Yes No No No Yes Box-and-Whisker Plot: (Quartiles and IQR, variance) Is better than histogram/kernel density estimators to compare groups Gedownload door stijn buurman (buurmanstijn@gmail.com) lOMoARcPSD|5869273 Week 2: VIS = Visualization Visualization is the process that transforms (abstract) data into (interactive) graphical representations for the purpose of exploration, confirmation or communication. Why visualize? Communication to inform humans o Annotations in graph: highlight important aspects o Compare data: 2 graphs (upper and lower) with comparable axes and scales o Support specific analysis: Select and summarize the data Exploration when questions are not well-defined generate hypothesis o Display all the data: allows different perspectives on the dataset o Not suitable for testing a specific hypothesis When NOT to visualize? Well-defined question on well-defined dataset Decision needed in minimal time Summary so far Visualizations can be made to explore an communicate data A good data visualization: o Combined strengths of humans and computers o Make data accessible o Effectively enables insight o Truthfully communicates Analyze (Action): Visualization for consuming information: o For end-users, often non-technical o Needs for good balance of details and message o Design matter more than for other tasks o Dataset will not be changed Visualization for producing data: o Extends the dataset o Usually interactive, more technical users o Additional meta-data (semantics) o Additional data (volume) o Additional dimensions (derived) Search (Action): Look up You know what to look for and where Browse You don’t know what, you look around Locate You know what, but not where Explore You don’t know what, nor where Gedownload door stijn buurman (buurmanstijn@gmail.com) lOMoARcPSD|5869273 Targets We look at all data o Trends or patterns define the ‘mainstream’ o Outliers standout from the mainstream o Features are task-dependent structures of interest We look at attributes only o One: Analyze the distribution or extremes o Many: Analyze dependency, correlation or similarity Our brain pays an active role in what we see and it automatically groups objects according to distance, color and shape. We are naturally faster and more accurate in position than in color. Colors: Perception of color begin with 3 specialized retinal cells containing pigments with different spectral sensitivities, known as cone cells The output of these cells in the retina is used in the visual cortex and associative areas of the brain A key attribute acts as an index that is used to look up value attributes Synonym for key attribute is independent attribute Synonym for value attribute is dependent attribute Data visualization makes use of Marks (geometric primitive) o Points o Lines o Areas o Complex shapes Channels (appearance of marks) o Position o Color o Length o Length, size o Shape, orientation, curvature Arrangement is about how data values determine position and visual representatives. Express: attribute is mapped to spatial position along an axis Separate: emphasize similarity and distinction Order: emphasize order Align: emphasize quantitative comparison Use: using an existing structure for arrangement Gedownload door stijn buurman (buurmanstijn@gmail.com) alignment of lOMoARcPSD|5869273 Munzner’s reference model of visualisation: What? What is the date and how is structured? Why? What are the actions of and target of the visualization? How? What is the mapping betwwen data items and visual? Summary so far: Reference model allows to o Separately analyse why, what and how to visualise o Shows a process to construct effective visualisations o Why: tasks, targets and actions What: dataset and data attributes How: encoding means arrangement (position) and mapping (other channels) The reference frame work uses the word idiom to indicate chart types. Idioms do not serve all tasks equally, so choose them carefully. Elements of a chart: Data is expressed in visual elements, but that is not enough for a chart, we also need: o Coordinate system an scaling of the date o Axes and frig (help reading the chart) o Legend (provides mapping information) o Annotations (highlight or emphasize) o Layout and layering (support tasks such as comparison Keys and values: Value o Key o o o o Dependent attribute or value of cell in a table Independent attribute Used as unique index to look up items Simple tables: 1 key Multidimensional tables: multiple keys Bar vs Line charts: Depnds on type of key attribute o Bar chart is the key is categorical (nominal) o Line chart is key is ordered Gedownload door stijn buurman (buurmanstijn@gmail.com) lOMoARcPSD|5869273 The difference between open exploration, confirmation and communication is the purpose of visualization Angles of degree 0, 90, 76, 137, 180, 223 and 270 are similar in their accuracy when visualising data. Explorative visualization should be used when there is (a lot) of data that you do not understand well A line chart fit a dataset with two time-series The common task of a histogram is to find distribution The most important visual attribute for visual encoding is to mark position Studies have shown that the length of shapes compares easier than area or even volume. A chart title is NOT a common accessibility issue for data visualisations. Gedownload door stijn buurman (buurmanstijn@gmail.com) lOMoARcPSD|5869273 Week 3: DMM = Data Mining Methods Data mining = Distilling information from raw data Supervised vs unsupervised methods: predefined target vs no predefined target Global vs Local models: : searching for information on all observations vs on only a few Data mining often finds correlations, not causations There are 4 methods of data mining: Linear Regression, Clustering, Decision Tree Mining, Association Rule Learning Linear regression Global, Supervised So we have a specific target in mind and it tries to say something about the whole dataset. Goal Find a linear model that relates an output variable to one or more input variables. Predict the output values from linear model over input values The lower the Sum of the Squared Deviations (SSD), the better the linear model n o ∑ ( y 1−( β 0+ β 1∗x 1 ))2 i=1 Alternative measure of quality: R2 = percentage of variation in output values that is explained by the variations in input values o The higher the better the quality of the linear regressions May have multiple inputs Refers to the wat parameters occur, not the form of the regression model! Add a cross term to the model: when two input columns amplify or dampen each other’s effect on the output Clustering Global, unsupervised We don’t have a specific target and we try to say something about the whole data set. Goal Find a natural grouping of observations. Predict the output values from a linear model over the input values Can also be applied where labels are unknown, uncertain or too expensive to obtain Each cluster is represented by one point: the centroid Clustering algorithm: k-means (convenient way to find clusters) 1. Pick k random points as initial centroid 2. Assign points to nearest centroid 3. Recompute centroids by moving each centroid to the mean of its observations and repeat the steps Strive for low within-cluster distance Good clusterings try to find a balance between small within-cluster distances and the number of clusters. Gedownload door stijn buurman (buurmanstijn@gmail.com) lOMoARcPSD|5869273 Distance: Euclidian distance : Movement is unrestical Network distance: Network of possible movements Manhattan distance: Network is a grid Decision Tree Mining Global, supervised So we have a specific target in mind and it tries to say something about the whole dataset. Goal separate success from unsuccessful observations by few, interpretable decisions. True positives and True negatives are good False positives and False negatives are bad Internal nod: Test on an attribute (nodes in decision trees) Branch: An outcome of that test Leaf: Class label distribution Decision trees are built by finding splits that have the highest information gain (entropy) Association Rule Learning Local, unsupervised So the results do not say something about whole dataset and we dot not have a specific goal in mind. Goal Identify strong rules, an event that is strongly associated with another event in the data. Rules of the form X Y are strong if both apply: o Their support (=how often does the combination of X and Y occur in the dataset) is high enough : frequent itemsets Frequent is when it happens at least 3 times o Their confidence (=If X occurs, how often does Y also occur?) is high enough Gedownload door stijn buurman (buurmanstijn@gmail.com) lOMoARcPSD|5869273 Week 4: ORG = Organization Data needs to be stored, used an manipulated in many types of applications: Scientific: biology, chemistry Technical/engineering: air traffic control, embedded systems Administrative: banking, student administration Document-oriented: like Google Objects system: the ‘real’ world of a company, organization or experiment with people, machines, products etc… Information system: A representation (always approximation) of the real world in a computer system, using data to represent objects such as people, machine, products etc.. Primary key A minimal set of attributes of a table that uniquely defines each row of this table. If we leave only the ‘key columns’ in the table, than there are no repetitions of values Database Management System (DBMS) is a very common type of information system and is desgined to support systematic principled solutions to: Data redundancy and inconsistency Data security Expressive and effeicient Data Analytics Relational Database model organizes data into one or more tables of columns and rows with a unique key identifying each row. Deschribes data and their relationship Instances: The actual contont of the database at a particular point in time Table (relation instance): Contains multiple instances with their attributes Logical schema (data model): Logical structure of the database, contains multiple tables. Entity-relationship model: We model the relationship between the entities, it is thus entity specific so it only represents the values in the tables Entity: Object that exists and is distinguishable from other objects, described by attributes Entities have attributes, like people have names and adresses Domain: The set of permitted values for each attribute E-R Diagrams syntax: Rectangles represent entity sets Diamonds represent relationship sets Lines link entity sets to relationship sets Attributes listed inside entity rectangles Underline indicates unique ID (primary key) In this case ‘date’ indicates an attribute associated with a relationship set Gedownload door stijn buurman (buurmanstijn@gmail.com) lOMoARcPSD|5869273 A database represents the information of a particual domain: Determine the informations needs and the users Design a conceptual model for this information Determine functional requirements for the system Example code (SQL): SELECT DISTINCT FROM WHERE instructor.instructor_id, instructor.name AS name instructor, advisor, student instructor.instructor_id = advisor.instructor_ID AND student.student_id = advisor.student_ID SELECT *: Lists attributes to retrieve - Using ‘ * ‘ shows all attributes - Operators can be used on numeric attributes. FROM: Lists tables from which we query WHERE: Defines a predicate - AND, OR, NOT and parentheses can be used. - For comparison can be used: =, <, <=, <> - ‘ % ‘ represents zero, one or multiple characters - ‘ _ ‘ represents a single character - Can be used for linking data from different tables (see example code) DISTINCT: Removes duplicates (insert after SELECT) AS: Can be used in the SELCT and FROM statement to rename attributes / tables (can then also be used in the query code). GROUP BY: Choose attribute to group on HAVING: WHERE statement to use on the GROUP BY statement - WHERE is evaluated first, then HAVING on the groups. UNION, INTERSECT and EXCEPT can be used on two queries Aggregate functions: COUNT, MIN, MAX, AVG, SUM Order of operations in SQL: FROM WHERE GROUP BY HAVING SELECT Union Intersection Except Gedownload door stijn buurman (buurmanstijn@gmail.com) lOMoARcPSD|5869273 Week 5: DAS = Data Aggregation and Sampling Data sampling: Any process of reducing the size of a data set by selecting a subset of the data Data aggregation: Any process in which available data is partitioned and the data in a single partition is expressed in a summary form by means of one or more extracted features All information that is a priori considered to be potentially useful is usually stored to minimize the risk that essential information is missing. The problem statement determines which information is actually needed. Some information may have to be made explicit. Fitt’s law: States that the average time needed to select the target is linearly related to the index-of-difficulty ID=log (1+ D ). W=width, D=distance(target amplitude) W Primary data: collected by you/your team Secondary data: collected by others Deductive reasoning: If the premises are true, the conclusion is valid. Inductive reasoning: The opposite of deductive reasoning, makes broad generalizations from specific observations. Occam’s razor/ Parsimony principle: when faced with two possible explanations, the simpler of the two is the one most likely to be true. A simpler explanation of the phenomenon is preferred. A model with less variables is preferred, if it fits the data equally well. There should be as few as possible independent variables in a regression model There should be as few as possible independent variables in a regression model There should be as few possible independent variable in a regressions model Measurements or conclusion are: Valid when they accurately describe the world o Internal validity: Are the conclusions valid within the study? o External validity: Can the conclusions of a scientific study be applied beyond the context of the study? Gedownload door stijn buurman (buurmanstijn@gmail.com) lOMoARcPSD|5869273 Reliable when the same resuls are obtained under the same conditions. Reproducibility: The ability of others to replicate the findings. If others do not get similar measurements/conclusions, analysis results will not be accepted. Random errors: not forming any pattern. Systematic errors: consistent errors, like offset errors or scale errors. Precision: refers to the errors introduced by the measuring instrument. Accuracy: refers to deviations from real values (systematic errors). Data cleaning: a process of detecting, diagnosing and editing faulty data. Data sampling: any process of reducing the size of a data set by selecting a subset of the data. Data aggregation: any process in which available data is partitioned and the data in a single partition is expressed in a summary form by means of one or more extracted features. Types of problems with data: Incomplete (missing) data Inconsistent data Invalid data Handling inconsistent, invalid or missing data: Discard rows with at least 1 inconsistent, invalid or missing value or discard a column with lots of such values. Impute values => fill in estimated values. Work in the presence of missing data. Time series: Sequence of pairs (tn, xn), where tn is the observation time and xn is the observed value tn>tn + 1. For equispaced time series, the spacing of observation times is constant and is called sampling frequency. Noise: Unwanted disturbance in time-series data. Median filter: Good at filtering away outliers. Choose a window size for example 3 and it will take the median out of 3 values, move the window by 1 and it will compute the median of all the values 1 by 1. Mean filter: (moving average) Works the same as the median filter but will compute the mean instead of the median. It is more sensitive to outliers. Values far away have the same influence as values nearby. Gaussian filter: Consider a pair (tn, xn) from the time series, assign weights wi = T (ti – tn, ω) to each value xi, the further away the lower the weight. Compute value filtered data as: xn=∑ wi xi−n note that i ∑ wi=1. Parameter ω indicates the width of the i applied kernel. When computing derivatives; the number of zero crossings is equal to the derivative order. Even order = symmetric, Odd order = asymmetric Gedownload door stijn buurman (buurmanstijn@gmail.com) lOMoARcPSD|5869273 Convolution filters: The same idea as gaussian filter, but using differently shaped functions to generate weights. Gedownload door stijn buurman (buurmanstijn@gmail.com) lOMoARcPSD|5869273 A sample is any part of the population. Biased if some part of the population is overrepresented compared to others. Convenience sampling: Going for data that is easier to collect Advantages: saving time, effort, money, etc. Disadvantages: possible bias that is a threat to external validity. Random sampling: Each individual is equally likely to be included into the sampling. Ignoring any knowledge of the population. Stratified random sampling: o Defining strata disjoint parts forming the whole target population o Define sample size o Proportionate stratified random sampling: take a random sample from every stratum in the proportion equal to proportion of this stratum in the population. o Disproportionate stratified random sampling, when you want to over represent particular strata in the sample. Voluntary sampling: Individuals select themselves Self-selection bias the effects of which are difficult to measure. Feature generation: identifying informative variables Goal: reduce the number of data/variables in the dataset by creating new, more informative variables from existing ones. Independent variables: Are used to see how the change of their value will be reflected in a change of value of a dependent variable. Can be correlated, but that doesn’t make the dependent variable causal. Independent variables are the x-axis in regression models. Confounding variable: variable that is not taken into account and that can provide an explanation to the observed effect of independent variable on the dependent variable. When computing derivatives; the number of zero crossings is equal to the derivative order. Even order = symmetric Odd order = asymmetric Gedownload door stijn buurman (buurmanstijn@gmail.com) lOMoARcPSD|5869273 Week 6: HYP = Hypothesis Probability theory: Useful mathematical tool to express uncertainty. Probability is a mathematical notion. It is a number between 0 and 1 indicates how likely an outcome is. The higher the value, the more likely the outcome is. P stands for probability. 2 reasons to study more than just averages: Data may be very skewed and averages may be misleading The average itself is not interesting in a given context: not interested in average person, but in what most people do/need. Modeling successes (yes/no data) is like modeling flipping coins, but it can be bias so heads and tails may not be equally likely. General formula: n independent observations, each with successes probability P. X is number of successes in n observations. P ( X=k )= (nk ) p ( 1− p ) k n−k k= 0, 1, .., n This gives probabilities of single number of successes, also known as the Probability Mass Function (PMF) It is also useful to look at cumulative probabilities: l o () n−k P ( X ≤l )=∑ n pk ( 1− p ) k k=0 P ( X=k )=P ( X ≤ k )−P( X ≤ k −1) o Binomial probability distribution X~Bin(n,p) Other probability distributions: 1 n Uniform distribution: All outcomes have the same probability. P ( X=k )= , k=1,…, n Geometric distribution: How many tries you need before the first success. P ( X=k )=( 1− p )k p, k = 0,1,… Poisson distribution: For modeling counts of rare events, can be infinite. −λ P ( X=k )=e λk , k = 0,1,2,… k! Gedownload door stijn buurman (buurmanstijn@gmail.com) lOMoARcPSD|5869273 Gedownload door stijn buurman (buurmanstijn@gmail.com) lOMoARcPSD|5869273 Continuous distribution (density): has no direct interpretation only area under the curve has meaning -> sums become integrals. Normal distribution: When we let the number of observations increase indefinitely (limit to infinity) then the binomial distribution changes into a probability distribution called normal distribution/Gaussian distribution/Bell curve distribution that: Assumes all values on the real line Has probability 0 for individual probabilities Has an integral presentation to describe cumulative probabilities Notation: N(μ, σ2) Symmetric around the mean μ = 0 and σ2=1 is called standard normal distribution μ is the mean (location parameters), to be estimated by the sample mean x (met streepje erboven) σ2 is variance (larger values indicate more spread), to be estimated by the sample variance. ECDF is a function that for a given value x returns the fraction of observations that are smaller or equal to x Example: 3,4,6,7,7,10,15,22,34,40 ECDF(22) = 0.80 since 80 % of the measurements are smaller or equal to 22 Confidence intervals: Intervals with a certain probability, contain true, unknown value that wish to estimate. The wider the interval, the more uncertainty there is about the true value The width of the interval increases when the standard deviation σ increases (less certainty) The width of the interval decreases when the sample size n increases (more certainty) Confidence intervals also exist for: o The difference of two means (normal distributions) o The difference of two proportions (binomial distributions) Summary so far: The binomial distribution is a discrete probability distribution that counts the number of successes When the number of observations is large, the binomial distribution is well approximated by a continuous probability distribution called the normal distribution Confidence intervals indicate how certain we are about a value that we estimated from data: wider intervals means more uncertainty (statistical) hypothesis testing: Using probability theory to answer objectively. Hypothesis are rejected when the outcome of the experiment is too unlikely (too low probability) if the null hypothesis would be true. Null hypothesis is for example, someone is not guilty till proven guilty. Test statistic: Properly scaled difference of the sample proportions, If value of statistic too extreme, H0 rejected If not, don’t reject H0 P value: Threshold is 0.05, H0 rejected below 0.05 Gedownload door stijn buurman (buurmanstijn@gmail.com) lOMoARcPSD|5869273 Choosing hypothesis tests: Is it one-sample, paired sample or two-sample Is it proportions or means Is the alternative two-sided or one-sided Is the data normally distributed? If not, is the sample size ‘large’? One-sided interval: in Ha > or < than a value Two-sided interval: in Ha != Two-sided test: in hypotheses there is also an unknown factor One-sample case: x ± z σ z is actually; zα/2 = normal quantile (z0.025 = 1.95; z0.05 = 1.645) √n Hypothesis testing may be performed by: Critical regions P values Confidence intervals Outliers Rule of thumb: data points more than 2 or 3 standard deviations away from mean are suspect. Should not be deleted. Normality testing Graphical (gives insight why normality may not be appropriate) o Kernel density plot (good for global assessment of shape, not good enough for tails) o Normal probability plot (good for detecting whether there are problems) Similar to ECDF (an improved histogram) Trick transform y-axis so that ECDF becomes straight line Goodness-of-fit test (gives objective decision criterion) o Anderson-Darling test = statistical test with objective answer Residuals: The differences between the original observations and the fitted values corresponding to the model. Model diagnostics: Model assumptions for linear regression check after a model has been fitted. Raw residuals: Differences yi-^yi of the observations and the fitted values corresponding to them (the regression model) Studentized residuals: Scaled raw residuals Gedownload door stijn buurman (buurmanstijn@gmail.com) lOMoARcPSD|5869273 Linear regression – normality Tools Normal probability plot of standardized residuals (optimal for detecting deviations) Kernel density plots of standardized residuals (better in detecting where deviations occur) Anderson-Darling test applied to standardized residuals Differences between exploratory and confirmatory analysis: Exploratory Confirmatory What seems to be in the data? Use the data to answer a specific question posed before collecting the data. Magnitude of effects “level of significance” Simple techniques Advanced techniques Generate questions (“hypothesis”) Test hypothesis Gedownload door stijn buurman (buurmanstijn@gmail.com)