Data Correlation Dr. Amira Abdelatey Correlation Key Concepts: • Types of correlation • Scatter diagram • Heat Map • Degree of Correlation Correlation ◼ ◼ ◼ ◼ ◼ Correlation is a statistical method that helps to measure and analyze the degree of relationship between two variables. The measure of correlation called the correlation coefficient The degree of relationship is expressed by coefficient which range from correlation ( -1 ≤ r ≥ +1). The direction of change is indicated by a sign. The correlation analysis enable us to have an idea about the degree & direction of the relationship between the two variables under study. Correlation • Purpose: o predict a value, by converting input (x) to output (f(x)). We can say also say that a function uses the relationship between two variables for prediction. o Relation between two variables (or multi variables) • The table shows the amount of time spent studying and the grade earned on a test for 10 students. These variables are correlated. • Scatterplot can be used for visualizing these relation ship. Correlation • Correlation is a relationship between two quantitative variables. • The correlation coefficient measures how strong the correlation between two variables is. It is always between -1 and 1 o -1 indicates a perfect negative linear correlation o 0 indicates no correlation (variables are independent of each other) o 1 indicates a perfect positive linear correlation Correlation Type Correlation Positive Correlation Negative Correlation Correlation Type ◼ Positive Correlation: The correlation is said to be positive correlation if the values of two variables changing with same direction. ◼ As X is increasing, Y is increasing ◼ As X is decreasing, Y is decreasing ◼ ◼ E.g., As height increases, so does weight. Negative Correlation: The correlation is said to be negative correlation when the values of variables change with opposite direction. ◼ As X is increasing, Y is decreasing ◼ As X is decreasing, Y is increasing ◼ E.g., As TV time increases, grades decrease Pearson correlation coefficient (r) • Where: N = the number of pairs of scores Σxy = the sum of the products of paired scores Σx = the sum of x scores Σy = the sum of y scores Σx2 = the sum of squared x scores Σy2 = the sum of squared y scores Pearson correlation coefficient (r) Pearson correlation Strength • Pearson correlation Strength: When using Pearson correlation coefficient • • • • Both variables are quantitative: You will need to use a different method if either of the variables is qualitative. The variables are normally distributed: You can create a histogram of each variable to verify whether the distributions are approximately normal. It’s not a problem if the variables are a little non-normal. The data have no outliers: Outliers are observations that don’t follow the same patterns as the rest of the data. A scatterplot is one way to check for outliers—look for points that are far away from the others. The relationship is linear: “Linear” means that the relationship between the two variables can be described reasonably well by a straight line. You can use a scatterplot to check whether the relationship between two variables is linear. a Perfect Linear Relationship (Correlation Coefficient = 1) a Perfect Negative Linear Relationship (Correlation Coefficient = -1) Insight about correlation: If we work longer hours, we tend to have lower calorie burnage because we are exhausted before the training session. No Linear Relationship (Correlation coefficient = 0) there is no linear relationship between the two variables. It means that longer training session does not lead to higher Max_Pulse. • Correlation Matrix • A correlation matrix is simply a table showing the correlation coefficients between variables. The variables are represented in the first row, and in the first column: Insights from correlation matrix • We observe that Duration and Calorie_Burnage are closely related, with a correlation coefficient of 0.89. This makes sense as the longer we train, the more calories we burn • We observe that there is almost no linear relationships between Average_Pulse and Calorie_Burnage (correlation coefficient of 0.02) • Can we conclude that Average_Pulse does not affect Calorie_Burnage? Correlation matrix in python a Heatmap • Using Heatmap to Visualize the Correlation Between Variables: Correlation Vs. Causality • “correlation does not imply causation.” • Correlation tests for a relationship between two variables. • However, seeing two variables moving together does not necessarily mean we know whether one variable causes the other to occur. Correlation Vs. Causality Classic Example: • During the summer, the sale of ice cream at a beach increases • Simultaneously, drowning accidents also increase as well Does this mean that increase of ice cream sale is a direct cause of increased drowning accidents? It is likely that these two variables are accidentally correlating with each other. NO, They are both things that happen more in summer Different r values Spearman correlation coefficient • Spearman’s rank correlation coefficient is another widely used correlation coefficient. It’s a better choice than the Pearson correlation coefficient when one or more of the following is true: • The variables are ordinal. • The variables aren’t normally distributed. • The data includes outliers. • The relationship between the variables is nonlinear and monotonic. Scatter Plots of Data with Various Correlation Coefficients Y Y Y X X r = -1 r = -.6 Y r=0 Y Y r = +1 X X X r = +.3 X r=0 Linear Correlation Linear relationships Y Curvilinear relationships Y X Y X Y X X Linear Correlation Strong relationships Y Weak relationships Y X Y X Y X X Linear Correlation No relationship Y X Y X