Uploaded by ahmedibrahimghnnam012

Lect7-correlation and causality

advertisement
Data Correlation
Dr. Amira Abdelatey
Correlation
Key Concepts:
• Types of correlation
• Scatter diagram
• Heat Map
• Degree of Correlation
Correlation
◼
◼
◼
◼
◼
Correlation is a statistical method that helps to measure
and analyze the degree of relationship between two
variables.
The measure of correlation called the correlation
coefficient
The degree of relationship is expressed by coefficient
which range from correlation ( -1 ≤ r ≥ +1).
The direction of change is indicated by a sign.
The correlation analysis enable us to have an idea about
the degree & direction of the relationship between the two
variables under study.
Correlation
• Purpose:
o predict a value, by converting input (x) to output (f(x)). We
can say also say that a function uses the relationship
between two variables for prediction.
o Relation between two variables (or multi variables)
• The table shows the amount of time spent studying and the grade
earned on a test for 10 students. These variables are correlated.
• Scatterplot can be used for visualizing these relation ship.
Correlation
• Correlation is a relationship between two
quantitative variables.
• The correlation coefficient measures how strong
the correlation between two variables is. It is
always between -1 and 1
o -1 indicates a perfect negative linear correlation
o 0 indicates no correlation (variables are independent of
each other)
o 1 indicates a perfect positive linear correlation
Correlation Type
Correlation
Positive Correlation
Negative Correlation
Correlation Type
◼
Positive Correlation: The correlation is said to be positive correlation if
the values of two variables changing with same direction.
◼ As
X is increasing, Y is increasing
◼ As X is decreasing, Y is decreasing
◼
◼
E.g., As height increases, so does weight.
Negative Correlation: The correlation is said to be negative correlation
when the values of variables change with opposite direction.
◼ As
X is increasing, Y is decreasing
◼ As X is decreasing, Y is increasing
◼ E.g., As TV time increases, grades decrease
Pearson correlation coefficient (r)
•
Where:
N = the number of pairs of scores
Σxy = the sum of the products of paired scores
Σx = the sum of x scores
Σy = the sum of y scores
Σx2 = the sum of squared x scores
Σy2 = the sum of squared y scores
Pearson correlation coefficient (r)
Pearson correlation Strength
• Pearson correlation Strength:
When using
Pearson correlation coefficient
•
•
•
•
Both variables are quantitative: You will need to use a different
method if either of the variables is qualitative.
The variables are normally distributed: You can create a histogram
of each variable to verify whether the distributions are approximately
normal. It’s not a problem if the variables are a little non-normal.
The data have no outliers: Outliers are observations that don’t follow
the same patterns as the rest of the data. A scatterplot is one way to
check for outliers—look for points that are far away from the others.
The relationship is linear: “Linear” means that the relationship
between the two variables can be described reasonably well by a
straight line. You can use a scatterplot to check whether the
relationship between two variables is linear.
a Perfect Linear Relationship (Correlation Coefficient = 1)
a Perfect Negative Linear Relationship (Correlation Coefficient = -1)
Insight about correlation:
If we work longer hours,
we tend to have lower
calorie burnage because
we are exhausted before
the training session.
No Linear Relationship (Correlation coefficient = 0)
there
is
no
linear
relationship between the
two variables. It means
that
longer
training
session does not lead to
higher Max_Pulse.
•
Correlation
Matrix
•
A correlation matrix is simply a table
showing the correlation coefficients between
variables.
The variables are represented in the first
row, and in the first column:
Insights from correlation matrix
• We observe that Duration and Calorie_Burnage are closely related,
with a correlation coefficient of 0.89. This makes sense as the
longer we train, the more calories we burn
• We observe that there is almost no linear relationships between
Average_Pulse and Calorie_Burnage (correlation coefficient of
0.02)
•
Can we conclude that Average_Pulse does not affect Calorie_Burnage?
Correlation matrix in python
a Heatmap
• Using Heatmap to Visualize the Correlation
Between Variables:
Correlation Vs. Causality
• “correlation does not imply causation.”
• Correlation tests for a relationship
between two variables.
• However, seeing two variables moving
together does not necessarily mean we
know whether one variable causes the
other to occur.
Correlation Vs. Causality
Classic Example:
• During the summer, the sale of ice cream at a beach
increases
• Simultaneously, drowning accidents also increase as well
Does this mean that increase of ice cream sale is a direct
cause of increased drowning accidents?
It is likely that these two variables
are accidentally correlating with
each other.
NO,
They are both things that
happen more in summer
Different r values
Spearman correlation coefficient
• Spearman’s rank correlation coefficient is another
widely used correlation coefficient. It’s a better
choice than the Pearson correlation coefficient
when one or more of the following is true:
• The variables are ordinal.
• The variables aren’t normally distributed.
• The data includes outliers.
• The relationship between the variables is nonlinear and monotonic.
Scatter Plots of Data with
Various Correlation Coefficients
Y
Y
Y
X
X
r = -1
r = -.6
Y
r=0
Y
Y
r = +1
X
X
X
r = +.3
X
r=0
Linear Correlation
Linear relationships
Y
Curvilinear relationships
Y
X
Y
X
Y
X
X
Linear Correlation
Strong relationships
Y
Weak relationships
Y
X
Y
X
Y
X
X
Linear Correlation
No relationship
Y
X
Y
X
Download