Correlation Project

advertisement
Algebra 1C:
Correlation Project
post hoc ergo propter hoc
Goal: To gather and analyse observable data and to analyze whether or not there is a
correlation.
Stages:
1: Form a hypothesis of two variables that may have a correlation
2: Gather data on those two variables, collecting at least 10 data points from reliable
sources, creating a box-and-whisker plot for each variable
3: Create a scatter plot that compares these variables, using an online tool to
calculate the correlation coefficient
4: Create a predictive line of best fit (aka the regression line) and analyze your line
using a residual plot
5: Reflect on your data. Write a report that incorporates your data (use Google
Sheets to create a scatter plot) and analyze your original hypothesis from Stage 1.
Monday
Stage 3
Tuesday
Wednesday
Thursday
Friday
Intro/Stage 1
Stage 2
Stages 1+2 due
Celebration
Stage 4
Stage 5
Due/Present!
Stage 1:​
Form a hypothesis of two variables that may have a correlation
Data analysis is used to answer real world questions. For this project you will be
collecting and analyzing data to see if two events have a correlation. A correlation is a
relationship that shows two events occur relative to each other positively, negatively, or
not at all.
For example, people the HEIGHT of someone has a positive correlation to their
SHOE SIZE. Taller people tend to have larger feet, shorter people tend to have smaller feet.
Perhaps there is a correlation between school ATTENDANCE and GPA, or EDUCATION
LEVEL and SALARY.
What are some other things that you see every day that may have a positive correlation?
______________________________ & ______________________________
______________________________ & ______________________________
______________________________ & ______________________________
On the other hand, things such as DRIVING SPEED and COMMUTE TIME may have a
negative correlation because the higher the speed, the lower the time.
What are some things that may have a negative correlation?
______________________________ & ______________________________
______________________________ & ______________________________
______________________________ & ______________________________
Some things are difficult to find if there is a correlation. For example, there is an old
phrase that “money can’t buy happiness.” If we were to investigate the correlation between
MONEY and HAPPINESS we would face some challenges. What challenges might we face in
trying to gather data on these two things?
The Notorious B.I.G. famously sang “Mo’ Money, Mo’ Problems.” He hypothesizes
that there is a positive correlation between MONEY and PROBLEMS. I will investigate
whether this is true. Before I do that I have to create a plan of how I’m measure MONEY
and PROBLEMS.
● MONEY - I will look at the big picture. I will look at the wealthiest and poorest
regions in the United States as measured by their GDP (gross domestic product).
● PROBLEMS - I will measure problems as CRIME RATE.
● My hypothesis will align to Biggie Smalls, the wealthier regions will have the highest
crime.
Create your own hypothesis. What TWO THINGS that may be related are you going
to investigate? This data should be available ONLINE or it can be possible to CREATE and
collect the data.
I will see how (x) ____________________________ relates to (y) ____________________________ .
My hypothesis is that it is a ​
positive/negative/neither​
relationship.
My methodology to collect data for (x) ____________________________ will be to ______________.
My methodology to collect data for (y) ____________________________ will be to ______________.
For my example:
I will see how (x) ​
MONEY​
relates to (y) ​
PROBLEMS​
.
My hypothesis is that it is a ​
positive​
relationship.
My methodology to collect data for (x) ​
MONEY​
will be to ​
collect GDP data of different U.S.
regions​
.
My methodology to collect data for (y) ​
PROBLEMS​
will be to ​
collect crime rate statistics
for those same U.S. regions​
.
Stage 2:​
Gather data on those two variables, collecting at least 10 data points from
reliable sources, creating a box-and-whisker plot for each variable
Now that you’ve narrowed down the two variables you want to compare we need to collect
data. There are two main ways you will do that for this project.
Direct data collection: In this option you will collect the data yourself. You will make
observation, interview, or question people yourself. To ensure your data is useable ask as
many different people as possible (not just your friends, for example) and be sure to poll
people for both variables.
Online research: search internet databases for the statistics you are interested in. Be sure
to use trustworth cites.
What are trustworthy sites? Data pulled by government or educational institutions
(those that end in .gov or .edu) are more trustworthy than for-profit commercial sites.
Perform the following test to make sure the source is appropriate:
C - Is the source ​
contemporary ​
or ​
current​
? New sources are best.
R - Is the source ​
related ​
and ​
relevant​
to the specific thing you are looking for? It should.
A - Is the author an ​
authority ​
on the subject? Experts that others experts trust.
P - What is the ​
purpose ​
of the source? It should be unbiased and without an agenda.
Data collection:
X Variable:
Y Variable:
Here is my data in a Google Sheets:
Notes (year, city, name, etc)
Once you have collected your data justify it below. How do you know your sources are
accurate? What potential problems may there be with the data?
C-
R-
A-
P-
For my “Mo’ Money, Mo’ Problems” example, I have the following:
C - All of my data is from 2012, the last year made available on the FBI’s web cite.
R - This data is looking at exactly what I want, though for the crime rates I limited it to
violent crime and ignored property crime as it was the more serious, in my opinion, of
the two. My results could change had I looked at the total.
A - The authority of both studies are the U.S. government who are merely collecting
statistics. There may be errors in reporting on the state and city level but I trust the
statists I found.
P - There is no other purpose than to make this information public.
Finally, enter your data into Google Sheets. This will make stage 3 easier.
Stage 3:​
Create a scatter plot that compares these variables, using an online tool to
calculate the correlation coefficient
This step should be fairly straight forward: plot your data onto a scatter plot.
Attached are all of the appropriate graphs for this part of the project.
Then use an online tool to calculate the correlation coefficient.
(for example: ​
http://www.socscistatistics.com/tests/pearson/​
)
Here is my data for example:
Correlation Coefficient: 0.1434
Observations: While there appears to be a VERY WEAK POSITIVE correlation I wonder if I
have an outliers.
Correlation Coefficient:
Observations:
Stage 4:​
Create a predictive line of best fit (aka the regression line) and analyze
your line using a residual plot
Use an online regression calculator to create a line of best fit
(for example: http://www.alcula.com/calculators/statistics/linear-regression/)
Your input should look like this:
And it will give you a result like this:
If you look above the plot it will give you the equation:
The format is a bit strange, written as y=b+mx. Convert it to the correct format and round
to the nearest tenth.
For example my equation will become:
Graph this line on your plot from the last step. Then, calculate and graph your residuals.
X:
Y:
Then plot your residuals:
f(x) (y value of line)
Residual
Stage 5:​
Reflect on your data. Write a report that incorporates your data (use
Google Sheets to create a scatter plot) and analyze your original hypothesis from Stage 1.
Write a one page report on your findings. Use the following paragraph format to do so:
I.
II.
III.
IV.
State the Problem
A. Why were you interested in your variables?
B. What did you think the relationship was?
C. What results did you think you would get?
Methodology
A. What method did you use to collect your data?
B. Did your data pass the CRAP test? Explain.
C. What potential problems were there with the data you collected?
Data
A. Include a scatter plot from google sheets (copy/paste).
B. Include your correlation coefficent.
Findings
A. What did your data show?
B. Did this confirm or disprove your hypothesis?
C. What would you change to improve your process next time?
Download