'r'? - Cloudfront.net

advertisement
CHAPTER 4, REGRESSION
ANALYSIS…
EXPLORING ASSOCIATIONS
BETWEEN VARIABLES
RELATIONSHIPS BETWEEN ...
Talk to the person next to you. Think of two things
that you believe may be related.
For example, height and weight are generally
related... generally, the taller the person, generally,
the more they weigh.
Or, the age of your car and its value... generally,
the older a car, the less it is worth.
Share out two numerical categories that you
believe are related on the board.
DO YOU BELIEVE THERE IS A
RELATIONSHIP BETWEEN...
•TIME SPENT STUDYING AND GPA?
•# OF CIGARETTES SMOKED DAILY & LIFE
EXPECTANCY
•SALARY AND EDUCATION LEVEL?
•AGE AND HEIGHT?
RELATIONSHIPS
When we consider data that comes in pairs or two’s or has
two variables, the data is referred to as bivariate data.
Much of the bivariate data we will examine is numeric.
There may or may not exist a relationship/an association
between the 2 variables.
Does one variable influence the other? Or vice versa? Or
do the two variables just ‘go together’ by chance? Or is
the relationship influenced by another variable(s) that we
are unaware of?
Does one variable ‘cause’ the other? Caution!
Put some examples here to discuss possibly
BIVARIATE DATA
Proceed similarly as univariate distributions …
(review... which graphical models do we typically use
with univariate data?)
Still graph (use visual model(s) to describe data;
scatter plot; LSRL; Least Squares Regression Line)
Still look at overall patterns and deviations from those
patterns (DOFS; Direction, Outlier(s), Form, Strength);
review how did we look for patterns in univariate data;
what did we use?
Still analyze numerical summary (descriptive statistics)
BIVARIATE
DISTRIBUTIONS
Explanatory variable, x, ‘factor,’ may help predict or explain
changes in response variable; usually on horizontal axis
Response variable, y, measures an outcome of a study,
usually on vertical axis
BIVARIATE DATA DISTRIBUTIONS
For example ... Alcohol (explanatory) and body temperature
(response). Generally, the more alcohol consumed, the higher
the body temperature. Still use caution with ‘cause.’
Sometimes we don’t have variables that are clearly explanatory
and response.
Sometimes there could be two ‘explanatory’ variables, such as
ACT scores and SAT scores, or activity level and physical fitness.
Discuss with a partner for 1 minute; come up with another
situation where we have two variables that are related, but neither
are clearly explanatory nor response.
GRAPHICAL MODELS…
Many graphing models display uni-variate data exclusively
(review). Discuss for 30 seconds and share out.
Main graphical representation used to display bivariate data
(two quantitative variables) is scatterplot.
SCATTERPLOTS
Scatterplots show relationship between two quantitative
variables measured on the same individuals or objects.
Each individual/object in data appears as a point (x, y) on
the scatterplot.
Plot explanatory variable (if there is one) on horizontal
axis. If no distinction between explanatory and
response, either can be plotted on horizontal axis.
Label both axes. Scale both axes with uniform intervals
(but scales don’t have to match)
LABEL & SCALE SCATTERPLOT
VARIABLES: CLEARLY EXPLANATORY
AND RESPONSE??
CREATING & INTERPRETING
SCATTERPLOTS
Let’s collect some data; on the board write
your height in inches and your hand span in
inches (to nearest ½ inch)
Input into Minitab & create scatterplot; which is
our explanatory and which is our response
variable?
Let’s do some predicting... to the best of our
ability...
INTERPRETING SCATTERPLOTS
Look for overall patterns (DOFS) including:
•direction: up or down, + or – association?
•outliers/deviations: individual value(s) falls outside
overall pattern; no outlier rule for bi-variate data –
unlike uni-variate data
•form: linear? curved? clusters? gaps?
•strength: how closely do the points follow a clear
form? Strong, weak, moderate?
MEASURING LINEAR ASSOCIATION
Scatterplots (bi-variate data) show direction, outliers/
deviation(s), form, strength of relationship between two
quantitative variables
Linear relationships are important; common, simple
pattern; linear relationships are our focus in this course
Linear relationship is strong if points are close to a
straight line; weak if scattered about
Other relationships (quadratic, logarithmic, etc.)
CREATING & INTERPRETING
SCATTERPLOTS
Go to my website, download the COC Math 140
Survey Data Fall 2015
Copy & paste last 2 columns (‘Approximately how
many minutes a day, on average, do you spend on
social media?’ And ‘How many people live in your
household, including yourself?’)
Is data messy? Does it need to be ‘fixed?’ ... Hint,
scan for ordered pairs (this is bivariate data); each
and every point must be an ordered pair.
CREATING & INTERPRETING
SCATTERPLOTS
‘Approximately how many minutes a day,
on average, do you spend on social
media?’ And ‘How many people live in your
household, including yourself?’
Create a scatter plot of the data. Analyze
(DOFS)
Let’s do some predictions...
SCATTERPLOTS: NOTE
Might be asked to graph a scatterplot from
data
Might need to sketch what’s on Minitab
Doesn’t have to be 100% exactly accurate; do
your best
Scaling, labeling: a must!
HOW STRONG ARE THESE
RELATIONSHIPS? WHICH ONE IS
STRONGER?
MEASURING LINEAR
ASSOCIATION: CORRELATION OR
“R”
Sometimes our eyes are not a good judge
Need to specify just how strong or weak a
linear relationship is with bivariate data
Need a numeric measure
Correlation or ‘r’
MEASURING LINEAR ASSOCIATION:
CORRELATION OR “R”
* Correlation (r) is a numeric measure of direction
and strength of a linear relationship between two
quantitative variables
• Correlation (r) is always between -1 and 1
1  r  1
• Correlation (r) is not resistant (look at formula;
based on mean)
• r doesn’t tell us about individual data points, but
rather
 trends in the data
* Never calculate by formula; use Minitab
(dependent on having raw data)
CALCULATING
CORRELATION “R”
n, x1, x2, etc., 𝒙, y1, y2, etc., 𝒚, sx, sy, …
MEASURING LINEAR
ASSOCIATION: CORRELATION OR
“R”
r ≈0  not strong linear relationship
r close to 1  strong positive linear relationship
r close to -1  strong negative linear relationship
Go back to our height/hand span data & calculate
‘r,’ correlation; then practice calculating ‘r’ with our
social media/# in household data (stat, regression,
fitted line)
GUESS THE CORRELATION
WWW.ROSSMANCHANCE.COM/APPLETS
‘March Madness’ bracket-style Guess the Correlation
tournament
Playing cards; match up head-to-head competition/rounds
Look at a scatterplot, make your guess
Student who is closest survives until the next round
CORRELATION &
REGRESSION APPLET
PARTNER ACTIVITY
Go to www.whfreeman.com/tps5e
Go to applets
Go to Correlation & Regression
Follow the directions on the hand out (or see my website)
Partner up with the person next to you; this should take no
more than 15-20 minutes, including the write-up; print out &
turn in with both your names on it.
CAUTION… INTERPRETING CORRELATION
Note: be careful when addressing form in scatterplots
Strong positive linear relationship ► correlation ≈ 1
But
Correlation ≈ 1 does not necessarily mean relationship is
linear; always plot data!
R ≈ 0.816 FOR EACH OF THESE
FACTS ABOUT CORRELATION
Correlation doesn’t care which variables is
considered explanatory and which is considered
response; can switch x & y; still same correlation
(r) value
Try with hand span & height data; try with minutes
on social media & # household data
CAUTION! Switching x & y WILL change your
scatterplot; try with our data sets!… just won’t
change ‘r’
FACTS ABOUT CORRELATION
r is in standard units, so r doesn’t change if
units are changed
If we change from yards to feet, or years to
months, or gallons to liters ... r is not
effected
+ r, positive association
- r, negative association
FACTS ABOUT CORRELATION
Correlation is always between -1 & 1
Makes no sense for r = 13 or r = -5
r = 0 means very weak linear relationship
r = 1 or -1 means strong linear
association
FACTS ABOUT CORRELATION
Both variables must be quantitative,
numerical. Doesn’t make any sense to
discuss r for qualitative or categorical data
Correlation is not resistant (like mean and
SD). Be careful using r when outliers are
present (think of the formula, think of our
partner activity)
FACTS ABOUT CORRELATION
r isn’t enough! … if we just consider r, it
could be misleading; we must also
consider the distribution’s mean,
standard deviation, graphical
representation, etc.
Correlation does not imply causation;
i.e., # ice cream sales in a given week
and # of pool accidents
ABSURD EXAMPLES…
CORRELATION DOES NOT IMPLY
CAUSATION…
Did you know that eating chocolate makes winning
a Nobel Prize more likely? The correlation
between per capita chocolate consumption and the
number of Nobel laureates per 10 million people
for 23 selected countries is r = 0.791
Did you know that statistics is causing global
warming? As the number of statistics courses
offered has grown over the years, so has the
average global temperature!
LEAST SQUARES REGRESSION
Last section… scatterplots of two
quantitative variables
r measures strength and direction of
linear relationship of scatterplot
WHAT WOULD WE EXPECT THE
SODIUM LEVEL TO BE IN A HOT DOG
THAT HAS 170 CALORIES?
LEAST SQUARES REGRESSION
BETTER model to summarize overall
pattern by drawing a line on scatterplot
Not any line; we want a best-fit line over
scatterplot
Least Squares Regression Line (LSRL) or
Regression Line
LEAST-SQUARES
REGRESSION LINE
LET’S DO SOME PREDICTING BY USING
THE LSRL...
About how much would a home cost if it were:
2,000 square feet?
2,600 square feet?
1,600 square feet?
LET’S DO SOME PREDICTING BY USING
THE LSRL...
About how large would a home be if it were worth:
$450,000?
$350,000?
$220,000?
Also, let’s discuss where the x and y axes start...
LEAST SQUARES REGRESSION
EQUATION TO PREDICT VALUES
LSRL Model: 𝑦 = 𝑎 + 𝑏𝑥
𝑦 is predicted value of response variable
a is y-intercept of LSRL
b is slope of LSRL; slope is predicted (expected)
rate of change
x is explanatory variable
LEAST SQUARES REGRESSION
EQUATION
Typical to be asked to interpret slope & y-intercept
of the equation of the LSRL, in context
Caution: Interpret slope equation of LSRL as the
predicted or average change or expected change
in the response variable given a unit change in the
explanatory variable
NOT change in y for a unit change in x; LSRL is a
model; models are not perfect
INTERPRET SLOPE & YINTERCEPT...
Notice the embedded
context in the equation
of the LSRL
LSRL: OUR DATA
Go back to our data (hand span & height)
and/or minutes on social media & # in your
household.
Create scatterplot; then put LSRL on our
scatterplot; also determine the equation of the
LSRL
Minitab: stat, regression, fitted line plot
LSRL: OUR DATA
Look at graph of our LSRL for our data
Look at our LSRL equation for our data
Our line fits scatterplot well (best fit) but not perfectly
Make some predictions… do we use our graph or our
equation? Which is easier? Which is better? More on
this in a minute...
Interpret our y-intercept; does it make sense?
Interpretation of our slope?
ANOTHER EXAMPLE… VALUE
OF A TRUCK
TRUCK EXAMPLE…
Suppose we were given the LSRL equation for our truck data
as 𝒑𝒓𝒊𝒄𝒆 = 𝟑𝟖, 𝟐𝟓𝟕 − 𝟎. 𝟏𝟔𝟐𝟗(𝒎𝒊𝒍𝒆𝒔 𝒅𝒓𝒊𝒗𝒆𝒏)
We want to find a more precise estimation of the value if we
have driven 100,000 miles. Use the LSRL equation.
Using graph, estimate price if we have driven 40,000 miles.
Then use the above LSRL equation to calculate the predicted
value of the truck.
AGES & HEIGHTS…
Age (years)
Height (inches)
0
18
1
28
4
40
5
42
8
49
LET’S REVIEW FOR A
MOMENT…
Input into Minitab
Create scatterplot and describe scatterplot (what do we
include in a description?)
Calculate r (btw, different from slope; why?), equation of
LSRL; interpret equation of LSRL in context; does y-intercept
make sense?
Based on LSRL or the equation of the LSRL (you choose),
make a prediction as to the height of a person at age 35.
LSRL: OUR DATA
Extrapolation: Use of a regression line (or
equation of a regression line) for prediction
outside the range of values of the
explanatory variable, x, used to obtain the
line/equation of the line.
Such predictions are often not accurate.
Friends don’t let friends extrapolate!
CALCULATING THE EQUATION OF THE
LSRL: WHAT IF WE DON’T HAVE THE
RAW DATA?
We still can calculate the equation for the LSRL, but a little
more time consuming
Note: Every LSRL goes through the point (𝒙, 𝒚)
Formula for slope of LSRL: 𝑏 = 𝑟
LSRL: 𝑦 = 𝑎 + 𝑏𝑥
𝑠𝑦
𝑠𝑥
CALCULATING THE EQUATION FOR THE
LSRL: WHAT IF WE DON’T HAVE THE
RAW DATA?
Equation of LSRL: 𝑦 = 𝑎 + 𝑏𝑥
If you do not have raw data, but still need to calculate a
LSRL, you will be given:
𝒙, 𝒚 , 𝑟 (𝑜𝑟 𝑟 2 ), 𝑠𝑦 , 𝑎𝑛𝑑 𝑠𝑥
Remember, (𝑥, 𝑦) is an ordered pair that is on the graph of the
LSRL
EXAMPLE: CREATING EQUATION OF
LSRL (WITHOUT RAW DATA)
•𝐵𝐴𝐿= a + b (# of beers consumed)
(equation of LSRL in context – better than x & y)
Remember, slope formula of LSRL: 𝑏 = 𝑟
𝑠𝑦
𝑠𝑥
Givens:
𝒙 = 4.8125, 𝑦 = .07375
𝑆𝑥 = 2.1975, 𝑆𝑦 = .0441, 𝑎𝑛𝑑 𝑟 2 = .80
Calculate slope for equation of LSRL
EXAMPLE: CREATING EQUATION
OF LSRL (WITHOUT RAW DATA)
𝐵𝐴𝐿= a + b (# of beers consumed)
Givens:
𝒙 = 4.8125, 𝑦 = .07375, 𝑆𝑥 = 2.1975, 𝑆𝑦 = .0441, 𝑎𝑛𝑑 𝑟 2 = .80
So, slope = b = .0179
Remember, equations of all LSRL’s go through 𝑥, 𝑦 …
so what’s next?
EXAMPLE: CREATING EQUATION OF
LSRL (WITHOUT RAW DATA)
𝐵𝐴𝐿= a + b (# of beers consumed)
Givens:
𝒙 = 4.8125, 𝑦 = .07375, 𝑆𝑥 = 2.1975, 𝑆𝑦 =
.0441, 𝑎𝑛𝑑 𝑟 2 = .80
𝑦 = 𝑎 + .0179 𝑥
Substitute (𝑥, 𝑦) into equation
EXAMPLE: CREATING
EQUATION OF LSRL
(WITHOUT RAW DATA)
0.07375 = a + (.0179) ( 4.8125) and solve for ‘a’
𝐵𝐴𝐿= a + b (# of beers consumed)
𝐵𝐴𝐿= -0.0123 + 0.0179 (# of beers consumed)
INTERPRETING SOFTWARE
OUTPUT…
Age vs. Gesell Score
DETOUR… MEMORY MONDAY
(OR WAY-BACK WEDNESDAY)…
What is r? What is r’s range?
r tells us how linear (and direction)
scatterplot is. ‘r’ ranges from -1 to 1. ‘r’
describes the scatterplot only (not
LSRL)
Why do we want/need ‘r’?
NOW…
We need a numerical measurement that
tells us how well the LSRL fits
Coefficient of Determination, or 𝑟 2
NOW...
We need a numerical measurement that tell
us how well the LSRL fits/accurately
describes the scatter plot points, the data.
Coefficient of Determination, or r2
COEFFICIENT OF
DETERMINATION …
Do all the points on the scatterplot fall
exactly on the LSRL?
Sometimes too high and sometimes too low
Is LSRL a good model to
use for a particular data
set?
How well does our model
fit our data?
COEFFICIENT OF
DETERMINATION OR 𝑟2
“R-sq” software output
Always 0 ≤ 𝑟 2 ≤ 1
Never calculate by hand; always use Minitab
No need to memorize formula; trust me … it’s ugly
COEFFICIENT OF
DETERMINATION OR 𝑟2
Remember “r” correlation, direction and strength
of linear relationship of scatterplot
−1 ≤ 𝑟 ≤ 1
𝑟 2 , coefficient of determination, fraction of the
variation in the values of y that are explained by
LSRL, describes to LSRL
0 ≤ 𝑟2 ≤ 1
COEFFICIENT OF
DETERMINATION OR 𝑟 2
Interpretation of 𝒓𝟐 :
We say, “x% of variation in (y variable) is
explained by LSRL relating (y variable) to (x
variable).”
GENERAL FACTS TO REMEMBER
ABOUT BIVARIATE DATA
Distinction between explanatory and
response variables.
If switched, scatterplot changes and LSRL
changes (but what doesn’t change?)
LSRL minimizes distances from data points
to line only vertically
GENERAL FACTS TO REMEMBER
ABOUT BIVARIATE DATA
𝑠𝑦
𝑏=𝑟
𝑠𝑥
Close relationship between correlation (r) and
slope of LSRL; but r and b are (often) not the
same; when would r and b have the same value?
LSRL always passes through (𝑥, 𝑦)
Don’t have to have raw data to identify the
equation of LSRL
GENERAL FACTS TO REMEMBER
ABOUT BIVARIATE DATA
Correlation (r) describes direction and
strength of straight-line relationships in
scatterplots
Coefficient of determination (𝑟 2 ) is the
fraction of variation in values of y explained
by LSRL
CORRELATION & REGRESSION WISDOM
Which of the following scatterplots has the
highest correlation?
CORRELATION & REGRESSION WISDOM
All r = 0.816; all have same exact LSRL equation
Lesson: Always graph your data! … because correlation and
regression describe only linear relationships
CORRELATION & REGRESSION
WISDOM
Correlation and regression describe only
linear relationships
CORRELATION & REGRESSION
WISDOM
Correlation is not causation! Association
does not imply causation… want a Nobel
Prize? Eat some chocolate! How about
Methodist ministers & rum imports?
Year
Number of
Cuban Rum
Methodist Ministers Imported to Boston
in New England
(in # of barrels)
1860
63
8,376
1865
48
6,506
1870
53
7,005
1875
64
8,486
1890
85
11,265
1900
80
10,547
1915
140
18,559
BEWARE OF NONSENSE ASSOCIATIONS…
r = 0.9749, but no economic relationship between these
variables
Strong association is due entirely to the fact that both imports
& health spending grew rapidly in these years.
Common year is other variable.
Any two variables that both increase
over time will show a strong association.
Doesn’t mean one explains the other
or influences the other
CORRELATION & REGRESSION
WISDOM
Correlation is not resistant; always plot
data and look for unusual trends.
… what if Bill Gates walked into a bar?
CORRELATION &
REGRESSION WISDOM
Extrapolation! Don’t do it… ever.
Example: Growth data from children from
age 1 month to age 12 years … LSRL
𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 ℎ𝑒𝑖𝑔ℎ𝑡 = 1.5𝑓𝑡 + 0.25(𝑎𝑔𝑒 𝑖𝑛 𝑦𝑒𝑎𝑟𝑠)
What is the predicted height of a 40-year
old?
OUTLIERS & INFLUENTIAL POINTS
All influential points are outliers, but not all outliers are
influential points.
Outliers: observations lie outside overall pattern
OUTLIERS &
INFLUENTIAL POINTS
Influential points/observations: If removed would
significantly change LSRL (slope and/or y-intercept)
CLASS ACTIVITY…
Groups of 2 or 3; measure each other’s head circumferences &
arm spans (both in inches, rounded to the nearest ½ “). Write
data on board.
1. Create scatterplot and describe the association between
head circumference and arm span using DOFS. Calculate the
correlation of the scatter plot (r).
2.
Is a regression line appropriate for our data? Why or why
not? If so, create LSRL graph & calculate equation;
calculate the coefficient of determination & interpret r2.
3. Interpret the slope and the y-intercept of the LSRL in context.
Continue on next slide for more questions ....
CLASS ACTIVITY…
Groups of 2 or 3; measure each other’s head circumferences
& arm spans (both in inches, rounded to the nearest ½ “).
Write data on board.
4. (a) Make a prediction (you can use your LSRL graph or
your equation of the LSRL; your choice). If a student’s
head circumference is 24.5”, what would be the predicted
arm span (in inches) for that given person? (b) If a
student’s head circumference is 36”, what would be the
predicted arm span (in inches) for that given person?
5. If there is an outlier that is not an influential point on your
scatter plot, circle it in red and label it as an outlier.
Continue on next slide for more questions ....
CLASS ACTIVITY…
Groups of 2 or 3; measure each other’s head circumferences
& arm spans (both in inches, rounded to the nearest ½ “).
Write data on board.
6. If there is an influential point on your scatter plot, circle it
in red and label it as an influential point.
7. Print everything up, put each group member name on it,
turn it in.
Download