GAISEing2013-day2-association-v9-with-additions

advertisement
GAISEing into the
Statistics Common Core
Day 2: Statistical
Association
June 27, 2013
Team
• Dr. Stephanie Casey is an Assistant Prof. of MathEd at EMU. Her research focuses on
teacher knowledge for teaching statistics at the middle and secondary levels,
motivated by her experience of teaching secondary mathematics for fourteen years.
• Dr. Andrew Ross is an Associate Prof. of Math at EMU, specializing in operations
research. He was named the Michigan MAA Distinguished Teaching Awardee in 2011.
• Dr. Brenda Gunderson is a Senior Lecturer in Stats Dept at the University of Michigan.
She coordinates and teaches Statistics and Data Analysis, with approximately 1800
students each term.
• Anamaria Kazanis, Pstat, is a Senior Statistician at MSU. She is the current president of
the Ann Arbor Chapter of ASA
• Karen Nielsen is a PhD student in the Stats Dept. at the University of Michigan. She
has taught 2 years of undergraduate introductory Statistics labs and served as a
mentor to other Graduate Student Instructors. As part of a cross-disciplinary team,
she helped to bring online learning objects into large-enrollment gateway classes.
• Mackenzie Fankell graduated from the U of M in 2009 with a degree in
psychology. After graduating she worked as an English teacher in Chile for two years
before returning to the US and working as a high school math teacher in Dearborn,
MI. She began her masters in education at U of M in 2012 but transferred to a
masters program in statistics later that year. She hopes to pursue research in
education and the social sciences.
Outline of Our Day
•
9:00-10:30 a.m. GAISE into the CCSS-M statistics standard(s) of the day:
• The standard ,
• its learning trajectory, and
• content
• 10:30-10:40 a.m.: BREAK
• 10:40 a.m.-12:10 p.m. GAISE activities part 1
• activities that teach the standard through the GAISE process,
• debrief on the experience and how to utilize the activity in their own classroom
• 12:10-1:00 p.m.: LUNCH BREAK
• 1:00-2:00 p.m.: GAISE activities part 2
• 2:00-2:30 p.m.: Interactive lecture on
• knowledge of standard and students,
• discussing what students are likely to think about and do as they progress through the learning
trajectory for the standard;
• common student conceptions, effective ways to support students as they move through the learning
trajectory
• 2:30-3:00 p.m.: Reflections on the day’s standard(s), share ideas, comments, concerns, etc. for
teaching the standard(s)
9:00-10:30 a.m.
GAISE into the CCSS-M statistics standards of the day:
The standards
Learning trajectory
Content
Standards, Grade 8 (part 1)
Investigate patterns of association in
bivariate data.
• CCSS.Math.Content.8.SP.A.1 Construct and interpret scatter
plots for bivariate measurement data to investigate patterns
of association between two quantities. Describe patterns such
as clustering, outliers, positive or negative association, linear
association, and nonlinear association.
• CCSS.Math.Content.8.SP.A.2 Know that straight lines are
widely used to model relationships between two quantitative
variables. For scatter plots that suggest a linear association,
informally fit a straight line, and informally assess the model
fit by judging the closeness of the data points to the line.
Standards, Grade 8 (part 2)
Investigate patterns of association in bivariate
data.
• CCSS.Math.Content.8.SP.A.3 Use the equation of a linear model to solve
problems in the context of bivariate measurement data, interpreting the
slope and intercept. For example, in a linear model for a biology
experiment, interpret a slope of 1.5 cm/hr as meaning that an additional
hour of sunlight each day is associated with an additional 1.5 cm in mature
plant height.
• CCSS.Math.Content.8.SP.A.4 Understand that patterns of association can
also be seen in bivariate categorical data by displaying frequencies and
relative frequencies in a two-way table. Construct and interpret a two-way
table summarizing data on two categorical variables collected from the
same subjects. Use relative frequencies calculated for rows or columns to
describe possible association between the two variables. For example,
collect data from students in your class on whether or not they have a
curfew on school nights and whether or not they have assigned chores at
home. Is there evidence that those who have a curfew also tend to have
chores?
Standards, High School (part 1)
Summarize, represent, and interpret data on two
categorical and quantitative variables
• CCSS.Math.Content.HSS-ID.B.5 Summarize categorical data for two
categories in two-way frequency tables. Interpret relative
frequencies in the context of the data (including joint, marginal, and
conditional relative frequencies). Recognize possible associations
and trends in the data.
• CCSS.Math.Content.HSS-ID.B.6 Represent data on two quantitative
variables on a scatter plot, and describe how the variables are
related.
• CCSS.Math.Content.HSS-ID.B.6a Fit a function to the data; use
functions fitted to data to solve problems in the context of the data.
Use given functions or choose a function suggested by the context.
Emphasize linear, quadratic, and exponential models.
• CCSS.Math.Content.HSS-ID.B.6b Informally assess the fit of a
function by plotting and analyzing residuals.
• CCSS.Math.Content.HSS-ID.B.6c Fit a linear function for a scatter plot
that suggests a linear association.
Standards, High School (part 2)
Interpret linear models
• CCSS.Math.Content.HSS-ID.C.7 Interpret the
slope (rate of change) and the intercept
(constant term) of a linear model in the
context of the data.
• CCSS.Math.Content.HSS-ID.C.8 Compute
(using technology) and interpret the
correlation coefficient of a linear fit.
• CCSS.Math.Content.HSS-ID.C.9 Distinguish
between correlation and causation.
AP Statistics (part 1)
• 1 . Exploring Data: Describing patterns and departures from patterns (20%–30%)
Exploratory analysis of data makes use of graphical and numerical techniques to
study patterns and departures from patterns. Emphasis should be placed on
interpreting information from graphical and numerical displays and summaries
D . Exploring bivariate data
1 . Analyzing patterns in scatterplots
2 . Correlation and linearity
3 . Least-squares regression line
4 . Residual plots, outliers and influential points
5 . Transformations to achieve linearity: logarithmic and power
transformations
E . Exploring categorical data
1 . Frequency tables and bar charts
2 . Marginal and joint frequencies for two-way tables
3 . Conditional relative frequencies and association
4 . Comparing distributions using bar charts
AP Statistics (part 2)
• IV . Statistical Inference: Estimating population
parameters and testing hypotheses (30%–40%)
Statistical inference guides the selection of appropriate
models.
A . Estimation (point estimators and confidence
intervals)
8 . Confidence interval for the slope of a leastsquares regression line
B . Tests of significance
6 . Chi-square test for … homogeneity of
proportions, and independence (…two-way tables)
7 . Test for the slope of a least-squares regression
line
Learning
Trajectories/Progressions
• TurnOnCCMath.net
• Progressions for the Common Core State Standards in
Mathematics
• Project SET: http://project-set.com/
• http://project-set.com/presentations/121712-regressionlpfinal-released/
Turn On CC Math.net (up to 8th grade)
Progressions for the Common Core
State Standards in Mathematics
• By The Common Core Standards Writing Team themselves
GAISE Level A, assoc.-related
• I. Formulate the Question
• → Teachers help pose questions (questions in contexts of interest to
the student).
• II. Collect Data to Answer the Question
• → Students conduct a census of the classroom.
• → Students understand individual-to-individual natural variability.
• → Students conduct simple experiments with nonrandom
assignment of treatments.
• III. Analyze the Data
• → Students observe association between two variables
• → Students use tools for exploring … association, including:
• ▪ Scatterplot ▪ Tables (using counts)
• IV. Interpret Results
Example:
GAISE Level B, assoc.-related
•
•
•
•
•
•
•
•
•
•
I. Formulate Questions
→ Students begin to pose their own questions
III. Analyze Data
→ Students quantify the strength of association between two
variables, develop simple models for association between two
numerical variables, and use expanded tools for exploring
association, including:
▪ Contingency tables for two categorical variables
▪ Time series plots
▪ The QCR (Quadrant Count Ratio) as a measure of strength of
association
▪ Simple lines for modeling association between two numerical
variables
IV. Interpret Results
→ Students understand basic interpretations of measures of
association.
Example: favorite music
GAISE Level C, assoc.-related
• I. Formulate Questions
• → Students should be able to formulate questions and
determine how data can be collected and analyzed to provide
an answer
• III. Analyze Data
• → Students should be able to recognize association between
two categorical variables.
• → Students should be able to recognize when the relationship
between two numerical variables is reasonably linear, know
that Pearson’s correlation coefficient is a measure of the
strength of the linear relationship between two numerical
variables, and understand the least squares criterion in line
fitting
Example
Example: plotting residuals
• http://project-set.com (there are many other Project SET’s)
• Aimed at high school
• Loop 1, golf ball drop, could be used in middle school
• Informal lines of fit
• Loop 2, vertical leap, is for HS: least-squares, residuals
• And possibility of categorical association
• Loop 2, used car prices, is for HS: least-squares, residuals
• Loop 3, NFL QB salaries, is for HS: least-squares, r or R^2
• Loop 4&5, txting, just for AP Stat
Loop 1: Informal Fit
• Using Golf Ball Drop data
• Please read the handout, use spaghetti to
show your informally fitted line.
• Not allowed to break spaghetti to connect
individual dots!
• Finish instructions on handout.
• Also, what is wrong with experimental
plan?
Lack of Replication!
• When possible, should do at least 2 experiments under each
experimental setting (drop height, in this case)
• Helps quantify uncertainty at each x value
• Can then use fancy tests for nonlinearity (post-AP-level stats)
What if we had only done
one trial at each dose?
Might see just the diamonds,
or just the Xs.
Also, when designing,
choose 3 or more
X values, so we can
detect nonlinearity.
Show 3 Types of Scatterplots:
• Designed experiment, with replication
Don’t average the y values at each x value to “make it simpler”!
Show 3 Types of Scatterplots:
• Observational Study
Pennsylvania, district-by-district
y = 0.0059x + 1023.8
R2 = 0.2108
1800
1600
Math test scores
1400
1200
1000
800
600
400
200
0
$-
$10,000
$20,000
$30,000
$40,000
$50,000
Avg Teacher Salary
$60,000
$70,000
$80,000
Show 3 Types of Scatterplots:
• Time Series
Common Suggestions for
Informal Fits
• Connect First and Last Points
• Connect Lowest and Highest Points
• Divide the data in half
• Connect as many points as possible
• And others we’ll get to later.
• Before we go on, sketch graphs that
show these ideas aren’t great.
Common suggestions for
informal fits:
Another common suggestion
Loop 2:
Residuals = actual - predicted
NOTE:
Residuals are
measured
VERTICALLY,
not
horizontally
and not
perpendicular
to the line of
best fit.
New ideas for informal fit?
Usual student’s answer:
sum the absolute residuals
• Not a bad idea!
• But, some bad points about it:
• Historically, harder to do than what we’ll see next.
• Sometimes the choice of line is not unique.
• Advanced statistical theory supports a different choice.
• Good points:
• Modern software can do it.
• It’s resistant to outliers.
Usual statistician’s answer:
sum the squared residuals
• This applet shows the geometric squares of the residuals:
http://www.geogebra.org/en/upload/files/mrfox001/line_of_
best_fit.html
• Does CCSSM require use or knowledge of formulas to find the
line that minimizes the sum of squared residuals?
• Standards aren’t so clear to me; the draft Progressions document
seems to focus only on using technology to fit the line
automatically.
Standards, High School (part 1)
Summarize, represent, and interpret data on two
categorical and quantitative variables
• CCSS.Math.Content.HSS-ID.B.5 Summarize categorical data for two
categories in two-way frequency tables. Interpret relative
frequencies in the context of the data (including joint, marginal, and
conditional relative frequencies). Recognize possible associations
and trends in the data.
• CCSS.Math.Content.HSS-ID.B.6 Represent data on two quantitative
variables on a scatter plot, and describe how the variables are
related.
• CCSS.Math.Content.HSS-ID.B.6a Fit a function to the data; use
functions fitted to data to solve problems in the context of the data.
Use given functions or choose a function suggested by the context.
Emphasize linear, quadratic, and exponential models.
• CCSS.Math.Content.HSS-ID.B.6b Informally assess the fit of a
function by plotting and analyzing residuals.
• CCSS.Math.Content.HSS-ID.B.6c Fit a linear function for a scatter plot
that suggests a linear association.
Standards, High School (part 2)
Interpret linear models
• CCSS.Math.Content.HSS-ID.C.7 Interpret the
slope (rate of change) and the intercept
(constant term) of a linear model in the
context of the data.
• CCSS.Math.Content.HSS-ID.C.8 Compute
(using technology) and interpret the
correlation coefficient of a linear fit.
• CCSS.Math.Content.HSS-ID.C.9 Distinguish
between correlation and causation.
What is the length of this line?
What is the length of this line?
Is this a square?
What is its Area?
Popular drawing: sum of
squared residuals
But squares are actually coming “out of the page” at us; both base & depth are
measured in $
•
What is the danger lurking in the equation that it shows?
http://xkcd.com/833/
Label the axes!
• It is very easy to get confused: is y=original data, or
y=residuals?
• Other, more advanced plots have:
• X=predicted, y=actual
• X=predicted, y=residual
• X=run sequence of data (1st, 2nd, etc) , y=residual
• Here are six recommended plots for examining the residuals:
http://www.itl.nist.gov/div898/handbook/eda/section3/6plot.htm
However, it neglects another type that it mentions elsewhere:
a run-order or run-sequence plot.
It is standard practice to graph
the residuals!
Timing data from yesterday
Let’s try it on the TI calculators.
•
Mackenzie 17 17
Lori 21 27
Paul 17 21
ASK 24 22
Katelyn 23 20
Karin 24 33
Allison 18 19
Karen 20 20
Jamie 22 19
Andra 18 20
Susan 24 25
Sherita 53 45
Susan 15 16
Stephanie 23 27
Jordan 27 26
Ed 18 18
Mila 25 27
Wendy 24 23
Claudia 28 27
Steve 24 26
Linda 25 25
Karen 28 28
Elizabeth 28 26
Jeff 26 25
Kim 38 26
Jeannette 19 24
Lisa 29 28
Joanne 30 25
Molly 31 38
Laura 33 35
With line of best fit:
• What if we flip the x & y data before doing regression?
It is standard practice to graph
the residuals!
What should residual graphs
look like?
• No patterns!
• If there are any patterns, that means our original regression
missed something. Which of these are okay/not okay?
Each graph has x=original x data values, y=residuals
Usual Procedure
1.
2.
3.
4.
5.
Graph the data
Fit a function
Compute and graph residuals
Any pattern left? Repeat from step 2
No pattern left? We’re done!
• Students get confused: do I want to see a pattern, or
not?
• In the original data, yes (usually). In the residuals,
no.
Correlation Coefficient
•
•
•
•
•
r (that is, little-r)
Always between -1 and +1
Close to zero: no linear relationship
Close to -1 or +1: close-to-linear relationship
0 to 0.5 is weak, 0.5 to 0.8 is moderate, above 0.8 is strong
(though that’s for social-science/biology stuff, not engr/physics)
• r doesn’t change if x or y units change (or both), or axes flip.
Which has the highest correl.?
• This is called Anscombe’s Quartet
• Again, which has the highest
correlation?
Teacher Value-Added Scores
• New York City, 2006 vs 2008 school year
• Data on each individual teacher
• Same school, same subject, same grade
level
• At least 3 years of experience
• X=VA z-score in 2006; Y=VA z-score in 2008
• What will the scatterplot look like?
• What will the r or R^2 value be?
Outliers & Influential Points
• Each data point could be:
•
•
•
•
•
An x outlier
A y outlier
Both x and y outlier, or neither x nor y outlier
A regression outlier (far from the pattern of the data)
Influential (if removed, the slope of the regression line would
change more than just a little bit, whatever that means in the
context of the problem)
• Not all outliers are influential!
Outlier boundaries
• Using the usual definition of outlier: more than 1.5 IQR from
Q1 or Q3
• Slanted lines for regression outliers use Q1, Q3, IQR of
residuals:
• Trendline + Q3 + 1.5 IQR, and
• Trendline + Q1 – 1.5 IQR (Q1 of residuals will be < 0, almost
always)
Influence on Slope
•
•
•
•
Consider a lattice of possible points we might add to data set
Compute abs(%change in regression slope)
Color small changes blue, large changes red.
Center is near (mean x, mean y)
Influence on R^2
• Using abs(% change in R^2)
• Red regions show large change, blue shows little change
• Note white-ish regions at bottom-left & top-right: adding
points from those regions (which are near the original
trendline) increases R^2
• Adding points from top-left or bottom-right decreases R^2
Common Names for Variables
horizontal
x
independent
free
predictor
stimulus
cause
controlled
explanatory
regressor
vertical
y
dependent
outcome
predicted
response
effect
uncontrolled
explained
result
Example
• Suppose you read an article that says that people who eat at
least one carrot a day tend to spend less on health care than
those that don’t. Does this mean you should eat more carrots
to stay healthier?
• Perhaps a hidden variable is a person’s attitude toward health.
People who try to take good care of themselves probably eat a
lot of all veggies. They probably also have better health than
those who don’t care what they eat.
• This proposes a specific lurking variable (rather than saying
“there is a lurking variable” with no further explanation), and
it says how that variable affects both of the variables already
mentioned. It doesn’t argue that the link doesn’t exist.
Correlation does not imply
Causation
• Perhaps a 3rd variable, not in the study, is affecting both
variables that were in the study (“lurking”)
• Perhaps the causation runs the opposite way of what was
proposed
• Common lurking variables:
•
•
•
•
•
•
•
•
SES = Socio-Economic Status (poverty, etc.)
A person’s overall health
A person’s health attitude
Population Size (of a city/state/country)
Inflation or flow of time
Weather
Local cost of living
% liberal/conservative by regione
Argue about these:
• There is a positive link between consumption of tobacco and
non-use of seatbelts. So if you want to cut down on smoking,
buckle up!
• There is a link between the # of years of math someone takes
in high school and their future income. So, Michigan should
require high school students to take at least Algebra 2.
• An actual article said something like: there is a link between
credit card debt and health problems. So, to make yourself
healthier, pay down your credit cards, since credit card debt
causes stress which can cause health problems.
• There is a link between the presence of computers in K-12
schools and their standardized test scores. Therefore, we
should spend more money on computers in schools.
• Smoking and seat belts: health attitude
• Algebra-2: lurking variable of geekiness?
• Debt and health problems: maybe health
problems cause debt, more than debt
causes health problems?
• Computers in schools: socio-economic
status?
http://xkcd.com/552/
Interpreting Slope
Pennsylvania, district-by-district
y = 0.0059x + 1023.8
R2 = 0.2108
1800
1600
Math test scores
1400
1200
1000
800
600
400
200
0
$-
$10,000
$20,000
$30,000
$40,000
$50,000
Avg Teacher Salary
$60,000
$70,000
$80,000
Algebra vs Statistics
vocabulary: Slope
• Algebra: slope =
how much y will change for a 1-unit change in x
• Statistics: slope of regression line =
AVERAGE change in y per 1-unit DIFFERENCE in x
• AVERAGE: no guarantee that y will change exactly that much.
• DIFFERENCE: saying “change” might give the impression that
we are changing the x value of a data point (putting someone
on a stretching machine) instead of comparing two different x
values (two people of different heights
True or False?
1.
T/F: If you give a raise of $10,000 to each teacher in a particular
district, that district’s avg. test score will go up 59 points.
2.
T/F: If you live in a district with a $50,000 average teacher salary and
move to a district with a $60,000 average, your child’s test scores
will go up, on average, by 59 points.
3.
T/F: If you live in a district with a $50,000 average and move to a
district with a $60,000 average, that district’s score will be 59 points
higher.
4.
T/F: If you live in a district with a $50,000 average and move to a
district with a $60,000 average, then on average the scores in that
district will be 59 points higher.
5.
T/F: Since there is a district with a salary of about $30,000 with test
scores above 1400, and another district around $65,000 with test
scores below 1200, we can see that there’s no correlation between
salary and test scores.
Answers
1. False; this presumes that the correlation is a causation
It talks about changing the x-value of one data point, not comparing
two data points.
2. False; your child is still your child with all their existing
demographics. While an increase might happen, there’s no
reason to think it would even average 59 points.
3. False; this statement sounds like a guarantee.
4. True; saying “on average” is the key point.
5. False; a single counter-example (or even many of them)
doesn’t disprove a general trend.
x=% free lunch; y=score
Free School Lunches cause bad test scores?
y = -304.48x + 1394.4
R2 = 0.4524
1800
1600
1400
1200
1000
800
600
400
200
0
0%
10%
20%
30%
40%
50%
Pct Free Lunch
60%
70%
80%
90%
100%
http://xkcd.com/605/
http://xkcd.com/1007/
Ecological Fallacy
• Better to call it Aggregation Fallacy (my personal opinion)
• The fallacy is: aggregate data gives useful info on individuals.
10:40 a.m.-12:10 p.m.
• GAISE activities part 1: participants engage in activities that
teach the standard through the GAISE process, then debrief
on the experience and how to utilize the activity in their own
classroom
• Possible / Favorite activities for Quantitative Association:
•
•
•
•
•
•
•
Barbie Bungee
Spaghetti Bridge
Balloon Descent Time
Paper Helicopter Descent Time
Sports ball bounce height (Golf? Ping-pong? Superbounce?)
M&M Exponential Survival Curve
Two Estimates of Timing, or weight of objects, or age of a person
12:10-1:00 p.m. : Lunch Break
Workshop participants engage in a process
of balancing nutritional value, price, and
flavor options to decide what food to eat.
They might have already done this in preclass work (“brown bag”)
Participants will eat their selected food (or
randomly selected?) n=30 times and
record the resulting observations.
1:00-2:00 p.m.
GAISE activities part 2: participants engage in activities that
teach the standard through the GAISE process and utilize
technology, then debrief on the experience and how to utilize
the activity in their own classroom
Categorical Association Activity: Eyes!
Categorical Association
• CCSS.Math.Content.8.SP.A.4 Understand that patterns of
association can also be seen in bivariate categorical data
by displaying frequencies and relative frequencies in a
two-way table. Construct and interpret a two-way table
summarizing data on two categorical variables collected
from the same subjects. Use relative frequencies
calculated for rows or columns to describe possible
association between the two variables. For example,
collect data from students in your class on whether or not
they have a curfew on school nights and whether or not
they have assigned chores at home. Is there evidence
that those who have a curfew also tend to have chores?
Categorical Association
• CCSS.Math.Content.HSS-ID.B.5 Summarize categorical
data for two categories in two-way frequency tables.
Interpret relative frequencies in the context of the data
(including joint, marginal, and conditional relative
frequencies). Recognize possible associations and trends
in the data.
Summer Employment and
Gender: Joint Frequency
Job Experience
Male Female Total
Never had a part21
31
time job
Had a part-time job
15
13
during summer only
Had a part-time job
12
8
but not only during
summer
Total
What numbers/tables and graphs could we look at to explore the data
looking for any association between gender and summer employment?
Are these helpful?
40
30
20
10
0
Male
Female
35
30
25
20
15
10
5
0
Male
Never had
a part-time
job
Female
Marginal Frequency
Job Experience
Male Female Total
21
31
Never had a part52
time job
15
13
Had a part-time job
28
during summer only
12
8
Had a part-time job
20
but not only during
summer
Total
48
52
100
What numbers/tables and graphs could we look at to explore the data
looking for any association between gender and summer employment?
Conditional Frequency, by
Gender
Job Experience
Male Female Total
Never had a part44%
60%
52%
time job
Had a part-time job
31%
25%
28%
during summer only
Had a part-time job
25%
15%
20%
but not only during
summer
Total
100%
100% 100%
Grade 8: Use relative frequencies calculated for rows or columns.
HS: Interpret relative frequencies in the context of the data (including joint,
marginal, and conditional relative frequencies)
Conditional Frequency, by
Gender: 100% Stacked Column
• A.K.A. Segmented Bar Graph
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Had a part-time job
but not only during
summer
Had a part-time job
during summer only
Never had a parttime job
Male
Female
Conditional Frequency by
Experience
Job Experience
Male Female Total
Never had a part40%
60% 100%
time job
Had a part-time job
54%
46% 100%
during summer only
Had a part-time job
60%
40% 100%
but not only during
summer
Total
48%
52% 100%
What numbers/tables and graphs could we look at to explore the data
looking for any association between gender and summer employment?
Conditional Frequency, by
Experience
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Female
Male
Never had a Had a part-time Had a part-time
part-time job
job during job but not only
summer only during summer
Famous Discrimination Case
Male
Female
Total
Recommended Not
for
Recommended
Promotion
for
Promotion
21
3
14
10
35
13
Total
24
24
48
Suppose these numbers show an actual association
(“statistically significant”).
Is that evidence of illegal discrimination?
Activity starter
• Everyone get a triangle-ended slip of paper.
• With both eyes open, hold hands at arm’s distance like this,
• Focus on a distant object.
• Close left eye. Reopen.
• Close right eye. Reopen.
• Whichever eye still saw the object is your
“dominant eye”
• Write your dominant eye (L or R) on the
triangle-end of your slip of paper.
• What might be related to eye dominance?
• (suggestions from the whole group)
• Write that on the non-triangle end of your
slip of paper.
• Pass slips of paper to the front
Let’s make a table
Left Hand Right
Hand
Left Eye
2
10
Right Eye 2
18
4
28
12
20
32
Example table 1 (fake data)
Eye
Hand
Left
5
9
Right
35
31
Example table 2 (fake data)
Left Hand Right Hand
Left Eye
3
2
Right Eye 6
29
What if extreme association?
Left Hand
Left Eye
4
Right Eye 0
4
Right Hand
8
12
20
20
28
32
What if no association?
•First, make a segmented bar chart (100% Stacked Column chart)
•Then, make a table of conditional relative frequencies
•Then, make a table of joint frequencies
Conditional: Left Hand Right Hand
Left Eye
12.5%
87.5%
100%
Right Eye
12.5%
87.5%
100%
Joint
Left Hand
Left Eye
12*12.5%
Right Eye 20*12.5%
4
Right Hand
12*87.5% 12
20*87.5% 20
28
32
What if no association?
•First, make a segmented bar chart (100% Stacked Column chart)
•Then, make a table of conditional relative frequencies
•Then, make a table of joint frequencies
Conditional: Left Hand Right Hand
Left Eye
12.5%
87.5%
100%
Right Eye
12.5%
87.5%
100%
Joint
Left Hand
Left Eye
1.5
Right Eye 2.5
4
Right Hand
10.5
12
17.5
20
28
32
Simulating No Association
(artificial data)
Left Hand Right Hand
Left Eye
Right Eye
9
31
5
35
40
2:00-2:30 p.m.
• Interactive lecture on
• knowledge of standard and students,
• what students are likely to think about and do as they progress
through the learning trajectory for the standard
• common student (mis-)conceptions,
• effective ways to support students as they move through the
learning trajectory
Common student conceptions
“What Fits” --from Project SET and STEW
Activity: drop a golf ball from various heights, record height of
its first bounce
The following graphs show where actual students placed a
“best-fit” line (thin metal rod)
For each, try to figure out the student's reasoning; then we'll
reveal it.
• I thought of the line of best fit like the mode, so I put the line
through two points with the same y-coordinate because they
occur most often.
• The line needs to start at (0,0) then go through the most dots.
I got my line to go through two of the dots so I put it there.
• The line should be in the middle of the highest and lowest
points because that’s like the average.
• I know the line should go in the middle of the data. I put it
here so it would be in the middle, four points on each side.
• I tried to make my line go through the most dots.
Help these students
• You are leading the class in analyzing the relationship between
GPA and ACT scores for a data set, using technology. You have
asked your students to find the correlation coefficient,
coefficient of determination, and regression line for the data
set. One pair of students asks for your help. They have done
their work independently and are now comparing their
answers. They have the same correlation coefficient &
coefficient of determination, but their regression lines are
different.
• What went wrong?
• How would you help them?
The answer
• One of them used GPA to predict ACT, and the other did the
reverse.
• The r and R^2 values will be the same
• The slopes will be different (slope1 approximately = 1/slope2 but
not exactly)
• The intercepts will be different
• Predictions will be different
• How to help? Perhaps ask each to predict the ACT of a 4.0
student; one will be able to answer quickly, the other will have
to solve a linear equation.
• Perhaps sketch a quick graph with vertical residuals and
horizontal residuals?
Rerouting a Student Idea
• To start a lesson on lines of best fit, you are following the
curriculum guide and have presented your students the
following data set about the pounds of beans used by families
of different sizes when traveling on the Overland Trail:
#people 5 8 6 7 11 10 5
7 10 5
8
Pounds 61 95 56 75 125 135 80 100 103 75 100
of
beans
7
9 12 10
105 125 150 125
• First question: how much does a party of 20 need?
• Student suggests: You could look at people with like 10 people
in their families and just double that amount.
• How to reroute into a line-of-best-fit?
Rerouting a Student Idea
•
•
•
•
Which party of 10? There are 3 of them.
Why not take a party of 5 and quadruple their usage?
Why not take a party of 12 and a party of 8 and add them?
What if the parties of 10 were by chance not big eaters? Can
we use all of the data to estimate needs?
• What if a student suggests to compute pounds-per-person for
each party, average them all, then multiply by 20?
• Industrial engineers would point out that while a linear
regression will forecast the needed amount of food, you
should actually bring along some safety stock, above the
forecast.
Vocabulary Time
• A student in your class raises his hand and asks “What’s the
difference between correlation, association, and regression?”.
• What would you say, do, or draw?
The answer
• Association is the most general. All of these are associated:
Association also applies to categorical data.
• Correlation is the direction and strength of any linear
relationship
• Regression is the process of fitting a line or curve to a data set.
• But, many people use them as if they mean the same thing, so
we can't assume when we read or hear something that the
person is being technically accurate in their usage. This is even
true for some textbooks.
• R^2 = % of variation explained by the predictor variable, using
the model that you used.
• = 1 – (sum of squared residuals)/(sum of squared y deviations
from the y mean)
• “Coefficient of Determination”
• Quadratic fits will ALWAYS be better (or at least as good) than
linear fits, as measured by R^2 or sum of squared residuals.
• Similarly, cubic will ALWAYS beat quadratic (or be at least as
good), etc.
• This is the danger of “Overfitting”: modeling your existing data
points too well, at the expense of making good predictions of
future data points.
Zero Correlation
• Create three distinctly differently patterned scatterplots which
all have a correlation coefficient of approximately zero.
• Explain why the correlation coefficient is near zero for each of
the cases.
Zero Correlation
• Also, if all the data is on a perfect flat line, y=0x+b, the
correlation coefficient is technically undefined (divide-byzero), but we might redefine it to be zero in that instance.
• But that would never happen with real data.
Model Selection
• You have been teaching your class about finding the best
model for a data set. A student says “So I just try all of these
different equations on the data set and whichever one gives
me the biggest value of r is the best one, right?”.
An answer
• First, is there theory that indicates which model might be best?
Exponential growth for populations, for example. Decay toward zero
in some cases (could be power or exponential). Horizontal
asymptotes predicted by theory? Are negative values (for x or for y)
allowed?
• If theory has nothing to say, we have an inherent preference for
linear models, since (a) Occam’s razor says to use the simplest thing,
and while mathematically they are all equally simple in some sense,
we really like linear stuff, and (b) if you zoom in on any of these
smooth curves you will get approximately a line. This preference
might lead us to select a linear model with R^2=0.80 instead of a
power/log/exponential model with R^2=0.83, for example. How big
a difference is acceptable? Hard to say.
• If we’re including polynomial models like a*x^2+b*x+c, warn about
the danger of overfitting—adding more terms always increases R^2,
but can make the predictions of future values ridiculous.
And then there’s Logs
• Logarithmic transforms aren’t part of the CCSSM, but are part
of AP Statistics.
• If some of the data has already been log-transformed
(decibels, earthquake richter scale magnitudes, pH, etc) then
it’s unlikely you would want to log it again.
• If the data (usually y rather than x) spans roughly 2 or more
orders of magnitude, consider logging it
• If the residuals have a fan-shape, consider transforming y
(either logging or square-rooting, usually)
• If the residuals look skewed, consider transforming y to bring
them back toward normality.
Common Misconceptions on
Categorical Association
• Research has found that students commonly have three incorrect
conceptions about association of categorical variables:
• Determinist: students believed that an association meant all cases
must show an association with no exceptions. These students
believed that the cells in the two-way table that did not agree with
the association should have zero frequency.
• Unidirectional: students believed dependence occurred only when it
was direct. This could be explained by the tendency of students to
give more relevance to positive cases than negative cases that
confirm a given hypothesis.
• Localist: students looked at part of the data to determine if an
association existed, often only looking at the cell with the highest
frequency or at only one conditional distribution.
Determinist: no exceptions!
Artificial data:
Has AIDS
Doesn't have AIDS
Has HIV
97
1
Doesn't have HIV
2
9900
• What are some arguments, based on the numbers in this
table, that they are NOT associated?
• What are some arguments, based on the numbers in this
table, that they ARE associated?
• What does a Determinist look at in a segmented bar
chart?
• Does an association have to hold for every single case for
the association to be true?
• Does the strong association shown in the table show that
HIV causes AIDS?
Unidirectional?
• Poll 100 people at your high school about their college plans.
Private
Public
In-state
5
60
Out-of-state
25
10
Does this table, above, show an association?
Does the table below show an association?
Public
Private
In-state
60
5
Out-of-state
10
25
What does Unidirectional thinking look like on
a segmented bar chart?
Localist?
• Ask 100 people whether they like summer or winter weather
more, and their birthplace.
East of the Mississippi
West of the Mississippi
Summer
72
8
Winter
18
2
It's pretty clear that people from the East prefer Summer more than Winter?
Does this mean that there is an association between birthplace and weather
preference?
Compute the conditional distributions:
What is the probability that someone from the east likes Summer weather?
[72/(72+18)=72/90=80%]
What is the probability that someone from the west likes Summer weather?
[8/(8+2)=8/10=80%]
Categorical Association and
Experimental Design
• You are leading your students in an activity in which they
complete all 4 steps of the GAISE framework (formulate a
research question, collect their own data, analyze the data,
and interpret the results) in a situation relevant to categorical
association.
• Students have brainstormed the following research questions.
• For each question, determine if it is relevant to categorical
association, and provide your response to the student.
• Consider data collection issues as well. For those that are not
relevant and/or have data collection issues, suggest some
modifications.
Relevant to Categ. Assoc.?
Data collection issues?
• LAURA: I want to study if girls are smarter than boys, so I am
going to compare the GPAs of boys and girls at school.
• BILL: I think people with higher GPAs are less likely to have had
a car crash while driving. I am going to ask a senior with
drivers licenses their GPAs and if they’ve crashed while
driving.
• ANNE MARIE: I want to see whether boys are more likely to
own smartphones than girls. So I’m going to count during
passing periods the number of boys I see with smartphones
and the number of girls I see with no phones.
Relevant to Categ. Assoc.?
Data collection issues?
• NICK: I think girls are more likely to bring their lunch to school
than boys. So I’m going to stand at the entrance to the
lunchroom and count the number of boys and girls that bring
their lunches.
• JEFF: I want to study the relationships between gender, race,
grade level, political party and whether students make honor
roll or not at our school.
• KRISTIN: I read that 77% of high school students go on to
college. I want to see if that’s true of the students here.
2:30-3:00 p.m.
• Reflections on the day’s standard(s), share ideas, comments,
concerns, etc. for teaching the standard(s)
Standards for Mathematical
Practice
• How do they relate to Statistical Association?
• (not sure what part of the day this fits in the best, if at all)
• CCSS.Math.Practice.MP1 Make sense of problems and
persevere in solving them.
• Make sense of trends in scatterplots
• Make sense of outliers/influential points
• Persevere in trying different functional fits
• CCSS.Math.Practice.MP2 Reason abstractly and quantitatively.
• “The slope has important practical interpretations for most
statistical investigations of this type” (CCSS Progressions grade
8)
• CCSS.Math.Practice.MP3 Construct viable arguments and
critique the reasoning of others.
• Think of lurking variables that would argue against a
correlation being a causation
• Evaluate linear/exponential/power/log models based on their
asymptotic behavior
• CCSS.Math.Practice.MP4 Model with mathematics.
• Everything!
• “build statistical models to explore the relationship between
two variables” (CCSS Progression grade 8)
• “using their knowledge of functions to fit models to
quantitative data” (CCSS Progression HS)
• CCSS.Math.Practice.MP5 Use appropriate tools strategically.
• Not much argument that technology is needed to do most
regression analysis?
• Tension between classroom technology and real-world
technology?
• CCSS.Math.Practice.MP6 Attend to precision.
• Don't report too many decimal places in slope, intercept, R^2,
or predictions of new values
• Phrase predictions about new values carefully, avoiding causal
or deterministic language.
• CCSS.Math.Practice.MP7 Look for and make use of structure.
• “looking for and making use of structure to describe possible
association in bivariate Data” (CCSS Progression grade 8)
• “using their knowledge of proportions to describe categorical
associations”, “Looking for patterns in tables” (CCSS
Progression HS)
• CCSS.Math.Practice.MP8 Look for and express regularity in
repeated reasoning.
Day 2 Wrap-Up
• What surprised you today?
• What did you find interesting?
• How might you bring these ideas to your
class?
• What would you change?
• Other activities/ideas to share with the
group?
Project-SET sports data
Gender
male
female
male
female
female
male
female
male
female
female
male
female
female
male
Vertical
Jump
Height Height Likes
(in)
(in)
Sports?
72.5
22.5Yes
70
18.5Yes
71
17No
64
17No
69
16.5Yes
72.5
27.5Yes
64.75
12.5Yes
70
16Yes
67.5
5.5No
66
12Yes
65.5
20.5Yes
66.75
13.5No
59.25
11No
69.25
16No
What categorical association question
can we ask?
Download