Uploaded by Ivan I

3.1 - 4.2

advertisement
Statistics 213 – 3.1: Exploratory Data
Analysis (EDA)
© Scott Robison, Claudia Mahler 2019 all rights reserved.
Objectives:
Be able to understand the differences between variable classifications
Be able to differentiate between a population and a sample as well as a parameter and a statistic
Be able to identify the sampling procedure used to collect a sample
Be able to compute sample proportions for qualitative variables
Be able to compute summary values for quantitative variables
Be able to create (in R) and interpret a boxplot
Motivation:
In Unit 2, we introduced the idea of a random variable. Recall the definition: a random variable is a quantity
whose values are (usually) the numerical values associated with the random outcomes of an experiment. Often
times, we represent a random variable with a letter.
Examples
Let X represent the number of heads that we observe from 10 coin flips
Let Y represent the number of customers who enter a bank on a given day
Let W represent an individual’s score on the Stanford-Binet IQ test
So far in this course, we’ve concerned ourselves with calculating probabilities relating to such variables as the
ones above. The focus of Unit 3 is slightly different. We are now going to switch gears and talk more about
variables and their importance in research settings. More specifically, we’re going to talk about variables more as
characteristics or properties that we are interested in studying or understanding in greater detail.
When we refer to a variable, we’re still going to think of something whose value varies from iteration to iteration just as we defined in Unit 2 - but now we’re going to focus our attention on what these variables might be in “real
world” situations and, specifically, in research.
We can still discuss calculating probabilities relating to these other types of variables, but Unit 3 is focused on
classifying variables and collecting data in order to make claims about variables.
1
Data and Variables
There are a few basic terms we should be familiar with before we go any further (some of these we have seen
before). To understand these terms, let’s consider an example. In nearly every field of study, research typically
begins with a question.
Example
“How long is a typical eruption of the Old Faithful geyser in Yellowstone National Park?”
One way to try and find an answer to this question would be to go to Yellowstone and observe, say, 20 eruptions of
Old Faithful, recording the length of each eruption. In statistics, an observation is an individual occurrence, case,
or instance of interest for which information is recorded.
In our Yellowstone scenario, each eruption would be an observation.
Notice that the thing of interest in this example is information about a specific characteristic of the eruptions - in
particular, the length of the eruptions. If we were to actually go out and record these eruption lengths, we would
see that they would most likely differ from observation to observation (eruption to eruption). Thus, the eruption
lengths are, in this scenario, our random variable (often just called a variable) of interest - a characteristic,
attribute, or outcome that can vary from observation to observation.
When we talk about the information that is recorded for one (or more) variables for a set of observations, we are
referring to data. Data is a collection of recorded information on one or more variables for a set of observations. A
particular set of data is sometimes just called a dataset.
In our Yellowstone scenario, our data would be the set of eruption lengths for the 20 observed eruptions (the 20
values of the variable “eruption length”).
Classifying Variables
Variables (and datasets) come in all shapes and sizes depending on what we’re interested in and/or what our
research is about. As we will see, knowing what type(s) of variable(s) we are dealing with will be useful when we
wish to use statistical techniques to explain or understand the variable(s) in more detail. So let’s briefly discuss the
general types of variables that we may encounter.
At the most general level, variables can be classed as one of two types: variables that describe a quality of an
observation or variables that describe a quantity of an observation.
Qualitative (categorical) variables are those that classify observations into one of a group of categories. Think
“quality” or “what type?”
Quantitative (numerical) variables are those that can be measured on a naturally numerical scale. Think
“quantity” or “how much?”
2
Example 3.1.1
Determine if the following variables are qualitative or quantitative.
1. Body weight (in pounds) – qualitative or quantitative
2. Clothing size (small, medium, large) – qualitative or quantitative
3. Number of trees in a park – qualitative or quantitative
4. Breed of a cat – qualitative or quantitative
Quantitative (numerical) variables can be further classified as discrete or continuous variables. We’ve seen these
definitions before!
Knowing which type of variable we’re working with is important! As we’ll see further on in this set of notes, different
summary and descriptive values are appropriate for different types of variables.
Making Claims about Variables
Whenever we’re interested enough in a variable to study it statistically, we’re usually interested in it on some fairly
large scale. That is, we want to study the behavior of the variable across many instances.
Examples
In Canada, what is the age at which people first consume alcohol?
How many registered voters in Idaho voted for Biden in the 2020 U.S. presidential election?
To do so, we collect data, or observations of these instances, and note the behavior of the variable for each
observation. One of the best ways of “summarizing” a variable’s behavior is by aggregating all of this information
into a single numerical description of the variable. While there are many different ways of doing this depending on
the variable(s) of interest, we will mainly focus on two of these summary values in this class: the mean and the
proportion.
The mean is used for quantitative (numerical) data and aggregates all measured observations for a given variable
into a single number that summarizes the “typical” behavior of the variable across those observations.
A proportion is used for qualitative (categorical) data and is a ratio of the number of all observations that fall into a
particular category of interest to the number of total observations.
Thus, when we’re interested in the behavior of a variable, we usually express our interest in terms of one of these
summary quantities.
3
Examples
In Canada, what is the average age at which people first consume alcohol?
What proportion of registered voters in Idaho voted for Biden?
We will learn the specifics of the calculations for the mean and for proportions a bit later. For now just think of a
mean as a “summary” value for quantitative variables and a proportion as a “summary” value for qualitative
variables.
Populations vs. Samples
As mentioned above, whenever we’re interested enough in a variable to study it statistically, we’re usually
interested in it on some fairly large scale. That is, we want to understand the variable’s behavior across many
instances.
A population is the set of all instances (units, people, objects, events, regions, etc.) we are interested in when
wanting to study the behavior of a variable. We usually denote the size of a population (if known) with N .
Examples
N
= all people in Canada
N
= all registered voters in Idaho
Ideally, we would record information on a variable of interest for every observation in a given population. However,
in a lot of cases (like in the examples above), it is impractical (or too expensive, or even impossible) to do so, since
most populations are very large. Instead we most often select a subset of observations from a population of
interest and record information on our variable of interest for every observation in the sample.
A sample is a selected subset of observations from the population, usually much smaller than the population itself.
We usually denote the size of a sample with n.
Examples
n
= 2,000 Canadians
n
= 400 registered voters in Idaho
At the population level, our summary values (means and proportions) are called parameters.
The population mean is usually denoted μ. This value μ is a parameter.
The population proportion is usually denoted p. This value p is a parameter.
At the sample level, our summary values (means and proportions) are called statistics or sample statistics.
⎯⎯
⎯
⎯⎯
⎯
The sample mean is usually denoted x . This value x is a statistic and is an estimate of μ.
The sample proportion is usually denoted p̂. This value p̂ is a statistic and is an estimate of p.
4
Examples
μ
⎯⎯
⎯
= average age of first alcohol consumption for all Canadians
x
= average age of first alcohol consumption for a sample of 2,000 Canadians
p
= proportion of all registered voters in Idaho who voted for Biden
p̂
= proportion of a sample of 400 registered voters in Idaho who voted for Biden
Samples are smaller and therefore easier to work with than large populations. However, the main goal of interest is
still to examine the behavior of a variable in the population of interest as a whole, not just in the smaller subset of a
sample.
In other words, our main interests are in our population summary values - μ and p. However, since we are often
⎯⎯
⎯
unable to calculate these values exactly, we use our sample summary values - x and p̂ - and generalize our
sample findings to the larger population of interest. We’ll see more of this later!
Data Collection
Because we wish to generalize our sample findings to our population of interest, it is important that our sample be
representative of our population. That is, we want to make sure that our sample reflects the characteristics and
composition of the population from which it’s taken.
Sampling Techniques
There are multiple different ways of obtaining a sample, depending on time/resources/observation type/etc. We will
briefly discuss several different types.
Ideally, a sample is a “perfect” representation of the population from which it was taken. While this is likely never
going to be the case, we can get a better representative sample if our sampling method involves random sampling.
To help give examples of each of these types of sampling methods, let’s consider the following scenario: a
researcher wishes to know what proportion of voters in Idaho voted for Biden.
A simple random sample (SRS) is one in which every conceivable subgroup of n observations has the same
chance as being selected for the sample of size n. A random number generator (or some other way of
randomizing all observations in the population) is required to assure the random selection.
Example
The researcher obtains a list of all N Idaho voters, then uses a random number generator to select n of those
voters to be in the sample.
(note that obtaining a list of an entire population may not always be possible!)
5
Stratified sampling is done when the population of interest contains naturally-occurring groups (or “strata”) of
observations that are similar on one or more characteristics. A stratified sample is obtained by selecting
observations from each strata and combining them to form the sample.
Example
The researcher divides voters into those who are registered as democrat, those registered as republicans, and
those registered as some other party (or no party). These are the three strata. She then takes samples of, say, 30
individuals from each strata and combines them to form the sample.
Cluster sampling is done when the population of interest contains naturally-occurring groups (or “clusters”) that
do not greatly differ from one another but contain observations with differing characteristics. A cluster sample is
obtained by selecting one of these clusters as the sample.
Example
Assuming that all counties in Idaho are relatively equal in terms of political and demographic makeup, the
researcher selects all voters from Latah County to act as the sample.
Note: there are differences between strata, but within strata the observations share similar characteristics.
Conversely, there are similarities between clusters, but within the clusters, the observations have differing
characteristics.
In some cases - due to monetary/time restrictions, types of variables of interest, etc. - random sampling is not a
possibility. In such cases, there are a few non-random sampling techniques that could be employed.
Convenience sampling is done when the observations selected for the sample are simply those that are the
easiest to reach/obtain.
Example
The researcher uses the first page of a phone book and calls every (registered to vote) individual on the page to
ask if they voted for Biden.
Voluntary sampling is done when the observations (people, in this case) “volunteer” to be in the sample.
Example
A researcher sets up a booth in a mall with a banner asking, “Who Did You Pick in the 2020 Election?” and asks all
those who approach if they had voted for Biden. The sample is then all who approached the booth.
6
What is the “best” sampling method? Ideally, a SRS is best, as it is most likely to lead to a genuinely “random”
selection from the population. However, a true SRS is difficult to do. Thus, the “best” sampling method really
depends on the situation. The more “randomly” you can select individuals for your sample, the more likely your
sample is to be a good representation of the population!
The goal of collecting data in a sample is to be able to analyze that data and then generalize the findings back to
the larger population. Remember that it is important that our sample be representative of our population in order to
be able to do this!
Now we’ll look at the step that comes after sampling: examining and summarizing the information in the collected
data. The techniques we discuss in this unit are often grouped under the general term exploratory data analysis
(EDA), which involves using both numerical and graphical summaries to explore sample data with the goal of
understanding what it means and how best to use it.
Describing Qualitative (Categorical) Variables
Numerically
Recall the definition of qualitative variables: qualitative (categorical) variables are those that classify
observations into one of a group of categories. Think “quality” or “what type?”
Examples
Eye color (blue, brown, green, etc.)
Ratings or rankings of songs (“dislike very much” to “like very much”)
Letter grades on a test (A, B, C, etc.)
Sample Proportions
When we want to summarize or describe a qualitative variable numerically, the most common way of doing so is
by calculating a proportion. A proportion is a ratio or fraction and is computed by taking the number of
observations that fall into a particular category or class of interest and dividing it by the number of total
observations.
If we are calculating a sample proportion, we usually denote it p̂ and calculate it as follows:
p̂ =
X
n
where X represents the number of observations in a particular category/class of interest and n is the sample size.
Recall that we often want to use p̂ as an estimate of p, a (usually unknown) population proportion. We’ll talk more
about this type of estimation later in the course.
7
Example 3.1.2
The following table shows a distribution of eye color for 208 students taking an intro
calculus course.
What proportion of students in this sample have green eyes?
Describing Quantitative (Numeric) Variables
Numerically
Recall the definition of quantitative variables: quantitative (numerical) variables are those that are measured on
a naturally numerical scale. Think “quantity” or “how much?” When we wish to describe quantitative variables
numerically, we tend to focus on ways to describe two separate features: center and spread.
Center: Measures of Central Tendency
Measures of central tendency (or measures of location) describe the tendency of the variables’ values to
center or cluster about a certain value. The two most common measures of central tendency are the mean and the
median.
The mean is the arithmetic average and is the sum of all observations divided by the total number of observations.
⎯⎯
⎯
If we are calculating a sample mean, we usually denote it x and calculate it as follows:
n
⎯⎯
⎯
x =
∑
i=1
xi
n
where xi represents the value of the random variable for the ith observation and n is the sample size.
⎯⎯
⎯
Recall that we often want to use x as an estimate of μ, a (usually unknown) population mean. We’ll talk more
about this type of estimation later in the course.
8
The median is the middle number when the observations are ordered from smallest to largest. If we are
calculating a sample median, we usually denote it x ̃.
If n is odd, x ̃ is simply the middle number of the ordered values
If n is even, x ̃ is the mean of the middle two numbers of the ordered values
Just like with the mean, we can use x ̃ as an estimate of a (usually unknown) population median, which is usually
denoted μ̃.
Consider a vector of data named dataset. In R, we can calculate the mean of the data by using mean(dataset)
and the median of the data by using median(dataset).
Example 3.1.3
Consider the following data set: 1 3 5 5 6 9 13 14 50
Using R, calculate the mean and the median of this data set.
dataset = c(1,3,5,5,6,9,13,14,50)
mean(dataset)
## [1] 11.77778
median(dataset)
## [1] 6
9
Spread: Measures of Variation
Measures of variation (or measures of spread) describe the variation of the observations about a measure of
central tendency. There are several measures of spread that we will discuss in this class.
The variance measures how far a set of observations are spread out from their average value. If we are
calculating a sample variance, we usually denote it s 2 and calculate it as follows:
⎯⎯
⎯
n
s
2
∑
=
i=1
(xi − x )
2
n − 1
⎯⎯
⎯
where xi represents the value of the random variable for the ith observation, x is the sample mean, and n is the
sample size.
The variance involves the calculation of the squared average deviation of each observation from the mean (this is
the numerator term). Because of this, variance is expressed in squared units, whatever those units happen to be
for a particular variable. This can make it hard to interpret (how can we interpret “dollars squared” or “grades
squared?”).
The standard deviation is the square root of the variance and is expressed in whatever units the variable is
expressed in. This makes it an easier to interpret measure of spread, and is thus much more commonly used. If
we are calculating a sample standard deviation, we usually denote it s and calculate it as follows:
2
s = √‾
s‾ =
⎯⎯
⎯
n
‾‾‾‾‾‾‾‾‾‾‾‾
2
‾
∑
(xi − x )
i=1
√
n − 1
Consider a vector of data named dataset. In R, we can calculate the variance of the data by using
var(dataset) and the standard deviation of the data by using sd(dataset) (or by, of course, taking the square
root of var(dataset).
10
Example 3.1.4
Consider the following data set: 1 3 5 5 6 9 13 14 50
Using R, calculate the variance and standard deviation of this data set.
dataset = c(1,3,5,5,6,9,13,14,50)
var(dataset)
## [1] 224.1944
sd(dataset)
## [1] 14.97312
A percentile indicates the value below which a given percentage of observations fall. For example, the 90th
percentile is the value below which 90% of the observations fall.
Some useful percentiles are given specific names:
The 25
th
percentile of the data is sometimes called the first quartile (Q1 )
The median is the same as the 50
th
percentile of the data
The 75th percentile of the data is sometimes called the third quartile (Q3 )
Think of Q1 and Q3 as the median values of the lower half of the ordered data and the upper half of the ordered
data, respectively (not counting the value of the median). If we think of the interval of data between Q1 and Q3 , we
can think of that interval as capturing/containing the “middle” 50% of our data.
Consider a vector of data named dataset. In R, we can calculate any percentile that we want by using
quantile(dataset, p) where p is the percentile of interest.
11
Describing Quantitative (Numeric) Variables
Graphically
We will focus on a specific type of graph that can be used to help describe quantitative variables: the boxplot.
A boxplot is a graphical representation based on five numbers: the minimum value of the data, Q1 , the median,
Q3 , and the maximum value of the data.
Example
The following is a boxplot representing the distribution of sepal widths for n
dataset in R.
= 150
iris flowers from a built-in
In R’s printout of a boxplot, values marked as ○ are classified as outliers. An outlier is a value that is either
extremely large or extremely small when compared to the rest of the values in a sample. Specifically, an outlier is a
value that is larger than the upper fence value or smaller than the lower fence value. These values are calculated
as follows:
Upper fence: Q3
+ 1.5(Q3 − Q1 )
Lower fence: Q1
− 1.5(Q3 − Q1 )
Note that the “whiskers” of the boxplot extend to the most extreme upper and lower points that are not outliers, not
to the fence values!
Consider a vector of data named dataset. In R, we can plot a boxplot of the data by using boxplot(dataset).
12
Example 3.1.5
The following dataset contains the heights (in inches) of a random selection of 100
active MLB players.
mlb=c(72,76,73,73,80,74,72,74,70,75,73,75,75,73,75,74,68,76,72,72,74,74,74,77,75,75,77,77,74,79,
75,74,76,70,75,74,72,75,75,75,68,76,75,76,77,77,74,73,73,74,75,78,76,73,74,72,71,73,70,77,73,76,
79,75,71,74,73,70,73,72,74,72,75,71,77,72,73,74,79,74,71,75,75,77,74,72,72,76,76,74,70,75,75,72,
72,72,71,77,72,76)
1. Use R to compute the median, Q1 , and Q3 of this data set.
median(mlb)
## [1] 74
quantile(mlb, 0.25)
## 25%
## 72
quantile(mlb, 0.75)
## 75%
## 75
13
2. Use R to create a boxplot of this data.
boxplot(mlb)
14
3. For the boxplot above, calculate the lower and upper fences.
4. Compute the 10th percentile of heights.
quantile(mlb, 0.10)
## 10%
## 71
15
Statistics 213 – 3.2: Regression
© Scott Robison, Claudia Mahler 2019 all rights reserved.
Objectives:
Be able to create and interpret a scatterplot
Be able to calculate and interpret the correlation coefficient
Be able to calculate and interpret the coefficient of determination
Be able to run a regression analysis in R and appropriately interpret the output
Be able to make predictions using a regression equation
Be able to calculate residuals
Be able to check regression model assumptions
Motivation:
In the previous set of notes, we focused in part on exploratory data analysis (EDA). The techniques we learned in
that set of notes are used to describe the behavior of a variable in a given sample.
Examples (from the 3.1 notes)
Letter grades on a midterm for a sample of n = 107 students
Eye colors for a sample of n = 208 students
Heights (in inches) for a sample of n = 500 MLB players
In the above examples, notice how we only focused on one variable at a time - letter grade, eye color, or height.
While it is important to be able to describe individual variables in a given sample, there are many situations in
which the relationship between two variables might be of more interest.
Examples
How can we describe the relationship between height and weight in high school students?
To what degree are temperature and rainfall levels related?
Is there an association between the duration of an eruption of the Old Faithful geyser and the amount of
time between eruptions?
1
In this unit, we will discuss three main ways of describing and summarizing the relationship between two variables:
using scatterplots, computing the correlation coefficient, and determining the regression equation.
Bivariate Data
When we refer to bivariate data, we’re referring to information collected on two variables from the same set or
sample of observations.
Example
Suppose we were interested in the relationship between height and weight in high school students. We take a
sample of n = 43 students and measure each students’ height and weight.
In this case, we have two variables of interest: height and weight. Each individual in our sample is measured on
both of these variables. We could keep track of this information in a spreadsheet as follows:
Notice that these measurements of height and weight are “paired” in the sense that there is one height and one
weight for each of the n students.
Explanatory vs. Response Variables
As we saw in the 3.1 notes, when researchers are interested in the relationship between two variables, they often
suspect that one variable may be responsible (at least in part) for some of the change in the other variable. Thus,
when we’re looking at the relationship between two variables, we often consider one as the “explainer” and one as
the “responder.” Recall these definitions from the 3.1 notes:
An explanatory variable (independent variable, predictor variable, x-variable) is one that may explain or
cause some degree of change in another variable.
A response variable (dependent variable, y-variable) is a variable that changes - at least in part - due to the
changes in the explanatory variable.
We’ll return to explanatory and response variables in a little bit. The rest of the notes will focus on three main ways
of describing and summarizing the relationship between two variables, starting with scatterplots.
2
Scatterplots
As we saw in the last set of notes, the quickest and easiest way to get an idea of the behavior of a variable is to
generate some sort of picture or graph of it.
A scatterplot is a two-dimensional plot with one variables’ values plotted along the horizontal axis (x-axis) and the
other variables’ values plotted along the vertical axis (y-axis). It is a good way to visualize the relationship between
two variables.
It is accepted practice to plot the explanatory variable (x-variable) along the x-axis and the response variable (yvariable) along the y-axis.
Example 3.2.1
The heights (in inches) and weights (in pounds) were recorded for a sample of
n = 43 high school students. The following R code shows how these values are read
into R and then displayed as a data frame called “students”:
heights = c(73, 69, 70, 72, 73, 69, 68, 71, 71, 68, 69, 67, 66, 67, 72, 68, 75, 68, 73, 72, 72,
72, 72, 74, 68, 73, 68, 70, 72, 70, 67, 67, 71, 72, 73, 68, 72, 68, 67, 70, 71, 70, 67)
weights = c(195, 135, 145, 170, 172, 168, 155, 185, 175, 158, 185, 146, 135, 150, 160, 155, 230,
149, 240, 170, 198, 163, 230, 170, 151, 220, 145, 130, 160, 210, 145, 185, 237, 205, 147, 170, 1
81, 150, 150, 200, 175, 155, 167)
students = data.frame(heights, weights)
head(students)
##
##
##
##
##
##
##
1
2
3
4
5
6
heights weights
73
195
69
135
70
145
72
170
73
172
69
168
To make a scatterplot of the data, we use the R function plot(x,y)
where x is the variable whose values are to be plotted along the x-axis and y is the variable whose values are to
be plotted along the y-axis.
plot(heights, weights)
3
Notice that there is a point on the scatterplot for each of the n
measurements (height, weight) for each student.
= 43
students. The points represent each pair of
4
As you can see, a scatterplot can give a general idea of the relationship between the two variables. In general, we
like to describe this relationship in terms of both its direction and its strength.
The direction of a relationship can generally be described as positive, negative, curvilinear, or non-existent.
A positive linear relationship exists when, as the values of one of the variables increase, the values of the
other generally increase as well. (Plot A)
A negative linear relationship exists when, as the values of one of the variables increase, the values of the
other generally decrease. (Plot B)
A curvilinear relationship exists when the relationship between the variables’ values can best be descried by
a curve (rather than a line). (Plot C)
A non-existent relationship exists when there is no consistent relationship between the values of the two
variables. (Plot D)
The strength (or magnitude) of a relationship is based on how tightly the “cloud” of points is clustered about the
trend line (the line/curve that best describes the direction of the relationship. The more tightly clustered the cloud
the stronger the relationship is between the two variables.
Example 3.2.1 (revisited)
The relationship between height and weight in the example with n = 43 high school
students appears to be a moderately strong positive linear relationship.
5
Correlation
Using a scatterplot is a good way to get a quick general idea of the relationship between two variables, but what if
we wanted some way to quantify this relationship beyond the subjective interpretation of a scatterplot?
Pearson’s correlation coefficient is a measure of the direction and strength of a linear relationship between two
quantitative variables x and y. The sample correlation coefficient is usually denoted r and is computed as:
n
∑
r =
i=1
⎯⎯
⎯ ⎯⎯
⎯
xi yi − nx y
(n − 1)s x s y
⎯⎯
⎯
⎯⎯
⎯
where xi is the ith observation of the x variable, yi is the ith observation of the y variable, x and y are the means
of the x and y variables, respectively, s x and s y are the standard deviations of the x and y variables, respectively,
and n is the sample size.
The following are important features of the correlation coefficient:
The distinction between the explanatory variable and the response variable doesn’t matter in the calculation
or interpretation of r
−1 ≤ r ≤ 1
The sign of r indicates the direction of the linear relationship
+r indicates a positive relationship
−r indicates a negative relationship
Values of r closer to −1 or 1 suggest a strong linear relationship (with r = 1 and r = −1 representing
perfect positive and negative correlations, respectively); values of r closer to 0 suggest a weak linear
relationship
To compute correlation in R, we use the R function cor(x,y).
Example 3.2.1 (Revisited)
The heights (in inches) and weights (in pounds) were recorded for a sample of
n = 43 high school students. Use R to compute the correlation coefficient for the
heights and weights of these students. Interpret this value.
cor(heights, weights)
## [1] 0.5684901
6
Example 3.2.2
The annual rainfall (in mm) and maximum daily temperature (in Celsius) were
recorded for n = 11 different locations in Mongolia.
rain = c(196, 196, 179, 197, 149, 112, 125, 99, 125, 84, 115)
temperature = c(5.7, 5.7, 7, 8, 8.5, 10.7, 11.4, 10.9, 11.4, 11.4, 11.4)
Use R to compute the correlation coefficient for the rainfall and temperatures of these
locations. Interpret this value.
cor(temperature, rain)
## [1] -0.918617
Coefficient of Determination
Another way to examine a bivariate relationship is to measure the contribution of the explanatory variable in
predicting the value of the response variable.
The coefficient of determination is the squared correlation coefficient (r 2 or R2 ) and represents the proportion of
the total sample variation in the response variable that can be explained by its linear relationship with the
explanatory variable.
Note that since −1 ≤ r ≤ 1 , 0 ≤ r 2 ≤ 1. Also note that you need to know which variable is being treated as the
explanatory variable and which variable is being treated as the response variable in order to interpret r 2 .
7
Example 3.2.3
In the height/weight example involving the n = 43 high school students, what
percentage of variation in weight can be explained by its linear relationship with
height?
Regression
We can get even more detailed than correlation when describing the relationship between two variables.
Regression is a technique that allows us to not only summarize a linear relationship but to make predictions about
future values that have yet to be observed.
In this class, we will focus on simple linear regression, which is a predictive model that describes the relationship
between our explanatory variable and our response variable. Simple linear regression fits (or models) the
prediction of the response variable from its relationship with the explanatory variable. This is done by observing
measures on both an explanatory variable and a response variable and “fitting” a line that best describes the
relationship between the two variables.
The regression equation is what defines the straight line that best describes how the values of the response
variable are related, on average, to the values of one explanatory variable. The regression line itself is just the
name of the line defined by the regression equation - the line of “best fit.”
8
The Regression Equation and Regression Line
Let’s look at the heights/weights example with the n = 43 high school students again. Suppose we wanted an
equation that tells us how best to predict weight (y) given a specific height (x ). In this case, we’ve got a set of data
that is made up of 43 (x, y) points. We can use this data to come up with an equation of a line that “best fits” the
relationship between height and weight.
Recall the equation for a straight line:
y = mx + b
where:
y
= a value of the y-variable, m = slope, b = y-intercept, and x = a value of the x-variable
Rearrange the right-hand side (and change the symbols) and you have our regression equation:
̂
̂
ŷ = β 0 + β 1 x
where:
is the predicted mean y value for a given x , β̂ 0 is an estimate of the y-intercept based on the data, β̂ 1 is an
estimate of the slope based on the data, and x is a value of our explanatory variable.
ŷ
The value of the y-intercept in a regression context has the same meaning as the value of the y-intercept in a
general math context. That is, it is the average value of the y-variable when x = 0 . In some scenarios, the yintercept has a meaningful interpretation (e.g., the growth rate of a tree when temperature = 0 degrees). However,
in other cases, the interpretation is meaningless and/or nonsensical (e.g., the weight of a child when height = 0
inches). It depends on the variables!
A much more meaningful value in the context of regression is the slope. In regression, the value of the slope has
a slightly more specific meaning than it does in a general math context. In math, the slope is defined as the
change in the y-variable compared to the change in the x-variable. Some may be familiar with the phrase “rise
over run.”
In regression, it’s the same idea - however, it’s more specific. The regression slope is defined as the average
change in the y-variable for every unit change in the x-variable. Think of it as “rise over one.”
While we’ll mostly be relying on R to do our regression calculations for us, the following are the equations used to
obtain the sample slope and y-intercept values.
⎯⎯
⎯
n
̂
β1 =
∑
i=1
⎯⎯
⎯
(xi − x )(yi − y )
n
∑
i=1
⎯⎯
⎯
(xi − x ) 2
= r
sy
sx
⎯⎯
⎯
⎯
̂
̂ ⎯⎯
β0 = y − β1 x
To run a regression analysis on an explanatory variable x and a response variable y in r, we use the function
lm(y~x). Make sure your response variable (y) is listed first, otherwise you will not get the correct regression
output!
9
Example 3.2.4
The annual rainfall (in mm) and maximum daily temperature (in Celsius) were
recorded for n = 11 different locations in Mongolia. A regression analysis was
performed to express annual rainfall (“rain”) as a linear function of maximum daily
temperature (“temperature”).
rain = c(196, 196, 179, 197, 149, 112, 125, 99, 125, 84, 115)
temperature = c(5.7, 5.7, 7, 8, 8.5, 10.7, 11.4, 10.9, 11.4, 11.4, 11.4)
fit = lm(rain~temperature)
fit
##
##
##
##
##
##
##
Call:
lm(formula = rain ~ temperature)
Coefficients:
(Intercept) temperature
295.25
-16.36
1. Write down the regression equation. Interpret the slope and the y-intercept.
2. What percentage of variation in annual rainfall is explained by its linear
relationship with temperature?
10
Example 3.2.5
Old Faithful is a popular geyser in Yellowstone National Park that is famous for its
consistent eruptions. The duration (length) of the geyser’s eruptions (in minutes) as
well as the amount of time spent waiting after the previous eruption (in minutes) were
recorded for n = 144 eruptions. A regression analysis was performed to express
eruption duration (“duration”) as a linear function of waiting time (“waiting”).
duration = c(3.3, 3.3, 3.3, 3.4, 3.5, 3.5, 3.7, 3.7, 3.7, 3.7, 3.8, 3.8, 3.8, 3.9, 3.9, 3.9, 3.9
, 3.9, 3.9, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4.1, 4.1, 4.1, 4.1, 4.1, 4.1, 4.1, 4.2, 4.2, 4.2, 4.2, 4.
2, 4.2, 4.2, 4.2, 4.2, 4.2, 4.2, 4.2, 4.2, 4.2, 4.3, 4.3, 4.3, 4.3, 4.3, 4.3, 4.3, 4.3, 4.3, 4.3
, 4.3, 4.3, 4.3, 4.3, 4.3, 4.4, 4.4, 4.4, 4.4, 4.4, 4.4, 4.4, 4.4, 4.4, 4.4, 4.4, 4.4, 4.4, 4.5,
4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5,
4.5, 4.6, 4.6, 4.6, 4.6, 4.6, 4.6, 4.6, 4.6, 4.6, 4.6, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7,
4.7, 4.7, 4.7, 4.7, 4.7, 4.8, 4.8, 4.8, 4.8, 4.8, 4.8, 4.8, 4.8, 4.8, 4.8, 4.9, 4.9, 4.9, 4.9, 5
, 5, 5, 5, 5, 5, 5.1, 5.3, 5.5)
waiting
79, 75,
93, 82,
79, 92,
79, 80,
83, 84,
89, 77,
= c(74,
88, 74,
86, 68,
87, 87,
84, 86,
84, 78,
89)
74,
87,
77,
76,
96,
81,
74,
89,
87,
93,
93,
82,
73,
76,
89,
75,
92,
87,
85,
78,
81,
87,
78,
93,
73,
86,
87,
76,
80,
96,
74,
80,
79,
75,
85,
78,
71,
76,
72,
80,
82,
84,
84,
77,
72,
84,
78,
78,
73,
80,
87,
79,
87,
85,
71,
85,
78,
80,
72,
87,
94,
84,
84,
88,
73,
90,
82,
91,
89,
88,
83,
76,
69,
80,
89,
80,
81,
88,
77,
73,
87,
81,
74,
98,
76,
81,
93,
72,
82,
85,
77,
71,
86,
90,
75,
93,
84,
83,
73,
80,
87,
79,
77,
88,
81,
81,
85,
89,
71,
76,
75,
93,
80,
84,
92,
68,
78,
91,
75,
83,
fit = lm(duration~waiting)
fit
##
##
##
##
##
##
##
Call:
lm(formula = duration ~ waiting)
Coefficients:
(Intercept)
2.78244
waiting
0.01948
11
1. Write down the regression equation. Interpret the slope.
2. Use R to compute the correlation. interpret this value.
cor(duration~waiting)
## [1] 0.3294965
Predicting Using the Regression Equation
We can get a general summary of our bivariate relationship by examining the regression equation and knowing
how to interpret the y-intercept and the slope. We can also use our regression equation to predict the value of the
response variable (y) for a specific value of the explanatory variable (x ).
A predicted ith value of our response variable, denoted yî , can be obtained by plugging in the specific ith
explanatory variable value (xi ) into the regression equation and solving for yî .
̂
̂
yî = β 0 + β i xi
Note: be wary of extrapolation! Extrapolation involves using the regression equation to predict y values for x
values outside of the original range of the data set.
12
Example 3.2.6
The heights (in inches) and weights (in pounds) were recorded for a sample of
n = 43 high school students. A regression analysis was performed to express weight
(“weights”) as a linear function of height (“heights”).
heights = c(73, 69, 70, 72, 73, 69, 68, 71, 71, 68, 69, 67, 66, 67, 72, 68, 75, 68, 73, 72, 72,
72, 72, 74, 68, 73, 68, 70, 72, 70, 67, 67, 71, 72, 73, 68, 72, 68, 67, 70, 71, 70, 67)
weights = c(195, 135, 145, 170, 172, 168, 155, 185, 175, 158, 185, 146, 135, 150, 160, 155, 230,
149, 240, 170, 198, 163, 230, 170, 151, 220, 145, 130, 160, 210, 145, 185, 237, 205, 147, 170, 1
81, 150, 150, 200, 175, 155, 167)
fit = lm(weights~heights)
fit
##
##
##
##
##
##
##
Call:
lm(formula = weights ~ heights)
Coefficients:
(Intercept)
-317.919
heights
6.996
1. Predict the weight of a student who is 70 inches tall.
2. Should we predict the weight of a student who is 62 inches tall?
13
Residuals
Let’s look again at the scatterplot of heights and weights for the n = 43 high school students, except this time let’s
also plot the regression line.
plot(heights, weights)
abline(fit)
Notice that the regression line isn’t perfect! For example, our calculation for the weight of a student who is 70
inches tall is not the same as any of the weights of the 70-inch-tall students in our actual sample.
In other words, there exist prediction errors in our estimation. A prediction error (or residual) is the difference
between the observed y value (y) and the predicted y value (ŷ ) for any given x value. For the ith x value, the
residual ei is calculated as:
ei = yi − yî
14
Example 3.2.7
In our height and weight example, a 70-inch tall student weighed 130 pounds.
Compute this student’s residual.
So you may be wondering: “if the regression line isn’t perfect, why is there a unique regression equation/line for
any given problem?”
Our method of calculating the regression line involves the method of least squares. The regression line is the
line that minimizes the total (summed) squared residuals. In other words, it is the line for which
n
∑
(yi − yî )
2
i=1
is minimized. The regression line is therefore the best possible linear explanation for the relationship between the
two variables.
Regression Model Assumptions
As you can see, regression is quite powerful. It allows us to describe the linear relationship between two variables
and make predictions. However, in order for a regression analysis to be accurate, a few assumptions must be met:
1. The assumption of constant variance (homoscedasticity): the variance of the residuals is constant. That
is, the variance ei values is the same regardless of the value of xi .
2. The assumption of normality: the distribution of the residuals is normal.
15
Checking Model Assumptions
In this class, we can focus on how to check these assumptions using plots.
Checking for constant variance (homoscedasticity): look at a residual (or “residuals vs. fits”) plot.
Good: The points appear fairly uniformly scattered about the flat dotted line. This suggests
homoscedasticity.
Bad: The points either gradually fan out from the line or gradually condense about the line. This suggests
that the variance is not constant (heteroscedasticity).
Checking for normality: look at a normal probability plot.
Good: The points appear in a straight (or near-straight) line and follow the diagonal line. This suggests
normality.
Bad: The points deviate from the diagonal line. This suggests non-normality.
Suppose you have a regression analysis that you have named fit. To create the normal probability plot and
residual plot, you would use the R function plot(fit).
16
Example 3.2.8
Recall the heights (in inches) and weights (in pounds) that were recorded for a
sample of n = 43 high school students. A regression analysis was performed to
express weight (“weights”) as a linear function of height (“heights”).
heights = c(73, 69, 70, 72, 73, 69, 68, 71, 71, 68, 69, 67, 66, 67, 72, 68, 75, 68, 73, 72, 72,
72, 72, 74, 68, 73, 68, 70, 72, 70, 67, 67, 71, 72, 73, 68, 72, 68, 67, 70, 71, 70, 67)
weights = c(195, 135, 145, 170, 172, 168, 155, 185, 175, 158, 185, 146, 135, 150, 160, 155, 230,
149, 240, 170, 198, 163, 230, 170, 151, 220, 145, 130, 160, 210, 145, 185, 237, 205, 147, 170, 1
81, 150, 150, 200, 175, 155, 167)
fit = lm(weights~heights)
fit
##
##
##
##
##
##
##
Call:
lm(formula = weights ~ heights)
Coefficients:
(Intercept)
-317.919
heights
6.996
Use the residual plot and the normal probability plot to assess the assumptions of
homoscedasticity and normality.
plot(fit)
17
18
19
20
Statistics 213 – 4.1: Sampling from WellKnown Distributions
© Scott Robison, Claudia Mahler 2019 all rights reserved.
Objectives:
Be able to understand the concept of a sampling distribution
Be able to use the Central Limit Theorem (CLT) to describe the distribution of the sample mean for
samples from well-known distributions
Be able to use a sampling distribution to calculate probabilities related to sample mean values
Motivation:
So far, the main focus in this class has been on describing how a single observation from a particular distribution
will behave.
Examples
Suppose X represents the weight of a baby panda at the zoo. If X is normally distributed with a mean μ of
100g and a standard deviation σ of 12g, what is P(X > 105) ?
Suppose Y represents amount of time spent waiting at a bus stop for a bus. If Y is exponentially distributed
with β = 7.5 minutes, what is P(Y < 6) ?
Suppose W represents the number of times you roll a 1 when rolling a fair six-sided die ten times. If W is
binomial with n
= 10
and p
=
1
6
, what is P(W
= 0)
?
Now let’s shift our focus a little. Back in the 3.1 notes, we discussed the idea of sampling observations from a
population. Recall that a sample is a selected subset of n observations from the population and is usually much
smaller than the population itself.
1
We can consider this same idea when we think of taking multiple observations from a particular distribution.
Examples
Suppose X1 , X2 , . . . , X30 represents sample of 30 observations from a normal distribution with a mean of
100g and a standard deviation of 12g.
Suppose Y1 , Y2 , . . . , Y50 represents a sample of 50 observations from an exponential distribution with
β = 7.5 minutes.
Suppose W1 , W2 , . . . , W88 represents a sample of 88 observations from a binomial distribution with
n = 10
and p
=
1
6
.
The goal of this set of notes is to understand how we can use the Central Limit Theorem to describe how a sample
of observations from a particular distribution will behave.
The Central Limit Theorem
Definition
Consider a random sample of n observations selected from a population (any population) with mean μx and
⎯⎯
⎯
standard deviation σx . Then, when n is sufficiently large, the sampling distribution of x will be approximately a
σ
normal distribution with mean μx = μx and standard deviation σx =
.The larger the sample size, the better will
⎯⎯
⎯
x
⎯⎯
⎯
√
⎯⎯
⎯
n
be the normal approximation to the sampling distribution of x .
Z =
≈
σ
⎯⎯⎯⎯
⎯⎯
⎯
x − μ
x − μx
σx
√
n
=
X − E[X ]
SD[X ]
√
n
2
Example 4.1.1
Let X be a Poisson random variable with λ = 5 . That is, X ∼ Pois(λ
seen previously that μx = E[X] = λ and that σx = SD[X] = √λ .
= 5)
. We have
1. Find the probability that if seven samples were drawn from the distribution of X ,
⎯⎯⎯⎯
the mean X
∑
=
to find P(4.76
n=7
i=1
7
⎯⎯⎯⎯
xi
will be between 4.76 and 5.35 . In other words, we want
≤ X ≤ 5.35)
where X
∼ Pois(λ = 5)
.
pnorm(q = 5.35, mean = 5, sd = sqrt(5/7)) - pnorm(q = 4.76, mean = 5, sd = sqrt(5/7))
## [1] 0.2723929
pnorm((5.35-5)/(sqrt(5/7))) - pnorm((4.76-5)/(sqrt(5/7)))
## [1] 0.2723929
3
⎯⎯⎯⎯
2. Suppose we are interested in an interval for X such that that
⎯⎯⎯⎯
P(a ≤ X ≤ b) = 0.95
What could the values a and b take? Use R to
determine these values.
qnorm(p = c(.025,.975), mean = 5, sd = sqrt(5/7))
## [1] 3.343528 6.656472
That is, we can say that if we were to take a random sample of seven values from a Poisson random variable with
λ = 5 , 95% of such samples would have a sample mean between 3.343528 and 6.656472. Or, equivalently,
⎯⎯
⎯
3.343528 ≤ x ≤ 6.656472 .
4
Example 4.1.2
Suppose we randomly select five cards (without replacement) from an ordinary deck
of 52 playing cards. Let Y be the number of face cards would you expect to pick (a
“face card” meaning a jack, queen, or king).
1. What is the probability of selecting between zero and three (inclusive) face
cards? Find this probability using R.
phyper(q = 3, m = 12, n = 40, k = 5)
## [1] 0.9920768
sum(dhyper(x = 0:3, m = 12, n = 40, k = 5))
## [1] 0.9920768
5
Note that since we know that Y follows a hypergeometric distribution with m
know
km
E[Y ] = μy =
SD[Y ] = σy =
(
m + n
5 ⋅ 12
)
=
= 12
60
=
12 + 40
,n
= 40
, and k
= 5
, we also
15
=
52
13
‾
‾‾‾‾‾
m
n
m + n − k ‾
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
12
40
47
‾
2350
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
k
=
5 ⋅
⋅
⋅
=
√
√ 2873
√ ( m + n )( m + n )( m + n − 1 )
52
52
51
2. You select five cards out of the shuffled deck and count the number of face
cards. You then repeat this process 14 more times (for a total of 15 times).
Consider the average number of face cards you would see out of the five cards
selected on a given iteration. What is P(0
⎯⎯⎯⎯
≤ Y ≤ 3)
?
pnorm(q = 3, mean = 15/13, sd = (2350/2873)^.5/(15)^.5) - pnorm(q = 0, mean = 15/13, sd = (2350/
2873)^.5/(15)^.5)
## [1] 0.9999996
pnorm((3-15/13)/((2350/2873)^.5/(15)^.5)) - pnorm((0-15/13)/((2350/2873)^.5/(15)^.5))
## [1] 0.9999996
6
⎯⎯⎯⎯
3. Suppose we are interested in an interval for Y such that that
⎯⎯⎯⎯
. What could the values a and b take? Use R to
determine these values.
P(a ≤ Y ≤ b) = 0.95
qnorm(p = c(.025,.975), mean = 15/13, sd=(2350/2873)^.5/(15)^.5)
## [1] 0.6961592 1.6115332
That is, we can say that if we were to draw five cards from a deck of cards 15 separate times, then computed the
average number of face cards from each of those 15 individual draws of five cards, 95% of the time, the average
number of face cards selected out of the five cards will be between 0.6961592 and 1.6115332. Or, equivalently,
⎯⎯⎯⎯
0.6961592 ≤ Y ≤ 1.6115332
.
One thing to note about the CLT that is very interesting: no matter what the distribution of our random variable X is
⎯⎯⎯⎯
- normal, exponential, Poisson, or someting else - the distribution of the sample mean, X , will be approximately
σ
normal with a mean μx = μx and standard deviation σx =
, provided your sample size n is “large enough.”
⎯⎯
⎯
⎯⎯
⎯
x
√
n
This applies even when the distribution of X is not a “named” probability distribution, as long as you have a
probability distribution table.
7
Example 4.1.3
Consider rolling two dice and consider the summed total of the two die faces. The
probability distribution table for the sum of the dice, X is given as follows:
1. What is the expected sum when you roll two dice, E[X]
= μX
?
2. What is the standard deviation of the expected sum when you roll two dice,
SD[X] = σX ?
8
3. If the pair of dice were rolled n = 47 times and the outcomes were recorded,
give an interval that would account for the mean of the 47 rolls 95% of the time.
9
x = c(2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
p = c(1/36, 2/36, 3/36, 4/36, 5/36, 6/36, 5/36, 4/36, 3/36, 2/36, 1/36)
E = sum(x*p)
E
## [1] 7
E2 = sum((x^2)*(p))
VAR = E2 - E^2
VAR
## [1] 5.833333
SD = sqrt(VAR)
SD
## [1] 2.415229
qnorm(p = c(0.025, 0.975), mean = 7, sd = SD/sqrt(47))
## [1] 6.30951 7.69049
qnorm(p = c(0.025, 0.975))
## [1] -1.959964
1.959964
10
Example 4.1.4
Consider X ∼ Bin(n = 1, p = 0.3) . Suppose 64 observations are selected at
random from this distribution (n ∗ = 64 ).
1. What is the expected value of X , E[X]
= μX
2. What is the standard deviation of X , SD[X]
Now consider
⎯⎯⎯⎯
p̂ = X =
∑
∗
n
i=1
∗
n
Xi
. Note that μX
3. Give an interval that would account for
p̂
?
= σX
?
= E[X] = np
since X
∼ Bin
.
95% of the time.
11
n = 1
p = 0.3
N = 64
E = n*p
E
## [1] 0.3
SD = sqrt(n*p*(1-p))
SD
## [1] 0.4582576
qnorm(p = c(0.025, 0.975), mean = p, sd = SD/sqrt(N))
## [1] 0.187729 0.412271
qnorm(p = c(0.025, 0.975))
## [1] -1.959964
1.959964
12
4. Give an interval that would account for
p̂
99% of the time.
qnorm(p = c(0.005, 0.995), mean = p, sd = SD/sqrt(N))
## [1] 0.1524508 0.4475492
qnorm(p = c(0.005, 0.995))
## [1] -2.575829
2.575829
This is a precursor to what we’ll be covering in the 4.2 notes…
‾
‾‾‾‾‾‾‾
‾
p̂
(1 − p̂ )
p̂ ± Z
α
2
√
n
13
Statistics 213 – 4.2: Sampling from
Unknown Distributions
© Scott Robison, Claudia Mahler 2019 all rights reserved.
Objectives:
Review the definitions of a parameter and a statistic
Be familiar with the concept of confidence intervals and know how to compute them
Understand the basics behind bootstrapping
Motivation:
In the 4.1 notes, we discussed the concept of sampling distributions as well as the Central Limit Theorem to
allow us to describe how a sample of observations from a particular distribution will behave. Recall that the
distributions we discussed in the 4.1 notes came from what we considered to be “known” distributions distributions for which we know the specifics for how to compute the mean and variance/standard deviation of the
distribution.
Examples
Suppose X1 , X2 , . . . , X30 represents sample of 30 observations from a normal distribution with a mean of
⎯⎯
⎯
100 and a standard deviation of 12. Then the distribution of x is approximately normal with mean
⎯
μ ⎯⎯
= μx = 100
x
and standard deviation σx
⎯⎯
⎯
=
σx
√
n
=
12
√ 30
Suppose W1 , W2 , . . . , W88 represents a sample of 88 observations from a binomial distribution with n
and p = 16 . Then the distribution of p̂ is approximately normal with mean μw = p = 16 and standard
= 1
⎯⎯⎯⎯
deviation σw
⎯⎯⎯⎯
5
‾1‾‾
⋅ ‾
=
6
√
6
88
But what happens if we run into a scenario where we have a random variable that doesn’t follow one of these
“known” distributions? In this set of notes, we’ll discuss several different approaches we can take when we must
sample from an unknown distribution.
1
Sampling from Populations
Variation in Sampling
One goal of sampling is to use the sample we collect to make an inference or claim about the larger population
from which the sample was taken. That is, based on a smaller (and hopefully representative!) subset of the
population, what can be said about the population itself?
Example
Suppose we are interested in the average height of all active Major League Baseball (MLB) players. If we take a
sample of n = 30 players and measure their heights, we can use the average height of those 30 players to make
a claim about the average height of all active players.
In the example above, we are using a sample mean to make a claim about a population mean. One thing to note
about using samples to estimate what’s going on in the population is that different samples may produce different
results. That is, samples will vary even when the samples are collected in the same manner.
Example
One sample of n
= 30
Another sample of n
75.3”.
A third sample of n
height of 73.9”.
MLB players gives an average sample height of 74”.
= 30
= 30
MLB players (collected in the same fashion) gives an average sample height of
MLB players (again, collected in the same fashion) gives an average sample
Etc.
This variation or variance from sample-to-sample is really the focus of statistics!
2
Parameters and Statistics - A Review
As was first discussed in the 3.1 notes, one of the best ways of “summarizing” the variation that comes with
collecting data is to aggregate all of the information collected from each object in the target population into a single
numerical property. Recall from the 3.1 notes that a population is the set of all instances (units, people, objects,
events, regions, etc.) we are interested in when wanting to study the behavior of a variable. We usually denote the
size of a population (if known) with N . At the population level these numerical properties are called parameters.
Ideally, we would record information on a variable of interest for every observation in a given population and thus
would be able to compute appropriate population parameters. However, in a lot of cases, it is impractical (or too
expensive, or even impossible) to do so, since most populations are very large. Instead we most often select a
subset of observations from a population of interest and record information on our variable of interest for every
observation in that subset.
Recall from the 3.1 notes that a sample is a selected subset of observations from the population, usually much
smaller than the population itself. We usually denote the size of a sample with n.
By obtaining a measurement from each object in a sample of the population, we can then make inferences about
the population as a whole. Just as we can think about aggregating all of the information collected from each object
in a population into a single numerical numerical property, we can do so in a sample as well. At the sample level,
these summary values are called statistics.
Example 4.2.1
The following data were produced by a simple random sample of houses that have
sold in my area in the past six months. The variable of interest is the selling price of a
home, in $1000s.
575.0, 549.0, 572.5, 649.9, 485.0
Let’s consider a few sample statistics based on this sample of n
⎯⎯
⎯
The sample mean is calculated as x
∑
=
n
i=1
th
houses.
xi
n
The sample standard deviation is calculated as s
The 50
= 5
=
n
⎯⎯
⎯
2
‾
‾‾‾‾‾‾‾
∑
(xi −x )‾
i=1
√
n−1
percentile (also called the median) is the number such that 50% of the data is less than that number.
The 25 percentile (also called the 1 quantile or Q1 ) is the number such that 25% of the data is less than that
number.
th
st
The 75th percentile (also called the 3rd quantile or Q3 ) is the number such that 75% of the data is less than that
number.
3
library(mosaic)
x = c(575.0, 549.0, 572.5, 649.9, 485.0)
favstats(x)
##
##
min Q1 median Q3
max
mean
sd n missing
485 549 572.5 575 649.9 566.28 59.18629 5
0
Now consider a sixth house that has been sold was randomly picked. The selling price of this particular house was
$975,000. Compute the mean and median of this new sample:
485.0, 549.0, 572.5, 575.0, 649.9, 975.0
x=c(575.0, 549.0, 572.5, 649.9, 485.0, 975.0)
favstats(x)
##
##
min
Q1 median
Q3 max mean
sd n missing
485 554.875 573.75 631.175 975 634.4 175.0555 6
0
Result: The sample median is said to be robust, or less sensitive to irregular/extreme data points, compared to the
sample mean.
Irregular data points can (and often do) occur when the variable of interest possesses (in the population) a skewed
distribution (either right-skewed or left-skewed). In such cases, the sample median should be used as the
appropriate measure of center.
One must also be careful in using the sample standard deviation as a measure of spread! Use the value of s to
⎯⎯
⎯
estimate the spread in the population variable of interest only when the sample mean x is being used as the
measure of spread. This is because the sample standard deviation (like the sample mean) is very sensitive to
irregular/extreme values. The “skewing effect” of the sample mean is compounded in the calculation of s .
So what is one way we can deal with skewed distributions in the context of trying to estimate parameters? One
option to consider is bootstrapping.
4
Bootstrapping
In statistics, bootstrapping is any test or metric that relies on random sampling with replacement. Bootstrapping
allows assigning measures of accuracy (defined in terms of bias, variance, confidence intervals, prediction error or
some other such measure) to sample estimates. This technique allows estimation of the sampling distribution of
almost any statistic using random sampling methods.
The general process of bootstrapping is as follows:
1. An initial sample of n observations is taken from a distribution.
2. From that initial sample, we sample with replacement n times to obtain a bootstrap sample. We do this
repeatedly to obtain a large number of bootstrap samples.
⎯⎯
⎯
3. For each bootstrap sample, the sample statistic of interest is calculated (x or p̂ ).
4. These bootstrap sample statistics form the sampling distribution of the sample statistic.
As we will see later on in the notes, we can use this bootstrap-based sampling distribution to form confidence
interval estimates of parameters.
Bootstrapping allows for a better estimation of the sampling distribution of a sample statistic when one cannot
assume that the variable follows a normal distribution (or any of the other well-known distributions, such as
Poisson, exponential, etc.).
To better demonstrate this, let’s go back to the modified Example 4.2.1.
5
Example 4.2.1 (revisited)
The following data were produced by a simple random sample of six houses that
have sold in my area in the past six months. The variable of interest is the selling
price of a home, in $1000s.
575.0, 549.0, 572.5, 649.9, 485.0, 975.0
Let’s use R to compare the distribution of the sample itself, a distribution based on
the normal distribution, and a bootstrap distribution.
First, the distribution of the sample:
x=c(575.0, 549.0, 572.5, 649.9, 485.0, 975.0)
mean(x)
## [1] 634.4
sd(x)
## [1] 175.0555
favstats(x)
##
##
min
Q1 median
Q3 max mean
sd n missing
485 554.875 573.75 631.175 975 634.4 175.0555 6
0
Now, a distribution based on the normal distribution:
y=rnorm(n = 1000, mean = 634.4, sd = 175.0555)
favstats(y)
##
##
min
Q1
median
Q3
max
mean
sd
n missing
118.7369 521.6396 648.6604 757.2487 1306.529 637.3441 178.8723 1000
0
6
Finally, a bootstrap distribution:
library(mosaic)
x = c(575.0, 549.0, 572.5, 649.9, 485.0, 975.0)
B = do(1000)*mean(resample(x, 6))
favstats(B$mean)
##
##
min
Q1
median
Q3
max
mean
sd
n missing
500 580.2167 632.5833 680.0875 920.8167 635.286 64.70401 1000
0
Compare the means and standard deviations of these three “distributions.”
What do we see? While the medians of these three distributions are approximately the same (as are the means),
notice that there is far less variability in the bootstrap boxplot than in the “normal” boxplot.
Why does it matter? Bootstrap-based distributions can be considered as more robust against extreme values
compared to traditional methods of estimating parameters. We will see the use of this in the next section.
7
Estimating Parameters with Confidence Intervals
As discussed above, parameters represent numerical properties that act as summary values used to describe the
behavior of a given variable in the population of interest. Statistics are these summary values calculated at the
sample level and are typically used to estimate the corresponding (unknown) population parameters.
Examples
The average height in a sample of n
all MLB players, μ.
= 40
⎯⎯
⎯
MLB players, x , can be used to estimate the average height of
The proportion of voters who would vote for the Liberal Party in a sample of n = 1, 034 registered
Canadian voters, p̂ , can be used to estimate the proportion of all registered Canadian voters who would
vote for the Liberal Party, p.
These sample statistics on their own are what we call point estimates, or single-value estimates of their
corresponding population parameters. Since these point estimates vary from sample to sample, using them on
their own to estimate the corresponding population parameter may not be extremely useful, as this variation of the
sample statistics is not taken into account.
Another option for estimation involves the computation of confidence intervals. A confidence interval is
constructed based on sample data and a point estimate and is an interval estimate of the target population
parameter. It gives us a range of values that are plausible for the population parameter.
In this class, we will discuss how to construct confidence intervals for two parameters of interest: p and μ.
Estimating p with Confidence Intervals
Selecting the Margin of Error
The margin of error in a confidence interval is the amount that we will both add and subtract from the point
estimate of interest in order to produce a desired confidence level for a confidence interval. We can change the
confidence level by changing the margin of error.
The greater the margin of error, the higher our confidence level will be. We represent the allowable error that our
interval will contain with α. For example, if we construct an interval for which we could be 100% sure that the
interval would always range across all possible values of the parameter of interest, then the allowable error is
α = 0. Another way to state this is to say that we have made a (1 − α) ∗ 100 % confidence interval or, in this
case, a (1 − 0) ∗ 100 % or a 100% confidence interval.
The structure of a confidence interval for p is:
p̂ ± margin of error
‾
‾‾‾‾‾‾‾
‾
p̂
(1 − p̂ )
p̂ ± Z
2
√
X
where p̂ =
α
n
n
8
Example 4.2.2
An Ipsos-Reid poll of n
= 1, 034
randomly selected Canadian voters was taken
between February 14 and February 18th . Each voter was asked the following
question: “If a federal election were to be held tomorrow, what political party would
you vote for?” 382 would vote for the Liberal Party should a federal election to occur
tomorrow.
th
1. Compute the sample proportion as well as the margin of error.
382/1034
## [1] 0.3694391
qnorm(.975)*(382/1034*(1-382/1034)/1034)^(.5)
## [1] 0.02941865
9
2. Use the information computed above to find a 95% confidence interval estimate
for the percentage of all Canadian voters who would vote for the Liberal Party.
382/1034 - qnorm(.975)*(382/1034*(1-382/1034)/1034)^(.5)
## [1] 0.3400204
382/1034 + qnorm(.975)*(382/1034*(1-382/1034)/1034)^(.5)
## [1] 0.3988577
Confidence Intervals Based on Bootstrap Percentiles
If we were only concerned with 95% confidence intervals and always had a symmetric, bell-shaped bootstrap
distribution, the confidence interval as it is computed in the above section would probably be all that we need. But
we may end up with a bootstrap distribution that is symmetric but subtly flatter (or steeper) so that more (or less)
than 95% of bootstrap statistics are within Z standard errors of the center.
α
2
Fortunately, we can use the percentiles of the bootstrap distribution to locate the actual middle (1 − α) ∗ 100 % of
the distribution. Specifically, if we want the middle (1 − α) ∗ 100 % of the bootstrap distribution (the values that
are most likely to be close to the center), we can just chop off the lowest α ∗ 100% and highest α ∗ 100% of the
2
2
bootstrap statistics to produce an interval.
Visually:
10
Example 4.2.2 (revisited)
An Ipsos-Reid poll of n
= 1, 034
randomly selected Canadian voters was taken
between February 14 and February 18th . Each voter was asked the following
question: “If a federal election were to be held tomorrow, what political party would
you vote for?” 382 would vote for the Liberal Party should a federal election to occur
tomorrow. Compute a bootstrap 95% confidence interval estimate for the percentage
of all Canadian voters who would vote for the Liberal Party.
th
library(mosaic)
B = do(1000)*mean(resample(c(rep(1, 382), rep(0, 1034-382)), 1034));
quantile(B$mean, 0.025);
##
2.5%
## 0.3413685
quantile(B$mean, 0.975);
##
97.5%
## 0.3984526
11
Interpreting Confidence Intervals
A confidence interval for a sample proportion gives a set of values that are plausible for the population proportion.
If a value is not in the confidence interval, we conclude that it is an implausible/unlikely value for the actual
population proportion. It’s not impossible that the population value is outside the interval, but it would be pretty
surprising.
For example, suppose a candidate for political office conducts a poll and finds that a 95% confidence interval for
the proportion of voters who will vote for him is 42% to 48%. He would be wise to conclude that he does not have
50% of the population voting for him. The reason is that the value 50% is not in the confidence interval, so it is
implausible to believe that the population value is 50%. Sometimes drawing a picture helps!
There are many common misinterpretations of confidence intervals that you must avoid. The most common
mistake that is made is trying to turn confidence intervals into some sort of probability problem. For example, if
asked to interpret a 95% confidence interval of 45.9% to 53.1%, many people would mistakenly say, “This means
there is a 95% chance that the population percentage is between 45.9% and 53.1%.”
What’s wrong with this statement? Remember that probabilities are long-run frequencies. The above interpretation
claims that if we were to repeat this survey many times, then in 95% of the surveys the true population
percentages would be a number between 45.9% and 53.1%. This claim is wrong! This is because the true
population percentage doesn’t change. It is either always between 45.9% and 53.1% or it is never between these
two values. It can’t be between these two numbers 95% of the time and somewhere else the rest of the time.
Another analogy will help make this clear. Suppose there is a skateboard factory where 95% of the skateboards
produced are perfect, but the other 5% have no wheels. Once you buy a skateboard from this factory, you can’t
say that there is a 95% chance that it has wheels. Either it has wheels or it does not have wheels. It is not true that
the board has wheels 95% of the time and, mysteriously, no wheels the other 5% of the time. A confidence interval
is like one of these skateboards. Either it contains the true parameter (has wheels) or it does not. The “95%
confidence” refers to the “factory” that “manufactures” confidence intervals: 95% of its products are good, 5% are
bad. Our confidence is in the process, not in the product.
A correct interpretation: We are (1 − α) ∗ 100 % sure that the true population proportion is between the lower
and the upper limit calculated.
12
Example 4.2.3
A random sample of n = 3, 005 Canadians between the ages of 30 and 65 revealed
that 1, 683 expect to work past the traditional retirement age of 65.
1. Find a 99% confidence interval for p , the proportion of Canadians aged 30 to 65
who expect to be working past the age of 65.
1683/3005 - qnorm(.995)*(1683/3005*(1-1683/3005)/3005)^(.5)
## [1] 0.5367423
1683/3005 + qnorm(.995)*(1683/3005*(1-1683/3005)/3005)^(.5)
## [1] 0.5833908
13
2. Interpret the meaning of this interval in the context of the data.
3. Can you infer from the interval above that (a) p
= 0.54
? (b) p
< 0.60
?
14
Example 4.2.4
A random sample of n = 109 first-year University of Calgary students revealed that
23 had used marijuana in the past six months.
1. Find a 95% confidence interval for p , the proportion of all first-year University of
Calgary students that have used marijuana in the past six months, based on the
distribution of p̂ and based on bootstrapping.
p = 23/109
n = 109
conf = 0.95
p - qnorm(conf+(1-conf)/2)*(p*(1-p)/n)^(.5)
## [1] 0.1344105
p + qnorm(conf+(1-conf)/2)*(p*(1-p)/n)^(.5)
## [1] 0.2876079
library(mosaic)
B = do(1000)*mean(resample(c(rep(1,23),rep(0,n-23)), n));
quantile(B$mean, (1-conf)/2);
15
##
2.5%
## 0.146789
quantile(B$mean, conf+(1-conf)/2);
##
97.5%
## 0.293578
2. Can you conclude from the findings above that (a) 20% of first-year University of
Calgary students have used marijuana in the past six months? (b) more than
25% of first-year University of Calgary students have used marijuana in the past
six months?
16
Estimating μ with Confidence Intervals
Selecting the Margin of Error
Recall that we used the CLT to create the confidence interval formula for proportions. It would be nice to use CLT
to create the confidence interval formula for means as well!
Recall also the following formula:
⎯⎯
⎯
x − μ
Z =
σ
√
n
It would be nice if we could divide by the true standard error,
σ
√
n
. The problem is that in real life, we almost never
know the value of σ , the population standard deviation. In fact, in order to calculate it, we would have to know μ,
which is what we are trying to estimate!
So instead, we replace it with an estimate: the sample standard deviation, s . This gives us an estimate of the
standard error: s
√
n
⎯⎯
⎯
However,
x −μ
s
≠ Z
since we changed the σ to an s . In fact,
n
√
⎯⎯
⎯
x − μ
t =
s
√
n
That is, this computation is not a z-score but rather a t-score and does not come from the normal distribution.
Instead, it comes from a new distribution called the Student’s t-distribution (or just t-distribution).
Note: the t-distribution was discovered by William Gosset. However, he was working at the Guinness Brewery at
the time, which did not allow employees to publish their work. So instead, he published his work under the pen
name “Student.”
⎯⎯
⎯
This t-distribution is a better model for the sampling distribution of x than the normal distribution when σ is not
known (that is, when it must be estimated with s ).
s =
⎯⎯
⎯
n
‾‾‾‾‾‾‾‾‾‾‾‾
2
‾
∑
(xi − x )
i=1
√
n − 1
and σ =
N
‾‾‾‾‾‾‾‾‾‾‾‾
‾
2
∑
(xi − μ)
i=1
√
N
The t-distribution shares many characteristics with the standard normal distribution. Both are symmetric, unimodal,
and might be described as “bell-shaped.”
The t-distribution’s shape depends on only one parameter, called the degrees of freedom (df). The number of
degrees of freedom is (usually) an integer: 1, 2, 3, and so on. In this case, the degrees of freedom is the number of
gaps in the data or n − 1. Ultimately, when the degrees of freedom is infinitely large, the t-distribution is exactly
the same as the standard normal distribution.
17
Therefore, to create a (1 − α) ∗ 100 % confidence interval for the population mean, μ when σ is unknown, we will
use the following formula:
s
⎯⎯
⎯
x ± t
α
2
where t
α
2
,n−1
= P(Tn−1 ≥ t
α
2
) =
,n−1
‾
√n
α
2
Remember, if σ is known we can use it (and the standard normal distribution). Also note that if n is large, the CLT
ensures s ≈ σ . However, this is only an approximation; it is still best to use t when σ is unknown!
To create a (1 − α) ∗ 100 % confidence interval for the population mean, μ when σ is known, we will use the
following formula:
σ
⎯⎯
⎯
x ± Z
α
2
where z
α
2
= P(Z ≥ z
α
2
) =
‾
√n
α
2
Confidence Intervals Based on Bootstrap Percentiles
Just as was the case for confidence intervals involving proportions, we may end up with a bootstrap distribution
that is symmetric but subtly flatter (or steeper) so that more (or less) than 95% of bootstrap statistics are within Z
α
2
standard errors of the center.
So we can use the percentiles of the bootstrap distribution to locate the actual middle (1 − α) ∗ 100 % of the
distribution. Specifically, if we want the middle (1 − α) ∗ 100 % of the bootstrap distribution (the values that are
most likely to be close to the center), we can just chop off the lowest α ∗ 100% and highest α ∗ 100% of the
2
2
bootstrap statistics to produce an interval.
18
Example 4.2.5
One of the exciting aspects of a university professor’s life is the time one spends in
meetings. A stratified random sample of 40 professors from various science
departments was taken. Each professor was asked, “In a week, how many hours do
⎯⎯
⎯
you typically spend in meetings?” The mean of this sample was x = 9.85 hours.
Assume that the standard deviation in the amount the number of hours per week
spent in meetings for all professors in this particular science faculty is 8 hours, or σ =
8 hours.
1. Find a 95% confidence interval for μ , the mean number of hours a professor in
this particular science faculty spends in meetings in a week.
mean = 9.85
sigma = 8
n = 40
conf = .95
mean - qnorm(conf+(1-conf)/2)*sigma/n^.5
## [1] 7.37082
mean + qnorm(conf+(1-conf)/2)*sigma/n^.5
## [1] 12.32918
19
2. Interpret the meaning of the above interval in the context of the data.
3. If the level of confidence was increased from 95% to, say, 99% what would
happen to the width of the confidence interval?
Example 4.2.6
A study focusing on financial issues and concerns of post-secondary students in
Canada was recently conducted by the Royal Bank of Canada. A subset of n = 200
recent graduates from an undergraduate program or diploma was randomly chosen
and the debt as a result of going to school (defined as student debt) was determined
for each. This produced an average student debt of $26, 680 and a standard
deviation of $4, 500 . You want to find a 95% confidence interval estimate for μ , the
average level of student debt for all recent graduates from a post-secondary
institution (excluding graduate programs).
1. Find the standard error and the margin of error for 95% confidence.
20
mean = 26680
s = 4500
n = 200
conf = 0.95
s/n^.5
## [1] 318.1981
qt(conf+(1-conf)/2,n-1)
## [1] 1.971957
qt(conf+(1-conf)/2,n-1)*s/n^.5
## [1] 627.4727
2. Find a 95% confidence interval estimate for the average level of student debt for
all recent graduates from a post-secondary institution (non graduate programs).
mean = 26680
s = 4500
n = 200
conf = 0.95
mean-qt(conf+(1-conf)/2,n-1)*s/n^.5
## [1] 26052.53
mean+qt(conf+(1-conf)/2,n-1)*s/n^.5
21
## [1] 27307.47
3. Interpret the meaning of the interval calculated above in the context of the data.
Example 4.2.7
The amount of sewage and industrial pollutants dumped into a body of water affects
the health of the water by reducing the amount of dissolved oxygen available for
aquatic life. Over a two-month period, sixteen samples of water were taken from a
river one kilometer downstream from a sewage treatment plant. The amount of
dissolved oxygen in the each sample of river water was determined and is given
below.
5.4, 5.4, 5.6, 4.2, 4.7, 5.3, 4.4, 4.9, 5.2, 5.9, 4.7, 4.9, 4.8, 4.9, 5.0, 5.5
The mean, median, and the standard deviation of the above sample are given as
⎯⎯
⎯
x = 5.05, x̃ = 4.95 , s = 0.453
1. Find a 95% confidence interval estimate for μ , the mean dissolved oxygen level
during the two-month period in the river located one-kilometer downstream from
the sewage plant. Compute this interval using the appropriate margin of error.
22
x=c(5.4, 5.4, 5.6, 4.2, 4.7, 5.3, 4.4, 4.9, 5.2, 5.9, 4.7, 4.9, 4.8, 4.9, 5.0, 5.5)
mean=mean(x)
s=sd(x)
n=length(x)
conf=.95
mean-qt(conf+(1-conf)/2,n-1)*s/n^.5
## [1] 4.80854
mean+qt(conf+(1-conf)/2,n-1)*s/n^.5
## [1] 5.29146
2. Find a 95% confidence interval estimate for μ , the mean dissolved oxygen level
during the two-month period in the river located one-kilometer downstream from
the sewage plant. Compute this interval based on bootstrapping.
library(mosaic)
B = do(1000)*mean(resample(c(5.4, 5.4, 5.6, 4.2, 4.7, 5.3, 4.4, 4.9, 5.2, 5.9, 4.7, 4.9, 4.8, 4.
9, 5.0, 5.5), n));
quantile(B$mean,(1-conf)/2);
##
2.5%
## 4.83125
quantile(B$mean,((1-conf)/2)+conf);
23
##
97.5%
## 5.250156
3. From the above confidence intervals, could you conclude that μ
interval is better?
< 5
? Which
Example 4.2.8
In a report from the Bank of Montreal Outlook on holiday spending for the year 2014,
a survey was conducted by Pollara in which 115 Albertans were randomly chosen
and each was asked how much they would spend on gifts for people in the upcoming
holiday season (excluding amount spent on trips, entertaining, and other spending).
The mean, median, and standard deviation resulting from this survey are:
⎯⎯
⎯
x = $652.00, x̃ = $643.00, s = $175 .
1. From this sample, construct a 95% confidence interval for μ , the mean amount
Albertans spent in the holiday season in 2014.
24
mean = 652
s = 175
n = 115
conf = 0.95
mean-qt(conf+(1-conf)/2,n-1)*s/n^.5
## [1] 619.6725
mean+qt(conf+(1-conf)/2,n-1)*s/n^.5
## [1] 684.3275
2. The same study revealed that the mean amount spent by all Quebec consumers
during the 2014 holiday season was $460 . Does the interval computed above
suggest that, on average, Albertans will spend more this holiday season
compared to consumers in Quebec?
25
Which Interval Should We Compute? (A Guide!)
In this set of notes, we’ve learned about three different “structures” of confidence intervals: one that relies on the zdistribution (a z-score), one that relies on a t-distribution (a t-score), and one that is built on bootstrapping the
original sample. Which interval should be used and when? The following guide should help you determine when
each interval is most appropriate to use.
Intervals for μ
Use bootstrapping when…
You have the raw data
You are not sure if the data are normal
Basically, if you can bootstrap (that is, if you have the raw data), use that method!
Use an interval based on the z-distribution when…
You do not have the raw data
σ
is known
Use an interval based on the t-distribution when…
You do not have the raw data
You can assume the data are normally distributed
σ
is not known
Intervals for p
Use bootstrapping when…
You have the raw data
Again, if you can bootstrap (if you have the raw data or can use R to make “fake” 0 and 1 data), use that
method!
Use an interval based on the z-distribution when…
You are unable to perform bootstrapping
26
Download