Statistics 213 – 3.1: Exploratory Data Analysis (EDA) © Scott Robison, Claudia Mahler 2019 all rights reserved. Objectives: Be able to understand the differences between variable classifications Be able to differentiate between a population and a sample as well as a parameter and a statistic Be able to identify the sampling procedure used to collect a sample Be able to compute sample proportions for qualitative variables Be able to compute summary values for quantitative variables Be able to create (in R) and interpret a boxplot Motivation: In Unit 2, we introduced the idea of a random variable. Recall the definition: a random variable is a quantity whose values are (usually) the numerical values associated with the random outcomes of an experiment. Often times, we represent a random variable with a letter. Examples Let X represent the number of heads that we observe from 10 coin flips Let Y represent the number of customers who enter a bank on a given day Let W represent an individual’s score on the Stanford-Binet IQ test So far in this course, we’ve concerned ourselves with calculating probabilities relating to such variables as the ones above. The focus of Unit 3 is slightly different. We are now going to switch gears and talk more about variables and their importance in research settings. More specifically, we’re going to talk about variables more as characteristics or properties that we are interested in studying or understanding in greater detail. When we refer to a variable, we’re still going to think of something whose value varies from iteration to iteration just as we defined in Unit 2 - but now we’re going to focus our attention on what these variables might be in “real world” situations and, specifically, in research. We can still discuss calculating probabilities relating to these other types of variables, but Unit 3 is focused on classifying variables and collecting data in order to make claims about variables. 1 Data and Variables There are a few basic terms we should be familiar with before we go any further (some of these we have seen before). To understand these terms, let’s consider an example. In nearly every field of study, research typically begins with a question. Example “How long is a typical eruption of the Old Faithful geyser in Yellowstone National Park?” One way to try and find an answer to this question would be to go to Yellowstone and observe, say, 20 eruptions of Old Faithful, recording the length of each eruption. In statistics, an observation is an individual occurrence, case, or instance of interest for which information is recorded. In our Yellowstone scenario, each eruption would be an observation. Notice that the thing of interest in this example is information about a specific characteristic of the eruptions - in particular, the length of the eruptions. If we were to actually go out and record these eruption lengths, we would see that they would most likely differ from observation to observation (eruption to eruption). Thus, the eruption lengths are, in this scenario, our random variable (often just called a variable) of interest - a characteristic, attribute, or outcome that can vary from observation to observation. When we talk about the information that is recorded for one (or more) variables for a set of observations, we are referring to data. Data is a collection of recorded information on one or more variables for a set of observations. A particular set of data is sometimes just called a dataset. In our Yellowstone scenario, our data would be the set of eruption lengths for the 20 observed eruptions (the 20 values of the variable “eruption length”). Classifying Variables Variables (and datasets) come in all shapes and sizes depending on what we’re interested in and/or what our research is about. As we will see, knowing what type(s) of variable(s) we are dealing with will be useful when we wish to use statistical techniques to explain or understand the variable(s) in more detail. So let’s briefly discuss the general types of variables that we may encounter. At the most general level, variables can be classed as one of two types: variables that describe a quality of an observation or variables that describe a quantity of an observation. Qualitative (categorical) variables are those that classify observations into one of a group of categories. Think “quality” or “what type?” Quantitative (numerical) variables are those that can be measured on a naturally numerical scale. Think “quantity” or “how much?” 2 Example 3.1.1 Determine if the following variables are qualitative or quantitative. 1. Body weight (in pounds) – qualitative or quantitative 2. Clothing size (small, medium, large) – qualitative or quantitative 3. Number of trees in a park – qualitative or quantitative 4. Breed of a cat – qualitative or quantitative Quantitative (numerical) variables can be further classified as discrete or continuous variables. We’ve seen these definitions before! Knowing which type of variable we’re working with is important! As we’ll see further on in this set of notes, different summary and descriptive values are appropriate for different types of variables. Making Claims about Variables Whenever we’re interested enough in a variable to study it statistically, we’re usually interested in it on some fairly large scale. That is, we want to study the behavior of the variable across many instances. Examples In Canada, what is the age at which people first consume alcohol? How many registered voters in Idaho voted for Biden in the 2020 U.S. presidential election? To do so, we collect data, or observations of these instances, and note the behavior of the variable for each observation. One of the best ways of “summarizing” a variable’s behavior is by aggregating all of this information into a single numerical description of the variable. While there are many different ways of doing this depending on the variable(s) of interest, we will mainly focus on two of these summary values in this class: the mean and the proportion. The mean is used for quantitative (numerical) data and aggregates all measured observations for a given variable into a single number that summarizes the “typical” behavior of the variable across those observations. A proportion is used for qualitative (categorical) data and is a ratio of the number of all observations that fall into a particular category of interest to the number of total observations. Thus, when we’re interested in the behavior of a variable, we usually express our interest in terms of one of these summary quantities. 3 Examples In Canada, what is the average age at which people first consume alcohol? What proportion of registered voters in Idaho voted for Biden? We will learn the specifics of the calculations for the mean and for proportions a bit later. For now just think of a mean as a “summary” value for quantitative variables and a proportion as a “summary” value for qualitative variables. Populations vs. Samples As mentioned above, whenever we’re interested enough in a variable to study it statistically, we’re usually interested in it on some fairly large scale. That is, we want to understand the variable’s behavior across many instances. A population is the set of all instances (units, people, objects, events, regions, etc.) we are interested in when wanting to study the behavior of a variable. We usually denote the size of a population (if known) with N . Examples N = all people in Canada N = all registered voters in Idaho Ideally, we would record information on a variable of interest for every observation in a given population. However, in a lot of cases (like in the examples above), it is impractical (or too expensive, or even impossible) to do so, since most populations are very large. Instead we most often select a subset of observations from a population of interest and record information on our variable of interest for every observation in the sample. A sample is a selected subset of observations from the population, usually much smaller than the population itself. We usually denote the size of a sample with n. Examples n = 2,000 Canadians n = 400 registered voters in Idaho At the population level, our summary values (means and proportions) are called parameters. The population mean is usually denoted μ. This value μ is a parameter. The population proportion is usually denoted p. This value p is a parameter. At the sample level, our summary values (means and proportions) are called statistics or sample statistics. ⎯⎯ ⎯ ⎯⎯ ⎯ The sample mean is usually denoted x . This value x is a statistic and is an estimate of μ. The sample proportion is usually denoted p̂. This value p̂ is a statistic and is an estimate of p. 4 Examples μ ⎯⎯ ⎯ = average age of first alcohol consumption for all Canadians x = average age of first alcohol consumption for a sample of 2,000 Canadians p = proportion of all registered voters in Idaho who voted for Biden p̂ = proportion of a sample of 400 registered voters in Idaho who voted for Biden Samples are smaller and therefore easier to work with than large populations. However, the main goal of interest is still to examine the behavior of a variable in the population of interest as a whole, not just in the smaller subset of a sample. In other words, our main interests are in our population summary values - μ and p. However, since we are often ⎯⎯ ⎯ unable to calculate these values exactly, we use our sample summary values - x and p̂ - and generalize our sample findings to the larger population of interest. We’ll see more of this later! Data Collection Because we wish to generalize our sample findings to our population of interest, it is important that our sample be representative of our population. That is, we want to make sure that our sample reflects the characteristics and composition of the population from which it’s taken. Sampling Techniques There are multiple different ways of obtaining a sample, depending on time/resources/observation type/etc. We will briefly discuss several different types. Ideally, a sample is a “perfect” representation of the population from which it was taken. While this is likely never going to be the case, we can get a better representative sample if our sampling method involves random sampling. To help give examples of each of these types of sampling methods, let’s consider the following scenario: a researcher wishes to know what proportion of voters in Idaho voted for Biden. A simple random sample (SRS) is one in which every conceivable subgroup of n observations has the same chance as being selected for the sample of size n. A random number generator (or some other way of randomizing all observations in the population) is required to assure the random selection. Example The researcher obtains a list of all N Idaho voters, then uses a random number generator to select n of those voters to be in the sample. (note that obtaining a list of an entire population may not always be possible!) 5 Stratified sampling is done when the population of interest contains naturally-occurring groups (or “strata”) of observations that are similar on one or more characteristics. A stratified sample is obtained by selecting observations from each strata and combining them to form the sample. Example The researcher divides voters into those who are registered as democrat, those registered as republicans, and those registered as some other party (or no party). These are the three strata. She then takes samples of, say, 30 individuals from each strata and combines them to form the sample. Cluster sampling is done when the population of interest contains naturally-occurring groups (or “clusters”) that do not greatly differ from one another but contain observations with differing characteristics. A cluster sample is obtained by selecting one of these clusters as the sample. Example Assuming that all counties in Idaho are relatively equal in terms of political and demographic makeup, the researcher selects all voters from Latah County to act as the sample. Note: there are differences between strata, but within strata the observations share similar characteristics. Conversely, there are similarities between clusters, but within the clusters, the observations have differing characteristics. In some cases - due to monetary/time restrictions, types of variables of interest, etc. - random sampling is not a possibility. In such cases, there are a few non-random sampling techniques that could be employed. Convenience sampling is done when the observations selected for the sample are simply those that are the easiest to reach/obtain. Example The researcher uses the first page of a phone book and calls every (registered to vote) individual on the page to ask if they voted for Biden. Voluntary sampling is done when the observations (people, in this case) “volunteer” to be in the sample. Example A researcher sets up a booth in a mall with a banner asking, “Who Did You Pick in the 2020 Election?” and asks all those who approach if they had voted for Biden. The sample is then all who approached the booth. 6 What is the “best” sampling method? Ideally, a SRS is best, as it is most likely to lead to a genuinely “random” selection from the population. However, a true SRS is difficult to do. Thus, the “best” sampling method really depends on the situation. The more “randomly” you can select individuals for your sample, the more likely your sample is to be a good representation of the population! The goal of collecting data in a sample is to be able to analyze that data and then generalize the findings back to the larger population. Remember that it is important that our sample be representative of our population in order to be able to do this! Now we’ll look at the step that comes after sampling: examining and summarizing the information in the collected data. The techniques we discuss in this unit are often grouped under the general term exploratory data analysis (EDA), which involves using both numerical and graphical summaries to explore sample data with the goal of understanding what it means and how best to use it. Describing Qualitative (Categorical) Variables Numerically Recall the definition of qualitative variables: qualitative (categorical) variables are those that classify observations into one of a group of categories. Think “quality” or “what type?” Examples Eye color (blue, brown, green, etc.) Ratings or rankings of songs (“dislike very much” to “like very much”) Letter grades on a test (A, B, C, etc.) Sample Proportions When we want to summarize or describe a qualitative variable numerically, the most common way of doing so is by calculating a proportion. A proportion is a ratio or fraction and is computed by taking the number of observations that fall into a particular category or class of interest and dividing it by the number of total observations. If we are calculating a sample proportion, we usually denote it p̂ and calculate it as follows: p̂ = X n where X represents the number of observations in a particular category/class of interest and n is the sample size. Recall that we often want to use p̂ as an estimate of p, a (usually unknown) population proportion. We’ll talk more about this type of estimation later in the course. 7 Example 3.1.2 The following table shows a distribution of eye color for 208 students taking an intro calculus course. What proportion of students in this sample have green eyes? Describing Quantitative (Numeric) Variables Numerically Recall the definition of quantitative variables: quantitative (numerical) variables are those that are measured on a naturally numerical scale. Think “quantity” or “how much?” When we wish to describe quantitative variables numerically, we tend to focus on ways to describe two separate features: center and spread. Center: Measures of Central Tendency Measures of central tendency (or measures of location) describe the tendency of the variables’ values to center or cluster about a certain value. The two most common measures of central tendency are the mean and the median. The mean is the arithmetic average and is the sum of all observations divided by the total number of observations. ⎯⎯ ⎯ If we are calculating a sample mean, we usually denote it x and calculate it as follows: n ⎯⎯ ⎯ x = ∑ i=1 xi n where xi represents the value of the random variable for the ith observation and n is the sample size. ⎯⎯ ⎯ Recall that we often want to use x as an estimate of μ, a (usually unknown) population mean. We’ll talk more about this type of estimation later in the course. 8 The median is the middle number when the observations are ordered from smallest to largest. If we are calculating a sample median, we usually denote it x ̃. If n is odd, x ̃ is simply the middle number of the ordered values If n is even, x ̃ is the mean of the middle two numbers of the ordered values Just like with the mean, we can use x ̃ as an estimate of a (usually unknown) population median, which is usually denoted μ̃. Consider a vector of data named dataset. In R, we can calculate the mean of the data by using mean(dataset) and the median of the data by using median(dataset). Example 3.1.3 Consider the following data set: 1 3 5 5 6 9 13 14 50 Using R, calculate the mean and the median of this data set. dataset = c(1,3,5,5,6,9,13,14,50) mean(dataset) ## [1] 11.77778 median(dataset) ## [1] 6 9 Spread: Measures of Variation Measures of variation (or measures of spread) describe the variation of the observations about a measure of central tendency. There are several measures of spread that we will discuss in this class. The variance measures how far a set of observations are spread out from their average value. If we are calculating a sample variance, we usually denote it s 2 and calculate it as follows: ⎯⎯ ⎯ n s 2 ∑ = i=1 (xi − x ) 2 n − 1 ⎯⎯ ⎯ where xi represents the value of the random variable for the ith observation, x is the sample mean, and n is the sample size. The variance involves the calculation of the squared average deviation of each observation from the mean (this is the numerator term). Because of this, variance is expressed in squared units, whatever those units happen to be for a particular variable. This can make it hard to interpret (how can we interpret “dollars squared” or “grades squared?”). The standard deviation is the square root of the variance and is expressed in whatever units the variable is expressed in. This makes it an easier to interpret measure of spread, and is thus much more commonly used. If we are calculating a sample standard deviation, we usually denote it s and calculate it as follows: 2 s = √‾ s‾ = ⎯⎯ ⎯ n ‾‾‾‾‾‾‾‾‾‾‾‾ 2 ‾ ∑ (xi − x ) i=1 √ n − 1 Consider a vector of data named dataset. In R, we can calculate the variance of the data by using var(dataset) and the standard deviation of the data by using sd(dataset) (or by, of course, taking the square root of var(dataset). 10 Example 3.1.4 Consider the following data set: 1 3 5 5 6 9 13 14 50 Using R, calculate the variance and standard deviation of this data set. dataset = c(1,3,5,5,6,9,13,14,50) var(dataset) ## [1] 224.1944 sd(dataset) ## [1] 14.97312 A percentile indicates the value below which a given percentage of observations fall. For example, the 90th percentile is the value below which 90% of the observations fall. Some useful percentiles are given specific names: The 25 th percentile of the data is sometimes called the first quartile (Q1 ) The median is the same as the 50 th percentile of the data The 75th percentile of the data is sometimes called the third quartile (Q3 ) Think of Q1 and Q3 as the median values of the lower half of the ordered data and the upper half of the ordered data, respectively (not counting the value of the median). If we think of the interval of data between Q1 and Q3 , we can think of that interval as capturing/containing the “middle” 50% of our data. Consider a vector of data named dataset. In R, we can calculate any percentile that we want by using quantile(dataset, p) where p is the percentile of interest. 11 Describing Quantitative (Numeric) Variables Graphically We will focus on a specific type of graph that can be used to help describe quantitative variables: the boxplot. A boxplot is a graphical representation based on five numbers: the minimum value of the data, Q1 , the median, Q3 , and the maximum value of the data. Example The following is a boxplot representing the distribution of sepal widths for n dataset in R. = 150 iris flowers from a built-in In R’s printout of a boxplot, values marked as ○ are classified as outliers. An outlier is a value that is either extremely large or extremely small when compared to the rest of the values in a sample. Specifically, an outlier is a value that is larger than the upper fence value or smaller than the lower fence value. These values are calculated as follows: Upper fence: Q3 + 1.5(Q3 − Q1 ) Lower fence: Q1 − 1.5(Q3 − Q1 ) Note that the “whiskers” of the boxplot extend to the most extreme upper and lower points that are not outliers, not to the fence values! Consider a vector of data named dataset. In R, we can plot a boxplot of the data by using boxplot(dataset). 12 Example 3.1.5 The following dataset contains the heights (in inches) of a random selection of 100 active MLB players. mlb=c(72,76,73,73,80,74,72,74,70,75,73,75,75,73,75,74,68,76,72,72,74,74,74,77,75,75,77,77,74,79, 75,74,76,70,75,74,72,75,75,75,68,76,75,76,77,77,74,73,73,74,75,78,76,73,74,72,71,73,70,77,73,76, 79,75,71,74,73,70,73,72,74,72,75,71,77,72,73,74,79,74,71,75,75,77,74,72,72,76,76,74,70,75,75,72, 72,72,71,77,72,76) 1. Use R to compute the median, Q1 , and Q3 of this data set. median(mlb) ## [1] 74 quantile(mlb, 0.25) ## 25% ## 72 quantile(mlb, 0.75) ## 75% ## 75 13 2. Use R to create a boxplot of this data. boxplot(mlb) 14 3. For the boxplot above, calculate the lower and upper fences. 4. Compute the 10th percentile of heights. quantile(mlb, 0.10) ## 10% ## 71 15 Statistics 213 – 3.2: Regression © Scott Robison, Claudia Mahler 2019 all rights reserved. Objectives: Be able to create and interpret a scatterplot Be able to calculate and interpret the correlation coefficient Be able to calculate and interpret the coefficient of determination Be able to run a regression analysis in R and appropriately interpret the output Be able to make predictions using a regression equation Be able to calculate residuals Be able to check regression model assumptions Motivation: In the previous set of notes, we focused in part on exploratory data analysis (EDA). The techniques we learned in that set of notes are used to describe the behavior of a variable in a given sample. Examples (from the 3.1 notes) Letter grades on a midterm for a sample of n = 107 students Eye colors for a sample of n = 208 students Heights (in inches) for a sample of n = 500 MLB players In the above examples, notice how we only focused on one variable at a time - letter grade, eye color, or height. While it is important to be able to describe individual variables in a given sample, there are many situations in which the relationship between two variables might be of more interest. Examples How can we describe the relationship between height and weight in high school students? To what degree are temperature and rainfall levels related? Is there an association between the duration of an eruption of the Old Faithful geyser and the amount of time between eruptions? 1 In this unit, we will discuss three main ways of describing and summarizing the relationship between two variables: using scatterplots, computing the correlation coefficient, and determining the regression equation. Bivariate Data When we refer to bivariate data, we’re referring to information collected on two variables from the same set or sample of observations. Example Suppose we were interested in the relationship between height and weight in high school students. We take a sample of n = 43 students and measure each students’ height and weight. In this case, we have two variables of interest: height and weight. Each individual in our sample is measured on both of these variables. We could keep track of this information in a spreadsheet as follows: Notice that these measurements of height and weight are “paired” in the sense that there is one height and one weight for each of the n students. Explanatory vs. Response Variables As we saw in the 3.1 notes, when researchers are interested in the relationship between two variables, they often suspect that one variable may be responsible (at least in part) for some of the change in the other variable. Thus, when we’re looking at the relationship between two variables, we often consider one as the “explainer” and one as the “responder.” Recall these definitions from the 3.1 notes: An explanatory variable (independent variable, predictor variable, x-variable) is one that may explain or cause some degree of change in another variable. A response variable (dependent variable, y-variable) is a variable that changes - at least in part - due to the changes in the explanatory variable. We’ll return to explanatory and response variables in a little bit. The rest of the notes will focus on three main ways of describing and summarizing the relationship between two variables, starting with scatterplots. 2 Scatterplots As we saw in the last set of notes, the quickest and easiest way to get an idea of the behavior of a variable is to generate some sort of picture or graph of it. A scatterplot is a two-dimensional plot with one variables’ values plotted along the horizontal axis (x-axis) and the other variables’ values plotted along the vertical axis (y-axis). It is a good way to visualize the relationship between two variables. It is accepted practice to plot the explanatory variable (x-variable) along the x-axis and the response variable (yvariable) along the y-axis. Example 3.2.1 The heights (in inches) and weights (in pounds) were recorded for a sample of n = 43 high school students. The following R code shows how these values are read into R and then displayed as a data frame called “students”: heights = c(73, 69, 70, 72, 73, 69, 68, 71, 71, 68, 69, 67, 66, 67, 72, 68, 75, 68, 73, 72, 72, 72, 72, 74, 68, 73, 68, 70, 72, 70, 67, 67, 71, 72, 73, 68, 72, 68, 67, 70, 71, 70, 67) weights = c(195, 135, 145, 170, 172, 168, 155, 185, 175, 158, 185, 146, 135, 150, 160, 155, 230, 149, 240, 170, 198, 163, 230, 170, 151, 220, 145, 130, 160, 210, 145, 185, 237, 205, 147, 170, 1 81, 150, 150, 200, 175, 155, 167) students = data.frame(heights, weights) head(students) ## ## ## ## ## ## ## 1 2 3 4 5 6 heights weights 73 195 69 135 70 145 72 170 73 172 69 168 To make a scatterplot of the data, we use the R function plot(x,y) where x is the variable whose values are to be plotted along the x-axis and y is the variable whose values are to be plotted along the y-axis. plot(heights, weights) 3 Notice that there is a point on the scatterplot for each of the n measurements (height, weight) for each student. = 43 students. The points represent each pair of 4 As you can see, a scatterplot can give a general idea of the relationship between the two variables. In general, we like to describe this relationship in terms of both its direction and its strength. The direction of a relationship can generally be described as positive, negative, curvilinear, or non-existent. A positive linear relationship exists when, as the values of one of the variables increase, the values of the other generally increase as well. (Plot A) A negative linear relationship exists when, as the values of one of the variables increase, the values of the other generally decrease. (Plot B) A curvilinear relationship exists when the relationship between the variables’ values can best be descried by a curve (rather than a line). (Plot C) A non-existent relationship exists when there is no consistent relationship between the values of the two variables. (Plot D) The strength (or magnitude) of a relationship is based on how tightly the “cloud” of points is clustered about the trend line (the line/curve that best describes the direction of the relationship. The more tightly clustered the cloud the stronger the relationship is between the two variables. Example 3.2.1 (revisited) The relationship between height and weight in the example with n = 43 high school students appears to be a moderately strong positive linear relationship. 5 Correlation Using a scatterplot is a good way to get a quick general idea of the relationship between two variables, but what if we wanted some way to quantify this relationship beyond the subjective interpretation of a scatterplot? Pearson’s correlation coefficient is a measure of the direction and strength of a linear relationship between two quantitative variables x and y. The sample correlation coefficient is usually denoted r and is computed as: n ∑ r = i=1 ⎯⎯ ⎯ ⎯⎯ ⎯ xi yi − nx y (n − 1)s x s y ⎯⎯ ⎯ ⎯⎯ ⎯ where xi is the ith observation of the x variable, yi is the ith observation of the y variable, x and y are the means of the x and y variables, respectively, s x and s y are the standard deviations of the x and y variables, respectively, and n is the sample size. The following are important features of the correlation coefficient: The distinction between the explanatory variable and the response variable doesn’t matter in the calculation or interpretation of r −1 ≤ r ≤ 1 The sign of r indicates the direction of the linear relationship +r indicates a positive relationship −r indicates a negative relationship Values of r closer to −1 or 1 suggest a strong linear relationship (with r = 1 and r = −1 representing perfect positive and negative correlations, respectively); values of r closer to 0 suggest a weak linear relationship To compute correlation in R, we use the R function cor(x,y). Example 3.2.1 (Revisited) The heights (in inches) and weights (in pounds) were recorded for a sample of n = 43 high school students. Use R to compute the correlation coefficient for the heights and weights of these students. Interpret this value. cor(heights, weights) ## [1] 0.5684901 6 Example 3.2.2 The annual rainfall (in mm) and maximum daily temperature (in Celsius) were recorded for n = 11 different locations in Mongolia. rain = c(196, 196, 179, 197, 149, 112, 125, 99, 125, 84, 115) temperature = c(5.7, 5.7, 7, 8, 8.5, 10.7, 11.4, 10.9, 11.4, 11.4, 11.4) Use R to compute the correlation coefficient for the rainfall and temperatures of these locations. Interpret this value. cor(temperature, rain) ## [1] -0.918617 Coefficient of Determination Another way to examine a bivariate relationship is to measure the contribution of the explanatory variable in predicting the value of the response variable. The coefficient of determination is the squared correlation coefficient (r 2 or R2 ) and represents the proportion of the total sample variation in the response variable that can be explained by its linear relationship with the explanatory variable. Note that since −1 ≤ r ≤ 1 , 0 ≤ r 2 ≤ 1. Also note that you need to know which variable is being treated as the explanatory variable and which variable is being treated as the response variable in order to interpret r 2 . 7 Example 3.2.3 In the height/weight example involving the n = 43 high school students, what percentage of variation in weight can be explained by its linear relationship with height? Regression We can get even more detailed than correlation when describing the relationship between two variables. Regression is a technique that allows us to not only summarize a linear relationship but to make predictions about future values that have yet to be observed. In this class, we will focus on simple linear regression, which is a predictive model that describes the relationship between our explanatory variable and our response variable. Simple linear regression fits (or models) the prediction of the response variable from its relationship with the explanatory variable. This is done by observing measures on both an explanatory variable and a response variable and “fitting” a line that best describes the relationship between the two variables. The regression equation is what defines the straight line that best describes how the values of the response variable are related, on average, to the values of one explanatory variable. The regression line itself is just the name of the line defined by the regression equation - the line of “best fit.” 8 The Regression Equation and Regression Line Let’s look at the heights/weights example with the n = 43 high school students again. Suppose we wanted an equation that tells us how best to predict weight (y) given a specific height (x ). In this case, we’ve got a set of data that is made up of 43 (x, y) points. We can use this data to come up with an equation of a line that “best fits” the relationship between height and weight. Recall the equation for a straight line: y = mx + b where: y = a value of the y-variable, m = slope, b = y-intercept, and x = a value of the x-variable Rearrange the right-hand side (and change the symbols) and you have our regression equation: ̂ ̂ ŷ = β 0 + β 1 x where: is the predicted mean y value for a given x , β̂ 0 is an estimate of the y-intercept based on the data, β̂ 1 is an estimate of the slope based on the data, and x is a value of our explanatory variable. ŷ The value of the y-intercept in a regression context has the same meaning as the value of the y-intercept in a general math context. That is, it is the average value of the y-variable when x = 0 . In some scenarios, the yintercept has a meaningful interpretation (e.g., the growth rate of a tree when temperature = 0 degrees). However, in other cases, the interpretation is meaningless and/or nonsensical (e.g., the weight of a child when height = 0 inches). It depends on the variables! A much more meaningful value in the context of regression is the slope. In regression, the value of the slope has a slightly more specific meaning than it does in a general math context. In math, the slope is defined as the change in the y-variable compared to the change in the x-variable. Some may be familiar with the phrase “rise over run.” In regression, it’s the same idea - however, it’s more specific. The regression slope is defined as the average change in the y-variable for every unit change in the x-variable. Think of it as “rise over one.” While we’ll mostly be relying on R to do our regression calculations for us, the following are the equations used to obtain the sample slope and y-intercept values. ⎯⎯ ⎯ n ̂ β1 = ∑ i=1 ⎯⎯ ⎯ (xi − x )(yi − y ) n ∑ i=1 ⎯⎯ ⎯ (xi − x ) 2 = r sy sx ⎯⎯ ⎯ ⎯ ̂ ̂ ⎯⎯ β0 = y − β1 x To run a regression analysis on an explanatory variable x and a response variable y in r, we use the function lm(y~x). Make sure your response variable (y) is listed first, otherwise you will not get the correct regression output! 9 Example 3.2.4 The annual rainfall (in mm) and maximum daily temperature (in Celsius) were recorded for n = 11 different locations in Mongolia. A regression analysis was performed to express annual rainfall (“rain”) as a linear function of maximum daily temperature (“temperature”). rain = c(196, 196, 179, 197, 149, 112, 125, 99, 125, 84, 115) temperature = c(5.7, 5.7, 7, 8, 8.5, 10.7, 11.4, 10.9, 11.4, 11.4, 11.4) fit = lm(rain~temperature) fit ## ## ## ## ## ## ## Call: lm(formula = rain ~ temperature) Coefficients: (Intercept) temperature 295.25 -16.36 1. Write down the regression equation. Interpret the slope and the y-intercept. 2. What percentage of variation in annual rainfall is explained by its linear relationship with temperature? 10 Example 3.2.5 Old Faithful is a popular geyser in Yellowstone National Park that is famous for its consistent eruptions. The duration (length) of the geyser’s eruptions (in minutes) as well as the amount of time spent waiting after the previous eruption (in minutes) were recorded for n = 144 eruptions. A regression analysis was performed to express eruption duration (“duration”) as a linear function of waiting time (“waiting”). duration = c(3.3, 3.3, 3.3, 3.4, 3.5, 3.5, 3.7, 3.7, 3.7, 3.7, 3.8, 3.8, 3.8, 3.9, 3.9, 3.9, 3.9 , 3.9, 3.9, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4.1, 4.1, 4.1, 4.1, 4.1, 4.1, 4.1, 4.2, 4.2, 4.2, 4.2, 4. 2, 4.2, 4.2, 4.2, 4.2, 4.2, 4.2, 4.2, 4.2, 4.2, 4.3, 4.3, 4.3, 4.3, 4.3, 4.3, 4.3, 4.3, 4.3, 4.3 , 4.3, 4.3, 4.3, 4.3, 4.3, 4.4, 4.4, 4.4, 4.4, 4.4, 4.4, 4.4, 4.4, 4.4, 4.4, 4.4, 4.4, 4.4, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.6, 4.6, 4.6, 4.6, 4.6, 4.6, 4.6, 4.6, 4.6, 4.6, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.8, 4.8, 4.8, 4.8, 4.8, 4.8, 4.8, 4.8, 4.8, 4.8, 4.9, 4.9, 4.9, 4.9, 5 , 5, 5, 5, 5, 5, 5.1, 5.3, 5.5) waiting 79, 75, 93, 82, 79, 92, 79, 80, 83, 84, 89, 77, = c(74, 88, 74, 86, 68, 87, 87, 84, 86, 84, 78, 89) 74, 87, 77, 76, 96, 81, 74, 89, 87, 93, 93, 82, 73, 76, 89, 75, 92, 87, 85, 78, 81, 87, 78, 93, 73, 86, 87, 76, 80, 96, 74, 80, 79, 75, 85, 78, 71, 76, 72, 80, 82, 84, 84, 77, 72, 84, 78, 78, 73, 80, 87, 79, 87, 85, 71, 85, 78, 80, 72, 87, 94, 84, 84, 88, 73, 90, 82, 91, 89, 88, 83, 76, 69, 80, 89, 80, 81, 88, 77, 73, 87, 81, 74, 98, 76, 81, 93, 72, 82, 85, 77, 71, 86, 90, 75, 93, 84, 83, 73, 80, 87, 79, 77, 88, 81, 81, 85, 89, 71, 76, 75, 93, 80, 84, 92, 68, 78, 91, 75, 83, fit = lm(duration~waiting) fit ## ## ## ## ## ## ## Call: lm(formula = duration ~ waiting) Coefficients: (Intercept) 2.78244 waiting 0.01948 11 1. Write down the regression equation. Interpret the slope. 2. Use R to compute the correlation. interpret this value. cor(duration~waiting) ## [1] 0.3294965 Predicting Using the Regression Equation We can get a general summary of our bivariate relationship by examining the regression equation and knowing how to interpret the y-intercept and the slope. We can also use our regression equation to predict the value of the response variable (y) for a specific value of the explanatory variable (x ). A predicted ith value of our response variable, denoted yî , can be obtained by plugging in the specific ith explanatory variable value (xi ) into the regression equation and solving for yî . ̂ ̂ yî = β 0 + β i xi Note: be wary of extrapolation! Extrapolation involves using the regression equation to predict y values for x values outside of the original range of the data set. 12 Example 3.2.6 The heights (in inches) and weights (in pounds) were recorded for a sample of n = 43 high school students. A regression analysis was performed to express weight (“weights”) as a linear function of height (“heights”). heights = c(73, 69, 70, 72, 73, 69, 68, 71, 71, 68, 69, 67, 66, 67, 72, 68, 75, 68, 73, 72, 72, 72, 72, 74, 68, 73, 68, 70, 72, 70, 67, 67, 71, 72, 73, 68, 72, 68, 67, 70, 71, 70, 67) weights = c(195, 135, 145, 170, 172, 168, 155, 185, 175, 158, 185, 146, 135, 150, 160, 155, 230, 149, 240, 170, 198, 163, 230, 170, 151, 220, 145, 130, 160, 210, 145, 185, 237, 205, 147, 170, 1 81, 150, 150, 200, 175, 155, 167) fit = lm(weights~heights) fit ## ## ## ## ## ## ## Call: lm(formula = weights ~ heights) Coefficients: (Intercept) -317.919 heights 6.996 1. Predict the weight of a student who is 70 inches tall. 2. Should we predict the weight of a student who is 62 inches tall? 13 Residuals Let’s look again at the scatterplot of heights and weights for the n = 43 high school students, except this time let’s also plot the regression line. plot(heights, weights) abline(fit) Notice that the regression line isn’t perfect! For example, our calculation for the weight of a student who is 70 inches tall is not the same as any of the weights of the 70-inch-tall students in our actual sample. In other words, there exist prediction errors in our estimation. A prediction error (or residual) is the difference between the observed y value (y) and the predicted y value (ŷ ) for any given x value. For the ith x value, the residual ei is calculated as: ei = yi − yî 14 Example 3.2.7 In our height and weight example, a 70-inch tall student weighed 130 pounds. Compute this student’s residual. So you may be wondering: “if the regression line isn’t perfect, why is there a unique regression equation/line for any given problem?” Our method of calculating the regression line involves the method of least squares. The regression line is the line that minimizes the total (summed) squared residuals. In other words, it is the line for which n ∑ (yi − yî ) 2 i=1 is minimized. The regression line is therefore the best possible linear explanation for the relationship between the two variables. Regression Model Assumptions As you can see, regression is quite powerful. It allows us to describe the linear relationship between two variables and make predictions. However, in order for a regression analysis to be accurate, a few assumptions must be met: 1. The assumption of constant variance (homoscedasticity): the variance of the residuals is constant. That is, the variance ei values is the same regardless of the value of xi . 2. The assumption of normality: the distribution of the residuals is normal. 15 Checking Model Assumptions In this class, we can focus on how to check these assumptions using plots. Checking for constant variance (homoscedasticity): look at a residual (or “residuals vs. fits”) plot. Good: The points appear fairly uniformly scattered about the flat dotted line. This suggests homoscedasticity. Bad: The points either gradually fan out from the line or gradually condense about the line. This suggests that the variance is not constant (heteroscedasticity). Checking for normality: look at a normal probability plot. Good: The points appear in a straight (or near-straight) line and follow the diagonal line. This suggests normality. Bad: The points deviate from the diagonal line. This suggests non-normality. Suppose you have a regression analysis that you have named fit. To create the normal probability plot and residual plot, you would use the R function plot(fit). 16 Example 3.2.8 Recall the heights (in inches) and weights (in pounds) that were recorded for a sample of n = 43 high school students. A regression analysis was performed to express weight (“weights”) as a linear function of height (“heights”). heights = c(73, 69, 70, 72, 73, 69, 68, 71, 71, 68, 69, 67, 66, 67, 72, 68, 75, 68, 73, 72, 72, 72, 72, 74, 68, 73, 68, 70, 72, 70, 67, 67, 71, 72, 73, 68, 72, 68, 67, 70, 71, 70, 67) weights = c(195, 135, 145, 170, 172, 168, 155, 185, 175, 158, 185, 146, 135, 150, 160, 155, 230, 149, 240, 170, 198, 163, 230, 170, 151, 220, 145, 130, 160, 210, 145, 185, 237, 205, 147, 170, 1 81, 150, 150, 200, 175, 155, 167) fit = lm(weights~heights) fit ## ## ## ## ## ## ## Call: lm(formula = weights ~ heights) Coefficients: (Intercept) -317.919 heights 6.996 Use the residual plot and the normal probability plot to assess the assumptions of homoscedasticity and normality. plot(fit) 17 18 19 20 Statistics 213 – 4.1: Sampling from WellKnown Distributions © Scott Robison, Claudia Mahler 2019 all rights reserved. Objectives: Be able to understand the concept of a sampling distribution Be able to use the Central Limit Theorem (CLT) to describe the distribution of the sample mean for samples from well-known distributions Be able to use a sampling distribution to calculate probabilities related to sample mean values Motivation: So far, the main focus in this class has been on describing how a single observation from a particular distribution will behave. Examples Suppose X represents the weight of a baby panda at the zoo. If X is normally distributed with a mean μ of 100g and a standard deviation σ of 12g, what is P(X > 105) ? Suppose Y represents amount of time spent waiting at a bus stop for a bus. If Y is exponentially distributed with β = 7.5 minutes, what is P(Y < 6) ? Suppose W represents the number of times you roll a 1 when rolling a fair six-sided die ten times. If W is binomial with n = 10 and p = 1 6 , what is P(W = 0) ? Now let’s shift our focus a little. Back in the 3.1 notes, we discussed the idea of sampling observations from a population. Recall that a sample is a selected subset of n observations from the population and is usually much smaller than the population itself. 1 We can consider this same idea when we think of taking multiple observations from a particular distribution. Examples Suppose X1 , X2 , . . . , X30 represents sample of 30 observations from a normal distribution with a mean of 100g and a standard deviation of 12g. Suppose Y1 , Y2 , . . . , Y50 represents a sample of 50 observations from an exponential distribution with β = 7.5 minutes. Suppose W1 , W2 , . . . , W88 represents a sample of 88 observations from a binomial distribution with n = 10 and p = 1 6 . The goal of this set of notes is to understand how we can use the Central Limit Theorem to describe how a sample of observations from a particular distribution will behave. The Central Limit Theorem Definition Consider a random sample of n observations selected from a population (any population) with mean μx and ⎯⎯ ⎯ standard deviation σx . Then, when n is sufficiently large, the sampling distribution of x will be approximately a σ normal distribution with mean μx = μx and standard deviation σx = .The larger the sample size, the better will ⎯⎯ ⎯ x ⎯⎯ ⎯ √ ⎯⎯ ⎯ n be the normal approximation to the sampling distribution of x . Z = ≈ σ ⎯⎯⎯⎯ ⎯⎯ ⎯ x − μ x − μx σx √ n = X − E[X ] SD[X ] √ n 2 Example 4.1.1 Let X be a Poisson random variable with λ = 5 . That is, X ∼ Pois(λ seen previously that μx = E[X] = λ and that σx = SD[X] = √λ . = 5) . We have 1. Find the probability that if seven samples were drawn from the distribution of X , ⎯⎯⎯⎯ the mean X ∑ = to find P(4.76 n=7 i=1 7 ⎯⎯⎯⎯ xi will be between 4.76 and 5.35 . In other words, we want ≤ X ≤ 5.35) where X ∼ Pois(λ = 5) . pnorm(q = 5.35, mean = 5, sd = sqrt(5/7)) - pnorm(q = 4.76, mean = 5, sd = sqrt(5/7)) ## [1] 0.2723929 pnorm((5.35-5)/(sqrt(5/7))) - pnorm((4.76-5)/(sqrt(5/7))) ## [1] 0.2723929 3 ⎯⎯⎯⎯ 2. Suppose we are interested in an interval for X such that that ⎯⎯⎯⎯ P(a ≤ X ≤ b) = 0.95 What could the values a and b take? Use R to determine these values. qnorm(p = c(.025,.975), mean = 5, sd = sqrt(5/7)) ## [1] 3.343528 6.656472 That is, we can say that if we were to take a random sample of seven values from a Poisson random variable with λ = 5 , 95% of such samples would have a sample mean between 3.343528 and 6.656472. Or, equivalently, ⎯⎯ ⎯ 3.343528 ≤ x ≤ 6.656472 . 4 Example 4.1.2 Suppose we randomly select five cards (without replacement) from an ordinary deck of 52 playing cards. Let Y be the number of face cards would you expect to pick (a “face card” meaning a jack, queen, or king). 1. What is the probability of selecting between zero and three (inclusive) face cards? Find this probability using R. phyper(q = 3, m = 12, n = 40, k = 5) ## [1] 0.9920768 sum(dhyper(x = 0:3, m = 12, n = 40, k = 5)) ## [1] 0.9920768 5 Note that since we know that Y follows a hypergeometric distribution with m know km E[Y ] = μy = SD[Y ] = σy = ( m + n 5 ⋅ 12 ) = = 12 60 = 12 + 40 ,n = 40 , and k = 5 , we also 15 = 52 13 ‾ ‾‾‾‾‾ m n m + n − k ‾ ‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾ 12 40 47 ‾ 2350 ‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾ k = 5 ⋅ ⋅ ⋅ = √ √ 2873 √ ( m + n )( m + n )( m + n − 1 ) 52 52 51 2. You select five cards out of the shuffled deck and count the number of face cards. You then repeat this process 14 more times (for a total of 15 times). Consider the average number of face cards you would see out of the five cards selected on a given iteration. What is P(0 ⎯⎯⎯⎯ ≤ Y ≤ 3) ? pnorm(q = 3, mean = 15/13, sd = (2350/2873)^.5/(15)^.5) - pnorm(q = 0, mean = 15/13, sd = (2350/ 2873)^.5/(15)^.5) ## [1] 0.9999996 pnorm((3-15/13)/((2350/2873)^.5/(15)^.5)) - pnorm((0-15/13)/((2350/2873)^.5/(15)^.5)) ## [1] 0.9999996 6 ⎯⎯⎯⎯ 3. Suppose we are interested in an interval for Y such that that ⎯⎯⎯⎯ . What could the values a and b take? Use R to determine these values. P(a ≤ Y ≤ b) = 0.95 qnorm(p = c(.025,.975), mean = 15/13, sd=(2350/2873)^.5/(15)^.5) ## [1] 0.6961592 1.6115332 That is, we can say that if we were to draw five cards from a deck of cards 15 separate times, then computed the average number of face cards from each of those 15 individual draws of five cards, 95% of the time, the average number of face cards selected out of the five cards will be between 0.6961592 and 1.6115332. Or, equivalently, ⎯⎯⎯⎯ 0.6961592 ≤ Y ≤ 1.6115332 . One thing to note about the CLT that is very interesting: no matter what the distribution of our random variable X is ⎯⎯⎯⎯ - normal, exponential, Poisson, or someting else - the distribution of the sample mean, X , will be approximately σ normal with a mean μx = μx and standard deviation σx = , provided your sample size n is “large enough.” ⎯⎯ ⎯ ⎯⎯ ⎯ x √ n This applies even when the distribution of X is not a “named” probability distribution, as long as you have a probability distribution table. 7 Example 4.1.3 Consider rolling two dice and consider the summed total of the two die faces. The probability distribution table for the sum of the dice, X is given as follows: 1. What is the expected sum when you roll two dice, E[X] = μX ? 2. What is the standard deviation of the expected sum when you roll two dice, SD[X] = σX ? 8 3. If the pair of dice were rolled n = 47 times and the outcomes were recorded, give an interval that would account for the mean of the 47 rolls 95% of the time. 9 x = c(2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12) p = c(1/36, 2/36, 3/36, 4/36, 5/36, 6/36, 5/36, 4/36, 3/36, 2/36, 1/36) E = sum(x*p) E ## [1] 7 E2 = sum((x^2)*(p)) VAR = E2 - E^2 VAR ## [1] 5.833333 SD = sqrt(VAR) SD ## [1] 2.415229 qnorm(p = c(0.025, 0.975), mean = 7, sd = SD/sqrt(47)) ## [1] 6.30951 7.69049 qnorm(p = c(0.025, 0.975)) ## [1] -1.959964 1.959964 10 Example 4.1.4 Consider X ∼ Bin(n = 1, p = 0.3) . Suppose 64 observations are selected at random from this distribution (n ∗ = 64 ). 1. What is the expected value of X , E[X] = μX 2. What is the standard deviation of X , SD[X] Now consider ⎯⎯⎯⎯ p̂ = X = ∑ ∗ n i=1 ∗ n Xi . Note that μX 3. Give an interval that would account for p̂ ? = σX ? = E[X] = np since X ∼ Bin . 95% of the time. 11 n = 1 p = 0.3 N = 64 E = n*p E ## [1] 0.3 SD = sqrt(n*p*(1-p)) SD ## [1] 0.4582576 qnorm(p = c(0.025, 0.975), mean = p, sd = SD/sqrt(N)) ## [1] 0.187729 0.412271 qnorm(p = c(0.025, 0.975)) ## [1] -1.959964 1.959964 12 4. Give an interval that would account for p̂ 99% of the time. qnorm(p = c(0.005, 0.995), mean = p, sd = SD/sqrt(N)) ## [1] 0.1524508 0.4475492 qnorm(p = c(0.005, 0.995)) ## [1] -2.575829 2.575829 This is a precursor to what we’ll be covering in the 4.2 notes… ‾ ‾‾‾‾‾‾‾ ‾ p̂ (1 − p̂ ) p̂ ± Z α 2 √ n 13 Statistics 213 – 4.2: Sampling from Unknown Distributions © Scott Robison, Claudia Mahler 2019 all rights reserved. Objectives: Review the definitions of a parameter and a statistic Be familiar with the concept of confidence intervals and know how to compute them Understand the basics behind bootstrapping Motivation: In the 4.1 notes, we discussed the concept of sampling distributions as well as the Central Limit Theorem to allow us to describe how a sample of observations from a particular distribution will behave. Recall that the distributions we discussed in the 4.1 notes came from what we considered to be “known” distributions distributions for which we know the specifics for how to compute the mean and variance/standard deviation of the distribution. Examples Suppose X1 , X2 , . . . , X30 represents sample of 30 observations from a normal distribution with a mean of ⎯⎯ ⎯ 100 and a standard deviation of 12. Then the distribution of x is approximately normal with mean ⎯ μ ⎯⎯ = μx = 100 x and standard deviation σx ⎯⎯ ⎯ = σx √ n = 12 √ 30 Suppose W1 , W2 , . . . , W88 represents a sample of 88 observations from a binomial distribution with n and p = 16 . Then the distribution of p̂ is approximately normal with mean μw = p = 16 and standard = 1 ⎯⎯⎯⎯ deviation σw ⎯⎯⎯⎯ 5 ‾1‾‾ ⋅ ‾ = 6 √ 6 88 But what happens if we run into a scenario where we have a random variable that doesn’t follow one of these “known” distributions? In this set of notes, we’ll discuss several different approaches we can take when we must sample from an unknown distribution. 1 Sampling from Populations Variation in Sampling One goal of sampling is to use the sample we collect to make an inference or claim about the larger population from which the sample was taken. That is, based on a smaller (and hopefully representative!) subset of the population, what can be said about the population itself? Example Suppose we are interested in the average height of all active Major League Baseball (MLB) players. If we take a sample of n = 30 players and measure their heights, we can use the average height of those 30 players to make a claim about the average height of all active players. In the example above, we are using a sample mean to make a claim about a population mean. One thing to note about using samples to estimate what’s going on in the population is that different samples may produce different results. That is, samples will vary even when the samples are collected in the same manner. Example One sample of n = 30 Another sample of n 75.3”. A third sample of n height of 73.9”. MLB players gives an average sample height of 74”. = 30 = 30 MLB players (collected in the same fashion) gives an average sample height of MLB players (again, collected in the same fashion) gives an average sample Etc. This variation or variance from sample-to-sample is really the focus of statistics! 2 Parameters and Statistics - A Review As was first discussed in the 3.1 notes, one of the best ways of “summarizing” the variation that comes with collecting data is to aggregate all of the information collected from each object in the target population into a single numerical property. Recall from the 3.1 notes that a population is the set of all instances (units, people, objects, events, regions, etc.) we are interested in when wanting to study the behavior of a variable. We usually denote the size of a population (if known) with N . At the population level these numerical properties are called parameters. Ideally, we would record information on a variable of interest for every observation in a given population and thus would be able to compute appropriate population parameters. However, in a lot of cases, it is impractical (or too expensive, or even impossible) to do so, since most populations are very large. Instead we most often select a subset of observations from a population of interest and record information on our variable of interest for every observation in that subset. Recall from the 3.1 notes that a sample is a selected subset of observations from the population, usually much smaller than the population itself. We usually denote the size of a sample with n. By obtaining a measurement from each object in a sample of the population, we can then make inferences about the population as a whole. Just as we can think about aggregating all of the information collected from each object in a population into a single numerical numerical property, we can do so in a sample as well. At the sample level, these summary values are called statistics. Example 4.2.1 The following data were produced by a simple random sample of houses that have sold in my area in the past six months. The variable of interest is the selling price of a home, in $1000s. 575.0, 549.0, 572.5, 649.9, 485.0 Let’s consider a few sample statistics based on this sample of n ⎯⎯ ⎯ The sample mean is calculated as x ∑ = n i=1 th houses. xi n The sample standard deviation is calculated as s The 50 = 5 = n ⎯⎯ ⎯ 2 ‾ ‾‾‾‾‾‾‾ ∑ (xi −x )‾ i=1 √ n−1 percentile (also called the median) is the number such that 50% of the data is less than that number. The 25 percentile (also called the 1 quantile or Q1 ) is the number such that 25% of the data is less than that number. th st The 75th percentile (also called the 3rd quantile or Q3 ) is the number such that 75% of the data is less than that number. 3 library(mosaic) x = c(575.0, 549.0, 572.5, 649.9, 485.0) favstats(x) ## ## min Q1 median Q3 max mean sd n missing 485 549 572.5 575 649.9 566.28 59.18629 5 0 Now consider a sixth house that has been sold was randomly picked. The selling price of this particular house was $975,000. Compute the mean and median of this new sample: 485.0, 549.0, 572.5, 575.0, 649.9, 975.0 x=c(575.0, 549.0, 572.5, 649.9, 485.0, 975.0) favstats(x) ## ## min Q1 median Q3 max mean sd n missing 485 554.875 573.75 631.175 975 634.4 175.0555 6 0 Result: The sample median is said to be robust, or less sensitive to irregular/extreme data points, compared to the sample mean. Irregular data points can (and often do) occur when the variable of interest possesses (in the population) a skewed distribution (either right-skewed or left-skewed). In such cases, the sample median should be used as the appropriate measure of center. One must also be careful in using the sample standard deviation as a measure of spread! Use the value of s to ⎯⎯ ⎯ estimate the spread in the population variable of interest only when the sample mean x is being used as the measure of spread. This is because the sample standard deviation (like the sample mean) is very sensitive to irregular/extreme values. The “skewing effect” of the sample mean is compounded in the calculation of s . So what is one way we can deal with skewed distributions in the context of trying to estimate parameters? One option to consider is bootstrapping. 4 Bootstrapping In statistics, bootstrapping is any test or metric that relies on random sampling with replacement. Bootstrapping allows assigning measures of accuracy (defined in terms of bias, variance, confidence intervals, prediction error or some other such measure) to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods. The general process of bootstrapping is as follows: 1. An initial sample of n observations is taken from a distribution. 2. From that initial sample, we sample with replacement n times to obtain a bootstrap sample. We do this repeatedly to obtain a large number of bootstrap samples. ⎯⎯ ⎯ 3. For each bootstrap sample, the sample statistic of interest is calculated (x or p̂ ). 4. These bootstrap sample statistics form the sampling distribution of the sample statistic. As we will see later on in the notes, we can use this bootstrap-based sampling distribution to form confidence interval estimates of parameters. Bootstrapping allows for a better estimation of the sampling distribution of a sample statistic when one cannot assume that the variable follows a normal distribution (or any of the other well-known distributions, such as Poisson, exponential, etc.). To better demonstrate this, let’s go back to the modified Example 4.2.1. 5 Example 4.2.1 (revisited) The following data were produced by a simple random sample of six houses that have sold in my area in the past six months. The variable of interest is the selling price of a home, in $1000s. 575.0, 549.0, 572.5, 649.9, 485.0, 975.0 Let’s use R to compare the distribution of the sample itself, a distribution based on the normal distribution, and a bootstrap distribution. First, the distribution of the sample: x=c(575.0, 549.0, 572.5, 649.9, 485.0, 975.0) mean(x) ## [1] 634.4 sd(x) ## [1] 175.0555 favstats(x) ## ## min Q1 median Q3 max mean sd n missing 485 554.875 573.75 631.175 975 634.4 175.0555 6 0 Now, a distribution based on the normal distribution: y=rnorm(n = 1000, mean = 634.4, sd = 175.0555) favstats(y) ## ## min Q1 median Q3 max mean sd n missing 118.7369 521.6396 648.6604 757.2487 1306.529 637.3441 178.8723 1000 0 6 Finally, a bootstrap distribution: library(mosaic) x = c(575.0, 549.0, 572.5, 649.9, 485.0, 975.0) B = do(1000)*mean(resample(x, 6)) favstats(B$mean) ## ## min Q1 median Q3 max mean sd n missing 500 580.2167 632.5833 680.0875 920.8167 635.286 64.70401 1000 0 Compare the means and standard deviations of these three “distributions.” What do we see? While the medians of these three distributions are approximately the same (as are the means), notice that there is far less variability in the bootstrap boxplot than in the “normal” boxplot. Why does it matter? Bootstrap-based distributions can be considered as more robust against extreme values compared to traditional methods of estimating parameters. We will see the use of this in the next section. 7 Estimating Parameters with Confidence Intervals As discussed above, parameters represent numerical properties that act as summary values used to describe the behavior of a given variable in the population of interest. Statistics are these summary values calculated at the sample level and are typically used to estimate the corresponding (unknown) population parameters. Examples The average height in a sample of n all MLB players, μ. = 40 ⎯⎯ ⎯ MLB players, x , can be used to estimate the average height of The proportion of voters who would vote for the Liberal Party in a sample of n = 1, 034 registered Canadian voters, p̂ , can be used to estimate the proportion of all registered Canadian voters who would vote for the Liberal Party, p. These sample statistics on their own are what we call point estimates, or single-value estimates of their corresponding population parameters. Since these point estimates vary from sample to sample, using them on their own to estimate the corresponding population parameter may not be extremely useful, as this variation of the sample statistics is not taken into account. Another option for estimation involves the computation of confidence intervals. A confidence interval is constructed based on sample data and a point estimate and is an interval estimate of the target population parameter. It gives us a range of values that are plausible for the population parameter. In this class, we will discuss how to construct confidence intervals for two parameters of interest: p and μ. Estimating p with Confidence Intervals Selecting the Margin of Error The margin of error in a confidence interval is the amount that we will both add and subtract from the point estimate of interest in order to produce a desired confidence level for a confidence interval. We can change the confidence level by changing the margin of error. The greater the margin of error, the higher our confidence level will be. We represent the allowable error that our interval will contain with α. For example, if we construct an interval for which we could be 100% sure that the interval would always range across all possible values of the parameter of interest, then the allowable error is α = 0. Another way to state this is to say that we have made a (1 − α) ∗ 100 % confidence interval or, in this case, a (1 − 0) ∗ 100 % or a 100% confidence interval. The structure of a confidence interval for p is: p̂ ± margin of error ‾ ‾‾‾‾‾‾‾ ‾ p̂ (1 − p̂ ) p̂ ± Z 2 √ X where p̂ = α n n 8 Example 4.2.2 An Ipsos-Reid poll of n = 1, 034 randomly selected Canadian voters was taken between February 14 and February 18th . Each voter was asked the following question: “If a federal election were to be held tomorrow, what political party would you vote for?” 382 would vote for the Liberal Party should a federal election to occur tomorrow. th 1. Compute the sample proportion as well as the margin of error. 382/1034 ## [1] 0.3694391 qnorm(.975)*(382/1034*(1-382/1034)/1034)^(.5) ## [1] 0.02941865 9 2. Use the information computed above to find a 95% confidence interval estimate for the percentage of all Canadian voters who would vote for the Liberal Party. 382/1034 - qnorm(.975)*(382/1034*(1-382/1034)/1034)^(.5) ## [1] 0.3400204 382/1034 + qnorm(.975)*(382/1034*(1-382/1034)/1034)^(.5) ## [1] 0.3988577 Confidence Intervals Based on Bootstrap Percentiles If we were only concerned with 95% confidence intervals and always had a symmetric, bell-shaped bootstrap distribution, the confidence interval as it is computed in the above section would probably be all that we need. But we may end up with a bootstrap distribution that is symmetric but subtly flatter (or steeper) so that more (or less) than 95% of bootstrap statistics are within Z standard errors of the center. α 2 Fortunately, we can use the percentiles of the bootstrap distribution to locate the actual middle (1 − α) ∗ 100 % of the distribution. Specifically, if we want the middle (1 − α) ∗ 100 % of the bootstrap distribution (the values that are most likely to be close to the center), we can just chop off the lowest α ∗ 100% and highest α ∗ 100% of the 2 2 bootstrap statistics to produce an interval. Visually: 10 Example 4.2.2 (revisited) An Ipsos-Reid poll of n = 1, 034 randomly selected Canadian voters was taken between February 14 and February 18th . Each voter was asked the following question: “If a federal election were to be held tomorrow, what political party would you vote for?” 382 would vote for the Liberal Party should a federal election to occur tomorrow. Compute a bootstrap 95% confidence interval estimate for the percentage of all Canadian voters who would vote for the Liberal Party. th library(mosaic) B = do(1000)*mean(resample(c(rep(1, 382), rep(0, 1034-382)), 1034)); quantile(B$mean, 0.025); ## 2.5% ## 0.3413685 quantile(B$mean, 0.975); ## 97.5% ## 0.3984526 11 Interpreting Confidence Intervals A confidence interval for a sample proportion gives a set of values that are plausible for the population proportion. If a value is not in the confidence interval, we conclude that it is an implausible/unlikely value for the actual population proportion. It’s not impossible that the population value is outside the interval, but it would be pretty surprising. For example, suppose a candidate for political office conducts a poll and finds that a 95% confidence interval for the proportion of voters who will vote for him is 42% to 48%. He would be wise to conclude that he does not have 50% of the population voting for him. The reason is that the value 50% is not in the confidence interval, so it is implausible to believe that the population value is 50%. Sometimes drawing a picture helps! There are many common misinterpretations of confidence intervals that you must avoid. The most common mistake that is made is trying to turn confidence intervals into some sort of probability problem. For example, if asked to interpret a 95% confidence interval of 45.9% to 53.1%, many people would mistakenly say, “This means there is a 95% chance that the population percentage is between 45.9% and 53.1%.” What’s wrong with this statement? Remember that probabilities are long-run frequencies. The above interpretation claims that if we were to repeat this survey many times, then in 95% of the surveys the true population percentages would be a number between 45.9% and 53.1%. This claim is wrong! This is because the true population percentage doesn’t change. It is either always between 45.9% and 53.1% or it is never between these two values. It can’t be between these two numbers 95% of the time and somewhere else the rest of the time. Another analogy will help make this clear. Suppose there is a skateboard factory where 95% of the skateboards produced are perfect, but the other 5% have no wheels. Once you buy a skateboard from this factory, you can’t say that there is a 95% chance that it has wheels. Either it has wheels or it does not have wheels. It is not true that the board has wheels 95% of the time and, mysteriously, no wheels the other 5% of the time. A confidence interval is like one of these skateboards. Either it contains the true parameter (has wheels) or it does not. The “95% confidence” refers to the “factory” that “manufactures” confidence intervals: 95% of its products are good, 5% are bad. Our confidence is in the process, not in the product. A correct interpretation: We are (1 − α) ∗ 100 % sure that the true population proportion is between the lower and the upper limit calculated. 12 Example 4.2.3 A random sample of n = 3, 005 Canadians between the ages of 30 and 65 revealed that 1, 683 expect to work past the traditional retirement age of 65. 1. Find a 99% confidence interval for p , the proportion of Canadians aged 30 to 65 who expect to be working past the age of 65. 1683/3005 - qnorm(.995)*(1683/3005*(1-1683/3005)/3005)^(.5) ## [1] 0.5367423 1683/3005 + qnorm(.995)*(1683/3005*(1-1683/3005)/3005)^(.5) ## [1] 0.5833908 13 2. Interpret the meaning of this interval in the context of the data. 3. Can you infer from the interval above that (a) p = 0.54 ? (b) p < 0.60 ? 14 Example 4.2.4 A random sample of n = 109 first-year University of Calgary students revealed that 23 had used marijuana in the past six months. 1. Find a 95% confidence interval for p , the proportion of all first-year University of Calgary students that have used marijuana in the past six months, based on the distribution of p̂ and based on bootstrapping. p = 23/109 n = 109 conf = 0.95 p - qnorm(conf+(1-conf)/2)*(p*(1-p)/n)^(.5) ## [1] 0.1344105 p + qnorm(conf+(1-conf)/2)*(p*(1-p)/n)^(.5) ## [1] 0.2876079 library(mosaic) B = do(1000)*mean(resample(c(rep(1,23),rep(0,n-23)), n)); quantile(B$mean, (1-conf)/2); 15 ## 2.5% ## 0.146789 quantile(B$mean, conf+(1-conf)/2); ## 97.5% ## 0.293578 2. Can you conclude from the findings above that (a) 20% of first-year University of Calgary students have used marijuana in the past six months? (b) more than 25% of first-year University of Calgary students have used marijuana in the past six months? 16 Estimating μ with Confidence Intervals Selecting the Margin of Error Recall that we used the CLT to create the confidence interval formula for proportions. It would be nice to use CLT to create the confidence interval formula for means as well! Recall also the following formula: ⎯⎯ ⎯ x − μ Z = σ √ n It would be nice if we could divide by the true standard error, σ √ n . The problem is that in real life, we almost never know the value of σ , the population standard deviation. In fact, in order to calculate it, we would have to know μ, which is what we are trying to estimate! So instead, we replace it with an estimate: the sample standard deviation, s . This gives us an estimate of the standard error: s √ n ⎯⎯ ⎯ However, x −μ s ≠ Z since we changed the σ to an s . In fact, n √ ⎯⎯ ⎯ x − μ t = s √ n That is, this computation is not a z-score but rather a t-score and does not come from the normal distribution. Instead, it comes from a new distribution called the Student’s t-distribution (or just t-distribution). Note: the t-distribution was discovered by William Gosset. However, he was working at the Guinness Brewery at the time, which did not allow employees to publish their work. So instead, he published his work under the pen name “Student.” ⎯⎯ ⎯ This t-distribution is a better model for the sampling distribution of x than the normal distribution when σ is not known (that is, when it must be estimated with s ). s = ⎯⎯ ⎯ n ‾‾‾‾‾‾‾‾‾‾‾‾ 2 ‾ ∑ (xi − x ) i=1 √ n − 1 and σ = N ‾‾‾‾‾‾‾‾‾‾‾‾ ‾ 2 ∑ (xi − μ) i=1 √ N The t-distribution shares many characteristics with the standard normal distribution. Both are symmetric, unimodal, and might be described as “bell-shaped.” The t-distribution’s shape depends on only one parameter, called the degrees of freedom (df). The number of degrees of freedom is (usually) an integer: 1, 2, 3, and so on. In this case, the degrees of freedom is the number of gaps in the data or n − 1. Ultimately, when the degrees of freedom is infinitely large, the t-distribution is exactly the same as the standard normal distribution. 17 Therefore, to create a (1 − α) ∗ 100 % confidence interval for the population mean, μ when σ is unknown, we will use the following formula: s ⎯⎯ ⎯ x ± t α 2 where t α 2 ,n−1 = P(Tn−1 ≥ t α 2 ) = ,n−1 ‾ √n α 2 Remember, if σ is known we can use it (and the standard normal distribution). Also note that if n is large, the CLT ensures s ≈ σ . However, this is only an approximation; it is still best to use t when σ is unknown! To create a (1 − α) ∗ 100 % confidence interval for the population mean, μ when σ is known, we will use the following formula: σ ⎯⎯ ⎯ x ± Z α 2 where z α 2 = P(Z ≥ z α 2 ) = ‾ √n α 2 Confidence Intervals Based on Bootstrap Percentiles Just as was the case for confidence intervals involving proportions, we may end up with a bootstrap distribution that is symmetric but subtly flatter (or steeper) so that more (or less) than 95% of bootstrap statistics are within Z α 2 standard errors of the center. So we can use the percentiles of the bootstrap distribution to locate the actual middle (1 − α) ∗ 100 % of the distribution. Specifically, if we want the middle (1 − α) ∗ 100 % of the bootstrap distribution (the values that are most likely to be close to the center), we can just chop off the lowest α ∗ 100% and highest α ∗ 100% of the 2 2 bootstrap statistics to produce an interval. 18 Example 4.2.5 One of the exciting aspects of a university professor’s life is the time one spends in meetings. A stratified random sample of 40 professors from various science departments was taken. Each professor was asked, “In a week, how many hours do ⎯⎯ ⎯ you typically spend in meetings?” The mean of this sample was x = 9.85 hours. Assume that the standard deviation in the amount the number of hours per week spent in meetings for all professors in this particular science faculty is 8 hours, or σ = 8 hours. 1. Find a 95% confidence interval for μ , the mean number of hours a professor in this particular science faculty spends in meetings in a week. mean = 9.85 sigma = 8 n = 40 conf = .95 mean - qnorm(conf+(1-conf)/2)*sigma/n^.5 ## [1] 7.37082 mean + qnorm(conf+(1-conf)/2)*sigma/n^.5 ## [1] 12.32918 19 2. Interpret the meaning of the above interval in the context of the data. 3. If the level of confidence was increased from 95% to, say, 99% what would happen to the width of the confidence interval? Example 4.2.6 A study focusing on financial issues and concerns of post-secondary students in Canada was recently conducted by the Royal Bank of Canada. A subset of n = 200 recent graduates from an undergraduate program or diploma was randomly chosen and the debt as a result of going to school (defined as student debt) was determined for each. This produced an average student debt of $26, 680 and a standard deviation of $4, 500 . You want to find a 95% confidence interval estimate for μ , the average level of student debt for all recent graduates from a post-secondary institution (excluding graduate programs). 1. Find the standard error and the margin of error for 95% confidence. 20 mean = 26680 s = 4500 n = 200 conf = 0.95 s/n^.5 ## [1] 318.1981 qt(conf+(1-conf)/2,n-1) ## [1] 1.971957 qt(conf+(1-conf)/2,n-1)*s/n^.5 ## [1] 627.4727 2. Find a 95% confidence interval estimate for the average level of student debt for all recent graduates from a post-secondary institution (non graduate programs). mean = 26680 s = 4500 n = 200 conf = 0.95 mean-qt(conf+(1-conf)/2,n-1)*s/n^.5 ## [1] 26052.53 mean+qt(conf+(1-conf)/2,n-1)*s/n^.5 21 ## [1] 27307.47 3. Interpret the meaning of the interval calculated above in the context of the data. Example 4.2.7 The amount of sewage and industrial pollutants dumped into a body of water affects the health of the water by reducing the amount of dissolved oxygen available for aquatic life. Over a two-month period, sixteen samples of water were taken from a river one kilometer downstream from a sewage treatment plant. The amount of dissolved oxygen in the each sample of river water was determined and is given below. 5.4, 5.4, 5.6, 4.2, 4.7, 5.3, 4.4, 4.9, 5.2, 5.9, 4.7, 4.9, 4.8, 4.9, 5.0, 5.5 The mean, median, and the standard deviation of the above sample are given as ⎯⎯ ⎯ x = 5.05, x̃ = 4.95 , s = 0.453 1. Find a 95% confidence interval estimate for μ , the mean dissolved oxygen level during the two-month period in the river located one-kilometer downstream from the sewage plant. Compute this interval using the appropriate margin of error. 22 x=c(5.4, 5.4, 5.6, 4.2, 4.7, 5.3, 4.4, 4.9, 5.2, 5.9, 4.7, 4.9, 4.8, 4.9, 5.0, 5.5) mean=mean(x) s=sd(x) n=length(x) conf=.95 mean-qt(conf+(1-conf)/2,n-1)*s/n^.5 ## [1] 4.80854 mean+qt(conf+(1-conf)/2,n-1)*s/n^.5 ## [1] 5.29146 2. Find a 95% confidence interval estimate for μ , the mean dissolved oxygen level during the two-month period in the river located one-kilometer downstream from the sewage plant. Compute this interval based on bootstrapping. library(mosaic) B = do(1000)*mean(resample(c(5.4, 5.4, 5.6, 4.2, 4.7, 5.3, 4.4, 4.9, 5.2, 5.9, 4.7, 4.9, 4.8, 4. 9, 5.0, 5.5), n)); quantile(B$mean,(1-conf)/2); ## 2.5% ## 4.83125 quantile(B$mean,((1-conf)/2)+conf); 23 ## 97.5% ## 5.250156 3. From the above confidence intervals, could you conclude that μ interval is better? < 5 ? Which Example 4.2.8 In a report from the Bank of Montreal Outlook on holiday spending for the year 2014, a survey was conducted by Pollara in which 115 Albertans were randomly chosen and each was asked how much they would spend on gifts for people in the upcoming holiday season (excluding amount spent on trips, entertaining, and other spending). The mean, median, and standard deviation resulting from this survey are: ⎯⎯ ⎯ x = $652.00, x̃ = $643.00, s = $175 . 1. From this sample, construct a 95% confidence interval for μ , the mean amount Albertans spent in the holiday season in 2014. 24 mean = 652 s = 175 n = 115 conf = 0.95 mean-qt(conf+(1-conf)/2,n-1)*s/n^.5 ## [1] 619.6725 mean+qt(conf+(1-conf)/2,n-1)*s/n^.5 ## [1] 684.3275 2. The same study revealed that the mean amount spent by all Quebec consumers during the 2014 holiday season was $460 . Does the interval computed above suggest that, on average, Albertans will spend more this holiday season compared to consumers in Quebec? 25 Which Interval Should We Compute? (A Guide!) In this set of notes, we’ve learned about three different “structures” of confidence intervals: one that relies on the zdistribution (a z-score), one that relies on a t-distribution (a t-score), and one that is built on bootstrapping the original sample. Which interval should be used and when? The following guide should help you determine when each interval is most appropriate to use. Intervals for μ Use bootstrapping when… You have the raw data You are not sure if the data are normal Basically, if you can bootstrap (that is, if you have the raw data), use that method! Use an interval based on the z-distribution when… You do not have the raw data σ is known Use an interval based on the t-distribution when… You do not have the raw data You can assume the data are normally distributed σ is not known Intervals for p Use bootstrapping when… You have the raw data Again, if you can bootstrap (if you have the raw data or can use R to make “fake” 0 and 1 data), use that method! Use an interval based on the z-distribution when… You are unable to perform bootstrapping 26