Unit 1 - Introduction Jenna G. Tichon Objectives: By the end of this class the student should be able to: explain the scope and purpose of statistics identify samples vs populations explain the definition of a sampling distribution 1.1 What is statistics? Statistics is about making sense of the world around us based on limited data. We need to: make sense of the past understand the present make predictions about the future but we only have access to a limited amount of data. We need to take the information that is available and use it to infer the answers to our questions. Statistics involves all aspects of this task: stating appropriate research questions collecting data organizing data summarizing/graphing data analyzing data interpreting output presenting data drawing actionable conclusions Statistics is an interesting branch of science as statisticians do not do statistics for themselves. We do statistics with and for people in other fields and help make sense of the data for non-statisticians that need the results. QUESTION: What is the difference between a statistical conclusion and a conclusion to a research question? 1.2 Sample vs Populations Every time we take a sample from a population, we get something that is a bit different. Generally, if we are picking sample randomly and with a sufficiently large sample size, they should reflect the population. All of these histograms are slightly different with small differences in the sample means and standard deviations but they closely approximate the actual population. Let’s see what happens when we redo this with a small sample size. Here we took samples of size 10 which is very small. Note that these histograms don’t look very alike. We do not have enough observations in our sample to get an accurate picture of the population at large. This is what happens when we use a sampling rule that is not random. ## Warning: package 'tidyr' was built under R version 4.0.5 ## Warning: package 'dplyr' was built under R version 4.0.5 Here I used the sampling rule that it would pick any observation from my list of the population (we’ll find out later this is called a sampling frame) that is smaller than the proceeding number (Note that my list of the population was in random order). This selection methods is biased as it systematically favours smaller numbers. We note that the sample has a much smaller mean than the first set of four samples. It also has a smaller max and selects more number that are under 20. When sampling units are chosen based on one of their traits and not at random, this creates a sample that is not reflective of the population. 1.3 Sampling Distributions Notice that none of the samples had the exact same mean and standard deviation as the population? Every time we take a sample we get something that is slightly different than the previous time. We do not expect every sample to be identical. There is, however, an expected range we expect we our sample means and sample standard deviations to fall into. It would be impossible to get a x̄ of 40 because we have no observations that large. If we took the 500 largest observations in our population we would have a mean of 26.14. If we took the 500 smallest observations in our population we would have a mean of 13.76. Theoretically we when we were taking samples of size 500 we could have seen x̄ ’s between 13.76 and 26.14 but it is extremely unlikely we would ever see that bound. As we were picking samples at random and there are (10000 possible samples (which is too large a number for R to calculate!) 500 ) we could spend our entire life time picking samples and it’s unlikely we would happen to pick the 500 smallest or 500 lowest by random chance alone. This means we could perhaps change our bounds on our expectations. Analogy: The analogy I like to give is that if you told me a person was standing outside of my front door, I would not be surprised if they were only 5 feet tall or if they were 6 feet tall. But if you told me 100 people were outside of my door (maybe a very organized group of Christmas carolers), I would be really confused if they had an average height of 5 feet or an average height of 6 feet. We have an intuitive understanding that what we expect to see in a single individual, is different than what we expect to see in a large group of individuals. This is the idea of a sampling distribution. Let’s look at what happens when we repeatedly take samples of size 100 from the population, calculate the sample mean, and put our results in a histogram: This is the Central Limit Theorem in action! The distribution of all the sample means we see in 10,000 samples follows a normal distribution. It is bell shaped It is centered around the true value of the mean (μ = 20 ) There are an equal amount of samples above and below the mean and they fall in a symmetric pattern 95% of all samples had sample means within approximately 2 standard deviations of x̄ 3 ±2 ⋅ ( population mean (μ = 20 of the √ (1000) ) ) QUESTION: What is it called when the mean for the distribution of the sample mean is equal to the population mean? (Because this is a symmetric distribution, the center is also the mean) QUESTION: When we take random samples, usually our sample statistic is a reasonable estimate of the population mean. What are some reasons why our estimate might be drastically off? While graphing the distribution of sample means clearly shows us that the center is the true value of the population mean, we cannot in practice make this distribution in order to do inference. Taking 10,000 samples of size 100 is a hugely cumbersome task. Many times 100 observations will be costly (making the sampling frame, taking the sample, physically finding and contacting the people in the sample, paying the people who will collect the data), so repeatedly taking the sample 100’s or 1000’s of times is not realistic. Main Idea Q: If every sample is different, and we see this much variation from sample to sample, and we are limited to taking one sample, how can we take the mean of one sample to estimate the mean of the whole population? A: Approximately 95% of x̄’s are within 2 σ ’s of μ ‾ √n This means approximately 95% of μ ’s are with 2 σ ’s of x̄ ‾ √n If every time we take a sample, we create an interval by going x̄ σ ± 2 then approximately ‾ √n 95% of the time we will create an interval that contains the value we are trying to estimate, μ . This is a confidence interval. Things to note: We know that every sample is different so it’s silly to claim the population mean is equal to x̄ . We can’t make intervals where we are 100% confident without making them so large that they are useless. (Garfield knows what’s what (https://i.pinimg.com/originals/3b/d2/c5/3bd2c5d56e49479d019034050d5f65e3.png)) We have to live with an error rate which is accepted among the scientific community as generally 5%. We will never know when we’ve made one of the 5% of intervals that don’t contain μ but we sleep better at night knowing that, in our careers, we’ll mostly make intervals that are “correct”. 1.4 Summary Summary Statistics makes use of data from samples to make inferences on how populations behaved, behave, or will behave in the future. Samples should be taken randomly and be of sufficiently large size. In the long run, sample means will, on average, equal the population mean. We can never be certain if our sample mean is “close” to the population mean but we can use confidence intervals to quantify how far away we think our estimate might likely be from the true value. 1.5 References for Reading: Try this applet (https://digitalfirst.bfwpub.com/stats_applet/stats_applet_4_ci.html) to test out taking confidence intervals for various sample sizes. See my STAT 1150 Lab 2 handout for further discussions of sampling distributions of other statistics. Read section 3.4 of text