STATISTICS FOR THE SOCIAL SCIENCES: A SESSION 8: PROBABILITY AND SAMPLES INTRODUCTION On a piece of paper, rate how cute you think Delta is INTRODUCTION • I am going to calculate: – Population mean – Means for each of the samples I draw • Key things to notice: 1. The mean of each sample ≠ the mean of the population 2. The means of each sample ≠ each other • On point 1: The discrepancy between the sample mean and the population parameter is called sampling error DISTRIBUTION OF SAMPLE MEANS • A THEORETICAL distribution if we were to collect all the possible random samples of a population, and plot their means as a distribution • A sampling distribution: a distribution of statistics (remember, statistics refers to samples) obtained by selecting all the possible samples of a specific size from a population – We could plot means, or standard deviations, or variances – NB is that we are plotting info from each sample into a new distribution DISTRIBUTION OF SAMPLE MEANS DISTRIBUTION OF SAMPLE MEANS: WHY?? • The ultimate reason is for inferential statistics, where we will have to include the concept of error in our calculations to account for the fact that our sample statistics do not accurately reflect the population parameters. • We are going to go on a logical journey: What the distribution of sample means is Characteristics of DoSM Central limit theorem Standard error of M Inferential stats and the DoSM Z-scores and probability for sample means SAMPLES AND DISTRIBUTIONS S A M P L E D I S T R I BU T I O N S A M P L I N G D I S T R I BU T I O N • Is practical • Is theoretical • Created by drawing a sample of scores from a population and plotting each of the scores onto a frequency distribution table • Created by drawing all possible random samples from a population, calculating a statistic (e.g. mean) for each of these samples, and then plotting each of the sample statistics onto a frequency distribution table • So far we have spoken about populations and samples – the distribution of our sample is called a sample distribution • Plots individual scores drawn from a single sample of the population • Plots individual sample statistics calculated from multiple samples drawn from a population CHARACTERISTICS OF THE DISTRIBUTION OF SAMPLE MEANS 1. The distribution of sample means almost always approximates a normal-shaped distribution – It should make sense that most of the means will cluster around the population mean (μ) and it is relatively rare to find samples mean that differ greatly from the mean – This happens even when the associated population doesn’t have a normal distribution (because #mathematicians) 2. The larger the sample size, the closer the sample means will be to the population mean – Stated another way, as sample size increases, the sampling distribution becomes more compact (clustered around the mean) CHARACTERISTICS OF THE DISTRIBUTION OF SAMPLE MEANS DISTRIBUTION OF SAMPLE MEANS CENTRAL LIMIT THEOREM • This theorem provides us with a precise description of the distribution that would be obtained if you selected every possible sample, calculated every sample mean, and constructed the distribution of sample means (again, because #mathematicians) “For any population with mean μ and standard deviation σ, the distribution of sample means for sample size n will have a mean of μ and 𝜎 a standard deviation of √𝑛 and will approach a normal distribution as n approaches infinity.” • Where it says “the distribution of sample means will approach a normal distribution..” – This magic number is 30 http://onlinestatbook.com/stat_sim/ CENTRAL LIMIT THEOREM: SHAPE OF DOSM • The shape of the distribution of sample means tends to be nearly perfectly normal if: – The population from which the samples are drawn is a normal distribution – The number of scores per sample is relatively large (n ≥ 30) CENTRAL LIMIT THEOREM: MEAN OF DOSM • The mean of the distribution of sample means = the population mean (μ) • The means of DoSM is called the expected value of M, as the sample means are “expected” to be near the population means µM = µ • We use the Greek letter µ because the distribution of means is a kind of population CENTRAL LIMIT THEOREM: STANDARD DEVIATION • The standard deviation of a DoSM is called the standard error of M and is denoted by σM • Why is it called standard error? – It provides an estimate of how much distance is expected, on average, between a sample mean and a population mean – Because we would ideally want our sample mean to = our population mean, any deviations are considered “error” • The standard error describes how spread out the sample means are (variability of scores) – When it is small, the sample means are close together (clustered around the mean) – When it is large it implies that the sample means are spread out (big differences from one mean to another) • The standard error also measures how well an individual sample mean represents the entire distribution by telling us how much distance is reasonable to expect between sample mean and overall mean CENTRAL LIMIT THEOREM: STANDARD DEVIATION • The standard error (σM) is calculated: 𝜎𝑀 = 𝜎 𝑛 OR 𝜎𝑀 = 𝜎2 𝑛 • The magnitude of standard error is consequently affected by – Sample size – Population standard deviation • Sample size: the greater the sample size, the more accurate the sample (as sample size ↑, error↓) • Standard deviation: when a sample consists of a single score, σM = σ THE THREE DISTRIBUTIONS THE THREE DISTRIBUTIONS Z-SCORES AND PROBABILITY OF SAMPLE MEANS • The process for calculating z-scores of sample means is the same as the calculation for individual scores, except we infer some information: – The distribution is normal – The distribution’s mean = population mean – Standard error can be calculated Z-SCORES AND PROBABILITY OF SAMPLE MEANS • E.g. population of Stats A scores forms a normal distribution, where μ = 500 and σ = 100. If I take a random sample of n = 16 students, what is the probability that the sample mean will be greater than M = 525? 1. Reconceptualise as a proportions problem 2. Infer the relevant characteristics: – Distribution is normal because population distribution is normal – Distribution has a mean of 500 because μ = 500 – For n = 16, standard error is: – 𝜎𝑀 = 𝜎 𝑛 = 100 16 = 100 4 = 25 Z-SCORES AND PROBABILITY OF SAMPLE MEANS 3. Calculate the z-score for the stipulated mean score z= z = 𝑀−𝜇 𝜎𝑀 525−500 25 25 z = 25 z = +1 4. Use the unit normal table to determine the proportions associated with z = +1. The table indicates that 0,1587 (15,87%) of the distribution is in the tail 5. It is therefore relatively unlikely, p = 0,1587, to obtain a random sample of n=16 students with an average Stats A score > 525. MORE ABOUT ERROR S A M P L I N G E R RO R S TA N DA R D E R RO R • Refers to the idea that a sample typically does not provide a perfectly accurate description of its population • A measure of the standard/typical distance between the population mean and a sample mean • There will therefore be some discrepancy between the mean of a sample and the mean of the corresponding population • Size of standard error depends on size of samples • The discrepancy dependant) is random • Discrepancy is not random (sample SEM µ µM STANDARD ERROR & INFERENTIAL STATS • When we move into inferential stats calculations, we often use the measure of standard error in our calculations. E.g.: 𝑋−𝜇 𝑀−μ z= 𝑡= 𝜎 𝑠𝑀 • We can use data from two separate samples used to draw inferences about the mean difference between two populations / treatment conditions EXAMPLE