1 STAM4000 Quantitative Methods Week 5 Sampling distributions https://www.google.com/search?q=normal+distribution+funny&tbm=isch&hl=en&chips=q:normal+distribution+funny,online_chips:central+limit+theorem&rlz=1C1CHBF_enAU841AU846&sa=X&ved=2ahUKEwix3OAjdPtAhVBYCsKHQKADx4Q4lYoBXoECAEQHw&biw=1353&bih=641#imgrc=KyS7tw1mVJHIZM 1 Kaplan Business School (KBS), Australia 1 2 COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of Kaplan Business School pursuant to Part VB of the Copyright Act 1968 (the Act). 2 The material in this communication may be subject to copyright under the Act. Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. Do not remove this notice. Kaplan Business School (KBS), Australia 2 3 #1 Recognise when to use a census or a sample #2 Examine sampling distributions of the เดฅ , read as “X bar” sample mean, ๐ฟ Week 5 Sampling distributions Learning Outcomes #3 Kaplan Business School (KBS), Australia Examine sampling distributions of the เทก , read as “P hat” sample proportion, ๐ท or “P cap” 3 4 Why does this matter? We need sampling distributions for statistical inference using one sample to infer or draw conclusions on the entire population. https://lovestats.wordpress.com/dman/ Kaplan Business School (KBS), Australia 4 5 #1 Recognise when to use a census or a sample This Photo by Unknown Author is licensed under CC BY-SA-NC Kaplan Business School (KBS), Australia 5 6 #1 Recognise when to use a census or a sample A census is the process of collecting information on items or individuals in the entire population. Sampling is the process of only collecting information on a subset of the population. Advantages of sampling versus a census: •Can save time, money and resources. •Does not destroy all the product, as it may do in a research process, or quality control process. •Is practical - accessing the entire population is often impossible. •In real life, calculating the parameters of populations (population measurement) is prohibitive because populations can be very large. https://www.cartoonstock.com/directory/s/statistician.asp What is best; census or sampling? See what the Bureau of Statistics has to say about census and samples: http://www.abs.gov.au/websitedbs/a3121120.nsf/home/statistical+language++census+and+sample 6 7 #1 Parameters and statistics Descriptive measures calculated from a population of data are called population parameters. Descriptive measures calculated from a sample of data are called sample statistics. In the simplest of terms, statistical inference is when we use a sample statistic to infer or draw conclusions about a population parameter. Recall, there are two branches of statistics: i) Descriptive statistics: analysis done on a sample of data to describe just that sample. ii) Inferential statistics: analysis done on a sample of data to infer or draw conclusions on the entire population. A number of factors determine which statistical technique should be used, but two of these are especially important: • Data type: The type of data being measured • Problem objective: The purpose of the statistical inference In many applications of statistical inference, we draw conclusions about a population parameter by using a sample statistic. 7 8 #1 Sampling distributions So far, we have seen a continuous random variable, X, that has N possible values of x1, x2, …, xN The population parameter is usually unknown and fixed. We can take a sample of size n, with values x1, x2, …, xn and use this sample to calculate a statistic. If we picture taking all possible samples, of the same size n, from the population and calculate the same statistic for each sample, we create a sampling distribution of the statistic. The sampling distribution of the statistic is the tool that tells us how close the statistic is to the parameter. See video at https://www.youtube.com/watch?v=olK80ngCbXc A sampling distribution tells us which outcomes we should expect for some sample statistic. Don’t confuse the distribution of a sample with the sampling distribution: • with a sample from a population, you can find descriptive statistics to summarize just that sample. • with a sampling distribution, we are thinking of all the possible values that a statistic can take, based on infinite sample of the same size n. The sampling distribution is used to describe how the statistic varies. 8 9 เดฅ #2 Examine sampling distributions of the sample mean, ๐ฟ https://www.google.com/search?q=Then+a+miracle+occurs+cartoon&rlz=1C1CHBF_enAU841AU846&sxsrf=ALeKk01kJi6GAQpObYpi3LbLE1UsEOAKeg:1608153170232&tbm=isch&source=iu&ictx=1&fir=x9PvfewMxCIBgM%252CbhLiFj0qb6LfjM%252C_&vet=1&usg=AI4_kSlnSV3wm31F7GGpKZSyjtmp7p3Cg&sa=X&ved=2ahUKEwiVxtKztdPtAhWryjgGHYlqBLcQ9QF6BAgPEAE&biw=1366&bih=589#imgrc=1KUkZ0n_rsQW-M Kaplan Business School (KBS), Australia 9 10 In Week 4, we learned about Z, the standard normal variable, Z ~ ๐ต๐ถ๐น๐ด๐จ๐ณ ๐, ๐ We also learned about a normally distributed random variable, X where, ๐ฟ ~ ๐ต๐ถ๐น๐ด๐จ๐ณ (μ , σ) μ = population mean σ = population standard deviation 10 #2 Examine sampling distributions of the sample mean, ๐เดค 11 เดฅ, is a collection of all possible sample The sampling distribution of the sample mean, X means ๐ฅาง , …., ๐ฅาง for random samples taken from a population, each based on the same sample size, n. The sampling distribution of the statistic is the tool that tells us how close the statistic is to the parameter. เดฅ: Two ways of creating a sampling distribution of the mean, ๐ฟ เดฅ , then represent 1. Draw samples from the population, calculate the mean for each, ๐ (graph or table) the distribution. 2. Use the laws of probability and expected value (long run average) to derive the distribution. เดฅ The Sampling Distribution of the sample mean, ๐ฟ The sampling distribution of the sample mean is the tool that tells us how close a sample mean, ๐ฅาง is to the population mean, μ. See video at https://www.youtube.com/watch?v=olK80ngCbXc 11 12 Week 4: With X, values were ๐ฅ1 , ๐ฅ2 , … , ๐ฅ๐ Now, Week 5, เดฅ , values here are now, we have ๐ฟ เดฅ๐ , ๐ เดฅ๐ , ….etc. ๐ This is so we can have a normally distributed random variable and we can use the corresponding statistical tables that require normality, e.g.: to use Z tables. 12 12 13 #2 เดฅ Properties of the sampling distribution of the mean, ๐ฟ Recall, if we have a continuous random variable, X, then the, • population mean of X is µ • population standard deviation of X is σ NOTE: stdev = standard deviation เดฅ Now, we have the sampling distribution of the sample mean, ๐ฟ • population mean of ๐เดค is ๐๐ฟเดฅ = ๐ • population standard error of ๐เดค is ๐๐ฟเดฅ = ๐ this is smaller than stdev of X ๐ Note: the population standard deviation of the sampling distribution of the sample mean is usually called the “standard error of the mean”. เดฅ How about the SHAPE of ๐ฟ ? Lets do an illustration first. Properties of the sampling distribution of the mean The population mean of ๐เดค is ๐๐ฅาง = ๐. The population mean of all the sample means, is the population mean, μ, of X. The population standard deviation (or population standard error) of ๐เดค is ๐๐ฅาง = ๐ . ๐ Note that the standard deviation of ๐เดค , decreases, as the sample size increases. Why? As we increase n, we increase characteristics of the population into our samples. This increases the information we have, which decreases variation/dispersion/deviation. Note: the population standard deviation of the mean is called the “standard error of the mean”. This is not an error in terms of a mistake. This is a standard error in terms of standard variability. N = population #2 Illustration - developing a sampling distribution size, in this context Let X be a random variable of ages of individuals (years). Say, that a population of size N = 4, has the following values: 18, 20, 22, 24 in years What does X look like? X is UNIFORM This is a Uniform Distribution 18+20+22+24 ๐= 4 Histogram of X 0.3 ๐= 21 years of X 0.25 P(X) 0.2 Formulae are from Week 2 0.15 0.1 ๐= 0.05 0 18 − 21 2 + … + 24 − 21 ๐ 2 ๐ = 2.326 years of X X, Age (years) A uniform distribution is a distribution where each value in the data set has the same frequency. The distribution is considered as having no mode. Let X be a random variable of ages of individuals (years). Say, that a population of size N = 4, has the following values: 18, 20, 22, 24 in years What does X look like? Each value only occurs once, so they have a frequency of one and a relative 1 frequency of 4 = 0.25 or 25%. As we have a population of data, N = population size, • population mean, ๐ = σ๐ฅ ๐ = 18+20+22+24 4 • population standard deviation, ๐ = = 21 years σ(๐ฅ − ๐)2 (18 −21)2 +(20 −21)2 +(22 −21)2 +(24 −21)2 4 ๐ = 20 4 = = 5 = 2.236 years 14 15 Illustration continued … #2 Say, that a population of size N = 4, has the following values: 18, 20, 22, 24 in years • Now, consider all possible samples of size n = sample size n = 2. • If sampling with replacement, there are 16 possible samples, EACH OF SIZE n = 2 • With our 16 samples, we can find the mean, เดฅ ๐ for each sample; we now have 16 ๐ฅาง ′ s n=2 2nd Observation 1st Observation 18 20 22 24 18 18,18 so ๐ฅาง = 18 18,20 so ๐ฅาง = 19 18,22 so ๐ฅาง = 20 18,24 so ๐ฅาง = 21 20 20,18 so ๐ฅาง = 19 20,20 so ๐ฅาง = 20 20,22 so ๐ฅาง = 21 20,24 so ๐ฅาง = 22 22 22,18 so ๐ฅาง =20 22,20 so ๐ฅาง = 21 22,22 so ๐ฅาง = 22 22,24 so ๐ฅาง = 23 24 24,18 so ๐ฅาง = 21 24,20 so ๐ฅาง = 22 24,22 so ๐ฅาง = 23 24,24 so ๐ฅาง = 24 Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) – 9781442549272/Berenson/Business Statistics /2e If sampling with replacement, there are 16 possible samples. Note: •Sampling with replacement occurs when we sample (select) an individual/item from the population, note its value, then return it to the population before selecting the next item/individual. •Sampling without replacement is when we select an item/individual from the population, and do not return it to the population before selecting the next item/individual. When sampling without replacement, we must be careful that the population size is large enough, such that sampling will not distort the proportions of characteristics/values in the population. See more on this later . 15 16 “sampling with replacement”, refers to the process of taking the first sample from the population, recording the values, then replacing the values, before taking the second sample from the population. This process is then repeated multiple times. 16 16 Illustration continued … #2 เดฅ Sample Means Distribution, ๐ฟ Histogram of sample mean ages P(X) ๐ ๐๐ 0.25 = 0.0625 18 1 19 2 0.125 20 3 0.1875 21 4 0.25 22 3 0.1875 23 2 0.125 24 1 0.0625 Total 16 1 ๐๐๐๐๐๐๐๐๐ frequency: ๐ป๐ถ๐ป๐จ๐ณ ๐๐๐๐๐๐๐๐๐ 0.30 0.20 P(X) Sample mean Frequency 17 A probability is like a long run relative 0.15 0.10 0.05 0.00 18 19 20 21 22 Sample mean (years) 23 24 เดฅ ๐ฟ ๐เดค is unimodal and symmetric, a normal distribution. Previously, we noted that X had a uniform distribution. Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) – 9781442549272/Berenson/Business Statistics /2e Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) – 9781442549272/Berenson/Business Statistics /2e 17 18 Recall, to have a valid probability distribution: • 0 ≤ P(X) ≤ 1 • ∑P(X) = 1 18 18 19 Illustration continued - comparing summary measures #2 เดฅ Sampling distribution of the mean, ๐ฟ Population distribution, X ๐= ๐= 18+20+22+24 = 4 18 − 21 2 21 years ๐๐เดค = + … + 24 − 21 4 2 ๐๐เดค = = = 2.236 years 21 years 2 σ(๐ฅาง −๐๐ เดฅ) ๐ 18−21 2+ 19−21 2+ 20−21 2+ …+ 24−21 16 2 = 1.581 years เดฅ Histogram of sample mean ages, ๐ฟ Histogram of ages, X 0.3 18+19 + 20+ …+24 = 16 0.30 P(X) P(X) 0.2 0.1 0.20 0.10 0.00 18 0 19 20 21 22 Sample mean (years) 23 24 Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) – 9781442549272/Berenson/Business Statistics /2e Sampling distribution of the mean, ๐เดค Alternative calculation for the population standard error of the sampling distribution of the sample mean: For our population, N = 4 For our samples, n = 2 ๐๐เดค = = = ๐๐๐๐ข๐๐๐ก๐๐๐ ๐ ๐ก๐๐๐๐๐๐ ๐๐๐ฃ๐๐๐ก๐๐๐ ๐ ๐ ๐ σ(๐ฅ−๐)2 ๐ เต ๐ 18 − 21 + 19 − 21 2 + 20 − 21 4 = = 2 2.236 1.414 = 1.581 years 2 + 20 − 21 2 เต 2 20 #2 Central Limit Theorem The fact that the histogram of sample means on the previous slides appear to be bellshaped (Normal) is a consequence of the Central Limit Theorem. Hence, for sufficiently large sample sizes (approximately n ≥ 30), the sample distribution is approximately normal, even if the population distribution, X, is non-normal. e.g. 0.2 P(X) Distribution of X 0.2 0.16 0.12 0.08 0.04 0 Distribution of ๐เดค 0.1 1 2 3 4 5 6 X 0 Uniform distribution 1 2 3 4 5 6 ๐เดค Normal distribution เดฅ: More on the shape of ๐ฟ 1. Sampling from Normal populations: If X ~ Normal, then ๐เดค ~ Normal, this is NOT the Central Limit Theorem 2. Sampling from Non-Normal Populations: If X is NOT Normal BUT n ≥ 30 then, by the CENTRAL LIMIT THEOREM, ๐เดค ~ Normal 20 21 #2 The Central Limit Theorem continued … เดฅ approaches normality as the sample size, n The Central Limit Theorem: when n ≥ 30, ๐ฟ increases, regardless of the shape of the population distribution, X. Normal Uniform Skewed Bimodal Population n=2 n = 30 ๐ฟ ๐ฟ ๐ฟ ๐ฟ เดฅ ๐ฟ เดฅ ๐ฟ เดฅ ๐ฟ เดฅ ๐ฟ เดฅ ๐ฟ เดฅ ๐ฟ เดฅ ๐ฟ เดฅ ๐ฟ As n increases, เดฅ distribution the ๐ฟ approaches normality. Usually at n = 30, เดฅ ~ ๐ต๐๐๐๐๐ ๐ฟ The Central Limit Theorem The sampling distribution of any mean approaches a normal distribution, as the sample size, n, grows. This is true regardless of the shape of the population distribution, X However, if the population distribution is very skewed, it may take a sample size of dozens or even hundreds of observations for the Normal model to work well. The fact that the histogram of sample means on the previous slide appears to be bell-shaped (Normal) is a consequence of the Central Limit Theorem. Hence, for sufficiently large sample sizes (approximately n ≥ 30), the sampling distribution is approximately normal, even if the population distribution is non-normal. เดฅ as well as the The Central Limit Theorem applies to the sampling distribution of the sample mean, ๐, เท sampling distribution of the sample proportion, ๐. Graphic:https://www.google.com/search?q=central+limit+theorem+funny&tbm=isch&hl=en&chips=q:statistics+t heorem+explained+central+limit+theorem+funny,online_chips:statistics,online_chips:theorem+explained,online _chips:sampling+distribution&rlz=1C1CHBF_enAU841AU846&sa=X&ved=2ahUKEwj84-z8v7uAhUt3nMBHSwdDtUQ4lYoBHoECAEQIg&biw=1013&bih=470#imgrc=E0SnD9IFQT4mcM&imgdii=mBZMrniZr qsDiM 22 As n increases, our value of population standard error (or standard เดฅ, deviation) of ๐ฟ ๐ ๐๐ฟเดฅ = , ๐ gets smaller and smaller. As n increases, we capture more population characteristics INTO our sample, so our sample has LESS sampling error (variability) 22 #2 Conditions to check before using ๐เดค 23 10% condition is to ensure that the proportions of characteristics in the population are not DISTORTED or are NOT changed by sampling WITHOUT replacement. 23 If random, they should be INDEPENDENT, and hopefully are REPRESENTATIVE of the population. https://pixabay.com/photos/bulldog-cute-easter-animal-dog-2952049 เดฅ: Conditions to check before using ๐ฟ Random Sample Condition: • The data values must be sampled randomly. • This is to try and ensure that the samples are independent and representative of the population. 10% Condition: • If sampling without replacement, the sample size, n, should be no more than 10% of the population size. • We can also satisfy this condition by checking whether the population size is at least n × 10. • This is to ensure that if we are sampling without returning items/individual to the population before sampling the next item/individual, the characteristics in the population are not affected by the non-replacement and we can assume independence. Normal or Large Enough Sample Condition: • If the population is normal, then a small n is ok. • If the population is non-normal or unknown, we need n ≥ 30 to apply the Central Limit Theorem. o Note if the population is very skewed, we may need a sample size greater than 30, for our sampling distribution to be normal by the Central Limit Theorem. If any of the conditions are not satisfied, it should be noted and stated to proceed with caution. 24 10% condition: “sampling without replacement”: process of taking the 1st sample from the population, and NOT returning the sampled values before taking a 2nd sample from the population. The repeating the process many times. 24 24 25 #2 Example Suppose that the population of coffee drinkers in a city spend on average of μ = $50 per month on coffee with a standard deviation of σ = $6. Assume that the amount spent on coffee, X, is normally distributed. If a random sample of n = 25 coffee drinkers is chosen from this population, answer the following: เดฅ. a) Check the conditions for ๐ฟ https://www.publicdomainpictures.net/pictures/30000/velka/co ffee-cup-1350307722Zz7.jpg •Random Sample Condition: We are told the sample was randomly chosen. •10% Condition: If sampling without replacement, 25 coffee drinkers is no more than 10% of the population of coffee drinkers. Or, we could say, the population size of coffee drinkers is at least 25 × 10 or 250. As we have a city, we can assume this to be true. •Normal or Large Enough Sample Condition: told to assume amount spent on coffee, X, is normally distributed, which tells us that เดฅ X ~ N. (Note: this is NOT the Central Limit Theorem) As the conditions are satisfied, we can use the Z tables. เดฅ? b) What is the mean of the sampling distribution of the mean, ๐ฟ ๐๐เดค = ๐ = $50 เดฅ? c) What is the standard error of of the sampling distribution of the mean, ๐ฟ ๐๐เดค = ๐ ๐ = 6 = 25 $1.2 25 25 #2 26 เดฅ Probabilities and ๐ฟ เดฅ We can take samples and ask probability type questions about the mean, ๐ฟ in the same way we solved problems related to a single random variable, X. The new Z formula is ๐= เดฅ ๐−๐ ๐ เต ๐ or we can write this as Z= าง ๐๐ ๐ฅ− เดฅ ๐๐ เดฅ Such that, ๐ฅาง = sample mean μ = population mean (๐ ๐๐๐๐ ๐๐เดค = ๐) σ = population standard deviation of X n = sample size ๐๐เดค = population standard error (standard deviation) of ๐เดค 26 Note: When the standard deviation of the sampling distribution of a statistic, in this case, ๐ฅ,าง is estimated from data, the corresponding statistic is called a standard error (SE). 26 27 General structure of a • Z value, • Z score, • standardized score: ๐= ๐๐๐๐๐ ๐๐ ๐๐๐๐๐๐๐๐ − ๐๐๐๐๐๐ ๐๐๐๐๐๐ Last week, in Week 4, we transformed an X value to a Z value using the formula: ๐ − ๐ ๐= ๐ เดฅ into a Z using: Now, in Week 5, we transform an ๐ฟ เดฅ − ๐ ๐ ๐= ๐ เต ๐ 27 27 28 เดฅ #2 4 Steps to find probabilities about ๐ฟ Step 1: Check relevant conditions, unless otherwise stated เดฅ to Z , using Step 2: Convert ๐ฟ ๐= เดฅ ๐−๐ ๐ เต ๐ Step 3: Sketch a curve(s) for the area (probability) Step 4: Find the area using Z tables This Photo by Unknown Author is licensed under CC BYSA 28 28 29 #2 Example Suppose that the population of coffee drinkers spend on average μ = $50 per month on coffee with a standard deviation of σ = $6. Assume that the amount spent on coffee is normally distributed and the conditions are satisfied. https://www.publicdomainpictures.net/pictures/30000/velka/co ffee-cup-1350307722Zz7.jpg a) Suppose a random sample of n = 25 coffee drinkers is chosen from this population. What is the เดฅ on probability that the average monthly spend, ๐ฟ, coffee is less than $47? b) Suppose a random sample of 25 coffee drinkers is chosen from this population. What is the เดฅ , on probability that the average monthly spend, ๐ฟ coffee is between $50 and $53? https://www.google.com/search?q=coffee+comic+joke&tbm=isch&chips=q:coffee+comic+joke,online_chips:caffeine&rlz=1C1CHBF_enAU841AU846&hl=en&sa=X&ve d=2ahUKEwi2-oCs6dPtAhVJTSsKHRMTCDkQ4lYoDHoECAEQJg&biw=1351&bih=574#imgrc=CfmCvX_v_0gK4M 29 30 #2 Example solution How to type this for ONLINE ASSESSMENT: P(X bar < 47) เดฅ < 47) a) Want P(๐ฟ Step 1: Check conditions Told conditions are satisfied Step 2: Find the z value z= z= z= https://www.publicdomainpictures.net/pictures/30000/velka/co ffee-cup-1350307722Zz7.jpg Step 3: Sketch a curve(s) าง ๐ ๐ฅ− ๐ ๐ 47−50 6 25 47 ๐ = 50 ๐เดค −3 1.2 z = − 2.50 Step 4: Use the Z tables 0.0062 P(๐เดค < 47) = P(Z < − 2.50) = 0.0062 − 2.50 ๐ = 0 Z a) For x = 47, Z = −2.5, so P (Z< −2.5) = 0.0062 b) For x = 50, Z = 0, and for x = 53, Z = 2.5, so P( 50 < X-bar <53) = P( 0 < Z < 2.50) = P(Z < 2.50) – P(Z < 0) = 0.9938 – 0.5 = 0.4938 30 31 If asked to “SHOW WORKINGS” with ONLINE ASSESSMENT: • Substitute into the formula • Give the Z value • Give probability – be careful of LAYOUT a) P(X bar < 47) Z = (47 – 50)/(6/sqrt(25)) Z = -2.50 P(Z < -2.50) = 0.0062 31 31 32 Example solution continued #2 เดฅ < 53). Step 1: Check conditions Told satisfied b) Want P(50 < ๐ฟ Step 2: Find the z values z= z= าง ๐ ๐ฅ− ๐ ๐ 50−50 z=0 6 25 z= z= https://www.publicdomainpictures.net/pictures/30000/velka/co ffee-cup-1350307722Zz7.jpg Step 3: Sketch a curve าง ๐ ๐ฅ− ๐ ๐ 0.9938 0.4938 53−50 6 25 WANT 0.5 z = 2.5 Step 4: Use the Z tables P(50 < ๐เดค < 53) = P(0 < Z < 2.5) = P( Z < 2.5) − P(Z < 0) = 0.9938 − 0.5 = 0.4938 50 53 ๐เดค 0 2.5 Z 32 a) Z = -2.5, so P (Z< -2.5) = 0.0062 b) For 50, Z = 0, and for 53, Z = 2.5, so P( 50 < X-bar <53) = P( 0 < Z < 2.50) = P(Z < 2.50) – P(Z < 0) = 0.9938 – 0.5 = 0.4938 32 33 #2 Exercise A petrol station is open, an average, of μ = 100 hours per week with a standard deviation of σ = 12 hours per week. The opening hours are not normally distributed. A random sample of n = 36 petrol stations is taken. a) Check conditions. b) What is the probability that the mean of this sample is less than 105 hours? c) What is the probability that the mean This Photo by Unknown Author is licensed under CC BY of this sample is above 102.2 hours per week? 33 34 #2 Exercise solution เดค a) Check conditions for ๐. •Random Sample Condition: We are told the sample was randomly taken. This Photo by Unknown Author is licensed under CC BY •10% Condition: If sampling without replacement, we can say that 36 petrol stations is no more than 10% of all petrol stations. Or, we can say that we need to have more than 360 petrol stations in the population. Assume this is true. •Normal or Large Enough Sample Condition: told the opening hours are not normally distributed. However, as เดฅ ~ N. n = 36 > 30, by the Central Limit Theorem, ๐ As the conditions are satisfied, we can use Z tables to find probabilities. https://pixabay.com/photos/bulldog-cute-easter-animal-dog-2952049 34 35 Exercise solution #2 b) Want P(๐เดค < 105) Conditions checked earlier. z= z= This Photo by Unknown Author is licensed under CC BY าง ๐ ๐ฅ− ๐ ๐ 105 −100 12 36 z = 2.5 0.9938 100 105 0 2.5 ๐เดค Z P(๐เดค < 105) = P( Z < 2.5) = 0.9938 35 36 #2 Exercise solution continued c) Want P(๐เดค > 102.2) Conditions checked earlier. z= z= าง ๐ ๐ฅ− ๐ ๐ 102.2−100 12 36 z = 1.10 This Photo by Unknown Author is licensed under CC BY 100% or 1 0.8643 We do NOT want 0.1357 WANT 100 102.2 0 1.1 ๐เดค Z P(๐เดค > 102.2) = P( Z > 1.10) = 1 − P(Z < 1.10) = 1 − 0.8643 = 0.1357 36 36 #2 37 เดฅ Reverse normal problems with ๐ฟ เดฅ are like those with X. Reverse normal problems with ๐ฟ Rearrange the correct Z formula, making the unknown variable the subject of the equation: ๐= E.g., If solving for เดฅ− ๐ ๐ ๐ ๐ เดฅ rearrange this Z formula to get ๐ เดฅ= ๐ +๐ ๐ Remember the units for ๐เดค are the same as the units for X. 37 ๐ ๐ 38 #2 Example Suppose that a population of coffee drinkers spend on average μ = $50 per month on coffee with a standard deviation of σ = $6. Assume that the amount spent on coffee is normally distributed. If a random sample of n = 25 coffee drinkers is chosen from this population, what is the average dollar https://www.publicdomainpictures.net/pictures/30000/velka/co ffee-cup-1350307722Zz7.jpg เดฅ, that 10% or less of coffee drinkers spend on coffee per month? value , ๐ ๐ฅาง = ๐ + ๐ The conditions were checked earlier. 10% =0.10 ≈ 0.1003 ๐ ๐ ๐ฅาง = 50 − 1.28 6 25 ๐ฅาง = 50 – 1.28(1.2) เดฅ ๐ −1.28 50 0 ๐เดค Z เดฅ = $48.46 ๐ Using the Z tables “backwards”, we try to find the probability of 0.10 in the body of the first set of Z tables - those with the left tail shaded, as we are interested in the left tail of the curve. We find the closest value to 0.10 in the table is 0.1003. From the probability of 0.1003, we work to the border of the Z table and find the corresponding Z value is −1.28. 39 Steps for Reverse Normal: i) Sketch a curve, place the know values. ii) Use the Z tables in REVERSE (inside out), look for a probability in the BODY of the Z tables, then work to the border for the Z value, (respecting the sign + or −). iii) Substitute into relevant formula เดฅ iv) Solve for ๐ 39 39 40 #2 Exercise A petrol station is open, an average, of μ = 100 hours per week with a standard deviation of σ = 12 hours per week. The opening hours are not normally distributed. A random sample of n = 36 petrol stations is taken. Assume the conditions are satisfied. What is the minimum average number of hours a petrol station is opened for the 0.6% of longest opening hours? This Photo by Unknown Author is licensed under CC BY Petrol station now missing something … http://72.26.108.11/humor/stories/strange_petrol_pumps.htm 40 40 41 #2 Exercise solution What is the minimum average number of hours a petrol station is opened for the 0.6% of longest opening hours? ๐ฅาง − ๐ ๐= ๐ เต ๐ 2.51 = This Photo by Unknown Author is licensed under CC BY ๐๐จ๐ญ๐๐ฅ ๐ฉ๐ซ๐จ๐๐๐๐ข๐ฅ๐ข๐ญ๐ฒ ๐ฎ๐ง๐๐๐ซ ๐๐ฎ๐ซ๐ฏ๐ = ๐ ๐ฅาง −100 12 เต 36 1 − 0.0060 = 0.9940 ๐ฅาง = 105.02 TOLD 0.0060 = 0.6% hours 100 0 เดฅ ๐ 2.51 ๐เดฅ Z 41 Here, we have 0.06 in the right tail of the Z curve. This is equivalent to an area of 1 – 0.06 = 0.994 area in the left side of the Z-curve. Using the Z tables “backwards”, we try to find the probability of 0.94 in the body of the second set of Z tables - those with the large left shaded area, as we are interested in the area to the left of a positive Z value. We find the exact probability of 0.9940 in the body of the Z table. From the probability of 0.9940, we work to the border of the Z table and find the corresponding Z value is 2.51. 41 42 เดฅ that has an area of 0.06% in the Want the ๐ RIGHT TAIL. “longest opening hours” are in the RIGHT TAIL. 42 42 43 เทก #3 Examine sampling distributions of the sample proportion, ๐ท This Photo by Unknown Author is licensed under CC BY-SA-NC Kaplan Business School (KBS), Australia 43 Examine the sampling distribution of the sample #3 เทก proportion, ๐ท 44 For a categorical variable, we If the objective is to describe a single population for a categorical can find the proportion of variable, the population parameter is the proportion, p, of times times a specific characteristic that a specific characteristic of interest (success) occurs. of interest, occurs. Population: p = population proportion of success (of interest) q = population proportion of failure (not of interest), where q = 1 − p Sample: เท = sample proportion of success = p hat = p cap ๐ เท = sample proportion of failure = q hat = q cap where ๐เท = 1 − ๐ฦธ ๐ Note: p is estimated by ๐ฦธ and ๐ฅ ๐ ๐ฦธ = = the count of successes in a sample sample size เทก Examine the sampling distribution of the sample proportion, ๐ท We have a categorical variable which has different characteristics. Our data is counts. We are interested in a specific characteristic, that we will label as “success”. All other characteristics are labelled as “failures”. We count all successes in the population. The population parameter of interest is labelled as p, the proportion of success in the population. The corresponding sample statistic that estimates p, is ๐, เท which may be read as “p hat” or “p cap”. 45 เทก #3 Sampling distribution of the proportion, ๐ท Think about the true proportion, p, and the proportion we might expect to get in a random sample, ๐ฦธ . From sample to sample, ๐ฦธ will vary. Imagine, repeated, independent samples (each the same size, n), finding ๐ฦธ for each sample and building the distribution of the sample proportion, ๐เท . Sampling Distribution of the เทก proportion, ๐ท P(เท ๐) 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 Sampling Distribution of the Proportion Note that the sampling distribution of the sample proportion is based on a categorical variable. Whereas, the sampling distribution of the sample mean is based on a quantitative variable. Proportions can only lie between −1 and 1 (or −100% and 100%) and are units free. Whereas, means can lie between −∞ ๐๐๐ ∞, with the same units as the original population, X. Note: Technically, here we are looking at the “sampling distribution of the sample proportion”, but sometimes, for ease, for we refer to this as just the “sampling distribution of the proportion”. เทก ๐ท 46 เทก #3 Properties of the sampling distribution of the proportion, ๐ท and the corresponding Z formula เทก for a categorical variable: The sampling distribution of the sample proportion ๐ท เทก is • population (true) mean of ๐ท ๐๐เท = p เทก is • population standard deviation of ๐ท ๐๐เท = ๐๐ ๐ เทก = standard error of ๐ท • We standardize ๐ฦธ to a Z value with the following formula ๐= ๐ฦธ − ๐ ๐๐ ๐ เทก for a categorical variable: The sampling distribution of the sample proportion ๐ท เทก is population (true) mean of ๐ท ๐๐เท = p เทก is population standard deviation of ๐ท ๐๐เท = ๐๐ ๐ We standardize ๐เท to a Z value with the following formula ๐เท − ๐ ๐= ๐๐ ๐ This can be re-written as ๐ = ๐เท −๐ ๐(1−๐) ๐ Note: • In the denominator of the Z formula, we use p, not ๐. เท Why? Well, p is the population proportion and is usually fixed and reliable, whereas ๐เท is the sample proportion and will vary from sample to sample. • When the standard deviation of the sampling distribution of a statistic( in this class, either ๐ฅาง ๐๐ ๐, เท is estimated from data, the corresponding statistic is called a standard error (SE). • With proportions, we work with the decimal form, not the percentage. • เทก curve. We have the proportion value on the horizontal axis of the ๐ท • เทก curve. We have areas or probabilities under the ๐ท • We usually work with at least three decimal places when calculating with proportions. 46 47 เทก #3 4 Steps to find probabilities about ๐ท Step 1: Check relevant conditions, unless otherwise stated Step 2: Convert ๐เท to Z , using ๐= ๐ฦธ − ๐ ๐๐ ๐ Step 3: Sketch a curve for the area (probability) Step 4: Find the area using Z tables This Photo by Unknown Author is licensed under CC BYSA 47 47 FINM4000 Finance 48 #3 Conditions for the sampling distribution of the proportion This Photo by Unknown Author is licensed under CC BY-NC-ND Conditions for the sampling distribution of the proportion Randomization Condition: The sample is random and representative, (without bias). 10% Condition: If sampling without replacement, the sample size, n, must be no larger than 10% of the population. This is to ensure that if we are sampling without returning items/individual to the population before sampling the next item/individual, the proportions of characteristics in the population are not affected by the non-replacement. Success/Failure Condition: The sample size must be big enough so that both the number of “successes,” np, and the number of “failures,” nq, are expected to be at least 10 (i.e., greater than 10). This is related to the Central Limit Theorem, in that the sample size must be large enough to rely on the Normal distribution. Note: Only when p and q are unknown, should we use ๐เท and ๐เท in the Success/Failure Condition. p = population proportion of “success” (of interest) q = population proportion of “failure” = 1 - p pเท = sample proportion of “success” (of interest) qเท = sample proportion of “failure” = 1 − pเท n = sample size Note: with proportions, we work with the decimal form, not the percentage. If any of the conditions are not satisfied, it should be noted and stated to proceed with caution. Kaplan Business School (KBS), Australia 48 49 Success/failure condition: Check by SUBSITUTING relevant values from the question INTO the condition: • np ≥ 10 • nq ≥ 10 49 50 #3 Example https://www.nbcnews.com/politics/meet-the-press/did-biden-win-little-or-lot-answer-yes-n1251845 The true proportion (population proportion)of U.S. voters who supported Joe Biden in the 2020 U.S. presidential election was p = 51.3%. If a sample of n = 1000 U.S. voters were selected, what is the probability that more than 500 of those sampled were in favour of Joe Biden? Comment on your solution. 50 51 With proportions, we may be give a percentage value or a fractional value out of 1. In the Conditions AND in the Z formula, we do NOT use the percentage form. If we are given a “percentage” we must convert this to a fractional value out of 1, before checking conditions and using the Z formula for proportions. E.g. In this question, p = proportion of voters in the U.S.A. population who voted for Joe Biden. Given p = 51.3% = 0.513, USE 0.513 in the conditions and Z formula. With proportions, please use the entire window of your calculator, best to NOT round numbers, otherwise the rounding errors get compounded. ๐๐๐ Want (P hat > ๐๐๐๐) 51 51 52 #3 Example solution a) p = population proportion in favour of Joe Biden = 0.513 q = population proportion NOT in favour of Joe Biden = 1 − p = 1 − 0.513 = 0.487 n = 1000 500 ๐ฦธ = = 0.5 = p hat https://www.nbcnews.com/politics/meet-the-press/did-biden-win-little-or-lot-answer-yes-n1251845 1000 Step 1: Check conditions Random sample condition: not told whether the sample was random. Assume random and proceed with caution. 10% condition: If sampling without replacement, we know that 1000 is far less than 10% of the population of U.S. voters. Success/Failure Condition: np = 1000(0.513) = 513 > 10 and nq = 1000(0.487) = 487 > 10 เทก is normal. Two from three conditions are satisfied, proceed with caution and conclude ๐ท We can use the Z tables to find probabilities. This Photo by Unknown Author is licensed under CC BY-NC-ND Note: • With proportions, we work with the decimal form, not the percentage. • We usually work with at least three decimal places when calculating with proportions. 52 53 #3 Example solution Step 2: Find the z value ๐= ๐= Step 3: Sketch a curve(s) Total area = 100% = 1 ๐ฦธ − ๐ ๐๐ ๐ 0.5 − 0.513 0.513(0.487) 1000 ๐ = −๐. ๐๐๐ 0.0261 Do NOT want ๐ฦธ = 0.5 https://www.nbcnews.com/politics/meet-the-press/did-biden-winlittle-or-lot-answer-yes-n1251845 Want 1 − 0.0261 = 0.7939 p = 0.513 ๐เท Step 4: Use the Z tables 0.7939 Do NOT want P(๐เท > 0.50) = P(Z > −0.822) 0.2061 = 1 − P(Z < −0.822) = 1 − 0.2061 −0.82 0 Z = 0.7939 Interpretation: There is a 79.39% chance or 0.7939 probability, that more than half of those U.S. voters sampled, were in favour of Joe Biden in the 2020 US presidential election. Note: เทก is the sampling distribution of the proportion of U.S. voters who were in favour of ๐ท Joe Biden p = population proportion of U.S. voters in favour of Joe Biden = 51.3% = 0.513, the form we use in our calculation with proportions. เท = sample proportion of U.S. voters in favour of Joe Biden ๐ = 50% = 0.50, the form we use in our calculation with proportions. Note: with proportions, be careful. เทก curve. • We have the proportion value on the horizontal axis of the ๐ท เทก • We have areas or probabilities under the ๐ท curve. • We usually work with at least three decimal places when calculating with proportions. 53 54 For Online Assessment: Z = (0.5 – 0.513)/sqrt[(0.13*0.487)/100] Z = -0.822 P(Z > -0.82) = 1 – 0.0261 = 0.7939 54 54 55 #3 Exercise Based on past experience, a bank believes that p = 7% of customers who receive loans will not make payments on time. This is termed “defaulting” on a loan. The bank has recently approved n = 200 loans. a) What are the mean and standard deviation of the proportion of customers in this group who may default on their loan? b) Check the conditions. ๐๐ WANT P(P hat > ) ๐๐๐ c) What is the probability, that over 20 of the that is, we want P(P hat > 0.10 https://www.glasbergen.com/ngg_tag/real-estate-cartoon-comics/ customers sampled, will default on their loan? Interpret. Note: เทก is the sampling distribution of the sample proportion of bank customers who default on their loan. ๐ท p = population proportion of bank customers who default = 7% = 0.07, the form we use in our calculation with proportions. เท = sample proportion of bank customers who default on their loan ๐ = 20 customers defaulting from 200 customers in the sample 20 = 200 = 0.10, the form we use in our calculation with proportions. This question is based on Sharpe et al. “ Business Statistics” Pearson International Edition 2010, Chapter 9, Exercises 35, page 252 55 #3 Exercise solution 56 “Success” in this question is the proportion of interest, those that default. เทก is ๐๐เท = p = 7% = 0.07 a) population mean of ๐ท เทก is ๐๐ทเทก = population standard deviation of ๐ท https://www.glasbergen.com/ngg_tag/real-estate-cartoon-comics/ ๐๐ ๐ = 0.07(1−0.07) = 200 0.018 b) Random sample condition: not told that this is a random sample. We have to assume that these new customers are a random sample from the same population on which the default percentage is based. 10% condition: If sampling without replacement, this bank has to have approved at least 2000 loans in the past. Assume true. Success/failure condition: np = 200(0.07) = 14 > 10 and nq = 200(0.93) = 186 > 10. As the first two conditions involve assumptions, we should proceed with caution. ๐เท ~ Normal, we can use the Z tables to find probabilities. This Photo by Unknown Author is licensed under CC BY-NC-ND Note: เทก is the sampling distribution of the sample proportion of bank customers who ๐ท default on their loan. p = population proportion of bank customers who default = 7% = 0.07, the form we use in our calculation with proportions. เท = sample proportion of bank customers who default on their loan ๐ 20 =200 = 0.10, the form we use in our calculation with proportions. Note: we usually work with at least three decimal places when calculating with proportions. 57 Recall, from the question: p = 7% = 0.07 q=1−p q = 1 − 0.07 q = 0.93 p hat = 20/200 p hat = 0.10 57 57 58 Example solution #3 p hat = 20/200 = 0.10 Step 2: Find the z value ๐= ๐= Step 3: Sketch a curve(s) https://www.glasbergen.com/ngg_tag/real-estate-cartoon-comics/ Total area = 100% = 1 ๐ฦธ − ๐ ๐๐ ๐ ๐. ๐๐ − ๐. ๐๐ Do NOT want ๐. ๐๐(๐. ๐๐) ๐๐๐ ๐ = 1.663 = 1.66 p 0.07 Want ๐ฦธ 0.1 ๐เท Step 4: Use the Z tables Do NOT want P(๐เท > 0.10) = P(Z > 1.66) 1 – 0.9515 = 0.0485 0.9515 = 1 − P(Z < 1.66) = 1 − 0.9515 Z 0 1.66 = 0.0485 Interpretation: There is a 4.85% chance or 0.0485 probability that more than 20 out of these 200 customers sampled, will default on their loan payments. Note: • เทก curve. We have the proportion value on the horizontal axis of the ๐ท • เทก curve. We have the area or probability under the ๐ท เทก is the sampling distribution of the sample proportion of bank customers who default on their loan. ๐ท p = population proportion of bank customers who default = 7% = 0.07, the form we use in our calculation with proportions. เท = sample proportion of bank customers who default on their loan ๐ 20 = 200 = 0.10, the form we use in our calculation with proportions. เทก > 20 ) We want P(๐ท 200 เทก > 0.10) = P(๐ท เทก > 0.10) will be in the right tail of the curve. เท = 0.10 > μ, we know that P(๐ท As p = 0.07 = μ, and ๐ Note: we usually work with at least three decimal places when calculating with proportions. 58 59 Supplementary Exercises • Students are advised that Supplementary Exercises to this topic may be found on the subject portal under “Weekly materials”. • Solutions to the Supplementary Exercises may be available on the portal under “Weekly materials "at the end of each week. • Time permitting, the lecturer may ask students to work through some of these exercises in class. • Otherwise, it is expected that all students work through all Supplementary Exercises outside of class time. Kaplan Business School (KBS), Australia 59 60 Extension • The following slides are an extension to this week’s topic. • The work covered in the extension: o Is not covered in class by the lecturer. o May be assessed. Extension topics may be studied outside class time. Kaplan Business School (KBS), Australia 60 61 Your turn: build a sampling distribution Say, you are in a class with a population of N = 14 students, and you ask asked everyone their age. You have the following data set: • From this dataset, you could select a sample of size n = 5 ages & find the sample mean age. • You can repeat this until you have say, 20 different samples of data. • Now, you can create a frequency distribution using all of the sample mean ages. • See an example on the next slide with 20 sample means. 61 61 62 Example of sampling distribution Sample Raw data on age 20 18 21 34 26 22 20 20 21 18 19 21 19 25 Each row is a sample, with the corresponding sample mean (sample average) The next slide has the frequency distribution and histogram. Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 Sample 7 Sample 8 Sample 9 Sample 10 Sample 11 Sample 12 Sample 13 Sample 14 Sample 15 Sample 16 Sample 17 Sample 18 Sample 19 Sample 20 20 27 18 25 21 22 20 34 20 26 20 19 26 22 18 25 20 19 26 21 18 29 26 20 22 25 20 21 22 21 20 20 34 20 26 19 19 26 18 22 34 20 21 19 19 34 18 18 18 18 22 18 20 21 19 18 18 20 25 20 Average 19 18 20 21 20 18 25 20 21 25 21 21 19 25 21 20 34 19 21 25 21 20 22 21 25 19 20 21 19 18 22 25 19 22 18 19 20 21 19 20 22.4 22.8 21.4 21.2 21.4 23.6 20.6 22.8 20 21.6 21 20.6 23.6 22 20.4 20.2 22.2 21 21.8 21.6 We have only represented 20 samples here. 62 63 Example continued … Class frequency >19 - 20 >20 - 21 >21- 22 >22 - 23 >23 - 24 1 6 7 4 2 You can see the Sampling distribution of mean age, ๐เดค is unimodal & approximately symmetric 63 64 Exercise Assume that the population mean age for all students enrolled in this subject is 22 years with a standard deviation of 1.6 years. The distribution of age is approximately normal. Assume the conditions are satisfied. a) What is the probability that a sample of 5 students has a mean age less than 23 years? b) For a sample of 5 students, what age will the youngest 33% be less than? 64 64 65 Exercise solution a) As told conditions are satisfied, ๐เดค ~ N, and we can use the Z tables. Want P(๐เดค < 23) = P( Z < 1.40) = 0.9192 b) Want ๐ฅาง for P(๐เดค < ๐ฅ)าง = 0.33 ๐ฅาง − ๐ ๐= ๐ เต ๐ ๐ฅาง −22 −0.44 = 1.6 เต 5 ๐ฅาง = 21.685 years 65 65 66 เดฅ Here is another illustration from X to ๐ฟ Let’s use some fair six sided-dice, where the sides of each are labelled 1, 2 , 3 , 4 , 5 , 6. Now, we can simulate rolling 3 fair dice 50,000 times & calculate the mean value from each roll (now n = 3) We can simulate rolling 1 fair die 50,000 times & record the value the die lands on from each roll (n = 1) 66 0.2 Distribution of X 0.2 0.16 0.12 0.08 0.04 0 Distribution of ๐ฟ าง 0.15 0.1 0.05 1 2 3 4 5 Uniform distribution 6 0 1 2 3 4 Normal distribution 5 6 Simulate the sampling distribution of the sample mean Let’s use some fair six-sided dice, where the sides of each are labelled 1, 2 , 3 , 4 , 5 , 6. Note: singular, die and plural, dice. We can simulate rolling 1 fair die 50,000 times & record the value the die lands on from each roll (here sample size, n = 1). Note, the distribution of values, X is uniform, where each side has the same relative frequency. Now, we can simulate rolling 3 fair dice 50,000 times & calculate the mean value from each roll (now sample size, n = 3). Note, the distribution of values, ๐เดค is unimodal and symmetric, or normally distributed. The fact that the histogram of sample means on the right appears to be bell-shaped (Normal) is a consequence of the Central Limit Theorem, even for such a small n here. Hence, for sufficiently large sample sizes (n ≥ 30), the sample distribution is approximately normal, even if the population distribution is non-normal. (This Photo by Unknown Author is licensed under CC BY-NC-ND) 66 67 Z = (x - σ=8 Area = 5.05%, prob = 0.0505 BELOW 160 x = 160 Note: n not given, so we must be dealing with X. Steps for reverse normal (from Week 4): 1. Sketch, placed known values 2. Use Z tables BACKWARDS or INSIDE OUT, find Z value for left tail area of 0.0505 3. Substitute known values into Z formula: Z = (x – μ)/σ 4. Solve for μ 67 – – 173.12 = μ 67 68 Weekly Content, Week 4, Supplementary Exercises, Q7. a) Steps: 1) Sketch 2) Use Z tables backwards and search for the area of 1 – 0.05 = 0.95 in the LEFT SIDE; 0.95 is the average of 0.9495 and 0.9505 3) Find the Z value, CHECK SIGN OF Z: Z = (1.64 + 1.65)/2 = 1.645 4) Substitute into Z =(x – mean)/standard deviation and solve for the standard deviation. 1.645 = (130 – 125)/σ σ = (130 – 125)/1.645 σ = 3.04 units 68 68 69 Weekly Content, Week 4, Supplementary Exercises, Q7. b) Steps: 1) Sketch 2) Use Z tables backwards and search for the area of 3) Find the Z value, CHECK SIGN OF Z: Z = -1.17 etc. 4) Substitute into Z =(x – mean)/standard deviation and solve for the standard deviation. 69 69