ELEMENTARY STATISTICS Study Guide Dr. Shinemin Lin Table of Contents 1. Introduction to Statistics 2. Descriptive Statistics 3. Probabilities and Standard Normal Distribution 4. Estimates and Sample Sizes 5. Hypothesis Testing 6. Correlations and Regression 7. Analysis of Variance 8. Statistical Process Control Project 1 Collecting Data There are many factors that influence the complexity of the written words, factors such as subject matter, overall length of discussion, choice of vocabulary, and sentence structure. To simplify the question I propose to look at the length of the sentence in articles written for the national newspaper, and local paper. You are going to collect sentences from local and national newspapers and record the length of each sentence ( and the complexity of the sentence.) This project requires you to do random sampling or pseudo random sampling to obtain your sample. Concentrating on the following questions: 1. How will you accomplish this? 2. How will you measure your variables with as much reliability and as little bias as possible? 3. How will you collect your data? 4. Is your data collection plan unbiased bias? 5. What descriptive statistics do you plan to compute for what variables? 6. What graphs and tables do you plan to display? 7. What inferences do you want to make about your population from the sample you observe? Chapter 1 Introduction to Statistics The word STATISTICS has two basic meaning. We sometimes use this word when referring to actual numbers derived from data. A second meaning refers to statistics as a method of analysis. A statistical research usually consists of data collection, data presentation, data analysis and decision-making. In statistics, we commonly use the term's population and sample. We investigate sample to predict population. A population is the complete collection of elements to be studied. A sample is a sub collection of elements drawn from population. A parameter is a numerical measurement describing some characteristic of a population. A statistic is a numerical measurement describing some characteristic of a sample. Natural of Data • Qualitative data can be separated into different categories that are distinguished by some nonnumeric characteristic. • Quantitative data consist of numbers representing counts and measurements. ♦ Discrete data result from either a finite of possible values or a countable number of possible values. When data represent counts, they are discrete. ♦ Continuous numerical data result from infinitely many possible values that can be associated with points on a continuous scale in such a way that there are no gaps or interruptions. When data represent measurement, they are continuous. Levels of measurement of Data The nominal level of measurement is characterized by data that consist of names, labels, or categories only. The data cannot be arranged in an ordering scheme. Example. Voter distribution: 45 democrats, 80 Republicans, 90 Independents. The ordinal level of measurement involves data that may be arranged in some order, but difference between data values either cannot be determined or are meaningless. Example. Voter distribution: 45 low-income voters, 80 middle-income voters, 90 upperincome voters. An order is determined by 'low, middle, upper'. The interval level of measurement is like the ordinal level, with the additional property that meaningful amounts of differences between data can be determined. However, there is no natural zero starting point Example. Temperatures of steel rods: 45 F, 80 F, and 90 F. 90 F is not twice as hot as 45 F. The ratio level of measurement is the interval level modified to include the natural zero starting point. For value of this level, differences and ratios are meaningful. Example. Length of steel rods: 45 cm, 60 cm, and 90 cm. 90 cm is twice as long as 45 cm. Methods of Sampling Guidelines of data collection 1. Ensure that sample size is large enough for required purpose. 2. If your are obtaining measurements of some characteristic from people, you will get better results if you do the measuring instead of asking the subject for the value. 3. When conducting a survey, consider the medium to be used. 4. Ensure that the method used to collect data actually results in a sample that is representative of the population. Sampling methods 1. Random sampling: members of the population are selected in such a way that each has an equal chance of being selected. 2. Stratified sampling: we subdivide the population into at least two different subpopulation that share the same characteristics (such as gender), and then we draw a sample from each subpopulation. 3. Systematic sampling: we choose some starting point and then select every kth element in the population. 4. Cluster sampling, we divide the population area into sections (or clusters), randomly select a few of those sections, and then choose all the members from the selected sections. 5. Convenience sampling, we simply use results that are readily available. Homework: Chapter 2 Descriptive Statistics Summarizing Data When beginning an analysis of a large set of values, we must often organize and summarize the data by developing tables and graphs. We begin with a frequency table. A frequency table lists categories (or classes) of scores, along with counts (or frequencies) of number of scores that fall into each category. The construction of a frequency table is not very difficult, and many statistics software packages can do it automatically. • • • • • Lower class limits are the smallest numbers that can actually belong to the different classes. Upper class limits are the largest numbers that can actually belong to the different classes. Class boundaries are the numbers used to separate classes, but without the gaps created by class limits. Class marks are the midpoints of the classes. Class width is the difference between two consecutive lower class limits Example 1, Construct frequency table (with 4 classes) for the data 80, 68, 84, 86, 85, 77, 64, 81, 93, 94, 97, 93, 89, 82, 76, 75, 83, 90, 83, 84, 92, 94, 90, 92, 91, 84, 81, 84, 79, 80, 80. Data Frequency Cumulative frequency relative (%) frequency Cumulative percent Pictures of Data 1. Histograms use frequency table 2. Pie Charts use relative frequency table 3. Maps 4. Stem and Leaf Plot Another interesting way of summarizing data is to use what is called a stem and leaf plot. To illustrate this procedure, let's consider the grades obtained by two classes as follow. Class I: 56, 64, 73, 72, 84, 98, 80, 86, 75, 68, 46, 78, 75, 91, 63, 84, 79, 69, 76, 58. Class II: 99, 81, 50, 64, 76, 63, 71, 78, 81, 92, 87, 79, 74, 60, 68, 92, 84, 86, 65, 78. The first digit serves as the stem, and the second digit as the leaf. For example, the stem of 46 in Class I is 4, and the leaf is 6. Likewise, 56, and 58 have stems of 5 and leaves of 6 and 8, respectively. Stem and Leaf Plots Class I Stems 4 5 6 7 8 9 Leaves 6 6, 8 4, 8, 3, 9 2, 3, 5, 8, 5, 9, 6 4, 0, 6, 4 8, 1 Class II Stems Leaves Complete the stem and leaf plot for class II Steps in Making a Stem and Leaf Plot 1. Decide on the number of digits in the data to be listed under stems (one-digit, twodigit,..) Usually only one digit is given under leaves and the other digits are listed under stems. 2. List the stems in a column, for least to greatest. 3. List the remaining digits in each data entry as leaves. (You may wish to order these data from smallest to largest) Example, Construct the stem-and-leaf plot for the data 80, 68, 84, 86, 85, 77, 64, 81, 93, 94, 97, 93, 89, 82, 76, 75, 83, 90, 83, 84, 92, 94, 90, 92, 91, 84, 81, 84, 79, 80, 80. Measures of Central Tendency A measure of central tendency is a value at the center or middle of a data set. Mostly we will like to measure mean, weighted mean, mode, median, midrange and skewness. _ Mean = sum / counts X Weighted mean = ∑w x ∑n i = ∑x n i i Median: The median of a set of scores is the middle value of the sorted data. Mode: The mode of a data set is the score that occurs most frequently. Midrange = (highest + lowest)/2 Examples 1. Given data list 5, 5, 5, 3, 1, 1, 5, 4, 3, and 5. Find the mean, mode, and median. 2. How to find mean, mode, and median if you are given a frequency table. Skewness: A distribution of data is skewed if it is not symmetric and extends more to one side than the other. Skewed to the left (mean < median < mode) Symmetric (mean = mode = median) Skewed to the right (mode < median < mean) Measures of Variation Range = largest - smallest − Variance = ∑ ( x − x) 2 /(n − 1) = s2 _ SD = ∑ ( X − X )2 (n − 1) =s The amount of deviation is the amount of difference between score and the mean. Example Find the variance and SD of the data 6.5, 6.6, 6.7, 6.8, 7.1, 7.3, 7.4, 7.7, 7.7, 7.7. Calculate SD from a frequency table Example. Find the (a) mean and (b) standard deviation of the data described below. X | Frequency ------------------------------4 | 10 ------------------------------5 | 7 ------------------------------6 | 3 ------------------------------7 | 2 Range Rule of THUMB Range is closed to 4s and hence s can be approximated by (range /4) Interquartile Range (IQR) IQR = Q3 - Q1 Example, The following are 16 grades received on a test, arranged in increasing order. Find the mean, Q1, Q3, and IQR. Boxplots display Q1, Median and Q3. Outlier An outlier is any data point father than 1.5 IQRs above Q3 or father than 1.5 IQRs below Q1 Measure of position x−µ σ The standard score, or z score, is the number of standard deviation that a given value x is above or below the mean. Z scores = Percentile = cumulative percent Example, Two equivalent IQ tests are given to similar groups, but the tests are designed with different scales. The statistics for the tests are listed below. Which is better: a score of 130 on test A or a score of 52 on test B? Test A: mean = 100, s = 15; Test B: mean = 40, s = 5. Solution. Chapter 3 Probability and Standard Normal Distribution Probability of a single even. Pr(E) = k/n = number of success/number of possible outcomes. Example. 1) If we draw a ball from a bag containing 4 white balls and 6 black balls, what is the probability of a) getting a white ball? b) Getting a black ball? c) Not getting a white ball? 2) A dice is rolled. What is the probability that a) A 4 will result? b) An old number will result? c) An number bigger than 4 will result? Sample space. A set that contains all possible outcomes of an experiment is called a sample space. Each element of the sample space is called a sample point, and an event is a subset of the sample space. Examples. 1) Write the sample space and all events of the example 2 above. 2) Suppose a coin is tossed 3 times. Construct the sample space for the experiment and the event of getting at least 2 heads. 3) Ten blank cards are marked with the numbers 1 to 10. An experiment consists of shuffling the cards and then drawing one card. a) Determine the sample space for the experiment. b) How many sample points are in the sample space? c) What is the event getting a card with an even number? Pr(event) = #event / # sample points Addition rule: the probability of obtaining any one of several different and distinct outcomes equals the sum of their separate probability. The addition rule always assumes that the outcomes being considered are mutually exclusive Multiplication Rule: the probability of obtaining a combination of independent outcomes equals the product of their separate probabilities. Examples Flip two fair coins, what is the probability to get a) Two heads b) One head and one tail Example 1. Draw one card at random froma standard deck of cards. The sample spaces S is the collection of the 52 cards. Assume that the probability set function assigns 1/52 to each of these 52 outcomes. Let A = { x: x is a jack, queen, or king} B = {x: x is a 9, 10, or jack and x is red}, C = {x: x is a club}, D = {x: x is a diamond, a heart, or a spade} Find a) P(A), b) P(A and B), c) P(A or B), d) P(C or D), and e) P(C and D) Example 2. If P(A) = 0.4, P(B) = 0.5, and P(A and B) = 0.3, find P(A or B), and P(A and B'). Independent and Dependent events If the occurrence of one event affects the occurrence of the other, the events are said to be dependent. If the occurrence of one event does not affect the occurrence of the other, the events are called independent. If E and F are any two events, then the probability that both events occur, denoted Pr(EF), is given by Pr(E*F) = Pr(E)*Pr(F|E), where Pr(F|E) is the probability that F occurs, given that E has occurred. We call Pr(F|E) a conditional probability. Examples. 1) Two card are drawn from regular deck of cards (without replacement). a) The probability of first card king is 4/52 b) The probability of second card king given first card king is 3/51 c) The probability of both cards king is 4/52 * 3/51 = 1/221. d) The probability of both hearts. e) The probability of a heart at first draw, a club on the second draw. f) The probability of a heart on the first draw; an ace on the second draw. 2) From a deck of 52 cards two cards are drawn, one after another without replacement. What is the probability that (a) the first will be king and the second will be a jack? (b) the first will be king and the second will be jack in the same suit? • Suppose that we are given 20 tulips that are very similar in appearance and told that 8 tulips will bloom early, 12 will bloom late, 13 will be red, and 7 will be yellow. If a bulb is selected at random, find a) the probability that it will produce a red tulip, b) the probability that it will be red and that will bloom early. The normal distribution If we modify some line graphs to indicate probability rather than frequency, the resulting graphs closely approximate a smooth, bell-shaped curve called the normal probability curve. 1. The area under a normal curve is equal to 1. 2. The normal curve is symmetric about a vertical line through the mean of the set of data. 3. The interval extending from 2 SDs to the left of the mean to 2 SDs to the right of the mean contains approximately 95% of all data. 4. If x is a data value from a normally distributed set of data, then the probability that x is greater than a and less than b is the area under the normal curve between a and b. Finding Probabilities when given z scores Prob(a < z < b) = the probability that the z score is between a and b. Prob(z > a) = the probability that the z score is greater than a. Prob(z < a) = the probability that z score is less than a. Using Z-distribution Table Examples 1. Assume that IQ scores are normally distributed with a mean of 100 and a standard deviation of 15. An IQ scores is randomly selected from this population. Find the indicated probability. a) P(100 < x < 130) b) P( x < 125) c) P( x > 85) d) P( 85 < x < 115) 2. If IQ scores are normally distributed with a mean of 100 and a standard deviation of 15, find the probability of randomly selecting a person with an IQ score between 100 and 130? Finding z scores when given Probabilities Examples 1. Use the same thermometers with temperature readings that are normally distributed with a mean 0 C and a standard deviation of 1 C. Find the temperature corresponding to, the 95th percentile, and 90th percentile. 2. The Chemco Company, which manufactures car tires, finds that the tires last distance that are normally distributed with a mean of 35600 miles and a standard deviation of 4275 miles. The manufacturer wants to guarantee the tires so that only 3% will be replaces because of failure before the guaranteed number of miles. For how many miles should the tires be guaranteed? The central Limit Theorem As the sample size increases, the sampling distribution of sample means approaches a normal distribution. i.e.. the mean of means is almost equal to population mean ( µ _ = µ ); and the standard deviation of the sample means will be σ X n Central Limit Theorem Given: 1. The random variable x has a distribution with mean µ and standard deviation σ . 2. Samples of size n are randomly selected from this population. Conclusion: _ 1. The distribution of the sample means Xs will, as the sample size n increase, approaches a normal distribution. 2. The mean of the sample means will be the population mean µ . 3. The standard deviation of the sample means will be σ n Practical Rules Commonly used: 1. For sample size n larger than 30, the distribution of the sample means can be approximated reasonably well by a normal distribution. The approximation gets better as the sample size n becomes larger. 2. If the original population itself is normally distributed, the sample means will be normally distributed for any sample size n. Example Assume that the population of human body temperatures has a mean of 98.6 F, as is commonly believed. Also assume that the population standard deviation is 0.62 F. If a sample of size 106 is randomly selected, find the probability of getting a mean of 98.2 F or lower. Solution. µ _ = µ = 98.6 x 0.62 σ = = 0.060 x n 106 z = (98.2 - 98.6)/0.060 = -6.67 P( z < -6.67) = 0.0001. σ_ = Chapter 4 Estimates and Sample sizes A point estimate is a single value used to approximate a population parameter. As an _ example, the sample mean X is the best point estimate of the population mean µ A confidence interval (or interval estimate) is a range (or an interval) of values that likely to contain the true value of the population parameter. A confidence interval is associated with a degree of confidence. The degree of confidence is the probability 1 - α that the population parameter is contained in the confidence interval. This probability is often expressed as the equivalent percentage value. The degree of confidence is also referred to as the level of confidence or the confidence coefficient. • Common choices for the degree of confidence are 95% ( α = 0.05), and 99% ( α = 0.01). Notation: Z α = positive z value that is at the vertical boundary for the area of α / 2 in 2 the right tail of the standard normal distribution. A critical value is the number on the borderline separating sample statistics that are likely to occur from those that are unlikely to occur. The number Z α is a critical value with 2 the property that the size of the area under the curve bounded by - Z α and Z α is 1 - α 2 Example Given a 95% degree of confidence, find the critical value Z α 2 2 When sample data are used to estimate a population mean µ , the margin of error, denoted by E, is the maximum likely (with probability 1 - α ) difference between the σ observed sample mean and the true value of the population mean µ . E = Z α * 2 n If n > 30, we can replace σ by the sample standard deviation s. _ _ Confidence interval for the population mean µ is ( X − E , X + E ) . Example For a 95% degree of confidence, find the confidence interval for _ population mean given the statistics n = 106, X = 98.2 and s = 0.62. Small Sample Cases and the Student t distribution If n < 30 and the population standard deviation is unknown then we can not use the previous formula to find out the confidence interval for the population mean. It this is the case we can apply the student t-distribution. Student t-distribution _ X−µ If the distribution of a population is essentially normal, then the distribution of t = s n is essentially a Student t-distribution for all samples of size n. Using t-distribution Table The number of degrees of freedom for a data set corresponds to the number of scores that can vary after certain restrictions have been imposed on all scores. DF = n - 1. Important facts of the t-distribution 1. The t-distribution is different for different sample size. 2. The t-distribution has the same general symmetric bell shape as the normal distribution, but it reflects the greater variability that is expected with small sample. 3. The t-distribution has a mean of t = 0. 4. The standard deviation of The t-distribution varies with the sample size, but it is greater than 1. 5. As n gets larger, the t-distribution gets closer to normal distribution. When do we use the t-distribution? (1) The sample is small (n <= 30); (2) σ is unknown; and (3) The population has a distribution that is essentially normal. _ _ The confidence interval will be ( X − E , X + E ) where E = tα / 2 * s n Example. Suppose that we have only the following 10 randomly selected body temperatures. 98.6, 98.6, 98.0, 99.0, 98.4, 98.4, 98.4, 98.6, 98.4, 98.0 Construct the 95% confidence interval for the mean of all body temperatures. (Assume that body temperatures are normally distributed.) Zα / 2 * σ 2 ] and round up. If we don't know the E population standard deviation, we can go ahead using s instead of σ . What is the appropriate sample size? n = [ Example. We want to estimate the mean weight of plastic discarded by households in one week. How many households must we randomly select if we want to be 99% sure that the sample mean is within 0.25lb of the true population mean Assume that σ = 1.10lb . Estimating a Population Variance In a normally distributed population with variance σ 2 , we randomly select independent sample of size n and compute the sample variance s 2 for each sample. The sample statistic χ 2 = (n − 1) s 2 / σ 2 has a distribution called the Chi-square distribution with DF = n - 1. Properties of Chi-square distribution. 1. The Chi-square distribution is not symmetric. 2. The value of chi-square can be zero or positive, but cannot be negative. 3. The chi-square distribution is different for each number of degrees of freedom. As the number of DF increases, the chi-square distribution approaches to a normal distribution. Example. Find the critical values of χ 2 that determine critical regions containing an area of 0.025 in each tail. Assume that the relevant sample size is 10. • The sample variance s 2 is the best point estimate of the population variance (n − 1) s 2 (n − 1) s 2 , The confidence interval of population variance is 2 χ χ L2 R . Question: What is the confidence interval of population standard deviation? Example. The following IQ scores are obtained from a randomly selected sample. 85 91 93 99 103 111 115 122 92 a) Find the best point estimate of the population variance. b) Construct a 95% confidence interval estimate of the population standard deviation. Chapter 5 Hypothesis Testing In previous chapter we studied how to use sample statistics to estimate values of population parameters. In this chapter we study how to use sample statistics to test hypotheses made about population parameters. In statistics, a hypothesis is a statement that something is true. Components of a Formal Hypothesis Test 1. The null hypothesis (H0) is a statement about the value of a population parameter, and it must contain the condition of equality. 2. The alternative hypothesis (H1) is the statement that must be true if the null hypothesis is false. Hypothesis testing is not simply a matter of being right or wrong. Different types of errors can have dramatically different consequences. • Type I error: The mistake of rejecting the null hypothesis when it is true. This type error is not a miscalculation or procedural misstep; it is an actual error that can occur when a rare event happens by chance. The probability of rejecting the null hypothesis when it is true is called significance level; that is, the significance level is the probability of type I error. The symbol α is used to represent the significance level. The values of 0.05 and 0.01 are common used. • Type II error: This mistake of failing to reject the null hypothesis when it is false. The symbol β is used to represent the probability of a type II error. True State of Nature Reject H0 Fail to reject H0 H0 is true Type I error Correct Decision H0 is false Correct Decision Type II error Type II error 3. Test Statistic: A sample statistic or a value based on the sample data. A test statistic is used in making decision about rejection of the null hypothesis. 4. Critical Region: The set of all values of the test statistic that would cause us to reject the null hypothesis. 5. Critical value: The value or values that separated the critical region from the value of the test statistic that would not lead to rejection of the null hypothesis. The critical values depend on the natural of the null hypothesis, the relevant sampling distribution, and the level of the significance 6. Conclusion: a) Fail to reject the null hypothesis H0 b) Reject the null hypothesis Example. • Original claim: A medical researcher claims that the mean body temperature of a healthy adults is not equal to 98.6 F. • Hypotheses: H0: µ = 98; H1: µ ≠ 98.6. • Significant level: α = 0.05 _ • • • X−µ_ Test statistic: z = X = 98.2 − 98.6 = -6.64 σ/ n 0.62 * 106 Critical region: It consists of values of the statistic less than z = -1.96 or greater than z = 1.96. Critical value: The critical values are z = -1.96 and z = 1.96. The following practical considerations may be relevant: 1) For any fixed α , an increase in the sample size n will cause a decrease in β . That is a larger sample will lessen the chance that you fail to reject a false null hypothesis. 2) For any fixed size n, a decrease in α will cause an increase in β . Conversely, an increase in α will cause a decrease in β . 3) To decrease both α and β , increase the sample size. Summary Start => Does the original claim contain the condition of equality? If the answer is yes => original claim becomes H0. Do you reject H0 If yes => There is sufficient evidence to warrant rejection of the claim. If no => There is not sufficient evidence to warrant rejection of the claim. If the answer is no => original claim becomes H1. Do you reject H0? If yes => The sample data support the claim. If no => There is not sufficient sample evidence to support the claim. Two-tailed test: H1 ≠ Left-tailed test: H1 < Right-tailed test: H1 > Example. After analyzing 106 body temperatures of healthy adults, a medical researcher makes a claim that the mean body temperature is less than 98.6 degree F. a) b) c) d) e) f) g) Express the claim in symbolic form: Identify the null hypothesis: Identify the alternative hypothesis: Identify this test as being two-tailed, left-tailed, or right-tailed: Identify the type I error: Identify the type II error: Assume that the conclusion is to reject the null hypothesis. State the conclusion in no technical terms. h) Assume that the conclusion is failure to reject the null hypothesis. State the conclusion in no technical terms. Testing a claim about a mean: Large Samples _ Test Statistic for claims about When n > 30: Z = X−µ_ X σ/ n Traditional Method of Hypothesis Testing: 1. Identify the specific claim or hypothesis to be tested and put it in symbolic form 2. Give the symbolic form that must be true when the original claims id false. 3. Of two symbolic expressions obtained so far, let the null hypothesis H0 be the one that contains the condition of equality, H1 is another statement. 4. Select the significance level α based on the seriousness of a type I error. Make α small if the consequences of rejecting a true H0 are severe. The value 0.05 or 0.01 is very common. 5. Identify the statistic that is relevant to this test and its sampling distribution. 6. Determine the test statistic, the critical values, and the critical region. Draw a graph and include the test statistic, critical value(s), and critical region. 7. Reject H0 if the test statistic is in the critical region. Fail to reject H0 if the test statistic is not in the critical region. 8. Restate this previous decision in simple no technical terms. • That fail to reject H0 does not equivalent to say support H0. _ Example. Using the sample data given at the beginning of the chapter (n = 106, X = 98.2, s = 0.62) and a 0.05 significance level, test the claim that the mean body temperature of healthy adults is equal to 98.6 F. Use the traditional method by following the procedure outlined above. The p-value method of testing hypothesis Many professional articles and software packages use another approach to hypothesis testing that is based on the calculation of a probability value, or p-value. A p-value is the probability of getting a value of the sample test statistic that is at least as extreme as the one found from the sample data, assuming that the null hypothesis is true. p-value can be found at Table A3. P-values measure how confident we are in rejecting a null hypothesis. For example, a Pvalue of 0.0002 would lead us to reject null hypothesis, but it would also suggest that the sample results are extremely unusual if the claimed value of µ is in fact correct. P-value approach uses most of the same basic procedures as the traditional approach, but step 6 and 7 are different: Step 6: Find p-value Step 7: Report p-value. Some statisticians prefer to simply report the p-value and leave the conclusion to the reader. Others prefer to use the following decision criterion: • Reject H0 if the p-value is less than or equal to the significance level. • Fail to reject H0 if the p-value is greater than the significance level. If the conclusion is based on the p-value alone, the following guide may be helpful: Less than 0.01: Highly statistically significant; Very strong evidence against the null hypothesis 0.01 to 0.05: Statistically significant Adequate evidence against the null hypothesis Greater than 0.05: Insufficient evidence against the null hypothesis Example. Use the p-value method to test the claim that the mean body temperature of healthy adults is equal to 98.6 F. As before, use a 0.05 significance level and the sample data from previous example. Testing a Claim about a Mean: Small samples If the sample size is small than 30, the population standard deviation is unknown, and the population is essentially normally distributed then we use t-distribution to test our hypothesis. _ X−µ_ Test Statistic = t = X s/ n Example. In one part of a test developed by a psychologist, the test subject is asked to form a word by unscrambling the letters 'ciiatttsss'. Given below are the times (in seconds) required by 15 randomly selected persons to unscramble the letters. Test the claim that the mean time is equal to 60 seconds at the 0.05 level of significance. 68.7, 27.4, 26.0, 60.5, 34.6, 61.1, 68.6, 48.4, 43.6, 39.5, 85.3, 26.3, 43.4, 83.7, 68.9. Testing a Claim about a Standard Deviation or Variance. In testing a hypothesis made about a population standard deviation and variance, we assume that the population has values that are normally distributed. Test Statistic for testing hypothesis about standard deviation or variance χ 2 = (n − 1) s 2 / σ 2 , where n = sample size; s 2 = sample variance; and α 2 = population variance(given in the H0) Example. With individual lines at its various windows, the Jefferson Bank found that the standard deviation for normally distributed waiting times on Friday afternoon was 62 min. The bank experimented with a single main waiting line and found that for a random sample of 25 customers, the waiting times have a standard deviation of 3.8 min. based on previous studies, we can assume that the waiting times are normally distributed. At the = 0.05 significance level, test the claim that a single line causes lower variation among the waiting times. Chapter 6 Correlation and Regression In this chapter involves estimating parameters and testing hypothesis, but the method s we will use are different because of the very different issue we will be considering: given paired data, we want to investigate the relationship between the two variables. Specially, we want to determine whether there is a relationship between the two variables and, if so, identify what the relationship is. We begin by considering the concept of correlation. We also investigate regression analysis. A correlation exists between two variables when on of them is related to the other in some way. The Minitab provides a scatter diagram, which is a plot of paired (x, y) data with a horizontal x-axis and a vertical y-axis. We can find out the general pattern of those paired data sometimes. The linear correlation coefficient r measures the strength of the linear relationship between the paired x and y values in a sample. Its value is computed by using the formula r= n∑ xy − (∑ x)(∑ y ) n( ∑ x 2 ) − ( ∑ x ) 2 n ( ∑ y 2 ) − ( ∑ y ) 2 _ = _ ∑ ( x − x)( y − y) (n − 1) s x s y r is a sample statistic. We might think of r as a point estimate of the population parameter, which is the linear correlation coefficient for all pairs of data in the population. Example. Use Table 6.1, find the value of the linear correlation coefficient r. (r = 0.842) Table 6.1 Data from the Garbage Project x Plastic (lb) | 0.27 1.41 2.19 2.83 2.19 1.81 0.85 3.05 y household size | 2 3 3 6 4 2 1 5 After calculating r, how do we interpret the result? If r is close to zero, we conclude that there is no significant linear correlation between x and y. Properties of r 1. 2. 3. 4. r is always between -1 and 1. r does not change if all values of either variables are converted to a different scale. r is not affected by the choice of x or y. r measures the strength of a linear relationship. Hypothesis Test of the Significance of r H1: ρ ≠ 0 H0: ρ = 0; For the test statistic, we use one of the following methods. Method I: Test Statistic is t (r − µ r ) r t= = ; since we assume that ρ = 0 , it follows that µ r = 0 Also, sr 1− r2 n−2 it can be shown that the standard deviation of linear correlation coefficients, can be expresses as (1 − r 2 ) /(n − 2) . . Critical value: Use Table A-3 with degrees of freedom = n-2. Method 2: Test Statistic is r Critical values: refer to Table A-6. Example. Using the sample data in Table 6.1, test the claim that there is a linear correlation between weights of discarded plastic and household sizes use method 1. Common Errors Involving Correlation 1. We must be careful to avoid conducting that a significant linear correlation between two variables is proof that there is a cause-effect relationship between them. 2. Another source of potential error arises with data based on rates or averages. If we suppress the variation of individuals, it may lead to an inflates correlation coefficient. 3. A third error involves the property of linearity. The conclusion that there is no significant linear correlation coefficient does not mean that x and y are not related in any way. Regression Our goal in this section is to identify the relationship between variables so that we can predict the value of one variable, given the value of the other variable. Given a collection of paired sample data, the regression equation describes the relationship between the two variables. The graph is Yˆi = b0 + b1 X i called the regression line or line of best fit, or least-squares line. n b1 = ∑ X Y − nXY i =1 n ∑X i =1 i i 2 i − n(X ) 2 b0 = Y − b1 X Notation of Regression Equation Population parameter | Point Estimate --------------------------------------------------------------------------------------------------y-intercept of regression line b0 b0' Slope of regression line b1 b1' Equation of the line y = b0 + b1x y = b0' + b1' x' Example. Use Table 6.1 data, find the regression equation of the straight line that relates x and y. (y = 0.549 + 1.48x) Predictions In predicting a value of y based on some given value of x.. _ 1. If there is not a significant linear correlation, the best predicted y value is y . 2. If there is a significant linear correlation, the best predicted y value is found by substituting the x value into the regression equation. Example. Use the previous regression equation y = 0.549 + 1.48x to predict the size of a household that discards 2.50 lb of plastic in a week. Solution. y = 0.549 + 1.48(2.50) = 4.25 Guidelines for Using the Regression Equation • If there is no significant linear correlation, don't use the regression equation to make prediction • When using the regression equation for prediction, stay within the scope of the available sample data. • A regression equation based no old data is not necessarily valid now. • Don't make predictions about a population that is different from the population from which the sample data were drawn. Chapter 7 Analysis of Variance In Chapter 5 we developed procedures for testing the hypothesis that two population means are equal. In this chapter we will develop a procedure for testing the hypothesis that three or more population means are equal. Analysis of Variance (ANOVA) is a method of testing the equality of three or more population means by analyzing sample variances. The ANOVA methods use F-distribution. Assume that two populations are independent of each other and are normally distributed then s12 F(n,m) = 2 is a F-distribution with degrees of freedom n-1 ,m-1. s2 Properties of F-distributions 1. The F distribution is not symmetric; it is skewed to the right. 2. The value of F can be zero or positive, but they cannot be negative. 3. There is a different F distribution for each pair of degrees of freedom for the numerator and denominator. In this chapter we assume that 1. The population has normal distribution 2. The population has the same variance. 3. The samples are random and independent of each other. One-Way ANOVA with Equal Sample Sizes. Notation for One-Way ANOVA with Equal Sample Sizes n = size of each sample k = number of samples S _2 = Variance of the sample means x 2 p S = Pooled variance obtained by calculating the mean of the sample variances. H0: µ1 = µ 2 = µ 3 H1: one of the equalities does not hold. The variance between samples (variation due to treatment) is an estimate of σ 2 based on the sample means. Variance between samples = ns _2 where S _2 = variance of the sample means x x The variance within samples (variation due to error) is an estimate of σ 2 based on the sample variances. With all samples of the same size n, Variance within samples = S p2 = pooled variance obtained by finding the mean of the sample variance. Test Statistic for One-Way ANOVA with Equal Sample sizes F = ns _2 / S p2 x numerator degrees of freedom = k-1 denominator degrees of freedom = k(n-1) The critical value of F is F(k-1, k(n-1)) Example Do different age groups have different body temperature? Table 7-3 lists the body temperatures of 5 randomly selected subjects from each of 3 different age groups. Informal examination of 3 sample means (97.940, 98.580, 97.800) seems to suggest that the 3 samples come from populations with means that are not significantly different. In addition to the values of the 3 sample means, however, we should consider their standard deviations and the sample sizes. We need to conduct a formal hypothesis test to determine whether the sample means are significantly different. Using a significance level of 0.05, we will test the claim that the 3 age-group populations have the same mean body temperature. Table 7-1 18 - 20 98.0 98.4 97.7 98.5 97.1 n1 = 5 Body Temperature (Categorized by Age) 21 - 29 99.6 98.2 99.0 98.2 97.9 n2 = 5 _ _ 30 and older 98.6 98.6 97.0 97.5 97.3 n3 = 5 _ X 1 = 97.940 X 2 = 98.580 X 3 = 97.800 s1 = 0.568 s3 = 0.752 s2 = 0.701 Solution. Step 1 and Step 2 (omit) Step 3: Ho: µ1 = µ 2 = µ 3 ; H1: Three means are not all equal. Step 4: Significance level = 0.05 Step 5: Because we test the claim that 3 or more population means are equal, we use ANOVA with an F test statistic. Step 6: For one-way ANOVA with equal sample sizes, the test statistic (F = 1.8803) is calculated as following. The critical value of F = 3.8855 is found by referring to the table for which α = 0.05. The degrees of freedom are as follows: numerator degrees of freedom = k - 1 = 3 - 1 = 2 denominator degrees of freedom = k(n - 1) = 3(5 - 1) = 12 Step 7: Because the test statistic of F = 1.8803 does not fall in the critical region bounded by F = 3.8853, we fail to reject the null hypothesis of the 3 means are equal. Step 8: There is not sufficient evidence to warrant rejection of the claim that the 3 populations of different age groups have the same mean body temperature. Perhaps there really is a difference, but the sample size is too small and/or the sample differences are not large enough to justify that conclusion. One-Way ANOVA with Unequal Sample Sizes Notions = X = overall mean ( sum of all sample scores divided by the total number of scores) k = number of population means being compared ni= number of values in the ith sample N = total number of values in all sample combined (N = _ X i = mean of values in the ith sample S i2 = variance of values in the ith sample Using the preceding notation, we can now express the test statistic as follows: F =( variance between samples )/ (variance within samples) = The numerator is really a form of the formula Key components in our ANOVA method are listed below. SS(total) = total sum of squares = a measure of the total variation (around overall mean) in all of the sample data combined. = = SS(treatment) + SS(error) SS(treatment) = a measure of the variation between the sample means = SS(between groups) = SS(error) = sum of squares representing the variability that is assumed to be common to all the population being considered. = Example. Table 7-2 includes sample data with movie lengths arranged according to the numbers of stars the movies were given. Use the data in Table 7-2 to find the values of SS(treatment), SS(error), and SS(total). Table 7.2 Lengths (in minutes) of Movies Categorized by Star Ratings Poor Fair 0.0-1.5 Stars 105 108 96 91 2.0 - 2.5 Stars 110 114 98 100 96 123 101 92 155 92 155 92 99 100 108 Good 3.0 - 3.5 Stars 93 123 115 97 133 104 94 82 94 98 106 107 93 95 129 94 102 117 90 104 102 117 90 104 104 119 105 96 139 134 111 100 111 Solution k = 4 (number of samples) mean of all 60 sample scores = 6630/60 = 110.5000 SS(treatment) = = 4113.1122. SS(error) = = 25466.0514 Excellent 4.0 Stars 72 120 120 104 159 125 103 160 193 168 193 168 88 121 144 90 SS(total) = = 29579.1636 SS(treatment) and SS(error) are both sums of squares, and if we divide each by its corresponding number of degrees of freedom, we get mean squares, as defined below. MS(treatment) is a mean square for treatment, obtained as follows: MS(treatment) = SS(treatment)/(k-1) MS(error) is a mean square for error, obtained as follows: MS(error) = SS(error)/(N - k) MS(total) is a mean square for the total variation, obtained as follows: MS(total) = SS(total)/(N-1) Example. Use the sample in Table 7-2 to find the values of MS(treatment), MS(error), and MS(total). Solution. MS(treatment ) = SS(treatment)/(k-1) = 4113.1122/(4 - 1) = 1371.0374 MS(error) = SS(error) / (N - k) = 25466.0514/(60 - 4) = 454.7509 MS(total) = SS(total) /(N - 1) = 29579.1636/(60 - 1) = 501.3418. Test Statistic for ANOVA with Unequal Sample Sizes H0: All means are equal The test statistic H1: these means are not all equal F = MS(treatment)/MS(error) The critical value = F(k-1, N-k) Example. Are bad movies as long as good movies, or does it just seem that way? Refer to the sample data given in Table 7.2. Examination of summary statistics seems to suggest that there are diifferences in the mean length of movies, with movies rated as excellent tending to be longer. But are those differences significant? Test the claim that the 4 categories of movies have the same mean length. That is, test the claim that . Solution H0 : H1 : The preceding means are not all equal. Significant level = 0.05. Use F distribution ANOVA Table Source of Variation Treatments Error Total SS 4113.1122 25466.0514 29579.1636 Degree of Freedom 3 56 59 MS 1371.0374 454.7509 F 3.0149 Critical Value = 2.7581 Because the test statistic of F = 3.0149 does exceed the critical value F = 2.7581, we reject the null hypothesis that the means are equal. There is sufficient sample evidence to warrant rejection of the claim that the 4 population means are equal. It appears the mean movie length is not the same for poor, fair, good, and excellent movies. It seems that movies rated with 4 stars are longer than other movies, but we need other methods to formally justify this conclusion. Chapter 8 Nonparametric Statistics