Unit 1 Descriptive Statistics & Basic Probability

2/5/2016 Math 131 Table of Contents Math 131 Notes Math 131 Notes .............................................................................................................................................. 1 Unit 1 Descriptive Statistics & Basic Probability ....................................................................................... 1 Chapter 1: Introduction ....................................................................................................................... 1 Section 1.1: Overview of Statistics (p 2) .................................................................................... 1 Section 1.2: Data Classification (p 8) ......................................................................................... 1 Section 1.3 Experimental Design (p 15) ..................................................................................... 1 Generating random numbers in Minitab ............................................................................. 2 Sorting numbers in Minitab (Manual p 11) ........................................................................ 2 Using Minitab to select a random sample from a dataset stored in columns ...................... 2 Generating a Sequential set of numbers in Minitab and then selecting randomly from them (Manual p 8) ....................................................................................................................... 2 Chapter 2: Descriptive Statistics ......................................................................................................... 2 Section 2.1 Frequency Distributions and their Graphs (p 32) ..................................................... 2 Constructing a Histogram in Minitab (Manual p 37) .......................................................... 3 Construction a Frequency Polygon in Minitab (Manual p 51) ........................................... 4 Constructing an Ogive in Minitab (Manual p 54) ............................................................... 4 Section 2.2 More Graphs and Displays (p 46) ............................................................................ 5 Constructing a stem-and-leaf chart in Minitab (Manual p 45) ............................................ 6 Constructing a Pie Chart in Minitab(Manual p 25)............................................................. 7 Constructing a Pareto (Bar) Chart in Minitab (Manual p 15) ............................................. 9 Section 2.3 Measures of Central Tendency (p 57) .....................................................................10 Finding Measures of Central Tendency in Minitab (Manual p 67)....................................11 Using Minitab to Obtain Frequency of Individual Variables .............................................11 Section 2.4 Measures of Variation (p 70) ..................................................................................11 Finding Measures of Variation in Minitab .........................................................................12 Section 2.5 Measures of Position (p 87) ....................................................................................12 Finding Quartiles in Minitab (Manual p 88) ......................................................................13 Constructing a Boxplot in Minitab (Manual p 90) .............................................................15 Using Minitab to Compute z-scores (Manual p 86) ...........................................................16 Chapter 3 Probability (p 109) ............................................................................................................16 Section 3.1 Basic Concepts of Probability .................................................................................16 Unit 2 Probability & Probability Distributions ..........................................................................................18 Section 3.2 Conditional Probability and the Multiplication Rule (p 121) .................................18 Section 3.3 The Addition Rule ..................................................................................................18 Simulating the Birthday Problem in Minitab .....................................................................19 Section 3.4 Counting Principles (p 140) ....................................................................................19 Chapter 4 Discrete Probability Distributions (p 161) ........................................................................21 Section 4.1 Probability Distributions (p 162) ............................................................................21 Section 4.2 Binomial Distributions (p 174) ...............................................................................22 Constructing a binomial Distribution using Minitab (Manual p 128) ................................25 Chapter 5 Normal Probability Distributions (p 205) .........................................................................26 Section 5.1 Introduction to Normal Distributions (p 206) .........................................................26 Section 5.2 The Standard Normal Distribution (p 214) .............................................................26 Section 5.3 Normal Distributions: Finding Probabilities ...........................................................26 Using Minitab to find the probability that a normally distributed random variable is less than a specified value (Manual p 157) ...............................................................................28 Using Minitab to find the probability that a normally distributed random variable is between two specified values (Manual p 159) ...................................................................28 Section 5.4 Normal Distributions: Finding Values (p 229) .......................................................28 Section 5.5 The Central Limit Theorem (p 238) ........................................................................29 1 2/5/2016 Math 131 Table of Contents Section 5.6 Normal Approximations to Binomial Distributions (p 251) ...................................30 Unit 3 Inferential Statistics ........................................................................................................................32 Chapter 6 Confidence Intervals (p 269) .............................................................................................32 Section 6.1 Confidence Intervals for the Mean (Large Samples) ..............................................32 Using Minitab to find the Confidence Interval with a Sample in a Column for a Normal Distribution (Manual p 183) ..............................................................................................33 Using Minitab to find the Confidence Interval with Summarized Data for a Normal Distribution ........................................................................................................................34 Determining Sample Size (p 276) ..............................................................................................34 Section 6.2 Confidence Intervals for the Mean (Small Samples) (p 284)..................................34 Summary of when the normal distribution or the t-distribution can be used (p 288) ................35 Using Minitab to find the Confidence Interval for a t-Distribution with the Sample in a Column (Manual p 193).....................................................................................................35 Using Minitab to find the Confidence Interval with Summarized Data for a t-Distribution ...........................................................................................................................................36 Section 6.3 Confidence Intervals for Population Proportions (p 293) .......................................36 Chapter 7 Hypothesis Testing with One Sample ...............................................................................37 Section 7.1 Introduction to Hypothesis Testing (p 321) ............................................................37 Alternative Hypothesis ..................................................................................................................39 Area of Normal Curve ...................................................................................................................39 Section 7.2 Hypothesis Testing for the Mean (Large Samples) (p 334) ....................................39 Using Minitab for Hypothesis testing for the mean with summarized data from a large sample ................................................................................................................................40 Section 7.3 Hypothesis Testing for the Mean (Small Samples) (p 350) ....................................41 Using Minitab to perform Hypothesis testing for the mean with summarized data when the sample is small (Manual p 211) ...................................................................................41 Section 7.4 Hypothesis Testing for Proportions (p 360) ............................................................42 Using Minitab to perform Hypothesis testing for a proportion with summarized data (Manual p 215) ..................................................................................................................42 Chapter 9 Correlation and Regression ...............................................................................................43 Section 9.1 Correlation (p 442) .................................................................................................43 Using Minitab to draw a scatter plot (Manual p 93) ..........................................................43 Using Minitab to find the Correlation Coefficient (Manual p 95) .....................................45 Using Minitab to determine whether the correlation coefficient is significant (Manual, p 95) ......................................................................................................................................46 Section 9.2 Linear Regression (p 458).......................................................................................46 Using Minitab to find the Least Squares Regression Equation (p 98) ...............................47 Using Minitab to find the Regression equation and a predicted value for the Old Faithful Data (p 460) .......................................................................................................................48 Using Minitab to draw the least squares regression line on the scatter plot for the Old Faithful Data ......................................................................................................................49 Chapter 10 Chi-Square Tests and the F-Distribution (p 493) ............................................................49 Section 10.1 Goodness of Fit .....................................................................................................49 Using Minitab to perform the Chi-Square Goodness-of-Fit Test (Manual p 237) .............50 Chi-Square with M&M’s ...........................................................................................................51 Section 10.2 Independence (p 504)............................................................................................52 Using Minitab to perform the Chi-Square Independence Test (Manual p 242) .................54 2 2/5/2016 Math 131 Unit 1 Unit 1 Descriptive Statistics & Basic Probability Chapter 1: Introduction Section 1.1: Overview of Statistics (p 2)         Data consists of information coming from observations, counts, measurements, or responses. The singular of data is datum. (p 2) Statistics is the science of collection, organizing, analyzing and interpreting data in order to make decisions. (p 3) A population is a collection of all outcomes, responses, measurements, or counts that are of interest. (p 3) A sample is a subset of a population characteristic (p 3) A parameter is a numerical description of a population (p 4) A statistic is a numerical description of a sample characteristic (p 4) Descriptive Statistics is the branch of statistics that involves the organization, summarization, and display of data. (p 5) Inferential statistics is the branch of statistics that involves using a sample to draw conclusions about a population. A basic tool in the study of inferential statistics is probability (p 5). Section 1.2: Data Classification (p 8)       Qualitative data consist of attributes, labels, or nonnumerical entries. (p 8) Quantitative data consist of numerical entries or counts. Nominal level of measurement: qualitative (p 9) Ordinal level of measurement: qualitative or quantitative, can be ordered, but differences are not meaningful Interval level of measurement: quantitative, can be ordered, differences are meaningful, no inherent zero (e.g. 0 degrees) (p 10) Ratio level of measurement: quantitative, can be ordered, differences are meaningful, inherent zero (e.g. 0 dollars) Section 1.3 Experimental Design (p 15) Guidelines for designing a statistical study (p 15) 1. Identify the variable(s) of interest (the focus) and the population of study. 2. Develop a detailed plan for collecting data. If you use a sample, make sure it is representative. 3. Collect the data 4. Describe the data using descriptive techniques. 5. Interpret the data and make decisions about the population using inferential statistics. 6. Identify any possible errors. Data can be collected as follows (p 15-16)  Census: A count or measure of the entire population  Sampling: A count or measure of part of the population  Simulation: Using a mathematical or physical model  Experiment: A treatment is applied to part of a population and responses are observed. A second part of the population is often used as a control group and given no treatment or a placebo Sampling techniques: (p 17-19)  Random sample: Select the sample randomly from the entire population  Stratified sample: break population into subsets called strata (e.g. ethnicity) and take random samples from each strata. 1 2/5/2016   Math 131 Unit 1 Cluster sample: break population into groups called clusters (e.g. zip codes) then randomly select clusters and select all the members of the each cluster. Systematic sample: Assign a number to each member of the population, randomly pick a number, then start with that number and choose at the same interval from it. A convenience sample is not reliable! Generating random numbers in Minitab Calc->Random Data->Integer, Generate Enter number of random numbers (e.g. sample size), Store in column(s), C1, Minimum of: 1, Maximum of Population Size Note, this way of generating random numbers can give repeats. Also, this method is not described in the Minitab Manual. An easy way to eliminate repeats is to sort the numbers so that the repeats appear sequentially, then delete the repeats. Sorting numbers in Minitab (Manual p 11) To sort data: Data->Sort, Select the column to sort, choose By Column (usually the same as the one to sort) choose where to Store sorted data in (usually original column) Using Minitab to select a random sample from a dataset stored in columns Calc->Random Data->Sample from columns. Sample sample size (e.g. 40) from column(s). Select the columns the data are stored in (e.g. C1 C2 C3). Store samples in (Usually just overwrite the original columns, e.g. C1 C2 C3). Click OK Note that the default way of sampling from columns in Minitab is without replacement. The dialog box allows you to choose Sample with replacement, but we usually do not what this. Generating a Sequential set of numbers in Minitab and then selecting randomly from them (Manual p 8) To generate a sequential number for each member of the population and store in C1 Calc->Make Patterned Data->Simple Set of Numbers. Choose Store patterned data in C1. From the first value 1 To last value Population Size In steps of 1. Click OK. To select numbers randomly from these numbers Calc->Random Data->Sample from columns, Sample number of rows from column(s) C1, Store samples in C2. Click OK. Chapter 2: Descriptive Statistics Section 2.1 Frequency Distributions and their Graphs (p 32) A frequency distribution is a table that shows classes or intervals of data entries with a count of the number of entries in each class. The frequency, f, of a class is the number of data entries in the class.    Midpoint of a class = (Lower limit + upper limit)/2 Relative frequency = class count/sample size Cumulative frequency = (sum of frequencies for class and all previous) Guidelines for Constructing a Frequency Distribution form a Data Set (p 32) Decide the number of classes. To detect patterns, this should be between 5 and 20 Find the width of each class by dividing the range by the number of classes and rounding up Find the class limits. The minimum entry can be the lower limit. To find the remaining lower limits add the width to the lower limit of the preceding class Make a tally mark for each data entry in the fow of the appropriate class Count the tallies for the frequency in each class 2 2/5/2016 Math 131 Unit 1 The following are techniques for representing quantitative data:  A frequency histogram is a bar graph that represents the frequency distribution of the data set.  A frequency polygon is a line graph that represents the frequency distribution of the data set.  A relative frequency histogram is similar to a frequency histogram except that it plots relative frequencies (i.e. portion or percent of data that falls in each class) (p 34).  A ogive is a cumulative frequency graph (i.e. the frequency of succeeding classes are added up) (p 39)  A Stem-and-Leaf –Plot is a plot in which each number is represented as a stem (e.g. leftmost digits) and a leaf (eg. the rightmost digit) (p 46) Using the Internet Usage data on p 33 for one example, we will create some of these graphs. First we will sort the data using Minitab. Place the Usage data in column C1 of a new Worksheet then sort the data to make it easier to find frequencies in each class. Data -> Sort -> Sort column C1 by Column C1. Choose Store sorted data in original column. InterUse 7,7,11,17,17,18,19,20,21,22,23,28,29,29,30,30,31,31,33,34,36,37,39,39,39,40,41,41,42,44,44,46 50,51,53,54,54,56,56,56,59,62,67,69,72,73,77,78,80,88 Divide the range by the number of classes: 88  7 81   11.57 . Rounding up give a class width of 12 7 7 This gives boundaries of 7,19, 31, 43, 55, 67, 79, 91 NOTE: We will have the classes running from 7 to 19, etc where the upper bound is exclusive, ie it does not include 19 Class Freq Rel freq Cum freq Cum rel freq 7 -19 6 0.12 6 0.12 19 - 31 10 0.20 16 0.32 31 -43 13 0.26 29 0.58 43 - 55 8 0.16 37 0.74 55 - 67 5 0.10 42 0.84 67 - 79 6 0.12 48 0.96 79 - 91 2 0.04 50 1.00 We can now use the Freq or the Rel freq column to construct our histogram. We can label the x-axis with either the class boundaries 7, 19, 31, 43, 55, 67, 79, 91 or the class midpoints, 13, 25, 37, 49, 61, 73, 85. Note the first class midpoint = (7 + 19)/2 = 13, and the rest can be obtained by adding 12. The histograms are similar to those of the text on page 36 except that the boundaries 6.5, 18.5, 30.5, etc and the midpoints are 12.5, 24.5, 36.5 etc. Constructing a Histogram in Minitab (Manual p 37) Graph->Histogram Select Simple from the Histogram Dialog Box, Click OK, Select the Column in the Simple Histogram Dialog Box, Click on Scale and under the tab Y-scale Type, choose either Frequency or Percent, click Labels and under the Data Labels tab, click use y-value labels, click OK, OK. (Minitab includes a default title, but you can click on Labels in the Simple Histogram Dialog Box to enter your own title) The histogram should be modified to include our breakpoints and bins. Place the cursor near the X-axis so that the screen tip says X-scale and then Right click and choose Edit X Scale. Under the Binning tab choose Cutpoint and under Interval Definition set number of intervals to 7. (Note that Minitab chooses 9 as the default for this data, but we set to 7 as specified in the text). The following is the frequency histogram plotted by Minitab for the Internet Usage data on p 33. 3 2/5/2016 Math 131 Unit 1 Histogram of InterUse 12 12 10 Frequency 10 8 7 7 6 6 4 4 4 2 0 7.00000000 1.8571E+01 3.0143E+01 4.1714E+01 5.3286E+01 InterUse 6.4857E+01 7.6429E+01 8.8000E+01 Note that Minitab uses the exact value of 81/7 to find the class boundaries. Construction a Frequency Polygon in Minitab (Manual p 51) Graph -> Histogram -> Simple, choose the column. To make a polygon instead of a histogram, click on the Data view button. On the Data Display tab, remove check mark from Bars and place check mark on Symbols. Under the Smoother Tab, choose Lowess for Smoother, make the Degree of smoothing 0 and the Number of steps 1. Then click OK twice. The polygon should be modified to include our breakpoints and bins. Right click on the X-axis and choose Edit X Scale. Under the Binning tab choose Cutpoint and under Interval Definition set number of intervals to 7. The results for the Internet Usage Data on p 33 are: Histogram of InterUse 12 12 10 Frequency 10 8 7 7 6 6 4 4 4 2 0 7.00000000 1.8571E+01 3.0143E+01 4.1714E+01 5.3286E+01 InterUse 6.4857E+01 7.6429E+01 8.8000E+01 Constructing an Ogive in Minitab (Manual p 54) We will construct an Ogive with the data on internet usage presented in the text on p 33. We will make the classes go from 7 to 19 exclusive, etc, instead of 6.5 to 18.5 etc as the book does to make it a little easier. 4 2/5/2016 Math 131 Unit 1 Minitab doesn’t have an automatic ogive function. All it can do is plot the class limits and the cumulative frequencies. So the procedure is to do all the calculations ourselves, enter them in Minitab and tell Minitab to plot them. To use Minitab to plot the Ogive, in a new worksheet make the Upper Class Boundaries column C1 and the Cum rel freq column C2. Then proceed as follows: Then select Graph -> Scatterplot -> With Connect Line. Select C2 for the Y-variable and C1 for the Xvariable. Click on the Data View button and be sure that both Symbols and Connect line are selected. By choosing both Symbol and Connect line, Minitab will connect the dots at each data point on the graph. Click on Labels and title the ogive ‘Ogive of Internet Usage in Minutes’. To label the points click the Data labels tab and choose use y-value labels. Click OK After the graph is created, it should be edited to show each upper class limits. Right-click on the X-axis of the graph and select Edit X scale. Enter the Position of ticks as 19: 91/12. This tells Minitab that the tick marks should go from 19 to 91 in steps of 12. (We could have made it 7: 91/12 but this would indicate that 7 The results are: Ogive of Internet Usage in Minutes 1.0 0.96 1.00 0.84 Cum Rel Freq 0.8 0.74 0.58 0.6 0.4 0.32 0.2 0.0 0.12 0.00 7 19 31 43 55 Class Boundaries 67 79 91 Section 2.2 More Graphs and Displays (p 46) Section 2.1 discussed traditional ways to display quantitative data. A stem-and-leaf plot is a newer way. In a stem-and-leaf plot, each number is separated into a stem (e.g. the leftmost digits) and a leaf (e.g. the rightmost digit). Two advantages of the stem-and-leaf plot are that it provides an easy way to sort the data and the graph contains the original data. The following table shows the stem leaf plot for the first row of data on page 46 in the text. The leaf is the last digit of each number and the stem is the first two digits: Stem 10 11 Leaf 5 64 5 2/5/2016 12 13 14 15 Math 131 Unit 1 96 0 45 59 The stem-and-leaf plot can also have two entries for each stem, one for leaves from 0 to 4 and the other for leaves from 5 to 9. (p 47) This increases the refinement of the graph. Constructing a stem-and-leaf chart in Minitab (Manual p 45) Graph->Stem-and-Leaf->Select the Column, click OK Minitab presents an ordered stem-and-leaf plot. The results are presented as follows: First, the number of items and the Leaf Unit is given. The Leaf Unit is explained below. Then the stem-and-leaf-plot is presented:  The first column is the cumulative number of data points in the row starting at the first row and going to the row below the median. The first column for the row containing the median has the number of points in that row. Starting in the last row the first column is the cumulative number of data points going down to the row above the median.  The second column is the stem. There may be several rows for a stem, the first for lower valued leaves, etc. The stem value is multiplied by 10 times the Leaf Unit.  The third column contains the leaf values. The leaf values may be actual values or they may be truncated. The leaf values are multiplied by the Leaf Unit, so that if the Leaf Unit is 1, the leaf values represent actual values (e.g. if the Leaf Stem is 1, a stem value of 3 and a leaf value of 7 indicates an actual value of 37). For example the following data 100, 120, 140, 145, 179, 190, 200 The results are: Stem-and-leaf of C1 Leaf Unit = 10 1 2 (2) 3 2 1 1 1 1 1 1 2 N = 7 In this example the Leaf Unit is 10, so that the leafs are multiplied by 10 and the stems are multiplied by 100. E.g. 120 = 100*1 + 10*2 (the second row). 0 2 44 7 9 0 The median is 142.5 so the row representing 140 and 145 is the median row, indicated by the parenthesis. The numbers 145 and 179 are truncated, e.g. 179 is represented by 100*1 + 10*7 = 170. Another example: 145, 179, 190, 200, 350, 380, 400, 555, 700, 900 Stem-and-leaf of C1 Leaf Unit = 10 3 4 (2) 4 3 2 1 2 3 4 5 6 N = 10 Note that stem values 6 and 8 have no leaves , indicating that there are no such values. 479 0 58 0 5 6 2/5/2016 2 1 1 7 8 9 Math 131 Unit 1 0 0 Another example: 900, 1234, 1468, 5432, 5789, 7777, 8500, 9765 is presented as follows: Stem-and-leaf of C1 N = 8 Leaf Unit = 1000 3 3 (2) 3 2 0 0 0 0 0 011 55 7 89 The first value, 900, is represented by the first 0 in the leaf column. Since the stem is 0 the first value is 0*1000 = 0. The second value, 1234, is represented by the first 1 in the leaf column. So the second value is 1*1000 = 1000. The 0 in the Stem column indicates that it does not change the value of the data point Two common techniques for graphing qualitative data are pie charts and pareto (bar) charts. A pie chart is a convenient way of showing qualitative data. A pie chart is a circle with slices proportional to the relative frequency of each category. Constructing a Pie Chart in Minitab(Manual p 25) Method 1: Used when we have a categorical variable specified in each row of a column, e.g. Grades A A A B B C To graph the frequency of a categorical variable (note Manual does not describe this technique): Graph->Pie Chart Choose Chart Raw Data, Click on Labels choose the Slice Labels tab and click Category name (you can also click Frequency and/or Percent) (You can also click on the Titles/Footnotes tab and enter a different title from the Minitab default), click OK, OK The results are: 7 2/5/2016 Math 131 Unit 1 Pie Chart of Grades Category A B C C 1, 16.7% A 3, 50.0% B 2, 33.3% To graph variable based on another categorical variable: (Manual p 25) Eg for the following data from p 50 of the text: Causes of Shrinkage Employee Theft Shoplifting Administrative Error Vendor Fraud $million 15.6 14.7 7.8 2.9 Graph->Pie Chart, click on Choose values from a table, Choose the Categorical variable (C1 Causes of Shrinkage) and the Summary variable (C2 $million), Click on Labels choose the Slice Labels tab and click Category name (you can also click Frequency and/or Percent), click OK OK The results are 8 2/5/2016 Math 131 Unit 1 Pie Chart of $million vs Causes of Shrinkage Category Employ ee Theft Shoplifting Administrativ e Error Vendor Fraud Vendor Fraud 2.9, 7.1% Administrativ e Error 7.8, 19.0% Employ ee Theft 15.6, 38.0% Shoplifting 14.7, 35.9% A pareto chart (or bar chart?) is another way of showing qualitative data. A pareto chart is a graph in which the categories are plotted horizontally and the frequencies are plotted vertically. Constructing a Pareto (Bar) Chart in Minitab (Manual p 15) We will use the Grades example for the first technique. To graph the frequency of a categorical variable: Graph->Bar Chart, for Bars represent choose Counts of unique values, choose Simple Table, choose variable to graph (C1 Grades), click OK Chart of Grades 3.0 2.5 Count 2.0 1.5 1.0 0.5 0.0 A B Grades C Next we want to choose a graph variable based on a categorical variable. (Manual p 17) We will use the same example as we used for the Pie Chart (“Causes of Inventory Shrinkage” form the text, p 50). 9 2/5/2016 Math 131 Unit 1 Graph->Bar Chart, for Bars represent choose Values from a table, choose Simple Table click OK, choose the Graph Variable ($million) and the Categorical Variable (Causes of Shrinkage), To place the values above the bars, click Labels, choose the Data Labels tab and choose Use y-value labels, click OK, OK Chart of $million vs Causes of Shrinkage 15.6 16 14.7 14 12 $million 10 7.8 8 6 4 2.9 2 0 Employee Theft Shoplifting Administrative Error Causes of Shrinkage Vendor Fraud Section 2.3 Measures of Central Tendency (p 57)  Population Mean:  x . The population mean is called the expected value: N N E ( X )   xi p( xi ) . If each element has the same probability of being selected, p( xi )  i 1 x n  Sample Mean: x    Median: Middle element Mode: The entry that occurs the greatest number of times, if there is one. The weighted mean is the mean of a dataset whose entries have varying weights: x  w w Usually w  1, so that x  x  w . x For example, for this course the final grade = 0.2*lab + 0.2*test1 + 0.3*test2 + 0.3*test3. An estimate of the mean can be obtained from the frequency distribution as follows: x ( x  f ) n where x is the midpoint and f is the frequency of a class. (p 62) Example 8 on page 62 estimates the mean for Internet Usage this way and finds it to be 41.8. The following are the general shapes that distributions can take on: 10 1 N 2/5/2016     Math 131 Unit 1 Symmetric: Histogram has approximately mirror images on both sides of a vertical line in the middle: mean, median and mode are about the same. Uniform: Histogram is flat: mean and median are about the same. Skewed left: Mean is less that median and mode. Skewed right: Mean is more than the median and mode. Finding Measures of Central Tendency in Minitab (Manual p 67) Using exercise 19 p 65 (EX2_3-19.MTP) as an example Minitab: Stat->Basic Statistics->Display Descriptive Statistics->(Choose the Column)-> Click on Statistics…->choose the stats from the Dialog box (Note there is no Mode choice although Minitab can help find the mode as discussed below) click OK OK Results for: EX2_3-19.MTP Descriptive Statistics: Points per game Variable Points per game Mean 97.000 Median 97.200 The results for the mean and median are the same as those presented in the answers on p A50. Using Minitab to Obtain Frequency of Individual Variables To determine the frequency of individual variables in Minitab click Stat -> Tables -> Tally Individual Variables, check Counts, and click OK. The value with the highest frequency is the mode. The results in the Session window show that 94.8, 95.4, 97.2 and 103.1 appear twice, while the other scores appear only once, so these are the mode. This agrees with the answer on p A50. Section 2.4 Measures of Variation (p 70)  Range = Max entry – Min entry  Deviation of an entry x: x –  Population variance:  2  ( x   ) 2 . Note  N N Var ( X )  E[( x   ) 2   ( xi   ) 2 p( xi ) . If each element has an equal chance of being i 1 selected, p ( xi )  1 . N  Population standard deviation:  Sample variance: s2   2 ( x  x ) 2 n 1  Sample Standard deviation: s  s Why do we divide by n – 1 and not by n when we define the sample variance? The reason is that for 2 2 random samples from an infinite population, this makes s and unbiased estimator of  2 , i.e. E (s 2 )   2 . This is proven in Freund (p 216). Freund notes, however, that s is not an unbiased of the standard deviation. 2 Also, for a finite population as defined in Freund on p 182, s is not an unbiased estimator of the variance. 11 2/5/2016 Math 131 Unit 1 Finding Measures of Variation in Minitab Using the Try it Yourself Example on p 74 (TIY2_4-5.MTP) Stat->Basic Statistics->Display Descriptive Statistics->(Choose the Column)-> Click on Statistics…>choose the stats from the Dialog box. Click OK OK The results are as follows: Descriptive Statistics: Rental rates Variable Rental rates N 20 N* 0 Mean 37.888 StDev 3.979 These results are the same as those given in the Try it Yourself appendix on p A32. Empirical Rule (p 76): For data with a symmetric bell-shaped distribution, about 68% of data lies within 1 standard deviation of the mean, about 95% lies within 2 standard deviations of the mean, and about 99.7% lies within 3 standard deviations of the mean. Chebychev’s Theorem (p 77): The portion of any data set lying within k standard deviations (k > 1) of the mean is at least 1  1 k2 Mathematically this is: P (| X   |  k )  1 k2 For example, 75% of data lies within 2 standard deviations of the mean. Sample standard deviation. In a sample of grouped data which has much repeated data (such as number of children per household presented in the example on p 78), the formula for the standard deviation can be simplified as follows: ( x  x ) 2 f n 1 s Also as in the case with the mean (p 62) this formula can be used as an approximation with x being the midpoint and f the frequency of each class. Section 2.5 Measures of Position (p 87) DEFINITIONS (p 87)  Fractiles are data values that divide an ordered set into equal parts.  The median is a fractile because about one half the data lies below it and one half above.  Quartiles divide the set into four parts. About one quarter of the data falls on or below the first quartile (Q1 ) , half below the second quartile (the median) and three fourths below the third quartile (Q3 ) .   Deciles divide the data into ten parts Percentiles divide the data into 100 parts, e.g. 90% of the data falls below the 90 th percentile. To find the fractiles, first order the data, then count the number of elements. For example, sorting the data in example 1 on page 87 (CPR Test Scores) and bolding the first, second and third quartiles gives: 5 7 9 10 11 13 14 15 16 17 18 18 20 21 37. 12 2/5/2016 Math 131 Unit 1 Just as with the median, Q1 and Q3 may fall between two actual items. In our example we choose 10 for Q 1 because 4/15 = .27 of the data are less than or equal to it. Finding Quartiles in Minitab (Manual p 88) Using Example 2, p 88 (TIY2_5-2.MTP )as an example Stat->Basic Statistics->Display Descriptive Statistics->Choose column C1->Statistics…->choose First quartile, Median and third Quartile. Click OK The results are as follows: Results for: TIY2_5-2.MTP Descriptive Statistics: Tuition Costs Variable Tuition Costs Q1 17.00 Median 23.00 Q3 28.50 This is the same as the answer on page A33. DEFINITION (p 89) The Interquartile Range: IQR  Q3  Q1 Note that Minitab also displays the IQR in the same way we displayed the other statistics. The IQR for the data from example 1 is 18 – 10 = 8 (example 3 p 89). A Box-and-Whisker Plot (or simply of Boxplot) is a line from the minimum entry, a box from Q1 to Q3 and a line from Q3 to the maximum entry. It gives a representation of how much of the data is in the middle. GUIDELINES (p 90) 1. Find the five-number summary, Min, Q1, M, Q3, Max 2. Construct the horizontal scale that spans the range of the data 3. Plot the five numbers on the horizontal scale. 4. Draw a box above the horizontal scale from Q1, to Q3 and draw a vertical line in the box at M. 5. Draw whiskers from the box to Min and Max. Example 4 on p 90 gives the Box-and-Whisker Plot of the data in example 1. The numbers are 5, 10, 15,18, 37. 5 6 7 8 9 10111213141516171819202122232425262728293031323334343637 If a whisker or box is short, this indicates that the data is concentrated in this range. This boxplot for our example indicates that one quarter of the data is concentrated between 15 and 18 (the third quartile). This is confirmed by the following histogram: 13 2/5/2016 Math 131 Unit 1 Histogram of CPR Scores 5 Frequency 4 3 2 1 0 5 10 15 20 25 CPR Scores 30 35 40 Comparing Boxplot and Histogram for Internet Usage Histogram Histogram of InterUse 12 12 10 Frequency 10 8 7 7 6 6 4 4 4 2 0 7.00000000 1.8571E+01 3.0143E+01 4.1714E+01 5.3286E+01 InterUse 6.4857E+01 7.6429E+01 8.8000E+01 Boxplot Boxplot of InterUse 0 10 20 30 40 50 InterUse 60 70 80 90 Notice that where the Histogram has the highest bar (between 30 and 40) is where the boxplot has the narrowest box. This is because many sample points are crowded in this region: enough to constitute a quartile. Note also that if the left whisker and the left box are narrow, the data is skewed to the left, and if the right box and right whisker are narrow, the data is skewed to the right. 14 2/5/2016 Math 131 Unit 1 Constructing a Boxplot in Minitab (Manual p 90) Minitab is demonstrated with exercise 33, page 96 (EX2_5-33.MTP). Method 1 (This is given as the second method on Manual p 93, but it seems more obvious to me) Click on Graph->Boxplot and select Simple boxplot. Click on OK. Select C1 for the Graph variable. To view a horizontal boxplot (rather than a vertical one) click on Scale and select Transpose value and category scales. Click on OK twice. The result is shown below. Note that the meaning of the whisker in Minitab seems to differ from what is stated in the book. In the book the whisker extends to the smallest and largest element, whereas in Minitab there is a concept of an outlier, which is a sample value that is much larger or smaller than the rest. So, if there is not an outlier, the whisker extends to the largest and smallest item as defined in the book. But if there is an outlier, it is indicated with an asterisk and the whisker does not extend it. The answer to EX2_533.MTP is shown on page A52 of the book and the whisker extends to the largest age (82). In the Minitab results below, the whisker does not extend to this age and it is presented with an asterisk. Boxplot of Ages of Executives 20 30 40 50 60 Ages of Executives 70 80 Minitab result of exercise 33 page 96 (EX2_5-33.MTP) Method 2 (Note, this does not give the option of constructing a horizontal boxplot) Stat->Basic Statistics->Display Descriptive Statistics. Select column and click on the Graphs Button, select Boxplot of Data DEFINITION (p 92) The standard score or z-score represents the number of standard deviations a given value x falls from the mean μ. That is: z x  Example 6 on p 92 calculates the z-score for speeds on a stretch of highway where the mean is 56 mph and the standard deviation is 4 mph. Someone traveling 47 miles per hour has the following z-score: z 47  56  2.25 4 Chebyshev’s theorem tells us that at most only 25% of drivers drive further from the average of 56 mph than this driver. 15 2/5/2016 Math 131 Unit 1 Someone driving 68 mph is 3 standard deviations above the mean. Chebyshev’s theorem tells us that at most only 11.1% of drivers drive this far from the mean. Using Minitab to Compute z-scores (Manual p 86) Calc -> Standardize. Choose the Input column and the column to Store results in (usually an empty column). Click on Subtract mean and divide by std. dev., click OK. The results for each value in the input column are stored in the column you chose. Chapter 3 Probability (p 109) Section 3.1 Basic Concepts of Probability DEFINITION (p 110) A probability experiment is an action, or trial, through which specific results (counts, measurements or responses) are obtained. The result of a single trial in a probability experiment is an outcome. The set of all possible outcomes of a probability experiment is the sample space. An event consists of one or more outcomes and is a subset of the sample space. Example 1 (p110) The experiment consists of tossing a coin then rolling a die. the sample space consists of H 1 H1 2 H2 3 H3 T 4 H4 5 H5 6 H6 1 T1 2 T2 3 T3 4 T4 5 T5 6 T6 How many outcomes are there? Do you agree, disagree, or have no opinion, and what is your gender? (p 111) An event that consists of a single outcome is called a simple event (p 111). DEFINITION (p 112) Classical (or theoretical) is used when each outcome in a sample space is equally likely to occur. The Classical probability of an event E is given by: P( E )  Number of outcomes in E Total number of outcomes in sample space Example 3 (p 112) Roll a die: What is the sample space? {1,2,3,4,5,6} Event A: rolling a 3, p = 1/6 = 0.157. Note this is a simple event. Event C: rolling < 5, p =4/6 = 0.667. Note this is not a simple event. DEFINITION (p 113) Empirical (or statistical) probability is based on observations obtained from probability experiments. The empirical probability of an event E is the relative frequency of event E: P( E )  Frequency of event E f  Total frequency n Example: Finding Empirical Probabilities (p 113). Each fish (Bluegill, Redgill, and Crappy) is equally likely to get caught. You catch and release the following. Fish Type Bluegill Redgill Crappy Number of times caught, f 13 17 10 f  40 Probability of catching a bluegill = 13/40 = 0.325 16 2/5/2016 Math 131 Unit 1 Law of Large Numbers (p 114): As an experiment is repeated over and over, the empirical probability of the event approaches the theoretical (actual) probability of the event. For example, the theoretical probability of getting a head on a fair toss of a coin is 0.5. If you toss the coin 10 times, there’s a good chance that you’ll get 4 or less or 6 or more heads, but if you toss it 1000 times, there’s a small chance that you’ll get 400 or less or 600 or more heads. See Example 5 on p 114 for an example about using frequency distributions to find probabilities. A third type of probability is subjective probability, e.g. predicting a patient’s chances for full recovery (p114) An important property of probability is that the sum of the probabilities of all outcomes in the sample space is 1. (p 116) DEFINITION (p 116) The complement of Event E is the set of all outcomes in a sample space that are not included in event E. The complement of event E is denoted by E’ and is read as “E prime”. For example the sample space for rolling a die is {1,2,3,4,5,6}. If E is the event that the number is at least 5, the complement is the number is less than 5. E = {5,6}, E’ = {1,2,3,4} From the above it is clear that: P( E )  P( E )  1 We often use a Venn diagram to illustrate the relationship between a sample space, an event E and its complement E’. 17 2/5/2016 Math 131 Unit 2 Unit 2 Probability & Probability Distributions Section 3.2 Conditional Probability and the Multiplication Rule (p 121) DEFINITION (p 121) A conditional probability is the probability of an event occurring, given that another event has already occurred. The conditional probability of event B occurring, given that event A has occurred, is denoted by P ( B | A) and is read as “probability of B , given A . (p 121) DEFINITION (p 122) Two events are independent if the occurrence of one of the events does not affect the probability of the occurrence of the other event. Two events A and B are independent if P( B | A)  P( B) or if P( A | B)  P( A) Events that are not independent are dependent. Often it is important to determine whether two events are independent. To determine if A and B are independent, calculate P (B) and P( B | A). If the values are equal, the events are independent. If P(B)  P ( B | A) , then A and B are independent events. (p 122) Example (p 122): Select a King from a deck of cards (event K), not replacing it, and then select a Queen (Event Q): P( K )  4 4 , P(Q | K )  , so the events are dependent. 52 51 The Multiplication Rule for the probability that two events A and B will occur in sequence is P( A and B)  P( A) * P( B | A) . If events A and B are independent, then the rule can be simplified to P( A and B)  P( A) * P( B) . This simplified rule can be extended for any number of independent events. (p 123) Example (p 123). What is the probability of selecting a King then a Queen? P( K ) P(Q | K )  4 4 16   0.006 52 51 2652 Another example (Hogg & Craig, p 59): A bowl contains eight chips, three red and five blue. Two chips are drawn successively, at random and without replacement. What is the probalility that the firs is red and the second is blue? P(R) = 3/8, P(B|R) = 5/7, so P(R and B) = (3/8)*(5/7) = 15/56 = 0.268. Section 3.3 The Addition Rule Two events A and B are mutually exclusive if A and B cannot occur at the same time. (p 130) The probability that events A or B will occur is: P( A or B)  P( A)  P( B)  P( A and B) If events A and B are mutually exclusive, then the rule can be simplified to P( A or B)  P( A)  P( B). This simplified rule can be extended to any number of mutually exclusive events. (p 131) Example (p 131) Select a card from a deck. What is the probability that it s either 4 or Ace. 4C 4H 4D 4S AC AH AD AS 18 2/5/2016 Math 131 Unit 2 P(4 or A) = P(4) + P(A) = 4/52 + 4/52 = 0.154. My Example: What is the probability that it is a 4 or a Club? 4 Club P(4or C lub)  P(4)  P(C lub)  P(4 and C lub)  4 13 1 16     0.308 52 52 52 52 Example (p 131): Roll a die. What is the probability that it is < 3 or odd. Two-sixths + three-sixths – onesixth = four-sixths. TIY 2 (p 132): Probability of Face Card or Heart: 12/52 + 13/52 – 3/52 = 22/52. Exercise 19 (p 137) In a sample of 1000 people, 120 are left handed. If two unrelated people are selected at random from the sample find the probability of the following: 1. 2. 3. 4. 120 119   0.014294 1000 999 120 880 P( LR)    0.105706 1000 999 880 120 P( RL )    0.105706 1000 999 880 879 P( RR )    0.774294 1000 999 P ( LL)  Bullet 1 answers part A (both are left handed). Part B (at least one is left handed) can be answered as follows: Its Bullet 1 + Bullet 2 + Bullet 3 = 0.225706. Its also 1 – Bullet 4 = 0.225706. Part C (neither is left handed) is answered by Bullet 4. Part D: C (neither is left handed) is complementary with B (at least one is left handed) Simulating the Birthday Problem in Minitab Calc->Random Data->Integer, Generate 24 rows of data, Store in column(s) C1, Minimum value 1, Maximum value 365 OK Then Stat->Tables->Tally Individual Variables, select column C1 and check Counts, OK Note whether any value appears more than once. Section 3.4 Counting Principles (p 140) The Fundamental Counting Principle: If one event can occur in m ways and a second event can occur in n ways, the number of ways the two events can occur in sequence is m  n . This rule can be extended for any number of events occurring in sequence. (p 140) Example 1 (p 140) 19 2/5/2016 Math 131 Manufacturer Car size Color Unit 2 Ford, GM, Chrysler small, medium White, Red, Black, Green Number of ways of selecting one Manufacturer, one size and one color are: 3*2*4 = 24. A permutation (p 141) is an ordered arrangement of objects. The number of different permutations of n distinct objects is n factorial, which is written as n! and equals n*(n-1)*(n-2)…1. For example How many possible batting orders are possible with the starting 9 players. The first player can be chosen 9 ways, the second 8, the third 7 etc. So the number of ways is 9! = 362,880. (p 142) The number of permutations of n objects taken r at a time is: (p 142) n Pr  n(n  1)( n  2)...( n  r  1)  n! (n  r )! For example how many ways can we select the batting order of the first three players who will start the game: We are choosing 3 players out of 9, so the number is: 9 P3  9 * 8 * 7  504 Note: Distinguishable Permutations (p 143) are not covered Suppose the above question was: How many ways can we select the first three players who will start the game? I.E. order does not matter, so that selection A, B, C is the same as players C, B, A. The selection of r objects from n where order does not matter is called a combination. We can see that 9 C3  9 P3 / 3! 504 / 6  84 In General n C r  n Pr / r! This leads to the following DEFINITION: A combination (p 144) is a selection of r objects from a group of n objects without regard to order is and is denoted by n Cr  n! (n  r )! r! Note that this is called the combination of n things taken r at a time and is often denoted by  n   . r How many poker hands are there?  52     2598960 5 Example 9 (p 146): What is the probability of a diamond flush? 13    5   1287  0.0004951  52  2598960   5 The denominator is the number of ways of selecting 5 objects from 52, i.e. the number of poker hands. The numerator is the number of ways of selecting 5 objects from 13 (the number of diamonds). 20 2/5/2016 Math 131 Unit 2 My example: What is the probability of a Flush: There are four ways of obtaining a flush so the probability is 4  1287 5148   0.0019807 2598960 2598960 Note: Wikipedia ( http://en.wikipedia.org/wiki/Poker_probability) states the following about the probability of a Flush -- The flush contains any five of the thirteen ranks, all of which belong to one of the four suits, minus the 40 straight flushes. Thus, the total number of flushes is: Thus the probability is 0.0019654. So although a straight flush is a flush, Wikipedia excludes it from the probability of a flush because it has its own category with 40 combinations. Section 3.4 Exercise 9 The access code for a car’s security system consists of four digits. The firs digit cannot be zero and the last digit must be odd. How many different codes are available? 9*10*10*5 = 4500. Chapter 4 Discrete Probability Distributions (p 161) Section 4.1 Probability Distributions (p 162) DEFINITIONS (p 162)  A random variable X represents a numerical value associated with each outcome of a probability experiment.  A random variable is discrete if it has a finite or countable number of possible outcomes that can be listed.  A random variable is continuous if it has an uncountable number of possible outcomes, represented by an interval on the number line. The number of calls a salesperson makes in one day is an example of a discrete random variable, while the time in hours he spends making calls in one day is an example of a continuous random variable. (p 162). A discrete probability distribution lists each possible value the random variable can assume, together with its probability. A probability distribution must satisfy the following conditions (p 163): The probability of each value of the discrete random variable is between 0 and 1: 0  P( x)  1 The sum of all the probabilities is 1:  P(x)  1 Guidelines for constructing a discrete probability distribution: (p164) 1. Make a frequency distribution for the possible outcomes 2. Find the sum of the frequencies 3. Find the probability of each possible outcome by dividing its frequency by the sum of the frequencies. 4. Check that each probability is between 0 and 1 and that the sum is 1. Example (p 164) Individuals are rated on a score of 1 to 5 for passive-aggressive traits, where 1 is extremely passive and 5 is extremely aggressive. Score, X 1 Frequency, f 24 P(X) 0.16 21 2/5/2016 Math 131 2 3 4 5 Total 33 42 30 21 150 Unit 2 0.22 0.28 0.2 0.14 1.00 The mean (also called the expected value) of a discrete random variable is given by (p 166): ExpectedValue  E( x)     xP( x) Note that each value of x is multiplied by its corresponding probability and the products are added. Example (p 166) Find the mean for passive-aggressive traits above: X 1 2 3 4 5 P(X) 0.16 0.22 0.28 0.2 0.14 XP(X) 1*0.16 = 0.16 2*0.22 = 0.44 3*0.28 = 0.84 4*0.20 = 0.80 5*0.14 = 0.70 P ( X )  1 XP( X )  2.94 The variance of a discrete random variable is the expected value of  2  E( x   ) 2   ( x   ) 2 P( x) (x  )2 : The standard deviation is   2 Example (p 167) Find the Variance and Standard Deviation of the passive-aggressive measure in the above example X P(X) x (x  )2 P( x)( x   ) 2 1 2 3 4 5 X 0.16 0.22 0.28 0.2 0.14 -1.94 -0.94 0.06 1.06 2.06 3.764 0.884 0.004 1.124 4.244 0.602 0.194 0.001 0.225 0.594 P ( X )  1 P( x)( x   ) 2  1.616 So, Var ( x)   2  1.616   1.616  1.27. Section 4.2 Binomial Distributions (p 174) A binomial experiment is a probability experiment that satisfies the following conditions: 22 2/5/2016 1. 2. 3. 4. Math 131 Unit 2 The experiment is repeated for a fixed number of trials where each trial is independent of the other trials. There are only two possible outcomes of interest for each trial. The outcomes can be classified as a success (S) or as a failure (F). The probability of a success P(S) is the same for each trial. The random variable x counts the number of successful trials. Notation for Binomial Experiments Symbol Description n The number of times the trial is repeated The probability of success in a single trial p  P (S ) q  P (F ) x The probability of failure in a single trial ( q  1  p ) The random variable represents a count of the number of successes in n trials: x = 0,1,2,3,…,n Suppose we have 9 trials. If we let 0 mean failure and 1 mean success, the probability of getting the results: 3 6 0 0 1 0 1 1 0 0 0 is p q . (See Mood and Graybill p 66.) This is a specific way of getting 3 successes: on the third fifth and sixth tries. Each try can be viewed as a box, and the number of ways we can place 3 1’s in 9 boxes is the same as the number of ways we can choose the first 3 players from 9 on a baseball team. 9   . In general the probability of a specific arrangement of x 1’s and n-x 0’s is p x q n  x and there  3  n are   arrangements. This leads to the following formula for the binomial distribution.  x This is In a binomial experiment, the probability of exactly x successes in n trials is:  n n! p x q n  x    p x q n  x , x  0,1,2,..., n (n  x)! x!  x This is often referred to as b( x; n, p ) . P( x) n C x p x q n  x  We can also see how this formula is derived from a simple example: Suppose we perform have 3 trials. The possible results are: Probability of sample point Sample Points Value of x So, SSS p3 3 SSF 2 p q 2 SFS p2q 2 SFF pq 2 1 FSS p2q 2 FSF pq 2 1 FFS pq 2 1 FFF q3 0  3  3  3  3 P(0)   q 3 , P(1)    pq 2 , P(2)    p 2 q, P(3)    p 3  0  1  2  3 23 2/5/2016 Math 131 Unit 2 Appendix B, Table 2 gives, for the binomial distribution, the probabilities of x successes in n trials, for values of n = 2-16,20 for x = 0 to n, for various probabilities of success. Population Parameters of a Binomial Distribution (p 182)   np  2  npq   npq The following are derivations of the mean for n = 1 and 2. (Mendenhall p 123) 1 E ( x)   xp( x)  0q  1 p  p n 1 x 0 2 E ( x)   xp( x)  0q 2  1  2 pq  2 p 2  2 p(q  p)  2 p n2 x 0 The following is a derivation of the variance for n = 1. (Mendenhall p 123) 1  2  E ( x   ) 2   ( x   ) 2 p( x)  (0  p) 2 q  (1  p) 2 p  p 2 q  q 2 p  pq(q  p)  pq. x 0 Example (p 184 Exercise 11): 54 percent of men consider themselves basketball fans. You randomly select 10 men and ask each of he considers himself a basketball fan. Find the probabilities that the number who are fans is: Exactly eight At least eight Less than eight 10   0.54 8  0.46 2  0.069 8 10  10  0.069   0.54 9  0.461   0.5410  0.46 0  0.089 9 10  1  0.089  0.911 Example: What is the probability of getting 3 kings in five draws of the card without replacement. This is NOT the binomial distribution because the draws are not independent. The probability is described by the hypergeometric distribution. This is discussed briefly in exercise 16 on p 194. The hypergeometric distribution is defined as follows  a  b     x  n  x   h( x; n, a, b)  where a is the number of “success” elements, b is the number of “failure”  a  b    n  elements, n is the sample size and x is the number of successes. So getting 3 kings in five draws without replacement is 24 2/5/2016 Math 131 Unit 2  4  48     3 2 94 h(3;5,4,48)       0.002 54145  52    5 When the sample size n is small compared with the population size, a + b, we sometimes use the binomial distribution to approximate the hypergeometric distribution. For example, suppose we know that we have a room with 100 people and we know that 60 support candidate A. If we select 10 people, what is the probability that 5 support candidate A?  60  40     5 5 h(5;10,60,40)      0.208 100     10  Since n = 10 is small compared to a + b = 100, we can approximate with the binomial distribution: 10  b(5;10,0.6)   (0.6) 5 (0.4) 5  0.201 5 Constructing a binomial Distribution using Minitab (Manual p 128) Using Minitab to find a binomial distribution (i.e. the probability of x successes in n trials) (Using Try it Yourself Section 4.2 p 177 as an example): Enter the x values (the number of successes that you want the probabilities for, usually 0,1,…,n (0,1,2,3,4,5,6,7 in this example) in C1. Calc -> Probability Distributions -> Binomial->Select Probability and enter n (7 in this example) for the Number of Trials, the p (.34 in this example) for the Probability of Success and the Input Column (C1). Click OK The results are as follows: Probability Density Function Binomial with n = 7 and p = 0.34 x 0 1 2 3 4 5 6 7 P( X = x ) 0.054552 0.196716 0.304016 0.261024 0.134467 0.041563 0.007137 0.000525 This agrees with the answer given on p A35. Using Minitab to find a particular value of a binomial distribution (Using Example 5 Section 4.5 p 179 as an example): Calc -> Probability Distributions -> Binomial -> Select Probability, and enter the Number of Trials, the Probability of Success and enter the particular value in the Input Constant. The results are as follows: Probability Density Function 25 2/5/2016 Math 131 Unit 2 Binomial with n = 250 and p = 0.71 x 178 P( X = x ) 0.0555120 This agrees with the answer given on p A35. Chapter 5 Normal Probability Distributions (p 205) Section 5.1 Introduction to Normal Distributions (p 206) GUIDELINES (p 206) A normal distribution is a continuous probability distribution for a random variable x. The graph of a normal distribution is called the normal curve. A normal distribution has the following properties. 1. The mean, median and mode are equal 2. The normal curve is bell shaped and is symmetric about the mean 3. The total area under the normal curve is equal to one 4. The normal curve approaches, but never touches, the x-axis as it extends farther and farther away from the mean 5. Between    and    (in the center of the curve) the graph curves downward. The graph curves upward to the left of    and to the right of    . The points at which the curve changes from curving upward to curving downward are called inflection points. The graph of the normal distribution (the density function) is given by the following equation (p 206): y 1  2 e ( x   ) 2 / 2 2 The normal distribution follows the empirical rule, which states that 1. About 68% of the area lies between    and    2.   2 and   2 About 99.7% of the area lies between   3 and   3 About 95% of the area lies between 3. (p 209) Section 5.2 The Standard Normal Distribution (p 214) The standard normal distribution is a normal distribution with a mean of 0 and a standard deviation of 1. The z-score of any normal distribution has the standard normal distribution. As noted in section 2.5 the zscore is: z x  The density function for the standard normal distribution is: y ex 2 /2 2 Section 5.3 Normal Distributions: Finding Probabilities 26 2/5/2016 Math 131 Unit 2 To find the probability for any normal curve, convert the x values to their z-scores, and then find the probability for the standard normal distribution. For example, suppose x has a normal distribution with x < 2, we convert it to the z-score: z x     2.4 and   0.5. To find the probability that 2  2.4  0.8 0 .5 Now we can use the standard normal table to find P( z  0.8)  0.2119. We noted above that the normal distribution follows the empirical rule. Let’s see precisely what the probabilities are that X lies within the following standard deviations of the mean. one st dev two st devs three st devs P(z < 1) = 0.8413, P(z < -1) = 0.1587 P(z < 2) = 0.9772, P(z < -2) = 0.0228 P(z < 3) = 0.9987, P(z < -3) = 0.0013 P(-1 < z < 1) = 0.8413 - 0.1587 = 0.6826 P(-2 < z < 2) = 0.9772 – 0.0228 = 0.9544 P(-3 < z < 3) = 0.9987 – 0.0013 = 0.9974 Since the normal curve is symmetric, the above calculations could be simplified. For example, Since P(z < -1) = 0.1587, P(z > 1) = 0.1587. So P(-1 < z < 1) = 1 - 2*0.1578 = 0.6826. The following graph shows the standard normal curve and highlights the area between –1 and 1. What is the probability of being between 1 and 2 standard deviations from the mean? P(-2 < z < 2) – P(-1 < z < 1) = 0.9544 - 0.6826 = 0.2718. Since the graph is symmetric, the probability of being between 1 and 2 standard deviations above the mean equals the probability of being between 1 and 2 standard deviations below the mean: P(1 < z < 2) = 0.1359 and P(-2 < z < -1) = 0.1359. Comparing with Chebychev’s Theorm Within this distance from the mean  2 3 Chebychev Normal 68% 95% 99.7% 75% 89% Example 2 (p 223) A shopper spends mean 45 minutes and standard deviation 12 minutes in a super market, and the time is normally distributed. This example finds the probabilities that a shopper will be in the store between 24 and 54 minutes, and also more than 39 minutes. 27 2/5/2016 Math 131 Unit 2 Using Minitab to find the probability that a normally distributed random variable is less than a specified value (Manual p 157) Calc->Probability Distributions->Normal->Select Cumulative probability. Enter the Mean  and Standard deviation  , then enter the value that up in the Input Constant. Using Minitab to find the probability that a normally distributed random variable is between two specified values (Manual p 159) Using Try it Yourself Section 5.3 p 223 as an example. Find the probability that it is less than the smaller value and the probability that it is less than the larger value. Then subtract the former from the later. Cumulative Distribution Function Normal with mean = 45 and standard deviation = 12 x 33 P( X <= x ) 0.158655 Cumulative Distribution Function Normal with mean = 45 and standard deviation = 12 x 60 P( X <= x ) 0.894350 Subtracting, 0.8944 – 0.1587 = 0.7357. This agrees with the answer given on p A36. Section 5.4 Normal Distributions: Finding Values (p 229) We can find the z-score that corresponds to a particular area or percentile by looking it up in the standard normal table. Example 1 (p 229) What is the z-score that correspond to a cumulative area of 0.3632? Looking up 0.3632 in the Standard Normal Table shows that the z-score is –0.35. Similarly the z-score corresponding to a probability of 0.8925 is 1.24. To find the x-value that corresponds to a particular area or percentile, look up the z-score in the standard normal table, then use the equation x    z to convert the z-score to the x-value. For example (from example 4, p 232), suppose a normal distribution has a mean of 75 and a standard deviation of 6.5, and we want the x-value corresponding to the 95th percentile. From the standard normal table we find that the z-score corresponding to .95 is about 1.645. Therefore x    z  75  1.645 * 6.5  85.69. Example 5, p 233. Mean cholesterol is 211 and standard deviation is 39.2. What is the highest level a man can have and still be in the lowest 1%? Looking up 1% in the Standard Normal Table gives z = -2.33. So x    z  211  (2.33)(39.2)  119.66 28 2/5/2016 Math 131 Unit 2 Section 5.5 The Central Limit Theorem (p 238) The following explanation is from Freund (p 176). Suppose we draw a random sample of size n: x1 , x 2 ,...x n are the values assumed by the random variables X 1 , X 2 ,... X n which are independent and have the same distribution. The mean of the sample is the value X 1  X 2  ... X n n The mean (also known as the expected value of X ) is denoted by E (X ) and also by  x . assumed by the random variable The variance of X is denoted by Var(X ) and also by  x . 2 It can be shown that x   and x  2 2 n So that the standard deviation of x   X is n x Note that is called the standard error of the mean. The fact that the standard error of the mean decreases as n increases is a very important result: It says that, whatever the population distribution (provided that it has a finite variance) the distribution of the sample mean becomes more and more concentrated near the population mean as the sample size increases (Mood and Graybill, p 146). Note, the book states this in the following way (p 238): DEFINITION (p 238) A sampling distribution is a probability distribution of a sample statistic that is formed when samples of size n are repeatedly taken from a population. If the sample statistic is the sample mean, then the distribution is the sampling distribution of sample means. The mean of sample i is denoted by x i The mean of the sample means is (p 238): x   and the standard deviation of the sample means is (p 238): x   n This is also called the standard error of the mean.  30 ), x will be (  x   ) , and standard deviation (standard error of The central limit theorem states that for any population when n is large (book says approximately normally distributed with mean the mean) ( x   n ) , and the approximation becomes better as n increases. (p 240). If the original population is normally distributed, the sampling distribution of sample means is normally distributed for any sample size n. 29 2/5/2016 Math 131 Unit 2 Example 2 (p 241) Phone bills for residents of Cincinnati have mean $64 and standard deviation $9. Random samples of size 36 are drawn and the mean of each sample is determined. Find the mean and the standard error of the mean for the sampling distribution.  x    64  9 x    1.5 n 36 From the Central Limit Theorem, since n > 20 the sample mean has a normal distribution with mean 64 and standard deviation 1.5. Example 6 (p 245) Credit card balances are normally distributed with mean $2870 and standard deviation $900. What is the probability that a randomly selected credit card holder has balance less than $2500? x 2500  2870  0.41,  900 P( x  2500)  P( z  0.41)  0.3409 z  What is the probability that a random sample of 25 credit card holders has mean balance less than $2500? z x  x x / n  2500  2870  2.06 900 / 25 P( x  2500)  P( z  2.06)  0.0197 x  Section 5.6 Normal Approximations to Binomial Distributions (p 251) The Central Limit Theorem can be restated to apply to the sum of sample measurements as follows: x is normally distributed with mean = n and standard deviation =  n as n becomes large. Given that E (X )   and Var ( X )  2 n , this is easy to show: E (X )  E (nX )  nE ( X )  n Var (X )  Var (nX )  n 2Var ( X )  n 2 Applying this version of the Central Limit Theorem to the binomial distribution gives the following: If np  5 and nq  5 the binomial random variable is approximately normally distributed with mean   np and standard deviation   npq To see why this result is valid look at the graphs of various binomial distributions on p 251. Note that the mean and standard deviation of the normal distribution is the same as the mean and standard deviation of the binomial distribution. The following table explains how to use the normal approximation to the binomial distribution. Using the Normal Distribution to Approximate Binomial Probabilities Procedure Equations Example (p 254) Verify that the binomial Specify n, p and q p  .37, q  .63, n  15 30 2/5/2016 Math 131 distribution applies Unit 2 Want probability that the number of successes, x  8 Determine whether you can use the normal distribution to approximate x, the binomial variable. Find the mean and standard deviation for the distribution Is np  5 ? Apply the appropriate continuity correction. Subtract 0.5 to the left boundary, if there is one and add 0.5 to the right boundary if there is one. Find the corresponding z-score(s). z Find the probability Use the Standard Normal Table. Is nq  5 ? If both are true, you can proceed.   np   npq x np  15  .37  5.55 nq  15  .63  9.45   5.55   1.87 7  0.5  7.5 7.5  5.55  1.04 1.87 P( z  1.04)  0.8508 z  Example 4 (p 255) In the U.S, 29% of people believe that passenger trips to the moon will occur in their lifetime. You randomly select 50 people. What is the probability that at least 50 will say they believe it? np  200 * 0.29  58, nq  200 * 0.71  142 , so the binomial is approximately normal with   np  58,  npq  200 * 0.29 * 0.71  6.42 Using the correction for continuity, we want P(X>=49.5). z  (49.5  58) / 6.42  1.32 P( x  49.5)  P( z  1.32)  1  P( z  1.32)  1  0.0934  0.9066 . 31 2/5/2016 Math 131 Unit 3 Unit 3 Inferential Statistics Chapter 6 Confidence Intervals (p 269) Section 6.1 Confidence Intervals for the Mean (Large Samples)  A point estimate is a single value estimate for a population parameter. The most unbiased estimate of the population mean  is the sample mean x . An interval estimate is an interval, or range of values, used to estimate a population parameter. The level of confidence c is the probability that the interval estimate contains the population parameter.   Since we can hardly expect that point estimates based on samples always hit the parameters they are supposed to estimate exactly, it is often desirable to give an interval rather than a single number. We can then assert with a certain probability (or degree of confidence) that such an interval contains the parameter it is intended to estimate. (Freund p 214) For large samples, the Central Limit Theorem applies. From the CLT, when n  30 , the sampling distribution of the sample mean x is normal. The level of confidence c is the area under the standard normal curve between the critical values,  z c and z c . For example, if c = 95%, then 2.5% is less than  z c and 2.5% is greater than z c . Looking up the z-score in table A16, we see that z .95  1.96 and . z.95  1.96 . The distance between the point estimate and the actual parameter value is called the error of estimate. When estimating  the error of estimate is the distance x   . Given a level of confidence c, the maximum error of estimate (sometimes called the margin of error or error tolerance) E is the greatest possible distance between the point estimate and the value of the parameter it is estimating. E  z c x  z c  n When n  30, the sample standard deviation s can be used in place of  . In example 1 and example 2 on pages 270 – 272, there are 54 samples and the sample mean and sample standard deviation are: x s x  12.4 n ( x  x ) 2  5.0 n 1 Substituting E  zc  n  zc s n  1.96  5.0 54  1.3 32 2/5/2016 Math 131 Unit 3 So we are 95% sure that the maximum error of estimate for the population mean is about 1.3. The c-confidence interval for the population mean is x  E    x  E In the above example the 95% left endpoint (often called the lower confidence limit or LCL) of the confidence interval is 12.4 – 1.3 = 11.1 and the right endpoint (often called the upper confidence limit or UCL) is 12.4 + 1.3 = 13.7. So the 95% confidence interval is 11.1    13.7. The confidence interval is often denoted in the following ways xE ( x  E, x  E )  ( x  z c  n , x  zc  n ) Summary for finding confidence interval for population mean (p 273) What to do Find the sample statistics n and x Equations Specify  if known. Otherwise, if n  30 , find the sample standard deviation, s Find the critical value  n s  5.0 ( x  x ) 2 n 1 s the right of Find the maximum error of the estimate E. Note this is the critical value times the standard error of the mean. Find the left and right endpoints and form the confidence interval CI  ( x  z c n  54, x  12.4 Use the Standard Normal Table to find the value z c such that the area to z c that corresponds to the given level of confidence In summary, Example (from above) x x n z c  (1  c) 2 . E  z c x  z c  n  zc s  n E  1.3 n Left endpoint (LCL): x  E Right endpoint (UCL): x  E Interval: x  E    x  E , x  zc z.95  1.96 (11.1,13.7) ) Example 5, p 275 Take a sample of size 20 from a Normal distribution with standard deviation = 1.5. The sample mean is 22.9. What is the 90% CI? Looking in Table 4 p A16 (Standard Normal Distribution) CI  ( x  z c (22.9  1.645  , x  zc n 1.5 20  n z.05  1.645 )= , 22.9  1.645 1.5 20 )  (22.9  0.55, 22.9  0.55)  (22.35, 23.45) Using Minitab to find the Confidence Interval with a Sample in a Column for a Normal Distribution (Manual p 183) Enter data in column 1 (For example enter the 54 data points on page 270). (Note Manual generates 20 random samples from a Normal Distribution instead.) Then determine the standard deviation: Stat->Basic 33 2/5/2016 Math 131 Unit 3 Statistics->Display Descriptive Statistics (Click Statistics to be sure the standard deviation is included)>Click OK and note the standard deviation (In this example it is 5.015) . Stat->Basic Statistics->1-Sample Z…->Choose Column C1 and Enter the standard deviation from the first step (In this example it is 5.015) ->Click Options then choose not equal for the Alternative and enter the Confidence level (95% in the example), Click OK then Click OK again. If the data from the example on p 270 is entered, the confidence interval is given as (11.0883, 13.7635). Using Minitab to find the Confidence Interval with Summarized Data for a Normal Distribution Stat->Basic Statistics->1-Sample Z…->Choose Summarized Data, enter the Sample size (e.g. 100), the Mean (e.g. 50) and the Standard deviation (e.g. 5). Click Options and enter the Confidence level (e.g. 95.0 and choose not equal for Alternative. Click OK, OK The results are presented in the session window as follows: One-Sample Z The assumed standard deviation = 5 N 100 Mean 50.0000 SE Mean 0.5000 95% CI (49.0200, 50.9800) Determining Sample Size (p 276) How large a sample size (n) is needed to guarantee a certain level of confidence for a given maximum error of estimate (E)? This can be derived from the formula for E above E  zc  n Solving for n gives: z   n c   E  If 2  is unknown, s can be used as an estimate if there is a preliminary sample size of at least 30. Example 6, p 276 We want to estimate the mean number of sentences in a magazine ad. How many ads must be in the sample if you want to be 95% confident that the sample mean is within one sentence of the population mean? From Example 2, p 272, s = 5.0, so z   1.96  5.0  n c     96.04 . So you need a sample of size 97. 1    E  2 2 Section 6.2 Confidence Intervals for the Mean (Small Samples) (p 284) When the sample size is small (less than 30), the sample standard deviation s is not good enough to assume that the Central Limit Theorem applies. However when the random variable x is drawn from an approximately normal distribution, the distribution of the following random variable t is known and is called the t-distribution. 34 2/5/2016 t Math 131 Unit 3 x s n   The t-distribution is bell shaped and symmetric about the mean. The t-distribution is a family of curves, each determined by a parameter called the degrees of freedom (d.f.) where d . f .  n  1 (n is the sample size) The total area under the t-curve is 1. The mean, median, and mode of the t-distribution are equal to zero. As the degrees of freedom increase, the t-distribution approaches the standard normal distribution    Constructing confidence interval using the t-distribution is similar to constructing it for the normal distribution as the following table indicates Procedure Identify the sample statistics Equations n, x , and s Identify the degrees of freedom, the level of confidence c, and the critical value t c Estimate the maximum error of estimate E Find the confidence interval x ( x  x ) , s n n 1 d. f .  n  1 t c is found in Table 5 x Appendix B E  tc s n ( x  E, x  E ) 2 Example 2 p 286 n  16 , x  162 , s  10 c  .95 d . f .  n  1  15 t c  2.131 10 E  2.131   5.3275 16 (156.6725,167.3275) Summary of when the normal distribution or the t-distribution can be used (p 288)    If n  30 , the normal distribution can be used, and s can be used to estimate  . If n  30 and the population is normally or approximately normally distributed, use the normal distribution if  is known, otherwise use the t-distribution. If n  30 and the population is not approximately normally distributed, a CI cannot be constructed. Using Minitab to find the Confidence Interval for a t-Distribution with the Sample in a Column (Manual p 193) We will use the data in Ex6_2-23.mtp (Chapter 6, p 291) Sports Cars: Miles per Gallon to illustrate the method. First verify the data is approximately normal. To draw a normal probability plot, click on Graph -> Probability Plot and select the Single plot. Click on OK. Select C1 for the Graph variable. Click on OK and the probability plot will be displayed. Notice that all the data points are contained within the confidence bands of the plot. Next check for outliers using a boxplot. Click on Graph->Boxplot and select Simple boxplot. Click on OK. Select C1 for the Graph variable. To view a horizontal boxplot (rather than a vertical one) click on Scale and select Transpose value and category scales. Click on OK twice. There are no outliers shown in the boxplot, so you may now proceed with the confidence interval. 35 2/5/2016 Math 131 Unit 3 Since n = 25 and the population standard deviation is unknown, you should construct a t-interval for this problem. Click on Stat -> Basic Statistics -> 1-Sample t. Select Samples in Columns and enter C1. Next select Options and enter 95.0% for the Confidence Level and select ‘not equal’ for the Alternative. Click on OK twice and the interval will be displayed in the Session Window as follows: One-Sample T: Miles per gallon Variable Miles per gallon N 25 Mean 24.0000 StDev 3.0000 SE Mean 0.6000 95% CI (22.7617, 25.2383) Note that the 95% Confidence Interval is (22.7617, 25.2383). Using Minitab to find the Confidence Interval with Summarized Data for a t-Distribution Stat->Basic Statistics->1-Sample t…->Choose Summarized Data, enter the Sample size (e.g. 10), the Mean (e.g. 50) and the Standard deviation (e.g. 5). Click Options and enter the Confidence level (e.g. 95.0 and choose not equal for Alternative. Click OK, OK The results are presented in the session window as follows: One-Sample T N Mean StDev SE Mean 95% CI 10 50.0000 5.0000 1.5811 (46.4232, 53.5768) Section 6.3 Confidence Intervals for Population Proportions (p 293) The for p, the population proportion of success is given by the proportion of successes in a sample and is denoted by pˆ  x n where n is the sample size and x is the number of successes in the sample. The point estimate for the ˆ . The symbols p̂ and q̂ are read as “p hat” and “q hat”. Note this is number of failures is qˆ  1  p derived from the equation in section 5.5 where the mean of X was obtained. The mean and standard deviation of the estimate p̂ are:  pˆ  p  pˆ  pq n This is the standard error of the mean (section 5.5) when the random variable X i only can take on the value 0 or 1. Relate this to the sample mean of a random variable that has a binomial distribution:  x p , n np  x 1 E ( p)  E    E ( x )   p, n n n 2 1 1 pq  x Var ( p)  Var    2 Var ( x )  2 npq  n n n n The following table explains how to construct a confidence interval for the population proportion. 36 2/5/2016 Math 131 Unit 3 Constructing a Confidence Interval for the Population Proportion (p 294) Procedure Equations Examples 1, 2 (p 293, 295) Identify the sample statistics. n is the number of trials and x is n  1024 , x  287 the number of successes Find the point estimate p̂ . 287 x pˆ qˆ pˆ   0.28 pˆ  , s pˆ  Also find the estimate of the 1027 n n standard deviation (the standare error of the mean): s pˆ Verify that the sampling distribution of p̂ can be approximated by the normal distribution Find the critical value z c that corresponds to the given level of confidence c. Find the maximum error of the estimate E. This is the critical value times the standard error of the mean. Find the left and right endpoints of the confidence interval npˆ  5 nqˆ  5 npˆ  1024  0.28  287 nqˆ  1024  0.72  737 Use the Standard Normal Table to find the value z c such that the z.95  1.96 area to the right of z c  (1  c) 2 E  zc pˆ qˆ n E  1.96 ( pˆ  E , pˆ  E ) 0.28  0.72  0.028 1024 (0.28  0.028, 0.28  0.028) = (0.252, 0.308) Finding the minimum sample size is done by substituting in the formula that was derived above: z      z  n   c   pq c  Note that pq is the estimate of the standard deviation of the proportion.  E  E 2 2 Example 4 p 297. We want to estimate the proportion of voters who support our candidate with 95%  confidence that we are within 3% of the actual proportion. Since there is no preliminary estimate for p we use 0.5. Substituting into the above equation we gives: 2  1.96  n  0.5 * 0.5   1067.11  0.03  Rounding up, we need at least 1068 registered voters to be included in the sample. Chapter 7 Hypothesis Testing with One Sample Section 7.1 Introduction to Hypothesis Testing (p 321) H 0 is a statistical hypothesis that contains a statement of equality, such as , , or  . The alternative hypothesis H a is the complement of the null hypothesis. It is a statement that must be true A null hypothesis if H 0 is false and it contains a statement of inequality such as , , or  . 37 2/5/2016   Math 131 Unit 3 A Type I error occurs if the null hypothesis is rejected when it is actually true. A Type II error occurs if the null hypothesis is not rejected when it is actually false. Example 1 (p 322) and Try it Yourself (p 322) A university claims that the proportion of students who graduate in four years is 82% A water faucet manufacturer claims that the mean flow rate of a faucet is less than 2.5 gpm A cereal company claims that the mean weight of the contents of its 20-ounce size cereal boxes is more than 20 oz An automobile battery manufacturer claims that the mean live of a certain battery type is 74 months A television manufacturer claims that the variance of the life of a certain type of TV is <= 3.5 A radio station claims that its proportion of the local listening audience is greater than 39% H0: p = 0.82 (Claim) Ha: p <> 0.82 H0: p >= 2.5 gpm Ha: p < 2.5 gpm (Claim) H0: mean <= 20 oz Ha: mean > 20 oz (Claim) H0: mean = 74 (Claim) Ha: mean <> 74 H0: variance <= 3.5 (Claim) Ha: variance > 3.5 H0: p <= 0.39 Ha: p > 0.39 (Claim) DEFINITION (p 325) In a hypothesis test, the level of significance is your maximum allowable probability of making a Type I error. It is denoted by . Three commonly used levels of significance are   0.10, 0.05, 0.01 . Note that making  small means that we want a very small chance that we will reject a null hypothesis that is true. The probability of a type II error is denoted by  . The following table summarizes this: Do not reject Reject H0 H 0 True H a True Correct decision Type II Error (Probability =  ) Type I error (Probability = H0 ) Correct decision The statistic that is compared to the parameter in the null hypothesis is called the test statistic. The following table shows the relationships between population parameters and their corresponding test statistics, sampling distributions, and standardized test statistics. (p 325) Population Parameter Test statistic  x p p̂ Sampling Distribution If n  30 , Normal If n  30 , Student t Normal Standardized test statistic z t z DEFINITION: Assuming the null hypothesis is true, a P-value (or probability value) of a hypothesis is the probability of obtaining a sample statistic with a value as extreme or more extreme than the one determined from the sample data. (p 325)  If the alternative hypothesis  a left-tailed test, i.e. P is the area of the standard normal curve to the left of z. If the alternative hypothesis H a contains the greater-than inequality symbol (>), the hypothesis H a contains the less-than inequality symbol (<), the hypothesis test is test is a right-tailed test.,i.e. P is the area of the standard normal curve to the right of z. 38 2/5/2016  Math 131 Unit 3 H a contains the not-equal-to symbol (  ), the hypothesis test is two1 tailed test. In a two-tailed test, each tail has an area of P . 2 If the alternative hypothesis P-value: probability of getting as or more extreme result than sample statistic α: max allowable probability of rejecting true null hypothesis za z standardized test critical value. called z0 (p 339) Example of a left-tailed test: Reject if statistic z  z Or equivalently, reject if P   . In the figure above, z  z a ( P   ) , so do not reject the null hypothesis. Page 326 has a DEFINITION that shows the P area corresponding to each alternative hypothesis: Alternative Hypothesis < > <> Area of Normal Curve P is in left tail P is in right tail Half of P is in each tail Example 3 p 327 shows the P-areas for the cases discussed in Example 1. Section 7.2 Hypothesis Testing for the Mean (Large Samples) (p 334) Remember the P-value is the probability of getting a result as extreme as you obtained and the level of significance α is the maximum allowable probability of making a Type I error (rejecting the null hypothesis when it is actually true). The decision rule based on the P-Value is: compare the P-value to . , then (p 334)  If P   , then reject H 0  If P   , then fail to reject H 0 An equivalent way of deciding whether to accept or reject the Null Hypothesis is to determine whether the standardized test statistic falls within a range of values called the rejection region of the sampling distribution (p 339). We will discuss both methods at the same time to show their equivalence This is why the discussion on p 339 is inserted here. 39 2/5/2016 Math 131 Unit 3 DEFINITION (p 339) A rejection region (or critical region) of the sampling distribution is the range of values for which the null hypothesis is not probable. If a test statistic falls in this region, the null hypothesis is rejected. A critical value, zo, separates the rejection region from the nonrejection region. Using P-values for a z-Test for Mean μ (p 336). Also using the Critical Value Method (p 441) Procedure Equations Example 4 p 337 State the claim mathematically and Claim: Delivery time is < 30 min State H 0 and H a verbally. Identify the null and H 0 :   30 alternative hypotheses H a :   30 Specify the level of significance Determine the standardized test statistic. Note that this is the sample mean minus the hypothesized mean over the standard error of the mean. Note, this alternative means we use a left tailed test. Identify    0.01 x is the test statistic z is the standardized test statistic z x   s or if n n  30 use n  36 , x  28.5, s  3.5 28.5  30 z  2.57 3.5 36  Note that  x  is the n standard error of the mean and is approximated by 3 .5 Find the area that corresponds to z. O, using the Critical Value Method described on p 34, determine the critical value z0. Find the P-value Left-tailed test, P = area in left tail Right-tailed test, P = area in right tail two-tailed test, P = 2(area in tail) Make a decision to reject or fail to reject the null hypothesis Use table 4 in Appendix B, the Standard Normal Distribution Reject H 0 if P-value   , Otherwise do not reject. Or using the Critical Value Method, determine if z is in the rejection region. 36  .5833 Area corresponding to z  2.57 is 0.0051 Or using the Critical Value Method from Standard Normal Table, z0 = zα = z.01 = -2.33 Since this is a left-tailed test, the P-value is 0.0051. The P-value <  , so reject H 0 Or using the Critical Value Method, z  z 0  2.33 So, since the claim (that delivery time is less than 30 minutes) is the alternative, we have sufficient evidence to conclude that the claim is valid. Using Minitab for Hypothesis testing for the mean with summarized data from a large sample Minitab: Stat->Basic Statistics -> 1-sample z->In the Dialog box, there are two choices: Samples in columns and Summarized data. Choosing the Summarized data of the above example, enter Sample size = 36, Mean = 28.5, Standard deviation = 3.5 and Test mean = 30. Then choose Options… and enter Confidence Level = .99 (This is 1   ) and Alternative: less than, then click OK OK. Result P = 0.005 Note the 99% Upper Bound is given to be 29.8570. What does this mean? 40 2/5/2016 Math 131 Unit 3 Section 7.3 Hypothesis Testing for the Mean (Small Samples) (p 350) For samples of size less than 30 and when  is unknown, if the population has a normal, or nearly normal, distribution, the t-distribution is used to test for the mean  . Using the t-Test for a Mean  when the sample is small (p 352) Procedure Equations Example 4 p 353 State the claim mathematically State H 0 and H a H 0 :   16500 and verbally. Identify the null and H a :   16500 alternative hypotheses d. f  n  1 n  14, x  15700, s  1250   0.05 d . f .  13 Table 5 (t-distribution) in appendix B The test is left-tailed. Since test is left tailed and d . f  13 , the Specify the level of significance Identify the degrees of freedom and sketch the sampling distribution Determine any critical values. If test is left tailed, use One tail,  column with a negative sign. If test is right tailed, use One tail,  column with a positive sign. If test is two tailed, use Two tails,  column with a negative and positive sign. Determine the rejection regions. Specify Find the standardized test statistic t Make a decision to reject or fail to reject the null hypothesis  critical value is The rejection region is x x  t  t0 x s n If t is in the rejection region, reject H 0 , Otherwise do not reject t 0  1.771 The rejection region is t  1.771 15700  16500 t  2.39 1250 14 Since  2.39  1.771, reject H0 H0 Interpret the decision in the context of the original claim. Reject claim that mean is at least 16500. Using Minitab to perform Hypothesis testing for the mean with summarized data when the sample is small (Manual p 211) We are assuming the population is normally distributed Click on Stat -> Basic Statistics -> 1 sample t. Choose Summarized data and enter 14 for the Sample size and 15700 for the Mean and 1250 for the Standard deviation and 16500 for the Test mean. Click Options and enter 95 for the Confidence level and ‘less than’ for the Alternative. Click OK twice. The results displayed in the Session Window are: One-Sample T Test of mu = 16500 vs < 16500 N 14 Mean 15700.0 StDev 1250.0 SE Mean 334.1 95% Upper Bound 16291.6 T -2.39 41 P 0.016 2/5/2016 Math 131 Unit 3 Since the t-value (-2.39) is less than the critical value (-1.771) we reject the null hypothesis. Section 7.4 Hypothesis Testing for Proportions (p 360) Hypothesis testing for proportions is similar to hypothesis testing for the mean (Section 7.2). Recall from section 6.3 that the mean and the standard error of the mean for the proportion are:  pˆ   pˆ  x n pq n If np  5 and nq  5 , then the sampling distribution is normal and the following table summarizes the procedure. Using Critical Values for a z-Test for Proportion p (p 360) Procedure Equations Example 1 p 361 Verify that we can use the normal np  100(0.20)  20 Verify that np  5 and approximation. nq  5 State the claim mathematically and verbally. Identify the null and alternative hypotheses Specify the level of significance Sketch the sampling distribution. Determine any critical values State nq  100(0.80)  80 H 0 and H a Claim: Less than 20% are allergic. H 0 : p  0.2 H a : p  0.2   0.01 Identify  Use Table 4 in Appendix B to determine the critical value(s) from the level of significance z 0.01  2.33 Determine the rejection regions This is based on the critical value(s) and specifies the values of the test statistic z that cause the rejection of the null hypothesis z  2.33 Determine the standardized test statistic. Note that this is the sample proportion minus the hypothesized proportion over the standard error of the mean. z Make a decision to reject or fail to reject the null hypothesis If z is in the rejection region, reject H 0 , otherwise do not  pˆ  p pq n reject. (If P-value   , n  100 , pˆ  0.15 0.15  0.2 z  1.25 (0.2)(0.8) / 100 (Note P  value  0.1056) z  z 0.01 , so do not reject H 0 (The P-value >  , so do not reject H 0 ) reject H 0 , otherwise do not reject.) Using Minitab to perform Hypothesis testing for a proportion with summarized data (Manual p 215) We will test the hypothesis given in the example above. We showed above that we could assume normality. 42 2/5/2016 Math 131 Unit 3 Click on Stat -> Basic Statistics -> 1-Proportion. The data is in a summarized form, so select Summarized data. Enter 100 for the Number of trials and 15 for the Number of events. Click on Options. Enter 99 (or .99) for the Confidence level, 0.20 for the Test Proportion and ‘less than’ for the Alternative. Because the assumption of normality has been met, select Use test and interval based on normal distribution, and then click on OK twice. The results are: Test and CI for One Proportion Test of p = 0.2 vs p < 0.2 Sample 1 X 15 N 100 Sample p 0.150000 99% Upper Bound 0.233067 Z-Value -1.25 P-Value 0.106 Chapter 9 Correlation and Regression Section 9.1 Correlation (p 442) DEFINITION A correlation is a relationship between two variables. The data can be represented by the ordered pairs (x, y) where x is the independent or explanatory, variable and y is the dependent, or response, variable. A scatter plot is a graph of the x,y points (first discussed in section 2.2). To construct a scatter plot we plot the independent variable on the x axis and the dependent variable on the y axis. As p 442 shows, there can be a negative linear correlation, a positive linear correlation, no correlation or a nonlinear correlation. Using Minitab to draw a scatter plot (Manual p 93) Using exercise 13, p 454 (EX9_1-13.MTP) Click on Graph -> Scatterplot -> Simple. Click OK. Enter C2 for the Y variable and C1 for the X variable. (You can also click Labels and specify your own title to replace the default provided by Minitab) Click OK For exercise EX9_1-13.MTP, the plot is Test score vs. Hours study. The results are: 43 2/5/2016 Math 131 Unit 3 Scatterplot of Test Score vs Hours study 100 90 Test Score 80 70 60 50 40 0 1 2 3 4 5 Hours study 6 7 8 9 Obtaining the scatter plot for the Old Faithful data in Example 3 p 444: Click on Graph -> Scatterplot -> Simple. Click OK. Enter Time for the Y variable and Duration for the X variable. (You can also click Labels and specify your own title to replace the default provided by Minitab) Click OK The results are: Scatterplot of Time y vs Duration x 95 90 85 Time y 80 75 70 65 60 55 2.0 2.5 3.0 3.5 Duration x 4.0 4.5 5.0 The Correlation Coefficient is a measure of the strength and the direction of a linear relationship between two variables. The symbol r represents the sample correlation coefficient. The formula for r is r nxy  (x)(y ) nx  (x) 2 ny 2  (y ) 2 2 (p 445) where n is the number of pairs of data. The population correlation coefficient is represented by . Note, Hogg and Craig (p 69) give the following equation population correlation coefficient: 44 2/5/2016  Math 131 E( X  1 )(Y   2 )  1 2  E ( XY )  1  2  1 2 =  12  1 2 Unit 3 where  12 is the covariance of X and Y (Freund p 300). The covariance is positive if there is a high probability that large values of X go with large values of Y and small values of X go with small values of Y. If there is a high probability that large values of X go with small values of Y and vice versa, the covariance is negative (Freund p 112). Hogg and Craig ( p 339) give the following equation for the sample correlation coefficient: n R  i 1 ( X i  X )(Yi  Y ) n (X i 1 i  X) (p 339) n 2  (Y i 1 i Y ) 2 This equation can be manipulated to yield the equation given in the text on p 445. The range of the correlation coefficients is –1 to 1. If x and y have a strong positive linear correlation, r is close to 1. I x and y have a strong negative linear correlation, r is close to –1. If there is no linear correlation or a weak linear correlation, r is close to 0. See some examples on p 445. Note that Yi  Y is positive when X i  X is negative and vice versa. So using the Hogg and Craig formula we see that the correlation is negative when the slope is negative as in the graph at the left. Example 1 (p 443) gives eight points for advertising expenses and company sales. Advertising expenses (thousands of $) 2.4 1.6 2.0 2.6 1.4 1.6 2.0 2.2 Company sales (thousands of $) 225 184 220 240 180 184 186 215 The correlation coefficient for these data is calculated on p 446. Using Minitab to find the Correlation Coefficient (Manual p 95) Again using exercise 13, p 454 (EX9_1-13.MTP) Click on Stat -> Basic Statistics -> Correlation: Choose the columns C1 and C2 and click OK The results are displayed in the Session Window as follows: Correlations: Hours study, Test Score Pearson correlation of Hours study and Test Score = 0.923 45 2/5/2016 Math 131 Unit 3 P-Value = 0.000 The Window shows that the correlation coefficient is 0.923. Note that Minitab gives the P-value of the Correlation Coefficient. The P-value is the probability of getting as or more extreme result than the sample statistic if there were no correlation. Using Minitab to find the correlation coefficient for the Old Faithful (Example 5 p 447) Click on Stat -> Basic Statistics -> Correlation: Choose the columns C1 (Duration) and C2 (Time) and click OK The results are displayed in the Session Window as follows: Scatterplot of Time y vs Duration x Correlations: Duration x, Time y Pearson correlation of Duration x and Time y = 0.972 P-Value = 0.000 We can test whether there is enough evidence to determine whether population correlation coefficient is significant at a specified level of significance  . We use table 11 in Appendix B to find the critical value for the specified  . If |r| > critical value, we can conclude that the correlation is significant. (p 448) Procedure Determine the number of pairs of data in the sample Specify the level of significance Find the critical value Decide it the correlation is significant. Interpret the decision in the context of the original claim. Equations Determine n Examples 3 p 444 and 6 p 449. n  35    0.05 Table 11 in Appendix B If |r| > critical value, the correlation is significant. critical value = 0.334 r  0.970  0.334 . Therefore the correlation is significant There is a significant correlation between the duration of Old Faithful’s eruptions and the time until the next eruption. Using Minitab to determine whether the correlation coefficient is significant (Manual, p 95) For the specified level of significance  , if |r| > r then the P-value is <  . In the Minitab example for Hours Study and Test Score we see that the P-value is 0.000. This is less than 0.05. So there is a significant correlation. In the Old Faithful example, the P-value is also 0.000 so there is a significant correlation. (Looking at Table 11 in Appendix B, we see that for sample size 35 and α = 0.05, the critical value is 0.334, confirming that the correlation is significant. Correlation does not necessarily mean causation. More in-depth study is needed to determine among the following possibilities (p 452): 1. x causes y e.g. Old Faithful: duration affects time until next eruption 2. y causes x e.g. Old Faithful: time since last eruption affects next duration of next eruption 3. Some other variable or variables affect both x and y. 4. Coincidence Section 9.2 Linear Regression (p 458) 46 2/5/2016 Math 131 Unit 3 Residuals are the differences between the observed and predicted points The regression line, also called a line of best fit, is the line for which the sum of squares of the residuals is a minimum. The equation of a regression line for an independent variable x and a dependent variable y is (p 459) yˆ  mx  b where ŷ is the predicted y-value for a given x-value. The slope m and y –intercept b are given by nxy  (x)(y ) y x m and b  y  mx  2 2 n n nx  ( x ) where y is the mean of the y-values in the data set and x is the mean of the x-values. The regression line always passes through the point ( x , y ). m Using Minitab to find the Least Squares Regression Equation (p 98) Again using exercise 13, p 454 (EX9_1-13.MTP) Click Stat -> Regression -> Regression. Choose the Predictor (x variable) and Response (y variable). Click on Results and Choose Regression equation, table of coefficients, s, R-squared, and basic analysis of variance. Click OK. The results shown in the Session Window begins with: Regression Analysis: Test Score versus Hours study The regression equation is Test Score = 34.6 + 7.35 Hours study The data window also contains a new column for the residuals, i.e. the differences between the observed and predicted values. We can also obtain the equation when we create a scatter plot. This is described below. If you want to make a prediction of a y-value for a specific x-value, do the above and also click on Options in the Regression dialog Box enter the x-value for Prediction intervals for new observations, and select Fits. This is described below. If we enter 5 as the x-value, the following is shown in the Sessions Window. Predicted Values for New Observations New Obs 1 Fit 71.37 SE Fit 2.16 95% CI (66.60, 76.13) 95% PI (53.76, 88.97) Values of Predictors for New Observations New Obs 1 Hours study 5.00 47 2/5/2016 Math 131 Unit 3 So the predicted value is 71.37. The result is also presented in the data window as the first element in a new column labeled PFIT1. The result is 71.3653. To draw the least squares regression line on the scatter plot, click Stat -> Regression -> Fitted line plot, choose the Response (Y) and the Predictor (X), select Linear for the Type of Regression Model and click OK. The Fitted Line: Test Score versus Hours study regression line is presented with the scatter plot. Using Minitab to find the Regression equation and a predicted value for the Old Faithful Data (p 460) Next we will determine the regression equation where x is the duration and y is the time until the next eruption for Old Faithful. We will also predict the time when the duration is 5.1 minutes and draw the regression equation on the scatter plot. Click Stat -> Regression -> Regression. Choose the Predictor (duration) and Response (time). Click on Results and Choose Regression equation, table of coefficients, s, R-squared, and basic analysis of variance. Click OK then click on Options and enter the duration (5.1) for the Prediction intervals for the new observations and select Fits, then click OK and OK again. The results are: Regression Analysis: Time versus Duration The regression equation is Time = 35.0 + 12.0 Duration Predictor Constant Duration Coef 34.983 11.9634 S = 3.21485 SE Coef 1.770 0.5080 R-Sq = 94.4% T 19.76 23.55 P 0.000 0.000 R-Sq(adj) = 94.2% Analysis of Variance Source Regression Residual Error Total DF 1 33 34 SS 5732.8 341.1 6073.9 MS 5732.8 10.3 F 554.68 P 0.000 Predicted Values for New Observations New Obs 1 Fit 95.996 SE Fit 1.057 95% CI (93.847, 98.146) 95% PI (89.112, 102.881) Values of Predictors for New Observations New Obs 1 Duration 5.10 48 2/5/2016 Math 131 Unit 3 The regression equation is Time = 35.0 + 12.0 Duration and the predicted value for the Duration of 5.10 is 95.996 Using Minitab to draw the least squares regression line on the scatter plot for the Old Faithful Data click Stat -> Regression -> Fitted line plot, choose the Response (Time) and the Predictor (Duration), select Linear for the Type of Regression Model and click OK. The Fitted Line: Test Score versus Hours study regression line is presented with the scatter plot. The results are: Fitted Line Plot Time = 34.98 + 11.96 Duration 95 S R-Sq R-Sq(adj) 90 3.21485 94.4% 94.2% 85 Time 80 75 70 65 60 55 2.0 2.5 3.0 3.5 Duration 4.0 4.5 5.0 Chapter 10 Chi-Square Tests and the F-Distribution (p 493) Section 10.1 Goodness of Fit DEFINITION A chi-square goodness-of-fit test is used to test whether a frequency distribution fits an expected distribution. The test is used in a multinomial experiment to determine whether the number of results in each category fits the null hypothesis: H 0 : The distribution fits the proposed proportions H 1 : The distribution differs from the claimed distribution. To calculate the test statistic for the chi-square goodness-of-fit test, you can use observed frequencies and expected frequencies. DEFINITION The observed frequency O of a category is the frequency for the category observed in the sample data. The expected frequency E of a category is the calculated frequency for the category. Expected frequencies are obtained assuming the specified (or hypothesized) distribution. The expected frequency for the ith category is Ei  npi where n is the number of trials (the sample size) and p i is the assumed probability of the ith category. 49 2/5/2016 Math 131 Unit 3 The Chi-square Goodness of Fit Test: The sampling distribution for the goodness-of-fit test is a chi-square distribution with k  1 degrees of freedom where k is the number of categories. The test statistic is 2   (O  E ) 2 E where O represents the observed frequency of each category and E represents the expected frequency of each category. To use the chi-square goodness of fit test, the following must be true (p 496). 1. The observed frequencies must be obtained using a random sample. 2. The expected frequencies must be  5 . Performing the Chi-Square Goodness-of-Fit Test (p 496) Procedure Equations Example (p 497) Identify the claim. State the null State H 0 and H 1 H0 : and alternative hypothesis. Classical 4% Country 36% Gospel 11% Oldies 2% Pop 18% Rock 29% Specify the significance level Specify    0.01 Determine the degrees of freedom d.f. = #categories - 1 d. f .  6  1  5 Find the critical value  2 : Obtain from Table 6  02.01 (d . f  5)  15.086 Appendix B Identify the rejection region  2   2 Calculate the test statistic 2    2  15.086 (O  E ) 2 E Survey results, n = 500 Classical O= 8 E = .04*500 = 20 Country O = 210 E = .36*500 = 180 Gospel O = 7 E = .11*500 = 55 Oldies O = 10 E = .02*500 = 10 Pop O = 75 E = .18*500 = 90 Rock O= 125 E = .29*500 = 145 Substituting Make the decision to reject or fail to reject the null hypothesis Reject if  is in the rejection region Equivalently, we reject if the P-value (the probability of getting as extreme a value or more extreme) is   2 Interpret the decision in the context of the original claim  2  22.713 Since 22.713 > 15.086 we reject the null hypothesis Equivalently P( X  22.713)  0.01 so reject the null hypothesis. (Note Table 6 of Appendix B doesn’t have a value less than 0.005.) Music preferences differ from the radio station’s claim. Using Minitab to perform the Chi-Square Goodness-of-Fit Test (Manual p 237) The data from the example above (Example 2 p 497) will be used. Enter Three columns: Music Type: Classical, etc, Observed: 8 etc, Distribution 0.04, etc. (Note the names of the columns ‘Music Types’, ‘Observed’ and ‘Distribution’ are entered in the gray row at the top.) 50 2/5/2016 Math 131 Unit 3 Select Calc->Calculator, Store the results in C4, and calculate the Expression C3*500, click OK, Name C4 ‘Expected’ since it now contains the expected frequencies Music Type Observed Distribution Expected Classical 8 0.04 20 Country 210 0.36 180 Gospel 72 0.11 55 Oldies 10 0.02 10 Pop 75 0.18 90 Rock 125 0.29 145 Next calculate the chi-square statistic, (O-E)2/E as follows: Click Calc->Calculator. Store the results in C5 and calculate the Expression (C2-C4)**2/C4. Click on OK and C5 should contain the calculated values. 7.2000 5.0000 41.8909 0.0000 2.5000 2.7586 Next add up the values in C5 and the sum is the test statistic as follows: Click on Calc->Column Statistics. Select Sum and enter C5 for the Input Variable. Click OK. The chi-square statistic is displayed in the session window as follows: Sum of C5 Sum of C5 = 22.7132 Next calculate the P-value: Click on Calc->Probability Distributions->Chi-square. Select Cumulative Probability and enter 5 Degrees of Freedom Enter the value of the test statistic 22.7132 for the Input Constant. Click OK. The following is displayed on the Session Window. Cumulative Distribution Function Chi-Square with 5 DF x 22.7132 P( X <= x ) 0.999617 P(X  22.7132) = 0.999617 So the P-value = 1 – 0.999617 = 0.000383. This is less that α = 0.01 so we reject the null hypothesis. Instead of calculating the P-value, we could have found the critical value from the Chi-Square table (Table 6 Appendix B) for 5 degrees of freedom as we did above. The value is 15.086, and since our test statistic is 22.7132, we reject the null hypothesis. Chi-Square with M&M’s H0: Brown: 13%, Yellow: 14%, Red: 13%, Orange: 20%, Green 16%, Blue 24% Significance level: α = 0.05 Degrees of freedom: number of categories – 1 = 5  2 0.05 (d . f .  5)  11.071 2 Rejection Region:   11.071 Critical Value: 51 2/5/2016 Test Statistic: Math 131 2   Unit 3 (O  E ) 2 , where O is the actual number of M&M’s of each color in the bag and E E is the proportions specified under H0 times the total number. Reject H0 if the test statistic is greater than the critical value (1.145) Section 10.2 Independence (p 504) This section describes the chi-square test for independence which tests whether two random variables are independent of each other. DEFINTION An r x c contingency table shows the observed frequencies for the two variables. The observed frequencies are arranged in r rows and c columns. The intersection of a row and a column is called a cell. (p 504). The following is a contingency table for two variables A and B where f ij is the frequency that A equals Ai and B equals Bj. A1 A2 A3 A4 A B1 f11 f12 f 13 f14 f 1. B2 f 21 f 22 f 23 f 24 f 2. B3 f 31 f 32 f 33 f 34 B f .1 f .2 f .3 f .4 f 3. f If A and B are independent, we’d expect ( f i. )( f . j )  f  f . j  f ij  prob( A  Ai ) * prob( B  B j ) * f   i.   f  f  f  f  ( sum of row i )  ( sum of column j ) (p 504) sample size Example 1 (p 505) Determining the expected frequencies of CEO’s ages as a function of company size under the assumption that age is independent of company size. Small/midsize Large Total Small/midsize <= 39 42 5 47 40 - 49 69 18 87 50 - 59 108 85 193 60 - 69 60 120 180 >= 70 21 22 43 Total 300 250 550 <= 39 40 - 49 50 - 59 60 - 69 >= 70 300 * 47 550  25.64 300 * 87 550  47.45 300 * 193 550  105.27 300 * 180 550  98.18 300 * 43 550  23.45 Total 300 52 2/5/2016 Math 131 Unit 3 Large 250 * 47 550  21.36 250 * 87 550  39.55 250 * 193 550  87.73 250 * 180 550  81.82 250 * 43 550  19.55 250 Total 47 87 193 180 43 550 After finding the expected frequencies under the assumption that the variables are independent, you can test whether they are independent using the chi-square independence test. DEFINITION A chi-square independence test is used to test the independence of two random variables. Using a chi-square test, you can determine whether the occurrence of one variable affects the probability of occurrence of the other variable. (p 506) To use the test, 1. The observed frequencies must be obtained from a random sample 2. Each expected frequency must be  5 The sampling distribution for the test is a chi-square distribution with (r  1)(c  1) degrees of freedom, where r and c are the number of rows and columns, respectively, of the contingency table. The test statistic for the chi-square independence test is 2   (O  E ) 2 E where O represents the observed frequencies and E represents the expected frequencies. To begin the test we state the null hypothesis that the variables are independent and the alternative hypothesis that they are dependent. Performing a Chi-Square Test for Independence (p 507) Procedure Equations Example2 (p 507) Identify the claim. State the null H 0 : CEO’s ages are State H 0 and H 1 and alternative hypotheses. independent of company size H 1 : CEO’s ages are dependent on company size. Specify the level of significance Specify    0.01 Determine the degrees of freedom d . f .  (r  1)(c  1) d . f .  (2  1)(5  1)  4 Find the critical value.  2 : Obtain from Table 6,  2  13.277 Appendix B Identify the rejection region  2   2 Calculate the test statistic 2   Make a decision to reject or fail to reject the null hypothesis Reject if  2  13.277 (O  E ) 2 E  2 is in the rejection 53  (O  E ) 2  77.9 E Note that O is in the table of actual CEO’s ages above, and E is in the table of Expected CEO’s ages (if independent of size) above Since 77.9 > 13.277 we reject the null hypothesis 2/5/2016 Math 131 Unit 3 region. Equivalently, we reject if the Pvalue (the probability of getting as extreme a value or more extreme) is   Equivalently P( X  77.0)   so reject the null hypothesis. (Note Table 6 of Appendix B doesn’t have a value less than 0.005.) CEO’s ages and company size are dependent. Interpret the decision in the context of the original claim Using Minitab to perform the Chi-Square Independence Test (Manual p 242) Enter the names of the columns Comp Size, 39 and under, 40-49 etc, enter the row names Small and Large and enter the values in the rows as follows: Comp Size 39 and under 40 - 49 50 - 59 60 - 69 70 and over Small 42 69 108 60 21 Large 5 18 85 120 22 Click Stat->Tables->Chi-square Test. On the Chi-square Test dialog box, enter columns C2-C6 for the Columns containing the table. (Note you can select several columns in the Windows way, i.e. with ShiftClick or Control Click) Click OK. The results displayed in the Session Window are: Chi-Square Test: 39 and under, 40 - 49, 50 - 59, 60 - 69, 70 and over Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts 39 and under 42 25.64 10.445 40 - 49 69 47.45 9.782 50 - 59 108 105.27 0.071 60 - 69 60 98.18 14.848 70 and over 21 23.45 0.257 2 5 21.36 12.534 18 39.55 11.739 85 87.73 0.085 120 81.82 17.818 22 19.55 0.308 250 Total 47 87 193 180 43 550 1 Total 300 Chi-Sq = 77.887, DF = 4, P-Value = 0.000 The test statistic (77.887) is greater than the critical value obtained from Table 6, Appendix B (13.277) so the null hypothesis is rejected. (Alternatively the P-Value (0.000) is less than the level of significance, α (0.01) so the null hypothesis is rejected.) 54

Unit 1 Descriptive Statistics & Basic Probability

Related documents

Products

Support

Unit 1 Descriptive Statistics & Basic Probability

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib