Stats 845 Applied Statistics This Course will cover: 1. Regression – – Non Linear Regression Multiple Regression 2. Analysis of Variance and Experimental Design The Emphasis will be on: 1. Learning Techniques through example: 2. Use of common statistical packages. • • • • SPSS Minitab SAS SPlus What is Statistics? It is the major mathematical tool of scientific inference - the art of drawing conclusion from data. Data that is to some extent corrupted by some component of random variation (random noise) An analogy can be drawn to data that is affected by random components of variation to signals that are corrupted by noise. Quite often sounds that are heard or received by some radio receiver can be thought of as signals with superimposed noise. The objective in signal theory is to extract the signal from the received sound (i.e. remove the noise to the greatest extent possible). The same is true in data analysis. Example A: Suppose we are comparing the effect of three different diets on weight loss. An observation on weight loss can be thought of as being made up of two components: 1. A component due to the effect of the diet being applied to the subject (the signal) 2. A random component due to other factors affecting weight loss not considered (initial weight of the subject, sex of the subject, metabolic makeup of the subject.) random noise. Note: that random assignment of subjects to diets will ensure that this component will be a random effect. Example B In this example we again are comparing the effect of three diets on weight gain. Subjects are randomly divided into three groups. Diets are randomly distributed amongst the groups. Measurements on weight gain are taken at the following times - one month - two months - 6 months and - 1 year after commencement of the diet. In addition to both the factors Time and Diet effecting weight gain there are two random sources of variation (noise) - between subject variation and - within subject variation This can be illustrated in a schematic fashion as follows: Deterministic factors Diet Time Random Noise within subject between subject Response weight gain The circle of Research Questions arise about a phenomenon A decision is made to collect data Conclusion are drawn from the analysis Statistics Statistics A decision is made as how to collect the data The data is summarized and analyzed The data is collected Notice the two points on the circle where statistics plays an important role: 1.The analysis of the collected data. 2.The design of a data collection procedure The analysis of the collected data. • This of course is the traditional use of statistics. • Note that if the data collection procedure is well thought out and well designed, the analysis step of the research project will be straightforward. • Usually experimental designs are chosen with the statistical analysis already in mind. • Thus the strategy for the analysis is usually decided upon when any study is designed. • It is a dangerous practice to select the form of analysis after the data has been collected ( the choice may to favour certain predetermined conclusions and therefore in a considerable loss in objectivity ) • Sometimes however a decision to use a specific type of analysis has to be made after the data has been collected (It was overlooked at the design stage) The design of a data collection procedure • the importance of statistics is quite often ignored at this stage. • It is important that the data collection procedure will eventually result in answers to the research questions. • And will result in the most accurate answers for the resources available to research team. • Note the success of a research project should not depend on the answers that it comes up with but the accuracy of the answers. • This fact is usually an indicator of a valuable research project.. Some definitions important to Statistics A population: this is the complete collection of subjects (objects) that are of interest in the study. There may be (and frequently are) more than one in which case a major objective is that of comparison. A case (elementary sampling unit): This is an individual unit (subject) of the population. A variable: a measurement or type of measurement that is made on each individual case in the population. Types of variables Some variables may be measured on a numerical scale while others are measured on a categorical scale. The nature of the variables has a great influence on which analysis will be used. . For Variables measured on a numerical scale the measurements will be numbers. Ex: Age, Weight, Systolic Blood Pressure For Variables measured on a categorical scale the measurements will be categories. Ex: Sex, Religion, Heart Disease Types of variables In addition some variables are labeled as dependent variables and some variables are labeled as independent variables. This usually depends on the objectives of the analysis. Dependent variables are output or response variables while the independent variables are the input variables or factors. Usually one is interested in determining equations that describe how the dependent variables are affected by the independent variables A sample: Is a subset of the population Types of Samples different types of samples are determined by how the sample is selected. Convenience Samples In a convenience sample the subjects that are most convenient to the researcher are selected as objects in the sample. This is not a very good procedure for inferential Statistical Analysis but is useful for exploratory preliminary work. Quota samples In quota samples subjects are chosen conveniently until quotas are met for different subgroups of the population. This also is useful for exploratory preliminary work. Random Samples Random samples of a given size are selected in such that all possible samples of that size have the same probability of being selected. Convenience Samples and Quota samples are useful for preliminary studies. It is however difficult to assess the accuracy of estimates based on this type of sampling scheme. Sometimes however one has to be satisfied with a convenience sample and assume that it is equivalent to a random sampling procedure A population statistic (parameter): Any quantity computed from the values of variables for the entire population. A sample statistic: Any quantity computed from the values of variables for the cases in the sample. Statistical Decision Making • Almost all problems in statistics can be formulated as a problem of making a decision . • That is given some data observed from some phenomena, a decision will have to be made about the phenomena Decisions are generally broken into two types: • Estimation decisions and • Hypothesis Testing decisions. Probability Theory plays a very important role in these decisions and the assessment of error made by these decisions Definition: A random variable X is a numerical quantity that is determined by the outcome of a random experiment Example : An individual is selected at random from a population and X = the weight of the individual The probability distribution of a random variable (continuous) is describe by: its probability density curve f(x). i.e. a curve which has the following properties : • 1. f(x) is always positive. • 2. The total are under the curve f(x) is one. • 3. The area under the curve f(x) between a and b is the probability that X lies between the two values. 0.025 0.02 0.015 f(x) 0.01 0.005 0 0 20 40 60 80 100 120 Examples of some important Univariate distributions 1.The Normal distribution A common probability density curve is the “Normal” density curve - symmetric and bell shaped Comment: If m = 0 and s = 1 the distribution is called the standard normal distribution 0.03 Normal distribution with m = 50 and s =15 0.025 0.02 Normal distribution with m = 70 and s =20 0.015 0.01 0.005 0 0 20 40 60 80 100 120 xm 2 1 f(x) e 2s 2s 2 2.The Chi-squared distribution with n degrees of freedom 1 (n 2 ) / 2 x / 2 f ( x) n n / 2 x e if x 0 2 2 0.5 0.4 0.3 0.2 0.1 2 4 6 8 10 12 14 Comment: If z1, z2, ..., zn are independent random variables each having a standard normal distribution then 2 2 2 U = z1 z2 zn has a chi-squared distribution with n degrees of freedom. 3. The F distribution with n1 degrees of freedom in the numerator and n2 degrees of freedom in the denominator n1 n2 / 2 n1 if x 0 1 x f(x) K x n 2 n1 / 2 n1 n1 n 2 n 2 2 where K = n1 n2 2 2 (n1 2)2 0.8 0.7 0.6 F dist 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 6 Comment: If U1 and U2 are independent random variables each having Chi-squared distribution with n1 and n2 degrees of freedom respectively then U1 n1 F= U 2 n2 has a F distribution with n1 degrees of freedom in the numerator and n2 degrees of freedom in the denominator 4.The t distribution with n degrees of freedom n1 / 2 x f(x) K 1 n 2 n 1 2 where K = n n 2 0.4 0.3 0.2 0.1 -4 -2 2 4 Comment: If z and U are independent random variables, and z has a standard Normal distribution while U has a Chisquared distribution with n degrees of freedom then t= z U n has a t distribution with n degrees of freedom. • 1. 2. 3. 4. 5. An Applet showing critical values and tail probabilities for various distributions Standard Normal T distribution Chi-square distribution Gamma distribution F distribution The Sampling distribution of a statistic A random sample from a probability distribution, with density function f(x) is a collection of n independent random variables, x1, x2, ...,xn with a probability distribution described by f(x). If for example we collect a random sample of individuals from a population and – measure some variable X for each of those individuals, – the n measurements x1, x2, ...,xn will form a set of n independent random variables with a probability distribution equivalent to the distribution of X across the population. A statistic T is any quantity computed from the random observations x1, x2, ...,xn. • Any statistic will necessarily be also a random variable and therefore will have a probability distribution described by some probability density function fT(t). • This distribution is called the sampling distribution of the statistic T. • This distribution is very important if one is using this statistic in a statistical analysis. • It is used to assess the accuracy of a statistic if it is used as an estimator. • It is used to determine thresholds for acceptance and rejection if it is used for Hypothesis testing. Some examples of Sampling distributions of statistics Distribution of the sample mean for a sample from a Normal popululation Let x1, x2, ...,xn is a sample from a normal population with mean m and standard deviation s Let x x i i n Than x x i i n has a normal sampling distribution with mean mx m and standard deviation sx s n 0 20 40 60 80 100 Distribution of the z statistic Let x1, x2, ...,xn is a sample from a normal population with mean m and standard deviation s Let z xm s n Then z has a standard normal distibution Comment: Many statistics T have a normal distribution with mean mT and standard deviation sT. Then T mT z sT will have a standard normal distribution. Distribution of the c2 statistic for sample variance Let x1, x2, ...,xn is a sample from a normal population with mean m and standard deviation s Let s2 and 2 x x i i n 1 = sample variance xi x 2 s i n 1 = sample standard deviation Let c 2 x i x 2 i s 2 (n 1)s s2 2 Then c2 has chi-squared distribution with n = n-1 degrees of freedom. The chi-squared distribution 0 .5 0 0 4 8 12 16 20 24 Distribution of the t statistic Let x1, x2, ...,xn is a sample from a normal population with mean m and standard deviation s Let xm t s n then t has student’s t distribution with n = n-1 degrees of freedom Comment: If an estimator T has a normal distribution with mean mT and standard deviation sT. If sT is an estimatior of sT based on n degrees of freedom Then TmT t sT will have student’s t distribution with n degrees of freedom. . t distribution standard normal distribution Point estimation • A statistic T is called an estimator of the parameter q if its value is used as an estimate of the parameter q. • The performance of an estimator T will be determined by how “close” the sampling distribution of T is to the parameter, q, being estimated. • An estimator T is called an unbiased estimator of q if mT, the mean of the sampling distribution of T satisfies mT = q. • This implies that in the long run the average value of T is q. • An estimator T is called the Minimum Variance Unbiased estimator of q if T is an unbiased estimator and it has the smallest standard error sT amongst all unbiased estimators of q. • If the sampling distribution of T is normal, the standard error of T is extremely important. It completely describes the variability of the estimator T. Interval Estimation (confidence intervals) • Point estimators give only single values as an estimate. There is no indication of the accuracy of the estimate. • The accuracy can sometimes be measured and shown by displaying the standard error of the estimate. • There is however a better way. • Using the idea of confidence interval estimates • The unknown parameter is estimated with a range of values that have a given probability of capturing the parameter being estimated. Confidence Intervals • The interval TL to TU is called a (1 - a) 100 % confidence interval for the parameter q, if the probability that q lies in the range TL to TU is equal to 1 - a. • Here , TL to TU , are – statistics – random numerical quantities calculated from the data. Examples Confidence interval for the mean of a Normal population (based on the z statistic). TL x za / 2 s s t o TU x za / 2 n n is a (1 - a) 100 % confidence interval for m, the mean of a normal population. Here za/2 is the upper a/2 100 % percentage point of the standard normal distribution. More generally if T is an unbiased estimator of the parameter q and has a normal sampling distribution with known standard error sT then TL T z a / 2 s T to TU T z a / 2 s T is a (1 - a) 100 % confidence interval for q. Confidence interval for the mean of a Normal population (based on the t statistic). TL x t a / 2 s s t o TU x t a / 2 n n is a (1 - a) 100 % confidence interval for m, the mean of a normal population. Here ta/2 is the upper a/2 100 % percentage point of the Student’s t distribution with n = n-1 degrees of freedom. More generally if T is an unbiased estimator of the parameter q and has a normal sampling distribution with estmated standard error sT, based on n degrees of freedom, then TL T t a / 2 s T to TU T t a / 2 s T is a (1 - a) 100 % confidence interval for q. Common Confidence intervals Situation Sample form the Normal distribution with unknown mean and known variance (Estimating m) (n large) Sample form the Normal distribution with unknown mean and unknown variance (Estimating m)(n small) Confidence interval x za / 2 x ta / 2 Estimation of a binomial probability p pˆ za / 2 Two independent samples from the Normal distribution with unknown means and known variances (Estimating m1 - m2) (n,m large) Two independent samples from the Normal distribution with unknown means and unknown but equal variances. (Estimating m1 - m2) ) (n,m small) Estimation of a the difference between two binomial probabilities, p1-p2 s0 n s n pˆ (1 pˆ ) n x y za / 2 2 s x2 s y n m x y ta / 2 s Pooled pˆ 1 pˆ 2 za / 2 1 1 n m pˆ 1 (1 pˆ 1 ) pˆ 2 (1 pˆ 2 ) n1 n2 Multiple Confidence intervals In many situations one is interested in estimating not only a single parameter, q, but a collection of parameters, q1, q2, q3, ... . A collection of intervals, TL1 to TU1, TL2 to TU2, TL3 to TU3, ... are called a set of (1 - a) 100 % multiple confidence intervals if the probability that all the intervals capture their respective parameters is 1 - a Hypothesis Testing • Another important area of statistical inference is that of Hypothesis Testing. • In this situation one has a statement (Hypothesis) about the parameter(s) of the distributions being sampled and one is interested in deciding whether the statement is true or false. • In fact there are two hypotheses – The Null Hypothesis (H0) and – the Alternative Hypothesis (HA). • A decision will be made either to – Accept H0 (Reject HA) or to – Reject H0 (Accept HA). The following table gives the different possibilities for the decision and the different possibilities for the correctness of the decision • The following table gives the different possibilities for the decision and the different possibilities for the correctness of the decision H0 is true H0 is false Accept H0 Reject H0 Correct Decision Type II error Type I error Correct Decision • Type I error - The Null Hypothesis H0 is rejected when it is true. • The probability that a decision procedure makes a type I error is denoted by a, and is sometimes called the significance level of the test. • Common significance levels that are used are a = .05 and a = .01 • Type II error - The Null Hypothesis H0 is accepted when it is false. • The probability that a decision procedure makes a type II error is denoted by b. • The probability 1 - b is called the Power of the test and is the probability that the decision procedure correctly rejects a false Null Hypothesis. A statistical test is defined by • 1. Choosing a statistic for making the decision to Accept or Reject H0. This statisitic is called the test statistic. • 2. Dividing the set of possible values of the test statistic into two regions - an Acceptance and Critical Region. • If upon collection of the data and evaluation of the test statistic, its value lies in the Acceptance Region, a decision is made to accept the Null Hypothesis H0. • If upon collection of the data and evaluation of the test statistic, its value lies in the Critical Region, a decision is made to reject the Null Hypothesis H0. • The probability of a type I error, a, is usually set at a predefined level by choosing the critical thresholds (boundaries between the Acceptance and Critical Regions) appropriately. • The probability of a type II error, b, is decreased (and the power of the test, 1 - b, is increased) by 1. Choosing the “best” test statistic. 2. Selecting the most efficient experimental design. 3. Increasing the amount of information (usually by increasing the sample sizes involved) that the decision is based. Some common Tests Situation Test Statistic Sample form the Normal distribution with unknown mean and known variance (Testing m) (n large) z Sample form the Normal distribution with unknown mean and unknown variance (Testing m) (n small) t Testing of a binomial probability p Two independent samples from the Normal distribution with unknown means and known variances (Testing m1 - m2) (n, m largel) z z n x m0 s n x m 0 s pˆ p0 p0 (1 p0 ) n x y H0 m m m m p p HA m m m m m m m m m m m m p p p p p p m 1 m 2 m1 m 2 2 s x2 s y n m Critical Region z < -za/2 or z > za/2 z > za z <-za t < -ta/2 or t > ta/2 t > ta t < -ta z < -za/2 or z > za/2 z > za z < -za z < -za/2 or z > za/2 m1 m 2 z > za m1 m 2 z < -za Two independent samples from the Normal distribution with unknown means and unknown but equal variances. (Testing m1 - m2) t x y s Pooled m1 m 2 m1 m 2 t < -ta/2 or t > ta/2 1 1 n m m1 m 2 t > t a m1 m 2 t < -ta Estimation of a the difference between two binomial probabilities, p1-p2 z pˆ 1 pˆ 2 1 1 pˆ (1 pˆ ) n1 n 2 p1 p 2 p1 p2 z < -za/2 or z > za/2 p1 p 2 z > za p1 p2 z < -za The p-value approach to Hypothesis Testing In hypothesis testing we need 1. A test statistic 2. A Critical and Acceptance region for the test statistic The Critical Region is set up under the sampling distribution of the test statistic. Area = a (0.05 or 0.01) above the critical region. The critical region may be one tailed or two tailed The Critical region: a/2 a/2 Reject H0 za / 2 0 za / 2 Accept H0 z Reject H0 PAcceptH0 when true P za / 2 z za / 2 1 a PReject H0 when true Pz za / 2 or z za / 2 a In test is carried out by 1. Computing the value of the test statistic 2. Making the decision a. Reject if the value is in the Critical region and b. Accept if the value is in the Acceptance region. The value of the test statistic may be in the Acceptance region but close to being in the Critical region, or The it may be in the Critical region but close to being in the Acceptance region. To measure this we compute the p-value. Definition – Once the test statistic has been computed form the data the p-value is defined to be: p-value = P[the test statistic is as or more extreme than the observed value of the test statistic] more extreme means giving stronger evidence to rejecting H0 Example – Suppose we are using the z –test for the mean m of a normal population and a = 0.05. Z0.025 = 1.960 Thus the critical region is to reject H0 if Z < -1.960 or Z > 1.960 . Suppose the z = 2.3, then we reject H0 p-value = P[the test statistic is as or more extreme than the observed value of the test statistic] = P [ z > 2.3] + P[z < -2.3] = 0.0107 + 0.0107 = 0.0214 Graph p - value -2.3 2.3 If the value of z = 1.2, then we accept H0 p-value = P[the test statistic is as or more extreme than the observed value of the test statistic] = P [ z > 1.2] + P[z < -1.2] = 0.1151 + 0.1151 = 0.2302 23.02% chance that the test statistic is as or more extreme than 1.2. Fairly high, hence 1.2 is not very extreme Graph p - value -1.2 1.2 Properties of the p -value 1. If the p-value is small (<0.05 or 0.01) H0 should be rejected. 2. The p-value measures the plausibility of H0. 3. If the test is two tailed the p-value should be two tailed. 4. If the test is one tailed the p-value should be one tailed. 5. It is customary to report p-values when reporting the results. This gives the reader some idea of the strength of the evidence for rejecting H0 Multiple testing Quite often one is interested in performing collection (family) of tests of hypotheses. 1. H0,1 versus HA,1. 2. H0,2 versus HA,2. 3. H0,3 versus HA,3. etc. • Let a* denote the probability that at least one type I error is made in the collection of tests that are performed. • The value of a*, the family type I error rate, can be considerably larger than a, the type I error rate of each individual test. • The value of the family error rate, a*, can be controlled by altering the thresholds of each individual test appropriately. • A testing procedure of this nature is called a Multiple testing procedure. A chart illustrating Statistical Procedures Independent variables Dependent Variables Categorical Continuous Categorical Multiway frequency Analysis (Log Linear Model) Discriminant Analysis Continuous Continuous & Categorical ANOVA (single dep var) MANOVA (Mult dep var) ?? MULTIPLE REGRESSION (single dep variable) MULTIVARIATE MULTIPLE REGRESSION (multiple dependent variable) ?? Continuous & Categorical Discriminant Analysis ANACOVA (single dep var) MANACOVA (Mult dep var) ?? Comparing k Populations Means – One way Analysis of Variance (ANOVA) The F test The F test – for comparing k means Situation • We have k normal populations • Let mi and s denote the mean and standard deviation of population i. • i = 1, 2, 3, … k. • Note: we assume that the standard deviation for each population is the same. s1 = s2 = … = sk = s We want to test H 0 : m1 m2 m3 mk against H A : mi m j for at least one pair i, j To test H 0 : m1 m2 m3 mk against H A : mi m j for at least one pair i, j use the test statistic 2 Between 2 Error s F s k 2 ni xi x k 1 i 1 k 2 ni 1si ni k i 1 i 1 where xi mean for the ith sample. th si standard deviation for the i sample n1 x1 nk xk x overall mean n1 nk k k the statistic n x x i i 1 2 i is called the Between Sum of Squares and is denoted by SSBetween It measures the variability between samples k – 1 is known as the Between degrees of freedom and k n x x k 1 2 i 1 i i is called the Between Mean Square and is denoted by MSBetween k 2 n 1 s i i the statistic i 1 is called the Error Sum of Squares and is denoted by SSError k n k N k i 1 i is known as the Error degrees of freedom and k n 1 s i 1 i 2 i k ni k i 1 is called the Error Mean Square and is denoted by MSError then MS Between F MS Error The Computing formula for F: Compute ni 1) 2) Ti xij T otalfor sample i j 1 k k G Ti xij Grand T otal i 1 k 3) i 1 ni x ij i 1 j 1 k 5) i 1 j 1 N ni T otalsamplesize k 4) ni 2 Ti i 1 ni 2 Then 1) 3) 2 Ti G SSBetween N i 1 ni k 2) 2 k SSError ni k 2 Ti xij i 1 j 1 i 1 ni 2 SSBetween k 1 F SSError N k The critical region for the F test We reject H 0 : m1 m2 m3 mk if F Fa Fa is the critical point under the F distribution with n1 = k - 1degrees of freedom in the numerator and n2 = N – k degrees of freedom in the denominator Example In the following example we are comparing weight gains resulting from the following six diets 1. Diet 1 - High Protein , Beef 2. Diet 2 - High Protein , Cereal 3. Diet 3 - High Protein , Pork 4. Diet 4 - Low protein , Beef 5. Diet 5 - Low protein , Cereal 6. Diet 6 - Low protein , Pork Gains in weight (grams) for rats under six diets differing in level of protein (High or Low) and source of protein (Beef, Cereal, or Pork) Diet Mean Std. Dev. x x2 1 73 102 118 104 81 107 100 87 117 111 100.0 15.14 1000 102062 2 98 74 56 111 95 88 82 77 86 92 85.9 15.02 859 75819 3 94 79 96 98 102 102 108 91 120 105 99.5 10.92 995 100075 4 90 76 90 64 86 51 72 90 95 78 79.2 13.89 5 107 95 97 80 98 74 74 67 89 58 83.9 15.71 792 839 64462 72613 6 49 82 73 86 81 97 106 70 61 82 78.7 16.55 787 64401 Hence i Ti 1 2 1000 859 3 995 4 792 k 5 839 6 Total (G ) 787 5272 N ni T otalsamplesize 60 i 1 ni k x i 1 j 1 ij 2 479432 Ti 2 467846 i 1 ni k Thus Ti 2 G 2 52722 SSBetween 467846 4612.933 N 60 i 1 ni 2 k ni k Ti 2 SSError xij 479432 467846 11586 i 1 j 1 i 1 ni k SSBetween k 1 4612.933/ 5 922.6 F 4.3 SSError N k 11586/ 54 214.56 F0.05 2.386 withn1 5 andn 2 54 Thus since F > 2.386 we reject H0 The ANOVA Table A convenient method for displaying the calculations for the F-test Anova Table Source d.f. Sum of Squares Between k-1 SSBetween Mean Square MSBetween Within N-k SSError MSError Total N-1 SSTotal F-ratio MSB /MSE The Diet Example Source d.f. Sum of Squares Between 5 Within Total F-ratio 4612.933 Mean Square 922.587 54 11586.000 214.556 (p = 0.0023) 59 16198.933 4.3 Using SPSS Note: The use of another statistical package such as Minitab is similar to using SPSS Assume the data is contained in an Excel file Each variable is in a column 1. Weight gain (wtgn) 2. diet 3. Source of protein (Source) 4. Level of Protein (Level) After starting the SSPS program the following dialogue box appears: If you select Opening an existing file and press OK the following dialogue box appears The following dialogue box appears: If the variable names are in the file ask it to read the names. If you do not specify the Range the program will identify the Range: Once you “click OK”, two windows will appear One that will contain the output: The other containing the data: To perform ANOVA select Analyze->General Linear Model-> Univariate The following dialog box appears Select the dependent variable and the fixed factors Press OK to perform the Analysis The Output Tests of Between-Subjects Effects Dependent Variable: wtgn Source Corrected Model Type III Sum of Squares df Mean Square F Sig. 4612.933(a) 5 922.587 4.300 .002 463233.067 1 463233.067 2159.036 .000 4612.933 5 922.587 4.300 .002 Error 11586.000 54 214.556 Total 479432.000 60 16198.933 59 Intercept diet Corrected Total a R Squared = .285 (Adjusted R Squared = .219) Comments • The F-test H0: m1 = m2 = m3 = … = mk against HA: at least one pair of means are different • If H0 is accepted we know that all means are equal (not significantly different) • If H0 is rejected we conclude that at least one pair of means is significantly different. • The F – test gives no information to which pairs of means are different. • One now can use two sample t tests to determine which pairs means are significantly different Fishers LSD (least significant difference) procedure: 1. Test H0: m1 = m2 = m3 = … = mk against HA: at least one pair of means are different, using the ANOVA F-test 2. If H0 is accepted we know that all means are equal (not significantly different). Then stop in this case 3. If H0 is rejected we conclude that at least one pair of means is significantly different, then follow this by • using two sample t tests to determine which pairs means are significantly different Linear Regression Hypothesis testing and Estimation Assume that we have collected data on two variables X and Y. Let (x1, y1) (x2, y2) (x3, y3) … (xn, yn) denote the pairs of measurements on the on two variables X and Y for n cases in a sample (or population) The Statistical Model Each yi is assumed to be randomly generated from a normal distribution with mean mi = a + bxi and standard deviation s. (a, b and s are unknown) slope = b yi s a + b xi a xi Y = a + bX The Data The Linear Regression Model • The data falls roughly about a straight line. 160 Y = a + bX 140 120 100 unseen 80 60 40 20 0 40 60 80 100 120 140 The Least Squares Line Fitting the best straight line to “linear” data Let Y=a +bX denote an arbitrary equation of a straight line. a and b are known values. This equation can be used to predict for each value of X, the value of Y. For example, if X = xi (as for the ith case) then the predicted value of Y is: yˆi a bxi The residual ri yi yˆi yi a bxi can be computed for each case in the sample, r1 y1 yˆ1, r2 y2 yˆ 2 ,, rn yn yˆ n , The residual sum of squares (RSS) is n n n RSS ri yi yˆ i yi a bxi 2 i 1 i 1 2 i 1 a measure of the “goodness of fit of the line Y = a + bX to the data 2 The optimal choice of a and b will result in the residual sum of squares n n n RSS ri yi yˆ i yi a bxi 2 i 1 i 1 2 i 1 attaining a minimum. If this is the case than the line: Y = a + bX is called the Least Squares Line 2 The equation for the least squares line n Let 2 S xx xi x i 1 n S yy yi y 2 i 1 n S xy xi x yi y i 1 Computing Formulae: 2 xi n n 2 i 1 2 S xx xi x xi n i 1 i 1 2 n yi n n 2 S yy yi y yi2 i 1 n i 1 i 1 n n S xy xi x yi y i 1 n n xi yi n i 1 i 1 xi yi n i 1 Then the slope of the least squares line can be shown to be: n b S xy S xx x x y i i 1 n y i x x i 1 2 i and the intercept of the least squares line can be shown to be: a y bx y S xy S xx x The residual sum of Squares n n RSS yi yˆi yi a bxi 2 i 1 S xy S yy S xx 2 i 1 2 Computing formula Estimating s, the standard deviation in the regression model : n s y i 1 i yˆ i n2 n 2 y a bx 2 i i 1 i n2 S xy 1 S yy n 2 S xx 2 Computing formula This estimate of s is said to be based on n – 2 degrees of freedom Sampling distributions of the estimators The sampling distribution slope of the least squares line : n b S xy S xx x x y i i 1 n y i x x 2 i i 1 It can be shown that b has a normal distribution with mean and standard deviation mb b and s b s S xx s n x x i 1 2 i Thus z b mb sb bb s S xx has a standard normal distribution, and b mb bb t s sb S xx has a t distribution with df = n - 2 (1 – a)100% Confidence Limits for slope b : bˆ t a /2 s S xx ta/2 critical value for the t-distribution with n – 2 degrees of freedom Testing the slope H0 : b b0 vs H A : b b0 The test statistic is: b b0 t s S xx - has a t distribution with df = n – 2 if H0 is true. The Critical Region Reject H0 : b b0 vs H A : b b0 if b b0 t ta / 2 or t ta / 2 s S xx df = n – 2 This is a two tailed tests. One tailed tests are also possible The sampling distribution intercept of the least squares line : a aˆ y bx y S xy S xx x It can be shown that a has a normal distribution with mean and standard deviation 1 ma a and s a s n x n 2 x x i 1 2 i Thus z a ma sa a a 1 s n x 2 n x x i i 1 has a standard normal distribution and a ma t sa a a 1 s n x 2 n x x i 1 i has a t distribution with df = n - 2 2 2 (1 – a)100% Confidence Limits for intercept a : 2 1 x aˆ ta / 2 s n S xx ta/2 critical value for the t-distribution with n – 2 degrees of freedom Testing the intercept H0 : a a0 vs H A : a a0 The test statistic is: t 1 s n a a0 x n 2 x x i 1 2 i - has a t distribution with df = n – 2 if H0 is true. The Critical Region Reject H0 : a a0 vs H A : a a0 if a a0 t ta / 2 or t ta / 2 sa df = n – 2 Example The following data showed the per capita consumption of cigarettes per month (X) in various countries in 1930, and the death rates from lung cancer for men in 1950. TABLE : Per capita consumption of cigarettes per month (Xi) in n = 11 countries in 1930, and the death rates, Yi (per 100,000), from lung cancer for men in 1950. Country (i) Australia Canada Denmark Finland Great Britain Holland Iceland Norway Sweden Switzerland USA Xi 48 50 38 110 110 49 23 25 30 51 130 Yi 18 15 17 35 46 24 6 9 11 25 20 50 Great Britain death rates from lung cancer (1950) 45 40 35 Finland 30 25 Switzerland Holland 20 USA Australia Denmark Canada 15 Sweden Norway Iceland 10 5 0 0 20 40 60 80 100 Per capita consumption of cigarettes 120 140 Fitting the Least Squares Line n x i 1 664 i x i 1 n yi 226 i 1 x y i i 16,914 2 i n y i 1 n i 1 n 2 i 54,404 6,018 Fitting the Least Squares Line First compute the following three quantities: 664 54404 2 S xx 14322.55 11 S yy 2 226 6018 S xy 664 226 16914 3271 .82 11 1374.73 11 Computing Estimate of Slope (b), Intercept (a) and standard deviation (s), S xy 3271.82 b 0.288 S xx 14322.55 226 664 a y bx 0.288 6.756 11 11 S xy 1 s S yy n 2 S xx 2 8.35 95% Confidence Limits for slope b : bˆ t a /2 s S xx 0.288 2.262 8.35 1432255 0.0706 to 0.3862 t.025 = 2.262 critical value for the t-distribution with 9 degrees of freedom 95% Confidence Limits for intercept a : 2 1 x aˆ ta / 2 s n S xx 1 664 11 6.756 2.262 8.35 11 1432255 2 -4.34 to 17.85 t.025 = 2.262 critical value for the t-distribution with 9 degrees of freedom death rates from lung cancer (1950) 50 Great Britain 45 40 35 Finland 30 25 Switzerland Holland 20 USA Australia Denmark Canada 15 Y = 6.756 + (0.228)X Sweden Norway Iceland 10 5 0 0 20 40 60 80 100 120 Per capita consumption of cigarettes 95% confidence Limits for slope 0.0706 to 0.3862 95% confidence Limits for intercept -4.34 to 17.85 140 Testing the positive slope H0 : b 0 vs H A : b 0 The test statistic is: b0 t s S xx The Critical Region Reject H0 : b 0 in favour of H A : b 0 if b0 t t0.05 =1.833 s df = 11 – 2 = 9 S xx A one tailed test b0 t s S xx Since 0.288 8.35 41.3 1.833 1432255 we reject H0 : b 0 and conclude HA : b 0 Confidence Limits for Points on the Regression Line • The intercept a is a specific point on the regression line. • It is the y – coordinate of the point on the regression line when x = 0. • It is the predicted value of y when x = 0. • We may also be interested in other points on the regression line. e.g. when x = x0 • In this case the y – coordinate of the point on the regression line when x = x0 is a + b x0 y=a+bx a + b x0 x0 (1- a)100% Confidence Limits for a + b x0 : 1 x0 x a bx0 ta / 2 s n S xx 2 ta/2 is the a/2 critical value for the t-distribution with n - 2 degrees of freedom Prediction Limits for new values of the Dependent variable y • An important application of the regression line is prediction. • Knowing the value of x (x0) what is the value of y? yˆ a x0 x = x is: • The predicted value ofy b when 0 aˆestimated bˆx0 aby:. bx0 • This in turn canyˆ be The predictor yˆ aˆ bˆx0 a bx0 • Gives only a single value for y. • A more appropriate piece of information would be a range of values. • A range of values that has a fixed probability of capturing the value for y. • A (1- a)100% prediction interval for y. (1- a)100% Prediction Limits for y when x = x0: 1 x0 x a bx0 ta / 2 s 1 n S xx 2 ta/2 is the a/2 critical value for the t-distribution with n - 2 degrees of freedom Example In this example we are studying building fires in a city and interested in the relationship between: 1. X = the distance of the closest fire hall and the building that puts out the alarm and 2. Y = cost of the damage (1000$) The data was collected on n = 15 fires. The Data Fire 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Distance Damage 3.4 26.2 1.8 17.8 4.6 31.3 2.3 23.1 3.1 27.5 5.5 36.0 0.7 14.1 3.0 22.3 2.6 19.6 4.3 31.3 2.1 24.0 1.1 17.3 6.1 43.2 4.8 36.4 3.8 26.1 Damage (1000$) Scatter Plot 50.0 45.0 40.0 35.0 30.0 25.0 20.0 15.0 10.0 5.0 0.0 0.0 2.0 4.0 Distance (miles) 6.0 8.0 Computations n Fire 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Distance Damage 3.4 26.2 1.8 17.8 4.6 31.3 2.3 23.1 3.1 27.5 5.5 36.0 0.7 14.1 3.0 22.3 2.6 19.6 4.3 31.3 2.1 24.0 1.1 17.3 6.1 43.2 4.8 36.4 3.8 26.1 x i 1 n i x i 1 2 i n y i 1 n i y i 1 49.2 196.16 396.2 11376.5 2 i n x y i 1 i i 1470.65 Computations Continued n x x i 1 i n 49.2 15 3.28 n y y i 1 i n 396.2 15 26.4133 Computations Continued xi x 2 i 1 n n S xx i 1 i yi y 2 i 1 n n S yy i 1 2 i n 2 49 . 2 196.16 34.784 2 n 2 396 . 2 11376.5 n n xi yi n S xy xi yi i 1 i 1 i 1 15 n 1470.65 49.2396.2 171.114 15 15 911.517 Computations Continued S xy 171.114 ˆ bb 4.92 S xx 34.784 a aˆ y bx 26.4133 4.9193.28 10.28 s S yy S xy2 S xx n2 2 171 . 114 911.517 13 34.784 2.316 95% Confidence Limits for slope b : bˆ t a /2 s S xx 4.07 to 5.77 t.025 = 2.160 critical value for the t-distribution with 13 degrees of freedom 95% Confidence Limits for intercept a : 2 1 x aˆ ta / 2 s n S xx 7.21 to 13.35 t.025 = 2.160 critical value for the t-distribution with 13 degrees of freedom Least Squares Line 60.0 Damage (1000$) 50.0 40.0 30.0 y=4.92x+10.28 20.0 10.0 0.0 0.0 2.0 4.0 Distance (miles) 6.0 8.0 (1- a)100% Confidence Limits for a + b x0 : 1 x0 x a bx0 ta / 2 s n S xx 2 ta/2 is the a/2 critical value for the t-distribution with n - 2 degrees of freedom 95% Confidence Limits for a + b x0 : x0 lower upper 1 2 3 4 5 6 12.87 18.43 23.72 28.53 32.93 37.15 17.52 21.80 26.35 31.38 36.82 42.44 95% Confidence Limits for a + b x0 60.0 Damage (1000$) 50.0 40.0 30.0 20.0 Confidence limits 10.0 0.0 0.0 2.0 4.0 Distance (miles) 6.0 8.0 (1- a)100% Prediction Limits for y when x = x0: 1 x0 x a bx0 ta / 2 s 1 n S xx 2 ta/2 is the a/2 critical value for the t-distribution with n - 2 degrees of freedom 95% Prediction Limits for y when x = x0 x0 lower upper 1 2 3 4 5 6 9.68 14.84 19.86 24.75 29.51 34.13 20.71 25.40 30.21 35.16 40.24 45.45 95% Prediction Limits for y when x = x0 60.0 Damage (1000$) 50.0 40.0 30.0 Prediction limits 20.0 10.0 0.0 0.0 2.0 4.0 Distance (miles) 6.0 8.0 Linear Regression Summary Hypothesis testing and Estimation (1 – a)100% Confidence Limits for slope b : bˆ t a /2 s S xx ta/2 critical value for the t-distribution with n – 2 degrees of freedom Testing the slope H0 : b b0 vs H A : b b0 The test statistic is: b b0 t s S xx - has a t distribution with df = n – 2 if H0 is true. (1 – a)100% Confidence Limits for intercept a : 2 1 x aˆ ta / 2 s n S xx ta/2 critical value for the t-distribution with n – 2 degrees of freedom Testing the intercept H0 : a a0 vs H A : a a0 The test statistic is: t 1 s n a a0 x n 2 x x i 1 2 i - has a t distribution with df = n – 2 if H0 is true. (1- a)100% Confidence Limits for a + b x0 : 1 x0 x a bx0 ta / 2 s n S xx 2 ta/2 is the a/2 critical value for the t-distribution with n - 2 degrees of freedom (1- a)100% Prediction Limits for y when x = x0: 1 x0 x a bx0 ta / 2 s 1 n S xx 2 ta/2 is the a/2 critical value for the t-distribution with n - 2 degrees of freedom Comparing k Populations Proportions The c2 test for independence The c2 test for independence Situation • • • • We have two categorical variables R and C. The number of categories of R is r. The number of categories of C is c. We observe n subjects from the population and count xij = the number of subjects for which R = i and C = j. • R = rows, C = columns Example Both Systolic Blood pressure (C) and Serum Cholesterol (R) were meansured for a sample of n = 1237 subjects. The categories for Blood Pressure are: <126 127-146 147-166 167+ The categories for Cholesterol are: <200 200-219 220-259 260+ Table: two-way frequency Serum Cholesterol <200 200-219 220-259 260+ Total <127 117 85 119 67 388 Systolic Blood pressure 127-146 147-166 121 47 98 43 209 68 99 46 527 204 167+ 22 20 43 33 118 Total 307 246 439 245 1237 The c2 test for independence c Define Ri xij i th row T otal j 1 c Ci xij j th columnT otal i 1 Eij Ri C j n = Expected frequency in the (i,j) th cell in the case of independence. Then to test H0: R and C are independent against HA: R and C are not independent Use test statistic r c c 2 i 1 j 1 x ij Eij 2 Eij Eij= Expected frequency in the (i,j) th cell in the case of independence. xij= observed frequency in the (i,j) th cell Ri C j n Sampling distribution of test statistic when H0 is true r c c 2 x ij Eij 2 Eij i 1 j 1 - c2 distribution with degrees of freedom n = (r - 1)(c - 1) Critical and Acceptance Region Reject H0 if : c 2 ca2 Accept H0 if : c ca 2 2 Table Expected frequencies, Observed frequencies, Standardized Residuals Serum Cholesterol <200 200-219 220-259 260+ Total c2 = 20.85 <127 96.29 (117) 2.11 77.16 (85) 0.86 137.70 (119) -1.59 76.85 (67) -1.12 388 Systolic Blood pressure 127-146 147-166 130.79 50.63 (121) (47) -0.86 -0.51 104.80 40.47 (98) (43) -0.66 0.38 187.03 72.40 (209) (68) 1.61 -0.52 104.38 40.04 (99) (46) -0.53 0.88 527 204 167+ 29.29 (22) -1.35 23.47 (20) -0.72 41.88 (43) 0.17 23.37 (33) 1.99 118 Total 307 246 439 245 1237 Standardized residuals x ij rij Test statistic r c c 2 Eij Eij x i 1 j 1 ij Eij 2 Eij r c rij2 20.85 i 1 j 1 degrees of freedom n = (r - 1)(c - 1) = 9 c02.05 16.919 Reject H0 using a = 0.05 Another Example This data comes from a Globe and Mail study examining the attitudes of the baby boomers. Data was collected on various age groups Age group Echo (Age 20 – 29) Gen X (Age 30 – 39) Younger Boomers (Age 40 – 49) Older Boomers (Age 50 – 59) Pre Boomers (Age 60+) Total Total 398 342 378 286 445 1849 One question with responses In an average week, how many times would you drink alcohol? never once twice three or four times Echo (Age 20 – 29) Gen X (Age 30 – 39) Younger Boomers (Age 40 – 49) Older Boomers (Age 50 – 59) Pre Boomers (Age 60+) 115 130 136 109 218 135 123 87 74 80 64 38 64 40 45 48 31 57 43 40 36 20 34 20 62 398 342 378 286 445 Total 708 499 251 219 172 1849 Age group Are there differences in weekly consumption of alcohol related to age? five more times Total Table: Expected frequencies three or four five more times Total times Age group never once twice Echo (Age 20 – 29) Gen X (Age 30 – 39) Younger Boomers (Age 40 – 49) Older Boomers (Age 50 – 59) Pre Boomers (Age 60+) 152.40 130.96 144.74 109.51 170.39 107.41 92.30 102.01 77.18 120.09 54.03 46.43 51.31 38.82 60.41 47.14 40.51 44.77 33.87 52.71 37.02 31.81 35.16 26.60 41.40 398 342 378 286 445 708 499 251 219 172 1849 Total rij Table: Residuals x ij Eij Eij Age group never once twice three or four times Echo (Age 20 – 29) Gen X (Age 30 – 39) Younger Boomers (Age 40 – 49) Older Boomers (Age 50 – 59) Pre Boomers (Age 60+) -3.029 -0.083 -0.726 -0.049 3.647 2.662 3.196 -1.486 -0.362 -3.659 1.357 -1.237 1.771 0.189 -1.982 0.125 -1.494 1.828 1.568 -1.750 r c c 2 i 1 j 1 x ij Eij Eij five more times -0.168 -2.095 -0.196 -1.280 3.203 2 r c rij2 93.97 i 1 j 1 2 c.05 26.296 for 4 4 16 d. f Conclusion: There is a significant relationship between age group and weekly alcohol use Examining the Residuals allows one to identify the cells that indicate a departure from independence Age group never once twice three or four times Echo (Age 20 – 29) Gen X (Age 30 – 39) Younger Boomers (Age 40 – 49) Older Boomers (Age 50 – 59) Pre Boomers (Age 60+) -3.029 -0.083 -0.726 -0.049 3.647 2.662 3.196 -1.486 -0.362 -3.659 1.357 -1.237 1.771 0.189 -1.982 0.125 -1.494 1.828 1.568 -1.750 five more times -0.168 -2.095 -0.196 -1.280 3.203 • Large positive residuals indicate cells where the observed frequencies were larger than expected if independent Large negative residuals indicate cells where the observed frequencies were smaller than expected if independent Another question withmany responses In an average week, how times would you surf the internet? 5 to 9 times 10 or more times Age group never 1 to 4 times Echo (Age 20 – 29) Gen X (Age 30 – 39) Younger Boomers (Age 40 – 49) Older Boomers (Age 50 – 59) Pre Boomers (Age 60+) 48 51 79 92 276 72 82 128 63 71 100 92 76 57 67 178 117 95 74 31 398 342 378 286 445 Total 546 416 392 495 1849 Total Are there differences in weekly internet use related to age? Table: Expected frequencies 5 to 9 times 10 or more times Age group never 1 to 4 times Echo (Age 20 – 29) Gen X (Age 30 – 39) Younger Boomers (Age 40 – 49) Older Boomers (Age 50 – 59) Pre Boomers (Age 60+) 117.53 100.99 111.62 84.45 131.41 89.54 76.95 85.04 64.35 100.12 84.38 72.51 80.14 60.63 94.34 106.55 91.56 101.20 76.57 119.13 398 342 378 286 445 Total 546 416 392 495 1849 Total rij Table: Residuals x ij Eij Eij Age group never 1 to 4 times Echo (Age 20 – 29) Gen X (Age 30 – 39) Younger Boomers (Age 40 – 49) Older Boomers (Age 50 – 59) Pre Boomers (Age 60+) -6.41 -4.97 -3.09 0.82 12.61 -1.85 0.58 4.66 -0.17 -2.91 r c c 2 i 1 j 1 x ij Eij Eij 5 to 9 times 10 or more times 1.70 2.29 -0.46 -0.47 -2.82 6.92 2.66 -0.62 -0.29 -8.07 2 r c rij2 406.29 i 1 j 1 2 c.05 21.03 for 43 12 d. f Conclusion: There is a significant relationship between age group and weekly internet use Echo (Age 20 – 29) 70.0 60.0 50.0 40.0 30.0 20.0 10.0 0.0 never 1 to 4 times 5 to 9 times 10 or more times Gen X (Age 30 – 39) 70.0 60.0 50.0 40.0 30.0 20.0 10.0 0.0 never 1 to 4 times 5 to 9 times 10 or more times Younger Boomers (Age 40 – 49) 70.0 60.0 50.0 40.0 30.0 20.0 10.0 0.0 never 1 to 4 times 5 to 9 times 10 or more times Older Boomers (Age 50 – 59) 70.0 60.0 50.0 40.0 30.0 20.0 10.0 0.0 never 1 to 4 times 5 to 9 times 10 or more times Pre Boomers (Age 60+) 70.0 60.0 50.0 40.0 30.0 20.0 10.0 0.0 never 1 to 4 times 5 to 9 times 10 or more times Next topic: Fitting equations to data Link