Lecture Notes

EDM6403: Quantitative Data Management and Analysis in Educational Research Professor Chang Lei, Department of Educational Psychology, CUHK Measurements of Individual Differences People differ in terms of the amount of an attribute they possess. Theoretically, no two people have exactly the same amount of an attribute, e.g., same height or same smartness. Such an attribute on which people differ from each other is considered as continuum of quantities of which a person may possess any amount. Individual differences along a trait continuum usually follow a normal distribution; that is, most people possess a moderate amount with few having an extremely large amount and few having a very small amount of the attribute. The optimal goal for any research is to be able to explain these individual differences -- why do some people have more or less of an attribute? The initial question is how to measure individual differences, how to quantify individual differences, how to differentiate people in numbers, with respect to an attribute continuum. There are four ways of quantifying individual differences, four ways of assigning numbers to people to indicate their differences with respect to an attribute. In statistics, these are called four levels of measurement. The level of measurement for a variable directly affects the choice of statistical technique because different level of measurement justifies different level of mathematical operation. These four levels are called ratio, interval, ordinal, and nominal scales. All of them involve truncating the trait continuum into manageable and meaningful units and categories. Ratio and interval method truncate the continuum into equal-distance units and measure the individual differences by counting how many units of an attribute a person possess. Someone having 5 units of an attribute is one unit higher than someone having 4 units and so is a person possessing 8 units in relation to another having 7 units. These are interval or ratio measures. Centimetre as a measurement of height, pound for weight, WISC points with intelligence are ratio and interval measurements. Centimetres, for example, have equal distance between each other. The distance between 4 and 5 centimetres is the same as that between 7 and 8 centimetres. So using this equal interval measure, one can distinguish people on the height continuum, so that 180 centimetre tall is taller than 179 centimetre tall. Of course, you lose some information in that not all 179 centimetre people are of the same height, even though, on the ruler measurement, they are indicated as having the same height. The difference between ratio and interval is that ratio has a meaningful zero such as the ruler measurement of height. For some construct, zero is meaningless. IQ, for example, has meaningless zero. What is no IQ? In this case, zero is simply a marker having no meaning. Or it is meaningful only within the measurement instrument itself as a reference to other values. For example, it may mean that you did not attempt any questions on an IQ test, and therefore you have a zero IQ score. But it does not mean you have no intelligence because you have a brain and you are still breathing. With a meaningful zero, the ratio scale permits ratio comparison. For example, someone weighing 120 pounds is twice as heavy as someone of 60 pounds. Such comparison cannot be made using interval scale where zero does not mean the absence of the attribute. For example, someone scoring 120 is not twice as intelligent as someone scoring 60. Interval is used more often than ratio in social science. A less precise or more coarse way of quantifying differences is to rank order people without too much concern with equal interval assumptions. This is the ordinal data method where rank orders become the measurement unit. 180, 179, 150, 150, 149 will be truncated into 1st, 2nd, 3rd, 3rd, and 4th. The distance between Number 2 and 3 is so much bigger than that between 1 and 2, and between number 3 and number 4. But that information is lost. 1 An even rougher way to quantify information is nominal or categorical where you differentiate people by assigning labels to them. A giant vs. a dwarf. The truncation is so coarse that the difference represented by categorical data are no longer defined quantitatively but are qualitatively defined. We no longer think of how much bigger giants are than dwarfs but think of them as two categories of people. A better example is political affiliation. The Democrats and DAB are two categories of the political affiliation variable, which is called a categorical variable. Here, the two parties differ by name. One has one label and the other has a different label. We no longer make any numerical differentiations between them. However, their political views can still be differentiated numerically by examining their different emphasis on different political principles, e.g., the extent to which they are anti versus pro China is only one of them. However, when using nominal measurement, we think and measure in terms of categories rather than numerically different standings on a continuum of pro versus anti China, for example. An extension of the concept is to use categories to distinguish individual differences along multi-dimensions; i.e., not just tall vs short, liberal vs. conservative, but also, different ear, nose, education, etc. A good example is our names. Name is a categorical variable. Names summarize so many quantitative differences on so many dimensions that we are no longer able to perceive those quantitative differences. Instead, we perceive each person as categorically or qualitatively different from everyone else. Another example is friendship. A friend does enough bad things to you to become an enemy whereas, friendliness-animosity is an attitude continuum which you (probably unconsciously) assign different amounts to your different friends and enemies resulting in your relationship categories such as the best friend, a friend, and the worst enemy, etc.. In summary, nominal variables only classify cases into categories. Categories are not numerical and can only be compared in terms of the number of cases classified in them. Only “ =” or “  ” could apply to a nominal variable. In addition to classification, ordinal variables can be ordered from ‘high’ to ‘low’ or ‘more’ to ‘less’, but the distance between scores cannot be described in precise terms. More mathematical operations (=,  , >, <) can be applied with ordinal data. Interval variables have the property of categorization, ranking, and the distance between two scores can be expressed in terms of equal units. Mathematical operations (=,  , >, < ,+,-) are permitted. But there is no true zero. The highest level of measurement is ratio, which possesses all the properties of the above levels of measurement. All arithmetic operations (,,,) are permitted and there is a true zero. Interval and ratio variables are also called continuous variables, while nominal and ordinal variables are also called discrete (or category) variables. Population and sample A population is the total collection of all units in which the researcher is interested and is thus the entity that the researcher wishes to understand. A sample is a carefully chosen subset of a population. A case is a unit in a sample. Usually a case is a subject in an experiment or a respondent in a survey. The number of cases in the sample is sample size (denoted by N or n). 2 Frequency analysis For a Nominal variable, the number of cases in each category is the frequency of occurrence in each category. Proportion for each category is the frequency divided by the sample size. It is called relative frequency. The proportion multiplied by 100 is percentage. The frequencies/percentages of all categories of a nominal variable give frequency distribution. For a continuous variable, grouped frequency distribution can be considered by dividing respondents into appropriate or convenient intervals and treating each interval as a category. Distribution The pattern of variation of a variable is called its distribution, which can be described both mathematically and graphically. The distribution records all possible numerical values of a variable and how often each value occurs (i.e., frequency). Four attributes are often used to define a distribution: central tendency (mean, median, and mode); variability (variance and standard deviation); symmetry or skew; peak or kurtosis. Measures of central tendency Mode is the value of a variable that occurs most frequently in the distribution. The mode is often used for category variables. Median is the middle observation when the observations are arranged in order of magnitude. The median identifies the position of an observation. The median is used primary for ordinal, but also appropriate for interval/ratio. The median is also called the 50th percentile. A percentile is the value on a scale below which a given percentage of cases fall. Each half of the observations can be further divided into quartiles. The 25th Percentile is called the lower quartile; the 75th percentile is called the upper quartile. Mean is the arithmetic average of the observations. Mean is appropriate for continuous variables. Measures of dispersion The range is the maximum minus the minimum of a distribution. The variance is the average of the squared deviation. The deviation of an observation is the distance of the observation from the mean. The square root of the variance is called the standard deviation (SD). The shape of a distribution Skew is a measure of deviation from symmetry. If the values of mean and median are the same, it is a normal curve (skew = 0). The skew of a normal curve is zero. If mean is larger than the median, it is positively skewed. If mean is smaller than the median, it is 3 negatively skewed. When skew is positive, the distribution is said to be positively skewed with thick right-hand tail. When skew is negative, the distribution is said to be negatively skewed with thick left-hand tail. Kurtosis is a measure of the relative peakness or flatness of the curve defined by the distribution of variable. When the kurtosis is zero, the curve is a normal curve. If the peak of the curve is above that of a normal curve, it is called leptokurtic (positive kurtosis). If it is below that of a normal curve, it is called platykurtic (negative kurtosis). Normal distribution The best studied and most widely used distribution is normal distribution. It is most widely used because this mathematical function can be used to describe most of the natural observations (behaviors, attitudes, abilities as far as education is concerned). When something is measured many times (e.g., a person’s height and weight), there are small but detectable fluctuations in the different measurements. Guess what is the distribution of these measurements. Yes, normal distribution. Because the fluctuation indicates measurement error, the distribution is also called a "normal curve of error." From this perspective, we can say that distributions or patterns of variations of, say, height, weight, IQ, abilities and personalities of various sorts are simply records of "nature's mistakes" against the corresponding population means. If X distributes normally, it is denoted as X ~ N (  ,  2 ) .  and  are mean and standard deviation respectively. There is not one but a family of normal distributions, with varying  and  . All normal curves are continuous, bell-shaped, symmetrical (skew = 0), the two points of inflection are at    (kurtosis = 0), and the curve approaches the X-axis asymptotically (does not touch the X-axis). N (0,1) is called standard normal distribution with mean 0 and standard deviation 1. Normal curves with different  Linear transformation The shape of a distribution does not change when a constant is added to or subtracted from scores or when scores are multiplied or divided by a constant. Any of these changes are called a linear transformation. Because of this property, the raw score or original measurement units are often transformed to z-scores, which express the score location in terms of the standard deviation units of the distribution. A z-score is one kind of standard score, which has a  = 0 and  = 1.0. Any observation that is expressed in terms of the relationship between the mean and standard deviation is a standard score. For example, GRE and SAT which have  = 500,  = 100, are standard scores. Wechsler Intelligence Test:  = 100,  = 15. Stanford-Binet Intelligence Test:  = 100,  = 16. These scores are 4 standard scores too. The z-score is only one of the standard scores. Notice that any variable, normal or not, can be transformed into a z-score by the formula, X  Z . The z-transformation does not "normalize" the variable. When a normal variable  is converted into a z-score, it becomes a standard normal score or a standard normal variable. That is, Z ~ N (0,1) . To transform a normal distribution into a standard normal distribution involves a process of converting the raw score into a standard score using the previously mentioned formula, X  Z . This formula first calculates how far away a score is from the mean, and then  determines the distance in terms of number of standard deviations. A standard score tells us how many standard deviations a raw score falls above or below the mean of the distribution. Standard score is also called z-score, its sign tells whether it falls above or below the mean. Distribution of probabilities In statistics, the curve of a variable is nonnegative with area 1 under the curve. The probability (or the relative frequency) of observations with values between [a, b] is the area under the curve between a and b. The probability of observations with values below b is the area under the curve below b. Using integration we can determine the area under the curve between any two values (a to b), and thus determine the probability (or the relative frequency) that a randomly drawn score will fall within that interval. These integrations have been done and summarized in a table of normal distribution where the probability that any score will fall within a range of two points is computed. As long as a variable is normally distributed or approximates normality, we can find the probability that X falls within a certain range. Again, we do not need to really use integration. Statisticians have already written down those probability values in a table based on a standard normal distribution or N (0, 1). Software packages, such as Excel and SPSS, can give the probability values easily. 5 Areas within some special intervals of standard normal distribution Some useful probabilities and corresponding intervals: P(1.65  Z  1.65)  0.90 P(1.96  Z  1.96)  0.95 P(2.58  Z  2.58)  0.99 P( Z  1.65)  P( Z  1.65)  0.05 P( Z  1.96)  P( Z  1.96)  0.025 P( Z  2.58)  P( Z  2.58)  0.005 Again, changing raw scores into z-scores does not alter the shape of the distribution. The difference is that originally, distance between scores were expressed in raw score form but now they are expressed in terms of number of standard deviations. However, if they are far away from normality, the probabilities obtained from a standard normal distribution table will not accurately describe the probabilities associated with the scores on a non-normally distributed distribution. If the variable is normally distributed, by converting its raw scores into z-scores, we can determine relative frequency or probability, and percentile rank or cumulative probabilities by looking up on the table rather than calculating them ourselves. Nonlinear transformation Finally, it is useful to mention that there are ways of score transformation to change the pattern of variations of a distribution. A monotonic transformation maintains the order of the measurements but not the relative distance between scores. Such a transformation can be used to make a skewed distribution normal when it can be assumed that the variable underlying the observed distribution is normally distributed. Usually, square root and log transformations reduce positive skew and a power transformation reduces negative skew. Parameter estimate Parameter estimate is one type of inferential statistics. Another type of inferential statistics is hypothesis testing. Usually, we use a sample mean to estimate the population mean, and a sample SD to estimate the population SD. Consider the population mean. Different random samples are likely to give us different estimates of the population mean. But by using a confidence interval we can indicate that, despite the uncertainty of random samples, only in 5% of samples will the population mean 6 be outside the confidence interval, while for 95% of samples the mean will be inside the confidence interval. The probability of including the population mean within the confidence interval is called the confidence level (often, 95% or 99% are chosen as confidence level) . Hypothesis testing Statistics hypotheses are always related to the population. They can be classified as parameter hypotheses and non-parameter hypotheses. A parameter hypothesis is always about a population parameter which is unknown or something about which we are not sure and, in fact, will never be sure. Corresponding to the population parameter is our sample statistic which is something we computed from a real sample and is something about which we have 100 percent certainty. In hypothesis testing, we use the sample statistic (something we are sure of) to infer to the population parameter (something we are not sure of) Hypothesis testing is one type of inferential statistics. Operationally, this hypothesis testing process consists of the following steps: 1. State the research hypothesis reflecting our verbal reasoning. e.g., There is a relationship between motivation to learn and math achievement. Girls have higher math achievement than boys. 2. Convert our research hypothesis into a statistical form. e.g., a) > 0, or there is a positive correlation between motivation to learn and math achievement. The statistic of correlation is used to summarize our sample data. b)  girl -  boy > 0: Mean math achievement of girls is higher than the mean of boys. Mean is used to summarize our sample data. 3. State the null hypothesis which provides a way to test the statistical hypothesis. e.g., H 0 :   0 . There is no correlation between motivation and math achievement. H 0 :  girl -  boy = 0. The mean math achievement of girls is the same as that of boys. 4. Hypothesis testing is conducted with the assumption that the null hypothesis is true. Question to be answered: What is the probability of finding a positive correlation as high as the present sample or even higher when the truth is there is no correlation? Question to be answered: What is the probability of finding a difference between the two means as large as the present sample or even larger when there is no difference? 5. Make a decision regarding our hypothesis. Do we reject the null hypothesis or not? This decision means the data from the one sample we have supports or does not support our research hypothesis. This decision will be associated with one of two potential errors. Statistical significance and testing errors The level of significance is the Type I Error rate (  ). Type I error rate (  ) is the probability of rejecting the null hypothesis when the null hypothesis is true. We make such an error only when the null is rejected. Type II error rate (  ) is the probability of not rejecting the null hypothesis when the null hypothesis is false. We make such an error only when we fail to reject the null hypothesis. Power (1 -  ) is the probability of correctly rejecting the null hypothesis. The distribution of sampling mean The Central Limit Theorem defines the characteristics of the sampling distribution of a statistic. Let's use the sample mean as the statistic, it has the following three characteristics: 1. It normally distributed. For samples of 30 subjects or more, the sampling distribution 7 of means will be approximately normal, independent of the shape of the population distribution. For example, even if the population is not normally distributed, a sampling distribution of means computed from drawing random samples of 30 or more cases each time from this population will be normally distributed. Therefore, if there is no available information about the shape of the population distribution, the central limit theorem states that, as the sample size increases, the shape of the sampling distribution of means becomes increasingly like the normal distribution. Of course, if the distribution of scores in the population is normal, the sampling distribution of means, regardless of N, shall be normal. Because, with a sample of size 30, the normal distribution provides a reasonably good approximation of the sampling distribution of means, N = 30 is often regarded as the lower end of sample size for conducting research. 2. It has a mean equal to the population mean. This simply means that the sample mean is an unbiased estimator of the population mean. In general, a sample statistic is an unbiased estimator of the corresponding population parameter, if the mean of the statistic is equal to the population parameter. 3. Its standard deviation, which is called standard error of the mean, equals to the population standard deviation divided by the square root of the sample size. The standard error of the mean ( X ) is a function of the population standard deviation (  ) and the sample size (N), ( X    N ) . The standard error of means provides an index of how much the sample means vary about the population mean, which is the mean of the sampling distribution of means. Hypothesis testing about the mean Continuing with the above, the significance level represents the probability at which we would reject the null hypothesis. In other words, it is the probability at which we would declare that the sample mean (or any sample statistic) is unlikely (what is unlikely is defined by the significance level) to have been drawn from the population defined by the null hypothesis. We often use  =.05 as the significance level. That means, we state, ahead of the game, that 5% is small enough to be considered unlikely. Then all we need to do is to draw a (random) sample, compute its mean, and then determine the probability of obtaining this sample mean assuming the population is that defined by the null hypothesis. If the probability is equal to or smaller than the significance level, we declare that the probability is small enough to be considered unlikely; i.e., the sample is unlikely to be drawn from the population defined by the null hypothesis. Thus, we reject the null hypothesis. The question then becomes how to find out the probability associated with the sample mean we have in hand. Since the sampling distribution of means for N = 30 can be modeled by a normal distribution, we can find the probability values associated with any number under the normal curve. Simply convert the value of the sample mean into z-score and go to the table to find the corresponding probability value associated with the z-score value. Compare the probability value we find from the table to that we set for the level of significance and decide whether to reject the null hypothesis. Many textbooks use the opposite logic by identifying or marking the point of the sampling distribution (which is normal) that corresponds to the level of significance and refers to this point as the critical value. This is the point beyond which the probability of observing a sample mean is less than .05. In order to find this part of the sampling 8 distribution, identify the critical value of z X beyond which 5 percent of the sample means fall (i.e.,  =.05); this critical value of z X can be denoted as z Xcritical . From the table of normal probabilities, z Xcritical equals 1.65 for p<.05. A sample mean falling in the rejection region would lead to a decision to reject the null hypothesis. We also need to decide which tail of the normal distribution to use as the critical region for rejecting the null hypothesis. Suppose the null hypothesis is: H0:  =50, the alternative hypothesis and the corresponding tail maybe: H1:  > 50 H1:  < 50 H1:   50 The first two hypotheses are directional; they indicate the direction of the difference between the mean of the population under the null hypothesis and the mean of the population under the alternative hypothesis. If the alternative hypothesis predicts the true population mean to be above the mean in the null hypothesis, the critical region for rejecting the null hypothesis lies in the upper tail of the sampling distribution. If the alternative hypothesis predicts the true population mean to be below the mean in the null hypothesis, the critical region for rejecting the null hypothesis lies in the lower tail of the sampling distribution. Directional hypotheses are sometimes called one-tailed hypotheses because the critical region for rejecting the null hypothesis falls in only one tail of the probability distribution. The third alternative hypothesis, H1:  ≠50, is non-directional. It predicts that the true mean does not equal the mean in the null hypothesis, but it does not say whether it is below or above. Thus, we must consider the critical region to lie in either tail of the distribution. That means, we have to consider both tails if the alternative hypothesis is non-directional. In order to test the null hypothesis against a non-directional alternative hypothesis at  =.05, mark off critical regions in both tails of the distribution as  /2 = .05/2 = .025. We use  /2 because we want to mark off no more than a total of 5 percent of the normal sampling distribution as a critical region for testing the null hypothesis: .025 + .025 = .05. Note that if we marked off both tails at .05, we would have actually marked off 10 percent of the distribution: .05 + .05 = .10. By using  /2, we set the level of significance at  =.05. Non-directional alternative hypotheses are also called two-tailed hypotheses because the critical region for rejecting the null hypothesis lies in both tails of the probability distribution. Recapture the Logic and Process of Hypothesis Testing Inferential statistics refer to the process of using sample statistics to estimate population parameters; the sample statistic that is used for such estimation is called an estimator. Inferential statistic can also be seen as the process of determining the probability of obtaining a population parameter as large as or larger than (in absolute values) the sample statistic that you computed from a random sample. Sampling distribution of a statistic is a theoretical or imagined distribution of an infinite number of the statistic (which can be a mean, a difference between two means, a regression coefficient, etc.) which is computed from an infinite number of samples of an equal size 9 randomly drawn from the same population. The mean of this sampling distribution of statistics equals the population parameter (which corresponds to the sample statistic, of course) and the variance of this sampling distribution (variance error) equals the population variance (or a sample estimate of the population variance if the population variance is not known) divided by the size of the sample (from which your sample statistic is computed). Taking the square root of the variance gives you the standard deviation, which, in the sampling distribution of a statistic is called standard error of the statistic. This sampling distribution is used as a probability distribution to determine, under the null hypothesis, what is the probability of obtaining a sample statistic as large as or larger than (in terms of absolute values) the one you computed from your sample. This process of using a sampling distribution to determine the probability of obtaining a sample statistic is the essence of inferential statistic. The sampling distribution of a statistic can take on different shapes or mathematical properties, depending on the statistic and other things. In this course, we are going to deal with the normal distribution, t distribution, F distribution, and chi-square distribution. These in turn correspond to different statistics such as the mean, difference between two means, frequency, correlation and regression coefficient. The statistic is different and the distribution is different for each different kind of hypothesis testing. The process and the logic are identical and are just like what I described so far. I summarize this process and logic again in the following equation: Inferential Statistical Testing (1) 1. 2. 3. 4. = Sample Statistic (2) - Population Parameter (3) ----------------------------------------------------Standard Error of the Statistic (4) This is a process although sometimes it is simply called inferential statistics. This is real and is your actual research. This is unknown and finding the answer (never with certainty) is the purpose of your research. One important criterion of the central limit theorem is that this standard error is a direct function of the sample size. Hypothesis Testing about Other Statistics z-test of differences between two means (population variance is known) From a population, two random are drawn (the samples sizes can be different), a mean score for each group is calculated, and the difference between the means is found by subtracting one sample mean from the other: X 1  X 2 . Imagine that this is done over and over again by drawing pairs of random samples from the population and finding the difference ( X 1  X 2 ) between the sample means of the two groups. These differences between pairs of sample means would be expected to vary from one pair to the next. Imagine that we plot these differences in the form of frequency chart, with the values of X 1  X 2 on the abscissa and the frequency with which each value of X 1  X 2 is observed on the ordinate. This plot is the sampling distribution of differences between two means. It shows the variability that is expected to arise from randomly sampling differences between pairs of means from the population under the null hypothesis. This sampling distribution of differences between two means is a theoretical probability distribution with differences 10 between sample means ( X 1  X 2 ) on the abscissa and probabilities on the ordinate. By treating the sampling distribution of differences between means as a probability distribution, we can answer such questions as “What is the probability of observing a difference between sample means ( X 1  X 2 ) as large as or larger than the observed sample difference, assuming that the null hypothesis is true?” This sampling distribution of differences between means is, in concept, exactly the same as the sampling distribution of means discussed earlier and has the same or similar characteristics: 1. It is normally distributed. 2. It has a mean equal to the population parameter which is zero (i.e. since the samples are taken from the same population,  1 =  2 and thus  1 -  2 = 0). 3. It has a standard deviation called the standard error of the difference between means.  X  X   2 X   2 X  1 1 2 2 2 N1  2 N2 The rest are the same as the earlier explanation about hypothesis testing of one sample mean, except for one important assumption, "homogeneity of variance." It is assumed that the two population has the same variance. Examples The null hypothesis is that there is no difference between mean GRE verbal scores of the Chinese and English speaking students (By the way, in trying to find out whether there is large scale cheating involved in the GRE testing in China, ETS indeed investigated hypotheses similar to this one because they were suspicious of cheating in China and other Asian countries): H0:  E –  C = 0 The alternative hypothesis is that the mean GRE verbal scores of the English speaking students is higher than that of the Chinese students: H1:  E –  C > 0 In testing the null hypothesis, we set α=.01 as the level of significance. Since the research hypothesis is directional, one tail of the sampling distribution of the difference between means, i.e., the upper tail, is of interest. The critical value of z X E  X C that identifies the small portion of the upper tail is 2.33 (see normal table). Note that we now speak of z X E  X C and not z X , since we are dealing with the sampling distribution of the difference between two means. If z X E  X C observed is greater than z X E  X C critical , the null hypothesis should be rejected. Assume that the observed sample means are 610 for the English speaking students and 570 for the Chinese students; the standard error of the differences between means is 15. In order to decide whether to reject the null hypothesis, we convert these sample data to a z-score, using ( X E  X C )  ( E   C ) z X E XC  observed X E XC Since  E –  C = 0, under the null hypothesis, this equation can be rewritten as 11 z X E XC  observed Thus we obtain: z X E XC observed  Since z X E  X C (X E  XC ) X E XC 610  570  2.67 15 exceeds z X E  X C observed critical , the decision is made to reject the null hypothesis and to conclude that GRE verbal scores of English speaking students are on average higher than those of Chinese students. Alternatively, we can find the probability value associated with the computed mean difference of 2.67 (treating as z-scores), and the value is 0.004. Since it less than 0.01, we reject the null hypothesis. In another example of a study of the disparity between men’s and women’s college grades, one hypothesis was that women earned higher grades than men. The mean grade point average (GPA) for the 20 men in the sample was 2.905 and for the 25 women was 2.955. The variance in the population, which is assumed to be equal for the two groups, is known to be 0.10. H0:  F –  M = 0, H1:  F –  M > 0  X F XM   2 XF   2 XM  2 NF  2 NM .10 .10 = 0.09  20 25 2.955  2.905 zXF XM   0.55 observed 0.09 The critical z associated with  =.05 is 1.96 and the observed z is 0.55 which is far smaller than the critical z and we thus do not reject the null hypothesis. = t test of independent sample means (population variance unknown) The above provide ways to test hypothesis regarding one sample mean and regarding the difference between two sample means. The problem with the above is that, in order to use the sampling distribution of means and the sampling distribution of the differences between means, we must know the population standard deviation. In reality, we often do not know the population standard deviation but have to use sample standard deviation first to estimate the population standard deviation. All the logics we discussed earlier still hold. The only difference is that, if we have first to estimate the population standard deviation, the sampling distribution of means and the sampling distribution of the differences between means are no longer normal distributions. Instead, they are best described as t distribution. Thus, when testing the same hypothesis, i.e., that regarding sample mean or sample mean differences, we have to use the t distribution to determine the probabilities (instead of the normal distribution) and the tests are thus called t-tests. Here we focus on the t-test for two independent samples. The first step is to estimate the population standard deviation so that we can compute the standard error of difference. Repeating the earlier formula for the standard error of difference: 12  X  X   2   2 X1 X2  2  2 N1 N 2 Now, without knowing the population standard deviation, the estimate of the standard error of difference can be obtained by using the standard deviation from each sample as an estimate of  . For samples of equal size (N1 = N2), the estimate of the standard error of the difference between means is as given 1 2 s2 s2  N1 N 2 The next step is to provide a measure of the distance between the difference of the sample means ( X 1  X 2 ) and the difference of their corresponding population means (  1 -  2). This measure is the t: ( X  X 2 )  ( 1   2 ) t X1  X 2observed  1 s X1  X 2 s X1  X 2  s 2 X1  s 2 X 2  In this case, t is a measure of the distance between the difference of the sample means ( X 1  X 2 ) and the difference of their corresponding population means (  1 -  2). Notice that since, under the null hypothesis, we sample from populations with identical means in setting up the sampling distribution under H0,  1 =  2, and so  1 -  2 = 0. Thus, t can be simplified as (X  X 2) t X1  X 2observed  1 s X1  X 2 The final step is to specify the sampling distribution of t. Clearly the observed value of t will vary from one sample to the next due to random fluctuation in both X and s. If the value of t [which is ( X 1  X 2 )/(s X1  X 2 )] is calculated for each of a very large number of samples from a normal population, a frequency distribution for t can be graphed, with the values of t on the abscissa and the frequencies on the ordinate. This frequency distribution can be modeled by the theoretical (mathematical) distribution of t. The theoretical model is called the sampling distribution of t for differences between independent means. It can be used as probability distribution for deciding whether or not to reject the null hypothesis when  X1  X 2 is unknown. Actually, the sampling distribution of t for differences between means is a family of distributions. Each particular member of the family depends on degrees of freedom. Specifically, the t distribution for differences depends on the number of degrees of freedom in the first sample (N1 – 1) and in the second sample (N2 – 1). So the t distribution used to model the sampling procedure described before is based on (N1 – 1) + (N2 – 1) degrees of freedom, or on N – 2 degrees of freedom, where N = N1 + N2. In the example with N1 = N2 = 10, a t distribution with 18 df would be used to model the sampling distribution [(10-1) + (10-1) = 9 + 9 = 18]. The t distribution of differences has the following characteristics: First, its mean is equal to the population parameter which is zero under the null hypothesis (  1 -  2 = 0). Second, the distribution is symmetric in shape and looks like a bell-shaped curve. However, the mathematical rule specifying the t distribution is not the same as the rule for the normal distribution (but very close). As the sample size increases, the t distribution becomes increasingly normal in shape. With an infinite sample size, the t distribution and the standard 13 normal distribution are identical. In using the sampling distribution of t to test for differences between independent means, we make the following assumptions: The scores in the two groups are randomly sampled from their respective populations and are independent of one another. The scores in the respective populations are normally distributed. The variances of scores in the two populations are equal (i.e.,  1² =  2²). This assumption is often called the assumption of homogeneity of variance. Again, this is an important assumption based on which we can obtain the "pooled estimate" of the population variance (to be discussed next). (However, when this assumption is not met, we can still use other methods to estimate the standard error of differences. This won't be discussed here.) However, in order to use the t test where sample sizes may be either equal or unequal, we must define s X  X more generally. The goal in defining s X  X is to find the best 1 2 1 2 estimate of the population variance by using sample variances. This can be done as follows: Under the assumption of equal population variances (  1² =  2² =  ²), s21 and s22 each estimate  ². The best estimate of the value of  ², then, is the average of s21 and s22. This average is called a pooled estimate. So the formula for the t test can be written to reflect this pooling. X1  X 2 X1  X 2 t X1  X 2observed   2 2 1 1 s pooled s pooled s 2pooled (  )  N1 N 2 N1 N2 The pooled variance estimate, s²pooled, is obtained by computing a weighted average of s 1 and s22: 2 ( N1  1) s 21  ( N 2  1) s 2 2 ( N1  1) s 21  ( N 2  1) s 2 2  ( N1  1)  ( N 2  1) N1  N 2  2 In order to decide whether or not to reject the null hypothesis, compare the observed t value with a critical value. The symbol tcritical designates the points in the t distribution beyond which differences between sample means are unlikely to arise under the null hypothesis. Since there is a different critical value of t for each degree of freedom, the first step is to determine the number of degrees of freedom. The next step is to find the critical value of t for the related degrees of freedom at a specified level of significance. s 2 pooled  For example, an experiment was conducted to test on pre school children's word acquisition. In the experiment group, N1 = 15, children were given some kind of phonological awareness training which was expected to enhance word acquisition. The control group children (N2 = 20) were simply told stories. At the end of the experiment, the two groups were tested on word acquisition test and the following results were obtained. Mean 18.00 15.25 Variance 5.286 3.671 H 0 :  E  C H1 : E  C t X1  X 2  s X1  X 2 X1  X 2 s2 p s2 p  N1 N 2 14 s2 p   ( N1  1) s 21  ( N 2  1) s 2 2 N1  N 2  2 14(5.286)  19(3.671) 74.004  69.749   4.356 15  20  2 33 Note that the pooled variance is somewhat closer in value to s22 than s21 because of the greater weight given s22 in the formula. Then t X1  X 2 s2 p s2 p  N1 N 2 18.00  15.25 4.356 4.356  15 20 2.75 2.75    3.86 0.5082 0.713 For this example we have N1 – 1 = 14 df for the experimental group and N2 – 1 = 19 df for control group, making a total of N1 – 1 + N2 – 1 = 33 df. From the sampling distribution shown in the t Table (in the back of the notes), t.05 (33)  2.04 . Because the value of tobs far exceeds tα, we will reject H0 (at  = .05, two-tailed).  t test for correlation samples The t test for dependent means is used in situations where the two samples (and the corresponding populations) are correlated. What makes two samples correlated? First, when two measures are obtained from the same person, the two samples of these measurements are correlated. For example, midterm and final are correlated. Views about Hong Kong democratic reform from husbands and wives represent correlated samples because two people get married in part because they share similar views and, after marriage, they also influence each other’s views. In all these examples, there is a correlation between the two persons (e.g., husband and wife) or two occasions (midterm and final of the same person). All the procedures and logic for testing the difference between these kinds of “correlated sample means” are the same except for the fact that the standard error of the differences between means is usually larger for independent samples than for dependent samples. This is so because the dependent samples are correlated. We have to calculate and take into consideration the correlation between the two scores. This “little exception” however, makes the computation much more cumbersome than that for t-test of independent samples. Instead of presenting this cumbersome method, many statistics books use a computational method to conduct t-test of dependent samples or t-test of matched pairs. This computational method uses the same logic as that of the z-test of single sample mean using the normal distribution, which was described earlier. The only difference is that the normal distribution serves as the probability distribution for the z-test whereas the t distribution serves as the probability distribution for the t-test of dependent samples. The procedure is to first compute a difference score between the two variables and then work on the difference score, which is one variable, as if we are dealing with one sample mean test. Let us see how it works using the following table as an example. 15 (1) Subject (2) X1 (3) X2 (4) D=X1-X2 1 2 3 4 5 60 85 90 110 115 460 107 111 117 125 122 582 -47 -26 -27 -15 -7 (-122)  (5) D -24.40 -24.40 -24.40 -24.40 -24.40 (-122) (6) DD -22.60 -1.60 -2.60 9.40 17.40 0 (7) (D  D ) 2 510.76 2.56 6.76 88.36 302.76 911.20 In the above table, instead of comparing column 2 and 3 which are the two variables, we compute the difference between the two variables which resides in column 4. Then, instead of testing the following hypotheses: H0:  1 –  2 = 0 H1:  1 –  2> 0 (or H1:  1 –  2 < 0) We test this hypothesis that involves only the difference score: H0:  D = 0 H1:  D > 0 (or H1:  D < 0) The corresponding t-test is: tobserved  D  D D  sD sD where s D  sD N Of course, SD is obtained by the following equation which is the same as that for standard deviation: sD   (D p  D )2 N 1 In the above example, 911.20 sD   227.8  15.093 5 1 15.093 sD   6.750 5  24.40 t observed   3.61 6.750 df  N  1  5  1  4 t crit. (.01/ 2, 4)  4.604 for a two-tailed test. Steps in t test of difference between two means 1. Make assumptions based on the research issue. State null hypothesis and alternative hypothesis, which will decide the test is one-tailed or two-tailed, and which tail is of interest for one-tailed test. 2. Distinguish whether the two groups are two independent samples or two paired samples (i.e., correlated samples). 3. Choose suitable command in Excel or SPSS, the t statistic and corresponding probability and other related information can be obtained. For independent samples, we need to test whether the two population variances are equal or not before testing the difference of the two population means. 4. Report the hypothesis result, and give reasonable explanations. 16 Chi-square test for frequency data The chi-square statistic (  2 ) is used to test whether the observed frequencies differed significantly from the expected frequencies. The expected frequencies might be based on a null hypothesis such as that height should be normally distributed with an extreme small frequency of people who are above 2 meters. The expected frequencies might also be based on what would be expected if chance assigned subjects to categories, as the number of people who carry their bags on the right versus left shoulder should be 50:50 under such kind of chance or null hypothesis.  2 is used with data in the form of counts (in contrast, for example, to scores on a test). Thus  2 can be used with frequency data (f), proportion data (f ÷ n), probability data (number of outcomes in an event ÷ total number of outcomes), and percentages (proportion × 100). If the data are in the form other than frequencies (e.g., proportions), the data need to be converted to frequencies. Chi-square test for one-way design In the chi-square test for a one-way design, only one variable is involved. We test whether the variable has the distribution as expected. Suppose N subjects can be assigned to k categories. The observed and expected frequencies of the ith category are Oi and Ei, respectively. The test statistic is: k (O  Ei ) 2 2   i Ei i 1 The chi-square statistic provides a measure of how much the observed and expected frequencies differ from one another. But how much difference should be tolerated before concluding that the observed frequencies were not sampled from a distribution represented by the expected frequencies? In other words, how large should  2 observed be in order to reject the null hypothesis that the observed frequencies were sampled from a distribution represented by the expected frequencies? The question becomes finding the probability of finding  2 observed as large as or larger than some value, assuming the null hypothesis is true. To do so, we need a sampling distribution for the  2 . The chi-square distribution, like the t distributions, is a theoretical distribution—actually, a family of theoretical distributions depending on the number of distinct categories on which the statistic is based. Or more accurately, there is a different member of family for each number of degrees of freedom, which is determined by: df = number of categories minus 1 or k – 1. Notice that the degrees of freedom for the  2 distribution depend on the number of categories (k) but not the number of subjects in the sample (N). For example, to find out whether HK movie goers have any preferences over the four common kinds of the movies of Action, Horror, Drama, and Comedy, one can use the chi-square test to compare the observed movie choices against the expected choice, which, under the null hypothesis (no preference) would be 1/4. Assume a researcher stood at the ticket counter and tallied 32 tickets sold at the time obtained the following. 17 Observed Expected Movies Action 4 8 Horror 5 8 Drama 8 8 Comedy 15 8 Total 32 32 H0: Four kinds of movies are equally preferred. H1: Some kinds of movies are preferred more than others. 2   (O  E ) 2 E (4  8) 2 (5  8) 2 (8  8) 2 (15  8) 2    8 8 8 8  9.25 From the chi-square, we see that k – 1 = 3 df,  2 .05 = 7.82. The observed chi-square value of 9.25 is greater than 7.82 and thus the researcher can reject the null hypothesis and conclude that HK movie goers have preferences. But which kind do the HK movie goers prefer? Unfortunately, the chi-square test alone does not answer that question. The easiest way to determine that is by eyeballing the four frequencies. From the data, it is clear that more people than expected go to see comedy, whereas Action and Horror choices were made less frequently than expected.  One requirement is that the expected frequency for each category is not less than 5 for df ≥ 2 and not less than 10 for df = 1. Note that this assumption is for expected, not observed, frequencies. The observed values of χ² with 1 degree of freedom must be corrected for continuity in order to use the table of values of χ²critical. This is called Yates' correction for continuity used to correct the inconsistencies between the theoretical sampling distribution of χ² and the actual sampling distribution frequency data which only approximates the chi-square. Yates’s correction: 1. Subtract .5 from the observed frequency if the observed frequency is greater than the expected frequency; that is, if O>E, subtract .5 from O. 2. Add .5 to the observed frequency if the observed frequency is less than the expected frequency; that is, if O<E, add .5 to O. Chi-square test for two-way design In the case of a two-way design, i.e., two variables which are categorical, the question becomes whether the two variables are independent of one another or are related to each other. For example, is movie preference, which, let us assume is defined by four preference categories of action, horror, drama, and comedy only, related to gender (another categorical variable with two categories)? The formula for calculating the two-way  2 is the same as for the one way  2 when treating each cross-category in two-way design as one category; only the degrees of freedom change. The degrees of freedom for the two-way  2 depend on the number of rows (r, 18 which represents the number of categories for one variable) and the number of columns (c, which represents the number of categories for the other variable) in the design: df = (r – 1)(c – 1). The chi-square statistic with (r – 1)(c – 1) degrees of freedom is used to compare the observed and expected frequencies in a two-way contingency table. Here, the expected frequencies represent the frequencies that would be expected if two variables were independent. The statistical rule of independence is that when two events are independent, their joint probability is the cross product of their individual probability. For example, to test whether there is a correlation between gender and movie choice (in other words, whether men and women have different preferences for movies), one can conduct a chi-square test to examine whether these two variables are independent. H0: Gender and movie choice are independent in the population. H1: Gender and movie choice are related in the population. The data are contained in the table below. In this example, the expected frequency in the (1,1)-cell, which is presented in parenthesis, is obtained by multiplying row total (564 for the first row) with the column total (116 for the first column) and dividing this product by the total frequency (897): 564  116 65,424 E11    72.94 897 897 This procedure for calculating the expected frequencies is carried out for all cells which are presented in the parenthesis in the table below. Then follow the chi-square formula and we will obtain:  2 observed = 16.78 + 28.48 + 265.36 + 110.08 + 70.53 + 119.19 + 20.02 + 33.13 = 463.57 df = (r – 1)(c – 1) = (4 - 1)(2 - 1) = 3. With df = 3, the chi-square value associated with a = .01 is 7.81. The observed chi-square of 463.57 far exceeds the critical chi-square and we reject the null hypothesis. Sex Male Female Total Movie Choices Action Horror 108 (73) 345 (224) 8 (43) 12 (133) 116 357 Drama 94 (218) 253 (129) 347 19 Comedy 17 (48) 60 (29) 77 Total 564 333 897 Correlation and Regression Joint distribution The joint distribution of two variables is graphically presented as a scatter plot. Unlike the frequency polygon for a univariate distribution, on both the ordinate and abscissa of a scatter plot are the measurement units of each of the two variables. Each dot represents the intersection of a pair of observations, the measurements of which are on the ordinate and abscissa. Scatter plots depict the strength and direction of the association between two variables. These characteristics can be summarized numerically by one of the two statistics, covariance and correlation. Covariance The covariance of X and Y is the average cross products of deviations of the two variables. It represents the degree to which the two variables vary together. 1 N s XY   ( X i  X )(Yi  Y ) (Covariance of X and Y) N  1 i 1 s X2  1 N  ( X i  X )( X i  X ) N  1 i 1 (Variance of X) 1 N (Variance of Y)  (Yi  Y )(Yi  Y ) N  1 i 1 The concept of covariance of two variables is similar to the variance of one variable. Variance summarizes deviations of observations from the mean, whereas covariance summarizes the deviations of pairs of observations from two respective means. If we replace all the Y's in the covariance equation with X's, we would have the variance of X. However, there is one important difference between variance and covariance. While a variance is always positive, a covariance can be positive or negative. The sign of the covariance indicates the direction of the association of the two variables. Of course, the magnitude of a covariance indicates the degree to which the two variables vary together or the strength of the association of the two variables. However, the value of covariance is also influenced by the measurement units of the two variables. A larger numerical quantity as measurement of one or both variables, e.g., X = 24 inches and s X = 120 inches versus X = 2 feet and s X = 10 sY2  feet and/or Y = 36 ounces and sY = 144 ounces versus Y = 3 pounds and sY = 12 pounds, will result in a larger s XY independent of the actual association of the two variables. Correlation Coefficient One way to eliminate the influence of measurement units is to use z-scores of X and Y since the z-scores have the same mean and standard deviation whether or not the original units are a large quantity of inches and ounces or smaller quantity of feet and pounds. The covariance of X and Y expressed in the z-score form is called a correlation coefficient. 1 N  X i  X  Yi  Y    rXY   N  1 i 1  S X  SY  N  (X i 1 N i  X )(Yi  Y )  ( X i  X )2 i 1 N  (Y  Y ) i 1 2 i The relation between the correlation and covariance can be described by: 20 s XY , s  rXY s X sY s X sY XY The correlation that is being talked about here was derived by Karl Pearson (1857-1936) and is called the product moment correlation coefficient. It is denoted by  (rho) for a population parameter and r as a sample statistic. The term "moment" in physics refers to a function of the distance of an object from the center of gravity, which is the mean of a frequency distribution. x  X  X and y  Y  Y are the first moments and xy is a product of the first moments. The correlation is also Pearson coefficient. rXY  Like the covariance, a correlation can be positive or negative indicating the direction of the relationship. Since a correlation is free from the influence of measurement units, the magnitude of the correlation ranges from 0 indicating no relationship to 1 or -1, indicating perfect or strongest association. Linear Association There are many ways two variables can be related. We only deal with one of the types, that of a linear association. In a linear relation, the change of one variable corresponding to the change of the other variable is in one direction at the same speed. That is, as X goes up, Y either goes up or goes down but does not go in both directions. An association is curvilinear if Y sometimes goes up and sometimes goes down in correspondence to the upward movement of X. The Pearson correlation measures only linear association. Thus, a correlation coefficient should be examined together with the scatter plot to be certain that a zero or moderate coefficient is not due to the nonlinearity of an existing association. The strength of a linear association can also be examined from scatter plots. When the dots (representing the correspondence between X and Y) are tightly clustered together in one direction or the other, the association between X and Y is stronger than when the dots are loosely scattered. Outliers and restriction of range As can be seen from the earlier formula, the correlation between X and Y is determined by the deviations of X from X and deviations of Y from Y . Two things may artificially influence these deviations and therefore the correlation coefficient. One is called an outlier, which is an extremely large or small score not typical of the X or Y population, creating an extremely large deviation. This large deviation sometimes may artificially raise the correlation if the large deviation is consistent with most of other observations. It may also artificially attenuate the correlation if the large deviation is inconsistent with other observations. The other problem is called the "restriction of range" fallacy. This refers to the fact that unusually small variability in either of the X or Y variable will result in small deviations and thus a weak correlation. Such reduced variability occurs when part rather than the full range of the X and/or Y distribution is sampled. For example, the correlation between GRE and graduate GPA would be much higher had the full range of GRE scores been available. In reality, we only have the graduate GPA measures for some but not all of the GRE scores since graduate schools are open only to students whose GRE are above a certain level. There are ways to predict the potential correlation of a full range using the correlation estimated from a restricted range. It is important to examine the scatter plot to determine if the relationship is linear, if there are outliers, and, according to theory and experience, if the sampled X and Y values represent the full range of these two variables. Hypothesis Testing The purpose of the test for  XY = specified value is to determine whether or not the 21 observed value of rXY was sampled from a population in which the linear relationship between X and Y is some specified value. When  XY (in the population or the unknown truth) is positive (e.g.,  = .30), the sampling distribution for rXY is negatively skewed. When  XY is negative (e.g.,  = -.25), the sampling distribution is skewed in the positive direction. Since skewed distributions can be transformed into approximately normal distributions by taking a logarithmic transformation of the data, a log transformation of rXY is made. With this transformation, the sampling distribution of rXY is approximately normal, regardless of whether the sample correlation is drawn from the population with  XY = 0 or  XY = some specific value. This log transformation is known as the Fisher’s Z transformation and we don't have to do the transformation ourselves, statistical tables have all the z-values corresponding to different correlation coefficients. The standard error of the sampling distribution of rXY when it has been transformed to Z is given by: 1 sZ  N 3 In order to test whether or not an observed correlation rXY differs from the hypothesized value of the population parameter  XY, we use the table of z-transformations and compute the z-scores (as we did earlier) and then go to the normal table to find the probability values (probability associated with the observed z-score and/or probability associated with the critical z-score.) More often of course we are interested in whether two correlation coefficients are same or one is higher or lower than the other. This involves hypothesis testing of the difference between two correlations. The sampling distribution of the differences between sample correlations (rX 1Y1  rX 2Y2 ) based on independent samples can also be modeled by the normal distribution if the observed correlations are transformed to Fisher’s Z’s. The standard error of the difference is estimated by 1 1  N 13 N 2  3 sr1 r2  Non-standardized regression equation The non-standardized regression equation, or simply, regression equation is Yi   0  1 X i  ei where  0 is called intercept which denotes the value of Y when X is zero,  1 is the regression coefficient which represents the change in Y associated with one unit change in X. The intercept is estimated by: ˆ  Y  ˆ X 0 1  1 can be estimated by sY sX is the correlation between X and Y, and sY and s X are the standard deviations ̂1  rXY where rXY of Y and X. Once the intercept and regression coefficient are estimated, we can estimate the predicted value of Y at a given value of X. Prediction: Yî  ˆ 0  ˆ1 X i Residual: eˆ  Y  Yˆ i i i 22 The purpose of regression is to explain the variability of an outcome variable (called dependent variable or response variable for experimental or non-experimental research) from the knowledge of some other predictor variables (called independent or explanatory variable for experimental and non-experimental research). Without any predictors, the best prediction is the mean of the outcome variable. For example, if we were to guess the GRE score of a randomly chosen graduate student without knowing things like which graduate school he/she is attending, which university he/she graduated from, what was his/her undergraduate GPA, what was his/her SAT, etc., the best guess is perhaps the population mean of GRE. That is, we would be wondering what kind of GRE score an average graduate student would get. And that score would be the average of GRE scores taken by all the graduate students. If we use Y to denote GRE as an outcome variable, the best prediction of one randomly chosen student's GRE is  (Y), or the population mean of GRE. Of course, the larger the variability of the GRE scores, the less certain we are of our guess. And we are right if we think that the variance of all the GRE scores,  2(Y), is quite large. What if you are told that his/her SAT is above the mean of SAT,  (X), using X to denote SAT. Would you bet his/her GRE to be above  (Y)? If you feel more certain with this second bet, it is because by knowing his/her SAT score, you increase your chance by picking from a smaller range of GRE scores, those above the mean. We call this smaller range of GRE scores conditional distribution of GRE and the full range of GRE scores unconditional distribution of GRE. You are more confident with the second guess because the variability of the conditional distribution,  2(YX), is smaller.  2(YX) is the variance of Y at a given level of X. In our example, it is the variance of GRE scores of those who scored above the mean on the SAT. What regression does is dividing the whole population of an outcome variable into many sub-populations according to the values of the predictor variable. The sub-populations are called conditional population or conditional distribution whereas the whole population is called unconditional population or unconditional distribution. Without knowledge of a predictor, the best prediction of the outcome variable is the mean of the unconditional population,  (Y). Knowing the value of a predictor, the best prediction is the mean of the conditional distribution,  (YX). The predicted Y, Yˆ , is an estimate of the mean of the i conditional population of Y,  (YX). Knowing that his/her SAT is 1200, the best guess of her GRE would be the mean of all GRE scores with an SAT of 1200, or  (YX=1200). What we have done earlier through the regression equation is to estimate these means of conditional population. The regression equation for population parameters is presented below and is accompanied by the equation for sample statistics described earlier.  (Y X )   0  1 X Yˆ  ˆ  ˆ X i 0 1 i The relationship between the above two equations is that between sample estimates and population parameters. Yî is a sample estimate of  (YX=Xi); ̂ 0 is an estimate of  0 ; ̂ 1 is an estimate of  1 . If we do not have any predictor variables, the best prediction will be the mean of the unconditional distribution, or  (Y) and the sample estimate of  (Y) is Y . Y in this case will be our sample prediction, Yˆ . It is shown in the following that, when i ̂ 1 = 0 or when there is no prediction from X to Y, Yî = Y . 23 Yî  ˆ 0  ˆ1 X i Recall that ˆ 0  Y  ˆ1 X Replacing ̂ 0 by Y  ̂ 1 X , we have Yî  Y  ˆ1 X  ˆ1 X i   Y  ̂1 X i  X  Y , if ˆ  0 .  1 Fitting a linear regression line to the scatter plots (A graph will be shown in the class which will facilitate the understanding of this part of the text.) Knowing the values of ̂ 0 , the intercept, and ̂ 1 , the regression coefficient, one can draw a straight line, known as the linear regression line. Or to put it differently, the Yˆ 's form a i straight line. ̂ 0 is the intercept of the regression line, or the value of Y when X is zero, ̂ 1 is the slope of the regression line or how steep or flat the line is. The regression line which represents a linear functional relation between X and Y best fits the data points in the joint distribution or scatter plot. There are several methods to fit the regression line to the data points. In other words, there are several ways to estimate  0 and  1 in a regression equation. The one we focus on here which is also the most commonly used method is called Ordinary Least Squares (OLS) method. By the least squares criterion, the intercept and slope of the regression equation are so determined that the sum of the squared deviations of the actual data points from the regression line is minimum. Since the regression line is formed by the Yî 's and the data points are the actual Yi's for   2 each level of X, the least squares criterion specifies that  Yi  Yî is minimum. Yî is the predicted. Y  Yˆ  e is the residual or prediction error. Just as the sum of deviations from i i the mean is zero, i  Y i  Yî  or  e based on the same cases that estimate i ̂ 0 and ̂ 1 is also zero. Conceptually,  ei = 0 implies that over-predictions (when data points fall below the regression line) and under-predictions (when data points are above the regression line) cancel each other out. Of course, the closer the data points are to the regression line, the smaller the sum of squared residuals and, thus, the better the prediction. In an extreme case when all the data points fall on the regression line, there will be no prediction errors and a perfect prediction from X to Y. The deviation of an observed Yi (for a given value of Xi) from the mean of Y, Y , is made of two components, Yî  Y and Yi  Yî . For each observation, we can write: Y  Y  Yˆ  Y  Y  Yˆ  i   i  i i     We can square the deviations and sum them for all the observations: 2 2 2  Y  Y   Yˆ  Y   Y  Yˆ  i   i   i i  24 (Note:  Y  Y    Yˆ  Y    Y  Yˆ    Yˆ  Y    Y  Yˆ   2 Yˆ  Y Y 2 2 i 2 i i 2 i i 2 i i i i  Yî  The last term is zero because of least squares criterion.) The above terms are called "sum of squared deviations" or simply "sum of squares" or SS. The equation is often simply written as: SStotal = SSreg + SSres which decomposes the total sum of squares into two parts, that sum of squares due to regression, MSreg , and that the sum of squares of residual, MSres . The sum of squares can be divided by the corresponding degrees of freedom to arrive at estimated variance components. dftotal = N - 1, where N is the number of sampled observations. dfreg = k, where k is the number of predictors, X1, X2, …, Xk. dfres = N - k - 1. Dividing SStotal , SSreg and SSres by corresponding df, yields the estimated variances or the mean of sum of squares (mean squares): MStotal , MSreg and MSres. An F statistic can be used to test whether the regression is significant. The null hypothesis is H 0 : 1  0 , the alternative hypothesis is H1 : 1  0 . MS reg F MS res When Fobs. is larger than Fcri , reject H 0 : 1  0 , and conclude that the regression is significant. SS reg It can be shown that r 2 XY  , that is, the squared correlation between X and Y SS total represents the proportion of variation in Y that is linearly associated with X, or in the other words, that can be explained by X. Dividing by these DF's, yealds the estimated variances: s 2 y  s 2 yˆ  s 2 r (s2r is also written as s2y.x.) s 2 y  s 2 yˆ  s 2 y. x decomposes the total variance in Y into two parts, that which is linearly related to X, s 2 yˆ , and that which can not be accounted for by the linear function of X, s2y.x. We learned earlier that rxy2 represents the proportion of variance in Y that is linearly associated with X. We can see now the same interpretation applies to s 2 yˆ . In s 2 y  s 2 yˆ  s 2 y. x , s2y is the sample estimate of the unconditional population variance, or the total variability in Y. In our GRE example, this is the variance of all GRE scores. s 2 yˆ is the sample estimate of the conditional population variance, or the variability among the conditional population means. Because the conditional population means can be predicted by the linear regression equation, s 2 yˆ is called variance due to regression. In the GRE example, 25 this is estimate of the variance among the means of GRE at different levels of SAT. For example, this variance can give you some idea about the difference between the mean of GRE with SAT of 1000 and the mean of GRE with SAT of 900. Finally, s2y.x is an estimate of the conditional population variance σ2(YX), or the variability among all the Y's in the population at a given level of X. In the GRE example, not everyone having an SAT of 1000 would have the same GRE. The error variance of prediction describes the difference among GRE scores of these people who all scored 1000 on SAT. Of course, according to our linear regression prediction, all these people are predicted to have a common GRE which is the mean of their GRE scores. s 2 yˆ shows how much off we are in our regression prediction. Note. s2y, s 2 yˆ , and s2y.x are unbiased estimators of their respective population parameters. For sampled data, these three terms do not add up because they are not divided by a common denominator or degrees of freedom. Thus, the equation, s 2 y  s 2 yˆ  s 2 y. x , does not really hold. Precisely, Es 2 y  Es 2 yˆ  Es 2 y. x . We use the equation because it is conceptually easy to understand. Analysis of Variance When populations (and their means) being compared are three or more, analysis of variance (ANOVA) and F-test are applied instead of t-test which is proper for comparing two population means. Here we focus ANOVA for one-way design only. Procedure of ANOVA 1. Hypothesis In ANOVA, the null hypothesis is that all population means are identical. That is, H 0 : 1   2     k where k is the number of population (or groups) being compared. 2. Decomposition of the sum of squared deviations Let X ij be the score of the jth subject in the ith group, X i be the mean of the ith group with a total of ni subjects, and X be the global mean of all n  n1  n2    nk subjects. The variation of a data set can be depicted in two sources, one caused from the differences between groups (variation between groups), the other came from the differences among subjects within each group (variation within groups). The variations are usually measured by the sum of squared deviations (or the sum of squares). Total sum of squares: SS T   ( X ij  X ) 2 i Sum of squares between groups: j SS B   ni ( X i  X ) 2 i Sum of squares within groups: SSW   ( X ij  X i ) 2 i j The above three sums of squares satisfy SS T  SS B  SSW which decomposes the total sum of squares into two parts, that sum of squares between 26 groups and that the sum of squares within groups. 3. Decomposition of the degrees of freedom and the mean squares Without considering the degrees of freedom, the sum of squares between groups and the sum of squares within groups are not comparable. Total： df T  N  1 Between groups: df B  k  1 Within groups： dfW  N  k Corresponding to the decomposition of the total sum of squares, the total degrees of freedom can be decomposed as: df T  df B  df W Dividing SS B , SSW by corresponding df, yields the mean of sum of squares (mean squares): mean squares between groups： MS B  SS B df B mean squares within groups： MS W  SSW df W 4. F-test An F statistic can be used to test whether the differences among groups are significant. MS B with degrees of freedom (df B , dfW ) . F MS W If there are small differences among groups, the mean squares between groups will be close to mean squares within groups, and F value calculated from sample, Fobs., will be close to 1. On the other hand if there are significant differences among groups, Fobs. will be substantially larger than 1. When Fobs. is larger than Fcri , we reject H 0 , and conclude that not all groups have the same mean. Alternatively, if the probability corresponding to Fobs. is less than α =0.05, reject H 0 . In application, a table of ANOVA result clearly. ANOVA Table: Source SS Between SS B groups Within SS W groups SST Total is often displayed through which one can view the df MS k 1 MS B nk MSW F MS B MSW P n 1 Example of ANOVA Three different teaching methods were applied in three classes, one method for one class. After one year, a test was conducted by using same items for the three classes. 10 scores from each class were randomly drawn out (see table below) . Are there significant 27 differences among the three classes? Scores of samples Class A B C Score 76, 78, 71, 68, 74, 67, 73, 80, 72, 70 83, 70, 76, 76, 69, 74, 72, 80, 79, 75 82, 88, 83, 85, 79, 77, 84, 82, 80, 75 Mean 72.9 75.4 81.5 Variance 17.7 19.6 14.9 The null hypothesis to be tested is H 0 : 1   2   3 Since F=11.2, P=0.000, we reject H 0 and conclude that there are at least two classes with significantly different mean scores. If all other conditions are similar, we can conclude that the effect of teaching methods on study achievement is significant. ANOVA Source Between groups Within groups Total SS 391.4 469.8 861.2 df 2 27 29 28 MS 195.7 17.4 F 11.2 P 0.000

Lecture Notes

Related documents

Products

Support

Lecture Notes

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib