Lecture Notes

advertisement
EDM6403: Quantitative Data Management and Analysis in Educational Research
Professor Chang Lei, Department of Educational Psychology, CUHK
Measurements of Individual Differences
People differ in terms of the amount of an attribute they possess. Theoretically, no two
people have exactly the same amount of an attribute, e.g., same height or same smartness.
Such an attribute on which people differ from each other is considered as continuum of
quantities of which a person may possess any amount. Individual differences along a trait
continuum usually follow a normal distribution; that is, most people possess a moderate
amount with few having an extremely large amount and few having a very small amount of
the attribute. The optimal goal for any research is to be able to explain these individual
differences -- why do some people have more or less of an attribute? The initial question is
how to measure individual differences, how to quantify individual differences, how to
differentiate people in numbers, with respect to an attribute continuum. There are four ways
of quantifying individual differences, four ways of assigning numbers to people to indicate
their differences with respect to an attribute. In statistics, these are called four levels of
measurement. The level of measurement for a variable directly affects the choice of statistical
technique because different level of measurement justifies different level of mathematical
operation. These four levels are called ratio, interval, ordinal, and nominal scales. All of them
involve truncating the trait continuum into manageable and meaningful units and categories.
Ratio and interval method truncate the continuum into equal-distance units and measure
the individual differences by counting how many units of an attribute a person possess.
Someone having 5 units of an attribute is one unit higher than someone having 4 units and so
is a person possessing 8 units in relation to another having 7 units. These are interval or ratio
measures. Centimetre as a measurement of height, pound for weight, WISC points with
intelligence are ratio and interval measurements. Centimetres, for example, have equal
distance between each other. The distance between 4 and 5 centimetres is the same as that
between 7 and 8 centimetres. So using this equal interval measure, one can distinguish people
on the height continuum, so that 180 centimetre tall is taller than 179 centimetre tall. Of
course, you lose some information in that not all 179 centimetre people are of the same height,
even though, on the ruler measurement, they are indicated as having the same height. The
difference between ratio and interval is that ratio has a meaningful zero such as the ruler
measurement of height. For some construct, zero is meaningless. IQ, for example, has
meaningless zero. What is no IQ? In this case, zero is simply a marker having no meaning. Or
it is meaningful only within the measurement instrument itself as a reference to other values.
For example, it may mean that you did not attempt any questions on an IQ test, and therefore
you have a zero IQ score. But it does not mean you have no intelligence because you have a
brain and you are still breathing. With a meaningful zero, the ratio scale permits ratio
comparison. For example, someone weighing 120 pounds is twice as heavy as someone of 60
pounds. Such comparison cannot be made using interval scale where zero does not mean the
absence of the attribute. For example, someone scoring 120 is not twice as intelligent as
someone scoring 60. Interval is used more often than ratio in social science.
A less precise or more coarse way of quantifying differences is to rank order people
without too much concern with equal interval assumptions. This is the ordinal data method
where rank orders become the measurement unit. 180, 179, 150, 150, 149 will be truncated
into 1st, 2nd, 3rd, 3rd, and 4th. The distance between Number 2 and 3 is so much bigger than
that between 1 and 2, and between number 3 and number 4. But that information is lost.
1
An even rougher way to quantify information is nominal or categorical where you
differentiate people by assigning labels to them. A giant vs. a dwarf. The truncation is so
coarse that the difference represented by categorical data are no longer defined quantitatively
but are qualitatively defined. We no longer think of how much bigger giants are than dwarfs
but think of them as two categories of people. A better example is political affiliation. The
Democrats and DAB are two categories of the political affiliation variable, which is called a
categorical variable. Here, the two parties differ by name. One has one label and the other has
a different label. We no longer make any numerical differentiations between them. However,
their political views can still be differentiated numerically by examining their different
emphasis on different political principles, e.g., the extent to which they are anti versus pro
China is only one of them. However, when using nominal measurement, we think and
measure in terms of categories rather than numerically different standings on a continuum of
pro versus anti China, for example. An extension of the concept is to use categories to
distinguish individual differences along multi-dimensions; i.e., not just tall vs short, liberal vs.
conservative, but also, different ear, nose, education, etc. A good example is our names.
Name is a categorical variable. Names summarize so many quantitative differences on so
many dimensions that we are no longer able to perceive those quantitative differences.
Instead, we perceive each person as categorically or qualitatively different from everyone
else. Another example is friendship. A friend does enough bad things to you to become an
enemy whereas, friendliness-animosity is an attitude continuum which you (probably
unconsciously) assign different amounts to your different friends and enemies resulting in
your relationship categories such as the best friend, a friend, and the worst enemy, etc..
In summary, nominal variables only classify cases into categories. Categories are not
numerical and can only be compared in terms of the number of cases classified in them. Only
“ =” or “  ” could apply to a nominal variable.
In addition to classification, ordinal variables can be ordered from ‘high’ to ‘low’ or
‘more’ to ‘less’, but the distance between scores cannot be described in precise terms. More
mathematical operations (=,  , >, <) can be applied with ordinal data.
Interval variables have the property of categorization, ranking, and the distance
between two scores can be expressed in terms of equal units. Mathematical operations (=,
 , >, < ,+,-) are permitted. But there is no true zero.
The highest level of measurement is ratio, which possesses all the properties of the
above levels of measurement. All arithmetic operations (,,,) are permitted and there is
a true zero.
Interval and ratio variables are also called continuous variables, while nominal and
ordinal variables are also called discrete (or category) variables.
Population and sample
A population is the total collection of all units in which the researcher is interested and is
thus the entity that the researcher wishes to understand.
A sample is a carefully chosen subset of a population.
A case is a unit in a sample. Usually a case is a subject in an experiment or a respondent in
a survey. The number of cases in the sample is sample size (denoted by N or n).
2
Frequency analysis
For a Nominal variable, the number of cases in each category is the frequency of
occurrence in each category.
Proportion for each category is the frequency divided by the sample size. It is called
relative frequency. The proportion multiplied by 100 is percentage.
The frequencies/percentages of all categories of a nominal variable give frequency
distribution.
For a continuous variable, grouped frequency distribution can be considered by dividing
respondents into appropriate or convenient intervals and treating each interval as a category.
Distribution
The pattern of variation of a variable is called its distribution, which can be described
both mathematically and graphically. The distribution records all possible numerical values of
a variable and how often each value occurs (i.e., frequency).
Four attributes are often used to define a distribution: central tendency (mean, median,
and mode); variability (variance and standard deviation); symmetry or skew; peak or kurtosis.
Measures of central tendency
Mode is the value of a variable that occurs most frequently in the distribution. The mode
is often used for category variables.
Median is the middle observation when the observations are arranged in order of
magnitude. The median identifies the position of an observation. The median is used primary
for ordinal, but also appropriate for interval/ratio.
The median is also called the 50th percentile. A percentile is the value on a scale below
which a given percentage of cases fall.
Each half of the observations can be further divided into quartiles. The 25th Percentile
is called the lower quartile; the 75th percentile is called the upper quartile.
Mean is the arithmetic average of the observations. Mean is appropriate for continuous
variables.
Measures of dispersion
The range is the maximum minus the minimum of a distribution.
The variance is the average of the squared deviation. The deviation of an observation is
the distance of the observation from the mean. The square root of the variance is called the
standard deviation (SD).
The shape of a distribution
Skew is a measure of deviation from symmetry. If the values of mean and median are
the same, it is a normal curve (skew = 0). The skew of a normal curve is zero. If mean is
larger than the median, it is positively skewed. If mean is smaller than the median, it is
3
negatively skewed. When skew is positive, the distribution is said to be positively skewed
with thick right-hand tail. When skew is negative, the distribution is said to be negatively
skewed with thick left-hand tail.
Kurtosis is a measure of the relative peakness or flatness of the curve defined by the
distribution of variable. When the kurtosis is zero, the curve is a normal curve. If the peak of
the curve is above that of a normal curve, it is called leptokurtic (positive kurtosis). If it is
below that of a normal curve, it is called platykurtic (negative kurtosis).
Normal distribution
The best studied and most widely used distribution is normal distribution. It is most
widely used because this mathematical function can be used to describe most of the natural
observations (behaviors, attitudes, abilities as far as education is concerned). When
something is measured many times (e.g., a person’s height and weight), there are small but
detectable fluctuations in the different measurements. Guess what is the distribution of these
measurements. Yes, normal distribution. Because the fluctuation indicates measurement error,
the distribution is also called a "normal curve of error." From this perspective, we can say that
distributions or patterns of variations of, say, height, weight, IQ, abilities and personalities of
various sorts are simply records of "nature's mistakes" against the corresponding population
means.
If X distributes normally, it is denoted as X ~ N (  ,  2 ) .  and  are mean and
standard deviation respectively. There is not one but a family of normal distributions, with
varying  and  . All normal curves are continuous, bell-shaped, symmetrical (skew = 0),
the two points of inflection are at    (kurtosis = 0), and the curve approaches the X-axis
asymptotically (does not touch the X-axis).
N (0,1) is called standard normal distribution with mean 0 and standard deviation 1.
Normal curves with different 
Linear transformation
The shape of a distribution does not change when a constant is added to or subtracted
from scores or when scores are multiplied or divided by a constant. Any of these changes are
called a linear transformation. Because of this property, the raw score or original
measurement units are often transformed to z-scores, which express the score location in
terms of the standard deviation units of the distribution. A z-score is one kind of standard
score, which has a  = 0 and  = 1.0. Any observation that is expressed in terms of the
relationship between the mean and standard deviation is a standard score. For example, GRE
and SAT which have  = 500,  = 100, are standard scores. Wechsler Intelligence Test: 
= 100,  = 15. Stanford-Binet Intelligence Test:  = 100,  = 16. These scores are
4
standard scores too. The z-score is only one of the standard scores.
Notice that any variable, normal or not, can be transformed into a z-score by the formula,
X 
Z
. The z-transformation does not "normalize" the variable. When a normal variable

is converted into a z-score, it becomes a standard normal score or a standard normal variable.
That is, Z ~ N (0,1) .
To transform a normal distribution into a standard normal distribution involves a process
of converting the raw score into a standard score using the previously mentioned formula,
X 
Z
. This formula first calculates how far away a score is from the mean, and then

determines the distance in terms of number of standard deviations. A standard score tells us
how many standard deviations a raw score falls above or below the mean of the distribution.
Standard score is also called z-score, its sign tells whether it falls above or below the mean.
Distribution of probabilities
In statistics, the curve of a variable is nonnegative with area 1 under the curve. The
probability (or the relative frequency) of observations with values between [a, b] is the area
under the curve between a and b. The probability of observations with values below b is the
area under the curve below b.
Using integration we can determine the area under the curve between any two values (a
to b), and thus determine the probability (or the relative frequency) that a randomly drawn
score will fall within that interval. These integrations have been done and summarized in a
table of normal distribution where the probability that any score will fall within a range of
two points is computed. As long as a variable is normally distributed or approximates
normality, we can find the probability that X falls within a certain range. Again, we do not
need to really use integration. Statisticians have already written down those probability
values in a table based on a standard normal distribution or N (0, 1). Software packages, such
as Excel and SPSS, can give the probability values easily.
5
Areas within some special intervals of standard normal distribution
Some useful probabilities and corresponding intervals:
P(1.65  Z  1.65)  0.90
P(1.96  Z  1.96)  0.95
P(2.58  Z  2.58)  0.99
P( Z  1.65)  P( Z  1.65)  0.05
P( Z  1.96)  P( Z  1.96)  0.025
P( Z  2.58)  P( Z  2.58)  0.005
Again, changing raw scores into z-scores does not alter the shape of the distribution. The
difference is that originally, distance between scores were expressed in raw score form but
now they are expressed in terms of number of standard deviations. However, if they are far
away from normality, the probabilities obtained from a standard normal distribution table will
not accurately describe the probabilities associated with the scores on a non-normally
distributed distribution. If the variable is normally distributed, by converting its raw scores
into z-scores, we can determine relative frequency or probability, and percentile rank or
cumulative probabilities by looking up on the table rather than calculating them ourselves.
Nonlinear transformation
Finally, it is useful to mention that there are ways of score transformation to change the
pattern of variations of a distribution. A monotonic transformation maintains the order of the
measurements but not the relative distance between scores. Such a transformation can be used
to make a skewed distribution normal when it can be assumed that the variable underlying the
observed distribution is normally distributed. Usually, square root and log transformations
reduce positive skew and a power transformation reduces negative skew.
Parameter estimate
Parameter estimate is one type of inferential statistics. Another type of inferential
statistics is hypothesis testing.
Usually, we use a sample mean to estimate the population mean, and a sample SD to
estimate the population SD.
Consider the population mean. Different random samples are likely to give us different
estimates of the population mean. But by using a confidence interval we can indicate that,
despite the uncertainty of random samples, only in 5% of samples will the population mean
6
be outside the confidence interval, while for 95% of samples the mean will be inside the
confidence interval. The probability of including the population mean within the confidence
interval is called the confidence level (often, 95% or 99% are chosen as confidence level) .
Hypothesis testing
Statistics hypotheses are always related to the population. They can be classified as
parameter hypotheses and non-parameter hypotheses.
A parameter hypothesis is always about a population parameter which is unknown or
something about which we are not sure and, in fact, will never be sure. Corresponding to the
population parameter is our sample statistic which is something we computed from a real
sample and is something about which we have 100 percent certainty. In hypothesis testing,
we use the sample statistic (something we are sure of) to infer to the population parameter
(something we are not sure of) Hypothesis testing is one type of inferential statistics.
Operationally, this hypothesis testing process consists of the following steps:
1. State the research hypothesis reflecting our verbal reasoning. e.g.,
There is a relationship between motivation to learn and math achievement.
Girls have higher math achievement than boys.
2. Convert our research hypothesis into a statistical form. e.g.,
a) > 0, or there is a positive correlation between motivation to learn and math
achievement. The statistic of correlation is used to summarize our sample data.
b)  girl -  boy > 0: Mean math achievement of girls is higher than the mean of boys.
Mean is used to summarize our sample data.
3. State the null hypothesis which provides a way to test the statistical hypothesis. e.g.,
H 0 :   0 . There is no correlation between motivation and math achievement.
H 0 :  girl -  boy = 0. The mean math achievement of girls is the same as that of boys.
4. Hypothesis testing is conducted with the assumption that the null hypothesis is true.
Question to be answered: What is the probability of finding a positive correlation as
high as the present sample or even higher when the truth is there is no correlation?
Question to be answered: What is the probability of finding a difference between the
two means as large as the present sample or even larger when there is no difference?
5. Make a decision regarding our hypothesis. Do we reject the null hypothesis or not?
This decision means the data from the one sample we have supports or does not
support our research hypothesis. This decision will be associated with one of two
potential errors.
Statistical significance and testing errors
The level of significance is the Type I Error rate (  ).
Type I error rate (  ) is the probability of rejecting the null hypothesis when the null
hypothesis is true. We make such an error only when the null is rejected.
Type II error rate (  ) is the probability of not rejecting the null hypothesis when the
null hypothesis is false. We make such an error only when we fail to reject the null
hypothesis.
Power (1 -  ) is the probability of correctly rejecting the null hypothesis.
The distribution of sampling mean
The Central Limit Theorem defines the characteristics of the sampling distribution of a
statistic. Let's use the sample mean as the statistic, it has the following three characteristics:
1. It normally distributed. For samples of 30 subjects or more, the sampling distribution
7
of means will be approximately normal, independent of the shape of the population
distribution. For example, even if the population is not normally distributed, a sampling
distribution of means computed from drawing random samples of 30 or more cases each time
from this population will be normally distributed. Therefore, if there is no available
information about the shape of the population distribution, the central limit theorem states
that, as the sample size increases, the shape of the sampling distribution of means becomes
increasingly like the normal distribution. Of course, if the distribution of scores in the
population is normal, the sampling distribution of means, regardless of N, shall be normal.
Because, with a sample of size 30, the normal distribution provides a reasonably good
approximation of the sampling distribution of means, N = 30 is often regarded as the lower
end of sample size for conducting research.
2. It has a mean equal to the population mean. This simply means that the sample mean
is an unbiased estimator of the population mean. In general, a sample statistic is an unbiased
estimator of the corresponding population parameter, if the mean of the statistic is equal to
the population parameter.
3. Its standard deviation, which is called standard error of the mean, equals to the
population standard deviation divided by the square root of the sample size. The standard
error of the mean ( X ) is a function of the population standard deviation (  ) and the sample
size (N), ( X    N ) . The standard error of means provides an index of how much the
sample means vary about the population mean, which is the mean of the sampling
distribution of means.
Hypothesis testing about the mean
Continuing with the above, the significance level represents the probability at which we
would reject the null hypothesis. In other words, it is the probability at which we would
declare that the sample mean (or any sample statistic) is unlikely (what is unlikely is defined
by the significance level) to have been drawn from the population defined by the null
hypothesis. We often use  =.05 as the significance level. That means, we state, ahead of the
game, that 5% is small enough to be considered unlikely. Then all we need to do is to draw a
(random) sample, compute its mean, and then determine the probability of obtaining this
sample mean assuming the population is that defined by the null hypothesis. If the probability
is equal to or smaller than the significance level, we declare that the probability is small
enough to be considered unlikely; i.e., the sample is unlikely to be drawn from the population
defined by the null hypothesis. Thus, we reject the null hypothesis.
The question then becomes how to find out the probability associated with the sample
mean we have in hand. Since the sampling distribution of means for N = 30 can be modeled
by a normal distribution, we can find the probability values associated with any number
under the normal curve. Simply convert the value of the sample mean into z-score and go to
the table to find the corresponding probability value associated with the z-score value.
Compare the probability value we find from the table to that we set for the level of
significance and decide whether to reject the null hypothesis.
Many textbooks use the opposite logic by identifying or marking the point of the
sampling distribution (which is normal) that corresponds to the level of significance and
refers to this point as the critical value. This is the point beyond which the probability of
observing a sample mean is less than .05. In order to find this part of the sampling
8
distribution, identify the critical value of z X beyond which 5 percent of the sample means
fall (i.e.,  =.05); this critical value of z X can be denoted as z Xcritical . From the table of
normal probabilities, z Xcritical equals 1.65 for p<.05. A sample mean falling in the rejection
region would lead to a decision to reject the null hypothesis.
We also need to decide which tail of the normal distribution to use as the critical region
for rejecting the null hypothesis.
Suppose the null hypothesis is: H0:  =50, the alternative hypothesis and the
corresponding tail maybe:
H1:  > 50
H1:  < 50
H1:   50
The first two hypotheses are directional; they indicate the direction of the difference
between the mean of the population under the null hypothesis and the mean of the population
under the alternative hypothesis. If the alternative hypothesis predicts the true population
mean to be above the mean in the null hypothesis, the critical region for rejecting the null
hypothesis lies in the upper tail of the sampling distribution. If the alternative hypothesis
predicts the true population mean to be below the mean in the null hypothesis, the critical
region for rejecting the null hypothesis lies in the lower tail of the sampling distribution.
Directional hypotheses are sometimes called one-tailed hypotheses because the critical
region for rejecting the null hypothesis falls in only one tail of the probability distribution.
The third alternative hypothesis, H1:  ≠50, is non-directional. It predicts that the true
mean does not equal the mean in the null hypothesis, but it does not say whether it is below
or above. Thus, we must consider the critical region to lie in either tail of the distribution.
That means, we have to consider both tails if the alternative hypothesis is non-directional.
In order to test the null hypothesis against a non-directional alternative hypothesis at
 =.05, mark off critical regions in both tails of the distribution as  /2 = .05/2 = .025. We
use  /2 because we want to mark off no more than a total of 5 percent of the normal
sampling distribution as a critical region for testing the null hypothesis: .025 + .025 = .05.
Note that if we marked off both tails at .05, we would have actually marked off 10 percent of
the distribution: .05 + .05 = .10. By using  /2, we set the level of significance at  =.05.
Non-directional alternative hypotheses are also called two-tailed hypotheses because the
critical region for rejecting the null hypothesis lies in both tails of the probability distribution.
Recapture the Logic and Process of Hypothesis Testing
Inferential statistics refer to the process of using sample statistics to estimate population
parameters; the sample statistic that is used for such estimation is called an estimator.
Inferential statistic can also be seen as the process of determining the probability of obtaining
a population parameter as large as or larger than (in absolute values) the sample statistic that
you computed from a random sample.
Sampling distribution of a statistic is a theoretical or imagined distribution of an infinite
number of the statistic (which can be a mean, a difference between two means, a regression
coefficient, etc.) which is computed from an infinite number of samples of an equal size
9
randomly drawn from the same population. The mean of this sampling distribution of
statistics equals the population parameter (which corresponds to the sample statistic, of
course) and the variance of this sampling distribution (variance error) equals the population
variance (or a sample estimate of the population variance if the population variance is not
known) divided by the size of the sample (from which your sample statistic is computed).
Taking the square root of the variance gives you the standard deviation, which, in the
sampling distribution of a statistic is called standard error of the statistic. This sampling
distribution is used as a probability distribution to determine, under the null hypothesis, what
is the probability of obtaining a sample statistic as large as or larger than (in terms of absolute
values) the one you computed from your sample. This process of using a sampling
distribution to determine the probability of obtaining a sample statistic is the essence of
inferential statistic.
The sampling distribution of a statistic can take on different shapes or mathematical
properties, depending on the statistic and other things. In this course, we are going to deal
with the normal distribution, t distribution, F distribution, and chi-square distribution. These
in turn correspond to different statistics such as the mean, difference between two means,
frequency, correlation and regression coefficient. The statistic is different and the distribution
is different for each different kind of hypothesis testing. The process and the logic are
identical and are just like what I described so far. I summarize this process and logic again in
the following equation:
Inferential
Statistical
Testing (1)
1.
2.
3.
4.
=
Sample Statistic (2) - Population Parameter (3)
----------------------------------------------------Standard Error of the Statistic (4)
This is a process although sometimes it is simply called inferential statistics.
This is real and is your actual research.
This is unknown and finding the answer (never with certainty) is the purpose
of your research.
One important criterion of the central limit theorem is that this standard error
is a direct function of the sample size.
Hypothesis Testing about Other Statistics
z-test of differences between two means (population variance is known)
From a population, two random are drawn (the samples sizes can be different), a mean
score for each group is calculated, and the difference between the means is found by
subtracting one sample mean from the other: X 1  X 2 . Imagine that this is done over and
over again by drawing pairs of random samples from the population and finding the
difference ( X 1  X 2 ) between the sample means of the two groups. These differences
between pairs of sample means would be expected to vary from one pair to the next. Imagine
that we plot these differences in the form of frequency chart, with the values of X 1  X 2 on
the abscissa and the frequency with which each value of X 1  X 2 is observed on the ordinate.
This plot is the sampling distribution of differences between two means. It shows the
variability that is expected to arise from randomly sampling differences between pairs of
means from the population under the null hypothesis. This sampling distribution of
differences between two means is a theoretical probability distribution with differences
10
between sample means ( X 1  X 2 ) on the abscissa and probabilities on the ordinate. By
treating the sampling distribution of differences between means as a probability distribution,
we can answer such questions as “What is the probability of observing a difference between
sample means ( X 1  X 2 ) as large as or larger than the observed sample difference, assuming
that the null hypothesis is true?” This sampling distribution of differences between means is,
in concept, exactly the same as the sampling distribution of means discussed earlier and has
the same or similar characteristics:
1. It is normally distributed.
2. It has a mean equal to the population parameter which is zero (i.e. since the
samples are taken from the same population,  1 =  2 and thus  1 -  2 = 0).
3. It has a standard deviation called the standard error of the difference between
means.
 X  X   2 X   2 X 
1
1
2
2
2
N1

2
N2
The rest are the same as the earlier explanation about hypothesis testing of one
sample mean, except for one important assumption, "homogeneity of variance." It is assumed
that the two population has the same variance.
Examples
The null hypothesis is that there is no difference between mean GRE verbal scores of the
Chinese and English speaking students (By the way, in trying to find out whether there is
large scale cheating involved in the GRE testing in China, ETS indeed investigated
hypotheses similar to this one because they were suspicious of cheating in China and other
Asian countries):
H0:  E –  C = 0
The alternative hypothesis is that the mean GRE verbal scores of the English speaking
students is higher than that of the Chinese students:
H1:  E –  C > 0
In testing the null hypothesis, we set α=.01 as the level of significance.
Since the research hypothesis is directional, one tail of the sampling distribution of the
difference between means, i.e., the upper tail, is of interest. The critical value of z X E  X C that
identifies the small portion of the upper tail is 2.33 (see normal table). Note that we now
speak of z X E  X C and not z X , since we are dealing with the sampling distribution of the
difference between two means. If z X E  X C
observed
is greater than z X E  X C
critical
, the null
hypothesis should be rejected.
Assume that the observed sample means are 610 for the English speaking students and
570 for the Chinese students; the standard error of the differences between means is 15. In
order to decide whether to reject the null hypothesis, we convert these sample data to a
z-score, using
( X E  X C )  ( E   C )
z X E XC

observed
X
E XC
Since  E –  C = 0, under the null hypothesis, this equation can be rewritten as
11
z X E XC

observed
Thus we obtain:
z X E XC
observed

Since z X E  X C
(X E  XC )
X
E XC
610  570
 2.67
15
exceeds z X E  X C
observed
critical
, the decision is made to reject the null
hypothesis and to conclude that GRE verbal scores of English speaking students are on
average higher than those of Chinese students.
Alternatively, we can find the probability value associated with the computed mean
difference of 2.67 (treating as z-scores), and the value is 0.004. Since it less than 0.01, we
reject the null hypothesis.
In another example of a study of the disparity between men’s and women’s college
grades, one hypothesis was that women earned higher grades than men. The mean grade point
average (GPA) for the 20 men in the sample was 2.905 and for the 25 women was 2.955. The
variance in the population, which is assumed to be equal for the two groups, is known to be
0.10.
H0:  F –  M = 0, H1:  F –  M > 0
 X
F XM
  2 XF   2 XM 
2
NF

2
NM
.10 .10
= 0.09

20 25
2.955  2.905
zXF XM

 0.55
observed
0.09
The critical z associated with  =.05 is 1.96 and the observed z is 0.55 which is far
smaller than the critical z and we thus do not reject the null hypothesis.
=
t test of independent sample means (population variance unknown)
The above provide ways to test hypothesis regarding one sample mean and regarding the
difference between two sample means. The problem with the above is that, in order to use the
sampling distribution of means and the sampling distribution of the differences between
means, we must know the population standard deviation. In reality, we often do not know the
population standard deviation but have to use sample standard deviation first to estimate the
population standard deviation. All the logics we discussed earlier still hold. The only
difference is that, if we have first to estimate the population standard deviation, the sampling
distribution of means and the sampling distribution of the differences between means are no
longer normal distributions. Instead, they are best described as t distribution. Thus, when
testing the same hypothesis, i.e., that regarding sample mean or sample mean differences, we
have to use the t distribution to determine the probabilities (instead of the normal distribution)
and the tests are thus called t-tests.
Here we focus on the t-test for two independent samples. The first step is to estimate the
population standard deviation so that we can compute the standard error of difference.
Repeating the earlier formula for the standard error of difference:
12
 X  X   2   2
X1
X2

2

2
N1 N 2
Now, without knowing the population standard deviation, the estimate of the standard
error of difference can be obtained by using the standard deviation from each sample as an
estimate of  . For samples of equal size (N1 = N2), the estimate of the standard error of the
difference between means is as given
1
2
s2 s2

N1 N 2
The next step is to provide a measure of the distance between the difference of the
sample means ( X 1  X 2 ) and the difference of their corresponding population means (  1
-  2). This measure is the t:
( X  X 2 )  ( 1   2 )
t X1  X 2observed  1
s X1  X 2
s X1  X 2  s 2 X1  s 2 X 2 
In this case, t is a measure of the distance between the difference of the sample means
( X 1  X 2 ) and the difference of their corresponding population means (  1 -  2). Notice that
since, under the null hypothesis, we sample from populations with identical means in setting
up the sampling distribution under H0,  1 =  2, and so  1 -  2 = 0. Thus, t can be
simplified as
(X  X 2)
t X1  X 2observed  1
s X1  X 2
The final step is to specify the sampling distribution of t. Clearly the observed value of t
will vary from one sample to the next due to random fluctuation in both X and s. If the value
of t [which is ( X 1  X 2 )/(s X1  X 2 )] is calculated for each of a very large number of samples
from a normal population, a frequency distribution for t can be graphed, with the values of t
on the abscissa and the frequencies on the ordinate. This frequency distribution can be
modeled by the theoretical (mathematical) distribution of t. The theoretical model is called
the sampling distribution of t for differences between independent means. It can be used as
probability distribution for deciding whether or not to reject the null hypothesis
when  X1  X 2 is unknown.
Actually, the sampling distribution of t for differences between means is a family of
distributions. Each particular member of the family depends on degrees of freedom.
Specifically, the t distribution for differences depends on the number of degrees of freedom in
the first sample (N1 – 1) and in the second sample (N2 – 1). So the t distribution used to model
the sampling procedure described before is based on (N1 – 1) + (N2 – 1) degrees of freedom,
or on N – 2 degrees of freedom, where N = N1 + N2. In the example with N1 = N2 = 10, a t
distribution with 18 df would be used to model the sampling distribution [(10-1) + (10-1) = 9
+ 9 = 18].
The t distribution of differences has the following characteristics: First, its mean is equal
to the population parameter which is zero under the null hypothesis (  1 -  2 = 0). Second,
the distribution is symmetric in shape and looks like a bell-shaped curve. However, the
mathematical rule specifying the t distribution is not the same as the rule for the normal
distribution (but very close). As the sample size increases, the t distribution becomes
increasingly normal in shape. With an infinite sample size, the t distribution and the standard
13
normal distribution are identical.
In using the sampling distribution of t to test for differences between independent means,
we make the following assumptions: The scores in the two groups are randomly sampled
from their respective populations and are independent of one another. The scores in the
respective populations are normally distributed. The variances of scores in the two
populations are equal (i.e.,  1² =  2²). This assumption is often called the assumption of
homogeneity of variance. Again, this is an important assumption based on which we can
obtain the "pooled estimate" of the population variance (to be discussed next). (However,
when this assumption is not met, we can still use other methods to estimate the standard error
of differences. This won't be discussed here.)
However, in order to use the t test where sample sizes may be either equal or unequal,
we must define s X  X more generally. The goal in defining s X  X is to find the best
1
2
1
2
estimate of the population variance by using sample variances. This can be done as follows:
Under the assumption of equal population variances (  1² =  2² =  ²), s21 and s22 each
estimate  ². The best estimate of the value of  ², then, is the average of s21 and s22. This
average is called a pooled estimate. So the formula for the t test can be written to reflect this
pooling.
X1  X 2
X1  X 2
t X1  X 2observed 

2
2
1
1
s pooled s pooled
s 2pooled ( 
)

N1 N 2
N1
N2
The pooled variance estimate, s²pooled, is obtained by computing a weighted average of
s 1 and s22:
2
( N1  1) s 21  ( N 2  1) s 2 2 ( N1  1) s 21  ( N 2  1) s 2 2

( N1  1)  ( N 2  1)
N1  N 2  2
In order to decide whether or not to reject the null hypothesis, compare the observed t
value with a critical value. The symbol tcritical designates the points in the t distribution
beyond which differences between sample means are unlikely to arise under the null
hypothesis. Since there is a different critical value of t for each degree of freedom, the first
step is to determine the number of degrees of freedom. The next step is to find the critical
value of t for the related degrees of freedom at a specified level of significance.
s 2 pooled 
For example, an experiment was conducted to test on pre school children's word
acquisition. In the experiment group, N1 = 15, children were given some kind of phonological
awareness training which was expected to enhance word acquisition. The control group
children (N2 = 20) were simply told stories. At the end of the experiment, the two groups
were tested on word acquisition test and the following results were obtained.
Mean
18.00
15.25
Variance
5.286
3.671
H 0 :  E  C
H1 : E  C
t
X1  X 2

s X1  X 2
X1  X 2
s2 p s2 p

N1 N 2
14
s2 p 

( N1  1) s 21  ( N 2  1) s 2 2
N1  N 2  2
14(5.286)  19(3.671) 74.004  69.749

 4.356
15  20  2
33
Note that the pooled variance is somewhat closer in value to s22 than s21 because of the
greater weight given s22 in the formula. Then
t
X1  X 2
s2 p s2 p

N1 N 2
18.00  15.25
4.356 4.356

15
20
2.75
2.75


 3.86
0.5082 0.713
For this example we have N1 – 1 = 14 df for the experimental group and N2 – 1 = 19 df
for control group, making a total of N1 – 1 + N2 – 1 = 33 df. From the sampling distribution
shown in the t Table (in the back of the notes), t.05 (33)  2.04 . Because the value of tobs far
exceeds tα, we will reject H0 (at  = .05, two-tailed).

t test for correlation samples
The t test for dependent means is used in situations where the two samples (and the
corresponding populations) are correlated. What makes two samples correlated? First, when
two measures are obtained from the same person, the two samples of these measurements are
correlated. For example, midterm and final are correlated. Views about Hong Kong
democratic reform from husbands and wives represent correlated samples because two people
get married in part because they share similar views and, after marriage, they also influence
each other’s views. In all these examples, there is a correlation between the two persons (e.g.,
husband and wife) or two occasions (midterm and final of the same person). All the
procedures and logic for testing the difference between these kinds of “correlated sample
means” are the same except for the fact that the standard error of the differences between
means is usually larger for independent samples than for dependent samples. This is so
because the dependent samples are correlated. We have to calculate and take into
consideration the correlation between the two scores. This “little exception” however, makes
the computation much more cumbersome than that for t-test of independent samples. Instead
of presenting this cumbersome method, many statistics books use a computational method to
conduct t-test of dependent samples or t-test of matched pairs.
This computational method uses the same logic as that of the z-test of single sample
mean using the normal distribution, which was described earlier. The only difference is that
the normal distribution serves as the probability distribution for the z-test whereas the t
distribution serves as the probability distribution for the t-test of dependent samples. The
procedure is to first compute a difference score between the two variables and then work on
the difference score, which is one variable, as if we are dealing with one sample mean test.
Let us see how it works using the following table as an example.
15
(1)
Subject
(2)
X1
(3)
X2
(4)
D=X1-X2
1
2
3
4
5
60
85
90
110
115
460
107
111
117
125
122
582
-47
-26
-27
-15
-7
(-122)

(5)
D
-24.40
-24.40
-24.40
-24.40
-24.40
(-122)
(6)
DD
-22.60
-1.60
-2.60
9.40
17.40
0
(7)
(D  D ) 2
510.76
2.56
6.76
88.36
302.76
911.20
In the above table, instead of comparing column 2 and 3 which are the two variables, we
compute the difference between the two variables which resides in column 4. Then, instead of
testing the following hypotheses:
H0:  1 –  2 = 0
H1:  1 –  2> 0 (or H1:  1 –  2 < 0)
We test this hypothesis that involves only the difference score:
H0:  D = 0
H1:  D > 0 (or H1:  D < 0)
The corresponding t-test is:
tobserved 
D  D
D

sD
sD
where s D 
sD
N
Of course, SD is obtained by the following equation which is the same as that for
standard deviation:
sD 
 (D
p
 D )2
N 1
In the above example,
911.20
sD 
 227.8  15.093
5 1
15.093
sD 
 6.750
5
 24.40
t observed 
 3.61
6.750
df  N  1  5  1  4
t crit. (.01/ 2, 4)  4.604 for a two-tailed test.
Steps in t test of difference between two means
1. Make assumptions based on the research issue. State null hypothesis and alternative
hypothesis, which will decide the test is one-tailed or two-tailed, and which tail is of interest
for one-tailed test.
2. Distinguish whether the two groups are two independent samples or two paired
samples (i.e., correlated samples).
3. Choose suitable command in Excel or SPSS, the t statistic and corresponding
probability and other related information can be obtained. For independent samples, we need
to test whether the two population variances are equal or not before testing the difference of
the two population means.
4. Report the hypothesis result, and give reasonable explanations.
16
Chi-square test for frequency data
The chi-square statistic (  2 ) is used to test whether the observed frequencies differed
significantly from the expected frequencies. The expected frequencies might be based on a
null hypothesis such as that height should be normally distributed with an extreme small
frequency of people who are above 2 meters. The expected frequencies might also be based
on what would be expected if chance assigned subjects to categories, as the number of people
who carry their bags on the right versus left shoulder should be 50:50 under such kind of
chance or null hypothesis.
 2 is used with data in the form of counts (in contrast, for example, to scores on a test).
Thus  2 can be used with frequency data (f), proportion data (f ÷ n), probability data
(number of outcomes in an event ÷ total number of outcomes), and percentages (proportion ×
100). If the data are in the form other than frequencies (e.g., proportions), the data need to be
converted to frequencies.
Chi-square test for one-way design
In the chi-square test for a one-way design, only one variable is involved. We test
whether the variable has the distribution as expected. Suppose N subjects can be assigned
to k categories. The observed and expected frequencies of the ith category are Oi and Ei,
respectively. The test statistic is:
k
(O  Ei ) 2
2   i
Ei
i 1
The chi-square statistic provides a measure of how much the observed and expected
frequencies differ from one another. But how much difference should be tolerated before
concluding that the observed frequencies were not sampled from a distribution represented by
the expected frequencies? In other words, how large should  2 observed be in order to reject the
null hypothesis that the observed frequencies were sampled from a distribution represented
by the expected frequencies? The question becomes finding the probability of finding
 2 observed as large as or larger than some value, assuming the null hypothesis is true. To do so,
we need a sampling distribution for the  2 .
The chi-square distribution, like the t distributions, is a theoretical distribution—actually,
a family of theoretical distributions depending on the number of distinct categories on which
the statistic is based. Or more accurately, there is a different member of family for each
number of degrees of freedom, which is determined by: df = number of categories minus 1 or
k – 1.
Notice that the degrees of freedom for the  2 distribution depend on the number of
categories (k) but not the number of subjects in the sample (N).
For example, to find out whether HK movie goers have any preferences over the four
common kinds of the movies of Action, Horror, Drama, and Comedy, one can use the
chi-square test to compare the observed movie choices against the expected choice, which,
under the null hypothesis (no preference) would be 1/4. Assume a researcher stood at the
ticket counter and tallied 32 tickets sold at the time obtained the following.
17
Observed
Expected
Movies
Action
4
8
Horror
5
8
Drama
8
8
Comedy
15
8
Total
32
32
H0: Four kinds of movies are equally preferred.
H1: Some kinds of movies are preferred more than others.
2  
(O  E ) 2
E
(4  8) 2 (5  8) 2 (8  8) 2 (15  8) 2



8
8
8
8
 9.25
From the chi-square, we see that k – 1 = 3 df,  2 .05 = 7.82. The observed chi-square
value of 9.25 is greater than 7.82 and thus the researcher can reject the null hypothesis and
conclude that HK movie goers have preferences. But which kind do the HK movie goers
prefer? Unfortunately, the chi-square test alone does not answer that question. The easiest
way to determine that is by eyeballing the four frequencies. From the data, it is clear that
more people than expected go to see comedy, whereas Action and Horror choices were made
less frequently than expected.

One requirement is that the expected frequency for each category is not less than 5 for df
≥ 2 and not less than 10 for df = 1. Note that this assumption is for expected, not observed,
frequencies.
The observed values of χ² with 1 degree of freedom must be corrected for continuity in
order to use the table of values of χ²critical. This is called Yates' correction for continuity used
to correct the inconsistencies between the theoretical sampling distribution of χ² and the
actual sampling distribution frequency data which only approximates the chi-square. Yates’s
correction:
1.
Subtract .5 from the observed frequency if the observed frequency is greater than the
expected frequency; that is, if O>E, subtract .5 from O.
2.
Add .5 to the observed frequency if the observed frequency is less than the expected
frequency; that is, if O<E, add .5 to O.
Chi-square test for two-way design
In the case of a two-way design, i.e., two variables which are categorical, the question
becomes whether the two variables are independent of one another or are related to
each other. For example, is movie preference, which, let us assume is defined by four
preference categories of action, horror, drama, and comedy only, related to gender (another
categorical variable with two categories)?
The formula for calculating the two-way  2 is the same as for the one way  2 when
treating each cross-category in two-way design as one category; only the degrees of freedom
change. The degrees of freedom for the two-way  2 depend on the number of rows (r,
18
which represents the number of categories for one variable) and the number of columns (c,
which represents the number of categories for the other variable) in the design: df = (r –
1)(c – 1).
The chi-square statistic with (r – 1)(c – 1) degrees of freedom is used to compare the
observed and expected frequencies in a two-way contingency table. Here, the expected
frequencies represent the frequencies that would be expected if two variables were
independent. The statistical rule of independence is that when two events are independent,
their joint probability is the cross product of their individual probability.
For example, to test whether there is a correlation between gender and movie choice (in
other words, whether men and women have different preferences for movies), one can
conduct a chi-square test to examine whether these two variables are independent.
H0: Gender and movie choice are independent in the population.
H1: Gender and movie choice are related in the population.
The data are contained in the table below. In this example, the expected frequency in the
(1,1)-cell, which is presented in parenthesis, is obtained by multiplying row total (564 for the
first row) with the column total (116 for the first column) and dividing this product by the
total frequency (897):
564  116 65,424
E11 

 72.94
897
897
This procedure for calculating the expected frequencies is carried out for all cells which
are presented in the parenthesis in the table below. Then follow the chi-square formula and
we will obtain:
 2 observed = 16.78 + 28.48 + 265.36 + 110.08 + 70.53 + 119.19 + 20.02 + 33.13 = 463.57
df = (r – 1)(c – 1) = (4 - 1)(2 - 1) = 3.
With df = 3, the chi-square value associated with a = .01 is 7.81. The observed chi-square of
463.57 far exceeds the critical chi-square and we reject the null hypothesis.
Sex
Male
Female
Total
Movie Choices
Action
Horror
108 (73)
345 (224)
8 (43)
12 (133)
116
357
Drama
94 (218)
253 (129)
347
19
Comedy
17 (48)
60 (29)
77
Total
564
333
897
Correlation and Regression
Joint distribution
The joint distribution of two variables is graphically presented as a scatter plot. Unlike
the frequency polygon for a univariate distribution, on both the ordinate and abscissa of a
scatter plot are the measurement units of each of the two variables. Each dot represents the
intersection of a pair of observations, the measurements of which are on the ordinate and
abscissa. Scatter plots depict the strength and direction of the association between two
variables. These characteristics can be summarized numerically by one of the two statistics,
covariance and correlation.
Covariance
The covariance of X and Y is the average cross products of deviations of the two
variables. It represents the degree to which the two variables vary together.
1 N
s XY 
 ( X i  X )(Yi  Y ) (Covariance of X and Y)
N  1 i 1
s X2 
1 N
 ( X i  X )( X i  X )
N  1 i 1
(Variance of X)
1 N
(Variance of Y)
 (Yi  Y )(Yi  Y )
N  1 i 1
The concept of covariance of two variables is similar to the variance of one variable.
Variance summarizes deviations of observations from the mean, whereas covariance
summarizes the deviations of pairs of observations from two respective means. If we replace
all the Y's in the covariance equation with X's, we would have the variance of X. However,
there is one important difference between variance and covariance. While a variance is
always positive, a covariance can be positive or negative. The sign of the covariance
indicates the direction of the association of the two variables. Of course, the magnitude of a
covariance indicates the degree to which the two variables vary together or the strength of the
association of the two variables. However, the value of covariance is also influenced by the
measurement units of the two variables. A larger numerical quantity as measurement of one
or both variables, e.g., X = 24 inches and s X = 120 inches versus X = 2 feet and s X = 10
sY2 
feet and/or Y = 36 ounces and sY = 144 ounces versus Y = 3 pounds and sY = 12 pounds,
will result in a larger s XY independent of the actual association of the two variables.
Correlation Coefficient
One way to eliminate the influence of measurement units is to use z-scores of X and Y
since the z-scores have the same mean and standard deviation whether or not the original
units are a large quantity of inches and ounces or smaller quantity of feet and pounds. The
covariance of X and Y expressed in the z-score form is called a correlation coefficient.
1 N  X i  X  Yi  Y 


rXY 

N  1 i 1  S X  SY 
N

(X
i 1
N
i
 X )(Yi  Y )
 ( X i  X )2
i 1
N
 (Y  Y )
i 1
2
i
The relation between the correlation and covariance can be described by:
20
s XY
, s  rXY s X sY
s X sY XY
The correlation that is being talked about here was derived by Karl Pearson (1857-1936)
and is called the product moment correlation coefficient. It is denoted by  (rho) for a
population parameter and r as a sample statistic. The term "moment" in physics refers to a
function of the distance of an object from the center of gravity, which is the mean of a
frequency distribution. x  X  X and y  Y  Y are the first moments and xy is a product of
the first moments. The correlation is also Pearson coefficient.
rXY 
Like the covariance, a correlation can be positive or negative indicating the direction of
the relationship. Since a correlation is free from the influence of measurement units, the
magnitude of the correlation ranges from 0 indicating no relationship to 1 or -1, indicating
perfect or strongest association.
Linear Association
There are many ways two variables can be related. We only deal with one of the types,
that of a linear association. In a linear relation, the change of one variable corresponding to
the change of the other variable is in one direction at the same speed. That is, as X goes up, Y
either goes up or goes down but does not go in both directions. An association is curvilinear if
Y sometimes goes up and sometimes goes down in correspondence to the upward movement
of X. The Pearson correlation measures only linear association. Thus, a correlation coefficient
should be examined together with the scatter plot to be certain that a zero or moderate
coefficient is not due to the nonlinearity of an existing association. The strength of a linear
association can also be examined from scatter plots. When the dots (representing the
correspondence between X and Y) are tightly clustered together in one direction or the other,
the association between X and Y is stronger than when the dots are loosely scattered.
Outliers and restriction of range
As can be seen from the earlier formula, the correlation between X and Y is determined
by the deviations of X from X and deviations of Y from Y . Two things may artificially
influence these deviations and therefore the correlation coefficient. One is called an outlier,
which is an extremely large or small score not typical of the X or Y population, creating an
extremely large deviation. This large deviation sometimes may artificially raise the
correlation if the large deviation is consistent with most of other observations. It may also
artificially attenuate the correlation if the large deviation is inconsistent with other
observations. The other problem is called the "restriction of range" fallacy. This refers to the
fact that unusually small variability in either of the X or Y variable will result in small
deviations and thus a weak correlation. Such reduced variability occurs when part rather than
the full range of the X and/or Y distribution is sampled. For example, the correlation between
GRE and graduate GPA would be much higher had the full range of GRE scores been
available. In reality, we only have the graduate GPA measures for some but not all of the
GRE scores since graduate schools are open only to students whose GRE are above a certain
level. There are ways to predict the potential correlation of a full range using the correlation
estimated from a restricted range. It is important to examine the scatter plot to determine if
the relationship is linear, if there are outliers, and, according to theory and experience, if the
sampled X and Y values represent the full range of these two variables.
Hypothesis Testing
The purpose of the test for  XY = specified value is to determine whether or not the
21
observed value of rXY was sampled from a population in which the linear relationship between
X and Y is some specified value. When  XY (in the population or the unknown truth) is
positive (e.g.,  = .30), the sampling distribution for rXY is negatively skewed. When  XY is
negative (e.g.,  = -.25), the sampling distribution is skewed in the positive direction. Since
skewed distributions can be transformed into approximately normal distributions by taking a
logarithmic transformation of the data, a log transformation of rXY is made. With this
transformation, the sampling distribution of rXY is approximately normal, regardless of
whether the sample correlation is drawn from the population with  XY = 0 or  XY = some
specific value. This log transformation is known as the Fisher’s Z transformation and we don't
have to do the transformation ourselves, statistical tables have all the z-values corresponding
to different correlation coefficients. The standard error of the sampling distribution of rXY
when it has been transformed to Z is given by:
1
sZ 
N 3
In order to test whether or not an observed correlation rXY differs from the hypothesized
value of the population parameter  XY, we use the table of z-transformations and compute
the z-scores (as we did earlier) and then go to the normal table to find the probability values
(probability associated with the observed z-score and/or probability associated with the
critical z-score.)
More often of course we are interested in whether two correlation coefficients are same
or one is higher or lower than the other. This involves hypothesis testing of the difference
between two correlations. The sampling distribution of the differences between sample
correlations (rX 1Y1  rX 2Y2 ) based on independent samples can also be modeled by the normal
distribution if the observed correlations are transformed to Fisher’s Z’s. The standard error of
the difference is estimated by
1
1

N 13 N 2  3
sr1 r2 
Non-standardized regression equation
The non-standardized regression equation, or simply, regression equation is
Yi   0  1 X i  ei
where  0 is called intercept which denotes the value of Y when X is zero,  1 is the
regression coefficient which represents the change in Y associated with one unit change in X.
The intercept is estimated by:
ˆ  Y  ˆ X
0
1
 1 can be estimated by
sY
sX
is the correlation between X and Y, and sY and s X are the standard deviations
̂1  rXY
where rXY
of Y and X.
Once the intercept and regression coefficient are estimated, we can estimate the
predicted value of Y at a given value of X.
Prediction: Yˆi  ˆ 0  ˆ1 X i
Residual: eˆ  Y  Yˆ
i
i
i
22
The purpose of regression is to explain the variability of an outcome variable (called
dependent variable or response variable for experimental or non-experimental research) from
the knowledge of some other predictor variables (called independent or explanatory variable
for experimental and non-experimental research). Without any predictors, the best prediction
is the mean of the outcome variable. For example, if we were to guess the GRE score of a
randomly chosen graduate student without knowing things like which graduate school he/she
is attending, which university he/she graduated from, what was his/her undergraduate GPA,
what was his/her SAT, etc., the best guess is perhaps the population mean of GRE. That is, we
would be wondering what kind of GRE score an average graduate student would get. And
that score would be the average of GRE scores taken by all the graduate students. If we use Y
to denote GRE as an outcome variable, the best prediction of one randomly chosen student's
GRE is  (Y), or the population mean of GRE. Of course, the larger the variability of the
GRE scores, the less certain we are of our guess. And we are right if we think that the
variance of all the GRE scores,  2(Y), is quite large.
What if you are told that his/her SAT is above the mean of SAT,  (X), using X to
denote SAT. Would you bet his/her GRE to be above  (Y)? If you feel more certain with
this second bet, it is because by knowing his/her SAT score, you increase your chance by
picking from a smaller range of GRE scores, those above the mean. We call this smaller
range of GRE scores conditional distribution of GRE and the full range of GRE scores
unconditional distribution of GRE. You are more confident with the second guess because the
variability of the conditional distribution,  2(YX), is smaller.  2(YX) is the variance of Y
at a given level of X. In our example, it is the variance of GRE scores of those who scored
above the mean on the SAT.
What regression does is dividing the whole population of an outcome variable into many
sub-populations according to the values of the predictor variable. The sub-populations are
called conditional population or conditional distribution whereas the whole population is
called unconditional population or unconditional distribution. Without knowledge of a
predictor, the best prediction of the outcome variable is the mean of the unconditional
population,  (Y). Knowing the value of a predictor, the best prediction is the mean of the
conditional distribution,  (YX). The predicted Y, Yˆ , is an estimate of the mean of the
i
conditional population of Y,  (YX). Knowing that his/her SAT is 1200, the best guess of
her GRE would be the mean of all GRE scores with an SAT of 1200, or  (YX=1200).
What we have done earlier through the regression equation is to estimate these means of
conditional population. The regression equation for population parameters is presented below
and is accompanied by the equation for sample statistics described earlier.
 (Y X )   0  1 X
Yˆ  ˆ  ˆ X
i
0
1
i
The relationship between the above two equations is that between sample estimates and
population parameters. Yˆi is a sample estimate of  (YX=Xi); ̂ 0 is an estimate of  0 ;
̂ 1 is an estimate of  1 . If we do not have any predictor variables, the best prediction will
be the mean of the unconditional distribution, or  (Y) and the sample estimate of  (Y) is
Y . Y in this case will be our sample prediction, Yˆ . It is shown in the following that, when
i
̂ 1 = 0 or when there is no prediction from X to Y, Yˆi = Y .
23
Yˆi  ˆ 0  ˆ1 X i
Recall that ˆ 0  Y  ˆ1 X
Replacing ̂ 0 by Y  ̂ 1 X , we have
Yˆi  Y  ˆ1 X  ˆ1 X i

 Y  ̂1 X i  X
 Y , if ˆ  0 .

1
Fitting a linear regression line to the scatter plots
(A graph will be shown in the class which will facilitate the understanding of this part of
the text.)
Knowing the values of ̂ 0 , the intercept, and ̂ 1 , the regression coefficient, one can draw
a straight line, known as the linear regression line. Or to put it differently, the Yˆ 's form a
i
straight line. ̂ 0 is the intercept of the regression line, or the value of Y when X is zero, ̂ 1 is
the slope of the regression line or how steep or flat the line is. The regression line which
represents a linear functional relation between X and Y best fits the data points in the joint
distribution or scatter plot. There are several methods to fit the regression line to the data
points. In other words, there are several ways to estimate  0 and  1 in a regression equation.
The one we focus on here which is also the most commonly used method is called Ordinary
Least Squares (OLS) method. By the least squares criterion, the intercept and slope of the
regression equation are so determined that the sum of the squared deviations of the actual
data points from the regression line is minimum.
Since the regression line is formed by the Yˆi 's and the data points are the actual Yi's for


2
each level of X, the least squares criterion specifies that  Yi  Yˆi is minimum. Yˆi is the
predicted. Y  Yˆ  e is the residual or prediction error. Just as the sum of deviations from
i
i
the mean is zero,
i
 Y
i
 Yˆi
 or  e based on the same cases that estimate
i
̂ 0 and ̂ 1 is
also zero. Conceptually,  ei = 0 implies that over-predictions (when data points fall below
the regression line) and under-predictions (when data points are above the regression line)
cancel each other out. Of course, the closer the data points are to the regression line, the
smaller the sum of squared residuals and, thus, the better the prediction. In an extreme case
when all the data points fall on the regression line, there will be no prediction errors and a
perfect prediction from X to Y.
The deviation of an observed Yi (for a given value of Xi) from the mean of Y, Y , is
made of two components, Yˆi  Y and Yi  Yˆi . For each observation, we can write:
Y  Y  Yˆ  Y  Y  Yˆ

i
 
i

i
i




We can square the deviations and sum them for all the observations:
2
2
2
 Y  Y   Yˆ  Y   Y  Yˆ

i


i


i
i

24
(Note:
 Y  Y    Yˆ  Y    Y  Yˆ 
  Yˆ  Y    Y  Yˆ   2 Yˆ  Y Y
2
2
i
2
i
i
2
i
i
2
i
i
i
i
 Yˆi

The last term is zero because of least squares criterion.)
The above terms are called "sum of squared deviations" or simply "sum of squares" or
SS. The equation is often simply written as:
SStotal = SSreg + SSres
which decomposes the total sum of squares into two parts, that sum of squares due to
regression, MSreg , and that the sum of squares of residual, MSres .
The sum of squares can be divided by the corresponding degrees of freedom to arrive at
estimated variance components.
dftotal = N - 1, where N is the number of sampled observations.
dfreg = k, where k is the number of predictors, X1, X2, …, Xk.
dfres = N - k - 1.
Dividing SStotal , SSreg and SSres by corresponding df, yields the estimated variances or the
mean of sum of squares (mean squares): MStotal , MSreg and MSres. An F statistic can be used
to test whether the regression is significant. The null hypothesis is H 0 : 1  0 , the
alternative hypothesis is H1 : 1  0 .
MS reg
F
MS res
When Fobs. is larger than Fcri , reject H 0 : 1  0 , and conclude that the regression is
significant.
SS reg
It can be shown that r 2 XY 
, that is, the squared correlation between X and Y
SS total
represents the proportion of variation in Y that is linearly associated with X, or in the other
words, that can be explained by X.
Dividing by these DF's, yealds the estimated variances:
s 2 y  s 2 yˆ  s 2 r (s2r is also written as s2y.x.)
s 2 y  s 2 yˆ  s 2 y. x decomposes the total variance in Y into two parts, that which is
linearly related to X, s 2 yˆ , and that which can not be accounted for by the linear function of
X, s2y.x. We learned earlier that rxy2 represents the proportion of variance in Y that is linearly
associated with X. We can see now the same interpretation applies to s 2 yˆ .
In s 2 y  s 2 yˆ  s 2 y. x , s2y is the sample estimate of the unconditional population variance,
or the total variability in Y. In our GRE example, this is the variance of all GRE scores. s 2 yˆ
is the sample estimate of the conditional population variance, or the variability among the
conditional population means. Because the conditional population means can be predicted by
the linear regression equation, s 2 yˆ is called variance due to regression. In the GRE example,
25
this is estimate of the variance among the means of GRE at different levels of SAT. For
example, this variance can give you some idea about the difference between the mean of
GRE with SAT of 1000 and the mean of GRE with SAT of 900. Finally, s2y.x is an estimate of
the conditional population variance σ2(YX), or the variability among all the Y's in the
population at a given level of X. In the GRE example, not everyone having an SAT of 1000
would have the same GRE. The error variance of prediction describes the difference among
GRE scores of these people who all scored 1000 on SAT. Of course, according to our linear
regression prediction, all these people are predicted to have a common GRE which is the
mean of their GRE scores. s 2 yˆ shows how much off we are in our regression prediction.
Note. s2y, s 2 yˆ , and s2y.x are unbiased estimators of their respective population
parameters. For sampled data, these three terms do not add up because they are not divided
by a common denominator or degrees of freedom. Thus, the equation, s 2 y  s 2 yˆ  s 2 y. x , does
not really hold. Precisely, Es 2 y  Es 2 yˆ  Es 2 y. x . We use the equation because it is
conceptually easy to understand.
Analysis of Variance
When populations (and their means) being compared are three or more, analysis of
variance (ANOVA) and F-test are applied instead of t-test which is proper for comparing two
population means. Here we focus ANOVA for one-way design only.
Procedure of ANOVA
1. Hypothesis
In ANOVA, the null hypothesis is that all population means are identical. That is,
H 0 : 1   2     k
where k is the number of population (or groups) being compared.
2. Decomposition of the sum of squared deviations
Let X ij be the score of the jth subject in the ith group, X i be the mean of the ith group
with a total of ni subjects, and X be the global mean of all n  n1  n2    nk subjects.
The variation of a data set can be depicted in two sources, one caused from the
differences between groups (variation between groups), the other came from the differences
among subjects within each group (variation within groups). The variations are usually
measured by the sum of squared deviations (or the sum of squares).
Total sum of squares: SS T   ( X ij  X ) 2
i
Sum of squares between groups:
j
SS B   ni ( X i  X ) 2
i
Sum of squares within groups:
SSW   ( X ij  X i ) 2
i
j
The above three sums of squares satisfy
SS T  SS B  SSW
which decomposes the total sum of squares into two parts, that sum of squares between
26
groups and that the sum of squares within groups.
3. Decomposition of the degrees of freedom and the mean squares
Without considering the degrees of freedom, the sum of squares between groups and the
sum of squares within groups are not comparable.
Total: df T  N  1
Between groups: df B  k  1
Within groups: dfW  N  k
Corresponding to the decomposition of the total sum of squares, the total degrees of freedom
can be decomposed as:
df T  df B  df W
Dividing SS B , SSW by corresponding df, yields the mean of sum of squares (mean squares):
mean squares between groups: MS B 
SS B
df B
mean squares within groups: MS W 
SSW
df W
4. F-test
An F statistic can be used to test whether the differences among groups are significant.
MS B
with degrees of freedom (df B , dfW ) .
F
MS W
If there are small differences among groups, the mean squares between groups will be
close to mean squares within groups, and F value calculated from sample, Fobs., will be close
to 1. On the other hand if there are significant differences among groups, Fobs. will be
substantially larger than 1. When Fobs. is larger than Fcri , we reject H 0 , and conclude that
not all groups have the same mean. Alternatively, if the probability corresponding to Fobs. is
less than α =0.05, reject H 0 .
In application, a table of ANOVA
result clearly.
ANOVA Table:
Source
SS
Between
SS B
groups
Within
SS W
groups
SST
Total
is often displayed through which one can view the
df
MS
k 1
MS B
nk
MSW
F
MS B
MSW
P
n 1
Example of ANOVA
Three different teaching methods were applied in three classes, one method for one
class. After one year, a test was conducted by using same items for the three classes. 10
scores from each class were randomly drawn out (see table below) . Are there significant
27
differences among the three classes?
Scores of samples
Class
A
B
C
Score
76, 78, 71, 68, 74, 67, 73, 80, 72, 70
83, 70, 76, 76, 69, 74, 72, 80, 79, 75
82, 88, 83, 85, 79, 77, 84, 82, 80, 75
Mean
72.9
75.4
81.5
Variance
17.7
19.6
14.9
The null hypothesis to be tested is
H 0 : 1   2   3
Since F=11.2, P=0.000, we reject H 0 and conclude that there are at least two classes
with significantly different mean scores. If all other conditions are similar, we can conclude
that the effect of teaching methods on study achievement is significant.
ANOVA
Source
Between groups
Within groups
Total
SS
391.4
469.8
861.2
df
2
27
29
28
MS
195.7
17.4
F
11.2
P
0.000
Download