6401 Lecture Notes Chang Lei

advertisement
Dr. Chang
1
Lecture Notes for EDM6401, Quantitative Methods in Educational Research
Chang Lei, Ph.D., Professor
1. An Overview of Educational Research
Four Ways of Knowing
Methods of Tenacity
Truth is true because one believes it even in front of contradicting evidence.
Superstition.
Method of Authority
Truth is true because an authority says so. Religion.
Method of Intuition
Truth is true because it is logical. It derives from reasoning but does not bear empirical
support. Philosophy.
Method of Science
Science is a method of seeking truth. This method only accounts for solvable problems
that have empirical solutions based on observable events. Some major components of the
scientific method include:
empirical evidence vs. refusal of contradicting evidence
random sampling vs. isolated or selected events
countering rival explanations
replication and public inquiry
Truth is routinely challenged and tested and retested by the public. There is no truth but
a temporary state where it has not been disproved or rejected as untrue.
The "Low" Status of Social Sciences
Difficult to replicate
The more developed a discipline, the higher the probability of detection and replication.
Value judgement
Science is considered "pure" and free from value judgement.
Too "obvious"
People tend to be uninterested in questioning what is already "obvious" to them. Nor do
people want to be bothered with what they do not know.
Thorndike: "That is the fate of educational research. If it comes out the way people
thought it should, they ask, 'What is the point?' If startling conclusions emerge, people say, 'I
do not believe it'."
Norms of Science
Universal Standards
The quality of research is judged by universal standards regardless of the experience,
race, sex, affiliation, or the characteristics of the researcher. e.g., Blind review process.
Dr. Chang
2
Common Ownership of Information
Scientific information is not proprietary but is owned and freely shared by all;
publication is not only a right but an obligation of a researcher. e.g., Data are to be shared on
request. "Publish or perish" enforces this norm.
Integrity in Gathering and Interpreting Data
The researcher displays disinterestedness and impersonal tone when gathering data or
presenting a point of view. e.g., Cite the researcher's own name or use "the author" to minimize
personability.
Organized Scepticism
It is the responsibility of the community of scientists to be skeptical of each new
knowledge claim, to test it, to try to think of reasons the claim might be false, to think of
alternative explanations. This challenge to new knowledge is sought in science, e.g., conference
debate, validation study.
The Roles and Outcomes of Research
Exploratory
Discover new phenomena and relationships among phenomena that are missed by others.
Qualitative research plays important role here.
e.g., A counselling psychologist wants to know what things make an effective
counsellor.
Explanatory
Develop new theories or use existing theories to account for the observations.
e.g., Dollard and Dobb (1939) theorized that frustration leads to aggression from the
observation that a child strikes out when deprived of a toy.
According to social learning theory, a good role model is important for school
achievement.
Validation
Validating and replicating existing research and theory is an important part of science.
Using different samples, populations, research methods.
Three Components of Educational Research Methodology
1. MEASUREMENT (PSYCHOMETRICS)
Instrumentation, Reliability, Validity
2. RESEARCH DESIGN
Sampling, Designs, Internal and External Validity
Experiment, quasi-experiment, non-experiment research.
3. DATA ANALYSIS (STATISTICS)
Descriptive statistics, Hypothesis testing, and Various
analytical techniques
Dr. Chang
The Process of Scientific Inquiry
1. Identification of a research problem.
(why)
2. Consult the literature for a solution.
(find out why)
3
3. Formulation of testable hypotheses on the basis of existing theory and/or experience.
(a tentative solution)
4. Design a study with efforts to minimize extraneous factors that may contribute to the same
phenomenon or relationship you hypothesized.
(design a study)
5. Data collection. When the behaviour of subjects are measured or observed, the measurement
or observations become empirical data.
(carry out the study)
6. Data analysis. Data are summarized in such a way that the summary bears on the research
questions and hypotheses. Statistics are used to generalize from sample to population.
(report the findings)
7. Interpretation of data, adding to the existing body of knowledge. (this is why)
Research Report:
* Title Page
* Abstract
* Introduction
Problem
Significance
Justifications
Hypotheses which are integrated in the literature review
* Method
population
sample
procedures
measurement
designs
* Results
Present the results in the order of the hypotheses
* Discussion
* References
* Tables and Figures
The Process of Research:
1.
Identification of a research problem.
(Why)
2.
Consult the literature for a solution.
(Find out why)
1. INTRODUCTION SECTION *****
Objectives and
Significance of study
Literature review
Dr. Chang
3.
Formulation of testable hypotheses on
the basis of existing theory and research.
(Here is a solution)
4.Design a study to minimize extranous
factors that affect the same phenomenon
or relationship you hypothesized.
(A plan to test the solution)
5.Data collection. When the behaviours
are experimentally manipulated or
observed, the outcome become data.
(Carry out the plan.)
6.Data analysis. Data are summarized in
such a way that the summary bears on
the research questions and hypotheses.
(Report it)
7.Interpretation of data, adding to
the existing body of knowledge.
(Why? This is why.)
4
Research questions
and hypotheses
Ind and Dep variables
2. METHOD SECTION *****
Sample, Design (exp
quasi, non), Procedure
Validity threats
Measurements,
Reliability and
Validity
3. RESULTS SECTION *****
ANOVA vs. Regression
framework, Sig test
Confidence interval
4. DISCUSSION SECTION *****
Theory & explanation
Limitations and
Future directions
Ways to Locate a Research Problem
1.Identify broad areas that are closely related to your interests and professional goal and write
them down.
2.Then choose among the areas that relate to your future career, an area or a research topic that
is feasible.
3.Collaborate with other people; join on-going projects.
4.Read text books where rather comprehensive topics in a field are summarized, and problems
and future research needs are identified; journal article for the state of art of the field and authors
recommendations; reviews articles for both.
5.Test a theory.
6.Replication. Replicate major milestone study. Replicate studies using different population,
samples, methods.
7.Observations. Observe carefully the existing practices in your area of interest.
8.Develop research ideas from advanced courses you take.
9.Get ideas from newspaper and popular magazines.
Dr. Chang
5
Variable and Constant
An attribute or characteristic of a person or object that varies from person to person,
object to object. A constant is an attribute that does not vary from person to person. Student
achievement, motivation, blood pressure, etc. Pi, in relation to a particular population, age is a
constant for the 6th graders, religion for parochial schools, gender for male prisoners, etc. When
you ask a research question, you ask about variables, you want to know the relationship among
variables. Why don't students learn, is it because they are in poor health, not motivated,
distracted by family problems, by crime, the teachers are not qualified, etc? You end up with a
question regarding the relationship among variables. Is there a relationship between motivation
and achievement.
Independent variable represents the research interest and is manipulated (experiment) or
measured (non-experiment) to see the results of its change on the dependent variable.
Dependent variable is the observed outcome in response to the independent variable. It
is used to evaluate the independent variable.
Control variable is a variable that is either made into a constant or is included in the
study (even though it is not of interest) to control or neutralize factors extraneous to the research
question.
Operational Definitions
Assign meaning to a construct or a variable by specifying the activities or "operations"
necessary to MEASURE or MANIPULATE it.
Redefine a concept in terms of clearly observable operations that anyone can see and
repeat. These observable and replicable operations can take the form of an experiment or of a
measurement instrument.
As Cronbach and Meehl (1955) point out, it is rare for a construct to receive one
commonly endorsed operational definition. To some researchers, hunger is defined in an
animal experiment as "amount of time since last feeding." It may also be defined by others as
"amount of energy an animal would expend to seek food." Thus, it is important to be
operationally clear about a particular construct, so that other researchers understand, for
example, what the construct "hunger" is intended to mean.
Measured operational definition (more often used):
Intelligence is defined as scores on the Woodcock-Johnson Test of Cognitive Abilities.
Vagueness of lecturing is defined as using the following words: A couple, a few,
sometimes, all of this, something like that, pretty much.
School achievement is defined as one's GPA.
Social Economic Status is defined by the number of years of education and the amount
of salary the head of a family receives.
Popularity is defined operationally by the number of friendship nominations a student
receives from his/her school mates.
Experimental operational definition:
Recall is defined by asking subjects to recite items shown to them from a stimulus list
and assigning a point for each item that matches one on the list.
Recognition is defined by showing subjects items and asking them to decide whether
they were part of the stimulus list.
Aggression is defined as the number of times a child hits a toy doll after watching a
violent TV show.
Dr. Chang
6
Distinguishing Between Two General Types of Literature Reviews
There are two general types of literature reviews, each possessing unique as well as
common characteristics. Making the distinction prior to embarking on the review is important
to both your own mental health and the quality of the product. The two types are:
1.A critical review of a literature
2.A review of literature relevant to a research proposal
Following are some is and is not for each type:
Critical Review of a Literature
Is a place where you may review the body of literature that bears on a problematic area -- or even
to examine all the research that relates to the specific question raised in a research proposal.
Is an activity the product of which is devoted to critical retrospective on scholarship -- and
publishable in such journals as the Psychological Bulletin, Psychological Review.
Is not encumbered with supporting the conceptual framework of a proposed study or of
justifying study design and methodology decisions.
Review of Literature Relevant to a Research Proposal
Is an obligation to place the question or hypothesis in the context of previous work in such a way
as to explain and justify the decisions made.
Is a product that reflects a step-by-step explanation of decisions, punctuated by references to
studies that support the conceptual framework and ongoing argument.
Is not a product to educate the reader concerning the state of science in the problem area; nor is
it to display the thoroughness with which the author pursued a comprehensive understanding of
the literature.
How to Write a Lit Review
Break up the review into several topic areas.
Organize all the findings under the various topics into a unified picture of the state of
knowledge in the area reviewed. The process of combining and interpreting the literature is
more difficult than merely reviewing what has been done.
Use two to three studies that are most pertinent and well done as foundations of your
review topics. Use similar studies as support.
Write the review as if you are expressing your own thoughts and developing and
building your own arguments and themes but not as if you are reporting others' work.
Don't do article by article listing of things.
Don't use the same format, e.g., Baker found...
Dr. Chang
7
Rather than citing everything in an article in one place, cite an article multiple times to
fit different themes of yours.
Write down your thoughts and paraphrase important points of the articles as you read.
It may not be a good idea to read all the articles and then write.
Look over the articles before copying them. Read several carefully before looking for
more.
2: Hypothesis Testing
1.The hypothesis should state an expected relationship between two or more variables.
2.The researcher should have definite reasons based on either theory or evidence for considering
the hypothesis worthy of testing.
3.Hypothesis should be testable. The relationship or difference that is stated in a hypothesis
should be such that measurement of the variables involved can be made and necessary statistical
comparisons carried out in order to determine whether the hypothesis as stated is or is not
supported by the research.
4.The hypothesis should be as brief as possible.
There is a gender difference in the perception of body sensations.
Women and men use physiological cues (internal) and situational factors (external)
differently in defining bodily state.
Women, compared to men, make greater use of external cues in defining their body
sensations.
There is a relationship between information processing techniques and subsequent recall
of information.
Visual imagery has a greater enhancing effect on recall than verbal recitation.
People tend to apply dispositional attribution to account for behaviours of others and use
situational attribution to explain behaviours of themselves.
Induced self-consciousness enhances recall of personal information.
Teachers who use specific feedback during lectures obtain higher pupil achievement
gains than teachers who use general feedback.
High intimacy self-disclosing statements would be more effective in counselling than
low intimacy self-disclosing statements.
Hypothesis Testing
A hypothesis is always about a population.
Testing a hypothesis means drawing inference from a random sample to the population
where the sample is taken.
1. Research hypothesis reflecting your verbal reasoning. The wording often reflects the research
design.
e.g., There is a relationship between motivation to learn and math achievement.
Girls have higher math achievement than boys.
Dr. Chang
8
The effect of induced public self-consciousness is stronger adolescents than adults.
2. Statistical hypothesis reflecting the statistics used to summarize your observations.
e.g., ρ > 0: There is a positive correlation between motivation to learn and math
achievement. The statistic of correlation is used to summarize data.
μg > μb: Mean math achievement of girls is higher than the mean of boys. Mean is used
to summarize data.
3. Null hypothesis representing a way to test the statistical hypothesis.
μg = μb. The mean math achievement of girls is the same as the mean of boys.
ρ = 0. There is no correlation between motivation to learn and math achievement.
4. Statistical tests are conducted with the assumption that the null hypothesis is true.
What is the probability of finding a positive correlation when the truth is there is no
correlation?
What is the probability of finding a difference between the two means when there is no
difference?
Statistical Significance
The probability level at which you will reject the null hypothesis, or, at which you will
allow yourself the risk of wrongly rejecting the null hypothesis.
Type I Error
Significance level is also Type I error rate. It is the probability of rejecting the null
hypothesis when the null hypothesis is true. You make such an error only when the null is
rejected.
Type II Error
It is the probability of not rejecting the null hypothesis when the null hypothesis is false.
You make such an error only when you fail to reject the null hypothesis.
Sampling distribution of means (or any statistic)
Is an imagined or theoretical distribution of an infinite number of means computed from
random samples of the same size. Because of the central limit theorem, this distribution is used
as a probability distribution to determine the probability of obtaining a mean larger than or as
large as (in absolute value) the one computed from your sample.
Central limit theorem
1. If repeated random samples of size n are drawn from a normally distributed
population, the distribution of the sample means is normal.
2. As the sample size increases, disregarding the shape of the population distribution, the
sampling distribution of means approximates normality.
3. The mean of the sampling distribution of means equals the population mean.
4. The standard deviation of the sampling distribution of means equals the population
standard deviation divided by the square root of sample size. This is called standard error of
means.
If population variance is not known, sample variance can be used as an estimate of
Dr. Chang
9
population variance in computing the standard error.
Four steps in hypothesis testing:
1: State the null and
alternative hypotheses.
Example
H0: μ1-μ2 = 0
H1: μ1-μ2 > 0
2: Set the level of statistical significance
which is the probability at which you'll
reject the null or at which you'll allow
yourself to make the type I error.
α .05
t (.05, 28)=1.7
3: Compute the test statistic which can be
a t-test, z-test, f-test, chi-square, etc.
t (28) = 2.85
4: Decision about the null. If you reject the
null, you may make a type I error the
probability of which is set at step 2. If
you do not reject null, you are running
the risk of making a type II error the
probability of which can be calculated if
you know certain parameters.
Reject null
and support
your research
(alternative)
hypothesis.
An Example
Hypothesis:
People high in public self-consciousness are more conforming to perceived social norms
on gender roles (than those low in public self-consciousness).
There is a relationship between public self-consciousness and gender role conformity.
Independent variable is public self-consciousness.
Dependent variable is gender role conformity.
Operational definitions:
Public self-consciousness is measured by the Self-Consciousness Scale (Fenigstein,
Scheier, & Buss, 1975; Scheier & Carver, 1985).
Gender role conformity is defined by the following operations: Ten gender role attitudes
questions were used to first determine participants' own standings on these gender role
questions. The participants were then informed of the mean ratings of their peers on these
gender role questions and were asked to re-assess their attitudes toward these gender roles.
Conformity to social norms on gender roles is measured by the difference score between the two
self-assessment on the ten gender roles questions.
Statistical hypothesis:
μhigh public - μ2high private > 0
This implies that the statistic, mean, is used to summarize sampled data that bear on the
hypothesis.
Dr. Chang 10
or
ρ > 0, implying that the statistic, correlation, is used to summarize data.
Null hypothesis:
μhigh public - μ2high private = 0
or
ρ=0
Significance level: α <= .05
Hypothesis testing rationale:
The hypothesis is regarding population.
The null assumes that there is no mean difference between the two populations (groups).
My hypothesis assumes that there is a mean difference (in the direction hypothesized)
between the two populations.
The purpose of hypothesis testing is to make the qualitative decision regarding whether
my samples are taken from the populations defined by the null (decision: accept null and your
research hypothesis is not supported) or are taken from the populations defined by the
alternative (research) hypothesis (decision: reject null and your research hypothesis is
supported).
The hypothesis testing starts with the assumption that the null is true. Even though the
null is true, there is a good chance that, due to sampling fluctuation, you will find some small
magnitudes of difference in your samples. The chance for you to find large differences,
however, should be very small. In fact, such chance is so small that you should no longer
attribute the difference to sampling fluctuations but to the possibility that the null is not true.
This chance is the probability associated with your computed sample statistic. As this
probability gets smaller, you grow more doubtful about the truth of the null to the point that you
make the qualitative decision that the null is not true (reject the null). This point is your
significance level and your decision is associated with the possibility of a type I error.
Another example:
Induced state of public self-consciousness increases gender role conformity.
State of public self-consciousness is the independent variable. It is induced by video
taping the participants while they are assessing their gender role attitudes in relation to these
attitudes of their peers. The knowledge that they are on camera induces the state of public
self-consciousness.
3: Measurement and Testing
Reliability
Classical theory also known as true score theory is mostly concerned with test reliability
or the reliability of observed scores of a test in measuring the underlying true abilities or true
scores. Reliability can be defined as the strength of the relationship between observed scores
and true scores.
If we were to administer a test to the same person under all different conditions at
different times using different items, there would be different observed scores. The mean of all
these observed scores is the person's true score, or true ability or personality.
In reality, we only give the person one test and there is only one observed score. This
score can be seen as a random variable or as randomly sampled observation from a distribution
of all possible observed scores. The observed score can be seen as consisting of the mean of the
Dr. Chang 11
distribution (or the true score) and a deviation from the mean which is called error or error score.
Thus, x = t + e.
The extent to which an observed score represent the true score is reliability. We can use
Pearson product moment correlation, ρ, to describe the strength of the relationship between
observed scores and true scores, i.e., reliability. Thus, ρxt is called a reliability index. (Note ρxt
is not reliability coefficient.) Of course, we don't know the true score and thus, can not solve for
ρxt. But assumptions can be made that enable the solving of ρxt. These assumptions, which are
not discussed here, make up the classical test theory.
With the assumptions, we can numerically estimate the reliability of a test without
knowing its true scores. First, we define the square of reliability, numerically, as the ratio
between true score variance and observed score variance. This is called the reliability
coefficient. (Note this is ρxx or ρxx'.) Second, we can estimate the reliability coefficient by
simply correlating two parallel tests (or two forms of a test, two halves of a test, or two
administrations of a test). The result is a Pearson correlation coefficient, r, which is an estimate
of the reliability coefficient. Keep in mind, the meaning of a Pearson r when used as an estimate
of reliability coefficient, is really r2 representing the proportion of the observed score variance
that is true score variance. r ranges from -1 to 1 whereas r2 ranges from 0 to 1. That is why, r
as a reliability estimate, ranges from 0 to 1 but not -1 to 1.
Depending on which kinds of two tests are being correlated to arrive at the reliability
estimates, these estimates are given different names as shown below:
Test-retest reliability (coefficient of stability)
Correlate two administrations of the same test.
Parallel form reliability (coefficient of equivalence)
Correlate two forms of the same test
Split half reliability (Spearman-Brown prophecy formula)
Correlate two halves of the test
Internal consistency reliability (Cronbach α)
Correlate every item with every other item.
When ρxx' = 1,
1. the measurement has been made without error (e=0 for all examinees).
2. X = T for all examinees.
3. all observed score variance reflects true-score variance.
4. all difference between observed scores are true score differences.
5. the correlation between observed scores and true scores is 1.
6. the correlation between observed scores and errors is zero.
1.
2.
3.
4.
5.
6.
When ρxx’ = 0,
only random error is included in the measurement.
X = E for all examinees.
all observed score variance reflects error variance.
all difference between observed scores are errors of measurement.
the correlation between observed scores and true scores is 0.
the correlation between the observed scores and errors is 1.
Dr. Chang 12
1.
2.
3.
4.
5.
6.
When ρxx’ is between zero and 1,
the measurement include some error and some truth.
X = T + E.
observed score variance include true-score and error variance.
difference between scores reflect true-score differences and error.
the correlation between observed scores and true scores is reliability.
the correlation between observed scores and error is the square root of 1 – reliability.
Validity
The validity of the use of a test refers to the extent to which the test truly measures what
it is expected to measure. For example, the use of a bathroom scale to measure weight is valid
whereas the use of a bathroom scale to measure height is invalid. Commonly discussed
validities include content, construct, and predictive validity.
Kinds of Validity Evidence
Content validity refers to the extent to which the items on a test are representative of a
specified domain content. For example, a test that is intended to measure the content of this
course should contain items about reliability, validity, intelligence and personality tests. If this
test is made up of items on calculus or matrix algebra, the test will have no content validity.
Achievement and aptitude (but not personality and attitude) tests are concerned with content
validity.
Construct validity refers to the extent to which items on a test are representative of the
underlying construct, e.g., personality or attribute. Personality and attitude tests are concerned
with construct validity. The process to establish construct validity is referred to as construct
validation. Construct validation is complicated, involving testing hypotheses concerning the
theories from which the test is derived. A common practice is to compare high scorers with low
scorers on the test with respect to some external behavior which is hypothesized to correlate
with the test.
Construct validity has often been narrowly interpreted as providing evidence for the
internal structure of a test. Another term for this narrow definition of construct validity is
factorial validity because such validity evidence is gathered through factor analysis. The correct
definition of construct validity refers to gathering evidence for a broad normological network of
relations.
In 1959, Campbell and Fiske published an important paper, which has also become the
most cited paper in psychology. In this paper, they conceptualize validity issues in a
Multitrait-Multimethod (MTMM) correlation matrix. In this MTMM matrix, convergent
validity (CV) is the correlation between the different methods of the same trait. This correlation
coefficient should be high. Discriminant validity (DV) is the correlation between different traits
obtained by the same method. A low value indicates evidence of validity. The correlation
between different traits obtained by different methods (HH, heterotrait-heteromethod) should be
the lowest in value. This MTMM concept has since become an important part of construct
validity, both in terms of its narrow definition and the broad definition.
Criterion related validity, including predictive validity and concurrent validity, refers to
the extent to which a test correlates with future behaviors which the test is intended to predict.
For example, the HKALE is intended to select students who are capable of university studies.
For the HKALE to have predictive validity, its test scores should correlate with undergraduate
performance, such as the GPA. Predictive validity is simply the correlation between the test and
a criterion measure the test is intended to predict. The correlation coefficient is also called
Dr. Chang 13
validity coefficient. Sometimes, because of lack of a criterion, one correlates the test with
another test that perports to measure the same thing. This is called concurrent validity.
Making Sense of Validity
Validation is an ongoing process where one keeps accumulating validity evidence. But
one is not capable of obtaining all the evidence at once. Certain evidence is never attempted.
For different uses of a test, the gathering of some evidence becomes more important than others,
giving rise to different validity concepts and procedures. Over time, this discriminating use of
evidence in relation to different types of tests becomes a tradition so that certain test and its use
is routinely associated with one kind of validity evidence but not with other kinds. For example,
most education achievement tests are only concerned with content validity; i.e., whether the test
items are representative of what have been taught within a specified domain content. One
purpose of education is to create a competent work force and, thus, a valid test of such education
achievement should be correlated with future job performance. However, such criterion-related
validity evidence is seldom gathered for an education achievement test in part because there is
also a strong public mentality that students must be assessed for what they have learned. As an
another example, the content validity of a personality test is never questioned. The validity
concern with a personality test lies in establishing the linkage between the test items and the
underlying trait structure as defined by a theory within which the test is conceptualized. Factor
analysis is often used to see if the items form clusters according to the theory and to see if the
items correlate with measures of other constructs according to specified patterns of relationships
defined by the theory. These efforts are referred to as construct validation.
To be simplistic, content validity is associated with an achievement test as construct
validity is with a personality and attitude test. The former is concerned with the
representativeness of the items with respect to a specified domain content. The content
validation procedures are qualitative or judgmental. The latter is concerned with the
representativeness of the items with respect to a defined theoretical construct. The procedures
are referred to as construct validation which may involve different data collection techniques
and strategies, some of which could be thought of as a different validity procedure, e.g.,
criterion related validity. Finally, some tests do not need validity evidence. For example, a test
of typing which is administered to measure the status of an individual's typing skills is its own
criterion. Such "obviously valid" tests do not need additional validity evidence.
An important issue associated with criterion validity is what is referred to as the
restriction of range effect. Validity coefficient is a correlation coefficient the magnitude of
which depends on the ranges of scores for the predictor variable (test) and the criterion (a future
behavior to be predicted by the test). A correlation based on the full range of scores will always
be higher than that based on a restricted range of scores, independent from the true predictability
of the test. In reality, most of such tests are used to make selections (of students or employees
for example). The validity study is always conducted on those who passed the test and were,
thus, given the opportunity to demonstrate the future performance. Thus, the validity is based
on a restricted range of scores rather than the full range. The real predictive validity of a test
should always be higher than what is obtained from a validity study. There are, however,
statistical procedures to adjust the validity coefficient.
Another important issue is the relationship between validity and reliability. Remember
that ρxx' = ρ2xt, the reliability coefficient is the squared correlation between the observed scores
and the true scores. If ρxx' = .81, ρxt = .90, (the correlation between the observed scores and true
scores is .90). In general, ρxt > ρxx'. That is an observed score will correlate higher with its own
true score than with an observed score on a parallel test. Because a test can not correlate more
Dr. Chang 14
highly with any other test or variable than with its own true score, the maximum correlation
between an observed score and another variable is the squared coefficient itself.
Sometimes, the distinction between reliability and validity is blurred. For instance,
Lindquist (1942) defined validity as the correlation between the fallible (the test to be validated)
and infallible measure of a trait. Assuming the infallible criterion to be perfectly reliable and the
(fallible) test to be perfectly representative of the criterion, the maximum validity of the (fallible)
test would have to be the correlation of the observed test scores with true scores on the test itself
which is the square root of the reliability of the fallible test. This is basically the criterion related
validity -- correlation between the observed scores of the test and observed scores of the
criterion.
4: Research Validity and Sampling Techniques
Research Validity
Research validity can be simply understood as the quality of a research study. There are
two kinds of quality issues that are referred to as internal validity and external validity. When we
try to pinpoint the cause of a phenomenon or behavior either by experimentally manipulating the
independent variable or by measuring or observing the independent variable, we are trying to
demonstrate that it is the independent variable that “causes” changes in the outcome variable or
in the dependent variable. Internal validity is about the extent to which we can make this causal
inference. Internal validity is the extent to which the outcomes of a study result from the
variables which were manipulated, measured, or selected in the study rather than from other
variables not systematically treated. e.g., there might not be a relationship between televised
violence and aggressive behavior, rather, children who watch violence programs are more
aggressive in the first place. Different brands of deodorants were tested under left and right arm
which provide unequal testing conditions. An expectancy effect will also make people feel the
advocated brand lasts longer. In other words, internal validity is about how confident we are
about the stated “causal” relationship between the independent variables and the dependent
variable. That is, the dependent variable is due to independent variable but not due to something
else. To improve internal validity and thus the quality of a research study, we need to be able to
rule out alternative “causes” that may have done the same thing to the dependent variable as does
our independent variable. Some of the commonly encountered alternative causes include:
History: Events take place during the study that might affect its outcome in the same
way that the independent variable is hypothesized to affect the outcome. e.g., a study examined
whether a certain leadership training program was effective in enhancing students’ sense of
competitiveness; in the three months of the experimental study, the TV show, Survivor, was on
which could have made the students more aware of competition and become more experienced
with some competition strategies.
Maturation: Especially for developmental studies where children grow with the passage
of time to become more mature in certain developmentally related abilities. e.g., a study
showing students’ vocabulary increase from a dialectical reading program may not have
internal validity due to maturation because children’s vocabulary increases with time
independent of the training program.
Testing: When people are measured repeatedly, e.g., pretest-posttest, they become
better not because of the independent variable but because they become test smarter. e.g., a
study showing math scores improved over the pretest due to a new teaching method might not
have internal validity because students work much faster or better because of their experience
Dr. Chang 15
with the pretest.
Instrumentation: The effect on the dependent variable is not due to the independent
variable but due to aspects of the instrument used in the study. Like the testing threat, this one
only operates in the pretest-posttest situation. e.g., observed change from pretest to posttest is
due not to a math program that is being experimented but rather to a change in the test that was
used. The posttest could simply be easier. Instrumentation threats are especially likely when the
"instrument" is a human observer. The observers may get tired over time or bored with the
observations and thus become more lenient or more stringent or more conservative or liberal in
their ratings. Conversely, they might get better at making the observations as they practice more
and become more accurate than their ratings at the beginning. In either event, it is the change in
instrumentation, not the independent variable, that leads to the observation in the dependent
variable.
Regression towards the mean: Particularly problematic when subjects are chosen
because of extreme scores where high scoring individuals are more likely to score lower and
low scoring individuals are likely to score higher the next time they are tested merely due to
random measurement error. e.g., a study showing a teacher is effective in improving the scores
of students who are at the bottom in the first term may not have internal validity due to the
regression towards the mean artifact because by chance alone these students at the bottom will
tend to increase (there is a greater chance for top students to fall towards the mean; while
bottom students have a greater chance to rise towards the mean).
Selection: Results due to assignment to different treatment or control groups but not
due to the independent variable that makes the two groups. e.g., you want to compare a new
coaching method against the existing method used in PE lessons and recruit volunteers to
participate in three days of training of the sport. You then compare the results with those of
some of the existing PE lessons. Volunteers are simply more motivated than those attending
regular PE class and thus may produce better training outcome (dependent variable) which has
little to do with training method (independent variable).
Mortality: Especially for longitudinal studies that last for an extended period of time,
attrition or dropping off from the study in a non-random manner may affect the outcome of the
study. That is, some participants no longer want to continue with the study and the results are
based on those who stayed in the study who have are different from those who dropped in some
fundamental ways. e.g., a special intervention may be tried out in a school but the results
showed that their students’ average HKCE results to be much worse than those from regular
schools; this study may not have internal validity due to mortality because the weakest students
may have dropped out from the regular schools.
Diffusion or imitation of treatment: The control group or one of the treatment groups
somehow end up receiving some of the same treatment as the other groups resulting in few
differences among the treatments. e.g., a study shows that a new teaching method (tried out in
one class) does not lead to better achievement than the traditional method (in another class) in
part because students from the two classes constantly compare notes and exchange information.
Hawthorne effect refers to the fact that when participants received unusal treatment in
a field experiment, they may temporarily change their behavior or performance not because of
the manipulation of the independent variable but because of the special attention they received
Dr. Chang 16
during the experimentation. The term gets its name from a factory called the Hawthorne Works,
where a series of experiments on factory workers were carried out between 1924 and 1932.
Among many types of experiments, one stands out and is often talked about is the illumination
experiment where researchers came to the factory to change lights all the time (and probably
made casual conversations with the workers). In comparison to the control group which did not
experience lighting changes, productivity of the experiemental group seemed to increase
independent of how lighting was adjusted.
John Henry effect: Whereas the Hawthorne effect is due to some unusal performance of
the experiemental group, sometimes the control group may also put up some extraordinary
performance to outperform the experimental group due to a sense of demoralization for not
being included in the “special treatment” experimental group. The term comes from a true story
of a railroad worker, John Henry, who probably felt threatened when a spiking machine (to put
spikes to stablize the rails) was introduced to replace the manual work and outperformed the
machine but later died of heart attack.
External Validity
External validity is the extent to which the findings of a particular study can be
generalized to people or situations other than those observed in the study. Can findings from
laboratory be applied to real world where there are many other factors influencing behavior that
have been controlled in the lab. All the factors threatening internal validity can be controlled in
a lab but the results may become less generalizable outside the lab. e.g., a study on a new
teaching method with a certain group of S.4 students does not have external validity if the
results cannot be generalized to other topics taught by the same new method, or to other S.4
students. Many threats to external validity can be understood in the form of an interaction:
Treatment-attribute interaction: Certain personality and other characteristics may
interact with the independent variable so that the effect of the independent variable may be
different on people having different personality characteristics. e.g., A study showing
democratic parenting style improves academic achievement may lack external validity if we
can argue and show that less parental demand and supervision only work with highly motivated
students but not other students.
Treatment-setting interaction: The independent variable may interact with other
external factors or contexts to result in different effects for different settings so that the effect
cannot be generalized to all settings. e.g., The positive effect of democratic parenting on school
achievement may lack external validity if democratic parenting (independent variable) works
only when schools provide clear structure and strict supervision regarding student learning.
Pretest sensitization: The effect of the independent variable may be due to the pretest
which serves to sensitize the subjects whereas in the population (the real world to which the
findings are to be generalized), there is no pretest and thus the treatment (independent variable)
may not work as it does in the study. A study showing that a math teaching method worked
because the post test improved over the pretest may lack external validity if the pretest helped
the teachers and students to identify learning difficulties or made them more aware of certain
weaknesses and the teaching method focused on those weaknesses.
Posttest sensitization: The effect of the independent variable is part due to sensitization
or exercising effect of the posttest which is not available in the population to which the results
are to be generalized. e.g., in the above study on a new math teaching method, the posttest
served to reinforced the effect of the teaching method and thus the students in the study
improved their math performance because they had both the new teaching method and the
posttest.
Dr. Chang 17
Sampling Techniques
An element is an object on which a measurement is taken. It is not the person or thing
but a particular measurement of the person or thing that is of interest. e.g., persons' height, class
size.
A population is all the elements in a defined set about which we wish to make an
inference. Examples of target population vs. experimentally accessible population are heights
of Chinese vs. heights of Shanghai residents.
Sampling units are non-overlapping collections of elements from the population. The
sampling unit can be the element.
Sampling frame is a list of sampling units.
A sample is a collection of sampling units drawn from a frame.
Simple random sample
If a sample of size n is drawn from a population of size N in such a way that every
possible sample of size n has the same chance of being selected.
M = Σxi / n is the sample estimate of population mean, μ.
Stratified random sampling
The population of N units is divided into subpopulations of N1, N2... Nh units which are
non-overlapping so that N1 + N2 +..+ Nh = N. The subpopulations are called strata. A sample is
drawn from each stratum. The sample is denoted as n1, n2, nh. If a simple random sample is
taken from each stratum, the whole procedure is called stratified random sampling.
When the population is heterogeneous, stratified sampling increases the precision in
estimating population parameters. Breaking up the population makes each stratum
homogeneous (measurement varies little among units) so that a small sample is needed to
estimate the population characteristics of the stratum. These strata estimates can be combined
into a precise estimate of the whole population.
Wh = Nh / N is stratum weight.
fh = nh / Nh is sampling fraction in the stratum.
= Σ NhXh / N
= Σ WhXh is the stratified sample estimate of population mean, μ.
If in every stratum, nh / n = Nh / N, this means sampling fraction is the same in all strata.
Such stratification is called stratification with proportional allocation of the nh. Using
proportional allocation, Mst = Σ nhXh / n
NXst = N1X1 + N2X2 + N3X3 .. + NhXh is stratified sample estimate of population total, τ.
Mst
Cluster sampling
A cluster sample is a simple random sample in which each sampling unit is a collection,
or cluster, of elements. It is used when 1) a good frame listing population elements either is
unavailable, unreliable, or costly; 2) the cost of obtaining observations increases as the distance
separating the elements increases. For example, when sampling in the field in agricultural
research, it is hard to do random sampling by running around. In quality inspection of light
bulbs contained in boxes, light bulbs are the elements and the boxes can be the sampling units.
Travelling within a city (to obtain a simple random sample of city residents) is more expensive
than travelling in a city block (to get a cluster sample of city blocks as sampling units). The
Dr. Chang 18
rational for choosing the unit size for cluster sampling is a decision on the unit that gives smaller
sampling variance for a given cost or on the cost for a prescribed variance. Used as a general
rule, the number of elements within a cluster should be small relative to the population size, and
the number of clusters in the sample should be reasonably large.
Mcl = Σxi / Σmi is the cluster sample estimate of population mean, μ, where Σxi is the total
of observations (sum of elements) in the ith cluster, and Σxi indicates summing over i=1 to n
sampled cluster totals; mi is the size of ith cluster, and Σmi indicates summing over i=1 to n
sampled cluster sizes,.
Systematic sampling
Randomly selecting one element from the first k elements in the frame and every kth
element thereafter is called a one-in-k systematic sample. It is needed in field study, e.g., select
every 10th tree or every 20th file, every 15th shopper who passes by an aisle in a supermarket,
until a predetermined n is achieved. If different persons originally handled different sections of
the files, or each clerk deals with a surname, then systematic sampling provides more accurate
information than simple random sampling which could, by random chance, select n files filed
by one clerk. Parameter estimations are the same as those of the simple random sampling.
Sample size
The method used to select the sample is of utmost importance in judging the validity of
the inference made from the sample to the population. The representativeness of the sample is
more important than the size of the sample. A representative sample of 100 may be preferable
to an unrepresentative sample of 100,000. The size of a sample can never compensate for a lack
of representativeness (bias).
Having established sample representativeness, using the right sample size becomes an
important economical decision. Each observation taken from the population contains a certain
amount of information about the population parameter or parameters of interest. Since
obtaining the information costs financially, one decides how much information to be sampled.
Too little prevents good estimates. Too much may be a waste given a limited economic resource.
The quality of information obtained in a sample depends upon the number of elements sampled
(sample size) and the amount of variation in the information (population variance). Let's look
at an example.
What proportion of people are left-handed? How big a random sample is needed to
answer the question? Or, in other words, how big an error, or margin of error, is to be tolerated?
Let's set the margin of error at no bigger than 10%. If the sample estimate is 20%, you will at
least be confident that the population proportion is between 10% and 30%. But you cannot
guarantee that every sample, including the one you draw, will have this margin of error unless
the whole population is sampled. Sometimes, a sample may have sampling error higher than
10%. Sometimes, a sample may have sampling error lower than 10%. Of course, you would be
concerned only with having an error higher than 10%. Then the question becomes how unlikely
do you want this unlucky sample having a higher than 10% error to be? Assume you want the
unlikelihood to be 5 out 100 samples so that you can be confident, 95% of the times, that a
sample does not exceed the specified sampling error of 10%. That is Pr (p - P > 10%) = 5% or
Pr (p - P  5) = 95%. You can then use some of the basic statistics to estimate the margin of error
by first estimating the population variance (of proportion of left-handers) and standard deviation
which is a function of sample size so that you can estimate how big a sample you need so that
the chance of making an estimation error of bigger than 10% does not exceed 5%.
Dr. Chang 19
5: Experimental, Quasi-Experimental and Non-Experimental Designs
To ensure research validity or internal validity, researchers develop various ways to
identify, isolate, or nullify variability among subjects in a dependent variable that is presumably
"caused" by one or more independent variables that are extraneous to the particular relation or
relations under study. Such a research effort is called control, control of variance, or control of
extraneous variables. The most powerful way of controlling extraneous variables is
experimentation where subjects are randomly assigned to experimental versus control groups.
Other things being equal, if random assignment has been used, the groups can be assumed to be
equal in all possible characteristics except that due to the manipulation of the independent
variable. In other words, variations among subjects due to anything other than the independent
variable are scattered or spread out evenly across the randomly assigned groups. The variability
due to the manipulation of the independent variable is called systematic variance. The purpose
of experimental research is to maximize this source of variance, minimize error variance, and
control extraneous variance. Other means of controlling extraneous variance include matching,
including the extraneous variable into the study, making the extraneous variable a constant, and
using statistical methods to decompose different sources of variance. It is the extent to which
extraneous variables are controlled that distinguishes research designs into experimental,
quasi-experimental, and non-experimental designs.
Experimental design
The single most important feature of experimental research is the manipulation of the
independent variable. Researchers create changes, called treatment conditions, in variables
being researched, called independent variables, to examine the impact of these manipulations
on some outcome behaviour or phenomena called the dependent variable. Another important
feature of experimental research is the ability to control extraneous variables so that subjects
receiving different manipulations of the independent variable are equal except for the
manipulation.
Experimental research in social science is developed from both physics research and
biological research models. In physics research the primary means of controlling extraneous
factors is through artificial controls, such as isolation, insulation, sterilization, strong steel
chamber walls, soundproofing, lead shielding, etc. These methods ensure the reproduction of
similar conditions and the consequent production of certain effects. Biological research moved
from the laboratory to the open field, the modern theory of experimental control through
randomized assignment to treatment emerges. Agricultural research compares yield per acre for
different crops or fertilizers, raking, or plowing method. One of the greatest breakthroughs in
experimental design was the realization that random assignment provided a means of
comparing the outcomes of different treatments in a manner that ruled out most alternative
interpretations. Random assignment requires experimental units, which can be plots of land in
agriculture, individual persons in social psychology experiments, intact classrooms in education
studies, and neighborhoods in some criminal justice research. Treatments are then assigned to
these units by some equivalent of a coin toss.
Quasi-Experimental Design
Quasi-experiments have treatments, outcome measures, and experimental units, but do
not use random assignment to create the comparisons from which treatment-caused change is
inferred. Instead, the comparisons depend on nonequivalent groups that differ from each other
in many ways other than the presence of a treatment whose effects are being tested. The task is
one of separating the effects of a treatment from those due to the initial noncomparability
Dr. Chang 20
between the average units in each treatment group. In a sense, quasi-experiments require
making explicit the irrelevant causal forces hidden within the intact groups. The advantages of
experimental control for inferring causation have to be weighed against the disadvantages that
arise because we do not always want to learn about causation in controlled settings. Instead, we
would like to be able to generalize to causal relationships in complex field settings, and we
cannot easily assume that findings from the laboratory will hold in the field.
For lack of randomization, pretest is an integral part of quasi-experiment that enables
comparisons among the nonequivalent groups, whereas, in most experiments, a pretest is often
unnecessary or undesirable. When a pre-test is not available, researchers may look for some
proxy variables to use in an attempt to equate the treatment and control groups. For instance
previous academic records can be used as proxies to equate the intact classes where different
instructional methods are implemented and post-tests are used to evaluate or compare the
different teaching methods.
Non-Experimental or Ex Post Facto Research
There is a temporal sequence in the order of the independent variable (which occurs first)
and dependent variable (which is observed subsequently) that allows a causal inference.
Specifically, the researcher manipulates the independent variable to "create" changes in the
dependent variable. This expected change (in the dependent variable as a consequence of the
manipulation of the independent variable) represents a causal relationship between the two
variables. This causal relationship is stated in the hypothesis. Thus, an experimental research
is guided by a hypothesis stated a priori.
In experimental and quasi-experimental research, inferences are made from the
independent variables (the causes) to the dependent variable (the effect). In non-experimental
research, also called "ex post facto research", inferences are generally made in the opposite
direction. That is beginning with the observation of the dependent variable, attempts are made
to uncover, detect, or find the reasons (independent variable) for the existing variations. The
point is the variations are not the result of the manipulation of the independent variables but are
pre-existing. When an experimental researcher manipulates a variable (e.g., administers
different treatments), the researcher has some expectations regarding the effect of the
manipulation on the dependent variable. These expectations are expressed in the form of
hypotheses to be tested. In non-experimental research, a researcher would not have an
expectation of the independent variables or sometimes would not even know prior to data
collection what "independent" variables are tenable to explain the pre-existing variations on the
"dependent" variable. Often, there are no hypotheses associated with a non-experimental study
and researchers adopt the position of "letting the data speak for themselves."
This design is most vulnerable to internal validity threats. Two general strategies to
protect internal validity are using (1) large samples to compensate for the lack of random
assignment and (2) large numbers of "independent" variables to eliminate rival explanations.
The latter strategy is intended to overcome the weakness of having no experimental
manipulation of the independent variables.
Summary of Research Designs
Manipulation of
Random
Sample
Variable
Design
Independent V.
Assignment
Size
Number
--------------------------------------------------------------E
xperimental
Yes
Yes
Small
Small
Dr. Chang 21
Quasi-Exp
Yes
No
Non-Exp
No
No
Large
Large
-------------------------------------------------------------Non-Experimental Design: Causal Comparative Study
Non-experimental designs refer to efforts at causal inference based on measures taken
all at one time, with differential levels of both effects and exposures to presumed causes being
measured as they occur naturally, without any experimental intervention.
According to some authors, there are two kinds of non-experimental designs, causal
comparative and correlational studies. (Others including me do not think such a distinction is
necessary.) The difference lies in the measurement of the "independent" variable which can be
either categorical or continuous. The quotation mark indicates that an "independent" variable in
non-experimental research is not truly what the term stands for since it is not manipulated. For
this reason, a categorical or continuous "independent" variable is also called a grouping variable
in causal comparative studies and an exogenous variable in correlational studies.
Causal comparative design is different from experimental and quasi-experimental
designs in that there is no manipulation of the grouping variable and there is no random
assignment of subjects into different groups. The question inspiring gender studies is whether
males and females are different with respect to a particular behaviour or trait. In this example,
the independent or grouping variable is gender which is not and can not be manipulated and
there is no way to randomly assign some subjects into either of the two groups. Another
difference is that, as stated earlier, groups in causal comparative research are often formed on
the basis of the dependent variable. Researcher are often interested in finding out why, for
example, some children are less motivated to learn than others, do not achieve as well as others,
have more behavioral problems than others, are more aggressive than others, and turn out to be
criminals. The behaviours on which questions are raised are the results or outcomes or
dependent variables. Researchers try to find answers to the individual differences on these
variables by grouping subjects on these variables into, say, high versus low achievers, students
with and without behavioral problems or with and without criminal records, and then compare
these two groups of subjects on some suspected causes, e.g., parental supervision, peer
influence, TV viewing, etc. Such logical thinking and research processes are almost the
opposite of the experimental research. For example, in the above example, TV viewing as an
independent variable will be manipulated in experimental research and subjects will be
randomly assigned to groups with different amounts or different kinds of TV exposure and their
subsequent aggressive, antisocial, or criminal behaviours are observed. In this experimental
example, because of randomization, different experimental groups can be considered equal
except for the manipulation of the TV viewing. Thus different validity threats can be reasonably
ruled out and the observed difference among the groups on the dependent variable, aggressive
behaviour, can be attributed to the independent variable, TV viewing. In the non-experimental
example, however, one can not say that TV viewing causes aggressive behaviour even though
the two groups of aggressive versus non-aggressive children were found to differ on TV
viewing. The reason is that the two groups can not be assumed to be equal: They may differ in
a lot of things in addition to aggressiveness, e.g., different family background, hormonal levels,
personalities, etc. The same statistical analyses are used in causal comparative studies, such as
ANOVA. The interpretations of the results should be far more cautious in causal comparative
research.
Non-Experimental Design: Correlational Study
Dr. Chang 22
Partial correlation. The calculation is complicated but the idea of partial correlation is
simple. It is an estimate of the correlation between two variables in a population that is
homogeneous on the variable (or variables) that is being controlled, whose effects are being
removed, or variability on this variable is made into a constant. For example, a correlation
between height and intelligence computed out of a sample that is heterogenous on age, say,
ranging from 4 to 15, is a simple correlation which is high and positive. A partial correlation
would be an average correlation between height and intelligence within each age group where
age is a constant. This partial correlation which is likely zero more truly depicts the relationship
between height and intelligence.
Spurious effect. When two variables are correlated solely because they are both affected
by the same cause, the correlation between these two variables is spurious. For example, the
tobacco industry argues that the correlation between cigarette smoking and lung disease is not
causal but spurious in that both these variables may be caused by a common third factor such as
stress or an unhappy mental state. Another example will be the positive correlation between
height and intelligence often observed in children. Here, the correlation is again spurious
because both variables have the common cause of chronicle age.
Mediating variable. The correlation between two variables can be the result of a
mediating variable. For example, a strong correlation between SES and academic achievement
is often observed and makes some people believe that there is a causal relationship between how
rich the parents are and how well the kids do in school. However, such a relationship is now
found to be mediated by a third variable, Achievement motivation. That is rich people's
children are more motivated to study (by their parents' success) and this motivation leads to
good academic performance. This later finding is achieved by correlating SES and
Achievement while statistically partialling out Motivation. The correlation is almost zero. The
important implication of this statistical insight is that the key lies in motivating the poor kids
(providing them with role models) whereas giving them material incentives may not make them
study.
Suppressor variable. A special case when a partial correlation is larger than its
zero-order correlation is called a suppressor variable effect. A suppressor variable is a variable
that has a zero, or close to zero, correlation with the criterion or dependent variable but is
correlated with the predictor or independent variable. When such suppressor variable is not
taken into consideration, the correlation between the independent and dependent variable may
be "suppressed" or reduced by this uncontrolled suppressor. For example, a paper-and-pencil
pilot test as a predictor was found to predict little of the criterion, flying. The correlation was
suppressed by a third variable, verbal ability, which has little to do with flying but a lot to do
with test taking. When this suppressor variable was partialled out, the correlation between the
pilot test and piloting increased significantly. This is a real example from pilot training during
World War II.
Validity Issues
The only way to enhance research validity of correlational studies which are post facto
or after fact is through careful logical deduction and well-thought-out statistical analyses. The
first part requires a strong theory to map out relations among variables and careful thinking to
include all the possible extraneous variables that may contribute to the post facto observed
variations. The second part involves the use of various rather complicated analytic techniques,
Dr. Chang 23
such as multiple regression, path analysis, and structural equation modeling. Partial correlation
is the basic idea behind these analyses.
Thus, correlational studies often involve more variables than experimental research.
The variables in a correlational study are not distinguished between independent and dependent
variables. First, since there is no experimental manipulation of the variable of research interest,
there is no independent variable. Second, because the inference is often not made from the
manipulation of the independent to the outcome of the dependent variable. On the contrary, the
post facto outcome is observed first and the research purpose is to account for the observations,
the order of what is independent and dependent seems opposite to what is used in experiments.
Third, unlike experimental studies where there is usually one dependent variable, there can be
and usually are more than one outcome variable in a complex patten of associations. In
correlational studies, variables are distinguished between what are referred to as exogenous and
endogenous variables. An exogenous variable is one whose variability is assumed to be
determined by causes outside the model or study under consideration. It is not the interest of the
study to explain the variability of an exogenous variable or its causal relations with other
exogenous variables. An endogenous variable is one whose variation is to be explained by the
exogenous and/or other endogenous variables in the model. Exogenous and endogenous
variables are like independent and dependent variables in experimental studies except that there
is no manipulation of the exogenous variables and there are usually more than one endogenous
variable in a correlational study.
In correlational studies one needs to think hard to include as many relevant variables as
possible. Omission of relevant variables which are correlated with the exogenous or
endogenous variables in the model constitutes what is called a specification error which will
lead to biased estimates of the relations among the variables in the model. Specification errors
are almost unavoidable in correlational research. All one can do is to attempt to minimize them
by including in the design major relevant variables. Although it is hard to say how many
relevant or extraneous variables need to be included in a model, it is fairly certain to say that a
correlational study involving only one "independent" variable is bound to be misspecified in
virtually any instance that comes to mind.
An Evaluation Checklist for Quantitative Studies
Adopted from McMillan, J. H. (2004). Educational research: Fundamentals for the consumer
(4th ed.) Boston: Pearson.
1.0 Research Problem
What are the independent and dependent variables?
Is the problem researchable?
Is the problem significant? Will the results have practical or theoretical importance?
Is the problem stated clearly and succinctly?
Does the problem communicate whether the study is descriptive, relational, or experimental?
Does the problem indicate the population studied?
Does the problem indicate the variables in the study?
2.0 Review of Literature
2.1 Does the review of literature seem comprehensive? Are all important previous studies
Dr. Chang 24
included?
2.2 Are primary sources emphasized?
2.3 Is the review up to date?
2.4 Have studies been critically reviewed, and flaws noted, and have the results been
summarized? (I disagree with 2.1 and 2.4.)
2.5 Does the review emphasize studies directly related to the problem?
2.6 Does the review explicitly relate previous studies to the problem?
2.7 If appropriate, does the review establish a basis for research hypotheses?
2.8 Does the review establish a theoretical framework for the significance of the study?
2.9 Is the review well organized?
3.0 Research Hypothesis
3.1 Is the hypothesis stated in declarative form?
3.2 Does the hypothesis follow from the literature?
3.3 Does the hypothesis state expected relationships or differences?
3.4 Is the hypothesis testable?
3.5 Is the hypothesis clear and concise?
4.0 Selection of Participants
4.1 Are the participants clearly described?
4.2 Is the population clearly defined?
4.3 Is the method of sampling clearly described?
4.4 Is probability sampling used? If so, is it proportional or disproportional?
4.5 What is the return rate in a survey study?
4.6 Are volunteers used?
4.7 Is there an adequate number of participants?
5.0 Instrumentation
5.1 Is evidence for validity and reliability clearly stated and adequate? Is the instrument
appropriate for the participants?
5.2 Are the instruments clearly described? If an instrument is designed for a study by the
researchers, is there a description of its development?
5.3 Are the procedures for gathering data clearly described?
5.4 Are norms appropriate if norm-referenced tests are used?
5.5 Are standard setting procedures appropriate if criterion-referenced tests are used?
5.6 Do the scores distort the reality of the findings?
5.7 Do response set or faking influence the results?
5.8 Are observers and interviewers adequately trained?
5.9 Are there observer or interviewer effects?
6.0 Design
6.1 Descriptive and Correlational
Dr. Chang 25
6.1a If descriptive, are relationships inferred?
6.1b Do graphic presentations distort the findings?
6.1c If comparative, are criteria for identifying different groups clear?
6.1d Are causative conclusions reached from correlational findings?
6.1e Is the correlation affected by restriction in the range and reliability of the
instruments?
6.1f If predictions are made, are the based on a different sample?
6.1g Is the size of the correlation large enough?
6.1h If causal-comparative, has the causal condition already occurred?
How comparable are the participants in the groups being compared?
6.2 Experimental
6.2a Is there direct manipulation of an independent variable?
6.2b Is the design clearly described? Is random assignment used?
6.2c What extraneous variables are not controlled in the design?
6.2d Are the treatments very different from one another?
6.2e Is each replication of the treatment independent of other replications? Is the
number of participants equal to the number of treatment replications?
7.0 Results
7.1 Is there an appropriate descriptive statistical summary?
7.2 Is statistical significance confused with practical significance?
7.3 Is statistical significance confused with internal or external validity?
7.4 Are appropriate statistical tests used?
7.5 Are levels of significance interpreted correctly?
7.6 How clearly are the results presented?
7.7 Is there a sufficient number of participants to give valid statistical results?
7.8 Are data clearly and accurately presented in graphs and tables?
8.0 Discussion and Conclusions
8.1 Is interpretation of the results separate from reporting of the results?
8.2 Are the results discussed in relation to previous research, methodology, and the research
problem?
8.3 Do the conclusions address the research problem?
8.4 Do the conclusions follow from the interpretation of the results?
8.5 Are the conclusion appropriately limited by the nature of the participants, treatments,
and
measures?
8.6 Is lack of statistical significance properly interpreted?
8.7 Are the limitations of the findings reasonable?
8.8 Are the recommendations and implications specific?
8.9 Are the conclusions consistent with what is known from previous research?
Quantitative versus Qualitative Research
The purpose of research is to draw some causal inference, A leads to B, A affects B, so
Dr. Chang 26
that intervention can be introduced. A particular study may or may not be able to draw causal
conclusion, but the eventual goal of research in any discipline is to draw such conclusion. There
is a fundamental difference in interpreting a causal relationship that distinguish quantitative vs.
qualitative. (The chair example.) Quantitative focuses on the specific or most salient causal link
whereas qualitative takes into consideration the chain of events as contributing to a specific
social process. However, as a tradition, relationship, process, context are talked about in
qualitative whereas "cause and effect" is considered quantitative terminology. More formally,
these two philosophies are distinguished in the following formal explanations:
Alternative conditions: sufficient but not necessary
Presence of the conditions associates with the presence of the outcome; but the absence
of conditions does not associate with the absence of the outcome. It is sufficient by itself but not
necessary. Flue virus is a sufficient but not necessary condition of headache.
Contingent conditions: necessary but not sufficient
Absence of the conditions indicates the absence of the outcome; but presence of the
conditions does not indicate the presence of the outcome. It is necessary but not sufficient by
itself. Ability to discriminate letters is necessary but not sufficient to reading.
Conclusion drawn from the two philosophies:
Quantitative: To be able to infer causality, conditions have to be both sufficient and
necessary; i.e., the presence of the conditions is accompanied by the presence of the outcome
and the absence of the conditions is accompanied by the absence of the outcome. The only way
to test a causal relationship is experiment where the independent variable (the cause) is
manipulated to see the corresponding change in the dependent variable (the outcome). There are
different variations of experiment to fit in the constraints of social science research, e.g., quasi,
causal comparative, correlational, but the ideas behind them are the same which are derived
from physical science research, physics.
Qualitative: Constellation of conditions that are individually insufficient but necessary
and jointly unnecessary but sufficient (INUS) to bring about the outcome. (Meehl's example.)
Any social phenomenon resembles the INUS situation. Every social factor itself is not sufficient
even though necessary but together they are sufficient to bring about an effect even though that
combination is not necessary. The emphasis is on different factors, angles, points of views,
culture, the big picture, the chain of events which are necessary but not sufficient. Since the
combination of them, which is sufficient, is not necessary and there can be other combinations,
there is not a pattern of relationship that is uniformly true. Here the emphasis is on the contexts,
the situations. They have to be taken into consideration or findings are context dependent
whereas quantitative emphasizes generalization and statistical inference from sample to
population.
Qualitative try to study a process (rather than an isolated event) by taking into
consideration different factors contributing to the process. But with the qualitative approach,
there is discipline emphasis in educational research. An anthropological orientation
(ethnography) emphasizes the role of culture in influencing behaviour. Researchers with a
sociological bent tend to emphasize symbolic interaction. People are seen as acting according to
the meaning of things and persons to them; their reality is socially constructed. People act not
according to what the school is supposed to be, but according to how they see it.
Quan:
Dr. Chang 27
Philosophy: Isolated causal link
Method: experiment
Effort: Control extraneous variable to isolate out the particular linkage; e.g., attitude of subject,
history, instrumentation, testing, maturation. standardize data collection. Following physical
science tradition, studying only that is observable, measurable, and testable. Latent constructs
have to operationalized. Lot of topics are not attempted for research.
Qual:
P: INUS, causal chain
M: Field study. Consider combinations of factors. Look at context. Use different data collection
schemes to obtain all sorts of information. Social phenomenon and human behaviour are not
directly observable which include intentions, feelings, aspirations influenced by norms, culture,
values. Observable behaviour aren't any real than internal phenomenon.
Quan:
P: Following the physical science tradition, study the social phenomenon or human behaviour
as an objective and impartial observer.
M: control internal validity threats, such as, observer bias and characteristics. Keep the subjects
unaware of your research purpose. Hire data collector and standardize the data collecting
condition by training them. Structured interview. Let the data speak.
Terminology: Subject, researcher
Qual:
P: Take the perspective of the people being studied. See the world from the way they see it.
M: Participation. Researcher is the only or major source of data collection. Subjects play a role
in data interpretation. Having subjects read your report and modify afterwards.
T: Informants, collaborators, teachers, vs. participant.
Quan:
P: Deductive reasoning, formulate theory from previous research and conduct specific empirical
test.
M: Hypothesis testing. Ask questions before data are collected. Use standardized test.
Confirmatory or explanatory studies.
Qual:
P: Inductive reasoning, theory grounded in observation. From pieces of specific events and
observations, develop an explanation.
M: Start from scratch. Extensive and prolonged observation. Going back and forth between data
and explanation until a theory is fully grounded by observation. No measurement. Measurement
is not just asking questions but knowing what to ask. Exploratory or discovery oriented
research.
Quan:
P: Generalization
M: Random sampling, hypothesis testing, inferential statistics.
Qual:
P: Context dependent, Generalization with caution
Dr. Chang 28
M: Purposive sample to gather data from most representative situation to draw generalization.
Informants are selected for their willingness to talk, their sensitivity, knowledge, and insights
into a situation, and their ability and influence to gain access to new situations. No intention to
use statistical inference. Lengthy text report. Text analysis.
Quan vs. qual is more of approach and philosophical difference than methodological
difference. Techniques which were traditionally more often used by one approach than the other
are now adopted by both. A case study can be used to triangulate on the findings from large
scaled surveys. Field notes compiled through participative observations and personal interviews
can be quantified and statistically analyzed to draw inference to the population.
Some Details about Research Designs
Randomly assign 3 subjects into each of the 3 treatments, Pu, Pr,
and Control.
Treatment
Pu
Pr
X1,X2,X3
Control
X4,X5,X6
MPu
X7,X8,X9
MPr
MC
MGrand
Factorial ANOVA:
Randomly assign 3 subjects from each gender into each of the
3 treatments, Pr, Pu, and Control.
Treatment
Pu
Male
Female
Pr
X1,X2,X3
Control
X4,X5,X6
X7,X8,X9
MPuM
MPrM
MCM
MMale
X10,X11,X12
X13,X14,X15
X16,X17,X18
MPuF
MPrF
MCF
MFemale
MPu
MPr
MC
Grand
RBANOVA:
Assign every subject into each of the 3 treatments, Pr, Pu,
and Control, in a random order.
Subject
Treatment
Pu
Pr
Control
Mean
1
X1
X1
X1
1
2
X2
X2
X2
2
Dr. Chang 29
3
X3
X3
X3
3
4
X4
X4
X4
4
Mean
MPu
MPr
MC
MGrand
SPANOVA:
Assign every subject from each gender into each of the 3
treatments, Pr, Pu, and Control, in a random order.
Sex or
BetweenSub Factor
Male
Female
Subject
Treatment or Within-Subject Factor
Pu
Pr
Control
Mean
1
X1
X1
X1
1
2
X2
X2
X2
2
3
X3
X3
X3
3
4
X4
X4
X4
4
MPuM
MPrM
MCM
MMale
5
X5
X5
X5
5
6
X6
X6
X6
6
7
X7
X7
X7
7
8
X8
MPuF
X8
MPrF
X8
MCF
8
MPu
MPr
MC
MGrand
MFemale
Pu: Experimental manipulation of public self-consciousness is achieved by having the subjects
respond to the Spence's Attitudes Towards Women Scale with the awareness that their answers
will be evaluated by other people.
Pr: Experimental manipulation of private self-consciousness is acquired by having the subjects
respond to the Spence's Attitudes Towards Women Scale in front of a mirror.
Control: Subjects respond to the Spence's Attitudes Towards Women Scale without
experimental manipulations.
The Spence's Attitudes Toward Women Scale is so scaled that higher values indicate more
stereotyped attitudes towards women.
Between Subject Design
Dr. Chang 30
Between subjects designs are used to draw inference about treatment effects to several
populations. This statement does not mean that subjects are drawn from different populations
to start with. In fact you use random assignment to create equal groups. The statement means
that the treatment (or experiment) is so effective that after the treatment the behavior of the
subjects represent that of a different population.
Although referred to as an experimental design, it is also used for non experimental
comparisons of different population means. As a non-experimental design, subjects are
sampled from different populations to start with.
Commonly used between-subject designs are the (1) completely randomized design and
(2) factorial design. Factorial design deals with more than one independent variable. In the case
of two independent variables, the design is called two-way factorial. The two variables can be
both experimental variables. More often, one of the two variables is experimental and the other
is a measured variable. A special case is the aptitude treatment interaction design where one of
the two variables is the state variable, which is experimentally manipulated, and the other is a
trait variable, which is measured. All these are experimental designs. Of course, when both
variables are measured, e.g., gender by race, the design is non-experimental. Three hypotheses
are explored in a two-way factorial: (1) the main effect due to one of the variables, (2) the main
effect due to the other variable, and (3) the interaction between the two variables. The factorials
are also distinguished between “fixed” and “random” depending on whether the categories of
the independent variables are assumed to exhaust those in the defined population (fixed design)
or are a random sample of the population (random design).
Within Subject Design
In a between design, each subject is observed under one condition and inference is
drawn between subjects from different conditions. In a within design, each subject is observed
under more than one condition and inference is drawn within the same subject across different
conditions. There are usually three ways a within-design can be carried out. All three are
generally referred to as randomized blocks design.
(1) Subjects are observed under all treatment conditions. The order of the conditions
assigned to each subject must be random. For example, each subject experiences three dosage
conditions of 1 mg, 5 mg, 10 mg. Some subjects have 5 mg first and some have 10 mg first. The
key is that the three conditions are assigned to each subject in a random order. As another
example, have the same teachers rate content-identical essays which bear names representing
immigration status and gender identities to see if there is discrimination in rating essays. In this
example, each teacher rates 4 essays (female-HK, male-HK, female-Mainland, male-Mainland)
the order of which is random.
(2) Assign matched subjects randomly into different treatments. K matched subjects
form a block. Within a block subjects are randomly assigned to treatment conditions with one
subject per condition. Subjects within a block are considered identical as far as the dependent
variable is concerned. The design strength lies in that the within-block variability is smaller
than between-block variability. In the earlier example, we can match subjects by, for example,
age, gender, physical conditions (all of which have to be related to the dependent variable), and
randomly assign each matched triplet into the three dosage conditions.
(3) A special case of the first design is the repeated measures design where each subject
undergoes repeated observations. For example, in a test-retest design, a subject is observed
twice, before and after the treatment. Repeated measures design can also be used in
non-experimental studies involving multiple observations.
Dr. Chang 31
Mixed Design
If you add a between-factor to the Randomized Blocks Design, you have the Split Plot
Design (SP) which is a mixed design. There are now two independent variables. One of them
varies within the subjects and, thus, is called within-factor. The other varies between subjects
and is called between-factor. This basic mixed design can be extended to include more than one
within-factor and/or more than one between-factor.
All the requirements for RB design apply to the SP design. All three situations which
make up an RB design apply to a SP design. The three situations are: 1. Same subjects are
observed under all treatment conditions the order of which is random. 2. Matched subjects form
a block within which subjects are considered equivalent and randomly assigned to treatments.
3. Same subjects undergo repeated measures where random order does not exist. The same
three situations make up the within-factor of a SP design.
Either of the two factors which make up a SP design can be an experimental or measured
variable, creating four possible scenarios:
1. Between-factor is an experimental (state) and within-factor is a measured (trait)
variable. For example, the between-factor is Pu (public self-consciousness induced by the
camera condition), Pr (private self-consciousness induced by the mirror condition), and a
control group. Subjects are randomly assigned to these three conditions. The within-factor is
pre-test (before the experimental conditions) and post-test (under the experimental conditions).
This is the typical pretest-posttest experimental design. (One typical RB design is the test-retest
design which is not experimental because there is no control group.) In this example, the
within-factor represents Situation 3 in B above.
2. Between-factor is a measured and within-factor is an experimental variable. In the
above example, the Pu, Pr, and Ctl conditions can be made a within-factor by randomly
assigning these conditions either to the same subjects or to blocks of matched triplets. The
between-factor can be gender. In this example, the within-factor represents Situations 1 or 2 in
B.
3. Both factors are experimental. In our earlier Prozac example, the within-factor is the
three experimental conditions of 10mg Prozac, 5mg Prozac, and placebo. We can add a
between-factor of whether or not receiving therapy to cope with depression. The research
question is whether the combination of counselling and Prozac is more effective than either the
counselling or Prozac alone. In this case, both the between and within factors are experimental
variables. The between-factor conditions are created by randomly assigning subjects into either
the counselling or no-counselling condition. The within-factor can be created in two ways.
Corresponding to Situation 1 in B, within each of these two between-factor conditions, subjects
undergo all three Prozac conditions in a random order. Corresponding to Situation 2 in B,
matched triplets are randomly assigned to one of the three Prozac conditions.
4. Finally, both factors can be measured in a non-experimental study. For example, I'm
obtaining repeated measures on withdrawn and aggressive behaviors of primary school children.
I'm going to give the children three questionnaires at the end of three consecutive semesters.
The within-factor is the three waves of questionnaire data. One between-factor could be gender.
Another between-factor could be popularity classification of these children. Through peer
nomination, children can be classified into popular, rejected, and neglected.
Analysis of Covariance (ANCOVA)
ANCOVA is ANOVA with a covariate or covariates. The covariate summarizes
individual differences on the dependent variable. These differences would otherwise be
allocated as error in an ANOVA. Apparently, the covariate should be correlated with the
Dr. Chang 32
dependent variable. By removing variations due to persons from the error term, ANOVA
achieves the same goal as that of a within-subject design or a mixed design (RB and SP
ANOVA). Three statistical assumptions:
1. Linearity of regression. The relationship between the dependent variable and the
covariate is linear. Simply check the scatter plot to examine this assumption.
2. Homogeneity of regression. The regression of the dependent variable on the covariate
is the same across different treatment groups. The most intuitive way to check this assumption
is to conduct separate regression within the treatment groups to see if the regression coefficients
are similar.
3. The covariate is independent from the treatment variable. The way to ensure this
assumption is to obtain the covariate before conducting the experiment. That is measuring the
covariate before randomly assigning subjects into different treatment conditions and conducting
the experiment.
The purpose of the covariate is to reduce within group variance or difference among
people within groups. If the covariate is highly related with the independent variable, the
individual difference between groups will be reduced by the part of the covariate that is related
to the independent variable. However, in non-experimental and quasi experimental studies, e.g.,
teaching methods are compared using intact classes which could differ in ability and ability is
used as covariate, the covariate is often correlated with the independent variable and ANCOVA
is still used. In this situation, a different question is sought after in non-experimental research.
The question is how much is purely due to the grouping variable after the covariate is accounted
for. Here researchers may also like to use a covariate which is related to a grouping variable so
that "pure" group differences can be identified after controlling for the covariate. For example,
in comparing gender role attitudes among several age groups, one may want to control for
education which is related to age groups. In this design, one can find out how much is due to
education and how much is due to pure age.
Download