Uploaded by Eureka Deypalubos

psych-assessment-c4-c5-reviewer (1)

advertisement
lOMoARcPSD|25267805
Psych Assessment C4 C5 Reviewer
Psychological Assessment (Far Eastern University)
Studocu is not sponsored or endorsed by any college or university
Downloaded by flint cloak (ijfa0101@gmail.com)
lOMoARcPSD|25267805
PSYCHOLOGICAL ASSESSMENT – CHAPTER
4: RELIABILITY
History and Theory of Reliability
Conceptualization of Error


Psychology researchers pursue complex
traits such as intelligence or
aggressiveness, which one can neither
see nor touch.
the concern with reliability has been a
particular obsession for psychologists and
provides evidence of the advanced scientific
status of the field
Spearman’s Early Studies





Charles Spearman - advanced
development of reliability assessment
Abraham De Moivre - introduced the basic
notion of sampling error
Karl Pearson - developed the product
moment correlation
Edward L. Thorndike - An Introduction to
the Theory of Mental and Social
Measurements
sophisticated mathematical models have
been developed to quantify “latent”
variables based on multiple measures


the greater the number of items, the higher
the reliability
Reliability can be estimated from the
correlation of the observed test score with
the true score
Item Response Theory



most important new development
the computer is used to focus on the range
of item difficulty that helps assess an
individual’s ability level
the method requires a bank of items that
have been systematically evaluated for level
of difficulty
Models of Reliability

reliability coefficient - the ratio of the
variance of the true scores on a test to the
variance of the observed scores
Basics of Test Score Theory

Classical test score theory - assumes that
each person has a true score that would be
obtained if there were no errors in
measurement.
Sources of Error





errors of measurement are random
Standard error of measurement - we
usually assume that the distribution of
random errors will be the same for all; basic
measure of error
The Domain Sampling Model


considers the problems created by using a
limited number of items to represent a larger
and more complicated construct.
Uses a sample of words

An observed score may differ from a true
score for many reasons (Ex. Situational
factors)
Test-retest method - consider the
consistency of the test results when the test
is administered on different occasions
Parallel forms - evaluate the test across
different forms of the test
Internal consistency - how people perform
on similar subsets of items selected from
the same form of the measure
Time Sampling: The Test–Retest Method



used to evaluate the error associated with
administering a test at two different times.
applies only to measures of stable traits.
Carryover effect - occurs when the first
testing session influences scores from the
second session
Downloaded by flint cloak (ijfa0101@gmail.com)
lOMoARcPSD|25267805

the time interval between testing sessions
must be selected and evaluated carefully
o
a special case of the reliability formula
that does not require the calculation of
Item Sampling: Parallel Forms Method



involves making sure that the test scores do
not represent any one particular set of items
or a subset of items from the entire domain.
Parallel forms reliability - compares two
equivalent forms of a test that measure the
same attribute.
Equivalent/parallel forms - When two
forms of the test are available, one can
compare performance on one form versus
the other
o
o
o
Split-Half Method





a test is given and divided into halves that
are scored separately
The results of one half of the test are then
compared with the results of the other.
Odd-even system - one subscore is
obtained for the odd-numbered items in the
test and another for the even-numbered
items.
Spearman-Brown formula - allows you to
estimate what the correlation between the
two halves would have been if each half had
been the length of the whole test
Cronbach’s coefficient alpha (a) – used
when the two halves of a test have unequal
variances (remains the most commonly
used reliability index)
the p’s and q’s for every item
uses an approximation of the sum of
the pq products—the mean test score.
average difficulty level is 50%
Difficulty - the percentage of test
takers who pass the item
Coefficient Alpha

Cronbach developed a more general
reliability estimate

most general method of finding estimates of
reliability through internal consistency.
Factor analysis - one popular method for
dealing with the situation in which a test
apparently measures several different
characteristics

Reliability of a Difference Score

Difference score - subtracting one test
score from another
Reliability in Behavioral Observation Studies

KR20 (Kuder-Richardson) Formula



simultaneously considers all possible ways
of splitting the items.
items are dichotomous, scored 0 or 1
(usually for right or wrong)
Formula 21 or KR21


behavioral observation systems are
frequently unreliable because of
discrepancies between true scores and the
scores recorded by the observer
Reliability estimates:
o Interrater
o Interscorer
o Interobserver
o Interjudge
Kappa Statistic
Downloaded by flint cloak (ijfa0101@gmail.com)
lOMoARcPSD|25267805
o
o
o
o
o
best method for assessing the level of
agreement among several observers
introduced by J. Cohen
a measure of agreement between two
judges who each rate a set of objects
using nominal scales
Kappa - the actual agreement as a
proportion of the potential agreement
following correction for chance
agreement
Values vary between 1 (perfect
agreement) and 21 (less agreement
than can be expected on the basis of
chance alone)

The wider the interval, the lower the
reliability of the score
What to Do about Low Reliability


Connecting Sources of Error with Reliability
Assessment Method

Increase the Number of Items - The larger
the sample, the more likely that the test will
represent the true characteristic
Factor and Item Analysis - The reliability
of a test depends on the extent to which all
of the items measure one common
characteristic
o Tests are most reliable if they are
unidimensional—one factor should
account for considerably more of the
variance than any other factor
o Discriminability analysis - examine the
correlation between each item and the
total score for the test
Correction for Attenuation - If a test is
unreliable, information obtained with it is of
little or no value. Thus, we say that potential
correlations are attenuated, or diminished,
by measurement error.
Using Reliability Information
Standard Errors of Measurement and the Rubber
Yardstick


The larger the standard error of
measurement, the less certain we can be
about the accuracy with which an attribute is
measured
a small standard error of measurement tells
us that an individual score is probably close
to the measured value.
How Reliable Is Reliable?


reliability estimates in the range of .70 and .
80 are good enough for most purposes in
basic research.
the most useful index of reliability for the
interpretation of individual scores is the
standard error of measurement
PSYCHOLOGICAL ASSESSMENT – CHAPTER
5: VALIDITY
Downloaded by flint cloak (ijfa0101@gmail.com)
Validity
lOMoARcPSD|25267805




the agreement between a test score or
measure and the quality it is believed to
measure
“Does the test measure what it is supposed
to measure?”
the evidence for inferences made about a
test score
3 types of evidence: : (1) construct related,
(2) criterion related, and (3) content related

organized into 3 sections: foundations,
operations, and applications
considers how tests are designed and built
and how they are administered, scored, and
reported.
Aspects of Validity


Content-Related Evidence for Validity




considers the adequacy of representation of
the conceptual domain the test is designed
to cover
content validity evidence: has been of
greatest concern in educational testing and
more recently in tests developed for medical
settings
Construct Underrepresentation - the
failure to capture important components of a
construct
Construct-Irrelevant Variance - occurs
when scores are influenced by factors
irrelevant to the construct
Criterion-Related Evidence for Validity





mere appearance that a measure has
validity
it is crucial to have a test that “looks like” it
is valid
These appearances can help motivate test
takers because they can see that the test is
relevant.
Criterion validity evidence - how well a
test corresponds with a particular criterion
Such evidence is provided by high
correlations between a test and a welldefined criterion measure
Criterion -the standard against which the
test is compared.
Predictive Validity Evidence - The
forecasting function of tests
o Ex. the SAT Critical Reading Test
serves as predictive validity evidence
as a college admissions test
o SAT = predictor variable
o GPA = criterion
Concurrent validity evidence - applies
when the test and the criterion can be
measured at the same time.
Validity Coefficient

Face Validity



Standards

Predictive and Concurrent Evidence
The relationship between a test and a
criterion is usually expressed as a
correlation
tells the extent to which the test is valid for
making statements about the criterion
.30 to .40 = adequate
Evaluating Validity Coefficients
1. Look for Changes in the Cause of
Relationships - Be aware that the
conditions of a validity study are never
exactly reproduced
2. What Does the Criterion Mean? Criterionrelated validity studies mean nothing at all
unless the criterion is valid and reliable
3. Review the Subject Population in the
Validity Study - the validity study might
have been done on a population that does
not represent the group to which inferences
will be made.
4. Be Sure the Sample Size Was Adequate s a validity coefficient that is based on a
small number of cases
5. Never Confuse the Criterion with the
Predictor
6. Check for Restricted Range on Both
Predictor and Criterion - A variable has a
“restricted range” if all scores for that
variable fall very close together
7. Review Evidence for Validity
Generalization - Criterion-related validity
evidence obtained in one situation may not
be generalized to other similar situations
o Generalizability - the evidence that
the findings obtained in one situation
can be generalized i.e., applied to
other situations
Downloaded by flint cloak (ijfa0101@gmail.com)
lOMoARcPSD|25267805
8. Consider Differential Prediction Predictive relationships may not be the
same for all demographic groups
Construct-Related Evidence for Validity



Construct - something built by mental
synthesis
Construct validity evidence - established
through a series of activities in which a
researcher simultaneously defines some
construct and develops the instrumentation
to measure it.
Construct validation - involves assembling
evidence about what a test means (done by
showing the relationship between a test and
other tests and measures)
Convergent Evidence


When a measure correlates well with other
tests believed to measure the same
construct
measures of the same construct converge,
or narrow in, on the same thing
Discriminant Evidence / Divergent Validation



The answer is that the index taps something
other than the tests used in the convergent
evidence studies
demonstration of uniqueness
a test should have low correlations with
measures of unrelated constructs, or
evidence for what the test does not
measure
Criterion-Referenced Tests


have items that are designed to match
certain specific instructional objectives
Validity studies for the criterion-referenced
tests would compare scores on the test to
scores on other measures that are believed
to be related to the test
Relationship between Reliability and Validity



Attempting to define the validity of a test will
be futile if the test is not reliable.
we can have reliability without validity
it is logically impossible to demonstrate that
an unreliable test is valid
Downloaded by flint cloak (ijfa0101@gmail.com)
Download