Ch 3 Jackson

this chapter discusses methods to estimate reliability and objectivity, and
factors that influence the value of these for a test score
on most occasions, a test can be administered several times a day
the test administrator must decide how many to administer
and which trials to use as the criterion score
after reading this chapter, you should be able to:
1. define and differentiate between reliability and objectivity for norm-referenced
tests scores and outline the methods used to estimate these values
2. identify those factors that influence reliability and objectivity for norm-referenced
test scores
3. identify those factors that influence reliability for criterion-referenced test scores
4. select a reliable criterion score based on measurement theory
certain characteristics essential to measurement, w/o them we cannot believe in the
measurement and we can make little use of it
in this chapter, we will go over these characteristics
the most important characteristic of a measurement is validity
an instrument is valid only if it measures what it is supposed to measure
this is discussed in greater detail in chapter 4
the second most important characteristic is reliability
a reliable instrument measures whatever it measures consistently
before an instrument can be valid, it must first be reliable
objectivity is another important characteristic
sometimes called rater reliability because it is defined in terms of agreement of
judges about the value of the measurement
if two judges/raters cannot agree on a score, the measurement lacks objectivity
a lack of objectivity reduces both reliability and objectivity
majority of this chapter deals with reliability and objectivity of norm-referenced tests
reliability of criterion-referenced tests in presented at the end of the chapter
Mean Score Versus Best Score
- a criterion score is the measure used to indicate a person’s ability
unless a measure has perfect reliability, it is a better indicator of performance when
developed from more than one trial
multiple trials are common: skinfolds, strength testing, IQ, jumping, etc…
for multiple trials, the criterion score can be either the best score or the mean of the
best can represent optimal
mean can represent typical
which to use – it depends
more difficult to obtain a mean (calculations and administrations)
the more trials you have, the better the reliability
More Considerations
- if maximum ability is what you are seeking, then maybe you want best score
- maybe the mean of the two highest scores
- based on the above information, we can determine a criterion score in any of the
following ways
1. mean of all the trial scores
2. best score of all the trial scores
3. mean of selected trial scores based upon which trials the group scored best
4. mean of selected trial scores based upon which trials the individual scored best
in chapter 4 we will learn that for a score to be valid, it must be reliable
but reliability does not guarantee validity
in selecting a criterion score, we must consider what is the most valid and reliable
score, and not simply the most reliable score
traditionally estimated by one of two methods:
1. test-retest (stability) or
2. internal consistency
each yields a different coefficient, so it is imperative to use the most appropriate one
it is also important to note the types of reliability others have used to estimate their
also, a test may be reliable in one group but not in another
Stability Reliability
- when individual scores change little from one day to the next, they are stable
- when scores remain stable, they are considered reliable
- test-retest is used to obtain the stability reliability coefficient
for this, the same individual is measured with the same instrument on several
(usually, and at least, 2)
the correlation between the sets of scores is the stability reliability coefficient
the close this coefficient is to (+1), the more reliable the scores
3 factors that can contribute to low reliability are:
1. testers may perform differently
2. measuring instrument may be operated or applied differently
3. the person administering the measurement may change
as a rule of thumb, test administrations are 1 to 3 days apart
for maximum fitness testing (i.e., VO2max) there should be seven days to allow for
complete physiological recovery
if interval between administrations is too long, scores can change due to practice,
memory, etc… - things that are not considered sources of measurement error
because of the time constraints, some people do not advocate test-retest reliability
probably, however, most appropriate method for determining reliability of physical
performance measures
and, not all subjects have to be retested – 25% to 50% of sample size should suffice
there is not set standard for an acceptable range of stability reliability coefficients
all situations must be evaluated on an individual basis
most physical performance measures, however, exhibit coefficients in the .80 - .95
Internal-Consistency Reliability
- used by many
- the advantage is that all measures are collected on the same day
- refers to a consistent rate of scoring by the individuals being tested throughout a test,
or when multiple trials are administered from trial to trial
two trials must be administered on the same day
significant changes in test scores indicate a lack of reliability
the correlation among the trial scores is the internal consistency reliability coefficient
Stability versus Internal Consistency
- not comparable
- internal consistency is not affected by day-to-day changes in performance
- the day-to-day changes in performance are major sources of measurement error in
stability reliability
- internal consistency is typically higher than stability reliability
- not unusual to have coefficients ranging from .85-.99 on performance tests
disciplines that rely heavily on paper and pencil tests primarily use internal
consistency reliability; whereas performance based disciplines rely heavily on
stability reliability
why? remember – the stability coefficient assumes that true ability has not changed
from one day to the next – paper and pencil tests typically cannot meet this
assumption – performance tests, on the other hand, can
reliability for norm-referenced tests may be better understood through its
mathematical foundation
reliability can be explained in terms of “observed scores”, “true scores”, and “error
additionally, reliability assumes that any measurement on a continuous scale has an
inherent component of error, termed “measurement error”
any number of things can affect measurement error, the four most frequently
occurring are:
1. lack of agreement among scorers (objectivity)
2. lack of consistency by individuals being tested
3. lack of consistency in measuring instrument; and
4. inconsistency in following standardized testing procedures
from text:
o assume we measure height of 5 people all 68 inches tall
o if we say anyone is other than 68 inches tall, we have measurement error
o the variability in measured height versus actual height is “measurement
o variance, from chapter 2, is s2
o if all measured scores are 68, s is zero and s2 is 0
o there is no measurement error
o if all people are not the same height, than s2 is due to true differences in
height or differences due to “measurement error”
o in reliability theory, we are trying to determine what is “true error” and
what is “measurement error”
in theory, observed score is the sum of the true score and the measurement error
equation is: X = t + e; where
o X is the observed score,
o t is the true score
o e is the error
for example, an individual is 70.25 inches and is measured at 70.5 inches, the
measurement error if 0.25 inches
o 70.5 (X) = 70.25(t) + 0.25(e)
the variance for a set of observed scores is equal to the variance of the true scores
plus the variance of the error scores:  x2   t2   e2
reliability then is the ratio of the true-score variance to the observed-score variance:
 t2  x2   e2
 e2
 x2
 x2
 x2
from the formula, we can observe that when measurement error is 0, reliability equals
as measurement error increases, reliability decreases
reliability is an indicator of the amount of measurement error in a set of scores
reliability, then is dependent on two factors
1. reducing the variation attributable to measurement error, and
2. detecting individual differences (true score variation) within the group
reliability, then must be viewed in terms of its measurement error (error variance) and
its power to discriminate among different levels of ability within the group measured
(true-score variance)
remember: X (observed score) = t (true score) + e (error)
furthermore, the variance of observed scores equals the variance of true scores plus
the variance of the error scores:
 t2  x2   e2
 e2
1 2
reliability equals the true-score variance divided by the observed-score variance
just as observed score can be divided into true and error scores, the total variability
(s2) for a set of scores can be divided into several parts
to divide, or petition, this variance – we use ANOVA
we use the output from the ANOVA to obtain the variance scores that we need
then, we can calculate an intraclass reliability coefficient
before ANOVA, two things we must discuss
1. you must estimate reliability prior to collecting a large amount of data
a. administer to small representative sample
b. estimate reliability
c. known as a pilot study
d. should use more administrations and trials with pilot study compared
to larger study
2. calculations are easier when done with the computer; however, you should
still practice by hand to understand how they are calculated
Intraclass R from One-Way ANOVA
using ANOVA, we replace the Reliability formula : 
with, R =
 t2
 x2
MS A  MS w
o R is the intraclass correlation coefficient
o MSA = mean square among
o MSW = mean square within
 the mean square values are provided by ANOVA
 they are also variance scores/estimates
o (MSA – MSW) is an estimate of  t2 , and
o MSA an estimate of  x2
to calculate these values by hand, we need to define 6 values:
1. sum of squares total – SST
2. degrees of freedom total dfT
3. sum of squares among people SSA
4. sum of squares within people SSW
5. degrees of freedom among people dfA
6. degrees of freedom within people dfW
SST   X
 X2
SS W   X 2 
Ti2  X 
 Ti2
df T  nk - 1
df w  nk - 1
df A  n - 1
 X 2 is the sum of the squared scores
 X is the sum of the scores of all people
n is the number of people
k is the number of scores for each person, and
Ti is the sum of the scores for person I
Now, we can easily calculate the Mean Square values (among and within)
MS A 
df A n - 1
MS W 
SS w
df W nk - 1
in calculating MSA, SSA will be zero if all people have the same score
it will be greater than zero if people have different scores
should not anticipate it being zero, there will be different scores
MSA, then, should be interpreted as an estimate of true score variance  x2
also, in calculating MSW, SSW will be zero if each person has all the same scores on
the different trials
it will be greater than zero if people have different scores on different trials
MSW, then, should be interpreted as an estimate of error score variance
 e2
Step 8 from text, page 86 – ANOVA source table:
Among people
Within people
go to handout for problem 3.1
we interpret this R as the reliability of a criterion score which is the sum or mean test
score for each person
when R = 0, there is no reliability
when R = 1, there is maximum reliability
SPSS handout
reliability we have from problem 3.1 is the reliability of a criterion score from the
mean of two scores (trials 1 and 2)
what if you want R for a single trial for a single test administration
we calculate for a criterion score from a single score on one day using the following
k 
MS A    1MS W 
 k' 
(formula 3.2)
R = reliability for a criterion score composed of k’ scores
k is the number of scores per person in the pilot study
k’ is the number of scores per person in the actual measurement group
if we want to estimate reliability for a score collected on a single day, using the
values from problem 3.1, where k = 2 and k’ = 1
31.5  0.33
 0.98
2 
31.5    10.33
1 
we use Step 9 when we want reliability of a criterion score from the mean or sum of
trial scores
we can use equation 3.2 if we want reliability of a best score
this reliability coefficient is only one of many
but is the simplest to illustrate
more advanced procedures may illicit a more precise coefficient
Intraclass R from Two-Way ANOVA
- suppose that k scores were collected for each of n people
- which could have been collected over k trials or k days
- for discussion and illustration, think of k scores as trials
- an ANOVA source summary table would look like Table 3.2, page 87
think of it as an extension of the previous calculations to calculate reliability – we
have introduced more trials, so we have increased the confusion, just a little bit
to complete these calculations, we need the following calculations:
sum of squares total – SST – same as above
sum of squares among people – SSP – same as SSA from above
sum of squares among trials – SSt =
 T j 2   X2
sum of squares interaction – SSi - =
Ti 2  Tj 2
df T  nk - 1; df P  n - 1; df t  k - 1; df I  n - 1k - 1
where, Tj is the sum of the scores for Trial j, and the rest are defined above
Selecting Criterion Score
- dealing with selecting the criterion score and ANOVA model for R
- review the earlier information (selecting a criterion score)
Criterion Score Is Mean
- if it will be the mean or sum of the trial scores and NOT THE BEST SCORE
- we need to examine if the trial means are different
- one method is to visually compare them
- in Problem 3.2, the means are 5.0, 5.2 and 5.4, not a big difference
another method of determining if the means are different is to use an F-test
MS t
; from 3.2 = F 
 0.29
evaluate the F-test from additional statistical techniques at the end of chapter 2
if it is significant, there are real differences between the means
if not, than there is no difference in the means
once you determine the significance of the F-test, you can proceed in one of three
ways (see Figure 3-1, page 89).
1. If there is no significant difference among the trial means, the reliability of
the criterion score is calculated using the following:
(formula 3.3); where
MS w  t
df t  df I
Note that formula 3.3 is the same as formula 3.1
2. If there are significant differences among the trial means, we need to make
some decisions
a. We can discard the scores that are presenting the problem
b. See figure 3-2, page 90.
c. If we discard, then we do another set of ANOVA calculations
d. Then see if the means are different, if they are not, then we
proceed with step 1 above
e. The purpose of doing this is to remove differences in trials, and is
completely acceptable
3. Or, we can consider the changes in scores as attributable to learning or
practice and therefore not due to measurement error
a. Use this especially if you see increases in performance from 1 trial
to another
b. This technique can also be used when attempting to estimate the
objectivity of judges
c. The formula for estimating reliability in this case is:
(formula 3.4)
measurement error is supposed to be random – significant differences indicate a lack
of randomness – there is a systematic change
Criterion Score Is Single Score
earlier, provided example of selecting criterion score using One-Way
here provide information on estimating reliability from a single score
we use the following formula:
(formula 3.5)
MS P    1MS I 
 k' 
you should use a computer at all times, unless the dataset is small
the advantage of a two-way ANOVA over a one-way ANOVA is that trial means will
be provided, as well as a significance test for the trial means
you can calculate R using either equation 3.3 or 3.4 from this output
in SPSS, Reliability Analysis provides an option for the two-way ANOVA for
repeated measures (this option is not available in the Student Version)
o see Section 12 of the Windows Statistical Procedures in Appendix A, page
see Tables 3.3 and 3.4
in Table 3.4, “between people” and “within people” if a one-way ANOVA is used
“between people”, “between measures” and “residual” is a two-way ANOVA is used
within people from one-way ANOVA is composed of between measures and residual
from the two-way ANOVA
between measures and residual in Table 3.4 are among trials and interaction,
formulae 3.2 and 3.5 are for estimating R for a criterion score that is a single trial or
single day score of a person where two or more trial or day scores were collected
remember that R is often estimated in a pilot study with more scores per person than
will actually be used in a regular study
presented next are common situations and formulae for calculating R using formulae
3.2 and 3.5
see handouts for situations 1 and 2
Sample Size for R
- the size of the estimated R is somewhat dependent on the sample
- to combat this, confidence intervals are used to provide a range of what the R may
be, for example 90% confidence intervals of R
when using confidence intervals, you are stating that you are X% confident that the
actual R will fall within this range
you want to have a high level of confidence with a small confidence interval
research has shown that confidence limits are the same when keeping sample size
criterion score equal across either one-way or two-way ANOVA’s.
additionally, as the sample size increases or the R increases, the width of the
confidence interval decreases
review Table 3.5 to see this relationship, note how the confidence interval changes as
R increases or the sample size increases
you can obtain confidence intervals through the Reliability Analysis program within
Acceptable Reliability
- what is an acceptable reliability coefficient
- it depends, and it depends on several factors, including
- the characteristics of the sample, R’s from previous and similar studies, the type of R,
study design, etc…
- in performance related tests, minimum estimates of R might be 0.70 or 0.80
- in behavioral research, estimates of R = 0.60 are sometimes acceptable
you can also use confidence intervals to set a minimum acceptable estimate of R
for example, with a 95% confidence interval, you might stipulate that the lower end
of the confidence interval must be greater than 0.70
some guidance in exercise science field might be the following
o 0.70 – 0.79 is below-average but acceptable
o 0.80 – 0.89 is average and acceptable
o 0.90 and greater is above average
Factors Affecting Reliability
- many factors can affect the reliability of a measurement
- some include: scoring accuracy, number of trials, test difficulty, test instructions,
testing environment and experience
- the length of test, since longer tests provide higher estimates of R
- Table 3.6 is a categorization of factors that influence test score reliability
beyond that, we can expect an acceptable degree of reliability when
1. the sample is heterogeneous in ability, motivated to do well, ready to be
tested, and informed about the nature of the test
2. the test discriminates among ability groups, and is long enough or repeated
sufficiently for each person to show his or her best performance
3. the testing environment and organization are favorable to good
performance, and
4. the person administering the test is competent
Coefficient Alpha
- often used to determine the reliability of dichotomous data (chapter 14)
- with ordinal data we may use coefficient alpha, in fact it provides the same answer if
the data are ratio using formula 3.4
- coefficient alpha is an estimate of the reliability of a criterion score that is the sum of
the trial scores in one day
 2
 k  s x  s j 
- it is determined using the following: r  
 ; where,
 k - 1  s 2x 
- rα is coefficient alpha
- k is the number of trials
- s2x is the variance for the criterion scores, and
- s2j is the sum of the variances for the trials
- if using statistical software or Excel, do not worry about n or n-1 in the denominator
- they cancel out, so the coefficient alpha will be the same
go to handout for problem 3.3
besides the needed variances, statistical software can also provide correlations
which can help in deciding which trial scores to use, if you don’t use all
Intraclass R in Summary
- sometimes the intraclass correlation coefficient will be lower than expected or wanted
- even though the test scores seem reliable
- this can happen when the sum of squares among people is small
- this would indicate a more homogeneous group as opposed to heterogeneous, which
increases the sum of squares among people
- one way to correct this is to increase the sensitivity of the test so that it discriminates
more among the homogeneous group
- another way is to increase the heterogeneity of the group
used to estimate reliability when the length of the test is increased
assumes that additional length (or new trial) is just as difficult as the original and is
neither mentally or physically more tiring
an estimate of reliability of a criterion score from a mean or sum of trial scores
k r1,1 
- rk,k  
where rk,k is the estimated reliability of a test increased in length k times
k is the number of times the test in increased in length, and
r1,1 is the reliability of the present test
problem 3.4, six trial test R = 0.94, what if 18 trials were administered
30.94   2.82   2.82 
rk, k  
  0.98
 
 
what about the accuracy of this formula – the accuracy of the formula increases as the
value of k in the formula decreases – meaning that the maximum reliability of the test
can be determined through multiple iterations of this equation
sometimes useful to estimate the measurement error in each test score
if we administer several trials to each person, we can calculate a standard deviation
for each person
this standard deviation is the measurement error for the individual
if there is only one score the measurement error is estimated using the group scores
by calculating the standard error of measurement with the following:
- s e  s x 1  rx, x
where, se is the standard error of measurement, sx is the standard deviation, and rx,x is
the reliability coefficient for the test scores
the standard error of measurement reflects the degree one may expect a test score to
vary due to measurement error
it acts like and can be interpreted in the same way using the normal curve as a
standard deviation for a score
it specifies the limits within which we can expect scores to vary due to measurement
Problem 3.5, standard deviation of a test is 5, R = 0.91
s e  5 1  0.91  5 0.09  5(0.3)  1.5
or rater reliability, the close agreement between the scores assigned to each person by
two or more raters
for a test to be valid it must be both reliable and objective
Factors Affecting Objectivity
- dependent on two related factors
1. clarity of the scoring system
o certain tests have clear scoring methods: mile run, strength scores, sit-ups,
pushups, etc…
o open-ended tests don’t have a clear scoring method and, therefore, interject
more subjectivity
2. degree to which raters can assign scores accurately
o affected by familiarity with scoring mechanism – stopwatch-if rater doesn’t
know how to use it, less accurate scores can be assumed
a high degree of objectivity is essential when two or more people are administering a
the lower the objectivity, the more dependent on a particular rater an individual’s
scores will be
high objectivity is also needed when the same rater/tester administers the same test
over several days
- we can calculate the degree of objectivity between two or more raters using an
intraclass correlation coefficient
- to do this, we consider the raters or judges scores as trials – we have two or more
scores for each person (the number of scores is determined by the number of raters)
- if all raters are supposed to be using the same standards, we can consider the
difference between judges as measurement error and calculate objectivity using the
following equation:
MS A  MS w
- Reliability
(formula 3.1, not 3.2 as stated in book)
- if all judges are not expected to use the same standards, we would calculate
objectivity using either the alpha coefficient or the appropriate intraclass R formula –
formula 3.5
from chapter 1 based on a criterion-referenced standard a person is classified as either
yes or no (proficient or non-proficient)
criterion-referenced reliability is defined differently than norm-referenced reliability
reliability is defined as consistency of classification
we use the 2 x 2 classification table to estimate reliability
Day 2
Day 1
A is the number of people who passed both times (true-positives) and D is the number
who failed both times (false-negatives)
B and C are those that did not meet or fail the criterion on both occasions
the objective is to maximize the numbers in the A and D boxes while minimizing the
numbers in the B and C boxes
proportion of agreement is used to estimate reliability
Problem 3.6, determine proportion of agreement for data in Table 3.8
84  40
 0.83
84  21  5  40 150
the kappa coefficient allows for chance occurrences within the data (some got lucky
both times or unlucky both times)
Pa - Pc
- k
; where Pa is the proportion of agreement
1 - Pc
Pc is the proportion of agreement by chance:
A  BA  C  C  D B  D 
- Pc 
A  B  C  D 2
Problem 3.7, determine kappa coefficient for data in Table 3.8
Step 1, calculate Proportion of Agreement, already done from above
Step 2, Calculate Proportion of agreement by chance, Pc
Pc 
10589  4561  9345  2745  12090  0.54
84  21  5  402
Step 3, Calculate kappa
0.83 - 0.54 0.29
 0.63
1 - 0.54
what are acceptable levels of P - .60 and lower should not even be considered
depending on the situation - .90 might be set as a minimum
sometimes called change scores or improvement scores
when people start a program with markedly different scores it proves difficult in
evaluating change over time – particularly when some people start the program with
higher scores
a difference score (change from beginning to end) can be used to determine the
degree to which a person has changed
there are two problems using difference scores for comparison among individuals,
among groups and/or development of performance standards
1. people who score high at the beginning have little chance for improvement
2. difference scores are unreliable
the formula for estimating R of difference scores (X – Y) is as follows:
s x2  s 2y  2 Rxy s x s y
Rdd 
Rxx s x2  Ryy s 2y  2 Rxy s x s y
X is the initial score
Y is the final score
Rxx and Ryy are the reliability coefficients for tests X and Y
Rxy is the correlation between tests X and Y
Rdd is the reliability of difference scores
sx and sy are the standard deviations for the initial and final scores
simple linear regression can also be used to predict final scores from initial scores
to do this you need to know what the correlation between the two tests is, and then
predict what the final score would be
- three characteristics to a sound measuring instrument: reliability, objectivity and
- a test is reliable when it consistently measures whatever it is supposed to measure
o there are two types of reliability – stability reliability and internal
consistency reliability
- objectivity is the degree to which different raters agree in scoring of the same