PPT 11

advertisement
Chapter 11
Measuring Research Variables
Research Methods in Physical Activity
Validity
Validity (Degree to which a test or instrument measures what it purports to measure; can be
categorized as logical, content, criterion, or construct validity. ) refers to the soundness of
the interpretation of scores from a test, the most important consideration in
measurement. There are different purposes for using certain measures.
Consequently, there are different kinds of validity. There are four basic types
of validity: logical, content, criterion, and construct.
Logical validity - Degree to which a measure obviously involves the
performance being measured; also known as face validity.
Content validity - Degree to which a test (usually in educational settings)
adequately samples what was covered in the course.
Criterion validity - Degree to which scores on a test are related to some
recognized standard or criterion. The two main types of criterion validity are
concurrent validity and predictive validity.
Research Methods in Physical Activity
Validity
Criterion validity (cont)
concurrent validity - Type of criterion validity in which a measuring
instrument is correlated with some criterion that is administered
concurrently or at about the same time.
predictive validity - Degree to which scores of predictor variables can
accurately predict criterion scores. Note that shrinkage can occur when
using one prediction equation formed from one sample, then applied to
another. Shrinkage is the reduction in the predictive ability. This
phenomenon can be addressed by cross-validating the prediction
equated from the original sample to the sample tested. This becomes
important when the two samples vary in subject demography and/or if
the original prediction formula was from a small sample size.
Cross-Validation: Technique to assess the accuracy of a prediction
formula in which the formula is applied to a sample not used in
developing the formula.
Research Methods in Physical Activity
Validity
Construct validity - Degree to which a test measures a hypothetical
construct; usually established by relating the test results to some behavior.
For example, certain behaviors are expected of someone with a high degree of
sportsmanship. Such a person might be expected to compliment the
opponent on shots made during a tennis match. For an indication of
construct validity, a test maker could compare the number of times a person
scoring high on a test of sportsmanship complimented the opponent with the
number of times a person scoring lower on the test did so.
The known group difference method is sometimes used in establishing
construct validity. The known groups difference method is used for
establishing construct validity in which the test scores of groups that should
differ on a trait or ability are compared. (ex. If the sprinters and jumpers score
significantly better on a test designed to measure anaerobic power than the distance runners do,
this finding would provide some evidence that the test measures anaerobic power.)
An experimental approach is occasionally used in demonstrating construct validity. For
example, a test of cardiovascular fitness might be assumed to have construct validity if
it reflected gains in fitness following a conditioning program.
Research Methods in Physical Activity
Reliability
An integral part of validity is reliability, which pertains to the
consistency, or repeatability, of a measure. A test cannot be
considered valid if it is not reliable. In other words, if the test is not
consistent—if you cannot depend on successive trials to yield the
same results—then the test cannot be trusted.
Test reliability is sometimes discussed in terms of observed score, true score,
and error score.
• test score obtained by an individual is the observed score.
• an observed score theoretically consists of the person’s true score and error
score.
• expressed in terms of score variance, the observed score variance consists
of true score variance plus error score variance. The goal of the tester is to
remove error to yield the true score.
• because true score variance is never known, it is estimated by subtracting
error variance from observed score variance. Thus, the reliability coefficient
(discussed later) reflects the degree to which the measurement is free of error
variance. The coefficient of reliability is the ratio of true score variance to
observed score variance.
Research Methods in Physical Activity
Reliability
Sources of Error
Measurement error can come from four sources: the participant, the
testing, the scoring, and the instrumentation.
Participant Error - Measurement error associated with the
participant includes many factors, including mood, motivation, fatigue,
health, fluctuations in memory and performance, previous practice,
specific knowledge, and familiarity with the test items.
Testing Error – testing error is related to how clear and complete
the directions are, how rigidly the instructions are followed, and
whether supplementary directions or motivation is applied.
Research Methods in Physical Activity
Reliability
Sources of Error
Scoring Error - Errors in scoring relate to the competence,
experience, and dedication of the scorers and to the nature of the
scoring itself. The extent to which the scorer is familiar with the
behavior being tested and the test items can greatly affect scoring
accuracy. Carelessness and inattention to detail produce
measurement error.
Measurement Error - Measurement error because of
instrumentation includes such obvious causes as inaccuracy and lack
of calibration of mechanical and electronic equipment. It also refers
to the inadequacy of a test to discriminate between abilities and to
the difficulty of scoring some tests.
Research Methods in Physical Activity
Reliability Coefficient
Expression of Reliability
The degree of reliability is expressed by a correlation coefficient, ranging from
0.00 to 1.00. The closer the coefficient is to 1.00, the less error variance it
reflects and the more the true score is assessed.
Interclass correlation - This coefficient is a bivariate statistic, meaning that
it is used to correlate two different variables. But interclass correlation is not
appropriate for establishing reliability because two values for the same
variable are being correlated. (When a test is given twice, the scores on the first test
are correlated with the scores on the second test to determine their degree of consistency)
Intraclass correlation - The procedures leading to the calculation of
intraclass correlation (R) are the same as those of simple ANOVA with
repeated measures. (see Table 11.2, p.199). Note that the “F” statistic for
“trials” determines whether there was any significant difference between
three trails of the same measure.
The intraclass correlation is calculated on p.200 (note that the best way to
increase the “R” value is to decrease the residual scores – remove unexplained
variance)
Research Methods in Physical Activity
Methods of Establishing Reliability
Stability - A coefficient of reliability measured by the test–retest
method on different days. In the test–retest method, the test is given
one day and then repeated a day or so later. Intraclass correlation
should be used to compute the coefficient of stability of the scores
on the two tests.
Alternate-forms method - establishing reliability involves the
construction of two tests that supposedly sample the same material.
This method is sometimes referred to as the parallel-form method or
the equivalence method. The two tests are given to the same
individuals. Ordinarily, some time elapses between the two
administrations. The scores on the two tests are then correlated to
obtain a reliability coefficient.
Research Methods in Physical Activity
Methods of Establishing Reliability
Internal Consistency- An estimate of the reliability that
represents the consistency of scores within a test.
Same-day test–retest method - Method of
establishing reliability in which a test is given twice to
the same participants on the same day.
Split-half technique - Method of testing reliability in
which the test is divided in two, usually by making the
odd-numbered items one half and the even numbered
items the other half. The two halves are then
correlated.
Research Methods in Physical Activity
Methods of Establishing Reliability
Internal Consistency-
Flanagan method - A process for estimating reliability
in which the test is split into two halves, and the
variances of the halves of the test are analyzed in
relation to the total variance of the test. (see example 11.3,
p. 202)
Kuder-Richardson (KR) method of rational
equivalence - Formulas developed for estimating
reliability of a test from a single test administration.
Only one test administration is required, and no correlation is
calculated. The resulting coefficient represents an average of all
possible split-half reliability coefficients
Research Methods in Physical Activity
Methods of Establishing Reliability
Intertester Reliability - the degree to which different testers can
achieve the same scores on the same subjects. Also called objectivity.
Objectivity - The degree to which different testers can achieve
the same scores on the same subjects, also known as intertester
reliability. The degree of objectivity (intertester reliability) can be
established by having more than one tester gather data. Then the
scores are analyzed with intraclass correlation techniques to
obtain an intertester reliability coefficient.
This approach typically involves a coding instrument to construct
Interobserver Agreement (see formula 11.4, p. 203)
Research Methods in Physical Activity
Standard Scores To Compare Performance (also
refer to Table 2 in Appendix)
Z – scores (see p. 205, for example)
The basic standard score is the z score. The z scale converts raw
scores to units of standard deviation in which the mean is zero and a
standard deviation is 1.0.
The formula is z = (X – M)/s
T scale (see p. 205, for example)
Type of standard score that sets the mean at 50 and
standard deviation at 10 to remove the decimal found in z
scores and to make all scores positive.
Research Methods in Physical Activity
Measuring Affective Behavior
To be continued. ( Exam three will include information up to this
point. The remaining information from Chapter 11 on scales for
measuring affective behavior will be covered in class with the
information from Chapter 15 on Survey Research). The remainder of
the information in this Chapter will be included in Exam four)
Chapter 11 Information continues on next slide.
Research Methods in Physical Activity
Measuring Affective Behavior
Affective behavior includes attitudes, personality, anxiety, self-concept,
social behavior, and sportsmanship.
Scales for Measuring Affective Behavior
Likert-Type Data : Type of closed question that requires the
participant to respond by choosing one of several scaled responses;
the intervals between items are assumed to be equal.
Example:
I prefer quiet recreational activities such as chess, cards, or checkers rather
than activities such as running, tennis, or basketball.
Strongly agree
Agree
Undecided
Disagree
Strongly disagree
Research Methods in Physical Activity
Measuring Affective Behavior
Benefits of Likert-Type Data
A principal advantage of scaled responses such as the Likert-type is
that they permit a wider choice of expression than responses such as
“always” or “never,” or “yes” or “no.” The five, seven, or more
intervals may help increase the reliability of the instrument.
Semantic Differential Scale: Is used to measure affective behavior in
which the respondent is asked to make judgments about certain
concepts by choosing one of seven intervals between bipolar
adjectives. (see example in text, p 208)
Research Methods in Physical Activity
Measuring Affective Behavior
Rating Scales: A measure of behavior that involves a subjective
evaluation based on a checklist of criteria. Raters are usually experts
on the criterion measure. When more than one judge is asked to rate
performances, some common standards must be set.
Rating Errors
Leniency - Tendency for observers to be overly generous in
rating.
Central tendency errors - Inclination of the rater to give an
inordinate number of ratings in the middle of the scale, avoiding
the extremes of the scale.
Halo effect - A threat to internal validity wherein raters allow
previous impressions or knowledge about a certain individual to
influence all ratings of that individual’s behaviors.
Research Methods in Physical Activity
Measuring Affective Behavior
Rating Errors
Proximity error - Inclination of a rater to consider behaviors to
be more nearly the same when they are listed close together on
a scale than when they are separated by some distance. (For
example, if the qualities “active” and “friendly” are listed side by side on the
scale, proximity errors result if raters evaluate performers as more similar on
those characteristics than if the two qualities were listed several lines apart on
the rating scale.)
Observer bias error - Inclination of a rater to be influenced by
his or her own characteristics and prejudices. Observer bias
errors are directional because they produce errors that are
consistently too high or too low.
Research Methods in Physical Activity
Measuring Affective Behavior
Rating Errors
Observer expectation error - Inclination of a rater to see
evidence of certain expected behaviors and interpret
observations in the expected direction. Observer expectations
can contaminate the ratings because a person who expects
certain behaviors is already inclined to see evidence of those
behaviors and interpret observations in the “expected” direction.
(In the research setting, potential observer expectation errors are likely when
the observer knows what the experimental hypotheses are and is thus inclined
to watch for these outcomes more closely than if the observer were unaware
of the expected outcomes. )
Research Methods in Physical Activity
Measuring Knowledge
Item Analysis
Item analysis - Process in analyzing knowledge tests in which
the suitability of test items and their ability to discriminate are
evaluated. Thus, the purpose of item analysis is to determine
which test items are suitable and which need to be rewritten or
discarded.
Two important parts of item analysis are:
1) To analyze the difficulty of the items on the test
2) To determine the degree of item discrimination
Research Methods in Physical Activity
Measuring Knowledge
Item Analysis
Item Difficulty - analysis of the difficulty of each test item in a
knowledge test, determined by dividing the number of people
who correctly answered the item by the total number of people
who responded to the item. (The more difficult the item is, the lower
its difficulty index is)
Most test authorities recommend eliminating questions with
difficulty indices below .10 or above .90. The best questions are
those that have difficulty indices around .50.
Research Methods in Physical Activity
Measuring Knowledge
Item Analysis
Item Discrimination - The degree to which a test item
discriminates between people who did well on the entire test
and those who did poorly; also called index of discrimination.
Item Discrimination may be calculated by dividing the completed
tests into a high-scoring group and a low-scoring group and
then use the following formula:
Index of discrimination = (nH – nL)/n
where nH is the number of high scorers who answered the item
correctly, nL is the number of low scorers who answered the item
correctly, and n is the total number in either the high or the low
group. (Ex. if we have 30 in the high group and 30 in the low group and if 20 of
the high scorers answered an item correctly and 10 of the low scorers answered
it correctly, the index of discrimination would be (20 – 10)/30 = 10/30 = .33.)
Research Methods in Physical Activity
End of Presentation
Research Methods in Physical Activity
Download