Uploaded by emanatef20002000

validity-walaa-Modified

advertisement
Measuring tool validity
Under Supervision of
Prof. Nagah Mahmoud
Prof. Nawal Fouad
Assist. Prof. Amel Shaban
Prepared by :
Walaa Hassan Ragab Arafat
Presentation rules
Objectives:
At the end of this lecture each candidate will be able to:• Discuss the different basic concepts of validity
• Distinguish between different types of validity
• Conclude the Factors that can lower validity
• Explain the item analysis procedure for norm and criterionreferenced measures
Outlines
• Definition of validity
• Types of validity.
Content validity
Face validity
Construct validity
Criterion-related validity.
Predictive validity
Concurrent validity
• Factors that can lower validity
• The item analysis procedure for norm and criterionreferenced measures
(1)Validity
The extent to which an instrument is measuring
what it is supposed to measure.
The degree to which evidence & theory support the
interpretation entailed by proposed use of tests.
Educational Research Association [AERA], American
Psychological Association [APA], National Council on
Measurement in Education
Is every valid measurement
reliable?
What is types of
validity?
Content
Face
Construct
Criterion Related
validity
Predictive
Contrasted
Groups
Hypothesis
Testing
Multi-traitmultimethod
Concurrent
Factor
analysis
1-Content validity
• It is the 1st type of validity that should be established and it
is the prerequisite for all other types of validity.
• It relates to how well the content of a test or measure
matches the objective to be measured or domain
specifications.
• Whether items and questions cover the full range
of the issues or problem being measured.
• To obtain evidence for content validity, the
objectives are given to at least three experts in the
area of content to be measured.
Role of experts
1. Link each objective with its respective item.
2. Assess the relevancy of the items to the content
addressed by the objectives.
3. Judge
if they believe the items on the tool
adequately represent the content.
• When more than two experts rate the items on a measure, the
alpha coefficient is employed as the index of content validity.
• The resulting alpha coefficient quantifies the extent to which
there is agreement between the experts’ ratings of the items.
• A coefficient of 0.00 indicates lack of agreement between the
experts and a coefficient of 1.00 indicates complete agreement.
Content validity index
When only two judges are employed, the content validity
index ( CVI) that
rate the relevance of each item to the
objective(s) using a 4-point rating scale:
(1) Not relevant
(3) Quite relevant
(2) Somewhat relevant
(4) Very relevant.
Content validity index
• The CVI is defined as the proportion of a
rating of quite\ very relevant by both rater
involved.
80% or
• Means a high value which denotes
a high level of agreement
0.80
80% or
0.80
• Means the items on the instrument
does not adequately address the
domains being explored
For example
Content validity
More than two experts
Only two judges
Alpha coefficient
Content validity index
A coefficient
of 0.00
indicates lack
of
agreement
a coefficient
of 1.00
indicates
complete
agreement
80% or 0.80:
denotes a
high level of
agreement
80% or 0.80:
the items do not
adequately
address the
domains
2-Face validity
The degree to which an assessment or test ,
subjectively appears to measure the variable or
construct that it is supposed to measure.
It
is
a
subjective
judgment
operationalization of a construct.
on
the
Face validity
Evaluate in
terms of:
Face validity cont.,
Face validity is determined by a review of the
items and not through the use of statistical analyses.
Unlike content validity, face validity is not
investigated through formal procedures.
Face validity cont.,
Anyone who looks over the test, including
examinees, may develop an informal opinion as to
whether or not the test is measuring what it is
supposed to measure.
3-Construct validity
Construct validity is the extent to which
relationships among items included in the
measure are consistent with the theory and
concepts as operationally defined.
Cont.
• For example:
A test of intelligence nowadays must
include measures of multiple intelligences, rather
than just logical-mathematical and linguistic
ability measures.
Construct validity
1.
Contrasted
group
2.
Hypothesis
testing
approach
3.
The multitraitmultimethod
approach
4.
Factor
analysis
For example
• To examine the validity of a measure designed to
quantify venous access, nurse ask a group of clinical
specialists on a given unit to identify a group of patients
known to have good venous access & group known to
have very poor access.
• The nurse employ the measure with both groups, obtain a mean
for each group, then compare the differences between two means
using t test or other appropriate statistic.
• If significant difference is found between mean scores of two
groups, there is some evidence for construct validity, that is, the
instrument measures the attribute of interest.
For example
What is The multitraitmultimethod
approach?
3-The multitrait- multimethod approach
It is appropriately employed when:
1- Measure two or more different constructs.
2- Use two or more different methodologies to
measure each construct.
3- Administer all instruments to every subject at
the same time.
4- Assume that performance on each instrument
employed independent ( not influenced by \
biased by) or a function of performance on any
other instrument.
The multitrait- multimethod approach
cont.,
1- That different measures of the same
construct should correlated highly with each
other. (The convergent validity principle).
2- That measures of different construct
should have low correlation with each other
(Discriminate principle).
Low
correlation
Bonding
(construct 1)
High
correlation
Prenatal care
(construct 2)
Rating
scale
Rating
scale
Checklist
Checklist
High
correlation
Disadvantages of the multitrait- multimethod
approach
• For subjects who must respond to multiple
instruments at one time:
1. Decreasing respondents' willingness to
participate, and decreasing response rate.
2. Introduces the potential for more errors of
measurement as a result of respondent
fatigue.
• The cost in time and money necessary to employ
the method.
4-Factor Analysis
• It is a useful approach to assessing construct
validity when the investigator has designed on
the basis of a conceptual framework, a measure
to assess various dimensions or subcomponents
of a phenomenon of interest and wishes to
empirically justify these dimensions or factors.
•
A procedure that gives the researcher information
about the extent to which a set of items measures the
same underlying construct or dimension of a
construct.
 Items designed to measure:
* The same dimension should load on the same factor
*
Differing dimensions should load on different
factors.
Factor analysis is commonly used in:
– Data reduction
– Scale development
– The assessment of the dimensionality of a set of
variables.
Steps in factor analysis “four steps”
– 1st Step: The correlation matrix for all variables is computed
– 2nd Step: Factor extraction
– 3rd Step: Factor rotation
– 4th Step: Make final decisions about the number of underlying
factors
Steps in factor analysis
•
The investigator administers the tool to a large representative
sample at one time.
•
Using parametric (paired T test, person correlation and one
way anova) or nonparametric (spearman correlation) factoranalysis procedure.
•
The result of this factoring process is a group of linear
combination of items called factors. Each of which is
independent of all other identified factors.
•
Each factor is then correlated with each item to
produce factor loadings.
•
The next step is a rotation, in which the factors are
repositioned in such a way as to give them more
interpretability.
•
Rotated factors are interpreted by examining the
items loading upon each, over and above a certain
preset criterion (usually 0.30 is the minimum).
• If evidence for construct validity exists, the number of
factors resulting from the analysis should approximate
the number of dimensions or subcomponents assessed
by the measure, and the items with the highest factor
loadings defining each factor should correspond with
the items designed to measure each of the dimensions
of the measure.
4-Criterion
– related
validity
A-Predictive
validity
B-Concurrent
validity
A-Predictive validity
If the test is used to predict future performance.
For example
Entrance exam . . . . Performance of these tests
correlates with later performance in professional
colleague.
General procedure for predictive
validity
1) A large group of people take the test.
2)The scores for those people are held for a
predetermined period of time.
3) Once the time period elapses, a measure of
some behavior (i.e., the criterion) is taken.
General procedure for predictive
validity cont.,
4) The test scores are then correlated with the
criterion scores.
5) If the scores correlate, the test has predictive
validity.
6) The resulting correlation coefficient is called
the validity coefficient.
B-Concurrent validity
• The extent to which a measure may be used to estimate
an individual’s present standing on the criterion.
• Concurrent validity is the practical alternative to the
ideal predictive method.
Concurrent validity cont.,
•With concurrent validity you obtain at roughly the same
time both test scores and criterion scores in some
predetermined population.
•Once this is accomplished, you simply correlate test
scores with the criterion scores.
Example of concurrent validity
Concurrent validity
What are
factors that can
lower validity?
• Unclear directions
• Difficult reading vocabulary and sentence structure
• Ambiguity in statements
• Inappropriate level of difficulty
• Identifiable patterns of answers.
• Tests that are too short.
• Inadequate time limits
• Improper arrangement of items (complex to easy).
• Poorly constructed test items
• Inadequate sample.
• Improper test administration.
• Scoring that is subjective.
• Test items inappropriate for the outcomes being
measured
(4)The item analysis procedure for norm
and criterion-referenced measures
Norm-Referenced
• Item P value
• Discrimination index
• Item-response chart
• Differential item
functioning
Criterion-Referenced
• Item- Objective
Congruence
• Empirical Item-analysis
• Item difficulty
• Item discrimination
indices
1-Norm-Referenced Item-Analysis
1. Item p level\ Difficulty level
• It is the proportion of correct responses to that item.
• It is determined by counting the number of subjects
selecting the correct or desired response to a
particular item and then dividing this number by the
total number of subjects.
• The range of p levels may be from 0 to 1.00.
• The closer the value of p is to 1.00, the easier
the item;
• The closer the value of p is to zero, the more
difficult the item.
Note:
• When
norm
referenced
measures
are
employed, P levels between 0.30 and 0.70 are
desirable.
• Assess an item’s ability to discriminate. It is a
powerful indicator of test-item quality.
• If performance on a given item is a good predictor
of performance on the overall measure, the item is
said to be a good discriminator.
• D ranges from -1.00 to +1.00 .
1. Rank all subjects’ performance on the measure by
using total scores from high to low.
2. Identify those individuals who ranked in the
upper 25 %.
3. Identify those individuals who ranked in the
lower 25 %.
4. Place the remaining scores aside.
5. Determine the proportion of respondents in the top 25
% who answered the item correctly (Pu).
6. Determine the proportion of respondents in the lower
25% who answered the item correctly (PL).
7. Calculate D by subtracting PL from Pu
(i.e., D= Pu – PL).
8. Repeat steps 5 through 7 for each item on the measure.
A positive D value is desirable and indicates
that the item discriminating in the same manner
as the total test.
• A negative D value suggests that the item is
not discriminating in the same way as the
total test; that is, respondents who obtain low
scores on the total measure tend to get the
item correct, while those who score high on
the measure tend to respond incorrectly.
• A negative D value indicates that item is faulty and
needs to improvement.
• Possible explanation for negative D value are that the
item provides a clue to the lower scoring subjects
that enable them to guess the correct response or
that the item is interpreted by the high scorers.
68
• Like D, item- response chart assess an item’s
discriminatory power.
• In addition to its utility in analyzing
true/false or MCQ , it is useful in situations
in which affective measures with more
than two choices.
• The respondents ranking in the upper and
lower 25% are identified as in steps 1 through
4 for determining D.
• Construct a 4 fold table using the 2
categories, high/low scores and correct/
incorrect for a given item.
71
• Differential item function (DIF) refers to “when
examinees of the same ability but belonging to
different groups have differing probabilities of
success on an item”.
• When DIF is present, it is an indicator of potential
item bias.
• A relatively simple approach to detecting DIF is to
compare item discrimination indices (i.e., D, p
level,
and/or
item-response
charting)
across
different groups of respondents to determine if
responses
to
membership.
the
item(s)
differ
by
group
1-Item-objective
congruence
Item-Analysis
Criterion-referenced
2-Criterion-Referenced Item-Analysis
2-Empirical Itemanalysis
3-Item difficulty(Plevel)
4-Item
Discrimination(D)
1- Item- Objective Congruence
It provides an index of the validity of an item
based on the ratings of two or more content
specialists.
In this method content specialists are directed to
assign a value of +1.0, or -1.0 for each item, depending
upon the item’s congruence with the measure’s
objective..
Item- Objective Congruence
A value of +1
An item is judged to be a definite
measure of the objective
A rating of 0
Undecided about whether the item is a
measure of the objective.
A rating of -1
The item is not a measure of the objective.
• An index cut-off score should be
set to separate valid (Retaining)
from
non-valid
(Revision
Discard) items within the test.
Or
For example:
• The index cut-off score is 0.75
• Then all items with an index of
item-objective congruence below
0.75 are deemed nonvalid
• While those with an index of 0.75
or above are considered valid.
Formula (Martuza, 1977)
•
I ik = (M-1) S k – S’ k
2N (M-1)
•
I ik = the index of the item-objective congruence for item
i and objective k.
•
•
M = the number of objectives
N = the number of content specialists
• S k = the sum of the ratings assigned to objective k
• S’ k = the sum of the ratings assigned to all objectives,
except objective k
Item- Objective Congruence
Example
M =4,
N =3,
Content
Specialist
1
S 1=2
Objective
2
3
4
A
+1
-1
-1
-1
B
+1
-1
-1
-1
C
0
-1
0
-1
Sk
+2
-3
-2
-3
• I ik =1
M =4 N =3
• I ik = (M-1) S k – S’ k
2N (M-1)
• S’1 = (-3)+ (-2) + (-3) = -8
• Hence I11= (4-1) (+2) – (-8)
2 (3) (4 - 1)
=6+8\18
= 0.78
S1=2
1-Item-objective
congruence
Item-Analysis
Criterion-referenced
Cont.Criterion-Referenced Item-Analysis
2-Empirical Itemanalysis
3-Item difficulty(Plevel)
4-Item
Discrimination(D)
• Empirical data are obtained from respondents
in order to evaluate the effectiveness of the
items of the measuring tool.
• Groups chosen for item analysis of criterion-
referenced measures are often referred to as
criterion groups.
Empirical Item-analysis
Has two approaches are used for identifying
criterion groups
The criterion
groups
technique
Pre-treatmentpost-treatment
measurements
approach
It involves the testing of two separate groups
at the same time, one group that is known by
independent means to possess more of the specified
trait or attribute, and a second group known to possess
less.
The subjects chosen for each of the
groups should be as similar as possible on
relevant characteristics, for example, social
class, cultures, and ages.
The only real difference between the
groups should be in terms of exposure to the
specified treatment or experience.
Cont,
• Example, if the purpose of a criterion
referenced measure is to identify parents who
have and who have not adjusted to parenthood
after the birth of a first child, two groups of
parents would be of interest those who have
adjusted to parenthood and those who haven’t
had a previous opportunity to adjust to
parenthood.
It involves testing one group of subjects twice;
once before exposure to some specific treatment
(pretreatment), and again after exposure to the
treatment (post-treatment).
Cont.
• Example:
• In the case in the instruction is the treatment,
testing would occur before instruction (pre
instruction) and after instruction (post
instruction).
Subjects are usually tested with the same set of items
on both occasions.
1-Item-objective
congruence
Item-Analysis
Criterion-referenced
Cont.Criterion-Referenced Item-Analysis
2-Empirical Itemanalysis
3-Item difficulty(Plevel)
4-Item
Discrimination(D)
The item p levels for each item are compared
between groups to help determine if respondents
would have performed similarly on an item,
regardless of which group they are in.
Cont.
• The item p level should be higher for the group
that is known to possess more of a specified
trait or attribute than for the group known to
possess less.
1-Item-objective
congruence
Item-Analysis
Criterion-referenced
Cont.Criterion-Referenced Item-Analysis
2-Empirical Itemanalysis
3-Item difficulty(Plevel)
4-Item
Discrimination(D)
The focus of item-discrimination indices for
criterion-referenced measures is on the measurement of
performance
changes
(e.g.
pretest-posttest)
or
differences (e.g., experienced parents-inexperienced
parents) between the criterion groups.
1-Criterion groups difference index
(CGDI):
The CGDI is the proportion of respondents in the group
known to have less of the trait and who answered the
item correctly subtracted from the proportion of
respondents in the group known to possess more of the
trait who answered it correctly.
Calculating the CGDI:
• CGDI = The item-p level for group known to
possess more of the attribute −The item-p level
for group known to have less of the attribute.
The pretreatment–posttreatment
measurements approach:
It is the proportion of respondents who answered
the item correctly on the posttest minus the
proportion who responded to the item correctly
on the pretest.
Calculating the pretest−posttest
difference index (PPDI)
• PPDI = The item-p level on the posttest − The
item-p level on the pretest.
Interpretation of results for CGDI
and PPDI
• The range of values for each of the indices
discussed previously is −1.00 to +1.00
• A high positive index for each of these item
discrimination indices is desirable, because
this would reflect the item’s ability to
discriminate between criterion groups.
• Items with high positive discrimination indices
improve the decision validity of a test.
Summary
3-Factors can
lower validity
4-The item analysis
procedure for norm
and criterionreferenced measures.
Any Question
References
• Glen.,
S.
(2020).
Statistics
How
To.
Available
at
https://www.statisticshowto.com.
• Northern Arizona University. (1998). Lesson6-2-1 - Northern Arizona
University. Available at jan.ucc.nau.edu › measurement › part2
• Stephanie, G. (2016). Measurement Error (Observational Error).
StatisticsHowTo.com: Elementary Statistics for the rest of us!.
Retrieved
22-11-2020
from
https://www.statisticshowto.com/measurement-error/.
• Trochim W.M.K. & Conjoint L.y. (2020). Measurement Error.
Research Methods Knowledge Base. Retrieved 22-11-2020 from
https://conjointly.com/kb/measurement-error/.
• Waltz C.F., Strickland O.L. and Lenz E.R. (2010). Measurement in
nursing and health research. (4th ed.). Springer Publishing Company,
LLC, United States of America by Bang Printing. Pp. 145-62.
Download