Measuring tool validity Under Supervision of Prof. Nagah Mahmoud Prof. Nawal Fouad Assist. Prof. Amel Shaban Prepared by : Walaa Hassan Ragab Arafat Presentation rules Objectives: At the end of this lecture each candidate will be able to:• Discuss the different basic concepts of validity • Distinguish between different types of validity • Conclude the Factors that can lower validity • Explain the item analysis procedure for norm and criterionreferenced measures Outlines • Definition of validity • Types of validity. Content validity Face validity Construct validity Criterion-related validity. Predictive validity Concurrent validity • Factors that can lower validity • The item analysis procedure for norm and criterionreferenced measures (1)Validity The extent to which an instrument is measuring what it is supposed to measure. The degree to which evidence & theory support the interpretation entailed by proposed use of tests. Educational Research Association [AERA], American Psychological Association [APA], National Council on Measurement in Education Is every valid measurement reliable? What is types of validity? Content Face Construct Criterion Related validity Predictive Contrasted Groups Hypothesis Testing Multi-traitmultimethod Concurrent Factor analysis 1-Content validity • It is the 1st type of validity that should be established and it is the prerequisite for all other types of validity. • It relates to how well the content of a test or measure matches the objective to be measured or domain specifications. • Whether items and questions cover the full range of the issues or problem being measured. • To obtain evidence for content validity, the objectives are given to at least three experts in the area of content to be measured. Role of experts 1. Link each objective with its respective item. 2. Assess the relevancy of the items to the content addressed by the objectives. 3. Judge if they believe the items on the tool adequately represent the content. • When more than two experts rate the items on a measure, the alpha coefficient is employed as the index of content validity. • The resulting alpha coefficient quantifies the extent to which there is agreement between the experts’ ratings of the items. • A coefficient of 0.00 indicates lack of agreement between the experts and a coefficient of 1.00 indicates complete agreement. Content validity index When only two judges are employed, the content validity index ( CVI) that rate the relevance of each item to the objective(s) using a 4-point rating scale: (1) Not relevant (3) Quite relevant (2) Somewhat relevant (4) Very relevant. Content validity index • The CVI is defined as the proportion of a rating of quite\ very relevant by both rater involved. 80% or • Means a high value which denotes a high level of agreement 0.80 80% or 0.80 • Means the items on the instrument does not adequately address the domains being explored For example Content validity More than two experts Only two judges Alpha coefficient Content validity index A coefficient of 0.00 indicates lack of agreement a coefficient of 1.00 indicates complete agreement 80% or 0.80: denotes a high level of agreement 80% or 0.80: the items do not adequately address the domains 2-Face validity The degree to which an assessment or test , subjectively appears to measure the variable or construct that it is supposed to measure. It is a subjective judgment operationalization of a construct. on the Face validity Evaluate in terms of: Face validity cont., Face validity is determined by a review of the items and not through the use of statistical analyses. Unlike content validity, face validity is not investigated through formal procedures. Face validity cont., Anyone who looks over the test, including examinees, may develop an informal opinion as to whether or not the test is measuring what it is supposed to measure. 3-Construct validity Construct validity is the extent to which relationships among items included in the measure are consistent with the theory and concepts as operationally defined. Cont. • For example: A test of intelligence nowadays must include measures of multiple intelligences, rather than just logical-mathematical and linguistic ability measures. Construct validity 1. Contrasted group 2. Hypothesis testing approach 3. The multitraitmultimethod approach 4. Factor analysis For example • To examine the validity of a measure designed to quantify venous access, nurse ask a group of clinical specialists on a given unit to identify a group of patients known to have good venous access & group known to have very poor access. • The nurse employ the measure with both groups, obtain a mean for each group, then compare the differences between two means using t test or other appropriate statistic. • If significant difference is found between mean scores of two groups, there is some evidence for construct validity, that is, the instrument measures the attribute of interest. For example What is The multitraitmultimethod approach? 3-The multitrait- multimethod approach It is appropriately employed when: 1- Measure two or more different constructs. 2- Use two or more different methodologies to measure each construct. 3- Administer all instruments to every subject at the same time. 4- Assume that performance on each instrument employed independent ( not influenced by \ biased by) or a function of performance on any other instrument. The multitrait- multimethod approach cont., 1- That different measures of the same construct should correlated highly with each other. (The convergent validity principle). 2- That measures of different construct should have low correlation with each other (Discriminate principle). Low correlation Bonding (construct 1) High correlation Prenatal care (construct 2) Rating scale Rating scale Checklist Checklist High correlation Disadvantages of the multitrait- multimethod approach • For subjects who must respond to multiple instruments at one time: 1. Decreasing respondents' willingness to participate, and decreasing response rate. 2. Introduces the potential for more errors of measurement as a result of respondent fatigue. • The cost in time and money necessary to employ the method. 4-Factor Analysis • It is a useful approach to assessing construct validity when the investigator has designed on the basis of a conceptual framework, a measure to assess various dimensions or subcomponents of a phenomenon of interest and wishes to empirically justify these dimensions or factors. • A procedure that gives the researcher information about the extent to which a set of items measures the same underlying construct or dimension of a construct. Items designed to measure: * The same dimension should load on the same factor * Differing dimensions should load on different factors. Factor analysis is commonly used in: – Data reduction – Scale development – The assessment of the dimensionality of a set of variables. Steps in factor analysis “four steps” – 1st Step: The correlation matrix for all variables is computed – 2nd Step: Factor extraction – 3rd Step: Factor rotation – 4th Step: Make final decisions about the number of underlying factors Steps in factor analysis • The investigator administers the tool to a large representative sample at one time. • Using parametric (paired T test, person correlation and one way anova) or nonparametric (spearman correlation) factoranalysis procedure. • The result of this factoring process is a group of linear combination of items called factors. Each of which is independent of all other identified factors. • Each factor is then correlated with each item to produce factor loadings. • The next step is a rotation, in which the factors are repositioned in such a way as to give them more interpretability. • Rotated factors are interpreted by examining the items loading upon each, over and above a certain preset criterion (usually 0.30 is the minimum). • If evidence for construct validity exists, the number of factors resulting from the analysis should approximate the number of dimensions or subcomponents assessed by the measure, and the items with the highest factor loadings defining each factor should correspond with the items designed to measure each of the dimensions of the measure. 4-Criterion – related validity A-Predictive validity B-Concurrent validity A-Predictive validity If the test is used to predict future performance. For example Entrance exam . . . . Performance of these tests correlates with later performance in professional colleague. General procedure for predictive validity 1) A large group of people take the test. 2)The scores for those people are held for a predetermined period of time. 3) Once the time period elapses, a measure of some behavior (i.e., the criterion) is taken. General procedure for predictive validity cont., 4) The test scores are then correlated with the criterion scores. 5) If the scores correlate, the test has predictive validity. 6) The resulting correlation coefficient is called the validity coefficient. B-Concurrent validity • The extent to which a measure may be used to estimate an individual’s present standing on the criterion. • Concurrent validity is the practical alternative to the ideal predictive method. Concurrent validity cont., •With concurrent validity you obtain at roughly the same time both test scores and criterion scores in some predetermined population. •Once this is accomplished, you simply correlate test scores with the criterion scores. Example of concurrent validity Concurrent validity What are factors that can lower validity? • Unclear directions • Difficult reading vocabulary and sentence structure • Ambiguity in statements • Inappropriate level of difficulty • Identifiable patterns of answers. • Tests that are too short. • Inadequate time limits • Improper arrangement of items (complex to easy). • Poorly constructed test items • Inadequate sample. • Improper test administration. • Scoring that is subjective. • Test items inappropriate for the outcomes being measured (4)The item analysis procedure for norm and criterion-referenced measures Norm-Referenced • Item P value • Discrimination index • Item-response chart • Differential item functioning Criterion-Referenced • Item- Objective Congruence • Empirical Item-analysis • Item difficulty • Item discrimination indices 1-Norm-Referenced Item-Analysis 1. Item p level\ Difficulty level • It is the proportion of correct responses to that item. • It is determined by counting the number of subjects selecting the correct or desired response to a particular item and then dividing this number by the total number of subjects. • The range of p levels may be from 0 to 1.00. • The closer the value of p is to 1.00, the easier the item; • The closer the value of p is to zero, the more difficult the item. Note: • When norm referenced measures are employed, P levels between 0.30 and 0.70 are desirable. • Assess an item’s ability to discriminate. It is a powerful indicator of test-item quality. • If performance on a given item is a good predictor of performance on the overall measure, the item is said to be a good discriminator. • D ranges from -1.00 to +1.00 . 1. Rank all subjects’ performance on the measure by using total scores from high to low. 2. Identify those individuals who ranked in the upper 25 %. 3. Identify those individuals who ranked in the lower 25 %. 4. Place the remaining scores aside. 5. Determine the proportion of respondents in the top 25 % who answered the item correctly (Pu). 6. Determine the proportion of respondents in the lower 25% who answered the item correctly (PL). 7. Calculate D by subtracting PL from Pu (i.e., D= Pu – PL). 8. Repeat steps 5 through 7 for each item on the measure. A positive D value is desirable and indicates that the item discriminating in the same manner as the total test. • A negative D value suggests that the item is not discriminating in the same way as the total test; that is, respondents who obtain low scores on the total measure tend to get the item correct, while those who score high on the measure tend to respond incorrectly. • A negative D value indicates that item is faulty and needs to improvement. • Possible explanation for negative D value are that the item provides a clue to the lower scoring subjects that enable them to guess the correct response or that the item is interpreted by the high scorers. 68 • Like D, item- response chart assess an item’s discriminatory power. • In addition to its utility in analyzing true/false or MCQ , it is useful in situations in which affective measures with more than two choices. • The respondents ranking in the upper and lower 25% are identified as in steps 1 through 4 for determining D. • Construct a 4 fold table using the 2 categories, high/low scores and correct/ incorrect for a given item. 71 • Differential item function (DIF) refers to “when examinees of the same ability but belonging to different groups have differing probabilities of success on an item”. • When DIF is present, it is an indicator of potential item bias. • A relatively simple approach to detecting DIF is to compare item discrimination indices (i.e., D, p level, and/or item-response charting) across different groups of respondents to determine if responses to membership. the item(s) differ by group 1-Item-objective congruence Item-Analysis Criterion-referenced 2-Criterion-Referenced Item-Analysis 2-Empirical Itemanalysis 3-Item difficulty(Plevel) 4-Item Discrimination(D) 1- Item- Objective Congruence It provides an index of the validity of an item based on the ratings of two or more content specialists. In this method content specialists are directed to assign a value of +1.0, or -1.0 for each item, depending upon the item’s congruence with the measure’s objective.. Item- Objective Congruence A value of +1 An item is judged to be a definite measure of the objective A rating of 0 Undecided about whether the item is a measure of the objective. A rating of -1 The item is not a measure of the objective. • An index cut-off score should be set to separate valid (Retaining) from non-valid (Revision Discard) items within the test. Or For example: • The index cut-off score is 0.75 • Then all items with an index of item-objective congruence below 0.75 are deemed nonvalid • While those with an index of 0.75 or above are considered valid. Formula (Martuza, 1977) • I ik = (M-1) S k – S’ k 2N (M-1) • I ik = the index of the item-objective congruence for item i and objective k. • • M = the number of objectives N = the number of content specialists • S k = the sum of the ratings assigned to objective k • S’ k = the sum of the ratings assigned to all objectives, except objective k Item- Objective Congruence Example M =4, N =3, Content Specialist 1 S 1=2 Objective 2 3 4 A +1 -1 -1 -1 B +1 -1 -1 -1 C 0 -1 0 -1 Sk +2 -3 -2 -3 • I ik =1 M =4 N =3 • I ik = (M-1) S k – S’ k 2N (M-1) • S’1 = (-3)+ (-2) + (-3) = -8 • Hence I11= (4-1) (+2) – (-8) 2 (3) (4 - 1) =6+8\18 = 0.78 S1=2 1-Item-objective congruence Item-Analysis Criterion-referenced Cont.Criterion-Referenced Item-Analysis 2-Empirical Itemanalysis 3-Item difficulty(Plevel) 4-Item Discrimination(D) • Empirical data are obtained from respondents in order to evaluate the effectiveness of the items of the measuring tool. • Groups chosen for item analysis of criterion- referenced measures are often referred to as criterion groups. Empirical Item-analysis Has two approaches are used for identifying criterion groups The criterion groups technique Pre-treatmentpost-treatment measurements approach It involves the testing of two separate groups at the same time, one group that is known by independent means to possess more of the specified trait or attribute, and a second group known to possess less. The subjects chosen for each of the groups should be as similar as possible on relevant characteristics, for example, social class, cultures, and ages. The only real difference between the groups should be in terms of exposure to the specified treatment or experience. Cont, • Example, if the purpose of a criterion referenced measure is to identify parents who have and who have not adjusted to parenthood after the birth of a first child, two groups of parents would be of interest those who have adjusted to parenthood and those who haven’t had a previous opportunity to adjust to parenthood. It involves testing one group of subjects twice; once before exposure to some specific treatment (pretreatment), and again after exposure to the treatment (post-treatment). Cont. • Example: • In the case in the instruction is the treatment, testing would occur before instruction (pre instruction) and after instruction (post instruction). Subjects are usually tested with the same set of items on both occasions. 1-Item-objective congruence Item-Analysis Criterion-referenced Cont.Criterion-Referenced Item-Analysis 2-Empirical Itemanalysis 3-Item difficulty(Plevel) 4-Item Discrimination(D) The item p levels for each item are compared between groups to help determine if respondents would have performed similarly on an item, regardless of which group they are in. Cont. • The item p level should be higher for the group that is known to possess more of a specified trait or attribute than for the group known to possess less. 1-Item-objective congruence Item-Analysis Criterion-referenced Cont.Criterion-Referenced Item-Analysis 2-Empirical Itemanalysis 3-Item difficulty(Plevel) 4-Item Discrimination(D) The focus of item-discrimination indices for criterion-referenced measures is on the measurement of performance changes (e.g. pretest-posttest) or differences (e.g., experienced parents-inexperienced parents) between the criterion groups. 1-Criterion groups difference index (CGDI): The CGDI is the proportion of respondents in the group known to have less of the trait and who answered the item correctly subtracted from the proportion of respondents in the group known to possess more of the trait who answered it correctly. Calculating the CGDI: • CGDI = The item-p level for group known to possess more of the attribute −The item-p level for group known to have less of the attribute. The pretreatment–posttreatment measurements approach: It is the proportion of respondents who answered the item correctly on the posttest minus the proportion who responded to the item correctly on the pretest. Calculating the pretest−posttest difference index (PPDI) • PPDI = The item-p level on the posttest − The item-p level on the pretest. Interpretation of results for CGDI and PPDI • The range of values for each of the indices discussed previously is −1.00 to +1.00 • A high positive index for each of these item discrimination indices is desirable, because this would reflect the item’s ability to discriminate between criterion groups. • Items with high positive discrimination indices improve the decision validity of a test. Summary 3-Factors can lower validity 4-The item analysis procedure for norm and criterionreferenced measures. Any Question References • Glen., S. (2020). Statistics How To. Available at https://www.statisticshowto.com. • Northern Arizona University. (1998). Lesson6-2-1 - Northern Arizona University. Available at jan.ucc.nau.edu › measurement › part2 • Stephanie, G. (2016). Measurement Error (Observational Error). StatisticsHowTo.com: Elementary Statistics for the rest of us!. Retrieved 22-11-2020 from https://www.statisticshowto.com/measurement-error/. • Trochim W.M.K. & Conjoint L.y. (2020). Measurement Error. Research Methods Knowledge Base. Retrieved 22-11-2020 from https://conjointly.com/kb/measurement-error/. • Waltz C.F., Strickland O.L. and Lenz E.R. (2010). Measurement in nursing and health research. (4th ed.). Springer Publishing Company, LLC, United States of America by Bang Printing. Pp. 145-62.