PhD Research Seminar Series: Reliability and Validity in Tests and Measures Dr. K. A. Korb University of Jos Outline Reliability Theory of Reliability Split-Half Reliability Test-Retest Reliability Alternate Forms Reliability Inter-Rater Reliability Validity Dr. K. A. Korb University of Jos Construct Validity Criterion Validity Content Validity Face Validity Overview Test Developer: The person who created a test Test user: A person administering the test Test taker: Person taking the test Dr. K. A. Korb University of Jos Reliability: Consistency of results Reliable Dr. K. A. Korb University of Jos Reliable Unreliable Reliability Theory Actual score on test = True score + Error True Score: Hypothetical actual score on test The reliability coefficient indicates the ratio between the true score variance on the test and the total variance Dr. K. A. Korb University of Jos In other words, as the error in testing decreases, the reliability increases Reliability: Sources of Error Error in Test construction Error in Item Sampling: Results from items that measure more than one construct in the same test Error in Test Administration Test environment: Room temperature, amount of light, noise, etc. For example: A test that has items assessing both reading and math ability will have lower reliability than a test that assess just reading Test-taker variables: Illness, amount of sleep, test anxiety, etc. Examiner-related variables: Absence of examiner, examiner’s demeanor, etc. Error in Test Scoring Scorer: With subjectively marked assessments, different scorers may give different scores to the same responses Dr. K. A. Korb University of Jos Reliability: Error due to Test Construction Measured by Split-Half Reliability: Determines how consistently your measure assesses the construct of interest. A low split-half reliability indicates poor test construction. If your measure assesses multiple constructs, split-half reliability will be considerably lower. Separate the constructs that you are measuring into different sections of the questionnaire and calculate the reliability separately for each construct. If you get a low reliability coefficient, then your measure is probably measuring more constructs than it is designed to measure. Revise your measure to focus more directly on the construct of interest. Dr. K. A. Korb University of Jos When validating a measure, you will most likely calculate the split-half reliability of your instrument. Reliability: Error due to Test Construction Calculating Split-Half Reliability If you have dichotomous items (e.g., right-wrong answers) as you would with multiple choice exams, calculate the KR-20. If you have a Likert scale, essays, or other types of items, use the Spearman-Brown formula. For a step-by-step example of calculating the Split-Half Reliability, see the associated presentation entitled Calculating Reliability of Quantitative Measures. Dr. K. A. Korb University of Jos Reliability: Error due to Test Administration Test-Retest Reliability: Determines how much error in a test score is due to problems with test administration. To calculate: Dr. K. A. Korb University of Jos Administer the same test to the same participants on two different occasions, perhaps a week or two apart. Correlate the test scores of the two administrations of the same test using Pearson’s Product Moment Correlation. Reliability: Error due to Test Construction with Two Forms of the Same Measure Parallel Forms Reliability: Determines the similarity of two different versions of the same measure. To calculate Dr. K. A. Korb University of Jos Administer the two tests to the same participants within a short period of time. Correlate the test scores of the two tests using Pearson’s Product Moment Correlation. Reliability: Error due to Test Scoring Inter-Rater Reliability: Determines how closely two different raters mark the assessment. To calculate Dr. K. A. Korb University of Jos Give the exact same test results from one test administration to two different raters. Correlate the two markings from the different raters using Pearson’s Product Moment Correlation Validity: Measuring what is supposed to be measured Valid Dr. K. A. Korb University of Jos Invalid Invalid Validity Three types of validity: Construct validity: Measure the appropriate psychological construct Criterion validity: Predict appropriate outcomes Content validity: Adequate sample of content Each type of validity should be established for all psychological tests. Dr. K. A. Korb University of Jos Construct Validity Definition: Appropriateness of inferences drawn from test scores regarding an individual’s status of the psychological construct of interest For example, a test is developed to measure Reading Ability. Once the test is administered to students, does their score on the test accurately reflect their true reading ability? Two considerations: Dr. K. A. Korb University of Jos Construct underrepresentation Construct irrelevant variance Construct Validity Construct underrepresentation: A test does not measure all of the important aspects of the construct. For example, a test of academic self efficacy (perceived effectiveness in academics) might measure self efficacy only in math and science, thus ignoring other important academic subjects. Construct-irrelevant variance: Test scores are affected by other unrelated processes Dr. K. A. Korb University of Jos For example, a test of statistical knowledge that requires complex calculations is likely influenced by construct-irrelevant variance. In addition to measuring statistical knowledge, the test is also measuring calculation ability. Sources of Construct Validity Evidence Homogeneity: The test measures a single construct Convergence: Test is related to other measures of the same construct and related constructs Evidence: High internal consistency - calculated by Split-Half reliability Evidence: Highly correlations with other measures – same as Criterion Validity Theory: The test behaves according to theoretical propositions about the construct Evidence by changes in test scores according to age: Scores on the measure should change by age as predicted by theory. Evidence from treatments: Scores on the measure change as predicted by theory from a treatment between pretest and posttest. Dr. K. A. Korb University of Jos For example, intelligence scores of one person should increase as that person gets older because theories of intelligence dictate increases by age. For example, scores on a test of Knowledge of Nigerian Government should significantly increase after a course on the Nigerian Government. Criterion Validity Definition: Correlation between the measure and a criterion. Criterion: Other accepted measures of the construct or measures of other constructs similar in nature. A criterion can consist of any standard with which your test should be related Examples: Dr. K. A. Korb University of Jos Behavior (e.g., misbehavior in class, teacher’s interactions with students, days absent from school) Other test scores (e.g., standardized test scores) Ratings (e.g., teachers ratings of helpfulness) Psychiatric diagnosis (e.g., depression, schizophrenia) Criterion Validity Three types: Dr. K. A. Korb University of Jos Convergent validity: High correlations with measures of similar constructs taken at the same time. Divergent validity: Low correlations with measures of different constructs taken at the same time. Predictive validity: High correlation with a criterion in the future Criterion Validity Example: You developed an essay test of science reasoning to admit students into the science programme at the university. Dr. K. A. Korb University of Jos Convergent Validity: Your test should have high correlations with other science tests, particularly well established science tests. Divergent Validity: Your test should have low correlations with measures of writing ability because your test should only measure science reasoning, not writing ability. Predictive Validity: Your test should have high correlations with future grades in science courses because the purpose of the test is to determine who will do well in the science programme at the university. Criterion Validity Example Criterion Validity Evidence for New Science Reasoning Test: Correlations between Science Reasoning and Other Measures New Science Reasoning Test WAEC Science Scores .83 School Science Marks .75 WAEC Writing scores .34 WAEC Reading Scores .24 Future marks in university science courses .65 Dr. K. A. Korb University of Jos High correlations with other measures of science ability indicates good criterion validity. Low correlations with measures unrelated to science ability indicates good criterion validity. High correlation with future measures of science ability indicates good criterion validity. Content Validity Definition: Sampling the entire domain of the construct it was designed to measure For example: Dr. K. A. Korb University of Jos The first chart represents the amount of time in class spent on each maths topic The second chart represents the amount of test questions on each maths topic This test does NOT demonstrate content validity because the proportion of test questions does not match the proportion of coverage in class. Class Coverage Addition Subtraction Multiplication Division Test Coverage Addition Subtraction Multiplication Division Content Validity Class Coverage Addition Subtraction Multiplication Division Test Coverage Addition Subtraction Multiplication Division For academic tests, a test is considered content valid when the proportion of material covered by a test approximates the proportion of material covered in a class. This maths test demonstrates good content validity because the proportion of test questions on each topic matches the proportion of time spent in class on each topic. Content Validity Content validity tends to be an important consideration ONLY for achievement tests To assess: Dr. K. A. Korb University of Jos Gather a panel of judges Give the judges a table of specifications of the amount of content covered in the domain Give the judges the measure Judges draw a conclusion as to whether the proportion of content covered on the test matches the proportion of content in the domain. Face Validity Face validity addresses whether the test appears to measure what it purports to measure. To assess: Ask test users and test takers to evaluate whether the test appears to measure the construct of interest. Face validity is rarely of interest to test developers and test users. Dr. K. A. Korb University of Jos The only instance where face validity is of interest is to instill confidence in test takers that the test is worthwhile. Face validity is NOT a consideration for educational researchers. Face validity CANNOT be used to determine the actual interpretive validity of a test. Concluding Advice The best way to determine that the measures you use are both reliable and valid is to use a measure that another researcher has developed and validated This will assist you in three ways: 1. 2. 3. Dr. K. A. Korb University of Jos You can confidently report that you have accurately measured the variables you are studying. By using a measure that has been used before, your study is intimately tied to previous research that has been conducted in your field, an important consideration in determining the importance of your study. It saves you time and energy in developing your measure Finding Pre-Existing Measures Information on how to find pre-existing measures: http://www.apa.org/science/faq-findtests.html#printeddirec Online directory of pre-existing measures Dr. K. A. Korb University of Jos http://www.ets.org/testcoll/ Type the construct you want to measure in the empty box and click the Search button. Find the test that is most relevant to for your purposes. When you click on the measure name in blue, if it has a journal article listed in the Availability category, the measure will be published in that journal article. Some tests can also be ordered from the ETS Tests collection for about N3000 and then downloaded to your computer. You can also try googling the name of the test to determine if somebody else has published the measure on the internet. Websites for Pre-existing Measures Personality Variables: International Personality Item Pool Motivation Constructs: Self Determination Theory http://ipip.ori.org/ipip/ http://www.psych.rochester.edu/SDT/ Motivation Constructs: Students’ goal orientations: Dr. K. A. Korb University of Jos http://www.umich.edu/~pals/