Chapter 11 Measuring Research Variables Research Methods in Physical Activity Validity Validity (Degree to which a test or instrument measures what it purports to measure; can be categorized as logical, content, criterion, or construct validity. ) refers to the soundness of the interpretation of scores from a test, the most important consideration in measurement. There are different purposes for using certain measures. Consequently, there are different kinds of validity. There are four basic types of validity: logical, content, criterion, and construct. Logical validity - Degree to which a measure obviously involves the performance being measured; also known as face validity. Content validity - Degree to which a test (usually in educational settings) adequately samples what was covered in the course. Criterion validity - Degree to which scores on a test are related to some recognized standard or criterion. The two main types of criterion validity are concurrent validity and predictive validity. Research Methods in Physical Activity Validity Criterion validity (cont) concurrent validity - Type of criterion validity in which a measuring instrument is correlated with some criterion that is administered concurrently or at about the same time. predictive validity - Degree to which scores of predictor variables can accurately predict criterion scores. Note that shrinkage can occur when using one prediction equation formed from one sample, then applied to another. Shrinkage is the reduction in the predictive ability. This phenomenon can be addressed by cross-validating the prediction equated from the original sample to the sample tested. This becomes important when the two samples vary in subject demography and/or if the original prediction formula was from a small sample size. Cross-Validation: Technique to assess the accuracy of a prediction formula in which the formula is applied to a sample not used in developing the formula. Research Methods in Physical Activity Validity Construct validity - Degree to which a test measures a hypothetical construct; usually established by relating the test results to some behavior. For example, certain behaviors are expected of someone with a high degree of sportsmanship. Such a person might be expected to compliment the opponent on shots made during a tennis match. For an indication of construct validity, a test maker could compare the number of times a person scoring high on a test of sportsmanship complimented the opponent with the number of times a person scoring lower on the test did so. The known group difference method is sometimes used in establishing construct validity. The known groups difference method is used for establishing construct validity in which the test scores of groups that should differ on a trait or ability are compared. (ex. If the sprinters and jumpers score significantly better on a test designed to measure anaerobic power than the distance runners do, this finding would provide some evidence that the test measures anaerobic power.) An experimental approach is occasionally used in demonstrating construct validity. For example, a test of cardiovascular fitness might be assumed to have construct validity if it reflected gains in fitness following a conditioning program. Research Methods in Physical Activity Reliability An integral part of validity is reliability, which pertains to the consistency, or repeatability, of a measure. A test cannot be considered valid if it is not reliable. In other words, if the test is not consistent—if you cannot depend on successive trials to yield the same results—then the test cannot be trusted. Test reliability is sometimes discussed in terms of observed score, true score, and error score. • test score obtained by an individual is the observed score. • an observed score theoretically consists of the person’s true score and error score. • expressed in terms of score variance, the observed score variance consists of true score variance plus error score variance. The goal of the tester is to remove error to yield the true score. • because true score variance is never known, it is estimated by subtracting error variance from observed score variance. Thus, the reliability coefficient (discussed later) reflects the degree to which the measurement is free of error variance. The coefficient of reliability is the ratio of true score variance to observed score variance. Research Methods in Physical Activity Reliability Sources of Error Measurement error can come from four sources: the participant, the testing, the scoring, and the instrumentation. Participant Error - Measurement error associated with the participant includes many factors, including mood, motivation, fatigue, health, fluctuations in memory and performance, previous practice, specific knowledge, and familiarity with the test items. Testing Error – testing error is related to how clear and complete the directions are, how rigidly the instructions are followed, and whether supplementary directions or motivation is applied. Research Methods in Physical Activity Reliability Sources of Error Scoring Error - Errors in scoring relate to the competence, experience, and dedication of the scorers and to the nature of the scoring itself. The extent to which the scorer is familiar with the behavior being tested and the test items can greatly affect scoring accuracy. Carelessness and inattention to detail produce measurement error. Measurement Error - Measurement error because of instrumentation includes such obvious causes as inaccuracy and lack of calibration of mechanical and electronic equipment. It also refers to the inadequacy of a test to discriminate between abilities and to the difficulty of scoring some tests. Research Methods in Physical Activity Reliability Coefficient Expression of Reliability The degree of reliability is expressed by a correlation coefficient, ranging from 0.00 to 1.00. The closer the coefficient is to 1.00, the less error variance it reflects and the more the true score is assessed. Interclass correlation - This coefficient is a bivariate statistic, meaning that it is used to correlate two different variables. But interclass correlation is not appropriate for establishing reliability because two values for the same variable are being correlated. (When a test is given twice, the scores on the first test are correlated with the scores on the second test to determine their degree of consistency) Intraclass correlation - The procedures leading to the calculation of intraclass correlation (R) are the same as those of simple ANOVA with repeated measures. (see Table 11.2, p.199). Note that the “F” statistic for “trials” determines whether there was any significant difference between three trails of the same measure. The intraclass correlation is calculated on p.200 (note that the best way to increase the “R” value is to decrease the residual scores – remove unexplained variance) Research Methods in Physical Activity Methods of Establishing Reliability Stability - A coefficient of reliability measured by the test–retest method on different days. In the test–retest method, the test is given one day and then repeated a day or so later. Intraclass correlation should be used to compute the coefficient of stability of the scores on the two tests. Alternate-forms method - establishing reliability involves the construction of two tests that supposedly sample the same material. This method is sometimes referred to as the parallel-form method or the equivalence method. The two tests are given to the same individuals. Ordinarily, some time elapses between the two administrations. The scores on the two tests are then correlated to obtain a reliability coefficient. Research Methods in Physical Activity Methods of Establishing Reliability Internal Consistency- An estimate of the reliability that represents the consistency of scores within a test. Same-day test–retest method - Method of establishing reliability in which a test is given twice to the same participants on the same day. Split-half technique - Method of testing reliability in which the test is divided in two, usually by making the odd-numbered items one half and the even numbered items the other half. The two halves are then correlated. Research Methods in Physical Activity Methods of Establishing Reliability Internal Consistency- Flanagan method - A process for estimating reliability in which the test is split into two halves, and the variances of the halves of the test are analyzed in relation to the total variance of the test. (see example 11.3, p. 202) Kuder-Richardson (KR) method of rational equivalence - Formulas developed for estimating reliability of a test from a single test administration. Only one test administration is required, and no correlation is calculated. The resulting coefficient represents an average of all possible split-half reliability coefficients Research Methods in Physical Activity Methods of Establishing Reliability Intertester Reliability - the degree to which different testers can achieve the same scores on the same subjects. Also called objectivity. Objectivity - The degree to which different testers can achieve the same scores on the same subjects, also known as intertester reliability. The degree of objectivity (intertester reliability) can be established by having more than one tester gather data. Then the scores are analyzed with intraclass correlation techniques to obtain an intertester reliability coefficient. This approach typically involves a coding instrument to construct Interobserver Agreement (see formula 11.4, p. 203) Research Methods in Physical Activity Standard Scores To Compare Performance (also refer to Table 2 in Appendix) Z – scores (see p. 205, for example) The basic standard score is the z score. The z scale converts raw scores to units of standard deviation in which the mean is zero and a standard deviation is 1.0. The formula is z = (X – M)/s T scale (see p. 205, for example) Type of standard score that sets the mean at 50 and standard deviation at 10 to remove the decimal found in z scores and to make all scores positive. Research Methods in Physical Activity Measuring Affective Behavior To be continued. ( Exam three will include information up to this point. The remaining information from Chapter 11 on scales for measuring affective behavior will be covered in class with the information from Chapter 15 on Survey Research). The remainder of the information in this Chapter will be included in Exam four) Chapter 11 Information continues on next slide. Research Methods in Physical Activity Measuring Affective Behavior Affective behavior includes attitudes, personality, anxiety, self-concept, social behavior, and sportsmanship. Scales for Measuring Affective Behavior Likert-Type Data : Type of closed question that requires the participant to respond by choosing one of several scaled responses; the intervals between items are assumed to be equal. Example: I prefer quiet recreational activities such as chess, cards, or checkers rather than activities such as running, tennis, or basketball. Strongly agree Agree Undecided Disagree Strongly disagree Research Methods in Physical Activity Measuring Affective Behavior Benefits of Likert-Type Data A principal advantage of scaled responses such as the Likert-type is that they permit a wider choice of expression than responses such as “always” or “never,” or “yes” or “no.” The five, seven, or more intervals may help increase the reliability of the instrument. Semantic Differential Scale: Is used to measure affective behavior in which the respondent is asked to make judgments about certain concepts by choosing one of seven intervals between bipolar adjectives. (see example in text, p 208) Research Methods in Physical Activity Measuring Affective Behavior Rating Scales: A measure of behavior that involves a subjective evaluation based on a checklist of criteria. Raters are usually experts on the criterion measure. When more than one judge is asked to rate performances, some common standards must be set. Rating Errors Leniency - Tendency for observers to be overly generous in rating. Central tendency errors - Inclination of the rater to give an inordinate number of ratings in the middle of the scale, avoiding the extremes of the scale. Halo effect - A threat to internal validity wherein raters allow previous impressions or knowledge about a certain individual to influence all ratings of that individual’s behaviors. Research Methods in Physical Activity Measuring Affective Behavior Rating Errors Proximity error - Inclination of a rater to consider behaviors to be more nearly the same when they are listed close together on a scale than when they are separated by some distance. (For example, if the qualities “active” and “friendly” are listed side by side on the scale, proximity errors result if raters evaluate performers as more similar on those characteristics than if the two qualities were listed several lines apart on the rating scale.) Observer bias error - Inclination of a rater to be influenced by his or her own characteristics and prejudices. Observer bias errors are directional because they produce errors that are consistently too high or too low. Research Methods in Physical Activity Measuring Affective Behavior Rating Errors Observer expectation error - Inclination of a rater to see evidence of certain expected behaviors and interpret observations in the expected direction. Observer expectations can contaminate the ratings because a person who expects certain behaviors is already inclined to see evidence of those behaviors and interpret observations in the “expected” direction. (In the research setting, potential observer expectation errors are likely when the observer knows what the experimental hypotheses are and is thus inclined to watch for these outcomes more closely than if the observer were unaware of the expected outcomes. ) Research Methods in Physical Activity Measuring Knowledge Item Analysis Item analysis - Process in analyzing knowledge tests in which the suitability of test items and their ability to discriminate are evaluated. Thus, the purpose of item analysis is to determine which test items are suitable and which need to be rewritten or discarded. Two important parts of item analysis are: 1) To analyze the difficulty of the items on the test 2) To determine the degree of item discrimination Research Methods in Physical Activity Measuring Knowledge Item Analysis Item Difficulty - analysis of the difficulty of each test item in a knowledge test, determined by dividing the number of people who correctly answered the item by the total number of people who responded to the item. (The more difficult the item is, the lower its difficulty index is) Most test authorities recommend eliminating questions with difficulty indices below .10 or above .90. The best questions are those that have difficulty indices around .50. Research Methods in Physical Activity Measuring Knowledge Item Analysis Item Discrimination - The degree to which a test item discriminates between people who did well on the entire test and those who did poorly; also called index of discrimination. Item Discrimination may be calculated by dividing the completed tests into a high-scoring group and a low-scoring group and then use the following formula: Index of discrimination = (nH – nL)/n where nH is the number of high scorers who answered the item correctly, nL is the number of low scorers who answered the item correctly, and n is the total number in either the high or the low group. (Ex. if we have 30 in the high group and 30 in the low group and if 20 of the high scorers answered an item correctly and 10 of the low scorers answered it correctly, the index of discrimination would be (20 – 10)/30 = 10/30 = .33.) Research Methods in Physical Activity End of Presentation Research Methods in Physical Activity