THE FOLLOWING LECTURE HAS BEEN APPROVED FOR ALL STUDENTS BY BIRMINGHAM CITY UNIVERSITY This lecture may contain information, ideas, concepts and discursive anecdotes that may be thought provoking and challenging Any issues raised in the lecture may require the viewer to engage in further thought, insight, reflection or critical evaluation health.bcu.ac.uk/craigjackson Validity & Variability of Back Pain Assessments Dr. Craig Jackson Senior Lecturer in Psychology School of Psychology Faculty of Education Law &Social Science BCU Who is Observing What? The validity of any observation depends upon who is observing whom Heisenberg’s uncertainty principle (1927) Content Assessment Criteria Validity Reliability Low Back Pain Assessments Appropriateness & Feasibility Between-Observer Variability and Consistency The Future: Mathematical Models? Validity without Psychology? Variability Specificity of Defined Field + Repeatable Measurement = Valid Measures S R=V 100 Problem of between-observer variation remains GP eliciting signs in respiratory disease Neurologist evaluating diagnosis of multiple sclerosis Geriatrician assessing stroke rehab. Anaesthetist determining fitness for operation 1. Judgements might be made differently by other observers 2. Judgements might be made differently by same on repeated occasions Between-Observer Variability Variation between observers Seriously compromise research / clinical findings Worst example: Patients with condition A - all examined by Dr X Patients with condition B - all examined by Dr Y One observer examine all patients ? Not possible / practical Examples of Between-Observer Variability Diagnostic classification for multiple sclerosis for 149 patients By two clinicians (observers) Neurologist A diagnostic class Neurologist B Certain 38 5 0 1 44 0.30 Probable 33 11 3 0 47 0.32 Possible 10 14 5 6 35 0.23 Doubtful 3 7 3 10 23 0.15 Total 84 37 11 17 149 Proportion 0.56 0.25 0.07 0.11 Examples of Between-Observer Variability Circum-corneal hyperaemia (scored 0,1,2,3,4) by four ophthalmologists Ophthalmologist Patient A B C D 1 3 3 4 3 2 2 3 4 3 3 2 2 3 1 4 2 2 3 2 5 2 3 4 2 6 2 2 4 2 • Systematic error by observer C - consistently higher • Observer B sticks to mid-ranges • No patient on whom there is total agreement 7 1 2 2 1 8 2 2 3 3 9 2 2 3 3 Examples of Between-Observer Variability Iris hyperaemia (scored 0,1,2,3,4) by four ophthalmologists Ophthalmologist Patient A B C D 1 1 1 4 3 2 0 0 0 3 3 0 1 0 1 4 0 4 0 1 5 0 1 0 2 6 3 1 4 2 7 1 1 0 1 8 2 1 4 2 • Observer C uses only extremes of scale • Observer C introduces spurious code • Observer D avoids extreme codes • Only 2 cases with difference of 1 between highest and lowest scores 9 0 4 9 2 Reducing Between-Observer Variability Use expert panel / reference library – they evaluate all procedures Compare rival observation methods in small pilot studies Suspect observer at all times - how may s/he be biased? Train observers / assessors Standardised techniques & judgement criteria Consider severity of disagreements Randomise patients out to multiple observers / multiple observations Appoint external assessor Assessment Criteria With any assessment – observation, questionnaire or equipment – we ask: Utility Is it useful? Reliability Is it dependable? Validity Does it do what it is supposed to? Sensitivity Can it identify patients with a condition? Specificity Can it identify those that do not have the condition? Responsiveness Can it measure differences over time? Purpose of Assessment Instruments Is the purpose of the instrument clearly stated? Is it discriminative? Is it evaluative? Is it prognostic? Which population is it appropriate for? Clinical Working Research / Epidemiological Reliability & Validity Validity The degree to which an instrument measures what it is intended to measure Reliability is a necessary but insufficient condition for validity The approximate truth about inferences regarding causal relationships Reliability A degree of consistency of a measure The degree to which a test is free of random error A measure that produces consistent results is said to have high reliability Validity in Research Validity Poor repeatability of examination implies a poor validity How repeatable are results by the same observer: On two (or more) occasions by same observer? (Temporal Stability) Or repeated occasions by different observers? Applies equally to Clinical Practice and Research A clinical sign carries no info if it is assessed differently when re-examined Measures for Clinical Use Questionnaires General health status Pain Functional status Patient satisfaction Physiological outcomes Utilization measures Cost measures Mathematical Modelling Face Validity Are items measured in a sensible way ? How specific are the questions ? Do questions have a specific time frame / frame of reference ? Are questions performance related ? (do you do it?) Are questions capacity related ? How is the index scored ? Weighting of items ? (can you do it?) Content Validity Content validity is concerned with “representativeness” Are all relevant dimensions of functionality included ? Subjective Was method for choosing items appropriate ? Draw an inference from test scores to a larger domain of functionality e.g. the abilities covered by the test items should be representative to the larger domain of abilities and function Construct Validity What is the bigger concept that the assessment is trying to measure? “Theoretical Construct” Does assessment perform satisfactorily when compared with other measures Is that concept a real one? e.g. does specific local pain prevent general functioning? Measured by correlation between the intended independent variable (back health) and a proxy independent variable (specific test performance) that is actually used Construct Validity For example: Company physician wants to study the relationship between general back health and job performance However, the physician may not be able to administer a comprehensive back health test to every worker In this case, s/he can use a proxy variable such as “performance on a specific functional test" as an indirect indicator of back health Administer the proxy test AND comprehensive back test to a portion of Workers If finding a strong correlation between general back health and the specific test, the proxy test can be used with the larger group because its construct validity is established Criterion Validity Drawing an inference from specific test scores to general performance Criterion validity is about prediction rather than explanation. Prediction is concerned with non-casual dependence Explanation is pertaining to causal or logical dependence E.g. one can predict the weather based on the height of mercury inside a Barometer. However, one cannot use the behaviour of mercury height to explain why the weather changes. Responsive to Change Is measure sensitive enough to detect clinically relevant change? Essential for evaluative measurements Examples: Pain Perception Visual Analogue Scales Reliable and Valid (Jensen & Karoly 1993) Advantages over other pain assessment methods (Scott & Huskisson 1976, Price et al. 1994) Quadruple Visual Analogue Scales – 4 specific factors – Von Korff et al. 1992 CURRENT Pain Level AVERAGE or TYPICAL Pain Level Pain level at its BEST Pain level at its WORST Ratings are averaged x 10 = TOTAL SCORE (Range 0 – 100) Condition-Specific Assessment – Low Back Pain 40+ low back functional questionnaires exist 5 identified as “gold standard” (Kopec & Esdaile, 1995) 1. Sickness Impact Profile (Bergner et al. 1981) 2. Roland-Morris Disability Questionnaire (Roland and Morris, 1983) 3. Oswestry Low Back Pain Disability Questionnaire (Fairbank et al. 1980) 4. Million Visual Analogue Scale (Million et al. 1982) 5. Waddell Disability Index (Waddell, 1984) Condition-Specific Assessment – Low Back Pain 2 of the “gold standards” (Kopec & Esdaile, 1995) 1. Roland-Morris Disability Questionnaire (Roland and Morris, 1983) 2. Oswestry Low Back Pain Disability Questionnaire (Fairbank et al. 1980) + Quebec Back Pain Disability Scale (Kopec et al. 1995) Roland Morris Disability Questionnaire (RMQ) Purpose: Acute and Chronic population of low back pain sufferers An evaluative measure in clinical trials Face Validity: + 24 Yes No questions + Moderate specificity + Today is the frame of reference + Performance related + Double negatives + “Yes response” scores – score out of 24 Content Validity: Mobility Work Sleeping Recreation Dressing / grooming Standing Mood Appetite Roland Morris Disability Questionnaire (RMQ) “The best single study of assessing short-term outcomes of primary care patients with low back pain“ (Von Korff & Saunders, 1996) Scores > than 13 = Significant disability associated with an unfavorable outcome (Von Korff & Saunders, 1996) Any change of less than 4 points is both too small to matter and too small to be reliable (Stratford et al. 1996) Oswestry Disability Questionnaire (revised) Purpose: Acute and Chronic population of low back pain sufferers Discriminate between chronic and acute low back pain An evaluative measure in clinical trials Used to predict different rates of improvement Face Validity: + Measured 0 – 5 by degree of difficulty + Very specific questions + No specific frame of reference + Capacity related questions + Score by summing all items = percentage score Content Validity: Pain intensity Walking Sleeping Personal care Lifting Sitting Standing Sex / social life Travelling Oswestry Disability Questionnaire (revised) Content Validity: Omits: bending twisting emotional state kneeling turning sudden movement “Sex life” reduced response rates (Hudson-Cook et al. 1989) Scoring issues: 11% is a cut off score (Erhard et al. 1994) 00 - 20% 20 - 40% 40 - 60% 60 - 80% 80 - 100% Stratford et al. 1988 Minimal Disability Moderate Disability Severe Disability Crippled Bed Bound or Exaggerating Quebec Back Pain Disability Scale Purpose: Acute and Chronic population of low back pain sufferers Assess level of functional disability Designed as discriminative, evaluative and predictive Face Validity: + Response on rating scale 0 - 5 + Very specific questions + “Today” as frame of reference + Performance related questions + Score by summing all items = percentage score Content Validity: Mobility Sitting Lifting Travelling Standing Bending Sleeping Running Quebec Back Pain Disability Scale Content Validity: Omits: twisting emotional state sex life turning sudden movement Reliability Has test-retest reliability been established ? Measure reproducible on repeated use on stable patient ? Internal consistency ? Do items correlate with others ? Roland-Morris Disability Questionnaire Alpha (reliability score) lowest highest 0.89 0.93 Oswestry Disability Questionnaire 0.77 Waddell Disability Index 0.76 Quebec Back Pain Disability Scale - 0.93 0.95 Back Performance Scale (BPS) 5 Tests of sagittal-plane mobility A) Sock test B) Pick up test D) Roll-up test E) Lift test C) Fingertip-to-floor test Sum scores to obtain performance measure of mobility-related activities Objectives: Develop a sum scale Discriminative ability Sensitivite to change Strand et al. 2002 Back Performance Scale (BPS) – Evaluation of . . . Correlations among 5 tests of sagittal-plane mobility: Correlations among 5 tests and BPS total: Cronbach Alpha (reliability): Sum Scores Discrimination: Responsiveness: Back Performance Scale (BPS) – Evaluation of . . . Correlations among 5 tests of sagittal-plane mobility: Ranged from: 0.27 – 0.50 Correlations among 5 tests and BPS total: Ranged from: 0.63 – 0.73 Cronbach Alpha (reliability): Achieved: 0.73 Sum Scores Discrimination: Higher scores in patients not returning to work Higher scores in patients with back pain rather than MSD Responsiveness: Effect size high (1.33) for patients who returned to work Effect size low (0.31) for patients who had not returned to work Back Performance Scale (BPS) – Evaluation of . . . 1. BPS sum more responsive than separate tests 2. Measures aspects of performance of clinical importance to back pain 3. Quick, simple and cheap to administer 4. No costly equipment Future research: Could tests with lateral bending and twisting be added? Could twisting / lateral tests replace any of the sagittal bending tests? Yellow Flags of Low Back Pain Indicative of long term chronicity and disability • Negative attitude – back pain is harmful and disabling • Fear avoidance • Reduced activity • Expects passive treatment to be better than active treatment • Tendency to low morale, depression and social withdrawal • Social / Financial problems Should these psychosocial aspects be included in assessment scale? What validity does any scale have when omitting these constructs? Appropriateness / Feasibility Is administration format suitable ? Time take to complete questionnaire appropriate ? Questions easy to understand ? Questions acceptable to patient ? Clinical relevance ? Mathematical Models Leg length differences and MSDs Two measurement methods 1) Direct measurement / observation 2) Regression equations MRI Ultrasonics Early stages Not cost-effective Complimentary at present Physiologically valid Requires physiological uniformity Not valid with clinical populations Ashford & Marlbrook, 2003 Summary of Reliability & Validity • There can be validity without reliability • Reliability is an aspect of construct validity - as assessment becomes less standardized, distinctions between reliability and validity blur • In many situations assessors are not trained to agree on a common set of criteria and standards • Inconsistency in performance across tasks does not invalidate the assessment • Rather it becomes an empirical puzzle to be solved by searching for a more comprehensive interpretation • Initial disagreement does not invalidate any assessment - provides impetus for dialog Moss, 1994 Implications for Back Pain Assessments • Development of 1 single valid universal test may be pointless • No grand Unifying Theory of measurements • If something is easy to measure validly, it would’ve been done by now • Functional assessments seem alive and well (for now) • Functional assessments must develop and include psychosocial aspects