Appendix B There are several sources and statistical methods that can be used to assess validity and reliability. For readers wishing to learn more there are many academic resources available1–4,6. Traditionally, there are three types of validity (content, predictive and concurrent validity and construct validity) and four methods of testing the reliability of a tool. In this paper, the following four definitions/descriptions from Messick5 are used when discussing validity and from Downing3 when discussing rater reliability: Face or content validity “Content validity is evaluated by showing how well the content of the test samples the class of situations or subject matter about which conclusions are to be drawn.” This subject relevance or representativeness is usually judged by experts and established through consensus. E.g. Does the checklist contain items related to what is performed during an ultrasound guided regional anesthesia block? Construct validity “Construct validity is evaluated by investigating what qualities a test measures, that is, by determining the degree to which certain explanatory concepts or constructs account for performance on the test.” The test does not have to define the construct completely, but rather be an approximate measure. It can be assessed through the discernment of performance. E.g. When this evaluation is used, do novices in this procedure perform poorly as compared to experienced participants? Concurrent validity “Concurrent validity indicates the extent to which the tests estimate an individual's present standing on the criterion.” E.g. How well does this evaluation (i.e. simulation model performance) reflect or correlate with another measure (i.e. patient block performance)? Inter-rater reliability “Reliability refers to the reproducibility of assessment data or scores, over time or occasions… All reliability estimates quantify some consistency of measurement and indicate the amount of random error associated with the measurement data.” Intraclass correlation coefficients (ICCs) provide an estimate of the consistency of measurements made by different raters evaluating the same task while considering the various components of variability inherent to the study design (i.e. the raters, participant evaluation).3 There are six forms of ICCs, which one is reported is dependent on the analysis of variance used (one or twoway), the relevance of raters (random or fixed raters) and the unit of analysis (individual or mean ratings).6 In our paper, two-way average measures intra-class correlations (ICC) were calculated for absolute agreement, as all participants were rated by the same two raters and our interest was in quantifying the reliability of average ratings. References 1. Cook DA, Beckman TJ. Current Concepts in Validity and Reliability for Psychometric Instruments: Theory and Application. Am J Med 2006;119:166.e7–166.e16 2. Downing SM, Haladyna TM. Validity threats: overcoming interference with proposed interpretations of assessment data. Med Educ 2004; 38:327–333 3. Downing SM. Reliability: on the reproducibility of assessment data. Med Educ 2004; 38:1006–1012 4. Downing SM. Validity: on the meaningful interpretation of assessment data. Med Educ 2003; 37:830–837 5. Messick S. Validity of test interpretation and use. Education Testing Service 1990; 1-33 6. Shrout P, Fleiss J: Intraclass Correlations: Uses in Assessing Rater Reliability. Psychol Bullentin 1979; 86.2:420–428