1 Repeatability of a skin tent test for dehydration in working horses and 2 donkeys 3 JC Pritchard†‡ *, ARS Barr† and HR Whay† 4 5 † 6 BS40 5DU, UK 7 ‡ 8 SW1Y 4DR, UK 9 * Contact for correspondence and request for reprints: joy.pritchard@bristol.ac.uk Department of Clinical Veterinary Science, University of Bristol, Langford House, Brooke Hospital for Animals, Broadmead House, 21 Panton St, London 10 11 Running title: Dehydration in working horses and donkeys 12 13 Abstract 14 Dehydration is a serious welfare issue for equines working in developing countries. 15 Risk factors such as high ambient temperature, heavy workload and poor water 16 availability are exacerbated by the traditional belief that provision of water to 17 working animals will reduce their strength or cause colic and muscle cramps. 18 As part of the welfare assessment of 4889 working horses and donkeys during 19 2002/3, eight observers were trained to perform a standardised skin tent test. The 20 prevalence of a prolonged duration of skin tenting was 50% in horses and 37% in 21 donkeys. Two studies investigated inter-observer repeatability of skin tent test 22 techniques, using a total of 220 horses and donkeys in India and then Egypt: 23 measures of agreement with a ‘gold standard’ observer varied from 40 to 99%. 24 Simplifying the test by reducing the number of possible scores for skin tent from 25 three (immediate return of skin to normal position; delayed return up to three 1 1 seconds; delayed return more than three seconds) to two (immediate return of skin 2 to normal position; delayed return of any duration) did not improve overall 3 repeatability of the skin tent test. Potential reasons for not achieving high levels of 4 agreement include variations in assessment method, assessors’ previous experience, 5 subjective demarcation between score categories and biological variability. 6 7 Keywords: animal welfare, dehydration, donkey; horse, repeatability, skin tent 8 9 Introduction 10 Assessment of welfare using direct, animal-based observation of health and 11 behaviour outputs has increased in recent years. It is now being used by the Soil 12 Association for monitoring of organic standards and by the European Union 13 Welfare Quality Project to standardise welfare output criteria within the farming 14 sector. Animal-based welfare assessment also has applications in other sectors and 15 in 2002 was adopted by the UK-based charity, Brooke Hospital for Animals, as an 16 effective way to inform and monitor welfare improvement strategies for equines 17 working in developing countries. Although often assessed subjectively, animal- 18 based observations should provide a more direct, and therefore more valid, 19 assessment of welfare than resource measurements. A key consideration when 20 developing an effective welfare assessment tool is ensuring repeatability of results 21 between different observers. Subjective assessment of health parameters, such as 22 locomotion scoring of pigs, has been found to be highly repeatable between trained 23 observers (Main et al 2000). 24 A welfare assessment was carried out on 4889 equines that pull carts or carry people 25 or goods in urban and peri-urban areas of Afghanistan, Egypt, India, Jordan and 2 1 Pakistan (Pritchard et al 2005). As part of the study, which took place during winter 2 2002 and spring 2003, eight observers were trained to perform a skin tent test, using 3 a standard method and anatomical location described by detailed guidance notes and 4 photographs. Skin tent was scored in 4664 animals: the prevalence of Score 1 (some 5 loss of skin elasticity) was 32% in horses and 28% in donkeys. The prevalence of 6 Score 2 (prominent tenting of skin) was 18% in horses and 9% in donkeys. 7 This paper describes two tests for repeatability of the skin tent test and discusses 8 potential reasons why the skin tent test may not be highly repeatable between 9 observers. 10 11 Materials and methods 12 Study A 13 This investigation was carried out in Delhi, India, during August 2003. Eighty 14 horses and 80 donkeys were recruited from the population working in the vicinity of 15 Brooke’s field clinics, in order to test inter-observer repeatability for each parameter 16 in the animal-based welfare assessment described above, including the standardised 17 skin tent test. The animal was positioned with its head up, in a natural position, 18 facing straight ahead. The lateral edge of the observer’s hand rested against the 19 cranial margin of the animal’s scapula and a vertical fold of skin overlying the m. 20 brachiocephalicus was pinched and released immediately. Although the height of 21 the skin pinch was not standardised, the fixed position of the observer’s hand was 22 designed to reduce variation whilst remaining practically applicable. Skin tent 23 duration was scored as follows: 0) if there was no loss of skin elasticity and skin 24 returned to normal immediately, 1) if there was some loss of skin elasticity and 3 1 tented skin remained visible but not prominent for up to three seconds and 2) if 2 there was prominent tenting of skin, visible for more than three seconds. 3 Each animal was identified with a unique number on a hoof brand and harness tag. 4 No prior information about the animals’ health or behaviour was provided to the 5 observers. Six observers carried out the standardised skin tent test on the same 6 80 animals, with an interval of approximately ten minutes between observations. 7 Forty horses were assessed on each of days 1 and 2 and forty donkeys were assessed 8 on each of days 3 and 4. 9 10 Study B 11 The results of the first repeatability study were used to modify the welfare 12 assessment protocol with the aim of increasing the repeatability of some measures. 13 The skin tent test was simplified from three scores to two. Animals were scored: 0) 14 (absent) if skin returned to a normal position immediately after it was pinched and 15 released, or 1) (present) if there was any delay in return of tented skin to its normal 16 position. To test the success of modifications to the welfare assessment, a second 17 repeatability study was undertaken in Cairo, Egypt during April 2004, using the 18 method described for Study A. Ten observers (including five who took part in Study 19 A) assessed 30 working horses and 30 working donkeys over two days. 20 21 Statistical analysis 22 For both studies, data were analysed for the level of agreement between Observer 1 23 and each of the other observers, using Cohen’s kappa coefficient and calculations of 24 percentage agreement. Observer 1 was the same for both studies. Statistical analysis 4 1 was carried out using SPSS v. 12.0 (SPSS Inc). Significance is reported at the 2 P < 0.05 level. 3 4 Table 1 5 6 Results 7 Levels of agreement for the skin tent test between Observer 1 and each of the other 8 observers are illustrated in Table 1. In Study A, two observers achieved greater than 9 75% agreement with Observer 1 for horses and three achieved greater than 75% 10 agreement for donkeys. The kappa statistic could not be calculated for all observers 11 in Study A because observers did not use the full range of possible scores. In Study 12 B, one observer achieved over 75% agreement with Observer 1 for horses 13 (κ = 0.664, P < 0.01) and three achieved over 75% agreement for donkeys 14 (κ = 0.529-0.667, P < 0.01). Some observers improved their percentage agreement 15 with Observer 1 when the scoring system was simplified; others had a lower 16 percentage agreement. Overall, for both systems the repeatability was above 60%, 17 but the two-score system achieved a lower inter-observer repeatability (61 and 64% 18 for horses and donkeys respectively) than the three-score system (66 and 83%). 19 20 Discussion 21 Cohen’s kappa coefficient is a measure of agreement, which relates the actual 22 agreement between observers with that which would have been obtained by chance. 23 It is a quotient that can take any value between 0 (no agreement beyond chance) and 24 1 (perfect agreement). There are no objective criteria for judging kappa, although 25 0.4-0.5 is considered to be moderate agreement beyond chance. Levels below this 5 1 indicate that a test has poor specificity and/or sensitivity (Martin et al 1987). Kappa 2 depends on the number of categories that are used in its calculation, with its value 3 being greater if there are fewer categories (Petrie & Watson 1999). However, 4 because kappa compares pairs of observations, it cannot generate a statistic for 5 agreement between observers who did not use the full range of possible scores. In 6 Study A, most observers did not allocate all possible scores within the group of 7 animals, so could not be compared with Observer 1. Calculating the percentage 8 agreement between Observer 1 and each other observer produced the most useful 9 outcome in this case. It may be argued that each observer should be compared with 10 the mode result rather than a ‘gold standard’; however, this assumes that all 11 observers have an equal level of training, use an identical method and can produce 12 absolutely standardised observations. All observers were trained by JCP (Observer 13 1), HRW or both and provided with a comprehensive set of guidance notes and 14 photographs written by JCP, so for these studies it was decided to use JCP as the 15 gold standard. 16 Potential reasons for not achieving good agreement between observers include: 17 18 • Variation between observers in the method of assessment used for some 19 parameters, possibly attributable to guidance notes not being sufficiently specific 20 about how a parameter should be assessed or how to define ‘normal’. 21 22 • Assessors’ previous experience: unfamiliarity with observing the parameter during previous field experience, not recognising normal, applying a skewed scale. 23 • Subjective demarcation between score categories; for example in Study A, 24 Scores 1 and 2 for skin tent were defined by time taken for skin to return to 25 normal, but times were estimated rather than measured accurately. 6 1 2 3 • Biological variability in the parameter over the time taken to carry out the repeatability test; for example, changes in skin elasticity or hydration status. • Test not valid for the parameter under assessment. 4 5 The first three may be improved by refining training and guidance notes, prior 6 exposure of observers to animals demonstrating the full range of possible scores for 7 the parameter in question and standardising methodology. For skin tent, this may 8 include introduction of an objective time measurement. Biological variability in the 9 skin tent test is poorly understood; reducing the time taken for the inter-observer 10 repeatability test may improve agreement between observers, although it risks 11 introducing variables relating to repeated handling of the skin within a short time 12 period. Many observer repeatability studies use photographs or video footage, rather 13 than consecutive direct observations, in order to reduce this variability (Fuller et al 14 2006). Simplifying the score categories from three in Study A to two in Study B 15 was intended to minimise errors relating to subjective demarcation of score 16 categories, although this did not appear to have the desired effect. 17 18 Conclusions and animal welfare implications 19 This study concluded that a standardised skin tent test for dehydration was 20 moderately repeatable and some observers could achieve a high level of 21 repeatability compared with a ‘gold standard’ observer. Simplifying the scoring 22 system did not result in better inter-observer repeatability overall. A repeatable and 23 practical measure of hydration status is needed in order to develop and evaluate 24 intervention programmes to reduce dehydration and thus improve the welfare of 25 equines working in hot and humid environments. 7 1 2 Acknowledgements 3 This work was funded by the Brooke Hospital for Animals. The authors would like 4 to thank all Brooke field staff overseas who participated in and supported these 5 studies. 6 7 References 8 Fuller CJ, Bladon BM, Driver AJ and Barr ARS 2006 The intra- and inter- 9 assessor reliability of measurement of functional outcome by lameness scoring in 10 horses. The Veterinary Journal 171: 281-286 11 Main DCJ, Clegg J, Spatz A and Green LE 2000. Repeatability of a lameness 12 scoring system for finishing pigs. Veterinary Record 147: 574-576 13 Martin SW, Meek AH and Willeberg P 1987. Measurement of disease frequency 14 and production. In: Veterinary Epidemiology. Principles and Methods. Iowa State 15 University Press: Ames, Iowa, USA [Au: Please supply the editor(s) of 16 Veterinary Epidemiology] 17 Petrie A and Watson P 1999. The kappa measure of agreement for a categorical 18 variable. In: Statistics for Veterinary and Animal Science. Blackwell Science Ltd: 19 Oxford, UK[Au: Please supply the editor(s) of Statistics for Veterinary and...] 20 Pritchard JC, Lindberg AC, Main DCJ and Whay HR 2005. Assessment of the 21 welfare of working horses, mules and donkeys, using health and behaviour 22 parameters. Preventive Veterinary Medicine 69: 265-283 8