Normality or not? Different distributions and their importance Stats Club 3 Marnie Brennan References • Petrie and Sabin - Medical Statistics at a Glance: Chapter 7, 8, 9, 35 Good • Petrie and Watson - Statistics for Veterinary and Animal Science: Chapter 3 Good • Thrusfield – Veterinary Epidemiology: Chapter 12 • Kirkwood and Sterne – Essential Medical Statistics What is a distribution? • Empirical frequency distribution versus theoretical distribution • Very easy! – Empirical frequency distribution is something that you actually measure and calculate • E.g. Coat colour in cats – Tabby, Ginger, Tortoiseshell, Seal-point • In a population, each one of these has a frequency e.g. 5 x Tabby, 9 x Ginger, 15 x Tortoiseshell, 8 x Seal-point Theoretical distributions • Theoretical distribution – is just that – theoretical! • It is something we measure our data (empirical frequency distribution) against to see which distribution describes it the best – This helps to signpost us to what statistical analyses we do next, according to the distribution it ‘approximates’ Theoretical distributions and types of data • Back to our flow charts in the back of Petrie and Sabin, and Petrie and Watson • Relates to what type of variable you have – Continuous? E.g. Heights of Japanese men – Categorical or discrete? E.g. Coat colour in cats Continuous distributions • Normal distribution – The grandaddy of them all! – Also known as the Gaussian distribution (after Gauss, German mathematician) Our focus today – e.g. heights of adult men in the UK • T-distribution – Similar shape to Normal, but is more spread out with longer tails – Useful for calculating confidence intervals • Chi-squared distribution – Right-skewed distribution – Useful for analysing categorical data • F-distribution – Skewed to the right – Useful for comparing variances and more than 2 means (i.e. > 2 groups) • DO NOT BE SCARED – THIS IS ANOTHER EXERCISE IN TERMINOLOGY!! Discrete distributions • Binomial distribution – Could be skewed to the right or left (!) – Good for analysing proportion data – i.e. it is either one thing, or another, such as an animal either has a disease or does not have a disease • Poisson distribution – Right skewed – Good for analysing count data – i.e. the number of hospital admissions per day, the number of parasitic eggs per gram of faecal sample • Many of these distributions approximate normal when your sample size increases • A lot of this goes on behind your computer when doing statistics; it is here to help explain some of the terminology and basic ideas only (don’t worry too much about it!) The useful bit....... • You have collected continuous data from your research e.g. length in millimetres of the diameter of rabbit skulls • You would like to find out if this is normally distributed or not (as you know that this will affect what statistical tests you do) • How do you measure whether this variable is normal or not? 4 steps to Normality! • Plot your data – Create a histogram with frequencies and determine by eye • Does it look bell-shaped and symmetrical? • Does it look unimodal i.e. does it only have one peak? – Subjective measurement, but you should be doing this anyway! 4 steps (continued) • How different are the mean and median? – Mean = Total of your data added up/total no. of measurements – Median = The midpoint of your values i.e. what is the ‘halfway’ value in your data? • If they are very different, the data is probably not normally distributed • If they are very similar, your data could be normally distributed – Another rule of thumb, so not always correct 4 steps (continued) • Skewness and kurtosis – Skewness (how symmetrical the data is) • Normal – this value is 0 • Right-skewed distribution – positive value • Left-skewed distribution – negative value – Kurtosis (the ‘peakedness’ of the data) – does your data have a pointy bit, or is it flat? • Normal – this value is 0 • Sharply peaked data – positive value • Flat peaked data – negative value – Can measure these in Minitab or SPSS 4 steps (continued) • Bespoke tests for normality – Shapiro-Wilk test (Ryan-Joiner test) – Kolmogorov-Smirnov test – Anderson-Darling test • Watch interpretation of p-values – if it is <0.05, it is not normal (reject null hypothesis of normality) • The good news! – Computers do this for us so we don’t have to! Next month • Spread of your data – how do we measure this? – mean, standard deviation, variance – median, interquartile range – mode