Descriptive Statistics Epidemiology/Biostatistics Kenneth Kwan Ho Chui, PhD, MPH Department of Public Health and Community Medicine kenneth.chui@tufts.edu 617.636.0853 Learning objectives in the syllabus Distinguish between types of data Know appropriate data presentation options for various data types Understand the strengths and limitations of various descriptive statistics Understand the concept of skewness and its implications to discrete and continuous distribution Appreciate the special aspects of the normal distributions Understand the calculation and application of z-scores Population Parameter ? The true mean BMI of Boston, Massachusetts ? ? Researcher Sample Sample statistics The mean BMI of a sample from Boston, Massachusetts Population Sample Parameter Sample statistics Distribution of sample means Know how to interpret and calculate a confidence interval for statistical inference Types of data How to summarize data Central tendency Variability Types of data Descriptive statistics Tabulation Attribute 15 16 17 18 19 20 21 22 23 24 25 Frequency 3 4 12 13 16 22 15 10 4 0 1 Graphical visualization Mean = 19.43 Median = 20.00 Standard deviation = 2.01 Types of data: Nominal Data representing attributes that are: unordered mutually exclusive ideally exhaustive Examples Genders Nominal variables with only two possible attributes are also called “dichotomous” or “binary” Marital status Census 2000, Long form Graph for showing nominal data: Pie chart Watson-Jones et. al. Effect of Herpes Simplex Suppression on Incidence of HIV among Women in Tanzania. N Engl J Med 2008; 358:1560-71 Graph for showing nominal data: Bar chart Horizontal axis is categorical Watson-Jones et. al. Effect of Herpes Simplex Suppression on Incidence of HIV among Women in Tanzania. N Engl J Med 2008; 358:1560-71 Graph for showing nominal data: Grouped bar chart Watson-Jones et. al. Effect of Herpes Simplex Suppression on Incidence of HIV among Women in Tanzania. N Engl J Med 2008; 358:1560-71 Types of data: Ordinal Data representing attributes that are: ordered unequal difference between ranks Examples Language proficiency Number of rooms, pay attention to the last option Census 2000, Long form Graph for showing oridinal data: Bar chart US General Social Survey, 1991 Types of data: Discrete Sometimes referred to as “count variable” Data representing attributes that are: ordered equal difference between ranks of finite amount of possible values, usually at the level of integer (0, 1, 2, 3, 4, 5… etc.) Example Frequency of cooking dinner at home NHANES 2007-08, Consumer Behavior section Graph for showing discrete data: Histogram No space between bars Horizontal axis is a continuum US General Social Survey, 1991 Types of data: Continuous Data representing attributes that are: ordered equal difference between ranks of infinite amount of possible values Examples Height Consider a reported height of 165.5 cm. In reality it could be 165.4810550654211381380… cm, so fine that we can never exactly measure it. Blood pressure Age Graph for showing continuous data: Histogram US General Social Survey, 1991 The “relationship diagram” Nominal Collectively referred to as “categorical data” Ordinal Discrete Continuous Also called “rank” Also called “count” Share similar statistical properties. Techniques good for continuous data are often good for discrete data. In fundamental level, it is safe to group them together. (Until you learn analysis that is specific for discrete data, and that is out of our syllabus.) The hierarchy of data types Once the data are collected… Continuous Discrete you can them aggregate down …but you cannot go back up Ordinal Nominal The hierarchy of data types, cond. Continuous Birthday Discrete Age in years Ordinal <20 20-29 30-39 40-49 50-59 ≥60 Nominal Below 21 vs. above 21 Downward aggregation If you ever end up designing a study or collect your own data, always strive for the highest type in the hierarchy, within reason. Central tendency Central tendency The tendency of quantitative data to cluster around some central value Three major types: Mean (also called average) Median (also called 50th percentile) Mode Central tendency: Mean Consider a variable with data: 1, 2, 3, 3, 4, 4, 4, 5, 5, 6 Central tendency: Median A median is the numeric value separating the higher half of a sample from the lower half Median can be found by: 1. 2. 3. Arranging all the observations in ascending or descending order Picking the middle number as the median If there is an even number of observations, then the median is the mean of the two middle values Consider a variable with data: 1, 2, 3, 3, 4, 4, 4, 5, 5, 6 Since we have even number of cases, the median is then the mean of the two middle values, which is (4+4)/2 = 4 Central tendency: Mode A mode is a data value with the highest frequency compared to the other values’ frequencies Consider a variable with data: 1, 2, 3, 3, 4, 4, 4, 5, 5, 6 If we compile a frequency table with the numbers, we get: Value Frequency 1 1 2 1 3 2 4 3 5 2 6 1 Because value “4” has the highest frequency (3), “4” is the mode. A variable can only have one mean and one median. However, it can have more than one mode. Which one is the right central tendency? Mean Median Mode Nominal No No Yes Ordinal No Yes Yes, but uncommon** Discrete Yes Yes, esp. if skewed* Yes, but uncommon** Continuous Yes Yes, esp. if skewed* No * Skewness will be explained shortly in this lecture ** Numbers of possible responses in ordinal and discrete variables tend to be much more than that of nominal, causing the inconvenience of reporting too many modes Variability Variability The magnitude of dispersion of the data around their own central value Four major expressions: Range Interquartile range (IQR) Variance Standard deviation Variability: Range A range is the difference between the smallest and the largest values in a variable Consider a variable with data: 1, 2, 3, 3, 4, 4, 4, 5, 5, 6 The range is (6 – 1) = 5 No conventional way of pairing with any particular central tendency measure Variability: Interquartile range Quartile is a set of three numbers that breaks the variables into four groups of equal sample size Consider a variable with data: 1, 2, 3, 3, 4, 4, 4, 5, 5, 6 The first one is 3, it’s also called the lower quartile or 25th percentile The middle one is (4+4)/2=4. It’s the median or 50th percentile The last one is 5, it’s also called the upper quartile or 75th percentile Interquartile range is simply: 75th percentile – 25th percentile = 5 – 3 = 2 Often paired with median in data reporting A little caveat about quartiles The median is well defined, but there has not been a universal agreement on how the upper and lower quartiles should be derived. Two examples: 1, 2, 3, 4 1, 2 Lower quartile: 1.5 3, 4 Upper quartile: 3.5 1, 2, 2.5 Lower quartile: 2 2.5, 3, 4 Upper quartile: 3 1, 2, 3, 4 Graph for showing quartiles: Boxplot A variable e.g. height Outlier Highest data point within (75th percentile + 1.5 IQR) 1.5 IQR Upper quartile 75th percentile IQR Median Lower quartile 25th percentile 1.5 IQR Lowest data point within (25th percentile – 1.5 IQR) Variability: Variance Consider a variable with data: 1, 2, 3, 3, 4, 4, 4, 5, 5, 6 To get the variance: 1. 2. 3. 4. 5. Compute the mean, which is 3.7 (we did this already) Subtract the mean from each value: -2.7, -1.7, -0.7, -0.7, 0.3, 0.3, 0.3, 1.3, 1.3, 2.3 Square them: 7.29, 2.89, 0.49, 0.49, 0.09, 0.09, 0.09, 1.69, 1.69, 5.29 Add them up: 20.1 Divided the sum by (number of cases – 1): 20.1/(10 – 1) = 2.23 Fortunately, computer can now do all these for us! Variability: Standard deviation Standard deviation is the square root of variance Consider a variable with data: 1, 2, 3, 3, 4, 4, 4, 5, 5, 6 We calculated the variance in the previous slide (2.23) The standard deviation (SD) is then: Often paired with mean in data reporting Which one is the right variability? Range IQR* Nominal No No No Ordinal Yes Yes No Discrete Yes Yes, esp. if skewed** Yes Continuous Yes Yes, esp. if skewed** Yes * IQR: Interquartile range; SD: Standard deviation ** Skewness will be explained shortly in this lecture Variance SD* Normal distribution Mean ± SD You will see “Mean ± SD” a lot. Most continuous data are summarized with mean ± standard deviation (± is pronounced as “plus-minus ”) E.g. In the Aspirin study*, the BMI data in Table 1 for the two groups are: Aspirin: 26.1 ± 5.1 Placebo: 26.0 ± 5.0 For both groups to be comparable, both means and SDs have to be similar If we are willing to make an assumption, we can even infer more about the data! This magical assumption is “normal distribution” * See course reading “A Randomized Trial of Low-Dose Aspirin in the Primary Prevention of Cardiovascular Disease in Women” Normal distribution Some variables, when plotted in the form of a histogram, look like this: Reasonably symmetric More values at the center Decreasing number of values towards the two ends Looks like a bell When this happens, we can say a lot more with the mean and standard deviation! Feature #1: The 68-95-99 rule 99% of samples are within ± 3SD 95% of samples are within ± 2SD 68% of sample are within ± 1SD # of SD: Percentile: 0.5th 2.5th 16th 50th 84th 97.5th 99.5th Application of the 68-95-99 rule (I) The mean (±SD) of the daily caloric intake of a certain group is 1200 ± 150 ASSUME THE VARIABLE DAILY CALORIC INTAKE IS NORMALLY DISTRIBUTED*, then: 68% of the participants have caloric intakes ranging from 1050 to 1350 kcal (– 1 SD to 1 SD) 95% of the participants have caloric intakes ranging from 900 to 1500 kcal (– 2 SD to 2 SD) 99% of the participants have caloric intakes ranging from 750 to 1650 kcal (– 3 SD to 3 SD) * This assumption is needed for the 68-95-99 rule to work. The distribution can be checked with histogram or other statistics (not covered in this class) Application of the 68-95-99 rule (II) The mean (±SD) of the daily caloric intake of a certain group is 1200 ± 150 ASSUME THE VARIABLE DAILY CALORIC INTAKE IS NORMALLY DISTRIBUTED*, then: The data point at the 84th percentile is about (1200 + 150) = 1350 kcal The data point at the 99.5th percentile is about (1200 + 450) = 1650 kcal A subject with kcal = 1200 is likely to be the 50th percentile in this sample * This assumption is needed for the 68-95-99 rule to work. The distribution can be checked with histogram or other statistics (not covered in this class) Feature #2: Standardized comparison with z-score Consider an imaginary sample Height: Mean = 160 cm, SD = 15 cm Weight:Mean = 95 lb, SD = 10 lb How is someone who is 180 cm tall and 107 lb heavy doing relative to the rest? The different units are impeding direct comparison, but z-score can help i.e. z-score is simply how many SDs a value is away from the mean z-score for the person’s height: (180 – 160)/15 = 1.33 z-score for the person’s weight: (107 – 95)/10 = 0.70 Continuous/discrete variables are (mostly) not normal Problems with skewed distribution Mean & Median Problems with skewed distributions Median Mean Positively skewed/Right skewed Median is more or less the same IQR is more or less the same Mean becomes larger SD is inflated Mean Median Negatively skewed/Left skewed Median is more or less the same IQR is more or less the same Mean becomes smaller SD is inflated For variables with a skewed distribution, median & interquartile range is a better representation of the central tendency and variability, respectively Tell-tale signs of skewness When the mean and median of the variable are very different When you try to reconstruct the histogram of the normal distribution for the variable, a good part of the curve falls into an illogical or biologically implausible domain: In a study with an entry criteria of age ≥ 45, the mean and standard deviation of the age is 52.0±7.0 A study on eating out reported that an average family makes dinner at home on 5.2±2.0 nights/week When the authors reported only median with/without quartiles for the variable So what if it’s skewed? (Advanced teaser) Skewness distorts the means, and hence distorts analyses that heavily rely on the sample means Ask if the skewness is relevant For some variables in statistical analysis, we don’t care as much if they are skewed or not Ask if the skewness is serious Some analyses are robust enough to tolerate some skewness Check if the authors employed solutions such as: Transformation (e.g. logarithmic, square root, etc.) Aggregating down the data type hierarchy Using analyses that have relaxed requirement on the sample’s distribution (e.g. non-parametric procedures)