Chapter 3 Summarizing Data Graphical Methods - 1 Variable • After data collected, sorted into categories/ranges of values so that each individual observation falls in exactly one category/range – Numeric Responses: Break “range” of values into nonoverlapping bins and count number of units in each bin – Categorical Responses: List all possible categories (with “Other” if needed), and count numbers of units in each • Pie Chart: Displays percent in each category/range • Bar Chart: Displays frequency/percent per category • Histogram: Displays frequency/percent per “range” Constructing Pie Charts • Select a small number of categories (say 5 or 6 at most) to avoid many narrow “slivers” • If possible, arrange categories in ascending or descending order for categorical variables Monthly Philly Rainfall 1825-1869 (1/100 in) Philly Monthy Rainfall 1825-1869 (1/100 inches) Category 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 Range <100 100-199 200-299 300-399 400-499 500-599 600-699 700-799 800-899 900-999 >1000 Count 17 78 132 115 86 55 27 17 6 3 4 Constructing Bar Charts • Put frequencies on one axis (typically vertical, unless many categories) and categories on other • Draw rectangles over categories with height=frequency • Leave spaces between categories Constructing Histograms • Used for numeric variables, so need Class Intervals – Let Range = Largest - Smallest Measurement – Break range into (say) 5-20 intervals depending on sample size – Make the width of the subintervals a convenient unit, and make “break points” so that no observations fall on them – Obtain Class Frequencies, the number in each subinterval – Obtain Relative Frequencies, proportion in each subinterval • Construct Histogram – Draw bars over each subinterval with height representing class frequency or relative frequency (shape will be the same) – Leave no space between bars to imply adjacency of class intervals Histogram 140 100 80 60 40 20 rain100 e M or 00 11 0 90 0 70 0 50 0 30 0 0 10 Frequency 120 100 200 300 400 500 600 700 800 900 1000 1100 1200 More Interpreting Histograms • Probability: Heights of bars over the class intervals are proportional to the “chances” an individual chosen at random would fall in the interval • Unimodal: A histogram with a single major peak • Bimodal: Histogram with two distinct peaks (often evidence of two distinct groups of units) • Uniform: Interval heights are approximately equal • Symmetric: Right and Left portions are same shape • Right-Skewed: Right-hand side extends further • Left-Skewed: Left-hand side extends further Stem-and-Leaf Plots • Simple, crude approach to obtaining shape of distribution without losing individual measurements to class intervals. Procedure: – Split each measurement into 2 sets of digits (stem and leaf) – List stems from smallest to largest – Line corresponding leaves aside stems from smallest to largest – If too cramped/narrow, break stems into two groups: low with leaves 0-4 and high with leaves 5-9 – When numbers have many digits, trim off right-most (less significant) digits. Leaves should always be a single digit. Time Series Plots • Many datasets represent a single variable measured on a single unit at different time points • When measurements are made at equally spaced time points, goal is often to describe temporal variation • Annual measurements can reveal long-term trends • Sub-annual (weekly, monthly, quarterly) measurements can reveal long-term trends as well as seasonal fluctuations • Plots generally have measurement on vertical axis and time period on horizontal. • Some plots include bars around points to represent fluctuations within that time period Philly Rainfall 1/1825-12/1869 Rainfall (1/100th inches) 2000 1000 0 Month Numerical Descriptive Measures • Numeric summaries of a set of measurements • Measures of Central Tendency describe the “location” or center of a set of measurements • Measures of Variability describe the “spread” or dispersion of a set of measurements • Parameters: Numeric descriptive measures based on Populations of measurements • Statistics: Numeric descriptive measures based on Samples of measurements Measures of Central Tendency - I • Mode: Most often occuring outcome (typically only of interest for variables taking on only “discrete” values) • Median: Middle value when measurements ordered from smallest to largest • Mean: Sum of all measurements, divided by total number of measurements (equal distribution of total) Population y ( N elements) : Sample (n elements) : i y y i i N i n In practice, we only observe sample, and use y to estimate Example - Philadelphia Rainfall N 540 Months (Treating as Population ) 540 198547 yi 198547 367.68 540 i 1 Ordered Amounts : y( 270) 339 y( 271) 341 M 340 Note: The mean is higher than median as a few very large amounts were observed. Measures of Central Tendency - II • Outlier: Individual measurement(s) falling far away from others. Can have large effect on mean, not median • Trimmed Mean (TM): Mean that is based on center measurements (deleting extreme measurements). • Mode: For continuous (smooth) distributions, mode is value corresponding to the peak of the frequency curve • Skewness: Shape of the distribution: – Mound-Shaped Distributions: Mode Median Mean TM – Right-Skewed Distributions: Mode < Median < TM < Mean – Left-Skewed Distributions: Mean < TM < Median < Mode Measures of Variability - I • Variability: Magnitude of dispersion in data. • Range: Difference between largest and smallest measurements in a set. • pth-Percentile: Value that has at most p% of measurements below, and (100-p)% above it (0<p<100) – Lower Quartile = 25th Percentile (Q1) – Median = 50th Percentile (Q2) – Upper Quartile = 75th Percentile (Q3) • Interquartile Range: Difference between the upper and lower quartiles (measures the amount of spread in he middle 50% of ordered measurements). IQR = Q3-Q1 Quantile Plot • Quantile: Q(u) ≡ Number that divides a dataset such that the fraction of observations below Q(u) = u and the fraction above Q(u) = 1-u • Quantile plot – Plot of Q(u) on vertical axis versus u on horizontal axis Place scale on horizontal axis ranging over 0 to 1 Order data: y(1) ≤ y(2) ≤ … ≤ y(n) and scale vertical axis to include full range of y-values Plot y(i) versus ui = (i – 0.5)/n for i = 1,2,…,n Quantile Plot Q(u) versus u 1800 u_i 0.000926 0.002778 0.00463 0.006481 … 0.993519 0.99537 0.997222 0.999074 y_(i) 19 25 26 55 … 1005 1102 1180 1582 1600 1400 1200 1000 Q(u) i 1 2 3 4 … 537 538 539 540 y_(i) 800 600 400 200 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 u 1 Measures of Variability - II • Deviation: Distance between an individual measurement and the group mean: y y • Variance: “Average” squared deviation • Standard Deviation: Square root variance (data’s units) Population ( N elements) : Variance : 2 Sample (n elements) : Variance : s 2 2 ( y ) i i N 2 ( y y ) i i n 1 Std. Dev. 2 Std. Dev. s s 2 Empirical rule (measurements with mound-shaped histogram) Approximately 68% of measurements lie within 1 SD of mean Approximately 95% of measurements lie within 2 SD of mean Virtually all of measurements lie within 3 SD of mean Example - Philadelphia Rainfall (Population) 25 th Percentile : 232.75 75 th Percentile : 468 Inter - Quartile Range : IQR 468 232.75 235.25 540 2 ( y ) 19822752 i i 1 19822752 36708.8 540 36708.8 191.6 367.7 191.6 (176.1 , 559.3) 2 2 367.7 383.2 (0* , 750.9) Note: 383 (71%) Months lie within 1 of and 518 (96%) within 2 Other Measures of Variation • Median Absolute Deviation (MAD) – Median of the absolute values of differences between observed data values and the sample median, divided by 0.6745 (due to properties of normal distribution, this provides estimate of ) • Coefficient of Variation (CV) – Standard deviation as a fraction of mean (assuming ≠ 0). Often reported as a percentage: CV 100 s y % MAD CV(%) 177.9096 52.15765 Boxplots • Graph highlighting spread of set of measurements, highlighting quartiles and outliers. • Constructing a boxplot: – Draw box with top at Q3, bottom at Q1, and line crossing at median (Q2). Height of box is IQR = Q3 - Q1 – Compute “lower inner fence” = Q1-1.5(IQR) = LIF – Compute “upper inner fence” = Q3+1.5(IQR) = UIF – Compute “lower outer fence” = Q1-3.0(IQR) = LOF – Compute “upper outer fence” = Q3+3.0(IQR) = UOF – Draw line from Q3 to max(UIF, largest y value). Place ‘*’ for any y values between UIF and UOF, ‘o’ for any above UOF – Draw line from Q1 to min(LIF, smallest y value). Place ‘*’ for any y values between LIF and LOF, ‘o’ for any below LOF BoxPlot 0 500 UIF = 468+1.5(232.25) = 816.375 1000 1500 2000 UOF = 468+3(232.25) = 1164.75 Summarizing Data of More than One Variable • Contingency Table: Cross-tabulation of units based on measurements of two qualitative variables simultaneously • Stacked Bar Graph: Bar chart with one variable represented on the horizontal axis, second variable as subcategories within bars • Cluster Bar Graph: Bar chart with one variable forming “major groupings” on horizontal axis, second variable used to make side-by-side comparisons within major groupings (displays all combinations in factorial expt) • Scatterplot: Plot with quantitaive variables y and x plotted against each other for each unit • Side-by-Side Boxplot: Compares distributions by groups Example - Ginkgo and Acetazolamide for Acute Mountain Syndrome Among Himalayan Trekkers Contingency Table (Counts) Percent Outcome by Treatment Placebo Acet Ginkgo Acc+Gi Total Placebo Acet Ginkgo Acc+Gi AMS 40 14 43 18 115 No AMS 79 104 81 108 372 Total 119 118 124 126 487 AMS 33.61 11.86 34.68 14.29 No AMS 66.39 88.14 65.32 85.71 Total 100 100 100 100 Stacked Bar Graph of AMS Incidence (Percent) 100% 90% 80% 70% 60% No AMS 50% AMS 40% 30% 20% 10% 0% Placebo Acet Ginkgo Treatment Acc+Gi Cluster Bar Graph of AMS Incidence (Counts) 120 100 Frequency 80 AMS 60 No AMS 40 20 0 Placebo Acet Ginkgo Treatment Acc+Gi 3-D Barchart of Incidence of AMS 100.00 90.00 80.00 70.00 60.00 Percent within Treatment 50.00 40.00 30.00 20.00 10.00 No AMS 0.00 Placebo AMS Acet Ginkgo Treatment Acc+Gi Outcome Scatterplots • Identify the explanatory and response variables of interest, and label them as x and y • Obtain a set of individuals and observe the pairs (xi , yi) for each pair. There will be n pairs. • Statistical convention has the response variable (y) placed on the vertical (up/down) axis and the explanatory variable (x) placed on the horizontal (left/right) axis. (Note: economists reverse axes in price/quantity demand plots) • Plot the n pairs of points (x,y) on the graph France August,2003 Heat Wave Deaths • • • • Individuals: 13 cities in France Response: Excess Deaths(%) Aug1/19,2003 vs 1999-2002 Explanatory Variable: Change in Mean Temp in period (C) Data: City Dth03 Dth9902 %chng (y) Degchg(x) Little Marseilles Grenoble Rennes Toulouse Bordeaux Strasbourg Nice Poitiers Lyon Le Mans Dijon Paris 200 571 148 156 315 318 253 341 184 447 204 168 1854 192.3 456.8 115.6 114.7 231.6 222.4 167.5 222.9 102.8 248.3 112.1 87.0 766.1 4 25 28 36 36 43 51 53 79 80 82 93 142 4.0 4.3 6.3 5.6 6.6 6.2 5.9 4.3 7.3 6.8 7.0 7.4 6.7 France August,2003 Heat Wave Deaths 2003 France Heat Wave Mortality Possible Outlier 160 140 Excess Mortality (%) 120 100 80 60 40 20 0 3 3.5 4 4.5 5 5.5 6 Change in Mean Temp (Celsius) 6.5 7 7.5 8 Example - Pharmacodynamics of LSD • Response (y) - Math score (mean among 5 volunteers) • Explanatory (x) - LSD tissue concentration (mean of 5 volunteers) • Raw Data and scatterplot of Score vs LSD concentration: 80 70 60 LSD Conc (x) 1.17 2.97 3.26 4.69 5.83 6.00 6.41 50 40 SCORE Score (y) 78.93 58.20 67.47 37.47 45.65 32.92 29.97 30 20 1 2 LSD_CONC Source: Wagner, et al (1968) 3 4 5 6 7 Manufacturer Production/Cost Relation X= Amount Produced Y= Total Cost Month 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Prod 46.75 42.18 41.86 43.29 42.12 41.78 41.47 42.21 41.03 39.84 39.15 39.20 39.52 38.05 39.16 38.59 Cost 92.64 88.81 86.44 88.80 86.38 89.87 88.53 91.11 81.22 83.72 84.54 85.66 85.87 85.23 87.75 92.62 Month 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Prod 36.54 37.03 36.60 37.58 36.48 38.25 37.26 38.59 40.89 37.66 38.79 38.78 36.70 35.10 33.75 34.29 n=48 months (not in order) Cost 91.56 84.12 81.22 83.35 82.29 80.92 76.92 78.35 74.57 71.60 65.64 62.09 61.66 77.14 75.47 70.37 Month 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 Prod 32.26 30.97 28.20 24.58 20.25 17.09 14.35 13.11 9.50 9.74 9.34 7.51 8.35 6.25 5.45 3.79 Cost 66.71 64.37 56.09 50.25 43.65 38.01 31.40 29.45 29.02 19.05 20.36 17.68 19.23 14.92 11.44 12.69 Manufacturer Production/Cost Relation Production (x) / Cost (y) Relation 100 90 80 70 Total Cost 60 50 40 30 20 10 0 0 5 10 15 20 25 Total Production 30 35 40 45 50