Descriptive Statistics • Tabular and Graphical Displays – Frequency Distribution - List of intervals of values for a variable, and the number of occurrences per interval – Relative Frequency - Proportion (often reported as a percentage) of observations falling in the interval – Histogram/Bar Chart - Graphical representation of a Relative Frequency distribution – Stem and Leaf Plot - Horizontal tabular display of data, based on 2 digits (stem/leaf) Constructing Pie Charts • Select a small number of categories (say 5 or 6 at most) to avoid many narrow “slivers” • If possible, arrange categories in ascending or descending order for categorical variables Monthly Philly Rainfall 1825-1869 (1/100 in) Philly Monthy Rainfall 1825-1869 (1/100 inches) Category 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 Range <100 100-199 200-299 300-399 400-499 500-599 600-699 700-799 800-899 900-999 >1000 Count 17 78 132 115 86 55 27 17 6 3 4 Constructing Bar Charts • Put frequencies on one axis (typically vertical, unless many categories) and categories on other • Draw rectangles over categories with height=frequency • Leave spaces between categories Constructing Histograms • Used for numeric variables, so need Class Intervals – Let Range = Largest - Smallest Measurement – Break range into (say) 5-20 intervals depending on sample size – Make the width of the subintervals a convenient unit, and make “break points” so that no observations fall on them – Obtain Class Frequencies, the number in each subinterval – Obtain Relative Frequencies, proportion in each subinterval • Construct Histogram – Draw bars over each subinterval with height representing class frequency or relative frequency (shape will be the same) – Leave no space between bars to imply adjacency of class intervals Histogram 140 100 80 60 40 20 rain100 e M or 00 11 0 90 0 70 0 50 0 30 0 0 10 Frequency 120 100 200 300 400 500 600 700 800 900 1000 1100 1200 More Interpreting Histograms • Probability: Heights of bars over the class intervals are proportional to the “chances” an individual chosen at random would fall in the interval • Unimodal: A histogram with a single major peak • Bimodal: Histogram with two distinct peaks (often evidence of two distinct groups of units) • Uniform: Interval heights are approximately equal • Symmetric: Right and Left portions are same shape • Right-Skewed: Right-hand side extends further • Left-Skewed: Left-hand side extends further Stem-and-Leaf Plots • Simple, crude approach to obtaining shape of distribution without losing individual measurements to class intervals. Procedure: – Split each measurement into 2 sets of digits (stem and leaf) – List stems from smallest to largest – Line corresponding leaves aside stems from smallest to largest – If too cramped/narrow, break stems into two groups: low with leaves 0-4 and high with leaves 5-9 – When numbers have many digits, trim off right-most (less significant) digits. Leaves should always be a single digit. Comparing Groups • • • • Side-by-side bar charts 3 dimensional histograms Back-to-back stem and leaf plots Goal: Compare 2 (or more) groups wrt variable(s) being measured • Do measurements tend to differ among groups? Summarizing Data of More than One Variable • Contingency Table: Cross-tabulation of units based on measurements of two qualitative variables simultaneously • Stacked Bar Graph: Bar chart with one variable represented on the horizontal axis, second variable as subcategories within bars • Cluster Bar Graph: Bar chart with one variable forming “major groupings” on horizontal axis, second variable used to make side-by-side comparisons within major groupings (displays all combinations in factorial expt) • Scatterplot: Plot with quantitaive variables y and x plotted against each other for each unit • Side-by-Side Boxplot: Compares distributions by groups Example - Ginkgo and Acetazolamide for Acute Mountain Syndrome Among Himalayan Trekkers Contingency Table (Counts) Percent Outcome by Treatment Placebo Acet Ginkgo Acc+Gi Total Placebo Acet Ginkgo Acc+Gi AMS 40 14 43 18 115 No AMS 79 104 81 108 372 Total 119 118 124 126 487 AMS 33.61 11.86 34.68 14.29 No AMS 66.39 88.14 65.32 85.71 Total 100 100 100 100 Stacked Bar Graph of AMS Incidence (Percent) 100% 90% 80% 70% 60% No AMS 50% AMS 40% 30% 20% 10% 0% Placebo Acet Ginkgo Treatment Acc+Gi Cluster Bar Graph of AMS Incidence (Counts) 120 100 Frequency 80 AMS 60 No AMS 40 20 0 Placebo Acet Ginkgo Treatment Acc+Gi 3-D Barchart of Incidence of AMS 100.00 90.00 80.00 70.00 60.00 Percent within Treatment 50.00 40.00 30.00 20.00 10.00 No AMS 0.00 Placebo AMS Acet Ginkgo Treatment Acc+Gi Outcome Sample & Population Distributions • Distributions of Samples and Populations- As samples get larger, the sample distribution gets smoother and looks more like the population distribution – U-shaped - Measurements tend to be large or small, fewer in middle range of values – Bell-shaped - Measurements tend to cluster around the middle with few extremes (symmetric) – Skewed Right - Few extreme large values – Skewed Left - Few extreme small values Measures of Central Tendency • Mean - Sum of all measurements divided by the number of observations (even distribution of outcomes among cases). Can be highly influenced by extreme values. • Notation: Sample Measurements labeled Y1,...,Yn Y1 Yn Yi Y n n Median, Percentiles, Mode • Median - Middle measurement after data have been ordered from smallest to largest. Appropriate for interval and ordinal scales • Pth percentile - Value where P% of measurements fall below and (100-P)% lie above. Lower quartile(25th), Median(50th), Upper quartile(75th) often reported • Mode - Most frequently occurring outcome. Typically reported for ordinal and nominal data. Measures of Variation • Measures of how similar or different individual’s measurements are – Range -- Largest-Smallest observation – Deviation -- Difference between ith individual’s outcome and the sample mean: Yi Y – Variance of n observations Y1,...,Yn is the “average” squared deviation: s2 2 ( Y Y ) i n 1 (Y1 Y ) 2 (Y2 Y ) 2 (Yn Y ) 2 n 1 Measures of Variation • Standard Deviation - Positive square root of the variance (measure in original units): s s2 2 ( Y Y ) i n 1 • Properties of the standard deviation: • s 0, and only equals 0 if all observations are equal • s increases with the amount of variation around the mean • Division by n-1 (not n) is due to technical reasons (later) • s depends on the units of the data (e.g. $1000s vs $) Empirical Rule • If the histogram of the data is approximately bell-shaped, then: – Approximately 68% of measurements lie within 1 standard deviation of the mean. – Approximately 95% of measurements lie within 2 standard deviations of the mean. – Virtually all of the measurements lie within 3 standard deviations of the mean. Other Measures and Plots • Interquartile Range (IQR)-- 75th%ile - 25th%ile (measures the spread in the middle 50% of data) • Box Plots - Display a box containing middle 50% of measurements with line at median and lines extending from box. Breaks data into four quartiles • Outliers - Observations falling more than 1.5IQR above (below) upper (lower) quartile Dependent and Independent Variables • Dependent variables are outcomes of interest to investigators. Also referred to as Responses or Endpoints • Independent variables are Factors that are often hypothesized to effect the outcomes (levels of dependent variables). Also referred to as Predictor or Explanatory Variables • Research ??? Does I.V. D.V. Example - Clinical Trials of Cialis • Clinical trials conducted worldwide to study efficacy and safety of Cialis (Tadalafil) for ED • Patients randomized to Placebo, 10mg, and 20mg • Co-Primary outcomes: – Change from baseline in erectile dysfunction domain if the International Index of Erectile Dysfunction (Numeric) – Response to: “Were you able to insert your P… into your partner’s V…?” (Nominal: Yes/No) – Response to: “Did your erection last long enough for you to have succesful intercourse?” (Nominal: Yes/No) Source: Carson, et al. (2004). Example - Clinical Trials of Cialis • Population: All adult males suffering from erectile dysfunction • Sample: 2102 men with mild-to-severe ED in 11 randomized clinical trials • Dependent Variable(s): Co-primary outcomes listed on previous slide • Independent Variable: Cialis Dose: (0, 10, 20 mg) • Research Questions: Does use of Cialis improve erectile function? Contingency Tables • Tables representing all combinations of levels of explanatory and response variables • Numbers in table represent Counts of the number of cases in each cell • Row and column totals are called Marginal counts 2x2 Tables - Notation Group 1 Outcome Present X1 Outcome Absent n1-X1 Group Total n1 Group 2 X2 n2-X2 n2 Outcome Total X1+X2 (n1+n2)(X1+X2) n1+n2 Example - Firm Type/Product Quality Not Integrated Vertically Integrated Outcome Total High Quality Low Quality Group Total 33 55 88 5 79 84 38 134 172 • Groups: Not Integrated (Weave only) vs Vertically integrated (Spin and Weave) Cotton Textile Producers • Outcomes: High Quality (High Count) vs Low Quality (Count) Source: Temin (1988) Scatterplots • Identify the explanatory and response variables of interest, and label them as x and y • Obtain a set of individuals and observe the pairs (xi , yi) for each pair. There will be n pairs. • Statistical convention has the response variable (y) placed on the vertical (up/down) axis and the explanatory variable (x) placed on the horizontal (left/right) axis. (Note: economists reverse axes in price/quantity demand plots) • Plot the n pairs of points (x,y) on the graph France August,2003 Heat Wave Deaths • • • • Individuals: 13 cities in France Response: Excess Deaths(%) Aug1/19,2003 vs 1999-2002 Explanatory Variable: Change in Mean Temp in period (C) Data: City Dth03 Dth9902 %chng (y) Degchg(x) Little Marseilles Grenoble Rennes Toulouse Bordeaux Strasbourg Nice Poitiers Lyon Le Mans Dijon Paris 200 571 148 156 315 318 253 341 184 447 204 168 1854 192.3 456.8 115.6 114.7 231.6 222.4 167.5 222.9 102.8 248.3 112.1 87 766.1 4 25 28 36 36 43 51 53 79 80 82 93 142 4 4.3 6.3 5.6 6.6 6.2 5.9 4.3 7.3 6.8 7 7.4 6.7 France August,2003 Heat Wave Deaths 2003 France Heat Wave Mortality 160 140 Excess Mortality (%) 120 100 80 60 40 20 0 3 3.5 4 4.5 5 5.5 6 Change in Mean Temp (Celsius) 6.5 7 7.5 8 Sample Statistics/Population Parameters • Sample Mean and Standard Deviations are most commonly reported summaries of sample data. They are random variables since they will change from one sample to another. • Population Mean (m) and Standard Deviation (s) computed from a population of measurements are fixed (unknown in practice) values called parameters. Example 1.3 - Grapefruit Juice Study crcl 38 66 74 99 80 64 80 120 To import an EXCEL file, click on: FILE OPEN DATA then change FILES OF TYPE to EXCEL (.xls) To import a TEXT or DATA file, click on: FILE OPEN DATA then change FILES OF TYPE to TEXT (.txt) or DATA (.dat) You will be prompted through a series of dialog boxes to import dataset Descriptive Statistics-Numeric Data • After Importing your dataset, and providing names to variables, click on: • ANALYZE DESCRIPTIVE STATISTICS DESCRIPTIVES • Choose any variables to be analyzed and place them in n box on right yi n i 1 • Options include: Mean : y Sum : yi n i 1 y y n Std. deviation : S S.E. Mean : S n i 1 2 i n 1 Variance : S 2 Example 1.3 - Grapefruit Juice Study e t d N e u i m a m i a t t t t t t t i E i i i i i i C 8 8 0 1 3 3 1 1 V 8 Descriptive Statistics-General Data • After Importing your dataset, and providing names to variables, click on: • ANALYZE DESCRIPTIVE STATISTICS FREQUENCIES • Choose any variables to be analyzed and place them in box on right • Options include (For Categorical Variables): – Frequency Tables – Pie Charts, Bar Charts • Options include (For Numeric Variables) – Frequency Tables (Useful for discrete data) – Measures of Central Tendency, Dispersion, Percentiles – Pie Charts, Histograms Example 1.4 - Smoking Status S u P r u c c V N 0 9 9 9 Q 3 3 3 2 Q 9 6 6 8 C 2 4 4 2 O 3 8 8 0 T 7 0 0 Vertical Bar Charts and Pie Charts • After Importing your dataset, and providing names to variables, click on: • GRAPHS BAR… SIMPLE (Summaries for Groups of Cases) DEFINE • Bars Represent N of Cases (or % of Cases) • Put the variable of interest as the CATEGORY AXIS • GRAPHS PIE… (Summaries for Groups of Cases) DEFINE • Slices Represent N of Cases (or % of Cases) • Put the variable of interest as the DEFINE SLICES BY Example 1.5 - Antibiotic Study 80 60 40 Count 20 5 4 0 1 OUTCOME 2 3 4 5 3 1 2 Histograms • After Importing your dataset, and providing names to variables, click on: • GRAPHS HISTOGRAM • Select Variable to be plotted • Click on DISPLAY NORMAL CURVE if you want a normal curve superimposed (see Chapter 4). Example 1.6 - Drug Approval Times 30 20 10 Std. Dev = 20.97 Mean = 32.1 N = 175.00 0 0 0. 12 0 0. 11 0 0. 10 .0 90 .0 80 .0 70 .0 60 .0 50 .0 40 .0 30 .0 20 .0 10 0 0. MONTHS Side-by-Side Bar Charts • After Importing your dataset, and providing names to variables, click on: • GRAPHS BAR… Clustered (Summaries for Groups of Cases) DEFINE • Bars Represent N of Cases (or % of Cases) • CATEGORY AXIS: Variable that represents groups to be compared (independent variable) • DEFINE CLUSTERS BY: Variable that represents outcomes of interest (dependent variable) Example 1.7 - Streptomycin Study 30 20 OUTCOME 1 2 10 3 Count 4 5 0 6 1 TRT 2