ORF 245 Fundamentals of Statistics Chapter 2 Displaying and Summarizing a Single Variable Robert Vanderbei Fall 2015 Slides last edited on June 22, 2015 http://www.princeton.edu/∼rvdb Displaying and Summarizing Categorical Variables 1 Example – Mortality Data – 15 to 24 Year Olds Cause Count Percentage Motor vehicle accidents 6,948 23.47% All other accidents 5,048 17.05% Intentional self-harm 4,688 15.84% Homicide 4,508 15.23% Cancer 1,609 5.43% Diseases of heart 948 3.20% Congenital malformations 429 1.45% 5,391 18.21% 29,605 100.00% All other causes Total Numbers in the Count column are called frequency data. Numbers in the Percentage column are called relative frequency data. 2 Bar Chart 7000 6000 5000 4000 3000 2000 1000 0 1 2 3 Data File: "Motor vehicle accidents" "All other accidents" "Intentional self-harm" "Homicide" "Cancer" "Diseases of heart" "Congenital malformations" "All other causes" 4 5 6 7 8 Matlab: 6948 5048 4688 4508 1609 948 429 5391 http://www.princeton.edu/∼rvdb/245/data/Table2.2/data.txt fileID = fopen('data.txt'); C = textscan(fileID, '%q %f'); cause = C{1}; count = C{2}; bar(count); 3 Bar Chart 7000 6000 5000 4000 3000 2000 1000 0 Motor vehicle accidents All other accidents Intentional self-harm Cancer Diseases of heart Congenital malformations All other causes Matlab: Data File: "Motor vehicle accidents" "All other accidents" "Intentional self-harm" "Homicide" "Cancer" "Diseases of heart" "Congenital malformations" "All other causes" Homicide 6948 5048 4688 4508 1609 948 429 5391 http://www.princeton.edu/∼rvdb/245/data/Table2.2/data.txt fileID = fopen('data.txt'); C = textscan(fileID, '%q %f'); cause = C{1}; count = C{2}; bar(1:8,diag(count),'stacked'); set(gca, 'XTick', 1:8, 'XTickLabel', cause); 4 Pie Chart All other causes Motor vehicle accidents Congenital malformations Diseases of heart Cancer All other accidents Homicide Matlab: Intentional self-harm fileID = fopen('data.txt'); C = textscan(fileID, '%q %f'); cause = C{1}; count = C{2}; pie(count,cause); 5 Which is Better: Pie or Bar? A B C 1 5 1 5 1 5 2 4 2 4 2 4 3 3 3 25 25 25 20 20 20 15 15 15 10 10 10 5 5 5 0 0 1 2 3 4 5 0 1 2 3 4 5 1 2 3 4 5 6 Displaying and Summarizing Quantitative Variables 7 Example – Ozone – Mauna Loa Hawaii STN YEAR MON DAY HR O3(PPB) 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 .. . 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 .. . 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 .. . 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 .. . 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 .. . 35.88 34.84 34.57 33.40 33.14 38.34 39.96 37.17 40.31 39.30 35.05 36.02 35.19 37.17 42.61 43.34 42.29 42.63 42.09 .. . http://www.princeton.edu/∼rvdb/245/data/Figure2.4/mlo O3 6m hour 2014.dat In total 8178 lines of data. 8 Histogram Every Hour 500 450 400 # of hours 350 300 250 200 150 100 50 0 0 10 20 30 40 50 60 70 80 90 100 Hourly Ozone Concentrations (ppb) Matlab: load -ascii 'mlo_O3_6m_hour_2014.dat'; ozone = mlo_O3_6m_hour_2014(:,6); histogram(ozone); xlabel('Hourly Ozone Concentrations (ppb)'); ylabel('# of hours'); % anything after the "percent" is ignored by Matlab % colon means all rows, 6 means the 6th column % always wise to label the axes 9 Histogram Daily at 6am 60 50 # of days 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 6 AM Ozone Concentration (ppb) Unix: grep "06 " mlo_O3_6m_hour_2014.dat > mlo_O3_6m_6am_2014.dat Matlab: load -ascii 'mlo_O3_6m_6am_2014.dat'; ozone = mlo_O3_6m_6am_2014(:,6); histogram(ozone); xlabel('6 AM Ozone Concentration (ppb)'); ylabel('# of days'); 10 Histogram Every 24th line 60 50 # of hours 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 Hourly Ozone Concentrations (ppb) Matlab: load -ascii 'mlo_O3_6m_hour_2014.dat'; ozone = mlo_O3_6m_6am_2014( 7:24:end, 6); histogram(ozone); xlabel('6 AM Ozone Concentration (ppb)'); ylabel('# of days'); % 7:24:end means rows 7, 31, 55, 79, ... Why’s it different? Missing data! 11 Modality Bimodal Unimodal 350 500 450 300 400 250 300 # of hours # of hours 350 250 200 150 200 150 100 100 50 50 0 0 10 20 30 40 50 60 70 80 90 100 Hourly Ozone Concentrations (ppb) 0 0 20 40 60 80 100 120 140 Hourly Ozone Concentrations (ppb) Three or more humps is called multimodal. 12 Skewness Symmetric Skewed 600 180 160 500 140 400 120 100 300 80 200 60 40 100 20 0 -40 0 20 -20 0 20 40 60 80 100 120 30 40 50 60 70 80 90 100 110 120 ⇑ ⇑ Outliers? 13 Centrality 14 12 # of days 10 8 6 4 2 0 0 10 20 30 40 50 60 70 80 90 6 AM Ozone Concentration (ppb) load -ascii 'mlo_O3_6m_6am_2014.dat'; x = mlo_O3_6m_6am_2014(:,6); % let's call our vector of variables x n Mode: top of the hump Median: half to the left, half to the right 1X Mean: average value... x̄ = xj n j=1 mode(round(x)) Answer: 49 % brute force x_sorted = sort(x); [n,m] = size(x); (x_sorted(floor((n+1)/2)) ... + x_sorted(ceil((n+1)/2)))/2 % Matlab's builtin function median(x) % brute force [n,m] = size(x); sum(x)/n % Matlab's builtin function mean(x) Answer: 40.7631 Answer: 39.94 14 Which is most sensitive to outliers? Mean, Median, , Mode? 15 Spread 14 12 # of days 10 8 6 4 2 0 0 10 20 30 40 50 60 70 80 90 6 AM Ozone Concentration (ppb) Range: max − min Inter-Quartile Range (IQR): Q3 − Q1 % brute force max(x)-min(x) % brute force x_sorted = sort(x); [n,m] = size(x); x_sorted(round(0.75*n) ... - x_sorted(round(0.25*n))) % builtin function range(x) Answer: 84.12 % Matlab's builtin function iqr(x) Answer: 19.1725 Standard Deviation: v u u 1 s=t n−1 n X (xj − x̄)2 j=1 % brute force [n,m] = size(x); sqrt(sum((x-mean(x)).^2)/(n-1)) % Matlab's builtin function std(x) Answer: 14.3677 Another option: 1 n n X j=1 |xj − x̄| =⇒ 11.4826 16 Bell Curve aka Normal or Gaussian Distribution 60 50 # of hours 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 Hourly Ozone Concentrations (ppb) Curve is called the Bell Curve (formula later). Peak of the curve is at x̄. Inflection points are at x̄ ± s. 17 Samples, Resampling and Bootstrap 18