ORF 245 Fundamentals of Statistics Chapter 2 Displaying and

advertisement
ORF 245 Fundamentals of Statistics
Chapter 2
Displaying and Summarizing a Single Variable
Robert Vanderbei
Fall 2015
Slides last edited on June 22, 2015
http://www.princeton.edu/∼rvdb
Displaying and Summarizing Categorical Variables
1
Example – Mortality Data – 15 to 24 Year Olds
Cause
Count Percentage
Motor vehicle accidents
6,948
23.47%
All other accidents
5,048
17.05%
Intentional self-harm
4,688
15.84%
Homicide
4,508
15.23%
Cancer
1,609
5.43%
Diseases of heart
948
3.20%
Congenital malformations
429
1.45%
5,391
18.21%
29,605
100.00%
All other causes
Total
Numbers in the Count column are called frequency data.
Numbers in the Percentage column are called relative frequency data.
2
Bar Chart
7000
6000
5000
4000
3000
2000
1000
0
1
2
3
Data File:
"Motor vehicle accidents"
"All other accidents"
"Intentional self-harm"
"Homicide"
"Cancer"
"Diseases of heart"
"Congenital malformations"
"All other causes"
4
5
6
7
8
Matlab:
6948
5048
4688
4508
1609
948
429
5391
http://www.princeton.edu/∼rvdb/245/data/Table2.2/data.txt
fileID = fopen('data.txt');
C = textscan(fileID, '%q %f');
cause = C{1};
count = C{2};
bar(count);
3
Bar Chart
7000
6000
5000
4000
3000
2000
1000
0
Motor vehicle accidents
All other accidents
Intentional self-harm
Cancer
Diseases of heart
Congenital malformations
All other causes
Matlab:
Data File:
"Motor vehicle accidents"
"All other accidents"
"Intentional self-harm"
"Homicide"
"Cancer"
"Diseases of heart"
"Congenital malformations"
"All other causes"
Homicide
6948
5048
4688
4508
1609
948
429
5391
http://www.princeton.edu/∼rvdb/245/data/Table2.2/data.txt
fileID = fopen('data.txt');
C = textscan(fileID, '%q %f');
cause = C{1};
count = C{2};
bar(1:8,diag(count),'stacked');
set(gca, 'XTick', 1:8, 'XTickLabel', cause);
4
Pie Chart
All other causes
Motor vehicle accidents
Congenital malformations
Diseases of heart
Cancer
All other accidents
Homicide
Matlab:
Intentional self-harm
fileID = fopen('data.txt');
C = textscan(fileID, '%q %f');
cause = C{1};
count = C{2};
pie(count,cause);
5
Which is Better: Pie or Bar?
A
B
C
1
5
1
5
1
5
2
4
2
4
2
4
3
3
3
25
25
25
20
20
20
15
15
15
10
10
10
5
5
5
0
0
1
2
3
4
5
0
1
2
3
4
5
1
2
3
4
5
6
Displaying and Summarizing Quantitative
Variables
7
Example – Ozone – Mauna Loa Hawaii
STN
YEAR
MON
DAY
HR
O3(PPB)
31
31
31
31
31
31
31
31
31
31
31
31
31
31
31
31
31
31
31
..
.
2014
2014
2014
2014
2014
2014
2014
2014
2014
2014
2014
2014
2014
2014
2014
2014
2014
2014
2014
..
.
01
01
01
01
01
01
01
01
01
01
01
01
01
01
01
01
01
01
01
..
.
01
01
01
01
01
01
01
01
01
01
01
01
01
01
01
01
01
01
01
..
.
00
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
..
.
35.88
34.84
34.57
33.40
33.14
38.34
39.96
37.17
40.31
39.30
35.05
36.02
35.19
37.17
42.61
43.34
42.29
42.63
42.09
..
.
http://www.princeton.edu/∼rvdb/245/data/Figure2.4/mlo O3 6m hour 2014.dat
In total 8178 lines of data.
8
Histogram
Every Hour
500
450
400
# of hours
350
300
250
200
150
100
50
0
0
10
20
30
40
50
60
70
80
90
100
Hourly Ozone Concentrations (ppb)
Matlab:
load -ascii 'mlo_O3_6m_hour_2014.dat';
ozone = mlo_O3_6m_hour_2014(:,6);
histogram(ozone);
xlabel('Hourly Ozone Concentrations (ppb)');
ylabel('# of hours');
% anything after the "percent" is ignored by Matlab
% colon means all rows, 6 means the 6th column
% always wise to label the axes
9
Histogram
Daily at 6am
60
50
# of days
40
30
20
10
0
0
10
20
30
40
50
60
70
80
90
6 AM Ozone Concentration (ppb)
Unix:
grep "06
" mlo_O3_6m_hour_2014.dat > mlo_O3_6m_6am_2014.dat
Matlab:
load -ascii 'mlo_O3_6m_6am_2014.dat';
ozone = mlo_O3_6m_6am_2014(:,6);
histogram(ozone);
xlabel('6 AM Ozone Concentration (ppb)');
ylabel('# of days');
10
Histogram
Every 24th line
60
50
# of hours
40
30
20
10
0
0
10
20
30
40
50
60
70
80
90
100
Hourly Ozone Concentrations (ppb)
Matlab:
load -ascii 'mlo_O3_6m_hour_2014.dat';
ozone = mlo_O3_6m_6am_2014( 7:24:end, 6);
histogram(ozone);
xlabel('6 AM Ozone Concentration (ppb)');
ylabel('# of days');
% 7:24:end means rows 7, 31, 55, 79, ...
Why’s it different? Missing data!
11
Modality
Bimodal
Unimodal
350
500
450
300
400
250
300
# of hours
# of hours
350
250
200
150
200
150
100
100
50
50
0
0
10
20
30
40
50
60
70
80
90
100
Hourly Ozone Concentrations (ppb)
0
0
20
40
60
80
100
120
140
Hourly Ozone Concentrations (ppb)
Three or more humps is called multimodal.
12
Skewness
Symmetric
Skewed
600
180
160
500
140
400
120
100
300
80
200
60
40
100
20
0
-40
0
20
-20
0
20
40
60
80
100
120
30
40
50
60
70
80
90
100
110
120
⇑ ⇑
Outliers?
13
Centrality
14
12
# of days
10
8
6
4
2
0
0
10
20
30
40
50
60
70
80
90
6 AM Ozone Concentration (ppb)
load -ascii 'mlo_O3_6m_6am_2014.dat';
x = mlo_O3_6m_6am_2014(:,6);
% let's call our vector of variables x
n
Mode: top of the hump
Median: half to the left, half to the right
1X
Mean: average value... x̄ =
xj
n
j=1
mode(round(x))
Answer: 49
% brute force
x_sorted = sort(x);
[n,m] = size(x);
(x_sorted(floor((n+1)/2)) ...
+ x_sorted(ceil((n+1)/2)))/2
% Matlab's builtin function
median(x)
% brute force
[n,m] = size(x);
sum(x)/n
% Matlab's builtin function
mean(x)
Answer: 40.7631
Answer: 39.94
14
Which is most sensitive to outliers?
Mean,
Median,
, Mode?
15
Spread
14
12
# of days
10
8
6
4
2
0
0
10
20
30
40
50
60
70
80
90
6 AM Ozone Concentration (ppb)
Range: max − min
Inter-Quartile Range (IQR): Q3 − Q1
% brute force
max(x)-min(x)
% brute force
x_sorted = sort(x);
[n,m] = size(x);
x_sorted(round(0.75*n) ...
- x_sorted(round(0.25*n)))
% builtin function
range(x)
Answer: 84.12
% Matlab's builtin function
iqr(x)
Answer: 19.1725
Standard Deviation:
v
u
u 1
s=t
n−1
n
X
(xj − x̄)2
j=1
% brute force
[n,m] = size(x);
sqrt(sum((x-mean(x)).^2)/(n-1))
% Matlab's builtin function
std(x)
Answer: 14.3677
Another option:
1
n
n
X
j=1
|xj − x̄|
=⇒
11.4826
16
Bell Curve
aka Normal or Gaussian Distribution
60
50
# of hours
40
30
20
10
0
0
10
20
30
40
50
60
70
80
90
100
Hourly Ozone Concentrations (ppb)
Curve is called the Bell Curve (formula later).
Peak of the curve is at x̄.
Inflection points are at x̄ ± s.
17
Samples, Resampling and Bootstrap
18
Download