Describing your Data

advertisement
Chapter 4: Describing your Data
1.
a.
Quantitative variables involve values that come in meaningful (not arbritrary) numbers.
b.
Qualitative variables are variables whose values fall into some category, indicating a quality or
property of an object.
c.
Continuous variables are quantitative variables whose values follow a wide range of possible values
over a continuous spectrum.
d.
Ordinal variables are qualitative variables whose categories can be put into some natural order.
e.
Nominal variables are qualitative variables whose categories can not be put into a natural order.
2.
A distribution is skewed if most of the values are clustered toward the left or right edge of the distribution.
A positively skewed distribution is a skewed distribution in which the values are clustered towards the left
– or lower end of the distribution. A negatively skewed distribution is a skewed distribution in which the
values are clustered towards the right – or upper end of the distribution.
3.
A stem and leaf plot is a graphical representation of a distribution in which the values of the distribution
are used to create the stems and leafs of the plot. One advantage of the stem and leaf plot over a histogram
is that the values of the distribution can be inferred from the plot. A disadvantage is that you do not have
control over the size and placement of the bins and the plot can be difficult to manage fro large data sets in
which the stems need to display a large number of leafs.
4.
The interquartile range is the range between the first quartile and the third quartile.
5.
False. The range of the distribution is related to the distributions maximum and minimum value. Those
values are not determined from the distribution's variability–especially in the case of outliers.
6.
An outlier is an unusually extreme value from a distribution. If the value is greater than the third quartile
plus 1.5 x the interquartile range (IQR) or less than the first quartile minus 1.5 x the IQR it is considered a
moderate outlier. If the outlier is greater than the third quartile plus 3 x IQR or less than the first quartile
minus 3 x IQR, it is considered an extreme outlier.
7.
False. Extreme vlalues may be a natural part of the data. By removing those values, you may be removing
an important (and perhaps the most interesting) part of the distribution.
8.
A boxplot is a graphical representation of the distribution that displays the interquartile range, the
maximum and minimum values, and the existence of moderate or extreme outliers. One advantage is that
the boxplot displays several of the important summary statistics. A distadvantage is that the boxplot may
be difficult to interpret for the untrained eye.
1
Chapter 4: Describing your Data
9.
a.
Data Values
30
30
60
100
110
120
120
180
200
200
210
210
210
220
240
290
300
340
450
610
900
b.
Positively skewed.
c.
The mean is approximately 244.29. The median is approximately 210.
d.
Moderate outliers: 610. Extreme outliers: 900.
10.
a.
2
Chapter 4: Describing your Data
Positively skewed.
b.
11.
a.
Univariate Statistics
Count
Sum
Average
Median
Trimmed Mean (0.2)
Minimum
Maximum
Range
Standard Deviation
Variance
Standard Error
Square Feet
117
193,501
1,653.85
1,549
1,594.33
837
3,750
2,913
523.723
274,285.574
48.418
Skewness
Kurtosis
1.188
1.634
Smallest (2)
Largest (2)
1st Percentile
5th Percentile
10th Percentile
25th Percentile
50th Percentile
75th Percentile
90th Percentile
95th Percentile
99th Percentile
Interquartile
Range
Square
Feet
900
2,931
911
1,029
1,106
1,280
1,549
1,894
2,489
2,680
2,929
614
b.
The smallest house is 837 square feet. The largest house is 3,750 square feet.
c.
85.3%
d.
The boxplot appears as:
4000
3500
3000
2500
2000
1500
1000
Square Feet
500
0
e.
The house that is 3,750 square feet.
3
Chapter 4: Describing your Data
The boxplot appears as:
f.
4000
3500
3000
2500
2000
1500
1000
Corner Lot = No
Corner Lot = Yes
500
0
g.
Houses on corner lots appear to be larger (in general) than house which are not on corner lots, but
this is not always the case. There are some corner houses which are small.
h.
The house that was an outlier when we considered all houses, is not when compared only to other
corner-lot houses. Whether this observation is an outlier depends upon the context.
4
Chapter 4: Describing your Data
i.
Count
Sum
Average
Median
Trimmed Mean
(0.2)
Minimum
Maximum
Range
Standard
Deviation
Variance
Standard Error
Skewness
Kurtosis
Smallest (2)
Largest (2)
1st Percentile
5th Percentile
10th Percentile
25th Percentile
50th Percentile
75th Percentile
90th Percentile
95th Percentile
99th Percentile
Interquartile
Range
j.
Square Feet
Corner_Lot =
Corner_Lot =
"No"
"Yes"
90
27
135,477
58,024
1,505.30
2,149.04
1,447
2,116
Overall
117
193,501
1,653.85
1,549
1,465.22
837
2,931
2,094
2,127.17
1,080
3,750
2,670
1,594.33
837
3,750
2,913
393.338
154,715.044
41.462
1.156
2.052
900
2,774
893
1,016
1,051
1,219
1,447
1,715
1,923
2,191
2,791
602.583
363,106.652
115.967
0.467
0.359
1,348
2,921
1,150
1,364
1,438
1,710
2,116
2,590
2,785
2,899
3,534
523.723
274,285.574
48.418
1.188
1.634
900
2,931
911
1,029
1,106
1,280
1,549
1,894
2,489
2,680
2,929
496
880
614
The new variable is a quantitative variable. The first few values are:
Price
Square Feet
PPSQF
87,400
1,236
70.71
110,900
1,740
63.74
95,000
1,715
55.39
87,000
1,273
68.34
73,900
970
76.19
77,000
900
85.56
133,000
1,850
71.89
116,000
1,720
67.44
102,000
1,606
63.51
5
Chapter 4: Describing your Data
The histogram appears as:
k.
25.0
20.0
15.0
10.0
5.0
l.
97.63
94.39
91.16
87.92
84.68
81.45
78.21
74.98
71.74
68.51
65.27
62.04
58.80
55.56
52.33
49.09
45.86
42.62
39.39
36.15
0.0
The distribution is symmetric. There are two moderate outliers with values of 34.53 and 99.24.
6
Chapter 4: Describing your Data
12.
a.
Employees
Count
Sum
Average
Median
Trimmed Mean
(0.2)
Minimum
Maximum
Range
Standard
Deviation
Variance
Standard Error
Skewness
Kurtosis
Smallest (2)
Largest (2)
1st Percentile
5th Percentile
10th Percentile
25th Percentile
50th Percentile
75th Percentile
90th Percentile
95th Percentile
99th Percentile
Interquartile
Range
b.
50
6,777
135.54
52
75.28
16
1,400
1,384
256.084
65,578.825
36.216
4.118
17.522
19
1,200
17
21
25
33
52
125
215
403
1,302
93
Average = 135.54. Median = 52. The median is probably a better measure of the size of a typical
business from the data set since it is less influenced by large values. In this case, the sample average
is even larger than the third quartile (125).
7
Chapter 4: Describing your Data
The distribution is positively skewed with several extreme outliers.
c.
1600
1400
1200
1000
800
600
400
200
0
Employees
-200
d.
The first few values of the log(Empl.oyees) variable are:
Name
Employees
500
Jockey International
LEmployees
2.699
2300 6th St., Kenosha 53140
Astronautics
1400
3.146
280
2.447
163
2.212
285
2.455
140
2.146
52
1.716
46
1.663
125
2.097
4115 N. Teutonia Ave., Milwaukee 53209
Pleasant Company
8400 Fairway Place, Middleton 53562
Dawes Transport
9180 N. 107th St., Milwaukee 53212
East Capitol Drive Foods Inc.
709 E. Capitol Drive Milwaukee 53212
TSR Inc.
201 Sheridan Springs Road, Lake Geneva 53147
Ricom Electronics
5139 W. Clinton Ave. Milwaukee 53233
Rollette Oil Co.
2025 Beloit Ave., Janesville 53548
Mainline Industrial Distributors
3441 Highview Drive, P.O. Box 297, Appleton 54912
8
Chapter 4: Describing your Data
The boxplot of the log(Employees) variable appears as:
3.5
3
2.5
2
1.5
1
LEmp
0.5
0
The distribution is more symmetric than the raw employee counts. Also, there are few outliers and those
outliers are moderate in size, not extreme.
9
Chapter 4: Describing your Data
e.
The statistics are:
Count
Sum
Average
Median
Trimmed Mean
(0.2)
Minimum
Maximum
Range
Standard Deviation
Variance
Standard Error
Skewness
Kurtosis
Smallest (2)
Largest (2)
1st Percentile
5th Percentile
10th Percentile
25th Percentile
50th Percentile
75th Percentile
90th Percentile
95th Percentile
99th Percentile
Interquartile Range
LEmployees
50.000
91.974
1.839
1.712
1.791
1.204
3.146
1.942
0.434
0.188
0.061
1.106
1.314
1.279
3.079
1.241
1.328
1.394
1.512
1.712
2.097
2.331
2.589
3.113
0.585
The skewness of the raw counts is 4.118 and for the log counts, the skewness is 1.106. The larger skewness
for the raw counts demonstrates that the distribution is more skewed than the distribution for the log
values. The kurtosis value for the raw counts is 17.522, while for the log counts the kurtosis is 1.314. This
indicates that the distribution of the raw counts shows heavier tails.
f.
The average of log(Employees) is 1.839. In terms of the number of employees, this is 101.839 or
69.10. This is equal to the geometric mean of the raw counts.
g.
The difference in sales between the first two companies is $40,000 – however the one company is
5x the size of the other. The difference in sales between the second two companies is $50,000, but
the smaller company is 89% of the size of the larger one. Thus, even though the difference between
the second pair is greater in absolute numbers, the companies are more similar in terms of their
scale.
10
Chapter 4: Describing your Data
13.
a.
The statistics are:
High Income
Minority
Count
Sum
Average
Median
Mode
Trimmed Mean
(0.2)
Minimum
Maximum
Range
Standard
Deviation
Variance
Standard Error
Skewness
Kurtosis
Smallest (2)
Largest (2)
1st Percentile
5th Percentile
10th Percentile
25th Percentile
50th Percentile
75th Percentile
90th Percentile
95th Percentile
99th Percentile
Interquartile
Range
High Income
White
20
550.3
27.515
29.1
20
226.0
11.300
9.8
#N/A
Minority
20
737.63
36.8815
37.40
#N/A
28.381
5.8
41.3
35.5
10.519
2.2
26.8
24.6
36.7519
10.60
62.20
51.60
15.075
3.7
32.4
28.7
10.9220
119.2908
2.4422
-0.598
-0.616
8.0
41.1
6.2
7.9
11.0
21.3
29.1
37.0
39.3
41.1
41.3
6.5164
42.4632
1.4571
0.994
0.880
3.6
25.1
2.5
3.5
4.1
7.4
9.8
15.1
17.0
25.2
26.5
13.05090
170.32590
2.91827
0.004
-0.373
20.90
55.90
12.56
20.39
22.88
26.43
37.40
45.25
51.94
56.22
61.00
7.7958
60.7746
1.7432
0.415
-0.122
5.5
29.7
4.0
5.4
5.6
9.2
15.8
20.2
23.9
29.8
31.9
15.7
7.7
18.83
11.0
#N/A
11
White
20
312.5
15.625
15.8
16.0
Chapter 4: Describing your Data
The boxplots appear as:
b.
70
60
50
40
30
20
10
0
High Income Minority
High Income White
Minority
-10
12
White
Chapter 4: Describing your Data
14.
a.
The statistics are:
Area = "North"
Count
Sum
Average
Area = "South"
Area = "West"
17
13
51
820,651
622,717
544,099
1,987,467
38,969.95
39,078.63
36,630.40
41,853.78
Median
39,200
35,328
41,261
Mode
43,472
Trimmed Mean (0.2)
Overall
21
#N/A
#N/A
37,411
(43472 ; 36230.4)
39,000.75
35,921.60
40,380.07
38,261.81
Minimum
28,952
29,509
33,550
28,952
Maximum
49,085
54,384
66,368
66,368
Range
20,133
24,875
32,818
37,416
5,960.871
5,686.171
8,197.974
6,687.082
35,531,977.545
32,332,535.680
67,206,782.950
44,717,069.549
1,300.769
1,379.099
2,273.709
936.379
0.029
2.044
2.439
1.583
Kurtosis
-1.164
5.557
7.368
4.549
Smallest (2)
31,333
31,261
35,746
29,509
Largest (2)
48,269
43,498
46,611
54,384
1st Percentile
29,428
29,789
33,814
29,230
5th Percentile
31,333
30,910
34,868
31,297
10th Percentile
32,421
32,146
35,791
32,520
25th Percentile
33,502
33,504
36,230
34,391
50th Percentile
39,200
35,328
41,261
37,411
75th Percentile
43,472
37,411
41,624
42,508
90th Percentile
47,152
41,553
46,001
46,611
95th Percentile
48,269
45,675
54,514
48,677
99th Percentile
48,922
52,642
63,997
60,376
9,970
3,907
5,394
8,117
Standard Deviation
Variance
Standard Error
Skewness
Interquartile Range
13
Chapter 4: Describing your Data
The boxplots appear as:
b.
Salary Boxplot
70000
65000
60000
Teacher Salaries
55000
50000
45000
40000
35000
30000
25000
Area = North
Area = South
Area = West
20000
Area
c.
The salary distribution for the Nothern region is symmetric with no outliers. There is an extreme
outlier and a moderate outlier in the Southern region. The distribution of the Southern region is
positively skewed. There is an extreme outlier in the Western region. The outlier comes from the
state of Alaska. The cost of living in Alaska is very high compared to other states in the union.;
therefore teacher salaries may be higher to compensate.
14
Chapter 4: Describing your Data
The statistics are:
d.
Area = "North"
Count
Area = "South"
Area = "West"
Overall
21
17
13
51
Sum
133.39
121.24
92.47
347.10
Average
6.3517
7.1317
7.1132
6.8058
Median
6.25
Mode
7.34
#N/A
Trimmed Mean (0.2)
#N/A
6.86
6.77
#N/A
#N/A
6.3203
7.1454
7.0707
6.7614
Minimum
4.91
5.45
4.97
4.91
Maximum
7.98
8.61
9.73
9.73
Range
3.07
3.15
4.76
4.82
Standard Deviation
0.78332
0.83879
1.50144
1.07652
Variance
0.61359
0.70356
2.25431
1.15890
Standard Error
0.17093
0.20344
0.41642
0.15074
0.383
-0.153
0.169
0.435
Skewness
Kurtosis
-0.167
-0.188
-0.967
-0.085
Smallest (2)
5.37
5.96
5.00
4.97
Largest (2)
7.68
8.40
8.73
8.73
1st Percentile
5.00
5.53
4.97
4.94
5th Percentile
5.37
5.86
4.99
5.19
10th Percentile
5.44
6.14
5.14
5.45
25th Percentile
5.79
6.55
6.25
6.09
50th Percentile
6.25
7.34
6.86
6.77
75th Percentile
6.87
7.41
8.36
7.40
90th Percentile
7.50
8.16
8.73
8.36
95th Percentile
7.68
8.44
9.13
8.66
99th Percentile
7.92
8.57
9.61
9.23
Interquartile Range
1.09
0.86
2.10
1.31
The boxplots appear as:
e.
Salary Pupil Ratio
10
9
8
7
6
5
Area = North
Area = South
4
Area
15
Area = West
Chapter 4: Describing your Data
f.
Alaska is not an outlier for this variable. Alaska is at the 2 nd percentile for salary/pupil ratio, but at
the 100th percentile for salary. Even though these teachers are paid the highest, in terms of the
salary/pupil ratio they are among the lowest paid.
a.
The histogram appears as:
15.
50.0
45.0
40.0
35.0
30.0
25.0
20.0
15.0
10.0
5.0
16
2,500
2,400
2,300
2,200
2,100
2,000
1,900
1,800
1,700
1,600
1,500
1,400
1,300
1,200
1,100
900
1,000
800
700
600
500
400
300
200
0
100
0.0
Chapter 4: Describing your Data
The frequency table is:
Cum.
Salary
Freq
Freq.
0
23
23
100
44
67
200
33
100
300
22
122
400
23
145
500
19
164
600
14
178
700
29
207
800
12
219
900
11
230
1,000
7
237
1,100
4
241
1,200
4
245
1,300
5
250
1,400
1
251
1,500
1
252
1,600
2
254
1,700
0
254
1,800
2
256
1,900
4
260
2,000
0
260
2,100
1
261
2,200
0
261
2,300
0
261
2,400
2
263
2,500
0
263
b.
c.
%
8.75%
16.73%
12.55%
8.37%
8.75%
7.22%
5.32%
11.03%
4.56%
4.18%
2.66%
1.52%
1.52%
1.90%
0.38%
0.38%
0.76%
0.00%
0.76%
1.52%
0.00%
0.38%
0.00%
0.00%
0.76%
0.00%
Cum. %
8.75%
25.48%
38.02%
46.39%
55.13%
62.36%
67.68%
78.71%
83.27%
87.45%
90.11%
91.63%
93.16%
95.06%
95.44%
95.82%
96.58%
96.58%
97.34%
98.86%
98.86%
99.24%
99.24%
99.24%
100.00%
100.00%
The 10th percentile is 100. The 90th percentile is 1049.
17
Chapter 4: Describing your Data
d.
The players in the upper 10% of the salary range are:
First
Last
Salary
Darryl
Strawberry
1220
Don
Mattingly
1975
George Bell
1175
Wade
Boggs
1600
Cal
Ripken
1350
Jesse
Barfield
1238
Kent
Hrbek
1310
Von
Hayes
1300
Leon
Durham
1183
Tony
Pena
1150
Kirk
Gibson
1300
Rickey
Henderson
1670
Carney
Lansford
1200
Ozzie
Smith
1940
Paul
Molitor
1260
Eddie
Murray
2460
Dale
Murphy
1900
Jack
Clark
1300
Andre
Thornton
1100
Gary
Carter
1926
Jim
Rice
2413
Keith
Hernandez
1800
Dave
Winfield
1861
George Brett
1500
Mike
Schmidt
2127
Ron
Cey
1050
Steve
Garvey
1450
e.
Average = 541.48. Median = 425. Percentile = 59.3%
18
Chapter 4: Describing your Data
16.
The histogram appears as:
a.
200.0
180.0
160.0
140.0
120.0
100.0
80.0
60.0
40.0
20.0
19
22.0
21.0
20.0
19.0
18.0
17.0
16.0
15.0
14.0
13.0
12.0
11.0
9.0
10.0
8.0
7.0
6.0
5.0
4.0
3.0
2.0
1.0
0.0
0.0
Chapter 4: Describing your Data
The frequency table is:
b.
Salary
Freq
Cum. Freq.
%
Cum. %
0.0
184
184
40.98%
40.98%
0.5
47
231
10.47%
51.45%
1.0
25
256
5.57%
57.02%
1.5
21
277
4.68%
61.69%
2.0
12
289
2.67%
64.37%
2.5
15
304
3.34%
67.71%
3.0
21
325
4.68%
72.38%
3.5
13
338
2.90%
75.28%
4.0
13
351
2.90%
78.17%
4.5
12
363
2.67%
80.85%
5.0
12
375
2.67%
83.52%
5.5
14
389
3.12%
86.64%
6.0
10
399
2.23%
88.86%
6.5
8
407
1.78%
90.65%
7.0
10
417
2.23%
92.87%
7.5
3
420
0.67%
93.54%
8.0
5
425
1.11%
94.65%
8.5
3
428
0.67%
95.32%
9.0
2
430
0.45%
95.77%
9.5
3
433
0.67%
96.44%
10.0
3
436
0.67%
97.10%
10.5
0
436
0.00%
97.10%
11.0
1
437
0.22%
97.33%
11.5
1
438
0.22%
97.55%
12.0
6
444
1.34%
98.89%
12.5
1
445
0.22%
99.11%
13.0
2
447
0.45%
99.55%
13.5
1
448
0.22%
99.78%
14.0
0
448
0.00%
99.78%
14.5
0
448
0.00%
99.78%
15.0
0
448
0.00%
99.78%
15.5
0
448
0.00%
99.78%
16.0
0
448
0.00%
99.78%
16.5
0
448
0.00%
99.78%
17.0
0
448
0.00%
99.78%
17.5
0
448
0.00%
99.78%
18.0
0
448
0.00%
99.78%
18.5
0
448
0.00%
99.78%
19.0
0
448
0.00%
99.78%
19.5
0
448
0.00%
99.78%
20.0
0
448
0.00%
99.78%
20.5
0
448
0.00%
99.78%
21.0
0
448
0.00%
99.78%
21.5
0
448
0.00%
99.78%
22.0
1
449
0.22%
100.00%
20
Chapter 4: Describing your Data
c.
The 10th percentile is 0.2148. The 90th percentile is 6.6672.
d.
The player list is:
Number
3
Player
Salary
Alex Rodriguez
22.000
31
Mike Piazza
13.571
42
Mo Vaughn
13.167
24
Manny Ramirez
13.033
Derek Jeter
12.600
51
Bernie Williams
12.357
25
Carlos Delgado
12.200
33
Larry Walker
12.167
15
Shawn Green
12.167
88
Albert Belle
12.049
21
Sammy Sosa
12.000
43
Raul Mondesi
11.500
25
Mark McGwire
11.000
10
Chipper Jones
10.333
25
Barry Bonds
10.300
22
Juan Gonzalez
10.000
35
Frank Thomas
9.927
10
Gary Sheffield
9.917
30
Ken Griffey
9.710
33
Brian Jordan
9.100
11
Barry Larkin
9.000
25
Rafael Palmeiro
8.621
7
Ivan Rodriguez
8.600
4
Robin Ventura
8.500
25
Andruw Jones
8.200
25
Jim Thome
8.175
16
Ray Lankford
8.100
33
Jay Bell
8.000
7
Kenny Lofton
8.000
8
Javy Lopez
7.750
7
Craig Biggio
7.750
7
Dean Palmer
7.500
23
Greg Vaughn
7.479
23
Eric Karros
7.375
12
Roberto Alomar
7.343
24
Brian Giles
7.333
2
Carl Everett
7.333
5
Nomar Garciaparra
7.250
19
Vinny Castilla
7.250
8
Johnny Damon
7.100
28
David Justice
7.000
19
Dante Bichette
7.000
Todd Zeile
6.833
Jose Offerman
6.750
John Olerud
6.700
2
9
30
5
21
Chapter 4: Describing your Data
e.
Average = 2.4545. Median = 0.9. The average salary is at the 64.4 percentile.
f.
The skewness of the 1985 salaries is 1.5717. The skewness of the 2002 salaries is 1.912. Salaries
are more skewed in 2002 than they were in 1985.
a.
The boxplots are:
17.
Bladder Cancer Rate
7
6
5
4
3
Cig Use Category = 0
Cig Use Category = 1
Cig Use Category = 2
2
1
0
Kidney Cancer Rates
5
4.5
4
3.5
3
2.5
2
1.5
Cig Use Category = 0
Cig Use Category = 1
1
0.5
0
22
Cig Use Category = 2
Chapter 4: Describing your Data
Leukemia Rates
9
8
7
6
5
Cig Use Category = 0
Cig Use Category = 1
Cig Use Category = 2
4
3
2
1
0
Lung Cancer Rates
30
25
20
15
Cig Use Category = 0
Cig Use Category = 1
10
5
0
23
Cig Use Category = 2
Chapter 4: Describing your Data
b.
The univariate statistics are:
Bladder
Cig_Use_Category = 0
Count
Sum
Average
Median
Mode
Cig_Use_Category = 1
14
Cig_Use_Category = 2
15
15
47.70
61.48
72.15
3.4071
4.0987
4.8100
3.07
4.09
4.78
2.90
3.72
4.46
3.3058
4.1038
4.8008
Minimum
2.89
2.86
3.20
Maximum
5.14
5.27
6.54
Range
2.25
2.41
3.34
Standard Deviation
0.71710
0.77775
0.87097
Variance
0.51424
0.60490
0.75859
Standard Error
Trimmed Mean (0.2)
0.19165
0.20081
0.22488
Skewness
1.609
-0.089
0.031
Kurtosis
1.683
-1.033
0.377
Smallest (2)
2.90
2.93
3.46
Largest (2)
4.65
5.21
5.98
1st Percentile
2.89
2.87
3.24
5th Percentile
2.90
2.91
3.38
10th Percentile
2.90
3.04
3.69
25th Percentile
2.92
3.62
4.46
50th Percentile
3.07
4.09
4.78
75th Percentile
3.56
4.71
5.21
90th Percentile
4.47
5.08
5.83
95th Percentile
4.82
5.23
6.15
99th Percentile
5.08
5.26
6.46
Interquartile Range
0.64
1.09
0.75
24
Chapter 4: Describing your Data
Kidney
Cig_Use_Category = 0
Count
Sum
Average
Median
Mode
Cig_Use_Category = 1
Cig_Use_Category = 2
14
15
15
33.82
43.59
45.55
2.4157
2.9060
3.0367
2.32
2.90
3.03
2.75
2.66
2.3842
2.9169
2.9862
Minimum
1.59
2.13
2.41
Maximum
3.62
3.54
4.32
Range
2.03
1.41
1.91
Standard Deviation
0.54092
0.35958
0.45492
Variance
0.29260
0.12930
0.20695
Standard Error
0.14457
0.09284
0.11746
Skewness
0.716
-0.197
1.491
Kurtosis
0.571
0.640
3.896
Smallest (2)
1.77
2.45
2.55
Largest (2)
3.11
3.43
3.36
1st Percentile
1.61
2.17
2.43
5th Percentile
1.71
2.35
2.51
10th Percentile
1.85
2.55
2.59
25th Percentile
2.08
2.75
2.75
50th Percentile
2.32
2.90
3.03
75th Percentile
2.72
3.07
3.18
90th Percentile
3.04
3.37
3.36
95th Percentile
3.29
3.46
3.65
99th Percentile
3.55
3.52
4.19
Interquartile Range
0.63
0.32
0.43
Trimmed Mean (0.2)
#N/A
25
Chapter 4: Describing your Data
Leukemia
Cig_Use_Category = 0
Count
Sum
Average
Cig_Use_Category = 1
Cig_Use_Category = 2
14
15
15
94.34
107.04
99.13
6.6087
6.7386
7.1360
Median
6.71
7.00
Mode
6.71
7.38
6.6975
7.1038
6.6892
Minimum
5.82
6.41
4.90
Maximum
8.15
8.28
7.27
Range
2.33
1.87
2.37
Standard Deviation
0.64412
0.51788
0.66076
Variance
0.41489
0.26820
0.43660
Standard Error
0.17215
0.13372
0.17061
Skewness
0.604
0.719
-1.339
Kurtosis
0.353
0.066
1.842
Smallest (2)
5.95
6.56
5.78
Largest (2)
7.48
7.80
7.23
1st Percentile
5.84
6.43
5.02
5th Percentile
5.90
6.52
5.52
10th Percentile
5.99
6.58
5.90
25th Percentile
6.26
6.82
6.30
50th Percentile
6.71
7.00
6.82
75th Percentile
6.98
7.42
7.10
90th Percentile
7.46
7.76
7.22
95th Percentile
7.71
7.94
7.24
99th Percentile
8.06
8.21
7.26
Interquartile Range
0.72
0.60
0.81
Trimmed Mean (0.2)
26
6.82
#N/A
Chapter 4: Describing your Data
Lung
Cig_Use_Category = 0
Count
Sum
Average
Median
Mode
Cig_Use_Category = 1
15
15
233.27
284.94
346.53
16.6621
18.9960
23.1020
16.41
19.50
#N/A
Trimmed Mean (0.2)
Cig_Use_Category = 2
14
#N/A
23.03
#N/A
16.3175
18.9500
23.3338
Minimum
12.01
12.11
15.92
Maximum
25.45
26.48
27.27
Range
13.44
14.37
11.35
Standard Deviation
3.63450
3.62722
2.70780
13.20959
13.15674
7.33219
0.97136
0.93654
0.69915
Skewness
1.005
0.013
-1.101
Kurtosis
1.365
0.308
2.670
Smallest (2)
12.12
14.20
20.94
Largest (2)
20.55
22.72
25.95
1st Percentile
12.02
12.40
16.62
5th Percentile
12.08
13.57
19.43
10th Percentile
12.56
14.73
20.96
25th Percentile
14.23
16.65
22.06
50th Percentile
16.41
19.50
23.03
75th Percentile
17.56
20.98
24.79
90th Percentile
20.49
22.39
25.92
95th Percentile
22.27
23.85
26.35
99th Percentile
24.81
25.95
27.09
3.33
4.34
2.73
Variance
Standard Error
Interquartile Range
c.
There does not appear to be any relationship between cigarette use and leukemia rate in the states.
There is some evidence for a relationship between cigarette use and kidney cancer, since the kidney
cancer rate increases with increased rate of cancer use. The strongest evidence occurs between
cigarette use and bladder cancer or lung cancer. In most cases as the rate of cigarette use increases,
so does the cancer rate.
d.
Wyoming with a lung cancer rate of 15.92 and a cancer rate category of "high".
18.
The statistics are:
Diff80
Ratio80
Count
14
14
Average
-16.600
0.730
Median
-10.9
0.507
a.
27
Chapter 4: Describing your Data
The boxplots are:
b.
20
0
-20
-40
-60
-80
-100
Diff80
-120
2.5
2
1.5
1
0.5
0
Ratio80
The extreme outlier in the Diff80 boxplot comes from New York city.
The statistics without New York city are:
Diff80
Ratio80
Count
13
13
Average
-10.215
0.774
Median
-8.6
0.538
c.
28
Chapter 4: Describing your Data
The boxplots are:
15
10
5
0
-5
-10
-15
-20
-25
Diff80
-30
-35
2.5
2
1.5
1
0.5
Ratio80
0
d.
There is evidence that air quality has improved (whether we include or exclude the data from New
York city). Of the two statistics, the ratio is less affected by the New York city results, though it
does contain two moderate outliers from other cities.
29
Chapter 4: Describing your Data
19.
a.
The statistics are:
REACTION
Count
Sum
Average
106
18.264
0.17230
Median
Mode
Trimmed Mean (0.2)
0.171
(0.175 ; 0.167 ; 0.164 ; 0.162)
0.17107
Minimum
0.124
Maximum
0.224
Range
0.100
Standard Deviation
0.020556
Variance
0.000423
Standard Error
0.001997
Skewness
0.495
Kurtosis
0.101
Smallest (2)
0.137
Largest (2)
0.224
1st Percentile
0.137
5th Percentile
0.143
10th Percentile
0.148
25th Percentile
0.158
50th Percentile
0.171
75th Percentile
0.186
90th Percentile
0.200
95th Percentile
0.214
99th Percentile
0.224
Interquartile Range
0.028
30
Chapter 4: Describing your Data
The boxplot appears as:
b.
0.24
0.22
0.2
0.18
0.16
0.14
0.12
REACTION
0.1
There are no outliers and the shape of the distribution appears to be symmetric.
c.
The stem and leaf plot appears as follows:
Stem x 0.01
REACTION
12 4
13 7
14 0011344567778
15 22455566677899
16 0011222233444455777789
17 00111222334444555667
18 22233455666777899
19 0122338
20 2238
21 3477
22 244
d.
The interquartile range is 0.0275. The lower inner fence is equal to 0.117. The lower outer fence is
equal to 0.0757. Therefore 0.1 seconds would be considered a moderate outlier.
e.
The reaction statistics by order of finish are:
Count
Average
Median
Minimum
Maximum
Standard Deviation
FINISH = 1
12
0.16842
0.172
0.137
0.189
0.017154
FINISH = 2
12
0.16750
0.166
0.140
0.214
0.022134
FINISH = 3
12
0.17033
0.161
0.141
0.222
0.024570
FINISH = 4
12
0.16717
0.164
0.145
0.187
0.014509
FINISH = 5
12
0.17575
0.175
0.146
0.217
0.019410
FINISH = 6
12
0.17267
0.171
0.142
0.224
0.023631
FINISH = 7
12
0.17125
0.174
0.124
0.209
0.023038
FINISH = 8
12
0.17925
0.176
0.148
0.224
0.022519
FINISH = 9
10
0.17960
0.177
0.155
0.213
0.018234
31
Chapter 4: Describing your Data
In general, those who finish lower seemed to have slower reaction times. But there is a great deal of
overlap in the data. For example one of the first place finishers had a reaction time of 0.189 seconds, while
one of the last place finishers had a reaction time of 0.155 seconds. A slower reaction time need not mean
a lower finish.
The boxplot is:
f.
Reaction Times by Order of Finish
0.24
0.22
Reaction Times
0.2
0.18
0.16
0.14
0.12
FINISH = 1 FINISH = 2 FINISH = 3 FINISH = 4 FINISH = 5 FINISH = 6 FINISH = 7 FINISH = 8 FINISH = 9
0.1
Order of Finish
g.
There is no strong evidence of a relationship between a runner's reaction time and the result of the
race.
h.
The statistics are:
Count
Average
Median
Minimum
Maximum
Std. Dev.
Group = "1-3"
REACTION
36
0.16875
0.166
0.137
0.222
0.020919
Group = "4-6"
REACTION
36
0.17186
0.170
0.142
0.224
0.019314
Group = "7-9"
REACTION
34
0.17653
0.175
0.124
0.224
0.021268
32
Chapter 4: Describing your Data
The boxplot is:
Reaction Times by Finish Group
0.24
0.22
Reaction Times
0.2
0.18
0.16
0.14
0.12
Group = "1-3"
Group = "4-6"
Group = "7-9"
0.1
Finish Group
There appears to be a trend to slower reaction times for runners in the lower finish groups. Reaction time
may be related to finish order only in a very general way. For example, you cannot tell whether a runner
will finish first by having a fast reaction time, but it may be more likely that runner who finish in the top 3
have, on average, faster reaction times than runners who finish in the bottom group. However, given the
amount of overlap in the boxplots, even this very general conclusion may not be valid.
33
Chapter 4: Describing your Data
20.
a.
The statistics are:
Assoc
Count
Asst
Full
50
50
50
Sum
2,571.35
2,163.86
3,695.56
Average
51.4270
43.2772
73.9112
50.65
42.45
72.00
50.9388
42.8240
73.4815
Minimum
41.80
35.30
55.90
Maximum
70.00
56.40
96.50
Range
28.20
21.10
40.60
Median
Trimmed Mean (0.2)
Standard Deviation
5.70100
4.61775
10.16325
32.50135
21.32362
103.29158
0.80624
0.65305
1.43730
Skewness
1.012
1.060
0.419
Kurtosis
1.623
1.144
-0.468
Smallest (2)
42.00
35.90
56.10
Largest (2)
64.40
56.20
93.30
1st Percentile
41.90
35.59
56.00
5th Percentile
43.29
37.65
58.64
10th Percentile
45.97
38.79
62.94
25th Percentile
47.83
40.40
66.80
50th Percentile
50.65
42.45
72.00
75th Percentile
54.68
44.80
79.68
90th Percentile
57.32
50.09
90.23
95th Percentile
62.98
51.06
92.03
99th Percentile
67.26
56.30
94.93
6.85
4.40
12.88
Variance
Standard Error
Interquartile Range
The greatest variability is found in the salaries of full professor.
b.
In general all three groups have fairly symmetric distributions. There are no apparent outliers in the
stem and leaf plot.
Stem x 10
Assoc
Asst
Full
3
3
557888899
4
1224
000000011111111222223333334444
4
5666667777888899999
6679
5
000001111223344
00011
5
55556777
66
6
144
566
12334
6
7
55566777889
0
0000112233
7
555667899
8
233
8
567
9
00123
9
6
34
Chapter 4: Describing your Data
The boxplot is:
c.
Professor Salaries
120
100
80
60
40
Assoc
Asst
Full
20
0
There is some overlap in the salaries. This is probably due to the fact that an experienced associate
professor may make as much or more than a new full professor. Likewise, experience assistant professors
may make salaries comparable to leff experienced associate professors.
The percentiles are:
Assoc
25th Percentile
47.83
75th Percentile
54.68
d.
Asst
40.40
44.80
Full
66.80
79.68
e.
$92,030 since this represents the upper 5% of the salary range for full professors at the universities
in the sample.
a.
The statistics are:
21.
Count
Average
Median
Minimum
Maximum
Standard Deviation
b.
Year 68
19
0.4932
0.50
0.34
0.63
0.06799
Year 72
19
0.5268
0.53
0.35
0.64
0.07079
1968 geometric mean = 0.4885. 1972 geometric mean = 0.522.
35
Chapter 4: Describing your Data
The boxplot is:
c.
Women in the Labor Force
0.70
0.65
0.60
% of Women
0.55
0.50
0.45
0.40
0.35
Year 68
Year 72
0.30
Year
There is one outlier from 1972, coming from the city of St. Louis. The boxplot seems to indicate that the
percent of women in the workforce has increased during this period.
d.
The distributions are close to symmetric. The mean and median values are similar, so you could use
either the mean or the median.
a.
There appears to be a slight downward trend in the draft numbers as the year progresses; but there
is a lot of overlap in the boxplots.
22.
Count
Average
Median
Standard Deviation
Month = 1
31
201.48
215
99.756
Month = 2
29
202.97
210
103.955
Month = 3
31
225.81
256
95.825
Month = 4
30
203.67
225
109.371
Month = 5
31
207.97
226
114.974
Month = 6
30
195.73
208
117.866
Month = 7
31
181.55
188
109.614
Month = 8
31
173.45
145
112.731
Month = 9
30
157.30
168
87.151
Month = 10
31
182.45
201
96.781
Month = 11
30
148.73
132
94.398
Month = 12
31
121.55
100
95.059
36
Chapter 4: Describing your Data
Numbers by Month
400
350
300
Draft Numbers
250
200
150
100
50
0
Month = 1 Month = 2 Month = 3 Month = 4 Month = 5 Month = 6 Month = 7 Month = 8 Month = 9 Month = 10 Month = 11 Month = 12
-50
Month
The statistics and boxplot are:
b.
Count
Average
Median
Standard Deviation
Quarter = 1
91
210.24
216
99.332
Quarter = 2
91
202.52
219
112.974
Quarter = 3
92
170.91
168
103.310
Quarter = 4
92
150.93
134
97.677
Draft Numbers by Quarter
400
350
300
Draft Numbers
250
200
150
100
50
0
Quarter = 1
Quarter = 2
Quarter = 3
Quarter = 4
-50
Quarter
The first two quarters of the year appear to be similar, but there is a drop off in the third quarter and then
another drop off in the fourth quarter.
37
Chapter 4: Describing your Data
The statistics and boxplot are:
c.
Count
Average
Median
Standard Deviation
Half = 1
182
206.38
218
106.149
Half = 2
184
160.92
155
100.757
Draft Numbers by Half
400
350
300
Draft Numbers
250
200
150
100
50
0
Half = 1
Half = 2
-50
Half
There appears to be a drop off in the draft numbers from the first half of the year to the second.
d.
On average, it would appear that persons born in the second half of the year received a lower draft
number and are therefore more likely to be selected in the draft. This effect is difficult to see until
you create groups which are large enough to compensate for the variability between individual
numbers.
One possible cause could be how the draft numbers are randomized. If the draft numbers are poured
into the drum with the early birth dates being poured in first and the later dates poured in last, the
later birth dates will be located near the top of the pile. Rotating the drum will randomize the order
somewhat, but there will still be a larger proportion of dates near the top of the drum, causing those
dates to receive lower draft numbers.
38
Chapter 4: Describing your Data
23.
The boxplot is:
a.
26
25
24
23
22
21
20
Hedge Sparrow
Meadow Pipit
Pied Wagtail
Robin
Tree Pipit
Wren
19
b.
The statistics are:
Hedge
Sparrow
Count
Meadow
Pipit
Pied Wagtail
Robin
Tree Pipit
Wren
14
45
15
16
15
15
23.1214
22.2989
22.9033
22.5750
23.0900
21.1300
Median
23.05
22.25
23.05
22.55
23.25
21.05
Mode
23.05
22.05
(24.05 ; 21.85)
23.05
(24.05 ; 23.25)
(22.05 ; 21.05 ; 20.85)
Minimum
20.85
19.65
21.05
21.05
21.05
19.85
Maximum
25.05
24.45
24.85
23.85
24.05
22.25
Std. Dev.
1.06874
0.92063
1.06762
0.68459
0.90143
0.74374
Variance
1.14220
0.84756
1.13981
0.46867
0.81257
0.55314
Average
39
41
Download