Page | 1 CHAPTER 1 Descriptive Statistics 1.1 Introduction 1.2 Basic

advertisement
P a g e |1
CHAPTER 1
Descriptive Statistics
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
Introduction
Basic concepts
Sampling schemes
Graphical representation of data
Numerical description of data
Computers and statistics
Chapter summary
Computer examples
Projects for Chapter 1
Statistical software R is used for this book. All outputs and codes given are in R. R is a free
statistical software, and it can be downloaded from the website: http://www.r-project.org
P a g e |2
Exercises 1.2
1.2.1
The suggested solutions:
For qualitative data we can have color, sex, race, Zip code and so on. For quantitative data we
can have age, temperature, time, height, weight and so on. For cross section data we can have
school funding for each department in 2000. For time series data we can have the crude oil
price from 1995 to 2008.
1.2.3
The suggested questions can be:
1. What types of data the amounts are?
2. Do these Federal Agency receive the same amount of funding? If not, why?
3. Which Federal Agency should receive more funding? Why?
The suggested inferences we can make are:
1. These Federal Agency get different amount of money.
2. There are big differences between funding the Agencies receive.
Exercises 1.3
1.3.1
Simple Random Sample:
Say we have a population of 1,000 students, and we want a sample of 100 students.
Using software or a random table, we randomly select 100 out of the 1,000 students. We want
the selection probability for all the students to be equal. That is no student is more likely to be
selected than any other student.
Systematic Sample:
Again, we have a population of 1,000 students, and we want a sample of 100 students.
We need the sampling interval k = N/n = 10. Now, we need a random starting point between 1
and k. Let say, we randomly select 4. This gives us the sample: 4, 14, 24, ..., 74, 84, 94. This
sample of numbers will correspond to ordered list of students.
Stratified Sample:
Suppose we decide to sample 100 college students from the population of 1000 ( that is
10% of the population). We know these 1000 students come from three different major, Math,
Computer Science and Social Science. We have Math 200, CS 400 and SS 400 students. Then
we choose 10% of each of them Math 20, CS 40 and SS 40 by using simple random sample
within each major.
Cluster Sample:
Presume we have a population of 1,000 students clustered into 10 departments. For our
sample of students, we will randomly select a subset from the 10 departments. Let say we
P a g e |3
randomly select 3 out 10 departments. Now, all the students on those 3 department become
the sample from the population of students.
Exercises 1.4
1.4.1
(a) Bar graph
Bar graph for the percent of road mileage
35.00%
30.00%
25.00%
C2
20.00%
15.00%
10.00%
5.00%
0.00%
Poor
Mediocre
Fair
C1
Good
Very good
Pie chart of the percent of road mileage
C ategory
Poor
Very good
Good
Fair
Mediocre
(b) Pie chart
1.4.3
(a) Bar graph
Bar graph
40.00%
C2
30.00%
20.00%
10.00%
0.00%
Coal
Natural Gas Nyclear Electric Power
C1
Petrolium
Renewable Energy
P a g e |4
(b) Pareto graph
100
0.8
80
0.6
60
0.4
40
0.2
20
0.0
C1
Pe
Percentage
Percent
Cum %
m
li u
tro
ra
tu
Na
0.40
40.0
40.0
as
lG
al
Co
ec
El
ar
0.23
23.0
63.0
e
cl
Ny
0.22
22.0
85.0
ic
tr
w
Po
er
R
ew
en
le
ab
0.08
8.0
93.0
Percent
Percentage
Pareto graph
1.0
0
gy
er
En
0.07
7.0
100.0
Pie chart of species
species
Category
Coal
Natural Gas
Ny clear Electric Power
Petrolium
Renewable Energy
(c) Pie chart
1.4.5
bar graph
6
5
Count
4
3
2
1
0
A
B
C
C1
(a) Bar graph
D
F
Pie chart
species
Category
A
B
C
D
F
(b) Pie chart
P a g e |5
1.4.7
(a) Pie chart
Pie chart
species
Category
Mining
Construction
Manufacturing
Transportation
Wholesale
Retail
Finance
Services
(b) Bar graph
Bar graph
8000
7000
6000
C2
5000
4000
3000
2000
1000
0
g
in
in
M
n
g
n
io
in
tio
ct
ur
rta
ru
ct
st
fa
po
n
u
s
Co
an
an
M
Tr
W
le
sa
le
ho
Re
il
ta
e
nc
na
Fi
Se
ic e
rv
C1
1.4.9
Bar graph
Bar graph
80
70
60
C2
50
40
30
20
10
0
1900
1960
1980
C1
1990
2000
s
P a g e |6
1.4.11
(a) Bar graph
Bar graph
300
250
C2
200
150
100
50
0
Accidents Chronic
C
Cancer
Diabetes
C1
Heart
Kidney Pneumonia Strok e
Suicide
(b) Pareto graph
Pareto graph
700
100
600
80
60
400
300
40
200
20
100
0
C1
Percentage
Percent
Cum %
a
He
rt
C
268.0
38.7
38.7
r
ce
an
199.4
28.8
67.6
ke
ro
St
ia
on
C
58.5
8.5
76.0
s
es
nt
et
de
ab
ci
Di
Ac
m
eu
Pn
42.3
35.1
6.1
5.1
82.1
87.2
34.5
5.0
92.2
23.9
3.5
95.6
r
he
Ot
30.2
4.4
100.0
1.4.13
Histogram
9
8
7
Frequency
6
5
4
3
2
1
0
60
70
80
C1
90
0
Percent
Percentage
500
P a g e |7
1.4.15
( a ) Stem and leaf
Stem-and-leaf of SAT Mathematics scores N = 20
Leaf Unit = 10
1 4 7
3 4 99
8 5 00011
10 5 22
10 5 4455
6 5 6667
2 5 9
1 6 0
Histogram
5
Frequency
4
3
2
1
0
480
500
520
(b) Histogram
540
C1
560
580
Pie Chart
Category
470-490
490-510
510-530
530-550
550-570
570-590
590-610
(c) Pie chart
1.4.17
Pie Chart
Category
White or European American
Black or African American
Asian American
American Indian or Alaska Native
Native Hawaiian or other Pacific Islander
Some other race
Two or more races
Not Hispanic nor Latino
Non-Hispanic White or European American
Non-Hispanic Black or African American
Non-Hispanic Asian
Non-Hispanic American Indian or Alaska Native
Non-Hispanic Native Hawaiian or other Pacific Islander
Non-Hispanic Some Other Race
Non-Hispanic Two or more races
Hispanic or Latino
White or European American Hispanic
Black or African American Hispanic
American Indian or Alaska Native Hispanic
Asian Hispanic
Some Other Race Hispanic
Two or more races Hispanic
600
P a g e |8
Exercises 1.5
1.5.1
1
7
6

1
0
5

.
.
.

7
89
6
x


1
6
5
.
6
7
1
2
1
7
6

1
6
5
.
6
7
1
0
5

1
6
5
.
6
7
.
.
.

7
8

1
6
5
.
6
7
9
6

1
6
5
.
6
7








s
2
2
2
2
2
1
2

1
s
3
9
8
8
.
4
2
2
s
3
9
8
8
.
4
26

3
.
1
5
1.5.3
Given information: mean=6 , median = 4 , mode = 3
We know that the value 3 can only be in the data twice. If not the median would be different
than 4. This give us the following: 3, 3, x, y. Where x and y are the missing values. We
introduce a system of equation to solve for x and y.
39xy
6
4
xy2
46
xy1
8
y1
85
y1
3
,x
=
5
x3
4
2
x83
x5
Data: 3, 3, 5, 13
1
2
2
2
2
Var 36 36 56 136 

3
1
= 68
3
=22.667
Sd= Var
= 22.667
=4.76
1.5.5
Q1  80
Q3  115
(a)
M  95
IQR  115  80  35
LL  80 1.535  27.5
LL  115 1.535  167.5
P a g e |9
40
60
80
100
120
(b)
(c) There are no outliers.
1.5.7
l
l
l
f
(
m

x
)

f
(
m
)

f
(
x
)

n
x

n
x

0
(a) 


i
i
i
i
i
i

1
i

1
i

1
5
f
m

x

5
2

1
1
.
8

1
4
7

1
1
.
8

1
5
1
2
1
1
.
8

1
0
1
7
1
1
.
8

6
2
2

1
1
.
8











(b)
i i
i

1

5

9
.
8

1
4

4
.
8

1
5
.
2
1
0
5
.
2

6
1
0
.
2










4
9

6
7
.
2

3

5
2

6
1
.
2

0
1.5.9
3
2
x

i
1
0
5
9
.3
6
xi1 
3
3
.1
0
5
3
2
3
2
(a)
132
4
8
8
.3
3
2
2 5
s2 

1
7
7
.0
4
3
xi 33.105 
3
1i1
3
1
ra
n
g
e5
3
.5
0
5
.3
14
8
.1
9
P a g e | 10
24.75  25.44
 25.095
2
42.19  43.25
Q3 
 42.72
2
32  32
 32
(b) M 
2
IQR  42.72  25.095  17.625
Q1 
LL  25.095 1.5 17.625  1.3425
LL  42.72  1.5 17.625  69.1575
10
20
30
40
50
There are no outliers.
(c)
4
0
2
Frequency
6
8
Histogram of y
0
(d)
10
20
30
y
40
50
60
P a g e | 11
(e)
x 33.105
x
s
1
9
.8
0
,4
6
.4
1
21 data point (65.625%) fall within 1 SD, empirical rule = 68%
x
2
s
.4
9
,5
9
.7
2
6
31 data point (96.875%) fall within 2 SD, empirical rule = 95%
x

3
s
6
.8
1
,7
3
.0
2

32 data point (100%) fall within 3 SD, empirical rule = 99.7%
1.5.11
2


1
1
,
0
0
0 2 1
1
1
,
0
0
0
x
 
1
1
0
,
s
 
1
,
9
0
0
,
0
0
0
 

6
9
6
9

6
9
7
1
0
0
1
0
0

1
1
0
0
(a)


s

6
9
6
9
.
6
9
7

8
3
.
4
8
4
7
xs

.
6
8
(
4
0
0
,0
0
0
)
2
7
2
,0
0
0

2
s

.
9
5
(
4
0
0
,0
0
0
)
3
8
0
,0
0
0
(b) x
x

3
s

.
9
9
7
(
4
0
0
,0
0
0
)
3
9
8
,8
0
0
1.5.13
(a)
x 112.33.7433
x
30
i1 i
30
30
30
1
2
s2  
xi 3.7433 3.502
29i1
s 3.5021.871
(b) Frequency table
Class
Interval
1
0-1.6
2
1.7-3.3
3
3.4-5
4
5.1-6.7
5
6.8-8.4
(c) Grouped data:
Frequency
4
10
9
5
2
Mi
.8
2.5
4.2
5.9
7.6
Mi∙fi
3.2
25
37.8
29.5
15.2
4
(
.
8
)

1
0
(
2
.
5
)

9
(
4
.
2
)

5
(
5
.
9
)

2
(
7
.
6
)
x


3
.
6
9
3
0
2
2
2
2
2
21


s

4
0
.
8

3
.
6
9

1
0
2
.
5

3
.
6
9

9
4
.
2

3
.
6
9

5
5
.
9

3
.
6
9

2
7
.
6

3
.
6
93

.
6
2












2
9
s

3
.
6
2

1
.
9
0
The results from the grouped data are similar to the actual data.
P a g e | 12
1.5.15
L  25 fm 139615
w4
Fb 178859
n514661
w
M

L
(.
5
n

F
)

27
.
24822
b
f
m
1.5.17
3
8
1
0

3
1
2
9
.
5

5
9
4
9
.
5

4
5
6
9
.
5

7
8
9
.
5









x


4
4
.
2
7
2
1
8
0
2
2
2
2
2
21


s

3
8
1
0

4
4
.
2
7
2

3
1
2
9
.
5

4
4
.
2
7
2

5
9
4
9
.
5

4
4
.
2
7
2

4
5
6
9
.
5

4
4
.
2
7
2

7
8
9
.
5

4
4
.
2
7
2












4
9

5
3
6
.
1
4
6
s

5
3
6
.
1
4
6

2
3
.
1
5
5
(b) L  40 , f m  59 , w  19 , Fb  69
Download