Organizing Data

advertisement
Organizing and describing Data
Techniques for continuous
variables
Continuous variables are measurements that
vary over a continuum (Weight, Blood
Pressure, etc.) (as opposed to categorical
variables Gender, religion, Marital Status
etc.)
The Grouped frequency table:
The Histogram
To Construct
• A Grouped frequency table
• A Histogram
1. Find the maximum and minimum of the
observations.
2. Choose non-overlapping intervals of equal width
(The Class Intervals) that cover the range
between the maximum and the minimum.
3. The endpoints of the intervals are called the
class boundaries.
4. Count the number of observations in each
interval (The cell frequency - f).
5. Calculate relative frequency
relative frequency = f/N
Data Set #3
The following table gives data on Verbal IQ, Math IQ,
Initial Reading Acheivement Score, and Final Reading Acheivement Score
for 23 students who have recently completed a reading improvement program
Student
Verbal
IQ
Math
IQ
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
86
104
86
105
118
96
90
95
105
84
94
119
82
80
109
111
89
99
94
99
95
102
102
94
103
92
100
115
102
87
100
96
80
87
116
91
93
124
119
94
117
93
110
97
104
93
Initial
Reading
Acheivement
1.1
1.5
1.5
2.0
1.9
1.4
1.5
1.4
1.7
1.6
1.6
1.7
1.2
1.0
1.8
1.4
1.6
1.6
1.4
1.4
1.5
1.7
1.6
Final
Reading
Acheivement
1.7
1.7
1.9
2.0
3.5
2.4
1.8
2.0
1.7
1.7
1.7
3.1
1.8
1.7
2.5
3.0
1.8
2.6
1.4
2.0
1.3
3.1
1.9
70 to 80
80 to 90
90 to 100
100 to 110
110 to 120
120 to 130
Verbal IQ Math IQ
1
1
6
2
7
11
6
4
3
4
0
1
In this example the upper endpoint is included in the
interval. The lower endpoint is not.
Histogram – Verbal IQ
8
7
6
5
4
3
2
1
0
70 to 80 80 to 90
90 to
100
100 to
110
110 to
120
120 to
130
Histogram – Math IQ
12
10
8
6
4
2
0
70 to 80 80 to 90
90 to
100
100 to
110
110 to
120
120 to
130
Example
• In this example we are comparing (for two
drugs A and B) the time to metabolize the
drug.
• 120 cases were given drug A.
• 120 cases were given drug B.
• Data on time to metabolize each drug is
given on the next two slides
Drug A
22.6
31.5
7.2
13.0
11.2
6.4
4.8
3.5
11.9
4.1
6.7
6.7
6.0
7.7
11.7
11.7
8.5
30.0
7.2
10.0
17.8
6.3
11.4
6.4
8.1
5.7
3.2
13.4
7.8
16.8
9.0
8.9
10.5
13.1
6.4
21.9
6.3
6.2
5.4
17.2
18.8
7.2
12.9
6.3
13.6
4.3
7.5
14.1
21.9
7.4
8.8
10.5
12.6
14.9
6.2
2.9
5.2
3.8
9.7
19.6
10.5
3.5
12.7
20.1
25.3
11.2
2.0
1.8
22.0
5.1
20.1
7.0
6.0
8.0
6.0
3.8
13.6
8.5
9.8
33.5
6.5
4.7
5.3
7.4
2.5
18.7
5.6
2.3
7.9
6.8
12.3
10.1
14.9
19.2
10.8
9.3
14.9
11.8
12.7
1.5
11.8
5.1
18.0
4.1
9.0
6.5
15.4
3.9
4.8
6.3
4.3
17.4
11.3
2.7
30.0
3.1
10.9
3.3
28.3
6.4
Drug B
4.2
10.4
8.2
13.4
10.5
5.6
19.0
4.5
25.9
3.2
5.5
7.8
6.0
5.3
4.8
10.8
5.4
25.2
6.6
5.1
12.8
5.4
6.0
4.3
6.0
7.3
5.9
10.2
10.4
2.7
4.6
3.5
2.9
3.0
4.6
13.4
8.3
2.9
15.1
4.0
3.2
5.0
4.9
2.7
14.3
9.6
10.6
2.8
12.9
4.2
2.7
5.4
4.4
5.7
7.7
5.8
4.1
11.5
12.3
5.1
7.8
5.1
5.9
10.3
12.4
4.7
6.3
9.4
4.5
3.3
7.5
12.6
4.1
3.0
4.8
5.3
9.3
8.8
10.9
7.4
3.2
5.1
17.0
20.9
8.1
4.8
9.3
24.1
2.6
13.7
5.1
8.8
5.0
9.7
4.1
7.7
8.3
5.9
6.0
16.0
8.8
14.1
2.5
15.3
5.2
7.8
11.4
9.2
10.6
3.7
5.0
8.5
12.1
8.5
6.9
12.1
8.0
4.1
2.3
2.8
Grouped frequency tables
Class interval
0 to 4
4 to 8
8 to 12
12 to 16
16 to 20
20 to 24
24 to 28
28 to 32
32 to 36
36 to 40
40 to 44
44 to 48
Drug A
15
43
26
15
9
6
1
4
1
0
0
0
Drug B
19
54
26
15
2
1
3
0
0
0
0
0
Histogram – drug A
(time to metabolize)
60
50
40
30
20
10
0
o
0t
4
o
4t
8
o
8t
12
1
o1
t
2
6
1
o2
t
6
0
2
o2
t
0
4
2
o2
t
4
8
2
o3
t
8
2
3
o3
t
2
6
3
o4
t
6
0
4
o4
t
0
4
4
o4
t
4
8
Histogram – drug B
(time to metabolize)
60
50
40
30
20
10
0
o
0t
4
o
4t
8
o
8t
12
1
o1
t
2
6
1
o2
t
6
0
2
o2
t
0
4
2
o2
t
4
8
2
o3
t
8
2
3
o3
t
2
6
3
o4
t
6
0
4
o4
t
0
4
4
o4
t
4
8
The Grouped frequency table:
The Histogram
To Construct
• A Grouped frequency table
• A Histogram
To Construct - A Grouped frequency table
1. Find the maximum and minimum of the
observations.
2. Choose non-overlapping intervals of equal width
(The Class Intervals) that cover the range
between the maximum and the minimum.
3. The endpoints of the intervals are called the
class boundaries.
4. Count the number of observations in each
interval (The cell frequency - f).
5. Calculate relative frequency
relative frequency = f/N
To draw - A Histogram
Draw above each class interval:
•
A vertical bar above each Class Interval whose height is
either proportional to The cell frequency (f) or the
relative frequency (f/N)
frequency (f) or
relative frequency
(f/N)
Class Interval
Some comments about histograms
• The width of the class intervals should be
chosen so that the number of intervals with
a frequency less than 5 is small.
• This means that the width of the class
intervals can decrease as the sample size
increases
• If the width of the class intervals is too
small. The frequency in each interval will
be either 0 or 1
• The histogram will look like this
• If the width of the class intervals is too
large. One class interval will contain all of
the observations.
• The histogram will look like this
• Ideally one wants the histogram to appear as seen
below.
• This will be achieved by making the width of the
class intervals as small as possible and only
allowing a few intervals to have a frequency less
than 5.
80
70
60
50
40
30
20
10
55
-1
45
15
0
-1
35
14
0
-1
25
13
0
-1
15
0
12
11
0
-1
05
5
10
0
-1
-9
5
90
-8
5
80
-7
70
60
-6
5
0
• As the sample size increases the histogram will
approach a smooth curve.
• This is the histogram of the population
80
70
60
50
40
30
20
10
55
-1
45
15
0
-1
35
14
0
-1
25
13
0
-1
15
0
12
11
0
-1
05
5
10
0
-1
-9
5
90
-8
5
80
-7
70
60
-6
5
0
N = 25
10
9
8
7
6
5
4
3
2
1
0
60 - 70
70 - 80
80 - 90 90 - 100
100 110
110 120
120 130
130 140
140 150
N = 100
30
25
20
15
10
5
0
60 - 70
70 - 80
80 - 90
90 - 100 100 - 110 110 - 120 120 - 130 130 - 140 140 - 150
-9
5
-8
5
-7
5
-6
5
-1
11 05
0
-1
12 15
0
-1
13 25
0
-1
14 35
0
-1
15 45
0
-1
55
10
0
90
80
70
60
N = 500
80
70
60
50
40
30
20
10
0
N = 2000
140
120
100
80
60
40
20
0
4
2
0
8
6
4
4
2
0
8
6
- 6 - 7 - 8 - 8 - 9 - 10 - 11 - 12 - 12 - 13 - 14
62 70 78 86 94 02 10 18 26 34 42
1
1
1
1
1
1
N=∞
0.03
0.025
0.02
0.015
0.01
0.005
0
50
60 70
80 90 100 110 120 130 140 150
Comment: the proportion of area under a
histogram between two points estimates the
proportion of cases in the sample (and the
population) between those two values.
Example: The following histogram displays
the birth weight (in Kg’s) of n = 100 births
25
20
19
20
17
15
10
11
12
10
5
3
1
1
4
1
1
0
0.085 0.113 0.142 0.17 0.198 0.227 0.255 0.283 0.312 0.34 0.369 0.397 0.425 0.454
to
to
to
to
to
to
to
to
to
to
to
to
to
to
0.113 0.142 0.17 0.198 0.227 0.255 0.283 0.312 0.34 0.369 0.397 0.425 0.454 0.482
Find the proportion of births that have a
birthweight less than 0.34 kg.
Proportion = (1+1+3+10+11+19+17)/100 = 0.62
The Characteristics of a Histogram
• Central Location (average)
• Spread (Variability, Dispersion)
• Shape
Central Location
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0
5
10
15
20
25
Spread, Dispersion, Variability
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0
5
10
15
20
25
Shape – Bell Shaped (Normal)
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0
5
10
15
20
25
Shape – Positively skewed
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0
5
10
15
20
25
Shape – Negatively skewed
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0
5
10
15
20
25
Shape – Platykurtic
0
-3
-2
-1
0
1
2
3
Shape – Leptokurtic
0
-3
-2
-1
0
1
2
3
Shape – Bimodal
0
-3
-2
-1
0
1
2
3
The Stem-Leaf Plot
An alternative to the histogram
Each number in a data set can
be broken into two parts
– A stem
– A Leaf
Example
Verbal IQ = 84
84
Stem
Leaf
–Stem = 10 digit = 8
– Leaf = Unit digit = 4
Example
Verbal IQ = 104
104
Stem
Leaf
–Stem = 10 digit = 10
– Leaf = Unit digit = 4
To Construct a Stem- Leaf
diagram
• Make a vertical list of “all” stems
• Then behind each stem make a horizontal
list of each leaf
Example
The data on N = 23 students
Variables
• Verbal IQ
• Math IQ
• Initial Reading Achievement Score
• Final Reading Achievement Score
Data Set #3
The following table gives data on Verbal IQ, Math IQ,
Initial Reading Acheivement Score, and Final Reading Acheivement Score
for 23 students who have recently completed a reading improvement program
Student
Verbal
IQ
Math
IQ
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
86
104
86
105
118
96
90
95
105
84
94
119
82
80
109
111
89
99
94
99
95
102
102
94
103
92
100
115
102
87
100
96
80
87
116
91
93
124
119
94
117
93
110
97
104
93
Initial
Reading
Acheivement
1.1
1.5
1.5
2.0
1.9
1.4
1.5
1.4
1.7
1.6
1.6
1.7
1.2
1.0
1.8
1.4
1.6
1.6
1.4
1.4
1.5
1.7
1.6
Final
Reading
Acheivement
1.7
1.7
1.9
2.0
3.5
2.4
1.8
2.0
1.7
1.7
1.7
3.1
1.8
1.7
2.5
3.0
1.8
2.6
1.4
2.0
1.3
3.1
1.9
We now construct:
a stem-Leaf diagram
of Verbal IQ
A vertical list of the stems
8
9
10
11
12
We now list the
leafs behind
stem
86 104 86 105 118 96 90 95 105 84
94 119 82 80 109 11 1 89 99 94 99
95 102 102
8
9
10
11
12
86 104 86 105 118 96 90 95 105 84
94 119 82 80 109 11 1 89 99 94 99
95 102 102
8
9
10
11
12
8
9
10
11
12
664209
60549495
4559 22
891
The leafs may be arranged in order
8
9
10
11
12
024669
04455699
224559
189
The stem-leaf diagram is equivalent to a
histogram
8
9
10
11
12
024669
04455699
224559
189
The stem-leaf diagram is equivalent to a
histogram
8
9
10
11
12
024669
04455699
224559
189
Rotating the stem-leaf diagram we have
80 90 100 110 120
The two part stem leaf diagram
Sometimes you want to break the
stems into two parts
for leafs 0,1,2,3,4
*
for leafs 5,6,7,8,9
Stem-leaf diagram for Initial
Reading Acheivement
1. 01234444455556666677789
2. 0
This diagram as it stands does not
give an accurate picture of the
distribution
We try breaking the stems into
two parts
1.* 012344444
1. 55556666677789
2.* 0
2.
The five-part stem-leaf diagram
If the two part stem-leaf diagram is
not adequate you can break the stems
into five parts
for leafs 0,1
t
for leafs 2,3
f
for leafs 4, 5
s
for leafs 6,7
*
for leafs 8,9
We try breaking the stems into
five parts
1.* 01
1.t 23
1.f 444445555
1.s 66666777
1. 89
2.* 0
Stem leaf Diagrams
Verbal IQ, Math IQ, Initial RA, Final RA
Some Conclusions
• Math IQ, Verbal IQ seem to have
approximately the same distribution
• “bell shaped” centered about 100
• Final RA seems to be larger than initial RA
and more spread out
• Improvement in RA
• Amount of improvement quite variable
Next Topic
• Numerical Measures - Location
Download