Chapter 1 Uncertainty, Randomness and Data

advertisement
9
Chapter 2
Sec. 2.1
One Variable Data
How Data Described
 Sorting: small to big
 Graphing: visualizing and pattern
 Numerical summaries: measures of centers and variations.
Sec. 2.2
The Graphical Display of Data
1. Dot – plots
Example. Given data set: {1, 2, 3, 2, 4, 5, 2, 3, 6, 5}, draw a dot-plot.
y

















x
2. Stem-leaf
Example 1. Given a set of a test scores: {78, 65, 96, 81, 53, 70, 82, 98,
58, 45, 64, 75, 88, 67, 66, 68, 56, 77, 65, 48, 34, 55, 61, 70,
60}, construct a stem-leaf display.
10
Stem
3
4
5
6
7
8
9
Leaf
4
58
3568
01245567
00578
128
68
Example 2. Given the stem-leaf display:
Stem
Leaf
40 6 6 8
41 0 5
42 1 3 4
What is the original data set?
Answer: {406, 406, 408, 410, 415, 421, 423, 424}
3. Histograms
(a). Frequency Table (Using data from example 1 above.)
30 – 39
40 – 49
50 – 59
60 – 69
70 – 79
80 – 89
90 – 99
Frequency Relative
Frequency
1
1/25
2
2/25
4
4/25
8
8/25
5
5/25
3
3/25
2
2/25
25
1
upper class limits
lower class limits
Cumulative Relative
Frequency
1/25
3/25
7/25
15/25
20/25
23/25
1
11
Bar Graph (There are gaps between bars.):
9
8
Frequency
7
6
5
4
3
2
1
0
30 - 39
40 - 49
50 - 59
60 - 69
70 - 79
80 - 89
90 -99
score
(b). Histogram
9
8
Frequency
7
6
5
4
3
2
1
0
30 - 39
40 - 49
50 - 59
60 - 69
70 - 79
80 - 89
90 -99
score
Note: To change the bar graph to histogram, 39 and 40 merge at
39.5, 49 and 50 merge at 49.5 …
The frequency axis can also be relative frequency axis.
0.35
Relative Frequency
0.3
0.25
0.2
0.15
0.1
0.05
0
30 - 39
40 - 49
50 - 59 60 - 69
score
70 - 79
80 - 89
90 -99
12
Section 2.3
Describe the Center of the Data
1. Arithmetic mean (Average)
Given data set {x1 , x2 , , xn }
1
n
n
Sample mean: x   xi 
i 1
1
n
x1  x2    xn
n
n
Population mean:    xi
i 1
2. Median (the “middle value” of data in ascending or descending order)
Steps:
(a). Order the data.
1
2
(b). If n is even, then Median M  ( xn / 2  x( n / 2)1 )
(c). If n is odd, then Median M  x( n1) / 2
Examples: Find the median of the given data set.
(1). {1, 6, 11, 9, 2}
Order: {x1 , x2 , x3 , x4 , x5 }  {1, 2, 6, 9, 11}
n  5 , odd number  M  x( n1) / 2  x3  6
(2). {5, 7, 31, 19, 14, 29, 10, 25, 42, 18}
Order: 5, 7, 10, 14, 18, 19, 25, 29, 31, 42
n  10 , even number 
M  12 xn / 2  xn / 21 
 12 ( x5  x6 )  12 (18  19)  18.5
13
3. Mode The most frequent data value in the data set.
Note: A data set may have more than one mode. If all data have the
same frequency, no need to discuss mode.
4. Midrange:
Section 2.4
1
(smallest + largest)
2
Describe the Spread of the Data
1. Range of the data set
Range = largest – smallest
 Advantage: Simple
 Disadvantage: It can not describe the spread of the data well if
there is an extreme value.
 Note the difference between the range of a data set and the range
of a random variable, which is the set of all the possible values of
the random variable.
Example 1. Find the range of the given data set.
(a). {1, 3, 4, 6, 7, 8, 10, 11, 13}
(b). {1, 3, 4, 6, 7, 8, 10, 11, 101}
Solution: (a). range  13  1  12
(b). range  101  1  100
Note: the range of the second data set does not describe the spread
of the data well.
14
2. Inter Quartile Range (IQR)
1st quartile
Q1 :
2nd quartile
Q2 :
the number that 25% data fall under.
the number that 50% data fall under.
Note: Q2  M , the median
3rd quartile
Q3 :
the number that 75% data fall under.
Inter Quartile Range
IQR  Q3  Q1
Example 2. Find the quartiles and the IQR.
(a). {1, 3, 4, 6, 8, 9, 11}
(b). {1, 3, 4, 6, 8, 9, 11, 100}
Solution:
(a). Q2  6 ( median)
Two ways to calculate
Q1
and
Q3 ,
hence IQR:
Excluding 6: Q1  3 , the median of {1, 3, 4}.
Q3  9 , the median of {8, 9, 11}.
IQR  9  3  6 .
Including 6: Q1  12 (3  4)  3.5 , the median of {1, 3, 4, 6}.
Q3  12 (8  9)  8.5 , the median of {6, 8, 9, 11}.
IQR  8.5  3.5  5 .
(b). Q2  12 (6  8)  7 , which is not a member of the data set.
Q1  12 (3  4)  3.5 , the median of {1, 3, 4, 6}
Q3  12 (9  11)  10 , the median of {8, 9, 11, 100}
IQR  10  3.5  6.5
15
3. Standard deviation
Standard deviation is a quantity to describe the spread of the data around
the mean.
Some of the ideas to describe data spread from the mean:
 How far from the mean, left/right of the mean ----The difference
between a data value and the mean.
 Combined effect----The sum of the differences.
Example 3. Find the sum of the differences from the mean.
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}
Solution: (Treat as a sample data set)
Mean: x  6 (verify)
11
Sum of the differences =  (i  6)  0 (Note: xi  i here.)
i 1
n
In fact, for any set {x1 , x2 , , xn } ,  ( xi  x )  0 .
i 1
n
n
n
i 1
i 1
i 1
Proof:  ( xi  x )   xi   x  nx  nx  0 .
So such calculation does not show the spread of the data from the mean.
n
How about  xi  x ?
i 1
Mathematically, the absolute value will cause problems.
16
A better way of describe the deviation:
Definition of the standard deviation/variance:
 Sample Standard deviation:
s [
1 n
1 n
2 1/ 2
(
x

x
)
]

( xi  x ) 2


i
n  1 i 1
n  1 i 1
 Population Standard deviation:
 [
1 n
1 n
2 1/ 2
(
x


)
]

( xi   ) 2


i
n i 1
n i 1
(Note: For the same data set,   s .)
 Sample variance:
s2 
1 n
( xi  x ) 2

n  1 i 1
 Population variance:
2 
1 n
( xi   ) 2

n i 1
Example 4. Find the standard deviation of the given data set.
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}
Solution:
Treat as a sample:
1 11
1
s [
(i  6) 2 ]1 / 2 
110  11 . (Note: x  6 .)

11  1 i 1
10
Treat as a population:
1 11
1
  [  (i  6) 2 ]1 / 2 
110  10 .
11 i 1
11
17
Remarks: (Use sample standard deviation s here. It is the same for  .)
(a). s  0 , s  0  x1  x 2    xn  x .
(b). In general, data value more spread out
 s
is larger
(c). s is influenced strongly by the extreme values.
(d). For most data sets, at least 90% of the values are in the interval
( x  3s , x  3s )
The 68-95-99 rule, also called the empirical rule for an approximately bell
shaped distribution.
x
x  3s
x  2s x  s
x
x  s x  2s x  3s
68%
95%
99%
18
Section 2.5
More Graphical Display: Box-Plots
Min = the minimum value
Max = the maximum value
Q1
= the 1st quartile or lower quartile
Q2
= the 2nd quartile (the median)
Q3
= the 3rd quartile or upper quartile
|
Min
|
Max
Q1
Q3
Q2
IQR
Section 2.6
Data Transformation
 Linear transformation:
y  ax  b , a
and b are fixed real numbers.
Example 1. Transform the given sample data set. Discuss the change of the
mean and the standard deviation.
Sample: { 1, 2, 3, 4, 5}
Transformations:
(1). y  x  3 ,
(2). y  2 x ,
(3). y  2 x  1
1
5
Solution: x  (1  2  3  4  5)  3
1

s x   [(1  3) 2  (2  3) 2  (3  3) 2  (4  3) 2  (5  3) 2 ] 
4

1/ 2

5
10

2
2
19
(1). y  x  3
y {4, 5, 6, 7, 8}
1
y  (4  5  6  7  8)  6
5
1

s y   [(4  6) 2  (5  6) 2  (6  6) 2  (7  6) 2  (8  6) 2 ] 
4

1/ 2
10
2

Note: y  x  3, s y  s x .
(2) y  2 x
y {2, 4, 6, 8  10}
1
y  (2  4  6  8  10)  6
5
1/ 2
1

s y   [(2  6) 2  (4  6) 2  (6  6) 2  (8  6) 2  (10  6) 2 ] 
4

 10
Note: y  2 x , s y   2 s x
(3) y  2 x  1
y {3, 5, 7, 9, 11}
1
y  (3  5  7  9  11)  7
5
1/ 2
1

s y   [(3  7) 2  (5  7) 2  (7  7) 2  (9  7) 2  (11  7) 2 ] 
4

Note: y  2 x  1, s y  2s x
In general,
If
y axb
X 
Y
, then y  a x  b , s y  a s x .
 10
20
 Standard score ( z -score)
Definition formula:
For sample: z 
xx
s
For population: z 
x

z -score is a special linear transformation:
Sample:
z
Population: z 
xx 1
x
1
x
 x   ax  b , where a  , b  
s
s
s
s
s
x


1

x

1

, where a  , b  



z -score is a combination of two simple linear transformations:
Given data x , let y  x  x  1  x  ( x ) , and z 
then z 
y 1
  y  0,
s s
xx
.
s
Note:
y  x  x  0,
s y  sx  s .
1
z   y 0,
s
1
1
sz   s y   s  1 .
s
s
zi 
Thus, {x1,
x2 , x3 ,  xn }
xi  x
s
 {z ,
1
z 2 , z3 ,  z n }
   x  , 0  s
However, the set of z -scores always have the mean of 0, and the
standard deviation of 1, i.e. z  0, s z  1.
21
The meaning of the z -score: z -score gives the number of standard
deviations that a data value x is away from the mean x .
For example, if z = 2, that is
xx
 2  x  x  2s , which means x is 2
s
standard deviations to the right of the mean x .
Example 2. Given scores of a student’s exams in a class. Which one is
his/her best performance in his/her class ? Which one is the worst ?
Exam
1
2
3
4
Score
40
50
60
70
Class average
30
53
65
69
Standard deviation
20
8
15
10
Solution:
To answer the question we need to compute the standard score of
each exam score.
Exam 1: z1 
40  30
 0.5  0.5 st. dv.’s to the right of 30.
20
Exam 2: z 2 
50  53
 0.375  0.375 st. dv.’s to the left of 53.
8
Exam 3: z3 
60  65
 0.333  0.333 st. dv.’s to the left of 65.
15
Exam 4: z 4 
70  69
 0.1  0.1 st dv.’s to the right of 69.
10
Therefore, score of exam 1 is the best and score of exam 2 is the
worst.
Download