Uploaded by Jennie Roque

EDA - Module 3 - Descriptive Statistics

advertisement
BULACAN STATE UNIVERSITY
COLLEGE OF ENGINEERING
CIVIL ENGINEERING DEPARTMENT
ENGINEERING DATA ANALYSIS
MODULE 3
COMPILED BY:
MERRICRIS U. PANGILINAN &
JENNIE C. ROQUE
DESCRIPTIVE STATISTICS
1
3
MODULE 3
DESCRIPTIVE STATISTICS
1.1 DURATION FOR CHAPTER 3: 6 HRS
1.2 STUDENTS’ SKILLS ACQUISITION:
At the end of this lesson, you will
1. Calculate the mean, median, and mode for a set of data, and compare these
measures of center.
2. Identify the symbols and know the formulas for sample and population
means.
3. Describe the skewness and the peakness of the graph.
4. Calculate the standard deviation for grouped and ungrouped data.
5. Calculate the weighted mean, percentiles, and quartiles for a data set.
1.3 WHAT YOU KNOW SO FAR?
The following questions are made to assess the students on what they have
already known in the subject. Answer the following as much as you can. Link will be
given for you to access the Google forms.
1.4. INTRODUCTION
Once data are collected, it is useful to summarize the data set by identifying a value
around which the data are centered. Three commonly used measures of center are
the mode, the median, and the mean.
DESCRIPTIVE STATISTICS
2
1.5 LESSON PROPER
1.5.1. MEASURES OF THE CENTER OF THE DATA
1.5.1.A. Measures of Center for Ungrouped Data
The "center" of a data set is also a way of describing location. The two most widely
used measures of the "center" of the data are the mean (average) and the median.
To calculate the mean weight of 50 people, add the 50 weights together and divide
by 50. To find the median weight of the 50 people, order the data and find the
number that splits the data into two equal parts. The median is generally a better
measure of the center when there are extreme values or outliers because it is not
affected by the precise numerical values of the outliers. The mean is the most
common measure of the center.
When each value in the data set is not unique, the mean can be calculated by
multiplying each distinct value by its frequency and then dividing the sum by the total
number of data values. The letter used to represent the sample mean is an π‘₯Μ…
(pronounced “ π‘₯ bar”). The Greek letter πœ‡ (pronounced "mew") represents the
population mean. One of the requirements for the sample mean to be a good
estimate of the population mean is for the sample taken to be truly random.
To see that both ways of calculating the mean are the same, consider the sample:
1; 1; 1; 2; 2; 3; 4; 4; 4; 4; 4
π‘₯Μ… =
1+1+1+2+2+3+4+4+4+4+4
= 2.7
11
You may consider the number of occurrences of each value.
π‘₯Μ… =
1(3) + 2(2) + 3(1) + 4(5)
= 2.7
11
In the second example, the frequencies are in parenthesis and was considered in the
computation for the mean.
You can quickly find the location of the median by using the expression
𝑛+1
2
DESCRIPTIVE STATISTICS
3
The letter 𝑛 is the total number of data values in the sample. If 𝑛 is an odd number,
the median is the middle value of the ordered data (ordered smallest to largest). If
𝑛 is an even number, the median is equal to the two middle values added together
and divided by two after the data has been ordered. For example, if the total number
of data values is 97, then
𝑛 + 1 97 + 1
=
= 49
2
2
The median is the 49th value in the ordered data. If the total number of data values
is 100, then
𝑛 + 1 100 + 1
=
= 50.5
2
2
The median occurs midway between the 50 and 51 values. The location of the
median and the value of the median are NOT the same.
Examples:
1. Score data for the first quiz in a 50-item Engineering Data Analysis quiz are
as follows (smallest to largest):
3; 4; 8; 8; 10; 11; 12; 13; 14; 15; 15; 16; 16; 17; 17; 18; 21; 22; 22; 24; 24; 25;
26; 26; 27; 27; 29; 29; 31; 32; 33; 33; 34; 34; 35; 37;
40; 44; 44; 47
Calculate the mean and the median.
Solution
The calculation for the mean is
π‘₯Μ… =
3 + 4 + 8(2) + 10 + 11 + 12 + 13 + 14 + 15(2) + 16(2) + 17(2) + β‹― . +35 + 37 + 40 + 44(2) + 47
40
π‘₯Μ… = 23.6
To find the median Μƒπ‘₯. Locate first the location. The location is
𝑛+1
2
=
40+1
2
= 20.5
Starting from the smallest value, locate the value between 20th and 21th (the two 24s)
π‘₯Μƒ =
24 + 24
= 24
2
DESCRIPTIVE STATISTICS
4
2. The following data show the number of months graduates typically wait on a
before getting hired. The data are ordered from smallest to largest. Calculate
the mean and median.
3; 4; 5; 7; 7; 7; 7; 8; 8; 9; 9; 10; 10; 10; 10; 10; 11; 12; 12; 13; 14; 14; 15; 15;
17; 17; 18; 19; 19; 19; 21; 21; 22; 22; 23; 24; 24; 24;
24
Solution
Mean π‘₯Μ…
= 3 + 4 + 5 + 7(4) + 8(2) + 9(2) + 10(5) + 11 + 12(2) + 13 + 14(2) + 15(2) + 17(2) + 18 + 19(3) + 21(2) + 22(2) + 23 + 24(4)
39
π‘₯Μ… = 13.95
Median π‘₯Μƒ
Starting at the smallest value, locate
39+1
2
= 20th term. The median is 13. The 20th
term.
Mean vs. Median
Both the mean and the median are important and widely used measures of center.
Consider the following example:
Suppose you got an 85 and a 93 on your first two statics quizzes, but then you had a
really bad day and got a 14 on your next quiz!
The mean of your three grades would be 64. Which is a better measure of your
performance? As you can see, the middle number in the set is an 85. That middle
does not change if the lowest grade is an 84, or if the lowest grade is a 14. However,
when you add the three numbers to find the mean, the sum will be much smaller if
the lowest grade is a 14.
Outliers and Resistance
The mean and the median are so different in the previous example because there is
one grade that is extremely different from the rest of the data. In statistics, we call
such extreme values outliers. The mean is affected by the presence of an outlier;
however, the median is not. A statistic that is not affected by outliers is called
resistant. We say that the median is a resistant measure of center, and the mean
is not resistant. In a sense, the median can resist the pull of a far away value, but
the mean is drawn to such values. It cannot resist the influence of outlier values. As
a result, when we have a data set that contains an outlier, it is often better to use the
median to describe the center, rather than the mean.
DESCRIPTIVE STATISTICS
5
Μ‚ is the most frequent value.
Another measure of the center is the mode. The mode 𝒙
There can be more than one mode in a data set if those values have the same
frequency and that frequency is the highest. A data set with two modes is called
bimodal. The mode can be calculated for qualitative data as well as for quantitative
data. For example, if the data set is: red, red, red, green, green, yellow, purple,
black, blue, the mode is red.
Examples:
1. Statistics exam scores for 20 students are as follows:
50; 53; 59; 59; 63; 63; 72; 72; 72; 72; 72; 76; 78; 81; 83; 84; 84; 84; 90; 93
Find the mode.
Solution
Μ‚ = 72.
The most frequent score is 72, which occurs five times. 𝒙
2. The number of books checked out from the library from 25 students are as
follows:
0; 0; 0; 1; 2; 3; 3; 4; 4; 5; 5; 7; 7; 7; 7; 8; 8; 8; 9; 10; 10; 11; 11; 12; 12
Find the mode.
Solution
Μ‚ = 7.
The most frequent number of books is 7, which occurs four times. 𝒙
3. Five real estate exam scores are 430, 430, 480, 480, 495. The data set is
bimodal because the scores 430 and 480 each occur twice.
When is the mode the best measure of the "center"? Consider a weight loss program
that advertises a mean weight loss of six pounds the first week of the program. The
mode might indicate that most people lose two pounds the first week, making the
program less appealing.
Exercises:
The students in a statistics class were asked to report the number of children that
live in their house. The data are recorded below:
1, 3, 4, 3, 1, 2, 2, 2, 1, 2, 2, 3, 4, 5, 1, 2, 3, 2, 1, 2, 3, 6
Find the mean, median and mode.
DESCRIPTIVE STATISTICS
6
The Law of Large Numbers and the Mean
The Law of Large Numbers says that if you take samples of larger and larger size
from any population, then the mean π‘₯Μ… of the sample isvery likely to get closer and
closer to population mean πœ‡.
1.5.1.B. Measures of Center for Grouped Data
Constructing Frequency Distribution:
STEPS IN CONSTRUCTING A REQUENCY DISTRIBUTION:
1. Determine the largest and smallest value in the data.
2. Determine the number of class intervals (k) desired.
Table 3.1. Recommended k values from Juran and Gyrna:
Number of observation, n
Recommended no. of classes (k)
20 -50
5 or 6
51 -100
7
101 – 200
8
201 - 500
9
501 – 1000
10
over 1000
11 - 20
Sturges offers a mathematical formula:
π‘˜ = 1 + 3.222(log 𝑛)
Or as mentioned before, you may use π‘˜ = √𝑛 and then round to the nearest
whole number, if necessary.
π‘œπ‘Ÿ π‘˜ = √𝑛
3. Determine the approximate class size(c) [ class size is also known as bin size
or class width]
π‘₯π‘šπ‘Žπ‘₯ − π‘₯π‘šπ‘–π‘›
𝑐=
π‘˜
4. Determine the lower and upper limits of the class interval.
5. Write down the class intervals starting with the decided lower and upper class
limit of the first class interval. Add the class size to the lower and upper class
limits to obtain the next class interval and so on.
6. Determine the number of observations falling under each class interval that is
find the class frequency.
Example:
A random sample of 30 capacitors were taken from the ECE laboratory and were
measured. The following data represents values of the capacitances in πœ‡πΉ.
65.6
83.4
73.3
76.2
35.6
63.6
33.2
28.6
52.5
56.4
10.3
36.0
74.7
52.5
74.7
64.7
73.0
49.2
52.7
45.8
DESCRIPTIVE STATISTICS
7
97.6
72.1
41.0
65.4
64.5
78.5
83.4
80.1
45.9
50.2
Summarize the data above using frequency distribution.
Solution:
A. For frequency distribution
Step 1. Determine the largest and smallest value in the data
Step 2. Determine the number of class intervals (k) desired.
Since there are 30 observations (n = 30), Let us use k = 5 from Juran’s
recommendation
Step 3. Determine the approximate class size(c) [ class size is also known as
bin size or class width]
π‘₯
−π‘₯
97.6−10.3
𝑐 = π‘šπ‘Žπ‘₯π‘˜ π‘šπ‘–π‘› =
= 17.46 ≈ 17.5 since the data has one decimal we
5
are going to use c with one decimal
Step 4 and 5. Determine the lower and upper limits of the class interval. Write
down the class intervals starting with the decided lower and upper class limit
of the first class interval. Add the class size of 17.5 to the lower and upper
class limits to obtain the next class interval and so on.
Since the smallest value is 10.3, let us use 10.3 as the lower limit.
And if we add the class size of 17.5 to
10.3, it will have a value of 27.8, this
will the lower class limit of the next
interval see Table 3.2. In doing so, the
upper limit of the 1st row must be 27.7
so that values will fall in the interval
Table 3.2.
Step 6. Determine the number of
observations falling under each class interval that is find the class frequency,
see Table 3.3
DESCRIPTIVE STATISTICS
8
Table 3.3.
Class Boundaries
Class boundaries are the true class limit. They are values one half measurement
unit more accurate than the observed values. This is necessary so that NO values
can be observed exactly on a boundary.
π‘ˆπΆπ΅π‘– =
π‘ˆπΏπ‘– +𝐿𝐿𝑖+1
2
where 𝑖 is the class or row
For example, using the previous problem let us determine the starting UCB
boundary, see Table 3.4
Table 3.4.
Or we may use as what had been discussed in the previous chapter. 10.3 is the
smallest value and the
data contain one
decimal, the lower class
boundary will be
10.25(10.3 – 0.05).
And to get the remaining boundary, simply add the class size as seen in Table 3.5
DESCRIPTIVE STATISTICS
9
Table 3.5.
Note that the upper class boundary of a certain row is the same as the lower class boundary of the next row.
Mean, Median and Mode
When only grouped data is available, you do not know the individual data values (we
only know intervals and interval frequencies); therefore, you cannot compute an
exact mean, median and mode for the data set. What we must do is estimate the
actual central tendencies using frequency table as shown above. A frequency table
is a data representation in which grouped data is displayed along with the
corresponding frequencies. We simply need to modify the definition to fit within the
restrictions of a frequency table. Since we do not know the individual data values we
can instead find the midpoint of each interval. The midpoint (π‘₯𝑖 ) is
π‘™π‘œπ‘€π‘’π‘Ÿ π‘™π‘–π‘šπ‘–π‘‘ + π‘’π‘π‘π‘’π‘Ÿ π‘™π‘–π‘šπ‘‘ π‘™π‘œπ‘€π‘’π‘Ÿ π‘π‘œπ‘’π‘›π‘‘π‘Žπ‘Ÿπ‘¦ + π‘’π‘π‘π‘’π‘Ÿ π‘π‘œπ‘’π‘›π‘‘π‘Žπ‘Ÿπ‘¦
=
2
2
We can now modify the following mean definition to be
π‘₯Μ… =
∑ 𝑓𝑖 π‘₯𝑖
∑ 𝑓𝑖
where fi = class frequency,
xi = class mark or midpoint
median definition to be
𝑛
− (∑ 𝑓)1
2
π‘₯Μƒ = 𝐿1 + [
π‘“π‘šπ‘’π‘‘
]𝑐
Where:
𝑛
L1 = LCB of the median class (class in which the 2 π‘‘β„Ž item
belong)
n = total frequency
f med = median class frequency
DESCRIPTIVE STATISTICS
10
(∑f)1 = sum of the frequencies of all classes lower than
the median class.
c = median class size
mode definition to be
π‘₯Μ‚ = 𝐿1 + [
βˆ†1
]𝑐
βˆ†1 + βˆ†2
Where:
L1 = LCB of the modal class (class with the highest
frequency)
Δ1 = excess of the modal class frequency over the
frequency of the next lower
class.
Δ2 = excess of the modal class frequency over the
frequency of the next higher
class.
c = median class size.
Example. Using the same problem of a random sample of 30 capacitors that were
taken from the ECE laboratory and were measured. The following data represents
values of the capacitances in πœ‡πΉ.
65.6
83.4
73.3
76.2
97.6
72.1
35.6
63.6
33.2
28.6
41.0
65.4
52.5
56.4
10.3
36.0
64.5
78.5
74.7
52.5
74.7
64.7
83.4
80.1
73.0
49.2
52.7
45.8
45.9
50.2
Determine the mean median and mode
Solution:
Frequency table, class boundaries and midpoint (π‘₯𝑖 ) are computed based on
what is discussed previously.
class interval
lower limit
upper limit
10.3
27.8
45.3
62.8
80.3
-
27.7
45.2
62.7
80.2
97.7
class boundary
Lower CB
Upper CB
10.25
27.75
45.25
62.75
80.25
27.75
45.25
62.75
80.25
97.75
Frequency, f
xi
1
5
8
13
3
19.0
36.5
54.0
71.5
89.0
n = 30
DESCRIPTIVE STATISTICS
11
A. Mean
∑ 𝑓𝑖 π‘₯𝑖 (1)19 + (5)36.5 + (8)54 + (13)71.5 + (3)89
π‘₯Μ… =
=
= 61
∑ 𝑓𝑖
30
B. Median
𝑛
− (∑ 𝑓)1
2
π‘₯Μƒ = 𝐿1 + [
π‘“π‘šπ‘’π‘‘
𝑛
]𝑐
30
Locate first the 2 = 2 = 15
class boundary
class interval
lower limit
upper limit
10.3
27.8
45.3
62.8
80.3
-
27.7
45.2
62.7
80.2
97.7
Lower CB
Upper CB
10.25
27.75
45.25
62.75
80.25
27.75
45.25
62.75
80.25
97.75
L1 = 62.75
𝑛
= 15
2
Frequency, f
1
5
8
13
3
xi
19.0
36.5
54.0
1+5+8 +13 = 27This is
71.5
where the 15th term is.
89.0
n = 30
𝑓 π‘šπ‘’π‘‘ = 13
∑f1 = 8 + 5 +1 = 14
C = 17.5 (as computed in the previous problem)
π‘₯Μƒ = 62.75 + [
15 − 14
13
] 17.5 = 64.8
C. Mode
π‘₯Μ‚ = 𝐿1 + [
βˆ†1
]𝑐
βˆ†1 + βˆ†2
L1 = 62.75
Δ1 = 13 -8 = 5
Δ 2 = 13 – 3 = 10
c = 17.5
π‘₯Μ‚ = 62.75 + [
5
] 17.5 = 68.58
5 + 10
1.5.2. SKEWNESS AND THE MEAN, MEDIAN AND MODE
The data in Figure 3.1 can be presented using histogram. The histogram displays a
symmetrical distribution of data. A distribution is symmetrical if a vertical line can be
drawn at some point in the histogram such that the shape to the left and the right of
the vertical line are mirror images of each other. The mean, the median, and the
mode are each 8 for these data. In a perfectly symmetrical distribution, the
mean and the median are the same. This example has one mode (unimodal), and
the mode is the same as the mean and median. In a symmetrical distribution that
has two modes (bimodal), the two modes would be different from the mean and
median.
DESCRIPTIVE STATISTICS
12
4
x
5
6
7
8
9
10
11
3
2
1
f
1
1
2
3
2
1
1
0
5
6
7
8
9
10
11
Figure 3.1
The histogram shown in Figure 3.2 for the data is not symmetrical. The right-hand
side seems "chopped off" compared to the left side. A distribution of this type is
called skewed to the left because it is pulled out to the left.
The mean is 8.46, the median is 9, and the mode is 10. Notice that the mean is
less than the median, and they are both less than the mode. The mean and the
median both reflect the skewing, but the mean reflects it more so.
5
4
x
5
6
7
8
9
10
11
3
2
1
f
1
1
2
2
2
4
1
0
5
6
7
8
9
10
11
Figure 3.2
The histogram shown in Figure 3.3 for the data is also not symmetrical. It is skewed
to the right. The mean is 7.53, the median is 7, and the mode is 6. Of the three
statistics, the mean is the largest, while the mode is the smallest. Again, the
mean reflects the skewing the most.
5
4
x
5
6
7
8
9
10
11
3
2
1
f
1
4
3
3
2
1
1
0
5
6
7
8
9
10
11
Figure 3.3
DESCRIPTIVE STATISTICS
13
Formula for measurement of skewness:
Third central moment about the mean determines the symmetry of distribution
π‘Ž3 =
∑ 𝑓(π‘₯−π‘₯Μ… )3
(𝑛−1)𝑠3
Exercises:
Discuss the mean, median, and mode for each of the following problems. Is there a
pattern between the shape and measure of the center?
DESCRIPTIVE STATISTICS
14
1.5.3. MEASURES OF KURTOSIS
Kurtosis is the degree of peakedness of unimodal distribution. Peakedness is
a comparative measure of the height of the peak of a frequency distribution
usually taken relative to a normal distribution. (a symmetric distribution)
Mesokurtic (normal distribution) is not very
peaked or flat topped.
Leptokurtic – distribution having a relatively high
peak.
Platykurtic – distribution is flat-topped
𝑛
Moment coefficient of kurtosis, a4 = 𝑠44 , where n4 is the fourth moment about the mean and
equal to
∑ 𝑓(π‘₯−π‘₯Μ… )4
(𝑛−1)
If a4 = 3, distribution is mesokurtic, Normal peakedness
a4 < 3, distribution is platykurtic, low peakedness
a4 > 3, distribution id leptokurtic, high peakedness
1.5.4 MEASURES OF THE SPREAD OF DATA
An important characteristic of any set of data is the variation in the data. In some
data sets, the data values are concentrated closely near the mean; in other data
sets, the data values are more widely spread out from the mean. The most common
DESCRIPTIVE STATISTICS
15
measure of variation, or spread, is the standard deviation. The standard deviation
is a number that measures how far data values are from their mean.
•
The standard deviation provides a measure of the overall variation in a
data set
The standard deviation is always positive or zero. The standard deviation is
small when the data are all concentrated close to the mean, exhibiting little
variation or spread. The standard deviation is larger when the data values are
more spread out from the mean, exhibiting more variation.
•
The standard deviation can be used to determine whether a data value is
close to or far from the mean.
•
Calculating the Standard Deviation
If π‘₯ is a number, then the difference "π‘₯ − π‘šπ‘’π‘Žπ‘›" is called its deviation. In a
data set, there are as many deviations as there are items in the data set. The
deviations are used to calculate the standard deviation. If the numbers belong
to a population, in symbols a deviation is π‘₯ − πœ‡ . For sample data, in symbols
a deviation is π‘₯ − π‘₯Μ… .
The procedure to calculate the standard deviation depends on whether the
numbers are the entire population or are data from a sample. The calculations
are similar, but not identical. Therefore, the symbol used to represent the
standard deviation depends on whether it is calculated from a population or a
sample. The lower case letter 𝒔 represents the sample standard deviation and
the Greek letter 𝜎(sigma, lower case) represents the population standard
deviation. If the sample has the same characteristics as the population, then s
should be a good estimate of 𝜎.
To calculate the standard deviation, we need to calculate the variance first.
The variance is the average of the squares of the deviations (π‘₯ − π‘₯Μ… for a
sample or π‘₯ − πœ‡ for a population). The symbol 𝜎 2 represents the population
variance, the population standard deviation 𝜎 is the square root of the
population variance. The symbol 𝑠 2 represents the sample variance; the
sample standard deviation 𝒔 is the square root of the sample variance. You
can think of the standard deviation as a special average of the deviations.
DESCRIPTIVE STATISTICS
16
•
Variance is defined as the square of the standard deviation.
𝑣 = 𝜎 2 population variance
𝑣 = 𝑠 2 sample variance
•
Standard deviation of the sample is defined as follows.
(π‘₯ −π‘₯Μ… )2
𝑖
s = √ 𝑛−1
For a grouped data:
∑ 𝑓𝑖 (π‘₯𝑖 −π‘₯Μ… )2
s=√
𝑛−1
Where:
s = standard deviation of sample
xi = class mark or midpoint
fi = frequency
π‘₯Μ… = sample mean
n = number of sample
•
Standard deviation of the population is defined as follows.
(π‘₯𝑖 −πœ‡)2
𝜎= √
For a grouped data:
𝑁
∑ 𝑓𝑖 (π‘₯𝑖 −πœ‡)2
𝜎=√
𝑁
Where:
𝜎 = standard deviation of population
xi = class mark or midpoint
fi = frequency
πœ‡= population mean
N = number of population
DESCRIPTIVE STATISTICS
17
•
NOTE: n-1 is used as denominator so as to obtain an s2 which is unbiased estimate of the
population variance, σ2. But as n increases, the bias becomes smaller. If the numbers come
from a census of the entire population and not a sample, when we calculate the average of
the squared deviations to find the variance, we divide by 𝑁 , the number of items in the
population. If the data are from a sample rather than a population, when we calculate the
average of the squared deviations, we divide by n – 1, one less than the number of items in
the sample.
Examples:
1. Consider the previous problem capacitors in ECE Lab with the computed
tabular data as follows.
Determine the standard deviation of sample.
Solution:
Determine the π‘₯𝑖 (midpoint) of each row as shown in the table, and π‘₯Μ… (mean)
using the formula
∑ 𝑓 𝑖 π‘₯𝑖
class interval
lower limit
upper limit
10.3
27.7
27.8
45.2
45.3
62.7
62.8
80.2
80.3
97.7
∑ 𝑓𝑖
. In this problem, π‘₯Μ… = 61
class boundary
Frequency, f
Lower CB Upper CB
10.25
27.75
1
27.75
45.25
5
45.25
62.75
8
62.75
80.25
13
80.25
97.75
3
∑
30
xi
19.0
36.5
54.0
71.5
89.0
Determine the remaining the value for the column of (π‘₯𝑖 − π‘₯Μ… )2 for each row as
well as the column for 𝑓(π‘₯𝑖 − π‘₯Μ… )2
class interval
lower limit
upper limit
10.3
27.7
27.8
45.2
45.3
62.7
62.8
80.2
80.3
97.7
class boundary
Frequency, f
Lower CB Upper CB
10.25
27.75
1
27.75
45.25
5
45.25
62.75
8
62.75
80.25
13
80.25
97.75
3
∑
30
xi
xi-xΜ…
(xi-xΜ… )2
(f)(xi-xΜ… )2
19.0
36.5
54.0
71.5
89.0
-42.0
-24.5
-7.0
10.5
28.0
1764.0
600.3
49.0
110.3
784.0
1764
3001.25
392
1433.25
2352
8942.5
∑ 𝑓𝑖 (π‘₯𝑖 −π‘₯Μ… )2
Using the formula for standard deviation of sample as √
𝑛−1
= 17.5602411
DESCRIPTIVE STATISTICS
18
Explanation of the standard deviation calculation shown in the table
The deviations show how spread out the data are about the mean. The data value of
midpoint 89 is farther from the mean than is the midpoint data value 71.5 which is
indicated by the deviations 28 and 10.5. A positive deviation occurs when the data
value is greater than the mean, whereas a negative deviation occurs when the data
value is less than the mean. The deviation is –42 for the midpoint data value 19..
The standard deviation measures the spread in the same units as the data The
standard deviation, or , is either zero or larger than zero. When the standard
deviation is zero, there is no spread; that is, the all the data values are equal to each
other. The standard deviation is small when the data are all concentrated close to
the mean, and is larger when the data values show more variation from the mean.
When the standard deviation is a lot larger than zero, the data values are very
spread out about the mean; outliers can make standard deviation very large.
The standard deviation, when first presented, can seem unclear. By graphing your
data, you can get a better "feel" for the deviations and the standard deviation. You
will find that in symmetrical distributions, the standard deviation can be very helpful
but in skewed distributions, the standard deviation may not be much help. The
reason is that the two sides of a skewed distribution have different spreads. In a
skewed distribution, it is better to look at the first quartile, the median, the third
quartile, the smallest value, and the largest value. Because numbers can be
confusing, always graph your data. Display your data in a histogram or a box plot.
Comparing Values from Different Data Sets
The standard deviation is useful when comparing data values (x) that come from
different data sets. If the data sets have different means and standard deviations.
sample
π‘₯ = π‘₯Μ… + 𝑧𝑠
Population
π‘₯ = πœ‡ + π‘§πœŽ
π‘₯ − π‘₯Μ…
𝑠
π‘₯−πœ‡
𝑧=
𝜎
𝑧=
The z value are the z scores and that will be the number of standard deviations.
Example:
Two students, Henry and Josh, from different high schools, wanted to find out who
had the highest GPA when compared to his school. Which student had the highest
GPA when compared to his school?
DESCRIPTIVE STATISTICS
19
Student
Henry
Josh
GWA
2.85
77
School Mean GWA
3
80
School Standard Deviation
0.7
10
Solution
For each of the student, determine how many standard deviations his GWA is away
from the average for his school. (careful with the sign).
π‘₯ − π‘₯Μ…
𝑠
For Henry:
π‘₯ − π‘₯Μ… 2.85 − 3
𝑧=
=
= −0.21
𝑠
0.7
𝑧=
For Josh:
π‘₯ − π‘₯Μ… 77 − 80
𝑧=
=
= −0.3
𝑠
10
Henry has the better GPA when compared to his school because his GPA is 0.21
standard deviations below his school's mean while Josh's GPA is 0.3 standard
deviations below his school's mean.
Henry's z-score of –0.21 is higher than Josh's z-score of –0.3. For GPA, higher
values are better, so we conclude that Henry has the better GPA when compared to
his school.
Two swimmers, Eileen and Janeth, from different teams, wanted to find out who had
the fastest time for the 50-meter freestyle when compared to her team. Which
swimmer had the fastest time when compared to her team ?
Swimmer
Time(s)
Eileen
Janeth
26.2
27.3
Team Mean
Time
27.2
30.1
Team Standard
Deviation
0.8
1.4
For Eileen:
π‘₯ − π‘₯Μ… 26.2 − 27.2
𝑧=
=
= −1.25
𝑠
0.8
For Josh:
π‘₯ − π‘₯Μ… 27.3 − 30.1
𝑧=
=
= −2.0
𝑠
1.4
DESCRIPTIVE STATISTICS
20
The following lists give a few facts that provide a little more insight into what the
standard deviation tells us about the distribution of the data.
•
For ANY data set, no matter what the distribution of the data is:
At least 75% of the data is within two standard deviations of the mean.
At least 89% of the data is within three standard deviations of the mean.
At least 95% of the data is within 4.5 standard deviations of the mean.
This is known as Chebyshev's Rule. (see Figure 3.5)
•
For data having a distribution that is BELL-SHAPED and SYMMETRIC:
Approximately 68% of the data is within one standard deviation of the mean.
Approximately 95% of the data is within two standard deviations of the mean.
More than 99% of the data is within three standard deviations of the mean.
This is known as the Empirical Rule(see Figure 3.4)
It is important to note that this rule only applies when the shape of the
distribution of the data is bell-shaped and symmetric.
1.5.5. MEASURES OF THE LOCATION OF DATA
DESCRIPTIVE STATISTICS
21
The common measures of location are quartiles and percentiles however, we also
have deciles. Quartiles are special percentiles. The first quartile, 𝑄1 , is the same as
the 25th percentile, and the third quartile, 𝑄3 , is the same as the 75th percentile. The
median,π‘₯Μƒ , is called both the second quartile and the 50 percentile and the 5th decile.
To calculate quartiles, decile and percentiles for an ungrouped data, the data must
be ordered from smallest to largest. Quartiles divide ordered data into quarters.
Deciles divide ordered data into tens. Percentiles divide ordered data into
hundredths. To score in the 90th percentile of an exam does not mean, necessarily,
that you received 90% on a test. It means that 90% of test scores are the same or
less than your score and 10% of the test scores are the same or greater than your
test score.
Percentiles are mostly used with very large populations. Therefore, if you were to
say that 90% of the test scores are less (and not the same or less) than your score, it
would be acceptable because removing one particular data value is not significant.
To determine the percentile, quartile and decile for a grouped data, we will be using
𝑛
the formula for the median of a grouped data with certain changes in the value for 2,
1
3
for quartiles it will be replaced with 4 𝑛 π‘‘π‘œ 4 𝑛, while for deciles it will be replaced with
1
10
9
1
99
𝑛 π‘‘π‘œ 10 𝑛 and for percentiles it will be 100 𝑛 π‘‘π‘œ 100 𝑛
𝑛
− (∑ 𝑓)1
2
π‘₯ = 𝐿1 + [
π‘“π‘œπ‘π‘ 
]𝑐
Where:
L1 = LCB of the observed class (class in which the observed item belong)
n = total frequency
f obs = observed class frequency
(∑f)1 = sum of the frequencies of all classes lower than the observed class.
c = median class size
Example:
Consider the previous problem capacitors in ECE Lab with the computed
tabular data as follows.
class boundary
class interval
Frequency, f
lower limit
upper limit Lower CB Upper CB
10.3
27.7
10.25
27.75
1
27.8
45.2
27.75
45.25
5
45.3
62.7
45.25
62.75
8
62.8
80.2
62.75
80.25
13
80.3
97.7
80.25
97.75
3
30
xi
19.0
36.5
54.0
71.5
89.0
DESCRIPTIVE STATISTICS
22
Determine 𝑄1 , 𝑃40 π‘Žπ‘›π‘‘ 𝐷8
Solution:
1
𝑛
1
For 𝑄1: Locate the 𝑛( this will replace the in the formula)term of the data, 𝑛 =
1
4
4
2
4
(30) = 7.5~8th
class boundary
class interval
Frequency, f
lower limit
upper limit Lower CB Upper CB
10.3
27.7
10.25
27.75
1
27.8
45.2
27.75
45.25
5
45.3
62.7
45.25
62.75
8
62.8
80.2
62.75
80.25
13
80.3
97.7
80.25
97.75
3
30
1
𝑛 − (∑ 𝑓)1
4
π‘₯𝑄1 = 𝐿1 + [
π‘“π‘œπ‘π‘ 
xi
19.0
36.5
54.0
71.5
89.0
1+5+8=14, this
where the 8th
term is.
] 𝑐 = 45.25 + [
40
8 − (5 + 1)
] 17.5 = 49.625
8
𝑛
40
For 𝑃40 : Locate the 100 𝑛( this will replace the 2 in the formula)term of the data, 100 𝑛 =
40
100
(30) = 12π‘‘β„Ž
class boundary
class interval
Frequency, f
lower limit
upper limit Lower CB Upper CB
10.3
27.7
10.25
27.75
1
27.8
45.2
27.75
45.25
5
45.3
62.7
45.25
62.75
8
62.8
80.2
62.75
80.25
13
80.3
97.7
80.25
97.75
3
30
40
𝑛 − (∑ 𝑓)1
π‘₯𝑃40 = 𝐿1 + [100
π‘“π‘œπ‘π‘ 
8
xi
19.0
36.5
54.0
71.5
89.0
1+5+8=14, this
where the 12th
term is.
] 𝑐 = 45.25 + [
12 − (5 + 1)
] 17.5 = 58.375
8
𝑛
8
For 𝐷8 : Locate the 10 𝑛( this will replace the 2 in the formula)term of the data, 10 𝑛 =
8
10
(30) = 24π‘‘β„Ž
class boundary
class interval
Frequency, f
lower limit
upper limit Lower CB Upper CB
10.3
27.7
10.25
27.75
1
27.8
45.2
27.75
45.25
5
45.3
62.7
45.25
62.75
8
62.8
80.2
62.75
80.25
13
80.3
97.7
80.25
97.75
3
30
8
𝑛 − (∑ 𝑓)1
10
π‘₯𝐷8 = 𝐿1 + [
π‘“π‘œπ‘π‘ 
xi
19.0
36.5
54.0
71.5
89.0
] 𝑐 = 62.75 + [
1+5+8+13=27, this
where the 24th
term is.
24 − (8 + 5 + 1)
] 17.5 = 76.211
13
DESCRIPTIVE STATISTICS
23
1.6. CLASS ASSIGNMENT:
1. Suppose that in a small town of 50 people, one person earns $5,000,000 per year
and the other 49 each earn $30,000. Which is the better measure of the "center": the
mean or the median?
2. Enrique has a 91, 87, and 95 for his statistics grades for the first three quarters.
His mean grade for the year must be a 93 for him to be exempted from taking the
final exam. Assuming grades are rounded following valid mathematical procedures,
what is the lowest whole number grade he can get for the 4th quarter and still be
exempt from taking the exam?
1.7. SUMMARY
The mean and the median can be calculated to help you find the "center" of a data
set. The mean is the best estimate for the actual data set, but the median is the best
measurement when a data set contains several outliers or extreme values. The
mode will tell you the most frequently occurring datum (or data) in your data set. The
mean, median, and mode are extremely helpful when you need to analyze your data,
but if your data set consists of ranges which lack specific values, the mean may
seem impossible to calculate. However, the mean can be approximated if you add
the lower boundary with the upper boundary and divide by two to find the midpoint of
each interval. Multiply each midpoint by the number of values found in the
corresponding range. Divide the sum of these values by the total number of data
values in the set.
Looking at the distribution of data can reveal a lot about the relationship between the
mean, the median, and the mode. There are three types of distributions:
A left (or negatively) skewed distribution has a shape like Figure 3.4a
A symmetrical or normal distribution looks like Figure 3.4b
A right (or positively) skewed distribution has a shape like Figure 3.4c
DESCRIPTIVE STATISTICS
24
Figure 3.4
The standard deviation can help you calculate the spread of data. There are different
equations to use if are calculating the standard deviation of a sample or of a
population
DESCRIPTIVE STATISTICS
25
Download