Descriptive statistics

advertisement
Descriptive Statistics
BAN 530
TERMINOLOGY
Definition: A population is the set of all elements of interest (note: the population size is usually
denoted by an upper case “N”.).
Definition: A sample is a subset of the population (note: the sample size is usually denoted by a
lower case “n”.).
Definition: A parameter is a characteristic (usually numeric) of the population.
Definition: A statistic is a characteristic (usually numeric) of the sample.
BASIC CONCEPT
The field of statistics can be divided into two related areas: descriptive statistics and inferential
statistics. Descriptive statistics consists of numerical and graphical procedures that allow for the
organization of information (data) and the extraction of key characteristics from the data.
Inferential statistics consist of techniques designed to use sample statistics to make estimates of
and decisions about population parameters. Probability is used to measure the error rates and/or
level of confidence associated with statistical decisions.
We use a sample because it is usually impossible or impractical to measure each and every
element of the population. The parameter(s) is (are) what we want to know. The only way to
know the value of the parameter is to measure each and every element of the population without
error. Thus, it is rare when we know the value of the parameter. We use the information in the
sample to describe and make decisions about the population. We use statistics to describe,
estimate, and make decisions about parameters. It is the parameter we want to know; it is the
statistic that we “settle for.” Statistics represent our “best guess” for the value of the parameter.
The value of the statistic is a function of the sample. Different researchers, taking separate
samples from the same population are likely to obtain different values for the statistic. Statistics
are used as a decision making tool. The use of statistics does not guarantee correct answers.
TYPES OF DATA
There are two basic types of data: quantitative and qualitative. Quantitative data categorizes a
response by a numeric attribute. Qualitative data categorizes a response by a non-numeric
attribute. Numeric refers to the “natural state” of the measurement. A number is not necessarily
numeric. Rather, a number can be a symbol representing a non-numeric characteristic. To
determine if a number is numeric, answer the following question. Can the number be replaced
with a letter, word, or symbol with no loss of information? If the answer is yes, the number is
non-numeric; if the answer is no the number is numeric.
2
A further breakdown of the types of data is the scale of measurement. There are fours scales of
measurement. Listed in order of the least amount of structure to the most, the scales are: (1)
nominal, (2) ordinal, (3) interval, and (4) ratio. Nominal scaled data categorizes a response by a
non-numeric, non-ordered attribute (e.g., religious preference, ethnicity). Ordinal scaled data
categorizes a response by an ordered non-numeric attribute (e.g., military rank, Likert scale).
Interval scaled data categorizes a response by a numeric attribute with no natural “zero” point.
Thus, differences have meaning for interval scaled data but ratios do not (e.g., temperature in
degrees Celsius or Fahrenheit). Ratio scaled data is numeric and has a natural “zero” point.
Thus, ratios have meaning. Most numeric data is ratio scales (e.g., time to complete a project,
number of defective items produced).
KEY PARAMETERS AND STATISTICS
There are a wide variety of parameters and statistics. In general, Greek letters (e.g., µ, π, 2, )
2
will be used to represent parameters and Latin letters (e.g. x, p, s , s ) will be used to represent
statistics. In some cases a Latin letter will be used to represent a parameter and the same Latin
letter with a “hat” (^) over it will be used to represent the statistic (e.g., in some books the
population proportion (the parameter) is denoted by p and the sample proportion (the statistic) is
denoted by ˆp ).
MEASURES OF CENTRAL TENDENCY
Measures of central tendency attempt to describe the typical response (note that the assumption is
made the typical values fall in the “middle” region of a distribution). Two of the most commonly
used measures of central tendency are the mean and the median.
Mean
For notation purposes we use the Greek letter µ (“mu”) to represent the population mean and a
Latin letter with a “bar” over it to represent the sample mean (e.g. x, y, q ). The mean is also
referred to as the expected value, which is denoted as E(X) where X represents the characteristic
being measured. How the mean is computed is dependent on the nature of the data. The data can
be presented in one of two forms, raw observations or a table of frequencies.
If the data is a set of observations, the sample mean is computed as the sum of the observations
divided by the number of observations.
n
x x
x

i
i1
n
n
Note that the indices were dropped from the summation notation in the last form of the formula.
While technically incorrect, it is this last form that will be used. It is assumed that we will use
the entire data set and, thus, 
we drop the subscript on “x” and the indices on the summation
notation.
Descriptive Statistics
© Mitchell J. Muehsam, Ph.D., January
3
If the data is a table of frequencies (or relative frequencies) the sample mean is computed as:
E(X)  xf(x)  xP(x)
where f(x) is the relative frequency for the specific value of X and P(x) is the probability that the
specific value of X will occur. Another notation for P(x) is P(X = x), where the uppercase X

represents the characteristic being measured (e.g. number of defective products) and the
lowercase x represents specific values that may be obtained upon measuring the characteristic
(e.g., 0, 1, etc.).
Example: Suppose a sample of five students yielded the following number credit hours each
student is taking in the 2003 spring semester.
12, 9, 16, 15, 12
The sample mean is: x 
 x  12  9  16  15  12  64  12.8
n
5
5
On average, the five sampled students are taking 12.8 credit hours during the 2003 spring
semester.

Example: Suppose the table below represents the relative frequency of the number of defective
items produced per day for a sample 25 days. Compute the expected number (mean) of defective
items produced per day.
D, Number of Defective Items
f, Frequency
0
1
2
3
4
7
9
5
3
1
Descriptive Statistics
Relative Freq.,
f/n
7/25 = 0.28
9/25 = 0.36
5/25 = 0.20
3/25 = 0.12
1/25 = 0.04
© Mitchell J. Muehsam, Ph.D., January 2003
4
The expected number of defective items is:
E(D) = df(d) = 0(.28) + 1(.36) + 2(.20) + 3(.12) + 4(.04) = 1.28
On average, 1.2 defective items are produced per day.
Notes:
(1) The mean is valid only for numeric data.
(2) The physically represents the center of gravity.
(3) The mean is sensitive to extreme values and is pulled toward the extreme values.
Median
The sample median, denoted with an uppercase M or a Latin letter with a tilde (~) over it (e.g. y˜ )
is the middle value in an ordered data set. The median can be found using the following
procedure:
(1) Order the data

n 1
(2) Compute the location of the median (denoted by the letter “i”), where i 
.
2
(3) If “i” is an integer the median is the value located at the ith position.
If “i” is not an integer, the median is the mean of the two values surrounding the ith
position.

Example: Find the median for the following two data sets.
Data set 1: 12, 9, 16, 15, 13
Data set 2: 12, 9, 16, 15, 13, 8
Data set 1: 12, 9, 16, 15, 13
(1) Order the data: 9, 12, 13, 15, 16
n 1 5 1

3
2
2
(3) The median is the value at the 3rd ordered position, M = 13
(2) Compute the location of the median i 
Data set 2: 12, 9, 16, 15, 13, 8
(1) Order the data: 8, 9, 12, 13,15, 16
n 1 6 1

 3.5
2
2
(3) The median is the mean of the two values surrounding the 3rd ordered position,
M = (12 + 13)/2 = 12.5
(2) Compute the location of the median i 

Descriptive Statistics
© Mitchell J. Muehsam, Ph.D., January
5
Notes:
(1) The median is valid for ordinal, interval, and ratio scaled data.
(2) The median is relatively stable in the presence of extreme values.
MEASURES OF DISPERSION
Statistics is a decision making tool used in the face of uncertainty. Perhaps no concept is more
important in statistics than the measurement and understanding of variability. Measures of
dispersion (also called measures of spread, measures of variability) are attempts to describe the
spread or fluctuation is a data set. Three common measures of dispersion are: (1) the range, (2)
the mean absolute deviation, and (3) the standard deviation.
The range, R, is simply the difference between the maximum and minimum values in the data
set
In notation, the range is R = H – L.
The mean absolute deviation, MAD is the average of the absolute deviations from the mean.
yy
MAD 
.
n
Example: Suppose a sample of five students yielded the following number of credit hours each
student is taking in the 2003 spring semester.
12, 9, 16, 15, 12
The sample mean is: x 
 x  12  9  16  15  12  64  12.8
n
5
5
Find the mean absolute deviation.

Descriptive Statistics
x
x  x 
xx
12
9
16
15
12
64
-0.8
-3.8
3.2
2.2
-0.8
0.0
0.8
3.8
3.2
2.2
0.8
10.8
2
© Mitchell J. Muehsam, Ph.D., January 2003
6
The sample mean absolute deviation is: MAD 
 x  x 10.8

 2.16
n
5

The standard deviation, s is the square root of the “average” of the squared deviations from the
mean. Just as with the mean, the formula used to find the standard deviation is dependent on the
format of the data – raw observations or in a frequency distribution.
x  x
.
n 1
2
With raw data the standard deviation is computed as: s 
With a frequency distribution the standard deviation is computed as:
s  E(x  )2  (x  )2 f(x)  x2f(x)  2

Conceptual Motivation for the Standard Deviation
As a measure of spread consider, for each observation, the deviation from the mean.
x1  x
x2  x
.
.
xn  x
For each observation, the deviation from the mean is measuring how far the observation lies from
the mean. If the deviation is positive, the observation is greater than the mean; if the deviation is
negative, the observation is less than the mean; if the deviation is zero, the observation is equal to
the mean. If the data is “tightly grouped” most of the deviations from the mean should be small.
If the data is “spread out” more of the deviations from the mean should be large. An intuitive
measure of spread is the average of the deviations from the mean. Intuitively, if the data is
“tightly grouped” then on the average the observations should be close to the mean and the
average deviation from the mean will be small. Conversely, on an intuitive level, if the data is
“spread out” then on the average the observations will not be close to the mean and the average
deviation from the mean will be large. Good idea, but it does not work. The sum of the
deviations from the mean always equals zero. We need a way to ensure that the sum will not
equal zero.
Try:
•
•
absolute values of the deviations – use for the MAD.
squares of the deviations – use for the standard deviation.
Descriptive Statistics
© Mitchell J. Muehsam, Ph.D., January
7
The use of absolute deviations is intuitively pleasing but it has two drawbacks:
• absolute values are arithmetically tedious to compute
• absolute values lack “nice” statistical properties
The use of squares eliminates these two drawbacks.
2
The sample variance is: s 
 (x x)2
n 1
Problem: The sample variance is in squared units.
Solution: Take the square root.
The sample standard deviation is the positive square root of the sample variance. If the standard
deviation is “small” the data is tightly grouped and if the standard deviation is “large” the data is
spread out.
The notation for the population variance is 2 (“sigma squared) and the notation for the sample
variance is s2. When measuring a single, quantitative variable, the sample variance is defined as:
1
2
x 2   x

(x
x)

n
s2 

n 1
2
n 1
The two formulas above provide the definitional and computational formulas for the sample
variance. The definitional formula provides insight into what the sample variance represents.
The sample variance is an “average” of the squared deviations from the sample mean. Note that
the variance measures deviations from a specified reference point. In the situation described
above, the reference point is the sample mean. As the physical reality changes, the reference
point used in the computation of the variance will change. However, the general idea is to
measure the spread around the reference point.
The general format for the sample variance is “sum of squares” divided by “degrees of freedom.”
Degrees of freedom, df, will be the total number of observations used in computed the sum of
squares minus the number of parameters estimated in the computation of the sum of squares, SS.
s2 
Descriptive Statistics
SS
df
© Mitchell J. Muehsam, Ph.D., January 2003
8
Example: Suppose a sample of five students yielded the following number of credit hours each
student is taking in the 2003 spring semester.
12, 9, 16, 15, 12
The sample mean is: x 
 x  12  9  16  15  12  64  12.8
n
5
5
Find the sample variance and standard deviation using both the definitional and computational
formulas.
Use of the definitional formula.
x
x  x 
x x
12
9
16
15
12
64
-0.8
-3.8
3.2
2.2
-0.8
0.0
0.64
14.44
10.24
4.84
0.64
30.80
 x x
2
2
The sample variance is: s 
n 1

2
30.80 30.80

 7.7
51
4
The sample standard deviation is: s  s2  7.7  2.77
Use of the computational formula.
x
12
9
16
15
12
64
x2
144
81
256
225
144
850
The sample variance is:
Descriptive Statistics
© Mitchell J. Muehsam, Ph.D., January
9
1
s2 
 x 2  n  x
2
n 1
1
4096
850  64 2 850 
5
5  850  819.2  30.80  7.7


5 1
4
4
4
The sample standard deviation is: s  s2  7.7  2.77
Example: Suppose the table below represents the frequency and relative frequency of the
number of defective items produced per day for a sample 25 days. Compute the standard
deviation of the number defective items produced per day.
D, Number of Defective Items
f, Frequency
0
1
2
3
4
7
9
5
3
1
Relative Freq.,
f/n
7/25 = 0.28
9/25 = 0.36
5/25 = 0.20
3/25 = 0.12
1/25 = 0.04
Recall that the expected number of defective items was 1.28
s  E(x  )2  (x  )2 f(x)  x2f(x)  2 .

Using s  (x  )2 f(x)
D
0
1

2
3
4
f
7
9
5
3
1
f(x) = f/n
7/25 = 0.28
9/25 = 0.36
5/25 = 0.20
3/25 = 0.12
1/25 = 0.04
(x-µ)2
1.6384
0.0784
0.5184
2.9584
7.3984
(x-µ)2f(x)
0.458752
0.028224
0.103680
0.355008
0.295936
1.241600
f(x) = f/n
7/25 = 0.28
x2
0
x2f(x)
0.00
s  (x  )2 f(x)  1.2416 1.114
Using s  x2f(x) - 2
D
0

f
7

Descriptive Statistics
© Mitchell J. Muehsam, Ph.D., January 2003
10
1
2
3
4
9
5
3
1
9/25 = 0.36
5/25 = 0.20
3/25 = 0.12
1/25 = 0.04
1
4
9
16
0.36
0.80
1.08
0.64
2.88
s  x2f(x) - 2  2.88 1.282  2.88 1.6384  1.2416 1.114

MEASURE OF LOCATION
The z-score is a commonly used measure of location. For an individual observation, x, the zscore indicates how many standard deviations the observations lies from the mean.
z
x

The z-score is a standardized measure. It converts an observation from its original units (e.g.,
dollars, number of credit hours, time, etc.) to units of “number of standard deviations.” The term
“x - µ” tells computes the observation’s deviation from the population mean (i.e., in terms of the
original units, “x - µ” is how far the observation lies from the population mean). Dividing by the
population standard deviation converts the value of the deviation into units of “number of
standard deviations.” When computing a z-score, if the population mean or standard deviation is
unknown, use the sample mean or standard deviation as an estimate of the respective population
parameter.
Example: Suppose a sample of five students yielded the following number credit hours each
student is taking in the 2003 spring semester.
12, 9, 16, 15, 12
Find and interpret the z-score for a student taking 16 credit hours in the 2003 spring semester.
From previous examples the sample mean was found to be x  12.8 and the sample standard
deviation was determined to be s = 2.77. Thus, the z-score for a student taking 16 credit hours in
the 2003 spring semester is
z
x


16  12.8

 1.16
2.77
A student taking 16 credit hours in the 2003 spring semester is 1.16 standard deviations above
the average course load.
Descriptive Statistics
© Mitchell J. Muehsam, Ph.D., January
11
Descriptive Statistics
© Mitchell J. Muehsam, Ph.D., January 2003
Download