Chapter 3: Summary Statistics

advertisement
E370 Statistical Analysis for
Bus & Econ
Chapter 3: Summary Statistics
Objectives:
 Be able to import data into Excel.
 Be able to generate useful summary statistics
from the data using Excel. (Data Crunch)
 Be able to interpret summary statistics and use
them to describe a dataset. (Value Added)
Why do we care?
These days the problem is not lack of data, but an
overwhelming amount of data. We need to extract
information out of the data sets we have. To do
that we
 condense data into tables and graphs
 summarize data into descriptive statistics
Overview:
Summary of a dataset (3 dimensions)
Center
Where are the data values concentrated?
What seem to be typical or middle data
values?
Dispersion
The scattering or spread of data around its
center. How much variation is there in the
data?
Are the data values distributed
symmetrically? Skewed?
Shape
Measures of Center:
Statistic Formula
Mean
Median
Mode
Excel Command
Pros and Cons
n
=AVERAGE(Array) Pros: use all the
information
Cons: sensitive to
extreme value
Middle value =MEDIAN(Array) Pros: robust to
in sorted
extreme value
array
Cons: not use all
the information
1
xi
å
n i=1
Most
frequently
occurring
data value
=MODE.MULT
(Array)
Pros: the only
measure of center
for nominal data
Cons: unreliable
Measures of Dispersion:
Statistic
Formula
Excel Command
Range
X max  X min
Variance
Population:
=MAX(Array)MIN(Array)
=VAR.P(Array)
Sample:
Standard
Deviation
s
2
s2
(x - m )
å
=
2
i
N
(x - x )
å
=
2
i
n -1
Population:
s = s2
s = s2
Sample:
Coefficient Population: (s m ) *100
of
Variance(CV) Sample: s X *100
(
VAR.S(Array)
)
=STDEV.P(Array)
=STDEV.S(Array)
=STDEV.P(Array)/AVE
RAGE(Array)*100
=STDEV.S(Array)/AVE
RAGE(Array)*100
Measures of Dispersion:
1. Range: the distance between the largest value and the
smallest value in the dataset.
2. Variance: the average squared distances of
observations from their mean. “Squared” units difficult
to interpret.
3. Standard Deviation: a type of average distance of
observations from their mean. (Calculated by taking
square root of the variance.)
4. Coefficient of Variance(CV): a measure of “relative”
dispersion (unit-free). It is useful for comparing
dispersion of variables measured in different units or
with different means.
Shape of Distribution:
By comparing the three measures of center:
a. Mean > Median (>Mode): positively or right-skewed
b. Mean = Median (=Mode): symmetric
c. Mean < Median (<Mode): negatively or left-skewed
The tail points to the direction of skewness.
Shape of Distribution(cont’d):
By using Pearson’s Second Skewness Coefficient:
a. Pearson’s Second Skewness Coefficient > 0: positively
or right-skewed
b. Pearson’s Second Skewness Coefficient = 0: symmetric
c. Pearson’s Second Skewness Coefficient < 0: negatively
or left-skewed
Pearson’s Second Skewness Coefficient=3*(meanmedian)/standard deviation
Summary Statistics
Calories
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
146.1111
4.012704
145
140
29.48723
869.4969
-0.68913
-0.1484
110
90
200
7890
54
𝑆𝑢𝑚
• 𝑚𝑒𝑎𝑛 = 𝐶𝑜𝑢𝑛𝑡
• The Standard Deviation and Sample Variance are sample statistics
• 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 =
𝑆𝑎𝑚𝑝𝑙𝑒 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒
𝑁−1
• 𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝑆𝑎𝑚𝑝𝑙𝑒 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 × 𝑁
• Skewness is NOT the Pearson’s skewness coefficient
Excel Commands:
Excel Command
Output
AVERAGE(Array)
Mean of the data
MEDIAN(Array)
Approximate median of the data
MODE.MULT(Array)
Mode(s) of the data
VAR.P(Array)
Population variance
STDEV.P(Array)
Population standard deviation
VAR.S(Array)
Sample variance
STDEV.S(Array)
Sample standard deviation
MAX(Array)
Largest number in the data
MIN(Array)
Smallest number in the data
MAX(Array)-MIN(Array)
Range of the data
Data/Data Analysis/Descriptive
Statistics
Table of selected descriptive
statistics
Download