Outline of Eco 251-descriptive stats

advertisement
251descr1 1/22/07 (Open this document in 'Outline' view!)
ECONOMICS 251 COURSE OUTLINE
A. Introduction
1. Definitions
Define Statistics, Descriptive and Analytic Statistics, Induction and Deduction.
2. Uses of Statistics
B. Sources and Types of Data
1. Data
Define data sets, observation, unit of observation. Qualitative and quantitative data. Nominal, ordinal,
interval and ratio data. Discrete vs. continuous data. Data is discrete if the number of values it can take are
countable – most typically a variable that can only be a whole number, like the number of times you can win
the lottery is discrete. If two numbers x1 and x 2 are drawn from continuous data, any number between
them also could be part of the same data set. Temperature, weight and most other things that we measure are
continuous data.
a. Qualitative Data
(i) Nominal Data: There is no natural number scale - numbers are only used to define categories, so that
no operations like addition or multiplication are valid.
(ii) Ordinal Data: Numbers are used only to order things (e.g. first, second, first). Differences between
ranks do not always have the same meaning. Most mathematical operations are still not valid.
b. Quantitative Data
(i) Interval Data: Differences between ranks have consistent meaning, but, like Celsius temperature,
there is no obvious origin, so that , although addition and subtraction can be used, multiplication and
division have no real meaning.
(ii) Ratio Data: there is a meaningful origin, so that multiplication and division are valid.
2. Sources
Define primary and secondary sources, internal and external data.
3. Cross Section and Time Series Data
a. Cross Section Data
b. Time Series Data.
i. Indices
ii. Real Values
iii. Rates of change
iv Logarithms
C. Presentation of Data
1. Classification
Define collectively exhaustive and mutually exclusive classes. These are not the same thing. Collectively
exhaustive means that every item you are considering has a place in a class. Mutually exclusive means that
if an item belongs in any given class, it does not belong in another class as well.
2. Tables
Define parts of tables. See 251pttbl .
3. Charts and Graphs
251descr1 11/09/06 (Open this document in 'Outline' view!)
Define parts of graphs
Line graph example http://www.epinet.org/issueguides/minwage/figure2.pdf
Pie chart example - National Priorities Project
Where Do Your Tax Dollars Go?
posted 2006
This publication shows how the federal government spent the average
household's 2004 income taxes in each state and 193 cities, towns and counties.
Component part line chart example 251GDP_DPI
D. Frequency Distributions and Populations.
1. Definitions
Meaning of Population, Frame, Census, Sample, Grouped Data, Frequency, Example of Frequency
Distribution, Relative Frequency. Width of a class interval.
largest  smallest
w
(Always round this result up!)
number of classes
Example: Let us assume that we have a sample consisting of numbers between 905 and 8756, and that we
8756  905
 1570 .2 . We will at
want to present the data in 5 classes. Our class interval will be at least
5
least round this up to 1571. If we want to use 1571, our first class will begin at 905, the next at 905 + 1571
= 2476 etc. In fact we might consider a class interval of 1600 and start our lowest group at 800. The classes
would be 800 – under 2400, 2400 – under 4000, 4000 – under 5600, 5600 – under 7200 and 7200-under
8800. Just make sure that the classes cover the data and that there are few empty classes.
The most commonly observed rule for deciding on the number of classes is Sturgis’ rule. The formula can
be written as number of classes  1 3.3 log 10 n where log 10 n  is the log base 10 of the number of
observations. This rule should not be taken seriously. For more on this see
http://cnx.org/content/m10160/latest/.
2. Graphs of the Frequency Distribution. See http://cnx.org/content/m10927/latest/
a. The Histogram
b. The Frequency Polygon
c. The Cumulative Frequency Distribution (Ogive). See
http://home.ched.coventry.ac.uk/Volume/vol0/ogive.htm
d. Relative Frequencies.
e. Smoothed Histograms
E. Sampling and Descriptive Statistics.
1. Sampling to Learn About a Population.
Infinite and finite populations, target and sampled populations, the Stability of Mass Data.
2. The Meaning of Random Sampling.
A simple random sample of n items taken from a population of N items must be selected in such a way
that all combinations of n items are equally likely.
3. Descriptive Statistics.
a. Measures of Central Tendency. (Where's the middle of the data?)
b. Measures of Dispersion. (How spread out are the data?)
2
251descr1 11/09/06 (Open this document in 'Outline' view!)
c. Measures of Asymmetry etc. (What else can I say about the shape?)
3
251descr1 11/09/06 (Open this document in 'Outline' view!)
F. Measures of Central Tendency.
1. The Arithmetic Mean of Ungrouped Data.
a. The Population Mean.

x
N
b. The Sample Mean.
x
x
n
Example: Consider the following data set.
x
Row
1
2
3
4
5
10000
17000
23000
30000
80000
160000
It makes no difference whether we call this a sample or a population, so let’s say that this is a sample. We
160000
x  160000 , n  5 so x 
 32000 . The alert observer will note that the mean has
can write
5
been raised by the highest number so that it is actually above all the numbers but the highest one.

2. The Arithmetic Mean of Grouped Data.
To make an ungrouped data formula into a grouped data formula, substitute
f
for

. For x
substitute the midpoint of the group. This is defined for our purposes as the arithmetic mean of the lower
limit of the group in question and the lower limit of the next group. In other words if we have the group 10
to 10.99, followed by 11 to 11.99 the midpoint of the first group is 10.50, not 10.495. The formula for a
population mean for grouped data is thus  
formula are essentially identical. x 

fx
 fx . The sample mean formula and the population mean
n
.
n
Example: It makes no difference whether we call this a sample or a population, so let’s say that this is a
sample.
Row x
f fx
1
2
3
4
5
10
12
14
16
18
3 30
3 36
5 70
3 48
1 18
15 202
Note that there is no reason to sum x. We can write
 fx  202 , n   f  15 so x 
202
 13 .467 .
15
Not also that if we use f rel in place of f , we do not have to divide by n .
4
251descr1 11/09/06 (Open this document in 'Outline' view!)
3. The Weighted Arithmetic Mean.
w 
 wx
w
, xw 
 wx . Example: We have three firms with profit rates of 10%, 12%, and 15%,
w
which would average 12.33%. If we want a rate of return on capital we might want to know that the assets
of the firms are respectively $2 billion, $1 billion and $1 billion. It is also common in a situation like this to
use relative weights found by dividing the original weights by the sum of the weights, in this case 4.
Row x w wx
wrel
wrel x
1
2
3
10
12
15
2
1
1
20
12
15
4
47
.50
.25
.25
5
3
3.75
11.75
47
w 4,
wx  47 and x w 
So
 11 .75 . If we use relative weights, we can read the weighted
4
mean as the sum of the wrel x column.

1.00

4. The Median of Ungrouped Data.
Defined simply as the middle point when the data is in order. If there are two middle points, take their
arithmetic mean. In continuous data half the points will be above or below the median.
Consider the data set that we used for the mean.
x
Row
1
2
3
4
5
10000
17000
23000
30000
80000
160000
Note that the middle number is the third number and that 3 
5 1
. In general the index
2
n 1
. If this is a sample, we can write x50  23000 . If this is a population
2
  23000 . The alert observer will note that median is not much affected by the highest number so that it
seems more typical that the mean. Now consider a second data set.
x
Row
of the median is location 
1
2
3
4
5
6
10000
17000
23000
27000
30000
80000
160000
n 1 6 1

 3.5 . This formula seems to be telling us that, since there is no one
2
2
middle number, we have to average the third and fourth number. If this is a sample, we can write
23000  27000
x 50 
 25000 . If this is a population   25000 .
2
Note that location 
5
251descr1 11/09/06 (Open this document in 'Outline' view!)
5. The Median of Grouped Data.
This is a special case of the formulas for fractiles of grouped data below, where p  .5 .
position 
1
2
n  1 .
 pn  F 
x1 p  L p  
 w . For the formulas and the example used in class see
 f p 
251median.
6. The Mode
The mode is simply the most common point, not very useful in discrete ungrouped data. For grouped data it
is defined as the midpoint of the largest group. If we dredge up our example for grouped data below.
Profit Rate f F Midpo int
9-10.99% 3 3
10
11-12.99% 3 6
12
13-14.99% 5 11
14
15-16.99% 3 14
16
17-18.99% 1 15
18
Total
15
Since 13-14.99 is the largest group and its midpoint is 14, we can write mo  14 .
Note that a distribution can have two modes, which would make it bimodal. If it has only one mode it is
unimodal. Of all the measures of central tendency, the mode is the most resistant to a few very high
numbers and the mean is least resistant.
Populations made up of data like wealth and income almost always have a few outliers to the right of most
of the data. They tend to be cut off on the left by the fact that a minimum income is necessary to sustain life.
We say that a population of this type is skewed to the right. Typically for such a population
mode  median  mean mo      . On the other hand a population that is skewed to the left would have
mean  median  mode     mo  . So what would you expect if a population is unimodal and
symmetrical?
7. Other Means.
a. The Geometric Mean.
1
x g  x1  x 2  x3  x n  n  n
x
or
 
ln x g 
1
n
 ln( x)
Example 1: Find the geometric mean of 1, 2 and 3
1
x g  1 2  33  3 6  6 0.3333  1.817
 
Or, using natural logarithms, ln x g 
1
ln 1  ln 2  ln 3  1 0  0.693147  1.098612   0.597253 . So
3
3
that x g  e 0.597253  1.817 .
 
Or, using logarithms to the base 10, log x g 
1
log 1  log 2  log 3 . So that x g  10 logxg   1.817 .
3
Example 2: A stock’s value grows at 50% in period 1 and 5% in period 2. Find the average growth rate.
1
Add 1 to the growth rates and take a geometric mean. x g  1.50 1.05  2
 2 1.575  1.575 0.500  1.255 .
So the average growth rate is 25.5%.
6
251descr1 11/09/06 (Open this document in 'Outline' view!)
b. The Harmonic Mean.

1
1
1

xh n
x
Example: Find the harmonic mean of 50 and 30.
1
11
1 1
     0.020000  0.03333   0.026667 .
x h 2  50 30  2
So x h 
1
 37 .50
0.026667
c. The Root-Mean-Square.

x
1
1
2
x 2 or x rms

n
n
Example: Find the rms of 1, 2 and 3
x rms 

2

1 2
1
1  4  9  4.666667  2.160
1  2 2  32 
3
3
x rms 
d. What Formulas for Means Have in Common.
f x  
1
n
 f x  .
8. Measures of Position.
Percentiles, deciles, quintiles, quartiles and fractiles.
The two formulas below are two-step formulas. The first step is multiplying n  1 (or N  1 )* by p . p
represents the fractile of the data wanted measured from the bottom. For example, if we want the 91st
percentile, p is .91. Note that the number you have found is called x1 p  x1.91  x.09 (i.e. 9% from the
top!). If we want the third quartile, Q3  x.25 , p is
3
4
or 0.75. If we want the first quartile, Q1  x.75 , p is
4 or 0.25. Of course, for the median p  .5 . N or n represents the number of items in the population or
sample, not the number of groups or classes.
1
a. Finding a Fractile of Grouped Data.
To use this formula, we must first compute the cumulative distribution of the group and determine in which
group the desired fractile is located with the calculation position  pn  1 *. Once we have found the
group that this is in, let f p be the frequency of the chosen group, and let
F be the cumulative frequency
 pn  F 
up to but not including the chosen group. The formula here is x1 p  L p  
 w . In this formula, w
 f p 
is the class interval (the interval between the lower limit of the chosen group and the lower limit of the next
group) and L p is the lower limit of the chosen group.
7
251descr1 11/09/06 (Open this document in 'Outline' view!)
Example: Suppose that in the example below we must find the first quartile. Since the first quartile is the
.25 fractile, p is .25. To locate the group use position  pn  1 = 0.25(16)=4.
Profit Rate
9-10.99%
11-12.99%
13-14.99%
15-16.99%
17-18.99%
f F
3 3
3 6
5 11
3 14
1 15
Total
15
we find that x1.25  x.75
Using the cumulative distribution F 
column, we find the fourth item in the sample.
Since 4 is above 3 and below 6 in the F column,
we pick the group 11-12.99%. n is
15, and for the group we have picked, w =
13 - 11 = 2, L p  11 , F = 3, and f p  3 .
If we put these numbers into the formula,
 .25 15  3 
 11  
 2  11.5 .
3


 pn  F 
Note: Sometimes 
 is negative. In this case choose the group before the one you would ordinarily
 f p 
have chosen. Example: If you want the 19th percentile of the data above position  pn  1 =.19(16) =
 pn  F  .1915   3
3.04, which would normally take us into 11-12.99. But 
 0.075 , so use the group

3
 f p 
9-10.99 instead. But see c below.
b. Finding a Fractile of Ungrouped Data.
This time when we compute position  pn  1 , we divide it into an integer part, a , and a
fractional part, .b . We then use the formula x1 p  xa  .bxa1  xa  to find the actual value.
Example: If our set of numbers is 1,5,7,9,9,11,13,14,17 ,19 n = 10, and we wish to find the first
quartile, p = 0.25, so that pn  1 = 0.25 (11) = 2.75. Then a  2 , and .b  .75 . Now find xa
and xa 1 , in this case x2 and x3 , and use the formula x1 p  xa  .bxa1  xa  .
1,5,7,9,9,11,13,14,17,19, xa  x2  5
x1 p  x.75  5  0.757  5  6.5
and x a 1  x3  7 , so that
c. Experimental formula (Don't read this unless you are really ready to ask
questions!) See 251dscr_A .
Document continues in 251descr2 .
* Experimentation indicates that a better formula is position  1  pn  1 . This is compatible with the
formula position  .5n  1 for the median and seems to work in more places.
8
Download