Econ 5 Lectures on Chapter 2 Introduction to Statistics DESCRIPTIVE STATISTICS:

advertisement
Econ 5
Introduction to Statistics
Lectures on Chapter 2
Asatar Bair, Ph.D.
Department of Economics
City College of San Francisco
abair@ccsf.edu
Frequency Distribution
! A frequency distribution is a tabular summary of
a set of data showing the frequency (or number) of
items in each of several non-overlapping classes.
! The objective is to provide insights about the data
that cannot be quickly obtained by looking only at
the original data.
DESCRIPTIVE STATISTICS:
Summarizing Qualitative Data
!
Frequency Distribution
!
Relative Frequency
!
Percent Frequency Distribution
!
Bar Graph
!
Pie Chart
Example: Marada Inn
Guests staying at Marada Inn were asked to rate the
quality of their accommodations. The ratings provided
by a sample of 20 quests are shown below.
Below Average
Above Average
Above Average
Average
Above Average
Average
Above Average
Average
Above Average
Below Average
Poor
Excellent
Above Average
Average
Above Average
Above Average
Below Average
Poor
Above Average
Average
Example: Marada Inn
Frequency Distribution
Quality rating
Frequency
Poor
2
Below average
3
Average
5
Above average
9
Excellent
1
total
20
Relative Frequency and
Percent Frequency Distributions
Relative Frequency and
Percent Frequency Distributions
! The relative frequency of a class is the fraction or
proportion of the total number of data items
belonging to the class.
! A relative frequency distribution is a tabular
summary of a set of data showing the relative
frequency for each class.
Example: Marada Inn
Frequency Distribution
! The percent frequency of a class is the relative
frequency multiplied by 100.
! A percent frequency distribution is a tabular
summary of a set of data showing the percent
frequency for each class.
Quality rating
Relative
Frequency
Percent
Frequency
Poor
Below average
Average
Above average
Excellent
total
0.10
0.15
0.25
0.45
0.05
1.00
10
15
25
45
5
100
Bar Graph
Bar Graph
A bar graph is a graphical device for depicting
qualitative data that have been summarized in a
frequency, relative frequency, or percent frequency
distribution.
On the horizontal axis we specify the labels that
are used for each of the classes.
A frequency, relative frequency, or percent
frequency scale can be used for the vertical axis.
Example: Marada Inn
Using a bar of fixed width drawn above each class
label, we extend the height appropriately.
The bars are separated to emphasize the fact that
each class is a separate category.
Tall thin bar graphs emphasize difference
9.0
Frequency
7.2
5.4
7.2
5.4
3.6
3.6
Poor
Below average
Average
Above average Excellent
Excellent
0
0
Average
1.8
1.8
Poor
Frequency
9.0
Pie Chart
Wide short bar graphs emphasize similarity
The pie chart is a commonly used graphical device for
presenting relative or percentage frequency
distributions for qualitative data.
Frequency
9.0
7.2
5.4
3.6
1.8
0
Poor
Below average
Average
Above average
Excellent
First draw a circle; then use the relative or
percentage frequencies to subdivide the circle into
sectors that correspond to the relative frequency for
each class.
Since there are 360 degrees in a circle, a class with a
relative frequency of 0.25 would consume 0.25(360) =
90 degrees of the circle.
Example: Marada Inn
Use of color in presenting pie charts
Above average
45%
Above average
45%
Excellent
5%
Excellent
5%
Poor
10%
Poor
10%
Average
25%
Below average
15%
Average
25%
Below average
15%
To highlight the positive features of this data
Use of color in presenting pie charts
Use of flashy 3D pie charts
Below average
Poor
10%
Poor
Average
Below average
15%
Above average
45%
Above average
Excellent
Average
25%
Excellent
5%
To highlight the negative features of this data
To highlight a certain slice of the pie
Exploded wedges also draw attention
Summarizing Quantitative Data
Above average
Excellent
•
Frequency Distribution
•
Relative Frequency and Percent
Frequency Distributions
•
Dot Plot
•
Histogram
•
Cumulative Distribution
Average
Below average
Poor
Example: Hudson Auto Repair
The manager of Hudson would like to get a better
picture of the distribution of costs for engine tuneup parts. A sample of 50 customer invoices has
been taken and the costs of parts, rounded to the
nearest dollar, are listed below.
Frequency Distribution
Guidelines for Selecting Number of Classes
Use bet ween 5 and 20 classes.
Data sets with a larger number of elements usually
require a larger number of classes.
Smaller data sets usually require fewer classes.
Frequency Distribution
Use classes of equal width.
Class Width =
Frequency Distribution
Cost ($)
Frequency
50-59
2
60-69
13
70-79
16
80-89
7
90-99
7
100-109
5
Total
50
Relative and Percent Frequency Distribution
Cost ($)
Relative
Frequency
Percent
Frequency
50-59
0.04
4
60-69
0.26
26
70-79
0.32
32
80-89
0.14
14
90-99
0.14
14
100-109
0.10
10
Total
1.00
100
Dot Plot
• One of the simplest graphical summaries of
data is a dot plot.
• A horizontal axis shows the range of data
values.
• Then each data value is represented by a dot
placed above the axis.
Histogram
Dot Plot
• Another common graphical presentation of quantitative
data is a histogram.
44
55
66
77
88
99
110
• The variable of interest is placed on the horizontal axis
and the frequency, relative frequency, or percent
frequency is placed on the vertical axis.
• A rectangle is drawn above each class interval with its
height corresponding to the interval’s frequency, relative
frequency, or percent frequency.
Most of the data is in this range.
• Unlike a bar graph, a histogram has no natural
separation bet ween rectangles of adjacent classes.
Histogram
Relative Frequency Histogram
0.40
relative frequency
frequency
20
15
10
5
0
0.32
60-69
70-79
80-89
90-99
100-109
0.26
0.24
0.16
0.14
0.14
0.10
0.08
0
50-59
0.32
0.04
50-59
60-69
70-79
80-89
90-99
Cost ($)
Cost ($)
Cumulative Distribution
Cumulative Frequency
Cost ($)
• The cumulative frequency distribution shows the
number of items with values less than or equal to the
upper limit of each class.
• The cumulative relative frequency distribution shows
the proportion of items with values less than or equal
to the upper limit of each class.
• The cumulative percent frequency distribution shows
the percentage of items with values less than or equal
to the upper limit of each class.
100-109
Cumulative
Cumulative
Frequency Relative Frequency
!59
2
0.04
!69
15
0.30
!79
31
0.62
!89
38
0.76
!99
45
0.90
!109
50
1.00
Ogive
Ogive
• An ogive is a graph of a cumulative
distribution.
• The data values are shown on the horizontal
axis.
• The vertical axis can be cumulative
frequencies, cumulative relative frequency,
or cumulative percent frequency.
cumulative frequency
50
40
30
20
10
0
49
59
69
79
89
99
109
cost ($)
Exploratory Data Analysis
Exploratory Data Analysis: techniques to
quickly summarize data
Crosstabulations
Scatter Diagrams
Stem-and-Leaf Display
This display shows both the rank order and
shape of the distribution of the data.
It’s similar to a histogram, but it has the
advantage of showing the actual data
values.
The first digit(s) of each data item are
arranged to the left of a vertical line.
Hudson Auto Repair
Crosstabulations and Scatter Diagrams
Stem and Leaf Display for Cost of Parts
5 2 7
6 2 2 2 2 5 6 7 8 8 8 9 9 9
7 1 1 2 2 3 4 4 5 5 5 6 7 8 9 9 9
8 0 0 2 3 5 8 9
Thus far we have focused on methods that are used to
summarize the data for one variable at a time.
Next we explore methods of understanding the
relationship bet ween t wo variables.
9 1 3 7 7 7 8 9
10 1 4 5 5 9
Crosstabulation: The number of Finger Lakes
homes sold for each style and price for the past two
years is shown below.
Home Style
Price
Problem with
crosstabulation
Crosstabulation data are often combined to
form an aggregate crosstabulation;
Colonial
Ranch
Split
A-Frame
Total
less than
$100,000
18
6
19
12
55
$100,000+
12
14
16
3
45
relationships that appear in the aggregate
may be contradicted by the unaggregated
data;
Total
30
20
35
15
100
this is called Simpson’s Paradox.
this presents a possible danger;
Crosstabulation: Simpson’s Paradox
Crosstabulation: Simpson’s Paradox
Judge Luckett
Verdict
Municipal Court
Upheld
29 (91%)
100 (85%)
129
Reversed
3 (9%)
18 (15%)
21
Total
32
118
150
Judge
Verdict
Upheld
Reversed
Total
Total
Luckett
Kendall
129
(86%)
110
(88%)
21
(14%)
150
15
(12%)
125
Total
Common Pleas
239
Judge Kendall
Verdict
36
275
Total
Common Pleas
Municipal Court
Upheld
90 (90%)
20 (80%)
110
Reversed
10 (10%)
5 (20%)
15
Total
100
25
125
It looks like Kendall’s doing a better job
But Luckett actually has a better record in both courts.
Example: Panthers Football Team
Scatter Diagram
The Panthers football team is interested in
investigating the relationship, if any, between
interceptions made and points scored.
Interceptions
Points scored
1
14
3
24
2
18
1
17
3
27
Points scored
30
Panthers Football Team
Scatter Diagram
25
20
15
10
5
0
1
2
Interceptions
3
Scatter diagram of weight and speed of NFL players
6
6
5
5
Time in the 40 yard dash (sec)
Time in the 40 yard dash (sec)
Scatter diagram of weight and speed of NFL players
4
3
2
1
From the Excel data, Chapter 2, “NFL”.
0
4
3
2
1
0
0
50
100
150
200
250
300
350
400
Weight (lb)
Data
Tabular
Methods
Graphical
Methods
Frequency
Distribution(s)
Bar graph
Crosstabulation
Pie chart
50
100
150
200
250
300
350
Weight (lb)
Tabular and Graphical Procedures (p. 56)
Qualitative Data
0
Microsoft Excel
Quantitative Data
Tabular
Methods
Graphical
Methods
Frequency
Distribution(s)
Dot plot
Cumulative
Frequency
Distribution(s)
Stem-and-Leaf
Display
Crosstabulation
Histogram
Ogive
Scatter diagram
MS Excel (and other statistical /
spreadsheet programs like it) makes many
tasks in statistics much, much easier;
Appendix 2.2 describes how to perform some
operations in Excel
400
Histogram
Histograms
highlight all your data
You need the Analysis ToolPak;
(more on this in a
minute)
Go to “Tools”, then “Add-ins”, then select “Analysis ToolPak”
and hit “OK”;
Then go to “Tools” and hit “Data Analysis”;
highlight where you
want the frequency
distribution to go
A list of options will come up; select “Histogram”.
Bin range
what you do here is to define the class widths
you want to use;
Excel can do this automatically, but it has
very bad judgement and the results will be
worthless;
look at the data and decide what the upper
bounds for each class should be
40-49
50-59
60-69
70-79
80-89
90-99
100-109
110-119
Bin range
For this example, I’m using the “Norris”
data on the CD;
if you want your classes to look like this,
you enter just the upper boundary in each
cell: 49, 59, 69, 79, 89, 99, 109, 119
then go to the bin range field and highlight
these cells
Frequency distribution
Histogram
Now go to “Insert” and select “Chart” or
hit the button
hit “OK”, and Excel gives you this
Select the “Clustered column”
I like to rename the “Bin” fields
“40-49”, “50-59”, etc.
I also rename the “More” field “Total”
and add up the column above so the
whole thing looks like this
enter the title for the x and y axes, and for
the whole chart, then hit OK;
to get rid of the gap bet ween the bars,
double-click on one of the bars on the
finished chart, then go to “Options” and
enter zero under “Gap width”.
Frequency Histogram: Norris Electronics
70
Frequency
60
50
40
30
20
10
0
11
0
to
9
6
11
10
99
89
79
69
59
49
to
to
to
to
to
to
to
0
10
90
80
70
60
50
40
Hours until Burnout
be sure to label the axes and give it a title;
another masterpiece of statistics!
Download