CSC-323 Data Analysis and Statistical Software I Winter 04

advertisement
Chapter 2
Summarizing and
Graphing Data
Recall: The 2 Types of data variables:
2.1
Graphs for qualitative variables
Bar graphs (frequency and
relative frequency)
 Pie charts
 Pareto

Graphs for qualitative variables


The values of a qualitative or categorical variable are
labels.
The distribution of a categorical variable lists the count
or percentage of individuals in each category.
Wireless surfers by Age
Bar Chart
60%
40%
53%
Pie chart
55>
5%
42%
20%
5%
0%
18-34
Counts: 212
35-54
168
55>
20
A sample of 400 wireless internet users.
35-54
42%
18-34
53%
Wireless internet users
Male
288 (72%)
Female
112 (28%)
Total
400 (100%)
Wireless surfers by gender
Bar chart
100%
72%
28%
50%
0%
Male
Female
 Frequency Distribution (or Frequency Table)
lists each category of data and the number of
occurrences for each category of data.
Frequency
Distribution Ages of
Best Actresses
Original Data
Frequency Distribution
Lower Class Limits
are the smallest numbers that can actually
belong to different classes
Lower Class
Limits
Upper Class Limits
are the largest numbers that can actually
belong to different classes
Upper Class
Limits
Class Midpoints
can be found by adding the lower class limit to the
upper class limit
and
dividing the sum by two
Class
Midpoints
25.5
35.5
45.5
55.5
65.5
75.5
Class Width
is the difference between two consecutive
lower class limits or two consecutive
lower class boundaries
Editor: Substitute
Table 2-2
Class
Width
10
10
10
10
10
10
EXAMPLE
Organizing Qualitative Data into a Frequency
Distribution
The data on the next slide represent the
color of M&Ms in a bag of plain M&Ms.
Construct a frequency distribution of the
color of plain M&Ms.
Frequency table
The relative frequency is the proportion (or percent) of
observations within a category and is found using the
formula:
frequency
relative frequency 
sum of all frequencies
A relative frequency distribution lists the relative
frequency of each category of data.
2-14
EXAMPLE
Organizing Qualitative Data into a Relative
Frequency Distribution
Use the frequency distribution obtained in the
prior example to construct a relative frequency
distribution of the color of plain M&Ms.
Relative Frequency
12
 0.2667
45
0.2222
0.2
0.1333
0.0667
0.1111
2-16
Bar Graphs
A bar graph is constructed by labeling each category of data on
either the horizontal or vertical axis and the frequency or relative
frequency of the category on the other axis.
EXAMPLE
Constructing a Frequency and
Relative Frequency Bar Graph
Use the M&M data to construct
(a) a frequency bar graph and
(b) a relative frequency bar graph.
2-18
2-19
Actresses example
28/76 = 37%
30/76 = 39%
etc.
Total Frequency = 76
Frequency bar graph
 The horizontal scale
represents the classes of
data values
 the vertical scale
represents the
frequencies
20 30 40 50
60
70 80
Relative Frequency Graph
Has the same shape and horizontal scale as the bar graph, but the
vertical scale is marked with relative frequencies instead of actual
frequencies
Interpreting Frequency Distributions
In later chapters, there will be frequent reference to
data with a normal distribution. One key
characteristic of a normal distribution is that it has
a “bell” shape.
The frequencies start low, then increase to some
maximum frequency, then decrease to a low
frequency.
The distribution should be approximately
symmetric.
Example:
“bell” shape
EXAMPLE
Comparing Two Data Sets
The following data represent the marital status (in millions) of U.S. residents 18
years of age or older in 1990 and 2006. Draw a side-by-side relative frequency
bar graph of the data.
Marital Status
1990
2006
Never married
40.4
55.3
Married
112.6
127.7
Widowed
13.8
13.9
Divorced
15.1
22.8
Marital Status in 1990 vs. 2006
0.7
Relative Frequency
0.6
0.5
1990
0.4
2006
0.3
0.2
0.1
0
Never married
Married
Marital Status
Widowed
Divorced
Another Example: On the morning of April 10, 1912 the Titanic sailed
from the port of Southampton (UK) directed to NY. Altogether there
were 2,201 passengers and crew members on board. This is the table
of the survivors of the famous tragic accident.
Survived
Dead
Male
Female
Male
Female
First class
62
141
118
4
Second class
25
93
154
13
Third class
88
90
422
106
Crew members
192
20
670
3
Define the categorical variables
Bar chart representing the data in the table above (in percentages)
0.7
0.6
0.5
First Class
0.4
Second class
0.3
Third class
0.2
Crew class
0.1
0
Male
Female
Male
Female
Survived
Survived
Dead
Dead
A Pareto chart is a bar graph where the bars are drawn
in decreasing order of frequency or relative frequency.
2-30
Pareto Chart
2-31
Pie Chart
A pie chart is a circle divided into sectors.
Each sector represents a category of data.
The area of each sector is proportional to
the frequency of the category.
Slide 32
EXAMPLE
Constructing a Pie Chart
The following data represent the marital status (in millions) of U.S. residents 18
years of age or older in 2006. Draw a pie chart of the data.
Marital Status
Frequency
Never married
55.3
Married
127.7
Widowed
13.9
Divorced
22.8
Slide 33
Other example:
A graph depicting qualitative data as slices of a pie
Slide 34
2.2 Graphs for quantitative
variables:





Histograms (discrete data and continuous
data)
Stem-and-leaf plots
Time series
Dot plots
Distributions
Histogram:
Example: CEO salaries
Forbes magazine published data on the best small firms in 1993. These were firms with
annual sales of more than five and less than $350 million. Firms were ranked by fiveyear average return on investment. The data extracted are the age and annual salary of
the chief executive officer for the first 60 ranked firms. (Data at
http://lib.stat.cmu.edu/DASL/DataArchive.html )
Salary of chief executive officer (including
bonuses), in $thousands
145 621 262 208 362 424 339 736 291
58 498 643 390 332 750 368 659 234
396 300 343 536 543 217 298 1103 406
254 862 204 206 250 21 298 350 800
726 370 536 291 808 543 149 350 242
198 213 296 317 482 155 802 200 282
573 388 250 396 572
Drawing a histogram
1.
2.
3.
Construct a distribution table:
i.
Define class intervals or bins (Choose intervals of equal width!)
ii.
Count the percentage of observations in each interval
iii. End-point convention: left endpoint of the interval is included,
and the right endpoint is excluded, i.e. [a,b[
Draw the horizontal axis.
Construct the blocks:
Height of block = percentages!
The total area under an histogram must be 100%
Class
intervals
Frequency
Percentage=
(frequency/total)x
Use left
100
end-point
Class
interv
als
Frequency
Use left
end-point
Percentage=
(frequency/total)x100
0-100
2
2/59x100=3.39
600700
3
5.08
100-200
4
4/59x100=6.78
700800
3
5.08
200-300
18
30.50
800900
4
6.78
300-400
14
23.73
9001000
0
0
400-500
4
6.78
10001100
1
1.70
500-600
6
10.18
Total
59
100%
30.50%
23.73%
3.39%
1.70%
The area of each block represents the percentages of cases in the
corresponding class interval (or bin).
Remarks
• A histogram represents percent by area. The area of each block represents
the percentages of cases in the corresponding class interval.
• The total area under a histogram is 100%
• There is no fixed choice for the number of classes in a histogram:
If class intervals are too small, the histogram will have spikes;
If class intervals are too large, some information will be missed.
Use your judgment!
• Typically statistical software will choose the class intervals for you, but
you can modify them.
• Let's try various binning levels.
Example: Smoking
In a Public Health Service study, a histogram was plotted showing the
number of cigarettes smoked per day by each subject (male current smokers),
as shown below. The density is marked in parentheses. The class intervals
include the left endpoint, but not the right.
1.
2.
3.
4.
The percentage who smoked less than two packs a day but at least a pack, is around
(note: there are 20 cigarettes in a pack.)
1.5%
15%
30%
50%
The percent who smoked at least a pack a day is around
1.5%
15%
30%
50%
The percent who smoked at least 3 packs a day is around
0.25 of 1%
0.5 of 1%
10%
The percent who smoked 20 cigarettes a day is around
0.35 of 1%
0.5 of 1%
1.5%
3.5%
10%
Answers:
1.
The percentage who smoked less than two packs a day but at least a pack, is
given by (note: there are 20 cigarettes in a pack.) the area of the third block:
1.5x(40-20)=1.5x20=30%
2.
The percent who smoked at least a pack a day is given by the area of the third
and fourth blocks: 30+0.5x40=50%
3.
The percent who smoked at least 3 packs a day is the area of the block for
number of cigarettes greater or equal to 60. This is half of the fourth block: 10%
4.
The percent who smoked 20 cigarettes a day: use the left endpoint convention,
so 20 belongs to the third block. The answer is 1.5%.
Using histograms for comparisons
Fuel economy for
model year 2001
compact and twoseater cars (Table
1.8 pg 38)
City Consumption
Highway
consumption
Stemplot (or Stem-and-Leaf Plot)
Represents data by separating each value into two parts: the stem
(leftmost digits) and the leaf (the last rightmost digit)
Example: a data value of 147 would have 14 as the stem and 7 as the leaf.
To make a Stemplot:
Example:
Advantage of Stem-and-Leaf Diagrams over Histograms
Once a frequency distribution or histogram of
continuous data is created, the raw data is lost
(unless reported with the frequency distribution),
however, the raw data can be retrieved from the
stem-and-leaf plot.
Dot plots
A dot plot is drawn by placing each observation
horizontally in increasing order and placing a dot above
the observation each time it is observed.
2-50
EXAMPLE Drawing a Dot Plot
The following data represent the number of available cars in
a household based on a random sample of 50 households.
Draw a dot plot of the data.
3
4
1
3
2
0
2
1
3
3
1
2
3
2
2
2
2
2
1
1
1
1
4
2
2
1
2
1
2
2
1
2
2
0
1
2
0
1
3
1
Data based on results reported by the United States Bureau of the Census.
0
2
2
2
3
2
4
2
2
5
2-52
Examining distributions



Purpose of graph: to understand data better
Histograms and Stemplots display the main features of a
distribution similarly.
Features to be observed:



Modes (how many?)
Symmetry vs skewness
Outliers
2-54
EXAMPLE
Identifying the Shape of the Distribution
Identify the shape of the following histogram which represents the time
between eruptions at Old Faithful.
Time-Series Graphs
Data that have been collected at different points in time
Time-Series Graphs
Data that have been collected at different points in time
Example:
Time series graph:
Time series graph with seasonal variation:
Other types of graphs:

Frequency Polygon
Ogive (cumulative frequencies)

Scatter Plot (to relate two variables)

Frequency polygons
The class midpoint is found by adding consecutive lower
class limits and dividing the result by 2.
A frequency polygon is drawn by plotting a point above
each class midpoint on a horizontal axis at a height equal to
the frequency of the class. After the points for each class
are plotted, draw straight lines between consecutive points.
2-64
Time
between
Eruptions
(seconds)
Class
Midpoint
Frequency
Relative
Frequency
670 – 679
675
2
0.0444
680 – 689
685
0
0
690 – 699
695
7
0.1556
700 – 709
705
9
0.2
710 – 719
715
9
0.2
720 – 729
725
11
0.2444
730 – 739
735
7
0.1556
2-65
Frequency Polygon
Time between Eruptions
12
10
Frequency
8
6
4
2
0
665
675
685
695
705
715
725
735
Time (seconds)
2-66
Practice
CO2 emission levels in the world:
Burning fuel in power plants or motor vehicles emits carbon
dioxide (CO2) which contributes to global warming. The
table in the next slide displays CO2 emissions per person
from countries with populations at least 20 millions.
Questions:
(a) Why do you think we choose to measure emissions per
person rather than total CO2 emissions for each country?
(b) Display the data of the table in a graph. Describe the
shape, center, and spread of the distribution. Which
countries are outliers?
1.
Make a Stemplot, then
2.
A Histogram.
Answer:
(a) Totals emissions would almost certainly be higher for
very large countries; for example, we would expect that
even with great attempts to control emissions, China
(with over 1 billion people) would have higher total
emissions than the smallest countries in the data set.
Answer: (stemplot)
(b) Graph representation of the data:
1) Stemplot:
0 001122223357899
1 02478
2 3558
3 67899
4 68
5 1
6 18
7 36
8 018
9 017
10 0 2
11
12
13
14
15
16
17
18
19 9
Answer: (histogram)
(b)-continued: Graph representation of the data:
2) Histogram: (For example, using Excel – Note: in Excel,
the convention is ‘right point belongs in bin, left point out’):
(Demo in class)
Summary of steps:
- Find min and max of data
- Choose binning
- From Menus: Tools, Data Analysis, Histograms
- Define: Input range, Bin range, Output range
- Check Chart output.
- Click OK.
- Adjust width between bars (right-click on bars,
format data series, options, set gap width to zero).
Answer: (histogram)
(b)-continued: Histogram:
min
0
max
19.9
Bin
Histogram
20
Frequency
18
0
2
2
18
4
9
6
3
8
5
10
6
12
2
4
14
0
2
16
1
0
18
1
20
1
22
0
16
Frequency
14
12
10
8
6
Bin
Interpretation of graphs:
The graph is not symmetric. There is a strong right skew with a high peak at low metric tons per person,
The three highest countries (the U.S., Canada, and Australia) appear to be outliers; apart from those
countries, the distribution is spread from 0 to 11 metric tons per person (see table).
Download