lec_2_statistics_15 - School of Mathematics

advertisement
Lecture on Statistics
By Dr. Brendan Browne
Introduction Everyday managers must make sense of the facts or data that businesses
accumulate through ongoing activities. Information is acquired through arranging,
summarizes or transforming these data in some logical manner. Information is extremely
important in business. When that information comes in the form of numerical data it is
said to be quantitative. And the methods needed to make sense of the data are quantitative
methods i.e. coordinate geometry. An element of uncertainty or randomness is often
associated with quantitative data. The appropriate quantitative methods for such
situations are statistical methods. Information, in turn forms the basis of rational decision
making.
Statistics is the science that processes and analyzes data in order to provide managers
with useful information to aid in decision making.
Descriptive statistics focus on the collection, summarization and characterization of a
set of data.
Inferential statistics estimates a characterization of a set or helps uncover patterns in
data sets that are unlikely to occur by chance.
The mathematics of probability theory form the foundation of inferential statistics.
Inferential methods select samples, portions of an entire set of data, rather than the
complete set itself which statisticians call the population. Inferential methods use the
sample data to calculate summary( called statistics ) measures that decision-makers can
use to estimate the characteristics of the entire population( called parameters).
Today the technological advances in computer processing has made practical applications
of computational complex inferential methods , that were beyond the computational
capabilities available to early statistical researches. Thus we use the MINITAB statistical
package on this course which is widely used, to do our statistical calculations.
1
Picturing Distributions with Graphs
Statistics is a group of methods used to collect, analyze, present and interpret data and
make decisions.
The volume of data available to us is over-whelming Each March, for example, the
United States Census Bureau collects economic and employment data from more than
200,000 people. From the bureau's Web site you can choose to examine more than 300
items of data for each person (and more for households): child care assistance, child care
support, hours worked, weekly earnings, and much more.
.
To make sense of such large volumes of data we must first organize this data in a
systematic manner. Before we give methods for organizing large volumes of data we
need some definitions.
Definition: Individuals or Observations are objects described by a set of data.
Individuals or observations may be people but they may be animals or things.
Any set of data contains information about some group of individuals or observations.
The information is organized in variables.
Definition: A variable is any characteristic of an individual or observation. A variable
can take different values for different individuals or observations
Now in statistics there are in general two types of variables, namely categorical (or
qualitative) variables and quantitative variables.
Definition: A categorical variable places an individual or observation into one of
several groups or categories.
Definition: A quantitative variable takes numerical values for which arithmetic
operations such as adding and averaging make sense.
Definition:
A distribution of a variable tells us what values it takes and how often
)
it takes these values.
2
Example Here is part of the data in which a professor records information about student
performance in a course.
A
Name
B
School
Smith, John
Arthur, Brenda
Fox, Des
Boggs, Joan
C
Major
D
HW
total
Edu
EdPsych 95
Law
Psych
32
Science Biol
74
Science Math
86
E
Midterm
80
61
68
75
F
Final
Exam
88
54
70
94
G
Total
H
Grade
263
147
212
255
A
D
B
A
The individuals described are the students. Each row records data on one individual. Each
column contains the values of one variable for all the individuals. In addition to the
student's name, there are 7 variables. School and major are categorical variables. Scores
on homework, the midterm, and the final exam and the total score are quantitative.
Grade is recorded as a category (A, B, and so on), but each grade also corresponds to a
quantitative score (A = 4, B = 3, and so on) that is used to calculate student grade point
averages.
Most data tables follow this format--each row is an individual, and each column is a
variable. This data set appears in a spreadsheet program that has rows and columns ready
(or your use. Spread sheets are commonly used to enter and transmit data and to do
simple calculations such as adding homework, midterm, and final scores to get total
points.
Example 1Fuel economy. Here is a small part of a data set that describes the fuel
economy(miles per gallon of 2002 model motor vehicles.
Make and
Vehicle
Transmission Number of
City MPG
Highway
model
Type
Type
cylinders
MPG
Acura NSX Two-seater
Automatic
6
17
24
Audi A4
Compact
Manual
4
22
31
Buick
Midsize
Automatic
6
20
29
Century
Dodge Ram Standard
Automatic
8
15
20
1500
pickup truck
(a) What are the individuals in this data set ?
(b) For each individual, what variables are given? Which of these variables
are categorical and which are quantitative?
Solution (a) model motor vehicles.
(b)Make and model, Vehicle Type and Transmission Type are categorical
variables while Number of cylinders, City MGP and Highway MPG are
quantitative variables.
3
Exercise 1 Which of the following variables are categorical(or qualitative) and which
are quantitative?
(i)
The color of cars involved in several severe accidents.
(ii)
The length of time required for rats to move through a maze.
(iii) The classification of police administration as city, county or state.
(iv)
The ratings given to pizza in a taste test as poor, good or excellent.
(v)
The number of times subjects in a sociological research study have
been married.
4
Exploratory Data Analysis
Statistical tools and ideas help us to examine data in order to describe their main features.
This examination is called exploratory data analysis. The two basic strategies that help us
organize our exploration of data are
(i)Examine each variable by itself and if there are more than one variable study the
relationship among the variables.
(ii)Begin with a graph or graph that describe the data. Then add numerical summaries
for more complete description.
The proper choice of graph depends on the nature of the variable. We shall first study
categorical variables. The distribution of a categorical variable lists the categories and
gives either the count or the percent of individuals who fall in each category.
Categorical variables: pie charts , bar graphs.
The main graphs that we use for categorical variables are (i)bar graphs,
(ii) pie charts.
Definition :A bar graph is a graph made up of bars whose heights represent the
frequencies or percentages of respective categories.
Note: The bar graphs for relative frequency and percentages of different
categories can be drawn simply by making relative frequencies or percentages
instead of class frequencies of categories on the vertical axis.
5
Example 1: Consider the following example
A sample was taken of 25 high school seniors who were planning to go to university.
Each of the students was asked which of the following majors he or she intended to
study: Business, Economics, Management Information Systems(MIS), Behavioural
Science(BS),Other. The responses of these students were as follows
Economics
Business
BS
Other
Economics
MIS
Business
BS
Business
MIS
Economics
Other
MIS
MIS
Other
Business Business
Other
Other
Other
MIS
Business Other
Other
MIS
Construct a frequency distribution table, a relative frequency and percentage table for this
categorical data. Hence construct their corresponding bar graphs.
Step 1 Construct a tally and class frequency table for the given categorical data.
Major
Business
Economics
MIS
BS
Other
Tally
!!!!/ !
!!!
!!!!/ !
!!
!!!!/ !!!
Frequency
6
3
6
2
8
Sum=25
Step 2 Construct a relative frequency and percentage table from above.
Frequency of that category
.
Sum of all frequencie s
The percentage of a category=(Relative frequency)  100.
The relative frequency of a category =
Major
Business
Economics
MIS
BS
Other
Relative Frequency
6/25=.24
3/25=.12
6/25=.24
2/25=.08
8/25=.32
6
Percentage %
.24(100)=24%
.12(100)=12%
.24(100)=24%
.08(100)=8%
.32(100)=32
Sum=100%
From these tables we can draw following bar graphs
Student Choise of University Course
Number of students
8
7
6
5
4
3
2
BS
Business Economics
MIS
Other
Courses
In decreasing order.
Student Choise of University Course
Number of students
8
7
6
5
4
3
2
Other
Business
MIS
Economics
BS
Courses
7
Relative Frequency Student Choise of University Course
Relative Frequency
0.3
0.2
0.1
Other
Business
MIS
Economics
BS
Courses
Percentage Student Choise of University Course
Percentage
30
20
10
Other
Business
MIS
Economics
BS
Courses
8
Pie Charts :A pie chart is more commonly used to display percentages, although it can
be used to display frequencies , or relative frequencies. The whole pie (or circle)
represents the total sample or population. The pie is divided into different portions that
represent the percentages of the population or sample belonging to different categories.
Definition: Pie Chart: A circle divided into portions that represent the relative
frequencies or percentages of a population or sample belonging to different categories is
called a pie chart.
To construct a pie chart: A circle contains 360 degrees. To construct a pie chart we
multiply 360 by the relative frequency(or %) for each category to obtain the degree
measure or size of the angle for representing that particular category. For the categorical
data of student choice of university course above we show the calculation of angle sizes
for the various categories in the table below.
Major
Business
Economics
MIS
BS
Other
Percentage %
24
12
24
8
32
Sum=100
Angle Size
360  .24 =86.4
360  .12 =43.2
360  .24 =86.4
360  .08 =28.8
360  .32 =115.2
Sum=360
The required pie chart is shown below.
Student Choice of University Course
Economics (3, 12.0%)
Business (6, 24.0%)
MIS
(6, 24.0%)
BS
(2, 8.0%)
Other
9
(8, 32.0%)
Example 2 The breakdown of American municipal waste in 2000 in million of tons
is given by the following table
Material
Food scraps
Glass
Metals
Paper, paperboard
Plastics
Rubber, leather, textiles
Wood
Yard trimmings
Other
Total
Weight (million of tons)
25.9
12.8
18.0
86.7
24.7
15.8
12.7
27.7
7.5
231.9
Note: The weights add to 231.8 and not 231.9 as given in table due to roundoff error
Construct a percentage distribution table for the above categorical data and
draw a (i) bar chart and a (ii) pie chart for this percentage distribution.
Solution
We calculate the % of each category of the total waste and the % distribution is given
below.
Material
Food scraps
Glass
Metals
Paper, paperboard
Plastics
Rubber, leather, textiles
Wood
Yard trimmings
Other
Total
Weight (million of tons)
25.9
12.8
18.0
86.7
24.7
15.8
12.7
27.7
7.5
231.9
10
Percentage of total
11.2%
5.5%
7.8%
37.4%
10.7%
6.8%
5.5%
11.9%
3.2%
100.0
From this distribution table we can draw the frequency bar graph and the % frequency
Bar chart as shown below.
Weight of Waste in millions oftons
Breakdown(in million of tons) of
American waste in 2000
90
80
70
60
50
40
30
20
10
0
paper yard food plasticmetals rub glass wood other
Percentage Weight of Waste in millions oftons
Waste
% of Breakdown(in million of tons) of
American waste in 2000
40
30
20
10
0
paper yard food plasticmetals rub glass wood other
Waste
11
(ii) From the % frequency distribution table we can also construct the % pie chart
for the above data as outlined above and this pie chart is given below.
Categories are in the order given in table in the pie chart..
% of American municipal waste in 2000
metals ( 8, 7.8%)
glass ( 6, 5.5%)
paper (37, 37.4%)
food
(11, 11.2%)
other ( 3, 3.2%)
yard
plastic (11, 10.7%)
rub
(12, 11.9%)
wood ( 6, 5.5%)
( 7, 6.8%)
Categories are decreasing order anticlockwise in the pie chart.
% of American municipal waste in 2000
paper (37, 37.4%)
yard
(12, 11.9%)
other ( 3, 3.2%)
food
(11, 11.2%)
wood
( 6, 5.5%)
glass ( 6, 5.5%)
plastic (11, 10.7%)
rub ( 7, 6.8%)
metals ( 8, 7.8%)
12
Exercise 2: The areas of various continents of the world in millions of square kilometres
are presented in table below.
Continent
Africa
Asia
Europe
North America
Oceanic
South America
U.S.S.R
Total
Area
30.3
26.9
4.9
24.3
8.5
17.9
20.5
133.3
%
22.7
20.2
3.7
18.2
6.4
13.4
15.4
100.0
Display this data using (i) a bar chart and (ii) a pie chart.
Exercise 3: The breakdown of total dollars spent on business trips in the United States is
estimated as follows (a) 41% on air fares, (b)22% on lodgings, (c) 12% on
meals, (d) 8% on car rentals and (e) the remaining on other expenses.
(i)
Construct a pie chart to show this information.
(ii)Construct a bar chart to show this information.
13
Quantitative variables:
The two main tools for organizing and displaying quantitative data are
(i) histograms and (ii) stem-and-leaf displays.
Definition Histogram is a graph in which classes (groups of observations or individuals)
are marked on the horizontal axis and frequencies, relative frequencies or
percentages are marked on the vertical axis. The frequencies, relative
frequencies or percentages are represented by the heights of the bars. In a
histogram the bars are drawn adjacent to each other.
To draw a histogram we have to construct a frequency distribution table. Data
presented in the form of a frequency distribution table are called grouped data. A graph
of the distribution is clearer if nearby values are grouped together. The most common
graph of the distribution of one quantitative variable is a histogram.
Definition A frequency distribution for quantitative data lists all the classes(or groups)
and the number of values that belong to each class.
To construct a frequency distribution table and hence a histogram we have to first have to
decide how many groups or classes we divide the given data set into. Usually the
number of classes varies from 5-20 depending on the number of data values in the data
set. It is preferable to have more classes as the size of the set increases. Too few classes
will give a "skyscraper" graph, with all values in a few classes with tall bars. Too many
will produce a "pancake" graph, with most classes having one or no observations. Neither
choice will give a good picture of the shape of the distribution.. You must use your
judgment in choosing classes to display the shape. Statistics software will choose the
classes for you. The software's choice is usually a good one, but you can change it if you
want.
The decision about the number of classes is arbitrary and is made by the data organizer.
A rough guide is given below. Statistics software such as MINITAB that gives an
automatic choice is usually a good one.
14
Rules for constructing a Frequency Distribution.
Rule 1 Class Intervals must be inclusive and non-overlapping. Each observation or
individual must belong to one and only one class and boundaries must not
overlap.
Rule 2 Number of Intervals
Rouge Guide
Sample Size
Number of classes
Fewer than 50
5-6 classes
50-100
6-8 classes
Over 100
8-10 classes .
Rule 3 Interval Width
The approximate class width
width 
l arg est data value  smallest data value
number of groups
The interval width is often rounded to most convenient integer.
Definition Class Boundaries are the end data values of the groups or classes that the
data are divided into.
Definition Class Width=Upper Boundary-Lower Boundary.
Definition Midpoint=
Lower Class Limit  Upper Class Limit of the same class
2
We will illustrate all these definitions by preparing a frequency table for the data set and
drawing frequency distribution and % frequency distribution for the following example.
15
Example 1 Prepare a frequency table for the following data .Hence draw a histogram of
the
frequency distribution and the % frequency distribution for this data.
The data below shows the weights( to the nearest gram) of 40 bags of flour:
Data
501 500
498 498
490 513
493 494
Solution
502
499
505
499
503
494
503
501
501
505
502
505
507
503
507
496
496
502
500
488
499
511
505
503
499
501
499
499
499
500
498
507
25
 4.111  5 .
6
The required Frequency Distribution Table is given below
Note: Any convenient number equal to or less than the smallest value data value can be
used as the lower limit of the first class.
The Range=513-488=25. Class Width=
Weights to
nearest gram
Class
Boundaries
488 to less than
493
493 to less than
498
498 to less than
503
503 to less than
508
508 to less than
513
513 to less than
518
Tally
Class
Width
Class
Midpoint
Frequency
!!
5
490.5
2
!!!!/
5
495.5
5
!!!!/
!!!!/
!!!!/
!!!!/
!!!!/
!!!!/ !
!
5
500.5
20
5
505.5
11
5
510.5
1
!
5
515.5
1
Sum=40
Example: Class Width=Upper Class Boundary-Lower Class Boundary
Class Width=498-493=5.
493  498
 495.5 .
Class Midpoint=
2
I just get the equal class widths and class midpoints to draw the histogram
or for use in statistic software package such as MINITAB.
16
From the above frequency distribution table we can draw the following histogram.
Frequency Distribution of number of flour bags
Number of flour bags
20
10
0
490.5
495.5
500.5
505.5
510.5
515.5
Weights to nearest gram.
N.B. The shape of the histogram is single peaked and is approximately symmetrical.
A polygon is another device that can be used to present quantitative data in graphic form.
Definition Polygon is a graph formed by joining the midpoints of the top of successive
bars in a histogram with straight lines. Two extra classes are added, one at each end and
their midpoints marked and they have zero frequency. Two extra classes are added, one at
each end and their midpoints marked and they have zero frequency.
Definition: A frequency polygon is a graph formed by joining the midpoints of the top
of successive bars in a frequency histogram with straight lines.
17
Frequency Distribution and Frequency Polygon
of the weights of Flour Bags
Number of flour bags
20
10
0
490.5 495.5 500.5 505.5 510.5 515.5
Weights to nearest gram.
18
We can also draw a histogram of the Percentage % frequency and a % frequency polygon
by
constructing a frequency table for these frequencies.
A complete frequency table showing (i)the frequency and (ii) the % frequency is given
below for the above data set.
Weights to
nearest gram
Class
Boundaries
488 to less than
493
493 to less than
498
498 to less than
503
503 to less than
508
508 to less than
513
513 to less than
518
Tally
Class
Midpoint
Frequency
Relative
Frequency
Percentag
e%
Frequency
!!
490.5
2
2/40=.05
5%
!!!!/
495.5
5
5/40=.125
12.5%
!!!!/
!!!!/
!!!!/
!!!!/
!!!!/
!!!!/ !
!
500.5
20
20/40=.5
50%
505.5
11
27.5%
510.5
1
11/40=.27
5
.025
!
515.5
1
.025
2.5
Sum=40
Sum=1.0
%
Sum=100
%
2.5%
Last two columns are calculated from the fourth column according to following
Note 1 Relative Frequency of a class 
frequency of that class
.
Sum of all frequencies
Note 2 Percentages %=Relative Frequencies  100.
19
From this frequency distribution table we can draw the histogram for the % Frequency
Distribution and the % Frequency Polygon. This is shown below.
Definition : A percentage polygon is a graph formed by joining the midpoints of the top
of successive bars in a percentage % frequency histogram with straight lines. Two extra
classes are added, one at each end and their midpoints marked and they have zero
frequency.
% Frequency Distribution and % Frequency Polygon
of the weights of Flour Bags
% Number of flour bags
50
40
30
20
10
0
490.5 495.5 500.5 505.5 510.5 515.5
Weights to nearest gram.
N.B. The shape of the histogram is single peaked and is approximately symmetrical.
20
Cumulative Frequency Distribution
Definition: A cumulative frequency distribution gives the total number of values that
fall below the upper boundary of each class. We will illustrate the concepts of (i)
Cumulative Frequency Distribution and (ii) Cumulative % Frequency Distribution using
the data for weights of the 40 flour bags above.
Example 1 Construct a (i) cumulative frequency distribution table and draw its
corresponding histogram for weights of the 40 flour bags above. Also construct a (ii)
cumulative % frequency distribution table and draw its corresponding histogram for
weights of the 40 flour bags above.
Solution
The complete frequency distribution table for this example was
Weights to
nearest gram
Class
Boundaries
488 to less than
493
493 to less than
498
498 to less than
503
503 to less than
508
508 to less than
513
513 to less than
518
Tally
Class
Midpoint
Frequency
Relative
Frequency
Percentag
e%
Frequency
!!
490.5
2
2/40=.05
5%
!!!!/
495.5
5
5/40=.125
12.5%
!!!!/
!!!!/
!!!!/
!!!!/
!!!!/
!!!!/ !
!
500.5
20
20/40=.5
50%
505.5
11
27.5%
510.5
1
11/40=.27
5
.025
!
515.5
1
.025
2.5
Sum=40
Sum=1.0
%
Sum=100
%
21
2.5%
Weights to
nearest gram
Class
Boundaries
488 to less than
493
493 to less than
498
498 to less than
503
503 to less than
508
508 to less than
513
513 to less than
518
Class
Midpoint
Frequency
Cumulative
Frequency
490.5
2
2
495.5
5
2+5=7
500.5
20
2+5+20=27
505.5
11
2+5+20+11=38
510.5
1
2+5+20+11+1=39
515.5
1
2+5+20+11+1+1=
40
Sum=40
22
Note: The lower boundary of the first class 488 is taken as the lower boundary of each
class in the cumulative frequency . The upper boundaries of all classes are the same as in
the frequency distribution table. To obtain the cumulative frequency of a class just add
the frequency of that class to the frequencies of all the preceding classes. The cumulative
frequencies are recorded in the fourth column while the class boundaries are recorded in
the first column.
From this table we can draw the histogram of the cumulative frequency distribution
which is given below.
Cumulative Number of flour bags
Cumulative Frequency Distribution of number of flour bags
40
30
20
10
0
490.5
495.5
500.5
505.5
510.5
515.5
Weights to nearest gram.
The advantage of the cumulative frequency histogram or table is that it can answer
following question
“ How many observations fall below upper limit of a class?”
Example How many bags of flour weigh less than or equal to 500 grams. Ans
approximately 27.
23
Cumulative Relative Frequency and Cumulative Percentage
Cumulative Relative Frequency and Cumulative Percentage are easily obtained from the
cumulative frequency distribution using following formulae.
Cumulative relative frequency 
cumulative frequency of a class
Total observations in the data set
Cumulative Percentage  Cumulative relative frequency  100 .
We will illustrate the Cumulative Relative Frequency and Cumulative Percentage using
the example above. The Cumulative Relative Frequency and Cumulative Percentage
distribution table for this data is
Weights to
nearest gram
Class
Boundaries
488 to less than
493
493 to less than
498
498 to less than
503
503 to less than
508
508 to less than
513
513 to less than
518
Class
Midpoint
Cumulative
Frequency
Cumulative
Relative
Frequency
Cumulative %
Frequency
490.5
2
2/40=.05
5%
495.5
2+5=7
7/40=.175
17.5 %
500.5
2+5+20=27
27/40=.675
67.5 %
505.5
2+5+20+11=38
38/40=.95
95 %
510.5
2+5+20+11+1=39
39/40=.975
97.5 %
515.5
2+5+20+11+1+1=
40
40/40=1.0
100 %
Note: The Cumulative Relative Frequency and the Cumulative Percentage are really the
same except the vertical axis have different units. Hence we will only plot the
Cumulative Percentage.
24
From this table we can draw the histograms for the Cumulative Percentage and is given
below.
Cumulative % Number of flour bags
Cumulative % Frequency Distribution of number of flour bags
100
50
0
490.5
495.5
500.5
505.5
510.5
515.5
Weights to nearest gram.
25
Definition: An ogive (ojive) or cumulative frequency polygon is a curve drawn for the
cumulative frequency distribution by joining with straight lines the dots
marked above
the upper boundaries of classes at heights equal to the cumulative frequencies
of the
respective classes. The ogive starts at the lower boundary of the first class and
ends
at the upper boundary of the last class.
Cumulative Number of flour bags
Cumulative Frequency Distribution and Cumulative Frequency
Polygon or Ogive of the weights of Flour Bags
40
30
20
10
0
490.5
495.5
500.5
505.5
510.5
515.5
Weights to nearest gram.
One advantage of an ogive is that it can be used to approximate the cumulative
frequency for any interval. For example we can find the number of bags with weights
less than or equal to 504 grams is approximately 33.
Definition: An % ogive (ojive) or % cumulative frequency polygon is a curve drawn
for the
% cumulative frequency distribution by joining with straight lines the dots
marked
above the upper boundaries of classes at heights equal to the % cumulative
frequencies of the respective classes. The ogive starts at the lower boundary of
the
first class and ends at the upper boundary of the last class.
26
% Cumulatie Number of flour bags
% Cumulative Frequency Distribution and % Cumulative Frequency
Polygon or % Ogive of the weights of Flour Bags
100
50
0
490.5
495.5
500.5
505.5
510.5
515.5
Weights to nearest gram.
Note: From the % ogive 50% of bags have weight 502 grams or less approximately.
27
EXAMINING A DISTRIBUTION
Making a statistical graph is not an end in itself. The purpose of the graph is to help to
understand the data. After you make a graph, always ask, "What do I see?" Once you
have displayed a distribution, you can see its important features as follows.
In any graph of data, look for the overall pattern and for striking
deviations from that pattern. You can describe the overall pattern of a histogram by
its (i) Shape, (ii) Center and (iii) Spread. .
An important kind of deviation is an outlier, an individual value that falls outside the
overall pattern.
We will learn how to describe center and spread numerically later. For now, we can
describe the center of a distribution by its midpoint or median, the value with roughly
half the observations taking smaller values and half taking larger values. We can
describe the spread of a distribution by giving the range that is , and largest values-'
smallest values.
Example: Examine the histogram which we obtained for the data on the weights of 40
flour bags which is reproduced below.
% Frequency Distribution and % Frequency Polygon
of the weights of Flour Bags
% Number of flour bags
50
40
30
20
10
0
490.5 495.5 500.5 505.5 510.5 515.5
Weights to nearest gram.
(i)The shape of the histogram is single peaked and is approximately symmetrical with
no obvious outliners.
(ii)The center is given by the midpoint point or median which is approximately 500.5
from graph.
(iii) The spread is given by the range=518-488=30.
28
Example: Examine the histogram which we obtained for the data on the % of the
American population per state that were of Hispanic origin. The histogram is
reproduced below.
Frequency Distribution and Frequency Polygon of the
% of population of Hispanic origin
Number of States
30
20
10
0
-2.5 2.5 7.5 12.5 17.5 22.5 27.5 32.537.5 42.5 47.5
% of population of Hispanic origin
(i)The shape of the histogram is single peaked and right-skewed. The distribution has
a single peak which represents states that are less than 5% Hispanic. The distribution is
skewed to the right. Most states have no more than 10% Hispanics, but some states
have a much higher percentages, so that the graph trails off to the right.
(ii)Center: Frequency distribution Table and histogram shows that about half the
states have less than 4.7% Hispanics among their residents half have more. So the
midpoint of the distribution is close to 4.7%.
(i)
Spread: The spread is from about t 0% to 42%, but only four states fall above
20%. Outliers: Arizona, California, New Mexico, and Texas stand out.
Whether they are outliers or just part of the long right tail of the distribution is
a matter of judgement. There is no rule for calling an observation an outlier
but we will give a rough rule later. Once you have spotted possible outliers,
look for an explanation. Some outliers are due to mistakes,. such as typing 4.2
as 42. Other outliers point to the special nature of some observations. These
four states are heavily Hispanic by history and location.
29
Exercise 1 The final marks of 80 students at a university are recorded below
68 84 75 82 68 90 62 88 76 93
73 79 88 73 60 93 71 59 85 75
61 65 75 87 74 62 95 78 63 72
66 78 82 75 94 77 69 74 68 60
96 78 89 61 75 95 60 79 83 71
79 62 67 97 78 85 76 65 71 75
65 80 73 57 88 78 62 76 53 74
86 67 73 81 72 63 76 75 85 77.
(a)Construct a Frequency Distribution , the Relative Distribution and the %
Frequency Distribution table for this data and draw the histograms for(i) the
Frequency Distribution and Frequency Polygon, and (ii) the % Frequency
Distribution and % Frequency Polygon.
Take 7 equal classes.
(b)Construct a Cumulative Frequency Distribution and Cumulative %
Frequency Distribution table and draw their respective histograms and
Ogives.Take 7 equal classes.
(c) Examine and describe the Frequency Distribution.
Exercise 2: The total payrolls( rounded to millions) for all 30 major league baseball
teams in
U.S.A. for 1999 are given in the table below.
Total Payrolls of Major League Baseball Teams for 1999
Team
Anaheim
Arizona
Atlanta
Baltimore
Boston
Chicago Cubs
Chicago White Sox
Cincinnati
Cleveland
Colorado
Detroit
Florida
Houston
Kansas City
Los Angeles
Total
Payroll(millions of
dollars)
51
70
79
75
72
55
25
38
74
54
37
15
56
17
77
Team
Milwaukee
Minnesota
Montreal
New York Mets
New York Yankees
Oakland
Philadelphia
Pittsburgh
St. Louis
San Diego
San Francisco
Seattle
Tampa Bay
Texas
Toronto
30
Total
Payroll(millions of
dollars)
43
16
15
72
92
25
30
24
46
47
46
45
38
81
49
(a)Construct a Frequency Distribution , the Relative Distribution and the %
Frequency Distribution table for this data and draw the histograms for(i) the
Frequency Distribution and Frequency Polygon, and (ii) the % Frequency
Distribution and % Frequency Polygon.
Take 5 equal classes.
(b)Construct a Cumulative Frequency Distribution and Cumulative %
Frequency Distribution table and draw their respective histograms and
Ogives. Take 7 equal classes.
(c) Examine and describe the Frequency Distribution.
31
Description and Interpretation of Histograms
A Histogram is a pictorial representation of the frequency distribution of a data set.
Interpreting the histogram is very important as this interpretation helps us to understand
the data and what the data tells us about the process the data represents.
In examining a histogram you should note its most important features which are
(i)
Its overall shape or pattern together with any deviations,
(ii)
Its centre,
(iii) Its spread.
An important kind of deviation is an outliner which is an individual data value that lies
outside the overall pattern of data vaues.
Firstly as regards the shape note whether it is irregular, multimodal(many peaked) or
unimodal (single peaked).
An example of an irregular histogram is the histogram displaying the costs for 2002-2003
academic year of 56 four year colleges in Massachusetts.
The overall pattern shows that it is neither symmetric nor skewed but that it is irregular
with two separate clusters of colleges, 11 colleges costing less than $16,000( public
colleges) and the remaining 45 colleges costing more than $20,000( private colleges).
32
A histogram which is multimodal( in fact bimodal) is shown below.
Such a bimodal histogram generally represents a mixture of two different types of data.
In this case half of the men are Irish and half are pygmies.
33
Most of the histograms that we study are unimodal ( single peaked) and the most
common shapes are
(i)
symmetric,
(ii)
skewed,
(iii) uniform or rectangular.
Graphs of these shapes together with their frequency curves are shown below.
(i)
Histogram of the vocabulary scores of all 947 seven-grade students in Gary, Indiana.
The smooth curve shows the overall shape of the distribution and is an approximation
of the frequency polygon for this large data set. This curve is a mathematical model for
the distribution. A mathematical model is an idealized description. It gives a compact
picture of the overall pattern of the data but ignores minor irregularities as well as any
outliners. It is the frequency polygon when the data set becomes very large. The
histogram is symmetric and bell-shaped and is very important and the most widely
occurring histogram shape in statistics.
34
The table and histogram of the percentage of the population of Hispanic origin by state
(2000) is given below.
The percentage of the population of Hispanic origin, by in 2000 is given by
following table.
State
Alabama
Alaska
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
Percentage
1.5
4.5
25.3
2.8
34.4
17.1
9.4
4.8
16.8
5.3
7.2
7.9
10.7
3.5
2.8
7.0
1.5
State
Percentage
Louisiana
2.4
Maine
0.7
Maryland
4.3
Massachusetts
6.8
Michigan
3.3
Minnesota
2.9
Mississippi
1.3
Missouri
2.1
Montana
2.0
Nebraska
5.5
Nevada
19.7
New Hampshire
1.7
New Jersey
13.3
New Mexico
42.1
New York
5.1
North Carolina
4.7
North Dakota
1.2
Its histogram is given below
Pecentage of population of Hispanic origin
Number of states
30
20
10
0
2.5
7.5 12.5 17.5 22.5 27.5 32.5 37.5 42.5
Percentage Hispanic
35
State
Percentage
Ohio
1.9
Oklahoma
5.2
Oregon
8.0
Pennsylvania
3.2
Rhode Island
8.7
South Carolina 2.4
South Dakota
1.4
Tennessee
2.0
Texas
32.0
Utah
9.0
Vermont
0.9
Virginia
4.7
Washington
7.2
West Virginia
0.7
Wisconsin
3.6
Wyoming
6.4
Shape of Histogram: Its
overall shape is unimodal and is skewed to the right i.e.
most states have no more than 10%Hispanics but some
states have much higher percentages so that the graph tails
off to the right.
Centre of Histogram: The centre as shown by table is about 4.7% i.e. about half the
states have less than 4.7% Hispanic among their residents and half
have more than. Hence the midpoint of distribution is close 4.7%.
Spread of Histogram: The spread is from 0% to 42% with only 4 states fall close to
20%.
Outliners of Histogram: Arizona, California, New Mexico, and Texas stand out.
Whether they are outliers or just part of the long right tail of the
distribution is a matter of judgement. There is no rule for
calling an observation an outlier. Once you have spotted
possible outliers, look for an explanation. Some outliers are due
to mistakes,. such as typing 4.2 as 42. Other outliers point to the
special nature of some observations. These four states are
heavily Hispanic by history and location.
Note: When you describe a distribution, concentrate on the main features. Look for major
peaks, not for minor ups and downs in the bars of the histogram. Look for clear outliers,
not just for the smallest and largest observations. Look for rough symmetry or clear
skewness.
36
The most common shape for a distribution is a symmetric pattern like one shown below
for the heights of 1000 men called a normal distribution .
The normal distribution arises so often that when we see a non-normal pattern it is worth
asking why it is not normal. The histogram below shows the heights of `1000 men with a
distribution that is not normal and there is a reason for this. The left tail is missing and is
called a truncated normal. The reason it is not normal is that the men are all members of
the police which have a minimum requirement for all police recruits.
37
Describing Distributions with Numbers
Two common features of all the distributions which we have encountered are
(i)
Data clusters about a central data value,
(ii)
Data spread or variability about this central data value.
We would like to have numbers that measure these two characteristics of data
distributions, namely the center, and the spread of the data.
The two main numerical measures of the center of data are the (i)mean and (ii) the
median.
The main numerical measures of spread or variability are (i) the range (ii)the quartiles
and (iii) standard deviation.
Measuring Center of Data: the Mean The most common measure of center of a set of
data is the arithmetic mean, usually called just the mean.
Definition: Mean The mean denoted by x of n data values x1 , x2 , x3 , xn is
i n
x  x2  x3    xn

x 1
n
x
i 1
n
i
.
Example 1: The incomes of 15 people who have bachelors degrees chosen at random
from the U.S. Census Bureau in March 2002 were to the nearest thousand of dollars
110 25 50 50 55 30 35 30 4 32 50 30 31 74 60.
Find the average income. Also find the average income of 14 of the same people
excluding the person earning 110,000 dollars.
Solution
110  25  50  50  55  30  35  30  4  32  50  30  31  74  60
x
15
666

 44.4 or $44,400.
15
The average of the 14 people is
25  50  50  55  30  35  30  4  32  50  30  31  74  60
x
14
556

 39.714286  39.7 or $39700.
14
Note 110 was an outliner and its presence raised the mean from $39,700 to $44,400.
This illustrates the important fact that the mean as a measure of the center of data
is sensitive to the influence of a few extreme values. These may be outliners but
a skewed distribution with no outliners may also pull the mean towards its long
tail. Because the mean cannot resist the influence of extreme data values we say
that the mean is not a resistant measure of the center.
38
Measuring center of data: Median Another important measure of the center of data is
the median M.
Definition: Median The median M is the midpoint of a distribution, such that half the
data values are smaller and the half are larger.
To find the median of a distribution.
Step 1 Arrange all the data values in order of size from the smallest to the largest.
Step 2 If the number of data values n is odd, the median M is the center of the data
n  1
values in the ordered list. Find the location of the median by counting
2
values up from the bottom of the list.
Step 3 If the number of data values n is even, the median M is the mean of the two
center data values in the ordered list. The location of the median is again got by
n  1 values up from the bottom of the list.
counting
2
Note: The formula
n  1
does not give the median but just the location of the median in
2
the ordered list. The median requires no arithmetic to calculate and for a small data
set is easy to compute.
Example: To find the median when n is odd.
The incomes of 15 people who have bachelors degrees chosen at random from the U.S.
Census Bureau in March 2002 were to the nearest thousand of dollars
110 25 50 50 55 30 35 30 4 32 50 30 31 74 60.
To find the median when n is even. Also find the median income of 14 of the same
people excluding the person earning 110,000 dollars.
Solution The earnings of the 15 college graduates arranged in ascending order are
4 25 30 30 30 31 32 35 50 50 50 55 60 74 110
n  1  16  8 th location in the data set and its value is
The median is situated at the
2
2
Median M =35 or %35,000.
The earnings of the 14 college graduates arranged in ascending order are
4 25 30 30 30 31 32 35 50 50 50 55 60 74
The median is situated at the
n  1  15  7.5 th location in the data set, that is halfway
2
2
between the 7th and 8th position in the ordered list.
32  35 67

 33.5 or $33,500.
Thus the Median=
2
2
Note: Outliner 110 changes the median by $1500
39
Comparing the Mean and the Median
We see from the calculations above the single outliner $110,000 changes the mean by
$4700 while it only changes the median by $1500.The median is thus considered a
resistant measure of center of data while the mean is not a resistant measure of the
center of data.
More generally the mean and the median are close together in symmetric distribution. In
fact if the distribution is exactly symmetrical the mean and the median will be the
same. In a skewed distribution the mean is further out in the long tail than the median.
40
Measuring the Spread or Variability of a Distribution of Data Two quantities that are
used to measure the spread or variability of the distribution of data are
(i)the range of the data and (ii)the quartiles of the data.
Definition: The Range of the data is maximum data - the minimum data value.
Now this measure of spread or variability can be unreliable because the data may
contain outliners. A more reliable measure of the spread or variability of the distribution
of data is the Interquartile Range(IQR) which measures the spread or variability of
the middle half of the data.
Definition Quartiles Q1 and Q3 With the data arranged in increasing order the first
quartile Q1 lies one quarter of the way of the way up the list of data. The third quartile
Q3 lies three quarters way up the list of data. In other words first quartile Q1 is larger than
25% of the data values and the third quartile Q3 is larger than 75% of the data values.
Note: The second quartile Q2 is the median M which is larger than 50% of the data
values.Thus the quartiles divide the data into quarters.
Definition: The Interquartile Range(IQR)= Q3  Q1 .
Note: The interquartile measures the spread of the middle half of the data.
To find the quartiles Q1 and Q3 .
Step 1 Arrange the data in increasing order listing all the data including those that are
equal(always regard data that are equal as distinct) and locate the median M.
Step 2 The first quartile Q1 is the median of the data values of the ordered list of
data to the left of the location of the overall median M. In other words Q1 is
n  1 position of the ordered list of data values.
located in the
4
Step 3 The third quartile Q3 is the median of the data values of the ordered list of
data to the right of the location of the overall median M. In other words Q3 is
3n  1
located in the
position of the ordered list of data values.
4
Step 4 Find the Interquartile Range(IQR)= Q3  Q1 .
Note: Some software packages use slightly different rules to calculate the quartiles so
computer results may be slightly different from the results calculated by the above
rules. However the difference will be very small and can be ignored.
41
Five-number summary and Box-and -Whisker plot(Boxplot)
The smallest and the largest data values tell us little about the data distribution as a
whole but they give us information about the tails of the distribution that is missing if we
know only Q1 , M and Q3 . To get a quick summary of both center and spread we
combine all five numbers into what is called the five-number summary.
Definition: The five- number summary of a distribution of data consists of the smallest
data value, the first quartile Q1 , the median M, the third quartile Q3 , and
the largest data value ,written in order from the smallest to the largest. In
symbols the fine-number summary is
Minimum Q1 M Q3 Maximum.
Note: These five number offer a reasonably complete description of the center and the
spread or variability of a data distribution. Of course
Minimum < Q1 < M < Q3 < Maximum.
42
Example 2: The incomes of 15 people who have bachelors degrees chosen at random
from the U.S. Census Bureau in March 2002 were to the nearest thousand of
dollars 110 25 50 50 55 30 35 30 4 32 50 30 31 74 60.
Find (i) the range (ii) the five-number summary and (iii)the interquartile(IQR) .
Solution
Note n  15
Step 1 Arrange the data in ascending order
4 25 30 30 30 31 32 35 50 50 50 55 60 74 110
(i)The Range=110-4=106.
(ii)Median M is at the
n  1  16  8 position in the list
2
2
i.e.M=35.
The first quartile Q1 is the median of the data values of the ordered list of
data to the left of the location of the overall median M, that is the median of
n  1  8  4 th(where n  7 )
4 25 30 30 30 31 32 which is at the
2
2
n  1  16  4 th
position i.e. Q1  30 .Alternatively first quartile Q1 is at the
4
4
position of the overall ordered data list i.e. Q1  30 .
The third quartile Q3 is the median of the data values of the ordered list of
data to the right of the location of the overall median M, that is the median of
n  1  8  4 th(where n  7 )
50 50 50 55 60 74 110 which is at the
2
2
position i.e. Q3  55 .Alternatively third quartile Q3 is at the
3n  1 3  16

 12 th position of the overall ordered data list i.e. Q3  55 .
4
4
Thus the five-number summary is
Minimum=4, Q1  30 ,M=35, Q3  55 and Maximum=110.
The Interquartile Range(IQR)= Q3  Q1 =55-30=25.
43
Minitab gives following results
Descriptive Statistics: income
Variable
income
N
15
Mean
44.40
Median
35.00
TrMean
42.46
Variable
income
Minimum
4.00
Maximum
110.00
Q1
30.00
Q3
55.00
Stem-and-leaf of income
Leaf Unit = 1.0
1
1
2
(6)
7
7
3
2
1
1
1
1
0
1
2
3
4
5
6
7
8
9
10
11
N
= 15
4
5
000125
0005
0
4
0
Boxplot
income
100
50
0
Note asterisk denotes an outliner 110 in boxplot.
44
StDev
24.90
SE Mean
6.43
Measuring the Spread or Variability of Data: The Standard Deviation
The five-number summary is not the most common numerical description of a
distribution. That distinction belongs to the combination of the mean to measure the
center of the distribution and the standard deviation to measure the spread or variability
of the data.The standard deviation measures the spread of data by measuring how far
data values are from the mean of the data.
Definition: Variance s 2 and Standard Deviation s : The variance s 2 of a set of data
set is the average of the squares of the deviations of the data from the mean, that is if we
have n set of data x1 , x2 , x3 ,, xn then
s 
2
x1  x 2  x2  x 2  x3  x 2    xn  x 2
n 1
i n

 x
i 1
i
 x
2
where x is the mean.
n 1
The standard deviation s is the square root of the variance s 2 , that is
in
s
 x
i 1
i
 x
n 1
2
.
Example 3 A person’s metabolic rate is the rate at which the body consumes energy.
Metabolic rate is important in studies of weight gain, dieting and exercises.
Below are the metabolic rates of 7 men measured in calories, who part in the
study of dieting
1792 1666 1362 1614 1460 1867 1439.
1792  1666  1362  1614  1460  1867  1439
7
11200

 1600 calories.
7
The mean x 
45
Data
Deviation
Squared Deviations
1792
1792-1600=192
192 2  36864
1666
1666-1600=66
66 2  4356
1362
1362-1600=-238
 2382
1614
1614-1600=14
1460
1460-1600=-140
1867
1867-1600=267
1439
1439-1600=-161
________________
Sum=0
 56644
14 2  196
 1402
 19600
267 2  71289
 1612
 25921
_______________
Sum=214,870
The variance
1792  16002  1666  16002  1362  16002  1614  16002  1460  16002  1867  16002  1
s2 
6
214870

 35811.67 .
6
The standard deviation s  s 2  35811.67  189.24 calories.
46
Note 1: The variance s 2 has not the same units as the data which is calories. Hence we
take square root of s 2 to get the standard deviation s which has the same units
as the data.
Note 2: We had to square the deviations because otherwise when we sum the deviations
we get zero which would be useless.
Note 3: When averaging the sum of the squares of the deviations we divide by n  1
where you would expect to divide by n the number of squared deviations. The
reason why we divide by n  1 and not n is as follows. The sum of the
in
deviations
 x
i 1
i
 x  is always zero(because of the definition of the mean), so
knowing n  1 of them determine the last one. Thus only n  1 of the squared
deviations can vary freely and so we average by dividing by n  1 rather than
n .The number n  1 is called the degrees of freedom of the variance or standard
deviation.
Note 4: s measures the spread about the mean x and should be used only when the mean
is chosen as the measure of the center of the distribution.
Note 5: s  0 only when there is no spread and all the data the same and otherwise
s 0.
Note 6: s has the same units of measurements as the data and this is the reason why the
standard deviation s is chosen in preference to the variance s 2 .
Note 7: Like the mean x , the standard deviation s is not resistant. Strong skewness
or
a few outliners can greatly increase s . For example if 1439 is replaced by 1999
in above data the new mean x and new standard deviation s are x  1680 and
s  224.85 .
47
Choosing measures of Center and Spread of Data
The five-number summary is usually better than the mean x and the standard
deviation s for describing a skewed distribution or a distribution with strong outliners.
However use mean x and the standard deviation s only for reasonably symmetric
distributions that are free of outliners.
Note Most computer packages give both the five-number summary and the mean x and
the standard deviation s .
For example MINITAB gives the following for above data namely
1792 1666 1362 1614 1460 1867 1439
Solution
Data in ascending order is 1362 1439 1460 1614 1666 1792 1867.
Descriptive Statistics: C1
Variable
C1
N
7
Mean
1600.0
Median
1614.0
TrMean
1600.0
Variable
C1
Minimum
1362.0
Maximum
1867.0
Q1
1439.0
Q3
1792.0
Stem-and-Leaf Display: C1
Stem-and-leaf of C1
Leaf Unit = 10
1
3
3
(2)
2
1
13
14
15
16
17
18
N
= 7
6
36
16
9
6
1400
1500
1600
1700
1800
C1
48
1900
StDev
189.2
SE Mean
71.5
Example 4 A new machine has been purchased to cut out drinking straws from lengths
of plastic tubing. The straws should be approximately 203 mm long. A
random sample of drinking straws, cut by the new machine, had the following
lengths in mm
204 202 204 204 201.
Find the range and show that their mean is 203mm.
A random sample of drinking straws were also cut by the old machine, with the
following results in mm
207 200 204 197 207.
Find range and median . Show that their mean is 203mm.Why is the new machine
better? Hint:Calculate the standard deviations of both sets
Solution Put data in ascending order 201 202 204 204 204
Range=204-201=3. Median=204
201  202  204  204  204 1015

 203 .
Mean= x 
5
5
Data
Deviations
Deviations squared
201
201-203=-2
202
202-203=-1
204
204-203=1
204
204-203=1
204
 22  4
 12  1
12  1
12  1
12  1
204-203=1
4 1111 8
 2
Variance s 2 
4
4
Standard Deviation s  s 2  2  1.414213562  1.41 .
Put data in ascending order 197 200 204 207 207
Range=207-197=107. Median=204
197  200  204  207  207 1015

 203 .
Mean= x 
5
5
Data
Deviations
Deviations squared
197
197-203=-6
200
200-203=-3
204
204-203=1
207
207-203=4
207
207-203=4
_________
Sum=0
 62  36
 32  9
12  1
42  16
42  16
________
Sum=78
49
Variance s 2 
36  9  1  16  16 78

 19.5
4
4
Standard Deviation s  s 2  19.5  4.415880433  4.42 .
Comment: So a typical drinking straw from the new machine is 203 long
But typically their lengths differ from 203mm by 1.41mm. We can say that this is the
precision of the machine: it typically makes an error of 1.41mm each time it cuts a straw.
Note that this is the typical error, not the maximum error made by the machine.
Instead the old machine cuts the drinking straws typically 203 mm long but their lengths
differ typically from 203 mm by 4.42 mm so that the precision of the old machine is 4.42
each time it cuts a straw while the new machine typically an error of 1.41 each time it
cuts a straw. Hence the new machine is better as the straws from the new machine are
less variable in length i.e. their lengths are closer in general to 203mm.
50
51
Download