Introduction - Department of Statistical Sciences

advertisement
STAB22 Fall 2008 - Useful Information
Instructor: Hadas Moshonov.
E-mail: hadas@utstat.utoronto.ca
Web-page: www.utstat.toronto.edu/hadas/STAB22/stab22.html
Office:
Tel:
Office hours: Tuesdays 2-3 PM or by appointment.
NOTE: The lecture notes are based on the textbook and
can be downloaded from the course website.
(lecture section 1).
week1
1
The rise of statistics
• Statistics is the science of collecting, organizing and
interpreting data. The goal of statistics is to gain
understanding from data.
• Historically, the ideas and methods of statistics developed
gradually as society became interested in collecting and
using data for a variety of applications.
• The discipline of statistics took shape in the twentieth
century when methods for producing and understanding
data grew in number and sophistication.
week1
2
Elements of Statistics - Introduction
• Data are numerical facts with context and we need to understand
the context if we are to make sense of the numbers.
• A set of data contains some information about a group of
individuals.
• Individuals are the objects upon which we collect data.
Individuals can be people, animals, plots of land and many other
things.
• A population is a set of individuals that we are interested in
studying.
• A variable is any characteristic of an individual.
• A sample is a subset of the individuals of a population.
week1
3
Questions to ask when planning a statistical study
• Why? What purpose do the data have? Do we hope to
answer some specific questions? Do we want to draw
conclusions about individuals other than the ones we actually
have data for?
• Who? What individuals do the data describe? How many
individuals appear in the data?
• What? How many variables do the data contain? Exact
definitions of these variables. What are the units of
measurements in which each variable is recorded? Weights
for example, might be recorded in pounds, or in kg.
week1
4
Collecting Data
• Generally, data can be obtained in four different ways.
 Published source.
 Designed experiment.
 Survey.
 Observational study.
week1
5
Types of Variables
• A categorical variable places an individual into one of
several groups or categories, e.g. gender, college major.
• A quantitative variable takes numerical values for which
arithmetic operations are defined, e.g. height, weight.
• The distribution of a variable tells us what values it takes
and how often it takes these values.
• Examples 1.2, 1.3 pages 5-6 in IPS.
week1
6
Displaying Distributions With Graphs
• Statistical tools and ideas help us examine data in order to
describe their main features. This examination is called
exploratory data analysis.
• Two basic strategies for exploration of data set:
 Begin by examining each variable by itself. Then move
on to study the relationships among the variables.
 Begin with graphs. Then add numerical summaries of
specified aspects of the data.
week1
7
Graphs for categorical variables
• The values of a categorical variable are the labels for
the categories such as “male” and “female”.
• The distribution of a categorical variable lists the
categories and gives either the count or the percent of
individuals who fall in each category.
week1
8
• Example : The of marital status for all Americans age 18+.
Marital status
Count (millions)
Never married
Married
Widowed
Divorced
Percent
43.9
116.7
13.4
17.6
22.9
60.9
7.0
9.2
Bar graph
120
100
Count
80
60
40
20
0
Divorvced
Married
Never Married
Widowed
Marital status
MINITAB commands:
Graph>Chart
week1
9
Pie chart for the distribution of marital status
Married
(117, 60.9%)
Divorced
Widowed
( 18, 9.2%)
( 13, 7.0%)
Never Marrie ( 44, 22.9%)
week1
10
Measurement - example
We want to compare the “size” of several statistics books.
• Describe three possible numerical variables that
describes the “size” of a book.
• In what units would you measure each variable?
• What measuring instrument does each require?
• Describe a variable that is appropriate for estimating how
long it would take to read the book?
week1
11
Describing Quantitative Data
• The pattern of variation of a variable is called its distribution.
• The distribution of a variable is best displayed graphically.
• There are three main graphical methods for describing
summarizing and detecting patterns in quantitative data:
 Dot plot
 Stemplot
 Histogram
week1
12
Stemplots
To make a stemplot:
1. Separate each observation into a stem consisting of
all but the final (rightmost) digit and a leaf, the final
digit.
2. Write the stems in a vertical column with the smallest
at the top, and draw a vertical line at the right of this
column.
3. Write each leaf in the row to the right of its stem, in
increasing order out from the stem.
•
Examples 1.5, 1.6 pages 11-13 in IPS.
week1
13
Example
• Here are the scores of a basketball player (say player A)
54 59 35 41 46 25 47 60 54 46 49 46 41 34 22
Make a stemplot of these data. Describe the main
features of the distribution.
• Solution : Min. = 22, Max = 60 …
• MINITAB command: Graph > Stem-and-Leaf
week1
14
Examining a distribution
• In any graph of data, look for the overall pattern and for
striking deviations from that pattern.
• Overall pattern of a distribution can be described by its
shape, centre, and spread.
• An important kind of deviation is an outlier, an individual
value that falls outside the overall pattern.
• Some other things to look for in describing shape are:
 Does the distribution have one or several major peaks,
called modes? A distribution with one major peak is
called unimodal.
 Is it approximately symmetric or skewed in one direction.
week1
15
Exercise
Describe the main feature of the distribution of the data
set summarized by the following MINITAB stemplot.
Stem-and-leaf of Fuel Use
Leaf Unit = 1.0
1
0 5
5
0 6677
(4)
0 8899
6
1 001
3
1 23
1
1
1
1
1
1
1
2 1
week1
N
= 15
16
Exercise
Describe the main feature of the distribution of the data sets
summarized by the following MINITAB stemplot.
Stem-and-leaf of weight
Leaf Unit = 0.010
3 107 479
(4) 108 0779
5 109
5 110
5 111 0
4 112 56
2 113 79
week1
N
= 12
17
Exercise
Describe the shape of the distributions
summarized by the following stemplot.
Stem-and-leaf of sta220 marks
Leaf Unit = 1.0
1
6 7
3
7 44
11
7 77888999
(11)
8 00011233444
20
8 555556666778
8
9 000001
2
9 7
1
10 0
week1
N
= 42
18
Exercise
Describe the shape of the distributions summarized by the
following stemplots.
Stem-and-leaf of C1 N= 50
Leaf Unit = 0.10
18
0 000111122233334444
(17)
0 55555566667889999
15
1 0011444
8
1 5669
4
2 03
2
2
2
3 1
1
3
1
4 2
.
week1
19
Exam Question
Forty students wrote a Statistics examination having a maximum of 50 marks.
The mark distribution is given in the following stem-and-leaf plot:
Stem Leaf
0
28
1
2245
2
01333388888
3
011156679
4
22444466788
5
000
State whether the following statements are true or false.
(a) The distribution is right skewed.
(b) The median of the distribution is 31.
(c) The median of the distribution is 28.
(d) The mode of the distribution is 48.
(e) More than 18% of the students scored 45 or more on the examination.
week1
20
Histograms
•
A histogram breaks the range of values of a variable into
intervals and displays only the count or percent of the
observations that fall into each interval.
•
We can choose a convenient number of intervals.
•
Histograms do not display the actual values observed. (only
counts in each interval).
•
Example:
Here is some data on the number of days lost due to illness
of a group of employees:
47, 1, 55, 30, 1, 3, 7, 14, 7, 66, 34, 6, 10, 5, 12, 5, 3, 9, 18,
45, 5, 8, 44, 42, 46, 6, 4, 24, 24, 34, 11, 2, 3, 13, 5, 5, 3, 4,
4, 1
week1
21
The main steps in constructing a histogram
1.
Determine the Range of the data (largest and smallest
values)
In our example the data ranges from a min.
of 1 day to a max. of 66 days.
2.
Decide on the number of intervals (or classes) , and the
width of each class (usually equal).
3.
Count the number of observations in each class. These
counts are called class frequencies.
4.
Draw the histogram.
week1
22
Class
No. of employees
(Frequency)
0-10
10-20
20-30
30-40
40-50
50-60
60-70
23
5
3
2
5
1
1
Total
40
Cumulative. Frequency
23
28
31
33
38
39
40
Relative frequency
0.575
0.125
0.075
0.050
0.075
0.025
0.025
1.000
• A table with the first two columns above is called
frequency table or frequency distribution.
• A table with the first column and the third column is called
cumulative frequency distribution.
week1
23
Frequency
20
10
0
0
10
20
30
40
50
60
70
days lost
• MINITAB command: Graph > Histogram
week1
24
Comments
• The above histogram suggests that the distribution is
skewed to the right. No gaps or outliers.
• Since this data set is not very large (40 observations) we
can also use a dotplot or a stemplot to represent the data.
• MINITAB commands for dotplot:
Graph > Character Graphs > Dotplot
Some good examples are 1.7 and 1.8 on pages 14, 18 in IPS.
week1
25
Stem-and-leaf of days lost
Leaf Unit = 1.0
(22)
18
12
10
7
2
1
0
1
2
3
4
5
6
N
= 40
1112333344455555667789
012348
44
044
24567
5
6
Dotplot for Days
0
10
20
30
40
50
60
Days
week1
26
Dealing with outliers
• We can spot outliers by looking for observations that stand
apart from the overall pattern of a histogram or stemplot.
• Identifying outliers is a matter for judgment. Outliers are
points that are clearly apart from the body of the data, not
just the most extreme observations in a distribution.
• We should always search for an explanation for any
outliers. Sometimes outliers point to errors made in
recording the data. In other cases, the outlying observation
may be caused by equipment failure or other unusual
circumstances.
• Example 1.9 p18 in IPS.
week1
27
Question
Which type of display
• uses the actual numbers as building blocks for the display?
• gives the most flexibility for setting the class width and
number of classes?
• is most convenient for very large data sets?
• is quickest to construct?
• keeps the most detail re the actual data?
week1
28
Review Questions
1.
The purpose of a frequency distribution is to ____.
a) present scores and their frequency of occurrence
b) present data in a more meaningful way than single
scores
c) provide more information than a graph
d) all of the above
e) a and b
2.
Which of the following indicates the proportion of the
total number of scores which occurred in each interval?
a) Relative frequency distribution
b) Cumulative frequency distribution
c) Cumulative percentage distribution
d) None of the above
week1
29
3.
Which of the following indicates the number of scores
which fell below the upper limit of each interval?
a) Relative frequency distribution
b) Cumulative frequency distribution
c) Cumulative percentage distribution
d) None of the above
4.
Which of the following is not a symmetrical
distribution?
a) A bell-shaped distribution
b) A J-shaped distribution
c) A U-shaped distribution
d) An inverted U-shaped distribution
week1
30
Download