Document

advertisement
MIREES
Academic year 2008/2009
STATISTICS
Silvia Cagnone
silvia.cagnone@unibo.it
Department of Statistics
University of Bologna
Readings

Mann P., “Introductory Statistics”, 6th edition, John Wiley & Sons, INC.,
2007. Chapters 1, 2, 3, 4.4, 4.5, 4.6, 11.3, 11.4.1, 13.1, 13.2.1-13.2.3,
13.4, 13.6

HyperStat Online Textbook (http://davidmlane.com/hyperstat/)

Handouts and exercises download from http://campus.cib.unibo.it/6033
Exam
Project work and oral exam
Office hours
by e-mail silvia.cagnone@unibo.it
Aim of the course
Basic concepts of the statistical method for the analysis
and the interpretation of economic and social data.
a.
Univariate statistical analysis (one character)
b.
Bivariate statistical analysis (two characters)
c.
Brief introduction to the statistical inference
Exercises in laboratory software Excel
INTRODUCTION TO
STATISTICS
What is statistics?
Two meanings:
1. Common usage
Numerical facts
(e.g the age of a student, the income of a family,
the starting salary of a typical college graduate,
etc.).
2. Field or discipline of study
Statistics is a group of methods used to
collect, analyze, present, and interpret
data and to make decisions.
What is statistics?
Statistics has two aspects:
1.
theoretical or mathematical statistics deals with the
development, derivation and proof of statistical
theorems, formulas, rules and laws;
2.
applied statistics involves the applications of those
theorems, formulas, rules and laws to solve real-world
problems.
Why statistics?



We use statistics when we need methods for
extracting information from observed or collected data
to obtain a deeper understanding from numbers about
the situations they represent.
Even professional statisticians have trouble
understanding a data set (= a collection of numerical
information) by merely looking at it.
Statistics and data analysis provide methods that can
help in the understanding of nearly every field of
human experience.
Why statistics?
Data set of 22.385 individuals of Cambridge.
Information on Sex, Height, Weight, Age and Smoking status.
What we can say about this data set by merely looking at it????
Types of statistics
 Descriptive statistics: consists of methods for organizing,
displaying, and describing data by using tables, graphs,
and summary measures.
 Inferential Statistics: consists of methods that use
sample results to help make decisions or predictions
about a population.
Population (target): the collection of all elements of interest
Sample: the selection of a few elements from this population
Probability acts as a link between descriptive and
inferential statistics.
Descriptive statistics: example
A sample of 30 employees from large companies was selected, and
these employees were asked how stressful their jobs were. The
responses of these employees are recorded next where Very
represents Very stressful, Somewhat means Somewhat stressful, and
None stands for Not stressful at all.
Inferential statistics: example
1.
2.
We may make some decisions about the political views
of a college and university students based on political
views of 1000 students selected from a few colleges
and universities.
We may want to find the starting salary of a typical
college graduate. To do so, we may select 2000 recent
college graduates, find their starting salaries, and make
a decision based on this information.
DESCRIPTIVE STATISTICS
Basic terms
1.
ELEMENT or MEMBER or UNIT
Specific subject or object (for example, a person, firm, item, state,
or country) about which the information is collected
2.
VARIABLE
Characteristic under study that assumes different values for
different elements. In contrast to a variable, the value of a
CONSTANT is fixed.
3.
OBSERVATION or MEASUREMENT
The value of a variable for an element.
Basic terms: example
Data set: 2001 Sales of Seven U.S. Companies
Element/
Member/Unit
Company
Wal-Mart Stores
IBM
General Motors
Dell Computer
Procter & Gamble
JC Penney
Home Depot
2001 Sales
Variable
(millions of dollars)
217,799
85,866
Observation/
177,260
Measurement
31,168
39,262
32,004
53,553
Types of variables
1.
QUANTITATIVE VARIABLE
A variable that can be measured numerically. The data collected on
a quantitative variable are called quantitative data.
Examples: Incomes, heights, prices of homes, etc.

DISCRETE VARIABLE
A variable whose values are countable. A discrete variable can
assume only certain values with no intermediate values. (e.g. nr
students in a class, nr components of a family, births in Forlì in 2007, etc.)

CONTINUOUS VARIABLE
A variable that can assume any numerical value over a certain
interval or intervals. (e.g. age, weight, height, time to get to the school,
etc.)
Types of variables
2.
QUALITATIVE or CATEGORICAL VARIABLE
A variable that cannot assume a numerical value but can be
classified into two or more nonnumeric categories. The data
collected on such a variable are called qualitative data.
Examples: hair color, gender, etc.
Variable
Quantitative
Discrete (e.g.,
number of
houses, cars,
accidents)
Continuous
(e.g., length,
age, height,
weight, time)
Qualitative or
categorical (e.g.,
make of a computer,
hair color, gender)
Summation notation
The summation operator  is a mathematical notation
used to denote the sum of values.
For example, suppose a sample consists of 5 books and their prices are
$25, $60, $37, $53 and $16. If we denote the variable price of a
book by X, we have:
x1 (price of the first book) = $25;
x2 (price of the second book) = $60;
…
x5 (price of the fifth book) = $16;
Now, suppose we want to add the prices of all five books:
x1 + x2 + x3 + x4 + x5 = 25 + 60 + 37 + 53 + 16 = $191 or, briefly,
5

xi =
xi x1 + x2 + x3 + x4 + x5 =25 + 60 + 37 +
53 + 16
i 1
= $191
Summation notation
Example
Annual salaries (in thousands of dollars) of four workers
are 75, 42, 125, and 61. Find
∑x
b) (∑x)²
c) ∑x²
a)
Solution
∑x = x1 + x2 + x3 + x4 = 75 + 42 + 125 + 61 = 303
b) (∑x)² = (75 + 42 + 125 + 61)² =(303)² = 91.809
c)∑x² = (75)² + (42)² + (125)² + (61)² = 5625 + 1764+
a)
+ 15,625 + 721
= 26.735
ORGANIZING DATA
Raw data
When data are collected, the information obtained from
each member of a population or a sample is recoded
in the sequence in which it becomes available. This
sequence of data is called raw data.
RAW DATA
Data recorded in the sequence in which they are collected.
These data are also called ungrouped data, because they
contain information on each member of a sample or
population individually.
Raw data: Examples
Ages of 50 students
21
18
25
22
25
19
20
19
28
23
24
19
31
21
18
25
22
19
20
37
29
19
23
22
27
34
19
18
22
23
26
25
23
21
21
27
22
19
20
25
37
25
23
19
21
33
23
26
21
24
M
F
M
M
M
F
F
M
M
M
M
F
F
M
F
F
M
M
M
F
F
M
F
M
M
F
M
M
F
M
F
F
M
M
M
Gender of 50 students
M
F
M
M
F
F
F
F
M
F
M
M
M
M
F
Organizing and Graphing
Qualitative Data
FREQUENCY DISTRIBUTION
A frequency distribution for qualitative data lists all
categories and the number of elements that belong to
each of the categories.
Example
Variable
Category
Frequency
Frequency distribution
How can we obtain a frequency distribution?
Example
A sample of 30 employees from large companies was selected, and
these employees were asked how stressful their jobs were. The
responses of these employees are recorded next where Very
represents Very stressful, Somewhat means Somewhat stressful, and
None stands for Not stressful at all.
Raw data
Frequency distribution
Stress on Job
Very
Somewhat
None
Tally
||||| |||||
||||| ||||| ||||
||||| |
Frequency (ni)
10
14
6
Sum =
ni
= 30
Relative frequency and
percentage distributions
Relative frequency of a category: it is obtained by
dividing the frequency of the category by the sum of all
frequencies
Frequency of that category
Relative frequency 
Sum of all frequencie s
fi 
ni
n
Percentage of a category: it is obtained by
multiplying the relative frequency of the category
by 100
Percentage = (Relative frequency) * 100 = fi * 100
Relative frequency and percentage
distributions:example
Stress on Job ( xi)
Frequency (ni)
Very
Somewhat
None
10
14
6
 ni = 30 = n
Stress on Job
( x)
i
Very
Somewhat
None
Relative Frequency (fi)
10/30 = .333
14/30 = .467
6/30 = .200
 fi = 1.00
Percentage
.333(100) = 33.3
.467(100) = 46.7
.200(100) = 20.0
Sum = 100
Graphical presentation of qualitative data
A graphic display can reveal at a glance the main
characteristics of a data set.
The bar graph and the pie chart are two types of
graphs used to display qualitative data.
Bar graph
A graph made of bars whose heights represent the frequencies of
respective categories.
Very
Somewhat
None
Frequency (ni)
10
14
6
16
14
12
Frequency
Stress on Job
10
8
6
4
2
 ni = 30 = N
0
Very
Somewhat
None
Strees on Job
The categories are on the horizontal axis and all these categories are
represented by intervals of the same width.
We mark the frequencies on the vertical axis and their heights
represent the frequency of the corresponding category.
We leave a small gap between adjacent bars.
Bar graph for percentages
The bar graphs for relative frequency and percentage distributions
can be drawn simply by marking the relative frequencies or
percentages, instead of the frequencies, on the vertical axis.
Percentage
Very
Somewhat
None
33.3
46.7
20.0
50
40
Percentage
Stress on
Job
30
20
10
0
Very
Somewhat
Stress on Job
None
Pie charts
A circle divided into portions that represent the relative frequencies or
percentages of a population or a sample belonging to different
categories.
20%
33%
Very
Somewhat
None
47%
Organizing and Graphing
Quantitative Data
Often for quantitative data with a large number of different values, it is
appropriated to prepare a frequency distribution based on classes.
Example: frequency distribution
Weekly earning of 120 employees of a large company
Variable
Third class
Lower limit of
the sixth class
Weekly Earnings
(dollars)
400 -| 600
600 -| 800
800 -| 1000
1000 -| 1200
1200 -| 1400
1400 -| 1600
Frequency
column
Number of Employees
9
22
39
15
9
Upper limit of the 6
sixth class
Frequency
of the third
class
Relative frequency and percentage
distribution
Weekly Earnings
(dollars)
Number of
Employees
n
Relative frequency
f
Percentage
400 -| 600
600 -| 800
800 -| 1000
1000 -| 1200
1200 -| 1400
1400 -| 1600
14
22
49
20
9
6
14/120 = 0.117
22/120 = 0.183
49/120 = 0.408
20/120 = 0.167
9/120 = 0.075
6/120 = 0.050
0.117*100 = 11.7
0.183*100 = 18.3
0.408*100 = 40.8
0.167*100 = 16.7
0.075*100 = 7.5
0.050*100 = 5.0
 n = 120
f=1
100
Cumulative frequency
A cumulative frequency is the total number of values that fall below a
certain value.
To obtain the cumulative frequency of a class, we add the frequency
of that class to the frequencies of all preceding classes.
Weekly Earnings
(dollars)
Number of
Employees
n
Cumulative frequency
400 -| 600
600 -| 800
800 -| 1000
1000 -| 1200
1200 -| 1400
1400 -| 1600
14
22
49
20
9
6
14
14 + 22 = 36
14 +22 + 49 = 86
14 + 22 + 49 + 20 = 105
14 + 22 + 49 + 20 + 9 = 114
14 + 22 + 49 + 20 + 9 + 6 =
120
 n = 120
Graphical presentation of quantitative data
Quantitative data can be displayed mainly in a histogram.
Actually, we can also draw a pie chart to display the
percentage distribution for a quantitative data set. The
procedure to construct a pie chart is similar to the one for
qualitative data.
Histogram
A graph in which classes are marked on the horizontal
axis and the frequencies, relative frequencies, or
percentages are marked on the vertical axis. The
frequencies, relative frequencies, or percentages are
represented by the heights of the bars. In a histogram,
the bars are drawn adjacent to each other, to underline
the continuity of the quantitative data.
Weekly Earnings
(dollars)
Number of
Employees
Percentage
400 -| 600
600 -| 800
800 -| 1000
1000 -| 1200
1200 -| 1400
1400 -| 1600
14
22
49
20
9
6
11.7
18.3
40.8
16.7
7.5
5.0
60
50
50
40
Percentage
Frequency
Example histogram
40
30
20
30
20
10
10
0
0
400-600
600-800
800-1000
1000-1200 1200-1400 1400-1600
Classes
400-600
600-800
800-1000
1000-1200 1200-1400 1400-1600
Classi
Download