STAT 551
PROBABILITY AND
STATISTICS I
INTRODUCTION
1
WHAT IS STATISTICS?
• Statistics is a science of collecting data,
organizing and describing it and drawing
conclusions from it. That is, statistics is
a way to get information from data. It is
the science of uncertainty.
2
WHAT IS STATISTICS?
• A pharmaceutical CEO wants to know if a
new drug is superior to already existing
drugs, or possible side effects.
• How fuel efficient a certain car model is?
• Is there any relationship between your GPA
and employment opportunities?
• Actuaries want to determine “risky” customers
for insurance companies.
3
STEPS OF STATISTICAL
PRACTICE
• Preparation: Set clearly defined goals,
questions of interests for the investigation
• Data collection: Make a plan of which data to
collect and how to collect it
• Data analysis: Apply appropriate statistical
methods to extract information from the data
• Data interpretation: Interpret the information
and draw conclusions
4
STATISTICAL METHODS
• Descriptive statistics include the collection,
presentation and description of numerical data.
• Inferential statistics include making inference,
decisions by the appropriate statistical methods by
using the collected data.
• Model building includes developing prediction
equations to understand a complex system.
5
BASIC DEFINITIONS
• POPULATION: The collection of all items of
interest in a particular study.
•SAMPLE: A set of data drawn from the population;
a subset of the population available for observation
•PARAMETER: A descriptive measure of the
population, e.g., mean
•STATISTIC: A descriptive measure of a sample
•VARIABLE: A characteristic of interest about each
element of a population or sample. 6
EXAMPLE
Population
All students currently
enrolled in school
All books in library
All campus fast food
restaurants
Unit
Student
Book
Sample
Any department
Variable
GPA
Hours of works per
week
Statistics’ Books Replacement cost
Frequency of check out
Repair needs
Restaurant Burger King
Number of employees
Seating capacity
Hiring/Not hiring
Note that some samples are not representative of population and shouldn’t
be used to draw conclusions about population. In the first example, some
students from all (or almost all) departments would constitute a better
sample.
7
How not to run a presidential poll
For the 1936 election, the Literary Digest picked
names at random out of telephone books in some
cities and sent these people some ballots,
attempting to predict the election results,
Roosevelt versus Landon, by the returns. Now,
even if 100% returned the ballots, even if all told
how they really felt, even if all would vote, even if
none would change their minds by election day,
still this method could be (and was) in trouble:
They estimated a conditional probability, used part
of the American population which had phones, that
part was not typical of the total population.
[Dudewicz & Mishra, 1988]
STATISTIC
• Statistic (or estimator) is any function of a r.v.
of r.s. which do not contain any unknown
quantity. E.g.
o
o
n
n
n
i 1
n
i 1
i 1
n
 Xi ,  Xi ,  Xi / n, m ii n(Xi ), m ai x(Xi )
are statistics.
 Xi   ,  Xi /  are NOT.
i 1
i 1
• Any observed or particular value of an
estimator is an estimate.
9
RANDOM VARIABLES
• Variables whose observed value is determined by
chance
• A r.v. is a function defined on the sample space S that
associates a real number with each outcome in S.
• Rvs are denoted by uppercase letters, and their
observed values by lowercase letters.
• Example: Consider the random variable X, the number of
brown-eyed children born to a couple heterozygous for eye
color (each with genes for both brown and blue eyes). If the
couple is assumed to have 2 children, X can assume any of
the values 0,1, or 2. The variable is random in that brown
eyes depend on the chance inheritance of a dominant gene at
conception. If for a particular couple there are two brown-eyed
children, we have x=2.
10
COLLECTING DATA
• Target Population: The population about
which we want to draw inferences.
• Sampled Population: The actual
population from which the sample has
been taken.
11
SAMPLING PLAN
• Simple Random Sample (SRS): All possible
members are equally likely to be selected.
• Stratified Sampling: Population is separated
into mutually exclusive sets (strata) and then
sample is drawn by using simple random
samples from each strata.
• Convenience Sample: It is obtained by
selecting individuals or objects without
systematic randomization.
12
13
EXAMPLE
• A politician who is running for the office of mayor of a city
with 25,000 registered voters runs a survey. In the
survey, 48% of the 200 registered voters interviewed say
they plan to vote for her.
• What is the population of interest?
The political choices of the 25,000 registered voters
• What is the sample?
The political choices of the 200 voters interviewed
• Is the value 48% a parameter or a statistic?
Statistic
14
EXAMPLE
• A manufacturer of computer chips claims that less than
10% of his products are defective. When 1000 chips
were drawn from a large production run, 7.5% were
found to be defective.
• What is the population of interest?
The complete production run for the computer chips
• What is the sample? 1000 chips
• What is parameter? Proportion of the all chips that are defective
• What is statistic? Proportion of sample chips that are defective
• Does the value 10% refer to a parameter or a statistics?
Parameter
•
Explain briefly how the statistic can be used to make
inferences about the parameter to test the claim.
Because the sample proportion is less than 10%, we can
conclude that the claim may be true.
15
DESCRIPTIVE STATISTICS
• Descriptive statistics involves the
arrangement, summary, and presentation of
data, to enable meaningful interpretation, and to
support decision making.
• Descriptive statistics methods make use of
– graphical techniques
– numerical descriptive measures.
• The methods presented apply both to
– the entire population
– the sample
16
Types of data and information
• A variable - a characteristic of population or
sample that is of interest for us.
– Cereal choice
– Expenditure
– The waiting time for medical services
• Data - the observed values of variables
– Interval and ratio data are numerical observations (in
ratio data, the ratio of two observations is meaningful
and the value of 0 has a clear “no” interpretation. E.g.
of ratio data: weight; e.g. of interval data: temp.)
– Nominal data are categorical observations
– Ordinal data are ordered categorical observations
17
Types of data – examples
Examples of types of data
Quantitative
Continuous
Blood pressure, height,
weight, age
Discrete
Number of children
Number of attacks of asthma
per week
Categorical (Qualitative)
Ordinal (Ordered categories) Nominal (Unordered
categories)
Grade of breast cancer
Better, same, worse
Disagree, neutral, agree
Sex (Male/female)
Alive or dead
Blood group O, A, B, AB
18
Types of data – analysis
Knowing the type of data is necessary to properly
select the technique to be used when analyzing data.
Types of descriptive analysis allowed for each type of
data



Numerical data – arithmetic calculations
Nominal data – counting the number of observation in each
category
Ordinal data - computations based on an ordering process
19
Types of data - examples
Numerical data
Nominal
Age - income
55
42
75000
68000
.
.
.
.
Weight
gain
+10
+5
.
.
Person Marital status
1
2
3
married
single
single
.
.
Computer
.
.
Brand
1
2
3
.
.
IBM
Dell
IBM
.
.
20
Types of data - examples
Numerical data
Nominal data
A descriptive statistic
for nominal data is
the proportion
of data that falls into
each category.
Age - income
55
42
.
.
75000
68000
.
. gain
Weight
+10
+5
.
.
IBM
25
50%
Dell Compaq
11
8
22% 16%
Other
6
12%
Total
50
21
Cross-Sectional/TimeSeries/Panel Data
• Cross sectional data is collected at a certain
point in time
– Test score in a statistics course
– Starting salaries of an MBA program graduates
• Time series data is collected over
successive points in time
– Weekly closing price of gold
– Amount of crude oil imported monthly
• Panel data is collected over successive points
in time as well
22
Differences
Cross-sectional
Time series
Panel
Change in time Cannot measure
Can measure
Can measure
Properties of
the series
No series
Long; usually just one
or a few series
Short; hundreds
of series
Measurement
time
Measurement only at
one time point; even
if more than one time
point, samples are
independent from
each other
Usually at regular time
points (all series are
taken at the same time
points and time points
are equally spaced)
Varies
Measurements
Response(s); timeindependent
covariates
Response(s); time;
usually no covariate
Response(s);
time; timedependent and
independent
covariates
23
GAMES OF CHANCE
24
COUNTING TECHNIQUES
• Methods to determine how many subsets
can be obtained from a set of objects are
called counting techniques.
FUNDAMENTAL THEOREM OF
COUNTING
If a job consists of k separate tasks,
the i-th of which can be done in ni ways,
i=1,2,…,k, then the entire job can be done
in n1xn2x…xnk ways.
25
THE FACTORIAL
• number of ways in which objects can be
permuted.
n! = n(n-1)(n-2)…2.1
0! = 1, 1! = 1
Example: Possible permutations of {1,2,3}
are {1,2,3}, {1,3,2}, {3,1,2}, {2,1,3}, {2,3,1},
{3,2,1}. So, there are 3!=6 different
permutations.
26
COUNTING
• Partition Rule: There exists a single set
of N distinctly different elements which is
partitioned into k sets; the first set
containing n1 elements, …, the k-th set
containing nk elements. The number of
different partitions is
N!
where N  n1  n2 
n1 !n2 ! nk !
 nk .
27
COUNTING
• Example: Let’s partition {1,2,3} into two
sets; first with 1 element, second with 2
elements.
• Solution:
Partition 1: {1} {2,3}
Partition 2: {2} {1,3}
Partition 3: {3} {1,2}
3!/(1! 2!)=3 different partitions
28
Example
• How many different arrangements can be
made of the letters “ISI”?
1st letter 2nd letter
3rd letter
I
S
I
S
S
I
I
I
N=3, n1=2, n2=1; 3!/(2!1!)=3
29
Example
• How many different arrangements can be
made of the letters “statistics”?
• N=10, n1=3 s, n2=3 t, n3=1 a, n4=2 i, n5=1 c
10!
 50400
3!3!1!2!1!
30
COUNTING
1. Ordered, without replacement (e.g. picking the first
3 winners of a
competition)
2. Ordered, with replacement (e.g. tossing a coin and
observing a Head in the
k th toss)
3. Unordered, without replacement (e.g. 6/49 lottery)
4. Unordered, with replacement (e.g. picking up red
balls from an urn that
has both red and green
balls & putting them
31
back)
PERMUTATIONS
• Any ordered sequence of r objects taken
from a set of n distinct objects is called a
permutation of size r of the objects.
n!
Pr ,n 
 n(n  1)...(n  r  1)
(n  r )!
32
COMBINATION
• Given a set of n distinct objects, any
unordered subset of size r of the objects is
called a combination.
n!
n
Cr ,n    
 r  r !(n  r )!
Properties
n
 0   1,
 
n
 n   1,
 
n  n 
r   nr
  

33
COUNTING
Number of possible
arrangements of size r from n
objects
Without
With
Replacement
Replacement
Ordered
Unordered
n!
 n  r !
n
r 
 
n
r
 n  r  1
 r 


34
EXAMPLE
• How many different ways can we arrange
3 books (A, B and C) in a shelf?
• Order is important; without replacement
• n=3, r=3; n!/(n-r)!=3!/0!=6, or
Possible number
of books for 1st
place in the shelf
3
Possible number
of books for 2nd
place in the shelf
x 2
Possible number of
books for 3rd place
in the shelf
x 1
35
EXAMPLE, cont.
• How many different ways can we arrange
3 books (A, B and C) in a shelf?
1st book
2nd book
3rd book
A
B
C
B
C
C
B
A
C
C
A
A
B
B
A
36
EXAMPLE
• Lotto games: Suppose that you pick 6 numbers
out of 49
• What is the number of possible choices
– If the order does not matter and no repetition is
allowed?
 49
   13,983,816  14 million
6
– If the order matters and no repetition is allowed?
49!
 49 x 48 x 47 x 46 x 45 x 44  1010
43!
37