Uploaded by xyebwvnyfg

Fundamentals of Biostatistics 1

advertisement
Fundamentals of Bio-Statistics
TEWACHEW G.
University of Gondar
Department of Statistics
E-Mail:t e w a c h e w 9 3 @gmail.com
1
Defintion and classification of Statistics
Definition
 Statistics can be defined in two senses:
 Statistics in its plural sense: Statistics refer to A collection of
numerical information that describes every aspect of social and
economic phenomenon.
 Statistics are the raw data themselves, like statistics of births,
statistics of deaths, statistics of imports and exports, etc.
 .Statistics in its singular sense: The science of collecting, organizing,
presenting, analyzing, and Interpreting data to assist in making more
effective decision
Types of Statistics
Descriptive Statistics is concerned with Methods of organizing,
summarizing, and presenting data in an informative way.
Inferential Statistics is the methods used to determine something about a
population on the basis of a sample.
By Tewachew
Fundamentals of Biostatistics
2
1.2 Stages in statistical investigation
Collection of data: The process of measuring, gathering, assembling
the raw data up on which the statistical investigation is to be based. Data
can be collected in a variety of ways.
Organization of data: Summarization of data in some meaningful way.
Organization of data may involve Editing, coding and classification of
the collected data.
Presentation of the data: In this stage the collected and organized
data are presented with some systematic order to facilitate statistical
analysis. The organized data are presented with the help of tables,
diagrams and graphs.
Analysis of data: The process of extracting numerical description of
data, mainly through the use of elementary mathematical operation (like
mean, standard deviation,.
Interpretation of data: This involves giving meaning to the analyzed
data and draw conclusions. Statistical techniques based on probability
theory are required.
By Tewachew
Fundamentals of Biostatistics
3
1.3 Definitions of some terms
population: is the complete set of possible measurements for which
inferences are to be made.
Census: a complete enumeration of the population. But in most real
problems it cannot be realized, hence we take sample.
Sample: A sample from a population is the set of measurements that are
actually collected in the course of an investigation.
 It should be selected using some pre-defined sampling technique in
such a way that they represent the population very well.
Parameter: Characteristic or measure obtained from a population.
Statistic: Characteristic or measure obtained from a sample.
Sampling: The process or method of sample selection from the
population.
Sample size: The number of elements or observation to be included in
the sample.
By Tewachew
Fundamentals of Biostatistics
4
1.4 Scales of Measurement
Variable: It is an attribute or characteristic that can assume different
values.
 Variable is divided in to two: Qualitative and quantitative variable
Qualitative variables are nonnumeric variables and cannot be measured.
Examples: gender, religious affiliation, and state of birth.
Quantitative Variables are numerical variables and can be measured.
Examples include balance in checking account, number of children in
family.
 Note that quantitative variables are either discrete or continuous
Discrete variable: It assumes a finite or countable number of possible
values. Example: number of children‘s in a family, number of cars at a
traffic light
Continuous variable: It can assume any value within the defined range.
Example: weight in kg, height, time, air pressure in a tire.
By Tewachew
Fundamentals of Biostatistics
5
Cont…
 Measurement scale refers to the property of value assigned to the
data based on the properties of order, distance and fixed zero.
Nominal Scales: are measurement systems that possess none
of the three properties stated above.
 With the nominal level, the data are sorted into categories with no
particular order to the categories.
Examples:




Sex (Male or Female),
Marital status (married, single, widow, divorce)
Country code
Regional differentiation of Ethiopia.
By Tewachew
Fundamentals of Biostatistics
6
Ordinal Scales
 Ordinal Scales are measurement systems that possess the property of order,
but not distances and true zero point.
 Level of measurement which classifies data into categories that can be
ranked. Differences between the ranks do not exist.
Examples:
 Rating scales (Excellent, Very good, Good, Fair, poor), Military status.
Interval Scales
 Interval scales are measurement systems that possess the properties
of Order and distance, but not the property of fixed zero.
 Examples:
Temperature in degree Celsius or 0F,


Your score on an individual intelligence test as a measure of your
intelligence.
A temperature of 0°C does not mean that there is no temperature.
By Tewachew
Fundamentals of Biostatistics
7
Ratio
RatioScales
Scale
 Ratio scales are measurement systems that possess all three properties:
order, distance, and fixed zero.
 . The ratio level of measurement has all the characteristics of the
interval level, plus there is a zero point and the ratio of two values is
meaningful.
Examples: Weight, Height, Number of students, Age
By Tewachew
Fundamentals of Biostatistics
8
1.6 Introduction to Method
of data Collection
Ratio Scales
 The statistical data may be classified under two categories, depending upon
the sources: (1) Primary data (2) Secondary data.
 Primary Data: are those data, which are collected by the investigator
himself for the purpose of a specific inquiry or study.
 The Main Methods of Primary data collection are
 Observation:
 Interview: (Could be face to face /telephone interview)
 Questionnaire(Mailed and self-administered questionnaire)
 Laboratory experiment
 Life histories, case studies, etc.
By Tewachew
Fundamentals of Biostatistics
9
Cont…..
Ratio
Scales
 Secondary Data: When an investigator uses data, which have already
been collected by others, such data are called "Secondary Data".
 Some of the sources of secondary data are government document,
official statistics, technical report, scholarly journals, trade journals,
review articles, reference books, research universities, hospitals,
libraries, library search engines, computerized data base and world
wide web (www ).
By Tewachew
Fundamentals of Biostatistics
10
CHAP-2 : Method of Data presentation
 Having collected and edited the data, the next important step is to
organize it.
 The presentation of data is broadly classified in to the following three
categories:
 Tabular presentation
 Diagrammatic and
 Graphic presentation
 The process of arranging data in to classes or categories according to
similarities technically is called classification.
 Raw data: recorded information in its original collected form, whether it
may be counts or measurements, is referred to as raw data
.
 Frequency: is the number of values in a specific class of the distribution.
By Tewachew
Fundamentals of Biostatistics
11 / 98
Frequency Distribution
 A frequency distribution is the organization of raw data in table
form, using classes and frequencies.
 There are three basic types of frequency distributions
 Categorical frequency distribution
 Ungrouped frequency distribution
 Grouped frequency distribution
Categorical frequency Distribution:
 Used for data that can be place in specific categories such as
nominal, or ordinal. E.g. marital status
 Steps of constructing categorical frequency distribution
 1 You have to identify that the data is in nominal or ordinal scale of
measurement
 2. Make a table as show below
By Tewachew
Fundamentals of Biostatistics
12 / 98
Frequency Distribution




Put distinct values of a data set in column A
4. Tally the data and place the result in column B
5. Count the tallies and place the results in column C
6. Find the percentage of values in each class by using the formula
 f Where, f frequency and n is total number of values.
 Example 2.1: Twenty-five army inductees were given a blood test to
determine their blood type. The data set is given as follows:
A B B AB O
O O B AB B
B
B O A
O
A
O O O AB
AB A O
B
A
 Construct a frequency distribution for the above data
By Tewachew
Fundamentals of Biostatistics
13 / 98
Frequency Distribution..cont’d
By Tewachew
Fundamentals of Biostatistics
14 / 98
Ungrouped Frequency Distribution
 When the data are numerical instead of categorical, the range of data
is small and each class is only one unit, this distribution is called an
ungrouped frequency distribution
 The major components of this type of frequency distributions are
class, tally, frequency, relative frequency and cumulative frequency.
Example 2.2 : The following data represent the mark of 20 students.
Construct a frequency distribution, which is ungrouped.
Solution:
Step 1: Find the range, Range=Max-Min=90-60=30.
Step 2: Make a table as shown
Step 3: Tally the data.
Step 4: Compute the frequency.
By Tewachew
Fundamentals of Biostatistics
15 / 98
Ungrouped Frequency Distribution….cont’d
 Each individual value is presented separately, that is why it is named
ungrouped frequency distribution.
By Tewachew
Fundamentals of Biostatistics
16 / 98
Grouped Frequency Distribution
 When the range of the data is large, the data must be grouped in to
classes that are more than one unit in width.
Definitions:
Grouped Frequency Distribution: a frequency distribution when
several numbers are grouped in one class.
Class limits: Separates one class in a grouped frequency distribution
from another and have gaps between the upper limits of one class
and lower limit of the next.
Units of measurement (U): the distance between two possible
consecutive measures. It is usually taken as 1, 0.1, 0.01, 0.001, -----.
Class boundaries: Separates one class in a grouped frequency
distribution from another and there is no gap between the upper
boundary of one class and lower boundary of the next class.
 The lower class boundary is found by subtracting U/2 from
the corresponding lower class limit and the upper class boundary is
found by adding U/2.
By Tewachew
Fundamentals of Biostatistics
17 / 98
Grouped Frequency Distribution….Cont’d
Class width: the difference between the upper and lower class
boundaries of any class.
Class mark (Mid points): it is the average of the lower and upper class
limits or the average of upper and lower class boundary.
Cumulative frequency: is the number of observations less than/more
than or equal to a specific value.
Cumulative frequency above: it is the total frequency of all values
greater than or equal to the lower class boundary of a given class.
Cumulative frequency blow: it is the total frequency of all values less
than or equal to the upper class boundary of a given class.
Cumulative Frequency Distribution (CFD): it is the tabular
arrangement of class interval together with their corresponding
cumulative frequencies.
Relative frequency (rf): it is the frequency divided by the total
frequency.
Relative cumulative frequency (rcf): it is the cumulative frequency
divided by the total frequency.
By Tewachew
Fundamentals of Biostatistics
18 / 98
Guidelines for classes:
1.
2.
3.
4.
5.
There should be between 5 and 20 classes.
The classes must be mutually exclusive. This means that no data
value can fall into two different classes
The classes must be all inclusive or exhaustive. This means that all
data values must be included.
The classes must be continuous. There are no gaps in a frequency
distribution.
The classes must be equal in width. The exception here is the first or
last class. It is possible to have a "below ..." or "... and above" class.
This is often used with ages.
By Tewachew
Fundamentals of Biostatistics
19 / 98
Steps for constructing Grouped frequency Distribution
1.
Find the largest and smallest values
Compute the Range(R) = Maximum – Minimum
Select the number of classes desired, use Sturge’s rule
𝑲 = 𝟏 + 𝟑. 𝟑𝟐𝒍𝒐𝒈(𝒏) where k is number of classes desired and n is total
number of observation.
4. Find the class width by dividing the range by the number of classes and
rounding up, not off. 𝐖 = 𝐑/𝐊
5. Pick a suitable starting point less than or equal to the minimum value. The
starting point is called the lower limit of the first class. Continue to add the
class width to this lower limit to get the rest of the lower limits.
6. To find the upper limit of the first class, subtract U from the lower limit of
the second class. Then continue to add the class width to this upper limit to
find the rest of the upper limits.
7. Find the boundaries by subtracting U/2 units from the lower limits and
adding U/2 units from the upper limit of the next class.
8. Tally the data.
9. Find the frequencies.
10. Find the cumulative frequencies..
11. If necessary, find the relative frequencies and/or relative cumulative
frequencies
2.
3.
By Tewachew
Fundamentals of Biostatistics
20 / 98
Steps for constructing Grouped frequency Distribution
Example-2.3 : Consider the following set of data and construct the frequency
distribution.
11 29 6 33 14 21 18 17 22 38
31 22 27 19 22 23 26 39 34 27
Steps
By Tewachew
Fundamentals of Biostatistics
21 / 98
Steps for constructing Grouped frequency Distribution
By Tewachew
Fundamentals of Biostatistics
22 / 98
Diagrammatical and Graphical Presentation of Data
 One of the most effective and interesting alternative way in which a
statistical data may be presented is through diagrams and graphs.
 The three most commonly used diagrammatic presentation for
discrete as well as qualitative data are:
 Pie hart
 Pictogram
 Bar charts

By Tewachew
Fundamentals of Biostatistics
23 / 98
Pie Chart
 A pie chart is a circle that is divided in two sections or wedges
according to the percentage of frequencies in each category of the
distribution. The angle of the sector is obtained using:
Example2.4 : The following table gives the details of monthly budget
of a family. Represent these figures by a suitable diagram.
By Tewachew
Fundamentals of Biostatistics
24 / 98
Pie Chart…..con’d
 Solution: The necessary computations are given below:
By Tewachew
Fundamentals of Biostatistics
25 / 98
Bar Charts
 The bar graph (simple bar chart, multiple bar chart component bar
chart) uses vertical or horizontal bins to represent the frequencies of
a distribution.
 While we draw bar chart, we have to consider the following two
points. These are
 Make the bars the same width
 Make the units on the axis that are used for the frequency equal in
size
Simple Bar Chart: Are used to display data on one variable classified
on spatial, quantitative or temporal basis.
Example : Draw simple bar diagram to represent the profits of a bank
for 5 years.
By Tewachew
Fundamentals of Biostatistics
26 / 98
Bar Charts
By Tewachew
Fundamentals of Biostatistics
27 / 98
Multiple Bars
 When two or more interrelated series of data are depicted by a bar
diagram, then such a diagram is known as a multiple-bar diagram.
 Suppose we have export and import figures for a few years.
Fig 2.2 Multiple Bars
By Tewachew
Fundamentals of Biostatistics
28 / 98
Component Bar Chart
 is used to represent data in which the total magnitude is divided into
different or components.
Example : The table below shows the quantity in hundred kgs of
Wheat, Barley and Oats produced on a certain form during the years
1991 to 1994. Draw stratified bar chart.
By Tewachew
Fundamentals of Biostatistics
29 / 98
Component Bar Chart….con’d
 Solution: To make the component bar chart, first of all we have to
take year wise total production.
 The required diagram is given below:
By Tewachew
Fundamentals of Biostatistics
30 / 98
Graphical Presentation of data
 The histogram, frequency polygon and cumulative frequency graph
or ogives are most commonly applied graphical representation for
continuous data.
Procedures for constructing statistical graphs:
 Draw and label the X and Y axes.
 Choose a suitable scale for the frequencies or cumulative frequencies and label
it on the Y axes.
 Represent the class boundaries for the histogram or ogive or the mid points
for the frequency polygon on the X axes.
 Plot the points.
 Draw the bars or lines to connect the points
Histogram
 The graph which displays the data by using vertical bars of height to represent
frequencies. Class boundaries are placed along the horizontal axes.
Example: Take the data in above example 2.3
By Tewachew
Fundamentals of Biostatistics
31 / 98
Graphical Presentation of data
Frequency Polygon:
 a line graph. The frequency is placed along the vertical axis and classes mid
points are placed along the horizontal axis.
Ogive (cumulative frequency polygon)
 A graph showing the cumulative frequency (less than or more than type) plotted
against upper or lower class boundaries respectively.
 That is class boundaries are plotted along the horizontal axis and the corresponding
cumulative frequencies are plotted along the vertical axis.
 The points are joined by a free hand curve.
By Tewachew
Fundamentals of Biostatistics
32 / 98
Graphical Presentation of data……cont’d
Example: Draw a frequency polygon and ogive curve(less than type and
More than type ) for the above data in example 2.3 .
.
By Tewachew
Fundamentals of Biostatistics
33 / 98
CHAPTER-3 : Measures of Central Tendency (MCT)
 A measure of central tendency is a summery measure that attempts to
describe a whole set of data with single value that represents the
middle or center of its distribution
 This single value is called the average of the group. Averages are
also called measures of central tendency.
Objectives of Measures of central Tendency
.
 To summarize a set of data by single value
 To facilitate comparison among different data sets
 To use for further statistical analysis or manipulation
Summation Notation
By Tewachew
Fundamentals of Biostatistics
34 / 98
……….Cont’d
 PROPERTIES OF SUMMATION
.
Example 3.1: considering the following data determine
find
By Tewachew
Fundamentals of Biostatistics
35 / 98
Types of Measures of Central Tendency
 In statistics, we have various types of measures of central tendencies.
The most commonly used types of MCT includes: Mean
 Mode
 Median
 Quintiles
 Percentiles
 deciles
Mean
 Is defined as the sum of the values of each observation in a data set
divided by the number of observations.
 The mean of X1, X2 ,X3 …Xn is denoted by A.M ,m or 𝑋 and is given
by
By Tewachew
Fundamentals of Biostatistics
37 / 98
Types of Measures of Central Tendency
 For grouped F.D
Example3.2: Obtain the mean of the following number
2, 7, 8, 2, 7, 3, 7
 Example3.3 : calculate the mean for the following age distribution.
By Tewachew
Fundamentals of Biostatistics
38 / 98
The Mode
 Mode is a value which occurs most frequently in a set of values
 The mode may not exist and even if it does exist, it may not be unique.
Examples 3.7
1.
2.
3.


Find the mode of , 5, 3, 5, 8, 9 Mode =5
Find the mode of 8, 9, 9, 7, 8, 2, and 5 , It is a bimodal Data: 8 and 9
Find the mode of 4, 12, 3, 6, and 7. No mode for this data.
The mode of a set of numbers X1, X2, …Xn is usually denoted by
If data are given in the shape of continuous frequency distribution, the mode is
defined as:
By Tewachew
Fundamentals of Biostatistics
39 / 98
The Mode
Note: The modal class is a class with the highest frequency.
Example-3.8 : Following is the distribution of the size of certain farms
selected at random from a district. Calculate the mode of the distribution.
Solutin -?
By Tewachew
Fundamentals of Biostatistics
40 / 98
The Median
 In a distribution, median is the value of the variable which divides it in to
two equal halves.
 Thus, in an ungrouped frequency distribution if the n values are
arranged in ascending order of magnitude, the median is the middle
value if n is odd.
 When n is even, the median is the mean of the two middle values.
Example-3.9 : Find the median of the following numbers.
a) 6, 5, 2, 8, 9, 4.
b) b. 2, 1, 8, 3, 5
solution =?
For grouped data
By Tewachew
Fundamentals of Biostatistics
41 / 98
The Median….Cont’D
Remark: The median class is the class with the smallest cumulative frequency (less
than type) greater than or equal to n/2
Example-3.9 : Find the median of the above example 3.8
solution =?
Quartiles:
 Quartiles are measures that divide the frequency distribution in to four
equal parts.
 usually denoted by Q1, Q2, Q3 and are obtained after arranging the data in
an increasing order known as respectively first quartile ,second quartile and
third quartile.
For grouped data: we have the following formula:
By Tewachew
Fundamentals of Biostatistics
42 / 98
Deciles
 are measures which divide a given ordered data in to ten equal parts
and each part contains equal no of elements. It has nine points known
as 1st, 2nd… 9th deciles and denoted by D1, D2… D9 respectively.
 For ungrouped data 𝒊𝒕𝒉 deciles is can be given by
 For grouped (continuous) data deciles can be obtained by using
Remark: The decile class (class containing Di )is the class with the smallest
cumulative frequency (less than type) greater than or equal to
By Tewachew
Fundamentals of Biostatistics
43 / 98
Percentiles:
 Percentiles are measures that divide the frequency distribution in to
hundred equal parts.
 The values of the variables corresponding to these divisions are
denoted P1, P2,.. P99 often called the first, the second,…, the ninetyninth percentile respectively.
 For ungrouped data 𝒊𝒕𝒉 percentiles is
 For grouped (continuous) data deciles can be obtained by using
Remark: The percentile class (class containing Pi )is the class with the
smallest cumulative frequency (less than type) greater than or equal to
By Tewachew
Fundamentals of Biostatistics
44 / 98
CHAPTER-4 : Measures of Dispersion
 The scatter or spread of items of a distribution is known as dispersion
or variation.
 Measures of dispersions are statistical measures which provide ways
of measuring the extent in which data are dispersed or spread out
Objectives of measuring Variation:
 To judge the reliability of measures of central tendency
 To control variability itself.
 To compare two or more groups of numbers in terms of their
variability.
 To make further statistical analysis.
Types of Measures of Dispersion
 Various measures of dispersions are in use. The most commonly used
measures of dispersions are:
1.Range and relative range
2. variance
3. Standard deviation
4. Coefficient of variation and standard score
By Tewachew
Fundamentals of Biostatistics
45 / 98
The Variance
Population Variance
If we divide the variation by the number of values in the population, we get
something called the population variance.
 This variance is the "average squared deviation from the mean".
Sample Variance
the sum of the squares of the deviations is divided by one less than the sample
size.
By Tewachew
Fundamentals of Biostatistics
46 / 98
Standard Deviation
 The standard deviation is defined as the square root of the mean of the squared
deviations of individual values from their mean.
 Examples: Find the variance and standard deviation of the following sample
data
1. 5, 17, 12, 10.
2. The data is given in the form of frequency distribution.
By Tewachew
Fundamentals of Biostatistics
47 / 98
Standard Deviation ….Cont
Solution
By Tewachew
Fundamentals of Biostatistics
48 / 98
Special properties of Standard deviations
Chebyshev's Theorem
By Tewachew
Fundamentals of Biostatistics
49 / 98
Special properties of Standard deviations….ont’d
Example: Suppose a distribution has mean 50 and standard deviation 6.
What percent of the numbers are:
a) Between 38 and 62
b) Between 32 and 68
c) Less than 38 or more than 62.
d) Less th
an 32 or more than 68.
Solution
By Tewachew
Fundamentals of Biostatistics
50 / 98
Coefficient of Variation (C.V)
By Tewachew
Fundamentals of Biostatistics
51 / 98
Standard Scores (Z-scores)
if X is a measurement from a distribution with mean X and standard deviation S,
then its value in standard units is
 Z gives the deviations from the mean in units of standard deviation
 Z gives the number of standard deviation a particular observation lie above or
below the mean.
 It is used to compare two observations coming from different groups.
Examples:
 Student A from section 1 scored 90 and student B from section 2 scored
95.Relatively speaking who performed better?
By Tewachew
Fundamentals of Biostatistics
52 / 98
CHAPTER- 5: Elementary Probability
probability is the chance of an outcome of an experiment.
 It is the measure of how likely an outcome is to occur.
Definitions of some probability terms
Experiment: Any process of observation or measurement or any process
which generates well defined outcome.
Probability Experiment: is an experiment whose out come is not known
Outcome: The result of a single trial of a random experiment
Sample Space: Set of all possible outcomes of a probability experiment
Event: It is a subset of sample space. It is a statement about one or more
outcomes of a random experiment .They are denoted by capital letters.
Mutually exclusive events: Two events are said to be mutually exclusive, if
both events cannot occur at the same time as outcome of a single
experiment.
Let E1 and E 2 said to be mutually exclusive evens if there is no sample
point in common to both events E1 and E 2
By Tewachew
Fundamentals of Biostatistics
53 / 98
Elementary Probability ….Cont
Equally Likely outcomes: outcomes which have the same chance of
occurring.
Independent Events: Two events A and B are said to be independent events
if the occurrence of event A has no influence on the occurrence of event B.
Dependent Events: Two events are dependent if the first event affects the
outcome or occurrence of the second event .
Fundamental Principles of Counting Techniques
 If the number of possible outcomes in an experiment is small, it is
relatively easy to list and count all possible events.
 When there are large numbers of possible outcomes an enumeration of
cases is often difficult, tedious, or both.
 Therefore, to overcome such problems one can use various counting
techniques or rules.
By Tewachew
Probability and Statistics
54 / 98
Elementary Probability…cont’d
 Addition rule: Suppose that a procedure designated by 1, can be
performed in n1 ways. Assume that second procedure designated by 2 can
be performed in n 2 ways.
 Suppose further more that it is not possible both procedures 1 and 2 are
performed together.
 The number of ways in which we can perform 1 or 2 procedures is n1 + n 2
ways.
 This can be generalized as follows if there are k procedures and i th
procedure may be performed in n i ways, i=1, 2, …, k , then the number of
ways in which we perform procedure 1 or 2 or … or k is given by n1 +n 2 +…+
 Example 5.1 : Suppose that we are planning a trip and are deciding between
bus and train transportation. If there are 3 bus routes and 2 train routes to go
from A to B, find the available routes for the trip. There are 3+2 = 5 possible
By Tewachew
Probability and Statistics
55 / 98
Elementary Probability…cont’d
 The Multiplication Rule:
 If a choice consists of k steps of which the first can be made in n1 ways, the
second can be made in n2 ways…, the kth can be made in nk ways, then
the whole choice can be made in (n1 * n2 * ........ * nk ) ways.
 Example 5.2 : An air line has 6 flights from A to B, and 7 flights from B to
C per day. If the flights are to be made on separate days, in how many
different ways can the airline offer from A to C?
 Example5.3 : The digits 0, 1, 2, 3, and 4 are to be used in 4 digit
identification card. How many different cards are possible if
a) Repetitions are permitted.
b) Repetitions are not permitted.
By Tewachew
Probability and Statistics
56 / 98
Permutation Rule:
 Permutation is an arrangement of all or parts of a set of objects with
regard to order.
 Rule 1: The number of permutations of n distinct objects taken all
together is n! Or In particular, the number of permutations of n objects
taken n at a time is
 Rule-2: A permutation of n different objects taken r at a time is an
arrangement of r out of the n objects, with attention given to the order of
arrangement. The number of permutations of n objects taken r at a time is
denoted by nPr, or P (n,r) and is given by
 Rule-3: The number of permutation of n objects taken all at a time, when
n1 objects are alike of one kind, n2 objects are alike of second kind, …, nk
objects are alike of kth kind is given by:
By Tewachew
Probability and Statistics
57 / 98
Permutation Rule:
Examples:
1. Suppose we have a letters A, B, C, D
a) How many permutations are there taking all the four?
b) How many permutations are there two letters at a time?
2. Find the permutation of the letters of the word STATISTICS taken all
at a time ?
Combinations Rule:
 Combination is the selection of objects without regarding order of
arrangement.
 A combination of n different objects taken r at a time is a selection of r out of
n-objects, denoted by the symbol
By Tewachew
Probability and Statistics
58 / 98
combination Rule ……cont’d
Examples:
1. Suppose in the box 3 red, 3 white and 5 black equal sized balls are
there. We want to draw 3 balls at a time. How many ways do we have
from each type?
2. In how many ways a committee of 5 people be chosen out of 9 people?
3. Among 15 clocks there are two defectives .In how many ways can an
inspector
chose
By Tewachew
three
of
the
clocks
Probability and Statistics
for
inspection
so
59 / 98
that:
Different Approaches to probability
Classical or Mathematical Approach
 If a random experiment results in N exhaustive, mutually exclusive and
equally likely outcomes; out of which n are favorable to the happening
of an event A, then the probability of occurrence of A, usually denoted
by P (A) is given by:
Examples:
1. In a given basket there is 3 yellow, 4 black and 3 white balls. What is the
probability of selection of one black ball?
2. A box of 80 candles consists of 30 defective and 50 non defective
candles. If 10 of these candles are selected at random, what is the
probability?
a) All will be defective.
b) 6 will be non defective
c) All will be non defective
By Tewachew
Probability and Statistics
60 / 98
Empirical or frequency approach
 This is based on the relative frequencies of occurrence of the event
when the number of observations is very large.
 Definition: The probability of an event A is the proportion of outcomes
favorable to A in the long run when the experiment is repeated under
same condition.
Example :
1. If 1000 tosses of a coin result in 529 heads, the relative frequency of
heads is 529/1000 = 0.529. If another 1000 tosses results in 493 heads,
the relative frequency in the total of 2000 tosses is
2. If records show that 60 out of 100,000 bulbs produced are defective.
What is the probability of a newly produced bulb to be defective?
By Tewachew
Probability and Statistics
61 / 98
Axiomatic Approach:
 Given a sample space of a random experiment S, the probability of the
occurrence of any event A is defined as a set function P (A) satisfying the
following axioms:
By Tewachew
Probability and Statistics
62 / 98
Subjective Approach
 A probability derived from an individual's personal judgment about
whether a specific outcome is likely to occur.
 Subjective probabilities contain no formal calculations and only reflect the
subject's opinions and past experience.
 Subjective probabilities differ from person to person. Because the
probability is subjective, it contains a high degree of personal bias.
Events as a set: If A and B are two events then
By Tewachew
Probability and Statistics
63 / 98
Conditional probability
 Conditional Events: If the occurrence of one event has an effect on the next
occurrence of the other event then the two events are conditional or
dependent events.
 Let there be two events A and B. Then the probability of event A given that
the outcome of event B is given
Example : 120 employees of a certain factory are given a performance test and
are divided in to two groups as those with good performance(G) and those
with poor performance (P) the result is given below
Good
performance(G)
Poor
performance(P)
Total
Male (M)
60
20
80
Female(F)
25
15
40
Total
85
35
120
By Tewachew
Probability and Statistics
64 / 98
Conditional probability…..cont’d
Example 2:If the probability that a research project will be well planned is 0.60
and the probability that it will be well planned and well executed is 0.54, what
is the probability that it will be well executed given that it is well planned?
Example 3: For a student enrolling at freshman at certain university the
probability is 0.25 that he/she will get scholarship and 0.75 that he/she will
graduate. If the probability is 0.2 that he/she will get scholarship and will also
graduate. What is the probability that a student who get a scholarship
graduate?
By Tewachew
Probability and Statistics
65 / 98
Probability of Independent Events
 We say two events A and B are said to be independent if the occurrence of
event A in a probability experiment does not affect the probability of event B.
 In other words, events A and B are considered as independent if the
conditional probability A given B is the same as the unconditional probability
of A i.e, P(A/B) = P(A) (B does not affect event A)
 This leads to a useful formula which is also our definition of independency
 Example: A box contains four black and six white balls. What is the probability
of getting two black balls in drawing one after the other under the following
conditions?
a) The first ball drawn is not replaced
b)
The first ball drawn is replaced
solution ?
By Tewachew
Probability and Statistics
66 / 57
CHAPTER -6
Sampling and Sampling Distribution
Sampling
Definitions:
Population – A group that includes all the cases (individuals, objects, or groups) in
which the researcher is interested.
Sample – A relatively small subset from a population.
 Parameter: Characteristic or measure obtained from a population.
 Statistic: Characteristic or measure obtained from a sample.
 Sampling: The process or method of sample selection from the population.
By Tewachew
Introduction to Statistics
67
Sampling Cont….
Sampling unit: the ultimate unit to be sampled or elements of the
population to be sampled
Examples:
Sampling frame: is the list of all elements in a population.
Examples:
By Tewachew
Introduction to Statistics
68
cont….
Errors in sample survey:
There are two types of errors
a) Sampling error:
 Is the discrepancy between the population value and sample value.
 May arise due to in appropriate sampling techniques applied
a) Non sampling errors: are errors due to procedure bias such as:
 Due to incorrect responses
 Measurement
 Errors at different stages in processing the data.
Use of Sampling
 Reduced cost
 Greater speed
 Greater accuracy
 Greater scope
 Avoids destructive test
 The only option when the population is infinite
By Tewachew
Introduction to Statistics
69
Sampling Techniques
 There are two types of sampling techniques.
 Random Sampling or probability sampling.
 Non Random Sampling or non probability sampling.
 Probability sampling methods : are those in which every item in the
population has a known chance, or probability, of being chosen for
sample.
 Non-probability sampling : it is defined as a sampling technique in
which the researcher selects samples based on the subjective judgment
of the researcher rather than random selection.
 on-probability sampling is a method in which not all population
members have an equal chance of participating in the study, unlike
probability sampling.
By Tewachew
Introduction to Statistics
70 / 57
Probability Sampling Methods
Simple Random Sampling
 Is a method of selecting items from a population such that every
possible sample of specific size has an equal chance of being
selected. In this case, sampling may be with or without
replacement.
 Simple random sampling can be done either using the lottery
method or table of random numbers.
Stratified Random Sampling:
 The population will be divided in to non overlapping but exhaustive groups
called strata.
 Simple random samples will be chosen from each stratum.
 Elements in the same strata should be more or less homogeneous while
different in different strata.
 It is applied if the population is heterogeneous
 Some of the criteria for dividing a population into strata are: Sex (male,
female); Age (under 18, 18 to 28, 29 to 39);
 Occupation (blue-collar, professional, other).
By Tewachew
Introduction to Statistics
71
Cluster Sampling:




The population is divided in to non overlapping groups called
clusters.
A simple random sample of groups or cluster of elements is chosen
and all the sampling units in the selected clusters will be surveyed.
Clusters are formed in a way that elements with in a cluster are
heterogeneous,
Cluster sampling is useful when it is difficult or costly to generate a
simple random sample.
Systematic Sampling
 The first element is selected randomly from a list or from sequential
files and then every 𝑛𝑡ℎ element is selected .
 The procedure starts in determining the first element to be included in
the sample.
 Then the technique is to take the kth item from the sampling frame.
By Tewachew
Introduction to Statistics
72
Non Random Sampling
Judgment/Purposive Sampling
 With this method, sampling is done based on previous ideas of
population composition and behavior.
 An expert with knowledge of the population decides which units in the
population should be sampled.
Convenience Sampling
 samples are selected from the population only because they are
conveniently available to the researcher.
Quota Sampling
 the researcher decides the selection of sampling based on some quota. In quota
sampling, the researcher makes sure that the final sample must meet his quota
criteria.
By Tewachew
Introduction to Statistics
73
Non Random Sampling
Quota Sampling
 the researcher decides the selection of sampling based on some
quota. In quota sampling, the researcher makes sure that the final
sample must meet his quota criteria.
Example: A researcher wants to survey individuals about what
smartphone brand they prefer to use. He/she considers a sample size of
500 respondents. Also, he/she is only interested in surveying ten states
in the US. Here’s how the researcher can divide the population by
quotas:
 Gender: 250 males and 250 females
 Age:100 respondents each between the ages of 16-20, 21-30, 31-40,
41-50, & 51+
 Employment status: 350 employed and 150 unemployed people
.
By Tewachew
Introduction to Statistics
74
Sampling Distribution

More precisely, sampling distributions are probability distributions and used to
describe the variability of sample statistics.

The sampling distribution of a statistics is the probability distribution of that statistics
Sampling Distribution of the sample mean
 Sampling distribution of the sample mean is a theoretical probability distribution that
shows the functional relation ship between the possible values of a given sample mean
based on samples of size and the probability associated with each value, for all possible
samples of size drawn from that particular population.
ƒ
.
By Tewachew
Introduction to Statistics
75
Sampling Distribution…..Cont’d
Steps for the construction of Sampling Distribution of the mean
1.
2.
3.
From a finite population of size N , randomly draw all possible samples of size n .
Calculate the mean for each sample.
Summarize the mean obtained in step 2 in terms of frequency distribution or
relative frequency distribution.
4.
.
Example:
 Take samples of size 2 with replacement and construct sampling distribution
of the sample mean.
Solution:
By Tewachew
Introduction to Statistics
76
Sampling Distribution…..Cont’d
By Tewachew
Introduction to Statistics
77
Sampling Distribution…..Cont’d
Sampling distribution of means
𝑋
𝑓(𝑋)
6
7
8
1 2 3
25 25 25
By Tewachew
9
10
11
12
13
14
4
25
5
25
4
25
3
25
2
25
1
25
Introduction to Statistics
78
Sampling Distribution…..Cont’d
By Tewachew
Introduction to Statistics
79
Sampling Distribution…..Cont’d
𝜎
Solution⟹ 𝑋~𝑁(𝜇, √𝑛)
⟹ 𝑋~𝑁 5.7,0.33
𝑋−𝜇
⟹ 𝑍 = 𝜎 ~𝑁(0,1)
𝑛
(𝑋 > 6)) = 𝑃(𝑍 >
6−5.7
0.33
𝑃(5 < 𝑋 < 6)) = 𝑃 (
) = 𝑃(𝑍 > 0.91) = 0.5 − 𝑃(0 ≤ 𝑍 ≤ 0.91) = 0.1814
5 − 5.7
6 − 5.7
<𝑍<
) = 𝑃(−2.12 < 𝑍 < 0.91) = 𝑃(0
0.33
0.33
≤ 𝑍 ≤ 2.12) + 𝑃(0 ≤ 𝑍 ≤ 0.91) = 0.8016
(𝑋 < 5.2)) = 𝑃(𝑍 <
5.2 − 5.7
) = 𝑃(𝑍 < −1.52) = 0.5 − 𝑃(0 ≤ 𝑍 ≤ 1.52)
0.33
= 0.0643
By Tewachew
Introduction to Statistics
80
7 : STATISTICAL INFERENCE
Inference is the process of making interpretations or conclusions from sample
data for the totality of the population.
 In statistics there are two ways though which inference can be made.
 Statistical estimation
 Statistical hypothesis testing.
1. Statistical Estimation
This is one way of making inference about the population parameter where the
investigator does not have any prior notion about values of the population
parameter.
Point Estimation
 It is a procedure that results in a single value as an estimate for a
parameter.
Interval estimation
 It is the procedure that results in the interval of values as an estimate for
a parameter. It deals with identifying the upper and lower limits of a
parameter.
By Tewachew
Introduction to Statistics
81
Definitions
Confidence Interval: An interval estimate with a specific level of confidence
Confidence Level: The percent of the time the true value will lie in the interval
estimate given.
Degrees of Freedom: The number of data values which are allowed to vary
once a statistic has been determined.
Estimator: A sample statistic which is used to estimate a population parameter.
It must be unbiased, consistent, and relatively efficient.
Estimate: Is the different possible values which an estimator can assumes.
Interval Estimate: A range of values used to estimate a parameter.
Point Estimate: A single value used to estimate a parameter.
Properties of best estimator
 Unbiased Estimator: An estimator whose expected value is the value of the
parameter being estimated.
 Consistent Estimator: An estimator which gets closer to the value of the
parameter as the sample size increases.
 Relatively Efficient Estimator: The estimator for a parameter with the smallest
variance.
By Tewachew
Introduction to Statistics
82
Point and Interval estimation of the population mean: µ
Point Estimation
 Another term for statistic is point estimate, since we are estimating the
parameter value.
 A point estimator is the mathematical way we compute the point estimate.
For instance,
is a point estimator of the population mean.
Confidence interval estimation of the population mean
 Although
possesses nearly all the qualities of a good estimator, because of
sampling error, we know that it's not likely that our sample statistic will be
equal to the population parameter, but instead will fall into an interval of
values.
 We will have to be satisfied knowing that the statistic is "close to" the
parameter.
 There are different cases to be considered to construct confidence intervals.
By Tewachew
Introduction to Statistics
83
case-1 : If sample size is large or if the population is
normal with known variance
By Tewachew
Introduction to Statistics
84
case-1 : If sample size is large or if the population is
normal with known variance
 Here are the z values corresponding to the most commonly used confidence
levels.
By Tewachew
Introduction to Statistics
85
Case 2: If sample size is small and the population
variance, is not known.
Examples-2:
From a normal sample of size 25 a mean of 32 was found .Given that the
population standard deviation is 4.2. Find
a) A 95% confidence interval for the population mean.
b) A 99% confidence interval for the population mean.
Solution:
By Tewachew
Introduction to Statistics
86
cont’d
Examples-2:
1. A drug company is testing a new drug which is supposed to reduce blood
pressure. From the six people who are used as subjects, it is found that
the average drop in blood pressure is 2.28 points, with a standard
deviation of .95 points. What is the 95% confidence interval for the mean
change in pressure?
By Tewachew
Introduction to Statistics
87
2 Hypothesis Testing
 This is also one way of making inference about population parameter,
where the investigator has prior notion about the value of the
parameter.
Definitions:
Statistical hypothesis: is an assertion or statement about the population
whose plausibility is to be evaluated on the basis of the sample data.
Test statistic: is a statistics whose value serves to determine whether to
reject or accept the hypothesis to be tested.
Statistic test: is a test or procedure used to evaluate a statistical
hypothesis and its value depends on sample data.
 There are two types of hypothesis:
1. Null hypothesis:
Null Hypothesis (H0):- is a statistical hypothesis that states there is no
difference between a parameter and a specific value or hypothesized
value. H0 : 𝜇 = 𝜇0 where µ is the population mean and 𝜇0 is the
hypothesized meanUsually denoted by H0.
By Tewachew
Introduction to Statistics
88
2 Hypothesis Testing
2. Alternative Hypothesis (H1):- is a statistical hypothesis that states
there exists a difference between a parameter and a specific value or
hypothesized value.
By Tewachew
Introduction to Statistics
89
Types and size of errors:
Testing hypothesis is based on sample data which may involve sampling and
non sampling errors.
Type I error: Rejecting the null hypothesis when it is true.
Type II error: Failing to reject the null hypothesis when it is false.
NOTE:
There are errors that are prevalent in any two choice decision making
problems.
There is always a possibility of committing one or the other errors.
Type I error (𝛼 ) and type II error ( 𝛽) have inverse relationship and therefore,
can not be minimized at the same time
By Tewachew
Introduction to Statistics
90
steps of hypothesis testing:
1.
2.
3.
4.
5.
6.
7.
The first step in hypothesis testing is to specify the null hypothesis (H0) and
the alternative hypothesis (H1).
The next step is to select a significance level, 𝛼
Identify the sampling distribution of the estimator.
Calculate test statistics
Identify the critical region.
Making decision.
Summarization of the result.
Hypothesis testing about the population mean, :
By Tewachew
Introduction to Statistics
91
Case 1: When sampling is from a normal distribution with
𝝈𝟐 known
 The relevant test statistic is
 After specifying 𝛼 we have the following regions (critical and acceptance)
on the standard normal distribution corresponding to the above three
hypothesis.
 Summary table for decision rule.
By Tewachew
Introduction to Statistics
92
Case 2: When sampling is from a normal distribution with
𝝈𝟐 unknown and small sample size
 The relevant test statistic is
 After specifying 𝛼 we have the following regions (critical and acceptance) on the
standard normal distribution corresponding to the above three hypothesis.
By Tewachew
Introduction to Statistics
93
Examples:
1.Test the hypotheses that the average height content of containers of certain lubricant is 10 liters if the
contents of a random sample of 10 containers are 10.2, 9.7, 10.1, 10.3, 10.1, 9.8, 9.9, 10.4, 10.3, and
9.8 liters. Use the 𝛼 =0.01 level of significance and assume that the distribution of contents is normal.
2. The mean life time of a sample of 16 fluorescent light bulbs produced by a company is computed to
be 1570 hours. The population standard deviation is 120 hours. Suppose the hypothesized value for the
population mean is 1600 hours. Can we conclude that the life time of light bulbs is increasing?
(Use 𝛼 = 0.05 and assume the normality of the population) (exercise!)
By Tewachew
Introduction to Statistics
94
Cont’d
Step 7: Conclusion
At 1% level of significance, we have no evidence to say that the average height content of
containers of the given lubricant is different from 10 litters, based on the given sample data.
By Tewachew
Introduction to Statistics
95
Chi-Square Test of Association
 Suppose A has r mutually exclusive and exhaustive classes.
B has c mutually exclusive and exhaustive classes
 The entire set of data can be represented using r*c contingency table.
 The chi-square procedure test is used to test the hypothesis of
independency of two attributes
 For instance we may be interested Whether the presence or absence of
hypertension is independent of smoking habit or not.
 Whether the size of the family is independent of the level of education
attained by the mothers.
 Whether there is association between father and son regarding boldness.
By Tewachew
Introduction to Statistics
96
Test of Association
By Tewachew
Introduction to Statistics
97
Test of Association
Decision Rule:
Example:
1. A geneticist took a random sample of 300 men to study whether there is
association between father and son regarding boldness. He obtained the
following results.
By Tewachew
Introduction to Statistics
98
Test of Association
By Tewachew
Introduction to Statistics
99
Test of Association
By Tewachew
Introduction to Statistics
100
Test of Association
2. Random samples of 200 men, all retired were classified according
to education and number of children is as shown below
 Test the hypothesis that the size of the family is independent of
the level of education attained by fathers. (Use 5% level of
significance)
By Tewachew
Introduction to Statistics
101
Solution
By Tewachew
Introduction to Statistics
102
Summary Worksheet
1.
A random sample of 200 adults are classified below by sex and their level of
education attained.
Education Male
Female
Elementary 38
45
Secondary
28
50
College
22
17
If a person is picked at random from this group, find the probability that
the person is a male given that the person has a secondary education
the person does not have a college education given that the person is a female.
Solution: Let A= the person has secondary education, M = the person is male, F =
the person is female, C = the person has college education and Cc =the person does
not have college education
2. suppose we have a population of size N=5, consists of the age of five
children (3, 6, 9, 12, 15) with their population variance 18 and population
mean 9, if sample size 3 selected randomly without replacement how many
different samples are possible? And construct a sampling distribution of the
sample mean.
By Tewachew
Introduction to Statistics
103
Summary Worksheet
3. Random samples of 200 men, all retired were classified according to
education and number of is as shown below Test the hypothesis that the size
of the family is independent of the level of education attained by fathers.
Test the hypothesis that the size of the family is independent of the level of
education attained by fathers. (Use 5% level of significance
By Tewachew
Introduction to Statistics
104
Summary Worksheet
3. Random samples of 200 men, all retired were classified according to
education and number of is as shown below Test the hypothesis that the size
of the family is independent of the level of education attained by fathers.
Test the hypothesis that the size of the family is independent of the level of
education attained by fathers. (Use 5% level of significance
By Tewachew
Introduction to Statistics
105
Summary worksheet
1. A research reports that the average salary of veterinarians is
more than $42000. A sample of 30 veterinarians has a mean
salary of $43260. Test the reports claim. Assume the population
standard deviation is $5230
2. A national magazine claims that the average college students
watches less television than the general public. The national
average is 29.4 hours per week, with a standard deviation 2
hours. A random sample of 25 college students has a mean of 27
hours. Test the claim. Assume normality.
3. A merchant believes that the average age of customers who
purchase a certain brand of wears is 13 years of age. A random
sample of 35 customers had an average age of 15.6 years. At
α=0.01, should this conjecture be rejected. The standard
deviation of the population is 1y
By Tewachew
Introduction to Statistics
106
By Tewachew
Probability and Statistics
107 / 57
By Tewachew
Probability and Statistics
108 / 57
Download