Uploaded by Syed Safiullah Shah

Probability & Statistics Basics Presentation

advertisement
Probability & Statistics
Basics of Probability &
Statistics
1
Chapter 1
Imran Ali
Text book
6
Probability & Statistics for Engineers & Scientists 8th
Edition
by
Ronald E. Walpole
Raymond H. Myers
Sharon L. Myers
Keying Ye
NOTE: The book has a wider scope that this course, so use the
book with the course guideline in mind
What is meant by Probability?
7

The term probability refers to the study of
randomness and uncertainly
 40%
chances of showers does not mean it will or will
not rain

Used when circumstances permit multiple outcomes
 It
may or may not rain
 Two
possible outcomes
 Students
may have a percentage >80%, >75%, 70%
etc
 Multiple
outcomes possible
What is meant by Statistics?
8
Deals with collection and interpretation of data
 Involves collecting, classifying, summarising,
organising, analysing and finally interpretation of data


Statistics for runs scored by a batsman or wicket taken by a
bowler




How would the definition apply here?
Statistics for number of students in MUET who passed all
subjects
Descriptive Statistics
Inferential Statistics

9
Descriptive Statistics
 Collection
of data leading to presenting its summary
and describing its important features
 For
example, collection of data from students of 10BM
on whether they would prefer either practical or theory
classes in the morning. The results are displayed in the
form of a graph
 The
techniques used to describe the statistics fall under
the category of descriptive statistics

10
Inferential Statistics
 Now
that the data is available, inferential statistics
helps to translate the output of descriptive statistics to
draw a conclusion (make inference) about the
population
 For
example, if the students of 10BM prefer theory
classes in the morning and assuming that most students
feel the same, we can conclude that all students of
studying BME prefer theory classes in the morning
 Is
this good enough assumption?
14

11


Consider the figure
which shows the
statistics for the data
collected during a
survey to find the most
popular sport in
Pakistan (fictitious)
12
10
8
6
4
2
0
1 Football
1.2
1.4
1.6
1.8
Cricket 2
2.2
2.4
2.6
Descriptive statistics informs us that 13 people
favored cricket, 6 favored football, 3 favored
hockey. 19 people were involved in the survey
Inferential statistics informs us that cricket is by
far the most popular sport!
2.8
Hockey3
Why Probability?
12


With out the use of probability, the true meaning
from statistical inference cannot be extracted
Probability theory translates the outputs of
descriptive statistics into inference statistics
Probability
Population
Sample
Inferential
Statistics
Why Statistics?
13


Engineering systems are designed after thorough
understanding of the system requirements
Statistical concepts and methods provide ways of
gaining new insights into the behavior of many
phenomena that are encountered in every field of
engineering & science

14
The field of statistics makes possible intelligent
judgment and informed decision making
 Collecting
data about requirements for medical
appliances from hospitals can let you know which
products they would like to have
 Once supported by statistics Bio-Medical Engineers can
work on the design and development of the required
products
15
Measures of Central location
Measures of Central Location
16



Methods to identify quantitatively the central
position, the central value in a data set
We use these measures in our daily lives
Examples:
 Average
marks obtained by the class in a subject
 Average distance covered by a vehicle per litre of
petrol
 Statements like, 8 out of 10 recommend this...
Experiment
17



An activity or process whose outcome is subject to
uncertainty
Similar to an experiment which you may perform in
a chemistry lab., experiments for gathering data
has to be planned properly too
Tossing a coin, conducting
survey, teacher’s evaluation etc
a
population
Population
18



An experiment will consist of a well-defined
collection of objects constituting a population of
interest
The desired information is available from the
population
For example,
 Students
who graduated with BEng Bio-Medical
 Feedback
 People
 Do
about the course
who suffer from malaria in Hyderabad
they live near water bodies?

19

When all objects have the desired information, then
the population turns out to be a census
Due to time, money and other miscellaneous reasons
a subset of the population i.e. sample
Sample
20

A sample is a set of observations taken from a
population
 Subset
of data of interest
 Subset of Population

For example, if we want to find out why some
students underperform then instead of every student
we can use students with 1st year academic
percentage between 50-60
Sampling
21

Sampling refers to collection of data in a discrete
manner
i.e. part of the entire population
 While checking for faculty produce at a manufacturing
plant, we check only samples drawn randomly


For example, you take a bite at piece of cake and
judgment how the rest of it might taste like


You do not need to eat it all to know
Can you think of more examples?

Representive Sampling
 One
that accurately reflects its population
characteristics
 It is neutral
 For example, you run a survey among students at MUET,
ask them which flavour of ice cream they like
22

Biased Sampling
 Not
neutral
 For example, you run a survey among students at MUET,
ask them which is the best department at the university
 ES
has a greater chance of being selected because it has
more students while CRP does not
Sample mean
23


The sample mean is the numerical average of the
values in a data set
Suppose the set X has n elements, then the sample
mean is
n
xi
x =∑
i =1 n
x1 + x2 + x3 +  + xn
x=
n

24
For example,
 Average
marks of students in a class
 Average petrol consumption per kilometer for a certain
vehicle
 More examples where this kind of information is used?
Median
25

The median is a value which lies at the centre in a
data set
 An
equal number of elements have value greater than
the median value and an equal number of elements
have value smaller than the median value

26
In order to find the median value in the data set,
arrange the element with increasing value of the
elements
 Ascending
order in terms of element value
 The median is the middle value

Consider A = {5 1 2 7 3}
 A1
= {1 2 3 5 7}
 The median is 3

The median remains unaffected by the extreme
values of a data set
 How?

Consider B = {5 1 2 7 3 6}
 B1
27
= {1 2 3 5 6 7}
 In
such a case the median is equal to one half of the
sum of the “two” centre values
 (3+5)/2 = 4
 Verification:
three elements have value less than 4 while
three have value greater than 4

So we note that
 Median
may not actually exist in the data set
 It may not be possible. How?
 Violations can also occur. Set of all ones
Mode
28




The simple measure for central tendency
Least used in practice
This measure provides information about the most
frequently occurring value in a data set
For example, the mode of the following data set A
is 3
A
= {1 2 3 4 5 6 3}

It is also possible to have more than one mode
C
29
= {1 2 3 4 5 6 3 4 9 9 10}
 How
many modes are there?
 What are they?

Such a data set is called bimodal
Trimmed Mean
30



A method of averaging that removes a small
percentage of the largest and smallest values
before calculating the mean
The mean is quite sensitive to an extrema value
while the median does not always in to account the
extrema values
The trimmed mean is a compromise between the two

31

Suppose we have the following data set, A={1, 2,
3, 4, 5, 6, 7, 8, 9, 10} and we want to calculate the
10% trimmed mean
10% trimming = remove 10% of the total elements
of the data set from either extrema
 i.e.
remove 0.1*10 = 1 element
 B = {2, 3, 4, 5, 6, 7, 8, 9}
 Now calculate the mean
Representing Data using a Dot-plot
32



This is a statistical chart/plot used as part of
descriptive statistics
Illustrates the location of elements in a data set on a
simple scale
For example, marks secured by students in the exam
can be illustrated as
x
x
0
10
20
30
40
50
x
x
x
x
x
x
x
x
60
70
80
90
100
Examples
33

Using the following data sets, find the mean,
median, mode and the 5%, 10%, 20%, 40%
trimmed mean
X
= {2, 1, 9, 3, 8, 4, 6, 7, 0, 1, 3, 6, 5, 9, 1}
Y
= {5, 0.6, 6, 0.1, 0.5}
More Examples-1
34


Twenty adult males between the ages of 30 and 40
were involved in a study to evaluate the effect of
specific health regiment involving diet and exercise
on the blood cholesterol.
Ten were randomly selected to be a control group
and ten others were assigned to take a part in the
regimen as the treatment group for a period of 6
months.

35
The following data
cholesterol so far
shows
the
reduction
in
Control Group
7
3
-4
14
2
5
22
-7
9
5
Treatment
Group
-6
5
9
4
4
12
37
5
3
3


Compute the mean, median, the mode and 10%
trimmed mean
Provide explanation for the information provided
by the statistics
More Examples-2
36
Blood pressure values are often reported to the
nearest mmHg (100, 105, 110 etc). Suppose the
actual blood pressure values for nine randomly
selected individuals are
118.6, 127.4, 138.4, 130.0, 113.7, 122.0, 108.3,
131.5,133.2

37


What is the median of the reported blood pressure
values?
Suppose the blood pressure of the second
individual is 127.6 rather than 127.4. How does this
affect the median of the reported values?
What if the median was calculated with roundedoff data? How would this affect the results?
38
Measures of Variability
Introduction
39


The measures of central location we discussed in the
previous lecture provide only partial information
about a data set
It is possible that multiple data sets have identical
measures of center yet they differ from one another
s For example, consider the marks achieved by

40
students in two examinations
Exam 1
70
65
75
80
70
Mean = 72
Exam 2
40
35
95
100
90
Mean = 72
x = Exam 1
o = Exam 2
x
o o
0
10
20
30
40
50
60
x
x x
x
70
mean
80
o o o
90
100
Statistics for measuring variability
41

Three key statistics
 Range
 Sample
Variance and Population Variance
 Sample Standard Deviation and Population Standard
Deviation
Range
42


It is the difference between the largest and smallest
value in a data set
For example,
 The
range of marks for Exam 1 is 15
 The range of marks for Exam 2 is 65
Exam 1
70
65
75
80
70
Mean = 72
Exam 2
40
35
95
100
90
Mean = 72
s Range is however a poor measure of variation of

43
values in a data set because it is based on the two
extreme values and disregards the position of the
remaining samples
Exam 3
20
74
75
75
75
75
76
100
x = Exam 3
x
x
x
xxx
x
0
10
20
30
40
50
60
70
80
Mean = 71.25
x
90
100
Deviation from the Mean value
44


In this statistic we find the amount by which the
value of each sample in the data set deviates from
the mean value of the data set
For a data set with values, x1, x2, x3,…, xn the
deviations from the mean can be calculated as
x1 − x , x2 − x , , xn − x
s The deviation will be positive if the sample has a

value greater than the mean and negative when the
sample has a value smaller than the mean
45

Calculate the deviations from the mean for the
samples in the data set.
Exam 3
20
74
75
75
75
75
76
100
s If all deviations are of small magnitude, then all the

samples in the data set will be close to the mean
46
 Small
variability in the data set
 Converse is also true

Small/no variability is important
 Soft
drink manufacturers
 Medicine manufacturers
 Electronic equipment
Variance
47



Comparing the variability of a data set with few
samples is straightforward, however, with larger
sized data sets it becomes more cumbersome
Variance is a statistic which combines the deviations
of individual samples with in a data set
Two types:
 Population
variance
 Sample variance
Population Variance
48

The populating variance can be calculated as
2
N
2 =

Where,
∑ (x −  )
i =1
i
N
 = population variance
 = population mean
2
Sample Variance
49

The population variance is not always known
 Difficult
to know the variance in the data set involving
200 million Pakistan on their preference on a given
matter


So we estimate the population variance from the
sample variance (inference)
As the name suggestions, sample variance is the
variance in the data set which is a subset of the
population
s Sample variance can be calculated as,
2
∑ (xi − x )
2

50
s =
n −1

Where,

Thus variance is the average squared deviation

s 2 = sample variance
x = sample mean
If the unit of the sample is cm, then the unit of the
variance is cm2
Population Standard Deviation
51

It is the positive square root of the variance
2
N
 = 2 =

∑ (x −  )
i =1
i
N
The unit for the standard deviation is the same as
the samples in the data set
Sample Standard Deviation
52

It is the positive square root of the variance
∑ (x − x )
2
s= s =
2

i
n −1
The unit for the standard deviation is the same as
the samples in the data set
Examples
53

Using the following data sets, find the variance and
the standard deviation.
X
= {2, 1, 9, 3, 8, 4}
Y
= {5, 0.6, 6, 0.1, 0.5}
sLet us find the population mean and variance.
54
X = {2, 1, 9, 3, 8, 4}
We can assume that the data set is a population
because we have not been explicitly told that the
given data set represents a much larger data set
Size of dataset :
N =6
Population mean :
2 +1+ 9 + 3 + 8 + 4
=
= 4.5
6
s
xi − 
(xi −  )
2
i
xi
1
2
2 - 4.5 = -2.5
6.25
2
1
1 - 4.5 = -3.5
12.25
3
9
9 - 4.5 = 4.5
20.25
4
3
3 - 4.5 = -1.5
2.25
5
8
8 - 4.5 = 3.5
12.25
6
4
4 - 4.5 = -0.5
0.25
55
2
N
∑ (x −  )
i =1
i
= 6.25 + 12.25 + 20.25 + 2.25 + 12.25 + 0.25
2
N
=
∑ (x −  )
i =1
i
N
53.5
=
= 8.9167 = 2.9861
6
 = 4.5
 = 2.9861
0
x
x
x
x
1
2
3
4
5
Mean = 4.5
56
6
7
x
x
8
9
10
57
58
Chebyshev’s theorem & z-scores
Chebyshev’s Theorem
59


So far we have studies some important statistics for
the description of a data set including the mean,
variance and the standard deviation
We observed that when the standard deviation was
large, there was a greater variability in the data
set and vice versa
s P. L. Chebyshev’s discovered that

 The
fraction (percentage) of measurements falling
between any two values symmetric about the mean is
related to the standards deviation
60

At least the fraction 1−(1/k2) of the measurements
of any set of data must lie within k standard
deviations of the mean
 It
means > 1−(1/k2) can also lie
s Examples,

61
 With
k=2, 1−(1/22) = 3/4 = 75% or more of the
values must lie within 2 standard deviations on either
side of the mean
Population :
 − 2 ,  + 2
Sample :
x − 2 s, x + 2 s
s Examples,

 With
k=3, 1−(1/32), 88.9% or more values lie within
three standard deviations from either side of the mean
62
Population :
 − 3 ,  + 3

Sample :
x − 3s, x + 3s
The theorem is not so helpful for k=1
 For
k=2, 1−(1/12) = 0 i.e. zero or more values must lie
within 1 standard deviation on either side of the mean
 Why is this not helpful?
 = 4.5
 = 3.2532
µ-2σ
µ+2σ
µ-σ
-2
-1
0
µ+σ
x
x
x
x
1
2
3
4
5
6
7
x
x
8
9
Mean = 4.5
µ±σ contains 3/6 = 50% values.
µ±2σ contains 6/6 = 100% values.
63
10 11
sExample:
Example:
64
Using Chebyshev’s theorem to find the
percentage of values that fall between 20 and 30
for a data set with sample mean 21 and a
standard deviation of 2.
Solution:
The lower limit is
x − 20
20 = x − ks ⇒ ks = x − 20 ⇒ k =
s
21 − 20
k=
= 0.5
2
Chebyshev’s Theorem
65
If the IQs of a random sample of 1080 students at a
large university have a mean score of 120 and a
standard deviation of 8, use the Chebyshev’s
theorem to determine the interval containing at
least 810 of the IQs in the sample.
sSolution:
66
According to the question, we need to determine the
interval containing at least 810 of the IQs in the
sample.
So, we need to find the value of k for the intervals μkσ and μ+kσ
We know that,
1
810 3
1− 2 =
=
k
1080 4
s
67
1
3 1
= 1− =
2
k
4 4
k2 = 4
k=2
Now, the interval can be calculated as,
Lower limit: μ-kσ = 120-2*8 = 120-16 = 104
Upper limit: μ+kσ = 120+2*8 = 120+16 = 136
Concluding Remarks
68


Chebyshev’s theorem holds for any distribution
The value given by the theorem is a lower bound
only
 The
actual value can be much greater than the lower
bound
Z-scores
69

Motivation
 Suppose
I want to compare the marks you score in two
subjects
 Applied
Calculus (AC)
 English
 Suppose
that a student scores 80 marks in AC and 90
marks in English
 Does
it mean that the student is better at English than
Applied Calculus?
s We cannot absolutely say that the student is better

at English than Applied Calculus
70
 Rather
it would make more sense if we compare the
student’s performance in these two subjects relative to
the performance of all other students in the class

Why compare relative?
 It
is quite possible that Applied Calculus examination is
more difficult than English
s It is quite possible that the mean grade in English

was 86 with a standard deviation of 5 while the
mean in AC was 70 with a stanrdard deviation of 8
71


So, the objective is to compare two observations
from two different populations in order to
determine their relative rank
One of the ways is to convert the statistics of the
observations in to standard units known as z-scores
or z-values
s Z-score:

72
 An
observation x from a population with mean µ and
standard deviation σ has a z-score defined by
z=
x−

 Note
that units in the numerator and the denominator
cancel, so the z-score is a unitless quantity.
 Permitting
comparison of even two distinct observations
s Compare the z-score for the two exams

73
sExample:
Example:
74
Different typing skills are required for
secretaries depending on whether one is working in
a law office, an accounting firm or for a research
mathematical group at a university.
The data is gathered from three distinct testing
methods. Determine the candidate which has the
fastest typing speed.
Sample
Applicant’s Score
Law
141 seconds
Accounting
7 minutes
Scientific
33 minutes
75
Frequency distribution
Frequency distribution
76


Frequency of a particular observation is the number
of times the observation occurs in a data
Frequency distributions can be portrayed as
 Frequency
tables
 Histograms
s Important characteristics of a large data set can be

easily assessed by first grouping the data into
difference classes and then determining the number
of observations that fall in each of the classes
77

In a tabular form we call this as frequency
distribution
s The

78

data presented in the form of frequency
distribution is called grouped data
Data can be grouped according to intervals/class
(as in classification)
Table: Frequency distribution for Percentage secured by students of BME
Percentage
Number of students
0-50
1
50-59
2
60-69
15
70-79
20
80-89
4
90-100
1
s Grouping the data provides a better overall picture

of the unknown population
79
 What

can you infer from the table?
However, grouped data loses the identity of the
individual observations
 How?

Note that the lower limit of the interval is called the
lower class limit and the upper limit is called the
upper class limit
 70-79
%
s Moreover, 79.5% is the upper class boundary and

69.5% is the lower class boundary for that class 7079%
80

The number of observations falling in a particular
class is called the class frequency
 Denoted

by “f”
The numerical difference between the upper and
lower class boundaries of a class interval is defined
to be the class width
s The midpoint between the upper and the lower class

81
boundaries is called the class mark or class
midpoint.
Generating a Freq. Distribution
82

Consider the following data
2.2
4.1
3.5
4.5
2
3.4
1.6
3.1
3.3
3.8
2.5
4.3
3.4
3.6
2.9
3.3
3.1
3.7
4.4
3.2
4.9
3.8
3.2
2.6
3.9
s STEP 1: Decide on the number of class intervals

83
required
 The
number must be smaller than the number of
observations otherwise we gain nothing from grouping
 Too few classes will make the outcome too generalised
 Look
at the data and decide
 Typically
 I’ll
between 5 & 20
take 5 (can change later)
s STEP 2: Determine the range

84
 From
the data we find that the range is 4.9-1.6 = 3.3
s STEP 3: Divide the range by the number of classes

85
in order to estimate the approximate width of the
interval
 3.3/5
= 0.68
 3.3/7
= 0.47 approx. 0.5
s STEP 4: List the lower class limit of the bottom

86


interval and then the lower class boundary
Add the class width to the lower class boundary to
obtain the upper class boundary
Write down the upper class limit and complete the
table
Class Interval
Class Boundary
1.5 – 1.9
1.45 – 1.95
2.0 – 2.4
1.95 – 2.45
2.5 – 2.9
2.45 – 2.95
3.0 – 3.4
2.95 – 3.45
3.5 – 3.9
3.45 – 3.95
4.0 – 4.4
3.95 – 4.45
4.5 – 4.9
4.45 – 4.95
s STEP 5: Determine the class marks by averaging the

87
class limits
 What
are they?
s STEP 6: Write the frequencies

88
Class Interval
Class Boundary
Frequency
1.5 – 1.9
1.45 – 1.95
1
2.0 – 2.4
1.95 – 2.45
2
2.5 – 2.9
2.45 – 2.95
4
3.0 – 3.4
2.95 – 3.45
8
3.5 – 3.9
3.45 – 3.95
5
4.0 – 4.4
3.95 – 4.45
3
4.5 – 4.9
4.45 – 4.95
2
89
Graphical respresentations
How do we Graphically summarise data?
90


We can summarise data in numerical and graphical
forms
Summary of data in numerical form referred to as
statistics of data
 Mean,
median, mode
 Range, variance, standard deviation
s Before blindly going for statistical analysis, it is

always good to look at the raw data
91
 usually
in graphical form
 Helps in summarising the data into an easy
interpretable format

The types of graphical display most frequently used
by biomedical engineers include
 Dot
plot
 Time series
 Histograms
 Stem-and-Leaf
 Boxplots
Time Series
92


A time series is used to plot the changes in a
variation as a function of time
The variable Is usually a physiological measure,
such as electrical activation in the brain or hormone
concentration in the blood stream that changes with
time
Histogram
93



The histogram is a graphical representation for the
frequency distribution
On the x-axis, we have the sample value
On the y-axis, we have the number of occurrence of
samples
 frequency
Frequency of Occurrence
Class 1
mark
Lower
Class
Limit 1
94
Class 2
mark
Lower
Class
Limit 2
Class 3
mark
Lower
Class
Limit 3
Class 4
mark
Lower
Class
Limit 4
Lower
Class
Limit 5
95
Class Interval
Class Boundary
Class mark
Frequency
1.5 – 1.9
1.45 – 1.95
1.7
1
2.0 – 2.4
1.95 – 2.45
2.2
2
2.5 – 2.9
2.45 – 2.95
2.7
4
3.0 – 3.4
2.95 – 3.45
3.2
8
3.5 – 3.9
3.45 – 3.95
3.7
5
4.0 – 4.4
3.95 – 4.45
4.2
3
4.5 – 4.9
4.45 – 4.95
4.7
2
Shapes for histograms
96


A histogram can be of a wide range of shapes
If the histogram has a single peak then it is called
unimodal histogram
s A bimodal histogram has two distinct peaks

97
s A histogram is said to be symmetric if the right half

98
is a mirror image of the right half
sA

99
unimodal histogram is positively skewed if the
right or upper tail is stretched out compared with
the left or the lower tail
sA

100
unimodal histogram is negatively skewed if the
left or lower tail is stretched out compared with the
right or the upper tail
Stem and Leaf plot
101



Stem-and-Leaf plot is a graphical method for
showing the frequency with which certain classes of
values occur
One can use a frequency distribution table or a
histogram for the values or one can use a stem-andleaf plot
Frequency distribution and the histogram do not
show the exact value of the elements of a sample or
population
s Example: Consider the following data,

102
{12, 13, 21, 27, 33, 34, 35, 37, 40, 40,41}
Step 1: Draw a table where the first column is the
stem and the second column is the leaf
stem
leaf
sStep 2: Select
103
and list one or more leading digits for
the stem values. The trailing digits become the
leaves
stem
1
2
3
4
leaf
sStep 3: Record the leaf for each observation beside
104
the corresponding stem value
stem
leaf
1
23
2
17
3
357
4
001
sStep 4: Indicate units for the stem and the leaves
105
stem
leaf
1
23
2
17
3
357
4
001
Key: stem = tens
leaf = units
Interpreting Stemplots
106

Example: Determine the range and the median for
the data provided in the data set
stem
leaf
1
23
2
17
3
357
4
001
s Example 2: Consider the following data

107
2.2
4.1
3.5
4.5
2
3.4
1.6
3.1
3.3
3.8
2.5
4.3
3.4
3.6
2.9
3.3
3.1
3.7
4.4
3.2
4.9
3.8
3.2
2.6
3.9
Draw the dotplot graph
Draw the frequency distribution table
Draw the histogram
Draw the stem-and-leaf plot
108
EXAMPLES
Chebyshev’s Theorem
109
If the IQs of a random sample of 1080 students at a
large university have a mean score of 120 and a
standard deviation of 8, use the Chebyshev’s
theorem to determine the interval containing at
least 810 of the IQs in the sample.
sSolution:
110
According to the question, we need to determine the
interval containing at least 810 of the IQs in the
sample.
So, we need to find the value of k for the intervals μkσ and μ+kσ
We know that,
1
810 3
1− 2 =
=
k
1080 4
s
111
1
3 1
= 1− =
2
k
4 4
k2 = 4
k=2
Now, the interval can be calculated as,
Lower limit: μ-kσ = 120-2*8 = 120-16 = 104
Upper limit: μ+kσ = 120+2*8 = 120+16 = 136
112
Quantiles, Boxplots
Introduction
113

We have so far studied measures of central location
and variations
 Mean,
median, mode
 Range, variance, standard deviation

Apart from these statistics there are several other
measures of location that describe or locate the
position of certain non-central pieces of data,
relative to the data set
s These

114

measures are referred to as fractiles or
quantiles
These are values below which a specific fraction or
percentage of the observations in a given set must
fall
 Percentage
of elements of a data set with value less
than some pre-defined value
Percentile
115


Percentiles are values that divide a set of
observations into 100 equal parts
These values denoted by P1, P2, …, P99 are such
that 1% of the data falls below P1, 1% of the data
falls below P1, 2% of the data falls below P2, 99%
of the data falls below P99 etc
s Recall that a median divides the lower 50% values

and the higher 50% values in a data set
116

Percentiles divides the data set into 100 values
 There


are 99 Percentiles
70 Percentile means that 70% values lie below the
value at P70 while 30% of the values lie above the
value at P70
Percentage and Percentile?
Calculating
sCalculating the kth Percentile
117
Step 1:
Arrange the data in ascending order
Step 2:
Compute the locator, L, using
 k 
L=
n
 100 
where, n = number of values in the data set,
k = percentile of the data
sStep
Step 3:
118
If L is an integer, the kth percentile, Pk , can be found
by Pk = (Lth value + Next value) /2
If L is not an integer, the we will need to round it up to
the next largest integer. Then the value of Pk is the
Lth value counting from the lowest
sExample:
Example: Consider that data set
119
{1, 2, 2, 4, 4, 8, 9, 9, 9,10}
Let us calculate P85.
Since we have 10 elements in the data set, we seek to
find the value below which
L = (85/100)*10 = 8.5 observations fall
That is approximately 9 observations
Thus P85 = 9 {1, 2, 2, 4, 4, 8, 9, 9, 9,10}
Quartiles
120

Divide the data set into four equal parts,
 Q1,

Q2, Q3
Quartiles can be related to percentiles as
 Q1
= P25, Q2 = P50, Q3 = P75
Interquartile
121

The interquartile range (IQR) is the difference
between the 75th percentile and the 25th percentile
scores in a distribution
Deciles
122

Divide the data set into ten equal parts
 D1,

D2, …, D9
Deciles can be related to percentiles as
 D1
= P10, D2 = P20, …, D9 = P90
123
2.2
4.1
3.5
4.5
2
3.4
1.6
3.1
3.3
3.8
2.5
4.3
3.4
3.6
2.9
3.3
3.1
3.7
4.4
3.2
4.9
3.8
3.2
2.6
3.9
Boxplot
124



Also known as Box and Whisker plot
A graphic representation of the distribution of
scores on a variable that includes the range, the
median and the inter-quartile range
Provides information about 5 statistics of data
 Minimum
value, lower quartile (Q1), median (Q2),
upper quartile (Q3) and maximum value
Maximum value
Value
75th percentile
50th percentile
25th percentile
Minimum value
125
Experiment number/name
Boxplot
126




The thin line in the middle of the box represents the
median of the distribution of scores
The top line of the box represents the 75th
percentile of the distribution
The bottom line represents the 25th percentile
In other words, 50% of the scores on this variable in
this distribution are contained within the upper and
lower lines of this box
Drawing the boxplot
127






Step 1: Find the minimum and the maximum value
available in the data set
Step 2: Calculate P25, P50 (i.e. the median), P75
Step 3: Draw a box from P25, P75
Step 4: Split the box with a line at the median
Step 5: Draw a line (whisker) from P75 to the
maximum value
Step 6: Draw another line from P25 down to the
minimum value
4.9
Maximum value
Value
75th percentile
50th percentile
25th percentile
Minimum value
1.6
128
Experiment number/name
129

Plot the box plot for the following data set
 Marks
= {10, 20, 40, 50, 70, 75, 80, 80, 85, 90, 100}
Download