Introduction Statistics: Collect, organize, classify and summarize

advertisement
ENG-7: Biostatistics
ENG-7: Biostatistics
Examples:
Introduction
Statistics: Collect, organize, classify and summarize DATA in
order to perform analyses and interpretation, produce forecasts and
predictions or make decision and define policies.
1. Who will win the next election?
Inferential Statistics consists of the tools that are used to make
inference about a population. For this we often use probabilistic
models, which are mathematical models to deal with uncertainty.
3. Determining which one of two tooth pastes is more efficient
than the other in preventing tooth decay.
2. Producing the best possible gasoline by changing the levels of a
given set of components.
4. Forecasting the amount of rainfall for the next winter.
Data are collected to explore a population, the target of our
study, this produces a sample.
5. Monitoring patients with heart diseases and decide which
factors affect their health.
A random sample is a sample from a population taken in such a
way that every sample of size n has an equal probability of
selection.
6. Optimize the performance of a computer by tuning the
parameters of the OS.
1
ENG-7: Biostatistics
2
ENG-7: Biostatistics
Types of data
• Quantitative data. Correspond to observations measured on
a numerical scale.
How data are collected
• Observational studies. Collect data with little or no control
over possible affecting factors.
– Continuous Variables. Can assume any value in some
interval of real numbers. They are usually related to
measurements of physical quantities.
• Designed experiments. Data are collected by means of an
experiment where most important factors are subject to
control.
– Discrete variables. Can assume only a countable number
of values. They are usually related to counts.
• Qualitative data. Correspond to observations classified in
groups or categories.
• Survey samples. Data are collected from a finite population
carefully considering its structure.
• Ranked (ordinal) data. Observations can be classified
within categories that have a natural ordering.
3
4
ENG-7: Biostatistics
ENG-7: Biostatistics
The mean
The population mean is defined as the sum of the values of the
variable under study divided by the number of objects in the
population. It is usually referred to using the letter µ.
Population parameters
The descriptive characteristics of a population can be summarized
by some population parameters. These are numerical measures
that are typical of each population. Usual examples are the central
tendency of the population or its dispersion. Parameters are
usually unknown quantities.
If a population has five elements:
X1 = 0, X2 = −1, X3 = 5, X4 = 2.4, X5 = −0.7 then the mean is
µ=
Parameters are estimated using values from a sample. Sample
values are used to obtain a statistics. Actually, any function of a
sample is called a statistics. The observed value of a statistics
depends on how the sample is obtained, thus statistics are random
variables.
0 − 1 + 5 + 2.4 − 0.7
= 1.14
5
In general we write
µ=
ΣN
i=1 Xi
N
To estimate µ we take a sample of size n, Y1 , . . . , Yn , and take its
average
Σn Yi
Y = i=1
n
5
ENG-7: Biostatistics
6
ENG-7: Biostatistics
The median
Consider the circumferences at chest height (CCH) of 15 maple
trees. Then d = (15 + 1)/2 = 8 is the maximum depth.
As a measure of the central tendency of a population, the mean can
be seriously affected by very large of very small observations. The
mean salary of the employees of a company may not give a good
idea of the kind of income of the average worker, since a small
number of very high salaries will pull the average up.
A more robust measure of central tendency is the median. This is
defined as follows:
CCH
18
21
22
29
29
36
37
38
Depth
1
2
3
4
5
6
7
8
CCH 56
59
66
70
88
88
93
120
Depth
7
6
5
4
3
2
1
For a population of 12 cypress trees we take the average of the two
observations of depth 6, (56 + 68)/2 = 62.
1. Order the elements of the population
2. Define depth the position of Xi relative to the nearest extreme
CCH
17
19
31
39
48
56
3. When N is an odd number the population median is the
observation of maximum depth.
Depth
1
2
3
4
5
6
CCH
68
73
73
75
80
122
4. If N is an even number, the median is the average of the
observations with maximum depth.
Depth
6
5
4
3
2
1
7
8
ENG-7: Biostatistics
ENG-7: Biostatistics
The range
The variance
Consider the two samples of weights of albacore tuna in the table
Sample 1
8.9
9.6
11.2
9.4
9.9
10.9
10.4
11.0
9.7
Sample 2
3.1
17.0
9.9
5.1
18.0
3.8
10.0
2.9
21.2
As a measure of dispersion we can consider the difference between
each observation and the mean. Let zi = Yi − Y . These differences
are called deviates. Deviates have the following property
n
X
i=1
they have the same mean (10.11 kg) and the same median (9.9 kg).
But they are not identical since observations are scattered in
different ways.
i=1
Yi − Y =
n
X
i=1
Yi −
n
X
n
X
Y =
i=1
Yi − nY = 0
i=1
A positive measure of dispersion can be obtained by considering
the squares of the deviates. The sample variance
Pn
2
Pn
Pn
( i=1 Yi )
2
2
i=1 Yi −
2
n
i=1 (Yi − Y )
s =
=
n−1
n−1
One measure of the dispersion of a population is given by the
range. The range is defined as the difference between the
maximum and the minimum. If we denote the ordered sample by
Y(i) then the sample range is
is an estimator of the population variance
PN
(Xi − µ)2
2
σ = i=1
,
N
the average square deviation of the observations from the mean.
Y(n) − Y(1)
Sample 1 has a range of 2.3 kg, for sample 2 the range is 18.3 kg.
9
ENG-7: Biostatistics
For the second sample of the previous table we have that
X
X
Yi = 91,
Yi2 = 1, 318.92, n = 9
so
n
X
zi =
2
1, 318.92 − (91)
9
s =
= 49.851 kg2
9−1
Notice that the units of dispersion in this example are kg2 and not
kg.
2
To obtain a measure of dispersion in the scale of the original data
we compute the standard deviation as the square root of the
variance.
So, for our example, s = 7.06kg.
11
10
Download