Statistics Introduction

advertisement
Statistics
Introduction
1.) All measurements contain random error

results always have some uncertainty
2.) Uncertainty are used to determine if two or more experimental
results are equivalent or different

Statistics is used to accomplish this task
Is the mutant (transgenic) mouse
significantly fatter than the normal
(wild-type) mouse?
Statistical Methods Provide Unbiased Means to Answer Such Questions.
Masuzaki, H., et. al Science (2001), 294(5549), 2166
Statistics
Gaussian Curve
1.) For a series of experimental results with only random error:
(i)
A large number of experiments done under identical conditions will yield a
distribution of results.
(ii)
Distribution of results is described by a Gaussian or Normal Error Curve
Number of Occurrences
High population
about correct value
low population far
from correct value
Value
Statistics
Gaussian Curve
2.) Any set of data (and corresponding Gaussian curve) can be
characterized by two parameters:
(i)
Mean or Average Value (
x)
n
x
where:
x
i 1
i
n
n = number of data points
x = value of data point number i
n i
 xi = value1 + value2 + value3 … valuen
i 1
(ii)
Standard Deviation (s)
n
s
2


x

x
 i
i 1
n  1
Smaller the standard
deviation is, more precise
the measurement is.
Statistics
Gaussian Curve
3.) Other Terms Used to Describe a Data Set
(i)
Variance: Related to the standard deviation

Used to describe how “wide” or precise a distribution of results is
variance = (s)2
where: s = standard deviation
(ii)
Range: difference in the highest and lowest values in a set of data

Example: measurments of 4 light bulb lifetimes
821, 783, 834, 855
High Value = 855 hours
Low Value = 783 hours
Range = High Value – Low Value
= 855 – 783
= 72 hours
Statistics
Gaussian Curve
3.) Other Terms Used to Describe a Data Set
(iii) Median: The value in a set of data which has an equal number of data
values above it and below it

For odd number of data points, the median is actually the middle
value

For even number of data points, the median is the value halfway
between the two middle values

Example:
Data Set: 1.19, 1.23, 1.25, 1.45 ,1.51
mean( x ) = 1.33
Median value
Data Set: 1.19, 1.23, 1.25, 1.45
Median value
mean(x ) = 1.28
median = 1.24
Statistics
Gaussian Curve
(iii) Example:
For the following bowling scores 116.0, 97.9, 114.2, 106.8 and 108.3, find
the mean, median, range and standard deviation.
Statistics
Gaussian Curve
4.) Relating Terms Back to the Gaussian Curve
(i) Formula for a Gaussian curve
1
y
e
s 2
 ( x  m )2
2s 2
where e = base of natural logarithm (2.71828…)
m ≈ x (mean)
s ≈ s (standard deviation)
mean
Entire area under curve
is normalized to one
± standard deviation
Statistics
Standard Deviation and Probability
1.) By knowing the standard deviation (s) and the mean ( x) of a set of
result (and the corresponding Gaussian curve)
(i)
The probability of the next result falling in any given range can be
calculated by:
z
(ii)
x x
s
The probability of a result falling in that portion of the Gaussian curve is
equal to the normalized area of the curve in that portion.
(iii) Example:
Probability of Measuring
a value
in a certain range is equal to the
area of that range
Standard Deviation (s)
Probability
±1s
68.3%
±2s
95.5%
±3s
99.7%
±4s
99.9%
68.3% of the area of a Gaussian curve occurs between
the values x -1s and x +1s ( x ± 1s)
Thus, any new result has a 68.3% chance of falling
within this range.
Statistics
Standard Deviation and Probability
x x
z
s
- Area under curve from mean value and result.
- Total ½ area is 0.5.
- Remaining area is 0.5 – Area.
- Example:
z = 1.3area from mean to 1.3 is 0.403
 area from infinity to 1.3 is 0.5 – 0.403 = 0.097
Statistics
Standard Deviation and Probability
(iii) Example:
A bowler has a mean score of 108.6 and a standard deviation of 7.1.
What fraction of the bowler’s scores will be less than 80.2?
Statistics
Standard Deviation and Probability
2.) Knowing the standard deviation (s) of a data set indicates the
precision of a measurement
(i)
Common intervals used for expressing analytical results are shown
below:
Range
(ii)
Percent of Results
Expected in Range
x ±1s
68.3% (31.7 outside)
x ±2s
95.5% (4.5% outside
x ±3s
99.7% (0.3% outside)
The precision of many analytical measurements is expressed as:
x  2s

There is only a ~5% chance (1 out of 20) that any given
measurement on the sample will be outside of this range
Statistics
Standard Deviation and Probability
4.) The precision of a mean (average) result is expressed using a
confidence interval
(i)
Relationship between the true mean value (m) and the measured mean
( x ) is given by:
ts
m  x
n
Confidence interval
where:
s = standard deviation
n = number of measurements
t = student’s t value
degrees of freedom = (n-1)
Note: As n increases, the confidence interval becomes smaller
(m becomes more precisely known)
Statistics
Standard Deviation and Probability
4.) The precision of a mean (average) result is expressed using a
confidence interval
(ii)
Student’s t

Statistical tool frequently used to express confidence intervals
From number of
measurements (n-1)
A probability distribution that
addresses the problem of
estimating the mean of a
normally distributed population
when the sample size is small.
Population standard deviation (s)
is unknown and has to be
estimated from the data using s.
Statistics
Standard Deviation and Probability
4.) The precision of a mean (average) result is expressed using a
confidence interval
(iii) The meaning of Confidence Interval

To determine the “true” mean need to collect an infinite number of
data points.
- obviously not possible
Confidence interval tells us the probability that the range of
numbers contains the “true” mean.
50% confidence interval  range of numbers only contains true mean 50% of the time
90% confidence interval  range of numbers contains true mean 90% of the time.

“true” mean
50% of data sets do not
contain true mean
Statistics
Standard Deviation and Probability
(iii) Example:
For the following bowling scores 116.0, 97.9, 114.2, 106.8 and 108.3, a bowler
has a mean score of 108.6 and a standard deviation of 7.1.
What is the 90% confidence interval for the mean?
Statistics
Standard Deviation and Probability
5.) Comparison of Two Data Sets
(i)
To determine if two results obtained by the same method are
statistically the same, use the following formula to determine a
calculated t:
x  x2
t calculated  1
s pooled
n1 n2
n1  n2
where:
x1 , x 2
n1, n2
spooled
s pooled 
= mean results of samples 1 & 2
= number of measurements of samples 1 & 2
= “pooled” standard deviation
2
2




x

x

x

x
 i 1  j 2
Set 1
Set 2
n1  n2  2
Requires standard deviation from the two data sets be similar.
Statistics
Standard Deviation and Probability
5.) Comparison of Two Data Sets
(ii)
Compare calculated t to the corresponding value in the Student’s t
probability table.

Use the desired %confidence level at the appropriate Degrees of
freedom

Degrees of Freedom = (n1 + n2 -2)
Calculated t needs to
be less than table value
(iii) If calculated t is greater than the value in the Student’s t probability
table, then the two results are significantly different at the given %
confidence level.

Easier to achieve for lower %confidence level
Statistics
Standard Deviation and Probability
5.) Comparison of Two Data Sets
(iv) Example:
The amount of 14CO2 in a plant sample is measured to be: 28, 32, 27, 39 & 40 counts/min
(mean = 33.2). The amount of radioactivity in a blank is found to be: 28, 21, 28, & 20 counts/min
(mean = 24.2). Are the mean values significantly different at a 95% confidence level?
s pooled 
s pooled 
 x
Set 1

 x1    x j  x 2
2
i

2
Set 2
n1  n2  2
( 28  33.2 )2  ( 32  33.2 )2  ( 27  33.2 )2  ( 39  33.2 )2  ( 40  33.2 )2  ( 28  24.2 )2  ( 21  24.2 )2  ( 28  24.2 )2  ( 20  33.2 )2
542
s pooled  5.4
t calculated 
x1  x 2
s
n1 n2
33.2  24.2 ( 5 )( 4 )

 2.48
n1  n2
5.4
54
Statistics
Standard Deviation and Probability
5.) Comparison of Two Data Sets
(iv) Example:
Degrees of Freedom = (5 + 4 – 2) = 7
From Student’s t probability table:
Degrees of Freedom (7)
95% Confidence level
Calculated t (2.48) > 2.365
The results are significantly different at a 95% confidence level,
but not at 98% or higher confidence levels
Statistics
Standard Deviation and Probability
6.) Comparison of Two Methods
(i)
To determine if the results of two methods for the same sample are the
same, use the following formula to determine a calculated t:
t calculated 
d
sd
n
where:
d
n
sd
= difference in the mean values of the two methods
= number of samples analyzed by each method
=
sd 
 d
d
2
i
n  1
(ii) Degree of Freedom = (n - 1)
(iii) If calculated t is greater than the value in the Student’s t probability
table, then the two methods are significantly different at the given %
confidence level.
Statistics
Standard Deviation and Probability
6.) Comparison of Two Methods
(iv) Example:
Two methods for measuring cholesterol in blood provide the following results:
Cholesterol content (g/L)
Plasma
sample
Method A
Method B
Difference
(di)
1
1.46
1.42
0.04
2
2.22
2.38
-0.16
3
2.84
2.67
0.17
4
1.97
1.80
0.17
5
1.13
1.09
0.04
6
2.35
2.25
0.10
d = +0.060
Are these methods significantly different at the 95% confidence level?
Statistics
Standard Deviation and Probability
6.) Comparison of Two Methods
(iv) Example:
sd 
 d
d
2
i
n  1
( 0.04  0.060 )2  ( 0.16  0.060 )2  ( 0.17  0.060 )2  ( 0.17  0.060 )2  ( 0.04  0.060 )2  ( 0.10  0.060 )2
sd 
61
sd  0.122
t calculated 
Degrees of Freedom (6-1 =5)
d
sd
n
0.060
6  1.20
0.12 2
95% Confidence level
Calculated t (1.20) ≤ 2.571
The results are not significantly different at a 95% confidence level.
Statistics
Dealing with Bad Data
1.) Q Test
(i)
(ii)
Method used to decide whether or not to reject a “bad” data point.
Procedure:
1. Arrange Data in order of increasing value.
2. Determine the lowest and highest values and the total range of
values.
Example:
12.47
Questionable point
Range = 0.20
12.48
12.53
12.56
12.67
gap = 0.11
3.
Determine the difference between the “bad” data point and the
nearest value.
- Calculate the “Q value”
Q
Gap
0.11

 0.55
Range 0.20
Statistics
Dealing with Bad Data
1.) Q Test
(ii)
Procedure:
4. Compare the calculated Q value to those in Tables at the same
value of n and the desired %confidence level.
- n: total number of values or observations
Values of Q for rejection of data
Q (90% confidence)
0.94
0.76
0.64
0.56
0.51
0.47
0.44
0.41
Number of
Observations
3
4
5
6
7
8
9
10
- For example, at n = 5 and 90% confidence, the value of Q is 0.64
- Since:
Q (calculated) ≤ Q (table)
0.55 ≤ 0.64
- data point 12.67 can not be rejected at the 90% confidence level
(iii) Although the Q-test is valuable in eliminating bad data, common sense
and repeating experiments with questionable results are usually more
helpful.
Statistics
Dealing with Bad Data
1.) Q Test
(ii)
Example:
For the following bowling scores 116.0, 97.9, 114.2, 106.8 and 108.3, a
bowler has a mean score of 108.6 and a standard deviation of 7.1.
Using the Q test, decide whether the number 97.9 should be discarded.
Download