Handling Data and Figures of Merit Data comes in different formats time Histograms

advertisement
Handling Data and Figures of Merit
Data comes in different formats
time
Histograms
Lists
But….
Can contain the same information about quality
What is meant by quality?
(figures of merit)
Precision, separation (selectivity), limits of detection,
Linear range
My weight
day
weight
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
140
140.1
139.8
140.6
140
139.8
139.6
140
140.8
139.7
140.2
141.7
141.9
141.4
142.3
142.3
141.9
142.1
142.5
142.3
142.1
142.5
143.5
143
143.2
143
143.4
143.5
142.7
143.7
day
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
weight
day
143.9
144
142.5
142.9
142.8
143.9
144
144.8
143.9
144.5
143.9
144
144.2
143.8
143.5
143.8
143.2
143.5
143.6
143.4
143.9
143.6
144
143.8
143.6
143.8
144
144.2
144
143.9
weight
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
Plot as a function of time data was acquired:
144
144.2
144.5
144.2
143.9
144.2
144.5
144.3
144.2
144.9
144
143.8
144
143.8
144
144.5
143.7
143.9
144
144.2
144
144.4
143.8
144.1
day
Comments:
background is white (less ink);
Font size is larger than Excel
default (use 14 or 16)
146
145
144
weight (lbs)
weight
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
140
140.1
139.8
140.6
140
139.8
139.6
140
140.8
139.7
140.2
141.7
141.9
141.4
142.3
142.3
141.9
142.1
142.5
142.3
142.1
142.5
143.5
143
143.2
143
143.4
143.5
142.7
143.7
day
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
weight
day
143.9
144
142.5
142.9
142.8
143.9
144
144.8
143.9
144.5
143.9
144
144.2
143.8
143.5
143.8
143.2
143.5
143.6
143.4
143.9
143.6
144
143.8
143.6
143.8
144
144.2
144
143.9
143
142
Do not use curved lines to connect data
points
– that assumes you know more about the
relationship of the data than you really do
141
140
139
0
10
20
30
Day
40
50
60
weight
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
144
144.2
144.5
144.2
143.9
144.2
144.5
144.3
144.2
144.9
144
143.8
144
143.8
144
144.5
143.7
143.9
144
144.2
144
144.4
143.8
144.1
Bin refers to what groups of
weight to cluster. Like
A grade curve which lists
number of students who got
between 95 and 100 pts
95-100 would be a bin
Assume my weight is a single, random, set of similar data
25 Make a frequency chart (histogram) of the data
146
145
# of Observations
144
weight (lbs)
20
143
142
141
15
140
139
0
10
20
30
40
50
60
Day
10
5
0
Weight (lbs)
Create a “model” of my weight and determine average
Weight and how consistent my weight is
25
average
143.11
# of Observations
20
15
10
Inflection pt
s = 1.4 lbs
5
0
Weight (lbs)
s = standard deviation
= measure of the consistency, or similarity, of weights
Characteristics of the Model Population
(Random, Normal)
A
f  x 
e
s 2
1 x m 
 

2 s 
2
Peak height, A
Peak location (mean or average), m
Peak width, W, at baseline
Peak width at half height, W1/2
Standard deviation, s, estimates the variation in an infinite population, s
Related concepts
0.45
0.4
0.35
Amplitude
Width is measured
At inflection point =
s
0.3
0.25
0.2
W1/2
0.15
0.1
0.05
0
-5
-4
-3
-2
-1
0
1
2
3
4
s
Triangulated peak: Base width is 2s < W < 4s
5
0.45
0.4
Pp = peak to peak – or
– largest separation of
measurements
0.35
+/- 1s Area = 68.3%
Amplitude
0.3
pp ~ 6s
0.25
0.2
0.15
0.1
Area +/- 2s = 95.4%
0.05
0
-5
-4
-3
-2
Area +/- 3s = 99.74 %
-1
0
1
2
3
4
5
s
Peak to peak is sometimes
Easier to “see” on the data vs time plot
pp ~ 6s
(Calculated s= 1.4)
146
144.9
145
Peak to
peak
143
25
142
20
# of Observations
weight (lbs)
144
141
15
10
5
140
139.5
0
Weight (lbs)
139
0
10
20
30
Day
s~ pp/6 = (144.9-139.5)/6~0.9
40
50
60
There are some other important characteristics of a normal (random)
population
0.45
0.4
0.35
0.3
Amplitude
0.25
1st derivative
0.2
2nd derivative
0.15
0.1
0.05
0
-5
-4
-3
-2
-1
-0.05
0
1
2
3
4
s
Scale up the first derivative and second derivative to see better
5
Population, 0th derivative
0.6
0.4
Amplitude
0.2
0
-5
-4
-3
-2
-1
0
1
2
3
4
5
-0.2
-0.4
-0.6
2nd derivative
Peak is at the inflection
Of first derivative – should
Be symmetrical for normal
Population; goes to zero at
Std. dev.
-0.8
-1
s
1st derivative,
Peak is at the inflection
Determines the std. dev.
Asymmetry can be determined from principle component analysis
A. F. (≠Alanah Fitch) = asymmetric factor
Comparing TWO populations of measurements
146
School Begins
145
Baseline
Vacation
weight (lbs)
144
143
142
141
140
139
0
10
20
30
Day
40
50
60
Is there a difference between my “baseline” weight and school weight?
Can you “detect” a difference? Can you “quantitate” a difference?
Exact same information displayed differently, but now we divide
The data into different measurement populations
25
school
# of Observations
20
15
10
baseline
5
0
138
139
140
141
142
143
Weight (lbs)
Model of the data as two normal populations
144
145
146
147
25
146
145
144
weight (lbs)
# of Observations
20
15
143
142
141
140
139
0
10
20
30
40
50
60
Standard deviation
Of the school weigh
Day
10
Standard deviation
Of baseline weight
5
0
138
139
140
141
Average
Baseline weight
142
143
Weight (lbs)
144
145
146
Average school
weight
147
25
20
20
# of Observations
15
10
15
10
5
5
0
138
0
139
140
141
142
Weight (lbs)
143
144
145
146
147
Weight (lbs)
We have two models to describe the population of measurements
Of my weight.
In one we assume that all measurements fall into a single population.
In the second we assume that the measurements
Have sampled two different populations.
25
20
Which is the better model?
How to we quantify “better”?
# of Observations
# of Observations
25
15
10
5
0
138
139
140
141
142
143
Weight (lbs)
144
145
146
147
25
The red bars represent the difference
Between the two population model and
The data
# of Observations
20
15
10
5
Compare how close
The measured data
Fits the model
The purple lines represent
The difference between
The single population
Model and the data
Which model
Has less summed
differences?
0
138
139
140
141
142
143
Weight (lbs)
Did I gain weight?
144
145
146
147
Normally sum the square of the difference in order to account for
Both positive and negative differences.
This process (summing of the squares of the differences)
Is essentially what occurs in an ANOVA
Analysis of variance
In the bad old days you had to work out all the sums of squares.
In the good new days you can ask Excel program to do it for you.
Anova: Single Factor
5% certainty
SUMMARY
Groups
Count
Column 1
12
Column 2
12
ANOVA
Source of Variation
Between Groups
Within Groups
SS
194.4273
167.2408
Total
361.6682
Sum
Average
277.41 23.1175
345.72
28.81
df
MS
1 194.4273
22 7.601856
Variance
8.70360227
6.50010909
F
P-value
F crit
25.5762995 4.59E-05 4.300949
Source of Variation
23
Test: is F<Fcritical? If true = hypothesis true, single population
if false = hypothesis false, can not be explained
by a single population at the
5% certainty level
0.3
0.35
Red, N=12, Sum sq diff=0.11, stdev=3.27 White, N=12, Sum sq diff=0.037, stdev=2.55
Red, N=40, Sum sq diff=0.017, stdev-2.67 White, N=38, Sum sq diff=0.028, stdev=2.15
N=24 Sum sq diff=0.0449, stdev=3.96
N=78, sum sq diff=0.108, stdev=4.05
0.3
0.25
0.25
Frequency
Frequency
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
0
14
19
24
29
Length (cm)
34
39
14
19
24
29
34
39
Length, cm
In an Analysis of Variance you test the hypothesis that the sample is
Best described as a single population.
1. Create the expected frequency (Gaussian from normal error curve)
2. Measure the deviation between the histogram point and the expected
frequency
3. Square to remove signs
4. SS = sum squares
5. Compare to expected SS which scales with population size
6. If larger than expected then can not explain deviations assuming a
single population
0.3
0.35
Red, N=12, Sum sq diff=0.11, stdev=3.27 White, N=12, Sum sq diff=0.037, stdev=2.55
Red, N=40, Sum sq diff=0.017, stdev-2.67 White, N=38, Sum sq diff=0.028, stdev=2.15
N=24 Sum sq diff=0.0449, stdev=3.96
N=78, sum sq diff=0.108, stdev=4.05
0.3
0.25
0.25
Frequency
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
0
14
19
24
29
34
14
39
19
24
29
34
39
Length, cm
Length (cm)
0.04
0.035
Square Difference Expected Measured
Frequency
0.2
0.03
0.025
0.02
0.015
0.01
0.005
0
15
17
19
21
23
25
Length (cm)
27
29
31
33
35
The square differences
For an assumption of
A single population
Is larger than for
The assumption of
Two individual
populations
There are other measurements which describe the two populations
Resolution of two peaks
xa  xb
R
Wa Wb

2
2
Mean or average
Baseline width
x a  xb
4.5
xa
xb
4
3.5
Signal
Wa Wb
3
2 2.5 2
2
1.5
Wa
2
1
Wb
2
0.5
0
1
1.5
In this example
R  1:
Wa Wb
xa  xb 

2
2
2
2.5
3
3.5
4
x
Peaks are baseline resolved when R > 1
x a  xb
1.8
xa
xb
1.6
1.4
Signal
Wa Wb
1.2
2 12
0.8
0.6
Wa
2
0.4
Wb
2
0.2
0
1
1.5
In this example
R  1:
2
Wa Wb
xa  xb 

2
2
2.5
x
3
3.5
Peaks are just baseline
resolved when R = 1
4
x a  xb
1.6
xa
xb
1.4
1.2
Signal
Wa Wb
1
2
2
0.8
0.6
Wa
2
0.4
0.2
Wb
2
0
1
1.5
In this example
R  1:
2
Wa Wb
xa  xb 

2
2
2.5
3
3.5
x
Peaks are not baseline resolved
when R < 1
4
2008 Data
0.35
White, N=12, Sum sq diff=0.037
Red, N=12, Sum sq diff=0.11
0.3
xp 
Frequency
0.25
0.2
1
2
W
R
 WW 
R1
0.15
0.1
0.05
0
14
19
24
29
Length (cm)
What is the R for this data?
34
39
Comparison of 1978 Low Lead to 1978 High Lead
25
20
Comparison of 1978 Low Lead to 1979 High Lead
25
% Measured
20
15
15
10
10
5
5
0
0
0
0
20
40
60
80
100
120
140
20
160
40
60
80
Series2
100
120
140
Series3
IQ Verbal
Visually less resolved
Anonymous 2009 student analysis of Needleman data
Wa
 ~ 112   ~ 70  42
2
Wb
 ~ 130   ~ 95  35
2
R
xa  xb
Wa Wb

2
2
Visually better resolved
160
Comparison of 1978 Low Lead to 1978 High Lead
25
20
Comparison of 1978 Low Lead to 1979 High Lead
25
% Measured
20
15
15
10
10
5
5
0
0
0
0
20
40
60
80
100
120
140
20
40
60
80
Series2
160
100
120
140
Series3
IQ Verbal
Visually less resolved
Visually better resolved
Anonymous 2009 student analysis of Needleman data
Wa
 ~ 112   ~ 70  42
2
Wb
 ~ 130   ~ 95  35
2
x a  x b  ~ 112  ~ 95  17
R
xa  xb
17
~
 0.22
Wa Wb
42  35

2
2
160
Other measures of the quality of separation of the
Peaks
1. Limit of detection
2. Limit of quantification
3. Signal to noise (S/N)
X blank
X limit of detection
0.45
99.74%
Of the observations
Of the blank will lie
below the mean of the
First detectable signal
(LOD)
0.4
0.35
Amplitude
0.3
0.25
0.2
0.15
0.1
0.05
0
-6
-4
-2
0
x LOD  xblank  3sblank
2
3s 4
s
6
8
10
12
Two peaks are visible when all the data is summed together
0.45
0.4
0.35
Amplitude
0.3
0.25
0.2
0.15
0.1
0.05
0
-6
-4
-2
0
2
3s
s
4
6
8
10
12
146
25
# of Observations
20
145
15
10
5
144
0
138
139
140
141
142
143
144
145
146
147
weight (lbs)
Weight (lbs)
143
142
141
140
139
0
10
20
30
Day
Estimate the LOD (signal) of this
data
40
50
60
Other measures of the quality of separation of the
Peaks
1. Limit of detection
2. Limit of quantification
3. Signal to noise (S/N)
x LOQ  xblank  9sblank
Your book suggests 10
0.45
0.4
0.35
Amplitude
0.3
0.25
0.2
0.15
0.1
0.05
0
-6
-4
-2
0
2
4
6
8
9s
10
12
Limit of squantification requires absolute
Certainty that no blank is part of the
146
25
# of Observations
20
145
15
10
5
144
0
138
139
140
141
142
143
144
145
146
147
weight (lbs)
Weight (lbs)
143
142
141
140
139
0
10
20
30
Day
Estimate the LOQ (signal) of this
data
40
50
60
Other measures of the quality of separation of the
Peaks
1. Limit of detection
2. Limit of quantification
3. Signal to noise (S/N)
Signal = xsample - xblank
Noise = N = standard deviation, s
x sample  xblank x sample  xblank
S


N
s
 pp 
 
 6
(This assumes pp school ~ pp baseline)
146
25
# of Observations
20
145
School Begins
15
Baseline
Vacation
10
5
144
0
138
139
140
141
142
143
144
145
146
147
weight (lbs)
Weight (lbs)
143
Peak to peak variation within
mean school
~ 6s where s = N for Noise
142
Signal
141
140
139
0
10
20
30
Estimate the S/N of this data Day
40
50
60
35
30
length (cm)
25
20
15
Can you “tell” where the switch between
Red and white potatoes begins?
10
What is the signal (length of white)?
What is the background (length of red)?
What is the S/N ?
5
0
0
5
10
15
Sample number
20
25
30
Effect of sample size on the measurement
Error curve
Peak height grows with # of measurements.
+ - 1 s always has same proportion of total number of measurements
However, the actual value of s decreases as population grows
ssample 
s population
nsample
2008 Data
27
5
4.5
4
26
3
25
2.5
24.5
2
24
Red Running Stdev
3.5
25.5
1.5
23.5
ssample 
1
23
0.5
22.5
s population
nsample
0
0
2
4
6
8
10
12
14
Sample number
4.1
3.9
3.7
stdev red length cm
Red Running Length Average
26.5
3.5
3.3
3.1
y = -0.8807x + 5.9303
2
R = 0.9491
2.9
2.7
2.5
1.5
2
2.5
3
sqrt number of samples
3.5
4
0.35
Red, N=12, Sum sq diff=0.11, stdev=3.27 White, N=12, Sum sq diff=0.037, stdev=2.55
Red, N=40, Sum sq diff=0.017, stdev-2.67 White, N=38, Sum sq diff=0.028, stdev=2.15
0.3
Frequency
0.25
0.2
0.15
0.1
0.05
0
14
19
24
29
Length (cm)
34
39
Calibration Curve
A calibration curve is based on a selected measurement as linear
In response to the concentration of the analyte.
y  a  bx
Or… a prediction of measurement due to some change
Can we predict my weight change if I had spent a longer time on
Vacation?
fitch lbs  a  bdays on vacation
25
# of Observations
20
15
10
5
0
138
139
140
141
142
143
Weight (lbs)
144
145
146
147
5 days
fitch lbs  a  bdays on vacation
The calibration curve contains information about the sampling
Of the population
143
Can get this by using “trend line”
142.5
Fitch Weight, lbs
142
y = 0.3542x + 140.04
2
R = 0.7425
141.5
141
140.5
140
139.5
139
0
1
2
3
Days on Vacation
4
5
6
This is just a trendline
From “format” data
4.1
3.9
stdev red length cm
3.7
3.5
3.3
3.1
y = -0.8807x + 5.9303
2
R = 0.9491
2.9
2.7
Sample
1
2
3
4
5
6
7
8
9
10
11
12
sqrt(#samples)
1
1.414213562
1.732050808
2
2.236067977
2.449489743
2.645751311
2.828427125
3
3.16227766
3.31662479
3.464101615
stdev
#DIV/0!
2.036468
4.475727
4.31441
3.844045
3.844604
3.735124
3.458414
3.235055
3.093053
2.935944
2.950187
2.5
1.5
2
2.5
3
3.5
4
sqrt number of samples
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.296113395
R Square
0.087683143
Adjusted R Square
-0.013685397
Standard Error
0.703143388
Observations
11
Using the analysis
Data pack
ANOVA
df
Regression
Residual
Total
Intercept
X Variable 1
1
9
10
Coefficients
3.884015711
-0.06235252
SS
MS
F
Significance F
0.427662048 0.427662 0.864994 0.376617
4.449695616 0.494411
4.877357664
Standard Error
t Stat
P-value Lower 95%
0.514960076 7.542363 3.53E-05 2.719094
0.067042092 -0.93005 0.376617 -0.21401
Get an error
Associated with
The intercept
In the best of all worlds you should have a series of blanks
That determine you’re the “noise” associated with the background
x LOD  xblank  3sblank
Sometimes you forget, so to fall back and punt, estimate
The standard deviation of the “blank” from the linear regression
But remember, in doing this you are acknowledging
A failure to plan ahead in your analysis
x LOD  x blank  b[conc. LOD]
Signal LOD
Sensitivity (slope)
x LOD  xblank  3sblank
x blank  3sblank  x blank  b[conc. LOD]
3sblank
[conc. LOD] 
b
Extrapolation of the associated error
Can be obtained from the Linear
Regression data
The concentration LOD depends on BOTH
Stdev of blank and sensitivity
!!Note!!
Signal LOD ≠ Conc LOD
We want Conc. LOD
Selectivity
pHpM
or pM
pH or
0
0
12 12
10 10
8
8
6
6
4
4
2
2
0
0
Difference in slope is one measure selectivity
-50 -50
Pb2+
y = -31.143x - 74.333
R2 = 0.9994
mV
-150
+
-150
H
-200-200
-250-250
y = -41x - 118.5
R2 = 0.9872
In a perfect method the sensing device would have zero
Slope for the interfering species
-300-300
-350-350
mV
-100-100
Limit of linearity
5% deviation
Summary: Figures of Merit Thus far
R = resolution
S/N
LOD = both signal and concentration
Can be expressed in terms of signal, but better
LOQ
Expression is in terms of concentration
LOL
Sensitivity (calibration curve slope)
Selectivity (essentially difference in slopes)
Tests: Anova
Why is the limit of detection important?
Why has the limit of detection changed so much in the
Last 20 years?
The End
25
20
20
% of Measurements
% of Measurements
25
15
10
15
10
5
5
0
0
40
60
80
100
120
Verbal IQ
140
160
40
60
80
100
120
140
Verbal IQ
Which of these two data sets would be likely
To have better numerical value for the
Ability to distinguish between two different
Populations?
Needleman’s data
160
Height for normalized
Bell curve <1
2008 Data
0.35
White, N=12, Sum sq diff=0.037
Red, N=12, Sum sq diff=0.11
0.3
Frequency
0.25
0.2
0.15
0.1
0.05
0
14
19
24
29
Length (cm)
Which population is more variable?
How can you tell?
34
39
0.35
Red, N=12, Sum sq diff=0.11, stdev=3.27 White, N=12, Sum sq diff=0.037, stdev=2.55
Red, N=40, Sum sq diff=0.017, stdev-2.67 White, N=38, Sum sq diff=0.028, stdev=2.15
0.3
Frequency
0.25
0.2
0.15
0.1
0.05
0
14
19
24
29
34
39
Length (cm)
Increasing the sample size decreases the std dev and increases separation
Of the populations, notice that the means also change, will do so until
We have a reasonable sample of the population
25
% of Measurements
20
15
10
5
0
40
60
80
100
Verbal IQ
120
140
160
25
% of Measurements
20
15
10
5
0
40
60
80
100
Verbal IQ
120
140
160
Download