Handling Data and Figures of Merit Data comes in different formats time Histograms Lists But…. Can contain the same information about quality What is meant by quality? (figures of merit) Precision, separation (selectivity), limits of detection, Linear range My weight day weight 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 140 140.1 139.8 140.6 140 139.8 139.6 140 140.8 139.7 140.2 141.7 141.9 141.4 142.3 142.3 141.9 142.1 142.5 142.3 142.1 142.5 143.5 143 143.2 143 143.4 143.5 142.7 143.7 day 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 weight day 143.9 144 142.5 142.9 142.8 143.9 144 144.8 143.9 144.5 143.9 144 144.2 143.8 143.5 143.8 143.2 143.5 143.6 143.4 143.9 143.6 144 143.8 143.6 143.8 144 144.2 144 143.9 weight 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 Plot as a function of time data was acquired: 144 144.2 144.5 144.2 143.9 144.2 144.5 144.3 144.2 144.9 144 143.8 144 143.8 144 144.5 143.7 143.9 144 144.2 144 144.4 143.8 144.1 day Comments: background is white (less ink); Font size is larger than Excel default (use 14 or 16) 146 145 144 weight (lbs) weight 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 140 140.1 139.8 140.6 140 139.8 139.6 140 140.8 139.7 140.2 141.7 141.9 141.4 142.3 142.3 141.9 142.1 142.5 142.3 142.1 142.5 143.5 143 143.2 143 143.4 143.5 142.7 143.7 day 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 weight day 143.9 144 142.5 142.9 142.8 143.9 144 144.8 143.9 144.5 143.9 144 144.2 143.8 143.5 143.8 143.2 143.5 143.6 143.4 143.9 143.6 144 143.8 143.6 143.8 144 144.2 144 143.9 143 142 Do not use curved lines to connect data points – that assumes you know more about the relationship of the data than you really do 141 140 139 0 10 20 30 Day 40 50 60 weight 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 144 144.2 144.5 144.2 143.9 144.2 144.5 144.3 144.2 144.9 144 143.8 144 143.8 144 144.5 143.7 143.9 144 144.2 144 144.4 143.8 144.1 Bin refers to what groups of weight to cluster. Like A grade curve which lists number of students who got between 95 and 100 pts 95-100 would be a bin Assume my weight is a single, random, set of similar data 25 Make a frequency chart (histogram) of the data 146 145 # of Observations 144 weight (lbs) 20 143 142 141 15 140 139 0 10 20 30 40 50 60 Day 10 5 0 Weight (lbs) Create a “model” of my weight and determine average Weight and how consistent my weight is 25 average 143.11 # of Observations 20 15 10 Inflection pt s = 1.4 lbs 5 0 Weight (lbs) s = standard deviation = measure of the consistency, or similarity, of weights Characteristics of the Model Population (Random, Normal) A f x e s 2 1 x m 2 s 2 Peak height, A Peak location (mean or average), m Peak width, W, at baseline Peak width at half height, W1/2 Standard deviation, s, estimates the variation in an infinite population, s Related concepts 0.45 0.4 0.35 Amplitude Width is measured At inflection point = s 0.3 0.25 0.2 W1/2 0.15 0.1 0.05 0 -5 -4 -3 -2 -1 0 1 2 3 4 s Triangulated peak: Base width is 2s < W < 4s 5 0.45 0.4 Pp = peak to peak – or – largest separation of measurements 0.35 +/- 1s Area = 68.3% Amplitude 0.3 pp ~ 6s 0.25 0.2 0.15 0.1 Area +/- 2s = 95.4% 0.05 0 -5 -4 -3 -2 Area +/- 3s = 99.74 % -1 0 1 2 3 4 5 s Peak to peak is sometimes Easier to “see” on the data vs time plot pp ~ 6s (Calculated s= 1.4) 146 144.9 145 Peak to peak 143 25 142 20 # of Observations weight (lbs) 144 141 15 10 5 140 139.5 0 Weight (lbs) 139 0 10 20 30 Day s~ pp/6 = (144.9-139.5)/6~0.9 40 50 60 There are some other important characteristics of a normal (random) population 0.45 0.4 0.35 0.3 Amplitude 0.25 1st derivative 0.2 2nd derivative 0.15 0.1 0.05 0 -5 -4 -3 -2 -1 -0.05 0 1 2 3 4 s Scale up the first derivative and second derivative to see better 5 Population, 0th derivative 0.6 0.4 Amplitude 0.2 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 -0.2 -0.4 -0.6 2nd derivative Peak is at the inflection Of first derivative – should Be symmetrical for normal Population; goes to zero at Std. dev. -0.8 -1 s 1st derivative, Peak is at the inflection Determines the std. dev. Asymmetry can be determined from principle component analysis A. F. (≠Alanah Fitch) = asymmetric factor Comparing TWO populations of measurements 146 School Begins 145 Baseline Vacation weight (lbs) 144 143 142 141 140 139 0 10 20 30 Day 40 50 60 Is there a difference between my “baseline” weight and school weight? Can you “detect” a difference? Can you “quantitate” a difference? Exact same information displayed differently, but now we divide The data into different measurement populations 25 school # of Observations 20 15 10 baseline 5 0 138 139 140 141 142 143 Weight (lbs) Model of the data as two normal populations 144 145 146 147 25 146 145 144 weight (lbs) # of Observations 20 15 143 142 141 140 139 0 10 20 30 40 50 60 Standard deviation Of the school weigh Day 10 Standard deviation Of baseline weight 5 0 138 139 140 141 Average Baseline weight 142 143 Weight (lbs) 144 145 146 Average school weight 147 25 20 20 # of Observations 15 10 15 10 5 5 0 138 0 139 140 141 142 Weight (lbs) 143 144 145 146 147 Weight (lbs) We have two models to describe the population of measurements Of my weight. In one we assume that all measurements fall into a single population. In the second we assume that the measurements Have sampled two different populations. 25 20 Which is the better model? How to we quantify “better”? # of Observations # of Observations 25 15 10 5 0 138 139 140 141 142 143 Weight (lbs) 144 145 146 147 25 The red bars represent the difference Between the two population model and The data # of Observations 20 15 10 5 Compare how close The measured data Fits the model The purple lines represent The difference between The single population Model and the data Which model Has less summed differences? 0 138 139 140 141 142 143 Weight (lbs) Did I gain weight? 144 145 146 147 Normally sum the square of the difference in order to account for Both positive and negative differences. This process (summing of the squares of the differences) Is essentially what occurs in an ANOVA Analysis of variance In the bad old days you had to work out all the sums of squares. In the good new days you can ask Excel program to do it for you. Anova: Single Factor 5% certainty SUMMARY Groups Count Column 1 12 Column 2 12 ANOVA Source of Variation Between Groups Within Groups SS 194.4273 167.2408 Total 361.6682 Sum Average 277.41 23.1175 345.72 28.81 df MS 1 194.4273 22 7.601856 Variance 8.70360227 6.50010909 F P-value F crit 25.5762995 4.59E-05 4.300949 Source of Variation 23 Test: is F<Fcritical? If true = hypothesis true, single population if false = hypothesis false, can not be explained by a single population at the 5% certainty level 0.3 0.35 Red, N=12, Sum sq diff=0.11, stdev=3.27 White, N=12, Sum sq diff=0.037, stdev=2.55 Red, N=40, Sum sq diff=0.017, stdev-2.67 White, N=38, Sum sq diff=0.028, stdev=2.15 N=24 Sum sq diff=0.0449, stdev=3.96 N=78, sum sq diff=0.108, stdev=4.05 0.3 0.25 0.25 Frequency Frequency 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0 0 14 19 24 29 Length (cm) 34 39 14 19 24 29 34 39 Length, cm In an Analysis of Variance you test the hypothesis that the sample is Best described as a single population. 1. Create the expected frequency (Gaussian from normal error curve) 2. Measure the deviation between the histogram point and the expected frequency 3. Square to remove signs 4. SS = sum squares 5. Compare to expected SS which scales with population size 6. If larger than expected then can not explain deviations assuming a single population 0.3 0.35 Red, N=12, Sum sq diff=0.11, stdev=3.27 White, N=12, Sum sq diff=0.037, stdev=2.55 Red, N=40, Sum sq diff=0.017, stdev-2.67 White, N=38, Sum sq diff=0.028, stdev=2.15 N=24 Sum sq diff=0.0449, stdev=3.96 N=78, sum sq diff=0.108, stdev=4.05 0.3 0.25 0.25 Frequency 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0 0 14 19 24 29 34 14 39 19 24 29 34 39 Length, cm Length (cm) 0.04 0.035 Square Difference Expected Measured Frequency 0.2 0.03 0.025 0.02 0.015 0.01 0.005 0 15 17 19 21 23 25 Length (cm) 27 29 31 33 35 The square differences For an assumption of A single population Is larger than for The assumption of Two individual populations There are other measurements which describe the two populations Resolution of two peaks xa xb R Wa Wb 2 2 Mean or average Baseline width x a xb 4.5 xa xb 4 3.5 Signal Wa Wb 3 2 2.5 2 2 1.5 Wa 2 1 Wb 2 0.5 0 1 1.5 In this example R 1: Wa Wb xa xb 2 2 2 2.5 3 3.5 4 x Peaks are baseline resolved when R > 1 x a xb 1.8 xa xb 1.6 1.4 Signal Wa Wb 1.2 2 12 0.8 0.6 Wa 2 0.4 Wb 2 0.2 0 1 1.5 In this example R 1: 2 Wa Wb xa xb 2 2 2.5 x 3 3.5 Peaks are just baseline resolved when R = 1 4 x a xb 1.6 xa xb 1.4 1.2 Signal Wa Wb 1 2 2 0.8 0.6 Wa 2 0.4 0.2 Wb 2 0 1 1.5 In this example R 1: 2 Wa Wb xa xb 2 2 2.5 3 3.5 x Peaks are not baseline resolved when R < 1 4 2008 Data 0.35 White, N=12, Sum sq diff=0.037 Red, N=12, Sum sq diff=0.11 0.3 xp Frequency 0.25 0.2 1 2 W R WW R1 0.15 0.1 0.05 0 14 19 24 29 Length (cm) What is the R for this data? 34 39 Comparison of 1978 Low Lead to 1978 High Lead 25 20 Comparison of 1978 Low Lead to 1979 High Lead 25 % Measured 20 15 15 10 10 5 5 0 0 0 0 20 40 60 80 100 120 140 20 160 40 60 80 Series2 100 120 140 Series3 IQ Verbal Visually less resolved Anonymous 2009 student analysis of Needleman data Wa ~ 112 ~ 70 42 2 Wb ~ 130 ~ 95 35 2 R xa xb Wa Wb 2 2 Visually better resolved 160 Comparison of 1978 Low Lead to 1978 High Lead 25 20 Comparison of 1978 Low Lead to 1979 High Lead 25 % Measured 20 15 15 10 10 5 5 0 0 0 0 20 40 60 80 100 120 140 20 40 60 80 Series2 160 100 120 140 Series3 IQ Verbal Visually less resolved Visually better resolved Anonymous 2009 student analysis of Needleman data Wa ~ 112 ~ 70 42 2 Wb ~ 130 ~ 95 35 2 x a x b ~ 112 ~ 95 17 R xa xb 17 ~ 0.22 Wa Wb 42 35 2 2 160 Other measures of the quality of separation of the Peaks 1. Limit of detection 2. Limit of quantification 3. Signal to noise (S/N) X blank X limit of detection 0.45 99.74% Of the observations Of the blank will lie below the mean of the First detectable signal (LOD) 0.4 0.35 Amplitude 0.3 0.25 0.2 0.15 0.1 0.05 0 -6 -4 -2 0 x LOD xblank 3sblank 2 3s 4 s 6 8 10 12 Two peaks are visible when all the data is summed together 0.45 0.4 0.35 Amplitude 0.3 0.25 0.2 0.15 0.1 0.05 0 -6 -4 -2 0 2 3s s 4 6 8 10 12 146 25 # of Observations 20 145 15 10 5 144 0 138 139 140 141 142 143 144 145 146 147 weight (lbs) Weight (lbs) 143 142 141 140 139 0 10 20 30 Day Estimate the LOD (signal) of this data 40 50 60 Other measures of the quality of separation of the Peaks 1. Limit of detection 2. Limit of quantification 3. Signal to noise (S/N) x LOQ xblank 9sblank Your book suggests 10 0.45 0.4 0.35 Amplitude 0.3 0.25 0.2 0.15 0.1 0.05 0 -6 -4 -2 0 2 4 6 8 9s 10 12 Limit of squantification requires absolute Certainty that no blank is part of the 146 25 # of Observations 20 145 15 10 5 144 0 138 139 140 141 142 143 144 145 146 147 weight (lbs) Weight (lbs) 143 142 141 140 139 0 10 20 30 Day Estimate the LOQ (signal) of this data 40 50 60 Other measures of the quality of separation of the Peaks 1. Limit of detection 2. Limit of quantification 3. Signal to noise (S/N) Signal = xsample - xblank Noise = N = standard deviation, s x sample xblank x sample xblank S N s pp 6 (This assumes pp school ~ pp baseline) 146 25 # of Observations 20 145 School Begins 15 Baseline Vacation 10 5 144 0 138 139 140 141 142 143 144 145 146 147 weight (lbs) Weight (lbs) 143 Peak to peak variation within mean school ~ 6s where s = N for Noise 142 Signal 141 140 139 0 10 20 30 Estimate the S/N of this data Day 40 50 60 35 30 length (cm) 25 20 15 Can you “tell” where the switch between Red and white potatoes begins? 10 What is the signal (length of white)? What is the background (length of red)? What is the S/N ? 5 0 0 5 10 15 Sample number 20 25 30 Effect of sample size on the measurement Error curve Peak height grows with # of measurements. + - 1 s always has same proportion of total number of measurements However, the actual value of s decreases as population grows ssample s population nsample 2008 Data 27 5 4.5 4 26 3 25 2.5 24.5 2 24 Red Running Stdev 3.5 25.5 1.5 23.5 ssample 1 23 0.5 22.5 s population nsample 0 0 2 4 6 8 10 12 14 Sample number 4.1 3.9 3.7 stdev red length cm Red Running Length Average 26.5 3.5 3.3 3.1 y = -0.8807x + 5.9303 2 R = 0.9491 2.9 2.7 2.5 1.5 2 2.5 3 sqrt number of samples 3.5 4 0.35 Red, N=12, Sum sq diff=0.11, stdev=3.27 White, N=12, Sum sq diff=0.037, stdev=2.55 Red, N=40, Sum sq diff=0.017, stdev-2.67 White, N=38, Sum sq diff=0.028, stdev=2.15 0.3 Frequency 0.25 0.2 0.15 0.1 0.05 0 14 19 24 29 Length (cm) 34 39 Calibration Curve A calibration curve is based on a selected measurement as linear In response to the concentration of the analyte. y a bx Or… a prediction of measurement due to some change Can we predict my weight change if I had spent a longer time on Vacation? fitch lbs a bdays on vacation 25 # of Observations 20 15 10 5 0 138 139 140 141 142 143 Weight (lbs) 144 145 146 147 5 days fitch lbs a bdays on vacation The calibration curve contains information about the sampling Of the population 143 Can get this by using “trend line” 142.5 Fitch Weight, lbs 142 y = 0.3542x + 140.04 2 R = 0.7425 141.5 141 140.5 140 139.5 139 0 1 2 3 Days on Vacation 4 5 6 This is just a trendline From “format” data 4.1 3.9 stdev red length cm 3.7 3.5 3.3 3.1 y = -0.8807x + 5.9303 2 R = 0.9491 2.9 2.7 Sample 1 2 3 4 5 6 7 8 9 10 11 12 sqrt(#samples) 1 1.414213562 1.732050808 2 2.236067977 2.449489743 2.645751311 2.828427125 3 3.16227766 3.31662479 3.464101615 stdev #DIV/0! 2.036468 4.475727 4.31441 3.844045 3.844604 3.735124 3.458414 3.235055 3.093053 2.935944 2.950187 2.5 1.5 2 2.5 3 3.5 4 sqrt number of samples SUMMARY OUTPUT Regression Statistics Multiple R 0.296113395 R Square 0.087683143 Adjusted R Square -0.013685397 Standard Error 0.703143388 Observations 11 Using the analysis Data pack ANOVA df Regression Residual Total Intercept X Variable 1 1 9 10 Coefficients 3.884015711 -0.06235252 SS MS F Significance F 0.427662048 0.427662 0.864994 0.376617 4.449695616 0.494411 4.877357664 Standard Error t Stat P-value Lower 95% 0.514960076 7.542363 3.53E-05 2.719094 0.067042092 -0.93005 0.376617 -0.21401 Get an error Associated with The intercept In the best of all worlds you should have a series of blanks That determine you’re the “noise” associated with the background x LOD xblank 3sblank Sometimes you forget, so to fall back and punt, estimate The standard deviation of the “blank” from the linear regression But remember, in doing this you are acknowledging A failure to plan ahead in your analysis x LOD x blank b[conc. LOD] Signal LOD Sensitivity (slope) x LOD xblank 3sblank x blank 3sblank x blank b[conc. LOD] 3sblank [conc. LOD] b Extrapolation of the associated error Can be obtained from the Linear Regression data The concentration LOD depends on BOTH Stdev of blank and sensitivity !!Note!! Signal LOD ≠ Conc LOD We want Conc. LOD Selectivity pHpM or pM pH or 0 0 12 12 10 10 8 8 6 6 4 4 2 2 0 0 Difference in slope is one measure selectivity -50 -50 Pb2+ y = -31.143x - 74.333 R2 = 0.9994 mV -150 + -150 H -200-200 -250-250 y = -41x - 118.5 R2 = 0.9872 In a perfect method the sensing device would have zero Slope for the interfering species -300-300 -350-350 mV -100-100 Limit of linearity 5% deviation Summary: Figures of Merit Thus far R = resolution S/N LOD = both signal and concentration Can be expressed in terms of signal, but better LOQ Expression is in terms of concentration LOL Sensitivity (calibration curve slope) Selectivity (essentially difference in slopes) Tests: Anova Why is the limit of detection important? Why has the limit of detection changed so much in the Last 20 years? The End 25 20 20 % of Measurements % of Measurements 25 15 10 15 10 5 5 0 0 40 60 80 100 120 Verbal IQ 140 160 40 60 80 100 120 140 Verbal IQ Which of these two data sets would be likely To have better numerical value for the Ability to distinguish between two different Populations? Needleman’s data 160 Height for normalized Bell curve <1 2008 Data 0.35 White, N=12, Sum sq diff=0.037 Red, N=12, Sum sq diff=0.11 0.3 Frequency 0.25 0.2 0.15 0.1 0.05 0 14 19 24 29 Length (cm) Which population is more variable? How can you tell? 34 39 0.35 Red, N=12, Sum sq diff=0.11, stdev=3.27 White, N=12, Sum sq diff=0.037, stdev=2.55 Red, N=40, Sum sq diff=0.017, stdev-2.67 White, N=38, Sum sq diff=0.028, stdev=2.15 0.3 Frequency 0.25 0.2 0.15 0.1 0.05 0 14 19 24 29 34 39 Length (cm) Increasing the sample size decreases the std dev and increases separation Of the populations, notice that the means also change, will do so until We have a reasonable sample of the population 25 % of Measurements 20 15 10 5 0 40 60 80 100 Verbal IQ 120 140 160 25 % of Measurements 20 15 10 5 0 40 60 80 100 Verbal IQ 120 140 160