APPENDIX A: STATISTICAL MODEL EVALUATION This author had some difficulty reproducing the Hanna statistical parameters presented in his APPENDIX D. The main problem stemmed from the definition of geometric mean bias, MG, and geometric mean variance, VG. Although Hanna presents equations for determining these values he fails to state how these means are calculated. Geometric Mean Bias, MG By example, a geometric mean of say three numbers; 3.7, 4.8 and 5.3 is determined by first noting there are 3 numbers in the set. That is, n = 3 and 1 / n = 1 / 3 = 0.333. Next, the geometric mean is determined by: Geometric Mean = (3.7 x 4.8 x 5.3)^0.333 = (94.1)^0.333 = 4.55 This differs from the usual meaning of mean, which is the middle of an odd numbered set arranged in descending order as: 3.7 Mean >>> = 4.8 5.3 For an even set of numbers, say; 3.7, 4.8, 5.3 and 3.2, arranged in descending order: 3.2 3.7 >> bottom of top half 4.8 >> top of bottom half 5.2 The mean of this even numbered set is then = (3.7 + 4.8) / 2 = 4.25 Geometric Mean also differs from the Simple Average of the three number set which is: Average = (3.7 + 4.8 + 5.3) / 3 = 4.6 In Hanna's study, MG for a single point is determined from his Equation 26 as: MG1 = exp [Ln(Xo1) - Ln(Xp1) ] Hanna Equation 26 The Geometric Mean of the set of n pairs of Xp and Xo is then: MG set = (MG1 x MG2 x .... x MGn)^(1/n) Geometric Mean Variance The above logic also applies to Geometric Mean Variance, VG, where VG is determined from Hanna's Equation 28 as: VG = exp [{Ln(Xo) - LN(Xp)}^2 ] Hanna Equation 28 Thus, for a set of n pairs of Xp and Xo: VG = (VG1 x VG2 x ..... x VGn)^(1/n) Fractional Bias, FB In Hanna's report, page 116, he presents Equation 25 for the Fraction Bias, FB as: FB = ( Xo - Xp ) / [ 0.5 (Xo + Xp) ] Hanna Equation 25 However, in Appendix D, the tables present FB from a slightly different, alternate equation for Fractional Bias. In Hanna's tables, FB appears to have been calculated by: FB = Ln (Xo / Xp) Then, for a set of n pair of Xo and Xp, FB is calculated by the simple Average described above. That is: FB = (FB1 + FB2 + .... + FBn ) / n Normalized Mean Square Error, NMSE Hanna's normalized mean square error is given by his Equation 27, which is: NMSE = (Xo - Xp)^2 / [(Xo)(Xp)] For a set of n pairs of Xo and Xp, NMSE for the set is calculated by the Average for the set as: NMSE set = (NMSE1 + NMSE2 + .... + NMSEn) / n Fraction Of Data Between a Factor Of Two for Xp / Xo, FAC2 Hanna qualifies this factor by his Equation 30 which is: FAC2 = fraction of data for which 0.5 <= Xp / Xo <= 2.0 Hanna Equation 30 There appears to be a belief in the academic community that the best we can expect from a model is that the majority of data pairs will fall into the FAC2 range. That is, at a given distance the predicted concentration may be as low as 0.5 times the observed ppmv or as high as 2 times the observed ppmv. This has considerable impact on the dose term for a toxic chemical where dose is defined as: Dose = (C^n ) dt Where, n is a constant, called a concentration probit constant ranging anywhere from 0.6 to 3.0, and dt is a time increment. This ± spread also has considerable impact when separation distances are fixed by some particular concentration criteria such as: IDLH, ERPG1, ERPG2 etc. Rarely in dispersion presentations do we see data with the expected ranges, such as: Xp = 100 ppmv ± 100 ppmv or Distance = 500 ft. ± 250 ft. After completing this exercise, I am now more inclined to predict concentrations and distances with the expected uncertainty. 95% Confidence Limits Hanna describes 95% confidence limits in his equation 34 as: 95% Confidence Limits = Mean ± (t 95)(sigma)[n/(n-1)]^0.5 Hanna Eqn 30 This is an unfamiliar form of the student “t” procedure. The more usual form of the student “t” procedure, found in many standard statistical text books, is: 95% Confidence Limits = Mean ± (t 95)(sigma) / [n]^0.5 Where t 95 is the factor which describes the percentage of points of the t-distribution inside a Gaussian or bell curve. Sigma is the standard deviation of the geometric mean bias, MG, data. Standard deviation is the usual definition given by: Standard Deviation = {[Sum (MG - MG avg)^2] / (n-1)}^0.5 For a 95% confidence limit, statistical tables generally give the “t” factor as t .025 for one side of the bell curve. For any data set of n numbers, “t” 0.025 is determined from the degrees of freedom, df, where df is defined as df = n - 1. For both sides of the bell, that is + 2.5% for one side and - 2.5% for the other side, a total of 5% of the points fall outside the “t” 0.025 limits. For points inside the bell curve, (1 - 0.05) = 0.95 which translates to the term "95% confidence limits". In examining most of the Hanna statistical MG, VG curves, my definition appears to give similar spreads to the various data sets. The Hanna Equation 30 does not come close to giving the spread shown on his curves. This suggests a typo error in Equation 30 or Hanna has used some other equation for 95% confidence limits. Since Hanna has described the limits as the student “t” procedure, I have presented the 95% confidence limits with my version of 95% confidence limits. A table of “t” 0.025 factors versus df = n-1 has been included in this report for convenience. The range of df varies from df = 1 to df = 140. Beyond df = 140 to 1000, equations for df are presented to determine “t” 0.025. A Statistics Demo - A Short Primer A statistics demo is presented to show how the Hanna statistical are presented in this report. Hopefully, this demo also serves to assist the reader in interpreting the statistics. To set the demo up, a set of dummy numbers have been inserted in the Xo column in the various spreadsheet formats. Xp values are then multiplied by factors such as, 2 times Xo, to demonstrate how the statistical factors would appear. Figure A-1: A Typical Working Table used by this author. For example: Fig. A-1 is a working table showing how the statistical factors are calculated for a set of Xp factors that differ by exactly a factor of two. In this set, 20 points have been set at two time Xo and 20 points have been set at 0.5 times Xo. Thus we have a highway that is within the FAC2 range described by Hanna. Fig. A-1 shows the Hanna Equations used in each column. t 0.025 factors have been extracted from the t-distribution tables mentioned above and found at the end of Appendix A. Note the range of MG varies between a minimum of 0.5 and a maximum of 2.0. Figure A-2: A Typical Statistical Summary Curve For A Series. Fig. A-2 is a summary of the working table in Fig. A-1. Fig. A-2 is presented throughout this report as a summary of any given Series. It presents the Hanna performance curves shown typically on Hanna's page 124. Fig A-2 also shows the Xp and Xo data and the Xo = Xp trend line. Data points above the trend line are under predicted points. Points that are below the trend line are over-predicted. For this demo, the data points show a highway that just meets the FAC2 factor of two limit. This is what Hanna means when he states on his page 117 that "Geometric Mean Bias (MG) values of 0.5 and 2.0 can be thought of as 'factor of two'". Note the Hanna Guidelines at MG = 0.5 and MG = 2.0 each rising vertically to VG = 200. Also note that the 95% confidence limit for this demo fall between MG = 1.2429 and 0.7571, or well between the Hanna Guidelines. If a larger data set was used, eg. n = 200, the 95% confidence limits would shrink even further as described above with n in the denominator of the student “t” equation. This should serve to warn the reader that the guidelines have little value in the statistical chart. What is really important is the 95% confidence spread. Also note in Fig. A-2, the NMSE is 0.5 for this set of dummy numbers. If an set of field data varied as shown on Fig A-2, we would note that VG was 1.62 as Hanna mentions on page 117. Hence, we can use the limit VG = 1.62 and the 95% spread of MG = 1.24 to 0.76 as a guide that the raw data is within the Factor of Two limits. Hence, this author warns against giving much too weight to the Hanna guidelines. Fig. A-3: Xp and Xo Vary By A factor of Four. The exercise, shown on Fig. A-1 and A-2, is repeated with the data varied by a factor of four. This would be demonstrative of a very poor dispersion equation. Note, the Xp and Xo highway now ranges between a max MG of 4 and a min MG of 0.25. Note: The 95% confidence limits now have a MG spread from 1.6074 to 0.3926 and a Variance, VG, of 6.8333. Note: the 95% lower MG value just falls outside the left hand Hanna Guideline of MG = 0.5. This shows that even for some lousy data, the points may still fall within the Factor of Two Guidelines and thereby mislead the reader at a glance. Note: the NMSE has jumped to 2,25 and the standard deviation has jumped to 1.9. Fig. A-4: Xp and Xo Vary By A Factor Of Three The exercise is repeated with a factor of three. Note: the 95% confidence limits are within the Hanna Guidelines. As before, the Guidelines are misleading. Fig A-5: Xp = Xo For A Perfect Fit If the dispersion model was perfect, the Xp values would equal the Xo values. All data points would fall on the trend line. The 95% confidence spread would shrink to a value of 1. MG would equal 1.0 and Vg would equal 1.0. This data point is shown on Fig A-5 at the bottom of the parabola. Fig A-6: A Summary Curve for the Four Xp / Xo Ranges. Fig. A-6 is a summary curve for the four Xp / Xo ranges discussed above. It serves to show how the quality of data, based on 95% confidence limits, varies over a considerable range but still appears to fall within the Factor Of Two criteria described by Hanna. Perhaps, these guidelines will serve as a better indicator of the quality of data. In any event, the Hanna guidelines are retained only for comparison to the Hanna data. Validation Of Statistical Parameters By Hanna - HEGADIS The Burro Series was used to demonstrate that the statistical calculations performed in this study are similar to those in the Hanna report. The reader is directed to Hanna's Appendix D-1 and to the Burro data in Block 1 for the HEGADIS line of information. As seen on this author's summary of the Burro Series, the values of MG = 0.292, VG = 6.89, FAC2 = 0.19 and FB = -1.23 are identical to those of Hanna in this Series. The author does not know how "sigma" and "mean" values were calculated as there are no descriptions that define how Hanna obtained these values. Why Plot Data On Ln-Ln Plots? As Hanna explains it, the range of data varies between 1 ppmv to 1,000,000 ppmv. Hence a LnLn plot will show the data better than standard Cartesian coordinates. This is demonstrated by showing the Series Statistics for the Prairie Grass, Set 5 Series using both Cartesian and Ln-Ln plots. Note how poorly the Cartesian coordinates show off the data compared to the same information on a Ln-Ln plot.