APPENDIX A: STATISTICAL MODEL EVALUATION

advertisement
APPENDIX A:
STATISTICAL MODEL EVALUATION
This author had some difficulty reproducing the Hanna statistical parameters presented in his
APPENDIX D. The main problem stemmed from the definition of geometric mean bias, MG,
and geometric mean variance, VG. Although Hanna presents equations for determining these
values he fails to state how these means are calculated.
Geometric Mean Bias, MG
By example, a geometric mean of say three numbers; 3.7, 4.8 and 5.3 is determined by first
noting there are 3 numbers in the set. That is, n = 3 and 1 / n = 1 / 3 = 0.333. Next, the geometric
mean is determined by:
Geometric Mean = (3.7 x 4.8 x 5.3)^0.333 = (94.1)^0.333 = 4.55
This differs from the usual meaning of mean, which is the middle of an odd numbered set
arranged in descending order as:
3.7
Mean >>> = 4.8
5.3
For an even set of numbers, say; 3.7, 4.8, 5.3 and 3.2, arranged in descending order:
3.2
3.7 >> bottom of top half
4.8 >> top of bottom half
5.2
The mean of this even numbered set is then = (3.7 + 4.8) / 2 = 4.25
Geometric Mean also differs from the Simple Average of the three number set which is:
Average = (3.7 + 4.8 + 5.3) / 3 = 4.6
In Hanna's study, MG for a single point is determined from his Equation 26 as:
MG1 = exp [Ln(Xo1) - Ln(Xp1) ]
Hanna Equation 26
The Geometric Mean of the set of n pairs of Xp and Xo is then:
MG set = (MG1 x MG2 x .... x MGn)^(1/n)
Geometric Mean Variance
The above logic also applies to Geometric Mean Variance, VG, where VG is determined from
Hanna's Equation 28 as:
VG = exp [{Ln(Xo) - LN(Xp)}^2 ]
Hanna Equation 28
Thus, for a set of n pairs of Xp and Xo:
VG = (VG1 x VG2 x ..... x VGn)^(1/n)
Fractional Bias, FB
In Hanna's report, page 116, he presents Equation 25 for the Fraction Bias, FB as:
FB = ( Xo - Xp ) / [ 0.5 (Xo + Xp) ] Hanna Equation 25
However, in Appendix D, the tables present FB from a slightly different, alternate equation for
Fractional Bias. In Hanna's tables, FB appears to have been calculated by:
FB = Ln (Xo / Xp)
Then, for a set of n pair of Xo and Xp, FB is calculated by the simple Average described above.
That is:
FB = (FB1 + FB2 + .... + FBn ) / n
Normalized Mean Square Error, NMSE
Hanna's normalized mean square error is given by his Equation 27, which is:
NMSE = (Xo - Xp)^2 / [(Xo)(Xp)]
For a set of n pairs of Xo and Xp, NMSE for the set is calculated by the Average for the set as:
NMSE set = (NMSE1 + NMSE2 + .... + NMSEn) / n
Fraction Of Data Between a Factor Of Two for Xp / Xo, FAC2
Hanna qualifies this factor by his Equation 30 which is:
FAC2 = fraction of data for which 0.5 <= Xp / Xo <= 2.0
Hanna Equation 30
There appears to be a belief in the academic community that the best we can expect from a model
is that the majority of data pairs will fall into the FAC2 range. That is, at a given distance the
predicted concentration may be as low as 0.5 times the observed ppmv or as high as 2 times the
observed ppmv. This has considerable impact on the dose term for a toxic chemical where dose
is defined as:
Dose = (C^n ) dt
Where, n is a constant, called a concentration probit constant ranging anywhere from 0.6
to 3.0, and dt is a time increment.
This ± spread also has considerable impact when separation distances are fixed by some
particular concentration criteria such as: IDLH, ERPG1, ERPG2 etc.
Rarely in dispersion presentations do we see data with the expected ranges, such as:
Xp = 100 ppmv ± 100 ppmv
or
Distance = 500 ft. ± 250 ft.
After completing this exercise, I am now more inclined to predict concentrations and distances
with the expected uncertainty.
95% Confidence Limits
Hanna describes 95% confidence limits in his equation 34 as:
95% Confidence Limits = Mean ± (t 95)(sigma)[n/(n-1)]^0.5
Hanna Eqn 30
This is an unfamiliar form of the student “t” procedure. The more usual form of the student “t”
procedure, found in many standard statistical text books, is:
95% Confidence Limits = Mean ± (t 95)(sigma) / [n]^0.5
Where t 95 is the factor which describes the percentage of points of the t-distribution
inside a Gaussian or bell curve. Sigma is the standard deviation of the geometric mean bias, MG,
data. Standard deviation is the usual definition given by:
Standard Deviation = {[Sum (MG - MG avg)^2] / (n-1)}^0.5
For a 95% confidence limit, statistical tables generally give the “t” factor as t .025 for one side of
the bell curve. For any data set of n numbers, “t” 0.025 is determined from the degrees of
freedom, df, where df is defined as df = n - 1. For both sides of the bell, that is + 2.5% for one
side and - 2.5% for the other side, a total of 5% of the points fall outside the “t” 0.025 limits. For
points inside the bell curve, (1 - 0.05) = 0.95 which translates to the term "95% confidence
limits".
In examining most of the Hanna statistical MG, VG curves, my definition appears to give similar
spreads to the various data sets. The Hanna Equation 30 does not come close to giving the spread
shown on his curves. This suggests a typo error in Equation 30 or Hanna has used some other
equation for 95% confidence limits. Since Hanna has described the limits as the student “t”
procedure, I have presented the 95% confidence limits with my version of 95% confidence limits.
A table of “t” 0.025 factors versus df = n-1 has been included in this report for convenience. The
range of df varies from df = 1 to df = 140. Beyond df = 140 to 1000, equations for df are
presented to determine “t” 0.025.
A Statistics Demo - A Short Primer
A statistics demo is presented to show how the Hanna statistical are presented in this report.
Hopefully, this demo also serves to assist the reader in interpreting the statistics.
To set the demo up, a set of dummy numbers have been inserted in the Xo column in the various
spreadsheet formats. Xp values are then multiplied by factors such as, 2 times Xo, to demonstrate
how the statistical factors would appear.
Figure A-1:
A Typical Working Table used by this author.
For example: Fig. A-1 is a working table showing how the statistical factors are calculated for a
set of Xp factors that differ by exactly a factor of two. In this set, 20 points have been set at two
time Xo and 20 points have been set at 0.5 times Xo. Thus we have a highway that is within the
FAC2 range described by Hanna.
Fig. A-1 shows the Hanna Equations used in each column. t 0.025 factors have been extracted
from the t-distribution tables mentioned above and found at the end of Appendix A. Note the
range of MG varies between a minimum of 0.5 and a maximum of 2.0.
Figure A-2:
A Typical Statistical Summary Curve For A Series.
Fig. A-2 is a summary of the working table in Fig. A-1. Fig. A-2 is presented throughout this
report as a summary of any given Series. It presents the Hanna performance curves shown
typically on Hanna's page 124.
Fig A-2 also shows the Xp and Xo data and the Xo = Xp trend line. Data points above the trend
line are under predicted points. Points that are below the trend line are over-predicted. For this
demo, the data points show a highway that just meets the FAC2 factor of two limit. This is what
Hanna means when he states on his page 117 that "Geometric Mean Bias (MG) values of 0.5 and
2.0 can be thought of as 'factor of two'".
Note the Hanna Guidelines at MG = 0.5 and MG = 2.0 each rising vertically to VG = 200. Also
note that the 95% confidence limit for this demo fall between MG = 1.2429 and 0.7571, or well
between the Hanna Guidelines. If a larger data set was used, eg. n = 200, the 95% confidence
limits would shrink even further as described above with n in the denominator of the student “t”
equation.
This should serve to warn the reader that the guidelines have little value in the statistical chart.
What is really important is the 95% confidence spread.
Also note in Fig. A-2, the NMSE is 0.5 for this set of dummy numbers.
If an set of field data varied as shown on Fig A-2, we would note that VG was 1.62 as Hanna
mentions on page 117. Hence, we can use the limit VG = 1.62 and the 95% spread of MG = 1.24
to 0.76 as a guide that the raw data is within the Factor of Two limits. Hence, this author warns
against giving much too weight to the Hanna guidelines.
Fig. A-3:
Xp and Xo Vary By A factor of Four.
The exercise, shown on Fig. A-1 and A-2, is repeated with the data varied by a factor of four.
This would be demonstrative of a very poor dispersion equation.
Note, the Xp and Xo highway now ranges between a max MG of 4 and a min MG of 0.25.
Note: The 95% confidence limits now have a MG spread from 1.6074 to 0.3926 and a Variance,
VG, of 6.8333.
Note: the 95% lower MG value just falls outside the left hand Hanna Guideline of MG = 0.5.
This shows that even for some lousy data, the points may still fall within the Factor of Two
Guidelines and thereby mislead the reader at a glance.
Note: the NMSE has jumped to 2,25 and the standard deviation has jumped to 1.9.
Fig. A-4:
Xp and Xo Vary By A Factor Of Three
The exercise is repeated with a factor of three.
Note: the 95% confidence limits are within the Hanna Guidelines. As before, the Guidelines are
misleading.
Fig A-5:
Xp = Xo For A Perfect Fit
If the dispersion model was perfect, the Xp values would equal the Xo values. All data points
would fall on the trend line. The 95% confidence spread would shrink to a value of 1. MG would
equal 1.0 and Vg would equal 1.0. This data point is shown on Fig A-5 at the bottom of the
parabola.
Fig A-6:
A Summary Curve for the Four Xp / Xo Ranges.
Fig. A-6 is a summary curve for the four Xp / Xo ranges discussed above. It serves to show how
the quality of data, based on 95% confidence limits, varies over a considerable range but still
appears to fall within the Factor Of Two criteria described by Hanna. Perhaps, these guidelines
will serve as a better indicator of the quality of data. In any event, the Hanna guidelines are
retained only for comparison to the Hanna data.
Validation Of Statistical Parameters By Hanna - HEGADIS
The Burro Series was used to demonstrate that the statistical calculations performed in this study
are similar to those in the Hanna report. The reader is directed to Hanna's Appendix D-1 and to
the Burro data in Block 1 for the HEGADIS line of information. As seen on this author's
summary of the Burro Series, the values of MG = 0.292, VG = 6.89, FAC2 = 0.19 and
FB = -1.23 are identical to those of Hanna in this Series.
The author does not know how "sigma" and "mean" values were calculated as there are no
descriptions that define how Hanna obtained these values.
Why Plot Data On Ln-Ln Plots?
As Hanna explains it, the range of data varies between 1 ppmv to 1,000,000 ppmv. Hence a LnLn plot will show the data better than standard Cartesian coordinates. This is demonstrated by
showing the Series Statistics for the Prairie Grass, Set 5 Series using both Cartesian and Ln-Ln
plots.
Note how poorly the Cartesian coordinates show off the data compared to the same information
on a Ln-Ln plot.
Download