WATER RESOURCES RESEARCH, VOL. 18, NO. 4, PAGES 1081-1088, AUGUST 1982 A Comparison of Four Streamflow Record Extension Techniques ROBERT M. HIRSCH U.S. Geological Survey, Reston, Virginia 22092 One approach to developing time series of streamflow, which may be used for simulation and optimizationstudiesof water resourcesdevelopmentactivities,is to extend an existinggagerecord in time by exploiting the interstationcorrelation between the station of interest and some nearby (longterm) base station. Four methods of extension are described, and their properties are explored. The methodsare regression(REG), regressionplus noise (RPN), and two new methods, maintenanceof variance extensiontypes I and 2 (MOVE.l, MOVE.2). MOVE. I is equivalent to a method which is widely used in psychology,biometrics, and geomorphologyand which has been called by various names, e.g., 'line of organic correlation,' 'reduced major axis,' 'unique solution,' and 'equivalence line.' The methods are examined for bias and standard error of estimate of moments and order statistics,and an empiricalexaminationis madeof the preservationof historiclow-flow characteristics using 50-year-longmonthly recordsfrom seven streams.The REG and RPN methodsare shown to have seriousdeficienciesas record extensiontechniques.MOVE.2 is shown to be marginally better than MOVE. l, accordingto the various comparisonsof bias and accuracy. INTRODUCTION Current practice in many aspects of water resources planning and managementinvolves the use of hydrologic time series to simulate the outcome of decisions. The decisionsmay include waste treatment plant designs,estab- lishment of operatingpolicies for water supply systems, hydropower pr0daCtion scheduling, entryintorivercompacts and interg6Vernmentalagreements,and construction The kinds of water storageand wm,urawal ..t.a . outcomes that may be of interest include frequency and duration of unacceptablewater quality conditions,frequency and durationof supplyshortfallsfor municipal,industrial or agricultural water users, dependablerate of hydropower production during peak demand periods, frequency and severity of river compactviolations, or, more abstractly, the expectation and variance of project benefits or costs. There are several different methodsfor developingtime seriesfor use in simulation. The following is a list of the general categories of methods for developing such time series for streamflowsat a singlesite. 1. Use the historic record of streamflows [Rippl, 1883' Hazen, 1914]. 2. Use the historic record of streamflows and extend it in time by exploiting the correlation between flows at the site and concurrent flows at some nearby long-term gage(a base station) [Riggs, 1972; Matalas and Jacobs, 1964; Hirsch, 1979]. 3. Reconstruct historic flow records by transposingrecords from a base stationto the site of interestby usingsome functionin which the coefficientsare derivedfrom a regional streamflow basin characteristics regression equation [Hirsch, 1979]. 4. Generate multiple synthetic streamflowrecords [Fiering, 1-967]where the parameters are based on historic flow values at the site or on regional streamflow characteristics [Benson and Matalas, 1967]. 5. Develop and calibrate a conceptualmodel of the basin and use it to generatea streamflowrecord by usinghistorical This paper is not subjectto U.S. copyright.Publishedin 1982by the American GeophysicalUnion. Paper number 2W0479. meteorological records as inputs [Crawford and Linsley, 1966]. 6. Develop and calibrate a conceptualmodel of the basin and use it to generate multiple streamflowrecords by using synthetic meteorological records as inputs [Leclerc and Schaake, 1973]. Each of these categories,and the specificmethodsin each category, have certain advantages and disadvantagesfor various applications.The selectionof an appropriatemethod would depend on the relevant time step of analysis (hours, days, weeks, seasons,years, or decades)and the benefitsof increasedaccuracy in estimationof outcomesin comparison to the cost of applying a more complex method. The more historically based methods (categories 1, 2, 3, and 5) have their greatest applicability when the time scalesare fine (for example, small storages,run of fiver withdrawals,or water quality analysis). The synthetic methods (categories4 and 6) have their greatest applicability where the time scales are coarse (for example, large storagesfor control of multiyear droughts). An additional consideration in the selection of methodsis the potential for analysisof errors. In general, this becomesmore difficultas the numberof parametersand model assumptionsgrow and this grows rapidly as the time step of analysis becomes finer, Another consideration in many cases is the adaptability of the method to changes in watershed characteristics. In general, only those methods that involve conceptualmodels have this capability. This paper will focus on methods within the second category and comparison of these with category 1. It is the intent of this paper to evaluate the characteristicsof some methods within category 2. This is not done out of a belief that this categoryis generally SUperiorto the others, but becauseit may be superiorfor someusesand is easyto apply and well accepted by many practicing water resource engineers. The evaluation of these techniques will rely on measures of the accuracy of low-flow duration and severity estimatesas indicators of the suitability of the techniquesfor use in water resource system simulations. In the next section of the paper, four methods of record extension are defined and some of their properties are discussed. A set of Monte Carlo trials are then carried out to evaluate the bias and error in estimating means, variances, 1081 1082 HIRSCH' COMPARISONOF STREAMFLOWRECORD EXTENSION TECHNIQUES and order statistics based on extended records where mar- seriallyindependent, andhavea bivariatenormalprobability ginal distributions of flows are normal. Finally, the methods are applied to historical data sets and the results are examined for accuracy in reproducing certain historical order distribution withparameters t•x,•, Crx 2,%2,andp,wheret•x statistics. for y. The parameter p is the population product moment PROBLEM andarx 2 denotethepopulation meanandvariancefor x, and • andtry 2,thepopulation meanandvariance, respectively, correlationcoefficient,andpcry/Crx is the populationvalueof DEFINITION the slope of the linear regressionof y on x. For the basestationthe flow valuesare denotedx(i) where i is an index of time. For the short-record THœ FOUR Mœ•'HODS OF EXTENSION station the flow values are denoted y(i). The observed events for the two sequences are represented as x(1), ß ß ', x(N1), x(N1 + 1), ß ß ', x(N1 + N2) Regression (REG) The first method of extension is linear regression. The missingvalues are filled in by the equation y(i) = a + bx(i) y(1), ß ß ', y(N1) where N1 is the length of the shortersequenceand (N1 + N2) is the length of the longer sequence.N1 is also the length of concurrent record. It is not necessaryfor the two sequences to begin or end simultaneously,nor need the observationsbe consecutive, but there is no loss of generality if the two sequences are represented as above. The estimatesof the missingvaluesare denotedp(i), i = N1 + 1, ß ß., N1 + N2 and the complete extended record is denoted 29(0,i = 1, ß ß., N1, N1 + 1, ß ß., N1 + N2; where 29(i)=y(i) /9(i) = p(i) i= 1,...,N1 i= N1 + 1,'' (1) where the parameters a and b are those values which minimize N1 Z = • (.9(0- y(i))2 i=1 The solution for a and b is found by solving the normal equations [Draper and Smith, 1966, p. 59]. The optimal solution to (1), rearranged for convenience, is p(i) = m(yO + r S(yO S(xO (x(i)- m(xO) (2) ', N1 + N2 Matalas and Jacobs [1964] show that m(y) is an unbiased Table 1 gives the naming conventions for the various sample statistics referred to in the report. Much of the notation and some of the derivationsused in this paper were developed by Matalas and Jacobs [1964]. There is, however, a fundamental differencebetween the intent of that paper and the intent of the present one. Matalas and Jacobswere concerned with the quality (bias and variance) of the esti- matesm(.9)andS2(.9), andit wasnottheirintentto consider methodsof producingan extendedrecord y(i). In this paper the goal is the development of this extended record. The estimate of • butS2(y)is a biased estimate of %2forp2< 1.0. Specifically, E[m(Y)l = • (1 _p2)N2(N1 _4)} E[S2(.9)] =O'y 2 1--(N1 +N2-1)(N13) Table2 givessomeexamplevaluesof E[S2(.9)]/try 2 for various combinationsof p, N1, and N2. Given that a common properties of statistics of thisrecord,suchasm(y)andS2(y) purposeof record extensionis the evaluationof the severity are used as measures (but not the only measures) of the and duration of hydrologic extremes, this consistentunderquality of the extended record. In the next section, four estimation of variance is an alarmingfeature of REG. It is in methodswill be presented,and in each casethe expectation fact the intent of each of the following three methods to of m(y)andof S2(y)will be evaluated underthe assumption eliminate (or at least minimize) this bias in the variance. that concurrent observations of x and y are stationary, Regression Plus Independent Noise (RPN) Matalas and Jacobs [1964] demonstratedthat unbiased TABLE 1. Definitions of Sample Statistics Statistic m(x•) m(x2) m(x) m(y•) m(.f) S2(x•) 52(X2) S2(x) S2(y•) S2(y) estimates of mean and variance are achieved if the following equation is used to calculate p(i): Definition x(1), x(N• x(1), y(1), y(1), Sample Mean of ß ß., x(N•) + 1),-.., x(N• + N2) ß ß ', x(N•), x(N• + 1), ß ß ', x(Ni + N2) ß ß., y(N•) ß ß., y(N•), y(N• + 1),..., y(N• + N2) x(1), x(N• x(1), y(1), y(1), Sample Variance of* ß ß., x(N•) + 1),..., x(N• + N2) ß ß., x(Ni), x(N• + 1), ß ß., x(Ni + N2) ß ß., y(N•) ß ß ', y(Ni), y(N• + 1), ß ß., y(N• + N2) Product Moment Correlation Coefficient of x(1), ß ß., x(N•) and y(1), ß ß., y(N•) *Variance computed with sample size minus 1 as the divisor. p(i) = m(yl) + r S(y0 S(xO (x(i) - m(xl))+ a(1 - r2)mS(yl)e(i) (3) where e(i) is a normal independentrandom variable with zero mean and unit variance and N2(N1 - 4)(N1 - 1) (N2- 1)(N1- 3)(Nl- 2) This procedureof addingindependentnoiseto regression estimates [Matalas and Jacobs, 1964, p. E4] ... is not too appealing. Independent studies of the same sequenceof x and y by several investigatorslead to different HIRSCH: COMPARISON OF STREAMFLOW RECORD EXTENSION TECHNIQUES TABLE 2. Valuesof E[Sy2(.•)]/o'y 2 UsingtheRegression (REG) Method of Record Extension N• N2 0.5 0.7 0.9 10 20 30 50 10 10 20 30 10 50 0.66 0.64 0.63 0.88 0.46 0.77 0.75 0.75 0.92 0.63 0.91 0.91 0.91 0.97 0.86 variance,whileMOVE. 1 overestimates it. NotethatS2(y)is anasymptotically unbiased estimator of %2asN• -->o•. valuesof [m(.9),S2(.9)and p(i), i = N1 + 1, '' ', N1 + N2], The estimates p(i) are hybrids: a weighted sum of the historically observed random variable x(i) and a unrelated, computer generated random variable, e(i). One may feel uncomfortable making a management decision when only one realization of these series of random numbers is used, and yet the interpretation of results from multiple realizations would be, at best, ambiguousbecausethe realizations would not be independentof each other. Nevertheless, the RPN method doeshave the desiredpropertiesof an unbiased mean and variance, and it has found use in practice [Beard et al., 1970]. Maintenance of VarianceExtension,Type 1 (MOVE. I) An alternative to the RPN approachis to specify that the extension equation must be of the form given in (1) but that a and b are to be set not to minimize squarederrors, but rather to maintain the sample mean and variance. The idea which led to the developmentof MOVE. 1 was to find somevalues of a and b in (1) which satisfy the following two equalities: N1 i=1 i=1 S(y,) S(x,) (x(i) - m(xO) For the former it is the minimization E(m(5)) = /Xy extension $(y) p(i)= fit(y) + - m(x)] (6) N2 (N1 + N2) S(Yl) r• (m(x2) - m(xl)) S(x2) (7) , [ •2(y) =N•+N2-1 (N•-1)S2(y0 of (5a) S2(yl) +(N2 - 1)r2S2(xi) S2(x2) +(N2 - 1).2(1 - r2)S2(yl) + NIN2 S2(yl) (N•+ N2)r2 S2-••(m(x2)- m(x•))2 (8) The complexity of these expressionsprevented the discovery of an analytical solution for the bias of the mean and variance. In Monte Carlo experiments of 2000 trials with each of 15 different combinations of p, N•, and N2, the o•2 1 The (4) squared errors of •(i); for the latter it is the desire for the sample mean and variance of the y(i) to equal the sample mean and variance of the y(i) for i = 1, ß ß., N•. When this equation is used for record extension it can be shown that N•+N2- is based on more information. equation is fit(Y) = m(Yl) + In spite of the obvious similarity between (2) and (4), it should be recognizedthat they arise from completely differ- E[S2(.9)] = estimation themselves unbiased estimates of • andtry 2. is p(i) = m(yO + ent motivations. In MOVE.1 the only four parameters are the means and variances of x and y estimated from the first N• observations. In MOVE.2 these sameparametersare used, but their were developed by Matalas and Jacobs [1964] and are (y(i)- m(y0)2 i=1 One such solution Maintenance of Variance Extension, Type 2 (MOVE.2) transfer fromthex sequence. Theparameters fit(y)and3:2(y) N1 •, (.9(0- m(Y0) 2= • It should be noted that the estimatesy(i) in (4) lie between the estimates of y(i) from a regression of y on x and the estimate of y(i) from the inverse of a regressionof x on y. This property was noted and discussed in a hydrologic context by Kritskiy and Menkel [1968], who called (4) 'the unique solution.' Till [1973] refers to this line as the reduced major axis and gives the standard error of the slope and intercept estimates; he also gives some applications in geomorphology. Additional information on the mathematics is found in the work by Kruskal [1953]. The first known reference to this line is by Pearson [1901] and discussionsof its applications can be found in works by lmbrie [1956] (biometrics) and Greenall [1949] (psychology). Kirby [1974] has proposed least normal squares, which also results in an equation for a line falling between the regressionof x on y and y on x. His equation coincideswith (4) only where r = 1.0 and/or S(x•) = S(yO. Neither the line represented by (4) nor Kirby's line coincide with the line which bisects the angle between these two regressionlines except where r = 1.0 and/or S(x•) = S(yO. The mean and variance estimatesfor x are based on all N• + N2 observations, and the mean and variance estimatesfor y are based on the historical values of y and on information • p(i) = x• y(i) N1 Table3 givesvaluesof theratioE S2(y)/O'y 2 for various combinationsof p, N•, and N2. It is clear from Tables 2 and 3 that the magnitude of the bias is substantially lower for MOVE.1 than for REG, and that REG underestimates because the same sequence of pseudo-random numbers is unlikely to be used by the investigators. i=1 1083 N•- 1+ N2 N•- 3 (5b) hypothesisthat E[m(5)] = /Xycould not be rejected at the 1084 HIRSCH:COMPARISON OF STREAMFLOW RECORDEXTENSIONTECHNIQUES TABLE 3. Valuesof E[Sy2(jT)]/rry 2 UsingtheMaintenance of was either significantand positive or not significant.For Variance Type 1 (MOVE.1) Method of Record Extension MOVE.1 the bias for S2(y)was in all casespositiveand N1 N2 0.5 0.7 0.9 10 20 30 50 10 10 20 30 10 50 1.19 1.05 1.03 1.01 1.18 1.08 1.03 1.02 1.00 1.12 1.03 1.03 1.01 1.00 1.05 significant,and the bias for the order statisticswas either negativeand significantor not significant.In no casedid the results show the bias to be significantly(at the 5% level) different from the value determined by (5b). For MOVE.2, in onlyonecasewastheresignificant biasin S2(y)(negative biasfor N• = 20, N2 = 40, p = 0.5) and it was not significant at the 2.5% level. The bias in the order statistics were either positive and significantor not significant. All biasesobservedin m(y) were very small. Two of them were significantbut evenif all methodswere in fact unbiased 0.05 significance level. The resultsfor S2(y)are shownin in m(y), it would not be unreasonableto find two significantTable 4. These results suggestthat for all practicalpurposes, ly biasedvalues(at the 5% level) out of 24 cases.Neither of MOVE.2 satisfies the need for an extension method which these two casesis significantat the 2.5% level. Thus, in summary,one can concludethat all four methods produces unbiased valuesof m(y).andS2(y). MONTE CARLO EVALUATION OF ERRORS In estimatingfrequency distributionsof streamflows,the hydrologistwill not only considerthe statisticalmomentsof the sample but also some of the extreme order statistics.If some method of record extension introduces a bias into the value of the more extreme order statistics, this will lead to produceextendedrecordsthat are unbiasedin the mean, that REG substantiallyreducesvariability, MOVE. 1 slightly increasesvariability, and RPN and MOVE.2 preserve the variance but slightly decreasesvariability in the more extreme events. The method with the lowest value of the root mean squareerror (RMSE) for all statisticsexcept m(y) is MOVE.2. For m(y) there is little differencebetween the bias in the estimates of the probability of exceedanceof selected extreme values or, conversely, bias in the estimation of distribution quantiles. Let w represent a statistic of a (simulated)historic time series (such as a moment or order statistic) and W represent RMSE of REG and of MOVE.2. For all methods and all statistics the RMSE decreases with an increase in N• or an the same statistic of a (simulated) extended record. The error in any realization of w is (w - E[w]), and the error in any realization of W is (W - E[w]), where E[w] is the expectation In the previoussectionsof thispaper,the randomvariable beingconsideredhad very simpleand hydrologicallyunrealistic properties(normal,independent,and withouta cyclic component).In this section some real data are used to explorethe techniques underconditions of nonnormal distribution, serial dependence,and seasonalcycles. The data used are monthly volume data. Some decisionswere necessaryon the particularsof howto applythe four techniques to of the statistic for the sample size and population being considered. The estimated bias of W, B(W), is the average error (W - E[w]) over a large number of Monte Carlo trials (2000 trials in this case). The root mean square error for statistic w is denoted R(w), and for the statistic • the root mean square error is R(W). The statistics w considered herearem(y),S2(y),y•, Y2,and increase in p. EMPIRICAL CHECK OF THE METHODS such data. The first decision was whether or not to use only one extensionequationfor all months,12 differentonesfor the ßß., N• + N2. The statistics ½ arem(y),S2(y),y•, Y2,andY5 12months,or to makesomecompromisesuchastwo or four seasonalequations.The choiceinvolvesa trade-offbetween where yk is the kth order statisticof the seriesy(i), i = 1, ß ß., N• + N2. The Monte Carlo trials were carded out for (N•, greatersamplesizefor estimatingthe parametersversusthe N2) values of (10, 50) and (20, 40) and for p valuesof 0.5, 0.7, ability to preservereal monthto monthdifferencesthat may Y5 where Yk is the kth order statistic of the seriesy(i) i = 1, and 0.9. In all cases the flows are assumed to be bivariate exist in the base station to short-record station relationship. normaltzx= •y = 0, trx 2 = try2 = 1withnoautocorrelation.The choice made here, basedon someexperimentswith the The results of these trials are given in Table 5. For each statistic w, the null hypothesisthat B(w) or B(W) equalszero (unbiased) was tested and those cases found significantat the 5% level are marked by the asterisk. The B(w) are unbiased by construction. For REG all of the statistics except m(y) appear to be data, was to use a singleextensionequationfor all of the months in each of the four techniques.This problem has biased for all of the cases considered. The absolute value of Maintenanceof Variance, Type 2 (MOVE.2) Method TABLE 4. Valuesof Sample Meanof S2(.•)/O'y 2 Usingthe of Record Extension the bias decreaseswith an increasein p or an increasein N•. The biases for REG all show the regression towards the mean, the estimated variances are too low, and the low- order statisticsare too high. For example, with N• = 10, N2 = 50, and p = 0.5, the second order statistic is, on the average, -1.261 rather than -1.935, which is its expectation; thus B•2) = (-1.261) - (-1.935) = 0.674. For the other three methods the absolute value of the N• N2 0.5 0.7 0.9 10 20 30 50 lO 10 20 30 lO 50 0.99 0.99 0.99 1.00 0.99 0.99 1.00 0.99 0.99 1.O1 1.02' 1.00 1.00 1.01 1.00 biasesfor all w except m(y) are in all casessubstantiallyless Based on 2000 Monte than the bias with REG. *The hypothesis Ho: E[X2(.•)]/try 2 -' 1.00is rejectedat the 5% For RPN there were no cases of significant biasfor S2(y),andfor theorderstatistics thebias level. Carlo thais. HIRSCH: COMPARISON OF STREAMFLOW RECORD EXTENSION TECHNIQUES TABLE 5. BiasesB( ) and Root Mean SquareError R( ) From 2000 Monte Carlo Trials B(•o) or B(d,) method m(y) S2(y) R(•o) or R(&) nl n2 rho Yl m(y) S2(y) y5 Y2 10 10 10 10 10 50 50 50 50 50 0.5 0.5 0.5 0.5 0.5 HIST REG RPN MOVE.I MOVE.2 0.001 -0.003 -0.005 -0.009 -0.004 0.004 -0.557* -0.001 0.164' -0.017 -0.002 0.603* 0.030* -0.059* 0.044* -0.002 0.674* 0.067* -0.069* 0.083* -0.006 0.656* 0.097* -0.090* 0.102' 0.131 0.294 0.306 0.334 0.292 0.183 0.658 0.517 0.783 0.487 0.239 0.784 0.487 0.593 0.467 0.332 0.871 0.616 0.758 0.590 0.459 0.910 0.734 0.930 0.726 10 10 10 10 10 50 50 50 50 50 0.7 0.7 0.7 0.7 0.7 HIST REG RPN MOVE.I MOVE.2 -0.001 0.004 0.003 0.002 0.004 0.003 -0.362* 0.007 0.137' 0.004 -0.000 0.358* 0.021' -0.045* 0.031' -0.005 0.447* 0.041' -0.069* 0.046* -0.010 0.476* 0.057* -0.090* 0.058* 0.128 0.251 0.262 0.271 0.250 0.184 0.533 0.458 0.670 0.445 0.233 0.578 0.439 0.501 0.427 0.323 0.707 0.563 0.641 0.533 0.442 0.804 0.705 0.803 0.670 10 10 10 10 10 50 50 50 50 50 0.9 0.9 0.9 0.9 0.9 HIST REG RPN MOVE.I MOVE.2 0.004 0.003 0.004 0.002 0.003 -0.002 -0.148' -0.013 0.040* -0.011 0.001 0.129' 0.026* -0.012 0.022* 0.010 0.176' 0.037* -0.018' 0.033* 0.013 0.217' 0.050* -0.017 0.047* 0.129 0.186 0.191 0.188 0.186 0.186 0.352 0.346 0.382 0.333 0.235 0.367 0.342 0.348 0.340 0.333 0.491 0.472 0.473 0.456 0.448 0.610 0.589 0.603 0.577 20 20 20 20 20 40 40 40 40 40 0.5 0.5 0.5 0.5 0.5 HIST REG RPN MOVE.I MOVE.2 0.002 0.006 0.006 0.005 0.005 -0.000 -0.484* -0.006 0.054* -0.013' -0.001 0.476* 0.023* -0.009 0.027* 0.004 0.470* 0.044* -0.016 0.049* 0.001 0.446* 0.067* -0.014 0.077* 0.130 0.208 0.228 0.227 0.208 0.187 0.533 0.347 0.418 0.325 0.245 0.582 0.375 0.386 0.341 0.334 0.624 0.472 0.510 0.443 0.452 0.687 0.586 0.641 0.556 20 20 20 20 20 40 40 40 40 40 0.7 0.7 0.7 0.7 0.7 HIST REG RPN MOVE.I MOVE.2 -0.004 -0.007* -0.007 -0.005 -0.006 0.003 -0.314' 0.010 0.052* 0.003 -0.008 0.267* -0.005 -0.027* -0.002 -0.003 0.327* 0.010 -0.022* 0.022* -0.011 0.344* 0.026* -0.031' 0.037* 0.128 0.185 0.199 0.194 0.186 0.184 0.412 0.330 0.375 0.313 0.237 0.423 0.343 0.355 0.328 0.334 0.531 0.447 0.470 0.427 0.458 0.624 0.560 0.604 0.544 20 20 20 20 20 40 40 40 40 40 0.9 0.9 0.9 0.9 0.9 HIST REG RPN MOVE.I MOVE.2 -0.003 -0.124' -0.004 0.014' -0.005 0.006 0.099* 0.010 -0.007 0.005 0.015' 0.138' 0.019' -0.001 0.019' 0.005 0.156' 0.019 -0.012 0.017 0.129 0.151 0.158 0.152 0.151 0.180 0.268 0.259 0.257 0.245 0.240 0.296 0.285 0.283 0.280 0.331 0.395 0.389 0.383 0.375 0.459 0.521 0.535 0.524 0.510 0.002 0.004 0.006* 0.004 0.004 Y5 1085 Y2 Yl The rowsmarkedHIST representcalculations basedon (N• + N2) observations andinvolvingno recordextension.E(•o)= 0.000, 1.000, -1.430, -1.935, -2.319, for the five statistics, respectively. *Bias significantat the 5% level. been explored by Alley and Burns [1981], who provide a means for selectingthe best method for any given case. The second decision was whether or not to do the exten- sion with the real data or with their logarithms (that is, take logarithms of all of the data, do the extension, and then take antilogs of these extended records). This questionhas been dealt with by Hirsch [1979] and, in a somewhat different context, by $tedinger [1980]. In both cases the conclusion was to work with the logarithms, and this was the approach taken here. The consequencesof this choice are that for any of the techniques,the samplemean of the extendedrecord of the logarithms is an unbiased estimate of the mean of the logarithms, but the sample mean of the extended record of flows is not an unbiased estimate of the mean of the flows. However, the objective of the technique is to produce records with sample cumulative distribution functions (CDF' s) which are close approximationsof the CDF's actual records (particularly in the tails). Where the station to station relationship more nearly approximates a bivariate lognormal distribution rather than a bivariate normal, transforming to logarithms will better achieve the desired result. The evaluation describedhere will forcus on the ability of the techniques to produce low-flow duration intensity characteristics like those of the actual records they are intended to represent. The example used concurrent, 50-year-long (commencing in 1928) U.S. Geological Survey monthly volume records from seven stream gaging stations in west central Virginia. Table 6 provides some information on the stations,their drainagebasinsand the observedflow characteristics. The experimental designis the following. 1. Each station is, in turn, considered to be the short- record station with only 10 years of data available. The historic values of the first, second, and fifth-order statistics in the annual series of minimum 1-month, 3-month, and 6- month volumesfrom the entire 50-year record are recorded. They are denotedQ(r, d, is) where r = 1, 2, 5 (indexof order or rank), d = 1, 3, 6 (duration in months), and is = 1, 2, ß ß., 7 (index of short-recordstation). The year is consideredto be the climatic year commencingin April. 2. For each given short-recordstation, each of the other six stations is considered, in turn, to be the base station with the full 50-year record available. 3. For each short-record station-base station pair, the period of record overlap is considered,in turn, to be years 110, 11-20, 21-30, 31-40, and 41-50, respectively. 4. For each short-record station-base station-overlap period, (there are 7 x 6 x 5 = 210 such 3-tuples) each of the four extension methods is applied to the logarithms of the streamflow volume data. From each of these extended flow records the same flow statistics are computed. They are 1086 HIRSCH' COMPARISONOF STREAMFLOWRECORD EXTENSION TECHNIQUES TABLE 6. Information on the Seven Gages Logarithms of Monthly Volumes Monthly Volumes Coefficient of Variation Coefficient of Skewness Standard deviation 28.3 0.92 1.26 1.04 -0.12 1.81 91 9.3 0.97 1.33 1.11 -0.12 1.80 1680 75 26.3 0.83 1.65 0.83 -0.02 2.12 164 2230 88 12.3 1.02 1.48 1.12 -0.00 1.76 461 2030 82 38.6 0.88 1.33 0.93 -0.03 1.87 247 2500 44 19.7 0.78 1.59 0.74 300 2470 47 27.0 0.57 1.54 0.54 Drainage Mean basin Station Number Location Forest area elevation mi2 in feet 329 2150 89 104 2210 395 02018000 Craig Creek at cover in Mean percent 106 m3 Coefficient of skewness Coefficient of kurtosis Parr, Va. 02017500 Johns Creek at New Castle, Va. 02055000 Roanoke River at Roanoke, Va. 02013000 Dunlap Creek near Covington, Va. 02016000 CowpastureRiver near Clifton Forge, Va. 03167000 Reed Creek at 0.21 2.01 Grahams Forge, Va. 03170000 Little River at -0.03 2.59 Graysonton, Va. denoted Q(r,d, is, ib, sq,m),ib = 1,2, ßß., 7 (indexofbase station), ib • is, sq = 1, 2, ß ß -, 5 (indexof sequence),and m = {REG, RPN, MOVE. 1, MOVE.2}. 5. The measure of accuracy of each estimate is U(r, d, is, ib, sq, m) U(r, d, is, ib, sq, m) = (2(r,d, is, ib, sq,m) Q(r, d, is) The correlation coefficient (of the log values) was also recorded for each of the 210 3-tuples.The mediancorrelation coefficientis 0.90, the lower and upperquartilesare 0.85 and 0.93, and the minimum and maximum are 0.71 and 0.99. The accuracy of the methodsis summarizedin Figures 1, 2, and 3. In these figures the box plots represent the distribution of all 210 U values for a given rank, duration, and method. The accuracyof the methodmay be judged by the degreeof dispersionin the box plot for that method, by the degreethat the medianapproachesa value of 1.0, and by the symmetryof the box about a value of 1.0. The following are some observations 1. Both MOVE.1 and MOVE.2 have median values for U very closeto 1.0 (in fact, between0.980 and 1.007)for all flow statistics. 2. At all durations REG had median U values > 1.00 (the most extreme being 1.11). For durationsof 1 and 3 months and for all three order statistics,approximately75% of the computedvalueswere higherthanthe historicalvalues.This result was expected, of course, given the tendencyfor regressionto reduce variance resulting in low extremes which are 'too high.' 3. For RPN at a 1-month duration, the median U values as well as the upper quartile U valuesare <1.00. This arises because of the kurtosis of the log volume values. For all seven streams, the kurtosis is significantly(a = 0.01) less than 3.0. For the three other methods, REG, MOVE. 1, and MOVE.2, the kurtosis of the extended record will closely approximatethe kurtosisfrom the basestation.But in RPN the kurtosis of the extended record will be some weighted average(dependingon r) of the basestationkurtosisandthe kurtosis of the noise term (a value of 3). Thus the kurtosisof about the results. Order statistic 1 (lowest in 5•year record) 1.75 Order statistic 2 Order statistic 5 (second lowest) (ffth lowest) __•._ , , , ,'-t-, , ' I , 0.75 0.50 O25 REG RPN MOVE. 1 MOVE. 2 REG RPN MOVE. 1 MOVE. 2 REG RPN MOVE. 1 MOVE. 2 Fig. 1. Boxplotsof U valuesfor 1-month-duration low flows.Boxplotsshowmedian,upperandlowerquartiles,and maximum and minimum values. The sample size is 210. HIRSCH: COMPARISONOF STREAMFLOWRECORD EXTENSION TECHNIQUES 1087 2.00 Order statistic 1 (lowest in 50-year record) 1.75 Order statistic 2 Order statistic 5 (fifth lowest) (second lowest) ,--[-- • 1.50 " 125 100 0.50 0.25 REG 0.00 RPN MOVE. 1 MOVE. 2 REG RPN MOVE. 1 MOVE. 2 REG RPN MOVE. 1 MOVE. 2 Fig. 2. Box plotsof U valuesfor 3-month-duration low flows.Box plotsshowmedian,upperandlower quartiles,and maximum and minimum values. The samplesize is 210. the extended records using RPN will be higher than those of the historic records. Given the approximatepreservationof variance and inflation of kurtosis, the extremes of the extended records will be more extreme than the historic. It is not known how common it is for kurtosis values for such time seriesto fall in the range 1.5-2.5 (as these sevenrecords do). It is not, however, unexpected.If one postulatesthat log monthly volumes are the sum of a normal random component and a sine curve of a period of 1 year (with the random component standard deviation equal to about one third of the semiamplitudeof the sinecurve), the kurtosisof the sum is approximately 2. 4. For RPN at a duration of 6 months the median U values for RPN are > 1.00, and in fact, nearly 75% of all U values are > 1.00. Thus RPN extended records have too low low- flow values at a 1-month duration and too high low-flow values at a 6-month duration. The reasonfor these too high values at 6-month duration is that the noisein (3) is independent. Thus the estimatedportion (i = N• + 1, ß ß., N• + N2) of the extended record will have a serial correlation coeffi- cient lower (in absolute value) than the base station, but for the other three methods it will match the base station exactly. 5. At a 3-month duration the U values for RPN for MOVE. 1. An additional question consideredhere was the matter of selecting a base station from among a variety of possible long-record stationslocated nearby. The hypothesisconsidered was that usingthe base stationwith the highestcorrelation coefficient (of the log values) for the 10-year overlap period would result in a smallerdispersionof U valuesthan was found using all possiblebase stations. This hypothesis was born out by the results: The interquartile ranges of U values using only the best correlated base station was 2.00 1.75 Order statistic 1 Order statistic 2 Order statistic 5 (lowest in 50-year record) (second lowest) (fifth lowest) ,--[-- • '•.50 =. • •.25 • 0.75 0 0.50 0.25 0.00 REG RPN MOVE. 1 MOVE. 2 REG are reasonably symmetrical about a value of 1.0. Presumably, the two contrary effects described above (3 and 4) are approximately in balance at this duration. 6. The box plots of U values for MOVE.1 and MOVE.2 are very similar in all cases. In general, the interquartile range and overall range for MOVE.2 is slightly smaller than RPN MOVE. 1 MOVE. 2 REG RPN MOVE. 1 MOVE. 2 Fig. 3. Box plotsof U valuesfor 6-month-duration low flows.Boxplotsshowmedian,upperandlowerquartiles,and maximumand minimumvalues. The samplesize is 210. 1088 HIRSCH: COMPARISONOF STREAMFLOWRECORDEXTENSION TECHNIQUES smaller for all nine order statisticsconsidered.The average ratio of the interquartile range using these selected base stations to the interquartile range for all possible base stations was 0.63. Thus maximum sample correlation appears to be a reasonable selection criterion; obviously, others might exist which are better. SUMMARY AND CONCLUSIONS In using record extension, as applied in the preceding section, one is transferringthe characteristicsof distribution shape, serial correlation, and seasonality from the base station to the short-record station with adjustments of location and scale appropriate to the short-record station. The analytical derivations, the Monte Carlo study of samples from bivariate normal populations,and the empirical analysis all show that REG and RPN fall substantially short of achieving the desired result of creating a realistic extended record. Specifically, REG cannot be expected to provide records with the appropriate variability and RPN cannot be expected to provide records with the appropriate distribution shape or serial correlaton. MOVE. 1 and MOVE.2 differ in the ways that they utilize the available data for parameter estimation (the latter uses more of the base station data than does the former). The differences in their performance, as seenin the evaluation of biasesof moments and order statisticsand in the empirical study, all show MOVE.2 to have slightly more desirable properties than MOVE. 1 and rather substantially better properties than REG or RPN. Of course, if the base station and short record station have substantial differences in terms of distribution shapes,serial correlation,or seasonality,then these methods cannot be expected to perform very well, becausethey will transfer these characteristicsfrom the base station to the short record station. The intent of record extensionis to produce a time series which is relatively long and is alsoplausible;that is, one that possessesstatistical characteristicsbelieved to be like those of an actual record for the station. The reasonfor producing such a record is for use in simulation and optimizations related to potential water managementdecisions.The use of regressionfor these purposes is inappropriate. The aim of regression is to provide the 'best estimate' (minimum squared error) for each individual streamflow value and not to preserve any particular statistical characteristics of the record. The three other methods are attempts to produce serieswhich do preserve certain desiredcharacteristics.The results of this study suggest that MOVE.2 is the most effective of all four of the methodsin terms of producingtime series with properties (such as variance and extreme order statistics) most like the properties of the records they are intended to represent. Acknowledgments.Ed Gilroy of the U.S. GeologicalSurvey helped with the derivation of the bias of the variancefor MOVE. 1. Bill Kirby of the U.S. GeologicalSurvey suggestedthe idea of MOVE.2 as a variant of MOVE. 1. REFERENCES Aitchison, J., and J. A. C. Brown, The Log-normal Distribution, CambridgeUniversity Press, London, 1957. Alley, W. A., and A. W. Burns, Mixed-stationextensionof monthly streamflow records, report, U.S. Geol. Surv., Reston, Virginia, 1981. Beard, L. R., A. J. Fredrich, and E. F. Hawkins, Estimating monthly streamflowswithin a region, Tech. Pap. 18, 14 pp., The Hydrol. Eng. Center, U.S. Army Corps of Eng., Davis, Calif., 1970. Benson, M. A., and N. C. Matalas, Synthetic hydrology based on regional statisticalparameters, Water Resour. Res., 3, 931-936, 1967. Crawford, N.H., and R. K. Linsley, Jr., Digital simulationin hydrology:Stanfordwatershedmodeliv, Tech.Rep. 39, Dep. of Civ. Eng., StanfordUniv., Stanford,Calif., July 1966. Draper, N. R., and H. Smith, Applied RegressionAnalysis, John Wiley, New York, 1966. Fiering, M. B, Streamflow Synthesis, Harvard University Press, Cambridge, Massachusetts, 1967. Greenall, P. D., The concept of equivalent scoresin similar tests, Br. J. Stat. Psychol., 2, 30-40, 1949. Hazen, A., Storagerequiredfor the regulationof streamflow,Trans. Am. Soc. Civ. Eng., 77, 1914. Hirsch, R. M., An evaluation of some record reconstructiontechniques, Water Resour. Res., 15, 1781-1790, 1979. Imbrie, J., Biometrical methodsin the studyof invertebratefossils, Bull. Am. Mus. Nat. Hist., 108, 215-52, 1956. Kirby, W., Straight line fitting of an observation path by least normal squares,U.S. Geol. Surv. Open-FileRep., 74-187, 1974. Kritskiy, S. N., and M. F. Menkel, Some statisticalmethodsin the analysisof hydrologicseries,Sov. Hydrol. Select.Pap., 7, 80-98, 1968. Kruskal, W. H., On the uniquenessof the line of organiccorrelation, Biometrics, 9, 47-58, 1953. Leclerc, G., and J. C. Schaake, Jr., Methodology for assessingthe potential impact of urban developmenton urban runoff and the relative efficiencyof runoffcontrolalternatives,Rep. 167, Dep. of Civ. Eng., Mass. Inst. of Technol., Cambridge,Mass., 1973. Matalas, N. C., Mathematical assessmentof synthetic hydrology, Water Resour. Res., 3, 937-945, 1967. Matalas, N. C., and B. A. Jacobs, A correlation procedure for augmentinghydrologicdata, U.S. Geol. Surv. Prof. Pap., 434-E, 1964. Pearson,K., On linesand planesof closestfit to systemsof pointsin space, Philos. Mag., 2, pp. 559-572, 1901. Riggs, H. C., Low-flow investigations,U.S. Geol. Surv. Tech. Water Resour. Invest., 4, 1972. Rippl, W., The capacity of storagereservoirsfor water supply, Proc. Inst. Civ. Eng., 71, 1883. Stedinger,J. R., Fitting log normaldistributionsto hydrologicdata, Water Resour. Res., 16, 481-490, 1980. Till, R., The use of linear regressionin geomorphology,Area, 5, 303-308, 1973. (Received October 29, 1981; revised March 9, 1982; accepted March 22, 1982.)