Supporting Information 1. Analysis of waterlevel observation errors Hill and Tiedeman (2007) defined the observation error as “error related to any aspect of the observation not accounted for by the model considered, for which the expected value is zero”, and suggested that in addition to measurement error, some types of model errors should also be lumped into the waterlevel target error for accurate weighting during model calibration. For example, errors due to unmodeled spatial heterogeneity, temporal variability and vertical averaging over long well screen intervals were included to assess errors associated with head targets when calibrating a groundwater flow model in Northeastern Illinois (Meyer, Roadcap, Lin, & Walker, 2009). When calibrating a model for the upper Klamath Basin in Oregon and California, Gannett et al. (2012) accounted for model errors by adding 10 ft. to the standard deviations of head measurement error. We argue that these model errors can be better accommodated with our complementary modeling framework, and can be modeled by the error-mapping DDMs. Consequently, the expected error of the DDMs updated waterlevel is comprised solely of irreducible errors associated with measuring devices and location of measurement. Three types of errors are discussed in the following paragraphs. They are assumed as independent Gaussian noise with zero mean and varying variance. Altitude uncertainty Altitude uncertainty of each well was calculated using the altitude accuracy code given in the NWIS database. We follow the recommendation of Hill and Tiedeman (2007) and interpret the ± accuracy code as the 95% confidence interval. Further assuming that the error is normally distributed, the accuracy code is equal to 1.96 times the standard deviation. Therefore the standard deviation of the error associated with altitude uncertainty is calculated as 𝜎𝑣 = (𝑎𝑙𝑡𝑖𝑡𝑢𝑑𝑒 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑐𝑜𝑑𝑒)/1.96. For the RRCA case study, 𝜎𝑣 ranges from 0.005 to 25.5 ft., and is within 10.2 ft. for 99% of the measurement used in this study. For the SVRP case study, 𝜎𝑣 ranges from 0.005 to 25.5 ft., and is within 5.1 ft. for 95% of the measurements used in this study. Location uncertainty This uncertainty refers to the aspect in the waterlevel observation error that is due to the measurement error of the well locations. Its standard deviation is calculated as 𝜎ℎ = 𝐽𝑅𝜎𝑙𝑎𝑡/𝑙𝑜𝑛𝑔 . J - the gradient of waterlevel within the model domain. R – the radius of earth. 𝜎𝑙𝑎𝑡/𝑙𝑜𝑛𝑔 - the standard deviation of the latitude/longitude coordinate error, calculated from the lat/long coordinate accuracy code given by NWIS dataset: 𝜎𝑙𝑎𝑡/𝑙𝑜𝑛𝑔 = (𝑙𝑎𝑡/𝑙𝑜𝑛𝑔 𝑐𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑐𝑜𝑑𝑒)/1.96. For the RRCA case study, 𝜎𝑙𝑎𝑡/𝑙𝑜𝑛𝑔 ranges from 0.0093 to 55.8 ft. 𝜎𝑙𝑎𝑡/𝑙𝑜𝑛𝑔 associated with 99% of the measurements is within 9.3 ft. For the SVRP case study, 𝜎𝑙𝑎𝑡/𝑙𝑜𝑛𝑔 ranges from 0.093 to 9.3 ft., and is within 0.5 ft. for 99% of the measurements in the dataset of this study. Measurement error The accuracy of depth to water measurement depends on the method used. Unfortunately for RRCA there is no available information about the method used for measuring the water level. Considering the fact that the aquifer is shallow and that most measurements were taken during non-irrigation season when disturbance due to pumping is low, steel and electric tape should have been the primary method. According to Thornhill (1989) the maximum difference of independent measurements using the same tape should be within ± 0.1 foot at a depth of around 500ft. Thus the standard deviation of measurement error 𝜎𝑚 is assumed to be within 0.26 ft. For the SVRP case study, the depth to water measurement method for the September 2004 synoptic survey was recorded by Campbell (2005), while no such record is available for all other measurements. As stated in the SVRP model documentation (Hsieh, et al., 2007), the measurement error during the synoptic survey is assumed to be 5 ft. for waterlevel measured by airline, calibrated airline and pressure gage methods, and 0.02 ft. for other measurements by steel and electric tape methods. The measurement error of all other measurements is assumed to be 0.02 ft. Therefore the standard deviation of measurement error 𝜎𝑚 ranges from 0.01 ft. to 2.55 ft., mostly 0.01 ft. Total aleatoric error In total, the variance of the aleatoric error of the waterlevel measurements is 2 𝜎𝐻2 = 𝜎𝑣2 + 𝜎ℎ2 + 𝜎𝑚 . For the RRCA case study, 𝜎𝐻 ranges from 0.26 to 55.9 ft. 𝜎𝐻 associated with 99% of the measurements is within 11.2 ft. For the SVRP case study, 𝜎𝐻 ranges 0.095 to 27.2 ft. and is less than 5.1 ft. for 99% of the measurements in the dataset. 2. Use of cluster analysis For the RRCA case study, cluster analysis was implemented before constructing DDMs because 1) local pattern was found in the error of the MODFLOW model, therefore localized DDMs adapted to each cluster are expected to perform better than a global DDM based on the whole dataset; 2) Partitioning the dataset into smaller clusters reduces computational cost of training DDMs and cross validation. In temporal prediction scenario, the wells were grouped into 10 clusters according to their spatial location, and first and second moment of the error associated with each well. The clusters are indicated by varying colors in Figure 1. The clustering results are not shown for spatial and spatiotemporal scenarios as the cluster analysis was implemented in four dimensional feature space thus cannot be visualized. On the other hand, local pattern is not found in the SVRP case, thus cluster analysis was not implemented there. Figure S1. Plot of wells being grouped into 10 clusters, each represented by one color. Temporal prediction scenario for RRCA case study. 3. Use of average mutual information for input feature selection The average mutual information (AMI) is widely used in practice to detect and quantify nonlinear relations between random variables (Chau & Wu, 2010). Given N pairs of realizations of two random variables X and Y, denoted as {𝑥1 , 𝑦1 }, … , {𝑥𝑁 , 𝑦𝑁 } , the AMI score is computed as 𝑓(𝑥𝑖 , 𝑦𝑗 ) 𝐴𝑀𝐼(𝑋, 𝑌) = ∑ ∑ 𝑓(𝑥𝑖 , 𝑦𝑗 )log 2 , 𝑓(𝑥𝑖 )𝑓(𝑦𝑗 ) 𝑥 ∈𝑋 𝑦 ∈𝑌 𝑖 𝑗 Where 𝑓(𝑥𝑖 ), 𝑓(𝑦𝑗 ), 𝑓(𝑥𝑖 , 𝑦𝑗 ) are the marginal and joint pdfs at (𝑥𝑖 , 𝑦𝑗 ) estimated from the samples. The AMI scores are used in this study to help determining whether to include or not the measurement time 𝑡 as an input feature of the DDMs. As shown in Table S0, the AMI score between PBM error and 𝑡 is significantly lower than the AMI scores between PBM error and other input features, indicating that the PBM error is less dependent on 𝑡 in both two case studies. Table S1. Average mutual information (AMI) scores between the PBM error and input features. 𝑥𝑤 𝑦𝑤 t ℎ̂ RRCA 0.30 0.13 0.067 0.19 SVRP 0.15 0.12 0.085 0.17 4. DDMs parameters selected by CV Table S2. DDMs parameter values and RMSE before and after DDMs updating for each cluster, temporal prediction scenario for RRCA case study. MODFLOW IBW SVR Cluster RMSE (ft.) q n RMSE (ft.) C ε γ RMSE (ft.) 1 62.31 0.0005 5 2.96 119.22 0.15 16 3.192 2 53.19 0.0007 10 6.3 115.3 0.24 64 7.035 3 27.93 0.001 10 4.55 63.11 0.52 256 6.82 4 56.65 0.01 5 11.45 150.5 0.66 64 7.395 5 27.49 0.0003 10 4.75 50.62 0.2 256 5.238 6 26.6 0.0004 50 7.98 100.26 0.16 16 7.695 7 17.85 0.0002 50 5.86 49.84 0.07 256 3.365 8 16.89 0.0007 10 4.3 55.59 0.12 64 4.629 9 52.33 0.0003 10 3.12 100.6 0.26 256 3.625 10 17.43 0.001 5 3.17 46.43 0.16 256 3.379 Table S3. DDMs parameter values and RMSE before and after DDMs updating for each cluster, spatial prediction scenario for RRCA case study. MODFLOW IBW SVR Cluster RMSE (ft.) p n RMSE (ft.) C ε γ RMSE (ft.) 1 40.49 2 16 11.76 121.43 1.42 11 12.49 2 17.3 2 256 12.96 48.48 0.93 0.4 12.16 3 15.03 2 512 12.72 49.58 1.32 6 12.69 4 33.58 3 64 14.06 98 0.7 0.4 13.32 5 29.71 5 64 17.3 103.44 0.94 13 11.76 6 26.39 4 16 11.86 124.85 0.8 1.6 12.85 7 45.4 0 16 12.38 84.59 1.06 4 11.1 8 14.21 3 32 8.73 50.59 0.59 2.9 7.33 9 58.63 0 4 17.56 178.47 2.13 3 12.52 10 28.85 3 32 16.96 95.34 1.07 2 22.63 Table S4. DDMs parameter values and RMSE before and after DDMs updating for each cluster, spatiotemporal scenario for RRCA case study. MODFLOW IBW SVR Cluster RMSE (ft.) p n RMSE (ft.) C ε γ RMSE (ft.) 5 24.55 5 128 16.18 106.86 1.07 11 17.05 6 29.65 2 64 15.89 126.61 0.79 1.5 16.26 9 57.57 1 32 22.95 181.91 2.15 0.8 20.86 DT loess ANN Cluster |T| RMSE (ft.) d δ RMSE (ft.) H RMSE (ft.) 5 73 18.78 2 0.35 18.78 18 21.59 6 116 19.45 2 0.2 19.45 18 17.44 9 47 23.44 2 0.2 23.44 12 25.58 Table S5. DDMs parameter values selected by five-fold cross validation, temporal prediction scenario of SVRP case study. IBW Parameter Value SVR p n C ε γ 2 1172 46.13 0.62 250 5. Performance of DDMs Table S6. Summary of DDMs performance on testing dataset in spatiotemporal scenario of RRCA case study. MODFLOW IBW loess DT ANN SVR ME (ft.) -6.94 -0.1 -0.99 -1.84 0.29 1.59 RMSE (ft.) 33.95 17.5 19.28 19.85 21.38 17.6 6. Residual plots before and after DDMs updating Figure S2. Plots of residuals of the MODFLOW model (up), updated by IBW (middle) and updated by SVR (bottom) versus the head computed by the MODFLOW model. Temporal prediction scenario for the RRCA case study. Figure S3. Plots of residuals of the MODFLOW model (up), updated by IBW (middle) and updated by SVR (bottom) versus the head computed by the MODFLOW model. Spatial prediction scenario for RRCA case study. Figure S4. Plots of residuals of the MODFLOW model (up), updated by IBW (middle) and updated by SVR (bottom) versus the head computed by the MODFLOW model. Spatiotemporal prediction scenario for RRCA case study. Figure S5. Plots of residuals versus the head computed by the MODFLOW model. (a) Residual of the MODFLOW model; (b,c) Residual after updated by IBW; (d) Residual after updated by SVR. Note that vertical axis’s of (a,b) and (c,d) have different scale. Temporal prediction scenario for SVRP case study. References Campbell, A. M. (2005). Ground-water levels in the Spokane Valley-Rathdrum Prairie aquifer, Spokane County, Washington, and Bonner and Kootenai Counties, Idaho, September 2004. U.S. Geological Survey Scientific Invesiigations Map 2905. Chau, K., & Wu, C. (2010). Hydrological predictions using data-driven models coupled with data preprocessing techniques. LAP Lambert Academic Publishing. Hill, M. C., & Tiedeman, C. R. (2007). Effective groundwater model calibration: With analysis of data, sensitivities, predictions, and uncertainty. Wiley-Interscience. Hsieh, P. A., Barber, M. E., Contor, B. A., Hossain, A., Johnson, G. S., Johes, J. L., et al. (2007). Ground-Water Flow Model for the Spokane Valley-Rathdrum Prairie Aquifer, Spokane County, Washington, and Bonner and Kootenai Counties, Idaho. USGS Scientific Investigations Report 2007-5044. Meyer, S. C., Roadcap, G. S., Lin, Y. -F., & Walker, D. D. (2009). Kane County Water Resources Investigations: Simulation of Groundwater Flow in Kane County and Northeastern Illinois. Champaign, IL.: Illinois State Water Survey Contract Report 2009-07. Thornhill, J. T. (1989). Accuracy of Depth to Water Measurements. Ada,OK: Superfund Technology Support Centers for Ground Water, RSKERL-Ada.