Environ Geol DOI 10.1007/s00254-007-0753-3 ORIGINAL ARTICLE Data integration and standardization in cross-border hydrogeological studies: a novel approach to hydrostratigraphic model development Diana M. Allen Æ Nadine Schuurman Æ Aparna Deshpande Æ Jacek Scibek Received: 7 November 2006 / Accepted: 3 April 2007 Springer-Verlag 2007 Abstract Data integration—or the merging of multiple source data sets—is central to hydrogeological studies. In cross-border situations, data heterogeneities are the source of most integration problems. Semantic integration of the subsurface geological terms is undertaken for the Abbotsford–Sumas aquifer, a cross-border (trans-national) aquifer, which is equally shared by British Columbia (Canada) and Washington State (US). Subsurface information is largely derived from water well information submitted to the respective governments. Use of this information is constrained due to inconsistent use of geological terms in water well reports. Lack of standardized methodology resulted in 6,000 unique geological descriptions for the aquifer alone. Semantic standardization of geological descriptions progressed from database interpretation to domain expert interpretation. Despite the poor quality of water well information, trends were observed that facilitated the development of a hydrostratigraphic model that honors the generalized early conceptual models of the aquifer, but provides a much higher degree of resolution in the stratigraphy necessary for groundwater flow modeling. The standardization protocols introduced support the model creation despite the constraint of poor quality data. Keywords Hydrostratigraphic model GIS Semantic standardization Aquifer heterogeneity Data integration Groundwater modelling D. M. Allen (&) J. Scibek Department of Earth Sciences, Simon Fraser University Burnaby, British Columbia, Canada V5A 1S6 e-mail: dallen@sfu.ca N. Schuurman A. Deshpande Department of Geography, Simon Fraser University Burnaby, British Columbia, Canada V5A 1S6 Introduction The ability to store, manipulate and visualize data has made geographical information systems (GIS) a common tool in many groundwater investigations and groundwater management activities, particularly those involving large datasets. GIS is being used to assemble groundwater data, such as water quality levels, water table surfaces, geological data, etc. and integrating these with various coverages (surface water courses, land use, etc.) for groundwater assessment and management activities. Activities such as aquifer vulnerability mapping and the construction of groundwater models depend on an understanding of the conceptual hydrostratigraphic model, and the development of such a model is dependent on the availability of subsurface data obtained from field investigations (i.e., drilling activities and/or geophysical surveys). For large regional studies, particularly those that are trans-jurisdictional in nature, the assembled datasets may be very large, very diverse, and very inconsistent. Specifically, water well information is often the chief source of the depth-specific subsurface data. The specific use of water well data for hydrostratigraphic model generation, however, is often constrained due to inconsistent geological descriptions (Russell et al. 1998). The Abbotsford–Sumas aquifer (Fig. 1), which straddles the border between British Columbia (BC), Canada and Washington State (WA), USA offers a unique opportunity to explore the semantic issues in building a conceptual hydrostratigraphic model. The aquifer has been exploited extensively on both sides of the border (Cox and Kahle 1999). It is one of the largest aquifers in the region, and supports the activities of approximately 200,000 people who live in this area. Groundwater is used not only for drinking purposes, but also supports industrial, farming and 123 Environ Geol Fig. 1 Extent of the Abbotsford–Sumas Aquifer. Inset map shows location of the aquifer within the southwest British Columbia and northwest Washington State agricultural activities (Cox and Kahle 1999; Kohut 1987). Such activities have threatened the integrity of the aquifer (Ricketts 1999); agricultural practices and the poultry industry have lead to widespread occurrence of nitrate contamination (Cox and Kahle 1999). While Canada is concerned with the excessive groundwater withdrawal south of the border (Kohut 1987), the US is concerned with groundwater contamination that may originate north of the border (Cox and Kahle 1999). As groundwater is the primary source of water for many inhabitants of the study area there is a pressing need to develop groundwater management strategies. Groundwater management strategies, however, call for a thorough and detailed understanding of the hydrogeological framework as well as a numerical groundwater flow model, both of which rely on subsurface geological map, which are lacking for this area. Approaches for mapping subsurface geology have stemmed largely from the petroleum industry (e.g., LeRoy 1955; Tearpock and Bischke 2002; Walker and Cohen 2006). The petroleum industry has the distinct advantage of having actual core, petrophysical data, paleontological data, geophysical logs and seismic data, all of which are collected and interpreted by trained geologists or geophysicists. Whereas, in the case of the water well industry, drillers have little to no geological training, and the overall quality of the data is generally very poor, largely because different drillers use different terminologies to represent the same geologic units and/or the level of detail is widely varying. Also, geologic information, where collected, is based on a rudimentary description of cuttings such that well records typically provide only basic relative grain size information (e.g., sand or gravel). In addition, geophysical datasets are few, and paleontological data are non-existent. Perhaps the greatest shortcoming of water well information relates to the positioning of wells. Wells are drilled based on lot development such that well location is random from 123 the perspective of gaining insight into subsurface stratigraphy and structure (i.e., wells are not strategically placed to obtain good subsurface data). While strategic drilling programs are undertaken at specific sites (e.g., superfund sites), they are rare in regional investigations. Consequently, environmental geoscientists are at a distinct disadvantage in their ability to make sense of subsurface information. As a result of these problems, there have been numerous initiatives to help improve how water well information is collected and stored in databases. For example, in British Columbia, standardized geologic terms will be required for reporting of subsurface geologic information collected during water well drilling (new provincial Ground Water Protection Regulation). Other jurisdictions are also experimenting with reporting standards for water well information. Another North American initiative is the development of the North America data model (NADM), which was initiated between the Geological Survey of Canada (GSC) and the United States Geological Survey (USGS). Despite all of the initiatives that are aimed at improving how data are collected and stored in databases, existing water well databases in Canada and the US store water well lithology data in the form that they were originally collected. Thus, the problem of standardization and classification of well log lithologic terms is a continuing problem in many jurisdictions in North America. When undertaking hydrogeologic studies, whether cross-jurisdictional or not, the hydrogeologist is faced with sifting through databases that comprise possibly several thousand well records, of variable quality. Within the Abbotsford–Sumas study area, most depth-specific, subsurface information is dependent on water well information provided to provincial (BC Ministry of Environment), State (WA State Department of Ecology), and Federal (Environment Canada, GSC, USGS) government agencies. Lack of consistency in describing the geology has resulted in over 6,000 unique geological descriptions for the aquifer for more than 10,000 water wells in the region. Confounding the problem is the heterogeneous nature of the geology owing to the complex depositional history. Herein, we describe a novel approach to developing a hydrostratigrahic model within a complex geologic setting. This paper begins with a review of data integration and standardization issues, and provides background on the geological history of the study area, which has resulted in a complex (heterogeneous) distribution of sediments. We then provide a simple methodology for standardizing and, hence, reconciling semantic issues where subsurface information is based on water well reports containing a large number of unique geological descriptions (in this case over 6,000). The standardized data, along with supporting geologic insight, are then used to construct a hydrostrati- Environ Geol graphic model of the aquifer that is validated numerically in a groundwater flow model. Materials and methods Data integration and standardization Data integration is at the core of GIS and is one of its defining properties (Vckovski 1998). Although data fuels the GIS industry, it is the source of most integration problems. Data heterogeneities inherent in the component databases have been identified as the source of data integration problems (Stock and Pullar 1999; Vckovski 1999; Bishr 1998; Sheth and Larson 1990). These include (1) syntactic heterogeneity, which stems from the use of different data models to represent database elements (Bishr 1998); (2) schematic heterogeneity, which results from different classification schemes employed in the component databases or structuring of database elements in component databases (Kim and Seo 1991). For example, in this research the geological descriptions contained in the well logs are represented by a single attribute in the BC database and with three attributes in the WA State database. Schematic heterogeneities also result from different definitions of semantically similar entities, missing attributes, and different representations for equivalent data; (3) semantic heterogeneity, which occurs when there is a disagreement about the meaning, interpretation or intended use of the same or related data (Sheth and Larson 1990). This heterogeneity results from the different categorizations employed by individuals when conceptualizing real world objects. Such categorizations differ between individuals depending on education, experience and theoretical assumptions (Stock and Pullar 1999). An example is the often varied descriptions used by different drillers to represent the same lithologic unit. Such semantic heterogeneities have been identified as the main cause of data sharing problems and are the most difficult to reconcile (Bishr 1998; Vckovski 1998; Kottam 1999). A number of sophisticated approaches to reconciling semantics have been proposed (e.g., Ahlqvist 2003, 2004, 2005; Visser et al. 2002; Fonseca et al. 2000; Kashyap and Sheth 1996; Sheth and Larson 1990), but these are generally restricted to the academic domain and are still in the prototyping stage. Most government agencies and organizations, however, still maintain datasets in relational format (Schuurman 2002), which is an obstacle at this stage. Our method supports standardizing data from multiple sources in relational data format. There are two closely related issues that bear on semantic data integration of geological data: classification and standardization. Classification is the process of allocating record names (e.g., borehole layer descriptions) to broader categories. The content of classification systems is all based on the category name, but the attributes of categories are discerned through implicit knowledge on the part of the user. There are strategies to deal with the problem of a lack of universal understanding of category meaning, including use of data dictionaries—which create equivalencies among different conventions. Standardization entails limiting the number of record names permitted, and re-assigning existing records to those categories. In this paper, we introduce a strategy for standardization of diverse nomenclature that accounts for local interpretations by using semi, rather than fully, automated integration. Aquifer hydrostratigraphy Quaternary sediments in the Fraser–Whatcom Lowland The Fraser–Whatcom Lowland, which hosts the Abbotsford–Sumas aquifer, consists of rolling hills of glacial drift, 60–120 m above broad valley floors. The floodplains are currently near sea level, and there are several prominent Tertiary-age bedrock outcrops, such as Sumas and Vedder Mountains, bordering Sumas Valley (Fig. 1). These sedimentary rocks underlie a thick (up to 600 m) Quaternary sediment fill (Clague 1994; Easterbrook 1969), consisting of complex sequences of diamictons and stratified drift, in various associations with marine and deltaic sediments, which provide the physical framework that controls the architecture of the aquifers, the Abbotsford–Sumas aquifer being the largest. Our understanding of Quaternary lithostratigraphy has evolved over many years through mapping of surficial deposits and detailed stratigraphic studies. Extension of this lithostratigraphic scheme into the subsurface is difficult, partly because of the remarkable complexity of the glacial stratigraphy. This complexity has resulted from interactions between sedimentation and erosion during advance and retreat of the ice sheets, the concomitant retreat and advance of the seas, and the isostatic effects of ice loading (subsidence) and unloading (uplift). The various stages of glaciation include (from the youngest to the oldest): Fraser Glaciation (20–10 ka); Olympia Interglaciation (60–20 ka); Possession Glaciation (80–60 ka); Whidbey Interglaciation (100–80 ka), and Double Buff Glaciation (>100 ka) (Jones 1999). Therefore, there are many units of sufficiently high porosity and hydraulic conductivity to qualify as aquifers, particularly those that accumulated in close proximity to ice. Mapping has 123 Environ Geol identified more than 200 aquifers in the region (Ricketts and Liebscher 1994). The maximum glacial ice sheet advance corresponds to the Vashon Stade of the Fraser Glaciation, which deposited the Vashon Drift (Armstrong et al. 1965). The time of retreat of the Vashon ice is called the Everson Interstade (Armstrong et al. 1965), depositing glaciomarine sediments referred in the Canadian part of Fraser Lowland as the Capilano Sediments and Fort Langley Formation (Armstrong 1981). The same sediments were named Everson Glaciomarine Drift in the US (Easterbrook 1969), which also include some glaciofluvial sediments. The Everson Interstade ended when the ice re-advanced briefly into parts of the Fraser Lowland. This episode is called the Sumas Stade. Sumas Drift was deposited up to 120 m elevation; large outwash plains and kame terraces were created by glacial meltwaters (Easterbrook 1969; Armstrong 1981). Abbotsford outwash is part of Sumas Drift sediments, and forms the thickest and uppermost layer of the Abbotsford– Sumas Aquifer. An outwash terrace slopes southward across the international boundary from a ridge of ice-contact deposits (Easterbrook 1969). The terrace ends at Lynden, WA, above the Nooksack River valley and floodplain. The glaciofluvial Abbotsford outwash is com- posed of stratified sandy gravel, gravel and sand, mostly horizontally bedded with some cross-bedding, scour and fill, and foreset bedding (Easterbrook 1969). The sediments fine to the south-west, grading from boulder-cobble gravel along international boundary to pebble gravel, then to sand near Lynden. South of the Nooksack River valley, there is much of recent alluvium and the Abbotsford outwash may or may not be present. The Lynden terrace is interrupted by the modern floodplain of the Nooksack River, but continues south of the river for several kilometers and terminates against highlands composed of Everson glaciomarine drift (Easterbrook 1969). A number of lakes and peat bogs occur in abandoned meltwater channels and kettles on the outwash terrace. The lakes include Abbotsford Lake, Laxton and Judson Lakes, Pangborn Lake and smaller ponds. Table 1 shows a compilation of the geologic units in comparison to hydrostratigraphic units identified by Halstead (1986). The regional hydrostratigraphic framework is discussed in the following section. Hydrostratigraphic framework The original 3D mapping of the Abbotsford aquifer (extended only in BC north of US–Canada boundary) must Table 1 Hydrostratigraphic units (Halstead 1986), comparing US and Canadian geologic units (compiled by Golder Associates 1995) Hydrostratigraphic unitsa Possible geologic equivalents C1 Qt C2 US geologic unitsb General geologic description Canadian geologic unitsc Glaciofluvial sand and gravel deposited by meltwater streams, often occurring as raised deltas Qp peat Fraser river and salish sediments Fluvial and floodplain deposits of silt, sand, gravel and peat; till, glaciofluvial, and ice-contact deposits; Qs till and ice-contact deposits Sumas drift Outwash sand and gravel Qal alluvial deposits Qsc silt and clay Qso outwash sand and gravel A/B Qb Bellingham drift Qk Kulshan drift Fort Langley formation and Capilano sediments Glaciomarine deposits consisting of stony clays, and stony silt with marine shells C3 Qd Deming sand Fort Langley formation and Capilano sediments Stratified, well sorted sand and gravel with some layers of clay, silt and gravel D Qvt Vashon till Vashon drift Qve Esperance sand Quadra sand Bellingham drift, Capilano, Fort Langley, Cowichan head formations Till and ice-contact deposits of poorly sorted gravel in matrix of silt, clay and sand; and glaciofluvial deposits of sand and gravel Clay and silt, with interbedded estuarine and fluvial deposits of fine sand and silt E Kulshan drift Pre-Vashon marine deposits C4 Pre-Vashon sediments Pre-Vashon sediments Fine to medium sand of fluvial or glaciofluvial origin F TKc tertiary bedrock Tertiary bedrock Tertiary-aged consolidated sedimentary deposits and interbedded volcanic deposits Th a Halstead (1986) b After Easterbrook (1976) c After Armstrong (1981) 123 Environ Geol be credited to Halstead (1986), who produced series of detailed fence diagrams and maps showing interpreted aquifer units and depths of water wells. Halstead’s work preceded numerical flow modeling, and the products were paper maps and not digital database or CAD drawings. Halstead (1986) defined hydrostratigraphic units on the basis of lithology, permeability and porosity, and subordinate factors, such as origin (marine, fluvial), stratigraphic position, and to some extent aquifer type (e.g., water table aquifers); they are different from the formal lithostratigraphic units, which are defined primarily on mappability and degree of homogeneity. Halstead’s scheme is useful from a regional perspective, although detailed mapping reveals some ambiguities. Halstead’s Unit C (Table 1), for example, corresponds to the easily mapped, unconfined Sumas Drift aquifers. However, other hydrostratigraphic units, such as units A and B, correspond to the more heterogeneous Fort Langley and Capilano Formations, and not specifically to aquifers within these formations. Several aquifers mapped herein are confined by finer grained Fort Langley–Capilano deposits—some of the confined aquifers appear very similar in general sedimentological characteristics and map extent to the unconfined Unit C aquifers. Halstead (1986) grouped the sediments into six units of significance to groundwater, either acting as barriers to flow or as units that readily transmit ground water. These are summarized in Table 1, along with their lithostratigraphic equivalents. In 1993 the GSC began a series of hydrogeologic investigations and analyses in the Fraser Valley, with the main objective being the analysis of regional stratigraphic framework of aquifers, groundwater flow, discharge and recharge dynamics (e.g., Ricketts and Jackson 1994). Similar work has been carried out in Washington State, notably by Kahle (1991), Jones (1999), and Cox and Kahle (1999) at the USGS, and Tooley and Erickson (1996) at Washington State Department of Ecology (WA Ecology). The USGS studies are known either as the LENS (Lynden– Everson–Nooksack–Sumas) hydrogeologic study and have been published as Water Resources Investigations Area (WRIA) 1 report. There were also small hydrogeologic projects near various landfill sites, water supply well locations, and a proposed power plant site (e.g., Golder Associates 1995; Gibbons and Culhane 1994; Mitchell et al. 2000; Piteau Associates 1991, 2002). Well litholog database To develop a hydrostratigraphic model of the aquifer, water well data from various sources were acquired. Sources of lithology data include: WA Ecology (WRIA 1 of USGS regional groundwater study database) and NWIFC (Northwest Indian Fisheries Commission), BC Ministry of Environment (BC MoE) (WELLs database), BC Ministry of Energy and Mines (several deep exploration boreholes), GSC reports and papers, BC Ministry of Transportation (bridge construction sites), and Simon Fraser University (Cameron 1989, MSc thesis with lithologs of Sumas Valley). The BC MoE has a very extensive well record digital database. In BC, submission of well drilling reports is currently not mandatory, and there is no training of drillers in the preparation of well lithologs. As a result, the lithology information and location of wells contain many errors. Typically, litholog quality depends on driller’s experience and/or education in geology, the amount of detail recorded in database, transcription errors, method of drilling, geologic setting, misplaced well records, and incorrect location of a well. Location coordinates were not available for many wells. In many instances, the given coordinates were not accurate, and the error was not known. For some of the wells, the coordinates were taken from address information and matched to addresses in Street Network files. In the lithologic records, the problems included incorrectly formatted text output, truncated lithologs (e.g., maximum 24 layers per litholog), lack of ground elevation of top of well (very common), missing uppermost unit (assumed to be soil where thin), and problems with conversion of unit top and bottom depths to elevations above sea level (elevation of top of well elevation). Washington State Department of Ecology (WA Ecology) maintains a large and detailed digital database of well drill records, including lithologs. The lithologs, however, are mostly in .tif image format, from scanned images of paper forms. In some areas, local governments and organizations have entered the information from paper forms to database records and these can be queried directly. In northern Whatcom County, the south part of our study area along the Nooksack River and on Lynden terrace, the database has only images of litholog paper forms, which had to be entered into text and numeric digital format. In the WA Ecology logs the values are in feet and locations are in latitude/longitude. All were converted to meters and the UTM coordinate system. Data from all sources were integrated to create a single database. Figure 2a shows the shallow wells (0.6–32 m depth) and Fig. 2b shows the deep wells (32 to >200 m depth). Data integration challenges in terms of data sources, quality, and access to information, schematic and syntactic heterogeneities can be found in Deshpande (2004). The semantic issues are discussed herein. As the data were obtained from multiple sources, the geological descriptions varied in style, classification and naming conventions. For example, the geological classifi- 123 Environ Geol Table 2 Varieties of descriptions for potentially the same lithology 1 Brown fine to medium sand and gravel, and some silt 2 Silty sand and gravel, fine–medium, brown 3 Sand, fine to medium, brown and silty, gravel, fine to medium 4 Brn. fn./med. sand and gravel with silt 5 Silty sand and gravel 6 Sand with gravel 7 Sand ... n Fig. 2 Borehole locations and depths in central Fraser valley (all data sets): a shallow wells (0.6–32 m depth), b deep wells (32 to >200 m depth) cation for the bridge construction reports were based on the Unified Soil Classification System, which is used for engineering purposes and is based on the particle size, liquid limit and plasticity index; the drill core record descriptions were based on the Wentworth Scale; GSC geological descriptions were based on the stratigraphy and environment of deposition, and the drillers’ descriptions were based on experience or education. These semantic differences and lack of standardized descriptions for the study area resulted in 6,000 unique categories. Lithologic data standardization A borehole litholog is a record of geologic materials encountered at different depths during the drilling process. The level of precision in such records varies between wells, and probably within each well. Numerous contractors and hydrogeologists have contributed information to the well databases; the variables are quality of expertise, field conditions, drilling purpose, cost of drilling, well depth and size, litholog translation into the database, and database management. Typically, lithologs follow a format that identifies the top and bottom depth of each layer, and give a description of lithology encountered at each depth interval. The choice of words varies slightly to significantly between different lithologs, even those that describe the same 123 material type. For example, consider a layer of unconsolidated deposits consisting of sand (60% by volume) and gravel (30% by volume) with properties of fine to medium grain size in each, brown in color, and containing some silt (5% by volume). This lithology description could be worded according to i = 1 to n different sentences as shown in Table 2. Each of the descriptions in Table 2 is unique, ambiguous to some extent, and typical of lithologs in the well databases. Some well reports describe more lithological details than others, and the degree of generalization varies as well. Furthermore, the complexity is increased by frequent non-standard abbreviations and word misspellings, grammatical ambiguities, and variable delimiters (comma, slash, space). When using these data some assumptions were made: 1. 2. 3. The descriptions are taken literally and describe the actual lithology of the site where the borehole was drilled. However, wells can be assigned litholog quality designations that can be used for weighing the data in further analysis. These are subjective criteria and may be based on the amount of detail written in a litholog; date of drilling, well size, depth, and purpose, noting that larger hydrogeologic studies usually involve professional hydrogeologists. Each litholog can be successfully interpreted. Therefore, lithologs that are too ambiguous cannot be used. The data are output correctly from the database. In each litholog, the sequence of layers has to be correct, and the depths of layers must be in the correct order. Pre-processing of text files A series of pre-processing steps was required prior to standardization and classification to deal with database structure issues. These steps were undertaken using a custom computer code. The first step was to parse and sort the data into several fields (e.g., a unique well identifier, UTM coordinates, layer top and bottom depths, layer Environ Geol 1 0 20 Silty sand and coarse gravel 2 20 22 Sand and coarse gravel 3 22 28 Grey/brn med. sand few pebbles 4 28 48 Grey/brn med. sand-coarse gravel few pebbles up to 2† unrecognized words, which are checked by the user who then updates the appropriate word lists. The program is rerun for the entire database until all important words are recognized. For example, the text ‘‘fn. to med. gry. sand & grav. with coarse gravels’’ is recognized as ‘‘fine–medium grey sand and gravel and coarse gravel.’’ 5 48 84 Grey/brn med coarse sand some gravel few pebbles Material property assignment 6 84 136 Gry/brn med coarse sand, trace med gravel at 117¢ 7 136 156 Gry/brn fn-med sand trace silt Table 3 Example of a litholog array for a single well lithology description, and additional well information, such as yield and screen depth). A multi-dimensional array of wells and their lithologs was then processed further. At this stage a litholog layer is the smallest unit of data aggregation. An array of litholog data for one well is shown in Table 3. Word recognition process A module was written for word recognition. For each litholog layer, the text is broken up into word groups as delineated by word separators in the original text (i.e., commas, slashes, or other characters). The word groups preserve the grammatical structure of the source text. Each word is read separately and compared to a custom dictionary of geological terms. This dictionary consists of lists of words and their alternative spellings for different categories of words, based on grammatical meaning. These lists were developed for words describing rock and unconsolidated sediment materials (e.g., ‘‘granite’’, ‘‘sand’’), words specifying grain size, color, sedimentological structure or rock structure (e.g., ‘‘interlayered’’), modifying words such as ‘‘sandy’’ or ‘‘wet’’, hydrogeologic terms, words describing technical aspects of well design and drilling process, and special words used to recognize grammatical relationships between words (e.g., ‘‘and’’, ‘‘to’’). One list also links some modifying words to material types such as ‘‘sandy’’ to ‘‘sand’’. For each word in the dictionary there may be many alternative spellings, abbreviations, and synonyms. For example, the color ‘‘brown’’ is often written in lithologs as ‘‘brn’’ or ‘‘brwn’’. In extreme cases, a commonly used word ‘‘gravel’’ is spelled in all of the following forms in the database: ‘‘gravel’’, ‘‘grav’’, ‘‘grv’’, ‘‘gravels’’, ‘‘grvl’’ in a combination of lower and upper case letters. Therefore, each word is also converted to lower case as a default. Word recognition reaches practical limits where words are badly misspelled, joined together (missing separator), or totally ambiguous. The program also outputs a list of The largest challenge concerned grammatical structures of litholog text. In that text there are descriptions of different materials (rock or unconsolidated deposit) and their properties. The materials are also arranged in order of importance, where usually the most abundant material is specified first, and all other subsequent materials are present in smaller amounts. There are exceptions, identified by such words as ‘‘and’’, which relate two materials as being equally abundant in a layer. For grain size ranges, the word ‘‘to’’ links size descriptors such as ‘‘fine’’ or ‘‘coarse’’, and ‘‘–’’ dash character may also be used instead of ‘‘to’’. The modifying terms such as ‘‘silty’’, when combined with a material such as ‘‘sand’’, have a special meaning, from which two separate materials ‘‘sand’’ and ‘‘silt’’, the silt being the lesser amount, must be extracted to standardize this text. The complexities grow exponentially with poorly constructed sentences and ambiguous sentences. The goal is to extract all the materials and all separate properties, in standard form, from all lines in all lithologs. This task involves an iterative process of test-and-run to verify the results. It is most economical to train the program on about 5% of the cases, let the program handle about 80% of the cases, and verify the remaining 15% cases by visual inspection without further modifications to attempt to improve the program. Software that would successfully recognize >95% of the lithologs with proper grammatical relationships would be able to almost mimic a human being, and is thus impractical to develop because of complexity. Standardized lithologs Table 4 shows an example of a non-standardized litholog for one well and Fig. 3 shows the standardized litholog for the same well. The standardized forms can be queried in a database environment using SQL statements or other methods, and layers can be generalized for spatial and structural analysis. Once standardization was complete, the original records were compared to the standardized format for a sample of wells. Litholog data classification and interpretation Rules were developed for litholog classification, which were used as guides for constructing the geologic cross-sections. 123 Environ Geol Table 4 Example of a non-standardized litholog Coarse gravel and silt Clean coarse sand and small w b gravel Very coarse sand/coarse gravel and fine silt Coarse sand and med sand Med sand/thin clay layers and some boulders Med and coarse sand/fine sand and silt Coarse gravel with clay layers Coarse sand and some gravel Medium sand with pebbles Gravel/some sand Very coarse gravel/very little sand In essence, the classified well logs aided the interpretation process, but ultimately did not replace geologic expertise. Material classification was undertaken over a series of passes, each reducing the output to a lower number of material types. The classification is simple when only one material type is present because the classified material is the same as the constituent material. For mixtures, a series of rules were applied (Table 5). These rules are not ideal, may be modified, and attempt to capture the important hydrogeologic properties of the subsurface materials. Classification was aided by expert knowledge of the local geology and depositional environment. In the Abbotsford uplands, gravel is the dominant material, followed by sand and larger boulders, and locally clay and silt. The clay occurs in lenses associated with tills, and silt content may vary in the sand matrix of those gravels. Small silt lenses may be present, but silt is more common as lacustrine deposits in the Sumas Valley. Clean gravels are rare except in lenses of fluvial or glaciofluvial deposits. Sand usually occurs with other materials, and most commonly with gravel. Due to lack of other information, the occurrence of sand as the dominant lithology in the litholog was interpreted as pure sand (fine to coarse—unknown grain size distribution). Fig. 3 Standardized litholog output from a custom standardization code 123 In the central Fraser Valley, clays are present in most deep lithologs at some intervals, and are present in most of boreholes in Langley uplands where Fort Langley Formation glaciomarine stony clays outcrop at ground surface, or lie beneath thin sands or gravels of Sumas drift or other coarse grained sediments. Clay is almost exclusively associated with clay-rich tills. Thus, intervals containing predominantly clay, or at least clay as secondary material, were classified as clay material. Therefore, most ‘‘clay’’ intervals inherently contain mixture of sand, gravel, boulders, or silt. Furthermore, we expected that many lithologs confuse silt with clay. From groundwater flow point of view, clay and silt both have low hydraulic conductivity relative to sand and gravel. Therefore, if clay is present as a major constituent, it was classified as clay. If clay contains silt, or is interbedded with silt, it was classified as clay. If clay is present as a minor constituent or trace amount, and it is a thin layer, then clay was ignored. Although small silt lenses may be present, silt is more common as a lacustrine deposit in the Sumas Valley, which experienced more lacustrine flooding and deposition. If silt was present as a major constituent, then the material was classified as silt, whereas if silt was present as a minor constituent or a trace amount, and was a thin layer, then silt was ignored. Sometimes bedrock fragments can be mixed with other materials near the bedrock surface. In such cases, the other material was considered dominant. Occasionally, bedrock was recorded between layers of unconsolidated materials. Here, it was assumed that bedrock was indeed a boulder. Soil and fill were ignored. These are very local in extent and are usually in the unsaturated zone, so do not play a major part in saturated groundwater flow, although it is recognized that soils impact recharge significantly. Soil was considered in recharge modelling (Scibek and Allen 2006). Some thin layers represent local lenses of materials, while others are part of larger, but thinning, continuous layers. For interpolation purposes, some layers were aggregated into larger more generalized layers, some thin layers were pre- Environ Geol Table 5 Primary rules for classification of mixed sediments as recorded in lithogs from the Abbotsford–Sumas aquifer Gravel Gravel + cobbles (or boulders) = gravel Gravel + sand = gravel (if gravel is the dominant material) Gravel + clay or silt = NOT gravel (see clay or silt classification) Sand Sand + gravel = sand Sand + other material = sand Clay Clay + silt = clay Other material + trace clay or thin clay layer = other material Other material + clay = clay Silt Other material + trace silt or thin silt layer = other material Other material + silt = silt Soil and fill Soil and fill were ignored Bedrock Other material + bedrock = other material Other material underlying bedrock = bedrock Thin layers Preserve all clay layers as these are important for groundwater flow Preserve silt layers if the thickness is significant (the threshold can be adjusted) If clay is interbedded with thinner layers of other materials, then generalize this group of layers as clay (same rules apply to thin layers of silt) served, and others were ignored. During interpolation, these decisions may be changed to provide better fit to data. However, the borehole density is insufficient in some locations to resolve the detailed stratigraphy. The actual implementation of rules for classification was done in Visual Basic (VB code) and run on an Excel spreadsheet with the litholog database. The frequency of occurrence of each material class and average thickness is very helpful in selecting appropriate aggregation rules and selecting appropriate material classes—graphed in Fig. 4 for second and third classifications, respectively. Ultimately, the database was reduced down to five material classes and these were used to map the regional trends in lithology. However, intermediate classification results aided in the interpretation at the local level. Results Early attempts at constructing a traditional layered hydrostratigraphic model for the Abbotsford–Sumas aquifer using the standardized well database were fraught with difficulty. Deshpande (2004) reported that it was practically impossible to fit any ‘‘surfaces’’ to the very heterogeneous Quaternary sediments in that area, despite having only five material classes. In this aquifer, the heterogeneity of sediments is such that a fit to any regional ‘‘geologic layers’’ or ‘‘hydrostratigraphic layered units’’, would reduce model resolution so greatly as to make it not possible to calibrate at a regional scale, and definitely not possible to calibrate to local conditions. The approach used required the HUV package in MODFLOW 2000 (Waterloo Hydrogeologic Inc. 2000) to represent geology in a 3D grid, rather than assigning geologic layers (where possible) to MODFLOW layer surfaces. The primary reason for standardizing the lithology database was to allow pseudo-3D computer representation and manipulation in geospatial databases and flow modeling software of the borehole logs. At first, the GMS 4.0 software (Brigham Young University 2002) was used to examine the information, but the very large quantity of data (>2000 good quality boreholes) slowed down the software so much as to make it not practical to use. The second solution involved use of ArcGIS 8.3 (ESRI 2004) to display the boreholes in 3D, together with pre-defined MODFLOW surfaces (slices of the aquifer area without regard for geology, but thinning toward ground surface to increase resolution of mapping), ground and bedrock surfaces, and surficial geology polygons (Fig. 5). The software (ArcScene module in ArcGIS 8.3) allows rotation, zooming in and out, in 3D, and proved to be very fast and easy to use. Lithologic materials were color-coded for quick reference. The following colors are used to represent different lithologies: clay (blue), silt (green), gravel (orange), sand (yellow). Surficial fill or soil units were not displayed to simplify the materials to only four general types. The goal of the mapping was to fill the 3D space of the model domain with geologic materials, classed into lithostratigraphic units, based on borehole lithologs. The lithostratigraphic units were also identified as hydrostratigraphic units by assigning appropriate hydraulic conductivity, porosity, and storativity values (as determined form pumping test rests). Lithostratigrahic units could then be joined with others on the basis of their hydraulic properties. The MODFLOW grid was then defined. Grid layers were created as slices (flat where possible), and thickening downward. Near ground surface, the layers were thin (3 m first layer, 5–10 m second in the uplands, 1– 3 m in lowlands). MODFLOW requires continuous layers and some judgment was required to create appropriate slice elevations. This was done using GIS, where elevation zones were created for each slice surface, then imported to MODFLOW as xyz surface elevation points. The surfaces were displayed in GIS during the mapping 123 Environ Geol Fig. 4 AB–SUM aquifer litholog classification histograms of material occurrence and thickness. Left side material classes in lithologs after second aggregationreclassification pass. Right side material classes in lithologs after third aggregationreclassification pass process. For example, to map geology in layer 4, a view of the bottom of layer 4 would be displayed, effectively truncating all deeper lithologs in the view, a view of the bottom of layer 3 could be switched on and off, to constrain the mapped litholog intervals. Mapping was done city block by city block (street network and drainage network were used as orientation guides), small area by small area, directly into Visual MODFLOW Fig. 5 Two litholog database in 3D intersected by MODFLOW surfaces (constructed in ArcScene, ESRI 2004) 123 software (WHI 2004), by ‘‘painting’’ zones of geologic materials on the MODFLOW grid in each layer. In each small area, all boreholes were examined from many views, through all layers, checking with surficial geology, and also viewing row and column cross sections of MODFLOW grid with defined (color-coded) material zones. This novel approach effectively bypassed the difficult steps of creating a 3D solid model (i.e., continuous geologic layers). The model is regionally consistent with the hydrogeologic fence diagrams developed by Halstead (1986) and the cross-sections developed by Cox and Kahle (1999), but provides a much greater degree of resolution necessary for numerical groundwater flow modeling. Figure 6 shows slices through the MODFLOW model, representing each model layer (1 through 8). The bottom layer of the model consists entirely of clay overlying impermeable bedrock. After initial groundwater flow model calibration attempts (see Scibek and Allen 2005), there were areas with large residuals that did not respond to changes in hydraulic conductivity within reasonable range for each mapped K-zone (hydrostratigraphic unit zone). In those areas, the geology was re-interpreted, again from borehole lithologs, this time with much more attention paid to possible Environ Geol Fig. 6 Hydrostratigraphic model of central Fraser valley fill by MODFLOW layer (nearly-horizontal slices of valley) interpretations and keeping in mind the model residuals and surficial geology, and by looking at individual borehole records to verify standardized lithologic units. In many areas, there are many possible interpretations of local geology due to poor distribution of boreholes. The interpretation favoring lower model residuals was selected and the geology re-mapped in that area. Therefore, the groundwater flow model was used to guide interpretation of the subsurface geology in this area—the attempt to explain groundwater levels, flows, existence of lakes and other features, gives additional information to help interpolate the geology from poorly distributed boreholes. The groundwater flow model ultimately achieved a normalized root mean square (RMS) error of 7.15% using roughly 1,700 static water levels from drilled wells. The main source of error in model calibration stems from our inability to adequately map highly conductive gravels and sand, and less conductive ‘‘dirty’’ gravel and sand mixtures that contain some finer grained materials, which lower the hydraulic conductivity. Typically, the difference can be as much as one order of magnitude (e.g., 300 m/day for clean gravels and sands, and 20 m/day for dirty gravel). From most of the well lithology information, it is impossible to distinguish these high and low zones. It is expected that similar difficulties would be had in other heterogeneous aquifer systems, and that alternative (stochastic) approaches may provide a solution to representing heterogeneity (e.g., T-PROGS module in GMS). However, such software similarly requires classification of material types, and the approach used herein could provide a means for achieving such a classification. Conclusions In the absence of field investigations, the water well information can be a source of invaluable information. This was the case for the Abbotsford–Sumas aquifer where water well reports were the chief source of depth specific geological information. The extremely poor data quality and semantics of geological descriptions, however, proved a hindrance in the development of the hydrostratigraphic model. Due to the lack of consistent methodology to document geological descriptions, 6,000 unique geological 123 Environ Geol description was observed for the study area alone. Semantic inconsistencies in the geological descriptions were adequately resolved by reclassification. A further confounding issue was the high degree of heterogeneity resulting from a complex geological history, which prevented the development of a traditional hydrostratigraphic model, based on a layered paradigm. The hydrostratigraphic model was constructed using ArcGIS to display the boreholes in 3D, together with pre-defined MODFLOW surfaces. The model is regionally consistent with the hydrogeologic fence diagrams developed by previous researchers, but provides a much greater degree of resolution necessary for numerical groundwater flow modeling. References Ahlqvist O (2003) Rough and fuzzy geographical data integration. Int J Geogr Inf Sci 17(3):223–234 Ahlqvist O (2004) A parameterized representation of uncertain conceptual spaces. Trans GIS 8:493–514 Ahlqvist O (2005) Using semantic similarity metrics to uncover category and land cover change. In: Rodriguez MA (ed) GeoS 2005, vol LNCS 3799. Springer, Heidelberg, pp 107–119 Armstrong JE (1981) Post-Vashon Wisconsin glaciation, Fraser Lowland, British Columbia. Geological Survey Bulletin 322, Geological Survey of Canada, Ottawa Armstrong JE, Crandell DR, Easterbrook DJ, Noble JB (1965) Late Pleistocene stratigraphy and chronology in Southwestern British Columbia and Northwestern Washington. Geol Soc Am Bull 76:321–330 Bishr Y (1998) Overcoming the semantic and other barriers to GIS interoperability. Int J Geogr Inf Sci 12(4):299–314 Brigham Young University (2002) Groundwater modeling system (GMS) Version 4.0 Cameron VJ (1989) The Late Quaternary geomorphic history of the Sumas Valley. MSc, Department of Geography, Simon Fraser University, Burnaby, 154 pp Clague JJ (1994) Quaternary stratigraphy and history of south-costal British Columbia. In: Monger JWH (ed) Geology and geologic hazards of Vancouver Region. Southwestern British Columbia, Geological Survey of Canada, pp 181–192 Cox SE, Khale SC (1999) Hydrogeology, ground-water quality, and sources of nitrate in Lowland glacial aquifers of Whatcom County, Washington, and British Columbia, Canada. In: US Geological Survey Water-Resources Investigations Report, US Geological Survey Deshpande A (2004) Data interoperability across borders: a case study of the Abbotsford–Sumas aquifer (BC/ Washington State). MSc, Department of Geography, Simon Fraser University, Burnaby, p 137 Easterbrook DJ (1969) Pleistocene chronology of the Puget Lowland and San Juan Islands, Washington. Geol Soc Am Bull 80:2273– 2286 Easterbook DJ (1976) Geologic map of western Whatcom County, Washington. United States Geological Survey Miscellaneous Investigations Series, Map I-854-B, 1:62500 ESRI (2004) ArcGIS 8.13 user manual and documentation. Environmental Systems Research Institute (ESRI) Fonseca FT, Egenhofer M J, Davis Jr CA, Borges KAV (2000) Ontologies and knowledge sharing in urban GIS. Comput Environ Urban Syst 24:251–271 123 Gibbons TD, Culhane T (1994) Basin study of Johnson Creek, Whatcom County hydraulic continuity investigations. In: Part 2, open file technical report OFTR 94–01. Washington State Department of Ecology, USA Golder Associates Inc (1995) Blaine ground water management program. Golder Associates, Canada Halstead EC (1986) Ground water supply—Fraser Lowland, British Columbia. In: NHRI Paper No.26, IWD Scientific Series No.145. National Hydrology Research Institute, Saskatoon Kashyap V, Sheth A (1996) Semantic and schematic similarities between database objects: a context-based approach. VLDB J 5:276–304 Jones MA (1999) Geologic framework for the Puget sound aquifer system, Washington State and British Columbia. Professional paper 1424–C, US Geological Survey, Reston, Virginia Kahle SC (1991) Hydrostratigraphy and groundwater flow in the Sumas Area, Whatcom County, Washington. MSc Western Washington University, Bellingham Kim W, Seo J (1991) Classifying schematic and data heterogeneities in multidatabase systems. IEEE 24(12):12–18 Kohut AP (1987) Groundwater supply capability Abbotsford Upland. BC Ministry of Environment and Parks, Water Management Branch, Victoria Kottam CA (1999) The open GIS consortium and progress towards interoperability in GIS. In: Egenhofer M, Goodchild M, Fegeas R, Kottam C (eds) Interoperating geographic information systems. Kluwer Academic, Boston, pp 39–54 LeRoy LW (ed) (1955) Subsurface geologic methods, 2nd edn. Colorado School of Mines, Golden, p 1156 Mitchell R, Babcock S, Stasney D, Nanus L, Gelinas S, Boeser S, Matthews R, Vandersypen J (2000) Abbotsford–Sumas aquifer monitoring project, final report. Geology Department, Western Washington University Piteau & Associates (1991) Hydrogeological assessment of Aldergrove aquifer, Aldergrove, BC. Report for the Corporation of the Township of Langley Piteau & Associates (2002) 2001 annual water quality monitoring report, Jackman sanitary landfill, WMB Permit No PR-1841. Township of Langley File 4720–L01 Ricketts BD (1999) The Fraser Lowland hydrogeology project: an overview. open file D3828. Geological Survey of Canada, Vancouver Ricketts BD, Jackson LE Jr (1994) An overview of the Vancouver– Fraser valley hydrogeology project, Southern British Columbia. Cordilleran and Pacific margin—current research. Geological Survey of Canada, Vancour, pp 201–206 Ricketts BD, Liebscher H (1994) The geological framework of groundwater in the Greater Vancouver area. In: Monger JWH (ed) Geology and geological hazards of the Vancouver region, Southwestern British Columbia. Geol Surv Canada Bull 481:287–298 Russell HAJ, Brennand TA, Logan C, Sharpe DR (1998) Standardization and assessment of geological descriptions from water well records: Greater Toronto and Oak Ridges Moraine Areas, Southern Ontario. Current Research 1998–E. Geological Survey of Canada, Ottawa Schuurman N (2002) Flexible standardization: making interoperability accessible to agencies with limited resources. Cartogr Geogr Inf Sci 29(4):343–353 Scibek J, Allen DM (2006) Comparing the responses of two high permeability, unconfined aquifers to predicted climate change. Glob Planet Change 50:50–62 Scibek J, Allen DM (2005) Numerical groundwater flow model of the Abbotsford–Sumas aquifer, Central Fraser Lowland of BC, Canada, and Washington State, US. Report prepared for Environment Canada, Vancouver, p 203 Environ Geol Sheth A, Larson JA (1990) Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Comput Surv 22(3):183–236 Stock K, Pullar D (1999) Identifying semantically similar elements in heterogeneous spatial databases using predicate logic. In: Vckovski A, Brassel K, Schek H (eds) Interoperating geographic information systems, second international conference, INTEROP’99 Zurich, Switzerland, 1999. Springer, Heidelberg, pp 231–252 Tearpock DJ, Bischke RE (2002) Applied subsurface geological mapping with stuctural methods, 2nd edn. Prentice Hall, NJ, p 822 Tooley J, Erickson D (1996) Nooksack watershed surficial aquifer characterization. Washington State Department of Ecology, Ecology Report, pp 96–311 Vckovski A (1998) Special issue: interoperability in GIS (Guest Editorial). Int J Geogr Inf Sci 12(4):297–298 Vckovski A (1999) Interoperability and spatial information theory. In: Egenhofer M, Goodchild M, Fegeas R, Kottman C (eds) Interoperating geographic information systems. Kluwer Academic, Boston, pp 31–37 Visser U, Stuckenschmidt H, Schuster G, Vogele T (2002) Ontologies for geographic information processing. Comput Geosci 28:103– 117 Walker JD, Cohen HA (2006) Geoscience handbook: AGI data sheets, 4th edn. American Geological Institute, p 310 Waterloo Hydrogeologic Inc (2000) Visual MODFLOW v 3.0: user manual. Waterloo Hydrogeologic Inc, Waterloo 123