Aalto university Data Quality Assurance of 3D Building Features in Data Integration Processes Alpo Turunen School of Engineering Thesis submitted for examination for the degree of Master of Science in Technology. Espoo 29.07.2022 Supervisor Assistant Professor Henrikki Tenkanen Advisor D.Sc. (Tech.) Antti Jakobsson Copyright © 2022 Alpo Turunen Aalto University, P.O. BOX 11000, 00076 AALTO www.aalto.fi Abstract of the master’s thesis Author Alpo Turunen Title Data Quality Assurance of 3D Building Features in Data Integration Processes Degree programme Master’s Program in Geoinformatics Major Geoinformatics Supervisor Assistant Professor Henrikki Tenkanen Advisor D.Sc. (Tech.) Antti Jakobsson Date 29.07.2022 Number of pages 65+1 Language English Abstract During the last 30 years, spatial data quality has raised its importance due to the increasing amount of data sources. In today’s data economy, it is profitable to invest in data quality, which has led to demand for common and simple quality assurance methods. Some standards and guidelines, e.g. ISO 19157 and ISO 8000, have contributed to this discourse, but they do not cover all situations. This thesis proposes an FME-tool, which assures the quality of 3D building data. The tool applies over 40 quality rules to CityJSON and CityGML data sets of LoD2 in order to find and report all violations in a tabular or geometrical format. From the example data sets, over 15 types of errors were found. Most of them were completeness or conceptual consistency errors, such as intersections and missing xlinks. Their repairing and impact on integration capabilities were discussed. By using the tool, users can automatically assess and improve their data quality as well as make their data more interoperable and harmonized. The tool is ready to use, but users might want to modify its quality rules and parameters to fit their preferences better. Most quality objectives are relative to the final purpose of data. However, the usage of data quality assurance tools will facilitate data integration processes and overall data utilization of all stakeholders in data governance. It is useful for all phases and roles of the DLC, ranging from data creation to its end-use. Keywords Data Quality, Quality Assurance, Data Integration, 3D Buildings Aalto-yliopisto, PL 11000, 00076 AALTO www.aalto.fi Diplomityön tiivistelmä Tekijä Alpo Turunen Työn nimi 3D rakennuskohteiden laadunvarmistus datan integrointiprosesseissa Koulutusohjelma Geoinformatiikan maisteriohjelma Pääaine Geoinformatiikka Työn valvoja Apulaisprofessori Henrikki Tenkanen Työn ohjaaja TkT Antti Jakobsson Päivämäärä 29.07.2022 Sivumäärä 65+1 Kieli Englanti Tiivistelmä Viimeisten parin vuosikymmenten aikana paikkatiedon määrä ja siihen liittyvä laatutietoisuus on kasvanut. Nykyään datan laatu on selkeä kilpailuetu erilaisille organisaatioille ja yrityksille, jolloin tarve helposti omaksuttaville ja sopiville datan laadunvarmistusmenetelmille on kasvanut. Useat standardit ja suositukset ovat pyrkineet paikkaamaan tätä puutetta luomalla yhteisiä käytäntöjä, mutta ne eivät sovellu kaikkiin tilanteisiin paikkatiedon moniulotteisuuden takia. Tämä diplomityö esittelee FME-pohjaisen työkalun, jonka avulla voi varmistaa paikkatiedon laatua erityisesti 3-ulotteisten rakennuskohteiden osalta. Työkalu sisältää yli 40 laatusääntöä LoD2-tason rakennuksille CityGML ja CityJSON formaateissa. Työkalu etsii aineistosta löytyvät virheet, ja raportoi ne käyttäjälle joko taulukkotai geometriamuodossa. Tutkimuksessa käytetystä esimerkkiaineistosta työkalu löysi 15 erilaista virhettä, joista suurin osa koski täydellisyyttä tai käsitteellistä eheyttä, kuten itseään leikkaavia kohtia tai puuttuvia topologioita. Diplomityö pohti näiden virheiden korjattavuutta ja vaikutusta integraatioprosesseihin datan elinkaaren eri vaiheissa. Työkalun avulla käyttäjät voivat automatisoida heidän laadunvarmistusprosessejaan ja parantaa datan laatua. Työkalu on sinällään valmis käytettäväksi, vaikkakin useimmiten käyttäjien kannattaa muokata laatusääntöjä ja parametreja vastaamaan paremmin heidän preferenssejään. Lähes aina datan laatuvaatimukset ovat sidoksissa datan käyttötarkoitukseen, jolloin laatusäännöt ovat aina tapauskohtaisia. Automaattista laadunvarmistustyökalua voi käyttää datan elinkaaren jokaisessa vaiheessa ja rooleissa. Erityisen hyödyllinen työkalu on datan integroimisprosessissa, sillä integroiminen vaatii yhteensopivaa ja harmonisoitua dataa. Työkalun avulla datan tuottajat voivat verrata datan laatua sen kriteereihin, ja loppukäyttäjä sen käyttötarkoitukseen. Avainsanat Laadun varmistus, Datan laatu, Datan integrointi, 3D rakennuskohteet Preface First of all, I want to thank National Land Survey of Finland for offering me this great opportunity to work in their organization. This research was part of the GeoE3 project, which is co-financed by Connecting Europe Facility of the European Union, action number INEA/CEF/ICT/A2019/2063390. I also want to thank my advisor Doctor of Science (Tech) Antti Jakobsson who introduced the world of geospatial data quality to me. In addition, my supervisor Assistant Professor Henrikki Tenkanen helped me a lot with academic writing, so great thanks to him as well. Without you, writing this thesis would have been more challenging, or even impossible. Otaniemi, 29.07.2022 Alpo Turunen Contents Abstract Abstract (in Finnish) Preface Contents Symbols and abbreviations 1 Introduction 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Study Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Previous Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 3 2 Theoretical Background 6 2.1 Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 Short Introduction to Data Quality . . . . . . . . . . . . . . . 6 2.1.2 From Data Quality to Better Financial Outcomes . . . . . . . 7 2.1.3 Data Quality Information Model . . . . . . . . . . . . . . . . 9 2.2 Data Quality Management . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Definitions and Terms for Data Quality Management . . . . . 12 2.2.2 Data Quality Assurance . . . . . . . . . . . . . . . . . . . . . 15 2.2.3 Data Quality Assurance from the Perspective of Data Life Cycle 16 2.2.4 Challenges Related to Data Quality Management . . . . . . . 17 2.3 Data Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.1 Spatial Data Infrastructures . . . . . . . . . . . . . . . . . . . 18 2.3.2 Data Integration in Spatial Data Infrastructures . . . . . . . . 20 2.3.3 Data Integration in Different Phases of Data Life Cycle . . . . 22 2.4 Automated Data Quality Assurance in Data Integration Processes . . 24 2.5 The Role of Standards and Recommendations . . . . . . . . . . . . . 25 3 Methods of Case Study 3.1 Used Data Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 CityGML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 CityJSON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3 Level of Details . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Used Data Integration Platform . . . . . . . . . . . . . . . . . . . . . 3.3 Created Data Quality Assurance Tool . . . . . . . . . . . . . . . . . . 29 29 29 31 32 33 35 4 Results 4.1 Implemented Quality Rules . . . . . . . . . . . . . . . . . . . . . . . 4.2 Used Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Used Example Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Data Quality Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 37 42 44 45 4.5 Comparison with Val3dity . . . . . . . . . . . . . . . . . . . . . . . . 46 5 Discussion 5.1 Reliability of the Results . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Results in the Context of Data Integration . . . . . . . . . . . . . . . 5.3 Using the Data Quality Assurance Tool During the Data Life Cycle . 5.4 Future Possibilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 48 49 50 50 6 Conclusions 53 Symbols and abbreviations Abbreviations 3D ADE API CityGML CityJSON CRS DIKW DLC DQA DQE DQM DSMM ELF ESDIN ETL FAIR FME GML ILM LoD NLS-FI OGC SDI SIG 3D UoD VGI WPS XML Three dimensional Application Domain Extension Application Programmer Interface City Geography Markup Language JSON-based format for 3D city models Coordinate Reference System Data, Information, Knowledge and Wisdom Data Life Cycle Data Quality Assurance Data Quality Evaluation Data Quality Management Data Stewardship Maturity Matrix European Location Framework European Spatial Data Infrastructure Network Extraction, Transformation, Load Findable, Accessible, Interoperable, Reusable Feature Manipulation Engine Geographical Markup Language Information Life-Cycle Management Level of Detail National Land Survey of Finland Open Geospatial Consortium Spatial Data Infrastructure Special Interest Group 3D Universe of Discourse Volunteered Geographic Information Web Processing Service Extended Markup Language 1 Introduction 1.1 Background The amount of geospatial data has increased greatly in the recent decades, which has led to an increasing in number of its applications. All of them rely on the data quality either directly or indirectly. Having data that is high quality is a cornerstone for data-driven organizations because data has to fulfil its requirements to be applicable. For that reason, more and more people and organizations have been interested in achieving higher data quality, which can be very profitable in the long term [1, 2]. According to the ISO 9000 family of standards [3], all coordinated activities of organization that aims to manage quality, can be put under the concept of quality management. The term establish quality policies, processes and objectives, and can be divided into four phases: • Quality planning focuses on establishing quality goals and considering resources and methods to fulfil them. • Quality control focuses on fulfilling the quality requirements defined in the quality planning phase. • Quality assurance ensures that quality control have achieved its objectives. • Quality improvement aims to increase the ability to achieve goals of previous steps, and tries to improve effectiveness, efficiency and traceability, as an example. Similar quality definitions and phases can be applied to data context, in which case the ISO 8000 standard of data quality [4] can replace the use of ISO 9000 standard of quality management systems [3]. Both standards contain similarities but from the different perspective. The ISO 8000 standard [4] focus on data quality instead of larger scope of quality management systems. Hence, it adds the word data before the word quality, so quality management is replaces by Data Quality Management (DQM). The research field of data quality is almost as old as the digital data itself. It has gained significant advantages and moved from quality awareness towards real life solutions [5]. In addition to multiple standards (e.g. [6, 4, 3]), scientists and practitioners have published several scientific papers, conference proceedings and books that consider DQM. This literature contains quality perspectives from multiple fields ranging from land administration (e.g. [7]) to disaster management (e.g. [8]). The wide scale of scientific literature has led to more practical solutions including definitions, measurements, practices, methods, and tools [5]. The situation of geospatial data quality has also taken similar steps forward. It started from measuring accuracy errors of geospatial data and progressed toward higher maturity and more diversified discipline [9]. The most recent standard (ISO 19157) [10] defines widely used quality framework for the geospatial data, which can be used as a basis for quality of geospatial data. The framework helps users to assess 2 the quality of their data and provides some methods for improving it. Despite its importance, it does not cover all needed aspects of geospatial data quality. Ariza et al. [11] brought existing challenges to attention and suggest multiple improvements to the standard. For example, more emphasis should be placed to new geospatial data types (e.g. big data, LIDAR, BIM), life cycle management, metadata and interoperability with the other standards [11], which is not an easy task due to the diversity of geospatial data [12]. A lot of research needs to be still done. In addition to quality as an absolute value, it also has a wide impact on the all elements in Spatial Data Infrastructures (SDIs) [13]. The most fundamental problem of SDIs is the data fusion i.e. integration of heterogeneous and multi-source geospatial data [14]. Data integration requires harmonized and interoperable data sets and methods [15], the lack of which has been one of the main issues of the geographic information science over 30 years [9]. Since that, geospatial data interoperability has gained a lot of attention in the forms of scientific projects and initiatives [15]. One key technology is Application Programming Interface (API), which allow data and web services to easily flow within the SDIs [16]. Common practices and standardized methods are especially relevant in today’s globalizing world because countries are moving from local and national SDIs towards regional or even global SDIs. That has created a need for adjusting micro regional data to be corresponding with the international standards [13]. To support this development, the European Union has published their own standard called INSPIRE (Infrastructure for Spatial Information in Europe), which provides a standardized framework for the countries to improve their geospatial data production, management, distribution and usage in the European Union [17]. As stated in the European Data Strategy [18], developing continental SDIs will have a huge impact on the whole European data economy. 1.2 Study Design This thesis explores current DQM methods and challenges related to the SDIs. The goal is to find and outline practices to automatically assure the quality of geospatial data throughout its complete life cycle, from data creation to its final use. In addition to the Data Life Cycle (DLC), this thesis also considers Data Quality Assurance (DQA) from the different perspectives of data governance. That includes discussion about different roles and responsibilities related to the implementation of SDIs. In more detail, this thesis has three main research questions: • What kind of data quality assurance methods have been developed for geospatial data and how to automate them? • How to implement the automated data quality assurance process for 3D building data? • Can the automated data quality assurance process be implemented in various stages of data life cycle in data integration process? 3 These research questions will be answered by combining theoretical literature and an empirical case study (see Figure 1). Literature provides a review to geospatial DQM and DQA. It considers how these two are connected to general data management practices, and how the DLC and data integration is currently discussed. The empirical case study will demonstrate when and how to use automated DQA in the real-life SDIs and data integration processes. The question of when will investigate different timing options for the automated DQA during the DLC. The question of how will investigate different alternatives for automatizing DQA from different user perspectives (e.g. data producer, data integrator and end-user). Figure 1: Research questions and workflow of the study In this thesis, the empirical case study is a DQA tool, which is practically an Extract, Transform, Load (ETL) process implemented in the FME Workbench application. The tool checks the given CityGML or CityJSON building data set against the pre-defined quality rules and report or fix all violations. In that way the quality assurance tool validates heterogeneous multi-source data sets and helps to modify them into the interoperable and high-quality format. 1.3 Previous Research Lot of scientific literature have been published before this study. In the year 2011, European Spatial Data Infrastructure Network (ESDIN) project [19] provided support for data providers, particularly for national mapping and cadastrial agencies, to prepare their data and metadata to be in compliance with the INSPIRE directive and other related standards. That included guidelines and concepts for managing the quality of data themes represented in the Annex I of INSPIRE (see [17]). The outcomes of ESDIN project can be compressed into two major achievements. First, 4 their quality model can help data providers to understand data quality and to define objectives and user requirements for geospatial data. Second, they proposed the semi-automatic service that can evaluate data quality, or serve as a prototype for future development of fully automated Data Quality Evaluation (DQE) services. Those practices and tools can help data producers to improve their data quality easily and quickly, which will eventually lead to multiple benefits and smaller operating costs. From the year 2013 to 2016, their efforts were continued by the European Location Framework (ELF) project, which contained even forty private, public and academic organizations from all over Europe [20]. Their goal was to improve the availability of the INSPIRE compliant geospatial data sets by creating a framework for providing interoperable, reliable and current geospatial data across country borders [20, 21]. Due to the fact that Europe is not very uniform group of countries, a lot of harmonization was needed to achieve required level of interoperability between data sets. The ELF project introduced a number of geo-tools to support data validation, transformation, edge-matching and generalization. Also geocoding tools and tools for discovering and licensing web-services were provided. Most of them were based on the prototypes and principles of ESDIN [21, 20]. A good example is the master thesis of Nils Mesterton [22], which considered open source options for spatial data quality validation. He created a PostGIS database tool, which applied quality rules to spatial data in order to fulfil quality requirements of the ELF project. His results indicate that PostGIS is excellent software to be used for quality validation purposes. The ELF project had a massive impact afterwards. Many of its outcomes have been developed further. For example, the National Land Survey of Finland (NLS-FI) implemented their own quality monitoring tool based on that. Most importantly, the project behind this thesis, the Geospatially Enabled Ecosystem for Europe (GeoE3) [23] is a continuation for the ELF project. The GeoE3 started in the year 2020 and will last three years. It aims to provide dynamic integration of high-value data sets and digital services with existing geospatial data from national platforms. The goal is to ease visualisation and analysis especially of cross-border data. One example is to combine pan-European meteorological data to national building data, which could open multiple possibilities, such as solar power analyses. Similarly than all previous research projects, also this project is financed by European Union and plays with interoperability, harmonization and accessibility of data. In addition to previous EU-funded project, there are also a lot of similar studies related to DQA and data integration. For example, Mohammadi et al. [15] proposed a DQA tool for the most common formats of 2D geospatial data. They listed several current data integration challenges and tried to automatically find all violations of the multi-source data in regard to relevant guidelines and instructions. The biggest difference between this thesis and their paper was that Mohammadi et al. did not evaluate geometry errors and they focused only on data validation part, which is usually seen as the first half of DQA. By doing so, their harmonization tool only identifies the inconsistencies of 2D features, but cannot fix them. An another example is the article of Amin Mobasheri [24]. He considered the semi-automatic DQE of geospatial data through the data quality elements of ISO 5 19000 series of standards. His main outcome was to present a list of data quality elements that are possible to utilize in the creation of automated DQE tool. Hence, the article provided a good base for further development, such as creating the tool of this thesis. Thirdly, there are several studies that focus on DQA of non-geospatial data, such as Dakrory et al. [25], and Alferes et al. [26]. The first one proposed an ETL based automatic quality validation framework for the tabular data concerning employees. They concluded that automatic data validation can raise the level of confidence with respect to data quality measures, such as consistency and completeness [25]. The second study by Alferes et al. [26] presented the automatic DQA tool for water data. They went further than previous examples as they completely assured the data, which included data validation and editing phases. First they measured data quality by identifying potential sensor faults, outliers and noise, and then they repaired the found errors by using multivariate analysis and predicted values [26]. Even though neither of these papers did not consider geographical data, most of the methods are applicable to geospatial data as well. Most importantly, the both papers are good demonstrations of how automatic DQA procedures can gain real life benefits. 6 2 Theoretical Background 2.1 Data Quality 2.1.1 Short Introduction to Data Quality Originally the word quality comes from the Latin words qualitatem and qualis, which can be translated to "of what kind", or "a quality, property; nature, state, condition" respectively[27]. The quality described characteristic of an object that exist in and by itself [28]. In the 14th century quality was referred to "a degree of goodness or excellence" or "an inherent attribute" [27]. During those times, the quality started to become a quantitative measure instead of purely qualitative[28]. A few hundreds year later, approximately from 1580s the definition established to refer to "a distinguished and characteristic excellence" [27]. The excellence-based definitions are illustrative, but hard to measure and manage systematically [29]. Thereafter, the term quality has gained a lot of new characteristics. In the year 1998, Paul Lillrank [30] described quality from four viewpoints: production, planning, customer and system centred perspectives. Similar perspectives can be also applied to data quality as well, which is also a multidimensional concept, and can be viewed from multiple perspectives [31]. The quality expert Joseph M. Juran [32] explain that data quality measures of how well data is suited to serve its intended use. In the year 1999 he wrote that data is of high quality if it is suitable for its intended uses in decision making, planning and operations. In that way, data should be error-free and it should contain all desired features, such as proper level of detail and comprehensiveness [32]. Even the most accurate data might be useless, if it does not serve its purpose [19]. Hence, this definition is widely referred to as fitness-for-use. A few years later, Thomas Redman [33] refined Juran’s definition into more compact form by saying that data has high quality, if its user say so. The definition takes account that quality of data set is not absolute as it is not same for all users [34], but the problem is that they do not always know what sort of quality they should or could expect [35]. Most definitions of data quality contain similar characteristics. Mostly, the data quality is related to criteria and expectations of data, and thus the level of data quality could be determined by comparing actual data to its desired goals [31]. The ISO 8000 standard of data quality [4] agree this method. It standardizes the definition by saying that data quality is "a degree to which a set of inherent characteristics of an object fulfils requirements". The definition does not take a stance of who is defining requirements, even though they may vary a lot between individuals, standards, laws, business policies and even data processing application [31]. For example, data producers compare data sets to data models and criteria (i.e. check if data contain errors), and end-users to their usability or so called fitness-for-use [10]. However, the large number of definitions imply that quality really has multiple dimensions [36]. One considerably remarkable dimension is geospatial data quality, which is an approximately 40 -year-old concept [11]. The first studies of geospatial data quality began in the year 1982 under the control of the American Congress of Surveying and Mapping [11], which led to the first proposal for geospatial data quality standards 7 in the year 1987 [37]. Later on, the discourse and standards have been changed over the years, but main quality principles and elements have remained unchanged [11]. The most recent standard for geospatial data quality, ISO 19157 [10], uses the same principles and definitions as the ISO 8000 standard [4], but in the geographical context. It combines different definitions of quality by saying that data quality can be seen as a divergence between the present data instances and the universe of discourse (UoD), which describes the real or hypothetical world including all that is interesting [10, 38]. Figure 2 illustrates relationships of this definition. As can be interpreted, the data set of perfect quality is completely accordance with its data model, which formally represents the real world through the UoD. Figure 2: The relationship between the physical data instance, the UoD, and the data model. [39] The geospatial data quality definition of ISO 19157 [10] is good because it takes account the fact that geospatial data cannot completely capture the complexity of real world. It is only a measured and predigested model of reality that is always incomplete, inaccurate, imprecise or out-of-date. The level of imperfection depends on precision of measurements, algorithms and humans, for instance [34]. Thus the geospatial data can be perfect only in relation to commonly agreed data model, but not in relation to real world. To perfectly (and humorously) represent the real world, scale of geospatial data should be at least 1:1 [34]. 2.1.2 From Data Quality to Better Financial Outcomes Although it is difficult to estimate the monetary value of data, it can be seen as an organizational asset [40, 41]. That is because most organizations use and gain benefits from data. They store, control, maintenance, and accumulate data, in addition to paying and creating value from it [40]. For those reasons, it is profitable to maximise the quality of data [1]. Just like in the retail business, higher quality increases the value of services and products, and saves costs on the other hand [42]. From the broader perspective, higher data quality enhances finance, confidence, satisfaction, productivity, risk management and compliance of the organizations [40]. On the contrary, poor data quality causes multitude of negative consequences, such as errors, unsatisfied employees, disappointed customers, unnecessary work, poorer decision making. Forbes provided a good example of poor data quality by saying that two thirds of the data scientist’s time goes to data quality related tasks, such as data cleaning [43]. So, both high and poor data quality affect the overall outcome of organizations [44]. Especially in today’s data economy, data quality has 8 a remarkable impact due to opportunity costs. That means, the winner usually takes it all [45]. But the data is not an only concept to concentrate. Although, it is a common phrase that "data is the new oil" (e.g. [46]), the real business outcomes come through information, knowledge and wisdom [47]. The terms are sometimes used interchangeably or they are used to refer to same definitions. But to clarify differences between them and their relations, they can be represented in the Data, Information, Knowledge and Wisdom (DIKW) pyramid, like in Figure 3. Figure 3: Data, Information, Knowledge and Wisdom pyramid (DIKW) represents the connections between the terms. Modified after: [48]. The DIKW pyramid contains four connected levels with fuzzy boundaries. The wisdom is at the top and the data at the bottom. The basic principle of the pyramid is based on the text of Russell Lincoln Ackoff [48], which was released in 1989. He says that data is a set of symbols representing the properties of some objects or event that have no meaning or usage until the they are processed into a more practical form. The processed data, i.e. information contains descriptions of the data and answers directly to questions of who, what, when, where and how many. The third level, knowledge, can be seen as a refined shape of information that answers the questions of know-how and know-why. It enables to work and control the system efficiently. The top level, wisdom can answer the question of why. With the help of wisdom, it is possible to see and evaluate long-term consequences rationally. It deals with values and involves judgements [48]. The pyramid implies that higher value of understanding can be achieved by laddering up from one stage to another. Although the pyramid is intuitive and widely known, it has been criticized. For example, Ilkka Tuomi [49] points out that the pyramid would also be useful for organizations when it is turned upside down. He argues that instead of raw data in the computer’s memory, the key to success is tacit, interpersonal and organizational knowledge. The data comes only after the information and knowledge because it supports them. However, some views declare that information product is just one type of data. Despite the problems of the pyramid, it represent the importance of quality [47]. As Figure 4 implies, data quality is prerequisite for information quality. Higher data quality leads to higher information quality, decision quality and eventually to 9 better business outcomes. Thus the real value of data comes from its usability and fitness-for-use [40]. Or in the other words, from its quality. Figure 4: The impact of data quality represented with the DIKW-pyramid. Modified after: [47]. 2.1.3 Data Quality Information Model Assessing the quality of geospatial data is not too straightforward. To help this, the ISO 19157 standard provides the widely known Data Quality Information Model (DQIM) [10], which main principles were originally adapted from the first geospatial quality standard proposal in 1987 [37]. In addition to metadata quality, it consists of five well-known quality measure: 1) completeness, 2) logical consistency, 3) positional accuracy, 4) thematic quality and 5) temporal quality [10]. Each of them has two or more sub-classes or data quality elements (see Figure 5), which can be used to evaluate and document geospatial data quality in smaller pieces [50, 10]. • Completeness indicates if data has too few or extra features, attributes or their relationships. It contains two data quality sub-elements: commission and omission, which describe excess and missing information respectively [10]. A data set has perfect state of completeness, if it contains and covers all the same aspects as the real world UoD [19]. Hence, these errors may occur for example due to the misclassification [10]. If a residential buildings accidentally classified to a business building, then there totally two completeness errors: an omission error on the one side, and a commission error on the other side. Beare et al. [19] remind that completeness errors are easy to mix up with other quality measures. For example, if the data schema has missing information in relation to the UoD, they might be omission errors. But if the data has missing information in relation to its schema, then the errors are rather conceptual errors since data is not corresponding to its conceptual schema [19]. Similarly, some completeness errors, especially population completeness errors can be rather coverage errors. 10 Figure 5: Geospatial data quality measures and elements in according to ISO 19157 standard. Modified from [10]. • Logical consistency measures am extent of adherence to logical rules of data attribution, structure and relationships"[10]. It can be expressed with four subelements, which are topological consistency, domain consistency, conceptual consistency and format consistency [10]. All of them are discussed more specifically in Chapter 4.1. This is because the elements of logical consistency are the most useful for automatic DQA since usually they are not negotiable, depending on the agreed threshold, or they do not require reference data sets [19]. • Positional accuracy is usually the most obvious quality element as it describes the geographical accuracy of features [19]. It answers the question of how close the measured and the real positions are to each other [10] with regard to coordinates, addresses or other locality descriptions [24]. In the case of absolute location, this element measures the uncertainty of a location because the full exactness of real-life coordinates is impossible to achieve [51]. Thus checking positional accuracy reliably requires usually access to measured values through fieldwork [24] or a trustful reference data set to compare differences [34]. Positional accuracy can be expressed in terms of absolute (or external) accuracy, relative (or internal) accuracy and gridded data positional accuracy [10]. 11 • Temporal accuracy is similar than positional accuracy but in the temporal dimension instead of spatial. It measures the quality related to the temporal dimension of features, including their temporal attributes or relationships [10], so it answers the question of are objects within a defined time constraints [51]. Evaluating this data quality measure also requires some reference information to be compared. Accuracy of temporal quality can be described using three sub-elements: accuracy of time measurements, temporal consistency, temporal validity. The latest one, temporal validity, can be expressed also by using timeliness, currentness, up-to-dateness or actuality as synonyms [10], and it can cause completeness errors, if the data set lags behind the UoD [19]. • Thematic accuracy focus more on attributes and semantics instead of geometry. It considers accuracy of quantitative and qualitative attributes as well as classifications of features and corresponding relationships [50], which also give names for the sub-elements of thematic accuracy: thematic classification correctness, non-quantitative attribute accuracy and quantitative attribute accuracy. [10]. The first describes the correctness of classifications of features and relationships in relation to the UoD. The second measures the accuracy of non-quantitative attributes (e.g. is the attribute valid or not), and the third describes how close the quantitative values of the data set are to values that are true or accepted. Because correctness and accuracy of values are always depending on their semantic contexts, measuring those elements usually requires a reference data set containing labeled objects [24]. The DQIM of the ISO 19157 standard is very illustrative and useful to describe the general quality of geospatial data, but is has some deficiencies. For example, it deals with geospatial data in general, but does not consider the specialties of 3D data. Because the 3D data has a more diverse structure, most common metrics of the DQIM, like correctness or positional accuracy, are not sufficient alone to cover all aspects of higher dimensional data [52]. Similarly, the model does not put almost any emphasis on new geospatial data types, such as big data, Light Detection and Ranging Data (LIDAR) or Building Information Models (BIM) [12]. In addition to data types, the DQIM does not include any information about interoperability, documentation or life-cycle management of the data [12, 11]. Nevertheless, it leaves room for users to define new data quality elements and to modify the model in the direction they prefer [10]. The whole idea of the DQIM is to be modifiable, but not too much because adding more quality measures could make comparison of data quality more difficult [53]. Then the model would not be interoperable. 12 2.2 Data Quality Management 2.2.1 Definitions and Terms for Data Quality Management The great importance of data quality is reason, why organizations want to manage it. That process is called Data Quality Management (DQM), which is a broad term covering several coordinated activities, methods and systems to plan, control, and improve an organization with regard to all aspects of data quality [54, 3]. One of the earliest ways to structure quality management was introduced by Joseph Juran in 1974. He developed the three-step model to describe quality management. The model is called ’quality trilogy’ [55], and it contain three steps: quality planning, quality control and quality improvement. Usually these steps are represented as a cyclical and iterative form. In that way, a good quality management requires proper planning of actions, controlling them and finally improving them based on given results. Figure 6: Management process cycle by William Deming. Modified from [56]. The Juran’s model has been influenced by the second very famous model of management systems. In the year 1950 William Deming introduced Plan, Do, Study, Act (PDSA) -cycle [56] in his lecture in Japan. The four-step model in Figure 6 includes the same aspects as Juran’s model, but it contain one more step for checking results before improvement phase. Even though neither model focus directly on the perspective of data or its quality, they are still used and developed a lot retrospectively in the data field. For example, the DQM model of the most recent data quality standard ISO 8000, is influenced by the previous models as well as the standard ISO 9000 [3]. All of them contain similar steps and roles, which are represented in the Figure 7: • Data Quality Planning (DQP) In the DQP phase, data integrator or data producer defines quality model ideally together with customers. The quality model contains quality require- 13 ments, promises and rules for the data [19], so it is basically a data model but targeted on data quality. Its requirements are meant for data producers, and quality promises are for customers. To keep customers satisfied, the quality requirements should be stricter than promises [57]. • Data Quality Control (DQC) In the DQC phase, data integrator or data producer creates the data according to commonly agreed quality requirements, and check quality of production using the quality rules [19]. The DQC phase may also be relevant in data integration or publishing processes. In addition, this phase can contain a basic work of the organization, such as training and acquisition of software [57]. • Data Quality Assurance (DQA) In the DQA phase, data producer, integrator, or customer assures that data is on conformance quality level [19]. Usually in DQA process, completeness and logical consistency quality checks are applied [58, 19]. If any errors are found, they are possible to reconcile automatically in this phase [19]. The DQA is discussed more specifically in Chapter 2.2.2. • Data Quality Improvement (DQI) In the DQI phase, data producer will rectify and improve invalid data at source in the DQI phase, if it was not possible in the DQA phase. After that, the DQA process can be re-initiated. When the data passes all the rules, it is ready to be used or published. In both cases, it is important to deliver a quality report to customers who can also give a feedback about data quality and specify their forthcoming needs. Also, the quality report can be given to data producer to contribute and assist data production process in the future. [19] To understand the discourse of the DQM, it is good to understand conceptual differences between the DQM and data governance. First of all, both are part of the overall management of the data. The data governance does not directly focus on the management of data quality as it has wider perspective. It considers also organizational actions, roles, activities, responsibilities and decisions around the data [59, 60]. It puts all of the actions related to usability, integrity, security and availability of the data in place [61]. Most researches (e.g. [60, 41, 61]) stress the perspective of organizations when they define data governance. They say that data should be equal to other assets of organization since it can also have a real monetary value. To maximise its profitability, data governance aims to improve organizational policies, responsibilities and procedures related to data. For example, it defines the data ownership and stewardship, auditing processes and data quality certifications for participants [40]. So, data governance can be viewed as a top concept for all who are interested in data- or information related matters in the organization. Another previously mentioned and important concept is data stewardship. It is a concept that aims to take care of data from the similar perspective than the 14 Figure 7: DQM process according to the ISO 8000 standard [4] and the ESDIN Quality Final Report [19]. data governance: both consider data as a valuable digital asset. The stewardship preserve and improve the possibility of re-using and discovering data and metadata by collecting, annotating, and archiving it, as an example [62, 63]. The difference is that data stewardship is a subset of data governance as it concerns a smaller groups of people. It focuses on the role of steward or so-called data coordinator, who usually does not own the data but has some responsibilities towards it. Usually those responsibilities are set by the data governance [64]. It is beneficial to understand all the different roles and responsibilities in data governance since all roles can have an influence on the data quality [65, 61]. All of them are not equally responsible for data quality but they all have at least indirect impact on data quality at some point of DLC [65]. That is because higher (or lower) quality of data is not only the result of the DQM, but also of communication, planning, team work, documentation, training, data calibration and analyses [66]. The high level of all these aspects can be difficult to achieve, especially in large, international or transdisciplinary teams [65]. 15 2.2.2 Data Quality Assurance As mentioned earlier, the DQA is usually the third step of the DQM cycle [32]. It is a set of actions that assure that a product (such as data) will meet its established objectives at an adequate level [67, 3]. Therefore, it tries to find possible problems and consider how to prevent errors and what to do if they occur [65]. In that way, the DQA process consists of two parts. The first part is to compare data against its objectives, which is usually called data quality evaluation, assessment or validation [68, 32]. It aims to give an insight into the data and its quality [69]. That usually happens by categorizing all features based on their acceptability. If the value, feature or complete data set does not violate any criteria, it is acceptable. Otherwise, it can be categorized as non-acceptable, such as erroneous or suspicious depending on its severity [67]. This decision procedure is based on a validation plan, which consists of quality objectives [70]. Quality objectives are informal criteria, which can be used to derive more formal quality rules. Quality rules are basically a pre-defined set of criteria that describe what (geospatial) data should or should not be in the context where it is used [70]. They ensure that reality can be represented correctly enough within the geographical information systems [58]. At its simplest, a quality rule can state that all lakes must be modelled as closed polygons. When the rule is created, then the data set can be checked against the rule as a goal to find and report all lakes that are not polygons [69]. Usually single quality rules are very explicit and suited only for certain purposes. One quality rule cannot serve multiple quality elements at the same time [58, 19]. Especially quality rules for positional or thematic accuracy might be challenging to create because quality rules work only well with quantitative values and encoding of geometry, but rightness of data and qualitative attributes may pose more difficulties. The logical consistency is the only element of geospatial data quality model that can be fully assured by using quality rules [58]. However, by applying quality rules to data, it is possible to harmonize data and improve its interoperability and quality. It can not be emphasized too much that quality rules are always depending on the purpose where data is used. That is the reason why they should be defined together with the users and technical experts [70]. Cooperation makes it possible to exchange data easily since both parties have commonly agreed quality rules for data [71]. The good way is to use existing standards as a basis to form quality rules that describe the logic of optimal data [19]. It is good to remember that even if the feature passes all the rules, it still may not be perfect. The rules can only find invalid features based on the criteria, but they do not state data or features to be one hundred percent correct [67]. The second part of DQA is data editing, which happens in situations where the data is considered as invalid or it does not exceed the required threshold [19]. Normally, consequences of the failure are defined in the logic of quality rules. In most cases, there are two options: the rule can either mark invalid features up and cause a warning, or it can try to repair data as a part of process and let improved data 16 continue [58, 19]. If automatic repairing is not possible, the rule can discard invalid features and send a list of errors to data producers who can fix the data at a source [67, 19]. After that, data can be re-initiated through the set of quality validation part [19]. According to Bogdahn and Coors, this kind of two-step process keeps DQA operation clean and transparent. Users can see the quality of original data, the list of found errors and the resulting data where problems are rectified. All modifications and errors are stored in the quality report that displays violations and modifications [72]. With the help of quality report, users can understand the possibilities and limitations of the data [19]. 2.2.3 Data Quality Assurance from the Perspective of Data Life Cycle At the first place, it should be clear that the the Data Life Cycle (DLC) is not the same as temporality of data. In the famous three-domain framework by Yuan May [73], temporality can be seen as parallel to location and properties of spatio-temporal data. In that way, data has spatial, temporal, and semantic domains that are all linked together [73]. Similarly, all domain changes are linked together as well as relationships of those changes. That means when one domain changes, it also affects others. Location changes can not happen without changes in time (or properties), or otherwise [53]. The DLC is more related to the data as a process in relation to time. According to David Loshin [40], it refers to the period in which the data is in your system. Usually, it is described by using discrete and sequential stages from data creation to its destruction [40]. Data is produced or measured, stored, managed, processed, used, and finally deleted [74]. Some views declare that the DLC begins even before the data creation (e.g. [74]) because also planning and enabling data delivery, storage, control and capture can be included to definition. Thus, depending on the definition, DLC starts when data is initially conceived. Identifying and managing the phases of DLC is important since data can have different objectives in each phase [10]. It is also good to place the DQM practices accordance to the DLC [61]. Antti Jakobsson [75] provided a good demonstration on how to manage data quality during the DLC. He considered the process from the perspective of data producers, and divides it into three phases in a same way than the ISO 8000 standard [4] and Joseph Juran [55]. Those steps are: before production, during production, and after production. 1. Before production is quality planning phase, in which data producer and customer commonly agree on quality objectives. The goal is to define quality model to support these conformance objectives. It should contain selected quality elements and their criteria. Without communication with customer, it is difficult to support all dimensions of quality. 2. Second step is called quality control, and it happens during production. It ensures that all features comply with the previously agreed quality model. This phase can be called quality control. 17 3. The last phase is called DQA, and it happens after the data is produced. At that phase, data is evaluated against the quality requirements, and resulting quality information is stored in the quality report or metadata. Finally, the data and quality information can be delivered to the customer. [75] Assessing and recognizing stages of DLC is important since data objectives can vary a lot between different stages [75]. Thus every stage require their own unique DQA actions [76] and there are no superior solution for all stages [40]. With the help of decent DLC model, users can schedule their actions and ensure that they perform as expected in the specific moment of the DLC [77]. Figure 8: Importance of DQM during the DLC. Modified after [75] and [77]. The biggest problem according to the experience of Ariza-López et al. [11] is that most users of the ISO 19157 standard concern mainly the quality of the final data product instead of the quality through its complete life cycle [11]. By doing so, their viewpoint remains quite narrow because the overall data quality is closely related to DLC [10, 78]. Better option would be to apply DQA more than once per data set [67, 10], or preferably every time when data is updated and progressed through the supply chain [19]. The DQA should start from the product specification and continue as long as data is updated [10]. Figure 8 illustrate that quality should be managed continuously throughout the DLC [66, 77]. Only one principle is logical for all phases of the DLC: ’the sooner the better’, which means that fixing errors in data is cheaper and easier when data is closer to its source than final deposition [67]. 2.2.4 Challenges Related to Data Quality Management The most urgent challenge related to DQM, is the insufficient knowledge and understanding. The discourse of data quality is difficult to understand and access, especially for the end-users that usually are not GIS specialists [35, 79]. Sometimes customers do not even ask for quality information because they think it is irrelevant. This is true even in the situations where data quality information is easily available [79]. However, their behaviour is logical at some level, because it is easier to do analyses and maps when you can mistakenly assume the data to have high quality. 18 But when users want to use quality information, it should be understandable and easy to access. It should contain all the necessary information about found errors, but in a format that is possible to understand without profound quality knowledge. Hunter and his colleagues [35] clearly demonstrated examples of poor data quality reporting. Reporting only the numbers of erroneous features without identifying them, provides hardly any information. Similarly, only claiming the percentage error levels does not tell how many and which features were checked or discarded [35]. Second challenge is that the DQA methods are not always scalable. According to Sarah McCord [65], in most organizations the DQM is executed only by a small team, which relies heavily on single employees and their experience. Because the process is not organized in a consistent way, it can not be always scaled up for larger scopes. Similarly, the DQM approaches that are developed in large and networked organizations can not always be utilized in smaller companies because they do not always have dedicated and available staff [65]. She emphasizes that all actors related to data quality (e.g. data producers, managers, collectors, and students) should adopt common technological and cultural practices for the DQM [65]. Third relevant challenge is the lack of required tools and requirements. Since reliable reference data sets are needed for assessing quality of the most data quality elements [34], the DQA can be very expensive to carry out extensively especially for non-organizational users [24]. Nowadays, however, there are a lot of free and open geospatial data to be used as a reference data set. For example, Volunteered Geographic Information (VGI) can be a good solution [24]. 2.3 Data Integration 2.3.1 Spatial Data Infrastructures Nowadays, data is mostly gathered from multiple producers rather than collecting and measuring all data by itself [80, 81, 82]. Most modern use cases, like city digital twins, combine data from multiple scales, authors, sources, and domains, which requires a lot of work for data processing, integration, and management [83]. This has created a demand for integrated data-sharing platforms containing harmonized data from various sources [80]. Connected systems like this are referred to as Spatial Data Infrastructure (SDI). The SDI is a broad concept that has several definitions given by standards and researches. They vary from technical and data management perspectives to more human-centered views. For example, Phillips et al. [80] defined the SDI to be a platform that comprises data sets, their interrelationships, management, distribution, and access networks. More recently, researchers have included more humanistic viewpoints to definitions of the SDI. Fernández et al. [84] state that the definition should also contain organizational, financial, and people-based factors. In short, the SDI is a common platform that provides geospatial data to its users in a correct, current, usable, and on-demand form [14]. It could be compared to a search engine for geospatial data. Phillips et al. provided an illustrative example [80]. They compared the SDIs with 19 the other forms of real-life infrastructure, like transportation infrastructure containing roads, powerlines, and railways. Users simply take this kind of infrastructure for granted without caring how it works or who manages it. The infrastructure is available for all, but sometimes users have to pay a little bit for the permission to use the infrastructure in the form of railway tickets or gasoline. Similarly, also the SDI connects authors and would enable stakeholders at different regional levels to exchange data that supports decision making, sustainable development, and governance [85]. The common platform for all parties prevents data duplicates and enables using the same data for a wide variety of users and intentions [13]. The important aspect of the SDIs is to recognize their level of locality based on the regional area of included data. By doing so, SDIs can be categorized into different political-administrative levels as shown in the pyramid below (see Figure 9). The pyramid illustrates how data producers at the local level should meet the national, regional, or international standards to be fully compatible with upper levels [13]. Figure 9: Spatial data infrastructures from corporation and organization level to global level. The double-ended arrow represent the continuum between the levels and their relationships. [86] It is important to notice that SDI differs from the concepts of Geospatial Knowledge Infrastructure (GKI) or Geospatial Information Infrastructure (GII). In addition to term geo, the latter concepts focus more on the derivatives of data, i.e. information and knowledge (see Figure 3). Figure 10 below illustrates the main differences. Some would say that there is an ongoing transition from SDI to GKI (e.g. [87]), and others would claim it to be a normal development, only with the different name. 20 Figure 10: Comparison of spatial data infrastructures and geospatial knowledge infrastructures. [87] 2.3.2 Data Integration in Spatial Data Infrastructures The most fundamental problem of the SDIs is data integration [14, 88]. In the publications of the OGC, the term data integration is used as a synonym for data conflation. They define it to be a process where one all-encompassing and integrated data set is produced by unifying multiple separate data sets together [89]. Its goal is not only to display and use two data sets at the same time but rather to gaina such knowledge from the relations that single data sets can not derive alone [81, 90]. For example, integration makes it possible to add geometrical references to statistical data, enhance the accuracy of geometry, inspect one data set with the help of another data set, and enrich existing data with geometric and semantic information from another data set [81]. In order to succeed, data integration requires known quality, harmonization and interoperability of the data sets [15, 75]. The quality is defined above, but the data harmonization aims to represent data in uniform way by removing all heterogeneous aspects from it, like naming conventions, structures or data contents [91]. In turn, the interoperability can be defined as the ability to integrate or work together as easily as possible [62]. As its name implies, the interoperability is operability that exists between (=inter) something. So shortly, the cacophony of heterogeneity between data sets and data producers 21 needs to be resolved to ensure successful integration. Thus the data integration process can take a lot of time and it contains challenges and usually two steps. The first step is schema matching [90]. According to the Working Group on Data Integration [87], all identifiers, vocabularies, definitions, and ontologies should be commonly agreed within all parties. Even though data sets would appear to contain the same information, they can still differ. The same data object can derive multiple different information interpretations, and one information interpretation can be derived from different types of data objects. A good example was provided in the same article. The wood can have multiple semantic dimensions depending on the context. It can either represent the avalanche barrier, part of an economic forest, or an outdoor area for leisure activities. The unsolvable problem is that everything is impossible to define and standardize in an unambiguous way because new dimensions always occur in new contexts. Hence, defining roles and responsibilities clearly is important. All stakeholders should have a common understanding of how to process data and how to take into account its life cycle [87]. Also, more technical aspects, such as detail level, format, vertical topology, resolution, reference system, scale, metadata, and data model must be considered commonly [14, 81, 15]. Data sets should have high quality and simplified geometry, but they still should contain sufficient amount of mandatory information [87]. Balance can be difficult to keep because usually conversion and generalization tools cause the loss of data content [15]. Mohammadi et al. [14] provided an illustrative example of technical issues when they spoke about the integration of two road data sets. Firstly, both data sets should represent streets in a similar way. It is difficult to integrate object classes and relations together. Similarly, streets should be represented only in one of the following formats: polygons, lines, or segments of pixels. Secondly, they should contain the same attribute information and data format. Also, other kinds of technical integration challenges are easy to imagine. For example, how to combine two data sets with different vertical accuracy? When schemes correspond to each other, the next step is data mapping [90], which is also known as linking, alignment, or reconciliation [92]. It is the task of finding all matching records and features that refer to the same real-world object [90]. It does not always happen easily via common identifiers or identical geometries as also other similarity-based methods exist, such as geometrical, contextual, semantic, attributive-, hierarchical- and topological methods [92]. Especially in the context of geospatial data, geometrical matching methods have a large numbers of different variations, which were discussed in multiple articles (e.g. [93, 87]). Based on those methods and their matching correspondence, features can be combined, fitted, ignored, or removed in the next step called data fusion. In the simplest case, matches are merged into a single entity, features without matches are ignored and passed forwards, and duplicate information are removed [90]. This is a critical step, because direct joining of geospatial objects may produce geometrical inconsistencies and errors in addition of errors that already exists. For example, combining data from neighbouring areas may cause overlaps or gaps at the borders[94]. But an interesting point of view is that quality is not only the prerequisite for data integration as it can be also its derivative. So, it is possible to either impair or 22 improve data quality by integrating and combining it from multiple sources [95]. However, this kind of one-to-one cardinalities are usually easy to handle successfully with existing methods. But there also exist other kinds of cardinalities and their variations. Sometimes one feature is combined with many features (1:N) or a group of features is combined with another group of features (N:M) [92, 93]. There is no method that is suitable for all types of cardinalities [93]. Both schema and data-mapping are mainly technical steps that are possible to overcome in theory, but there is an another side as well. Figure 11 illustrates non-technical challenges of data integration, which could be seen as a part cause for technical challenges [14]. In other words, technical challenges could be avoided by changing organizational culture and information flow [15, 82]. Sumit Sen argues that the key to effective integration is not only the technical interoperability but also the interoperability between institutions, people and policies [96]. Figure 11: Non-technical challenges of data integration. [14] When thinking about geospatial data integration, the importance of metadata interoperability should not be forgotten. According to the most relevant metadata standards, like ISO 19115 [97], the metadata contains information about data per se, like description of its consistencies, quality and temporal aspects. With the help of these information, it is easier to integrate data sets and avoid errors [15]. 2.3.3 Data Integration in Different Phases of Data Life Cycle The important question related to data integration is when to perform the integration because data and its objectives are not constant in time. Mohammadi et al. [15] demonstrated the optimal timing of automated DQA tools in the data integration process. They suggest that data harmonization should happen immediately when data is received from data providers (see Figure 12). Therefore, data would always be compliant with the requirements and harmonized for users. Their demonstration also suggests that users can create more quality rules for the data validation and 23 harmonization tool based on their preferences and field of application. In that way, data producers would harmonize data even before its publication. Figure 12: Proposed process flow to use harmonization tool in data integration. Modified from [15]. [15] Other authors agree with that method as well. It is recommendable to use the DQA tool always when data integrator combines or processes data sets [19]. At a minimum, some kind of the DLC rules, like duplicate detection and version control rules, are necessary for data integration [90]. However, even a continuous DQA process does not automatically solve the other problem: temporal quality. Integrating two data sets from different periods in time can cause problems in quality because real-life objects tend to change over time. For example, buildings can burn down and other buildings can replace them. Therefore, data should always contain standardized information about its current temporal and versioning history [98], as the CityGML standard does. It defines a few mandatory attributes, like creationDate and terminationDate, which help to manage versions and lifecycle [99]. Morel and Gesquièr [100] managed temporal changes of 3D buildings in a broader way. Instead of just adding creation and termination dates, they added flags and tags to describe DLC information from a wider scope. The tags contained temporal information, like dates of the city object and flags, to describe their behavior. They also proposed the so-called dynamic flag that allows to link more frequent sensor information to CityGML data. In that way, they were able to manage all kinds of temporal changes more efficiently than by just using the default CityGML specification 24 and attributes. Nevertheless, their approach was not unique as similar ideas have also been presented earlier, such as in the article of Kristin M. Stock [53]. By using these kinds of attributes to describe the history, users can query data and inspect how it looked in a specific period of time [99]. In that way, it is easier to avoid the integration between outdated and current data sets [15] because the currentness of data is known. Not only for the whole data set, but also for its smaller parts. On the contrary, users can also utilize DLC rules to check whether version information is accurate. 2.4 Automated Data Quality Assurance in Data Integration Processes Multiple articles (e.g. [87, 15]) emphasize that the whole integration process should be automated as much as possible. That allows managing data updates and temporal dimensions efficiently in addition to easier access and usage [87]. A standardized way to harmonize and facilitate interoperability of data will more likely lead to successful integration because it reduces the possibility of human error [15]. In that way, it decreases effort, cost, and time compared to manual methods [15]. It is especially useful for all people who use or integrate heterogeneous and multi-source data [15] from different levels of quality. The whole DQA process can be implemented automatically using the Extract, Transform, Load (ETL) processes [69]. As the abbreviation implies, the ETL is a set of processes that are responsible for Extracting data from several sources, Transforming them to usable form and then Loading them to the data warehouse or database [101]. On the more practical level, the ETL contain many sub-activities including cleaning, deduplication, conversion, normalization and dealing with loading problems [102, 103]. Ideally, the ETL processes can be realized through web processing services (WPS) that allow them to be an embedded part of the SDIs [24]. When users request data from the SDIs, the the server can automatically apply quality rules to the data, and derive the quality report [24]. That improves the DQA process to be more effortless and less time consuming than the manual work. The risk of human error can be minimized and the frequency of data updates can be increased because automation allows to re-use methods. That enhances the efficiency of data management and improves data quality at the same time. In the long run automated DQA processes produce more satisfied customers and smaller costs. [19, 25] If the data updating process does not work flawlessly or automatically, it may cause several shortcomings in overall data quality. For example, if the integration process lags behind the urban change, some buildings or blocks might not be as errorfree or up-to-date as others [104]. According to Mohammadi et al. [15], automatic DQA is also faster than querying, extracting, and investing all data manually. Manual work could also cause other kinds of challenges, such as that some people interpret rules differently. Only the definition of geographical extent may be understood differently [15]. And when thinking about the quality, most errors are impossible to detect with just a bare eye [94]. However, an automation cannot always catch 25 everything, so sometimes manual tests are needed [25]. Humans are especially useful to associate semantics "out of the boxes". Nevertheless, Hunter et al. [35] resembles that most automatic DQA or integration methods would be too complex or slow for real-time processing, especially for large data sets. For example, disaster management requires real-time data from several sources, so the integration process has to be efficient [88]. Even plain deduplication can cost a surprising amount of time because one feature or record has to be compared with all other features. The computation time of deduplication grows potentially when the size of data sets gets larger. In particular, some geostatistical methods for detecting errors, such as Monte Carlo uncertainty propagation analysis, require a lot of computational power and time due to multiple iteration rounds. Also, estimating the needed parameters for those analyses might be slow especially in the era of big data [35]. Sometimes it is possible to ignore the efficiency problem by sampling the data [19, 10]. The annex E of ISO 19157 standard [10] introduces multiple sampling methods for geospatial data evaluation, and there also exists a few standards that are focused specifically on sampling (e.g. the ISO 28590 standard of sampling procedures [105]) but from the non-spatial perspective. However, none of them is risk free, so they should be used with deliberation [19]. For example, the sample area should be large and representative enough. 2.5 The Role of Standards and Recommendations All kinds of interoperability would be difficult to achieve without the contribution of international standards or recommendations. They can help data stewards and other professionals to improve the effectiveness and efficiency of the DQM. Moreover, they help us communicate with each others by using common standardized language and definitions [74]. Antti Jakobsson and Jørgen Giversen pointed out some high-level reasons why to implement quality standards in data production. As can be seen from Figure 13, also a lot of non-technical reasons exists. The two most remarkable players have been the International Organization for Standardization (ISO) TC211 and the Open Geospatial Consortium (OGC). They both have published multiple standards, XML schemes, and object models to enhance geospatial data quality and interoperability [106]. For example, ISO published the first versions of the 19xxx series of standards in the year 2002. It contained the previously mentioned 19517 standard, which created a great basis for this large endeavor [107]. The European Commission has been involved in this development since it established the Infrastructure for Spatial Information in Europe (INSPIRE) directive in the year 2007. Its main objectives are explained in its sixth paragraph: it intends to improve the efficiency of geospatial data production, management, distribution, and usage in Europe [17]. Therefore, it can facilitate the creation of an integrated SDIs beyond the national borders. The INSPIRE should have been fully implemented before 2019, but it was delayed at least a few years [108]. De Jong [109] states that the biggest barrier that occurred was the different legislative perspectives of data 26 Figure 13: Legislative, technological, production and marketing reasons to use quality standards in data production. [75] protection in each country. Some countries consider geographical data as personal data, which restricts its usage and publication as open data. It is understandable because the geographical data may enable the identification of a natural person [109]. Also, more general challenges were listed in the articles of Ian Williamson [110] and Working Group of Data Integration [87]. However, the INSPIRE have had a massive impact even before its publication date. According to Kleomenis et al. [111], many EU countries have shown remarkable progress in complying and adapting the standard. Several countries have created searching, viewing, and downloading tools for geospatial data just one year after the standard publication. After that, the rest of the EU countries are also working toward a common goal, which will eventually lead to an integrated and common European SDI. [111]. The process is not yet finished at the time of writing. Also, in the context of SDIs, application programming interfaces (APIs) have a major role. APIs are a set of interfaces, tools and communication protocols that allow data and web services to move within SDIs and between software [16]. In simple terms, they allow software to connect to another software [112]. Thus, they make geospatial data more accessible and usable, especially for end-users, because assessing the data is easier through applications. [16]. The development of geospatially enabled APIs has taken steps forward mainly 27 thanks to the Open Geospatial Consortium (OGC). Most recently, they have published the framework called OGC API - Features [113], which consists of four separate standards, two of which were under construction at the time of writing. The most fundamental principles of APIs are included in the first two published standards. They describe basic capabilities to read, access, and share geospatial data through the web by using several Coordinate Reference Systems (CRS). That includes the ability to query the data based on area or attributes. On the other hand, more complex features, including richer queries and support for creating and modifying data, are not yet supported. Most likely they will be included in the third and fourth parts. Together, these four standards form a consistent framework containing recommendations and requirements for APIs, which allows them to work consistently with other standards. According to Jirka et al. [114], OGC API Standards comprise the functionalities of current OGC standards such as Web Map Service (WMS), Web Feature Service (WFS), and Catalog Services for the Web (CSW). For that reason and also due to a more general approach, the OGC API Standards might become more commonly used than previous ones. The third very relevant recommendation, but not a standard or specification, was produced by Mark D. Wilkinson et al [62]. They published the commonly used FAIR principles in the year 2016. The principles were created as a goal to improve the Findability, Accessibility, Interoperability, and Reuse of the data. Some views have added the term Quality in order to expand the definition from FAIR to Q-FAIR [115]. However, all of the terms of FAIR abbreviation contain three or four specific elements that clarify them. The principles are not a strict set of rules that everyone must follow, but more like recommended guidelines or inspiring concepts for data publishers and stewards [116, 62]. The biggest specialty is that FAIR principles focus on machine actionability with minimum (or none) human intervention [62], even though the overall process to find, access, process and reuse the data (and the metadata) is usually more intuitive for humans than computers. When searching or using open data from the internet, humans can easily understand semantics, like where, why, how, and whose the data is. For machines, the mission is more complicated as data might have different sources, licenses, formats, et cetera [117]. In the long run, it is more profitable to automatize the process as computers are more efficient when dealing with high volume, velocity, or complexity of data [117]. Because the number of relevant standards and recommendations is high, it might be difficult to access and understand all needed parts [35]. For example, standards contain difficult vocabulary, and they are expensive for private users. It is not easy to know where to start. To overcome this problem, Antti Jakobsson and Jørgen Giversen published the guidelines for implementing geographic information quality standards [75]. The guidelines are aimed mainly at data producers and SDI programs of national mapping agencies, but they are useful for customers as well. With the help of guidelines users could utilize standards as a base for DQA, as an example. As said earlier, the current standards still does not cover all aspect of the quality of multidimensional geospatial data. There are no specific standards that would offer consistent support for quality rules especially in the context of 3D data. Users have 28 to interpret standards differently to derive methods for harmonizing data. Hence, a lot of work need to be done in order to really standardize the use of 3D geospatial data. 29 3 Methods of Case Study The following chapter demonstrates how to automatically assure data quality of the 3D buildings. The goal was to find the best DQA practices for the building data of CityGML and the CityJSON data formats. The problem have been approached by developing the tool for the FME platform, which were tested by using the Level of Details 2 (LoD2) buildings from the city of Ranua, Finland. The data set is given as taken, so the DQA perspective is similar as the customers would have. 3.1 Used Data Formats 3.1.1 CityGML The most common international information model for 3D building data is the City Geographic Markup Language (CityGML). It was originally developed by the Special Interest Group 3D (SIG 3D) and in the year 2008 it became an international OGC standard. Its data model is open source and directly based on the objectives of the ISO 191xx standards family and an application schema on Geographical Markup Language (GML), which in turn is geographical version of Extended Markup Language (XML). [118, 99, 119] The data model of CityGML covers multiple thematic modules, such as buildings, water bodies, vegetation, and bridges. Each of those includes information about the Level of Detail (LoD), predefined classes, spatial and thematic attributes, as well as their relations [120]. Those make it possible to represent cities as diverse urban information spaces instead of only geometrical or visual models [72]. The wide variety of thematic modules enables also a large number of other kinds of potential applications [121] as well as integration and extension capabilities [99, 120]. Figure 14: CityGML Building module and its semantic and geometrical models. [122] The CityGML is especially handy for modelling buildings because the building module is the most detailed semantic module. As the CityGML standard [118] and Figure 14 implies, 3D buildings and other volumetric features should be modelled by using Solids [52], which are the basis of 3D geometry [123]. They consist of integrated and unified boundaries of a rigid 3D object that aims to form a watertight object. From the semantic side, these boundaries are called BoundarySurfaces, which is a top concept for five surface elements: OuterCeiling-, OuterFloor- Wall-, Roof- 30 and GroundSurfaces [124]. These surfaces can be linked to buildings by using xlinks, which can reduce redundancy and maintain topological relations in a unequivocal way [118, 94]. Despite the recommendations, buildings are still sometimes modelled using MultiSurfaces instead of Solids and Boundarysurfaces [52]. They represent a different set of polygons as they can be connected to each other without any restrictions [118], as in Figure 15. Arbitrariness topology can cause extra effort when trying to achieve correctly modelled buildings. For example, it is impossible to calculate the volume of buildings, if they are not water-tight 3D objects. Figure 15: The primitives of the CityGML [124] However and furthermore, each surface or polygon (and its holes) construct of planar Linear Rings, which are a finite series of more than two coordinate points [52, 71]. According to some scientists, linear rings can be considered the most fundamental piece of CityGML because all other pieces consist of them [121, 72]. All those primitives are based on widely used Simple Features (see [125]), which means they construct nodes, lines, and polygons without topology [122]. For all these reasons, the CityGML is a very rich and complex format that makes it a heavy, hierarchical and impractical format for exchanging information through the Internet. Not only because of its large file size, but parsing it into different 31 applications is also very complicated [122]. For example, Evan Rouault introduced 26 alternative ways to model a simple 2D square through CityGML [126]. The number of variations only increases in the higher dimensional data because all Solids consists of polygons [122]. To demonstrate this, Ledoux and Detlev geometrically modelled simple 3D building in the 10 different ways without even considering all secrets of GML, such as xlinks [70]. This means that developers should take into account all possible variations for every geometry type [122]. Only a few open-source applications can completely handle such a format, such as GDAL and citygml4j. But that is not all. In addition to all thematic modules, the CityGML model contains GenericCityObjects and Generic Attributes, which together allow unlimited extensions of the data. That makes it possible to easily create new geometry and attribute objects which are not included in the predefined thematic classes [119, 118]. Also, the Application Domain Extensions (ADE) facilitate the systematic extensions by adding an extra XML schemes. With the help of these extensions, it is possible to create new properties to existing classes, such as adding more geometries or complex attributes. [122] 3.1.2 CityJSON Due to complexity of CityGML, Ledoux et al. [122] presented a new JavaScript Object Notation (JSON) based data exchange format called CityJSON, which is derived from the CityGML. It can be seen as a subset or compression of the CityGML. The data structure of CityJSON construct of simple dictionaries and arrays, which contain non-hierarchical CityObjects and coordinates of the vertices respectively. These values, in addition to attributes, can be stored by using basic data types, including Boolean values, strings, and numbers. [122] The simple design of CityJSON makes it approximately six times lighter than CityGML [122]. Nevertheless, they do not correspond with each other one by one. The CityJSON does not contain as many modules and features as CityGML does, but still, it can represent the same information in a more general way. According to investors of the CityJSON [122], it has only a few main features that CityGML does not have: • The CityJSON does not support LoD4-level buildings, at least in the version 1.1. • In the CityJSON, the whole data set has the same CRS, whereas in the CityGML it can differ between every building or even between parts of the same building. • The CityJSON supports only EPSG systems, whereas the CityGML can use arbitrary coordinate systems. • In the CityJSON, only semantic surfaces and CityObjects have an ID, whereas in the CityGML every object can have an ID value. 32 Due to lighter encoding, the biggest advantage of CityJSON is that the format is easy to use with different sets of open source applications. For example, Stelios Vitalis et al. published a plugin for loading CityJSON data sets to QGIS [127]. Also, many programming languages can directly handle it, in the same way as handling traditional JSON [122]. Both formats have a large number of applications varying from pure geometrical network planning to more visual applications, such as in the tourism field [128]. To concretize the variety, Biljecki et al. found 29 use cases and over 100 applications for 3d city models. He also underlined that many potential use cases have not been invented yet (in the year 2015) [129]. 3.1.3 Level of Details Due to new applications of 3D building data, the need for more detailed models is increased in terms of geometry and semantic information [128]. The wide scale of data models with different complexities hampers the data integration process, so some commonly agreed classifications are needed. For that purpose, the term Level of Details (LoD) was created by Köninger and Bartel in the year 1998 [130]. They categorized 3D buildings into levels 1 to 3 based on their complexity. Later on, the Special Interest Group 3D (SIG 3D) added levels 0 and 4 to complement the previous categories. Figure 16: Level of Details illustrated [120] As can be seen from Figure 16, a higher LoD value indicates a more accurate, complex, and granulate city model, whereas a lower LoD indicates a more simplified model. The coarsest level LoD0 is just a 2.5D digital terrain model (DTM) without 33 any solid buildings. The LoD1 model includes buildings that are blocks constructed from 2D polygons and the mean heights of the building. So on that level, buildings do not contain any roof structures or textures. Moreover, the LoD2 buildings have separate roofs, stairs, balconies and other more complex structures. The LoD3 buildings are basically architecture models as they contain detailed visual structures, such as windows and doors. LoD4 is similar to LoD3 from the outside, but not from the inside as LoD4 buildings contain interior structures, including rooms and furniture. [120] In addition to granularity, also accuracies and minimal dimensions of objects are related to the LoD classification. Maximum vertical accuracies from LoD1 to LoD4 are 5, 1, 0.5 and 0.2 meters, respectively. Horizontal or positional accuracies are correspondingly 5, 2, 0.5 and 0.5 meters. According to these values, the data should be more accurate when the LoD value is higher. Minimum dimensional distances behave in the same way. For LoD1-LoD4 their areas are 6 x 6, 4 x 4, 2 x 2 and 0.5 x 0.5 meters, respectively. [131] The LOD and other classifications of data features make data sets more comparable and interoperable, as well as they eases the integration of multi-source data sets [131]. Especially in the context of this case study, the LoD classification is a good basis for creation of quality rules. 3.2 Used Data Integration Platform For this project, I created the DQA tool by using the FME (Feature Manipulation Engine), which is one of a few ETL software that supports both CityGML and CityJSON formats including all their thematic modules, objects and attributes [98]. Originally, the FME was released in 1993. After that, it has become one of the most popular platforms for (geospatial) data integration, validation, conversion, and transformation, as well as application integration. The FME is a very versatile platform and it can process data from various sources and formats. The rich data model supports over 500 formats. [132] A big part of FME’s popularity comes from its usability. It is easy to use because all data transforming happens in the graphical user interface. It consists of workbench canvas and the three primary drag-and-drop type elements, which are readers, writers, and transformers. Readers are components that read given source data sets and writers write data to a destination data set, so they are practically input and output functions of workspaces. The transformers are key pieces of the FME functionality as they act between readers and writers. They analyze, modify, edit and restructure source data after data is read and before it is written. Each of them has unique algorithm or function and they can be stacked together to form a chain that can execute more complicated tasks [132]. At the moment of writing, the FME Workbench 2021.2 application had a total of 490 different built-in transformers in addition to all stacked transformers that users have published online. The most powerful transformer for DQA is the GeometryValidator (see Figure 17). Similarly than most transformers, it has an input and output ports that allow 34 (a) Input and output ports. (b) Parameters and rules of the transformer. Figure 17: The GeometryValidator of the FME. connecting it to other transformers. By utilizing these ports, it can detect selected issues from the input features and produce error information to the output ports. If "Attempt repair" is activated, the transformer try to fix errors, it is possible. Based on the found issues and corresponding actions, features are categorized into different output ports, e.g. Passed or Failed. [133]. For example, the user can set the GeometryValidator to detect and repair all invalid boundaries and self-intersections. If either of them is found from the source data, the invalid feature is completely passed onto the Failed port, the exact point location of the error is output onto the IssueLocations port, and the smallest part 35 of the invalid feature is output onto InvalidParts port. If the transformer managed to repair invalid geometry, it is output onto the Repaired port, and if the feature passed all tests, it will continue forward through the Passed port. [133] Even though the FME is an excellent tool for the DQA, it has one major disadvantage: the FME is not open-source platform, so the processing principles of some transformers are a mystery. In most cases, their documentations describe only how to use the transformer, but no how they process data from step to step. For example, what happens concretely when the GeometryValidator fixes planarity errors? 3.3 Created Data Quality Assurance Tool In this project, the basic procedure of the DQA tool is not too complicated, as can be seen from Figure 18. The final FME tool assures data quality in the following way: Figure 18: Simplified workflow of the DQA tool. Before running the FME workspace, it asks users to define user parameters. These parameters impact on the workflow of the tool. The users can select which rules will be applied, determine thresholds and functionalities for them, and finally give names for input and output files. Totally the tool contain 14 user parameters. For example, 36 the user can select that only the geometry rules will be checked and their tolerance will be 0.05 meters at maximum. After the definition of inputs, the tool reads input data set(s) and ensure that they are readable. If any of user’s inputs are invalid or not in allowed range, the translation cannot continue. If all inputs are OK, the tool starts applying quality rules. First, it removes all appearances, unused attributes, and duplicate geometries and identifiers in order to avoid unnecessary processing. The tool also removes all features and elements that cannot be checked. That includes features without 3D geometries or CRSs, as well as degenerated or corrupted geometries, infinities and duplicate concecutive points. Deletion of invalid features is important since otherwise they may cause troubles in the following transformers. After the data cleaning phase, the tool applies rest of the rules in the categories that user selected in the first phase. If some feature or its part violates the rules, the tool formats an error message containing quality information and a reference to the invalid part. In some cases, the tool also fixes errors automatically because otherwise they would affect the rest of rules. Also in this case, the tool create an quality report that contain basic information about the error and feature. At the end of translation, the users can choose how results will be displayed. The tool can either aggregate results and calculate error frequencies, or keep the quality information more detailed. If user wants to save resulting quality information to the file, the tool can write results to Comma Separated Value (CSV) file in a tabular format, or to CityGML/CityJSON file in a 3D geometrical format. In the latter case, errors are easy to inspect visually. The basic workflow is similar for both data formats (CityGML and CityJSON) but in practice, there are two tools and set of rules for both data formats due to their different encoding. There are no 1-to-1 mapping between their data models, albeit they can represent almost the same information [122]. First of all, both data formats requires their own readers and writers, and individual set of extraction, converter and aggregation transformers between them. Secondly, some rules are not applicable for both formats. The biggest reason for this is their different hierarchy. As the inventor of CityJSON says [122], the format is flattened out from the CityGML schema, so it contains only two kinds of City Objects: first-level-city objects (e.g. Buildings) and second-levelcity-objects (e.g. BuildingParts). The second-level-city-objects cannot exist without their parents, so applying hierarchy checks for them is not relevant. In addition, the City Objects of different levels are not hierarchically linked together by using xlinks, so checks for xlinks are useless as well. [122] On the other hand, CityGML can have multiple and more complex variations for attributes [122] and geometrical structures [126], and it contain more mandatory attributes, and more diverse set of CRSs. In CityGML, every building and even their parts can have an individual CRS [118]. Thus, DQA of CityGML requires more quality rules than CityJSON. 37 4 Results 4.1 Implemented Quality Rules Since both CityGML and CityJSON data sets can be very large and complex, the number of possible quality rules is theoretically unlimited. For this tool, approximately 40 unique rules were implemented, and some of them are basically a set of smaller criteria. This means the real number of rules is over 60. For example, the OGC Valid Compliant (alias Fails OGC Valid) rule of the FME’s GeometryValidator consists of 12 criteria, including different types of intersections, nested and disconnected geometry parts [134]. Most of the rules were easy to implement directly by using the built-in GeometryValidator transformer of the FME, in which case their algorithms are impenetrable. Rules of GeometryValidator process data "inside the black box" and can be viewed in terms of their inputs and outputs. The names of these rules and errors are taken as given and thus they contain spaces. Following Figure 17 displays all rules of the GeometryValidator. Because the pre-built rules in the FME are meant mainly for generic purposes, some rules had to be created from scratch by myself. These rules consist of a chain of multiple different transformers that extract, convert, modify or process data in various ways. Thus, their algorithms are more transparent, but not completely in every case. Below is the screenshot of the self-made rule named GroundOverlapping. It consist of 16 separate transformers and connections between them with he aim of find all buildings that overlap more than the given threshold (see Figure 19. Figure 19: The FME algorithm behind the self-created GroundOverlappingIn2D rule Lastly, there are a lot of rules that were not implemented to the FME tool due to their low significance. This thesis focus more on geographical perspective, so attribute rules are excluded, for instance. In most cases, attributes are independent 38 from the dimensionality of data, and they are also easy to check via traditional methods outside of the geoinformatics field. For example, a data steward can ensure that all numerical values must be positive and in integer format. To clarify the purpose of the rules, all of them are categorized according to the elements of the DQIM of the ISO 19157 standard [10], which was introduced in Chapter 2.1.1. Completeness and the sub-elements of logical consistency were mainly used for the quality rules, because the rules of other quality elements cannot be translated directly to machine readable form [58]. Most of them do not consider absolute or calculable values, or they would require a reference data set to describe the ground truth [24], which was not available in this case. The following Tables 1 and 2 list most of the implemented rules. Tables contain only rules that are relevant in the context of this thesis, or they can be compared with other DQA software. Also, all rules, which were violated at least once, are included. For these reasons, the tool also contain some rules that are not listed below. All implemented rules and their detailed descriptions are available in the GitHub repository [135]. The first columns of the tables list rule names and the seconds their descriptions. The third columns refers to the corresponding errors in the Val3dity software. The fourth columns tell if their implementation rely on the FME’s built-in transformers. Format consistency rules rules validate the data format and field types [10]. They ensure that data is readable. For example, if the data format is not the CityGML or the CityJSON, the DQA tool cannot open the data and validate the rest of the rules. When it comes to the field type of a single item, format errors can be also categorized into the domain consistency errors at the same time [10], as the MissingFmeType does. In this tool, only one rule can be categorized into the format consistency quality element. This rule can be called readability. As a default, the FME’s readers require data in specific formats. Their versions and encoding must be supported by the FME in order to read data. At the time of writing, the FME can only read CityJSON data up to version v1.0.1, and CityGML data up to version 2.0. Because this rule is built-in to the FME, users can not modify it, and hence it is not included in the rule hierarchy (see Figure 20) Completeness rules measure the absence of data or the presence of extra data, or so-called omission and commission [10]. These rules are always dependent on the use case [24] and are usually meant for feature-level validation, describing the link between the data set and the UoD [10]. For example, a completeness error can occur if the area has 1000 residential buildings, but the data from the same area contains only 500 buildings. On the other hand, if some parts or attributes of the buildings are missing, it is more like a conceptual consistency error as the data does not follow its conceptual schema [10]. In this tool, a total of six completeness rules were implemented, which are listed in Table 1 below: 39 Table 1: Implemented completeness rules Rule Name Description In Val3dity MissingFMEType Check if the feature has geometry, 609 which is referred by FME type. If not, the error will occur. Check if the feature has a coordinate N/A MissingCRS reference system. If not, the error will occur. LOD2SolidMissing- Check if the Solid has xlinks that N/A Xlinks refers to its SurfaceMembers. If not, the error will occur. Check if the SurfaceMember has N/A MissingXlink xlinks that refers to corresponding Buildings. If not, the error will occur. Missing Measures Check if the feature has measures N/A and Elevation and elevations. If any vertex has a value for measure or elevation, then all other vertex of that geometry should also have values. If not, the error will occur Missing Vertex Check if every vertex of the feature N/A Normals contain vertex normal vectors. If not, the error will occur. Built-in No Partially No No Yes Yes Conceptual consistency rules compare data to its conceptual schema (or data model) [24], which defines basic entities, attributes and relations of the data [119]. When the data is compliant with its schema, it can be easily reused between different applications [119]. However, all rules are not based on conceptual schema because also applications can have limitations for data. For example, overlapping features are not always incorrect [24]. In this case, the CityGML has the standardized conceptual schema (see [119]), but the CityJSON does not yet [122]. Despite this, the rules for CityJSON can be derived from the same conceptual schema because also the format is derived from CityGML. All relevant conceptual consistency rules are listed in Table 2 below: Table 2: Implemented conceptual consistency rules Rule Name Description GeometryType Error Check if the geometry type of feature is either Solid or Surface. If not, the error will occur. Check if the feature has mixture of 2D and 3D parts Mismatched Dimensions In Val3dity N/A N/A Built-in No Yes 40 Rule Name Description Degenerate or Check if the feature has degenerated Corrupt Geome- or corrupted geometries tries BuildingWithout Check if the building contain its Address, Building mandatory parts, like Addresses or WithoutBoundWallSurfaces arySurface DuplicateGeometry Check if the two features have identical geometry. DuplicateID Check if the two features have identical ID value. Invalid Solid Check if the Solid’s outer boundaries Boundaries are water-tight, non-self-intersecting and properly oriented. Consists of eight sub-criteria. Invalid Solid Voids Check if the inner boundaries of solid resides completely within the outer boundary and do not intersect. Consists of four sub-criteria. Fails OGC Valid Check if the geometry can be represented completely according to the geometry model defined by OGC. Consists of 12 sub-criteria. Fails OGC Simple Check if the geometry can be represented completely according to the geometry model defined by OGC (see [125]). Simpler version of the OGC valid because this consists of only three sub-criteria. Faces Wrongly Check if the face has correct orientation i.e. order of vertices. If not, the Oriented error will occur. Check if the feature is three dimenNot3D sional. If not, the error will occur. Contains NaNs or Check if the feature contain -0, NaNs Infinities or Infinities. Duplicate Consec- Check if the feature has duplicate utive Points in 3D concecutive points in 3D Incorrect Surface Check if the surface has correct oriOrientation entation i.e. order of vertices. If not, the error will occur. Incorrect Solid Check if the solid has correct orienOrientation tation i.e. the order of vertices. If not, the error will occur. Non-Planar Check if the surfaces are non-planar Surfaces based on thickness and/or degrees. Val3dity N/A Built-in Yes N/A No N/A Yes N/A Yes 301, 302, 303, 304, 305, 306, 307 401, 402, 403, 403 Yes N/A Yes N/A Yes 208 Yes N/A Partially N/A Yes 102 Yes 307 Yes 205 Yes 203, 204 Yes Yes 41 Rule Name FeatureHasSpikes Self-Intersections in 2D MinGround HigherThanMin Roof MaxGround HigherThan MaxRoof SurfacesWrongly Oriented Description Check if the feature contain spikes, which angle is bigger than given threshold Check if the feature intersects itself in a 2D plane Check if the minimum height of the ground is higher than minimum height of the roof Check if the maximum height of the ground is higher than maximum height of the roof Check if every surface of the feature is logically oriented based on its type and vector normals. Val3dity N/A Built-in Partially 104, 201, 202, 205, 206, 207 N/A Yes N/A No N/A No No Domain consistency rules measure the degree of values in an allowed range (i.e. domain). Both data type of field and domain value should be taken into account [24]. That means value must be integer and bigger than zero, as an example. Most of these rules can be derived from the conceptual schema in addition to domain specific rules [10]. Domain consistency rules are usually ideal for purely attribute data as they contain values, but they can be used for values derived from geometry as well. In this tool, a total of two rules were implemented that can be categorized into the domain consistency element: AreaTooSmall and VolumeTooSmall. These rules calculate areas of polygons and volumes of solids, and check if any of them is smaller than the given threshold. By doing so, unnecessary small parts can be deleted. Both of them were built partially by me. Topological consistency rules measure topological characteristics of the geometric and spatial relationships of the data items [10]. Mostly these characteristics should be described in the conceptual schema, in which case they are reported as conceptual consistency errors [10]. If the conceptual schema does not contain some topologically erroneous characteristics, for example, due to simplification, transformations, displacement, or aggregation, these errors may occur and should be reported as topological consistency errors [24]. In this tool, two rules can be considered as topological consistency rules in total: GroundOverlappingIn2D and Self-Intersections in 2D. The first rule, GroundOverlappingIn2D, was built completely by me. It checks if any of the buildings are overlapping with another building. The user can determine the maximum overlapping threshold, in which case all buildings that overlap more than the given percentage, are marked down as erroneous. The another rule, Self-Intersections in 2D, finds all buildings and their parts that intersect with themselves. As the Annex F of the ISO 19157 standard [10] declares, the order of the rule categories is important because one error might affect multiple data quality elements. Hence, some rules can process input data only if it has been validated or repaired 42 earlier by other rules [71]. The best method is to first apply format/logical consistency rules to ensure that data and all its parts are readable. After that, logical consistency rules make sure that the data is in accordance with its conceptual schema, and completeness rules check commission and omission. Lastly, some accuracy checks can measure the deviation between the data set and the UoD [10]. In addition to the standard, the FME’s GeometryValidator transformer has several input and output dependencies that affect the order of workflow implementation (see [133]. Figure 20: FlowchartCityGML Similarly this DQA tool avoids "cascading" errors by classifying rules based on their consequence. If the feature do not pass rules, there are three options: First, the tool can try to fix the feature automatically. If that is not possible, the tool either discards the feature or note down the error and let the feature continue forward. Figure 20 above illustrates the hierarchy and workflow of implemented rules in the case of CityGML data. 4.2 Used Parameters The pass rate of these rules depends on the given parameters, which define how strictly the rules are applied. These parameters are called user parameters in the FME as a default. For example, users can define that the maximum amount of overlapping is 5 percent. Then all areas where the overlapping percent is more than 5 percent are considered as violations. On the contrary, if the overlapping percent is smaller than the given threshold, the error is not counted. Naturally, these parameters affects the final number of errors dramatically. Parameters used in the DQA tool are displayed in Figure 21. To ensure the comparison with Val3dity software, same values were used for Val3dity (see Table 3). That was not possible for every parameters due to smaller number of rules and parameters in Val3dity. For example, the Val3dity does not check spikes, minimum areas or volumes, or vertex normals. On the other hand, the parameter overlap_tol 43 Figure 21: User parameter prompt in the FME. More information about parameters is available in the GitHub repository [135]. in Val3dity requires input value in distance units, whereas corresponding attribute in the FME uses percentage values. Table 3: Parameters used in Val3dity Parameters snap_tol planarity_d2p_tol planarity_n_tol overlap_tol Description Value Tolerance for snapping vertices that are close to 0.001 each others. Tolerance for planarity based on distance on 0.01 [124] plane. 20 Tolerance for planarity based on normal deviations (degrees) The maximum distance of overlapping in the 5 units of the input. 44 4.3 Used Example Data The CityGML data set used in this validation is produced by NLS-FI. The data set contains 3D buildings from the city of Ranua (see Figure 22), which is located in the Lapland region of Finland. Totally there are 30 640 features in the data set. Inside the one CityModel, there are 7660 LoD2 buildings and 22979 BoundarySurfaces, like Wall- or RoofSurfaces. The data set can be downloaded from the file download service of the NLS-FI (https://tiedostopalvelu.maanmittauslaitos.fi/tp/ kartta?lang=en). Figure 22: Screenshot of the 3D buildings from the area of Ranua. The NLS-FI defines that buildings are built and notable structures that are supposed to serve some actions, such as accommodation or business. To be a building, a structure should be large enough, 10 square meters at minimum, and its construction must have required a significant amount of work, usually supervised by authors. [136] According to the metadata, all 3D buildings are derived automatically from the 2D buildings of the National Topographic Database. The building footprints are merged with laser scanning data in order to extrude 3D buildings. Thus the location accuracy of buildings is directly based on the accuracy of 2D vectors and point cloud, which contained 5 measurement points per square meter. Other data quality elements are discussed only narrowly in the metadata. It says that the data is geometrically complete and tries to model reality as accurately as possible but still the data is not conformant to the regulation 1089/2010 of the European Commission. One reason for this is that data is still being developed. According to the production team of NLS-FI, the quality is assured by using the default quality checks of TerraSolid software. The rules ensure that roof polygons are not intersecting each other and the buildings do not mismatch with their footprint polygons more than the given tolerance value. In addition, TerraSolid detect gaps and non-planar roof patches. [137] The missing quality information was the first reason why I selected this data set to be checked. Another reason was that the data set is relatively small when 45 compared to other CityGML data sets. That enabled faster processing. For the CityJSON validation, the data set had to be converted from CityGML to CityJSON by using software named CityGML Tools. Due to the lower hierarchical structure of CityJSON, the conversion aggregated all BoundarySurfaces of the CityGML into one CityJSON feature. Thus the total number of features is 7660 in CityJSON data format. 4.4 Data Quality Results As Table 4 indicates, the example DQA tool found many errors from the CityGML data set. The two most common violations were Fails OGC Compliant and Fails OGC Simple, which are both based on the geometry model defined by OGC [125]. All features and their BoundarySurfaces were marked as invalid because they didn’t followed the criteria of the OGC [133]. The documentation of GeometryValidator does not open the direct reason for that error [133] but the employee of Safe [138] said that the most common reason is that source data might contain arcs because they are not supported in the geometry model of OGC. For that reason, the OGC errors are not severe errors in every case because they depend on the selected encoding policy. Other errors with high incidences are BuildingWithoutAddress and LOD2Solid MissingXlinks. As their names imply, these rules refer to the absence of the building’s real-world address information and the method by how Buildings and their BoundarySurfaces are linked together topologically. When using the same set of rules and thresholds for the CityJSON format (see Figure 21), the results were a quite similar. Only the number of duplicate consecutive points (108 -> 24) and self-intersections (116 -> 124) were different due to the conversion from the CityGML to the CityJSON. That was not a bug of the DQA tool as the same effect was also visible in the results of Val3dity software. Also, the occurrences of Fails OGC Compliant and Fails OGC Simple were smaller because in the CityJSON all BoundarySurfaces were aggregated into one feature. Thus, the number of total features is smaller, which naturally led to the lower number of errors when all features are invalid. Lastly, the number of self-intersections depends on whether the duplicate consecutive points are repaired or removed earlier because the FME’s GeometryValidator transformer reports duplicate points as intersections [133]. The tool can directly improve data quality and enhance interoperability since most errors are easy to fix automatically by using the GeometryValidator transformer. Totally, the transformer can detect 17 errors, but only 15 of them can be repaired automatically according to its documentation [133]. The both OGC errors are not repairable without the chaining of other transformers. However, it is good to understand that fixing some errors may produce other errors. For example, repairing intersections in the building boundaries, can cause degenerated faces, which furthermore prevents checking for building voids [133]. Even without new errors, improving data quality for one purpose might lead to decreasing quality of secondary purposes because same data is usually used for several tasks. 46 Table 4: Errors founded by the FME. All of them are conceptual consistency errors except LOD2SolidMissingXlinks, which can be considered also as omission. Error Name Fails OGC Compliant Fails OGC Simple Building_Without_ Address LOD2Solid_Missing_ Xlinks FeatureHasSpikes Self-Intersections in 2D Duplicate Consecutive Points in 3D Invalid Solid Boundaries Incorrect Solid Orientation Invalid Solid Boundaries MinGround_Higher ThanMinRoof Ground_Overlapping_ n2D NonPlanar Surfaces Invalid Solid Boundaries Details Count Feature do not comply with the geometry model 30640 by the OGC. This rule contains 12 sub criteria, which were not specified by the FME. Feature do not comply with the geometry model 30640 by the OGC. This rule contains 3 sub criteria, which were not specified directly by the FME. Building or BuildingPart has no address feature 7760 LoD2Solid does not contain Xlinks 7760 Feature has spikes that exceeds given tolerance Feature intersects itself in 2D The feature has duplicate consecutive points 424 116 108 Surface Not Closed The normals of outer shells do not face out or the normals of inner shells do not face in. Multiple Connected Components The minimum height of the roof is smaller than minimum height of the ground. The geometry of two or more GroundSurfaces overlaps more than given tolerance Triangulated normal angles exceed tolerance Not Valid 2-Manifold 90 65 16 7 6 5 3 Figure 23: Example picture of self-intersection. 4.5 Comparison with Val3dity To get some reference results, I compared the results of my tool to the results of Val3dity. The Val3dity is an open source quality validation tool for 3D city model data that checks data sets to have respect for the standard of ISO 19107 (see [123]) 47 [124]. Despite its wide usage, it was not our primary tool because it has only a limited number of rules, which focus only on geometry part. It gives a good and trustful insight into data, but for more sophisticated purposes it is not as easily scalable as the FME. Nevertheless, the both software have a lot in common. Originally, the first versions of Val3dity were used as a basis of the FME’s GeometryValidator transformer, so they contain partly the same logic [124]. For that reason, my tool can be compare to the Val3dity. Despite a few uncertainty factors between Val3dity and the FME (which are discussed in Chapter 5.1), we can still suppose that neither the FME or Val3dity produce untrustworthy results. That allows us to compare results. For the comparison, the errors of the FME had to be aggregated per error type because the Val3dity does not consider errors that occur more than once per building or feature. It only displays the number of erroneous buildings per each error instead of all errors that have been found. Table 5: Table containing errors that the Val3dity founded and correspondence results of the FME. Results are aggregated as erroneous features per each error. The table does not contain errors that the FME founded alone. Val3dity error ID Val3dity count 102 CONSECUTIVE_POINTS_ 9 SAME 104 RING_SELF_INTERSEC- 11 TION 302 SHELL_NOT_CLOSED 17 303 NON_MANIFOLD_CASE 3 305 MULTIPLE_CON- 16 NECTED_COMPONENTS 306 SHELL_SELF_INTERSEC- 1 TIONS FME error Duplicate Consecutive Points in 3D Self-Intersections in 2D Invalid Solid Boundaries: Surface Not Closed Invalid Solid Boundaries: Not Valid 2-Manifold Invalid Solid Boundaries: Multiple Connected Components Self-Intersections in 2D FME count 3 33 18 3 16 33 When using the same rules, parameters and data set for Val3dity, the results can be seen from Table 5. The first observation is that the Val3dity did not find as many errors as my tool. In the addition of all self-made rules (which were not available in the Val3dity), the Val3dity did not find any planarity violations, even with the same thresholds. A part of the reason might be the different rule hierarchies. The Val3dity does not check planarity for features that contain intersections [139], but my tool does. At least all features are entered into the transformer that checks planarity. What happens inside the transformer remains a mystery, since the documentation of the FME’s GeometryValidator does not tell how it is able to triangulate features that contain intersections [133], which is needed for planarity checks. 48 5 Discussion 5.1 Reliability of the Results Comparison between the results of my DQA tool and the Val3dity is challenging due to many reason. First is the different hierarchy of the rules. The Val3dity applies quality rules first to the LinearRings and Surfaces, then to the Shells, and finally to the Solids and CompositeSolids. Therefore, an error in the LinearRing level discards the whole building and prevents it from continuing forward to the Shell or Solid level checks. In that case, only the first error is noticed and rest of the errors are ignored due to their higher geometry levels [124]. On the contrast, my tool enters data into the transformers mostly as complete buildings, and the transformers do not separate on the which level the error exists. Many rules rely directly on the GeometryValidator transformer of the FME, which has multiple input and output dependencies that have to be respected to get trustful results [133]. For example, the duplicate concecutive points have to be removed or fixed before the check of self-intersections. Similarly, repairing self-intersections from degenerated geometries, may cause the loss of coordinates or change the geometry type. That may cause some differences between software and hinders the comparison. Different order and target level of the rules also affect results in other ways. In some cases, the FME does not directly distinguish the level in which errors occur. Thus the one error in the FME can be equal to multiple errors in the Val3dity. A good example is the built-in rule named Self-Intersections in 2D, which is equal to the rules 104, 201, 202, 205, 206 and 207 in Val3dity[140, 124, 133]. On the other hand, the FME contains a lot of similar rules, especially in the case of intersections. The rules Self-intersections in 2D, Invalid Solid Voids - Shells Intersect, Invalid Solid Boundaries - Surface Self Intersect, Fails OGC Valid - Self-intersection and Fails OGC Simple - Self-intersections validate all the similar problem of intersections, but from the different perspectives or under the different error name [134, 133]. The differences between them are difficult to realize sometimes. The another very important factor is parameters that were used (see Chapter 4.2). In this case, they were more or less arbitrary as they were copied from the default parameters of Val3dity as far as possible. These parameters have a huge impact on the results because they define thresholds for rules. Naturally, there are no correct values for parameters as they are always depending on the properties and objectives of a data set. When comparing errors between FME and Val3dity, it is good to remember that many of the FME Readers tend to clean data automatically, which hinders error detection. For example, in the case of the CityGML reader, the FME automatically closes unclosed polygons and linear rings without any declaration or possibility to disable. In the Val3dity, these are reported as an error called 103 - polygon not closed. Nevertheless, it is challenging to say which tool produces more trustful quality results since they both have their unique way to encounter such a polymorphic formats. Both the FME and Val3dity try to process formats and interpret errors in 49 a way that is preferable for them, but not for the another. The deeper comparison would require fundamental changes to the FME, which are mostly impossible to execute as a user of it. 5.2 Results in the Context of Data Integration The most interesting result is the readiness of the data for data integration. As outlined earlier, the data sets have to be interoperable, harmonized and their quality must be known to ensure successful integration. Certainly, it is pervasive to consider integration capabilities of the single data set without reference or counterpart data sets, but the comparison with the standards, recommendations and conceptual model is always possible. The results show clearly that the DQA tool can assess quality level of the data and improve its interoperability and harmonization. First of all, all features in the data set had unique identifier and geometry. Only the six buildings were overlapping more than 5 percent of their areas, which is possible to repair even by manually. None of them had duplicates, degenerated or corrupted geometries, infinities, or values that are not numbers (NaNs). All these good news indicate that data mapping might be possible to carry out. Also, all buildings contained their mandatory BoundarySurfaces, which enables to utilize semantic information. That is very beneficial in data integration process for many reason. For example, semantics can be used for clarifying interrelations and identities in the real world, defining tolerance and thresholds for modifications, and finding tie points to minimize ambiguities of modifications and updates [94]. The successful combination of semantics and geometry is the key for data integration because the lack of semantics may lead to the misunderstanding of data. Depending on the encoding policy in the data integration process, one of the biggest problems might be that the BoundarySurfaces were not connected by using xlinks. In theory, xlinks allow to efficiently re-use same BoundarySurfaces multiple times, but they are not mandatory according to the CityGML That is because the xlinks are one-directional and need always to be resolved [118]. Thus using the xlinks might be slow especially for large files where xlinks point to external objects [124]. Due to situation dependency, missing xlinks are a two-folded error: they are not a problem if no one uses them, but the problems exist when two different topology references method encounter. Some geometry errors might exist due to shortcomings in data production since there are so many methods to construct 3D buildings from point clouds. Most of them do not affect directly the interoperability between data sets, but they may accumulate during the DLC or data integration processes. That is why it is not a good policy to combine data sets with different levels of quality. Combining two erroneous data sets together might produce even more errors, such as intersections and overlapping features and attributes. 50 5.3 Using the Data Quality Assurance Tool During the Data Life Cycle In this case study, the perspective was focused on the phases after data is produced, so the process imitates more the DQA perspectives of customers or data integrators than data producers. The tool can assure the data quality of complete buildings, but direct impact on data quality during or before the data is created is impossible. That is because the data set was taken as given, and the tool can only read the CityGML or CityJSON data sets, which are complete enough in terms of readability and encoding. Thus, the tools cannot be used to manage the quality of point clouds or other measurements without modifications. But thanks to the flexibility of the FME, similar tools could be also used for other phases of the DLC. To assure data quality of 3D buildings in their creation phase, the FME provides large amount of transformers that can manage the quality of point clouds or 2D building footprints. By chaining those transformers, it is possible to create an another tool to accomplish the needs of data production, before the buildings are complete. Especially evaluation completeness, and positional, temporal, and thematic accuracy can be useful since they are usually considered as the responsibility of data producers due to the requirement of ground truth [57]. The only restrain is that most data quality measures of data production phase, especially positional and temporal accuracy, usually require a reference data set to which the real data can be compared (see Chapter 2.1.3). In practice, it is challenging to create a single DQA tool that would be wellsuited for all phases of DLC since both data objectives and formats can change a lot during the DLC. For this reason, almost all researchers emphasize the importance of communication between data producers and customers. Only in that way the data can have high quality from all perspectives. 5.4 Future Possibilities It is obvious that more development and research is necessary in order to overcome the variety and complexity of 3D geospatial data. Further development of this DQA tool is relatively easy, since in most cases it does not require any coding capability. In the FME, everything happens inside the graphical user interface and via drag-and-drop transformers. That is especially useful since the tool as it is now can be used mainly for generic purposes. The current set of rules does not cover all domain-specific peculiarities, so modifications, such as adding, editing or removing the rules, are necessary. For example, rules related to life cycle-, business- or version management might be beneficial. Users might also want to expand the tool by adding more data types, formats, thematic modules, LoDs, or other types of add-ons. Just as the data by self, also the DQA tool can be of high quality only if it fits for its intended purposes. One concrete change to be taken into account is a different rule hierarchy of the FME tool, which might facilitate a better comparison with the Val3dity. A good alternative would be to set one GeometryValidator transformer for every level of 51 geometry. By chaining transformers, the first GeometryValidator would apply rules for the LinearRings, the second one to Polygons and last transformers would validate Solids and CompositeSolids. Another suggestion would be to improve the quality information reporting of the tool. As discussed in chapter 2.2.4, the current discourse around data quality is very sophisticated and diverse. The DQA tool of this thesis reports the errors in a format that is not standardized or easy to interpret for everyone. One useful way to overcome this problem is to utilize Data Quality Vocabulary (DQV) [141] framework, which aims to facilitate the reporting and accessing of data quality by providing a more common vocabulary to express it. The DQV does not only provide definitions for data quality, but rather places more standardized means for data quality measurements, dimensions, ontologies, management, annotations, standards, policies, provenance, et cetera [141]. That helps data producers to publish data quality information in such a way that potential users can make their own judgements about how good the data is for their purposes. But the clear problem of the DQV is its abstractness. Its learning curve might be steep especially for people who do not understand the syntax of Resource Description Framework (RDF), on which the DQV is based on. Fortunately the DQV is machine readable, so it can be integrated to be an automatic part of other pipelines, such as data quality dashboards. For example, Heiko Figgemeier et al. [142] provided a DQV-based prototype of an interactive dashboard that displays the quality information in a user-friendly format. For most people, dashboards or maps might be easier than interpreting textual quality reports [79]. According to Devillers et al. [79], another idea to overcome the accessibility problem of the results is to aggregate detailed quality information into more general higher-level categories, such as spatial accuracy. It would give an overall view of the quality at first glance, but users could also expand indicators to see more detailed information. Similar statistical ideas were presented in annex G of the ISO 19157 standard [10]. The standard introduced three ways to aggregate Boolean-type (fail/pass) quality results: percentage indicators, weighted values, and minimum/maximum values. Regardless of the used aggregation method, it is important to remember that whenever information is summarized, some details will disappear. That is why aggregation should always be used with a good reason, and it should be clear for the users [10]. The third interesting option is to fully automatize the tool. That can be achieved by adding some artificial intelligence functionalities to the tool, such as evolutionary algorithms, machine learning or semantic processing. They would help to determine the best set of algorithms, rules and their threshold, which would reduce human effort and therefore the risk of errors even more. Only then the DQA tool would be fully automatic [24]. An ideal approach is to make the DQA process so simple and invisible that everyone could use it regardless of their experience [24]. In the long run, full automation would reduce cost and improve results, even though humans would always be needed [87, 15, 19]. It is also possible to automatise the full usage of the tool, instead of only its workflow. That can be managed by using the FME through the command line, or 52 in a cloud environment, which enables to embed the tool to be an automatic part of other data pipelines, such as the SDIs or data spaces. For example, the OGC Web Processing Service (WPS) Interface Standard or its newer version, OGC API Features [113] are both excellent platforms for requesting and responding geospatial data processing services through the Web [24]. All these future possibilities should be considered at least in organizations that produce geospatial data, such as national mapping agencies. The case study of this thesis proved that the automatic DQA tool is beneficial for every role and phase of the DLC. Despite these known benefits, only a few organizations check their data using quality rules as a normal part of their data routines. For example, the NLS-FI does not use almost any DQA method, even though their data contains quality errors. Therefore, the premier recommendation for people working with data management would be to develop and implement the DQA tools in a way that would make it an integral part of daily routines. 53 6 Conclusions This thesis considers how organizations can utilize DQA methods to facilitate integration processes of geospatial data in every phase of the DLC. The most common way to assess current data quality level is by checking it against the set of quality rules, and then editing the data with the help of given results. These steps are called quality validation and data editing, which are part of the DQA. The underlying problem is that most organizations do not know how, why, and when to apply the DQA methods for geospatial data. The problem is not alleviated by the lack of common methods and instructions. In fact, there is only one standard, the ISO 19157, that is focused specifically on geospatial data quality and its interoperability. It provides some useful methods to measure data quality and suggests some actions, but it is still not comprehensive enough to cover all aspects, objectives and formats of the diverse nature of geospatial data quality. The quality is always depending on the purpose where data is used, so every purpose requires its own quality objectives and rules. To answer this demand on the one side, this thesis proposes a tool to assure data quality of 3D building data in different phases of the DLC and data integration processes. In practice, the tool is a FME-based workspace that applies quality rules to LoD2 building data in the formats of CityGML and CityJSON. It finds all violations of rules and lists them on the quality report. In this way, the DQA tool can improve data quality, interoperability and harmonization in accordance with the most relevant standards and recommendations. In addition to elevated quality, the tool improves the level of readiness for data integration. The results confirmed that the DQA tool can be truly beneficial for all roles during the DLC. Data producers can ensure that data follows its conceptual schema, and customers can find out that data is suitable for their use. From the example data set, the tool found several errors ranging from commission to conceptual consistency errors. Some of the errors were more severe than others, especially when thinking about data integration. However, most of these were possible to repair automatically, which improves data quality directly and can facilitate data integration. Because the objectives of data change over time, it is vital to manage data quality and integration capabilities in all stages of the data lifecycle - from data creation till its final disposal. Therefore, the best practice is to apply DQA methods every time someone updates, combines or modifies data. To do this, standardized and easy-to-use methods, such as the DQA tool introduced in this thesis, are very beneficial. The tools should be so simple that everyone can manage them regardless of their experience or organizational role. Ideally, the whole process would be as independent and automated as possible. As the field of data quality is still crowded with unsolved issues, I urge all of my fellow colleagues and organizations working with data to put more emphasis on data quality. The more attention we pay towards data quality today, the sturdier the groundwork we will set for further study. For resolving data quality challenges is an investment in the future. 54 References [1] Richard Y Wang. A product perspective on total data quality management. Communications of the ACM, 41(2):58–65, 1998. [2] Thomas C Redman. The impact of poor data quality on the typical enterprise. Communications of the ACM, 41(2):79–82, 1998. [3] ISO Central Secretary. Quality management systems — Fundamentals and vocabulary. Standard ISO 9000:2015, International Organization for Standardization, Geneva, CH, sep 2015. [4] ISO Central Secretary. Data quality — Part 1: Overview. Standard ISO 8000-1:2011, International Organization for Standardization, Geneva, CH, May 2011. [5] Richard Y Wang, Mostapha Ziad, and Yang W Lee. Data quality, volume 23. Springer Science & Business Media, 2006. [6] Software engineering — Software product Quality Requirements and Evaluation (SQuaRE) — Data quality model. Standard ISO/IEC 25012:2008, International Organization for Standardization, Geneva, CH, December 2008. [7] Luz Angela Rocha, Jonathan Montoya, and Alvaro Ortiz. Quality assurance for spatial data collected in fit-for-purpose land administration approaches in colombia. Land, 10(5):496, 2021. [8] Šárka Hošková-Mayerová. Geospatial data reliability, their use in crisis situations. In International conference Knowledge-Based Organization, volume 21, pages 694–698, 2015. [9] Rodolphe Devillers, Alfred Stein, Yvan Bédard, Nicholas Chrisman, Peter Fisher, and Wenzhong Shi. Thirty years of research on spatial data quality: achievements, failures, and opportunities. Transactions in GIS, 14(4):387–400, 2010. [10] Geographic information — Data quality – Part 1: General requirements. Standard ISO/CD 19157-1:2021(E)), International Organization for Standardization, Geneva, CH, January 2021. [11] Francisco Javier Ariza-López, Pablo Barreira González, Joan Masó Pau, Alaitz Zabala Torres, Antonio Federico Rodríguez Pascual, Gonzalo Moreno Vergara, and José Luis García Balboa. Geospatial data quality (ISO 19157-1): evolve or perish. Revista Cartográfica, (100):129–154, 2020. [12] JE Stoter, GAK Arroyo Ohori, B Dukai, A Labetski, K Kavisha, S Vitalis, and H Ledoux. State of the Art in 3D City Modelling: Six Challenges Facing 3D Data as a Platform. GIM International: the worldwide magazine for geomatics, 34, 2020. 55 [13] Bashkim Idrizi. General Conditions of Spatial Data Infrastructure. International Journal on Natural and Engineering Sciences. Turkey, 2018. [14] Hossein Mohammadi, Andrew Binns, Abbas Rajabifard, and Ian P Williamson. Spatial data integration. In Proceeding of the 17th UNRCC-AP Conference and 12th Meeting of the PCGIAP, Bangkok, Thailand, 2006. [15] Hossein Mohammadi, Abbas Rajabifard, and Ian P Williamson. Development of an interoperable tool to facilitate spatial data integration in the context of SDI. International Journal of Geographical Information Science, 24(4):487–505, 2010. [16] Martina Barbero, Monica Lopez Potes, Glenn Vancauwenberghe, Danny Vandenbroucke, and V Nunes de Lima. The role of spatial data infrastructures in the digital government transformation of public administrations. Publications Office of the European Union: Luxembourg, 2019. [17] DIRECTIVE 2007/2/EC OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 14 March 2007 - Establishing an Infrastructure for Spatial Information in the European Community (INSPIRE). L, L 108/1:1–14, 200704-25. [18] European Commission. COM(2020) 66 final - A European strategy for data. https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri= CELEX:52020DC0066&from=EN, Accessed: 22.6.2020. [19] Matthew Beare, Riikka Henriksson, Antti Jakobsson, Jarmo Marttinen, Erling Onstein, Lysandros Tsoulos, Frederique Williams, Jaana Mäkelä, Lies De Meulanear, Inger Persson, and Ioannis Kavadas. ESDIN Quality Final Report. 10 2010. [20] E Pauknerova, P Sidlichovsky, S Urbanas, and M Med. The European Location Framework - From National to European. International Archives of the Photogrammetry, Remote Sensing & Spatial Information Sciences, 41, 2016. [21] A Jakobsson, O Ostensen, D Lovell, A Hopfstock, R Mellum, D Kruse, C Portele, S Urbanas, J Hartnor, A Bray, et al. European location framework—one reference geo-information service for europe. In Proceedings of the 26th international cartographic conference, Dresden, pages 25–30, 2013. [22] Nils Mesterton. Paikkatietojen automaattinen laadunarviointi avoimen lähdekoodin ohjelmistoilla. Master’s thesis, Aalto University, Geomatiikka, 2015. [23] Open Geospatial Consortium. Geospatially Enabled Ecosystem for Europe (GeoE3). https://www.ogc.org/projects/initiatives/geoe3, Accessed: 26.05.2022. 56 [24] Amin Mobasheri. Exploring the possibility of semi-automated quality evaluation of spatial datasets in spatial data infrastructure. Journal of ICT Research & Applications, 7(1), 2013. [25] Sara B Dakrory, Tarek M Mahmoud, Abdelmgeid A Ali, et al. Automated ETL testing on the data quality of a data warehouse. International Journal of Computer Applications, 131(16):9–16, 2015. [26] J Alferes, P Poirier, C Lamaire-Chad, Anitha Kumari Sharma, Peter Steen Mikkelsen, and PA Vanrolleghem. Data quality assurance in monitoring of wastewater quality: Univariate on-line and off-line methods. In Proceedings of the 11th IWA conference on instrumentation control and automation, pages 18–20, 2013. [27] D. Harper. Etymology of quality, Online Etymology Dictionary. //www.etymonline.com/word/quality, Accessed: 30.05.2022. https: [28] Peter Antman. From Aristotle to Descartes - A Brief History of Quality. https://smartbear.com/blog/from-aristotle-to-descartes-abrief-history-of-qua/, Accessed: 31.05.2022. [29] Paul Lillrank. The quality of information. International Journal of Quality & Reliability Management, pages 691–703, 2003. [30] Paul. Lillrank. Laatuajattelu : laadun filosofia, tekniikka ja johtaminen tietoyhteiskunnassa. Otava, Helsingissä, 1998. [31] Christian Fürber. Semantic Technologies. In Data Quality Management with Semantic Technologies, pages 56–68. Springer, 2016. [32] Joseph M Juran, A Blanton Godfrey, Robert E Hoogstoel, and Edward G Schilling. Juran’s quality handbook 5th ed. McGraw Hill, 1999. [33] Thomas C Redman. Data quality: the field guide. Digital press, Boston, 2001. [34] Rodolphe Devillers and Robert Jeansoulin. quality. ISTE Publishing Company, 2006. Fundamentals of spatial data [35] Gary J Hunter, Arnold K Bregt, Gerard Heuvelink, Sytze De Bruin, and Kirsi Virrantaus. Spatial data quality: problems and prospects. In Research trends in geographic information science, pages 101–121. Springer, 2009. [36] Corinna Cichy and Stefan Rass. An overview of data quality frameworks. IEEE Access, 7:24634–24648, 2019. [37] National Committee for Digital Cartographic Standards (US) and Harold Moellering. A draft proposed standard for digital cartographic data. National Committee for Digital Cartographic Data Standards Columbus, OH, USA, 1987. 57 [38] Geographic information — Reference model. Standard ISO 19101:2002, ISO/TC 211 Geographic information/Geomatics, Geneva, CH, July 2002. [39] Tatjana Kutzner. Geospatial data modelling and model-driven transformation of geospatial data based on UML profiles. PhD thesis, Technische Universität München, 2016. [40] David Loshin. The practitioner’s guide to data quality improvement. Elsevier, 2010. [41] Zeljko Panian. Some practical experiences in data governance. World Academy of Science, Engineering and Technology, 62(1):939–946, 2010. [42] Antti Jakobsson. Data quality and quality management–examples of quality evaluation procedures and quality management in european national mapping agencies. Spatial data quality, pages 216–229, 2002. [43] Gil Press. Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says. https://www.forbes.com/sites/gilpress/2016/ 03/23/data-preparation-most-time-consuming-least-enjoyable-datascience-task-survey-says/?sh=59ef94766f63, Accessed: 13.10.2021. [44] Amir Parssian, Sumit Sarkar, and Varghese S Jacob. Assessing data quality for information products: impact of selection, projection, and Cartesian product. Management Science, 50(7):967–982, 2004. Nine reasons why tech markets are winner-take[45] Patrick Barwise. all. https://www.london.edu/think/nine-reasons-why-tech-marketsare-winner-take-all, Accessed: 16.05.2022. [46] Economics. The world’s most valuable resource is no longer oil, but data. https://www.economist.com/leaders/2017/05/06/the-worlds-mostvaluable-resource-is-no-longer-oil-but-data/, Accessed: 06.05.2022. [47] Profisee Master Data Management (MDM). Data Quality - What, Why, How, 10 Best Practices and More. https://profisee.com/data-quality-whatwhy-how-who/, Accessed: 11.05.2022. [48] Russell L Ackoff. From data to wisdom. Journal of applied systems analysis, 16(1):3–9, 1989. [49] Ilkka Tuomi. Data is more than knowledge: Implications of the reversed knowledge hierarchy for knowledge management and organizational memory. In Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers, pages 12–pp. IEEE, 1999. [50] Geographic information — Quality principles. Standard ISO 19113:2002, International Organization for Standardization, Geneva, CH, December 2002. 58 [51] Michel Krämer, Jörg Haist, and Thorsten Reitz. Methods for Spatial Data Quality of 3D City Models. In Eurographics Italian chapter conference, pages 167–172, 2007. [52] D Wagner, N Alam, M Wewetzer, M Pries, and V Coors. Methods for geometric data validation of 3D city models. International Archives of the Photogrammetry, Remote Sensing & Spatial Information Sciences, 40, 2015. [53] Kristin M Stock. Spatio-temporal data management using object lifecycles: A case study of the australian capital territory spatial data management system. Journal of spatial science, 51(1):43–58, 2006. [54] Martin Ofner, Boris Otto, and Hubert Österle. A maturity model for enterprise data quality management. Enterprise Modelling and Information Systems Architectures-An International Journal: Vol. 8, Nr. 2, 2013. [55] Joseph M Juran and Frank M Gryna. Quality control handbook. Number 658.562 Q-1q. McGraw Hill„ 1974. [56] Ronald Moen and Clifford Norman. Evolution of the PDCA cycle. Proceedings of the 7th ANQ Congress, Tokyo, 2009. In [57] Maanmittauslaitos. Kansallinen maastotietokanta - laadunhallintajärjestelmä. https://www.maanmittauslaitos.fi/sites/maanmittauslaitos.fi/ files/attachments/2020/03/KMTK%20laadunhallintajarjestelma.pdf, Accessed: 09.06.2022. [58] Fei Wang. Handling data consistency through spatial data integrity rules in constraint decision tables. PhD thesis, Universitätsbibliothek der Universität der Bundeswehr München, 2008. [59] The Data Governance Institute. Data Governance: The Basic Information. https://datagovernance.com/the-data-governance-basics/adgdata-governance-basics/, Accessed: 06.10.2021. [60] Budi Laksono Putro, Kridanto Surendro, and Herbert. Leadership and culture of data governance for the achievement of higher education goals (Case study: Indonesia University of Education). In AIP Conference Proceedings, volume 1708, page 050002. AIP Publishing LLC, 2016. [61] Ning Zhang and Q Yuan. An overview of data governance. Economics Paper, 2016. [62] Mark D Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E Bourne, et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific data, 3(1):1–9, 2016. 59 [63] National Research Council. "Environmental Data Management at NOAA: Archiving, Stewardship, and Access". The National Academies Press, Washington, DC, 2007. [64] The Data Governance Institute. Data Governance Glossary. https://datagovernance.com/the-data-governance-basics/datagovernance-glossary/, Accessed: 06.10.2021. [65] Sarah E McCord, Nicholas P Webb, Justin W Van Zee, Sarah H Burnett, Erica M Christensen, Ericha M Courtright, Christine M Laney, Claire Lunch, Connie Maxwell, Jason W Karl, et al. Provoking a cultural shift in data quality. BioScience, 71(6):647–657, 2021. [66] William K Michener. Quality assurance and quality control (QA/QC). In Ecological Informatics, pages 55–70. Springer, 2018. [67] Marco Di Zio, Nadežda Fursova, Tjalling Gelsema, Sarah Gießing, Ugo Guarnera, Jūraṫe Petrauskieṅe, L Quensel-von Kalben, Mauro Scanu, KO ten Bosch, Mark van der Loo, et al. Methodology for data validation 1.0. Essnet Validat Foundation, 2016. [68] Richard Y Wang, Veda C Storey, and Christopher P Firth. A framework for analysis of data quality research. IEEE transactions on knowledge and data engineering, 7(4):623–640, 1995. [69] Jasna Rodic and Mirta Baranovic. Generating data quality rules and integration into etl process. In Proceedings of the ACM twelfth international workshop on Data warehousing and OLAP, pages 65–72, 2009. [70] Detkev Wagner and Hugo Ledoux. CityGML Quality Interoperability Experiment. OGC: Wayland, MA, USA, 2016. [71] Detlev Wagner, Mark Wewetzer, Jürgen Bogdahn, Nazmul Alam, Margitta Pries, and Volker Coors. Geometric-semantical consistency validation of CityGML models. In Progress and new trends in 3D geoinformation sciences, pages 171–192. Springer, 2013. [72] Jürgen Bogdahn and Volker Coors. Towards an automated healing of 3D urban models. In Proceedings of international conference on 3D geoinformation. International archives of photogrammetry, remote sensing and spatial information science, volume 38, page 4. Citeseer, 2010. [73] May Yuan. Use of a three-domain repesentation to enhance gis support for complex spatiotemporal queries. Transactions in GIS, 3(2):137–159, 1999. [74] DAMA. The DAMA Dictionary of Data Management. Technics Publications, Denver, Colorado, 2008. 60 [75] Antti Jakobsson and Jørgen Giversen. Guidelines for implementing the ISO 19100 geographic information quality standards in national mapping and cadastral agencies. Eurogeographics Expert Group on Quality, 2007. [76] Sujith Kumar. What is Data Lifecycle Managements. https: //stealthbits.com/blog/what-is-data-lifecycle-management, Accessed: 29.10.2021. [77] John L Faundeen, Thomas E Burley, Jennifer Carlino, David L Govoni, Heather S Henkel, Sally Holl, Vivian B Hutchison, Elizabeth Martín, Ellyn T Montgomery, Cassandra C Ladino, et al. The United States geological survey science data lifecycle model. US Department of the Interior, US Geological Survey Reston, VA, USA, 2013. [78] Hampapuram K Ramapriyan, Ge Peng, David Moroni, and Chung-Lin Shie. Ensuring and improving information quality for earth science data and products. D.-Lib Magazine, 23, 2017. [79] Rodolphe Devillers, Yvan Bédard, Robert Jeansoulin, and Bernard Moulin. Towards spatial data quality information analysis tools for experts assessing the fitness for use of spatial data. International Journal of Geographical Information Science, 21(3):261–282, 2007. [80] Andrew Phillips, Ian Williamson, and Chukwudozie Ezigbalike. Spatial data infrastructure concepts. Australian Surveyor, 44(1):20–28, 1999. [81] Matthias Butenuth, Guido v Gösseln, Michael Tiedge, Christian Heipke, Udo Lipeck, and Monika Sester. Integration of heterogeneous geospatial data in a federated database. ISPRS Journal of Photogrammetry and Remote Sensing, 62(5):328–346, 2007. [82] Paul Miller. Interoperability: What is it and why should I want it? Ariadne, (24), 2000. [83] Ehab Shahat, Chang T Hyun, and Chunho Yeom. City digital twin potentials: A review and research agenda. Sustainability, 13(6):3386, 2021. [84] Tatiana Delgado Fernández, Kate Lance Cuba, and Buck Margaret. Assessing an SDI readiness index. In FIG Working Week, pages 16–21. Citeseer, 2005. [85] Ljiljana Živković. Approach to Spatial Data Infrastructure Development for Spatial Planning in Serbia. Proceedings REAL CORP, 2013. [86] Chinonye Cletus Onah. Spatial data infrastructures model for developing countries: A case study of Nigeria. Master’s Thesis in Universidad Jaime, Geospatial Technologies program, Castelló de la Plana, Spain, 2009. 61 [87] UN-GGIM: Europe | Working Group on Data Integration. Data Integration Methods - Analysis of future trends in geospatial data capture, creation, maintenance and management and recommendations for amplified use of good practices. 2021. [88] S Hasani, A Sadeghi-Niaraki, and M Jelokhani-Niaraki. Spatial data integration using ontology-based approach. The International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences, 40(1):293, 2015. [89] OGC® OWS-9 Cross Community Interoperability (CCI) Conflation with Provenance Engineering Report. Ogc® engineering report, Open Geospatial Consortium, February 2013. [90] Peter Christen. Chapter 1: Introduction. In Data matching, pages 3–22. Springer, 2012. [91] Timo Niemi, Turkka Näppilä, and Kalervo Järvelin. A relational data harmonization approach to XML. Journal of Information Science, 35(5):571–601, 2009. [92] Rodrigo Smarzaro, Clodoveu A Davis, and José Alberto Quintanilha. Creation of a multimodal urban transportation network through spatial data integration from authoritative and crowdsourced data. ISPRS International Journal of Geo-Information, 10(7):470, 2021. [93] Ting Lei and Zhen Lei. Optimal spatial data matching for conflation: A network flow-based approach. Transactions in GIS, 23(5):1152–1176, 2019. [94] Alexandra Stadler and Thomas H Kolbe. Spatio-semantic coherence in the integration of 3D city models. In Proceedings of the 5th International ISPRS Symposium on Spatial Data Quality ISSDQ 2007 in Enschede, The Netherlands, 13-15 June 2007, 2007. [95] Ying Zhang, Chaopeng Li, Na Chen, Shaowen Liu, Liming Du, Zhuxiao Wang, and Miaomiao Ma. Semantic web and geospatial unique features based geospatial data integration. In Geospatial Intelligence: Concepts, Methodologies, Tools, and Applications, pages 230–253. IGI Global, 2019. [96] Sumit Sen. Semantic interoperability of geographic information. GIS Development, 9:18–21, 2005. [97] Geographic information — metadata — part 1: Fundamentals. Standard ISO 19115-1:2014, ISO/TC 211 Geographic information/Geomatics, Geneva, CH, April 2014. [98] Francesca Noardo, Ken Arroyo Ohori, Filip Biljecki, Claire Ellul, Lars Harrie, Thomas Krijnen, Helen Eriksson, Jordi van Liempt, Maria Pla, Antonio Ruiz, et al. Reference study of CityGML software support: The GeoBIM benchmark 2019—Part II. Transactions in GIS, 25(2):842–868, 2021. 62 [99] Open Geospatial Consortium. CityGML Standard. https://www.ogc.org/ standards/citygml, Accessed: 29.09.2021. [100] Maxime Morel and Gilles Gesquière. Managing Temporal Change of Cities with CityGML. In UDMV, pages 37–42, 2014. [101] Juan Trujillo and Sergio Luján-Mora. A UML based approach for modeling ETL processes in data warehouses. In International Conference on Conceptual Modeling, pages 307–320. Springer, 2003. [102] Syed Muhammad Fawad Ali and Robert Wrembel. From conceptual design to performance optimization of etl workflows: current state of research and open problems. The VLDB Journal, 26(6):777–801, 2017. [103] Panos Vassiliadis, Alkis Simitsis, and Spiros Skiadopoulos. Conceptual modeling for ETL processes. In Proceedings of the 5th ACM international workshop on Data Warehousing and OLAP, pages 14–21, 2002. [104] Jürgen Döllner, Thomas H Kolbe, Falko Liecke, Takis Sgouros, and Karin Teichmann. The virtual 3d city model of berlin-managing, integrating, and communicating complex urban information. In Proceedings of the 25th international symposium on urban data management UDMS 2006 in Aalborg, Denmark, 15-17 May 2006, 2006. [105] Sampling procedures for inspection by attributes — Introduction to the ISO 2859 series of standards for sampling for inspection by attributes. Standard ISO 28590:2017(en), International Organization for Standardization, Geneva, CH, October 2017. [106] Miguel-Ángel Manso, Monica Wachowicz, and Miguel-Ángel Bernabé. Towards an integrated model of interoperability for spatial data infrastructures. Transactions in GIS, 13(1):43–67, 2009. [107] Sven Schade, Carlos Granell, Glenn Vancauwenberghe, Carsten Keßler, Danny Vandenbroucke, Ian Masser, and Michael Gould. Geospatial information infrastructures. In Manual of Digital Earth, pages 161–190. Springer, Singapore, 2020. INSPIRE Roadmap. https:// [108] INSPIRE Knowledge Base. inspire.ec.europa.eu/inspire-roadmap/61, Accessed: 21.09.2021. [109] AJ De Jong. Geographic data as personal data in four eu member states. Master’s thesis, Delft University of Technology, 2015. [110] Ian P Williamson. Building SDIs: The Challenges Ahead. Proceedings of the 7th International Conference: Global Spatial Data Infrastructure, pages 2–6, 2004. 63 [111] Kalogeropoulos Kleomenis, Stathopoulos Nikolaos, Tsatsaris Andreas, and Chalkias Christos. A survey of the Geoinformatics use for census purposes and the INSPIRE maturity within Statistical Institutes of EU and EFTA countries. Annals of GIS, 25(2):167–178, 2019. [112] Jean Paul Simon. APIs, the glue under the hood. Looking for the “API economy”. Digital Policy, Regulation and Governance, 2021. [113] Open Geospatial Concortium. OGC API - Features. https://www.ogc.org/ standards/ogcapi-features, Accessed: 28.10.2021. [114] Simon Jirka, Christian Autermann, Jan Speckamp, and Matthes Rieke. SensorThings API and the OGC API family of standards: a new generation of interoperability standards for research data infrastructures to further improve the sharing of ocean observation data. In 9th EuroGOOS International conference, 2021. [115] Callum Irving. “Byte-ing Back Better” - Introducing a Q-FAIR approach to Geospatial Data Improvement. Geospatial Commission, blog: https:// geospatialcommission.blog.gov.uk/2021/06/25/byte-ing-back-betterintroducing-a-q-fair-approach-to-geospatial-data-improvement/, Accessed: 22.6.2022. [116] RDA FAIR Data Maturity Model Working Group et al. FAIR Data Maturity Model: specification and guidelines. Res. Data Alliance, pages 2019–2020, 2020. [117] GO FAIR. FAIR Principles. https://www.go-fair.org/fair-principles/, Accessed: 05.10.2021. [118] Gröger Gerhard, Kolbe Thomas H., Nagel Claus, and Häfele Karl-Heinz. CityGML Encoding Standard. Standard OGC 12-019, Open Geospatial Consortium (OGC), apr 2012. [119] Open Geospatial Consortium. OGC City Geography Markup Language (CityGML) 3.0 Conceptual Model Users Guide. Open Geospatial Consortium (OGC), 2021. [120] Thomas H Kolbe. Representing and exchanging 3D city models with CityGML. In 3D geo-information sciences, pages 15–31. Springer, 2009. [121] Volker Coors and M Krämer. Integrating quality management into a 3D geospatial server. In UDMS 2011: 28th Urban Data Management Symposium, Delft, The Netherlands, September 28-30, 2011. Urban Data Management Society; OTB Research Institute for the Built Environment; Delft University of Technology, 2011. 64 [122] Hugo Ledoux, Ken Arroyo Ohori, Kavisha Kumar, Balázs Dukai, Anna Labetski, and Stelios Vitalis. CityJSON: A compact and easy-to-use encoding of the CityGML data model. Open Geospatial Data, Software and Standards, 4(1):1–12, 2019. [123] Geographic information — Spatial schema. Standard ISO 19107:2003, International Organization for Standardization, Geneva, CH, April 2003. [124] Hugo Ledoux. Val3dity: Validation of 3D GIS Primitives According to the International Standards. Open Geospatial Data, Software and Standards, 3(1):1–12, 2018. [125] OpenGIS® Implementation Standard for Geographic information - Simple feature access - Part 1: Common architecture. OpenGIS Implementation Standard OGC 06-103r4, Open Geospatial Consortium Inc, May 2011. [126] Even Rouault. GML madness. https://erouault.blogspot.com/2014/04/ gml-madness.html, Accessed: 11.01.2022. [127] Stelios Vitalis, Ken Arroyo Ohori, and Jantien Stoter. CityJSON in QGIS: Development of an open-source plugin. Transactions in GIS, 24(5):1147–1164, 2020. [128] Marcus Goetz and Alexander Zipf. Towards defining a framework for the automatic derivation of 3D CityGML models from volunteered geographic information. International Journal of 3-D Information Modeling (IJ3DIM), 1(2):1–16, 2012. [129] Filip Biljecki, Jantien Stoter, Hugo Ledoux, Sisi Zlatanova, and Arzu Çöltekin. Applications of 3D city models: State of the art review. ISPRS International Journal of Geo-Information, 4(4):2842–2889, 2015. [130] Alexander Köninger and Sigrid Bartel. 3D-GIS for urban purposes. Geoinformatica, 2(1):79–103, 1998. [131] Thomas H Kolbe, Gerhard Gröger, and Lutz Plümer. CityGML: Interoperable access to 3D city models. In Geo-information for disaster management, pages 883–899. Springer, 2005. [132] Safe Software. Introduction to FME Desktop, chapter 1.01. What is https://safe-software.gitbooks.io/introduction-to-fmeFME? desktop/content/1.getting-started/1.01.getting-started.html, Accessed: 22.6.2022. FME GeometryValidator Documentation. [133] Safe Software. http://docs.safe.com/fme/html/FME_Desktop_Documentation/ Accessed: FME_Transformers/Transformers/geometryvalidator.htm, 04.05.2022. 65 [134] Safe Software. GeometryValidator Issues Table. https://docs.safe.com/fme/ html/FME_Desktop_Documentation/FME_Transformers/Transformers/ geometryvalidator_issues.html, Accessed: 27.04.2022. [135] Open Geospatial Consortium (OGC). GeoE3 - Github Repository. https: //github.com/opengeospatial/GEOE3, Accessed: 13.06.2022. [136] Maanmittauslaitos. KMTK Tietomalli - Rakennukset ja rakennelmat. https://www.maanmittauslaitos.fi/sites/maanmittauslaitos.fi/ files/attachments/2022/01/2022-01-17_KMTK_Tietomalli_RR.pdf, Accessed: 17.06.2022. [137] TerraSolid. TerraScan User Guide - Check Building Models. https:// terrasolid.com/guides/tscan/toolcheckbuildingmodels.html, Accessed: 16.06.2022. [138] Markatsafe. Forum Post of Safe Employee. https://community.safe.com/s/ question/0D74Q000008t7NOSAY/detail, Accessed: 26.04.2022. [139] Hugo Ledoux. On the validation of solids represented with the international standards for geographic information. Computer-Aided Civil and Infrastructure Engineering, 28(9):693–706, 2013. [140] Volker Coors and Detlev Wagner. CityGML Quality Interoperability Experiment des OGC. DGPF Tagungsband. Publikationen der Deutschen Gesellschaft für Photogrammetrie, Fernerkundung und Geoinformation eV, 24:288–295, 2015. [141] Riccardo Albertoni and Antoine Isaac. Introducing the data quality vocabulary (DQV). Semantic Web, 12(1):81–97, 2021. [142] Heiko Figgemeier, Arne Rümmler, and Christin Henzen. A Geospatial Dashboard Prototype for Evaluating Spatial Datasets by using Semantic Data Concepts and Open Source Libraries. AGILE: GIScience Series, 3:1–3, 2022.
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )