ESSnet Statistical Methodology Project on Integration of Survey and Administrative Data Report of WP3. Software tools for integration methodologies LIST OF CONTENTS Preface III 1. Software tools for record linkage (Monica Scannapieco – Istat) 1 1.1. Comparison criteria for record linkage software tools (Monica Scannapieco – Istat) 2 1.2. Probabilistic tools for record linkage 4 1.2.1. Automatch (Nicoletta Cibella – Istat) 4 1.2.2. Febrl (Miguel Guigo – INE) 5 1.2.3. GRLS (Nicoletta Cibella – Istat) 6 1.2.4. LinkageWiz (Monica Scannapieco – Istat) 7 1.2.5. RELAIS (Monica Scannapieco – Istat) 7 1.2.6. DataFlux (Monica Scannapieco – Istat) 8 1.2.7. Link King (Marco Fortini – Istat) 10 1.2.8. Trillium Software (Miguel Guigo – INE) 10 1.2.9. Link Plus (Tiziana Tuoto – Istat) 12 1.3. Summary tables and comparisons 14 1.3.1. General Features 14 1.3.2. Strengths and weaknesses 16 2. Software tools for statistical matching (Mauro Scanu – Istat) 17 2.1. Comparison criteria for statistical matching software tools (Mauro Scanu – Istat) 17 2.2. Statistical matching tools 18 2.2.1. SAMWIN (Mauro Scanu – Istat) 19 2.2.2. R codes (Marcello D’Orazio – Istat) 19 2.2.3. SAS codes (Mauro Scanu – Istat) 20 WP3 I 2.2.4. S-Plus codes (Marco Di Zio – Istat) 2.3. Comparison tables 20 22 3. Commercial software tools for data quality and record linkage in the process of microintegration (Jaroslav Kraus and Ondřej Vozár - CZSO) 24 3.1. Data quality standardization requirements 24 3.2. Data quality assessment 24 3.3. Summary tables Oracle, Netrics and SAS/Data Flux 29 3.3.1. Oracle 29 3.3.2. Netrics 32 3.3.3. SAS 35 4. Documentation, literature and references 38 4.1. Bibliography for Section 1 38 4.2. Bibliography for Section 2 39 4.3. Bibliography for Section 3 40 WP3 II Preface This document is the deliverable of the third work package (WP) of the Centre of Excellence on Statistical Methodology. The objective of this WP is to review some existing software tools for the application of probabilistic record linkage and statistical matching methods The document is organized in three chapters. The first chapter is on software tools for record linkage. On the basis of the underlying research paradigm, three major categories of record linkage tools can be identified: Tools for probabilistic record linkage, mostly based on the Fellegi and Sunter model (Fellegi and Sunter1, 1969). Tools for empirical record linkage, which are mainly focused on performance issues and hence on reducing the search space of the record linkage problem by means of algorithmic techniques such as sorting, tree traversal, neighbour comparison, and pruning. Tools for knowledge-based linkage, in which domain knowledge is extracted from the files involved and reasoning strategies are applied to make the decision process more effective. In such a variety of proposals, this document restricts the attention to the record linkage tools that have the following characteristics: They have been explicitly developed for record linkage; They are based on a probabilistic paradigm. Two sets of comparison criteria were used for comparing several probabilistic record linkage tools. The first one considers general characteristics of the software: cost of the software; domain specificity (i.e. the tool can be developed ad-hoc for a specific type of data and applications); maturity (or level of adoption, i.e. frequency of usage - whereas available - and number of years the tool is around). The second set considers which functionalities are performed by the tool: preprocessing/standardization; profiling; comparison functions; decision method. Chapter 2 deals with software tools for statistical matching. Software solutions for statistical matching are not as widespread as in the case of record linkage, because statistical matching projects are still quite rare in practice. Almost all the applications are conducted by means of ad hoc codes. Sometimes, when the objective is micro it is possible to use general purpose imputation software tools. On the other hand, if the objective is macro, it is possible to adopt general statistical analysis tools which are able to deal with missing data. In this chapter, the available tools, explicitly devoted to statistical matching purposes, were reviewed. Only one of them (SAMWIN) is a software that can be used without any programming skills, while the others are software codes that can be used only by those with knowledge of the corresponding language (R, S-Plus, SAS) as well as a sound knowledge in statistical methodology. The criteria used for comparing the software tools for statistical matching were slightly different from those for record linkage. The attention is restricted to costs, domain specificity and maturity of the software tool. As far as the software functionalities are concerned, the focus is on: i) the inclusion of pre-processing and standardization tools; ii) the capacity to create a complete and synthetic data set by the fusion of the two data sources to integrate; iii) the capacity to estimate parameters on the joint distribution of variables never jointly observed; iv) the assumptions on the model of the variables of interest under which the software tool works (the most known is the Fellegi I. P., Sunter A. B. (1969). “A theory for record linkage”. Journal of the American Statistical Association, 64, 1183-1210. WP3 III 1 conditional independence assumption of the variables not jointly observed given the common variables in the two data sources); v) the presence of any quality assessment of the results. Furthermore, the software tools are compared according to the implemented methodologies. Strengths and weaknesses of each software tool are highlighted at the end. Chapter 3 focuses on commercial software tools for data quality and record linkage in the process of microintegration. The vendors in the data quality market are often classified within their entire position in IT business, where focus on the specific business knowledge and experience in specific business domain plays an important role. Quality of vendors and their products on the market are characterized by: i) product features and relevant services; ii) vendor characteristics, domain business understanding, business strategy, creativity, innovation; iii) sales characteristics, licensing, prices; iv) customer experience, reference projects; v) data quality tools and frameworks. The software vendors of tools in the statistics oriented “data quality market” propose solutions addressing all the tasks in the entire life cycle of the data oriented management programs and projects: data preparation, survey data collection, improving of quality and integrity, setting up for reports and studies, etc. According to the software/application category, the tools to perform or support the data oriented projects in record linkage in statistics should have several common characteristics: 1) portability in being able to function with statistic researchers' current arrangement of computer systems and languages, 2) flexibility in handling different linkage strategies, and 3) operational expenses or low costs in TCO (Total Cost of Ownership) parameters and in both, computing time and researchers' efforts. In this chapter the evaluation focused on three commercial software packages, which according to the data quality scoring position in Gartner reports (the so called “magic quadrants” available on the web page http://www.gartner.com) belong to important vendors in this area. The three vendors are: Oracle (represents the convergence of tools and services in the software market), SAS/DataFlux (data quality, data integration and BI (Business Intelligence) player on the market), Netrics (which disposes with the advanced technology complementing the Oracle data quality and integration tools). The set of comparison tables was prepared according to the following structure: linkage methodology, data management, post-linkage function, standardization, costs and empirical testing, case studies and examples. WP3 IV 1 Software Tools for Record Linkage Monica Scannapieco (Istat) The state of the art of record linkage tools includes several proposals coming from private companies but also, in large part, from public organizations and from universities. Another interesting feature of such tools is related to the fact that some record linkage activities are performed “within” other tools. For instance, there are several data cleaning tools that include record linkage (see Barateiro and Galhardas, 2005 for a survey), but they are mainly dedicated to standardization, consistency checks etc. A second example is provided by the recent efforts by major database management systems’ vendors (like Microsoft and Oracle) that are going to include record linkage functionalities for data stored in relational databases (Koudas et al., 2006). On the basis of the underlying research paradigm, three major categories of tools for record linkage can be identified (Batini and Scannapieco, 2006): 1. Tools for probabilistic record linkage, mostly based on the Fellegi and Sunter model (Fellegi and Sunter, 1969). 2. Tools for empirical record linkage, which are mainly focused on performance issues and hence on reducing the search space of the record matching problem by means of algorithmic techniques such as sorting, tree traversal, neighbour comparison, and pruning. 3. Tools for knowledge-based linkage, in which domain knowledge is extracted from the files involved, and reasoning strategies are applied to make the decision process more effective. In such a variety of proposal, in this document we concentrate on record linkage tools that have the following characteristics: they have been explicitly developed for record linkage they are based on a probabilistic paradigm. In the following, we first illustrate a set of comparison criteria (Section 1.1) that will be used for comparing several probabilistic record linkage tools. In Section 1.2, we provide a general description of the selected tools, while in Section 1.3 we present several comparison tables that show the most important features of each tool. Let us first list the probabilistic record linkage tools that have been selected among the most wellknown and adopted ones: 1. AutoMatch, developed at the US Bureau of Census, now under the purview of IBM [Herzog et al. 2007, chap.19]. 2. Febrl - Freely Extensible Biomedical Record Linkage, developed at the Australian National University [FEBRL]. 3. Generalized Record Linkage System (GRLS), developed at Statistics Canada [Herzog et al. 2007, chap.19]. 4. LinkageWiz, commercial software [LINKAGEWIZ]. 5. RELAIS, developed at ISTAT [RELAIS]. 6. DataFlux, commercialized by SAS [DATAFLUX]. 7. The Link King, commercial software [LINKKING]. 8. Trillium, commercial software [TRILLIUM]. 9. Link Plus, developed at the U.S. Centre for Disease Control and Prevention (CDC), Cancer Division [LINKPLUS]. WP3 1 For each of the above cited tools, in the following Section 1.2 we provide a general description. 1.1 Comparison criteria for record linkage software tools Monica Scannapieco (Istat) In this section we describe the criteria to compare the probabilistic record linkage tools with respect to several general features. Such criteria will be reported in some tables, whose detail is provided in Section 1.3. A first table will take into account the following characteristics: Free/Commercial, refers to the possibility of having the tool for free or not. The set of possible answers is shown in Figure 1. Domain Specificity, refers to the fact that the tool can be developed ad-hoc for a specific type of data and applications. For Domain Specificity, the set of answers is shown in Figure 2. Maturity (Level of Adoption), is related to the frequency of usage (whereas available) and to the number of years the tool is around. For Maturity, we use a HIGH/MEDIUM/LOW rating scale. In order to assign the rates, we take into account the following factors: (i) frequency of usage (Shah et al.) (ii) number of years since the tool has been first proposed. Source Code Available Free Source Code Not Available Free/Commercial Cost less than $5000 Commercial Cost between $5000 and $9900 Cost more than $9900 Figure 1: Free/commercial possible answers WP3 2 Only specific domain Domain Specificity “Specific domain” No specific domain (generalized system) “Specific domain” Mixed (specific domain + general features) “General features” Figure 2: Domain specificity possible answers A second table will consider which functionalities are performed by the tool, on the basis of a reference set of functionalities, listed in the following. Preprocessing/Standardization Data can be recorded in different formats and some items may be missing or with inconsistency or errors. The key job of this functionality is to convert the input data in a well defined format, resolving the inconsistencies in order to reduce misclassification errors in the subsequent phases of the record linkage process. In this activity null string are cancelled, abbreviations, punctuation marks, upper/lower cases, etc. are cleaned and any necessary transformation is carried out in order to standardize variables. Furthermore the spelling variations are replaced with standard spelling for the common words. A parsing procedure that divides a free-form field into a set of strings, could be applied and a schema reconciliation can be performed to avoid possible conflicts (i.e. description, semantic and structural conflicts) among data source schemas in order to have standardized data fields. Geocoding, a standardization task especially conceived for name and address data, transforms data variables assigning geographic identifiers or postal standards such as postal ZIP codes or official street addresses. Profiling An important phase of a record linkage process is the choice of appropriate matching variables that have to be as suitable as possible for the linking process considered. The matching attributes are generally chosen by a domain expert, hence this phase is typically not automatic but the choice can be supported by some further information that can be automatically computed. Such information is the result of a profiling activity that provides quality measures, metadata description and simple statistics on the distribution of variables which give hints on how to choose the set of matching variables. WP3 3 Comparison Functions Record linkage tools can provide support for different comparison functions. Some of the most common comparison functions are equality, edit distance, Jaro, Hamming distance, SmithWaterman, TF-IDF, etc. (see Koudas et al, 2006 for a survey) Search Space Reduction In a linking process of two datasets, say A and B, the pairs needed to be classified as matches, nonmatches and possible matches are those in the cross product A x B. When dealing with large datasets, the comparison of the matching variables is almost impracticable. Several techniques based on sorting, filtering, clustering and indexing may be all used to reduce the search space. Blocking and sorted neighbourhood are among the common ones. Decision Method The core of record linkage process is the choice of decision model. A record linkage tool can provide several decision rules in order to decide the status of match, nonmatch, or possible match of records. For instance, it can provide the support for the Fellegi and Sunter rule, but also for a deterministic (threshold based) rule. In Section 1.3, two further tables, namely for estimation methods and strenghts and weaknesses of the tools, will be described. 1.2 Probabilistic tools for record linkage 1.2.1AutoMatch Nicoletta Cibella (Istat) AutoMatch implements the Fellegi Sunter record linkage theory for matching records. The software can be used for matching records both within a list and between different data sources. In 1998, Vality (www.vality.com) acquired the AutoMatch tool which became part of INTEGRITY. The software is comprehensive of many matching algorithms. General characteristics - AutoMatch is a commercial software, under purview of the IBM. The software is created by Matthew Jaro and today it is a collection of different algorithms aiming at performing probabilistic record linkage; some of these algorithms (NYSIIS, SOUNDEX) uses codes suitable for the English words. The current version of the software is user-friendly and seems to follow the same strategy of a human being in matching records, referring to the same entity (Herzog et al, 2007) . Tools included in the software - AutoMatch performs the record linkage procedure throughout different steps in which the matching variables, the threshold and also the blocking variables can be changed. Firstly, the variables in the data files are processed: the pre-processing phase (standardization, parsing) transforms variables so as to obtain a standard format. Not only phonetic codes (NYSIIS) are used but also spelling variations and abbreviations in text string are considered. The software links records using the Fellegi and Sunter theory but it is enriched with frequency analysis methodology in order to discriminate weight score values. Methodology - The u-probabilities, in AutoMatch, are calculated from the number of occurrences of the values for each matching variables from the dataset being matched; the m-probabilities can be iteratively estimated from current linkages. Strength and weakness of the software - The pre-processing phase in AutoMatch is very important and well developed , it uses several tools so as to have variables in standard format; also WP3 4 the way to divide possible pairs into blocks is implemented. The documentation provided is rich and the matching parameters are estimated automatically. But some algorithms runs with English words and the error rates estimation phase is still not performed. 1.2.2 Febrl Miguel Guigò (INE - Spain) The Freely Extensible Biomedical Record Linkage (Febrl) development was funded in 2001 by the Australian National University (ANU) and the NSW Department of Health, with additional funding provided by the Australian Partnership for Advanced Computing (APAC). Febrl is a record linkage system which aims at supporting not only matching algorithms but also methods for large scale data cleansing, standardisation, blocking and geocoding, as well as a probabilistic data set generator in order to perform a wide variety of tests and empirical comparisons for record linkage procedures. General characteristics - Febrl has been written in the object-oriented open source language Python, which is also open source ( http://www.python.org ) and it is available from the project web page (http://sourceforge.net/projects/febrl). Due to this circumstance, Febrl allows the easy implementation of additional and improved record linkage techniques or algorithms, as well as different comparison tests. Tools included in the software - Febrl supports a special procedure for probabilistic data standardisation and cleansing with the aim of solving inconsistencies in the key variables as names and addresses. This procedure is based on hidden Markov models (HMMs). One HMM is used for names and one for addresses. The complete pre-processing tasks then consists of three steps: 1) the user input records are cleaned, converting all letters to lowercase, removing certain characters and replacing various substrings with their canonical form (based on user-specified and domain specific substitution tables); 2) the cleaned strings are split into a list of words, numbers and characters to which one or more tags are assigned; 3) The list of tags is given to a HMM (either name or address), where the most likely path is found by means of the Viterbi algorithm (see Rabiner, 1989). For some other key variables such as dates and telephone numbers, the package also contains rulesbased standardisation methods. The search space reduction process is made by means of Blocking. Febrl (version 0.2) implements three different blocking methods: standard blocking, Sorted-Neighbourhood, and a special procedure for fuzzy blocking, known as Bigram Indexing (see section on Blocking Procedures of report of WP1). A wide variety of comparison functions is available in order to obtain the corresponding vector of matching weights. Christen and Churches (2005a) includes a table with 11 field (or attribute) comparison functions for names, addresses, dates and localities that are available: Exact string, Truncated string, Approximate string, Encoded string, Keying difference, Numeric percentage, Numeric absolute, Date, Age, Time, and Distance. Version 0.3 has added two new approximate string comparison methods. Geocoding procedures have been implemented in Febrl version 0.3, with a system based on the Australian Geocoded National Address File (GNAF) database. Methodology - Febrl performs probabilistic record linkage based on the Fellegi and Sunter approach to calculate the matching decision, and classifies record pairs as either a link, non-link, or possible link. The package, however, adds a flexible classifier that allows a different definition and calculations of the final matching weight using several functions. This is done in order to alleviate the limitations due to the conditional independent assumption that is associated to the Fellegi – WP3 5 Sunter model. Thus, Febrl uses the Auction algorithm (Bertsekas, 1992) to achieve the optimal oneto-one assignment of linked record pairs. Strength and weakness of the software - Febrl shows, as an outstanding feature, its software architecture based on an open-source code. Both Python language and Febrl package are completely free of charge; also, the latter can be used as an integrated application that covers the complete linkage process rather than a tool designed strictly for the record linkage step, once additional routines for preprocessing and geocoding have been implemented. A comprehensive manual (Christen and Churches, 2005b) is thus available for the 0.3 release. The package, however, does not offer the same additional features that other commercial software supports, such as a wide variety of profiling or enrichment tools, or the ability to work in languages different from English. 1.2.3 GRLS (Generalized Record Linkage System) Nicoletta Cibella (Istat) Generalized Record Linkage System (GRLS) is a commercial software, developed at Statistics Canada, performing probabilistic record linkage. The system was planned to solve especially health and business problems; it is part of a set of generalized systems and it is based on statistical decision theory. The software aims at linking records within a file itself or between different data sources, particularly when there is a lack of unique identifiers. It is was designed to be used with ORACLE databases. General characteristics - GRLS is marketed by Statistics Canada and it is now at its 4.0 release, it works in a client-server environment with a C compiler and ORACLE. The NYSIIS (New York State Identification and Intelligent Systems) and Russel SOUNDEX phonetic codes, specific for English language, are incorporated in GRLS. Statistics Canada usually organizes two days training courses to present the software (with information, bilingual documentation) and make the users able to use it so as to facilitate the software level of adoption, even if the software development still focuses on Statistic Canada users. Tools included in the software - GRLS decomposes the whole record linkage procedures in three main passes (Fair, 2004): definition of searching space: creation of all linkable pairs on the basis of the initial criteria set by the user; application of a decision rule: the possible pairs are divided into the matched, possible matched and unmatched set; the grouping phase in which the link and possible pairs involving the same entity are grouped together and the final group formed. The NYSIIS (New York State Identification and Intelligent Systems) and SOUNDEX phonetic codes incorporated in GRLS, as stated above, and some other tools (e.g. postal code conversion file, PCCF), available within Statistics Canada, facilitates the pre-processing of the data files; the strings like names and surnames or addresses are parsed in their components and the free format is converted in a standard one. The system also supplies a suitable framework so as to test the parameters of the linkage procedure. Methodology - The core of GRLS, probabilistic record linkage software, is the mathematical theory of Fellegi and Sunter. The software decomposed the whole matching process in its constituting phases. Strength and weakness of the software - GRLS has many strengths comparing to other record linkage software. The pre-processing phase can be considered well developed within the Statistics WP3 6 Canada system. The documentation is rich and it is bilingual, both French and English versions are available and a special mention is devoted to the training course at Statistics Canada. The software can runs on a workstation or a PC, supporting UNIX. On the contrary, the software is not free and the error estimation phase is not implemented. 1.2.4 LinkageWiz Monica Scannapieco (Istat) LinkageWiz is a software dedicated to record linkage. It allows linking records from separate data sources or identifies duplicate records within a single data source. It is based on several probabilistic linkage techniques. Data can be imported from a wide range of desktop and corporate database systems. It offers a comprehensive range of data cleansing functions on the market. LinkageWiz uses an intuitive graphical user interface and does not require a high level of expertise to operate. It is a standalone product and does not require the use of separate programs. General characteristics - LinkageWiz has a quite low price which is clearly stated on the software website (http://www.linkagewiz.com) for the different versions of the software that can be purchased. It has the support for name matching with some specific phonetic codes like NYSIIS (New York State Identification and Intelligent Systems) and SOUNDEX which are specific of the English language. The software is currently at the version 5.0, so it’s quite stable. The level of adoption is medium: on the software web site is indicated a number of clients of about 20. Tools included in the software - LinkageWiz imports data from several formats. Furthermore, it allows standardization addresses and business names (English and French), various conversion of characters (e.g. accented characters for European language), removal of spaces and unwanted punctuation and a few other preprocessing functionalities. Therefore, the preprocessing phase is quite well supported by the tool. No specific profiling functionality is instead mentioned in the software documentation. Comparison functions are also well-supported, with a specific attention dedicated to the matching of English names. A search space reduction method is not explicitly mentioned Methodology – As far as the supported methodology, there is no precise information about it, but the fact that “sophisticated probabilistic techniques” are implemented for the matching decision. Strength and weakness of the software - LinkageWiz has a good support to preprocessing. However, in the tool’s documentation there is any explicit mention to the possibility of customizing preprocessing rules (which is instead allowed by DATAFLUX, see below) . Performances seem to be a strength of the tool, that seems to work quite well even in a PC environment. Among the negative points, there is the limited number of functionalities which are covered by the tool (profiling and search space reduction phases seem to be completely missing). Moreover, as already mentioned, no detail at all is provided on the implemented probabilistic method, a thing that forces any user to trust the tool’s decisions with a “black box” perspective. 1.2.5 RELAIS (Record Linkage At Istat) Monica Scannapieco (Istat) RELAIS (REcord Linkage At IStat) (Tuoto et al 2007, RELAIS) is a toolkit that permits the construction of record linkage workflows. The inspiring principle is to allow combining the most convenient techniques for each of the record linkage phases and also to provide a library of patterns WP3 7 that could support the definition of the most appropriate workflow, in both cases taking into account the specific features of the data and the requirements of the current application. In such a way, the toolkit not only provides a set of different techniques for each phase of the linkage problem, but it can also be seen as a compass to solve the linkage problem as better as possible given the problem constrains. In addition, RELAIS aims at joining specifically the statistical and computational essences of the record linkage problem. General characteristics - One of the inspiring principles of RELAIS as a toolkit is to re-use the several solutions already available for record linkage in the scientific community and to gain the several experiences in different fields. According to this principle, RELAIS is being carried on as an open source project, with full availability of source code as well. All the developed algorithms have been developed as domain independent. RELAIS started in 2006 and is at the 1.0 release, hence the level of adoption is quite low. Tools included in the software - RELAIS provides just basic pre-processing functionalities. The data profiling activity includes the evaluation of the following metadata: variable completeness, identification power, accuracy, internal consistency, and consistency. All these metadata can be evaluated for each variable, and are merged together into a quality vector associated to the variable itself. A ranking within the quality vectors is performed in order to suggest which variable is more suitable for blocking or matching. RELAIS permits the choice of the comparison function to use within a set of predefined ones. The search space reduction phase can be performed by two methods, namely blocking and sorted neighborhood. The first method partitions the search space according to a chosen blocking variable and allows conducting linkage on each block independently. The sorted neighborhood method limits the actual comparison to those records that falls within a window sliding on an ordered list of the records to compare. With respect to the decision model choice, RELAIS implements the Fellegi-Sunter probabilistic model by using the EM algorithm for the estimation of the model parameters. The output is a many to many linkage of the datasets records. Starting from this output, RELAIS allows to the user the construction of clusters of matches, non-matches and possible matches. Alternatively, a subsequent phase of reduction from many to many linkage to one to one linkage can be performed. Methodology – As described, RELAIS implements the Fellegi-Sunter probabilistic model and the estimation of the model’s parameters is realized by means of the EM method. It assumes the conditional independence of the matching variables. A latent class model is hypothesized in which the latent variable is the matching status. A deterministic decision rule is currently under development. Strength and weakness of the software - One strength of RELAIS is the fact that is open and free of charge. It also provides a good support for all the record linkage functionalities as it will be shown by Table 3. A further positive aspect is the flexibility offered by the possibility of combing different techniques to produce ad-hoc record linkage workflows. This is perhaps the most characterizing aspect of the tool. RELAIS is however quite at an early stage of adoption, being only at its first release. WP3 8 1.2.6 DataFlux Monica Scannapieco (Istat) DataFlux is a SAS company which has been recently classified as a leader company among data quality providers by Gartner Group (Gartner, 2007). Their solution is a comprehensive solution for data quality including: o Data profiling, to determine discrepancies and inaccuracies in data; o deduplication enabled by record linkage; o data cleansing, to eliminate or reduce data inconsistencies, and o parsing, standardization and matching that allow to create or enhance the rules used to parse parts of the names, addresses, email addresses, product code or other business data. General characteristics - DataFlux is a commercial software. It is not easy to actually identify the price of the product as it comes with suites of different other modules. Though it is quite difficult to isolate the specific offer for record linkage in the set of the proposed solutions for data quality, the basic price is not high. Moreover, the tool is not domain specific but gives instead the possibility of customizing part of its functionalities to domain specific features. For instance, the standardization rules can be customized to the specific type of business data at hand (like for instance addresses). As a further example in matching two strings it is possible to specify which part of a data string weighs the heaviest in the match. The level of adoption is high. Indeed, DataFlux is recognized as a leader provider with an established market presence, significant size and multi-national presence (Gartner, 2007). Tools included in the software - DataFlux allows different types of preprocessing: breaking noncompound names (e.g., first name, last name, middle), removing prefixes, suffixes and titles, etc. The tool principally makes use of parsing functions to such a purpose. A profiling activity is present in the SAS-DQ solution. This activity provides an interface to determine areas of poor data quality and the amount of effort required to rectify them. However, the profiling activity does not appear to be integrated within the record linkage process, but it is rather a standalone activity. It is also possible the usage of different comparison functions. DataFlux performs approximate matching (called “fuzzy matching”) between strings corresponding to same fields. A sensitivity of the result of the matching strings is computed on the basis of how many characters of are used for approximate matching. It is not explicitly mentioned in the documentation a specific functionality dedicated to the search space reduction task. The decision rule applied by DataFlux is deterministic. When performing approximate matching a matching code is determined as a degree of matching of the compared strings. In the deterministic decision rules the matching codes can be differently combined in order to have rules for the match or the unmatch at record level. Methodology – As described before, a deterministic decision rule is implemented. Strength and weakness of the software - Preprocessing in DataFlux is quite rich. Standardization for several business data types is well supported, including name parts, address parts, e-mail addresses and even free-form text values, thus ensuring a great level of flexibility. There is also the possibility of applying address standardizations based on local standards. The decision method in DataFlux is instead among the weak points of the solution. In particular, the proposed method is a quite trivial deterministic solution. WP3 9 1.2.7 The Link King Marco Fortini (Istat) The Link King is a SAS/AF application for use in the linkage and deduplication of administrative datasets which incorporates both probabilistic and deterministic record linkage protocols. The record linkage protocol was adapted from the algorithm developed by MEDSTAT for the Substance Abuse and Mental Health Services Administration’s (SAMHSA) Integrated Database Project. The deterministic record linkage protocols were developed at Washington State’s Division of Alcohol and Substance Abuse for use in a variety of evaluation and research projects. The Link King’s graphical user interface (GUI), equipped with easy-to-follow instructions, assists beginning and advanced users in record linkage and deduplication tasks. An artificial intelligence helps in the selection of the most appropriate linkage/deduplication protocol. The Link King requires a base SAS license but no SAS programming experience. General characteristics - The Link King is a free macro for SAS which needs a SAS licence to be used. It is oriented toward epidemiologic applications and it works only with files referring to people. Moreover it is oriented to a US audience because, among the recommended key variables, it is included the Social Security Number (SSN). It incorporates a variety of user-specified options for blocking and linkage decisions, and a powerful interface for manual review of “uncertain” linkages. It implements also a “phonetic equivalence” or “spelling distance” as a means to identify misspelled names. Being arrived to its 6th release it can be considered with a high level of adoption. Tools included in the software - The Link King can import data from the most popular formats but does not implement sophisticated preprocessing features. Artificial intelligence methods are implemented to insure that appropriate linking protocols are used. Missing or improper values identification characteristics are developed in order to automatically recognize values that have scarce discriminating power. Various comparison functions are realized so as to properly evaluate names and surnames. Blocking features are implemented with heuristics for identifying the most suitable blocking scheme. Methodology - Both probabilistic and deterministic schemes are implemented in The Link King, allowing for either dichotomous or partial agreement between key variables. The weights calculations are achieved by means of an ad hoc iterative procedure, described in the technical documentations, which does not make use of the standard EM procedure. Though it seems quite reasonable, this procedure is, in our opinion, not enough clear in terms of its hypotheses and theoretical sustainability. Strength and weakness of the software - The Link King is a fairly usable software with a good compatibility with different file formats, a flexibility of usage and a graphical usage interface that aims to help non expert users to conduct a linkage project. Moreover, since it has been developed on the basis of research projects it is a well documented and a free code tool. Among its major drawbacks there are a non standard estimation technique of the m and u weights into the probabilistic framework and its specific applicability to problems that regard the linkage between files of people. Another weakness is given by the need of a SAS licence. 1.2.8 Trillium Software Miguel Guigò (INE - Spain) Trillium Software is a commercial set of tools, developed by Harte-Hanks Inc. ( http://www.hartehanks.com ) for carrying out integrated database management functions, so it covers the overall data WP3 10 integration life cycle. It consists of a generalized system for data profiling, standardizing, linking, enriching and monitoring. Although this application relies heavily on the so named Data Quality (DQ) improvement procedures, some of the tasks performed are unmistakably belonging to the complete record linkage process. The record linkage algorithm itself performs probabilistic matching, though based on a different method than Fellegi-Sunter or Sorted-Neighbourhood algorithms (Herzog, 2004; Herzog et al., 2007; and Naumann, 2004). The general purpose of DQ is to ensure data consistency, completeness, validity, accuracy and timeliness in order to guarantee that they fit to their use in decision making procedures; nevertheless, the statistical point of view is not always coincident with other approaches (see Karr et al., 2005), more specifically with those based on marketing strategies -which often establish the guidelines of this sort of programs- or computer science , though it can overlap with them. Section 3 of this report deals more extensively with DQ issues. General characteristics - Trillium Software System (TSS) was first developed in Summer 1989 using the experience previously acquired by Harte-Hanks in the realm of data management, once a package for processing urban data was purchased by the firm in 1979. The software was complete in 1992 and from the first moment it was conceived for performing integrated mailing list functions as, specially, managing mailing address and name lists in banking business. Several improvements have been made along the years, as Unicode support (1998), Java GUI (1999), Com+ JNI interfaces (2000), CRM/ETL/ERP integration (2001), and the purchase of Avellino Technologies (2004). At present (2007), Trillium Software System version 11 has been completely developed. Concerning the level of adoption: from 1992 on, it has been used by companies in the following sectors: financial services, insurance, hospitality and travel, retail, manufacturing, automotive and transport, health care and pharmacy, telecom, cable, and IT. Finally, with respect to the public sector, several US Government institutions use this package for data integration purposes. A case study on a computer system called Insight and based on TSS linking software can be found at Herzog (2004) for a worldwide known logistics and courier company (FedEx). The system permits business customers to go online to obtain up-to-date information on all of their cargo information and includes outgoing, incoming, and third-party shipments. Concerning the domain-specificity, one of the most valuable package features, strongly remarked by the vendors, is its ability to bring about data integration and DQ tasks for almost whichever language or country. These are classified into four different levels (basic, average, robust, and mature) depending on the complexity and development degree assessed by the standardization rules of the TSS, and the level of knowledge and experience added to the corresponding geographical area. Furthermore, TS Quality module is supposed to be able to processes all data types across all data domains (product, financial, asset, etc.), although it obviously refers to business activities; and it is clearly oriented to name and address parsing and de-duplications. Tools included in the software - Trillium is comprised of the modules called TS Discovery -which is the module focused on Data Profiling-, TS Quality -which carries out most of the DQ tasks-, and TS Enrichment, in addition to some additional technologies that support integrating data quality into various environments. TS Discovery provides an environment for profiling activities within and across systems, files and databases. Its main skill is to uncover previously untracked information in records that is latent due to data entry errors, omissions, inconsistencies, unrecognised formats, etc, in order to ensure the data are correct in the sense that they will fulfil the requirements of the integration project. This application performs modelling functions including key integrity and key analyses, join analysis, physical data models, dependency models and Venn diagrams that identify outlier records and orphans. TS Quality is the rule-based DQ engine that processes data to ensure that it meets established standards, as defined by an organization. This module performs cleansing and standardization, WP3 11 although some previous cleansing, repair and standardization abilities can be directly from the profiling application, that is, the TS Discovery module itself. The DQ module also contains procedures to parse data such as dates, names and worldwide addresses, with some capability for metadata-based correction and validation against authoritative sources, in order to improve and refine the linkage process. Suspected duplicate records could be then previously flagged. This application includes the record linkage and de-duplication engine, which identifies relationships among records through an automated and rule-based process. This is sold "out-of-thebox" (Day, 1997), in the sense that a set of ready-to-use rules are available, although users can also apply their own customized options and rules. The engine finds the connections among records within single files, across different files, and against master databases or external or third-party sources. Some special features on how the linked data set is built (see Section 2.10 of WP2 report) are highly remarkable. Each step of the matching process, once the rules are defined, is recorded through an audit trail, in order to generate reports of the actions and trace changes made. Therefore, every change is appended to the original data, maintaining the meaning of each value across datasets, even as they are shared by different departments with different purposes. In this way, important distinctions among variables in different contexts are not lost. TS Enrichment is the name of the module focused on data enhancement through the use of thirdparty vendors. The system can append geographical, census, corporate, and other information from 5000 different sources. This includes postal geocoding. For added safety, appended data does not overwrite source data, but is placed in new fields that can be shared instantly with other users and applications. Methodology – Trillium performs probabilistic matching, though based on a different method than Fellegi-Sunter. No further details are available on the specific implemented method. Strength and weakness of the software - A core feature is its specific design for managing mailing address and name lists. All the functions related to the integration (and DQ) process strongly rely in its power to carry out profiling and monitoring tasks and display out-of-the-box rules closely related to the firms' marketing experience, together with the knowledge of local experts depending on the country. A highly remarkable strength of the application is its user friendly environment, designed to be used not only by experts on linkage or Statistics, nor previously trained users. The Trillium Software System is also prepared to be run across a wide range of platforms. On the other hand, the program has been conceived to manage data on customers and products datasets (with some identifiers and key variables such as e.g. brands, models, catalogue numbers, etcetera), that rarely fit the purposes and structure of the datasets that producers of officials statistics intend to use. 1.2.9 Link Plus Tiziana Tuoto (Istat) Link Plus is a probabilistic record linkage program recently developed at the U.S. Centre for Disease Control and Prevention (CDC), Cancer Division. Link Plus was written as a linkage tool for cancer registries, in support of CDC's National Program of Cancer Registries. It can be run in two modes: to detect duplicates in a cancer registry database, or to link a cancer registry file against external files. Although Link Plus has been designed referring to cancer registry databases and records, the program can be used with any type of data in fixed width or delimited format. Essentially it performs probabilistic record linkage based on the theoretical framework developed WP3 12 by Fellegi and Sunter and uses the EM algorithm (Dempster et al, 1977) to estimate parameters in the model proposed by Fellegi and Sunter. General characteristics - Link Plus is a free software, but it’s source code is not available. It has been designed especially for cancer registry work, however it can be used for linking any data referring to people. For instance, it allows to deal deeply with variables like names, surnames, dates, social security numbers and others characteristics typical of individuals in hospital context; otherwise it does not provide specific functionalities for dealing with addresses or with characteristics outside the hospital context or typical of the enterprise framework. The level of adoption is high, particularly in public health organizations. Tools included in the software - Link Plus does not provide any pre-processing step: it allows to deal with delimited and fixed width files. Regarding to the profiling activity, an explicit data profiling is not provided, but some data quality measurement can be included implicitly into the linkage process thanks to the fact that variable indicators, like the number of different categories and the frequency of each category, can be taken into account in the estimation of the linking probabilities. Link Plus provides several comparison functions: exact; Jaro-Winkler metric for names and surnames; a function specific for Social Security Number, that incorporates partial matching to account for typographical errors and transposition of digits; a function specific for dates, that incorporates partial matching to account for missing month values and/or day values and also checks for transposition; a Generic String method, incorporating partial matching to account for typographical errors, that uses an edit distance function (Levenshtein distance) to compute the similarity of two long strings. Regarding to the Search Space Reduction phase, Link Plus provides a simple blocking (“OR blocking”) mechanism by indexing up to 5 variables for blocking and comparing the pairs with the identical values on at least one of those variables. As far as conventional “AND blocks” and multipass blocking system are concerned, Link Plus runs a simplified version of multiple passes simultaneously. When blocking variables involve strings (e.g. names and surnames), Link Plus offers a choice of 2 Phonetic Coding Systems (the Soundex and the NYSIIS) in order to compare strings based on how they are pronounced. As far as the Decision Method, the Link Plus Manual declares the implementation of the probabilistic decision model, referring to the Fellegi-Sunter theorization. Methodology - Regarding to the model probability estimation Link Plus provides two options: the first one, called the “Direct Method”, allows to use the default M-probabilities or user-defined Mprobabilities. By default, the M-probabilities are derived from the frequencies of the matching variables in the first file in hand; however, Link Plus provides an option that allows to use the frequencies from 2000 US Census data or 2000 US Nation Death Index data, for instance for names. Link Plus computes the M-probabilities based on the data at hand using the EM algorithm as the second Decision Method option. The explanations given in the manual about the computation of the M-probabilities in Direct Method leave some perplexities about the probabilistic approach. The authors specify that “the Mprobability measures the reliability of each data item. A Value of 0 means the data item is totally unreliable (0%) and a value of 1 means that the data item is completely reliable (100%). Reasonable values differ from 0.9 (90% reliable) to 0.9999 (99.99% reliable). To compute the default M-probabilities, Link Plus uses the data in File 1 to generate the frequencies of last names and first names and then computes the weights for last name and first name based on the frequencies of their values.” These procedures seem closer to the deterministic approach than to the probabilistic one, even if it is largely specified by the authors that this method is not deterministic. WP3 13 Anyway, the authors recommend the Direct Method for initial linkage runs because the Direct Method is robust and it consumes roughly half of the CPU time needed to have Link Plus compute the M-probabilities. However, they underline that using the EM Algorithm may improve results, because computed M-probabilities are likely to be more reflective of the true probabilities, since they were computed by capturing and utilizing the information dynamically from the actual data being linked, especially when the files are large and the selected matching variables provide sufficient information to identify potential linked pairs. Assumptions of the probability model, like conditional independency or others, are not specified in the manual. Strength and weakness of the software - The main strength of Link Plus is the fact that is free of charge. This is maybe one of the reasons of its widespread use, in particular in public health organizations. A further positive aspect is the availability of a good user guide that provides stepby-step instructions making software easy to use, even if it doesn’t give deep explanation with respect to some methodological points. Another advantage of Link Plus is its ability to manipulate large data files. On the other side, Link Plus is specifically set-up to work with cancer data, so that some difficulties arise in using it for other applications. Moreover, due to the lack of pre-processing functionalities, it aborts when encountering certain nonprinting characters in input data. 1.3 Summary Tables and Comparisons In the following Section 1.3.1, three comparison tables are presented and described with the aim of summarizing and pointing out the principal features of each tool so far described. In Section 1.3.2, a critical analysis of strengths and weaknesses of each tool is described in detail. 1.3.1 General Features In Table 1, we report the selected values for the characteristics specified above for each of the analyzed tools. Table 1: Main features Free/Commercial Domain Specificity AUTOMATCH commercial FEBRL GRLS free/source code available commercial (government) LINKAGEWIZ commercial/less than $5000 RELAIS DATAFLUX THE LINK KING TRILLIUM free/source code available commercial/less than $5000 free/source code available (SAS licence is needed) commercial LINK PLUS free/source code not available WP3 14 functionalities for English words no specific domain functionalities for English words mixed/functionalities for English dataset no specific domain no specific domain mixed/requires first and last names, date of birth relatively general features specialized per language/country mixed- general features Level of Adoption high medium medium medium low high high medium high Table 2: Comparison of the functionalities of the record linkage tools AUTOMATCH FEBRL GRLS LINKAGEWIZ DATAFLUX Preprocessing Profiling Comparison Functions Search Space Reduction Decision Model yes yes yes yes yes No yes no yes no yes yes yes yes yes yes yes yes yes yes yes not specified not specified yes yes yes yes yes yes yes yes yes yes no yes No yes yes yes yes No no yes yes yes RELAIS TRILLIUM THE LINK KING LINK PLUS Decision Method Probabilistic Probabilistic Probabilistic Probabilistic Deterministic Probabilistic + Deterministic (under development) Probabilistic + Deterministic Probabilistic + Deterministic Probabilistic In Table 3 we show the details on the specific method used for the estimation of the Fellegi and Sunter parameters, for those tools that implement the probabilistic Fellegi and Sunter rule (i.e. all the software tools but DataFlux and Trillium). Table 3: Estimation methods implemented in the record linkage tools Fellegi Sunter Estimation Techniques Parameters estimation via frequency based matching AUTOMATCH Parameters estimation via EM algorithm FEBRL Parameters estimation under agreement/disagreement patterns GRLS No details are provided LINKAGEWIZ EM method RELAIS Conditional independence assumption of matching variables THE LINK KING Ad hoc weight estimation method Not very clear theoretical hypotheses Default M-probabilities + user-defined M-probabilities LINK PLUS EM algorithm WP3 15 1.3.2 Strengths and Weaknesses In this section, we describe a fourth table (Table 4) with strengths and weaknesses of the identified tools. Table 4: Strengths and Weaknesses of the record linkage software tools Strengths Weaknesses No error rate estimation AUTOMATCH Good documentation User-friendly Specific for English language Preprocessing Not free Automatic matching parameter estimation Free and open-source Not comprehensive profiling tools FEBRL Preprocessing (standardization, Specific for English language geocoding) Good documentation Good documentation Not free GRLS Free training course at No error rate estimation Statistics Canada Specific for English language Preprocessing Performances The coverage of the RL functionalities is poor LINKAGEWIZ Preprocessing Performance Decision Method Free and Open Low Adoption RELAIS Good support of the record linkage functionalities Toolkit Flexibility Preprocessing Decision Method DATAFLUX Profiling Standardization Monitoring User friendly Non standard estimation of probability THE LINK Flexible weights KING Tools for manual reviews A SAS license is necessary Open codes Specific for people files usage Preprocessing (profiling, Not free TRILLIUM standardization, geocoding) Specific for managing mailing addresses and Data enrichment and names lists. monitoring Algorithms for probabilistic RL are not wellUser-friendly interface defined Ability to work across datasets and systems Free availability Specifically set-up to work with cancer data, LINK PLUS High adoption some difficulties in using for other User friendly applications Good user guide with step-by- Sensitive to nonprinting characters in input step instruction data Ability to manipulate large data files WP3 16 2 Software Tools for Statistical Matching Mauro Scanu (Istat) Software solutions for statistical matching are not as widespread as in the case of record linkage. Almost all the applications are conducted by means of ad hoc codes. Sometimes, when the objective is micro (the creation of a complete synthetic data set with all the variables of interest) it is possible to use general purpose imputation software tools. On the other hand, if the objective is macro (the estimation of a parameter on the joint distribution of a couple of variables which are not observed jointly), it is possible to adopt general statistical analysis tools which are able to deal with data sets affected by missing items. In this section we review the available tools, explicitly devoted to statistical matching purposes. Just one of them is a software that can be used without any programming skills (SAMWIN). The others are software codes that can be used only by those with knowledge of the corresponding language (R, S-Plus, SAS). 2.1 Comparison criteria for statistical matching software tools Mauro Scanu (Istat) The criteria used for comparing the software tools for statistical matching are slightly different from those for record linkage. In fact, the applications are not as widespread as in the case of record linkage. At the beginning, general features of the software tools are reviewed. These include the following issues. 1. is the software free or commercial? 2. is the software built for a specific experiment or not (domain specificity)? 3. is the software mature (number of years of use)? As matter of fact, these issues are very similar to those asked for record linkage software tools. For the functionalities installed in the software tools, the focus was on 1. the inclusion of preprocessing and standardization tools 2. the capacity to create a complete and synthetic data set by the fusion of the two data sources to integrate 3. the capacity to estimate parameters on the joint distribution of variables never jointly observed (i.e. one variable observed in the first data source and the second variable on the second data source) 4. the assumptions on the model of the variables of interest under which the software tools works (the most known is the conditional independence assumption of the variables not jointly observed given the common variables in the two data sources) 5. the presence of any quality assessment of the results Furthermore, the software tools are compared according to the implemented methodologies. Strength and weakness of each software tool is highlighted at the end. WP3 17 2.2 Statistical Matching Tools 2.2.1 SAMWIN Mauro Scanu (Istat) This software was produced by Giuseppe Sacco (Istat) for the statistical matching application of the social accounting matrix. The main purpose of this software tool is to create a complete synthetic data set using hot-deck imputation procedures. It is an executable file written in C++ and easy to use with windows facilities. Free/commercial: The software is free. In order to have it, it is possible to write to Giuseppe Sacco (sacco@istat.it). It consists of an executable file. The source code is not available. Domain specificity: although the software was built for a specific experiment (the construction of the social accounting matrix), it is not domain specific. Maturity: low. It was used for specific experiments in the Italian Statistical Institute and abroad. Frequency of usage: three main applications Number of years of usage: 3 Functionalities Preprocessing and standardization: the input files must be already harmonized. No standardization tools are included Creation of a complete synthetic data file: the output of the software is a complete synthetic data file. It must be decided in advance the role of the samples (i.e. recipient and donor files) Estimation of specific parameters (eg regression coefficient, parameters of a contingency table) for the joint distribution of variables observed in distinct samples: The software does not estimate single parameters Model assumptions: this software can be used only under the conditional independence assumption. It can also be used if there is auxiliary information in terms of a third data set with all the variables of interest jointly observed. Quality evaluation of the results: The software includes some quality indicators (number of items to impute in a file, number of possible donors for each record) of the data set to match. It does not include a quality evaluation of the output. Other: It is possible to use this software for imputing a single data set affected by general patterns of missing values (either by hot-deck or cold-deck). It includes functionalities for simulations (useful for the evaluation of statistical matching procedure in a simulated framework). The instructions on how to use SAMWIN are available in Sacco (2008). Methodologies implemented in the software The software implements different hot-deck procedures, according to the nature of the matching variables (continuous, categorical or both). It is possible to specify cluster (strata) of units (this ensures equality on the stratification variables for both the donors and the recipient records). It is possible to use different distance functions (Euclidean, Manhattan, Mahalanobis, Chebishev, Gower). WP3 18 2.2.2 R codes Marcello D’Orazio (Istat) This set of software codes written in the R functional programming language was produced by D’Orazio et al. (2006) and implements some hot deck SM procedures. Moreover the Moriarity &Scheuren (2001) code is implemented and an extension of it is provided. Code to study SM uncertainty for categorical variables is introduced (extension of codes from Schafer for multiple imputation of missing values for categorical variables; Schafer codes are available at http://www.stat.psu.edu/~jls/misoftwa.html). Free/commercial: The code is published in the Appendix of D’Orazio, Di Zio and Scanu (2006); moreover it can be downloaded at site: http://www.wiley.com//legacy/wileychi/matching/supp/r_code_in_appendix_e.zip Updates of the code are available on request by contacting Marcello D’Orazio (madorazi@istat.it). Domain specificity: It is not domain specific. Maturity: low. It was used for the experiments (simulations) performed by D’Orazio et. al (2006). Frequency of usage: not available Number of years of usage: 2 Functionalities Preprocessing and standardization: the input files must be already harmonized. No standardization tool is included Output micro: the output of the software can be a complete synthetic data file. Output macro: The codes can estimate parameters of a multivariate normal distribution (X, Y and Z must be univariate normal) or ranges of probabilities of events (for categorical variables). Model assumptions: hot deck techniques assume the conditional independence assumption. The codes based on Moriarity and Scheuren work and its extension assume the normality of the data. No model assumptions are required for evaluation of SM uncertainty in case of categorical variables. Quality evaluation of the results: The software does not include direct or indirect tools to evaluate the accuracy of the estimated parameters. Methodologies implemented in the software The software implements the method proposed by Moriarity and Scheuren (2001) moreover an extension is proposed based on ML estimation methods, this last feature allows to go over some limits in Moriarity and Scheuren (2001) methodology. SM hot deck methods based on distance nearest neighbour are implemented, the matching can be constrained or not. Finally, code for estimation of uncertainty concerning probability of events under SM situation is presented. In this code it is possible to set linear constraints involving some probabilities of events. This code is based on the EM algorithm as implemented by Schafer (http://www.stat.psu.edu/~jls/misoftwa.html) in order to perform multiple imputation in presence of missing values for categorical variables. WP3 19 2.2.3 SAS codes Mauro Scanu (Istat) This set of software codes written in the SAS language was produced by Christopher Moriarity for his PhD thesis (available, not for free, on the website http://www.proquest.com). It includes just two statistical matching methods, those introduced by Kadane (1978) and by Rubin (1986). Free/commercial: The codes are available only on a PhD thesis, available upon request not for free. It is only available on paper. Domain specificity: It is not domain specific. Maturity: low. There is evidence of experiments (simulations) performed by Moriarity for his PhD thesis. Frequency of usage: not available Number of years of usage: 8 Functionalities Preprocessing and standardization: the input files must be already harmonized. No standardization tool is included Output micro: the output of the software can be a complete synthetic data file. Output macro: The codes can estimate single parameters of a multivariate normal distribution (X, Y and Z must be univariate normal) Model assumptions: the codes can be used only under the assumption of normality of the data. It has been built for evaluating the different models compatible with the available information from the two surveys (evaluation of the uncertainty for the parameters that cannot be estimated, in this case the correlation coefficient of the never jointly observed variables). As a byproduct it can reproduce estimates under the conditional independence assumption. Quality evaluation of the results: The software includes some quality evaluation of the estimated parameters (as jackknife variance estimates of the regression parameters). Methodologies implemented in the software The software implements the method proposed by Kadane (1978) that, given two samples that observe respectively the variables (X,Y) and (X,Z), establish the minimum and maximum values that the correlation coefficient of Y and Z can assume. The other software code implements the method proposed by Rubin (1986) that tackles the previous problem by means of a multiple imputation method. 2.2.4 S-Plus codes Marco Di Zio (Istat) This set of software codes written in the S-Plus language was produced by Susanne Raessler to compare different multiple imputation techniques in the statistical matching setting. Free/commercial: The codes are available only on Raessler (2002). They are available only on paper. Domain specificity: NO WP3 20 Maturity: low. There is evidence of experiments (simulations) performed by Raeesler (2002) for her papers. Frequency of usage: not available Number of years of usage: 6 Functionalities Preprocessing and standardization: the input files must be already harmonized. No standardization tool is included Output micro: YES Output macro: The codes can estimate single parameters of a multivariate normal distribution (X, Y and Z must be univariate normal) Model assumptions: the codes can be used only under the assumption of normality of the data. Quality evaluation of the results: NO Methodologies implemented in the software The software implements the method proposed by Raessler (2002, 2003) for multiple imputation with statistical matching. The methods implemented in the codes are: • NIBAS, the non-iterative multivariate Bayesian regression model based on multiple imputations proposed by Raessler, see Raessler (2002, 2003). • RIEPS is the regression imputation technique discussed in Section 4.4 of Raessler (2002) and Raessler (2003, Section 2). • NORM. The data augmentation algorithm assuming the normal model as proposed by Schafer (1997, 1999) introduced for the general purpose of doing inference in the presence of missing data. It requires the S-PLUS library NORM. • MICE. An iterative univariate imputation method proposed by Van Buuren and Oudshoorn (1999, 2000). It requires the library MICE. The software codes are used to evaluate uncertainty for Gaussian data. It estimates the region of acceptable values for the correlation coefficient of the variables not jointly observed given the observed data of the samples that must be integrated. WP3 21 2.3 Comparison tables This section summarizes the content of the previous section, comparing the characteristics of the different software tools. Table 1 – Comparison of the general characteristics of the software tools for statistical matching FREE/COMMERCIAL DOMAIN SPECIFICITY LEVEL OF ADOPTION free; source code not no specific domain low SAMWIN available free; source code available no specific domain low R codes on D’Orazio et al. (2006). updates available on request low SAS codes source code available on a no specific domain phd thesis (available on the webpage www.proquest.com) source code available on no specific domain low S-Plus Raessler (2002). codes Table 2 Comparison of the functionalities of the software tools for statistical matching (CIA= conditional independence assumption; AI= auxiliary information: Unc= uncertainty) PreOutput: Output: Quality Model Other processing micro Macro evaluation yes no CIA: yes Yes for -Imputation of a SAMWIN no AI: yes input files, single sample Unc: no No for -It performs output files simulations -Instructions available no yes yes CIA: yes No - micro and macro R code AI: yes approaches available Unc: yes - codes for categorical variables no yes yes CIA: yes No Only continuous SAS code AI: yes (normal) variables Unc: yes no yes yes CIA: yes No Only for continuous S-Plus AI: yes (normal) variables codes Unc: yes WP3 22 Table 3 – Comparison of the methodologies implemented in the software tools for statistical matching Implemented methodologies Hot-deck methods (distance hot deck with different distance functions) SAMWIN Hot-deck methods (constrained and unconstrained distance hot deck with various R code distance functions; random hot deck) Estimates based on Moriarity and Scheuren methods and ML estimates. Uncertainty estimation for probabilities of events. SAS codes Estimates based on consistent (not maximum likelihood) methods Application of a not proper multiple imputation procedure Multiple imputation methods S-Plus codes Table 4 – Strength and weakness of the software tools for statistical matching Strengths SAMWIN -free -easy to use -instructions -codes available on the internet R codes -codes for categorical variables SAS codes -written in SAS S-Plus codes Weaknesses -low adoption -the output is only micro -R package still missing -low adoption -not easy to use -limited results -Implementation of Bayesian and proper multiple imputation -low adoption procedures for statistical matching -not easy to use -limited results WP3 23 3 Commercial software tools for data quality and record linkage in the process of microintegration Jaroslav Kraus (CZSO), Ondřej Vozár (CZSO) 3.1 Data Quality Standardization Requirements It is not obvious for vendors of general and multipurpose tools to publish any quality assessments and quality characteristics of the data integration process. In data cleansing, in batch or off-line processing, in different application domains the meaning of quality of the output sounds more like functional parameters of the application or speed of the matching, quality of the intelligence learned automatically. The assessment and comparison of the real quality of the product via “level of calculated truth” of the data are substantially problematic. The vendors in the data quality market are often classified according to their entire position in IT business, where focus on the specific business knowledge and experience in specific business domain play an important role. Positions of vendors and quality of their products on the market are characterized by: Product features and relevant services Vendor characteristics, domain business understanding, business strategy, creativity, innovation Sales characteristics, licensing, prices Customer experience, reference projects Data Quality Tools and Frameworks The record linking characteristics of the products are the focus of this section. 3.2 Data Quality assessment The record linkage (the process of assigning/linking together the subjects from different data sources, often the person, family, institution etc.) is a substantial task in a wide set of practical methods for searching, comparison and grouping of the records. The software vendors of tools in the statistics oriented “data quality market” propose solutions addressing all the tasks in the entire life cycle of the data oriented management programs and projects. Starting with data preparation, survey data collection, improving of quality and integrity, setting up for reports and studies, etc. etc. The vendor proposals are usually classified according to their “position in the concrete segment of informatics”, according to the strength and advantage position in IT business, where the focus on the specific business knowledge and experience in a specific business domain plays an important role. The declaration of the quality of vendors and their products on the market is characterized by: Product features and relevant services Vendor characteristics, domain business understanding, business strategy, creativity, innovation Sales characteristics, licensing, prices WP3 24 Customer experience, reference projects It is not easy to publish any comparable quality assessments and characteristics of the quality of a data integration process. In “quality of data” management, in cleansing, in large batch processing, in interactive on-line, off-line or even in different specific domains of data and applications the meaning of the quality of the tool depends on data quality in input and similarly in output. The value of the quality sounds more like functional parameters of the application, often describing what it can do and what it can’t do. The non-functional parameters as for instance the time performances of the matching, behavior, capacity as well as the automatically learned knowledge, are difficult to measure in comparable units. Even if the core application algorithms work only on syntactic level of the programmed quality evaluation (as for example the exact matching does), the physical data storage, data access methods, technology and other programmers inventions crucially influence the results of the status of the measurable quality (speed, accuracy, etc.). With the comprehensive semantic driven operations and even more with pragmatic attitude in the informatics observation of the data, the “intellectual” quality of the tools should influence the computing processes and parameters dramatically (National Statistics Code of Practice) The process of the comparison of the real quality of the software products depends therefore on the assessment of the data itself, on physically measurable parameters of computing processes and on the believe and trust in “level of calculated truth” and its correct interpretation in real world practice. All that in relation with measurable estimation of the value of the risk in regard to action/decision related parametrically to the quality of the data applied in a business process. The absence of structured descriptions of quality of the features and quality of the fulfillment of requirements, makes the choice in selecting the most appropriate software package difficult. Some authors, as for example Franch et al (2003) discuss methodologies for describing the quality and features of domain-specific software packages in the evaluation frameworks (guides, questions and tables) and comprehensively utilizing ISO/IEC 9126-1 quality standards as a guide framework. According to the software/application category, the tools to perform or support the data oriented projects in record linkage in statistics should have these common characteristics: 1. portability in being able to function with statistic researchers' current arrangement of computer systems and languages, 2. flexibility in handling different linkage strategies, 3. operational expenses or low cost in TCO (Total Cost of Ownership) parameters and in both, computing time and researchers' efforts. The record linkage packages (both universal and domain specific) are often described in marketing and commercial documentation as they satisfy all of these characteristics criteria. In other words if we evaluate the tools for both, deterministic and probabilistic linkage, the results of a simple comparison are obviously not consistent. Similarly difficult is to measure the “not European statistical systems”, which rely on the logic based for example on NYSIIS identification, SOUNDEX algorithms or any address standardizations or business names coming from US environment. As well as those methods, which rely on the most advanced strategies in mixing of deterministic and probabilistic linkage and adapting the graph algorithms and neural networks driven computing further in decision, fail in the comparable common criteria. It can’t represent the correct comparable value in such different disciplines. The quality tuning, test and performance simulation is the only way of getting real performance figures about these modules. For assessing WP3 25 data quality and other “production” parameters, the testing and prototyping of the right project structure of the solution are often in the proposal. Because “each linkage project is different”, the modularity of the software should allow better control of the programming and managing the processes in linking project and development of unique strategies in advance or flexibly in response to actual linking conditions during the linking process. Where the user (statistician or domain researcher) provides the weights, parameters, decision rules or his “expertise and behavioral know how”, the architect of the linkage solution has the choice in applying different modules, interfaces and supplementary solutions. The advantage of open and flexible solutions is enabling the simulations to modify data and parameters consistently in different steps of project workflow and/or develop or engage additional modules from other suppliers and IT resources to supplement the basic modules optimally. There is a number of commercial and non-commercial implementations and deployments of matching and record linking algorithms in the areas of exact and probabilistic matching and more or less sophisticate strategies. They are often utilized in open source environments and often developed and supported by independent vendors or university engagements. In this section we evaluate some of the commercial software packages, which, according to the data quality scoring position in Gartner reports, belong to the most important vendors in this area. This Magic Quadrant graph was published by Gartner, Inc. as part of a larger research note and should be evaluated in the context of the entire report. Magic Quadrant for Data Integration Tools, 2007 WP3 26 We focus on three vendors: - Oracle, partner for data quality, data integration and enterprise integration services, represent the optimal convergence of tools and services in the software market. - SAS/DataFlux, data quality, data integration and BI (Business Intelligence) player on the market, with its applications, methodologies on integrated platform. - Netrics, which dispose with the advanced technology complementing the Oracle data quality and integration tools The IT specialists and statistics analysts evaluate the products, services and advanced scientific tools and methodologies together with substantial experience of suppliers in providing the statistics application intelligence. Real life solutions and concrete statistical projects (e.g. the Federal Swiss Statistics Office in scanning and recognizing) were references for application of the software in projects with high volume of data, scanned from paper forms during Census 2001, from next census pilot projects and other statistical experience. The lists of bulleted questions below represent a principal schema of characteristics of the software package in application of the models of quality management in software package selection. These questions should be answered with very deep understanding of the problems, features and practice in concrete projects, as for example within the solution of the Census 2011 project as mentioned above. These questions will be answered in a structured form in a set of tables (Section 3.3). These answers are published and authorized by vendors and often proven in concrete tests and experiences. This structured format simplify understanding of evaluation and enable basic comparison of the tools and features collected from available information resources and even those from practical experience. The format of comparison tables General The software is a universal system or domain specific to a given application The software is: o Complete system, ready to perform linkages "out of the box” o Modular or development environment, requiring obvious project/programming effort o Framework to apply by statistic scientist Types of linkages does the software support o Linking two files o Linking multiple files o Deduplication o Linking one or more files to a reference file The infrastructure and integration features o Computing performance o Capacities and storage o Networking and security o Databases o Integration, middleware, SOA The linkages mode interactive, in real time, in batch mode The characteristics of the vendor. Reliability, can the vendor provide adequate technical support? Software documentation and manuals WP3 27 The features the vendor plan to add in the near future (e.g., in the next version) References and practical experience Other software, such as database packages or editors, needed to use the system Linkage Methodology Record linkage methods: is the software based or manually interacted? Has user control over the linkage process? Is the system a "black box," or can the user set parameters to control the linkage process Software requires any parameter files The user specify the linking variables and types of comparisons Comparison functions are available for different types of variables? Do the methods give proportional weights (that is, allow degrees of agreement)? o Character-for-character o Phonetic code comparison (Soundex or NYSIIS variant) o Information advanced theory string comparison function o Specialized numeric comparisons o Distance comparisons o Time/Date comparisons o Ad hoc methods (e.g., allowing one or more characters different between strings) o User-defined comparisons o Conditional comparisons o Can user specify critical variables that must agree for a link to take place? Does the system handle missing values for linkage variables o Computes a weight or any other value o Uses a median between agreement and disagreement weights o Uses low or a zero weight The maximum number of linking variables Does the software block records? How users set blocking variables The software contain or support routines for estimating linkage errors The matching algorithm use contextual techniques that evaluate dependence between variables Data Management Data storage formats the software uses o Flat files o Database o Caches, ODS o Data limits Post-linkage Function The software provides a utility for review of possible links Sharing the decision work on the record review simultaneously Records can be "put aside" for later review The software provides a utility for generating reports on the linked, unlinked, duplicate, and possible link records? Report format can be customized The software provides a utility for extracting files of linked and unlinked records? User specify the format of such extracts Standardization The software provides a mean of standardizing (parsing out the pieces of) name and address fields localized for concrete environments Name and address standardization is customizable. Different processes can be used on different files Standardization change the original data fields, or does it append standardized fields to the original data record WP3 28 Costs Purchase and maintenance costs of the software itself, along with any needed additional software (e.g., database packages), and new or upgraded hardware. The cost of training personnel to use the system. Usually projected personnel staff associated with running the system. The cost of developing of the system for the intended purposes using the software within the available budget Empirical Testing, case studies and examples Levels of false match and false non match can be pre-set, expected with the system or tested on samples Number of manual intervention in practical examples (e.g., possible match review) Timing of the typical match projects with the system (examples or case studies) 3.3 Summary Tables of Oracle, Netrics and SAS/DataFlux 3.3.1 Oracle 3.3.1.1 Software/Tools/Capabilities Oracle Warehouse Builder transparently integrates all relevant aspects of data quality area into all phases of data integration project life cycle. Oracle Warehouse Builder provides functionality to support the following aspects of data quality area: Creation of data profiles. Maintenance for derivation of data rules based on automatically collected statistics or their manual creation. Execution of data audits based on data rules. Identification and unification of multiple records. Matching and Merging for the purpose of de-duplication Matching and Merging for the purpose of record-linking (including householding) Name and address cleansing. Oracle Data Integrator. The very basic idea behind Oracle Data Integrator was to create a framework for generation of data integration scripts. General integration patterns were derived from many integration projects from all around the globe. The techniques are the same, only underlying technologies vary so in the development environment was divided accordingly. Topologies, data models, data quality checks, integration rules are modeled independently of technology. Specific technology language templates (knowledge modules) are assigned in the second step and the resulting code is automatically generated. This approach maximizes the flexibility and technological independence of the resulting solution. Oracle Data Integrator allows organizations to reduce the cost and complexity of their data integration initiatives, while improving the speed and accuracy of data transformation. Oracle Data Integrator supports THREE “RIGHTS”: The Right Data: The data must not only be appropriate for the use that is intended, but must also be accurate and reliable. The Right Place: The overall information ecosystem consists of multiple operational and analytical systems, and they all need to benefit from data of the other systems, regardless of their locale. The Right Time: Data can become stale quickly. A decision support system that does not WP3 29 get the data in time is useless. A shipping application that does not get order information before cutoff time isn’t efficient. Getting data in right time—with a latency that is appropriate for the intended use of this data—is one of the most important challenges faced by businesses today. Each Knowledge Module type refers to a specific integration task, e.g.: Reverse-engineering metadata from the heterogeneous systems for Oracle Data Integrator. Handling Changed Data Capture on a given. Loading data from one system to another. Integrating data in a target system, using specific strategies (insert/update, slowly changing dimensions, match/merge). Controlling data integrity on the data flow. Exposing data in the form of services. Knowledge Modules are also fully extensible. Their code is open and can be edited through a graphical user. Powerful deterministic and probabilistic matching is provided to identify unique consumers, businesses, households, or other entities for which common identifiers may not exist. Examples of match criteria include: Similarity scoring to account for typos and other anomalies in the data, Soundex to identify data that sounds alike, Abbreviation and Acronym matching, etc. Oracle Data Integrator provides audit information on data integrity. For example, the following are four ways erroneous data might be handled: Automatically correct data—Oracle Data Integrator offers a set of tools to simplify the creation of data cleansing interfaces. Accept erroneous data (for the current project)—In this case rules for filtering out erroneous data should be used. Correct the invalid records—In this situation, the invalid data is sent to application end users via various text formats or distribution modes, such as human workflow, e-mail, HTML, XML, flat text files, etc. Recycle data—Erroneous data from an audit can be recycled into the integration process. Built-in transformations usually used as part of ETL processes. Oracle Data Integrator enables application designers and business analysts to define declarative rules for data integrity directly in the centralized Oracle Data Integrator metadata repository. These rules are applied to application data—inline with batch or real-time extract, transform, and load (ETL) jobs—to guarantee the overall integrity, consistency, and quality of enterprise information. Match rules are specified and managed through intuitive wizards. Match/Merge can also be accessed through an extensive scripting language. Parsing into individual elements is used for improved correction and matching. Standardization, which involves modification of components to a standard version acceptable to a postal service or suitable for record matching. Validation and Correction by using a referential database. Manual adding of new data elements. Data quality rules range from ensuring data integrity to sophisticated parsing, cleansing, standardization, matching and deduplication. WP3 30 Data can be repaired either statically in the original systems or as part of a data flow. Flowbased control minimizes disruption to existing systems and ensures that downstream analysis and processing works on reliable, trusted data. Conditional match rules, Weighted match rules. User defined algorithms, which can also reference one of the other match rule types to logically “and” or “or” them together to produce a unique rule type. Match results are intelligently consolidated into a unified view based on defines rules. Oracle Data Integrator does not provide standard functionality for weighted matching but provides open interface to enable integration with any database functionality or custom functionality (in Java, PL/SQL, SHELL, etc.). Match results are intelligently consolidated into a unified view based on defined rules. 3.3.1.2 Integration options Oracle Unified Method, Oracle Project Management Method Integration approach is driven by data model and by model of business/matching rules. Business/matching rules are partially designed and created by business users. Data model is usually create (modeled) by technical users. Computing Infrastructure. In case of name & address cleansing additional server for this functionality is required. The Oracle Data Integrator architecture is organized around a modular repository, which is accessed in client-server mode by component-graphical modules and execution agents. The architecture also includes a Web application, Metadata Navigator. The four graphical modules are Designer, Operator, Topology Manager and Security Manager. Oracle Data Integrator is a lightweight, legacy-free, state-of-the-art integration platform. All components can run independently on any Java-compliant system. Supported DBMSs Oracle database server. Oracle Database 8i, 9i, 10g, 11g 3.3.1.3 Performance/Throughput System with 8,5 millions of unified physical persons and 24,5 millions of physical person instances. Physical person’s records with approximately 5% of multiplicities. Matching score for physical persons was 98% (98% of physical person instances are linked to unified physical person). For total count 33 millions of dirty addresses enters the address matching and cleansing WP3 31 mechanism. The success rate of the cleansing process was 84%. 3.3.1.4 Business terms, conditions and pricing Oracle Warehouse Builder a/ Oracle Warehouse Builder as CORE ETL is free as part of Oracle DB from version 10g (edition SE-1, SE or EE) (it means, that database is licensed on CPU or NU and part of this is Oracle Warehouse Builder in CORE edition) b/ Oracle Warehouse Builder has three Options 1/ Enterprise ETL - performance a development productivity (licensed as part of database EE by the database rules NU or CPU) 2/ Data Quality – data quality and data profiling (licensed as part of database EE by the database rules NU or CPU) 3/ Connectors – connectors to enterprise applications (licensed as part of database EE by the database rules NU or CPU) Oracle Data Integrator a/ Oracle Data Integrator is licensed on CPU b/ Oracle Data Integrator has two Options 1/ Data Quality (licensed only on CPU) 2/ Data Profiling (licensed only on NU) Maintenance costs For all Oracle products it’s 22% from price of licenses. 3.3.2 Netrics 3. 3. 2. 1 Software/Tools/Capabilities The Netrics Data Matching Platform consists of two complementary components: 1. The Netrics Matching Engine — which provides error tolerant matching against data on a field-by-field, or multi-field basis; 2. The Netrics Decision Engine — which decides whether two or more records should be linked, are duplicates, or represent the same entity — learning from, and modeling, the same criteria that best human experts use. Matching accuracy out-of-the-box - Netrics Matching Engine™ - matches data with error tolerance that approximates human perception, with a level of speed, accuracy, and scalability that no other approach can provide. It provides unparalleled error tolerance, identifying similarity across a wide range of “real-world” data variations and problem conditions. The Netrics Matching Engine finds matches, even for incomplete or partial similarity. Using Netrics’ patented mathematical approach; it finds similarities in data much like humans perceive them. As a result, it discovers matches even when there are errors in both the query string and the target data Therefore it can handle many of the issues that plague real-world data from simple transpositions and typos, up to and including data that’s been entered into the wrong fields. Decision making accuracy – ease of initial training and re-training if needed The Netrics Decision Engine™ - uses machine learning to eliminate the weaknesses of both deterministic and probabilistic rules-based algorithms. It uses an exclusive system that models WP3 32 human decision-making. It “learns” by automatically tracking and understanding the patterns of example-decisions provided by business experts, and creates a computer model that accurately predicts them. Once trained, the Decision Engine’s model reliably makes the same decisions as the experts, at computer speeds, and without ever losing the track on rules and intelligence learned before and with the concentration on problem experts provided before. Because creating a model is so easy (the expert just needs to provide examples), a custom model can be created for each critical business decision, tailored specifically to specifics of data, market, and business requirements. If social security number, driver’s license numbers, emergency contact information, employers or insurance information is collected, any or all of these data elements can be used in evaluation of a record. There is no need for programmers to try to devise the underlying business rules, which may not capture all possible criteria, and may introduce inaccuracies, or to make up weights – the Netrics Decision Engine figures this out by itself. http://www.netrics.com/index.php/Decision-Engine/Decision-Engine.html Able to deal with multiple varied errors in both the data records and query terms The Netrics Matching Engine uses mathematical modeling techniques to determine similarity. This deployment of theory in practice means that a high degree of accuracy is maintained even when there are numerous errors in both the incoming matching requests and the data being matched. Fine grained control over each matching request If needed, matching features are incrementally customizable up to on a per match basis, including phonetic analysis, field selections, field weighting, multiple record weighting, record predicates, multi-table selection, and multi-query capabilities. 3.3.2.2 Integration options Ease of integration into applications Dependent on the customers´ strategy of the easy start up, customers are able to work with the Netrics Matching Engine within minutes on test data. Netrics has experience with the customers they have implemented production instances within few hours. All the matching capabilities are provided in a very powerful, yet easy to use API than can be called from all major languages and also by using SOAP requests included in SOA implementations. The integration with or within standard ETL applications is mostly expected. Range of server platforms supported The Netrics Matching Engine and Netrics Decision Engine have been ported across a wide range of 32 and 64 bit operating systems. Client languages supported – ease of API usage Client side supports for Java, Python, .NET, C#, C and C++ and command lines are provided as well as full WSDL/SOAP support enabling easy SOA implementation. Supported DBMSs Data can be loaded into the Netrics Matching Engine from any database that can extract data into a CSV file or other formats. Specific loading synchronizing support – to keep the Matching data updated as changes happen to the underlying data- is provided for the major DBMS – Oracle, SQL Server, MySQL and DB2. As the intermediate Netrics database provide services to the engines with cache like access, the deployment of simply, quick and secure database environment is preferable. In this way the Netrics decision environment can be created as an enterprise wide central data quality engine for ad hoc application in parallel with the BI typically driven on DWH store. WP3 33 3.3.2.3 Performance/Throughput Support for large databases Conventional "fuzzy matching" technology is computer-resource intensive: rule sets often drive multiple queries on databases, as developers compensate for the lack of sophisticated matching technology using multiple "wildcard" searches, or testing validity of error-prone algorithms such as Soundex. More sophisticated algorithms such as "Edit Distance" are highly computationally intensive. Computation increases exponentially with the size of the "edit window" and the number of records being matched. These limitations make "fuzzy matching" unusable in many applications. Customers can accept serious constraints on the number of records tested, or the matching window applied, and risk missing matches. Netrics' patented bi-partite graph technology is inherently efficient, and scales linearly with the number of records in the database. The technology has been proven in real-time applications for databases as large as 650 million records. Netrics' embeddable architecture operates independently of existing applications and DBMS. Many customers actually experience with decrease in computer resources required when the inexact matching is shifted from the DBMS (which is natively optimized for exact matching) onto the Netrics Matching Engine. Multi core, multi-CPU servers and multi-server cluster exploitation The Netrics Matching Engine is fully multi-threaded, stateless and exploits multi-core CPUs, multi-CPU servers and clustered server configurations. Front-end load balancing and failover appliances such as F5 and Citrix NetScaler are supported out-of-the-box. 3.3.2.4 Business terms, conditions and pricing Type of licensing (perpetual/lease/SaaS etc) Pricing for the Netrics Matching Engine and Netrics Decision Engine is simple and straightforward. Our one-time perpetual software license fee is based on the number of CPU cores needed to run the required data matching and decision making workload. Maintenance costs Maintenance costs are usually 20% of the total software license fee, charged annually. WP3 34 3.3.3 SAS 3.3.3.1 Software/Tools/Capabilities SAS/Dataflux Matching (default) engine. SAS/Dataflux Matching (default) engine has been designed to enable both the identification of duplicate records within a single data source, as well as across multiple sources. The rules-based matching engine uses a combination of parsing rules, standardization rules, phonetic matching, and token-based weighting to strip the ambiguity out of source information. After applying hundreds of thousands of rules to each and every field, the engine outputs a “match key” – an accurate representation of all versions of the same data generated at any point in time. “Sensitivity” allows the user to define the closeness of the match in order to support both high and low confidence match sets. SAS Matching (optional) SAS Data Quality Solution supports also probabilistic methods for record linking like Fellegi and Sunter method and other methods. Implementation characteristics. The SAS/DataFlux methodology supports the entire data integration life cycle through an integrated phased approach. These phases include data profiling, data quality, data integration, data enrichment, and data monitoring. The methodology may be implemented as an ongoing process to control the quality of information being loaded into target information systems. Matching characteristics Deterministics (mainly Dataflux) or Probabilistic (covered by SAS ) Manual Assistance SAS/DataFlux supports both manual review and automatic consolidation modes. This gives flexibility to manual review the consolidation rules for every group of duplicate records or to automatically consolidate the records without a review process. Preprocessing/Standardization Standardization The DataFlux engine supports advanced standardization (mapping) routines that include element standardization, phrase standardization and pattern standardization. Each standardization approach supports the ability to eliminate semantic differences found in source data, including multiple spellings of the same information, multiple patterns of the same data, or the translation of inventory codes to product descriptions. The DataFlux engine includes the ability to replace the original field with the new value or to append the standardized value directly on to the source record. Alternatively, the standardized values can be written to a new database table, or to text files. Phrase Standardization Phrase Standardization describes the process for modifying original source data from some value to a common value. In the below example, the company name “GM” is translated, based on standardization rules, from the original value to General Motors. Phrase standardization involves the creation of standardization rules that include both the original value and the desired value. These synonyms, or mapping rules, can be automatically derived by the data profiling engine, or they can be imported from other corporate information sources. Pattern Standardization Pattern Standardization includes the decomposition of source data into atomic elements such as first name, middle name, and last name. After identifying the atomic elements, pattern standardization reassembles each token into a common format. Element Standardization WP3 35 Element standardization involves the mapping of specific elements or words within a field to a new, normalized value. The following example illustrates how the engine is able to isolate a specific word such as “1st” and then modify the single word into a standardized word such as “First”. Automated Rule Building The DataFlux solution is the only data quality solution that is able to automatically build standardization rules based on analysis of source data during the profiling phase. This analysis includes the ability to group similar fields based on content and then derive from the content the value to which all occurrences should be mapped. The DataFlux engine uses data quality matching algorithms during this process in order to identify the inconsistent and non-standard data. In addition, the business user has the ability to dial-up or dial-down the automated process to meet specific domain needs. 3.3.3.2 Integration options Computing Infrastructure SAS/DataFlux Data Quality solution is an integrated data management suite that allows users to inspect, correct, integrate, enhance and control data across the enterprise. The proposed solution architecture is comprised of the following components: Data Quality Client: A user-friendly, intuitive interface that enables users to quickly and easily implement with low-maintenance. Batch Data Quality Server: A high-performance, customisable engine forms the core of our Data Quality product. This engine can process any type of data via native and direct data connections. The data quality server is able to access all data sources via native data access engines to every major database and file structure using the most powerful data transformation language available Quality Knowledge Base: All processing components of the SAS data quality solution utilise the same repository of pre-defined and user defined business rules. Real-time Data Quality Server: Processing real-time data quality tasks via the Application Program Interface (API) to allow embedding SAS data quality technology within customer’s front-end operational systems Meta Integration Model Bridge Enables to access non CWM complaint Metadata, by enabling users to import, share and exchange metadata from design tools. Database, Files Read and write requests are translated into the appropriate call for the specific DBMS or file structure. All the industry standards in data access are supported. They include: ODBC OLE DB for OLAP MDX XML/A This ensures that Inter-system data access will be possible for all storage technologies in the future. Our DQ product SAS DQ product includes the ability to natively source data from and output to various different databases/data files. The supported file formats include: Flat files Microsoft SQL Server (including Analysis Services via OLE DB) Oracle IBM DB2 WP3 36 Teradata The SAS/ACCESS Interface to Teradata supports two means of integration: Generation of the required SQL that can be edited and tuned as required, or User-created SQL that could include Teradata-specific functionality. The standards supported include: ODBC JDBC XML OLE DB Software employed SAS, DataFlux dfPower Studio 3.3.3.3 Performance/Throughput Performance Benchmarks The following benchmarks were generated using a delimited text file as the input and the jobs were executed on a machine with these specifications: Windows 2003 Server Dual Pentium Xeon 3.0 GHz 4.0 GB RAM. Process Generate Name Match Keys Generate Name Match Keys and Cluster Profiling – 1 column (all metrics) Profiling – 10 columns (all metrics) 1MM Records 0: 02: 54 0:03:01 0:00:23 0:03:06 5MM Records 0: 08:43 0:09:14 0: 02: 54 0: 10: 15 3.3.3.3 Business terms, conditions and pricing SAS Business model SAS Institute business model is based on commitment to build long-term relationship with customers. SAS products are licensed on one-year basis. Licenses fee include software installation, maintenance, upgrades, documentation and phone technical support. SAS Institute already has a long-term relationship with many Statistical Offices in Europe, especially in data integration and analytical tools. WP3 37 4 Documentation, literature and references 4.1 Bibliography for Section 1 Barateiro J., Galhardas H. (2005) A Survey of Data Quality Tools. Datenbank-Spektrum 14: 15-21. Batini C., Scannapieco M. (2006) Data Quality: Concepts, Methods, and Techniques, Springer. Bertsekas D.P.(1992) Auction Algorithms for Network Flow Problems: A Tutorial Introduction. Computational Optimization and Applications, vol. 1, pp. 7-66. Christen P., Churches T. (2005a) A Probabilistic Deduplication, Record Linkage and Geocoding System. In Proceedings of the ARC Health Data Mining workshop, University of South Australia. Christen P., Churches T. (2005b) Febrl: Freely extensible biomedical record linkage Manual. release 0.3 edition, Technical Report Computer Science Technical Reports no.TR-CS-02-05, Department of Computer Science, FEIT, Australian National University, Canberra. DATAFLUX: http://www.dataflux.com/ Day C. (1997) A checklist for evaluating record linkage software, in Alvey W. and Jamerson B. (eds.) (1997) Record Linkage Techniques, Washington, DC: Federal Committee on Statistical Methodology. Dempster A.P., Laird N.M., Rubin D.B. (1977) Maximum likelihood from incomplete data via EM algorithm. Journal of the Royal Statistical Society, Series A, 153, pp.287-320. FEBRL: : http://sourceforge.net/projects/febrl Fellegi I.P., Sunter A.B. (1969) A theory for record linkage. Journal of the American Statistical Association, Volume 64, pp. 1183-1210. Gartner (2007). Magic Quadrant for Data Quality Tools 2007, Gartner RAS Core Research Note G00149359, June 2007. Gu L., Baxter R., Vickers D., Rainsford. C. (2003) Record linkage: Current practice and future directions. Technical Report 03/83, CSIRO Mathematical and Information Sciences, Canberra, Australia. Herzog T.N. (2004) Playing With Matches: Applications of Record Linkage Techniques and Other Data Quality Procedures. SOA 2004 New York Annual Meeting - 18TS. Herzog T.N., Scheuren F.J., Winkler, W.E. (2007) Data Quality and Record Linkage Techniques. Springer Science+Business Media, New York. Hernandez M., Stolfo S. (1995) The Merge/Purge Problem for Large Databases. In Proc. of 1995 ACT SIGMOD Conf., pages 127–138. Karr A.F., Sanil A.P., Banks D.L. (2005). Data Quality: A Statistical Perspective. NISS Technical Report Number 151 March. WP3 38 Koudas N., Sarawagi S., Srivastava D. (2006) Record linkage: similarity measures and algorithms. SIGMOD Conference 2006, pp. 802-803. LINKAGEWIZ: http://www.linkagewiz.com LINKKING: http://www.the-link-king.com LINKPLUS: http://www.cdc.gov/cancer/npcr Naumann F. (2004) Informationsintegration. Antrittsvorlesung am Tag der Informatik. University of Postdam - Hasso-Plattner-Institut, Postdam. Date Accessed: 02/5/2008, www.hpi.uni-potsdam.de/fileadmin/hpi/FG_Naumann/publications/Antrittsvorlesung.pdf Rabiner L.R.(1989) A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, vol. 77, no. 2, Feb. 1989. RELAIS: http://www.istat.it/strumenti/metodi/software/analisi_dati/relais/ Shah G., Fatima F., McBride S. A Critical Assessment of Record Linkage Software Used in Public Health. Available at: http://nahdo.org/CS/files/folders/124/download.aspx TRILLIUM: Trillium, Trillium Software System for data warehousing and ERP. Trillium Software Webpage. Date Accessed: 01/5/2008, http://www.trilliumsoft.com/products.htm, 2000. Tuoto T., Cibella N., Fortini M., Scannapieco M., Tosco T. (2007) RELAIS: Don't Get Lost in a Record Linkage Project, Proc. of the Federal Committee on Statistical Methodologies (FCSM 2007) Research Conference, Arlington, VA, USA. 4.2 Bibliography for Section 2 D’Orazio M, Di Zio M, Scanu M, 2006. Statistical Matching: Theory and Practice. Wiley, Chichester. Kadane J.B., 1978. Some statistical problems in merging data files. 1978 Compendium of Tax Research, U.S. Department of the Treasury, 159-171. Reprinted on the Journal of Official Statistics, 2001, Vol. 17, pp. 423-433. Moriarity C., 2000. Doctoral dissertation submitted to The George Washington University, Washington DC. Available on the website http://www.proquest.com Moriarity C., and Scheuren, F. (2001). Statistical matching: paradigm for assessing the uncertainty in the procedure. Journal of Official Statistics, 17, 407-422 Raessler S. (2002). Statistical Matching: a Frequentist Theory, Practical Applications and Alternative Bayesian Approaches. Springer Verlag, New York. Raessler S. (2003). A Non-Iterative Bayesian Approach to Statistical Matching. Statistica Neerlandica, Vol. 57, n.1, 58-74. WP3 39 Rubin, D.B. (1986). Statistical matching using file concatenation with adjusted weights and multiple imputation. Journal of Business and Economic Statistics, Vol. 4, 87-94 Sacco, G., 2008. SAMWIN: a software for statistical matching. Available on the CENEX-ISAD webpage (http://cenex-isad.istat.it), go to public area/documents/technical reports and documentation. Schafer, J.L. (1997), Analysis of incomplete multivariate data, Chapman and Hall, London. Schafer, J.L. (1999), Multiple imputation under a normal model, Version 2, software for Windows 95/98/NT, available from http://www.stat.psu.edu/jls/misoftwa.html. Van Buuren, S. and K. Oudshoorn (1999), Flexible multivariate imputation by MICE, TNO Report G/VGZ/99.054, Leiden. Van Buuren, S. and C.G.M. Oudshoorn (2000), Multivariate imputation by chained equations, TNO Report PG/VGZ/00.038, Leiden. 4.3 Bibliography for Section 3 Cohen, Ravikumar, Fienberg (2003), A Comparison of String Metrics for Matching Names and Records, American Association for Artificial Intelligence (2003) Franch X., Carvallo J.P. (2003) Using Quality Models in Software Package Selection, IEEE Software, vol. 20, no. 1, pp. 34-41. Friedman, Ted, Bitterer Andreas, Gartner (2007). Magic Quadrant for Data Quality Tools 2007, Gartner RAS Core Research Note G00149359, June 2007. Haworth, Marta F., Martin Jean, (2000), Delivering and Measuring Data Quality in UK National Statistics, Office for National Statistics, UK Hoffmeyer-Zlotnik,, Juergen H.P., DATA HARMONIZATION, Network of Economic & Social Science Infrastructure in Europe, Luxemburg 2004 Leicester Gill, Methods for Automatic Record Matching and Linkage and their Use in National Statistics, Oxford University, 2001 National Statistics Code of Practice, Protocol on Data Matching, 2004 Record Linkage Software, User Documentation, 2001 Russom, Philip, Complex Data: A new Challenge for Data Integration, (November 2007), TDWI Research Data Warehousing Institute Schumacher, Scott, 2008 Data Management Review and SourceMedia, Inc. DM Review Special Report, January 2007, Probabilistic Versus Deterministic Data Matching: Making an Accurate Decision U.S. BUREAU OF THE CENSUS WP3 40 Winkler, William E., Overview of Record Linkage and Current research Directions, RESEARCH REPORT SERIES (Statistics #2006-2), Statistical Research Division, U.S. Census Bureau, Washington DC WP3 41