Report_of_WP3

advertisement
ESSnet Statistical Methodology Project on Integration
of Survey and Administrative Data
Report of WP3. Software tools for integration methodologies
LIST OF CONTENTS
Preface
III
1. Software tools for record linkage (Monica Scannapieco – Istat)
1
1.1. Comparison criteria for record linkage software tools (Monica
Scannapieco – Istat)
2
1.2. Probabilistic tools for record linkage
4
1.2.1. Automatch (Nicoletta Cibella – Istat)
4
1.2.2. Febrl (Miguel Guigo – INE)
5
1.2.3. GRLS (Nicoletta Cibella – Istat)
6
1.2.4. LinkageWiz (Monica Scannapieco – Istat)
7
1.2.5. RELAIS (Monica Scannapieco – Istat)
7
1.2.6. DataFlux (Monica Scannapieco – Istat)
8
1.2.7. Link King (Marco Fortini – Istat)
10
1.2.8. Trillium Software (Miguel Guigo – INE)
10
1.2.9. Link Plus (Tiziana Tuoto – Istat)
12
1.3. Summary tables and comparisons
14
1.3.1. General Features
14
1.3.2. Strengths and weaknesses
16
2. Software tools for statistical matching (Mauro Scanu – Istat)
17
2.1. Comparison criteria for statistical matching software tools (Mauro
Scanu – Istat)
17
2.2. Statistical matching tools
18
2.2.1. SAMWIN (Mauro Scanu – Istat)
19
2.2.2. R codes (Marcello D’Orazio – Istat)
19
2.2.3. SAS codes (Mauro Scanu – Istat)
20
WP3
I
2.2.4. S-Plus codes (Marco Di Zio – Istat)
2.3. Comparison tables
20
22
3. Commercial software tools for data quality and record linkage in the process of
microintegration (Jaroslav Kraus and Ondřej Vozár - CZSO)
24
3.1. Data quality standardization requirements
24
3.2. Data quality assessment
24
3.3. Summary tables Oracle, Netrics and SAS/Data Flux
29
3.3.1. Oracle
29
3.3.2. Netrics
32
3.3.3. SAS
35
4. Documentation, literature and references
38
4.1. Bibliography for Section 1
38
4.2. Bibliography for Section 2
39
4.3. Bibliography for Section 3
40
WP3
II
Preface
This document is the deliverable of the third work package (WP) of the Centre of Excellence on
Statistical Methodology. The objective of this WP is to review some existing software tools for the
application of probabilistic record linkage and statistical matching methods
The document is organized in three chapters.
The first chapter is on software tools for record linkage. On the basis of the underlying research
paradigm, three major categories of record linkage tools can be identified:

Tools for probabilistic record linkage, mostly based on the Fellegi and Sunter model (Fellegi
and Sunter1, 1969).

Tools for empirical record linkage, which are mainly focused on performance issues and
hence on reducing the search space of the record linkage problem by means of algorithmic
techniques such as sorting, tree traversal, neighbour comparison, and pruning.

Tools for knowledge-based linkage, in which domain knowledge is extracted from the files
involved and reasoning strategies are applied to make the decision process more effective.
In such a variety of proposals, this document restricts the attention to the record linkage tools that
have the following characteristics:

They have been explicitly developed for record linkage;

They are based on a probabilistic paradigm.
Two sets of comparison criteria were used for comparing several probabilistic record linkage tools.
The first one considers general characteristics of the software: cost of the software; domain
specificity (i.e. the tool can be developed ad-hoc for a specific type of data and applications);
maturity (or level of adoption, i.e. frequency of usage - whereas available - and number of years the
tool is around). The second set considers which functionalities are performed by the tool:
preprocessing/standardization; profiling; comparison functions; decision method.
Chapter 2 deals with software tools for statistical matching. Software solutions for statistical
matching are not as widespread as in the case of record linkage, because statistical matching
projects are still quite rare in practice. Almost all the applications are conducted by means of ad hoc
codes. Sometimes, when the objective is micro it is possible to use general purpose imputation
software tools. On the other hand, if the objective is macro, it is possible to adopt general statistical
analysis tools which are able to deal with missing data.
In this chapter, the available tools, explicitly devoted to statistical matching purposes, were
reviewed. Only one of them (SAMWIN) is a software that can be used without any programming
skills, while the others are software codes that can be used only by those with knowledge of the
corresponding language (R, S-Plus, SAS) as well as a sound knowledge in statistical methodology.
The criteria used for comparing the software tools for statistical matching were slightly different
from those for record linkage. The attention is restricted to costs, domain specificity and maturity of
the software tool. As far as the software functionalities are concerned, the focus is on: i) the
inclusion of pre-processing and standardization tools; ii) the capacity to create a complete and
synthetic data set by the fusion of the two data sources to integrate; iii) the capacity to estimate
parameters on the joint distribution of variables never jointly observed; iv) the assumptions on the
model of the variables of interest under which the software tool works (the most known is the
Fellegi I. P., Sunter A. B. (1969). “A theory for record linkage”. Journal of the American Statistical Association, 64,
1183-1210.
WP3
III
1
conditional independence assumption of the variables not jointly observed given the common
variables in the two data sources); v) the presence of any quality assessment of the results.
Furthermore, the software tools are compared according to the implemented methodologies.
Strengths and weaknesses of each software tool are highlighted at the end.
Chapter 3 focuses on commercial software tools for data quality and record linkage in the process of
microintegration. The vendors in the data quality market are often classified within their entire
position in IT business, where focus on the specific business knowledge and experience in specific
business domain plays an important role. Quality of vendors and their products on the market are
characterized by: i) product features and relevant services; ii) vendor characteristics, domain
business understanding, business strategy, creativity, innovation; iii) sales characteristics, licensing,
prices; iv) customer experience, reference projects; v) data quality tools and frameworks.
The software vendors of tools in the statistics oriented “data quality market” propose solutions
addressing all the tasks in the entire life cycle of the data oriented management programs and
projects: data preparation, survey data collection, improving of quality and integrity, setting up for
reports and studies, etc.
According to the software/application category, the tools to perform or support the data oriented
projects in record linkage in statistics should have several common characteristics:
1) portability in being able to function with statistic researchers' current arrangement of
computer systems and languages,
2) flexibility in handling different linkage strategies, and
3) operational expenses or low costs in TCO (Total Cost of Ownership) parameters and in
both, computing time and researchers' efforts.
In this chapter the evaluation focused on three commercial software packages, which according to
the data quality scoring position in Gartner reports (the so called “magic quadrants” available on the
web page http://www.gartner.com) belong to important vendors in this area. The three vendors are:
Oracle (represents the convergence of tools and services in the software market), SAS/DataFlux
(data quality, data integration and BI (Business Intelligence) player on the market), Netrics (which
disposes with the advanced technology complementing the Oracle data quality and integration
tools).
The set of comparison tables was prepared according to the following structure: linkage
methodology, data management, post-linkage function, standardization, costs and empirical testing,
case studies and examples.
WP3
IV
1 Software Tools for Record Linkage
Monica Scannapieco (Istat)
The state of the art of record linkage tools includes several proposals coming from private
companies but also, in large part, from public organizations and from universities.
Another interesting feature of such tools is related to the fact that some record linkage activities are
performed “within” other tools. For instance, there are several data cleaning tools that include
record linkage (see Barateiro and Galhardas, 2005 for a survey), but they are mainly dedicated to
standardization, consistency checks etc. A second example is provided by the recent efforts by
major database management systems’ vendors (like Microsoft and Oracle) that are going to include
record linkage functionalities for data stored in relational databases (Koudas et al., 2006).
On the basis of the underlying research paradigm, three major categories of tools for record linkage
can be identified (Batini and Scannapieco, 2006):
1. Tools for probabilistic record linkage, mostly based on the Fellegi and Sunter model (Fellegi and
Sunter, 1969).
2. Tools for empirical record linkage, which are mainly focused on performance issues and hence
on reducing the search space of the record matching problem by means of algorithmic techniques
such as sorting, tree traversal, neighbour comparison, and pruning.
3. Tools for knowledge-based linkage, in which domain knowledge is extracted from the files
involved, and reasoning strategies are applied to make the decision process more effective.
In such a variety of proposal, in this document we concentrate on record linkage tools that have the
following characteristics:
 they have been explicitly developed for record linkage
 they are based on a probabilistic paradigm.
In the following, we first illustrate a set of comparison criteria (Section 1.1) that will be used for
comparing several probabilistic record linkage tools. In Section 1.2, we provide a general
description of the selected tools, while in Section 1.3 we present several comparison tables that
show the most important features of each tool.
Let us first list the probabilistic record linkage tools that have been selected among the most wellknown and adopted ones:
1. AutoMatch, developed at the US Bureau of Census, now under the purview of IBM [Herzog
et al. 2007, chap.19].
2. Febrl - Freely Extensible Biomedical Record Linkage, developed at the Australian National
University [FEBRL].
3. Generalized Record Linkage System (GRLS), developed at Statistics Canada [Herzog et al.
2007, chap.19].
4. LinkageWiz, commercial software [LINKAGEWIZ].
5. RELAIS, developed at ISTAT [RELAIS].
6. DataFlux, commercialized by SAS [DATAFLUX].
7. The Link King, commercial software [LINKKING].
8. Trillium, commercial software [TRILLIUM].
9. Link Plus, developed at the U.S. Centre for Disease Control and Prevention (CDC), Cancer
Division [LINKPLUS].
WP3
1
For each of the above cited tools, in the following Section 1.2 we provide a general description.
1.1 Comparison criteria for record linkage software tools
Monica Scannapieco (Istat)
In this section we describe the criteria to compare the probabilistic record linkage tools with respect
to several general features. Such criteria will be reported in some tables, whose detail is provided in
Section 1.3.
A first table will take into account the following characteristics:



Free/Commercial, refers to the possibility of having the tool for free or not. The set of
possible answers is shown in Figure 1.
Domain Specificity, refers to the fact that the tool can be developed ad-hoc for a specific
type of data and applications. For Domain Specificity, the set of answers is shown in Figure
2.
Maturity (Level of Adoption), is related to the frequency of usage (whereas available) and to
the number of years the tool is around. For Maturity, we use a HIGH/MEDIUM/LOW rating
scale. In order to assign the rates, we take into account the following factors: (i) frequency
of usage (Shah et al.) (ii) number of years since the tool has been first proposed.
Source Code
Available
Free
Source Code
Not Available
Free/Commercial
Cost less than
$5000
Commercial
Cost between
$5000 and $9900
Cost more than
$9900
Figure 1: Free/commercial possible answers
WP3
2
Only specific
domain
Domain Specificity
“Specific domain”
No specific
domain
(generalized
system)
“Specific domain”
Mixed (specific
domain + general
features)
“General features”
Figure 2: Domain specificity possible answers
A second table will consider which functionalities are performed by the tool, on the basis of a
reference set of functionalities, listed in the following.
Preprocessing/Standardization
Data can be recorded in different formats and some items may be missing or with inconsistency or
errors. The key job of this functionality is to convert the input data in a well defined format,
resolving the inconsistencies in order to reduce misclassification errors in the subsequent phases of
the record linkage process. In this activity null string are cancelled, abbreviations, punctuation
marks, upper/lower cases, etc. are cleaned and any necessary transformation is carried out in order
to standardize variables. Furthermore the spelling variations are replaced with standard spelling for
the common words. A parsing procedure that divides a free-form field into a set of strings, could be
applied and a schema reconciliation can be performed to avoid possible conflicts (i.e. description,
semantic and structural conflicts) among data source schemas in order to have standardized data
fields. Geocoding, a standardization task especially conceived for name and address data,
transforms data variables assigning geographic identifiers or postal standards such as postal ZIP
codes or official street addresses.
Profiling
An important phase of a record linkage process is the choice of appropriate matching variables that
have to be as suitable as possible for the linking process considered. The matching attributes are
generally chosen by a domain expert, hence this phase is typically not automatic but the choice can
be supported by some further information that can be automatically computed. Such information is
the result of a profiling activity that provides quality measures, metadata description and simple
statistics on the distribution of variables which give hints on how to choose the set of matching
variables.
WP3
3
Comparison Functions
Record linkage tools can provide support for different comparison functions. Some of the most
common comparison functions are equality, edit distance, Jaro, Hamming distance, SmithWaterman, TF-IDF, etc. (see Koudas et al, 2006 for a survey)
Search Space Reduction
In a linking process of two datasets, say A and B, the pairs needed to be classified as matches, nonmatches and possible matches are those in the cross product A x B. When dealing with large
datasets, the comparison of the matching variables is almost impracticable. Several techniques
based on sorting, filtering, clustering and indexing may be all used to reduce the search space.
Blocking and sorted neighbourhood are among the common ones.
Decision Method
The core of record linkage process is the choice of decision model. A record linkage tool can
provide several decision rules in order to decide the status of match, nonmatch, or possible match of
records. For instance, it can provide the support for the Fellegi and Sunter rule, but also for a
deterministic (threshold based) rule.
In Section 1.3, two further tables, namely for estimation methods and strenghts and weaknesses of
the tools, will be described.
1.2 Probabilistic tools for record linkage
1.2.1AutoMatch
Nicoletta Cibella (Istat)
AutoMatch implements the Fellegi Sunter record linkage theory for matching records. The software
can be used for matching records both within a list and between different data sources. In 1998,
Vality (www.vality.com) acquired the AutoMatch tool which became part of INTEGRITY. The
software is comprehensive of many matching algorithms.
General characteristics - AutoMatch is a commercial software, under purview of the IBM. The
software is created by Matthew Jaro and today it is a collection of different algorithms aiming at
performing probabilistic record linkage; some of these algorithms (NYSIIS, SOUNDEX) uses
codes suitable for the English words. The current version of the software is user-friendly and seems
to follow the same strategy of a human being in matching records, referring to the same entity
(Herzog et al, 2007) .
Tools included in the software - AutoMatch performs the record linkage procedure throughout
different steps in which the matching variables, the threshold and also the blocking variables can be
changed. Firstly, the variables in the data files are processed: the pre-processing phase
(standardization, parsing) transforms variables so as to obtain a standard format. Not only phonetic
codes (NYSIIS) are used but also spelling variations and abbreviations in text string are considered.
The software links records using the Fellegi and Sunter theory but it is enriched with frequency
analysis methodology in order to discriminate weight score values.
Methodology - The u-probabilities, in AutoMatch, are calculated from the number of occurrences
of the values for each matching variables from the dataset being matched; the m-probabilities can
be iteratively estimated from current linkages.
Strength and weakness of the software - The pre-processing phase in AutoMatch is very
important and well developed , it uses several tools so as to have variables in standard format; also
WP3
4
the way to divide possible pairs into blocks is implemented. The documentation provided is rich
and the matching parameters are estimated automatically. But some algorithms runs with English
words and the error rates estimation phase is still not performed.
1.2.2 Febrl
Miguel Guigò (INE - Spain)
The Freely Extensible Biomedical Record Linkage (Febrl) development was funded in 2001 by the
Australian National University (ANU) and the NSW Department of Health, with additional funding
provided by the Australian Partnership for Advanced Computing (APAC). Febrl is a record linkage
system which aims at supporting not only matching algorithms but also methods for large scale data
cleansing, standardisation, blocking and geocoding, as well as a probabilistic data set generator in
order to perform a wide variety of tests and empirical comparisons for record linkage procedures.
General characteristics - Febrl has been written in the object-oriented open source language
Python, which is also open source ( http://www.python.org ) and it is available from the project web
page (http://sourceforge.net/projects/febrl). Due to this circumstance, Febrl allows the easy
implementation of additional and improved record linkage techniques or algorithms, as well as
different comparison tests.
Tools included in the software - Febrl supports a special procedure for probabilistic data
standardisation and cleansing with the aim of solving inconsistencies in the key variables as names
and addresses. This procedure is based on hidden Markov models (HMMs). One HMM is used for
names and one for addresses. The complete pre-processing tasks then consists of three steps: 1) the
user input records are cleaned, converting all letters to lowercase, removing certain characters and
replacing various substrings with their canonical form (based on user-specified and domain specific
substitution tables); 2) the cleaned strings are split into a list of words, numbers and characters to
which one or more tags are assigned; 3) The list of tags is given to a HMM (either name or
address), where the most likely path is found by means of the Viterbi algorithm (see Rabiner, 1989).
For some other key variables such as dates and telephone numbers, the package also contains rulesbased standardisation methods.
The search space reduction process is made by means of Blocking. Febrl (version 0.2) implements
three different blocking methods: standard blocking, Sorted-Neighbourhood, and a special
procedure for fuzzy blocking, known as Bigram Indexing (see section on Blocking Procedures of
report of WP1).
A wide variety of comparison functions is available in order to obtain the corresponding vector of
matching weights. Christen and Churches (2005a) includes a table with 11 field (or attribute)
comparison functions for names, addresses, dates and localities that are available: Exact string,
Truncated string, Approximate string, Encoded string, Keying difference, Numeric percentage,
Numeric absolute, Date, Age, Time, and Distance. Version 0.3 has added two new approximate
string comparison methods.
Geocoding procedures have been implemented in Febrl version 0.3, with a system based on the
Australian Geocoded National Address File (GNAF) database.
Methodology - Febrl performs probabilistic record linkage based on the Fellegi and Sunter
approach to calculate the matching decision, and classifies record pairs as either a link, non-link, or
possible link. The package, however, adds a flexible classifier that allows a different definition and
calculations of the final matching weight using several functions. This is done in order to alleviate
the limitations due to the conditional independent assumption that is associated to the Fellegi –
WP3
5
Sunter model. Thus, Febrl uses the Auction algorithm (Bertsekas, 1992) to achieve the optimal oneto-one assignment of linked record pairs.
Strength and weakness of the software - Febrl shows, as an outstanding feature, its software
architecture based on an open-source code. Both Python language and Febrl package are completely
free of charge; also, the latter can be used as an integrated application that covers the complete
linkage process rather than a tool designed strictly for the record linkage step, once additional
routines for preprocessing and geocoding have been implemented. A comprehensive manual
(Christen and Churches, 2005b) is thus available for the 0.3 release. The package, however, does
not offer the same additional features that other commercial software supports, such as a wide
variety of profiling or enrichment tools, or the ability to work in languages different from English.
1.2.3 GRLS (Generalized Record Linkage System)
Nicoletta Cibella (Istat)
Generalized Record Linkage System (GRLS) is a commercial software, developed at Statistics
Canada, performing probabilistic record linkage. The system was planned to solve especially health
and business problems; it is part of a set of generalized systems and it is based on statistical decision
theory. The software aims at linking records within a file itself or between different data sources,
particularly when there is a lack of unique identifiers. It is was designed to be used with ORACLE
databases.
General characteristics - GRLS is marketed by Statistics Canada and it is now at its 4.0 release, it
works in a client-server environment with a C compiler and ORACLE. The NYSIIS (New York
State Identification and Intelligent Systems) and Russel SOUNDEX phonetic codes, specific for
English language, are incorporated in GRLS. Statistics Canada usually organizes two days training
courses to present the software (with information, bilingual documentation) and make the users able
to use it so as to facilitate the software level of adoption, even if the software development still
focuses on Statistic Canada users.
Tools included in the software - GRLS decomposes the whole record linkage procedures in three
main passes (Fair, 2004):
 definition of searching space: creation of all linkable pairs on the basis of the initial criteria
set by the user;
 application of a decision rule: the possible pairs are divided into the matched, possible
matched and unmatched set;
 the grouping phase in which the link and possible pairs involving the same entity are
grouped together and the final group formed.
The NYSIIS (New York State Identification and Intelligent Systems) and SOUNDEX phonetic
codes incorporated in GRLS, as stated above, and some other tools (e.g. postal code conversion
file, PCCF), available within Statistics Canada, facilitates the pre-processing of the data files; the
strings like names and surnames or addresses are parsed in their components and the free format is
converted in a standard one. The system also supplies a suitable framework so as to test the
parameters of the linkage procedure.
Methodology - The core of GRLS, probabilistic record linkage software, is the mathematical theory
of Fellegi and Sunter. The software decomposed the whole matching process in its constituting
phases.
Strength and weakness of the software - GRLS has many strengths comparing to other record
linkage software. The pre-processing phase can be considered well developed within the Statistics
WP3
6
Canada system. The documentation is rich and it is bilingual, both French and English versions are
available and a special mention is devoted to the training course at Statistics Canada. The software
can runs on a workstation or a PC, supporting UNIX. On the contrary, the software is not free and
the error estimation phase is not implemented.
1.2.4 LinkageWiz
Monica Scannapieco (Istat)
LinkageWiz is a software dedicated to record linkage. It allows linking records from separate data
sources or identifies duplicate records within a single data source. It is based on several
probabilistic linkage techniques. Data can be imported from a wide range of desktop and corporate
database systems. It offers a comprehensive range of data cleansing functions on the market.
LinkageWiz uses an intuitive graphical user interface and does not require a high level of expertise
to operate. It is a standalone product and does not require the use of separate programs.
General characteristics - LinkageWiz has a quite low price which is clearly stated on the software
website (http://www.linkagewiz.com) for the different versions of the software that can be
purchased. It has the support for name matching with some specific phonetic codes like NYSIIS
(New York State Identification and Intelligent Systems) and SOUNDEX which are specific of the
English language. The software is currently at the version 5.0, so it’s quite stable. The level of
adoption is medium: on the software web site is indicated a number of clients of about 20.
Tools included in the software - LinkageWiz imports data from several formats. Furthermore, it
allows standardization addresses and business names (English and French), various conversion of
characters (e.g. accented characters for European language), removal of spaces and unwanted
punctuation and a few other preprocessing functionalities. Therefore, the preprocessing phase is
quite well supported by the tool. No specific profiling functionality is instead mentioned in the
software documentation. Comparison functions are also well-supported, with a specific attention
dedicated to the matching of English names. A search space reduction method is not explicitly
mentioned
Methodology – As far as the supported methodology, there is no precise information about it, but
the fact that “sophisticated probabilistic techniques” are implemented for the matching decision.
Strength and weakness of the software - LinkageWiz has a good support to preprocessing.
However, in the tool’s documentation there is any explicit mention to the possibility of customizing
preprocessing rules (which is instead allowed by DATAFLUX, see below) . Performances seem to
be a strength of the tool, that seems to work quite well even in a PC environment. Among the
negative points, there is the limited number of functionalities which are covered by the tool
(profiling and search space reduction phases seem to be completely missing). Moreover, as already
mentioned, no detail at all is provided on the implemented probabilistic method, a thing that forces
any user to trust the tool’s decisions with a “black box” perspective.
1.2.5 RELAIS (Record Linkage At Istat)
Monica Scannapieco (Istat)
RELAIS (REcord Linkage At IStat) (Tuoto et al 2007, RELAIS) is a toolkit that permits the
construction of record linkage workflows. The inspiring principle is to allow combining the most
convenient techniques for each of the record linkage phases and also to provide a library of patterns
WP3
7
that could support the definition of the most appropriate workflow, in both cases taking into account
the specific features of the data and the requirements of the current application. In such a way, the
toolkit not only provides a set of different techniques for each phase of the linkage problem, but it
can also be seen as a compass to solve the linkage problem as better as possible given the problem
constrains. In addition, RELAIS aims at joining specifically the statistical and computational
essences of the record linkage problem.
General characteristics - One of the inspiring principles of RELAIS as a toolkit is to re-use the
several solutions already available for record linkage in the scientific community and to gain the
several experiences in different fields. According to this principle, RELAIS is being carried on as
an open source project, with full availability of source code as well.
All the developed algorithms have been developed as domain independent.
RELAIS started in 2006 and is at the 1.0 release, hence the level of adoption is quite low.
Tools included in the software - RELAIS provides just basic pre-processing functionalities. The
data profiling activity includes the evaluation of the following metadata: variable completeness,
identification power, accuracy, internal consistency, and consistency. All these metadata can be
evaluated for each variable, and are merged together into a quality vector associated to the variable
itself. A ranking within the quality vectors is performed in order to suggest which variable is more
suitable for blocking or matching. RELAIS permits the choice of the comparison function to use
within a set of predefined ones.
The search space reduction phase can be performed by two methods, namely blocking and sorted
neighborhood. The first method partitions the search space according to a chosen blocking variable
and allows conducting linkage on each block independently. The sorted neighborhood method
limits the actual comparison to those records that falls within a window sliding on an ordered list of
the records to compare.
With respect to the decision model choice, RELAIS implements the Fellegi-Sunter probabilistic
model by using the EM algorithm for the estimation of the model parameters. The output is a many
to many linkage of the datasets records. Starting from this output, RELAIS allows to the user the
construction of clusters of matches, non-matches and possible matches. Alternatively, a subsequent
phase of reduction from many to many linkage to one to one linkage can be performed.
Methodology – As described, RELAIS implements the Fellegi-Sunter probabilistic model and the
estimation of the model’s parameters is realized by means of the EM method. It assumes the
conditional independence of the matching variables. A latent class model is hypothesized in which
the latent variable is the matching status. A deterministic decision rule is currently under
development.
Strength and weakness of the software - One strength of RELAIS is the fact that is open and free
of charge. It also provides a good support for all the record linkage functionalities as it will be
shown by Table 3. A further positive aspect is the flexibility offered by the possibility of combing
different techniques to produce ad-hoc record linkage workflows. This is perhaps the most
characterizing aspect of the tool.
RELAIS is however quite at an early stage of adoption, being only at its first release.
WP3
8
1.2.6 DataFlux
Monica Scannapieco (Istat)
DataFlux is a SAS company which has been recently classified as a leader company among data
quality providers by Gartner Group (Gartner, 2007). Their solution is a comprehensive solution for
data quality including:
o Data profiling, to determine discrepancies and inaccuracies in data;
o deduplication enabled by record linkage;
o data cleansing, to eliminate or reduce data inconsistencies, and
o parsing, standardization and matching that allow to create or enhance the rules used to parse
parts of the names, addresses, email addresses, product code or other business data.
General characteristics - DataFlux is a commercial software. It is not easy to actually identify the
price of the product as it comes with suites of different other modules. Though it is quite difficult to
isolate the specific offer for record linkage in the set of the proposed solutions for data quality, the
basic price is not high. Moreover, the tool is not domain specific but gives instead the possibility of
customizing part of its functionalities to domain specific features. For instance, the standardization
rules can be customized to the specific type of business data at hand (like for instance addresses).
As a further example in matching two strings it is possible to specify which part of a data string
weighs the heaviest in the match. The level of adoption is high. Indeed, DataFlux is recognized as a
leader provider with an established market presence, significant size and multi-national presence
(Gartner, 2007).
Tools included in the software - DataFlux allows different types of preprocessing: breaking noncompound names (e.g., first name, last name, middle), removing prefixes, suffixes and titles, etc.
The tool principally makes use of parsing functions to such a purpose.
A profiling activity is present in the SAS-DQ solution. This activity provides an interface to
determine areas of poor data quality and the amount of effort required to rectify them. However, the
profiling activity does not appear to be integrated within the record linkage process, but it is rather a
standalone activity.
It is also possible the usage of different comparison functions. DataFlux performs approximate
matching (called “fuzzy matching”) between strings corresponding to same fields. A sensitivity of
the result of the matching strings is computed on the basis of how many characters of are used for
approximate matching.
It is not explicitly mentioned in the documentation a specific functionality dedicated to the search
space reduction task.
The decision rule applied by DataFlux is deterministic. When performing approximate matching a
matching code is determined as a degree of matching of the compared strings. In the deterministic
decision rules the matching codes can be differently combined in order to have rules for the match
or the unmatch at record level.
Methodology – As described before, a deterministic decision rule is implemented.
Strength and weakness of the software - Preprocessing in DataFlux is quite rich. Standardization
for several business data types is well supported, including name parts, address parts, e-mail
addresses and even free-form text values, thus ensuring a great level of flexibility. There is also the
possibility of applying address standardizations based on local standards. The decision method in
DataFlux is instead among the weak points of the solution. In particular, the proposed method is a
quite trivial deterministic solution.
WP3
9
1.2.7 The Link King
Marco Fortini (Istat)
The Link King is a SAS/AF application for use in the linkage and deduplication of administrative
datasets which incorporates both probabilistic and deterministic record linkage protocols.
The record linkage protocol was adapted from the algorithm developed by MEDSTAT for the
Substance Abuse and Mental Health Services Administration’s (SAMHSA) Integrated Database
Project. The deterministic record linkage protocols were developed at Washington State’s Division
of Alcohol and Substance Abuse for use in a variety of evaluation and research projects.
The Link King’s graphical user interface (GUI), equipped with easy-to-follow instructions, assists
beginning and advanced users in record linkage and deduplication tasks. An artificial intelligence
helps in the selection of the most appropriate linkage/deduplication protocol. The Link King
requires a base SAS license but no SAS programming experience.
General characteristics - The Link King is a free macro for SAS which needs a SAS licence to be
used. It is oriented toward epidemiologic applications and it works only with files referring to
people. Moreover it is oriented to a US audience because, among the recommended key variables, it
is included the Social Security Number (SSN). It incorporates a variety of user-specified options
for blocking and linkage decisions, and a powerful interface for manual review of “uncertain”
linkages. It implements also a “phonetic equivalence” or “spelling distance” as a means to identify
misspelled names. Being arrived to its 6th release it can be considered with a high level of adoption.
Tools included in the software - The Link King can import data from the most popular formats but
does not implement sophisticated preprocessing features. Artificial intelligence methods are
implemented to insure that appropriate linking protocols are used. Missing or improper values
identification characteristics are developed in order to automatically recognize values that have
scarce discriminating power. Various comparison functions are realized so as to properly evaluate
names and surnames. Blocking features are implemented with heuristics for identifying the most
suitable blocking scheme.
Methodology - Both probabilistic and deterministic schemes are implemented in The Link King,
allowing for either dichotomous or partial agreement between key variables. The weights
calculations are achieved by means of an ad hoc iterative procedure, described in the technical
documentations, which does not make use of the standard EM procedure. Though it seems quite
reasonable, this procedure is, in our opinion, not enough clear in terms of its hypotheses and
theoretical sustainability.
Strength and weakness of the software - The Link King is a fairly usable software with a good
compatibility with different file formats, a flexibility of usage and a graphical usage interface that
aims to help non expert users to conduct a linkage project. Moreover, since it has been developed
on the basis of research projects it is a well documented and a free code tool. Among its major
drawbacks there are a non standard estimation technique of the m and u weights into the
probabilistic framework and its specific applicability to problems that regard the linkage between
files of people. Another weakness is given by the need of a SAS licence.
1.2.8 Trillium Software
Miguel Guigò (INE - Spain)
Trillium Software is a commercial set of tools, developed by Harte-Hanks Inc. ( http://www.hartehanks.com ) for carrying out integrated database management functions, so it covers the overall data
WP3
10
integration life cycle. It consists of a generalized system for data profiling, standardizing, linking,
enriching and monitoring. Although this application relies heavily on the so named Data Quality
(DQ) improvement procedures, some of the tasks performed are unmistakably belonging to the
complete record linkage process. The record linkage algorithm itself performs probabilistic
matching, though based on a different method than Fellegi-Sunter or Sorted-Neighbourhood
algorithms (Herzog, 2004; Herzog et al., 2007; and Naumann, 2004).
The general purpose of DQ is to ensure data consistency, completeness, validity, accuracy and
timeliness in order to guarantee that they fit to their use in decision making procedures;
nevertheless, the statistical point of view is not always coincident with other approaches (see Karr et
al., 2005), more specifically with those based on marketing strategies -which often establish the
guidelines of this sort of programs- or computer science , though it can overlap with them. Section 3
of this report deals more extensively with DQ issues.
General characteristics - Trillium Software System (TSS) was first developed in Summer 1989
using the experience previously acquired by Harte-Hanks in the realm of data management, once a
package for processing urban data was purchased by the firm in 1979. The software was complete
in 1992 and from the first moment it was conceived for performing integrated mailing list functions
as, specially, managing mailing address and name lists in banking business. Several improvements
have been made along the years, as Unicode support (1998), Java GUI (1999), Com+ JNI interfaces
(2000), CRM/ETL/ERP integration (2001), and the purchase of Avellino Technologies (2004). At
present (2007), Trillium Software System version 11 has been completely developed.
Concerning the level of adoption: from 1992 on, it has been used by companies in the following
sectors: financial services, insurance, hospitality and travel, retail, manufacturing, automotive and
transport, health care and pharmacy, telecom, cable, and IT. Finally, with respect to the public
sector, several US Government institutions use this package for data integration purposes. A case
study on a computer system called Insight and based on TSS linking software can be found at
Herzog (2004) for a worldwide known logistics and courier company (FedEx). The system permits
business customers to go online to obtain up-to-date information on all of their cargo information
and includes outgoing, incoming, and third-party shipments.
Concerning the domain-specificity, one of the most valuable package features, strongly remarked
by the vendors, is its ability to bring about data integration and DQ tasks for almost whichever
language or country. These are classified into four different levels (basic, average, robust, and
mature) depending on the complexity and development degree assessed by the standardization rules
of the TSS, and the level of knowledge and experience added to the corresponding geographical
area. Furthermore, TS Quality module is supposed to be able to processes all data types across all
data domains (product, financial, asset, etc.), although it obviously refers to business activities; and
it is clearly oriented to name and address parsing and de-duplications.
Tools included in the software - Trillium is comprised of the modules called TS Discovery -which
is the module focused on Data Profiling-, TS Quality -which carries out most of the DQ tasks-, and
TS Enrichment, in addition to some additional technologies that support integrating data quality
into various environments.
TS Discovery provides an environment for profiling activities within and across systems, files and
databases. Its main skill is to uncover previously untracked information in records that is latent due
to data entry errors, omissions, inconsistencies, unrecognised formats, etc, in order to ensure the
data are correct in the sense that they will fulfil the requirements of the integration project. This
application performs modelling functions including key integrity and key analyses, join analysis,
physical data models, dependency models and Venn diagrams that identify outlier records and
orphans.
TS Quality is the rule-based DQ engine that processes data to ensure that it meets established
standards, as defined by an organization. This module performs cleansing and standardization,
WP3
11
although some previous cleansing, repair and standardization abilities can be directly from the
profiling application, that is, the TS Discovery module itself. The DQ module also contains
procedures to parse data such as dates, names and worldwide addresses, with some capability for
metadata-based correction and validation against authoritative sources, in order to improve and
refine the linkage process. Suspected duplicate records could be then previously flagged.
This application includes the record linkage and de-duplication engine, which identifies
relationships among records through an automated and rule-based process. This is sold "out-of-thebox" (Day, 1997), in the sense that a set of ready-to-use rules are available, although users can also
apply their own customized options and rules. The engine finds the connections among records
within single files, across different files, and against master databases or external or third-party
sources.
Some special features on how the linked data set is built (see Section 2.10 of WP2 report) are highly
remarkable. Each step of the matching process, once the rules are defined, is recorded through an
audit trail, in order to generate reports of the actions and trace changes made. Therefore, every
change is appended to the original data, maintaining the meaning of each value across datasets,
even as they are shared by different departments with different purposes. In this way, important
distinctions among variables in different contexts are not lost.
TS Enrichment is the name of the module focused on data enhancement through the use of thirdparty vendors. The system can append geographical, census, corporate, and other information from
5000 different sources. This includes postal geocoding. For added safety, appended data does not
overwrite source data, but is placed in new fields that can be shared instantly with other users and
applications.
Methodology – Trillium performs probabilistic matching, though based on a different method than
Fellegi-Sunter. No further details are available on the specific implemented method.
Strength and weakness of the software - A core feature is its specific design for managing
mailing address and name lists. All the functions related to the integration (and DQ) process
strongly rely in its power to carry out profiling and monitoring tasks and display out-of-the-box
rules closely related to the firms' marketing experience, together with the knowledge of local
experts depending on the country.
A highly remarkable strength of the application is its user friendly environment, designed to be used
not only by experts on linkage or Statistics, nor previously trained users. The Trillium Software
System is also prepared to be run across a wide range of platforms.
On the other hand, the program has been conceived to manage data on customers and products
datasets (with some identifiers and key variables such as e.g. brands, models, catalogue numbers,
etcetera), that rarely fit the purposes and structure of the datasets that producers of officials statistics
intend to use.
1.2.9 Link Plus
Tiziana Tuoto (Istat)
Link Plus is a probabilistic record linkage program recently developed at the U.S. Centre for
Disease Control and Prevention (CDC), Cancer Division. Link Plus was written as a linkage tool for
cancer registries, in support of CDC's National Program of Cancer Registries. It can be run in two
modes: to detect duplicates in a cancer registry database, or to link a cancer registry file against
external files. Although Link Plus has been designed referring to cancer registry databases and
records, the program can be used with any type of data in fixed width or delimited format.
Essentially it performs probabilistic record linkage based on the theoretical framework developed
WP3
12
by Fellegi and Sunter and uses the EM algorithm (Dempster et al, 1977) to estimate parameters in
the model proposed by Fellegi and Sunter.
General characteristics - Link Plus is a free software, but it’s source code is not available. It has
been designed especially for cancer registry work, however it can be used for linking any data
referring to people. For instance, it allows to deal deeply with variables like names, surnames, dates,
social security numbers and others characteristics typical of individuals in hospital context;
otherwise it does not provide specific functionalities for dealing with addresses or with
characteristics outside the hospital context or typical of the enterprise framework. The level of
adoption is high, particularly in public health organizations.
Tools included in the software - Link Plus does not provide any pre-processing step: it allows to
deal with delimited and fixed width files. Regarding to the profiling activity, an explicit data
profiling is not provided, but some data quality measurement can be included implicitly into the
linkage process thanks to the fact that variable indicators, like the number of different categories
and the frequency of each category, can be taken into account in the estimation of the linking
probabilities.
Link Plus provides several comparison functions: exact; Jaro-Winkler metric for names and
surnames; a function specific for Social Security Number, that incorporates partial matching to
account for typographical errors and transposition of digits; a function specific for dates, that
incorporates partial matching to account for missing month values and/or day values and also
checks for transposition; a Generic String method, incorporating partial matching to account for
typographical errors, that uses an edit distance function (Levenshtein distance) to compute the
similarity of two long strings.
Regarding to the Search Space Reduction phase, Link Plus provides a simple blocking (“OR
blocking”) mechanism by indexing up to 5 variables for blocking and comparing the pairs with the
identical values on at least one of those variables. As far as conventional “AND blocks” and multipass blocking system are concerned, Link Plus runs a simplified version of multiple passes
simultaneously. When blocking variables involve strings (e.g. names and surnames), Link Plus
offers a choice of 2 Phonetic Coding Systems (the Soundex and the NYSIIS) in order to compare
strings based on how they are pronounced.
As far as the Decision Method, the Link Plus Manual declares the implementation of the
probabilistic decision model, referring to the Fellegi-Sunter theorization.
Methodology - Regarding to the model probability estimation Link Plus provides two options: the
first one, called the “Direct Method”, allows to use the default M-probabilities or user-defined Mprobabilities. By default, the M-probabilities are derived from the frequencies of the matching
variables in the first file in hand; however, Link Plus provides an option that allows to use the
frequencies from 2000 US Census data or 2000 US Nation Death Index data, for instance for
names. Link Plus computes the M-probabilities based on the data at hand using the EM algorithm as
the second Decision Method option.
The explanations given in the manual about the computation of the M-probabilities in Direct
Method leave some perplexities about the probabilistic approach. The authors specify that “the Mprobability measures the reliability of each data item. A Value of 0 means the data item is totally
unreliable (0%) and a value of 1 means that the data item is completely reliable (100%).
Reasonable values differ from 0.9 (90% reliable) to 0.9999 (99.99% reliable). To compute the
default M-probabilities, Link Plus uses the data in File 1 to generate the frequencies of last names
and first names and then computes the weights for last name and first name based on the
frequencies of their values.” These procedures seem closer to the deterministic approach than to the
probabilistic one, even if it is largely specified by the authors that this method is not deterministic.
WP3
13
Anyway, the authors recommend the Direct Method for initial linkage runs because the Direct
Method is robust and it consumes roughly half of the CPU time needed to have Link Plus compute
the M-probabilities. However, they underline that using the EM Algorithm may improve results,
because computed M-probabilities are likely to be more reflective of the true probabilities, since
they were computed by capturing and utilizing the information dynamically from the actual data
being linked, especially when the files are large and the selected matching variables provide
sufficient information to identify potential linked pairs.
Assumptions of the probability model, like conditional independency or others, are not specified in
the manual.
Strength and weakness of the software - The main strength of Link Plus is the fact that is free of
charge. This is maybe one of the reasons of its widespread use, in particular in public health
organizations. A further positive aspect is the availability of a good user guide that provides stepby-step instructions making software easy to use, even if it doesn’t give deep explanation with
respect to some methodological points. Another advantage of Link Plus is its ability to manipulate
large data files.
On the other side, Link Plus is specifically set-up to work with cancer data, so that some difficulties
arise in using it for other applications. Moreover, due to the lack of pre-processing functionalities, it
aborts when encountering certain nonprinting characters in input data.
1.3 Summary Tables and Comparisons
In the following Section 1.3.1, three comparison tables are presented and described with the aim of
summarizing and pointing out the principal features of each tool so far described.
In Section 1.3.2, a critical analysis of strengths and weaknesses of each tool is described in detail.
1.3.1 General Features
In Table 1, we report the selected values for the characteristics specified above for each of the
analyzed tools.
Table 1: Main features
Free/Commercial
Domain Specificity
AUTOMATCH
commercial
FEBRL
GRLS
free/source code available
commercial (government)
LINKAGEWIZ
commercial/less than $5000
RELAIS
DATAFLUX
THE LINK
KING
TRILLIUM
free/source code available
commercial/less than $5000
free/source code available
(SAS licence is needed)
commercial
LINK PLUS
free/source code not available
WP3
14
functionalities for English
words
no specific domain
functionalities for English
words
mixed/functionalities for
English dataset
no specific domain
no specific domain
mixed/requires first and last
names, date of birth
relatively general features
specialized per
language/country
mixed- general features
Level of
Adoption
high
medium
medium
medium
low
high
high
medium
high
Table 2: Comparison of the functionalities of the record linkage tools
AUTOMATCH
FEBRL
GRLS
LINKAGEWIZ
DATAFLUX
Preprocessing
Profiling
Comparison
Functions
Search Space
Reduction
Decision
Model
yes
yes
yes
yes
yes
No
yes
no
yes
no
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
not specified
not specified
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
no
yes
No
yes
yes
yes
yes
No
no
yes
yes
yes
RELAIS
TRILLIUM
THE LINK
KING
LINK PLUS
Decision Method
Probabilistic
Probabilistic
Probabilistic
Probabilistic
Deterministic
Probabilistic
+
Deterministic
(under
development)
Probabilistic
+
Deterministic
Probabilistic
+
Deterministic
Probabilistic
In Table 3 we show the details on the specific method used for the estimation of the Fellegi and
Sunter parameters, for those tools that implement the probabilistic Fellegi and Sunter rule (i.e. all
the software tools but DataFlux and Trillium).
Table 3: Estimation methods implemented in the record linkage tools
Fellegi Sunter Estimation Techniques
Parameters estimation via frequency based matching
AUTOMATCH
Parameters estimation via EM algorithm
FEBRL
Parameters estimation under agreement/disagreement patterns
GRLS
No details are provided
LINKAGEWIZ
EM method
RELAIS
Conditional independence assumption of matching variables
THE LINK KING Ad hoc weight estimation method
Not very clear theoretical hypotheses
Default M-probabilities + user-defined M-probabilities
LINK PLUS
EM algorithm
WP3
15
1.3.2 Strengths and Weaknesses
In this section, we describe a fourth table (Table 4) with strengths and weaknesses of the identified
tools.
Table 4: Strengths and Weaknesses of the record linkage software tools
Strengths
Weaknesses
No error rate estimation
AUTOMATCH Good documentation
User-friendly
Specific for English language
Preprocessing
Not free
Automatic matching parameter
estimation
Free and open-source
Not comprehensive profiling tools
FEBRL
Preprocessing (standardization, Specific for English language
geocoding)
Good documentation
Good documentation
Not free
GRLS
Free training course at
No error rate estimation
Statistics Canada
Specific for English language
Preprocessing
Performances
The coverage of the RL functionalities is poor
LINKAGEWIZ Preprocessing
Performance
Decision Method
Free and Open
Low Adoption
RELAIS
Good support of the record
linkage functionalities
Toolkit Flexibility
Preprocessing
Decision Method
DATAFLUX
Profiling
Standardization
Monitoring
User friendly
Non standard estimation of probability
THE LINK
Flexible
weights
KING
Tools for manual reviews
A SAS license is necessary
Open codes
Specific for people files usage
Preprocessing (profiling,
Not free
TRILLIUM
standardization, geocoding)
Specific for managing mailing addresses and
Data enrichment and
names lists.
monitoring
Algorithms for probabilistic RL are not wellUser-friendly interface
defined
Ability to work across datasets
and systems
Free availability
Specifically set-up to work with cancer data,
LINK PLUS
High adoption
some difficulties in using for other
User friendly
applications
Good user guide with step-by- Sensitive to nonprinting characters in input
step instruction
data
Ability to manipulate large
data files
WP3
16
2 Software Tools for Statistical Matching
Mauro Scanu (Istat)
Software solutions for statistical matching are not as widespread as in the case of record linkage.
Almost all the applications are conducted by means of ad hoc codes. Sometimes, when the objective
is micro (the creation of a complete synthetic data set with all the variables of interest) it is possible
to use general purpose imputation software tools. On the other hand, if the objective is macro (the
estimation of a parameter on the joint distribution of a couple of variables which are not observed
jointly), it is possible to adopt general statistical analysis tools which are able to deal with data sets
affected by missing items.
In this section we review the available tools, explicitly devoted to statistical matching purposes. Just
one of them is a software that can be used without any programming skills (SAMWIN). The others
are software codes that can be used only by those with knowledge of the corresponding language
(R, S-Plus, SAS).
2.1 Comparison criteria for statistical matching software tools
Mauro Scanu (Istat)
The criteria used for comparing the software tools for statistical matching are slightly different from
those for record linkage. In fact, the applications are not as widespread as in the case of record
linkage.
At the beginning, general features of the software tools are reviewed. These include the following
issues.
1. is the software free or commercial?
2. is the software built for a specific experiment or not (domain specificity)?
3. is the software mature (number of years of use)?
As matter of fact, these issues are very similar to those asked for record linkage software tools.
For the functionalities installed in the software tools, the focus was on
1. the inclusion of preprocessing and standardization tools
2. the capacity to create a complete and synthetic data set by the fusion of the two data sources
to integrate
3. the capacity to estimate parameters on the joint distribution of variables never jointly
observed (i.e. one variable observed in the first data source and the second variable on the
second data source)
4. the assumptions on the model of the variables of interest under which the software tools
works (the most known is the conditional independence assumption of the variables not
jointly observed given the common variables in the two data sources)
5. the presence of any quality assessment of the results
Furthermore, the software tools are compared according to the implemented methodologies.
Strength and weakness of each software tool is highlighted at the end.
WP3
17
2.2 Statistical Matching Tools
2.2.1 SAMWIN
Mauro Scanu (Istat)
This software was produced by Giuseppe Sacco (Istat) for the statistical matching application of the
social accounting matrix. The main purpose of this software tool is to create a complete synthetic
data set using hot-deck imputation procedures. It is an executable file written in C++ and easy to
use with windows facilities.
Free/commercial: The software is free. In order to have it, it is possible to write to Giuseppe Sacco
(sacco@istat.it). It consists of an executable file. The source code is not available.
Domain specificity: although the software was built for a specific experiment (the construction of
the social accounting matrix), it is not domain specific.
Maturity: low. It was used for specific experiments in the Italian Statistical Institute and abroad.
Frequency of usage: three main applications
Number of years of usage: 3
Functionalities

Preprocessing and standardization: the input files must be already harmonized. No
standardization tools are included
 Creation of a complete synthetic data file: the output of the software is a complete
synthetic data file. It must be decided in advance the role of the samples (i.e. recipient and
donor files)
 Estimation of specific parameters (eg regression coefficient, parameters of a contingency
table) for the joint distribution of variables observed in distinct samples: The software does
not estimate single parameters
 Model assumptions: this software can be used only under the conditional independence
assumption. It can also be used if there is auxiliary information in terms of a third data set
with all the variables of interest jointly observed.
 Quality evaluation of the results: The software includes some quality indicators (number
of items to impute in a file, number of possible donors for each record) of the data set to
match. It does not include a quality evaluation of the output.
 Other: It is possible to use this software for imputing a single data set affected by general
patterns of missing values (either by hot-deck or cold-deck).
 It includes functionalities for simulations (useful for the evaluation of statistical matching
procedure in a simulated framework).
The instructions on how to use SAMWIN are available in Sacco (2008).
Methodologies implemented in the software
The software implements different hot-deck procedures, according to the nature of the matching
variables (continuous, categorical or both).
It is possible to specify cluster (strata) of units (this ensures equality on the stratification variables
for both the donors and the recipient records).
It is possible to use different distance functions (Euclidean, Manhattan, Mahalanobis, Chebishev,
Gower).
WP3
18
2.2.2 R codes
Marcello D’Orazio (Istat)
This set of software codes written in the R functional programming language was produced by
D’Orazio et al. (2006) and implements some hot deck SM procedures. Moreover the Moriarity
&Scheuren (2001) code is implemented and an extension of it is provided. Code to study SM
uncertainty for categorical variables is introduced (extension of codes from Schafer for multiple
imputation of missing values for categorical variables; Schafer codes are available at
http://www.stat.psu.edu/~jls/misoftwa.html).
Free/commercial: The code is published in the Appendix of D’Orazio, Di Zio and Scanu (2006);
moreover it can be downloaded at site:
http://www.wiley.com//legacy/wileychi/matching/supp/r_code_in_appendix_e.zip
Updates of the code are available on request by contacting Marcello D’Orazio (madorazi@istat.it).
Domain specificity: It is not domain specific.
Maturity: low. It was used for the experiments (simulations) performed by D’Orazio et. al (2006).
Frequency of usage: not available
Number of years of usage: 2
Functionalities





Preprocessing and standardization: the input files must be already harmonized. No
standardization tool is included
Output micro: the output of the software can be a complete synthetic data file.
Output macro: The codes can estimate parameters of a multivariate normal distribution (X,
Y and Z must be univariate normal) or ranges of probabilities of events (for categorical
variables).
Model assumptions: hot deck techniques assume the conditional independence assumption.
The codes based on Moriarity and Scheuren work and its extension assume the normality of
the data. No model assumptions are required for evaluation of SM uncertainty in case of
categorical variables.
Quality evaluation of the results: The software does not include direct or indirect tools to
evaluate the accuracy of the estimated parameters.
Methodologies implemented in the software
The software implements the method proposed by Moriarity and Scheuren (2001) moreover an
extension is proposed based on ML estimation methods, this last feature allows to go over some
limits in Moriarity and Scheuren (2001) methodology. SM hot deck methods based on distance
nearest neighbour are implemented, the matching can be constrained or not. Finally, code for
estimation of uncertainty concerning probability of events under SM situation is presented. In this
code it is possible to set linear constraints involving some probabilities of events. This code is based
on the EM algorithm as implemented by Schafer (http://www.stat.psu.edu/~jls/misoftwa.html) in
order to perform multiple imputation in presence of missing values for categorical variables.
WP3
19
2.2.3 SAS codes
Mauro Scanu (Istat)
This set of software codes written in the SAS language was produced by Christopher Moriarity for
his PhD thesis (available, not for free, on the website http://www.proquest.com). It includes just
two statistical matching methods, those introduced by Kadane (1978) and by Rubin (1986).
Free/commercial: The codes are available only on a PhD thesis, available upon request not for
free. It is only available on paper.
Domain specificity: It is not domain specific.
Maturity: low. There is evidence of experiments (simulations) performed by Moriarity for his PhD
thesis. Frequency of usage: not available
Number of years of usage: 8
Functionalities





Preprocessing and standardization: the input files must be already harmonized. No
standardization tool is included
Output micro: the output of the software can be a complete synthetic data file.
Output macro: The codes can estimate single parameters of a multivariate normal
distribution (X, Y and Z must be univariate normal)
Model assumptions: the codes can be used only under the assumption of normality of the
data. It has been built for evaluating the different models compatible with the available
information from the two surveys (evaluation of the uncertainty for the parameters that
cannot be estimated, in this case the correlation coefficient of the never jointly observed
variables). As a byproduct it can reproduce estimates under the conditional independence
assumption.
Quality evaluation of the results: The software includes some quality evaluation of the
estimated parameters (as jackknife variance estimates of the regression parameters).
Methodologies implemented in the software
The software implements the method proposed by Kadane (1978) that, given two samples that
observe respectively the variables (X,Y) and (X,Z), establish the minimum and maximum values
that the correlation coefficient of Y and Z can assume. The other software code implements the
method proposed by Rubin (1986) that tackles the previous problem by means of a multiple
imputation method.
2.2.4 S-Plus codes
Marco Di Zio (Istat)
This set of software codes written in the S-Plus language was produced by Susanne Raessler to
compare different multiple imputation techniques in the statistical matching setting.
Free/commercial: The codes are available only on Raessler (2002). They are available only on
paper.
Domain specificity: NO
WP3
20
Maturity: low. There is evidence of experiments (simulations) performed by Raeesler (2002) for
her papers.
Frequency of usage: not available
Number of years of usage: 6
Functionalities





Preprocessing and standardization: the input files must be already harmonized. No
standardization tool is included
Output micro: YES
Output macro: The codes can estimate single parameters of a multivariate normal
distribution (X, Y and Z must be univariate normal)
Model assumptions: the codes can be used only under the assumption of normality of the
data.
Quality evaluation of the results: NO
Methodologies implemented in the software
The software implements the method proposed by Raessler (2002, 2003) for multiple imputation
with statistical matching.
The methods implemented in the codes are:
• NIBAS, the non-iterative multivariate Bayesian regression model based on multiple imputations
proposed by Raessler, see Raessler (2002, 2003).
• RIEPS is the regression imputation technique discussed in Section 4.4 of Raessler (2002) and
Raessler (2003, Section 2).
• NORM. The data augmentation algorithm assuming the normal model as proposed by Schafer
(1997, 1999) introduced for the general purpose of doing inference in the presence of missing data.
It requires the S-PLUS library NORM.
• MICE. An iterative univariate imputation method proposed by Van Buuren and Oudshoorn (1999,
2000). It requires the library MICE.
The software codes are used to evaluate uncertainty for Gaussian data. It estimates the region of
acceptable values for the correlation coefficient of the variables not jointly observed given the
observed data of the samples that must be integrated.
WP3
21
2.3 Comparison tables
This section summarizes the content of the previous section, comparing the characteristics of the
different software tools.
Table 1 – Comparison of the general characteristics of the software tools for statistical matching
FREE/COMMERCIAL
DOMAIN SPECIFICITY
LEVEL
OF
ADOPTION
free;
source
code
not
no
specific
domain
low
SAMWIN
available
free; source code available no specific domain
low
R codes
on D’Orazio et al. (2006).
updates available on request
low
SAS codes source code available on a no specific domain
phd thesis (available on the
webpage
www.proquest.com)
source code available on no specific domain
low
S-Plus
Raessler (2002).
codes
Table 2 Comparison of the functionalities of the software tools for statistical matching (CIA=
conditional independence assumption; AI= auxiliary information: Unc= uncertainty)
PreOutput: Output:
Quality
Model
Other
processing
micro
Macro
evaluation
yes
no
CIA: yes Yes
for -Imputation of a
SAMWIN no
AI: yes
input files, single sample
Unc: no No
for -It performs
output files simulations
-Instructions
available
no
yes
yes
CIA: yes No
- micro and macro
R code
AI: yes
approaches available
Unc: yes
- codes for
categorical variables
no
yes
yes
CIA: yes No
Only
continuous
SAS code
AI: yes
(normal) variables
Unc: yes
no
yes
yes
CIA: yes No
Only for continuous
S-Plus
AI:
yes
(normal) variables
codes
Unc: yes
WP3
22
Table 3 – Comparison of the methodologies implemented in the software tools for statistical
matching
Implemented methodologies
Hot-deck methods (distance hot deck with different distance functions)
SAMWIN
Hot-deck methods (constrained and unconstrained distance hot deck with various
R code
distance functions; random hot deck)
Estimates based on Moriarity and Scheuren methods and ML estimates.
Uncertainty estimation for probabilities of events.
SAS codes Estimates based on consistent (not maximum likelihood) methods
Application of a not proper multiple imputation procedure
Multiple imputation methods
S-Plus
codes
Table 4 – Strength and weakness of the software tools for statistical matching
Strengths
SAMWIN -free
-easy to use
-instructions
-codes available on the internet
R codes
-codes for categorical variables
SAS codes -written in SAS
S-Plus
codes
Weaknesses
-low adoption
-the output is only
micro
-R package still
missing
-low adoption
-not easy to use
-limited results
-Implementation of Bayesian and proper multiple imputation -low adoption
procedures for statistical matching
-not easy to use
-limited results
WP3
23
3 Commercial software tools for data quality and record
linkage in the process of microintegration
Jaroslav Kraus (CZSO), Ondřej Vozár (CZSO)
3.1 Data Quality Standardization Requirements
It is not obvious for vendors of general and multipurpose tools to publish any quality assessments
and quality characteristics of the data integration process. In data cleansing, in batch or off-line
processing, in different application domains the meaning of quality of the output sounds more like
functional parameters of the application or speed of the matching, quality of the intelligence learned
automatically. The assessment and comparison of the real quality of the product via “level of
calculated truth” of the data are substantially problematic.
The vendors in the data quality market are often classified according to their entire position in IT
business, where focus on the specific business knowledge and experience in specific business
domain play an important role. Positions of vendors and quality of their products on the market are
characterized by:
 Product features and relevant services
 Vendor characteristics, domain business understanding, business strategy, creativity,
innovation
 Sales characteristics, licensing, prices
 Customer experience, reference projects
 Data Quality Tools and Frameworks
The record linking characteristics of the products are the focus of this section.
3.2 Data Quality assessment
The record linkage (the process of assigning/linking together the subjects from different data
sources, often the person, family, institution etc.) is a substantial task in a wide set of practical
methods for searching, comparison and grouping of the records.
The software vendors of tools in the statistics oriented “data quality market” propose solutions
addressing all the tasks in the entire life cycle of the data oriented management programs and
projects. Starting with data preparation, survey data collection, improving of quality and integrity,
setting up for reports and studies, etc. etc.
The vendor proposals are usually classified according to their “position in the concrete segment of
informatics”, according to the strength and advantage position in IT business, where the focus on
the specific business knowledge and experience in a specific business domain plays an important
role.
The declaration of the quality of vendors and their products on the market is characterized by:
 Product features and relevant services
 Vendor characteristics, domain business understanding, business strategy, creativity,
innovation
 Sales characteristics, licensing, prices
WP3
24

Customer experience, reference projects
It is not easy to publish any comparable quality assessments and characteristics of the quality of a
data integration process. In “quality of data” management, in cleansing, in large batch processing, in
interactive on-line, off-line or even in different specific domains of data and applications the
meaning of the quality of the tool depends on data quality in input and similarly in output. The
value of the quality sounds more like functional parameters of the application, often describing
what it can do and what it can’t do.
The non-functional parameters as for instance the time performances of the matching, behavior,
capacity as well as the automatically learned knowledge, are difficult to measure in comparable
units. Even if the core application algorithms work only on syntactic level of the programmed
quality evaluation (as for example the exact matching does), the physical data storage, data access
methods, technology and other programmers inventions crucially influence the results of the status
of the measurable quality (speed, accuracy, etc.). With the comprehensive semantic driven
operations and even more with pragmatic attitude in the informatics observation of the data, the
“intellectual” quality of the tools should influence the computing processes and parameters
dramatically (National Statistics Code of Practice)
The process of the comparison of the real quality of the software products depends therefore on the
assessment of the data itself, on physically measurable parameters of computing processes and on
the believe and trust in “level of calculated truth” and its correct interpretation in real world
practice. All that in relation with measurable estimation of the value of the risk in regard to
action/decision related parametrically to the quality of the data applied in a business process.
The absence of structured descriptions of quality of the features and quality of the fulfillment of
requirements, makes the choice in selecting the most appropriate software package difficult.
Some authors, as for example Franch et al (2003) discuss methodologies for describing the quality
and features of domain-specific software packages in the evaluation frameworks (guides, questions
and tables) and comprehensively utilizing ISO/IEC 9126-1 quality standards as a guide framework.
According to the software/application category, the tools to perform or support the data oriented
projects in record linkage in statistics should have these common characteristics:
1. portability in being able to function with statistic researchers' current arrangement of
computer systems and languages,
2. flexibility in handling different linkage strategies,
3. operational expenses or low cost in TCO (Total Cost of Ownership) parameters and in both,
computing time and researchers' efforts.
The record linkage packages (both universal and domain specific) are often described in marketing
and commercial documentation as they satisfy all of these characteristics criteria. In other words if
we evaluate the tools for both, deterministic and probabilistic linkage, the results of a simple
comparison are obviously not consistent. Similarly difficult is to measure the “not European
statistical systems”, which rely on the logic based for example on NYSIIS identification,
SOUNDEX algorithms or any address standardizations or business names coming from US
environment. As well as those methods, which rely on the most advanced strategies in mixing of
deterministic and probabilistic linkage and adapting the graph algorithms and neural networks
driven computing further in decision, fail in the comparable common criteria. It can’t represent the
correct comparable value in such different disciplines. The quality tuning, test and performance
simulation is the only way of getting real performance figures about these modules. For assessing
WP3
25
data quality and other “production” parameters, the testing and prototyping of the right project
structure of the solution are often in the proposal.
Because “each linkage project is different”, the modularity of the software should allow better
control of the programming and managing the processes in linking project and development of
unique strategies in advance or flexibly in response to actual linking conditions during the linking
process.
Where the user (statistician or domain researcher) provides the weights, parameters, decision rules
or his “expertise and behavioral know how”, the architect of the linkage solution has the choice in
applying different modules, interfaces and supplementary solutions. The advantage of open and
flexible solutions is enabling the simulations to modify data and parameters consistently in
different steps of project workflow and/or develop or engage additional modules from other
suppliers and IT resources to supplement the basic modules optimally.
There is a number of commercial and non-commercial implementations and deployments of
matching and record linking algorithms in the areas of exact and probabilistic matching and more or
less sophisticate strategies. They are often utilized in open source environments and often
developed and supported by independent vendors or university engagements.
In this section we evaluate some of the commercial software packages, which, according to the data
quality scoring position in Gartner reports, belong to the most important vendors in this area. This
Magic Quadrant graph was published by Gartner, Inc. as part of a larger research note and should
be evaluated in the context of the entire report.
Magic Quadrant for Data Integration Tools, 2007
WP3
26
We focus on three vendors:
- Oracle, partner for data quality, data integration and enterprise integration services,
represent the optimal convergence of tools and services in the software market.
- SAS/DataFlux, data quality, data integration and BI (Business Intelligence) player on the
market, with its applications, methodologies on integrated platform.
- Netrics, which dispose with the advanced technology complementing the Oracle data quality
and integration tools
The IT specialists and statistics analysts evaluate the products, services and advanced scientific
tools and methodologies together with substantial experience of suppliers in providing the statistics
application intelligence. Real life solutions and concrete statistical projects (e.g. the Federal Swiss
Statistics Office in scanning and recognizing) were references for application of the software in
projects with high volume of data, scanned from paper forms during Census 2001, from next census
pilot projects and other statistical experience.
The lists of bulleted questions below represent a principal schema of characteristics of the software
package in application of the models of quality management in software package selection. These
questions should be answered with very deep understanding of the problems, features and practice
in concrete projects, as for example within the solution of the Census 2011 project as mentioned
above. These questions will be answered in a structured form in a set of tables (Section 3.3). These
answers are published and authorized by vendors and often proven in concrete tests and
experiences. This structured format simplify understanding of evaluation and enable basic
comparison of the tools and features collected from available information resources and even those
from practical experience.
The format of comparison tables
General
 The software is a universal system or domain specific to a given application
 The software is:
o Complete system, ready to perform linkages "out of the box”
o Modular or development environment, requiring obvious project/programming effort
o Framework to apply by statistic scientist
 Types of linkages does the software support
o Linking two files
o Linking multiple files
o Deduplication
o Linking one or more files to a reference file
 The infrastructure and integration features
o Computing performance
o Capacities and storage
o Networking and security
o Databases
o Integration, middleware, SOA
 The linkages mode interactive, in real time, in batch mode
 The characteristics of the vendor. Reliability, can the vendor provide adequate technical
support?
 Software documentation and manuals
WP3
27
 The features the vendor plan to add in the near future (e.g., in the next version)
 References and practical experience
 Other software, such as database packages or editors, needed to use the system
Linkage Methodology
 Record linkage methods: is the software based or manually interacted?
 Has user control over the linkage process? Is the system a "black box," or can
the user set parameters to control the linkage process
 Software requires any parameter files
 The user specify the linking variables and types of comparisons
 Comparison functions are available for different types of variables? Do the methods give
proportional weights (that is, allow degrees of agreement)?
o Character-for-character
o Phonetic code comparison (Soundex or NYSIIS variant)
o Information advanced theory string comparison function
o Specialized numeric comparisons
o Distance comparisons
o Time/Date comparisons
o Ad hoc methods (e.g., allowing one or more characters different between strings)
o User-defined comparisons
o Conditional comparisons
o Can user specify critical variables that must agree for a link to take place?
 Does the system handle missing values for linkage variables
o Computes a weight or any other value
o Uses a median between agreement and disagreement weights
o Uses low or a zero weight
 The maximum number of linking variables
 Does the software block records? How users set blocking variables
 The software contain or support routines for estimating linkage errors
 The matching algorithm use contextual techniques that evaluate dependence between
variables
Data Management
 Data storage formats the software uses
o Flat files
o Database
o Caches, ODS
o Data limits
Post-linkage Function
 The software provides a utility for review of possible links
 Sharing the decision work on the record review simultaneously
 Records can be "put aside" for later review
 The software provides a utility for generating reports on the linked, unlinked, duplicate, and
possible link records? Report format can be customized
 The software provides a utility for extracting files of linked and unlinked records? User
specify the format of such extracts
Standardization
 The software provides a mean of standardizing (parsing out the pieces of) name and address
fields localized for concrete environments
 Name and address standardization is customizable. Different processes can be used on
different files
 Standardization change the original data fields, or does it append standardized fields to the
original data record
WP3
28
Costs
 Purchase and maintenance costs of the software itself, along with any needed additional
software (e.g., database packages), and new or upgraded hardware.
 The cost of training personnel to use the system.
 Usually projected personnel staff associated with running the system.
 The cost of developing of the system for the intended purposes using the software within the
available budget
Empirical Testing, case studies and examples
 Levels of false match and false non match can be pre-set, expected with the system or tested
on samples
 Number of manual intervention in practical examples (e.g., possible match review)
 Timing of the typical match projects with the system (examples or case studies)
3.3 Summary Tables of Oracle, Netrics and SAS/DataFlux
3.3.1 Oracle
3.3.1.1 Software/Tools/Capabilities
Oracle Warehouse Builder transparently integrates all relevant aspects of data quality area into
all phases of data integration project life cycle.
Oracle Warehouse Builder provides functionality to support the following aspects of data
quality area:
 Creation of data profiles.
 Maintenance for derivation of data rules based on automatically collected statistics or their
manual creation.
 Execution of data audits based on data rules.
 Identification and unification of multiple records.
 Matching and Merging for the purpose of de-duplication
 Matching and Merging for the purpose of record-linking (including householding)
 Name and address cleansing.
Oracle Data Integrator. The very basic idea behind Oracle Data Integrator was to create a
framework for generation of data integration scripts. General integration patterns were derived
from many integration projects from all around the globe. The techniques are the same, only
underlying technologies vary so in the development environment was divided accordingly.
Topologies, data models, data quality checks, integration rules are modeled independently of
technology. Specific technology language templates (knowledge modules) are assigned in the
second step and the resulting code is automatically generated. This approach maximizes the
flexibility and technological independence of the resulting solution.
Oracle Data Integrator allows organizations to reduce the cost and complexity of their data
integration initiatives, while improving the speed and accuracy of data transformation. Oracle
Data Integrator supports THREE “RIGHTS”:
 The Right Data: The data must not only be appropriate for the use that is intended, but must
also be accurate and reliable.
 The Right Place: The overall information ecosystem consists of multiple operational and
analytical systems, and they all need to benefit from data of the other systems, regardless of
their locale.
 The Right Time: Data can become stale quickly. A decision support system that does not
WP3
29
get the data in time is useless. A shipping application that does not get order information
before cutoff time isn’t efficient. Getting data in right time—with a latency that is
appropriate for the intended use of this data—is one of the most important challenges faced
by businesses today.
Each Knowledge Module type refers to a specific integration task, e.g.:
 Reverse-engineering metadata from the heterogeneous systems for Oracle Data Integrator.
 Handling Changed Data Capture on a given.
 Loading data from one system to another.
 Integrating data in a target system, using specific strategies (insert/update, slowly changing
dimensions, match/merge).
 Controlling data integrity on the data flow.
 Exposing data in the form of services.
Knowledge Modules are also fully extensible. Their code is open and can be edited through a
graphical user.
Powerful deterministic and probabilistic matching is provided to identify unique consumers,
businesses, households, or other entities for which common identifiers may not exist.
Examples of match criteria include:
Similarity scoring to account for typos and other anomalies in the data, Soundex to identify data
that sounds alike, Abbreviation and Acronym matching, etc.
Oracle Data Integrator provides audit information on data integrity. For example, the
following are four ways erroneous data might be handled:
 Automatically correct data—Oracle Data Integrator offers a set of tools to simplify the
creation of data cleansing interfaces.
 Accept erroneous data (for the current project)—In this case rules for filtering out erroneous
data should be used.
 Correct the invalid records—In this situation, the invalid data is sent to application end users
via various text formats or distribution modes, such as human workflow, e-mail, HTML,
XML, flat text files, etc.
Recycle data—Erroneous data from an audit can be recycled into the integration process.
Built-in transformations usually used as part of ETL processes.
Oracle Data Integrator enables application designers and business analysts to define
declarative rules for data integrity directly in the centralized Oracle Data
Integrator metadata repository. These rules are applied to application data—inline with batch or
real-time extract, transform, and load (ETL) jobs—to guarantee the overall integrity,
consistency, and quality of enterprise information.
Match rules are specified and managed through intuitive wizards. Match/Merge can also be
accessed through an extensive scripting language.
Parsing into individual elements is used for improved correction and matching.
Standardization, which involves modification of components to a standard version acceptable to
a postal service or suitable for record matching.
Validation and Correction by using a referential database.
Manual adding of new data elements.
Data quality rules range from ensuring data integrity to sophisticated parsing, cleansing,
standardization, matching and deduplication.
WP3
30
Data can be repaired either statically in the original systems or as part of a data flow. Flowbased control minimizes disruption to existing systems and ensures that downstream analysis
and processing works on reliable, trusted data.
Conditional match rules, Weighted match rules.
User defined algorithms, which can also reference one of the other match rule types to logically
“and” or “or” them together to produce a unique rule type.
Match results are intelligently consolidated into a unified view based on defines rules.
Oracle Data Integrator does not provide standard functionality for weighted matching but
provides open interface to enable integration with any database functionality or custom
functionality (in Java, PL/SQL, SHELL, etc.).
Match results are intelligently consolidated into a unified view based on defined rules.
3.3.1.2 Integration options
Oracle Unified Method, Oracle Project Management Method
Integration approach is driven by data model and by model of business/matching rules.
Business/matching rules are partially designed and created by business users. Data model is
usually create (modeled) by technical users.
Computing Infrastructure.
In case of name & address cleansing additional server for this functionality is required.
The Oracle Data Integrator architecture is organized around a modular repository, which is
accessed in client-server mode by component-graphical modules and execution agents. The
architecture also includes a Web application, Metadata Navigator.
The four graphical modules are Designer, Operator, Topology Manager and
Security Manager.
Oracle Data Integrator is a lightweight, legacy-free, state-of-the-art integration platform. All
components can run independently on any Java-compliant system.
Supported DBMSs
Oracle database server.
Oracle Database 8i, 9i, 10g, 11g
3.3.1.3 Performance/Throughput
System with 8,5 millions of unified physical persons and 24,5 millions of physical person
instances. Physical person’s records with approximately 5% of multiplicities. Matching score
for physical persons was 98% (98% of physical person instances are linked to unified physical
person).
For total count 33 millions of dirty addresses enters the address matching and cleansing
WP3
31
mechanism. The success rate of the cleansing process was 84%.
3.3.1.4 Business terms, conditions and pricing
Oracle Warehouse Builder
a/ Oracle Warehouse Builder as CORE ETL is free as part of Oracle DB from version 10g
(edition SE-1, SE or EE) (it means, that database is licensed on CPU or NU and part of this is
Oracle Warehouse Builder in CORE edition)
b/ Oracle Warehouse Builder has three Options
1/ Enterprise ETL - performance a development productivity (licensed as part of
database EE by the database rules NU or CPU)
2/ Data Quality – data quality and data profiling (licensed as part of database EE by
the database rules NU or CPU)
3/ Connectors – connectors to enterprise applications (licensed as part of database EE
by the database rules NU or CPU)
Oracle Data Integrator
a/ Oracle Data Integrator is licensed on CPU
b/ Oracle Data Integrator has two Options
1/ Data Quality (licensed only on CPU)
2/ Data Profiling (licensed only on NU)
Maintenance costs
For all Oracle products it’s 22% from price of licenses.
3.3.2 Netrics
3. 3. 2. 1 Software/Tools/Capabilities
The Netrics Data Matching Platform consists of two complementary components:
1. The Netrics Matching Engine — which provides error tolerant matching against data on
a field-by-field, or multi-field basis;
2. The Netrics Decision Engine — which decides whether two or more records should
be linked, are duplicates, or represent the same entity — learning from, and modeling,
the same criteria that best human experts use.
Matching accuracy out-of-the-box - Netrics Matching Engine™ - matches data with
error tolerance that approximates human perception, with a level of speed, accuracy, and
scalability that no other approach can provide. It provides unparalleled error tolerance,
identifying similarity across a wide range of “real-world” data variations and problem conditions.
The Netrics Matching Engine finds matches, even for incomplete or partial similarity. Using
Netrics’ patented mathematical approach; it finds similarities in data much like humans
perceive them. As a result, it discovers matches even when there are errors in both the query
string and the target data Therefore it can handle many of the issues that plague real-world data
from simple transpositions and typos, up to and including data that’s been entered into the
wrong fields.
Decision making accuracy – ease of initial training and re-training if needed
The Netrics Decision Engine™ - uses machine learning to eliminate the weaknesses of both
deterministic and probabilistic rules-based algorithms. It uses an exclusive system that models
WP3
32
human decision-making. It “learns” by automatically tracking and understanding the patterns
of example-decisions provided by business experts, and creates a computer model that
accurately predicts them. Once trained, the Decision Engine’s model reliably makes the same
decisions as the experts, at computer speeds, and without ever losing the track on rules and
intelligence learned before and with the concentration on problem experts provided before.
Because creating a model is so easy (the expert just needs to provide examples), a custom
model can be created for each critical business decision, tailored specifically to specifics of
data, market, and business requirements. If social security number, driver’s license numbers,
emergency contact information, employers or insurance information is collected, any or all of
these data elements can be used in evaluation of a record. There is no need for programmers to
try to devise the underlying business rules, which may not capture all possible criteria, and may
introduce inaccuracies, or to make up weights – the Netrics Decision Engine figures this out by
itself.
http://www.netrics.com/index.php/Decision-Engine/Decision-Engine.html
Able to deal with multiple varied errors in both the data records and query terms
The Netrics Matching Engine uses mathematical modeling techniques to determine similarity.
This deployment of theory in practice means that a high degree of accuracy is maintained even
when there are numerous errors in both the incoming matching requests and the data being
matched.
Fine grained control over each matching request
If needed, matching features are incrementally customizable up to on a per match basis,
including phonetic analysis, field selections, field weighting, multiple record weighting, record
predicates, multi-table selection, and multi-query capabilities.
3.3.2.2 Integration options
Ease of integration into applications
Dependent on the customers´ strategy of the easy start up, customers are able to work with the
Netrics Matching Engine within minutes on test data. Netrics has experience with the customers
they have implemented production instances within few hours. All the matching capabilities are
provided in a very powerful, yet easy to use API than can be called from all major languages
and also by using SOAP requests included in SOA implementations. The integration with or
within standard ETL applications is mostly expected.
Range of server platforms supported
The Netrics Matching Engine and Netrics Decision Engine have been ported across a wide
range of 32 and 64 bit operating systems.
Client languages supported – ease of API usage
Client side supports for Java, Python, .NET, C#, C and C++ and command lines are provided as
well as full WSDL/SOAP support enabling easy SOA implementation.
Supported DBMSs
Data can be loaded into the Netrics Matching Engine from any database that can extract data
into a CSV file or other formats. Specific loading synchronizing support – to keep the Matching
data updated as changes happen to the underlying data- is provided for the major DBMS –
Oracle, SQL Server, MySQL and DB2. As the intermediate Netrics database provide services to
the engines with cache like access, the deployment of simply, quick and secure database
environment is preferable. In this way the Netrics decision environment can be created as an
enterprise wide central data quality engine for ad hoc application in parallel with the BI
typically driven on DWH store.
WP3
33
3.3.2.3 Performance/Throughput
Support for large databases
Conventional "fuzzy matching" technology is computer-resource intensive: rule sets often drive
multiple queries on databases, as developers compensate for the lack of sophisticated matching
technology using multiple "wildcard" searches, or testing validity of error-prone algorithms
such as Soundex.
More sophisticated algorithms such as "Edit Distance" are highly computationally intensive.
Computation increases exponentially with the size of the "edit window" and the number of
records being matched. These limitations make "fuzzy matching" unusable in many
applications. Customers can accept serious constraints on the number of records tested, or the
matching window applied, and risk missing matches.
Netrics' patented bi-partite graph technology is inherently efficient, and scales linearly with
the number of records in the database. The technology has been proven in real-time
applications for databases as large as 650 million records. Netrics' embeddable architecture
operates independently of existing applications and DBMS. Many customers actually
experience with decrease in computer resources required when the inexact matching is shifted
from the DBMS (which is natively optimized for exact matching) onto the Netrics Matching
Engine.
Multi core, multi-CPU servers and multi-server cluster exploitation
The Netrics Matching Engine is fully multi-threaded, stateless and exploits multi-core CPUs,
multi-CPU servers and clustered server configurations. Front-end load balancing and failover
appliances such as F5 and Citrix NetScaler are supported out-of-the-box.
3.3.2.4 Business terms, conditions and pricing
Type of licensing (perpetual/lease/SaaS etc)
Pricing for the Netrics Matching Engine and Netrics Decision Engine is simple and
straightforward. Our one-time perpetual software license fee is based on the number of CPU
cores needed to run the required data matching and decision making workload.
Maintenance costs
Maintenance costs are usually 20% of the total software license fee, charged annually.
WP3
34
3.3.3 SAS
3.3.3.1 Software/Tools/Capabilities
SAS/Dataflux Matching (default) engine.
SAS/Dataflux Matching (default) engine has been designed to enable both the identification of
duplicate records within a single data source, as well as across multiple sources. The rules-based
matching engine uses a combination of parsing rules, standardization rules, phonetic matching,
and token-based weighting to strip the ambiguity out of source information. After applying
hundreds of thousands of rules to each and every field, the engine outputs a “match key” – an
accurate representation of all versions of the same data generated at any point in time.
“Sensitivity” allows the user to define the closeness of the match in order to support both high
and low confidence match sets.
SAS Matching (optional) SAS Data Quality Solution supports also probabilistic methods for
record linking like Fellegi and Sunter method and other methods.
Implementation characteristics.
The SAS/DataFlux methodology supports the entire data integration life cycle through an
integrated phased approach. These phases include data profiling, data quality, data integration,
data enrichment, and data monitoring. The methodology may be implemented as an ongoing
process to control the quality of information being loaded into target information systems.
Matching characteristics
Deterministics (mainly Dataflux) or Probabilistic (covered by SAS )
Manual Assistance
SAS/DataFlux supports both manual review and automatic consolidation modes. This gives
flexibility to manual review the consolidation rules for every group of duplicate records or to
automatically consolidate the records without a review process.
Preprocessing/Standardization
Standardization
The DataFlux engine supports advanced standardization (mapping) routines that include
element standardization, phrase standardization and pattern standardization. Each
standardization approach supports the ability to eliminate semantic differences found in source
data, including multiple spellings of the same information, multiple patterns of the same data, or
the translation of inventory codes to product descriptions.
The DataFlux engine includes the ability to replace the original field with the new value or to
append the standardized value directly on to the source record. Alternatively, the standardized
values can be written to a new database table, or to text files.
Phrase Standardization
Phrase Standardization describes the process for modifying original source data from some
value to a common value. In the below example, the company name “GM” is translated, based
on standardization rules, from the original value to General Motors. Phrase standardization
involves the creation of standardization rules that include both the original value and the desired
value. These synonyms, or mapping rules, can be automatically derived by the data profiling
engine, or they can be imported from other corporate information sources.
Pattern Standardization
Pattern Standardization includes the decomposition of source data into atomic elements such as
first name, middle name, and last name. After identifying the atomic elements, pattern
standardization reassembles each token into a common format.
Element Standardization
WP3
35
Element standardization involves the mapping of specific elements or words within a field to a
new, normalized value. The following example illustrates how the engine is able to isolate a
specific word such as “1st” and then modify the single word into a standardized word such as
“First”.
Automated Rule Building
The DataFlux solution is the only data quality solution that is able to automatically build
standardization rules based on analysis of source data during the profiling phase. This analysis
includes the ability to group similar fields based on content and then derive from the content the
value to which all occurrences should be mapped. The DataFlux engine uses data quality
matching algorithms during this process in order to identify the inconsistent and non-standard
data. In addition, the business user has the ability to dial-up or dial-down the automated process
to meet specific domain needs.
3.3.3.2 Integration options
Computing Infrastructure
SAS/DataFlux Data Quality solution is an integrated data management suite that allows users to
inspect, correct, integrate, enhance and control data across the enterprise. The proposed solution
architecture is comprised of the following components:
 Data Quality Client: A user-friendly, intuitive interface that enables users to quickly
and easily implement with low-maintenance.
 Batch Data Quality Server: A high-performance, customisable engine forms the core
of our Data Quality product. This engine can process any type of data via native and
direct data connections. The data quality server is able to access all data sources via
native data access engines to every major database and file structure using the most
powerful data transformation language available
 Quality Knowledge Base: All processing components of the SAS data quality solution
utilise the same repository of pre-defined and user defined business rules.
 Real-time Data Quality Server: Processing real-time data quality tasks via the
Application Program Interface (API) to allow embedding SAS data quality technology
within customer’s front-end operational systems
Meta Integration Model Bridge
Enables to access non CWM complaint Metadata, by enabling users to import, share and
exchange metadata from design tools.
Database, Files
Read and write requests are translated into the appropriate call for the specific DBMS or file
structure. All the industry standards in data access are supported. They include:
 ODBC
 OLE DB for OLAP
 MDX
 XML/A
This ensures that Inter-system data access will be possible for all storage technologies in the
future.
Our DQ product SAS DQ product includes the ability to natively source data from and output to
various different databases/data files. The supported file formats include:
 Flat files
 Microsoft SQL Server (including Analysis Services via OLE DB)
 Oracle
 IBM DB2
WP3
36
 Teradata
The SAS/ACCESS Interface to Teradata supports two means of integration:
Generation of the required SQL that can be edited and tuned as required, or
User-created SQL that could include Teradata-specific functionality.
The standards supported include:
 ODBC
 JDBC
 XML
OLE DB
Software employed
SAS, DataFlux dfPower Studio
3.3.3.3 Performance/Throughput
Performance Benchmarks
The following benchmarks were generated using a delimited text file as the input and the jobs
were executed on a machine with these specifications:



Windows 2003 Server
Dual Pentium Xeon 3.0 GHz
4.0 GB RAM.
Process
Generate Name Match Keys
Generate Name Match Keys and Cluster
Profiling – 1 column (all metrics)
Profiling – 10 columns (all metrics)
1MM
Records
0: 02: 54
0:03:01
0:00:23
0:03:06
5MM
Records
0: 08:43
0:09:14
0: 02: 54
0: 10: 15
3.3.3.3 Business terms, conditions and pricing
SAS Business model
SAS Institute business model is based on commitment to build long-term relationship with
customers. SAS products are licensed on one-year basis. Licenses fee include software
installation, maintenance, upgrades, documentation and phone technical support.
SAS Institute already has a long-term relationship with many Statistical Offices in Europe,
especially in data integration and analytical tools.
WP3
37
4 Documentation, literature and references
4.1 Bibliography for Section 1
Barateiro J., Galhardas H. (2005) A Survey of Data Quality Tools. Datenbank-Spektrum 14: 15-21.
Batini C., Scannapieco M. (2006) Data Quality: Concepts, Methods, and Techniques, Springer.
Bertsekas D.P.(1992) Auction Algorithms for Network Flow Problems: A Tutorial Introduction.
Computational Optimization and Applications, vol. 1, pp. 7-66.
Christen P., Churches T. (2005a) A Probabilistic Deduplication, Record Linkage and Geocoding
System. In Proceedings of the ARC Health Data Mining workshop, University of South Australia.
Christen P., Churches T. (2005b) Febrl: Freely extensible biomedical record linkage Manual.
release 0.3 edition, Technical Report Computer Science Technical Reports no.TR-CS-02-05,
Department of Computer Science, FEIT, Australian National University, Canberra.
DATAFLUX: http://www.dataflux.com/
Day C. (1997) A checklist for evaluating record linkage software, in Alvey W. and Jamerson B.
(eds.) (1997) Record Linkage Techniques, Washington, DC: Federal Committee on Statistical
Methodology.
Dempster A.P., Laird N.M., Rubin D.B. (1977) Maximum likelihood from incomplete data via EM
algorithm. Journal of the Royal Statistical Society, Series A, 153, pp.287-320.
FEBRL: : http://sourceforge.net/projects/febrl
Fellegi I.P., Sunter A.B. (1969) A theory for record linkage. Journal of the American Statistical
Association, Volume 64, pp. 1183-1210.
Gartner (2007). Magic Quadrant for Data Quality Tools 2007, Gartner RAS Core Research Note
G00149359, June 2007.
Gu L., Baxter R., Vickers D., Rainsford. C. (2003) Record linkage: Current practice and future
directions. Technical Report 03/83, CSIRO Mathematical and Information Sciences, Canberra,
Australia.
Herzog T.N. (2004) Playing With Matches: Applications of Record Linkage Techniques and Other
Data Quality Procedures. SOA 2004 New York Annual Meeting - 18TS.
Herzog T.N., Scheuren F.J., Winkler, W.E. (2007) Data Quality and Record Linkage Techniques.
Springer Science+Business Media, New York.
Hernandez M., Stolfo S. (1995) The Merge/Purge Problem for Large Databases. In Proc. of 1995
ACT SIGMOD Conf., pages 127–138.
Karr A.F., Sanil A.P., Banks D.L. (2005). Data Quality: A Statistical Perspective. NISS Technical
Report Number 151 March.
WP3
38
Koudas N., Sarawagi S., Srivastava D. (2006) Record linkage: similarity measures and algorithms.
SIGMOD Conference 2006, pp. 802-803.
LINKAGEWIZ: http://www.linkagewiz.com
LINKKING: http://www.the-link-king.com
LINKPLUS: http://www.cdc.gov/cancer/npcr
Naumann F. (2004) Informationsintegration. Antrittsvorlesung am Tag der Informatik. University
of Postdam - Hasso-Plattner-Institut, Postdam. Date Accessed: 02/5/2008,
www.hpi.uni-potsdam.de/fileadmin/hpi/FG_Naumann/publications/Antrittsvorlesung.pdf
Rabiner L.R.(1989) A Tutorial on Hidden Markov Models and Selected Applications in
Speech Recognition. Proceedings of the IEEE, vol. 77, no. 2, Feb. 1989.
RELAIS: http://www.istat.it/strumenti/metodi/software/analisi_dati/relais/
Shah G., Fatima F., McBride S. A Critical Assessment of Record Linkage Software Used in Public
Health. Available at: http://nahdo.org/CS/files/folders/124/download.aspx
TRILLIUM: Trillium, Trillium Software System for data warehousing and ERP. Trillium Software
Webpage. Date Accessed: 01/5/2008, http://www.trilliumsoft.com/products.htm, 2000.
Tuoto T., Cibella N., Fortini M., Scannapieco M., Tosco T. (2007) RELAIS: Don't Get Lost in a
Record Linkage Project, Proc. of the Federal Committee on Statistical Methodologies (FCSM 2007)
Research Conference, Arlington, VA, USA.
4.2 Bibliography for Section 2
D’Orazio M, Di Zio M, Scanu M, 2006. Statistical Matching: Theory and Practice. Wiley,
Chichester.
Kadane J.B., 1978. Some statistical problems in merging data files. 1978 Compendium of Tax
Research, U.S. Department of the Treasury, 159-171. Reprinted on the Journal of Official Statistics,
2001, Vol. 17, pp. 423-433.
Moriarity C., 2000. Doctoral dissertation submitted to The George Washington University,
Washington DC. Available on the website http://www.proquest.com
Moriarity C., and Scheuren, F. (2001). Statistical matching: paradigm for assessing the uncertainty
in the procedure. Journal of Official Statistics, 17, 407-422
Raessler S. (2002). Statistical Matching: a Frequentist Theory, Practical Applications and
Alternative Bayesian Approaches. Springer Verlag, New York.
Raessler S. (2003). A Non-Iterative Bayesian Approach to Statistical Matching. Statistica
Neerlandica, Vol. 57, n.1, 58-74.
WP3
39
Rubin, D.B. (1986). Statistical matching using file concatenation with adjusted weights and
multiple imputation. Journal of Business and Economic Statistics, Vol. 4, 87-94
Sacco, G., 2008. SAMWIN: a software for statistical matching. Available on the CENEX-ISAD
webpage (http://cenex-isad.istat.it), go to public area/documents/technical reports and
documentation.
Schafer, J.L. (1997), Analysis of incomplete multivariate data, Chapman and Hall, London.
Schafer, J.L. (1999), Multiple imputation under a normal model, Version 2, software for Windows
95/98/NT, available from http://www.stat.psu.edu/jls/misoftwa.html.
Van Buuren, S. and K. Oudshoorn (1999), Flexible multivariate imputation by MICE, TNO Report
G/VGZ/99.054, Leiden.
Van Buuren, S. and C.G.M. Oudshoorn (2000), Multivariate imputation by chained equations, TNO
Report PG/VGZ/00.038, Leiden.
4.3 Bibliography for Section 3
Cohen, Ravikumar, Fienberg (2003), A Comparison of String Metrics for Matching Names and
Records, American Association for Artificial Intelligence (2003)
Franch X., Carvallo J.P. (2003) Using Quality Models in Software Package Selection, IEEE
Software, vol. 20, no. 1, pp. 34-41.
Friedman, Ted, Bitterer Andreas, Gartner (2007). Magic Quadrant for Data Quality Tools 2007,
Gartner RAS Core Research Note G00149359, June 2007.
Haworth, Marta F., Martin Jean, (2000), Delivering and Measuring Data Quality in UK National
Statistics, Office for National Statistics, UK
Hoffmeyer-Zlotnik,, Juergen H.P., DATA HARMONIZATION, Network of Economic & Social
Science Infrastructure in Europe, Luxemburg 2004
Leicester Gill, Methods for Automatic Record Matching and Linkage and their Use in National
Statistics, Oxford University, 2001
National Statistics Code of Practice, Protocol on Data Matching, 2004
Record Linkage Software, User Documentation, 2001
Russom, Philip, Complex Data: A new Challenge for Data Integration, (November 2007), TDWI
Research Data Warehousing Institute
Schumacher, Scott, 2008 Data Management Review and SourceMedia, Inc. DM Review Special
Report, January 2007, Probabilistic Versus Deterministic Data Matching: Making an Accurate
Decision U.S. BUREAU OF THE CENSUS
WP3
40
Winkler, William E., Overview of Record Linkage and Current research Directions, RESEARCH
REPORT SERIES (Statistics #2006-2), Statistical Research Division, U.S. Census Bureau,
Washington DC
WP3
41
Download