refdatapaper - Reference Data Portal

advertisement
White Paper: Managing the Organization's Reference Data
1. Putting Reference Data in Context
Reference data is the name for a special class of data found in all business databases, and all
organizations. We define it as follows:
Reference data is any kind of data that is used solely to categorize other data
found in a database, or solely for relating data in a database to information beyond
the boundaries of the enterprise.
Many Information Technology (IT) professionals recognize reference data, sometimes using
different names, like lookup tables or constants. This is based on the fact that reference data is
usually contained in small database tables often consisting of two columns, a code and a
description. These tables typically contain relatively few rows. Unfortunately, the simple
structure and small size of reference data tables often leads to them being overlooked by IT staff,
even though it is recognized that the data which they contain must be free from defects.
Sometimes reference data is confused with other kinds of data. If we look at a database we can
classify the different layers of data that are found in it based on function, as follows:
Metadata
Reference Data
Enterprise Structure Data
Transaction Structure Data
Transaction Activity Data
Transaction Audit Data

Metadata is, roughly speaking, data about data. Most commonly it is the business and
technical definitions of tables and columns, though it may sometimes include a wealth of
other information.

Reference data, also has a wide scope as it is used to categorize, classify, or otherwise
qualify or constrain transaction data

Enterprise structure data is data that describes the structure of the enterprise, e.g.
organizational structure or chart of accounts. This information is used to track business
activities by responsibility. E.g. Chart of Accounts, Organizational Structure.

Transaction structure data represents data that is required to create a framework within
which transactions occur. For instance, an application designed to sell something would
need to record Product and Customer information before an Order could be processed.

Transaction activity data is what an operational system is built to record. This is what is
often referred to more simply as the “transaction” data – without clearly distinguishing
structure and activity. E.g. for an order processing system it will be the orders themselves,
and payment and accounting information.
© Copyright, 2004. www.refdataportal.com
1

Transaction audit data records audit information about individual transactions that are
performed in an application. Very often this data is not recorded in a database, or a database
management system takes care of it by automated logging.
Some IT staff include enterprise structure data and transaction structure data as part of reference
data, but this is not helpful because the two classes of data behave very differently.
The Different Kinds of Reference Data
Reference data may be a distinct class of data, but it is still very diverse. Some major kinds of
reference data are:
Things that are not involved with the enterprise: Countries, currencies, and time zones are
examples of things that exist, but are not parties to the transactions that an enterprise carries out.
Yet the enterprise still needs to use the information about these things to process and report on
its transaction data.
Classification schemes: Human beings can classify transaction information in an infinite
number of ways, depending on what they view is important about it. Classification schemes may
be broadly accepted, like industry codes, or highly personal and transient.
Constant values: "Things" are typically described by codes and descriptions in reference data
tables. Yet, some reference data covers specific properties of these things. Tax rates, economic
indicators and currency exchange rates are good examples. They are non-key attributes of
reference data tables, and typically change their values over time.
Type codes: These define entity subtypes in a database. For instance, Employee Position may
be Administrative or Professional. Type codes have values that are known when a database is
designed.
Status codes: These control the life cycles of entities in a database. An order may be placed,
filled, shipped, received, have payment received, and have payment cleared. Each of these
states has a different description, and is usually represented by a code in a physical database
table. Again, these codes have values that are known at database design time.
There are other kinds of reference data, but these are the major classes. It can be seen that
although they often have similar structures - tables consisting of codes and descriptions - they
have quite different functions and behaviors.
2. Issues with Reference Data
Reference data, although quite diverse, comes with its own unique set of issues. Any
organization that wants to manage its reference data more efficiently needs understand them.
Only then is it possible formulate a strategy to deal with reference data.
Data quality: Reference data typically has wide scope. Individual values can be used widely
within a single information system, across different information systems of an enterprise, and are
often involved when data is exchanged with organizations that are external to the enterprise.
This wide scope of the use of individual reference data values means they must be free from
defects of any kind.
External standards: Some reference data is represented by coding schemes created by
standard-setting organizations, both in the public and private sectors. E.g. the International
organization for Standardization (ISO) and Standard and Poor's. Yet many organizations are
simply unaware of these standards and go about "reinventing the wheel" when setting up their
own reference data. This is very expensive, dooms them to repeating the same mistakes over
© Copyright, 2004. www.refdataportal.com
2
and over again, and makes it difficult to exchange data with other organizations - or even within
the same organization.
Choosing values for codes: Most reference data consists of tables with codes and descriptions.
The question arises as to what values should be used for codes. Sometimes the things being
described have well-known acronyms. How should these be dealt with? Should sequence
numbers with no meaning be used instead? The trade-offs that affect the design decision are
actually quite complex, and need careful consideration.
Duplication of maintenance functionality: Many organizations still treat systems and
databases as "islands", and have no overall information architecture. This often results in the
creation of functionality to update, control, and report on the same reference data tables in the
different databases and systems. This is expensive for IT staff to create and maintain. It also
makes the users of these systems devote time to keeping these tables up to date, often forcing
them to research the same data issues over and over again.
Duplication of data: With the "island" approach, not only is the maintenance functionality and
effort duplicated, but so is the data. This is really dangerous. There is a very high probability that
the data will diverge over time across the different databases. If users are responsible for the
update of this data, they may well begin to use reference data tables for purposes for which they
were not intended. This can add another dimension to a "Tower of Babel" within an enterprise's
information systems.
Hard-coding: Programmers have a tendency to do everything in program code. This can
include "hard-coding" the definition of reference data. When this happens, reference data values
are never found in database tables, but exist only in different locations in program code. This can
create a maintenance nightmare. Reasons cited for this approach are usually "saving time" or
"because it is more efficient". The real reasons are bad design and poor management.
Business rules: Unlike other kinds of data in a database, actual values of reference data can
participate in business rules. E.g. "If Customer Credit Rating Exceeds 6 then do the following…".
This makes changing or deleting reference data values something that can cause unanticipated
problems. Business rules are usually hidden in program code. Even the modern tools that are
starting to address them do not deal with the problems of reference data values in the rules.
Meaning: Unlike other kinds of data, individual reference data values have their own definitions.
E.g. in a table of countries "China" may exclude "Hong Kong". Classification schemes are
particularly prone to this issue. People cannot classify things if they do not know the exact
meaning of the different entries in the scheme. It is important to have this kind of information
readily available.
Data sharing: Perhaps the most important issue that has to be dealt with in the management of
reference data. Since the same reference data is used within different systems of the same
enterprise, and even across different enterprises, there is a considerable demand to share it. The
other issues described previously all have to be resolved before data sharing can be addressed.
Even then, mechanisms for data sharing and management of this sharing have to be put in place.
3. Managing Reference Data
If an organization is serious about managing its reference data, it needs to put a strategy in place
to do this. Such a strategy should have the following facets:
Central Maintenance: The data quality issues to which reference data is prone means that there
should be central maintenance of it. This may not apply to some personal and transient
classification schemes, but it will apply to most reference data.
© Copyright, 2004. www.refdataportal.com
3
Data stewardship: If there is to be central maintenance of reference data, some persons will
have to be responsible for it. These persons do not own the data, but they maintain it for the
organization, and are known as data stewards. The must have the authority to match their
mandate and the resources to carry it out.
Repository: Central maintenance also implies one location where the reference data is stored.
Not only must the reference data be stored there, but also its associated metadata, e.g. meanings
of values, sources from where it was taken.
Access mechanism: A means of accessing the central repository of reference data needs to be
put in place. At a minimum this should allow users to review what is in the repository, and to
understand the reference data kept there. Obviously, the Internet is an ideal way of doing this. It
may also be possible to have remote databases access the repository to obtain reference data
values. .
Distribution mechanism: Even if remote systems can access the central repository to obtain
reference data, performance issues may require these systems to have their own copies of
reference data. There then needs to be an efficient and secure mechanism for distributing
reference data from the central repository to the remote system. It is possible to use the Internet
to perform this distribution task.
Special interfaces: One of the objectives of central maintenance of reference data should be to
rationalize the design of all reference data tables in all systems and databases across the
enterprise. Yet after decades of systems being built as islands this is really not practical. If the
access mechanism and distribution mechanism described above are to work, legacy systems will
probably need special interfaces. Everything should be done to minimize the amount of work
needed to build, maintain and operate these special interfaces.
New systems: The data stewards responsible for reference data should have a role on every
project that creates or implements a new system or database. After all, reference data is found in
all business databases. They should attempt to utilize the functionality and data content of the
repository which they administer to satisfy the reference data requirement of the new system or
database. They should not permit new standalone reference data maintenance functionality to be
built.
Sustainability: The management of reference data as part of an enterprise's information
architecture can yield great savings, reduce exposure to risk, and improve data quality. If an
enterprise implements such a strategy it will need senior management commitment because
reference data has no natural community of users. Even after the strategy is implemented, senior
management will need to retain a commitment to it. There can be a fast payback, but there are
always pressures to fall back to poor practices when it comes to reference data. Senior
management should insist on quantifying the resources spent on reference data management,
and understanding what this is buying them. The data stewards have the obligation to provide
this information if they are to retain the support of senior management.
4. Conclusion
This paper has briefly reviewed what reference data is, how it differs to other classes of data
found in a database, what it consists of, what special issues pertain to it, and how it can be
managed. It should only be seen as a broad introduction, because there are many specific
problems that can arise when the details of an individual enterprise's situation are considered.
Furthermore, design decisions are always a balance between alternatives, and while there are
some choices that are always right, others depend on a given set of circumstances and the goals
that need to be met.
© Copyright, 2004. www.refdataportal.com
4
Any enterprise that wishes to take control of its reference data must first make the necessary
commitment. Such a decision should not be taken lightly because the commitment involved is
long-term; it is not for a standalone project, but rather to create an additional piece of
infrastructure in the Data Administration function of the enterprise.
© Copyright, 2004. www.refdataportal.com
5
Download