White Paper: Managing the Organization's Reference Data 1. Putting Reference Data in Context Reference data is the name for a special class of data found in all business databases, and all organizations. We define it as follows: Reference data is any kind of data that is used solely to categorize other data found in a database, or solely for relating data in a database to information beyond the boundaries of the enterprise. Many Information Technology (IT) professionals recognize reference data, sometimes using different names, like lookup tables or constants. This is based on the fact that reference data is usually contained in small database tables often consisting of two columns, a code and a description. These tables typically contain relatively few rows. Unfortunately, the simple structure and small size of reference data tables often leads to them being overlooked by IT staff, even though it is recognized that the data which they contain must be free from defects. Sometimes reference data is confused with other kinds of data. If we look at a database we can classify the different layers of data that are found in it based on function, as follows: Metadata Reference Data Enterprise Structure Data Transaction Structure Data Transaction Activity Data Transaction Audit Data Metadata is, roughly speaking, data about data. Most commonly it is the business and technical definitions of tables and columns, though it may sometimes include a wealth of other information. Reference data, also has a wide scope as it is used to categorize, classify, or otherwise qualify or constrain transaction data Enterprise structure data is data that describes the structure of the enterprise, e.g. organizational structure or chart of accounts. This information is used to track business activities by responsibility. E.g. Chart of Accounts, Organizational Structure. Transaction structure data represents data that is required to create a framework within which transactions occur. For instance, an application designed to sell something would need to record Product and Customer information before an Order could be processed. Transaction activity data is what an operational system is built to record. This is what is often referred to more simply as the “transaction” data – without clearly distinguishing structure and activity. E.g. for an order processing system it will be the orders themselves, and payment and accounting information. © Copyright, 2004. www.refdataportal.com 1 Transaction audit data records audit information about individual transactions that are performed in an application. Very often this data is not recorded in a database, or a database management system takes care of it by automated logging. Some IT staff include enterprise structure data and transaction structure data as part of reference data, but this is not helpful because the two classes of data behave very differently. The Different Kinds of Reference Data Reference data may be a distinct class of data, but it is still very diverse. Some major kinds of reference data are: Things that are not involved with the enterprise: Countries, currencies, and time zones are examples of things that exist, but are not parties to the transactions that an enterprise carries out. Yet the enterprise still needs to use the information about these things to process and report on its transaction data. Classification schemes: Human beings can classify transaction information in an infinite number of ways, depending on what they view is important about it. Classification schemes may be broadly accepted, like industry codes, or highly personal and transient. Constant values: "Things" are typically described by codes and descriptions in reference data tables. Yet, some reference data covers specific properties of these things. Tax rates, economic indicators and currency exchange rates are good examples. They are non-key attributes of reference data tables, and typically change their values over time. Type codes: These define entity subtypes in a database. For instance, Employee Position may be Administrative or Professional. Type codes have values that are known when a database is designed. Status codes: These control the life cycles of entities in a database. An order may be placed, filled, shipped, received, have payment received, and have payment cleared. Each of these states has a different description, and is usually represented by a code in a physical database table. Again, these codes have values that are known at database design time. There are other kinds of reference data, but these are the major classes. It can be seen that although they often have similar structures - tables consisting of codes and descriptions - they have quite different functions and behaviors. 2. Issues with Reference Data Reference data, although quite diverse, comes with its own unique set of issues. Any organization that wants to manage its reference data more efficiently needs understand them. Only then is it possible formulate a strategy to deal with reference data. Data quality: Reference data typically has wide scope. Individual values can be used widely within a single information system, across different information systems of an enterprise, and are often involved when data is exchanged with organizations that are external to the enterprise. This wide scope of the use of individual reference data values means they must be free from defects of any kind. External standards: Some reference data is represented by coding schemes created by standard-setting organizations, both in the public and private sectors. E.g. the International organization for Standardization (ISO) and Standard and Poor's. Yet many organizations are simply unaware of these standards and go about "reinventing the wheel" when setting up their own reference data. This is very expensive, dooms them to repeating the same mistakes over © Copyright, 2004. www.refdataportal.com 2 and over again, and makes it difficult to exchange data with other organizations - or even within the same organization. Choosing values for codes: Most reference data consists of tables with codes and descriptions. The question arises as to what values should be used for codes. Sometimes the things being described have well-known acronyms. How should these be dealt with? Should sequence numbers with no meaning be used instead? The trade-offs that affect the design decision are actually quite complex, and need careful consideration. Duplication of maintenance functionality: Many organizations still treat systems and databases as "islands", and have no overall information architecture. This often results in the creation of functionality to update, control, and report on the same reference data tables in the different databases and systems. This is expensive for IT staff to create and maintain. It also makes the users of these systems devote time to keeping these tables up to date, often forcing them to research the same data issues over and over again. Duplication of data: With the "island" approach, not only is the maintenance functionality and effort duplicated, but so is the data. This is really dangerous. There is a very high probability that the data will diverge over time across the different databases. If users are responsible for the update of this data, they may well begin to use reference data tables for purposes for which they were not intended. This can add another dimension to a "Tower of Babel" within an enterprise's information systems. Hard-coding: Programmers have a tendency to do everything in program code. This can include "hard-coding" the definition of reference data. When this happens, reference data values are never found in database tables, but exist only in different locations in program code. This can create a maintenance nightmare. Reasons cited for this approach are usually "saving time" or "because it is more efficient". The real reasons are bad design and poor management. Business rules: Unlike other kinds of data in a database, actual values of reference data can participate in business rules. E.g. "If Customer Credit Rating Exceeds 6 then do the following…". This makes changing or deleting reference data values something that can cause unanticipated problems. Business rules are usually hidden in program code. Even the modern tools that are starting to address them do not deal with the problems of reference data values in the rules. Meaning: Unlike other kinds of data, individual reference data values have their own definitions. E.g. in a table of countries "China" may exclude "Hong Kong". Classification schemes are particularly prone to this issue. People cannot classify things if they do not know the exact meaning of the different entries in the scheme. It is important to have this kind of information readily available. Data sharing: Perhaps the most important issue that has to be dealt with in the management of reference data. Since the same reference data is used within different systems of the same enterprise, and even across different enterprises, there is a considerable demand to share it. The other issues described previously all have to be resolved before data sharing can be addressed. Even then, mechanisms for data sharing and management of this sharing have to be put in place. 3. Managing Reference Data If an organization is serious about managing its reference data, it needs to put a strategy in place to do this. Such a strategy should have the following facets: Central Maintenance: The data quality issues to which reference data is prone means that there should be central maintenance of it. This may not apply to some personal and transient classification schemes, but it will apply to most reference data. © Copyright, 2004. www.refdataportal.com 3 Data stewardship: If there is to be central maintenance of reference data, some persons will have to be responsible for it. These persons do not own the data, but they maintain it for the organization, and are known as data stewards. The must have the authority to match their mandate and the resources to carry it out. Repository: Central maintenance also implies one location where the reference data is stored. Not only must the reference data be stored there, but also its associated metadata, e.g. meanings of values, sources from where it was taken. Access mechanism: A means of accessing the central repository of reference data needs to be put in place. At a minimum this should allow users to review what is in the repository, and to understand the reference data kept there. Obviously, the Internet is an ideal way of doing this. It may also be possible to have remote databases access the repository to obtain reference data values. . Distribution mechanism: Even if remote systems can access the central repository to obtain reference data, performance issues may require these systems to have their own copies of reference data. There then needs to be an efficient and secure mechanism for distributing reference data from the central repository to the remote system. It is possible to use the Internet to perform this distribution task. Special interfaces: One of the objectives of central maintenance of reference data should be to rationalize the design of all reference data tables in all systems and databases across the enterprise. Yet after decades of systems being built as islands this is really not practical. If the access mechanism and distribution mechanism described above are to work, legacy systems will probably need special interfaces. Everything should be done to minimize the amount of work needed to build, maintain and operate these special interfaces. New systems: The data stewards responsible for reference data should have a role on every project that creates or implements a new system or database. After all, reference data is found in all business databases. They should attempt to utilize the functionality and data content of the repository which they administer to satisfy the reference data requirement of the new system or database. They should not permit new standalone reference data maintenance functionality to be built. Sustainability: The management of reference data as part of an enterprise's information architecture can yield great savings, reduce exposure to risk, and improve data quality. If an enterprise implements such a strategy it will need senior management commitment because reference data has no natural community of users. Even after the strategy is implemented, senior management will need to retain a commitment to it. There can be a fast payback, but there are always pressures to fall back to poor practices when it comes to reference data. Senior management should insist on quantifying the resources spent on reference data management, and understanding what this is buying them. The data stewards have the obligation to provide this information if they are to retain the support of senior management. 4. Conclusion This paper has briefly reviewed what reference data is, how it differs to other classes of data found in a database, what it consists of, what special issues pertain to it, and how it can be managed. It should only be seen as a broad introduction, because there are many specific problems that can arise when the details of an individual enterprise's situation are considered. Furthermore, design decisions are always a balance between alternatives, and while there are some choices that are always right, others depend on a given set of circumstances and the goals that need to be met. © Copyright, 2004. www.refdataportal.com 4 Any enterprise that wishes to take control of its reference data must first make the necessary commitment. Such a decision should not be taken lightly because the commitment involved is long-term; it is not for a standalone project, but rather to create an additional piece of infrastructure in the Data Administration function of the enterprise. © Copyright, 2004. www.refdataportal.com 5