Enterprise Data Architecture October 28, 2004 Revision: 1.0 Status: Draft Prepared by Ralph C. Alderson Senior Consultant Third Coast Software Foundry Status: Working indicates the document may not be complete or reliable Draft indicates the author considers the document accurate and complete, still under coordination Recommended indicates the author and core team consider the document ready for approval Published indicates the document is reviewed, approved and ready for use Enterprise Data Architecture What is Data Architecture? Data architecture is where the rubber meets the sky. – Neil Snodgrass, Data Architecture Consultant, Answerthink Even among IT practitioners, there is a general misunderstanding (or perhaps more accurately, a lack of understanding) of what Data Architecture is, and what it provides. In general, Data Architecture is a master plan of the enterprise data locations, data flows, and data availability. It is a conceptual infrastructure to support data quality, data stewardship, data integration, data migration, and system collaboration. This infrastructure embodies a set of guidelines and standards which ensure that the data assets are managed appropriately, and that they conform to sanctioned principles for stewardship and quality. Data Architecture is the discipline of designing, creating, and maintaining this infrastructure. It must accommodate the data and information needs of the company and do so in a manner which promotes high reliability and easy data integration among applications and data repositories. The most visible and tangible product of effective Data Architecture is a reporting environment that 1) provides a single version of the corporate “truth” 2) allows business analysts to discover new insights, and 3) allows business executives and corporate decision makers to derive corporate strategies and actionable tactics from their data. Such a reporting environment usually entails one or more data warehouses, and one or more departmental or “competency” data marts.1 The architecture describes how data flows from corporate transactions, through the various layers of transformation and integration, through operational data stores, all the way to the decision-support applications. It is an infrastructure that, when properly implemented, (i.e. follows the architecture and conforms to the corporation’s suite of “best practices”) guarantees the three benefits of the reporting environment described above. As the humorous quote at the beginning of this paper indicates, Data Architecture often seems somewhat nebulous as there is no physical manifestation (like an executable program manifests programming code, or like a relational database manifests an entity relationship data model). Data Architecture has no programmatic instantiation and exists only as standards, policies, and corporate “best practices.” It resides only in the artifacts (text documents and pictorial diagrams) which describe it, and in the “tribal knowledge” of the enterprise. The artifacts which describe it are the blueprint of the architecture, and serve a similar function for building reliable systems as a building architect’s blueprint serves for building a house. 1 A data warehouse can be built without Enterprise Data Architecture, but it is highly inadvisable. Likewise, a data architecture can exist for an enterprise that is not doing any data warehousing, but provides the optimal benefit to the corporation when it establishes the blueprint for integrating enterprise data into a data warehouse. Status: Working indicates the document may not be complete or reliable Draft indicates the author considers the document accurate and complete, still under coordination Recommended indicates the author and core team consider the document ready for approval Published indicates the document is reviewed, approved and ready for use Page 2 3/7/2016 A corporation’s Data Architecture is a mirror of the data and information generated and captured by the enterprise in order to do its business. It describes the business rules and the concepts which are critical for the enterprise to operate efficiently. It offers a “seal of approval” on the reliability of the data, and guarantees that corporate decision makers can make wellinformed, fact-based decisions on policies and strategies. It provides for a sanctioned plan for stewardship of the data assets of the corporation, and determines the rules on how data gets created, how it moves through the enterprise, and how it gets consumed. Indeed, Data Architecture influences everything in the enterprise which “touches” the data. It motivates data polices, influences corporate goals, enables strategies for achieving those goals, and validates the tactics which implement those strategies. It encompasses all systems and programs in which data originates, in which data is transformed and/or cleansed, and in which data is migrated to, or integrated with, other systems. By standardizing data definitions, data formats, and the acceptable storage, integration, and usage of the data, the architecture prepares the environment for data management, and it is by invigorating these standards that the powerful benefits of the Data Architecture (high data quality and unquestionable data reliability) are enabled. Also, by dictating how data gets integrated, migrated, cleansed, and transformed, Data Architecture provides a plug-and-play framework for data warehousing. Status: Working indicates the document may not be complete or reliable Draft indicates the author considers the document accurate and complete, still under coordination Recommended indicates the author and core team consider the document ready for approval Published indicates the document is reviewed, approved and ready for use 3/7/2016 Page 3 Figure 1. A Typical Data Architecture Environment What are the artifacts and deliverables of Data Architecture? Since Data Architecture is a conceptual and abstract discipline, it has no simple representation that one can point to and say, “That’s Data Architecture.” It encompasses everything a company captures and maintains in the realm of data and information (see Figure 1). Having such a broad scope and impact, and such a high level of abstraction, it requires some imagination to conceive and understand what it is all about. The one artifact that comes closest to capturing the essence of Data Architecture is a high-level data-flow diagram (Figure 2). But data flow is only one aspect of a complete architecture. There must be rules about how data flows or migrates through the information systems, and there must be a crystal clear understanding throughout the IT realm of which subject areas and concepts are important to the company’s business model. In addition there must be an enterprise-wide agreement as to the semantics of those concepts in all possible contexts (within the business model). Figure 2. – Data Flow Diagram Status: Working indicates the document may not be complete or reliable Draft indicates the author considers the document accurate and complete, still under coordination Recommended indicates the author and core team consider the document ready for approval Published indicates the document is reviewed, approved and ready for use Page 4 3/7/2016 A fundamental goal of the architecture is to have absolutely unquestionable data quality and reliability. Semantic clarity is the first step, but disciplined stewardship of the data, the concepts, and the business rules is the only way to move forward, past that first step, to achieve a robust and effective architecture. In order to complete the picture, and implement the type of data environment which an ideal Data Architecture provides, there must be: Inspired analysis and design of the overall architecture Corporate sanction of the architecture’s goals Enforced compliance with the architecture’s rules The following deliverables and artifacts of the Data Architecture are designed to ensure that these three principles are delivered to the information systems which are destined to utilize the architecture. This is not a mandatory or an all-inclusive list. It is simply a recommended methodology, and does not preempt a different approach utilizing other documents and principles to achieve the desired environment. Business Concept Definitions Having corporate sanctioned definitions for the concepts which animate a company’s business model is the single most important element of Data Architecture. None of the major benefits of the architecture will accrue without them. Yet business concept definitions are often overlooked (or worse, purposely ignored) because (to many IT practitioners) it seems painfully like “documentation for documentation’s sake”. Nothing within the realm of enterprise data could be further from the truth. Semantic clarity is mandatory for getting the full utility and all of the collateral benefits of enterprise Data Architecture. Unless all systems and programs agree on a single definition for each and every critical business concept, then there can not be any reliable data migration, data integration, data cleansing, or data warehousing. Analysts and executives who query the data warehouse(s) would have little or no reason for confidence in the accuracy of the information which is presented to them. Data Stewardship Agreements Stewardship is a vital element of any Data Architecture. Data stewards ensure the quality, accessibility, and protection of the data, and define the data standards (data definitions, concept definitions, data formats, and data domains). They are the guardians and maintainers of the Data Architecture. They ensure that there is a single data store of record (DSOR) for the vertical stripe of data which they are stewarding, and they prevent non-conforming data silos from participating in the architecture. Stewardship agreements are corporate documents that grant stewardship responsibilities to a person, initiative, or department, and need the advice and consent of the CIO or a CIO designate. Stewards are typically positioned at a high level of corporate responsibility, e.g. V.P or Director. Data Sharing Agreements Data sharing agreements are corporate documents that describe the data, where it is located, who protects it, and who can access it. Most data should be freely available Status: Working indicates the document may not be complete or reliable Draft indicates the author considers the document accurate and complete, still under coordination Recommended indicates the author and core team consider the document ready for approval Published indicates the document is reviewed, approved and ready for use Page 5 throughout the enterprise. But some sensitive data needs to be restricted. The data sharing agreement, signed by all interested parties describes who can access the restricted data, when it is available, and how the access is accomplished. Even data that is not sensitive needs to be certified as “sharable.” Entities within the enterprise that want access to the DSOR for a concept need to be certified as conforming to the standards maintained for that concept (see Data Standards, below). Data Usage Models (Stewardship Matrix) Anyone who has been in Information Systems very long has heard of, and probably used, a diagram known as a CRUD matrix. CRUD stands for (C)reate, (R)ead, (U)pdate, and (D)elete, and details the data usage for an application, a system, or an initiative. The Data Usage Model (sometimes called a Stewardship Matrix) extends the old-fashioned CRUD matrix so that one can, at a glance, not only see how each application interacts with a given concept, but which application data store is the data store of record (DSOR) for each concept. The system which has the DSOR for a concept inherits the stewardship responsibilities for that business concept, and is obliged to: 1. Get enterprise-wide agreement of a definition for that concept 2. Document all of the business rules that pertain to the concept 3. Determine who (which systems and employee types) can see and use that data (via Data Sharing agreements discussed above), and 4. Maintain the integrity of the concept (by setting enterprise-wide data definitions, data formats, and data domains for the concept). Data Standards (Definition, Format, and Domain) Data definitions are often captured in modeling tools like Erwin, and then propagated to the physical database in the form of comments on tables, columns, and relationships. They quite frequently can come directly from the Business Concept Definition document (see above). The DSOR for a concept contains the sanctioned definitions which relate to the concept and its attributes. Similarly, the DSOR should be considered the sanctioned format for the data attributes for a concept, and for the valid domain values for that concept. An important criterion in data sharing is to make sure that all parties which want to use the data must define that data in exactly the same way – in entity and attribute definitions, in format, and in domain values. This is crucial to having certifiably correct reports, and a high level of data quality. Where definitions, formats, or domains are different, it is hard to rationalize that both sides of the data sharing are, indeed, talking about the same concept, and before a sharing agreement can be executed and sanctioned by the enterprise (with signatures of appropriate parties) one side or the other must change and conform to the other (or both sides can change and use a negotiated settlement to remediate the differences). Data Warehouse Artifacts Data warehouses have many artifacts and deliverables. All of the artifacts and deliverables mentioned here for Data Architecture will be utilized in building a data warehouse. Status: Working indicates the document may not be complete or reliable Draft indicates the author considers the document accurate and complete, still under coordination Recommended indicates the author and core team consider the document ready for approval Published indicates the document is reviewed, approved and ready for use 3/7/2016 Page 6 Data Flow Diagrams Many in Information Systems think of data flow diagrams (DFD) as being equivalent to Data Architecture – as being The Architecture. DFDs are a vital tool for conveying the scope and boundaries of the architecture, but, (as we hope we have demonstrated in this white paper) they are only a tool, and only one of many. DFDs describe how data flows throughout the enterprise – from creation of the data, through various layers of refinement, cleansing, and transformation, to the consumption of the data on reports, executive dashboards, or display screens. They are a key to documenting the overall architecture, and are a very useful starting place for the data mapping used by cleansing initiatives or for ETLs which load the data warehouse. Conceptual Models Conceptual models are diagrams that summarize all of the critical and interesting concepts which are inherent in the business, and the relationships among them. A very high-level conceptual model diagrammatically details only the subject areas (e.g. Finance, Human Resources, Products, etc.) of interest, and the relationships between subject areas and concepts. This type of model is called, naturally, a Subject Area Model. The next lower level of detail is captured by a concept model (sometimes called a data planning model) which depicts each interesting concept and the relationships among the concepts. One method of portraying this model is with an un-attributed entity relationship (ER) model. Indeed, most (if not all) of the business concepts will end up being fullyattributed entities in one or more logical models which support one or more transactional systems. The relationships between concepts in this type of model conform naturally enough to the concept of relationships in ER modeling. Another very effective technique for conceptual modeling is a formal modeling notation known as Object-Role Modeling (ORM). Object-Role modeling was designed for this purpose, and allows useful insights into the concepts and relationships which might be overlooked using the traditional ER modeling notation. ORM is sometimes eschewed as being too tedious, but this is due mostly to a lack of good graphical tools designed to support the technique. Logical Models If you have undertaken the discipline of creating conceptual models, you will find that the logical models evolve from the conceptual ones quite naturally. The major concepts become entities, and many of the minor ones become attributes for those entities. Physical Models Physical models are dependent on the choice of DBMS used, and are in the domain of the DBAs. Whereas the physical representation is definitely an artifact of the architecture, its main purpose is to document where (what DBMS, what database, and how the concepts and entities had to be modified (if at all) in order to become a column in a table. The physical residence of business concepts is an important piece of information for Data Sharing Agreements. Status: Working indicates the document may not be complete or reliable Draft indicates the author considers the document accurate and complete, still under coordination Recommended indicates the author and core team consider the document ready for approval Published indicates the document is reviewed, approved and ready for use 3/7/2016 3/7/2016 Page 7 Metadata Standards and Maintenance Metadata is the sum of all of the corporate knowledge about the corporation’s business processes and the data that qualifies and quantifies it. There are two types of metadata: technical and business. Technical metadata is used by Information Technology practitioners to standardize, categorize, and define the data structures used to capture information in databases. Technical metadata describes the physical properties of the data, how it relates to other data, and mappings between sources and destinations of data that is moving through the system(s). It is invaluable for standardizing the data formats, definitions and domains across systems. Business metadata is used to guide the system users (data consumers) through the data and the problems they are trying to solve with it. It provides, on a fundamental level, basic description information for the data fields. At a more robust level, it provides the foundation for understanding the content and source of the information. The business metadata provides a conceptual context for the technical metadata, and is often undocumented, only to remain as “tribal knowledge.” Accurately capturing and standardizing business metadata is always an important challenge for Data Architecture. Does my company need Data Architecture? It’s hard to imagine a company that wouldn’t benefit from a well-designed and robust Data Architecture, but for some companies it is absolutely critical. Here are typical circumstances under which a formal Data Architecture is mandated for a company: It is building (or anticipates building) a data warehouse (or data marts) – mandated to remediate potential data quality and data reliability issues. It is building (or anticipates building) an operational data store (ODS) – potential data integration and data quality issues. It is pursuing a six-sigma strategy – data quality and reliability issues. It is pursuing ISO certification – data quality and reliability issues. Enterprise data is used to analyze operations: o To discover marketplace opportunities – data quality and reliability issues. o To create marketing strategies – data quality and reliability issues. o To fine tune and optimize operations – data quality and reliability issues. Enterprise data is used as the basis for high-level decision-making – data quality and reliability issues. It believes that enterprise data is a corporate asset that needs to be leveraged and protected. – data stewardship issues It wants to get “the best and the most from its enterprise data” integration, and stewardship issues. © – data quality, reliability, Status: Working indicates the document may not be complete or reliable Draft indicates the author considers the document accurate and complete, still under coordination Recommended indicates the author and core team consider the document ready for approval Published indicates the document is reviewed, approved and ready for use Page 8 3/7/2016 What are the benefits that Data Architecture provides? At the very least, Data Architecture provides a high-level map of the data topology for an enterprise. It describes how the data originates, where it resides, where it migrates, what transformations are applied to it to cleanse and standardize it, and what it means (the semantics). At its best, it goes way beyond this simple documentation, and becomes an active principle that lives within the data, energizing and leveraging it in a multitude of ways. The data becomes an organic corporate asset that invigorates the enterprise and provides a clear path to the realization of the corporate vision, goals, and strategies. To someone that has never experienced a robust and inspired Data Architecture in action, this may sound a little like poetic license or hyperbole. But it truly is not. Metaphors aside, corporate personnel who discover the synergistic benefits of Data Architecture for the first time, are often amazed at how they ever functioned without it. Data that once was suspect or needed “tweaking” in order to balance the books, becomes as reliable as “Old Faithful.” Analysts who once complained that the reliability of the data made their analysis contrived and incomplete, become ardent converts, clamoring for more bandwidth to allow their heuristics to discover all of the exciting possibilities that are contained in their newly invigorated data warehouse. Data warehouse developers who previously spent many hours of overtime trying to shoe-horn data from legacy systems into the warehouse, happily discover that ETLs and data maps become self-revealing, and the data warehouse is found to be the software equivalent of “plug-and-play.” Executives who had struggled to find meaning in their daily, weekly and monthly reports, now discover nuggets of information which inspire new visions, and blaze new trails to outsmart and outmaneuver the competition. Because of guaranteed data reliability and the framework which enables death-defying data transformations, Data Architecture can have a positive impact on virtually every operational function, every department, and every profit center. The artifacts describe how this should happen: Data Stewards enable semantic clarity and enforce the standards. Data analysts and planners set the policies and discover the vision. Program and project managers instantiate the ideals. Data integrators become empowered to fold all data into a single vocabulary, whether they are dealing with existing disparate systems, new system development, or third-party packaged system. And everyone throughout the enterprise finds a new appreciation and respect for the data that pulses through the architecture’s veins. Status: Working indicates the document may not be complete or reliable Draft indicates the author considers the document accurate and complete, still under coordination Recommended indicates the author and core team consider the document ready for approval Published indicates the document is reviewed, approved and ready for use Page 9 3/7/2016 Ok, it sounds like we could benefit from formal Data Architecture. How do we proceed? An experienced data architect can analyze and document the current data environment, determine which aspects need refinement, enhancement, or extension, and devise a road map for achieving a best-practice Data Architecture which will allow the enterprise to get the best and most from its data ©. If no such architect exists in-house, there are many qualified consultants who can provide the experience, acumen, and level of expertise to design a plan for achieving the desired benefits. A data architect may be utilized to analyze the current environment only, or provide a complete architecture implementation – from capturing metadata, to defining concepts, to implementing a data warehouse. Optimal Data Architectures are flexible and can be implemented in stages. The key is to have a high-level plan which accounts for the goals and aspirations of the enterprise. Once that is in place, the benefits of Data Architecture can be prioritized and implemented in a seamless, phased-in approach that accommodates the specific needs of any organization. Ralph C. Alderson is a Senior Consultant with Third Coast Software Foundry, Austin, Texas, who specializes in Data Architecture and data-related issues. Status: Working indicates the document may not be complete or reliable Draft indicates the author considers the document accurate and complete, still under coordination Recommended indicates the author and core team consider the document ready for approval Published indicates the document is reviewed, approved and ready for use