Data Cleansing Strategy Table of Contents 1. Executive Summary 4 Scope of data cleansing strategy 5 Inputs for developing the data cleansing strategy 5 2. Data Lifecycle Management 5 3. Data Quality Framework 7 4. Define Scope 2 Business Relevancy Logic 2 5. Assess Data Quality: Profiling Master Data 2 6. Assess Data Quality: Analyze 4 Duplicate Records 7. 5 Execute Data Cleansing 6 Cleansing Governance 7 General Cleansing Guidelines 7 8. Monitor Quality 10 9. Data Cleansing Team 11 Roles and Responsibilities 12 10. Summary 12 11. Appendix 14 Assumptions 14 Potential Key risks 15 2 Document control Document version history: Changed by Date Reason for change Version Carolyn Connolly 9/26/2022 Initial draft creation 1.0 Carolyn Connolly 9/30/2022 Working Session 1 with JoeZ, MihirR, and Team 2.0 Reference Documents: Document Name Description Data Conversion Strategy Document detailing the Data Conversion Approach for RPC Historical / Archival Data Strategy Document detailing the Historical Data / Archival Strategy to ensure Reporting and operational business requirements are met Document approvals: Reviewed by Role Joe Zimolong RPC Data Lead Mihir Rajopadhye RPC Data Lead Date Signature 3 1. Executive Summary Background In order to drive business process optimization and best practices, and support ongoing growth initiatives, RPC has decided to implement an SAP Central Finance (CFIN) S/4 HANA system. The intent is to consolidate all finance businesses, including SAP and non-SAP ERP systems used in some countries, onto a single Central Finance S/4HANA Private Cloud instance. This will align business processes while also supporting the unique requirements of RPC’s diverse businesses. In order to successfully implement SAP CFIN S/4 HANA system, the data required to support both the business process requirements as well as the informational requirements will need to be accurate and available in SAP S/4 HANA. This includes master data, reference data, transactional data, and historical data (as necessary). RPC currently operates various instances of ECC 6.0, MS D365, MS AX, and a handful of other non-SAP systems around the globe. It has been recognized that there is a lack of conformity and consistency to global standards and business best practices across the various source systems. As part of RPC’s SAP Central Finance implementation, Data Cleansing is crucial to evaluate the data required for conversion to SAP CFIN S/4 HANA and MDG systems and making necessary changes to ensure the data meets the quality, design, and target requirements defined by the business. Cleansing will be done with a focus on the Data Quality Framework Management, to manage the data quality/integrity requirements within the SAP CFIN S/4 HANA project. The purpose of Data Lifecycle Management is to produce the quality, consistency, and completeness of all data types for its ultimate consumption in the generation of business transactions and reporting. The purpose of this document is to outline the Data Cleansing Strategy only for Objects in scope for Release 1. The pyramid below shows the build up from Data, through Operations Execution and Planning to Executive Planning and Reporting. Data Quality is the foundational requirement for all the higher levels to function. This Data Cleansing Strategy document focuses on the process to create the foundational layer of Data Quality that supports Operational Performance. 4 Scope of data cleansing strategy The scope of the Data Cleansing Strategy is to address the process of cleansing the master data in scope such as Customer and Material, for the SAP CFIN Galileo Project. Data cleansing is an ongoing process as data is constantly created. To best maintain cleansing requirements, data governance rules aid in keeping data clean / in line with defined standards. Reporting information is not within the scope of the Data Conversion Strategy. That is addressed in BR#8: Financial, Management & Statutory Reporting Strategy Data Governance information is not within the scope of the Data Conversion Strategy. That is addressed in BR#10: Master Data Governance Strategy. System Architecture and interactions across systems once the project is live, is not within the scope of the Data Conversion Strategy. That is addressed in BR#17: Data Solution Architecture. Inputs for developing the data cleansing strategy During the Big Rocks Phase of the CFIN project, the following topics are being incorporated as part of the overall Data cleansing Strategy: ● ● ● ● ● Current Legacy ERP Source Systems Current RPC MDM Systems (if any variance from above) Clarkston Analysis / Findings Initial list of master data objects for the target SAP CFIN S/4 HANA system implementation Overview of current RPC initiatives around Data Storage / Architecture 2. Data Lifecycle Management Data Lifecycle Management looks at data migration from a holistic approach, not just the ETL process. It is the process by which the data conversion strategy and its activities in support of the SAP CFIN S/4 HANA project will be conducted and managed. It is tailored to meet the specific needs of RPC, but focuses on leading practices for quality, accuracy, and complete transformation of the source data across all the data types. For the purposes of this document, Profiling/Cleansing will be highlighted, whereas the remaining areas are captured in depth in the Data Conversion Strategy Deliverable. Below is a high level description for reference. Data Lifecycle Management The following are the components of the Data Lifecycle Management methodology and approach: 5 ● ● ● ● ● ● ● Source – The source systems, external databases, and spreadsheets that will provide the input data for the target system (this includes identifying the relevance/extraction logic of the source data) Requirements – The specific data elements, fields, records, transformation rules, security/encryption requirements, and validations that are needed based on the functional requirements and detail design of CFIN S/4 HANA Profiling – The activity of profiling the source master data records to determine where there are requirements for cleansing and/or transformation in order to ensure compliance with CFIN S/4 HANA instance configuration enforced through data quality rules, cleansing, and transformations (i.e. record distribution, field usage, min/max values, etc..) Cleansing – Changing, standardizing, formatting, consolidating the data to comply with the data standards defined as part of design and CFIN S/4 HANA system requirements (ie duplicates, address standardization, etc.) Mapping – Source to target field mapping per data object, capturing the transformation logic of the new target CFIN S/4 HANA system (i.e., direct mapping, default, blank, transformation logic, etc.) Extraction, Transformation, & Load (ETL) – Automated/repeatable queries to ETL (extract, transform, load) the data, following the source to target mapping requirements, incorporating data cleansing, validations, and reconciliations through the process to fit the target CFIN S/4 HANA system Data Verification and Validation – Process performed before/during/after each Mock Cycle to confirm and verify that the data loaded into the SAP CFIN S/4 HANA system complies with the defined requirements (pre / post load validations and reconciliation) Below is an overview of the E2E Migration Process, captured as part of the Data Conversion Strategy, however, shown here for reference and highlighting how cleansing is integrated in the overall process. 6 3. Data Quality Framework The objective of the Data Cleansing Strategy is to describe the Data Quality Framework and to define and cleanse the Client source data that is required into the SAP Central Finance instance. The Data Quality Framework includes the following steps: Define Scope: Determine scope of data to be cleansed, including the relevancy logic applied to that data. Assess Data Quality: Profile, Analyze, Review, Assess the current level of data quality, in order to define and determine the cleansing required. Execute Cleansing: With the defined business rules / cleansing logic captured, compile a cleansing team and/or process to execute the data cleansing. Monitoring Quality: Ensure to continuously monitor the cleansing activities to ensure (1) data cleansed continues to be cleansed (2) business is not continuing to generate/create dirty data (stop at the source) and (3) newly created data is validated to be cleansed following the governance standards, or is included in the cleansing process. Below provides an overview of execution steps entailed across the Data Quality Framework. 7 4. Define Scope Cleansing is a continuous activity as data is constantly created and business needs change. As part of Release 1, this document will focus on the data objects that are known to be in scope. Below are the Data Objects that form the initial list of master data to be considered for cleansing during Release 1. Material Master (Basic View only) Customer Master (Basic View only) Business Relevancy Logic Business Relevance Logic dictate that only required data is cleansed. RPC, along with the Data Team with the support from the Functional Process Teams, will define the extraction rules to properly define the data in scope for cleansing. These rules will also be applied throughout the project as part of the Migration Cycles, and this logic should be validated by RPC each time to ensure the accurate scope of records are included. The activity to collect / define the business relevancy logic will be performed during the Design Workshops by each object in scope and documented within the associated Functional Specification Document (FSD). In addition, application of the business relevancy rules will reduce the master data set to “active” records, thereby increasing focus for subsequent data profiling and cleansing efforts and reducing the data set for conversion to the target system. 5. Assess Data Quality: Profiling Master Data Master Data Profiling is the process of assessing the source master data to identify any data cleansing, gaps, or data transformations required to meet the SAP CFIN S/4 HANA system / design requirements. Data Profiling is not needed on every table from the source systems and should be focused on higher priority master data objects and data that has known cleansing / harmonization requirements that impact business operations. Leading Practices for Data Profiling are listed below: ● ● ● ● ● Profiling executed on Master Data Objects only Completed at the table level by field If relevancy is known, only profile known in scope records Use of a robust tool to automate profiling Requires business involvement to accurately analyze profiling results There are essentially three levels of profiling that will be required to support the cleansing of the data that will eventually be used for testing. The primary level of profiling planned to be used at RPC is unit and integration (if transactional data is required as part of the relevancy logic). ● Unit Level – The simplest level of profiling and cleansing for single functionality using small subsets of data (targeted to individual fields) 2 ● ● String Level – Somewhat more complicated as it strings together multiple functions where cleansing starts to take on the issues of data context as the data is shared across the functions from end-to-end (targeted to multiple fields on the same data domain) Integration Level – The most complex as it involves all the data types and functions that are stringed together in a complete response to a business event (targeted to multiple fields across different data domains) Master Data Profiling can either occur with Production databases as its source, or a recent copy of Production. Because data is ever changing, it is best to execute profiling against most recent data, so teams are not cleansing outdated data. SAP Tools to execute master data profiling: SAP Data Services ● SAP Data Services has limited profiling capabilities and should primarily be used for ETL, whereas the profiling functionality can support developers as part of their build activities ● SAP Data Services will be the engine used to directly extract the data from the source DB / ERP systems, apply the relevancy logic / filtering criteria to reduce the data to in scope records, in order for the data to be ingested for in depth profiling SAP Information Steward ● It is preferred to use SAP Information Steward to execute the master data profiling ● The data/cleansing teams should use Information Steward to analyze the profiling results ● The master data relevancy rules will be used not only for profiling, but also incorporated as part of the data conversion routines SAP Information Steward has four functionalities, however for the purposes of Profiling, only Data Insights is used. Below defines the various profiling results and definitions: Value Options Value Description Min The row that has the smallest number in a particular column Max The row that has the largest number in a particular column Average The value that represents the mean. It is the sum of all values in this column divided by the number of values Median The middle value of a given number of rows String Length Options Min The row that has the smallest number of characters in a particular column Max The row that has the largest number of characters in a particular column Average The value that represents the man. It is the sum of all values in that column divided by the number of values Median The middle value of a given number of rows Completeness Options 3 Nulls The number of rows that are empty or have a null value for a particular column Blanks The number of rows that are empty or have a blank value for a particular column Zeros The number of rows that have a value of zero for a particular column Distribution Options Data The number of unique values. For example if your data includes USA and United Kingdom addresses, you would see 2 in the Country data distribution column Pattern The number of unique patterns. For example date formats may be yyyy/mm/dd, mmdd-yy or yy.mm.dd and so on Word The number of unique words. For example if your data lists the country as the value United Kingdom, each instance of "United" and "Kingdom" is counted separately, whereas in Data distribution, "United Kingdom" is counted as one instance Below is a screenshot from SAP Information Steward, from the Data Insights tab, for Basic Profiling at the table level. This view shows you the profiling results across dimensions, while also highlighting the drill down capability to review field level distribution of values and records associated. 6. Assess Data Quality: Analyze As part of the Clarkston Review, there were many findings across RPC’s current landscape in regards to data quality, process, and governance. Below is a summarized view from their analysis: 4