Uploaded by xy1293

Strategy

advertisement
Data Cleansing Strategy
Table of Contents
1.
Executive Summary
4
Scope of data cleansing strategy
5
Inputs for developing the data cleansing strategy
5
2.
Data Lifecycle Management
5
3.
Data Quality Framework
7
4.
Define Scope
2
Business Relevancy Logic
2
5.
Assess Data Quality: Profiling Master Data
2
6.
Assess Data Quality: Analyze
4
Duplicate Records
7.
5
Execute Data Cleansing
6
Cleansing Governance
7
General Cleansing Guidelines
7
8.
Monitor Quality
10
9.
Data Cleansing Team
11
Roles and Responsibilities
12
10.
Summary
12
11.
Appendix
14
Assumptions
14
Potential Key risks
15
2
Document control
Document version history:
Changed by
Date
Reason for change
Version
Carolyn Connolly
9/26/2022
Initial draft creation
1.0
Carolyn Connolly
9/30/2022
Working Session 1 with JoeZ, MihirR, and Team
2.0
Reference Documents:
Document Name
Description
Data Conversion Strategy
Document detailing the Data Conversion
Approach for RPC
Historical / Archival Data Strategy
Document detailing the Historical Data / Archival
Strategy to ensure Reporting and operational
business requirements are met
Document approvals:
Reviewed by
Role
Joe Zimolong
RPC Data Lead
Mihir Rajopadhye
RPC Data Lead
Date
Signature
3
1. Executive Summary
Background
In order to drive business process optimization and best practices, and support ongoing growth
initiatives, RPC has decided to implement an SAP Central Finance (CFIN) S/4 HANA system. The
intent is to consolidate all finance businesses, including SAP and non-SAP ERP systems used in
some countries, onto a single Central Finance S/4HANA Private Cloud instance. This will align
business processes while also supporting the unique requirements of RPC’s diverse businesses.
In order to successfully implement SAP CFIN S/4 HANA system, the data required to support
both the business process requirements as well as the informational requirements will need to be
accurate and available in SAP S/4 HANA. This includes master data, reference data, transactional
data, and historical data (as necessary).
RPC currently operates various instances of ECC 6.0, MS D365, MS AX, and a handful of other
non-SAP systems around the globe. It has been recognized that there is a lack of conformity and
consistency to global standards and business best practices across the various source systems. As
part of RPC’s SAP Central Finance implementation, Data Cleansing is crucial to evaluate the data
required for conversion to SAP CFIN S/4 HANA and MDG systems and making necessary changes
to ensure the data meets the quality, design, and target requirements defined by the business.
Cleansing will be done with a focus on the Data Quality Framework Management, to manage the
data quality/integrity requirements within the SAP CFIN S/4 HANA project.
The purpose of Data Lifecycle Management is to produce the quality, consistency, and
completeness of all data types for its ultimate consumption in the generation of business
transactions and reporting. The purpose of this document is to outline the Data
Cleansing Strategy only for Objects in scope for Release 1.
The pyramid below shows the build up from Data, through Operations Execution and Planning to
Executive Planning and Reporting. Data Quality is the foundational requirement for all the higher
levels to function. This Data Cleansing Strategy document focuses on the process to create the
foundational layer of Data Quality that supports Operational Performance.
4
Scope of data cleansing strategy
The scope of the Data Cleansing Strategy is to address the process of cleansing the master data in
scope such as Customer and Material, for the SAP CFIN Galileo Project.
Data cleansing is an ongoing process as data is constantly created. To best maintain
cleansing requirements, data governance rules aid in keeping data clean / in line with defined
standards.
Reporting information is not within the scope of the Data Conversion Strategy. That is addressed
in BR#8: Financial, Management & Statutory Reporting Strategy
Data Governance information is not within the scope of the Data Conversion Strategy. That is
addressed in BR#10: Master Data Governance Strategy.
System Architecture and interactions across systems once the project is live, is not within the
scope of the Data Conversion Strategy. That is addressed in BR#17: Data Solution Architecture.
Inputs for developing the data cleansing strategy
During the Big Rocks Phase of the CFIN project, the following topics are being incorporated as
part of the overall Data cleansing Strategy:
●
●
●
●
●
Current Legacy ERP Source Systems
Current RPC MDM Systems (if any variance from above)
Clarkston Analysis / Findings
Initial list of master data objects for the target SAP CFIN S/4 HANA system
implementation
Overview of current RPC initiatives around Data Storage / Architecture
2. Data Lifecycle Management
Data Lifecycle Management looks at data migration from a holistic approach, not just the ETL
process. It is the process by which the data conversion strategy and its activities in support of the
SAP CFIN S/4 HANA project will be conducted and managed. It is tailored to meet the specific
needs of RPC, but focuses on leading practices for quality, accuracy, and complete transformation
of the source data across all the data types. For the purposes of this document, Profiling/Cleansing
will be highlighted, whereas the remaining areas are captured in depth in the Data Conversion
Strategy Deliverable. Below is a high level description for reference.
Data Lifecycle Management
The following are the components of the Data Lifecycle Management methodology and approach:
5
●
●
●
●
●
●
●
Source – The source systems, external databases, and spreadsheets that will provide the
input data for the target system (this includes identifying the relevance/extraction logic of
the source data)
Requirements – The specific data elements, fields, records, transformation rules,
security/encryption requirements, and validations that are needed based on the functional
requirements and detail design of CFIN S/4 HANA
Profiling – The activity of profiling the source master data records to determine where
there are requirements for cleansing and/or transformation in order to ensure compliance
with CFIN S/4 HANA instance configuration enforced through data quality rules,
cleansing, and transformations (i.e. record distribution, field usage, min/max values, etc..)
Cleansing – Changing, standardizing, formatting, consolidating the data to comply with
the data standards defined as part of design and CFIN S/4 HANA system requirements (ie
duplicates, address standardization, etc.)
Mapping – Source to target field mapping per data object, capturing the transformation
logic of the new target CFIN S/4 HANA system (i.e., direct mapping, default, blank,
transformation logic, etc.)
Extraction, Transformation, & Load (ETL) – Automated/repeatable queries to ETL
(extract, transform, load) the data, following the source to target mapping requirements,
incorporating data cleansing, validations, and reconciliations through the process to fit the
target CFIN S/4 HANA system
Data Verification and Validation – Process performed before/during/after each Mock
Cycle to confirm and verify that the data loaded into the SAP CFIN S/4 HANA system
complies with the defined requirements (pre / post load validations and reconciliation)
Below is an overview of the E2E Migration Process, captured as part of the Data Conversion
Strategy, however, shown here for reference and highlighting how cleansing is integrated in the
overall process.
6
3. Data Quality Framework
The objective of the Data Cleansing Strategy is to describe the Data Quality Framework and to
define and cleanse the Client source data that is required into the SAP Central Finance instance.
The Data Quality Framework includes the following steps:
Define Scope: Determine scope of data to be
cleansed, including the relevancy logic applied
to that data.
Assess Data Quality: Profile, Analyze,
Review, Assess the current level of data quality,
in order to define and determine the cleansing
required.
Execute Cleansing: With the defined
business rules / cleansing logic captured,
compile a cleansing team and/or process to
execute the data cleansing.
Monitoring Quality: Ensure to continuously
monitor the cleansing activities to ensure (1)
data cleansed continues to be cleansed (2)
business is not continuing to generate/create
dirty data (stop at the source) and (3) newly
created data is validated to be cleansed
following the governance standards, or is
included in the cleansing process.
Below provides an overview of execution steps entailed across the Data Quality Framework.
7
4. Define Scope
Cleansing is a continuous activity as data is constantly created and business needs change. As part
of Release 1, this document will focus on the data objects that are known to be in scope. Below are
the Data Objects that form the initial list of master data to be considered for cleansing during
Release 1.


Material Master (Basic View only)
Customer Master (Basic View only)
Business Relevancy Logic
Business Relevance Logic dictate that only required data is cleansed. RPC, along with the Data
Team with the support from the Functional Process Teams, will define the extraction rules to
properly define the data in scope for cleansing. These rules will also be applied throughout the
project as part of the Migration Cycles, and this logic should be validated by RPC each time to
ensure the accurate scope of records are included. The activity to collect / define the business
relevancy logic will be performed during the Design Workshops by each object in scope and
documented within the associated Functional Specification Document (FSD). In addition,
application of the business relevancy rules will reduce the master data set to “active” records,
thereby increasing focus for subsequent data profiling and cleansing efforts and reducing the data
set for conversion to the target system.
5. Assess Data Quality: Profiling Master Data
Master Data Profiling is the process of assessing the source master data to identify any data
cleansing, gaps, or data transformations required to meet the SAP CFIN S/4 HANA system /
design requirements. Data Profiling is not needed on every table from the source systems and
should be focused on higher priority master data objects and data that has known cleansing /
harmonization requirements that impact business operations. Leading Practices for Data
Profiling are listed below:
●
●
●
●
●
Profiling executed on Master Data Objects only
Completed at the table level by field
If relevancy is known, only profile known in scope records
Use of a robust tool to automate profiling
Requires business involvement to accurately analyze profiling results
There are essentially three levels of profiling that will be required to support the cleansing of the
data that will eventually be used for testing. The primary level of profiling planned to be used at
RPC is unit and integration (if transactional data is required as part of the relevancy logic).
●
Unit Level – The simplest level of profiling and cleansing for single functionality using
small subsets of data (targeted to individual fields)
2
●
●
String Level – Somewhat more complicated as it strings together multiple functions
where cleansing starts to take on the issues of data context as the data is shared across the
functions from end-to-end (targeted to multiple fields on the same data domain)
Integration Level – The most complex as it involves all the data types and functions that
are stringed together in a complete response to a business event (targeted to multiple
fields across different data domains)
Master Data Profiling can either occur with Production databases as its source, or a recent copy of
Production. Because data is ever changing, it is best to execute profiling against most recent data,
so teams are not cleansing outdated data.
SAP Tools to execute master data profiling:
SAP Data Services
● SAP Data Services has limited profiling capabilities and should primarily be used for ETL,
whereas the profiling functionality can support developers as part of their build activities
● SAP Data Services will be the engine used to directly extract the data from the source DB /
ERP systems, apply the relevancy logic / filtering criteria to reduce the data to in scope
records, in order for the data to be ingested for in depth profiling
SAP Information Steward
● It is preferred to use SAP Information Steward to execute the master data profiling
● The data/cleansing teams should use Information Steward to analyze the profiling results
● The master data relevancy rules will be used not only for profiling, but also incorporated as
part of the data conversion routines
SAP Information Steward has four functionalities, however for the purposes of Profiling, only
Data Insights is used. Below defines the various profiling results and definitions:
Value Options
Value
Description
Min
The row that has the smallest number in a particular column
Max
The row that has the largest number in a particular column
Average
The value that represents the mean. It is the sum of all values in this column divided
by the number of values
Median
The middle value of a given number of rows
String Length Options
Min
The row that has the smallest number of characters in a particular column
Max
The row that has the largest number of characters in a particular column
Average
The value that represents the man. It is the sum of all values in that column divided by
the number of values
Median
The middle value of a given number of rows
Completeness Options
3
Nulls
The number of rows that are empty or have a null value for a particular column
Blanks
The number of rows that are empty or have a blank value for a particular column
Zeros
The number of rows that have a value of zero for a particular column
Distribution Options
Data
The number of unique values. For example if your data includes USA and United
Kingdom addresses, you would see 2 in the Country data distribution column
Pattern
The number of unique patterns. For example date formats may be yyyy/mm/dd, mmdd-yy or yy.mm.dd and so on
Word
The number of unique words. For example if your data lists the country as the value
United Kingdom, each instance of "United" and "Kingdom" is counted separately,
whereas in Data distribution, "United Kingdom" is counted as one instance
Below is a screenshot from SAP Information Steward, from the Data Insights tab, for Basic
Profiling at the table level. This view shows you the profiling results across dimensions, while also
highlighting the drill down capability to review field level distribution of values and records
associated.
6. Assess Data Quality: Analyze
As part of the Clarkston Review, there were many findings across RPC’s current landscape in
regards to data quality, process, and governance. Below is a summarized view from their analysis:
4
Download