SvalTech Two Neglected Data Management Technologies Data Profiling & Database Archiving DAMA Phoenix Chapter 9 September, 2010 Jack E. Olson jack.olson@SvalTech.com www.svaltech.com “Data Quality: The Accuracy Dimension”, Morgan Kaufmann, 2003 “Database Archiving: How to Keep Lots of Data for a Long Time", Morgan Kaufmann, 2008 Copyright SvalTech, Inc., 2010 SvalTech Why Two Topics • Jack knows a lot about each of them – – – – Wrote a book on data profiling Wrote a book on database archiving Built data profiling product: AXIO at Evoke Built database archiving product: TITAN at NESI • Both are post-operational data management technologies • Neither is required for operational purposes • Both have huge potential for return on investment – Cost savings – Operational performance improvements – Risk reduction • Both are liked by compliance/governance folks • Both are under-utilized in IT shops Copyright SvalTech, Inc., 2010 2 SvalTech What I’m Going to Talk About For each Technology, • • • • • • Define Technology Identify Where It is Used Show How it Works Basic Solution Architectures Show Business Case Basics Why it is Neglected Copyright SvalTech, Inc., 2010 3 SvalTech Data Profiling Copyright SvalTech, Inc., 2010 4 SvalTech Books • Data Quality: The Accuracy Dimension, Jack E. Olson, Morgan Kaufmann, 2003 • Executing Data Quality Projects” Ten Steps to Data Quality and Trusted Information (tm), Danette McGilray, Morgan Kaufmann, 2008 • Three Dimensional Analysis – Data Profiling Techniques, Ed Lindsey, Data Profiling LLC, 2008 • Data Quality Assessment, Maydanchik Arkady, Technics Publications, LLC, 2007 Copyright SvalTech, Inc., 2010 5 SvalTech Definition Data Profiling is the application of data analysis techniques to existing data for the purpose of determining the actual content, structure, and quality of the data. metadata accurate & inaccurate Data accurate & inaccurate Copyright SvalTech, Inc., 2010 accurate metadata Data Profiling facts about inaccurate data data quality issues 6 SvalTech Where is Data Profiling Used? • Enterprise Data Quality Improvement Program – – – • Support Consolidation of Databases after mergers and acquisitions – – • Dramatically reduce cost and time to complete projects Improve quality of data in resulting system Support application renovation projects – – • Traditional Six Sigma Like Program Recursive application of data quality assessment Based on historical success of companies who have used it Dramatically reduce time and cost to complete Improve quality of data in resulting system Support data integration functions for data warehousing/ business intelligence data stores – – Develop processes to cleanse data in transit Improve quality of data in information intelligence stores Copyright SvalTech, Inc., 2010 7 SvalTech Data Quality Program Data Quality Correction Data Quality Prevention Data Profiling Name & Address Cleansing De-duplication Data Quality Monitoring Incident Investigation Re-verification Business Process Improvements Data Quality Issue Formulation & Tracking Application Software Improvements Data Quality Discovery Copyright SvalTech, Inc., 2010 8 SvalTech Data Quality Discovery Data Profiling profile DATA review profiler Incident Investigation Copyright SvalTech, Inc., 2010 Suspicion Of DQ problem DQ issues process review DQ issues DATA review 9 SvalTech Data Profiling Functions • Column Examination – – – – – – • Row Examination – – – • Find all primary key candidates (single or multi-column) Find intra-row column dependencies (find denormalization instances) Find multi-column value relationships • Value ordering rules • NULL value dependencies Multi-table Examination – – – – • List all values in column with frequency of occurrence Show high and low values Determine true data type Determine degree of uniqueness Determine encoding patterns used, frequency of each pattern Compute values: AVG, SUM, MEDIAN, STD DEVIATION Find matching columns across tables • Match by column name, data type • Match by values Find primary/foreign key pairs (single and multi-column) Determine 1-1, 1-M, 1-0, M-1, M-M, 0-1 rules Find primary values not found in secondary tables Test User provided data rules Copyright SvalTech, Inc., 2010 10 SvalTech Potential for Discovery of Value Problems missing & invalid values values recorded in a column of a table errors that can be fixed without re-verification valid but wrong values errors that can be found through analytical techniques correct values Copyright SvalTech, Inc., 2010 11 SvalTech What Types of Problems can be identified for values in a column? Invalid Values - missing values when should not be missing - values out of range or not in domain of expected values - value in one column not possible when combined with values in one or more other columns - obviously wrong when looked at •Name = Donald Duck •Address = 1600 Pennsylvania Avenue Valid Values - distribution of values unexpected too many of one or more values too few of one or more values - value is incompatible with other values in other columns - multiple values mean same thing Copyright SvalTech, Inc., 2010 12 SvalTech What Types of Problems can be Fixed? • Synonyms (multiple representations for same value) • Multiple ways to express same value •Date formats •Number formats •Use of case on character values • Cases where the value of a column can be determined from the values in one or more other columns explicitly: through a rule correlation against known combinations Copyright SvalTech, Inc., 2010 13 SvalTech Role of Metadata and Data Rules • must define what constitutes correct • traditional metadata is only part of definition • traditional metadata is often inaccurate and/or incomplete • data rules exist whether known or not • data rules exist whether enforced or not • profiling can validate metadata and known data rules • can be used to correct or enhance metadata and data rules • can be used to discover additional data rules • can test adherence to data rules Copyright SvalTech, Inc., 2010 14 SvalTech Data Profiling Solution Architectures metadata data rules gather metadata & data rules review update metadata and data rules profile profiler data extract samples Identify and quantify problems determine impact on business create DQ problem tickets Copyright SvalTech, Inc., 2010 15 SvalTech Solution Architectures Home Grown Inefficient Directed SQL; not generic Easy Algorithms Few Surprises uncovered Vendor Tools More sophisticated functions third normal form discovery foreign key discovery multiple pattern recognition More management of results Copyright SvalTech, Inc., 2010 16 SvalTech State of Vendor Support • Fragmented – – Different products call themselves data profiling but do different things Some non-profiling products call themselves profiling • Buried in multiple data management functions • Marketed as compliance product • Not a lot of advances in technology over last few years – – – – – – More complex array of profiling functions Targeted profiling functions Menu of functions to pick from SaaS Profiling of metadata Monitoring operational systems for profiling expectations Copyright SvalTech, Inc., 2010 17 SvalTech Business Case Basics • Dollar savings – – • Improved Operations – – • Fewer Operational Mistakes • Reduced rework • Reduced customer returns Improved Analytic Results • Better decision making from Analytics Data Quality problems can cost a company 15-25% of bottom line profit. Fewer operational glitches due to wrong data values Reduced time to complete projects Risk Reduction – – Reduced exposure to catastrophic events Reduced exposure to legal scrutiny of data Copyright SvalTech, Inc., 2010 It’s all indirect: you cannot determine what value you will get until after you have profiled data, studied results and used it to some advantage. 18 SvalTech How to Get Started • Need to establish as part of quality improvement/ compliance/ governance program – – • Start with a pilot data quality assessment of a critical data resource – – – • 6 Sigma like Multi-department interest Small Targeted: high value consequences of poor data “see what we find” Possibly start with a renovation/ consolidation/ integration project – Process to establish source level metadata and understand data content • Need to Quantify Results • Use Success to Expand Use of Technology Copyright SvalTech, Inc., 2010 19 SvalTech Why is it Neglected? • Problems are not on Top Ten List of IT Concerns • Cannot predict value: cannot build business case in advance – – Don’t know value of a data quality improvements until after you discover the problems Value is hard to quantify: does not appear until quality problems acted upon • Perception that you don’t need it • Exposes corporate/ IT weaknesses • No Clear Owner of Data Quality • Lack of education on value, cost to implement – Management education – Technical Staff education Copyright SvalTech, Inc., 2010 20 SvalTech Database Archiving Copyright SvalTech, Inc., 2010 21 SvalTech Books • Database Archiving: How to Keep Lots of Data for a Very Long Time, Jack E. Olson, Morgan Kaufmann, 2009 Copyright SvalTech, Inc., 2010 22 SvalTech Definition The process of removing selected data items from operational databases that are not expected to be referenced again and storing them in an archive database where they can be retrieved if needed. Physical Documents application forms mortgage papers prescriptions File Archiving structured files source code reports Copyright SvalTech, Inc., 2010 Document Archiving Multi-media files word pictures pdf sound excel telemetry XML Email Archiving outlook lotus notes Database Archiving DB2 IMS ORACLE SAP PEOPLESOFT 23 SvalTech Business Records: the Archive Unit You don’t archive databases; you archive data from databases. A Business Record is the data captured and maintained for a single business event or to describe a single real world object. Databases are collections of Business Records. Database Archiving is Records Retention. customer employee stock trade purchase order deposit loan payment Copyright SvalTech, Inc., 2010 24 SvalTech Data Retention The requirement to keep data for a business object for a specified period of time. The object cannot be destroyed until after the time for all such requirements applicable to it has past. Business Requirements Regulatory Requirements The Data Retention requirement is the longest of all requirement lines. Copyright SvalTech, Inc., 2010 25 SvalTech Data Retention • Retention requirements vary by business object type • Retention requirements from regulations are exceeding business requirements • Retention requirements will vary by country • Retention requirements imply the obligation to maintain the authenticity of the data throughout the retention period • Retention requirements imply the requirement to faithfully render the data on demand in a common business form understandable to the requestor • The most important business objects tend to have the longest retention periods • The data with the longest retention periods tends to be accumulate the largest number of instances • Retention requirements often exceed 10 years. more years for some applications Copyright SvalTech, Inc., 2010 Requirements exist for 25, 50, 70 and 26 SvalTech Where Database Archiving Is Used • Overloaded Operational Databases • Retired Applications • Application Renovation Projects Copyright SvalTech, Inc., 2010 27 SvalTech Overloaded Operational Databases • Transaction data • Lots of data – Hundreds of millions of rows – High daily transaction rate • • • • 24/7 operational availability requirement Long retention period (15 years or more) Short useful active life (less than 2 years) Low access requirements during the inactive period – Very low access frequency – Response time not critical – Access requirements are simple, easily satisfied with ad hoc tools Copyright SvalTech, Inc., 2010 28 SvalTech Retired Applications • Merger of companies results in an operational application being duplicated • Data Structures are not compatible – One keeps data elements not in other – One encodes data elements differently – One designed for different OS/DBMS than other • Decision is made to use one system and abandon the other one • Meets all characteristics of an operational application Copyright SvalTech, Inc., 2010 29 SvalTech Application Renovation Projects • Application is undergoing major change – – – – – Replaced with packaged application Legacy modernization Legacy termination Rewritten to be web-centric Need to satisfy new requirements • Old data structures are out of date – Legacy DBMS – Legacy file system • Data meets all other requirements for archiving operational application Copyright SvalTech, Inc., 2010 30 SvalTech Architecture of Database Archiving Operational System Application program Archive Extractor Archive Administrator Archive Designer Archive Data Manager Archive Access Manager OP DB Archive extractor Archive Server archive catalog archive storage Copyright SvalTech, Inc., 2010 31 SvalTech Archive Staff • Database Archive Specialist – – – – • Database Archive Administrator – – • Received education on database archive design and implementation Knows tools available Experienced Full time job Received education on database archiving administration Full time job Supporting Roles – – – – – – – – – Storage Administrators Database Administrators Data Stewards Security Administrators Compliance staff IT management Business Unit Management Legal Records Management Copyright SvalTech, Inc., 2010 32 SvalTech Archive Designer Component • Metadata – – – – • Data – – – – – • Capture current metadata Validate it Enhance it Design archive storage format Define business records to be archived Define source of data Define data structures within operational system Define reference data needed to include with it Define archive format of data Policies – – – – Define extract policy (when a record becomes inactive) Define operational disposal policy (when to remove from operational database) Define storage policy (how to protect data in archive) Define discard policy (when to remove from archive) Copyright SvalTech, Inc., 2010 33 SvalTech Archive Extractor Components • Extractor process – – – – – – – – – • Verify consistency with design metadata Extract data as defined in designer Mark or delete from operational database as defined in designer Pass data to archive data manager Keep audit records on everything done Do not impact operational performance Support interruptions with transaction level recovery Support restart Finish scans within acceptable time periods Scheduling – Establish periodic executions – Find non-disruptive periods – Be consistent Copyright SvalTech, Inc., 2010 34 Archive Extractors SvalTech Physical vs. Application Extractors Archive Extractor Operational System Application program OP DB Archive extractor Physical Extractor Application Extractor Gets/deletes data directly from the database tables, rows, columns Gets/deletes data from an application API virtual tables, rows, columns application program Copyright SvalTech, Inc., 2010 35 SvalTech Archive Data Manager Component • Put data away – – – – – • Execute Storage policies – – – – – • Return to accessing programs Fetch data on request – – • Encryption/ signatures Backup copies created and stored Geographic dispersion of backups Register archive files with archive catalog Enter audit trail information Fetch metadata on request – • Receive data from extractors Format into archive segment files Determine metadata version affinity Format and store metadata files if new Build or update segment indexes both internal and external Scan archive segments Search through indexes Execute Archive Discard Process – – – Periodic scheduling Delete qualifying business records Update archive catalog Copyright SvalTech, Inc., 2010 36 SvalTech Archive Access Component • Query Capability – – – – • Determine applicability based on archive segment versions of metadata SQL based is best, if possible Employ external indexes to determine which archive segments to look into Employ internal indexes to avoid reading all of an archive segment Support standard access tools – – – Report generation (such as Crystal Reports) Generic query tools JDBC interface • Support metadata version browsing • Support generation of load files based on query results • Support generation of load files based on original data source based on query results Copyright SvalTech, Inc., 2010 37 SvalTech Archive Administration Component • Manage Archive Catalog – – – • Manage Archive Storage Systems – – • Application archive designs Audit trails Results logs Ensure periodic readability checks Maintain access audit trails Manage Archive Access – – – Authorizations for users Authorizations for specific events • Unloads Ensure audit records are created for all access • Manage e-Discovery requests • Ensure Extract and Discard processes are run when they are supposed to • Manage Metadata Change Process Copyright SvalTech, Inc., 2010 38 SvalTech Solution Comparisons Home-Grown vs. Vendor Home-Grown Solutions: Use Parallel DB Use Database Partitions Put in UNLOAD files Save Image Copies of DB Copyright SvalTech, Inc., 2010 Vendor Solutions: More Complete Solutions Support Long Term Administration Put data in XML files Put data in reformatted files Exploit strengths of storage subsystems Direct access support 39 SvalTech Home-Grown Solutions • Solve Operational Problems, BUT: – – – – Create downstream problems Fail to achieve cost savings Render archive data inaccessible • Either completely or, • Expensive in time and cost to query Lose data authenticity • Common Omissions – – – – – Copyright SvalTech, Inc., 2010 No handling or improvement of metadata No change process for structure changes No long term storage management Fail to achieve application/system independence No administration platform 40 SvalTech Vendor Solutions • Not a Lot of Vendors – – – Only 7 I know of 3 large companies • Through acquisition Gartner pre-recession characterization • Is a new technology • $100M in 2008 • 40% per year growth rate • Early adopter stage • Solutions not complete – – Copyright SvalTech, Inc., 2010 Need growth in function and maturity Common weak spots • Design modeling • Extractor technology • Not pervasive across data sources • Storage structure • Storage management 41 SvalTech Business Case Basics Drivers Longer Data Retention requirements Expanded Business overloaded operational databases Operational problems Mergers and Acquisitions Cost of Keeping Old Systems Difficulty in Making Application Changes Data Governance e-Records Retention e-Discovery Readiness concerns Copyright SvalTech, Inc., 2010 42 SvalTech Reason for Archiving All data in operational db Inactive data in archive db most expensive system most expensive storage most expensive software least expensive system least expensive storage least expensive software In a typical op db 60-80% of data is inactive Size Today This percentage is growing Operational Copyright SvalTech, Inc., 2010 operational archive 43 SvalTech Cost Saving Elements Look for and compute difference in storage costs front-line vs archive storage byte counts differences between operational and archive Look for and compute difference in system costs operational vs archive systems are operational system upgrades avoided are software upgrades avoided can systems be eliminated for application can software be eliminated for application Look for savings on people costs can people be eliminated or redirected for retired applications Potential savings on changes/ application renovations simplification of design elimination of data conversions Copyright SvalTech, Inc., 2010 44 SvalTech Operational Efficiency Impacts Will operational performance be enhanced with less data Will utility time periods be reduced (backup, reorganization) fewer occurrences needed less data to process each time Will recovery times be reduced and what is that worth interruption recoveries disaster recoveries Will implementation of data structure changes be improved avoided reduced amount of data to unload/modify/reload Copyright SvalTech, Inc., 2010 45 SvalTech Risk Factors Will the saved data have better authenticity not changed in archive shielded from updates or damage traceable back to original form Will e-Discovery benefit from archiving can locate and process data outside of operational environment can easily create legal-hold archive units Will exposure of data reduced fewer authorized users against the archive complete audit trails of all access Copyright SvalTech, Inc., 2010 46 SvalTech Business Case Summary • Database Archiving solutions generally provide for lower cost software, can use lower cost storage more efficiently, and run on smaller machines. • Each business case is different Many factors can be used in building business case Seen an application justified on storage costs alone Seen an application justified on disaster recovery time alone Seen an application justified on better data security alone • Each organization will have many potential applications • Having a database archiving practice can create synergies across many applications thus adding more value Copyright SvalTech, Inc., 2010 47 SvalTech Why is it Neglected? • Belief that work arounds are good enough • Technology is too new • Technology is incomplete or flawed • Inability to produce complete business case to justify • Lack of experience in using technology • Lack of education on value, cost to implement – Management education – Technical Staff education Copyright SvalTech, Inc., 2010 48 SvalTech Final Thoughts Are these two technologies related? Enterprises who consider data a critical and valuable asset and who have a goal of collecting, protecting, preserving and deploying data to the best advantage of the enterprise WILL have robust data management and data governance practices. These will include professional management of: - data and metadata architectural control - active and ongoing data quality improvement program - active and ongoing data archiving program (records management) - data security control - access authorization and auditing - data encryption - data protection program - backups - disaster recovery - protection from unauthorized access/ changes - data distribution control Copyright SvalTech, Inc., 2010 49