® IBM Software Group IBM Information Server Cleanse - QualityStage ©IBM Corporation IBM Software Group IBM Information Server Delivering information you can trust Discover, model, and govern information structure and content Standardize, merge, and correct information Combine and restructure information for new uses Synchronize, virtualize and move information for in-line delivery 2 IBM Software Group The IBM Solution: IBM Information Server Delivering information you can trust IBM Information Server Unified Deployment Unified Metadata Management WebSphere QualityStage Data cleansing, standardization, matching, and survivorship for enhancing data quality and creating coherent business views 3 IBM Software Group Need for Data Quality Data Sources Data Values Kentucky Fried Chicken KFC 227G CB&NAT STICK P QUE/MOZZ WRAPP. Molly Talber DBA KFC Kent Fried Chick Kentucky Fried Mrs. M. Talber 227G CB&NATURAL STICK MOZZ WRAPPER John & Molly Talber Talber, KFC, ATIMA Critical Problems Need to create & maintain 360 degree views of customers, suppliers, products, locations, events Need to leverage data - make reliable decisions, comply with regulations, meet service agreements Why? No common standards across organization Unexpected values stored in fields Required information buried in free-form fields Fields evolve - used for multiple purposes No reliable keys for consolidated views Operational data degrades 2% per month Alternative Approaches Denial – problem misunderstood and ignored until too late; load and explode Hand-coding - clerical exception processing; very time consuming and resource intensive Simplistic cleansing apps - evolved from direct marketing & list hygiene, lack flexibility 4 IBM Software Group Why Should I Care About Cleansing Information? Lack of information standards Different formats & structures across different systems Data surprises in individual fields Data misplaced in the database Information buried in free-form fields Kate A. Roberts 416 Columbus Ave #2, Boston, Mass 02116 Catherine Roberts Four sixteen Columbus APT2, Boston, MA 02116 Mrs. K. Roberts 416 Columbus Suite #2, Suffolk County 02116 Name Tax ID Telephone J Smith DBA Lime Cons. Williams & Co. C/O Bill 1st Natl Provident HP 15 State St. 228-02-1975 025-37-1888 34-2671434 508-466-1200 6173380300 415-392-2000 3380321 Orlando WING ASSY DRILL 4 HOLE USE 5J868A HEXBOLT 1/4 INCH WING ASSEMBY, USE 5J868-A HEX BOLT .25” - DRILL FOUR HOLES USE 4 5J868A BOLTS (HEX .25) - DRILL HOLES FOR EA ON WING ASSEM RUDER, TAP 6 WHOLES, SECURE W/KL2301 RIVETS (10 CM) Data myopia Lack of consistent identifiers inhibit a single view The redundancy nightmare Duplicate records with a lack of standards 19-84-103 RS232 Cable 6' M-F CandS CS-89641 6 ft. Cable Male-F, RS232 #87951 C&SUCH6 Male/Female 25 PIN 6 Foot Cable 90328574 90328575 90238495 90233479 90233489 90345672 IBM I.B.M. Inc. Int. Bus. Machines International Bus. M. Inter-Nation Consults I.B. Manufacturing 187 N.Pk. Str. Salem NH 01456 187 N.Pk. St. Salem NH 01456 187 No. Park St Salem NH 04156 187 Park Ave Salem NH 04156 15 Main Street Andover MA 02341 Park Blvd. Bostno MA 04106 5 IBM Software Group Importance of Data Quality Low data quality impacts an organization in several ways Poor data quality leads to misguided marketing promotions Cross sell opportunities may be missed because same customer appears several times in slightly different ways Valued customers may not be recognized during support calls or other important touchpoints Data mining is difficult because related items are not detected as related What is good data quality? Two percent of “bad” data doesn’t sound that bad? Two percent of 10M rows means that you have 200K errors 200K errors add up to big problem for analytics/operations/anything! 6 IBM Software Group Enterprise initiatives… …to satisfy critical business requirements. Supply chain collaboration & item synchronization Inventory consolidation Single view of a customer or supplier Compliance ERP Implementations Business to Business Standards ERP instance consolidation IT System renovation Consolidation resulting from M&A activity …need high quality data… Risk Management Reduce Costs & Increase Productivity Enterprise Data Warehouse Increase Revenue / CRM Payoff Compliance & Regulatory projects (SOX, HIPAA, ACCORD, etc.) Business Intelligence Payoff 7 IBM Software Group IBM WebSphere QualityStage Shared design environment with DataStage increases functionality and reduces development time Visual match rule interface simplifies match tuning Service orientation provides ‘continuous’ quality & delivers confidence in your data Parallel architecture shortens execution time 8 IBM Software Group How will you get an accurate, consolidated view of your business? Customers WebSphere QualityStage Process Products / Materials Transactions 1. Free Form Investigation 2. Data Standardization 3. Data Matching 4. Data Survivorship Target Database with Consolidated Views Vendors / Suppliers 9 IBM Software Group Why Investigate Discover trends and potential anomalies in the data 100% visibility of single domain and free-form fields Identify invalid and default values Reveal undocumented business rules and common terminology Verify the reliability of the data in the fields to be used as matching criteria Gain complete understanding of data within context 10 IBM Software Group Investigation - Free Form 123 St. Virginia St. Parsing: 123 | St. | Virginia | St. Separating multi-valued fields into individual pieces number Lexical analysis: street type street type 123 | St. | Virginia | St. Determining business significance of individual pieces House Street Context Sensitive: state Number Name Street Type 123 | St. Virginia | St. Identifying various data structures and content “The instructions for handling the data are inherent within the data itself.” 11 IBM Software Group Rule Sets Pre-defined rules for parsing and standardizing: Name Address Area (City, State and Zip) Multi-national address processing Validate structure: Tax ID US Phone Date Email Append ISO country codes Pre-process or filter name, address and area Rule sets are stored in the common repostiory 12 IBM Software Group Standardization - Example Input File: Address Line 1 Address Line 2 639 N MILLS AVENUE 306 W MAIN STR, CUMMING, GA 30130 3142 WEST CENTRAL AV 843 HEARD AVE 1139 GREENE ST ACCT #1234 4275 OWENS ROAD SUITE 536 EVANS ORLANDO, FLA 32803 TOLEDO OH 43606 AUGUSTA-GA-30904 AUGUSTA GEORGIA 30901 GA 30809 Result File: House # Dir Str. Name Type Unit No. 639 306 3142 843 1139 4275 N W W MILLS MAIN CENTRAL HEARD GREENE OWENS AVE ST AVE AVE ST RD STE 536 NYSIIS City SOUNDEX State Zip ACCT# MAL MAN CANTRAL HAD GRAN ON ORLANDO CUMMING TOLEDO AUGUSTA AUGUSTA EVANS O645 C552 T430 A223 A223 E152 FL GA OH GA GA GA 32803 30130 43606 30904 30901 1234 30809 13 IBM Software Group Why Match Identify duplicate entities within one or more files Perform householding Create consolidated view of customer Establish cross-reference linkage Enrich existing data with new attributes from external sources 14 IBM Software Group Two Methods to Decide a Match Are these two records a match? WILLIAM J KAZANGIAN 128 MAIN ST 02111 12/8/62 WILLAIM JOHN KAZANGIAN 128 MAINE AVE 02110 12/8/62 B B A A B D B A +5 +2 +20 +3 +4 -1 +7 +9 = BBAABDBA = +49 Deterministic Decisions Tables: • Fields are compared • Letter grade assigned • Combined letter grades are compared to a vendor delivered file • Result: Match; Fail; Suspect Probabilistic Record Linkage: • Fields are evaluated for degree-of-match • Weight assigned: represents the “information content” by value • Weights are summed to derived a total score • Result: Statistical probability of a match 15 IBM Software Group Why Survive Provide consolidated view of data Provide consolidated view containing the “best-of-breed” data Resolve conflicting values and fill missing values Cross-populate best available data Implement business and mapping rules Create cross-reference keys 16 IBM Software Group Survivorship - Example Survivorship Input (Match Output) Group Legacy 1 D150 1 A1367 First Bob Robert Middle Last Dixon Dickson No. 1500 1500 23 23 23 Ernest Ernie Ernie A Alex 5901 SW 5901 SW 5901 D689 A436 D352 Obrian O’Brian Obrian Dir. SE Str. Name Type Unit ROSS CLARK CIR ROSS CLARK CIR No. 74TH 74TH 74 STE 202 # 202 ST ST ST Consolidated Output Group Legacy 1 D150 1 A1367 23 23 23 D689 A436 D352 Group 1 First Robert Middle Last No. Dickson 1500 23 Ernie Alex Dir. SE O’Brian 5901 SW Str. Name Type Unit ROSS CLARK CIR No. 74TH 202 ST STE 17 IBM Software Group How Does WebSphere QualityStage Integrate Database DB2 Oracle Sybase Onyx IDMS etc. Data Extraction and Load Routines QualityStage 1. 2. 3. 4. Investigation Standardization Integration Survivorship Target DB2 Oracle Sybase Onyx IDMS etc. 18 IBM Software Group WebSphere DataStage and WebSphere QualityStage: Fully Integrated! 19 IBM Software Group QualityStage: Data Quality Extensions IBM WebSphere QualityStage GeoLocator IBM WebSphere QualityStage Postal Verification Products WAVES (WorldWide) IBM WebSphere Worldwide Address Verification Solution IBM WebSphere QualityStage Postal Certification Products CASS (United States) SERP (Canada) DPID (Australia) IBM Information Server Data Quality Module for SAP IBM WebSphere QualityStage for Siebel 20 IBM Software Group Key Strengths for IBM QualityStage Intuitive, “Design as you think” User Interface Simple rule design & fine tuning Seamless Data Flow integration Intuitive rule design & fine tuning Defining the technology standard with SOA Industry leading probabilistic matching engine 21 ® IBM Software Group Thank You ©IBM Corporation