IBM Information Server Training

advertisement
®
IBM Software Group
IBM Information Server
Cleanse - QualityStage
©IBM Corporation
IBM Software Group
IBM Information Server
Delivering information you can trust
Discover, model, and
govern information
structure and content
Standardize, merge,
and correct information
Combine and
restructure information
for new uses
Synchronize, virtualize
and move information
for in-line delivery
2
IBM Software Group
The IBM Solution: IBM Information Server
Delivering information you can trust
IBM Information Server
Unified Deployment
Unified Metadata Management
WebSphere QualityStage
Data cleansing, standardization, matching,
and survivorship for enhancing data quality
and creating coherent business views
3
IBM Software Group
Need for Data Quality
Data Sources
Data Values
Kentucky Fried Chicken
KFC
227G CB&NAT STICK P
QUE/MOZZ WRAPP.
Molly Talber DBA KFC
Kent Fried Chick
Kentucky Fried
Mrs. M. Talber
227G CB&NATURAL STICK
MOZZ WRAPPER
John & Molly Talber
Talber, KFC, ATIMA
Critical Problems
 Need to create & maintain 360 degree views of
customers, suppliers, products, locations, events
 Need to leverage data - make reliable decisions,
comply with regulations, meet service agreements
Why?






No common standards across organization
Unexpected values stored in fields
Required information buried in free-form fields
Fields evolve - used for multiple purposes
No reliable keys for consolidated views
Operational data degrades 2% per month
Alternative Approaches
 Denial – problem misunderstood and ignored until
too late; load and explode
 Hand-coding - clerical exception processing; very
time consuming and resource intensive
 Simplistic cleansing apps - evolved from direct
marketing & list hygiene, lack flexibility
4
IBM Software Group
Why Should I Care About Cleansing Information?
 Lack of information standards
 Different formats & structures
across different systems
 Data surprises in individual
fields
 Data misplaced in the database
 Information buried in free-form
fields
Kate A. Roberts
416 Columbus Ave #2, Boston, Mass 02116
Catherine Roberts Four sixteen Columbus APT2, Boston, MA 02116
Mrs. K. Roberts
416 Columbus Suite #2, Suffolk County 02116
Name
Tax ID
Telephone
J Smith DBA Lime Cons.
Williams & Co. C/O Bill
1st Natl Provident
HP 15 State St.
228-02-1975
025-37-1888
34-2671434
508-466-1200
6173380300
415-392-2000
3380321
Orlando
WING ASSY DRILL 4 HOLE USE 5J868A HEXBOLT 1/4 INCH
WING ASSEMBY, USE 5J868-A HEX BOLT .25” - DRILL FOUR HOLES
USE 4 5J868A BOLTS (HEX .25) - DRILL HOLES FOR EA ON WING ASSEM
RUDER, TAP 6 WHOLES, SECURE W/KL2301 RIVETS (10 CM)
 Data myopia
 Lack of consistent identifiers inhibit
a single view
 The redundancy nightmare
 Duplicate records with a lack of
standards
19-84-103
RS232 Cable 6' M-F CandS
CS-89641
6 ft. Cable Male-F, RS232 #87951
C&SUCH6
Male/Female 25 PIN 6 Foot Cable
90328574
90328575
90238495
90233479
90233489
90345672
IBM
I.B.M. Inc.
Int. Bus. Machines
International Bus. M.
Inter-Nation Consults
I.B. Manufacturing
187 N.Pk. Str. Salem NH 01456
187 N.Pk. St. Salem NH 01456
187 No. Park St Salem NH 04156
187 Park Ave Salem NH 04156
15 Main Street Andover MA 02341
Park Blvd. Bostno MA 04106
5
IBM Software Group
Importance of Data Quality
 Low data quality impacts an organization in several ways
 Poor data quality leads to misguided marketing promotions
 Cross sell opportunities may be missed because same customer appears several
times in slightly different ways
 Valued customers may not be recognized during support calls or other important
touchpoints
 Data mining is difficult because related items are not detected as related
 What is good data quality?
 Two percent of “bad” data doesn’t sound that bad?
 Two percent of 10M rows means that you have 200K errors
  200K errors add up to big problem for analytics/operations/anything!
6
IBM Software Group
Enterprise initiatives…
…to satisfy
critical business
requirements.
 Supply chain collaboration & item
synchronization
 Inventory consolidation
 Single view of a customer or supplier
 Compliance
 ERP Implementations
 Business to Business
Standards
 ERP instance consolidation
 IT System renovation
 Consolidation resulting from
M&A activity
…need
high
quality
data…
 Risk Management
 Reduce Costs &
Increase Productivity
 Enterprise Data Warehouse
 Increase Revenue /
CRM Payoff
 Compliance & Regulatory projects
(SOX, HIPAA, ACCORD, etc.)
 Business Intelligence
Payoff
7
IBM Software Group
IBM WebSphere QualityStage
 Shared design environment with
DataStage increases
functionality and reduces
development time
 Visual match rule interface
simplifies match tuning
 Service orientation provides
‘continuous’ quality & delivers
confidence in your data
 Parallel architecture shortens
execution time
8
IBM Software Group
How will you get an accurate, consolidated view of your
business?
Customers
WebSphere
QualityStage Process
Products /
Materials
Transactions
1. Free Form Investigation
2. Data Standardization
3. Data Matching
4. Data Survivorship
Target
Database with
Consolidated
Views
Vendors /
Suppliers
9
IBM Software Group
Why Investigate
 Discover trends and potential anomalies in the data
 100% visibility of single domain and free-form fields
 Identify invalid and default values
 Reveal undocumented business rules and common terminology
 Verify the reliability of the data in the fields to be used as matching
criteria
 Gain complete understanding of data within context
10
IBM Software Group
 Investigation - Free Form
123 St. Virginia St.
Parsing:
123 | St. | Virginia | St.
Separating multi-valued fields into individual pieces
number
Lexical analysis:
street
type
street
type
123 | St. | Virginia | St.
Determining business significance of individual
pieces
House
Street
Context Sensitive:
state
Number
Name
Street
Type
123 | St. Virginia | St.
Identifying various data structures and content
“The instructions for handling the data are inherent within the data itself.”
11
IBM Software Group
Rule Sets
 Pre-defined rules for parsing and
standardizing:
Name
Address
Area (City, State and Zip)
 Multi-national address processing
 Validate structure:
Tax ID
US Phone
Date
Email
 Append ISO country codes
 Pre-process or filter name, address
and area
 Rule sets are stored in the common
repostiory
12
IBM Software Group
 Standardization - Example
Input File:
Address Line 1
Address Line 2
639 N MILLS AVENUE
306 W MAIN STR, CUMMING, GA 30130
3142 WEST CENTRAL AV
843 HEARD AVE
1139 GREENE ST
ACCT #1234
4275 OWENS ROAD SUITE 536 EVANS
ORLANDO, FLA 32803
TOLEDO OH 43606
AUGUSTA-GA-30904
AUGUSTA GEORGIA 30901
GA 30809
Result File:
House #
Dir
Str. Name
Type
Unit
No.
639
306
3142
843
1139
4275
N
W
W
MILLS
MAIN
CENTRAL
HEARD
GREENE
OWENS
AVE
ST
AVE
AVE
ST
RD STE 536
NYSIIS
City
SOUNDEX
State
Zip
ACCT#
MAL
MAN
CANTRAL
HAD
GRAN
ON
ORLANDO
CUMMING
TOLEDO
AUGUSTA
AUGUSTA
EVANS
O645
C552
T430
A223
A223
E152
FL
GA
OH
GA
GA
GA
32803
30130
43606
30904
30901 1234
30809
13
IBM Software Group
Why Match
 Identify duplicate entities within one or more files
 Perform householding
 Create consolidated view of customer
 Establish cross-reference linkage
 Enrich existing data with new attributes from external
sources
14
IBM Software Group
Two Methods to Decide a Match
Are these two records a match?
WILLIAM J
KAZANGIAN
128 MAIN
ST
02111 12/8/62
WILLAIM JOHN
KAZANGIAN
128 MAINE AVE 02110 12/8/62
B
B
A
A
B
D
B
A
+5
+2
+20
+3
+4
-1
+7
+9
= BBAABDBA
=
+49
Deterministic Decisions Tables:
• Fields are compared
• Letter grade assigned
• Combined letter grades are compared to a vendor delivered file
• Result: Match; Fail; Suspect
Probabilistic Record Linkage:
• Fields are evaluated for degree-of-match
• Weight assigned: represents the “information content” by value
• Weights are summed to derived a total score
• Result: Statistical probability of a match
15
IBM Software Group
Why Survive
 Provide consolidated view of data
 Provide consolidated view containing the “best-of-breed”
data
 Resolve conflicting values and fill missing values
 Cross-populate best available data
 Implement business and mapping rules
 Create cross-reference keys
16
IBM Software Group
 Survivorship - Example
Survivorship Input (Match Output)
Group Legacy
1
D150
1
A1367
First
Bob
Robert
Middle Last
Dixon
Dickson
No.
1500
1500
23
23
23
Ernest
Ernie
Ernie
A
Alex
5901 SW
5901 SW
5901
D689
A436
D352
Obrian
O’Brian
Obrian
Dir.
SE
Str. Name
Type Unit
ROSS CLARK CIR
ROSS CLARK CIR
No.
74TH
74TH
74
STE
202
#
202
ST
ST
ST
Consolidated Output
Group Legacy
1
D150
1
A1367
23
23
23
D689
A436
D352
Group
1
First
Robert
Middle Last
No.
Dickson 1500
23
Ernie
Alex
Dir.
SE
O’Brian 5901 SW
Str. Name
Type Unit
ROSS CLARK CIR
No.
74TH
202
ST
STE
17
IBM Software Group
How Does WebSphere QualityStage Integrate
Database
DB2
Oracle
Sybase
Onyx
IDMS
etc.
Data Extraction and
Load Routines
QualityStage
1.
2.
3.
4.
Investigation
Standardization
Integration
Survivorship
Target
DB2
Oracle
Sybase
Onyx
IDMS
etc.
18
IBM Software Group
WebSphere DataStage and
WebSphere QualityStage: Fully Integrated!
19
IBM Software Group
QualityStage: Data Quality Extensions
 IBM WebSphere QualityStage GeoLocator
 IBM WebSphere QualityStage Postal Verification
Products
WAVES (WorldWide)
IBM WebSphere Worldwide Address Verification Solution
 IBM WebSphere QualityStage Postal Certification
Products
CASS (United States)
SERP (Canada)
DPID (Australia)
 IBM Information Server Data Quality Module for SAP
 IBM WebSphere QualityStage for Siebel
20
IBM Software Group
Key Strengths for IBM QualityStage
 Intuitive, “Design as you think” User Interface
Simple rule design & fine tuning
 Seamless Data Flow integration
 Intuitive rule design & fine tuning
 Defining the technology standard with SOA
 Industry leading probabilistic matching engine
21
®
IBM Software Group
Thank You
©IBM Corporation
Download