Fuzzy Duplicates Analysis with ACL Prepared by: Kevin Legere Date: April 3rd, 2013 ACL | Transforming Audit and Risk Agenda Overview Example FUZZYDUP command OMIT() Function Script Editor and RECOFFSET Q&A © 2012 ACL Services Ltd. 2 ACL | Transforming Audit and Risk Overview What is a "Fuzzy Duplicate"? – Match based on criteria where the values are not exact but very close » EX: "ACL Services" and "ACL Service" Typically used for: » » » » Keyword matching Invoice Number matching Vendor Name matching* Employee Name matching Can be simple or complex » Completely depends on your approach and desired accuracy * focus for this presentation © 2012 ACL Services Ltd. 3 ACL | Transforming Audit and Risk © 2012 ACL Services Ltd. Overview Simple Match Examples: – Exact or 100% match » "ACL" = "ACL" – Force Upper or Lower case » "ACL" = UPPER("acl") » "acl" = LOWER("ACL") – Removal of special characters » "ACL" = EXCLUDE("*ACL." "!@#$%^&*().") – Only compare numbers or letters » "ACL" = INCLUDE(UPPER("ACL123") "ABCDEFGHIJKLMNOPQRSTUVWXYZ") » "123" = INCLUDE("ACL123" "1234567890") 4 ACL | Transforming Audit and Risk © 2012 ACL Services Ltd. Overview Complex Match Examples: – Removal of company type indicators (LLC, INC, LTD, etc) » "ACL Services Ltd." = "ACL Services" – Percent of word match AKA letter by letter » "ACL Services" • "ACL Service" 11/12 character match or 91.6% match – Word by Word* » "ACL Services" • • • "ACL Champions" "ACL" "ACL" "Services" "Champions" = 50% match – Levenshtein distance – Sounds like – NYSIIS *Most used by ACL Consultants 5 ACL | Transforming Audit and Risk © 2012 ACL Services Ltd. Vendor Master Analysis 6 ACL | Transforming Audit and Risk © 2012 ACL Services Ltd. Vendor Master Analysis Fuzzy Duplicates on Vendor Name – Possible Risk » Payments are being sent to more than one vendor – May not involve risk. The desire can be to normalize the vendor master list to ensure that duplicates do not exist. » Ideally, one unique vendor should exist in your vendor master list with one or more address records in your vendor address table 7 ACL | Transforming Audit and Risk Vendor Master Analysis Sample file contains 75 vendors – Only Vendor Code and Vendor Name Where do you start for Vendor Name matching? – Look for exact duplicates – Focus on Simple matching – Sort or Summarize! © 2012 ACL Services Ltd. 8 ACL | Transforming Audit and Risk Vendor Master Analysis Step 1: Summarize your Vendor Master File » Choose Vendor Name as your key field » Add Vendor Code as the Other Fields for Summarizing » Be sure to check "Presort" © 2012 ACL Services Ltd. 9 ACL | Transforming Audit and Risk Vendor Master Analysis Step 2: Quickly comb over the data to identify a common trend. » We will focus on this issue, in the sample data: » Create a computed field that corrects the trend (or cleans the data). © 2012 ACL Services Ltd. 10 ACL | Transforming Audit and Risk © 2012 ACL Services Ltd. 11 Vendor Master Analysis Functions used in Default Value text box: INCLUDE(UPPER(ALLTRIM(Vendor_Name)) 'ABCDEFGHIJKLMNOPQRSTUVWXYZ') Within ACL, the computed field will return the following: ACL | Transforming Audit and Risk Vendor Master Analysis Step 3: Perform a Duplicates Command on the computed field © 2012 ACL Services Ltd. 12 ACL | Transforming Audit and Risk Vendor Master Analysis Results are as follows: © 2012 ACL Services Ltd. 13 ACL | Transforming Audit and Risk © 2012 ACL Services Ltd. 14 FUZZYDUP command ACL 9.3 has new features that make Fuzzy Duplicate analysis easier – – – – FUZZYDUP command OMIT() function ISFUZZYDUP() function LEVDIST() function Important parameters to understand – Levenshtein Distance – Difference Percentage ACL | Transforming Audit and Risk © 2012 ACL Services Ltd. 15 FUZZYDUP command Syntax – FUZZYDUP ON {key_field} <OTHER fields> {LEVDISTANCE value} <DIFFPCT value><RESULTSIZE value> <EXACT> TO table_name Example – FUZZYDUP ON Vendor_Name OTHER ALL LEVDISTANCE 2 DIFFPCT 50 TO My_Results Levenshtein Distance (LEVDISTANCE) » The number of edits required to make the strings equal • EX: "Smith" and "Smythe" have a Levenshtein Distance of 2 Difference Percentage (DIFFPCT) » The threshold for percentage difference between two strings • • EX: "Smith" and "Smythe" have a Percentage Difference of 40% (2/5) * 100% ACL | Transforming Audit and Risk © 2012 ACL Services Ltd. 16 OMIT() Function When Do I use OMIT()? – When you want to refine fuzzy duplicate analysis – Look for repeating strings you want to remove from your Vendor Name field Syntax – OMIT(string1, string2 <,case_sensitive>) – Specify T to make substrings specified for removal case-sensitive, or F to ignore case Example – OMIT(Vendor_Name " Ltd, Inc, Corp, Corporation" F) ACL | Transforming Audit and Risk © 2012 ACL Services Ltd. 17 Script Editor and RECOFFSET ACL | Transforming Audit and Risk © 2012 ACL Services Ltd. 18 ACL | Transforming Audit and Risk © 2012 ACL Services Ltd. 19 Contact Information Kevin Legere Implementation Consultant ACL Services Ltd. 1550 Alberni Street, Vancouver, BC, Canada V6G 1A5 kevin_legere@acl.com | @aclkevin www.acl.com/linkedin | www.acl.com/twitter | www.acl.com/facebook ACL | Transforming Audit and Risk © 2012 ACL Services Ltd. 20