4. Data Cleansing, Quality & Enrichment

advertisement
Data Cleansing,
Quality&Enrichment
Cleanupyourdataactor
face theconsequences
Introduction
www.dataxpander.com
Call Us: 1-281-407-7582
info@dataxpander.com
Poor data quality is the primary reason for 40% of all business initiatives failing to achieve their targetedbenefits (according to
Gartner). It has been found that data quality problems cost 10% of the total revenue. The staff of an organization spends 25% of
its time in handling customer complaints caused by erratic data, fixing incorrect data, finding missing data, and clarifying data
that doesn’t make anysense.
BADDATA–BADIMPACT!
Respondentsinas urveysaidthatpoordataqualitycauses
www.dataxpander.com
Call Us: 1-281-407-7582
info@dataxpander.com
With the evolution of information age, there is enormous amount of data which has been made readily available t hr ou g h
t he vi rt u al m edi aa nd s o p his t ic at e d c ommunic ations net work. Thereis a wides pread conceptual application of
internet in supply chains, data mining, knowledge management and many other related concepts. There has been a
considerable inc reas e in independent distributed database servers that directly provide online info -retrieval servic es to end
users Thishas facilitatedinformationmanipulationofmulti -sourc edat atoa great extent. The resulting integration of dat a
frommultiple independent sources, results in certain incompatibilities. Most of the users neglect the importance of data in
computers but data acts as the real fuel in the information technology engine. Incorrect and futile data s hould be
removedasitresultsinfaultyanalysis.
Inaccurate data leads users to make faulty decisions. The impact of data errors is felt by everyone at one time or another,
no matter to whic h function
of work the user belongs – management, internet applications, financial applications, marketing
87%
or information servic es. Mostly, poordat aqualityresultsinlossoftime,moneyandtheeffort behindcriticalbusinessdecisions.
27%
mappingfunctionsfordatacleansinginaccordancewit hthe Meta
data should be specified and be reusable for not just
otherdatasourcesbutalsoforqueryprocessing.Especially
for
WhatisDataCleansing?
71%
49%
Data cleansing, also called data scrubbing, deals with
detecting and removing errors and inconsistencies from
data in order to improve the quality of data. Data quality
data warehouses, a workflow infrastructure should be
supported to execute all data transformation steps for
multiplesourcesandlargedat asetsinanefficientmanner.
problemsarepres entinsingledatacollections,suchas files
and databases, e.g., due to misspellings during data entry,
missinginformationorot herinvaliddata.Whenmultipledata
sources71%-Financialreportingproblems
need to be integrat ed, e. g., in data warehous es,
federat49%
ed -dat
Operations
abase systems
problemsor
from
global
bad data
web based
information
systems, the need for data cleansingincreases
87%-Excesscostsforbusinessoperations
substantially. This is because the sourc es often cont ain
27%-Strategicplanningproblems
redundant dat a in different representations. In order to
pro vide acc ess to acc urat e and cons ist ent dat a,
Sourcecons olidation of different data repres entations and
http://www.mytechanalyst.net/2013/10/07/
eliminationofduplicateinformationbecomenecessary.
baddata-is-a-social-disease-part-1/
Report
Enrich
Profile
Cleanse
Merge
Transform
Match
A data cleansing approach should satisfyseveral
requirements:
1. Detect and remove all major errors and inconsistencies
bothinindividualdatasourcesandwhenintegratingmultiple
sources. Theapproachshouldbesupportedbytoolstolimit
manual ins pection and programming effort and be
extensibletoeasilycoveradditionalsources.
2. Data cleansing should not be performed in isolation but
should be based on comprehensive metadata. Required
www.dataxpander.com
Call Us: 1-281-407-7582
info@dataxpander.com
WhyareDataCleansing,Qualityan
d Enrichmentimportant?
According to a Market report, the greatest barriers t o
B2B lead generation are predominantly the result of
bad data quality. With such huge repercussions that
the business needs to face due to poor data;
Cleansing, Quality and Enric hment gain importance.
Most functions in business are dat a cent ric from
accounting to highly creative department marketing.
With
such
emphasis
on
good
dat a
leadingtogreatbusinessperformanceitiscriticaltoensure
thedat acomplies withtheindustrystandards.
TRADITION AL APPROACH
Data collected f or specifieduse
User is the datasubject
Individual provides legalconsent
but is not trulyengaged.
Economic v alue andinnov ation
come from combining data sets
and subsequentuses.
User can be the data subject,the
data controller, and/or data
processor
NEWPERSPECTIVE
Data actively collected withuser
awareness
Most data f rom machine tomachine
transactions& passive collection difficult to notifyindiv iduals
Definition of personal datais
predetermined andbinary
Definition of personal datais
contextual and dependent on
socialnorms.
Policy framework f ocuseson
minimizing risks to theindividual.
Policy focuses onbalancing
protection with innovation and
economicgrowth.
Individuals engage andunderstand
how data is used and how value is
created
With so many sources for the collation of dat a, the
probability of making errors like duplication and
missingof fieldsishigh.Alsowiththeemergenceofcloudbaseddata collection techniques the reasons and the
approaches for data collection have also changed.
Now,
data
collection
is
notdonejustwiththeintentionofimplementingapa rticular
process but also helps in making strategic business
decisions. These decisions det ermine whether or not
a process is worth t he effort. From pre -planning t o the
execution and further to the result; data drives the
whole cycle.
www.dataxpander.com
Call Us: 1-281-407-7582
info@dataxpander.com
Phases in DataCleansing
Data Standardization:
Ingeneral,datacleansinginvolvesseveralphases
Data
Analy sis
Quality
Check
Data
Cleansing
Data
Normalization
Datastandardizationisthefirststeptoens uret hatyourdat a
isreadytobesharedacrosstheent erprise. Thisestablishes
trustwort hinessforthedatatobeusedbyotherapplications
intheorganization.Ideally,suchstandardizationshouldbe
Data
De-duplication
performed during data entry. If, for some reason this is not
possible, a comprehensive back-end process isnecessary
to eliminate any inconsistencies in the data. Data
Data
St a n d a rdi za ti o n
standardization is usually done under name heads like
Name,Address,Product,Business,Financialetc.
Data
Standardi zation
Datanormalization:
It is the process of organizing the fields and tables of any
relationaldatabasetominimizeredundancy.Normalization
usually involves splitting of larger tables into smaller (and
less redundant ) tables and mapping relationshipsbetween
Dataanalysis:
In order to detect which kinds of errors andinc onsistencies
are to be removed, a detailed data analysis is required. In
additiontoamanualinspectionofthedataordat asamples,
analysis programs should be used to gain metadata about
thedat apropertiesanddetectdataqualityproblems.
them. The objective is to isolate data so that additions,
deletions, and modifications of a field can be made in just
one table and then propagated through the rest of the
databaseusingthemappingscreated.
QualityCheck:
Datade-duplication:
It is a specialized data compression technique for
eliminating duplicat e copies of repeating data. This
techniqueisusedt oimprovestorageutilizationandcanalso
beappliedtonetworkdatatransferstoreducet henumberof
bytes that must be sent. In the de-duplication process,
unique chunks of data, or byte patterns, are identified and
stored during a process of analysis. As the analysis
continues, other chunks are compared to the stored copy
and whenever a match occurs, the redundant chunk is
replaced wit h a small referenc e that points to the stored
chunk.Giventhatthesamebytepatternmayoccurdozens,
hundreds,oreventhousandsoftimes(thematchfrequency is
dependent on the chunk size), the amount of data that
This should be performed at every level of Data cleansing.
But for obvious reasons, like the source and timing of the
data receipt might vary largely, making it difficult to check
and assure the quality at the very beginning. Hence an
assigned stage of quality check becomes imperative. A
typical outline of the timeline at which a quality check
processisassignedisillustratedbelow–
Data
generation
• Differentsources
• Different times of datacollection
Different
fileformats
• Excel files
• Textfiles
mustbestoredortransferred.
Application
program
functionalities
www.dataxpander.com
Call Us: 1-281-407-7582
• Importing data from differentfiles
• Filling datagaps
• QualityCheck
• Derivedparameters
info@dataxpander.com
BenefitsofDataCleansing
The5keyissueswithdataquality
There are plenty of financial benefits to cleansing your
databases, not only will such activities stop you wasting
ISSUE#1:InconsistentData
money on storing old or obsolete dat a, but the improved
accuracycanresultingreaterreturnoninvestmentforyour
marketingactivities.
Data storage at more than one place results in data
inconsistency.Thusonsomeoccasions
changesmaynot bemadeatallplaceswhic hresultsinlack
ofdataint egrity.
Assists in your target profiling, meaning t hat your
marketingcampaigns willbemorefocused,relevantand
s ubs eque nt ly more lik ely t o be s uc c es s ful.
Understandably, using obsolete data for your targeting
activities willonlyleadtomisdirectedcampaigns
The average duplication ratio in vendor names is more
frequentthanthat ofsoftwareproductnameswhichis10:1 than
20:1 in the latter. This is often observed bec ause the
consistency across names is difficult to assign. Most
software product names are trademark or registered
entities.
Storing old information can actually harm your brand
imageandcustomerperception.Forinstance,ifyouare
consist ently sending mark eting lit erat ure to old
addresses,customersthatmayhavediedoroptedout of
receiving promotional material, your efforts are actually
helping t o create a wider negative image, rather than
returnoninvestment
IncompliancetotheDataP rotectionActwithregards to
the storage of pers onal information, cleansing
activities are necessary to eliminate any chances of
violationoftheactandsubsequentpenalties
Finally, improving the quality of your data results in
improved customer insight and adds considerable
valuetoyourmarketingeffortsbyimprovingROI
ISSUE#2:DuplicateorConflictingData
Most of the relational databases are a combination of data
from multiple sources; they often contain duplicate data. If
the database is to be the authoritative system of record
supporting IT decisions and processes, it needs to resolve
thoseconflictsandfilteroutduplicates. Theproblemandthe
impact can be huge. Different inventory and asset
management tools may insert duplicate data, for many
reasons.Whentheproblemscalesup,searchingandfixing
theduplicates/conflictscanbeahugetask.
An unbiased research of the industry conducted by BDNA
shows that 40 percent of dat a collected across different
sourcesisrepeat ed.
These are just some of the reasons to undert ake dat a
cleansing, however, as a business, it should be
rememberedthattheprocessisongoing. Databasesrequire
ISSUE#3:IrrelevantData
practically constant management and cleansing to ensure
theycontain,uptodate,accurateandvaluableinformation.
Inshort,improvethequalityofthedata.
data can reduce t he data footprint significantly. For
example, whenlookingatdataforanaudit,youdonotneed
to
look at files that are individual components of a larger
bundle.Ignoringthosefileshelpstoeliminatethefrequency,
Report
sothatyoucanfoc usontheremainingperc entofdatathatis
relevant.
Pr ofile
Enrich
Another issue is data relevance. Removing the irrelevant
Transform
AresearchdonebyBDNAshowsthat95%ofdatagathered
fromvariousdiscoverysourcesisirrelevant.
Cleanse
Merge
Match
www.dataxpander.com
Call Us: 1-281-407-7582
info@dataxpander.com
ISSUE#4:IncompleteData
After you have addressed issues like inconsistency and
duplication, the database needs to be looked in formissing
datathatyouneedt omakedecisions. Thes ourceofthedata
has to be authentic for the data to be of the highest quality.
Along with the source, the collection systems also largely
determinethequalityandcompletenessofthedat abas e.IT
systemslackcriticalinformationy ouneedtoknowaboutall
of your assets, such as end-of-life or end-of-support date,
licensing/packaging options, current version, etc.These
externaldatapoints–market data–areessentialformany dayto-dayprocesses.
There are various c ategories of data fields that might be
incomplet e. The probability of missing data should be
assessed prior to the database design. For example, a
respondentsurveyaboutincomeofanindividualhasavery
highchanceofbeingleftincomplete.
Intrinsic – The data should be accurate. Accuracy covers
most of the aspects of duplication, relevanc e and
authenticity. The sources of the data collated should be
reliable and reputable. A clear objective needs to be
specifiedforthedatatobeacceptableashighquality.
Cont extual–Therightthingattherighttimeisthekeytothe
context of a database. Timely and relevant value added to
thedat abaseensuresthec omplet enessofthedata.Also,a
largedatabas edoesnotmeanadatabasewhichhasfields
thatareofnomonetary orstrategicrelevance.
Representational – The database needs to be easy to
understandandus erfriendly.Acomplicateddataonlyuses
alotofmoretimefortheusertoaccessandmakegooduse
ofthesame.
Accessibility–Thesecurityandavailabilityofthedataover
networks and at the required time of access is essential.
Dataavailableattherighttimeenablesrightdecisions.
ISSUE#5:IncompleteData
It’smandatorythatsomeofthedat ainyourentiredatabas e
is
now outdated. There is a continuous inflow of some
informationwiththeITsystemsinplacecurrently.Mostdata
collected and then collated are more a mundane process
thanarequirement.Withsuchcollectionsystems,weneed
todetermineathresholdafterwhic hthedatawhichisnotof
anyrelevantuses houldbedestroyed. This wouldeliminate
theproblemofoutdateddataatlarge.
Dimension
Characteristics
Intrinsic
Believ able, Accurate
Objectiv e,Reputable
Contextual
Value-Ad ded,Relev ant
Timely , Complete
Appropriateamo unt
Representational
Interpretable, Easy to understand
Consistent ,Concise
Accessibility
Av ailable,Secure
25 percent of the Fortune 500 use software that is past its
supportdate.
Featuresthatdescribedataquality
We had a look at the issues due to which the quality of the
databasesgetshampered.Itisimportant forustoknowthe
essential features that help us in t he assessment of the
qualityofanylargedata. Thefourc ornerstonesofdataofthe
highestqualityare
www.dataxpander.com
Call Us: 1-281-407-7582
info@dataxpander.com
Data Enrichment–
Howisitdifferentfromdataquality?
The
fourth
step
in
an
effective
dataqualitymethodologyinvolves c reating as complete a
picture as possible to support strategic decision-making.
Data enrichment can range from something as simple as
adding a missing postcode, to augmenting records with
demographic or geographic data based on a name or
address match. It's a simple process, with the value
gained far outweighing the effortrequired.
ListManagement
Email Lists
ListBuilding
Data Management
Data Cleansing
Data Profiling
EnhanceandEnrichexamples–
DataVerification
Geographic:suchaspostcode,countyname,longitudeandlati
tude,andpoliticaldistrict
DataAppending
Email Appending
Phone Appending
Behavioral:
including
purchases,
riskandpreferredcommunicationchannels
credit
Cont actAppending
SocialMediaProfileA ppending
Demographic: such as income, marital status, education,
ageandnumberofchildre
Conclusion
Psychographic: ranging from hobbies and interests to
politicalaffiliation
The maint enance of data isn’t something that occurs only
Cens us:householdandcommunitydata
before you use it for marketing purposes. It encapsulates
While most data quality tools focus on name and address
data, the data enrichment processes should support any
numberofnon-nameandnon-addressdatatypes,including
dates, telephone numbers, account numbers and e-mail
the processes before use of the dat a and after use of the
data.Proc essesneedtobeinplacewhereallchangesand
updatesarefedbackintothedatabaset oensureaccuracy and
consistency. Networks are changing all the time, with
information, to name just a few. The process should also
enablefullcustomizationofthedataqualityrulesengine,so you
can modify existing rules, tweak dat a typesandcreat eyour
own data types. For example, If you want to parse,
newdevicesandsoftware.Andthevendorsare
constantly changing, with new software versions, patch
standardizeandmatchproductnames,yousimplycreatea
"product name" data type that is specific to the type ofdat a
youhave.
of updates each week. The industry keeps changing,
updates, product names, mergers and acquisitions, and
support changes. A large organization may havehundreds
issuing new releases, while you’re gathering the data.
Organizations often spend a lot of time and money in the
initial setup of the collection of the data and setting up the
AboutUs
system to collate. But then they find that the updates are
quiet frequent. Any failure in maintaining andupdating
Dataxpander offers following Data Solutions to keep your
data
up-to-date
to get optimum
res ults
and
improveROI. Translatedataintoactionableinsightswithour
solutions.Weprovidethefollowingservices:
wouldleadtooutdateddatasystemsandfields.Duetothe
various issues that impact data quality, only a very small
percentage of the overall data is really clean. Most of the
remainingdat ashouldbediscarded.Thecleandatashould
then be enriched with market information to give you the
datathatreallymatters.
www.dataxpander.com
Call Us: 1-281-407-7582
info@dataxpander.com
References
1. Data mining and knowledge discovery handbook,
Chapter 2, Data cleansing – A prelude to knowledge
15. Data Quality Strategy: A Step-by-Step Approach http://www.meritalk.com/uploads_legacy/whitepapers/WP
discovery;Jonathan.I.Maletic,Kentstateuniversity.
3131_A_DQ_Step_Approach.pdf
2. DataMiningandKnowledgeDiscovery,2,9–37(1998)©
1998KluwerAcademicPublishers,Boston
16.
.
Ma n ag e me nt
3. Data Cleaning: Problems and Current Approaches;
http://dbs.uni-leipzig.de
E s s ent ia l
D at a
Q ual it y
-
http://dma.org.uk/sites/default/files/tookit_files/wp_essent i
al_data_quality_management.pdf
4. Think you have clean data in your CMDB? Think again!
www.bdna.com
5. What does Data Cleansing & E nhancement mean and
howc anIjustifyit?
www.celsiusinternational.com
6. DataQuality-AproblemandA nApproach–Whitepaper,
Authors-JavedBegandShadabHussain
7.
How Data Cleansing always saves money for Mailing
Campaigns-http://www.data-8.co.uk
8. Enterprise Data Quality – Viewpoint by Sivaprak asam
S.R.
Dataxpander is a leading provider of digital marketing and
data-driven services. The brand's forte lies in its data
intelligence, which holds the largest intellectual mapping
available in the industry. As an expert B2B marketing
solutions provider, Dataxpander specializes in customized
services using the latest business models in online
marketing, search marketing, and innovative data
strategies. It is the world's only social verified and email
verified data provider today. With nearly a decade's
expertise in digital mark eting, its business intelligence
enables companies to utilize the intellectual online
marketing strategies along with dat a insights, market
reports, and IT support services. Cons ulting, Marketing, or
Outsourcing solutions —Dat axpander is the most
preferredchoice.
9. Data Quality Management –Maximize Your CRM
InvestmentReturn-www.activeprime.com
10. TrendsInDataQualityAndBusinessProcessAlignment
–Aforresterresearchreport,November2011
11. The Cost of Dirty Data – An educational report
https://www.vha.com/Solutions/Analytics/Pages/DataLYNX
.aspx
12. The Cost of Poor Data Quality - How to make better
business decisions and positively affect your bottom line ©
TheD&BCompaniesofCanadaLtd
13. A cas efor
www.imaltd.com
14.
Dat aCleans ing
–
W hit epaper
-
Data Quality: A Critical Component of B usiness
Assurancehttp://www.datacons ulting.co.uk/Files/wp_dat a_quality.pdf
www.dataxpander.com
Call Us: 1-281-407-7582
info@dataxpander.com
Download