Techniques for Collecting and Harmonizing External Data

advertisement
Techniques for Collecting and
Harmonizing External Data
By Neil Hepburn
Speaker Bio
 Data Architect for Call Genie Inc.
 Has worked in telecom for past 7 years, and IS/IT for past 15 years
 Is GM of marketing for TUN3R.com, an Internet radio aggregator
 Does part time consulting, including workshops on external data collection for consultancy Hepburn Data
Inc.
Neil Hepburn
2
Overview
 Fact-based decision making has begun to capture the popular imagination – beyond the
traditional data warehouse. The demand for fact-based approaches is real
• The first goal of this presentation is to trace the lineage of the fact-based management
approach, and point out the pitfalls when used the wrong way
• Members of IRMAC and DAMA live in a fact-based world. However, the data is mainly
confined to data produced internally.
– External data is collected and managed inconsistently.
 Internal BI, and external market data are the respective "peanut butter" and "chocolate" of
market analytics, but the greatest value comes when they are blended together
• My second goal is to explain why the DW/BI group should play a central role, and form
closer bonds with marketing
 The techniques external marketing firms employ can be more easily adopted internally than
vice versa
• My third goal is to provide pragmatic and concrete guidance on which can be actioned by
marketing and a BI/DW working in collaboration
• The bulk of this presentation will be about this
Neil Hepburn
3
Presentation Roadmap
 Modern history of analytics-driven enterprises
 Rationale for internalizing external data collection
 What consumer marketing departments usually look for:
• Competitive Intelligence explained
• Location Intelligence explained
• Householding explained
 The ethics of external data collection
 First steps: Integrating the Canadian Census and Elections Canada data
 Locating and evaluating data sets for purchase
 Primary Field Research
 Web Harvesting
 Integrating external data sets to derive new insights
 First and Last Name analysis
 Lessons learned from The Whiz Kids
Neil Hepburn
4
Background & History Part 1
 Current wave of “Cultures of Analytics” has begun to capture
the popular the popular imagination. In the last three years we
have seen the following books released:
• The Numerati (by Stephen Baker)
• Competing on Analytics (by Thomas Davenport & Jeanne
Harris)
• Supercrunchers (by Ian Ayres)
• Data Driven: Profiting from your most Important Asset (by
Thomas C. Redman)
 Much of the inspiration behind these books originates from
“Moneyball: The Art of Winning an Unfair Game” (by Michael
M. Lewis), which documents the success of the Oakland ‘A’s
through “Sabermetrics” – taking an analytical approach to
team picks and real time game strategy
 It’s all good stuff, but really nothing new…
Neil Hepburn
5
Part 2 The Whiz Kids
 In the late forties ‘Tex’ Thornton
recruited Jack Reith, Robert
MacNamara and 7 other “Whiz Kids”
from the Harvard Business School to
form a new department within the US
Air Force.
 This department was called Statistical
Control and its mandate was to base
all management decisions on numbers
eschewing emotion and intuition
 The Whiz Kids quickly saved the Air
Force billions of dollars.
 They moved on as a group to work for
The Ford Motor Co. with mixed results
 We will return to this last point at the
end of this presentation…
Neil Hepburn
6
Why, What, and Who? External Data for Marketing
 External data is collected primarily to support Sales & Marketing decisions in the following subject areas:
• Advertising
– What communities should I advertise in?
– What media should I use (e.g. television, Internet, print, radio, direct mail, out-of-home, etc.) and how
should I be targeting within each media?
- Properly managed external data sets can significantly cut down on “spray and pray” approaches to
advertising
• Pricing
– How much are my competitors charging?
– How have their prices changed over time?
– How do I assign value to individual product attributes?
• Competitive Positions
– How do I stack up against my competitor across…
- Product Pricing (as discussed already)
- Retail/Branch/Kiosk locations
- Customer Service (e.g. average pick-up time, IVR complexity, problem resolution times, etc.)
• Market Insights. For example:
– Where are my early adopters?
– How am I performing across gender/age/ethnic segments
• Store openings and closings (i.e. retail network optimization)
 This presentation is primarily focused on data to Sales & Marketing initiatives
Neil Hepburn
7
Why, What, and Who? External Data for Risk Awareness
 External data is heavily used in finance and insurance for risk awareness. The most common examples
include:
• Mortgage Limits, Approvals, Property Assessments
• Loan Limits, Approvals, and Rates
• Insurance Premiums
 Other common forms of risk awareness include:
• Background checks (e.g. for new hires)
• Store/branch location risk
– Risk of natural disaster. Typically flooding, but also hurricanes, avalanches, and earthquakes
– Risk of crime and theft, including warehouse or retail inventory shrinkage
 Since these external data and usages are mature within the finance and insurance industry, this presentation
does not spend as much time on risk awareness
Neil Hepburn
8
Why, What, and Who? Practically Anybody
 Many companies rely on upstream data sources as part of their business. Most
organizations (with the exception of the government) are not transparent about their data
collection methods and operations
• Evaluating upstream data quality is not difficult and will often provide results that may
surprise (and even explain quite a lot)
 Decision makers are apt to make decisions with the information they have, and make
“reasonable assumptions” where there are gaps.
• The more information you can access to close off these assumptions, the better. The old
saw about “assuming” is as true as ever
Neil Hepburn
9
Rationale for Internalizing External Data Collection Part 1
 There exists a chasm between internal domain knowledge and external data set knowledge
• Marketers may realize this, but are not equipped to deal with it.
• They need to support decisions with market facts, and will not wait for IT to validate those facts
• Warning warning! Danger danger!
 External vendors selling raw or packaged data sets are geared towards selling to marketing departments
• The average marketer does not know how to evaluate data quality
• Many data products are not transparent, and come with many “gotchas”, especially when analyzing
micro-markets.
 It is easier to comprehend both internal data sets going from the internal to the external than the other way
around.
• You probably have a better understanding of StatCan data than StatCan has of your data.
 The juiciest stories come from when we derive new facts through the “JOINing” of existing facts
• Derived facts often provide the most valuable insights.
– E.g. Identify customer characteristics with respect to regional income levels and property values. Very
interesting stories will quickly surface.
Neil Hepburn 10
Rationale for Internalizing External Data Collection Part 2
 By creating internal programmes you can be more agile at collecting data, especially during
crucial times of the year and crucial events.
• You can react instantaneously to business information needs
 You are in a better position to quantify and measure Competitive Intelligence (as opposed to
ad hoc reports)
 It is often cheaper to collect the data on your own
 It is a nice break from routine for many employees
• Going outside to collect information can be surprisingly enjoyable
• Many people perceive external data collection as being risky, when in fact it is not.
Naturally:
– Some people will resist external data collection, often due to perceived ethically concerns
– Other people will appreciate the learning opportunity
• Recommended to check with HR first to understand insurance and WSIB risks
Neil Hepburn 11
Competitive Intelligence explained
 Competitive Intelligence (CI) is the legal and ethical practice of obtaining
public domain information about one’s competitor
 Competitive Intelligence can either be qualitative OR;
• Is the most common form of CI, but its value degrades quickly over time
• E.g. press releases, Google Alerts
 Quantitative
• methodically collected and managed, time seriesed data increases in
value, but is less common
• E.g. Competitor pricing across product lines
Neil Hepburn 12
Location Intelligence explained
 Location Intelligence (LI) is the discipline of managing regional attributes
 Similar to geomatics, but is different in that LI is not solely focused on
physical attributes, but rather all attributes.
 The most granular unit of LI tends to be a single parcel of land (i.e. an
address), but can be as high as province or country
 Most LI tends to be focussed on either the address level (householding),
but these other units are commonly used:
• postal code (recommended)
• Forward Sort Area [FSA] (the first three characters of the postal code)
• Dissemination Area [DA] (the most granular level of the Census where
qualitative attributes are revealed)
• Block (the most granular level of the Census, where only population
figures are revealed)
Neil Hepburn 13
Householding
 The most granular level of LI data, and the most valuable
 Most householding initiatives are focussed on data quality.
• E.g. cleaning up addresses, removing duplicates, and confirming occupants identity (e.g. Trillium
Software, IBM InfoSphere QualityStage, etc.)
 Specific household details can be had, but be careful…
Neil Hepburn 14
The ethics of data collection
 General rules of thumb:
• Only utilize public domain information
• Ask yourself if you would accept your competitor doing the same
 For web sites, read the terms of service. Most web sites do not prohibit web harvesting
 Most call centres record your call already
• State to the person or computer you are recording their call too.
 StatCan policies (be aware of these rules if you ever publish or externally exchange
information):
• Rule of three
• Round by fives
 Some useful resources
• Personal Information Protection and Electronic Documents Act (PIPEDA)
• Society for Competitive Intelligence Professionals (SCIP)
Neil Hepburn 15
First steps: Integrating the Canadian Census and Elections Canada data
 The Canadian Census is likely the most valuable external data set you are likely to find for Customer
Segmentation. It provides
• Population figures
• Income levels, and earning population
• Age, Gender, and Ethnic population counts
 Elections Canada provides party voting counts by riding, as well as party contributions
 Both the Census and Elections Canada each have Postal Code Conversion File (PCCF), which can be
purchased from StatCan
 Once you have integrated the Census, other StatCan data sets (e.g. Uniform Crime Reporting Survey or
General Social Survey) can be integrated using the same Census <-> Postal Code mapping table
 StatCan data is transparent, as they document: sourcing methodology, the original questionnaire, gaps in
data, sources of error, and other data quality indicators
• First Nations data (i.e. data pertaining to reserves, is of poor quality, and is currently being addressed
through the First Nations Statistical Institute)
 Many vendors sell derived data sets derived largely from the Census. Beware, as these vendors tend not
to be as transparent as StatCan, and often you would be better off with the original Census
Neil Hepburn 16
Locating and Evaluating Data Sets for Purchase
 When attempting to locate a new data set, ask yourself the question: “Who is in a position to acquire these
data?”
 Many data sets cannot be purchased, but rather obtained through sharing. Some loyalty programs have
been known to do this.
 Some data sets can only be obtained by certain types of businesses (e.g. Ontario Land Registry data)
 Questions to ask about new data:
• How and when was the data sourced?
• What are the data definitions?
• What are the sources of error?
• What are the gaps in data?
• What distribution model does the data align to? E.g. gamma-poisson distribution.
 Evaluating new data:
• Always obtain sample data
• If possible, validate data against your internal data
• If possible, obtain a sample that large enough so as to be representative of the whole within +/- 5%
nineteen times out of twenty
 You may want to tap a statistician or actuary to help you here
Neil Hepburn 17
Primary Field Research
 Can be qualitative (i.e. focus groups). This can be tricky as it’s often difficult and
expensive to build the right focus group, and requires some experience to run and
interpret results.
• Often best to outsource Focus Group research, since finding the right mix of
candidates requires maintaining a network of contacts with the right mix of
demographic attributes (e.g. a middle income teenager, an upper income
woman, etc.)
 Quantitative approaches are easier to run, and require less experience. Plus the
data can be time seriesed for trend analysis
 Areas of focus often include:
• Survey Polling. Either in person, over the phone, or through the web (e.g.
Facebook polls)
• Retail trade area traffic counts
• Call Centre responses
 Very little practice is required, but once trained up, you will have a new found
Neil Hepburn
agility when obtaining concrete answers to tough questions
18
Web Harvesting
 Fastest growing area of external data collection. A virtual cottage industry already exists…
• You can easily outsource this to a company like fetch.com, but there are drawbacks to this
• Doing it in-house gives you greater control over data quality, and the data model, and costs
go down over time
• Many tools already exist to greatly assist in harvesting
– Like any software development platform, the “quick and dirty” use of the tool is
unsustainable, so you’re best to use something that can be controlled by a programming
language (e.g. a COM component)
• Some newer sites can be challenging due reliance on the Document Object Model [DOM]
(as opposed to explicit HTML), and may require more sophisticated tools to interrogate the
DOM.
 However, the real challenge is figuring out what to harvest. Here are some suggestions to
consider:
• Competitor pricing
• Competitor branch/store locations
• Competitor hiring
• Competitor press releases
 Data grows in value if it is well structured (normalized), and time seriesed
Neil Hepburn 19
Integrating External Data Sets
 For Location Intelligence, your best bet is to align data to a postal code or FSA level
 Highly recommend to purchase postal code <-> census conversion file and postal code
<-> riding conversion file. Be aware that there are imperfections in these files, so if you
are working at a municipal level, these files should be scrutinized
 For householding, data can be aligned by either:
• Civic address
• E-mail address
• Telephone number
Neil Hepburn 20
First and Last Name proxy analysis

If you have a large enough customer base, it is possible to achieve sensitive insights into your business through name
analysis.

Certain first name, last name, and first name last name combinations are highly correlated with the following:
• Age group
• Gender
• Ethnicity, including the distinction of
– Immigrant
– Born in Canada
• Religion
• Language

It is possible to approach householding using a combination of name analysis and other attributes (e.g. census profile of
belonging region).

For householding if you are going to take this approach, be sure that a hit has a positive impact, and a miss has little or no
impact.
• E.g. attaching the appropriate language(s) to a bill or statement insert is an excellent opportunity to connect with nonnative English speakers

Another way to extract value from name analysis, is by taking the most highly correlated names and using them as proxy for
customer segments. With a large enough customer base this can yield some very powerful insights
Neil Hepburn 21
Lessons Learned from The Whiz Kids

The Whiz Kids’ career ended on many sour notes

At Ford, they were fixated on driving costs down, and neglected to invest in production and
innovation.
• While not a direct attack, Theodore Levitt’s famous “Marketing Myopia”, is as true now as it was
back in 1960 when it was first published in the Harvard Business Review.

Robert McNamara went on to “architect” the Vietnam war. While he knew (from his own numbers)
that it was militarily unwinnable as early as 1965, McNamara put too much focus on kill ratios and
neglected to learn about the social dynamics in Asia resulted in the war dragging on for nearly a
decade longer.
• He also reported to Lyndon B. Johnson who made it clear that he was not going to be the first US
president to surrender in a war.

Tex Thornton had a falling out with Henry Ford Jr.
• Again, politics and relationships can trump results

Jack Reith became a huge advocate of Mergers and Acquisitions and coined the term “synergies”
• Beyond the consolidation of HR departments, he failed to achieve the “synergies” he had hoped for
• He saw in the numbers what he wanted to see, and took huge risks which failed
• He was also behind a couple of failed Ford cars (i.e. Mercury Comet and Ford Edsel)

The morals of the story, from Neil’s perspective:
• Numbers are better than no-numbers, and we should try to get as many as we can to support key
decisions
• Facts live in history. How facts are interpreted to predict the future is highly subjective
• Knowing what questions to ask should always be central to decision making.
• There is no substitute for the judgement skills that come with experience.
• Do not underestimate the power of politics
Neil Hepburn 22
Download