Data Mining at IMS America: How We Turned a

advertisement
Data Mining at IMS America
How We Turned a Mountain of Data into a Few Information-rich Molehills
Jerry Kagan
Kagan Associates, Inc.
323 Landsende Road
Devon, PA 19333
email: JerryKagan@msn.com
Paul Kallukaran
Manager, Statistical Services,
IMS America [Division of Cognizant],
660 W Germantown Pike
Plymouth Meeting, PA 19462
email: kallup@imsint.com
ABSTRACT
IMS America, a division of Cognizant Corporation, is the principal source of information used in
marketing and sales management by health care organizations throughout the United States. Of
the various potential applications of neural networks, pattern recognition is considered one of
major importance. This paper presents the results of using neural networks to classify time-series
data into several trend pattern classifications [e.g., Increasing Trend, Decreasing Trend, Shift Up,
Shift Down, Spike Up, Spike Down, and No Pattern], and the information generated from the
classifier is used to detect various marketing related phenomena [e.g., Brand Switching, Brand
Loyalty, and Product Trends].
The data used for the test consisted of prescription data for 12 months, for 600,000 prescribers
writing four drugs in the Anti-ulcer market. Using the neural network classifier and brand
switching algorithm, the system was able to detect 2500 prescribers who were changing their
prescribing behavior. The model is a promising formula for analyzing times-series information
from extremely large databases, and presenting the user with only information relevant for
decision making. This data-mining system, designed as part of the IMS Xplorer product, uses
SAS System components for data retrieval, data preparation, graphical user interface, and data
visualization. The IMS Xplorer product is a sales and marketing decision support system based
on commercially-available, client-server technology created for pharmaceutical companies in their
effort to fully utilize the mission critical information found in the Xplorer data warehouse.
KEY WORDS: Time-Series Data, Neural Networks, and Data Mining.
1.0
neural-network architectures have shown to
be particularly powerful tools for such
applications (Nelson and Illingworth 1991).
Thus, neural networks have become viable,
competitive alternatives to the more
traditional time-series analysis (de Groot and
Wuertz 1991), linear and nonlinear model
Introduction
During the past decade there has been a
significant increase in the use of artificial
intelligence technology for process control,
predictive modeling, data analysis, pattern
recognition and signal processing (Nibset,
Mclaughlin, and Mulgrew 1991). Artificial,
1
fitting (Cooper, Hayes, Whalen 1993), and
cluster and discriminant analysis techniques.
drugstores, hospitals, distributors, and
retailers provide data in various forms
including computer tape, microfilm, purchase
invoices and surveys. Xponent, the first
true physician level database for the healthcare industry, is used in making a variety of
sales and marketing decisions. Each month,
Xponent delivers the most precise
estimates available for individual prescribing
activity by using a customized projection
factor for each prescriber.
In fact, there is research to indicate that back
propagation neural networks may produce
predictive results superior to traditional
statistical methods in the areas of forecasting
(Sharda and Patil 1992). The use of artificial
neural networks is well established in applied
fields, where neural networks are recognized
as flexible and powerful tools for solving
prediction and pattern recognition problems.
Xplorer is a customized decision support
data delivery infrastructure that allows IMS
clients to integrate Xponent data with their
own data sources. Custom and third party
best-of-breed software tools allow clients to
access the data, conduct a broad range of
analyses, and produce easy-to-read on-line
reports and graphs. Xplorer operates in a
client/server
environment
using
the
capabilities of mainframes to handle the large
volume of data with the ease-of-use and
graphic capabilities of PC’s.
This paper outlines a neural network
approach to classifying time-series data into
various groups [e.g., Increasing Trend,
Decreasing Trend, Level Shift Up, Level
Shift Down, No Pattern, or Spike Up]. The
results from this neural network model are
being used to identify various trend
relationships of pharmaceutical products and
are also used at IMS to detect prescribers
that have changed their prescribing
behaviors.
Other statistical techniques
[Multinomial Logit, CART etc.] can then use
the classifications from the neural network
model as a dependent variables to help us to
understand the influence of factors that
caused these pattern changes.
2.0
The Xponent database, in the last couple
of years, has grown extremely large
[Terrabytes], and currently maintains
prescription information by prescriber,
product and payment type [cash, Medicaid
and HMO]. As this database grows, it
becomes extremely difficult to identify
physicians changing their prescribing
behaviors.
The neural-network model
provides a method of analyzing times-series
data and identifying physicians that have
changed their prescribing behavior over time.
The method provides a tool for the sales
force to use in identifying physicians to
target when making sales calls. Research has
shown that winning just one more
prescription per week from each prescriber,
yields an annual gain of $52 million in sales.
So, if you’re not targeting with the utmost
Database Background
With the advent of computer technology and
electronic communication, databases are
growing at an alarming rate. As these
databases are used more and more for critical
strategic marketing decisions, market
segmentation,
and
sales
promotion
effectiveness, data-mining techniques gain
significant importance.
IMS America collects data from over
175,000 sites across the United States.
physicians,
pharmacists,
veterinarians,
2
precision, you could be throwing away a
fortune.
Using the neural network model, the system
detected 2,500 physicians who were
changing their prescribing behavior. The
model running on a UNIX server was able to
process 600,000 physicians in approximately
15 CPU minutes and present the results to
the user in a graphical format [see Figure
One as an example]. Here, the report shows
a physician that was previously loyal to the
drug A, and in the last 5 months switched to
prescribing drug B. This report provides a
useful tool for the pharmaceutical company’s
sales-force to use when targeting the right
prescribers for sales calls. The ability of the
model to detect these trends and report
relevant information back to the user in a
quick, automated fashion is the next step in
extracting information from extremely large
databases.
3.0
Applications of the Data Mining
Technique for Targeting
The massive amounts of data make it
impossible for human analysts to visually
examine all of the time-series data and to
understand the various trends. This neuralnetwork model is designed to classify timeseries data into the various trend types. A
time series, as its name implies, consists of
statistical data collected over successive time
intervals. The data can be product volumes
for a particular market or prescription data
for individual physicians. A time-series
model is applicable regardless of the data
measure or magnitude. Marketing-research
analysts, economists, behavioral scientists,
security analysts, and others, study time
series to gain an understanding of general
market trends and to take subsequent action
based on the trends.
4.0
Implementation of a Data-Mining
Solution Using the SAS System
At IMS we explored several options and
tools that could be used to implement a datamining solution as part of the Xplorer
decision support system. The tool had to
have the ability to operate in a client/server
environment, to have access to relational
databases, to provide graphical-userinterface (GUI) capabilities, to have the
necessary data transformation and statistical
functions, and to furnish the user with
several graphical and text reports to make
the required business decisions. The SAS
System was the only software tool that
provided all the functionality that was
required to implement the data-mining
solution. Table One lists the SAS tools that
were used to implement the system: A
process flow diagram in Appendix A
illustrates the series of steps that the system
must undergo to deliver the switching
information.
At IMS, neural networks have been
developed to classify time-series data into
various trend patterns and to detect various
marketing related phenomena like brand
switching, brand loyalty, and brand
performance. For our data-mining test, the
model was used to detect physicians that
switched from prescribing product A to
product B, C, or D over a 12 month period.
The data used for the test consists of
prescription data for twelve months, for
600,000 prescribers, writing four drugs in
the Anti-ulcer market. The purpose of the
test was to detect physicians who have
changed their prescribing behaviors by
switching brands. The results of the test
provided a list of prescribers who have
switched from the product of interest to the
competitors’ products.
3
Table One: SAS System Components Used in the Project.
SAS COMPONENT
FUNCTION
1. SAS/AF ®
All the graphical user screens , to select type of analysis, data
elements and viewing the final results.
2. SAS/CONNECT ®
Connectivity to the UNIX server and interface to the Oracle
Database from Client. [Windows 95]
3. SAS/ BASE ®
All data preparation, transformation and statistics.
4. SAS/GRAPH ®
All the graphical reports for data visualization.
Table Two: List of Information Provided by the Client to Run the Model.
• Time Periods:
March 1995 - February 1996
• Doctor Specialty:
Cardiology, Internal Medicine
• Payment Type:
Prescription paid for by Cash
• Distribution Channel: Retail Pharmacies
• Geography:
Client defined geographic area
• Products:
A,B,C,D
For a typical analysis, the user may specify
the information shown in Table Two through
the GUI interface.
The final results from the analysis are saved
in a SAS data set. The complete analysis is
performed on the server, which usually
involves processing approx. 600,000 records
and producing a final SAS data set about
2,500 records in length.
After the criteria are selected for the datamining run, the user can initiate the analysis
from the client machine.
All of the
subsequent steps involved in the analysis are
shielded from the end user. First, the
analysis uses SAS to build SQL queries to
extract data from the Oracle database
residing on the server, and transforms and
manipulates the data set to provide the
correct input formats for the neural-network
model. Then, the results from the neural-net
model are used to identify the physicians
who have reduced the prescribing of one
medication over the 12 month period and
have increased the prescribing of another.
The results are viewed with a GUI interface,
and the user is required to download the final
SAS data set from the UNIX server. The
report details are then viewed using custom
SAS graphs and reports. The entire process
is seamlessly integrated with the SAS
system, giving the user complete flexibility in
running the process without knowing any
SAS.
4
5.0
Manager of Research & Development in the
Statistical Services Department at IMS
America, a division of the Cognizant
Corporation, and is pursuing a Master of
Software Engineering at the Pennsylvania
State University.
Conclusion
Using a classical subjective approach to the
examination and analysis of 600,000 time
series would take weeks of work. By using
the data-mining solution, IMS can pinpoint
prescribers who are switching from one
medication to another. A sales person can
use this model to target doctors who have
switched from the drug they are selling and
to devise a specific message to counter that
switching behavior.
Jerry Kagan is an independent consultant
specializing in SAS/AF development. He has
a Bachelors degree in Management Science
and an Associates degree in Computer
Science from The Pennsylvania State
University and has been using the SAS
System for the past eight years. Working
primarily in the pharmaceutical and health
care industries, Jerry has worked with
companies
including
Wyeth-Ayerst
Research, SmithKline Beecham, CoreStates
Bank and The Prudential. He won a best
paper award at SUGI 17 and was an invited
speaker at SUGI 18.
The implementation of a data-mining
solution is comprised of numerous steps,
such as, interacting with the data-warehouse,
providing the data preprocessing capabilities,
and offering the statistical functionality and
graphical visualization. SAS provides all the
tools required to implement such a system.
To help the user to understand the factors
that caused these time-series changes,
various statistical techniques [Multinomial
Logit, CART etc.] can use the pattern
classifications from the neural-network
model as a response variable. In the future,
IMS plans to implement additional datamining solutions and statistical models with
the Xplorer decision support system, to
help the business user discover the
“diamonds” lying hidden within the
mountains of captured data.
6.0
7.0
References
deGroot, C., and Wuertz, D. (1991),
“Analysis of Univariate Time series with
Connectionist Nets: A Case Study of Two
Classical Examples,” Neurocomputing, 3,4
177-192
Lapedes, A., and Farber, R. (1987),
“Nonlinear Signal Processing Using Neural
Networks:
Prediction
and
Systems
Modeling,”
Los
Alamos
National
Laboratory Technical Report,
No LA-U-87-2662
Biography
Paul Kallukaran graduated with a Bachelors
in Industrial and Production Engineering and
a Masters of Science in Operations Research
from Illinois Institute of Technology in
Chicago. Since graduation, he has worked in
the marketing research industry using his
expertise in statistics, operations research,
and artificial intelligence. Currently, he is a
Lykins, S., and Chance, D. (1992),
“Comparing Artificial Neural Networks and
Multiple
Regression
for
Predictive
Application,” Proceedings of the Eight
Annual
Conference
on
Applied
Mathematics, Edmond OK, 155-169
5
Nelson, M., and Illingworth, W. T. (1991), A
practical Guide to Neural Nets, New York,
NY: Addison Wesley Publishing Company,
Inc.
Propagation”,
American
Statistical
Association Winter Conference, Fort
Lauderdale, January 1993.
NeuralWare, Inc. (1991), Reference Guide:
NeuralWorks Professional II Plus and
Neural Works Explorer, Pittsburgh
8.0
Acknowledgments
SAS is a registered trademark of SAS
Institute Inc. in the USA and other countries.
® indicates USA registration
Zurada, J. M. (1992), Introduction to
Artificial Neural Systems, Los Angeles, CA:
West Publishing Company.
Other brand and product names are
registered trademarks of their respective
companies.
Cooper, Hayes, Michael N., and Whalen
(1992), “Neural Networks As an Alternative
to Statistical Modeling of Radio Wave
6
Figure One: Physician Targeting Report.
Doctor Targeting Report
Market: Anti-Depressant
Company: Lily
Product: Prozac
Doctor: John Smith
DEA NUMBER: AA0382856
Speciality: OBSTETRICS
Address 150 Main Street, Chester, PA
Number of Samples:
Number of Details:
Product Switched:
Sep-94
Oct-94
Nov-94
Dec-94
Jan-95
Feb-95
Mar-95
Apr-95
May-95
Jun-95
Jul-95
Aug-95
5
2
ZOLOFT
PLANS ASSOCIATED: US HEALTHCARE
FHP
BLUECROSS
Reasons for Switch:
September 1994 - August 1995
Market Share
PROZAC ZOLOFT
67
26
75
16
81
11
74
22
70
19
66
25
54
29
39
48
48
44
40
51
43
47
36
52
Sep-94
Oct-94
Nov-94
Dec-94
Jan-95
Feb-95
Mar-95
Apr-95
May-95
Jun-95
Jul-95
Aug-95
Rx Volume
PROZAC ZOLOFT
25
10
39
8
47
6
38
11
28
8
52
20
30
16
33
42
33
30
33
43
35
39
28
42
Less Drug Interactions, More Cost effective
Market Share By Product/Doctor
Rx Volume By Product/Doctor
90
60
PROZAC
80
ZOLOFT
50
70
40
Rx Volume
Share
60
50
40
30
PROZAC
ZOLOFT
30
20
20
10
10
0
Sep-94
Nov-94
Jan-95
Mar-95
May-95
0
Sep-94
Jul-95
Months
Nov-94
Jan-95
Mar-95
Months
7
May-95
Jul-95
Appendix A
TREND ANALYZER/SAS INTERFACE TOOL
PROTOTYPE USING SAS/FRAME
PROCESS FLOW DIAGRAM
SERVER
PC
User Input
Which View
(Where Stmt)
Oracle
Proc SQL
Extract Files
C Program
Translates
Market Share
Translated
Extract Files
Transposed &
Market Share
Calculated
Trend Analyzer
Output
Trend Analyzer C
Read Into SAS
Report Datasets
Oracle
SAS Dataset
Create Datasets
for Reports
Download Report
Data to Client
PC
Generate Reports
User Input
(Selection
Criteria)
8
Download