Data Mining at IMS America How We Turned a Mountain of Data into a Few Information-rich Molehills Jerry Kagan Kagan Associates, Inc. 323 Landsende Road Devon, PA 19333 email: JerryKagan@msn.com Paul Kallukaran Manager, Statistical Services, IMS America [Division of Cognizant], 660 W Germantown Pike Plymouth Meeting, PA 19462 email: kallup@imsint.com ABSTRACT IMS America, a division of Cognizant Corporation, is the principal source of information used in marketing and sales management by health care organizations throughout the United States. Of the various potential applications of neural networks, pattern recognition is considered one of major importance. This paper presents the results of using neural networks to classify time-series data into several trend pattern classifications [e.g., Increasing Trend, Decreasing Trend, Shift Up, Shift Down, Spike Up, Spike Down, and No Pattern], and the information generated from the classifier is used to detect various marketing related phenomena [e.g., Brand Switching, Brand Loyalty, and Product Trends]. The data used for the test consisted of prescription data for 12 months, for 600,000 prescribers writing four drugs in the Anti-ulcer market. Using the neural network classifier and brand switching algorithm, the system was able to detect 2500 prescribers who were changing their prescribing behavior. The model is a promising formula for analyzing times-series information from extremely large databases, and presenting the user with only information relevant for decision making. This data-mining system, designed as part of the IMS Xplorer product, uses SAS System components for data retrieval, data preparation, graphical user interface, and data visualization. The IMS Xplorer product is a sales and marketing decision support system based on commercially-available, client-server technology created for pharmaceutical companies in their effort to fully utilize the mission critical information found in the Xplorer data warehouse. KEY WORDS: Time-Series Data, Neural Networks, and Data Mining. 1.0 neural-network architectures have shown to be particularly powerful tools for such applications (Nelson and Illingworth 1991). Thus, neural networks have become viable, competitive alternatives to the more traditional time-series analysis (de Groot and Wuertz 1991), linear and nonlinear model Introduction During the past decade there has been a significant increase in the use of artificial intelligence technology for process control, predictive modeling, data analysis, pattern recognition and signal processing (Nibset, Mclaughlin, and Mulgrew 1991). Artificial, 1 fitting (Cooper, Hayes, Whalen 1993), and cluster and discriminant analysis techniques. drugstores, hospitals, distributors, and retailers provide data in various forms including computer tape, microfilm, purchase invoices and surveys. Xponent, the first true physician level database for the healthcare industry, is used in making a variety of sales and marketing decisions. Each month, Xponent delivers the most precise estimates available for individual prescribing activity by using a customized projection factor for each prescriber. In fact, there is research to indicate that back propagation neural networks may produce predictive results superior to traditional statistical methods in the areas of forecasting (Sharda and Patil 1992). The use of artificial neural networks is well established in applied fields, where neural networks are recognized as flexible and powerful tools for solving prediction and pattern recognition problems. Xplorer is a customized decision support data delivery infrastructure that allows IMS clients to integrate Xponent data with their own data sources. Custom and third party best-of-breed software tools allow clients to access the data, conduct a broad range of analyses, and produce easy-to-read on-line reports and graphs. Xplorer operates in a client/server environment using the capabilities of mainframes to handle the large volume of data with the ease-of-use and graphic capabilities of PC’s. This paper outlines a neural network approach to classifying time-series data into various groups [e.g., Increasing Trend, Decreasing Trend, Level Shift Up, Level Shift Down, No Pattern, or Spike Up]. The results from this neural network model are being used to identify various trend relationships of pharmaceutical products and are also used at IMS to detect prescribers that have changed their prescribing behaviors. Other statistical techniques [Multinomial Logit, CART etc.] can then use the classifications from the neural network model as a dependent variables to help us to understand the influence of factors that caused these pattern changes. 2.0 The Xponent database, in the last couple of years, has grown extremely large [Terrabytes], and currently maintains prescription information by prescriber, product and payment type [cash, Medicaid and HMO]. As this database grows, it becomes extremely difficult to identify physicians changing their prescribing behaviors. The neural-network model provides a method of analyzing times-series data and identifying physicians that have changed their prescribing behavior over time. The method provides a tool for the sales force to use in identifying physicians to target when making sales calls. Research has shown that winning just one more prescription per week from each prescriber, yields an annual gain of $52 million in sales. So, if you’re not targeting with the utmost Database Background With the advent of computer technology and electronic communication, databases are growing at an alarming rate. As these databases are used more and more for critical strategic marketing decisions, market segmentation, and sales promotion effectiveness, data-mining techniques gain significant importance. IMS America collects data from over 175,000 sites across the United States. physicians, pharmacists, veterinarians, 2 precision, you could be throwing away a fortune. Using the neural network model, the system detected 2,500 physicians who were changing their prescribing behavior. The model running on a UNIX server was able to process 600,000 physicians in approximately 15 CPU minutes and present the results to the user in a graphical format [see Figure One as an example]. Here, the report shows a physician that was previously loyal to the drug A, and in the last 5 months switched to prescribing drug B. This report provides a useful tool for the pharmaceutical company’s sales-force to use when targeting the right prescribers for sales calls. The ability of the model to detect these trends and report relevant information back to the user in a quick, automated fashion is the next step in extracting information from extremely large databases. 3.0 Applications of the Data Mining Technique for Targeting The massive amounts of data make it impossible for human analysts to visually examine all of the time-series data and to understand the various trends. This neuralnetwork model is designed to classify timeseries data into the various trend types. A time series, as its name implies, consists of statistical data collected over successive time intervals. The data can be product volumes for a particular market or prescription data for individual physicians. A time-series model is applicable regardless of the data measure or magnitude. Marketing-research analysts, economists, behavioral scientists, security analysts, and others, study time series to gain an understanding of general market trends and to take subsequent action based on the trends. 4.0 Implementation of a Data-Mining Solution Using the SAS System At IMS we explored several options and tools that could be used to implement a datamining solution as part of the Xplorer decision support system. The tool had to have the ability to operate in a client/server environment, to have access to relational databases, to provide graphical-userinterface (GUI) capabilities, to have the necessary data transformation and statistical functions, and to furnish the user with several graphical and text reports to make the required business decisions. The SAS System was the only software tool that provided all the functionality that was required to implement the data-mining solution. Table One lists the SAS tools that were used to implement the system: A process flow diagram in Appendix A illustrates the series of steps that the system must undergo to deliver the switching information. At IMS, neural networks have been developed to classify time-series data into various trend patterns and to detect various marketing related phenomena like brand switching, brand loyalty, and brand performance. For our data-mining test, the model was used to detect physicians that switched from prescribing product A to product B, C, or D over a 12 month period. The data used for the test consists of prescription data for twelve months, for 600,000 prescribers, writing four drugs in the Anti-ulcer market. The purpose of the test was to detect physicians who have changed their prescribing behaviors by switching brands. The results of the test provided a list of prescribers who have switched from the product of interest to the competitors’ products. 3 Table One: SAS System Components Used in the Project. SAS COMPONENT FUNCTION 1. SAS/AF ® All the graphical user screens , to select type of analysis, data elements and viewing the final results. 2. SAS/CONNECT ® Connectivity to the UNIX server and interface to the Oracle Database from Client. [Windows 95] 3. SAS/ BASE ® All data preparation, transformation and statistics. 4. SAS/GRAPH ® All the graphical reports for data visualization. Table Two: List of Information Provided by the Client to Run the Model. • Time Periods: March 1995 - February 1996 • Doctor Specialty: Cardiology, Internal Medicine • Payment Type: Prescription paid for by Cash • Distribution Channel: Retail Pharmacies • Geography: Client defined geographic area • Products: A,B,C,D For a typical analysis, the user may specify the information shown in Table Two through the GUI interface. The final results from the analysis are saved in a SAS data set. The complete analysis is performed on the server, which usually involves processing approx. 600,000 records and producing a final SAS data set about 2,500 records in length. After the criteria are selected for the datamining run, the user can initiate the analysis from the client machine. All of the subsequent steps involved in the analysis are shielded from the end user. First, the analysis uses SAS to build SQL queries to extract data from the Oracle database residing on the server, and transforms and manipulates the data set to provide the correct input formats for the neural-network model. Then, the results from the neural-net model are used to identify the physicians who have reduced the prescribing of one medication over the 12 month period and have increased the prescribing of another. The results are viewed with a GUI interface, and the user is required to download the final SAS data set from the UNIX server. The report details are then viewed using custom SAS graphs and reports. The entire process is seamlessly integrated with the SAS system, giving the user complete flexibility in running the process without knowing any SAS. 4 5.0 Manager of Research & Development in the Statistical Services Department at IMS America, a division of the Cognizant Corporation, and is pursuing a Master of Software Engineering at the Pennsylvania State University. Conclusion Using a classical subjective approach to the examination and analysis of 600,000 time series would take weeks of work. By using the data-mining solution, IMS can pinpoint prescribers who are switching from one medication to another. A sales person can use this model to target doctors who have switched from the drug they are selling and to devise a specific message to counter that switching behavior. Jerry Kagan is an independent consultant specializing in SAS/AF development. He has a Bachelors degree in Management Science and an Associates degree in Computer Science from The Pennsylvania State University and has been using the SAS System for the past eight years. Working primarily in the pharmaceutical and health care industries, Jerry has worked with companies including Wyeth-Ayerst Research, SmithKline Beecham, CoreStates Bank and The Prudential. He won a best paper award at SUGI 17 and was an invited speaker at SUGI 18. The implementation of a data-mining solution is comprised of numerous steps, such as, interacting with the data-warehouse, providing the data preprocessing capabilities, and offering the statistical functionality and graphical visualization. SAS provides all the tools required to implement such a system. To help the user to understand the factors that caused these time-series changes, various statistical techniques [Multinomial Logit, CART etc.] can use the pattern classifications from the neural-network model as a response variable. In the future, IMS plans to implement additional datamining solutions and statistical models with the Xplorer decision support system, to help the business user discover the “diamonds” lying hidden within the mountains of captured data. 6.0 7.0 References deGroot, C., and Wuertz, D. (1991), “Analysis of Univariate Time series with Connectionist Nets: A Case Study of Two Classical Examples,” Neurocomputing, 3,4 177-192 Lapedes, A., and Farber, R. (1987), “Nonlinear Signal Processing Using Neural Networks: Prediction and Systems Modeling,” Los Alamos National Laboratory Technical Report, No LA-U-87-2662 Biography Paul Kallukaran graduated with a Bachelors in Industrial and Production Engineering and a Masters of Science in Operations Research from Illinois Institute of Technology in Chicago. Since graduation, he has worked in the marketing research industry using his expertise in statistics, operations research, and artificial intelligence. Currently, he is a Lykins, S., and Chance, D. (1992), “Comparing Artificial Neural Networks and Multiple Regression for Predictive Application,” Proceedings of the Eight Annual Conference on Applied Mathematics, Edmond OK, 155-169 5 Nelson, M., and Illingworth, W. T. (1991), A practical Guide to Neural Nets, New York, NY: Addison Wesley Publishing Company, Inc. Propagation”, American Statistical Association Winter Conference, Fort Lauderdale, January 1993. NeuralWare, Inc. (1991), Reference Guide: NeuralWorks Professional II Plus and Neural Works Explorer, Pittsburgh 8.0 Acknowledgments SAS is a registered trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration Zurada, J. M. (1992), Introduction to Artificial Neural Systems, Los Angeles, CA: West Publishing Company. Other brand and product names are registered trademarks of their respective companies. Cooper, Hayes, Michael N., and Whalen (1992), “Neural Networks As an Alternative to Statistical Modeling of Radio Wave 6 Figure One: Physician Targeting Report. Doctor Targeting Report Market: Anti-Depressant Company: Lily Product: Prozac Doctor: John Smith DEA NUMBER: AA0382856 Speciality: OBSTETRICS Address 150 Main Street, Chester, PA Number of Samples: Number of Details: Product Switched: Sep-94 Oct-94 Nov-94 Dec-94 Jan-95 Feb-95 Mar-95 Apr-95 May-95 Jun-95 Jul-95 Aug-95 5 2 ZOLOFT PLANS ASSOCIATED: US HEALTHCARE FHP BLUECROSS Reasons for Switch: September 1994 - August 1995 Market Share PROZAC ZOLOFT 67 26 75 16 81 11 74 22 70 19 66 25 54 29 39 48 48 44 40 51 43 47 36 52 Sep-94 Oct-94 Nov-94 Dec-94 Jan-95 Feb-95 Mar-95 Apr-95 May-95 Jun-95 Jul-95 Aug-95 Rx Volume PROZAC ZOLOFT 25 10 39 8 47 6 38 11 28 8 52 20 30 16 33 42 33 30 33 43 35 39 28 42 Less Drug Interactions, More Cost effective Market Share By Product/Doctor Rx Volume By Product/Doctor 90 60 PROZAC 80 ZOLOFT 50 70 40 Rx Volume Share 60 50 40 30 PROZAC ZOLOFT 30 20 20 10 10 0 Sep-94 Nov-94 Jan-95 Mar-95 May-95 0 Sep-94 Jul-95 Months Nov-94 Jan-95 Mar-95 Months 7 May-95 Jul-95 Appendix A TREND ANALYZER/SAS INTERFACE TOOL PROTOTYPE USING SAS/FRAME PROCESS FLOW DIAGRAM SERVER PC User Input Which View (Where Stmt) Oracle Proc SQL Extract Files C Program Translates Market Share Translated Extract Files Transposed & Market Share Calculated Trend Analyzer Output Trend Analyzer C Read Into SAS Report Datasets Oracle SAS Dataset Create Datasets for Reports Download Report Data to Client PC Generate Reports User Input (Selection Criteria) 8