Data Fusion: A Way to Provide More Data to Mine in? Peter van der Putten ab a Sentient Machine Research Baarsjesweg 224, 1058 Amsterdam, The Netherlands pvdputten@smr.nl Abstract In everyday data mining practice, the availability of data is often a serious problem. For instance, in database marketing elementary customer information resides in customer databases, but market survey data is only available for a subset or even a different sample of customers. Data fusion provides a way out by combining information from different sources for each customer. We present a simple data fusion procedure based on a nearest neighbor algorithm. We suggest different measures to evaluate the quality of the fusion process. An experiment on real world data is described to illustrate the added value of our approach. 1. Introduction and motivation In marketing, direct forms of communication are getting more popular. Instead of broadcasting a single message to all customers through traditional mass media such as television and print, the most promising potential customers receive personalized offers at the most appropriate time and through the most appropriate channels. For this it becomes more important to gather information about media consumption, attitudes, product propensity etc. at an individual level. The amount of data that is collected about customers is generally growing very fast, however it is often scattered among a large number of sources. For instance, elementary customer information resides in customer databases, but market survey data depicting a richer view of the customer is only available for a small sample. Simply collecting all this information for the whole customer database in a single source survey is far too expensive. b The author is also affiliated with the Leiden Institute of Advanced Computer Science (LIACS), P.O. Box 9512, 2300 RA Leiden, The Netherlands Customer database Recipient Market survey Donor + 1x106 customers 50 variables 25 commons Virtual survey with each customer Fused data = 1000 survey respondents 1500 variables 25 commons 1x106 customers 1525 variables 25 commons Figure 1: Data Fusion in a nutshell A widely accepted alternative within database marketing is to buy external sociodemographic data that has been collected at a regional level. All customers living in a single region, for instance in the same zip code area, receive equal values. However, the kind of information that can be acquired is relatively limited. Furthermore, the underlying assumption that all customers within a region are equal is at the least questionable. Data fusion techniques can provide a way out. Information from different sources is combined by matching customers on variables that are available in both data sources. The resulting enriched data set can be used for all kinds of data mining and database marketing analyses. In this paper we will give a practical introduction to the application of data fusion for data mining in a database marketing context (section 2). This will be illustrated with some preliminary empirical results on a real world data set (section 3). Rather than presenting a complete solution we argue that data fusion might be a valuable tool in every day data mining practice. Furthermore we aim to demonstrate that the data fusion problem is far from simple and contains a lot of interesting topics for future algorithmic and methodological data mining research (section 4). 2. Data fusion Data fusion is not new. In the 1980s this subject was quite popular, most particularly in the field of media research [1,2,4,5,7,16] and micro economic analysis [3,10,11]. Up until today, data fusion is used to reduce the required number of respondents or questions in a survey. For instance, for the Belgian National Readership survey questions regarding media and questions regarding products are collected in 2 separate groups of 10.000 respondents and fused into a single survey thus reducing costs and time for a respondent needed to complete a survey [9]. In our research we are ultimately aiming at fusing entire customer databases with surveys instead of merging surveys with surveys. This implies new ways to exploit existing survey data. Furthermore, the single source alternative - asking all the questions to all the customers - might be an option for merging surveys, but in most cases it will not even be a possibility when merging large customer databases with survey data. 2.1 Core data fusion concepts The core data fusion concept is illustrated in figure 1. Assume a company has one million customers. For each customer, 50 variables are stored in the customer database. Furthermore, there exists a market survey with 1000 respondents, not necessarily customers of the company, and they were asked questions corresponding to 1500 variables. In this example 25 variables occur in both the database and the survey: these variables are called common variables. Now assume that we want to transfer the information from the market survey, the donor, to the customer database, the recipient. For each record in the customer database the data fusion procedure is used to predict the most probable answers on the market survey questions, thus creating a virtual survey with each customer. The variables to be predicted are called fusion variables. The most straightforward procedure to perform this kind of data fusion is statistical matching, which can be compared to k-nearest neighbor classification. For each recipient record those k donor records are selected that have the smallest distance to the recipient record, with the distance measure defined over the common variables. Based on this set of k donors the values of the fusion variables are estimated, e.g. by taking the average for ordinals or the mode for nominals. Sometimes separate fusions are carried out for groups for which 'mistakes' in the predictions are unacceptable, e.g. predicting 'pregnant last year' for men. In this case the gender variable will become a so-called cell variable; the match between recipient and donor must be 100% on the cell variable, otherwise they won't be matched at all. 2.2 Data fusion evaluation An important issue in data fusion is measuring the quality of the fusion; this is far from straightforward. The bottom line evaluation is the external evaluation. Assume for instance that we want to improve the response on mailings for a certain set of products, so this was the reason why the fusion variables were added in the first place. In this case, one way to evaluate the external quality is to check whether an improved mail response prediction model can be built when fused data is included in the input. However, one must take into account that the added value of socio-demographic and other external variables is often of limited value for purely predictive data mining. These variables have more value for descriptive data mining, e.g. discovering why people are interested in these products [13,14]. The internal evaluation of the data fusion procedure is simply the a priori evaluation before external evaluation has taken place. We identify evaluating representativeness versus predictiveness, although the problem to make this distinction formal is an interesting problem on its own. One challenge for both the fusion procedure and the evaluation of representativeness of the fused data is that the donor and the recipient might be samples from different populations, e.g. a customer database from a bank versus a national media survey. If both donor and recipient are samples from the same population, penalty factors can be used to 'punish' winning donors and ensure that donors are used evenly [11]. Standard statistical tests can be used to check whether there are significant deviations in frequency distributions for variable values in the fused data set. An interesting problem when testing predictiveness is that in general there are no target values available for the recipient, so measures like root mean squared error and classification error can generally only be calculated for the donor. 3. Experiments & results In this section we will describe some preliminary experiments and results with a standard statistical matching data fusion procedure. We assume the following hypothetical business case. A bank wants to learn more about its credit card customers and expand the market for this product. Unfortunately, there is no survey data available that includes credit cardholdership, this variable is just known for actual customers. Data fusion is used to enrich a customer database with survey data. The resulting data set serves as a starting point for further descriptive and predictive data mining analysis. 3.1 The data sets and fusion methodology used In this experiment we did not use separate donors, but we chose to split up an existing real world market survey into a donor and a recipient. The recipient contained 2000 records with a cell variable for gender, commons for age, marital status, region, number of persons in the household and income. Furthermore the recipient contained a unique variable for credit card ownership, the target variable to model. The donor contained 4880 records, with 36 variables for which we expected that there might be a relation to credit cardholdership: general household demographics, holiday and spare time activities, financial product usage and personal attitudes. The original survey contains over a thousand of variables and over 5000 possible variable values. We fused the donor and the recipient using 4 fold cross validation on the donor to determine the optimal k. Only ordinals and binary fusion variables were included, so we restricted to predicting averages. Standard root mean squared error was used as a measure for predictive quality. 3.2 Internal evaluation: representativeness Apart from the root mean squared error cross validation procedure we restricted ourselves to representativeness evaluation. First we compared averages for all variables for the donor and the recipient. As could be expected from the donor and recipient sizes and the fact that both sets were generated from the same source there weren't many significant differences between donor and recipient for the common variables. Within the recipient 'not married' was over represented (30.0% instead of 26.6%), 'married and living together' was under represented (56.1% versus 60.0%) and the countryside and larger families were slightly over represented. More surprisingly (and reassuring) the average fusion variable values were very well preserved in the recipient survey compared to the donor survey. Only the averages of "Way Of Spending The Night during Summer Holiday" and "Number Of Savings Accounts" differed significantly, respectively by 2.6% and 1.5%. Apart from general statistics we wanted to evaluate the preservation of relations between variables, for which we used the following weak measures. For each common variable we listed the correlation with all fusion variables, the real values for the donor and the predicted values for the recipient. We then computed the correlation between these lists and calculated the average over these correlations. The result was an average correlation of common-fusion relationship between recipient and donor of 0.9 ± 0.028. The mean difference between common-fusion correlations in the donor versus the recipient was 0.12 ± 0.028. In other words, these correlations were very well preserved. A similar procedure could be carried out for the fusion variables with respect to each other. Further work should also be done on the application of penalty factors to improve representativeness. However, our preliminary experiments have demonstrated that penalties have a negative effect on the prediction quality (measured in RMSE). 3.3 External value of fused data for prediction tasks To experiment with the added value of data fusion for further analysis (external evaluation) we first performed some descriptive data mining to discover relations between the target variable, credit cardholdership, and the fusion variables using straightforward univariate techniques. First we selected the top 10 fusion variables with the highest absolute correlations with the target (see Table I). Note that, in contrast to standard practice, it is perfectly legal to include dependent fusion variables such as ‘frequency usage credit card’ in the set of input variables for prediction. Smaller effects included "Need for cognition" (average 1.05 times higher) and less "housewives" (0.9 times lower). These results can already offer a lot of insight to a marketer. The descriptive results were also used to guide the predictive data mining modelling process. In this case we wanted to investigate whether different computational learning methods would be able to exploit the added information in the fusion variables. We included naive bayes, neural networks and linear regression and an adapted version of naïve bayes adapted for ordinals (naive bayes Gaussian). We report results over 10 runs with train and test sets of equal size. The quality of the models was measured by the so called c-index, a rank based test related to Kendall's Tau [12], which measures the concordance between the ordered lists of real and predicted cardholders (see [15] for details on the algorithms and the c-index). We compared models which were trained on commons only, for which no fusion was actually needed, and models on commons plus either all or a selection of correlated fusion variables (see Table II; c=0.5 means random prediction, c=1 means perfect prediction). These results indicate that for this data set the models that include the highly correlated fusion variables outperform the models which were built using commons only. For linear regression these differences were most significant. Significance was tested by a one sided two sample T test on the ‘fusion’ runs versus the ‘only commons’ runs. In figure 2 cumulative response curves are drawn for the linear regression models. The test recipients are ordered from high score to low score on the x-axis. The data points correspond to the actual proportion of cardholders up to that percentile. Random selection of customers results in an average proportion of 32.5% cardholders. We can see from this figure that credit cardholdership can be predicted quite well. The top 10% of cardholder prospects according to the prediction models contain around 50-65% cardholders. The added logarithmic trend lines indicate that the models which include fusion variables are better in 'creaming the crop', i.e. selecting the top prospects. Welfare class Income household above average Is a manager Manages which number of people Time per day of watching television Eating out (privately): money per person Frequency usage credit card Frequency usage regular customer card Statement current income Spend more money on investments 75 70 Commons & Correlated Fusion vbls Commons only 65 60 55 50 45 40 35 Table I:. Fusion variables in recipient strongly correlated with credit card ownership 30 0 20 40 60 80 Figure 2: Lift chart linear regression models for predicting credit card ownership (7 randomly selected runs) Only commons Commons & correlated fusion vbls Commons & all fusion vbls SCG Neural Network Linear regression Naïve Bayes Naïve Bayes Gaussian C=0.692 0.012 C=0.692 0.014 C=0.707 0.015 C=0.701 0.015 C=0.703 0.015 C=0.724 0.012 C=0.712 0.011 C=0.720 0.012 p=0.041 p=2.1e-5 p= 0.20 p=0.0034 C=0.694 0.019 C=0.713 0.013 C=0.704 0.009 C=0.719 0.012 p=0.38 p=0.0017 p=0.72 p=0.0049 Table II:. C indexes 4. Discussion and future research One could argue that in theory by applying data fusion no information is added to the recipient survey, because this information is derived directly from the commons. However, in practice data fusion can still be a valuable tool. For descriptive data mining tasks, the fusion variables and the patterns derived from these variables can be more understandable and easier to interpret for an end user than patterns derived solely from commons. Furthermore it is a well known practical fact that it often makes sense to 100 include derived variables to improve prediction quality. In this case, fusion can make it easier for ‘imperfect’ algorithms such as linear regression to discover complex nonlinear relations between commons and target variables, by exploiting the information in the fusion variables. It is highly recommended to use appropriate variable selection techniques to remove the noise that is added by ‘irrelevant’ fusion variables (to counter the ‘curse of dimensionality’). It goes without saying that evaluating the quality of data fusion is crucial for acceptance. We hope to have demonstrated that this is not straightforward. A lot of interesting research can be done in this area, especially in the field of evaluating the recipient fusion variable predictions, for which no targets are available. Even a relatively simple question as determining the optimal set of commons has interesting research dimensions. To structure all these choices we have started to build a data fusion process model, analogously to the CRISP_DM model for data mining [6]. Also, the core fusion algorithms provide a lot of room for research and improvement. There is no fundamental reason why the fusion algorithm should be based on k-nearest neighbor prediction instead of clustering methods, regression, the expectationmaximization (EM) algorithm or other statistical and machine learning algorithms (see f.i. [8]). By shifting from fusing surveys to fusing customer databases with surveys an extra challenge must be faced: scoring millions of customer database records instead of thousands of surveys. All these efforts work towards a single vision: keeping all knowledge about a customer up to date, including soft information such as predictions based on measurements from different sources. 5. Conclusion The promise of data fusion is indeed attractive: getting insight about individual customers against a fraction of the price it would have cost to collect all this information in a single source survey. The application of data fusion will increase the value of data mining, because there is more integrated data to mine in. However, there is still a lot of interesting research to be done to evaluate data fusion quality and improve the still rather straightforward data fusion algorithms. Acknowledgements We would like to thank Michel de Ruiter, Martijn Ramaekers, Evelien Langendoen, Michiel van Wezel and Joost Kok for their comments. Part of this work has been performed within "The Fusion Factory" project, which is supported by the Dutch Ministry of Economic Affairs, through the KREDO stimulation initiative for development of electronic services. References [1] Antoine, J. 1985. A Case Study Illustrating the Objectives and Perspectives of Fusion Techniques. Proceedings of the Salzburg Readership Symposium. [2] Ken Baker, Paul Harris and John O’Brien. 1989. Data Fusion: An Appraisal and Experimental Evaluation. Journal of the Market Research Society, 31 (2), 152-212. [3] Barry, J.T. 1988. An investigation of statistical matching Journal of Applied Statistics. [4] Sarah O’Brien. The role of the data fusion in actionable media targeting in the 1990’s. 1990. ESOMAR, pp 531-548. [5] A.E. Bronner. Einde van de fusie fobie in Nederland? 1989. In: Jaarboek van de Nederlandse vereniging van marktonderzoekers 1988/1989, 9-18 [6] Chapman, P., Clinton J., Khabaza T., Reinartz, T., Wirth, R. (1999). The CRISPDM Process Model. Draft Discussion paper, Crisp Consortium, March 1999. http://www.crisp-dm.org/. [7] Harris, P. and Baker, K. 1998. Data Fusion. Admap, June 1998 [8] W.A. Kamakura and M. Wedel, 1996. Statistical Data-Fusion For CrossTabulation. Research Report SOM Institute, Groningen University, The Netherlands. [9] R. Lokker. 1998. Bereikstudies Pers, Bioscoop en PMP. Centrum voor Informatie over de Media, Brussel, Belgium. [10] van Noordwijk, A.J. 1983. Technical Notes on a Statistical Matching Experiment. Chapter 8 in: Koppelling van Databestanden, Sociaal en Cultureel Planbureau, Rijswijk, the Netherlands. [11] Paass, G. 1986. Statistical Match: Evaluation of Existing Procedures and Improvements by Using Additional Information. In: Microanalytic Simulatiom Models to Support Social and Financial Policy. Orcutt, G.H. and Merz, K, (eds). Elsevier Science Publishers BV, North Holland. [12] Press, W.H, Teukolsky, S.A., Vetterling, W.T. and Flannery, B.P., 1992. Numerical Recipes in C. The Art of Scientific Computing. Cambridge University Press, Cambridge MA, 2nd edition [13] van der Putten, P.,1999. Datamining in Direct Marketing Databases. In: Baets, W. (ed.). (1999). A Collection of Essays on Complexity and Management. World Scientific, Singapore. [14] van der Putten, P. 1999. A Datamining Scenario for Stimulating Credit Card Usage by Mining Transaction Data. Proceedings of Benelearn-99. [15] de Ruiter, Michel, 1999. Bayesian classification in data mining: theory and practice. MSc. Thesis, BWI, Free University of Amsterdam, The Netherlands [16] Schieler, H.E. and Wiegand, J. 1985. A Report on Experiments in Fusion in the Official German Media Research. Proceedings of the Salzburg Readership Symposium.