Intelligent Mailer Bon Sy Department of Computer Science, Queens College and The University Graduate Center/CUNY Flushing, NY 11367 bon@bunny.cs.qc.edu Anand Dinakar Department of Computer Science, Queens College and The University Graduate Center/CUNY Flushing, NY 11367 adinaka1@qc.edu Abstract Intelligent Mailer (IM) is a system that predicts a person’s interest in an upcoming event, using data mining and pattern discovery techniques. Our approach takes into account three factors: (1) the history of an individual’s participation in events, (2) the hobbies of an individual, and (3) the interests in event categories as indicated by the individual. This system ties into an existing web-based framework for collecting and maintaining event information, event signup rosters and personal choice of hobbies and interests. The likelihood of a user being interested in an event is arrived at by combining two approaches: statistical analysis of the person’s indication of interest and signup history and by measuring any correlation between a person’s hobbies and the upcoming event’s subject description. 1. Introduction This paper reports our research in the development of an intelligent mailer (IM). The intelligent mailer is operated in an electronic community bulletin board environment in which each community member could make event announcements. Instead of relying on each community member to frequently visit the community bulletin board to check on new events, the goal of the IM is to proactively evaluate the potential match between the nature of an event and the interest of each individual as indicated in his/her personal profile. And if the IM reaches a positive conclusion on the potential match between a new event and the interest of an individual, the IM will automatically send an email to alert the individual about the new event. Jing Zou Department of Computer Science, The University Graduate Center/CUNY New York, NY 10016 jzou@gc.cuny.edu This research is focused on a model-based pattern discovery technique [1, 3] for realizing the “intelligence” of the IM. The basic idea is to derive a probability model for each community member to encapsulate the statistical information about (1) what previous events an individual has (not) participated in, (2) what category each historical event belongs to, and whether the individual has expressed interests in that category, and (3) for each historical event whether the nature of the event (as described in the subject header of an event announcement) has a match with the hobbies of the individual. When there is a new event announcement, a probabilistic inference will be conducted based upon the probability model of each individual to determine the match between the new event and the interest of an individual. There are two specific challenges in enabling the proposed model-based pattern discovery technique. First, although a new event announcement will include information about what category it belongs to as specified by the announcer, there is no explicit information on how it is related to the hobby list of individuals. Second, the aforementioned three pieces of information encapsulated in a probability model may not provide the predictive power required for determining a possible match between the nature of the event and the interest of an individual. For example, if the historical data reveals equal likelihood for whether an individual does (not) participate in events falling into categories that an individual has expressed interest in, there is no useful probabilistic information for predicting the likelihood of an individual’s participation in an event based on the category information. Yet another example is that there is simply insufficient data to construct the probability model for inferring the likelihood of (not) attending a new event. In section 2 we will describe the information available to the system. In section 3 we will detail the approach for handling the two challenges just mentioned. In section 4 we will describe the implementation of the proposed model-based pattern discovery technique, followed by the preliminary results in section 5. In the conclusion section we will describe future work for optimizing and improving the performance of IM. 2. Available Information for the System The user provides two sets of information to IM. The first is the user’s choice of hobbies. For our purposes, the user chooses from a list of predefined hobbies. The second piece of information is a list of event categories that the user is interested in. Again, this is chosen from a predefined list of event categories. Each event has a subject line briefly describing the nature of the event. A sample user profile is shown in Appendix 1. A history of past events and the users who signed up for them is maintained in an Oracle® database. Each user and event is designated a unique numeric ID. The information is available in four tables: (1) list of events and the particulars, (2) list of user’s hobbies, (3) list of categories that a user is interested in and (4) list of users who signed up for each event. An example is shown in Appendix 1. 3. Measuring Interest IM uses a combination of two distinct pieces of information to make a better decision. This scheme can be expanded to include other information as well. Data is collected on a per-user basis and is tabulated. Each row of the table contains information pertaining to one event from the past. Each cell contains a binary value, as shown below: Table 1. An example user history EventID A B C 1 {0, 1} {0, 1} {0, 1} 2 . . . 3 . . . 4 . . . 5 . . . A cell in the first column stores a 1 if the user has previously signed up for this event, 0 otherwise. A cell in the second column stores a 1 if the user has indicated interest in that row’s event category, 0 otherwise. A cell in the third column stores a 1 if there is a strong relationship between the event’s subject description and any one of the user’s hobbies. Section 3.2 describes how to determine the extent of relationship between the event’s subject description and any one of the user’s hobbies. 3.1. Sign up list and Event’s Category Columns A and B help us gauge if the user is in fact interested in the event’s category that he has indicated interest in. In case the user has signed up for every event that belongs to the event category of his interest, then in all likelihood, he will signup for the next event belonging to one of his categories of interest. In case the user signs up very rarely or does not signup for any event, then we know that interest shown towards an event carries very little or no importance. In such a case, data reflecting other users’ signup pattern sharing similar hobbies or attendance association pattern could be used instead for this particular user. 3.2. Hobby list and Event’s Description The hobby list and the event’s description are two strings consisting of a few words each. Recognizing a relationship between the two could be challenging. For our purposes, we compare each hobby against the event description string and come up with a measure of how closely the two are related. This measure is compared against a threshold (determined empirically) and is converted to a binary value. The binary values for each hobby are then ORed together to produce the final value for the cell in the third column of the tabulation. To compare the two strings and produce a numeric value, we developed a method based on Google®’s Web API™ [2] for searching directory categories. With the Google® Web API, we can generate a list of classifications that correspond to a string. Each classification that Google® Web API™ returns contains a hierarchy of directory terms. We then compare the hierarchy of directory terms obtained for the hobby string with that obtained for the event description string. The larger the number of hierarchy levels that match, the closer the topics are. The larger the number of such closeness matches, the more tightly related the strings are. We consider a strong match if the number of hierarchy is three or more. This threshold is determined empirically. 3.3. Arriving at a Result After the table has been fully populated, joint and marginal probability values are computed. The probability value of Pr(A | B C) gives us the predictive power to estimate the likelihood of an event participation given the information about the matching interest defined by event category and user hobbies (columns B and C). The estimated probability is then compared to a threshold to determine whether or not to send out an email apprising the user of that event. The threshold value can be increased to make the system more conservative or decreased to make it more aggressive. In the case of a user whose data is incomplete or unreliable, we can predict behavior by using a model derived from another user who has sufficient data. Model discovery techniques and tools discussed in [1] can be used to build a model from users with reliable data sets. The model can then be queried with data available about users with incomplete data sets, to predict if they will be interested in an upcoming event. 4. Implementation The implementation was divided into three parts and each part was implemented in a language that was most appropriate to the task. 4.1. Fetching the Latest Input from the Online Resource (PERL Scripts) PERL scripts are used to download the html web page, parse the tables into data files and store them as local files, accessible to the S-Plus program. 4.2. Preparing the Tabulation and Computing Results (S-Plus) The S-Plus program reads the data tables from the input files, and takes as input the user ID, the new event’s description and category. Then, it builds the necessary data structures used to populate the tabular column. After populating the table, the S-Plus program performs the necessary statistical computation to determine if the user will be interested in the new event or not. 4.3. Interacting with Goolge® Web API™ to Recover Comparison Results (Java) For each row, to obtain the value for the third column, the S-Plus program invokes the Java program with the event description string and the list of hobbies as arguments. The Java program, using the Goolge® Web API™ and computing the similarity returns either a 0 or 1 which is put into the third column. 5. Experiments and Results In this paper we present an example of how a result is obtained. Then the result of a preliminary study will be presented. 5.1. Relating a Hobby to an Event Description Let us consider an event “Picture gallery” posted in the bulletin board. A Google® Web API™ directory search returns “Arts > Visual Arts > Resources > Publications” among its classifications. The user hobby “Art Drawing Painting” would produce “Arts > Visual Arts >Resources” among its classifications. Our comparison heuristic would recognize a three level match between the two and signal a strong relationship between the hobby and the event category by returning a ‘1’. 5.2. Combination of Methods Let us consider a situation where the data for columns B and C for a user are incomplete or inconclusive. We know Pr (A1 B1 C1) for a user U1 with reliable data. We also know Pr(A2) for the user U2 with incomplete data for columns B and C. If some combinations of (A1 A2) exhibit significant statistical association patterns as discussed in chapter 8 of Sy & Gupta [1], we can fuse these probabilities and derive Pr(A1 B1 C1 A2) using the pattern utility discussed in chapter 9 of Sy & Gupta [1]. This pattern utility operates on the constraints that preserve the statistical significance of the patterns pertinent to the instantiations of (A1 A2), and the principle of minimum biased information, to discover the optimal probability model Pr(A1 B1 C1 A2). The optimality of the probability model is defined by the minimization of biased information. Information theory may also be used in conjunction with statistical analysis to define optimality [4, 5]. It has been discussed in [1] that minimizing biased information is equivalent to maximizing the entropy for the probability model consisting of the parameters (A1 B1 C1 A2). Let’s use two examples to illustrate the idea of the combination method for fusing two probability models. First, let’s consider U1 and U2 are friends and they always (do not) attend events together, we will expect that the pattern(s) such as (A1=1 A2=1) and (A1=0 A2=0) will exhibit significant statistical association. Likewise, if U1 and U2 are foes and they never attend events together, we will expect that the pattern(s) such as (A1=1 A2=0) and (A1=0 A2=1) will exhibit significant statistical association. In these two cases, we could predict if the user (with incomplete data) will sign up for the event or not based on simply the findings of the patterns. On the other hand, there could be significant statistical association patterns. But the patterns do not reveal specific information for prediction; for example, (A1=1 A2=1) and (A1=1 A2=0). In this case, although making prediction based on simply the findings of the patterns will be not sufficient, we could still make prediction based on Pr(A2 | B1 C1). Below is a concrete illustration of the combination method: Let’s assume the association patterns (A1=1 A2=1) and (A1=1 A2=0) are found to be statistically significant with the following information: Pr(A1=1 A2=1) = 0.6 Pr(A1=1 A2=0) = 0.25 Pr(A2=1) = 0.7 Let’s further assume the following reliable data about U1 is available: Pr(A1=0 B1=0 C1=0) = 0.1 Pr(A1=0 B1=0 C1=1) = 0.01 Pr(A1=0 B1=1 C1=0) = 0.02 Pr(A1=0 B1=1 C1=1) = 0.02 Pr(A1=1 B1=0 C1=0) = 0.1 Pr(A1=1 B1=0 C1=1) = 0.2 Pr(A1=1 B1=1 C1=0) = 0.2 Pr(A1=1 B1=1 C1=1) = 0.35 Using the pattern utility for model discovery as discussed in chapter 9 elsewhere [1], the following optimal model preserving the probability information shown above is found: Pr(A1=0 B1=0 C1=0 A2=1) = 0.1 Pr(A1=0 B1=0 C1=1 A2=0) = 0.01 Pr(A1=0 B1=1 C1=0 A2=0) = 0.02 Pr(A1=0 B1=1 C1=1 A2=0) = 0.02 Pr(A1=1 B1=0 C1=0 A2=0) = 0.1 Pr(A1=1 B1=0 C1=1 A2=0) = 0.15 Pr(A1=1 B1=0 C1=1 A2=1) = 0.05 Pr(A1=1 B1=1 C1=0 A2=1) = 0.2 Pr(A1=1 B1=1 C1=1 A2=1) = 0.35 where Pr(A1 B1 C1 A2) = 0 for the remaining terms that are not listed. With the above probability model, and let’s suppose a new event that belongs to a category in which U1 has expressed interest (i.e., B1=1), and there is also a match between the subject description of the new event and the hobby list of U1 (i.e., C1 =1), we can now infer if the user U2 (with incomplete data) will sign up for the event or not based on ArgMaxA2[Pr(A2|B1=1 C1=1)] = 1. 5.3. Preliminary Experimental Study We have conducted a preliminary experimental study to evaluate the potential of the proposed approach. Five individuals, labeled as U1, U2, … U5, participated in the preliminary study over a period of 7 days. Each individual has set up a personal profile reflecting one’s hobby list and one’s interest in the event categories. Prior to the experimental study, we created 105 events covering all the categories listed in Appendix 1. The subject header of each event is analyzed using the Google API described in sections 3.2 and 4.3 to determine its match (and relation) to the hobby list of each individual (column C in table 1 shown in section 3). The category of each event is also recorded (column B in table 1). Each individual was asked to go through all 105 events and to indicate their interest (column A in table 1) through an online sign-up process during the 7-day period of the experimental study. This completes the data collection process. Upon the completion of data collection, we have five tables one for each individual similar to that of table 1. The size of each table is 105 rows by 3 columns; where each row corresponds to one event. Each table is then divided into two: a table ref consisting of 90 rows by 3 columns for deriving the prediction model, and a table test consisting of 15 rows by 3 columns for testing. 5.4. Experimental Results The frequency distribution of the data in each of the five ref tables is derived to construct a probability model for each individual. Such probability models are shown in Table 2 below: Table 2. Probability model for prediction ABC 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1 2 PrU1 0.271 0 0.541 0.033 0.044 0.011 0.089 0.011 PrU2 0.68 0.045 0.24 0 0 0 0.023 0.012 PrU3 0.76 0.01 0.19 0 0.01 0 0.03 0 PrU4 0.4 0.044 0.49 0.011 0.044 0 0.011 0 PrU5 0.844 0.056 0.078 0 0.022 0 0 0 4.1866 25.1753 7.1076 4.8953 0.7799 In Table 2, the first column excluding the last row represents the historical information about A: sign-up, B: the category of an event appears in a user’s event category list, and C: the strength of a match between subject description and user hobby as returned by Google API. Column 2 through column 6 each shows the probability model Pr(A B C) of an individual. The last row shows the Chi-square value of the probability model under the assumption of independence. It is noted that the Chisquare value of the model for the individual U5 is significantly lower than the others and is close to zero. The implication of this is that the level of association that entails the predictive power needed for inferring the interest of an individual based on Pr(A|BC) could not be reliable due to the high level of independence among (A B C). Consequently, we will identify a “buddy” for U5 and derive a prediction model for U5 based on the information of the buddy of U5. To determine the “buddy” of U5, we first extract the first column from the ref table of each individual and construct a data table consisting of 90 rows by 5 columns. This new table essentially contains the data about the user’s interest reflected in the sign-up history. We then apply the technique as discussed in the chapter 8 of Sy & Gupta [1] for discovering second-order significant association patterns. These patterns reveal the association relationship about two individuals whether they may (not) sign up for the same event, and that the sign up (or not) is not a co-incidence. The following second-order association patterns are found: (Ui:0 Uj:0), (Ui:1 Uj:1) for i and j 4, and (Ui:1 U4:0), (Ui:0 U4:1) for i = 1,2,3,5. In other words, any pair of individuals except U4 exhibits patterns of (not) signing up together while the pair involved U4 exhibit patterns of one signs up while the other does not. After the association pattern discovery, a simple Pearson correlation analysis was employed to determine a possible buddy for U5. U3 is identified as the “best buddy” for U5 based on their correlated behavior on the sign-up history. 5.5. Model Discovery and Prediction The models PrU1 .. PrU4 in table 2 form a basis for inferring whether a new event should be “up sell” to the individuals U1 ... U4. Since the model for U5 bears a high level of independence as revealed in the Chi-square value, we will derive a prediction model for U5 based on the information of U3 --- the buddy of U5. The process of deriving a prediction model for U5 will be based on the method described in section 5.2 using the information from PrU3 and the following constraints that preserves the significance of the degree of association [3] exhibited in the patterns (U3:0 U5:0), (U3:1 U5:1): Pr(AU3:0, AU5:0) = 0.96 Pr(AU3:1, AU5:1) = 0.022 Pr(AU5:0) = 0.978 The following model for U5 that maximized Shannon entropy was found: ArgMaxA[Pr(A|B C)] for each one of the 15 events in the test table for each of the five individuals. We then compare the inferred value instantiation of A with that in the test table. 5.6. Evaluating Preliminary Results In this preliminary study, we calculate two kinds of error rate: false negative and false positive. A false negative is resulted if an event in a test table has a value of 1 for A while the inferred value for A is 0 (based on ArgMaxA[Pr(A|B C)]). Likewise, a false positive is resulted if an event in a test table has a value of 0 while the inferred value of A is 1. The following is the result of this preliminary study: Table 4. Experimental results Individuals U1 U2 U3 U4 U5 # of training events 90 90 90 90 90 # of test events 15 15 15 15 15 False positive 0 0 0 0 0 False negative 1 0 0 0 2 While the results shown in table 4 seem encouraging, we could not make strong conclusive statements. It is because the sign up rate is relatively low. As a consequence, the probability distribution of a prediction model is skewed towards events that are of little interests. In other words, we could not be certain whether the low false positive and false negative error rate is really due to the effectiveness of the proposed approach or due to the lack of event instances that can (dis)validate the proposed approach. Nonetheless, due to the statistical nature of the proposed approach, we do expect the accuracy of the system to improve over time, as more events take place and more data is available assuming the current data set used for this preliminary study is consistent and reflects the distribution of the event activities in the system. Table 3. U5 new model AU5 BU3 CU3 Pr(AU5 BU3 CU3) 0 0 0 0.76 0 0 1 0.01 0 1 0 0.01 0 1 1 0 1 0 0 0.208 1 0 1 0.012 1 1 0 0 1 1 1 0 Based on PrU1 .. PrU4 and the model above for U5, we determine the value instantiation for A based on 6. Conclusion In this paper, we employ a multi-faceted approach to predict a user’s interest. We employed an informationstatistical approach to expose hidden significant association patterns. Many optimizations are possible at different stages of this scheme. Some that will result in drastic improvements in speed are: 1. Cache comparison results of event descriptions and hobby lists. This will improve runtime speed significantly because many of the comparisons are repetitive, given a limited number of hobbies and a limited number of event categories. 2. Cache the tabular column for determining interest. Every time a new event expires, it can simply be appended to the table, thus limiting the number of computations. 7. Acknowledgement This work is supported in part by a NSF DUE CCLI grant #0088778, and a PSC-CUNY Research Award. 8. References [1]. Sy B.K., Gupta A. K., “Information-Statistical Data Mining”, Kluwer, 1st Edition, 2004. [2]. Goolge® Web API™, Http://www.google.com/apis/ [3]. Sy B.K., "Probability Model Selection Using InformationTheoretic Optimization Criterion," Journal of Statistical Computing and Simulation, 2001, Gordon and Breach Publishing Group, NJ, 69(3), 2001. [4]. Grenander U., 1996, Elements of Pattern Theory, The Johns Hopkins University Press, ISBN 0-8018-5187-4. [5]. Chen J. and Gupta A.K., "Information Criterion and Change Point Problem for Regular Models," Technical Report No. 9805, Department of Math. and Stat., Bowling Green State U., Ohio. APPENDIX 1 User profile and categories to choose from Screen-shot of a user’s profile in the online bulletin board Hobbies Art/Drawing/Painting Crafts Music Reading Sports Games Outdoors Photography Cooking/Wine Tasting Gardening Movies Wireless/RF/RC Event Categories Study group Cultural / Community activities Seminars Town Hall/Political/Religious/Community Film screening/premiere Field Trip Exhibit Poetry Sample data illustrating the information available to the system Category ID 5 Event ID 15 10 16 Event category table Subject Date Time WLAN radio project meeting Shakespeare 05-Nov03 12:00 03-Nov03 13:00 Loca tion NSB A20 7A NSB A20 2 Event signup table Event ID Signed up User ID 15 2 22 94 22 2 User category table User ID Category ID 2 1 2 2 2 3 User hobby table User ID Hobby ID 2 2 104 1 2 8