International Conference for Knowledge Engineering 2004, Las Vegas

advertisement
Intelligent Mailer
Bon Sy
Department of
Computer Science,
Queens College and
The University
Graduate
Center/CUNY
Flushing, NY 11367
bon@bunny.cs.qc.edu
Anand Dinakar
Department of
Computer Science,
Queens College and
The University
Graduate
Center/CUNY
Flushing, NY 11367
adinaka1@qc.edu
Abstract
Intelligent Mailer (IM) is a system that predicts a
person’s interest in an upcoming event, using data mining
and pattern discovery techniques. Our approach takes
into account three factors: (1) the history of an
individual’s participation in events, (2) the hobbies of an
individual, and (3) the interests in event categories as
indicated by the individual. This system ties into an
existing web-based framework for collecting and
maintaining event information, event signup rosters and
personal choice of hobbies and interests. The likelihood of
a user being interested in an event is arrived at by
combining two approaches: statistical analysis of the
person’s indication of interest and signup history and by
measuring any correlation between a person’s hobbies
and the upcoming event’s subject description.
1. Introduction
This paper reports our research in the development of
an intelligent mailer (IM). The intelligent mailer is
operated in an electronic community bulletin board
environment in which each community member could
make event announcements. Instead of relying on each
community member to frequently visit the community
bulletin board to check on new events, the goal of the IM
is to proactively evaluate the potential match between the
nature of an event and the interest of each individual as
indicated in his/her personal profile. And if the IM
reaches a positive conclusion on the potential match
between a new event and the interest of an individual, the
IM will automatically send an email to alert the individual
about the new event.
Jing Zou
Department of
Computer Science,
The University
Graduate
Center/CUNY
New York, NY 10016
jzou@gc.cuny.edu
This research is focused on a model-based pattern
discovery technique [1, 3] for realizing the “intelligence”
of the IM. The basic idea is to derive a probability model
for each community member to encapsulate the statistical
information about (1) what previous events an individual
has (not) participated in, (2) what category each historical
event belongs to, and whether the individual has
expressed interests in that category, and (3) for each
historical event whether the nature of the event (as
described in the subject header of an event announcement)
has a match with the hobbies of the individual. When
there is a new event announcement, a probabilistic
inference will be conducted based upon the probability
model of each individual to determine the match between
the new event and the interest of an individual.
There are two specific challenges in enabling the
proposed model-based pattern discovery technique. First,
although a new event announcement will include
information about what category it belongs to as specified
by the announcer, there is no explicit information on how
it is related to the hobby list of individuals. Second, the
aforementioned three pieces of information encapsulated
in a probability model may not provide the predictive
power required for determining a possible match between
the nature of the event and the interest of an individual.
For example, if the historical data reveals equal likelihood
for whether an individual does (not) participate in events
falling into categories that an individual has expressed
interest in, there is no useful probabilistic information for
predicting the likelihood of an individual’s participation
in an event based on the category information. Yet
another example is that there is simply insufficient data to
construct the probability model for inferring the
likelihood of (not) attending a new event.
In section 2 we will describe the information available
to the system. In section 3 we will detail the approach for
handling the two challenges just mentioned. In section 4
we will describe the implementation of the proposed
model-based pattern discovery technique, followed by the
preliminary results in section 5. In the conclusion section
we will describe future work for optimizing and
improving the performance of IM.
2. Available Information for the System
The user provides two sets of information to IM. The
first is the user’s choice of hobbies. For our purposes, the
user chooses from a list of predefined hobbies. The
second piece of information is a list of event categories
that the user is interested in. Again, this is chosen from a
predefined list of event categories. Each event has a
subject line briefly describing the nature of the event. A
sample user profile is shown in Appendix 1. A history of
past events and the users who signed up for them is
maintained in an Oracle® database.
Each user and event is designated a unique numeric
ID. The information is available in four tables: (1) list of
events and the particulars, (2) list of user’s hobbies, (3)
list of categories that a user is interested in and (4) list of
users who signed up for each event. An example is shown
in Appendix 1.
3. Measuring Interest
IM uses a combination of two distinct pieces of
information to make a better decision. This scheme can be
expanded to include other information as well. Data is
collected on a per-user basis and is tabulated. Each row of
the table contains information pertaining to one event
from the past. Each cell contains a binary value, as shown
below:
Table 1. An example user history
EventID
A
B
C
1
{0, 1} {0, 1} {0, 1}
2
.
.
.
3
.
.
.
4
.
.
.
5
.
.
.
A cell in the first column stores a 1 if the user has
previously signed up for this event, 0 otherwise. A cell in
the second column stores a 1 if the user has indicated
interest in that row’s event category, 0 otherwise. A cell
in the third column stores a 1 if there is a strong
relationship between the event’s subject description and
any one of the user’s hobbies. Section 3.2 describes how
to determine the extent of relationship between the
event’s subject description and any one of the user’s
hobbies.
3.1. Sign up list and Event’s Category
Columns A and B help us gauge if the user is in fact
interested in the event’s category that he has indicated
interest in. In case the user has signed up for every event
that belongs to the event category of his interest, then in
all likelihood, he will signup for the next event belonging
to one of his categories of interest. In case the user signs
up very rarely or does not signup for any event, then we
know that interest shown towards an event carries very
little or no importance. In such a case, data reflecting
other users’ signup pattern sharing similar hobbies or
attendance association pattern could be used instead for
this particular user.
3.2. Hobby list and Event’s Description
The hobby list and the event’s description are two
strings consisting of a few words each. Recognizing a
relationship between the two could be challenging. For
our purposes, we compare each hobby against the event
description string and come up with a measure of how
closely the two are related. This measure is compared
against a threshold (determined empirically) and is
converted to a binary value. The binary values for each
hobby are then ORed together to produce the final value
for the cell in the third column of the tabulation.
To compare the two strings and produce a numeric
value, we developed a method based on Google®’s Web
API™ [2] for searching directory categories. With the
Google® Web API, we can generate a list of
classifications that correspond to a string. Each
classification that Google® Web API™ returns contains a
hierarchy of directory terms. We then compare the
hierarchy of directory terms obtained for the hobby string
with that obtained for the event description string. The
larger the number of hierarchy levels that match, the
closer the topics are. The larger the number of such
closeness matches, the more tightly related the strings are.
We consider a strong match if the number of hierarchy is
three or more. This threshold is determined empirically.
3.3. Arriving at a Result
After the table has been fully populated, joint and
marginal probability values are computed. The probability
value of Pr(A | B C) gives us the predictive power to
estimate the likelihood of an event participation given the
information about the matching interest defined by event
category and user hobbies (columns B and C). The
estimated probability is then compared to a threshold to
determine whether or not to send out an email apprising
the user of that event. The threshold value can be
increased to make the system more conservative or
decreased to make it more aggressive.
In the case of a user whose data is incomplete or
unreliable, we can predict behavior by using a model
derived from another user who has sufficient data. Model
discovery techniques and tools discussed in [1] can be
used to build a model from users with reliable data sets.
The model can then be queried with data available about
users with incomplete data sets, to predict if they will be
interested in an upcoming event.
4. Implementation
The implementation was divided into three parts and
each part was implemented in a language that was most
appropriate to the task.
4.1. Fetching the Latest Input from the Online
Resource (PERL Scripts)
PERL scripts are used to download the html web page,
parse the tables into data files and store them as local
files, accessible to the S-Plus program.
4.2. Preparing the Tabulation and Computing
Results (S-Plus)
The S-Plus program reads the data tables from the
input files, and takes as input the user ID, the new event’s
description and category. Then, it builds the necessary
data structures used to populate the tabular column. After
populating the table, the S-Plus program performs the
necessary statistical computation to determine if the user
will be interested in the new event or not.
4.3. Interacting with Goolge® Web API™ to
Recover Comparison Results (Java)
For each row, to obtain the value for the third column,
the S-Plus program invokes the Java program with the
event description string and the list of hobbies as
arguments. The Java program, using the Goolge® Web
API™ and computing the similarity returns either a 0 or 1
which is put into the third column.
5. Experiments and Results
In this paper we present an example of how a result is
obtained. Then the result of a preliminary study will be
presented.
5.1. Relating a Hobby to an Event Description
Let us consider an event “Picture gallery” posted in the
bulletin board. A Google® Web API™ directory search
returns “Arts > Visual Arts > Resources > Publications”
among its classifications. The user hobby “Art Drawing
Painting” would produce “Arts > Visual Arts >Resources”
among its classifications. Our comparison heuristic would
recognize a three level match between the two and signal
a strong relationship between the hobby and the event
category by returning a ‘1’.
5.2. Combination of Methods
Let us consider a situation where the data for columns
B and C for a user are incomplete or inconclusive. We
know Pr (A1 B1 C1) for a user U1 with reliable data. We
also know Pr(A2) for the user U2 with incomplete data
for columns B and C. If some combinations of (A1 A2)
exhibit significant statistical association patterns as
discussed in chapter 8 of Sy & Gupta [1], we can fuse
these probabilities and derive Pr(A1 B1 C1 A2) using the
pattern utility discussed in chapter 9 of Sy & Gupta [1].
This pattern utility operates on the constraints that
preserve the statistical significance of the patterns
pertinent to the instantiations of (A1 A2), and the
principle of minimum biased information, to discover the
optimal probability model Pr(A1 B1 C1 A2). The
optimality of the probability model is defined by the
minimization of biased information. Information theory
may also be used in conjunction with statistical analysis to
define optimality [4, 5]. It has been discussed in [1] that
minimizing biased information is equivalent to
maximizing the entropy for the probability model
consisting of the parameters (A1 B1 C1 A2).
Let’s use two examples to illustrate the idea of the
combination method for fusing two probability models.
First, let’s consider U1 and U2 are friends and they
always (do not) attend events together, we will expect that
the pattern(s) such as (A1=1 A2=1) and (A1=0 A2=0)
will exhibit significant statistical association. Likewise, if
U1 and U2 are foes and they never attend events together,
we will expect that the pattern(s) such as (A1=1 A2=0)
and (A1=0 A2=1) will exhibit significant statistical
association. In these two cases, we could predict if the
user (with incomplete data) will sign up for the event or
not based on simply the findings of the patterns.
On the other hand, there could be significant statistical
association patterns. But the patterns do not reveal
specific information for prediction; for example, (A1=1
A2=1) and (A1=1 A2=0). In this case, although making
prediction based on simply the findings of the patterns
will be not sufficient, we could still make prediction
based on Pr(A2 | B1 C1). Below is a concrete illustration
of the combination method:
Let’s assume the association patterns (A1=1 A2=1)
and (A1=1 A2=0) are found to be statistically significant
with the following information:
Pr(A1=1 A2=1) = 0.6
Pr(A1=1 A2=0) = 0.25
Pr(A2=1) = 0.7
Let’s further assume the following reliable data about
U1 is available:
Pr(A1=0 B1=0 C1=0) = 0.1
Pr(A1=0 B1=0 C1=1) = 0.01
Pr(A1=0 B1=1 C1=0) = 0.02
Pr(A1=0 B1=1 C1=1) = 0.02
Pr(A1=1 B1=0 C1=0) = 0.1
Pr(A1=1 B1=0 C1=1) = 0.2
Pr(A1=1 B1=1 C1=0) = 0.2
Pr(A1=1 B1=1 C1=1) = 0.35
Using the pattern utility for model discovery as
discussed in chapter 9 elsewhere [1], the following
optimal model preserving the probability information
shown above is found:
Pr(A1=0 B1=0 C1=0 A2=1) = 0.1
Pr(A1=0 B1=0 C1=1 A2=0) = 0.01
Pr(A1=0 B1=1 C1=0 A2=0) = 0.02
Pr(A1=0 B1=1 C1=1 A2=0) = 0.02
Pr(A1=1 B1=0 C1=0 A2=0) = 0.1
Pr(A1=1 B1=0 C1=1 A2=0) = 0.15
Pr(A1=1 B1=0 C1=1 A2=1) = 0.05
Pr(A1=1 B1=1 C1=0 A2=1) = 0.2
Pr(A1=1 B1=1 C1=1 A2=1) = 0.35
where Pr(A1 B1 C1 A2) = 0 for the remaining terms that
are not listed.
With the above probability model, and let’s suppose a
new event that belongs to a category in which U1 has
expressed interest (i.e., B1=1), and there is also a match
between the subject description of the new event and the
hobby list of U1 (i.e., C1 =1), we can now infer if the user
U2 (with incomplete data) will sign up for the event or not
based on ArgMaxA2[Pr(A2|B1=1 C1=1)] = 1.
5.3. Preliminary Experimental Study
We have conducted a preliminary experimental study
to evaluate the potential of the proposed approach. Five
individuals, labeled as U1, U2, … U5, participated in the
preliminary study over a period of 7 days. Each individual
has set up a personal profile reflecting one’s hobby list
and one’s interest in the event categories.
Prior to the experimental study, we created 105 events
covering all the categories listed in Appendix 1. The
subject header of each event is analyzed using the Google
API described in sections 3.2 and 4.3 to determine its
match (and relation) to the hobby list of each individual
(column C in table 1 shown in section 3). The category of
each event is also recorded (column B in table 1). Each
individual was asked to go through all 105 events and to
indicate their interest (column A in table 1) through an
online sign-up process during the 7-day period of the
experimental study. This completes the data collection
process.
Upon the completion of data collection, we have five
tables  one for each individual  similar to that of table
1. The size of each table is 105 rows by 3 columns; where
each row corresponds to one event. Each table is then
divided into two: a table ref consisting of 90 rows by 3
columns for deriving the prediction model, and a table test
consisting of 15 rows by 3 columns for testing.
5.4. Experimental Results
The frequency distribution of the data in each of the
five ref tables is derived to construct a probability model
for each individual. Such probability models are shown in
Table 2 below:
Table 2. Probability model for prediction
ABC
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
2
PrU1
0.271
0
0.541
0.033
0.044
0.011
0.089
0.011
PrU2
0.68
0.045
0.24
0
0
0
0.023
0.012
PrU3
0.76
0.01
0.19
0
0.01
0
0.03
0
PrU4
0.4
0.044
0.49
0.011
0.044
0
0.011
0
PrU5
0.844
0.056
0.078
0
0.022
0
0
0
4.1866 25.1753 7.1076 4.8953 0.7799
In Table 2, the first column excluding the last row
represents the historical information about A: sign-up, B:
the category of an event appears in a user’s event category
list, and C: the strength of a match between subject
description and user hobby as returned by Google API.
Column 2 through column 6 each shows the probability
model Pr(A B C) of an individual. The last row shows the
Chi-square value of the probability model under the
assumption of independence. It is noted that the Chisquare value of the model for the individual U5 is
significantly lower than the others and is close to zero.
The implication of this is that the level of association that
entails the predictive power needed for inferring the
interest of an individual based on Pr(A|BC) could not be
reliable due to the high level of independence among (A
B C). Consequently, we will identify a “buddy” for U5
and derive a prediction model for U5 based on the
information of the buddy of U5.
To determine the “buddy” of U5, we first extract the
first column from the ref table of each individual and
construct a data table consisting of 90 rows by 5 columns.
This new table essentially contains the data about the
user’s interest reflected in the sign-up history. We then
apply the technique as discussed in the chapter 8 of Sy &
Gupta [1] for discovering second-order significant
association patterns. These patterns reveal the association
relationship about two individuals whether they may (not)
sign up for the same event, and that the sign up (or not) is
not a co-incidence. The following second-order
association patterns are found:
(Ui:0 Uj:0), (Ui:1 Uj:1) for i and j  4, and (Ui:1
U4:0), (Ui:0 U4:1) for i = 1,2,3,5. In other words, any pair
of individuals except U4 exhibits patterns of (not) signing
up together while the pair involved U4 exhibit patterns of
one signs up while the other does not.
After the association pattern discovery, a simple
Pearson correlation analysis was employed to determine a
possible buddy for U5. U3 is identified as the “best
buddy” for U5 based on their correlated behavior on the
sign-up history.
5.5. Model Discovery and Prediction
The models PrU1 .. PrU4 in table 2 form a basis for
inferring whether a new event should be “up sell” to the
individuals U1 ... U4. Since the model for U5 bears a high
level of independence as revealed in the Chi-square value,
we will derive a prediction model for U5 based on the
information of U3 --- the buddy of U5.
The process of deriving a prediction model for U5 will
be based on the method described in section 5.2 using the
information from PrU3 and the following constraints that
preserves the significance of the degree of association [3]
exhibited in the patterns (U3:0 U5:0), (U3:1 U5:1):
Pr(AU3:0, AU5:0) = 0.96
Pr(AU3:1, AU5:1) = 0.022
Pr(AU5:0) = 0.978
The following model for U5 that maximized Shannon
entropy was found:
ArgMaxA[Pr(A|B C)] for each one of the 15 events in the
test table for each of the five individuals. We then
compare the inferred value instantiation of A with that in
the test table.
5.6. Evaluating Preliminary Results
In this preliminary study, we calculate two kinds of
error rate: false negative and false positive. A false
negative is resulted if an event in a test table has a value
of 1 for A while the inferred value for A is 0 (based on
ArgMaxA[Pr(A|B C)]). Likewise, a false positive is
resulted if an event in a test table has a value of 0 while
the inferred value of A is 1. The following is the result of
this preliminary study:
Table 4. Experimental results
Individuals
U1 U2 U3 U4 U5
# of training events 90 90 90 90 90
# of test events
15 15 15 15 15
False positive
0 0 0 0 0
False negative
1 0 0 0 2
While the results shown in table 4 seem encouraging,
we could not make strong conclusive statements. It is
because the sign up rate is relatively low. As a
consequence, the probability distribution of a prediction
model is skewed towards events that are of little interests.
In other words, we could not be certain whether the low
false positive and false negative error rate is really due to
the effectiveness of the proposed approach or due to the
lack of event instances that can (dis)validate the proposed
approach. Nonetheless, due to the statistical nature of the
proposed approach, we do expect the accuracy of the
system to improve over time, as more events take place
and more data is available  assuming the current data
set used for this preliminary study is consistent and
reflects the distribution of the event activities in the
system.
Table 3. U5 new model
AU5 BU3 CU3
Pr(AU5 BU3 CU3)
0 0 0
0.76
0 0 1
0.01
0 1 0
0.01
0 1 1
0
1 0 0
0.208
1 0 1
0.012
1 1 0
0
1 1 1
0
Based on PrU1 .. PrU4 and the model above for U5, we
determine the value instantiation for A based on
6. Conclusion
In this paper, we employ a multi-faceted approach to
predict a user’s interest. We employed an informationstatistical approach to expose hidden significant
association patterns. Many optimizations are possible at
different stages of this scheme. Some that will result in
drastic improvements in speed are:
1. Cache comparison results of event descriptions and
hobby lists. This will improve runtime speed significantly
because many of the comparisons are repetitive, given a
limited number of hobbies and a limited number of event
categories.
2. Cache the tabular column for determining interest.
Every time a new event expires, it can simply be
appended to the table, thus limiting the number of
computations.
7. Acknowledgement
This work is supported in part by a NSF DUE CCLI
grant #0088778, and a PSC-CUNY Research Award.
8. References
[1]. Sy B.K., Gupta A. K., “Information-Statistical Data
Mining”, Kluwer, 1st Edition, 2004.
[2]. Goolge® Web API™, Http://www.google.com/apis/
[3]. Sy B.K., "Probability Model Selection Using InformationTheoretic Optimization Criterion," Journal of Statistical
Computing and Simulation, 2001, Gordon and Breach
Publishing Group, NJ, 69(3), 2001.
[4]. Grenander U., 1996, Elements of Pattern Theory, The Johns
Hopkins University Press, ISBN 0-8018-5187-4.
[5]. Chen J. and Gupta A.K., "Information Criterion and Change
Point Problem for Regular Models," Technical Report No. 9805, Department of Math. and Stat., Bowling Green State U.,
Ohio.
APPENDIX 1
User profile and categories to choose from
Screen-shot of a user’s profile in the online bulletin board
Hobbies
Art/Drawing/Painting
Crafts
Music
Reading
Sports
Games
Outdoors
Photography
Cooking/Wine
Tasting
Gardening
Movies
Wireless/RF/RC
Event Categories
Study group
Cultural / Community activities
Seminars
Town
Hall/Political/Religious/Community
Film screening/premiere
Field Trip
Exhibit
Poetry
Sample data illustrating the information available to the
system
Category
ID
5
Event
ID
15
10
16
Event category table
Subject
Date
Time
WLAN
radio project
meeting
Shakespeare
05-Nov03
12:00
03-Nov03
13:00
Loca
tion
NSB
A20
7A
NSB
A20
2
Event signup table
Event ID
Signed up User
ID
15
2
22
94
22
2
User category table
User ID
Category ID
2
1
2
2
2
3
User hobby table
User ID
Hobby ID
2
2
104
1
2
8
Download