GeoWord - College of Computing

advertisement
• Fahad Al-Emam
• Bachelors of CSE from MSU (04)
• Masters student in the College of
Computing specializing in Software
Engineering
• Graduating this Fall !
• Humans perceives a location in the verbal form,
such as address, landmark and other well known
terms.
• Most user interface of location based services
provide input methods based on those verbal
forms because they are the most intuitive and
easiest to use for human users.
• This verbal form is referred to as geo-word.
• The main focus of the paper describes using
geo-words in Association Rule Mining.
• Association rule mining finds all the rules
existing in the database that satisfy a minimum
support and confidence constraint.
• Let I = { i1 i2 i3 i4 … in} be a set of items
Let T = { t1, t2 .. tn } be a set of Transactions
A Rule is defined as an implication of the form
X → Y where X, Y ⊂ I and X ∩ Y = 0
• Support of an itemset is proportion of
Transactions that contain it.
• Confidence of a rule sup(x ∩ y) / sup(x)
• Geo-word: a proper noun which represents a
location related human understandable concept.
An address is a typical example of geo-word as
well as the name of a landmark.
• A non-proper noun such as "mountain" is also not
considered as a geo-word. Neither is IP or lat/lon
• Geo-word tuple is a set of items which contains
at least one geo-word
• TG = ( g, ik .. in ) where i can be keyword, area
status or even temperature.
(1)
• A geo-word centric association rule:
RG : Xg → Yg
• Geo-word association rule:
The rule format explicitly contains at least a geo-word either on the
left side of the rule X or the right side of the rule Y as formulated
(2)
• Co-location association rule:
The rule does not contain geo-word but it is generated from geo-word
tuples and the geo-word becomes a constraint to the rule. All of the
items considered in the rule come from geo-word tuples of the same
geo-word. In other word, when the geo-word tuples satisfy (1) then
the itemsets are related to a specific geo-word as formulated
(3)
• Over all Process of geo-word centric association
rule mining. Some geo-word specific processes
include geo-word cleaning, geo-word scaling and
geo-word specific interestingness measures. The
association rule mining itself can use any
available classic mining algorithms
• There are three type of Geo-word tuple processing which results in
intermediate data representations that become itemsets for
association rule mining:
• Geo-word tuple: This basic methodology only considers the items in
the geoword tuples. This processing method will generate geo-word
association rules with the support equal to the number of tuples in the
itemset.
• Session: When tuples contain user ID and timestamp they can be
grouped into sessions based on the unique users and a specified
time interval. This processing method also generates geo-word
association rules but the support of the rule is the number of sessions
instead of the number of the tuples.
• Co-location tuple: Co-location tuple contains only the items which
co-occur with a geo-world in the tuple. Since co-location tuples do not
necessarily contain any geo-word, the mining results can be regarded
as co-location association rules.
• Here is an example of
the processing
preformed on an
access log
• Geo-word tuples are
grouped
• Session are grouped
by time & ID
• Co-location are
grouped by geo-word
that relates to them
• Target Log Data: Access log data used for mining is from an online
commercial location based search service (Japanese yellowpages)
each line of the log is called request which includes a geo-word, freewords, and a time stamp. Data collected over 3 months.
• Data Preprocessing: Each request of log data is abstracted by only
picking items required for mining. After removing abnormal requests,
log data is divided into sessions.
• Statistics of the Target Data: Original raw data consisted of 500
million accesses and its size is 150 Giga bytes. After the
preprocessing, it was converted into 14 million accesses and 800
Mega bytes data.
• Association rules: Generated from 14 million requests using the
method described in Processing slide. The minimum support for the
rules is 0.000002. Each rule includes geo-words, and freewords.
• Observation: In some cases, there were too many rules generated
for a given word. For example, there are more than 1000 kinds of
association rules which “hotel” is on the left part in our generated
geo-word association rules. Therefore, we need a metric that can
measure useful and interesting rules; one known metric is called “lift”.
The lift is a method that can measure “how interesting”/Unique a rule
is, and a lift for rule R : X →Y is sup(X →Y)/(Sup(X)Sup(Y))
• The geo-word for a requested word can have different significance
based on the notion of “local specialty” and ”local commodity”. If a
user request is the word “super market”, he/she may wish to know
how to find it in the vicinity since one can be found in many places.
However, if the word is not so common such as “kabuki theater”,
he/she may want to know the location regardless of proximity to the
vicinity.
• Entropy: Is a metric can identify the degree of local specialty of a
given word. If we regard w a word as information source for g ∈G ,
then Entropy H(w) = −Σ N(g,w)log(N(g,w) / N(w))
• When a free-word w is requested, if the accompanying geo-word g is
the same all the time, then entropy = 0 . If g differs all the time then
H(w) is maximum
• Paper was extremely difficult to read.
• Discussion need more popular examples
• Didn’t mention any actual mining
algorithms used in research or in
experiment.
• By the end of the 3rd page, all citations
completed!
• Math presented without discussion.
Q & A
Download