• Fahad Al-Emam • Bachelors of CSE from MSU (04) • Masters student in the College of Computing specializing in Software Engineering • Graduating this Fall ! • Humans perceives a location in the verbal form, such as address, landmark and other well known terms. • Most user interface of location based services provide input methods based on those verbal forms because they are the most intuitive and easiest to use for human users. • This verbal form is referred to as geo-word. • The main focus of the paper describes using geo-words in Association Rule Mining. • Association rule mining finds all the rules existing in the database that satisfy a minimum support and confidence constraint. • Let I = { i1 i2 i3 i4 … in} be a set of items Let T = { t1, t2 .. tn } be a set of Transactions A Rule is defined as an implication of the form X → Y where X, Y ⊂ I and X ∩ Y = 0 • Support of an itemset is proportion of Transactions that contain it. • Confidence of a rule sup(x ∩ y) / sup(x) • Geo-word: a proper noun which represents a location related human understandable concept. An address is a typical example of geo-word as well as the name of a landmark. • A non-proper noun such as "mountain" is also not considered as a geo-word. Neither is IP or lat/lon • Geo-word tuple is a set of items which contains at least one geo-word • TG = ( g, ik .. in ) where i can be keyword, area status or even temperature. (1) • A geo-word centric association rule: RG : Xg → Yg • Geo-word association rule: The rule format explicitly contains at least a geo-word either on the left side of the rule X or the right side of the rule Y as formulated (2) • Co-location association rule: The rule does not contain geo-word but it is generated from geo-word tuples and the geo-word becomes a constraint to the rule. All of the items considered in the rule come from geo-word tuples of the same geo-word. In other word, when the geo-word tuples satisfy (1) then the itemsets are related to a specific geo-word as formulated (3) • Over all Process of geo-word centric association rule mining. Some geo-word specific processes include geo-word cleaning, geo-word scaling and geo-word specific interestingness measures. The association rule mining itself can use any available classic mining algorithms • There are three type of Geo-word tuple processing which results in intermediate data representations that become itemsets for association rule mining: • Geo-word tuple: This basic methodology only considers the items in the geoword tuples. This processing method will generate geo-word association rules with the support equal to the number of tuples in the itemset. • Session: When tuples contain user ID and timestamp they can be grouped into sessions based on the unique users and a specified time interval. This processing method also generates geo-word association rules but the support of the rule is the number of sessions instead of the number of the tuples. • Co-location tuple: Co-location tuple contains only the items which co-occur with a geo-world in the tuple. Since co-location tuples do not necessarily contain any geo-word, the mining results can be regarded as co-location association rules. • Here is an example of the processing preformed on an access log • Geo-word tuples are grouped • Session are grouped by time & ID • Co-location are grouped by geo-word that relates to them • Target Log Data: Access log data used for mining is from an online commercial location based search service (Japanese yellowpages) each line of the log is called request which includes a geo-word, freewords, and a time stamp. Data collected over 3 months. • Data Preprocessing: Each request of log data is abstracted by only picking items required for mining. After removing abnormal requests, log data is divided into sessions. • Statistics of the Target Data: Original raw data consisted of 500 million accesses and its size is 150 Giga bytes. After the preprocessing, it was converted into 14 million accesses and 800 Mega bytes data. • Association rules: Generated from 14 million requests using the method described in Processing slide. The minimum support for the rules is 0.000002. Each rule includes geo-words, and freewords. • Observation: In some cases, there were too many rules generated for a given word. For example, there are more than 1000 kinds of association rules which “hotel” is on the left part in our generated geo-word association rules. Therefore, we need a metric that can measure useful and interesting rules; one known metric is called “lift”. The lift is a method that can measure “how interesting”/Unique a rule is, and a lift for rule R : X →Y is sup(X →Y)/(Sup(X)Sup(Y)) • The geo-word for a requested word can have different significance based on the notion of “local specialty” and ”local commodity”. If a user request is the word “super market”, he/she may wish to know how to find it in the vicinity since one can be found in many places. However, if the word is not so common such as “kabuki theater”, he/she may want to know the location regardless of proximity to the vicinity. • Entropy: Is a metric can identify the degree of local specialty of a given word. If we regard w a word as information source for g ∈G , then Entropy H(w) = −Σ N(g,w)log(N(g,w) / N(w)) • When a free-word w is requested, if the accompanying geo-word g is the same all the time, then entropy = 0 . If g differs all the time then H(w) is maximum • Paper was extremely difficult to read. • Discussion need more popular examples • Didn’t mention any actual mining algorithms used in research or in experiment. • By the end of the 3rd page, all citations completed! • Math presented without discussion. Q & A