MINING RELATED QUERIES FROM SEARCH ENGINE QUERY LOGS Xiaodong Shi and Christopher C. Yang • Overview: • Web search engines have become the most popular solution to finding relevant information to a topic on the web. • However, search engine users often experience difficulties in organizing and representing their information needs by simple queries. • Finding related queries can help: • Giving search query suggestions; • Query expansion • Indexing/Caching optimization • We propose to segment user query sessions into query transactions in which queries are considered related and then to find statistically associated queries using a modified association rule mining model. • Definitions: • Query Record: A query record represents the submission of one query from a user to the search engine at a certain time. • Query Transaction: A query transaction is the search process 1) with the search interest focusing on the same topic or strongly related topics, 2) in a bounded and consecutive period, and 3) issued by the same user. It is represented as a series of query records in temporal order. • User Session: A user session contains the history of all query records that belong to the same user, in a given period. It can also be represented as a series of query records in temporal order. • Levenshtein Distance Similarity: • Search engine users often reformulate their input queries by adding, deleting or changing some words of the original query string. • Hence we use Levenshtein distance, a special type of edit distance, to measure the degree of matching between query strings. It defines a set of edit operations, such as insertion or deletion of a word, together with a cost for each operation. The distance between two query strings then is defined to be the sum of the costs in the cheapest chain of edit operations transforming one query string into the other. • The Levenshtein Distance Similarity between two query strings is: Levenshtein_distance(q1 ,q2 ) similarityLevenshtein (q1 ,q2 ) 1max(wn(q1 ) ,wn(q2 )) where wn(.) is the number of words (or characters in Chinese) in a query. Example: the Levenshtein Distance between “adobe photoshop” and “photoshop” is 1 and their Levenshtein Distance Similarity is 0.5. • Segmentation Algorithm: • Our model is based on the traditional association rule mining model. • The quality of segmenting user sessions into query transactions is critical for mining association rules of related queries. • A dynamic sliding window segmentation algorithm is proposed, which adopts three time interval constraints: • the maximum interval length allowed between adjacent query records in a same query transaction (α); • the maximum interval length of the period during which the user is allowed to be inactive (β); • the maximum length of the time window which the query transaction is allowed to span (γ) (α ≤ γ ≤ β). • It also sets a lower bound for the Levenshtein distance similarity between adjacent queries, i.e. θ, to justify the borders of query transactions. • Mining Related Queries: • Our model is a modified-confidence version of the traditional approach of mining association rules in data mining. • Given the set of queries Q = {q1, q2, …, qn}, the association rule is redefined as an implication qi ⇒ qk, where qi ∈ Q, qk ∈ Q and i ≠ k. • Mining related queries is simplified as finding the statistically strong associations between the input query qi and any other queries qk: • Support: qi ⇒ qk has a support factor of s if s% of the transactions in T contain both {qi} and {qk}, notated as qi ⇒ qk | s. • Raw Confidence: the raw confidence factor of qi ⇒ qk is rc if rc% of the transactions in T’ contain {qk}, provided that T’ is the set of all transactions in T that contains {qi}, and is notated as qi ⇒ qk | rc. • Confidence: the raw confidence factor is combined with the Levenshtein distance similarity between qi and qk to get the confidence factor: Dynamic Sliding Window Segmentation Algorithm. The complexity of this algorithm is O(n). We empirically set the values of α, β, γ, θ to be 5 minutes, 24 hours, 60 minutes and 0.4 in our experiments. A sample of how to segment a user session into query transactions. It is more like a decision tree algorithm with four decision factors α, β, γ, and θ. • Mining Related Queries (continued): (qi qk | c) (qi qk | rc) e similarityLevenshtein ( qi , qk ) • Assuming the input query is qi, we calculate the support factor qi ⇒ qk | s and confidence factor qi ⇒ qk | c of any hypothesized association rule qi ⇒ qk (qk ∈ Q, i ≠ j). • Then we first set a threshold min_support for the support factors to filter weak association rules. Next we rank the list of association rules according to their confidence factors. Finally we select the top K rules and extract the related queries. A sample showing how our proposed technique (ARM_LDS) promotes the highly related queries in the ranking list without penalizing other related queries. The numbers in the brackets indicate the confidence factors (or Levenshtein Distance Similarities for LDS). • Experiments: • The temporal correlation model, proposed by Chien & Immorlica, is selected as the baseline. • Our proposed technique is decomposed into two models and tested separately against rival models: • Dynamic Sliding Window Segmentation Algorithm (DSW SA). • Association Rule Mining Model with Levenshtein Distance Similarity (ARM_LDS). The Precision Rates of Our Experiment Results, at different levels of selected top K queries