WWW06-poster-query_correlation

advertisement
MINING RELATED QUERIES FROM SEARCH ENGINE QUERY LOGS
Xiaodong Shi and Christopher C. Yang
• Overview:
• Web search engines have become the most popular solution to finding relevant
information to a topic on the web.
• However, search engine users often experience difficulties in organizing and
representing their information needs by simple queries.
• Finding related queries can help:
• Giving search query suggestions;
• Query expansion
• Indexing/Caching optimization
• We propose to segment user query sessions into query transactions in which
queries are considered related and then to find statistically associated queries
using a modified association rule mining model.
• Definitions:
• Query Record: A query record represents the submission of one query from a
user to the search engine at a certain time.
• Query Transaction: A query transaction is the search process 1) with the
search interest focusing on the same topic or strongly related topics, 2) in a
bounded and consecutive period, and 3) issued by the same user. It is
represented as a series of query records in temporal order.
• User Session: A user session contains the history of all query records that
belong to the same user, in a given period. It can also be represented as a
series of query records in temporal order.
• Levenshtein Distance Similarity:
• Search engine users often reformulate their input queries by adding, deleting or
changing some words of the original query string.
• Hence we use Levenshtein distance, a special type of edit distance, to measure
the degree of matching between query strings. It defines a set of edit
operations, such as insertion or deletion of a word, together with a cost for each
operation. The distance between two query strings then is defined to be the
sum of the costs in the cheapest chain of edit operations transforming one
query string into the other.
• The Levenshtein Distance Similarity between two query strings is:
Levenshtein_distance(q1 ,q2 )
similarityLevenshtein (q1 ,q2 )  1max(wn(q1 ) ,wn(q2 ))
where wn(.) is the number of words (or characters in Chinese) in a query.
Example: the Levenshtein Distance between “adobe photoshop” and “photoshop” is 1 and
their Levenshtein Distance Similarity is 0.5.
• Segmentation Algorithm:
• Our model is based on the traditional association rule mining model.
• The quality of segmenting user sessions into query transactions is critical for
mining association rules of related queries.
• A dynamic sliding window segmentation algorithm is proposed, which adopts
three time interval constraints:
• the maximum interval length allowed between adjacent query records in a
same query transaction (α);
• the maximum interval length of the period during which the user is allowed
to be inactive (β);
• the maximum length of the time window which the query transaction is
allowed to span (γ) (α ≤ γ ≤ β).
• It also sets a lower bound for the Levenshtein distance similarity between
adjacent queries, i.e. θ, to justify the borders of query transactions.
• Mining Related Queries:
• Our model is a modified-confidence version of the traditional approach of
mining association rules in data mining.
• Given the set of queries Q = {q1, q2, …, qn}, the association rule is redefined as
an implication qi ⇒ qk, where qi ∈ Q, qk ∈ Q and i ≠ k.
• Mining related queries is simplified as finding the statistically strong
associations between the input query qi and any other queries qk:
• Support: qi ⇒ qk has a support factor of s if s% of the transactions in T
contain both {qi} and {qk}, notated as qi ⇒ qk | s.
• Raw Confidence: the raw confidence factor of qi ⇒ qk is rc if rc% of the
transactions in T’ contain {qk}, provided that T’ is the set of all transactions
in T that contains {qi}, and is notated as qi ⇒ qk | rc.
• Confidence: the raw confidence factor is combined with the Levenshtein
distance similarity between qi and qk to get the confidence factor:
Dynamic Sliding Window Segmentation Algorithm. The complexity of this
algorithm is O(n). We empirically set the values of α, β, γ, θ to be 5 minutes, 24
hours, 60 minutes and 0.4 in our experiments.
A sample of how to segment a user session into query transactions. It is more like a
decision tree algorithm with four decision factors α, β, γ, and θ.
• Mining Related Queries (continued):
(qi  qk | c)  (qi  qk | rc)  e
similarityLevenshtein ( qi , qk )
• Assuming the input query is qi, we calculate the support factor qi ⇒ qk |
s and confidence factor qi ⇒ qk | c of any hypothesized association rule
qi ⇒ qk (qk ∈ Q, i ≠ j).
• Then we first set a threshold min_support for the support factors to filter
weak association rules. Next we rank the list of association rules
according to their confidence factors. Finally we select the top K rules
and extract the related queries.
A sample showing how our proposed technique (ARM_LDS) promotes the highly related
queries in the ranking list without penalizing other related queries. The numbers in the
brackets indicate the confidence factors (or Levenshtein Distance Similarities for LDS).
• Experiments:
• The temporal correlation model, proposed by Chien & Immorlica, is
selected as the baseline.
• Our proposed technique is decomposed into two models and tested
separately against rival models:
• Dynamic Sliding Window Segmentation Algorithm (DSW SA).
• Association Rule Mining Model with Levenshtein Distance Similarity
(ARM_LDS).
The Precision Rates of Our Experiment Results, at different levels
of selected top K queries
Download