Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS 535 04/29/05 Outline Introduction / motivation Background Algorithm Feature selection / feature vector generation Hierarchical agglomerative clustering (HAC) Tree partitioning Results / conclusions Introduction Application – customer self-help (FAQ) system RightNow Technologies’ Customer Service module Need ways to organize Knowledge Base (KB) System already organizes documents (answers) using clustering Desirable to also organize user queries Goals Create concept hierarchy from user queries Domain-specific Self-guided (no human intervention / guidance required) Present hierarchy to help guide users in navigating KB Demonstrate the types of queries that can be answered by system Automatically augment searches with related terms Background Problem – cluster short text segments Inadequate information in queries to provide context for clustering Need some source of context Possible solution – use Web as source of info Cilibrasi and Vitanyi proposed mechanism to extract meaning of words using Google searches Chuang and Chien presented more detailed algorithm for clustering short segments by using text snippets returned by search engine Algorithm Use each text segment as input query to search engine Process resulting text snippets using stemming, stop word lists to extract related terms (keywords) Select set of keywords, build feature vectors Cluster using Hierarchical Agglomerative Clustering (HAC) Compact tree using min-max partitioning KB-Specific Version – HAC-KB Choose set of user queries, corresponding answers Find list of keywords corresponding to those answers Trim down list to reasonable length Generate feature vectors HAC clustering Min-max partitioning Available Data Answers Ans_phrases Documents forming the KB – actually question and answer, plus keywords and other information like product and category associations Extracted from answers, using stop word lists and stemming One-, two-, and three-word phrases Counts of occurences in different parts of answer Keyword_searches List of user queries – also filtered by stop word lists and stemmed List of answers matching query Feature Selection Select N most frequent user queries Select set of all answers matching those queries Select set of all keywords found in those answers Reduce to list of K keywords Avoid removing all keywords associated with a query (would generate empty feature vector) Try to eliminate keywords that provide little discrimimination (ones associated with many queries) Also eliminate keywords that only map to a single query Feature Vector Generation Generate map from queries to keywords, and inverse map from keywords to queries Use the TF-IDF (term frequency / inverse document frequency) metric for weighting vi , j N 1 log 2 tfi , j log 2 nj vi,j is weight of jth keyword for ith query tfi,j is the number of times that keyword j occurred in list of answers associated with query i nj is number of queries associated with keyword j Now have a N x K feature matrix Standard HAC Algorithm Initialize clusters – one cluster per query Initialize similarity matrix Using the average linkage similarity metric and cosine distance measure 1 sim AL (Ci , C j ) Ci C j sim (va , vb ) t j T 2 v a ,j t T j sim (v , v ) va Ci vb C j va , j vb , j 2 v b ,j t T j Matrix is upper-triangular a b HAC (cont.) For N – 1 iterations Pick two root-node clusters with largest similarity Combine into new root-node cluster Add new cluster to similarity matrix – compute similarity with all other root-level clusters Generates tall binary tree of clusters 2N – 1 nodes Not particularly usable by humans Min-Max Partitioning Need to combine nodes in cluster tree, produce a shallow, bushy multi-way tree Recursive partitioning algorithm MinMaxPartition(Cluster sub-tree) For each possible cut level in tree, compute quality of cut Choose best-quality cut level For each subtree cut off, recursively process Stop at max depth or max cluster size Cut Levels in Tree Choosing Best Cut Goal is to maximize intra-cluster similarity, minimize inter-cluster similarity Quality = Q(C) / N(C) Cluster set quality (smaller is better) Q(C ) 1 C sim AL (Ci , Ci ) , Ci C k Ci C sim AL (Ci , Ci ) k i Cluster size preference (gamma distribution) 1 1 x / N (C ) f ( C ), f ( x) x e ! Issues / Further Work Resolve issues with data / implementation Outstanding problem – generating meaningful labels for clusters in hierarchy Means of measuring performance Incorporate other KB data, like relevance scores of search results, products/categories Better feature selection Fuzzy clustering – query can belong to multiple clusters (Frigui & Masraoui) References S.-L. Chuang and L.-F. Chien, “Towards Automatic Generation of Query Taxonomy: A Hierarchical Query Clustering Approach, “Proceedings of ICDM’02, Maebashi City, Japan, Dec. 9-12, 2002, pp. 75–82, 2002. S.-L. Chuang and L.-F. Chien, “A Practical Web-based Approach to Generating Topic Hierarchy for Text Segments,” Proceedings of CIKM’04, Washington, DC, Nov., 2004, pp. 127-136. R. Cilibrasi and P. Vitanyi, “Automatic Meaning Discovery Using Google,” published on Web, available at http://arxiv.org/abs/cs/0412098. H. Frigui and O. Masraoui, “Simultaneous Clustering and Dynamic Keyword Weighting for Text Documents,” in Survey of Text Mining: Clustering, Classification, and Retrieval, Michael W. Berry, ed., Springer-Verlag, New York, 2004, pp. 45-72.