MASS COLLABORATION AND DATA MINING Raghu Ramakrishnan Founder and CTO, QUIQ Professor, University of Wisconsin-Madison Keynote Talk, KDD 2001, San Francisco DATA MINING Extracting actionable intelligence from large datasets • Is it a creative process requiring a unique combination of tools for each application? • Or is there a set of operations that can be composed using well-understood principles to solve most target problems? • Or perhaps there is a framework for addressing large classes of problems that allows us to systematically leverage the results of mining. University of Wisconsin-Madison Page 2 “MINING” APPLICATION CONTEXT • Scalability is important. – But when is 2x speed-up or scale-up important? When is 10x unimportant? • What is the appropriate measure, model? – Recall, precision – MT for search vs. MT for content conversion Answers to these questions come from the context of the application. University of Wisconsin-Madison Page 3 TALK OUTLINE • A New Approach to Customer Support – Mass Collaboration • Technical challenges – A framework and infrastructure for P2P knowledge capture and delivery • Role of data mining – Confluence of DB, IR, and mining University of Wisconsin-Madison Page 4 TYPICAL CUSTOMER SUPPORT Web Support KB Customer Support Center University of Wisconsin-Madison Page 5 TRADITIONAL KNOWLEDGE MANAGMENT QUESTION ANSWER KNOWLEDGE BASE EXPERTS CONSUMERS Knowledge created and structured by trained experts using a rigorous process. University of Wisconsin-Madison Page 6 MASS COLLABORATION QUESTION People using the web to share knowledge and help each other find solutions SELF SERVICE KNOWLEDGE BASE Answer added to power self service ANSWER MASS COLLABORATION -Experts -Partners -Customers -Employees University of Wisconsin-Madison Page 7 TIMELY ANSWERS 77% of answers are provided within 24h 6,845 86% (4,328) 77% (3,862) • No effort to answer each question • No added experts • No monetary incentives for enthusiasts 74% answered 65% (3,247) 40% (2,057) Answers provided in 3h Answers Answers provided provided in 12h in 24h Answers provided in 48h Questions University of Wisconsin-Madison Page 8 MASS CONTRIBUTION Users who on average provide only 2 answers provide 50% of all answers Answers 100 % (6,718) Contributed by mass of users 50 % (3,329) Top users Contributing Users 7 % (120) 93 % (1,503) University of Wisconsin-Madison Page 9 POWER OF KNOWLEDGE CREATION SUPPORT SHIELD 1 - 85% SHIELD 2 SelfService *) Knowledge Creation - 64% Customer Mass Collaboration *) 5% Support Incidents Agent Cases *) Averages from QUIQ implementations University of Wisconsin-Madison Page 10 TYPICAL SERVICE CHAIN 40% 50% FAQ Self Service Knowledge base Auto Email Manual Email $ 10% Call Center Chat $$ 2nd Tier Support $$$ QUIQ SERVICE CHAIN 80% 15% QUIQ Mass Collaboration QUIQ Self Service $ Manual Email 5% Chat $$ Call Center 2nd Tier Support $$$ University of Wisconsin-Madison Page 11 CASE STUDIES: COMPAQ “In newsgroups, conversations disappear and you have to ask the same question over and over again. The thing that makes the real difference is the ability for customers to collaborate and have information be persistent. That’s how we found QUIQ. It’s exactly the philosophy we’re looking for.” “Tech support people can’t keep up with generating content and are not experts on how to effectively utilize the product … Mass Collaboration is the next step in Customer Service.” – Steve Young, VP of Customer Care, Compaq University of Wisconsin-Madison Page 12 ASP 2001 “Top Ten Support Site” “Austin-based National Instruments deployed … a Network to capture the specialized knowledge of its clients and take the burden off its costly support engineers, and is pleased with the results. QUIQ increased customers’ participation, flattened call volume and continues to do the work of 50 support engineers.” – David Daniels, Jupiter Media Metrix University of Wisconsin-Madison Page 13 Communities Many Experts MASS COLLABORATION Internet-scale P2P knowledge sharing + Service Workflows Support Newsgroups Few Experts + Knowledge Management Mass Collaboration Call Center Support Knowledge Base Interactions Solutions University of Wisconsin-Madison Page 14 CORPORATE MEMORY Untapped Knowledge in Extended Business Community Customers Partners Suppliers Knowledgebase Employees University of Wisconsin-Madison Page 15 User-to-User Exchange Structured User Forum Self-Organizing User-toEnthusiast User-toExpert Incentive to Participate User Acquisition Areas of Interest University of Wisconsin-Madison Page 16 GOALS & ISSUES • Interactions must be structured to encourage creation of “solutions” – Resolve issue; escalate if necessary – Capture knowledge from interactions – Encourage participation • Sociology – Privacy, security – Credibility, authority, history – Accountability, incentives University of Wisconsin-Madison Page 17 REQUIRED CAPABILITIES • Roles: Credibility, administration – Moderators, experts, editors, enthusiasts • Groups: Privacy, security, entitlements – Departments, gold customers • Workflow: QoS, validation, escalation University of Wisconsin-Madison Page 18 TECHNICAL CHALLENGES University of Wisconsin-Madison Page 19 SEARCHING “PEOPLE-BASES” ROUTING, NOTIFICATION ? SEARCH “If it’s not there, find someone who knows” - And get “it” there (knowledge creation)! University of Wisconsin-Madison Page 20 QUIQ, the “Best in Class” Support Channel SUPPORT Email Support Call Center Automated Emails 1) -20% 100% 80% Support Incidents Support Incidents Agent Cases Mass Collaboration Web Self-Service Self-42% Service 2) Self-85% Service Agent Cases Knowledge Creation -64% 68% Support Incidents Agent Cases Customer Mass Collaboration 5% Support Incidents 1) Source: QUIQ Client Information 2) Source: Association of Support Professionals Agent Cases University of Wisconsin-Madison Page 21 SEARCH AND INDEXING • User types in “How can I configure the IP address on my Presario?” – Need to find most relevant content that is of high quality and is approved for external viewing, and that this user is entitled to see based on her roles, groups, and service levels. • User decides to post question because no good answer was found in the KB. – Search controls when experts and other users will see this new question; need to make this real-time. – Concurrency, recovery issues! University of Wisconsin-Madison Page 22 SEARCH AND INDEXING • Data is organized into tabular channels – Questions, responses, users, … • Each item has several fields, e.g., a question: – Author id, author status, service level, item popularity metrics, rating metrics, answer status, approval status, visibility group, update timestamp, notification timestamp, usage signature, category, relevant products, relevant problems, subject, body, responses Which 5 items should be returned? University of Wisconsin-Madison Page 23 RUNTIME ARCHITECTURE Web server Real-time Indexing, Caching, Alerts Email Cache Files, Logs Web server Hive Manager Indexer Alerts Warehouse DBMS RAID STORAGE University of Wisconsin-Madison Page 24 LEARNING FROM ACTIVITY DATA TO KNOWLEDGE Periodic offline activity Miner Indexer Large R/W Small reads Files, Logs Warehouse DBMS RAID STORAGE University of Wisconsin-Madison Page 25 SEARCH AND INDEXING Which 5 items should be returned? • Question text, user attributes, system policies • IR-style ranked output • Search constraints: – – – – – Show matches; subject match twice as important Show only approved answers to non-editors Give preference to category Laptop Give preference to recent solutions Weight quality of solution University of Wisconsin-Madison Page 26 VECTOR SPACE MODEL • Documents, queries are vectors in term space • Vector distance from the query is used to rank retrieved documents Q1 = w11 ,w12, ...,w1t D2 = w21 ,w22, ...,w2t sim(Q1 ,D2 ) = w1i * w2 i t unnormalized i =1 i’th term in summation can be seen as the “relevance contribution” of term i University of Wisconsin-Madison Page 27 TF-IDF DOCUMENT VECTOR wik = tfik * log( N / nk ) Tk = term k in document Di tfik = frequency of term Tk in document Di idf k = inverse document frequency of term Tk in C N = total number of documents in the collection C nk = the number of documents in C that contain Tk idf k = log N nk University of Wisconsin-Madison Page 28 A HYBRID DB-IR SYSTEM • Searches are queries with three parts: – Filter • DB-style yes/no criteria – Match • TF-IDF relevance based on a combination of fields – Quality • Relevance “boost” based on a policy University of Wisconsin-Madison Page 29 A HYBRID DB-IR SYSTEM • A query is built up from atomic constraints using Boolean operators. • Atomic constraint: – [ value op term, constraint-type ] – Terms are drawn from discrete domains and are of two types: hierarchy and scalar – Constraint-type is exact or approximate University of Wisconsin-Madison Page 30 A HYBRID DB-IR SYSTEM • Applying an atomic constraint to a set of items returns a tagged result set: – The result inherits the constraint-type – Each result item has a (TF-IDF) relevance score; 0 for exact • Combining two tagged item sets using Boolean operators yields a tagged set: – The result type is exact if both inputs are exact, and approximate otherwise – Result contains intersection of input item sets if either input is exact; union otherwise – Each result item is tagged with a combined relevance University of Wisconsin-Madison Page 31 A HYBRID DB-IR SYSTEM • Semantics of Boolean expressions over constraints is associative and commutative • Evaluating exact constraints and approximate constraints separately (in DB and IR subsystems) is a special case. Additionally: – Uniform handling of relevance contributions of categories, popularity metrics, recency, etc. • Absolute and relative relevance modifiers can be introduced for greater flexibility. University of Wisconsin-Madison Page 32 CONCURRENCY, RECOVERY, PARALLELISM • Concurrency – Index is updated in real-time – Automatic partitioning, two-step locking protocol result in very low overhead – Relies upon post-processing to address some anomalies • Recovery – Partitioning is again the key – Leverages recovery guarantees of DBMS – Approach also supports efficient refresh of global statistics • Parallelism – Hash based partitioning University of Wisconsin-Madison Page 33 NOTIFICATION • Extension of search: Each user can define one or more “standing searches”, and request instant or periodic notification. – Boolean combinations of atomic constraints. • Major challenges: – Scaling with number of standing searches. • Requires multiple timestamps, indexing searches. – Exactly-once delivery property. • Many subtleties center around “notifiability” of updates! University of Wisconsin-Madison Page 34 ROLE OF DATA MINING University of Wisconsin-Madison Page 35 DATA MINING TASKS • There is a lot of insight to be gained by analyzing the data. – – – – – – What will help the user with her problem? Who does a given user trust? Characteristic metrics for high-quality content. Identify helpful content in similar, past queries. Summarize content. Who can answer this question? University of Wisconsin-Madison Page 36 LEVERAGING DATA MINING • How do we get at the data? – Relevant information is distributed across several sources, not just the DBMS. – Aggregated in a warehouse. • How do we incorporate the insights obtained by mining into the search phase? – Need to constantly update info about every piece of content (Qs, As, users …) University of Wisconsin-Madison Page 37 LEVERAGING DATA MINING • Three-step approach: – Off-line analysis to gather new insight – Periodic refresh indexes – Use insight (from KB/index) to improve search using the extended DB/IR query framework Use mining to create useful metadata University of Wisconsin-Madison Page 38 SOME UNIQUE TWISTS • Identify the kinds of feedback that would be helpful in refining a search. – I.e., Not just specific terms, but the types of concepts that would be useful discriminators (e.g., a good hierarchy of feedback concepts) • Metrics of quality – Link-analysis is a good example, but what are the “links” here? • Self-tuning searches – The more the knobs, the more the choices – Next step: self-personalizing searches? University of Wisconsin-Madison Page 39 CONCLUSIONS University of Wisconsin-Madison Page 40 CONFLUENCES IR SEARCH ? University of Wisconsin-Madison Page 41