Raghu Ramakrishnan MASS COLLABORATION AND DATA MINING Founder and CTO, QUIQ

advertisement
MASS COLLABORATION AND DATA MINING
Raghu Ramakrishnan
Founder and CTO, QUIQ
Professor, University of Wisconsin-Madison
Keynote Talk, KDD 2001, San Francisco
DATA MINING
Extracting actionable intelligence from large datasets
• Is it a creative process requiring a unique
combination of tools for each application?
• Or is there a set of operations that can be
composed using well-understood principles to
solve most target problems?
• Or perhaps there is a framework for addressing
large classes of problems that allows us to
systematically leverage the results of mining.
University of Wisconsin-Madison
Page 2
“MINING” APPLICATION CONTEXT
• Scalability is important.
– But when is 2x speed-up or scale-up important?
When is 10x unimportant?
• What is the appropriate measure, model?
– Recall, precision
– MT for search vs. MT for content conversion
Answers to these questions
come from the context of the application.
University of Wisconsin-Madison
Page 3
TALK OUTLINE
• A New Approach to Customer Support
– Mass Collaboration
• Technical challenges
– A framework and infrastructure for P2P
knowledge capture and delivery
• Role of data mining
– Confluence of DB, IR, and mining
University of Wisconsin-Madison
Page 4
TYPICAL CUSTOMER SUPPORT
Web
Support KB
Customer
Support
Center
University of Wisconsin-Madison
Page 5
TRADITIONAL KNOWLEDGE MANAGMENT
QUESTION
ANSWER
KNOWLEDGE
BASE
EXPERTS
CONSUMERS
Knowledge created and
structured by trained experts
using a rigorous process.
University of Wisconsin-Madison
Page 6
MASS COLLABORATION
QUESTION
People using the
web to share
knowledge and help
each other find
solutions
SELF SERVICE
KNOWLEDGE
BASE
Answer added to
power self service
ANSWER
MASS COLLABORATION
-Experts
-Partners
-Customers -Employees
University of Wisconsin-Madison
Page 7
TIMELY ANSWERS
77% of answers are provided within 24h
6,845
86% (4,328)
77% (3,862)
• No effort to
answer each
question
• No added experts
• No monetary
incentives for
enthusiasts
74%
answered
65% (3,247)
40% (2,057)
Answers
provided
in 3h
Answers Answers
provided provided
in 12h
in 24h
Answers
provided
in 48h
Questions
University of Wisconsin-Madison
Page 8
MASS CONTRIBUTION
Users who on average provide only 2 answers
provide 50% of all answers
Answers
100 %
(6,718)
Contributed by
mass of users
50 %
(3,329)
Top users
Contributing
Users
7 % (120)
93 % (1,503)
University of Wisconsin-Madison
Page 9
POWER OF KNOWLEDGE CREATION
SUPPORT
SHIELD 1
- 85%
SHIELD 2
SelfService *)
Knowledge
Creation
- 64%
Customer
Mass Collaboration
*)
5%
Support Incidents
Agent Cases
*) Averages from QUIQ implementations
University of Wisconsin-Madison
Page 10
TYPICAL SERVICE CHAIN
40%
50%
FAQ
Self Service
Knowledge
base
Auto
Email
Manual
Email
$
10%
Call
Center
Chat
$$
2nd Tier
Support
$$$
QUIQ SERVICE CHAIN
80%
15%
QUIQ
Mass
Collaboration
QUIQ
Self Service
$
Manual
Email
5%
Chat
$$
Call
Center
2nd Tier
Support
$$$
University of Wisconsin-Madison
Page 11
CASE STUDIES: COMPAQ
“In newsgroups, conversations disappear and you
have to ask the same question over and over again.
The thing that makes the real difference is the ability
for customers to collaborate and have information be
persistent. That’s how we found QUIQ. It’s exactly the
philosophy we’re looking for.”
“Tech support people can’t keep up with generating
content and are not experts on how to effectively
utilize the product … Mass Collaboration is the next
step in Customer Service.”
– Steve Young, VP of Customer Care, Compaq
University of Wisconsin-Madison
Page 12
ASP 2001 “Top Ten Support Site”
“Austin-based National Instruments deployed … a
Network to capture the specialized knowledge of
its clients and take the burden off its costly
support engineers, and is pleased with the
results. QUIQ increased customers’ participation,
flattened call volume and continues to do the
work of 50 support engineers.”
– David Daniels, Jupiter Media Metrix
University of Wisconsin-Madison
Page 13
Communities
Many
Experts
MASS COLLABORATION
Internet-scale P2P knowledge sharing
+ Service
Workflows
Support
Newsgroups
Few
Experts
+ Knowledge
Management
Mass
Collaboration
Call Center
Support Knowledge Base
Interactions
Solutions
University of Wisconsin-Madison
Page 14
CORPORATE MEMORY
Untapped Knowledge in Extended Business Community
Customers
Partners
Suppliers
Knowledgebase
Employees
University of Wisconsin-Madison
Page 15
User-to-User
Exchange
Structured
User Forum
Self-Organizing
User-toEnthusiast
User-toExpert
Incentive to
Participate
User
Acquisition
Areas of
Interest
University of Wisconsin-Madison
Page 16
GOALS & ISSUES
• Interactions must be structured to
encourage creation of “solutions”
– Resolve issue; escalate if necessary
– Capture knowledge from interactions
– Encourage participation
• Sociology
– Privacy, security
– Credibility, authority, history
– Accountability, incentives
University of Wisconsin-Madison
Page 17
REQUIRED CAPABILITIES
• Roles: Credibility, administration
– Moderators, experts, editors, enthusiasts
• Groups: Privacy, security, entitlements
– Departments, gold customers
• Workflow: QoS, validation, escalation
University of Wisconsin-Madison
Page 18
TECHNICAL CHALLENGES
University of Wisconsin-Madison
Page 19
SEARCHING “PEOPLE-BASES”
ROUTING,
NOTIFICATION
?
SEARCH
“If it’s not there, find someone who knows”
- And get “it” there (knowledge creation)!
University of Wisconsin-Madison
Page 20
QUIQ, the “Best in Class” Support Channel SUPPORT
Email Support
Call Center
Automated
Emails 1)
-20%
100%
80%
Support Incidents
Support Incidents
Agent Cases
Mass Collaboration
Web Self-Service
Self-42% Service 2)
Self-85% Service
Agent Cases
Knowledge
Creation
-64%
68%
Support Incidents
Agent Cases
Customer
Mass Collaboration
5%
Support Incidents
1) Source: QUIQ Client Information
2) Source: Association of Support Professionals
Agent Cases
University of Wisconsin-Madison
Page 21
SEARCH AND INDEXING
• User types in “How can I configure the IP
address on my Presario?”
– Need to find most relevant content that is of high
quality and is approved for external viewing, and
that this user is entitled to see based on her roles,
groups, and service levels.
• User decides to post question because no
good answer was found in the KB.
– Search controls when experts and other users will
see this new question; need to make this real-time.
– Concurrency, recovery issues!
University of Wisconsin-Madison
Page 22
SEARCH AND INDEXING
• Data is organized into tabular channels
– Questions, responses, users, …
• Each item has several fields, e.g., a question:
– Author id, author status, service level, item
popularity metrics, rating metrics, answer status,
approval status, visibility group, update timestamp,
notification timestamp, usage signature, category,
relevant products, relevant problems, subject, body,
responses
Which 5 items should be returned?
University of Wisconsin-Madison
Page 23
RUNTIME ARCHITECTURE
Web server
Real-time
Indexing,
Caching,
Alerts
Email
Cache
Files, Logs
Web server
Hive Manager
Indexer
Alerts
Warehouse
DBMS
RAID STORAGE
University of Wisconsin-Madison
Page 24
LEARNING FROM ACTIVITY
DATA TO KNOWLEDGE
Periodic
offline activity
Miner
Indexer
Large R/W
Small reads
Files, Logs
Warehouse
DBMS
RAID STORAGE
University of Wisconsin-Madison
Page 25
SEARCH AND INDEXING
Which 5 items should be returned?
• Question text, user attributes, system policies
• IR-style ranked output
• Search constraints:
–
–
–
–
–
Show matches; subject match twice as important
Show only approved answers to non-editors
Give preference to category Laptop
Give preference to recent solutions
Weight quality of solution
University of Wisconsin-Madison
Page 26
VECTOR SPACE MODEL
• Documents, queries are vectors in term space
• Vector distance from the query is used to rank
retrieved documents
Q1 = w11 ,w12, ...,w1t
D2 = w21 ,w22, ...,w2t
sim(Q1 ,D2 ) =
w1i * w2 i
t
unnormalized
i =1
i’th term in summation can be seen as the
“relevance contribution” of term i
University of Wisconsin-Madison
Page 27
TF-IDF DOCUMENT VECTOR
wik = tfik * log( N / nk )
Tk = term k in document Di
tfik = frequency of term Tk in document Di
idf k = inverse document frequency of term Tk in C
N = total number of documents in the collection C
nk = the number of documents in C that contain Tk
idf k = log  N 
 nk 
University of Wisconsin-Madison
Page 28
A HYBRID DB-IR SYSTEM
• Searches are queries with three parts:
– Filter
• DB-style yes/no criteria
– Match
• TF-IDF relevance based on a combination of fields
– Quality
• Relevance “boost” based on a policy
University of Wisconsin-Madison
Page 29
A HYBRID DB-IR SYSTEM
• A query is built up from atomic constraints
using Boolean operators.
• Atomic constraint:
– [ value op term, constraint-type ]
– Terms are drawn from discrete domains and
are of two types: hierarchy and scalar
– Constraint-type is exact or approximate
University of Wisconsin-Madison
Page 30
A HYBRID DB-IR SYSTEM
• Applying an atomic constraint to a set of items
returns a tagged result set:
– The result inherits the constraint-type
– Each result item has a (TF-IDF) relevance score; 0 for exact
• Combining two tagged item sets using Boolean
operators yields a tagged set:
– The result type is exact if both inputs are exact, and
approximate otherwise
– Result contains intersection of input item sets if either input
is exact; union otherwise
– Each result item is tagged with a combined relevance
University of Wisconsin-Madison
Page 31
A HYBRID DB-IR SYSTEM
• Semantics of Boolean expressions over
constraints is associative and commutative
• Evaluating exact constraints and approximate
constraints separately (in DB and IR
subsystems) is a special case. Additionally:
– Uniform handling of relevance contributions of
categories, popularity metrics, recency, etc.
• Absolute and relative relevance modifiers can
be introduced for greater flexibility.
University of Wisconsin-Madison
Page 32
CONCURRENCY, RECOVERY, PARALLELISM
• Concurrency
– Index is updated in real-time
– Automatic partitioning, two-step locking protocol result
in very low overhead
– Relies upon post-processing to address some
anomalies
• Recovery
– Partitioning is again the key
– Leverages recovery guarantees of DBMS
– Approach also supports efficient refresh of global
statistics
• Parallelism
– Hash based partitioning
University of Wisconsin-Madison
Page 33
NOTIFICATION
• Extension of search: Each user can define
one or more “standing searches”, and
request instant or periodic notification.
– Boolean combinations of atomic constraints.
• Major challenges:
– Scaling with number of standing searches.
• Requires multiple timestamps, indexing searches.
– Exactly-once delivery property.
• Many subtleties center around “notifiability” of updates!
University of Wisconsin-Madison
Page 34
ROLE OF DATA MINING
University of Wisconsin-Madison
Page 35
DATA MINING TASKS
• There is a lot of insight to be gained by
analyzing the data.
–
–
–
–
–
–
What will help the user with her problem?
Who does a given user trust?
Characteristic metrics for high-quality content.
Identify helpful content in similar, past queries.
Summarize content.
Who can answer this question?
University of Wisconsin-Madison
Page 36
LEVERAGING DATA MINING
• How do we get at the data?
– Relevant information is distributed across
several sources, not just the DBMS.
– Aggregated in a warehouse.
• How do we incorporate the insights
obtained by mining into the search
phase?
– Need to constantly update info about every
piece of content (Qs, As, users …)
University of Wisconsin-Madison
Page 37
LEVERAGING DATA MINING
• Three-step approach:
– Off-line analysis to gather new insight
– Periodic refresh indexes
– Use insight (from KB/index) to improve
search using the extended DB/IR
query framework
Use mining to create useful metadata
University of Wisconsin-Madison
Page 38
SOME UNIQUE TWISTS
• Identify the kinds of feedback that would be
helpful in refining a search.
– I.e., Not just specific terms, but the types of
concepts that would be useful discriminators
(e.g., a good hierarchy of feedback concepts)
• Metrics of quality
– Link-analysis is a good example, but what are
the “links” here?
• Self-tuning searches
– The more the knobs, the more the choices
– Next step: self-personalizing searches?
University of Wisconsin-Madison
Page 39
CONCLUSIONS
University of Wisconsin-Madison
Page 40
CONFLUENCES
IR SEARCH
?
University of Wisconsin-Madison
Page 41
Download