PPT

advertisement
• Session 1
IR in a Nutshell:
Applications, Research, and Challenges
Tamer Elsayed
Feb 21st 2013
Roadmap
 What
is Information Retrieval (IR)?
● Overview and applications
 Overview of
my research interests
● Large-scale problems
● MapReduce Extensions
● Twitter Analysis
 The
future of IR research
● SWIRL 2012
IR in a Nutshell: Applications, Research, and Challenges
2
WHAT IS IR?
OVERVIEW & APPLICATIONS/RESEARCH TOPICS
IR in a Nutshell: Applications, Research, and Challenges
3
Information Retrieval (IR) …
information
need
Query
Unstructured
Hits
IR in a Nutshell: Applications, Research, and Challenges
4
Who and Where?
*Source: Matt Lease (IR Course at UTexes)
IR is not just
“Web Page” Ranking
or Document
or Retrieval
6
Web Search: Google
Search suggestions
Vertical search
Sponsored search
Query-biased
summarization
Search shortcuts
Vertical
search
(news,
blog,
image)
Web Search: Google II
Spelling correction
Personalized search / social ranking
Vertical search (local)
Cross-Lingual IR
1/3 of the Web is in non-English
 About 50% of Web users do not use English as their
primary language


Many (maybe most) search applications have to deal
with multiple languages
● monolingual search: search in one language, but with many
possible languages
● cross-language search: search in multiple languages at the
same time
Routing / Filtering

Given standing query, analyze new information as it
arrives
● Input: all email, RSS feed or listserv, …
● Typically classification rather than ranking
● Simple example: Ham vs. spam
*Source: Matt Lease (IR Course at UTexes)
Content-based Music Search
*Source: Matt Lease (IR Course at UTexes)
Speech Retrieval
*Source: Matt Lease (IR Course at UTexes)
Entity Search
*Source: Matt Lease (IR Course at UTexes)
Question Answering & Focused Retrieval
*Source: Matt Lease (IR Course at UTexes)
Expert Search
*Source: Matt Lease (IR Course at U Texes)
Blog Search
*Source: Matt Lease (IR Course at UTexes)
μ-Blog Search (e.g. Twitter)
*Source: Matt Lease (IR Course at UTexes)
e-Discovery
*Source: Matt Lease (IR Course at Utexes)
Book Search





Find books or more focused results
Detect / generate / link table of contents
Classification: detect genre (e.g. for browsing)
Detect related books, revised editions
Challenges: Variable scan quality, OCR accuracy, Copyright,
etc.
Other Visual Interfaces
*Source: Matt Lease (IR Course at Utexes)
MY RESEARCH
IR in a Nutshell: Applications, Research, and Challenges
21
My Research …
emails
Text
+
Enron
~500,000
Large-Scale
Processing
Identity
Resolution
web pages
CLuE
Web
~1,000,000,000
User Application
Web
Search
22
Back in 2009 …

Before 2009, small text collections are available
● Largest: ~ 1M documents

ClueWeb09
● Crawled by CMU in 2009
● ~ 1B documents !
● need to move to cluster environments

MapReduce/Hadoop seems like promising framework
23
MapReduce Framework
(b) Shuffle
(a) Map
(k1, v1)
input
input
input
input
(c) Reduce
[k2, v2]
map
map
map
(k2, [v2])
Shuffling
group values
by: [keys]
[(k3, v3)]
reduce
output
reduce
output
reduce
output
map
Framework handles “everything else” !
24
Ivory
http://ivory.cc




E2E Search Toolkit using MapReduce
Completely designed for the Hadoop environment
Experimental Platform for research
Supports common text collections
● + ClueWeb09


Open source release
Implements state-of-the-art retrieval models
25
(1) Pairwise Similarity in Large Collections
0.20
0.30
0.54
~~~~~~~~~~~~
0.21
~~~~~~~~~~~~
0.00
~~~~~~~~~~~~
~~~~ 0.34
0.34
0.13
0.74
0.20
0.30
~~~~~~~~~~~~
0.54
~~~~~~~~~~~~
0.21
~~~~~~~~~~~~
0.00
~~~~
0.34
0.34
0.13
0.74
0.20
0.30
~~~~~~~~~~~~
0.54
~~~~~~~~~~~~
0.21
~~~~~~~~~~~~
0.00
~~~~
0.34
0.34
0.13
0.74
0.20
~~~~~~~~~~~~
0.30
~~~~~~~~~~~~
0.54
~~~~~~~~~~~~
0.21
~~~~
0.00
0.34
0.34
0.13
0.74

0.20
~~~~~~~~~~~~
0.30
~~~~~~~~~~~~
0.54
~~~~~~~~~~~~
0.21
~~~~0.00
0.34
0.34
0.13
0.74
Applications:


Clustering
“more-like-that” queries
26
Decomposition
Each term contributes only if appears in
reduce
map
27
(2) Cross-Lingual Pairwise Similarity

Find similar document pairs in different languages
More difficult than monolingual!

Multilingual text mining, Machine Translation

Application: automatic generation of potential
“interwiki” language links

Locality-sensitive Hashing
Vectors close to each other
are likely to have similar signatures
28
Solution Overview
Nf
German
articles
Ne
English
articles
CLIR
projection
Preprocess
Ne+Nf
English
document
vectors
<nobel=0.324,
prize=0.227,
book=0.01, …>
Signature
generation
Random Projection/
Minhash/Simhash
Similar
article
pairs
11100001010
01110000101
Ne+Nf
Signatures
Sliding
window
algorithm
(3) Approximate Positional Indexes
“Learning to Rank”
models
Approximate
Term
positions
Large
index
X
Smaller
index
√
Proximity
features
Learn
effective ranking
functions
Slow query
evaluation
Faster query
evaluation
√
X
√
Close Enough is Good Enough?
30
Fixed-Width Buckets

Buckets of length W
d1
1
2
3
4
5
………...........….
………...........….
………...........….
………...........….
………...........….
………...........….
………...........….
………...........….
………...........….
………...........….
d2
………...........….
………...........….
………...........….
………...........….
………...........….
1
2
3
31
(4) Pseudo Training Data for Web Rankers
Documents, queries, and relevance judgments
 Important driving force behind IR innovation

In industry, easy to get
 In academia, hard and really expensive

Web Graph
P3
P1
web search
SIGIR 2012
P6
P2
web search
P4
web search
web search
P5
P7
web search
Google
Queries and Judgments?
anchor text lines ≈ pseudo queries
target pages ≈ relevant candidates
P3
P1
SIGIR 2012
P4
web search
P2
P6
P5
Google
Bing
P7
noise reduction ?
(5) Extending MapReduce Framework

Iterative Computations (iHadoop)
Concurrent Jobs with shared data
 m maps - r reduces instead of 1 map-1 reduce

IR in a Nutshell: Applications, Research, and Challenges
35
(6) Twitter Analysis

Real-time search in Twitter
● TREC 2011 (6th out of 59 teams)
● TREC 2013?

Answering Real-time Questions from Arabic Social
Media
● NPRP-submitted
IR in a Nutshell: Applications, Research, and Challenges
36
FUTURE RESEARCH DIRECTIONS
IR in a Nutshell: Applications, Research, and Challenges
37
SWIRL 2012
Goal of Report
Inspire researchers and graduate students to address
the questions raised
 Provide funding agencies data to focus and coordinate
support for information retrieval research.


Participants were asked to focus on efforts that could be
handled in an academic setting, without the
requirement of large-scale commercial data.
Key Themes (across Topics)

Not just a ranked list
● move beyond the classic “single adhoc query and ranked list” approach

Help for users
● support users more broadly, including ways to bring IR to inexperienced,
illiterate, and disabled users.

Capturing context
● Treats people using search systems, their context, and their information
needs as critical aspects needing exploration.

Information, not documents
● beyond document retrieval and into more complex types of data and more
complicated results

New Domains
● data with restricted access, collections of “apps,” and richly connected
workplace data

Evaluation
● suggest new techniques for evaluation
“Most Interesting” Topics
IR in a Nutshell: Applications, Research, and Challenges
41
[1] Conversational Answer Retrieval
IR: provides ranked lists of documents in response to a
wide range of keyword queries
 QA: provides more specific answers to a very limited
range of natural language questions.


Goal: combine the advantages of both to provide
effective retrieval of appropriate answers to a wide
range of questions expressed in natural language, with
rich user-system dialogue
Proposed Research



Questions: open-domain, natural language text questions
Answers: Develop more general approaches to identifying as
many constraints as possible on the answers for questions
Dialogue would be initiated by the searcher and proactively
by the system, for:
● refining the understanding of questions
● improving the quality of answers

Answers: short answers, text passages, clustered groups of
passages, documents, or even groups of documents may be
appropriate answers. Even tables, figures, images, or videos
IR in a Nutshell: Applications, Research, and Challenges
43
Challenges
Definitions of question and answer for open domain
searching
 Techniques for representing questions and answers
 Techniques for reasoning about and ranking answers
 Techniques for representing a mixed-initiative CAR
dialogue
 Effective dialogue actions for improving question
understanding
 Effective dialogue actions for refining answers

IR in a Nutshell: Applications, Research, and Challenges
44
[2] Finding What You Need with Zero Query
Terms (or Less)





Function without an explicit query, depending on context and
personalization in order to understand user needs
Anticipate user needs and respond with information appropriate
to the current context without the user having to enter a query
(zero query terms) or even initiate an interaction with the system
(or less).
In a mobile context: take the form of an app that recommends
interesting places and activities based on the user’s location,
personal preferences, past history, and environmental factors
such as weather and time.
In a traditional desktop environment: might monitor ongoing
activities and suggest related information, or track news, blogs,
and social media for interesting updates.
Imagine a system that automatically gathers information related
to an upcoming task.
Proposed Research
New representations of information and user needs,
along with methods for matching the two
 Modeling person, task, and context;
 Methods for finding “objects of interest”, including
content, people, objects and actions
 Methods for determining what, how and when to show
material of interest.

IR in a Nutshell: Applications, Research, and Challenges
46
Challenges
Time- and geo-sensitivity; trust, transparency, privacy;
determining interruptibility; summarization
 Power management in mobile contexts
 Evaluation

IR in a Nutshell: Applications, Research, and Challenges
47
[3] Mobile Information Retrieval Analytics
(MIRA)




No company or researcher has an understanding of mobile
information access across a variety of tasks, modes of
interaction, or software applications.
For example, a search service provider might know that a
query was issued, but not know whether the results it
provided resulted in consequent action.
The identification of common types of web search queries
led to query classification and algorithms tuned for different
purposes, which improved web search accuracy. A similar
understanding for mobile information seeking would focus
research on the problems of highest value to mobile users.
study what information, what kind of information, and
what granularity of information to deliver for different
tasks and contexts
Proposed Research





Methodology and tools for doing large-scale collection of
data about mobile information access.
Research on incentive mechanisms is required to
understand situations in which people are willing to allow
their behavior to be monitored.
Research on privacy is required to understand what can be
protected by dataset licenses alone, what must be
anonymized, and tradeoffs between anonymization and
data utility.
Development of well-defined information seeking tasks
Support quantitative evaluation in well-defined evaluation
frameworks that lead to repeatable scientific research
IR in a Nutshell: Applications, Research, and Challenges
49
Challenges
Developing incentive mechanisms
 Developing data collections that are sufficiently
detailed to be useful while still protecting people’s
privacy.
 Collection of data in a manner that university internal
review boards will consider acceptable ethically.
 Collection of data in a manner that does not violate the
Terms of Use restrictions of commercial service
providers.

IR in a Nutshell: Applications, Research, and Challenges
50
[4] Empowering Users to Search and Learn
Search engines are currently optimized for look-up
tasks and not tasks that require more sustained
interactions with information
 People have been conditioned by current search
engines to interact in particular ways that prevent them
from achieving higher levels of learning.


We seek to empower users to be more proactive and
critical thinkers during the information search process.
[5] The Structure Dimension
Better integration of structured and unstructured
information to seamlessly meet a user’s information
needs is a promising, but underdeveloped area of
exploration.
 Named entities, user profiles, contextual annotations,
as well as (typed) links between information objects
ranging from web pages to social media messages.

[6] Understanding People in Order to Improve
Information (Retrieval) Systems

Development of a research resource for the IR community:
1. from which hypotheses about how to support people in
information interactions can be developed
2. in which IR system designs can be appropriately evaluated.

Conducting studies of people
● before, during, and after engagement with information systems,
● at a variety of levels,
● using a variety of methods.
•
•
•
•
ethnography
in situ observation
controlled observation
large-scale logging
IR in a Nutshell: Applications, Research, and Challenges
54
Thank You!
IR in a Nutshell: Applications, Research, and Challenges
55
Download