XML Keyword Search Refinement

advertisement
XML Keyword
Search Refinement
郭青松
LOGO
Outline
 Introduction
 Query Refinement in Traditional IR
 XML Keyword Query Refinement
 My work
Why we need query refinement?
User express their query intention by
keywords, but their don’t know how to
formulate good query




Lack of experience
Too many expression forms
Unfamiliar with the system
Have no idea about the data
„Query Refinement
 Refine the query and get good results
What is Query Refinement?
Query expansion(query reformulation)
Given an ill-formed query from the user,
we refine the query and help the user to
better retrieve documents.
The goal is to improve precision and/or
recall.
Example:
 “cars” “car”, “automobile”, “auto”
XML Search
Tag + Keyword search
 book: xml
Path Expression + Keyword search (CAS
Queries)
 /book[./title about “xml db”]
Structure query
 XPath, XQuery
Keyword search (CO Queries)
 “xml”
XML Keywords Search VS IR
IR
 Flat HTML pages
 Whole page returned
XML
 Model(tree、graph)
 Structural(semi-structural)
 Semantic-based query(LCA, SLCA…)
 Information fragment returned
Need of XML Keyword Query Refinement
Hard to know the XML content
 Especially big xml document
Information fragments(LCA\SLCA)
 Easily affect the results(Precision )
 Huge difference of query results
IR style refinement methods is not
suitable for xml
 Only content be considered
 Need structure information to form a good
query
Outline
 Introduction
 Query Refinement in Traditional IR
 XML Keyword Query Refinement
 My work
Tasks
Spelling Correction
Word Splitting/Word Merging
Phrase Segmentation
Word Stemming
Acronym Expansion
Add/Delete Terms
Substitution
Classes of Query Refinement
 Relevance feedback
 Users mark documents(relevant, nonrelevant)
 Reweight the terms in the query
 Automatic query Refinement
 System analysis the relevance of documents
and query, give refined query automatically
 Global analysis
 Local analysis
Relevance Feedback
Began in the 1960s
Improvement in recall and precision
Basic process as follows
1. The user issues their initial query
2. The system returns an initial result set.
3. The user then marks some returned documents as
relevant or nonrelevant.
4. The system then re-weights the terms and refine the
query results
Relevance Feedback Models
Boolean.
 Terms appear in document: relevance
Vector Space.
 q=(t1, t2,…, tn) d=(w1, w2,…, wn)
sim (q, d ) 
n
q d
i 1
i
n
i
 (q
i 1
i
n
)
2
 (d
i 1
Probabilistic.
 Relevance of a query and documents
evaluate as probability
 Probabilistic ranking principle
i
)2
Rocchio algorithm for vector-space model
qm :refined query vector
q0: the original query vector
Dr : relevant documents , Dnr: nonrelevant
documents
α, β, γ: weights attached to each term
Average relevantdocument vector
Average non-relevant
document vector
Global analysis(1)
Using all documents to compute the
similarity of query q and terms in the
documents
Similarity Thesaurus based
d1 d 2
t1  w11 w12

K ( D )  t 2  w21 w22
...  ...
...

t n  wn1 wn 2
... d n
... w1n 
... w2 n 
... ... 

... wnn 

ti : term vector

q  k1 , k2 ,..., km , ks {t}
wij : weight of term ti and document t j
Global analysis(2)
Similarity of terms
Query vector
cu ,v
 
 tu  tv   wu , j  wv , j
d j


q   wi ,q  ti
ti q
Similarity of query and terms
 
 
sim (q, k j )  q  k j   wi ,q  ti  t j   wi ,q ci , j
ti q
k i q
Select r terms with highest sim value and adding
into initial query , reformulate the new query
www.themegallery.com
Local analysis
Local analysis: Using initial query
results(especially documents front ,local
documents) to refine the query
Local clustering
 Clustering the term of local documents
 Query refined with the relevant cluster
 Similarity of terms in query and terms in documents
Local context analysis(LCA)
 Get the most similar term in local documents with the query q to
expanse
 Similarity of q and terms in documents
Company name
Outline
 Introduction
 Query Refinement in Traditional IR
 XML Keyword Query Refinement
 My work
www.themegallery.com
XML Refinement Manner(1)
Query refined form
 Keywords query  New Keywords Query
• Treat as traditional IR problem
• IR with XML Keyword search Semantics
 Keywords  Structural Query
User participant
 Manually(User Interactive )
• Structural Feedback
 Automatic
Company name
XML Refinement Manner (2)
Manually Refined to new Keywords Query
 IR(consider the structure of xml)
Manually Transform to Structural Query
 Relevance Feedback
Automatic Refined to new Keywords Query
 Lu jiaheng:
Automatic Transform to Structural Query
 NLP
Automatic Refined to new Keywords Query(1)
Query  Refined Query
 Rule based
Operation
 Term merging: k1 , k2 ,..., kn  k
Original query
 Term splitting:Refined
k  k1query
, k2 ,..., kn
IR,2003,Mike
Information Retrieval,2003,Mike
'
'
'

Term
substitution:
k
,
k
,...,
k

k
,
k
,...,
k
1
2
n
1
2
n
Mike, publication
Mike, publications
Database,
 Term paper
deletion Database, in-proceedings
XML, John,2003
XML, John
machin, learn
machine, learning
Hobby, news, paper
Hobby, newspaper
On, line, data, base
Online, database
Automatic Refined to new Keywords Query(2)
Ranking Refined query candidates set S(RQ)
Refinement cost
 Cost: the step of “op” from “Q” to “RQ”
 Dynamic programming
Efficient Refinement Algorithms
 Avoid the multiple scan invert list
 stack-based ,stack-based, short-list-eager approach
RQ candidates have the same refinement
cost
 Q={XML, Jim, 2001}{XML, 2001}, {Jim, 2001} or
{XML, Jim}
NLPX
Natural Language Query (NLQ)  NEXI
NEXI(Narrowed Extended XPath I)
 //A[about(//B,C)]
 A: path expression,
 B :relative path expression to A
 C is the content requirement.
 ‘about’ clause represents an individual
information request.
NLPX—Lexical and Semantic Tagging
structural words: content requirements
boundary words: Path expression
instruction words
 R :return request , S :support request.
Find sections about compression in
articles about information retrieval
Tagged: Find/XIN sections/XST about/XBD
compression/NN in/IN articles/XST
about/XBD information/NN retrieval/NN
NLPX—Template Matching
most queries correspond to a small set of
patterns
formulate grammar templates with
patterns
Query: Request+
Grammar Templates
Request : CO_Request | CAS_Request
CO_Request: NounPhrase+
CAS_Request: SupportRequest | ReturnRequest
SupportRequest: Structure [Bound] NounPhrase+
ReturnRequest: Instruction Structure [Bound] NounPhrase+
Request 1
Structural:
Content:
Instruction:
Request 2
Information Requests
/article/sec
/articlec
compression information retrieval
R
S
NLPX—NEXI Query Production
merge the information request into NEXI
query.
A[about(.,C)]
 A :the request structural attribute and
 C : the request content attribute.
//article[about(.,information retrieval)]//sec[about
(.,compression)]
Query generation process
Create target component
 Break up the query into units
Generate initial target
 combinations of input target components
Generate queries
 modifying a target component
 combing two components
Initialization
Breaks up the input query into terms
 Structure( XML tags or attributes)
 Content term(refer to text)
Create component
 Structure term  unbound target
 Content term  binding to a bound target
Probability enumeration
Target component and target sets
Query: Papers by jennifer widom
papers
Jennifer
widom
{//article}
{//inproceedings}
0.5000
0.5000
{//author[~’jennifer widom’]}
{//editor[~’jennifer widom’]}
{//title[~’jennifer widom’]}
0.6842
0.3150
0.0004
{//article} {//author[∼‘jennifer widom’]}
{//inproceedings} {//author[∼‘jennifer widom’]}
{//inproceedings} {//editor[∼‘jennifer widom’]}
{//article} {//editor[∼‘jennifer widom’]}
{//inproceedings} {//title[∼‘jennifer widom’]}
{//article} {//title[∼‘jennifer widom’]}
0.3421
0.3421
0.1577
0.1577
0.0002
0.0002
Transformation Operators(1)
Aggregation: merge targets with same tag
 {//a}, {//a[~’x’]}  {//a[~’x’]}
 {//a[~’x’]} , {//a[~’y’]}  {//a[~’x y’]}
Prefix expansion: add an ancestor condition
 {//b}  {//a//b}
 {//b[~’x’]}  {//a//b[~’x’]}
Ordering: combine targets
 {//a}, {//b}  {//a//b} or {//a[//b]}
 {//a}, {//b[~’x’]}  {//a//b[~’x’]} or {//a[//b[~’x’]]}
Conclusion
Two stronger assumption
 Keyword query non-ambiguity
 Availability of XML thesaurus
Accuracy:
 terms classification didn’t consider specific
XML context
Time costly:
 Term classification
 Targets create scan the XML documents
Outline
 Introduction
 Query Refinement in Traditional IR
 XML Keyword Query Refinement
 My work
LOGO
www.themegallery.com
Download