View/Download - BYU Data Extraction Research Group

advertisement
Data Frame Augmentation of Free
Form Queries for Constraint
Based Document Filtering
Andrew Zitzelberger
Problem
Constraint Based Queries
Queries
Test Queries
1) Find me a Wii game.
2) Find me a Honda for under 15 thousand dollars.
3) Roller Coaster more than 150 feet high
4) mountains at least 15K feet
5) games under $25
6) mountains less than 4 km
7) ps games < $40
8) coasters longer than 1000 feet
9) car for under 5 grand newer than 1990 with less than 115K miles
10) more than 15K miles under 5 grand newer than 2004
Keywords + Semantics
• Semantic queries are computationally expensive
• Keyword queries are fast and simple
o People are used to keyword queries
• Synergistic solution:
o extract numerical constraints from the query
o use keywords to quickly narrow the search space
o use constraints as a filter
Data Frames
Price
internal representation: Double
external representation: \$[1-9]\d{0,2}(,\d{3})*|...
...
right units: (K)?\s*(cents|dollars|[Gg]rand|...)
canonicalization method: toUSDollars
comparison methods:
LessThan(p1: Price, p2: Price) returns (Boolean)
external representation: (less than|<|under|...)\s*{p2}|...
...
end
Data Frame Library
Free Form Query
• Car under 6 grand newer than 1990 with less than 115K
miles
Step 1: Condition Extraction
• Car under 6 grand newer than 1990 with less than 115K
miles
• Extracted Conditions
o (Price < 6000)
o (Year > 1990)
o (Distance < 115000)
Step 2: Remove Condition Values
• Car under newer than with less than
Step 3: Remove Stopwords
• Car
Step 4: Perform Keyword Search
Step 5: Filter Document on Constraints
• Keep page if every constraint is satisfied by at least one
extracted value
Experimental Setup
• 
300 web documents
o 100 car+trucks pages from http://provo.craigslist.org
o 100 video gaming pages from http://provo.craigslist.org
o 50 mountain pages from http://en.wikipedia.org
o 50 roller coaster pages from http://en.wikipedia.org
• 10 queries
o 8 with usable conditions
• 2 data sets
o test-development
o blind test
Results Summary
• 
Precision increase for 56% of queries
o 75% for test-dev, 50% for blind-test
• Precision never worse than keyword query
• Most effective for short, focused documents
Precision@3/Query Type Keyword Queries
Reduced Queries
Data Frame Augmented
Queries
Dev-Test Queries
33%
40%
60%
Blind-Test Queries
50%
46%
63%
Overall
42%
43%
62%
Discussion
• Issues:
1.inadequate narrowing or ranking of search space
2.noise caused by other numbers
Distance < 115000
Future Work
• Scalability
o Indexing data frame extracted terms
• Precision vs Recall trade-offs
• Pay-as-you-go search construction
Related Work
• Question-Answering Systems
• Keyword search over databases and semantic stores
Questions?
Results (Test-Dev Set)
Query
Keyword
Condition Removed Data Frame
Keyword
Augmentation
Find me a Wii game.
0.33
0.33
0.33
Find me a Honda for under 15 thousand dollars.
0.67
1.00
1.00
roller coaster more than 150ft high
0.33
0.33
0.67
mountains at least 15K ft
1.00
0.67
1.00
games under $25
0.00
0.33
0.67
mountains less than 4 km
0.00
0.00
0.33
ps games < 40 bucks
0.33
0.00
0.33
coasters longer than 1000 feet
0.33
1.00
1.00
car for under 6 grand newer than 1990 with less
than 115K miles
0.33
0.33
0.67
more than 15K miles under 10 grand newer than
2000
0.00
0.00
0.00
Results (Blind Test Set)
Query
Keyword
Condition Removed Data Frame
Keyword
Augmentation
Find me a Wii game.
0.67
0.67
0.67
Find me a Honda for under 15 thousand dollars.
0.67
1.00
1.00
roller coaster more than 150ft high
0.67
0.67
0.67
mountains at least 5K ft
0.33
0.33
0.67
games under $25
0.67
0.67
1.00
mountains less than 4 km
0.00
0.00
0.00
ps games < 40 bucks
0.33
0.33
0.33
coasters longer than 1000 feet
0.67
0.67
0.67
car for under 6 grand newer than 1990 with less
than 115K miles
0.67
0.00
1.00
more than 15K miles under 10 grand newer than
2000
0.33
0.33
0.33
Download