20 May 2010
LREC 2010
Walid Magdy, Jinming Min, Johannes Leveling, Gareth Jones
School of Computing, Dublin City University, Ireland
CNGL
Objective
Data collection preparation and overview
IR test collection design
Baseline Experiments
Summary
Centre of Next Generation Localisation (CNGL)
4 Universities: DCU, TCD, UCD, and UL
Team: 120 PhD students, PostDocs, and PIs
Supported by Science Foundation of Ireland (SFI)
9 Industrial Partners: IBM, Microsoft, Symantec, …
Objective: Automation of the localisation process
Technologies: MT, AH, IR, NLP, Speech, and Dev.
1.
2.
3.
4.
Create a collection of data that is:
Suitable for IR tasks
Suitable for other research fields (AH, NLP)
Large enough to produce conclusive results
Associated with defined evaluation strategies
Prepare the collection from freely available data
YouTube
Domain specific (Basketball)
Build standard IR test collection (document set + topics set + relevance assessment)
Posting
User
Tags
Descriptio n
Posting date
Responde d Videos
Related
Videos
Length
Document
- Video URL
- Video Title
Number of
Favorited
Category
Number of
Ratings
Comment s
Number of
Views
50 NBA related queries used to search YouTube
First 700 results per query crawled with related videos
Crawled pages parsed and metadata extracted.
Extracted data represented in XML format
Non-sport category results filtered out
Used Queries:
NBA - NBA Highlights - NBA All Starts - NBA fights
Top ranked 15 NBA players in 2008 + Jordan + Shaq
29 NBA teams
Crawled video pages: 61,340 pages
Max crawled related/responded video pages: 20
Max crawled comments for a given video page: 500
Comments associated with contributing user’s ID
Crawled user profiles ≈ 250k
40 topics (queries) created
Specific topics related to NBA
TREC topic = query (title) + description + narrative
<title> Michael Jordan best dunks </title>
<description> Find the best dunks through the career of Michael
Jordan in NBA. It can be a collection of dunks in matches, or dunk contest he participated in.
</description>
<narrative> A relevant video should contain at least one dunk for
Jordan. Videos of dunks for other players are not relevant. And other plays for Jordan other than dunks are not relevant as well </narrative>
4 indexes created:
Title
Title +Tags
Title + Tags + Description
Title + Tags + Description + Related videos titles
5 different retrieval models used
20 different result lists, each contains 60 documents
Result lists merged with random ranking
122 to 466 documents assessed per topic
1 to 125 relevant documents per topic ( avg.
= 23 )
Search 4 different indexes:
Title
Title +Tags
Title + Tags + Description
Title + Tags + Description + Related videos titles
Indri retrieval model used to rank results
1000 results retrieved for each search
Mean average precision (MAP) used to compare the results
0.25
0.20
0.15
0.10
0.05
0.00
0.45
0.40
0.35
0.30
Title Title+Tags Title+Tags+Desc All text fields
NER
Metadata
Tags
Comments
IR test set
61,340
Top bigrams in
“Tags” field
Kobe Bryant
NBA Basketball
Lebron James
Michael Jordan
Los Angeles
All Star
Chicago Bulls
Boston Celtics
AH/Personalisation
Angeles Lakers
Slam Dunk
Basketball NBA
Ratings XML docs
250,000
Vince Carter
User profiles
Kevin Garnett
Toronto Raptors
# Views
Houston Rockets
Miami Heat
O’Neal
Phoenix Suns
40 topics +
Reranking using ML rel. assess.
Detroit Pistons
Videos
Yao Ming
Multimedia
Amazing Highlights processing
Pau Gasol
Cleveland Cavaliers
NBA Amazing
Q: Is this collection available for free?
A: No
Q: Nothing could be provided?
A: Scripts + Topics + Rel. assess. (needs updating)
Q: Any other questions?
A: …