Computing Resources

advertisement

20 May 2010

LREC 2010

Building a Domain-Specific Document Collection for

Evaluating Metadata Effects on Information Retrieval

Walid Magdy, Jinming Min, Johannes Leveling, Gareth Jones

School of Computing, Dublin City University, Ireland

Outline

CNGL

Objective

Data collection preparation and overview

IR test collection design

Baseline Experiments

Summary

CNGL

Centre of Next Generation Localisation (CNGL)

4 Universities: DCU, TCD, UCD, and UL

Team: 120 PhD students, PostDocs, and PIs

Supported by Science Foundation of Ireland (SFI)

9 Industrial Partners: IBM, Microsoft, Symantec, …

Objective: Automation of the localisation process

Technologies: MT, AH, IR, NLP, Speech, and Dev.

Objective

1.

2.

3.

4.

Create a collection of data that is:

Suitable for IR tasks

Suitable for other research fields (AH, NLP)

Large enough to produce conclusive results

Associated with defined evaluation strategies

Prepare the collection from freely available data

YouTube

Domain specific (Basketball)

Build standard IR test collection (document set + topics set + relevance assessment)

YouTube Videos Features

Posting

User

Tags

Descriptio n

Posting date

Responde d Videos

Related

Videos

Length

Document

- Video URL

- Video Title

Number of

Favorited

Category

Number of

Ratings

Comment s

Number of

Views

Methodology for Crawling Data

50 NBA related queries used to search YouTube

First 700 results per query crawled with related videos

Crawled pages parsed and metadata extracted.

Extracted data represented in XML format

Non-sport category results filtered out

Used Queries:

NBA - NBA Highlights - NBA All Starts - NBA fights

Top ranked 15 NBA players in 2008 + Jordan + Shaq

29 NBA teams

Data Collection Overview

Crawled video pages: 61,340 pages

Max crawled related/responded video pages: 20

Max crawled comments for a given video page: 500

Comments associated with contributing user’s ID

Crawled user profiles ≈ 250k

XML sample

Topics Creation

40 topics (queries) created

Specific topics related to NBA

TREC topic = query (title) + description + narrative

<title> Michael Jordan best dunks </title>

<description> Find the best dunks through the career of Michael

Jordan in NBA. It can be a collection of dunks in matches, or dunk contest he participated in.

</description>

<narrative> A relevant video should contain at least one dunk for

Jordan. Videos of dunks for other players are not relevant. And other plays for Jordan other than dunks are not relevant as well </narrative>

Relevance Assessment

4 indexes created:

Title

Title +Tags

Title + Tags + Description

Title + Tags + Description + Related videos titles

5 different retrieval models used

20 different result lists, each contains 60 documents

Result lists merged with random ranking

122 to 466 documents assessed per topic

1 to 125 relevant documents per topic ( avg.

= 23 )

Baseline Experiments

Search 4 different indexes:

Title

Title +Tags

Title + Tags + Description

Title + Tags + Description + Related videos titles

Indri retrieval model used to rank results

1000 results retrieved for each search

Mean average precision (MAP) used to compare the results

Results

0.25

0.20

0.15

0.10

0.05

0.00

0.45

0.40

0.35

0.30

Title Title+Tags Title+Tags+Desc All text fields

Summary (new language resource)

NER

Metadata

Tags

Comments

IR test set

61,340

Top bigrams in

“Tags” field

Kobe Bryant

NBA Basketball

Lebron James

Michael Jordan

Los Angeles

All Star

Chicago Bulls

Boston Celtics

AH/Personalisation

Angeles Lakers

Slam Dunk

Basketball NBA

Ratings XML docs

250,000

Vince Carter

User profiles

Kevin Garnett

Toronto Raptors

# Views

Houston Rockets

Miami Heat

O’Neal

Phoenix Suns

40 topics +

Reranking using ML rel. assess.

Detroit Pistons

Videos

Yao Ming

Multimedia

Amazing Highlights processing

Pau Gasol

Cleveland Cavaliers

NBA Amazing

Questions & Answers

Q: Is this collection available for free?

A: No

Q: Nothing could be provided?

A: Scripts + Topics + Rel. assess. (needs updating)

Q: Any other questions?

A: …

Thank you

Download