Domain-specific Web Corpora and their Applications

advertisement
Domain-specific Web Corpora
and their Applications
Gregor Erbach
Saarland University
Project COLLATE
(funding: BMBF 01 IN A01 B)
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
Outline
Part I: Web Corpora
Part II: Applications of Web Corpora
Part III: LT-World Web Corpus
Part IV: Research in COLLATE
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
Part I: Web Corpora
1.
2.
3.
4.
12 July 2002
Formal Properties of the Web
Web Corpus
Document and Hyperlink Database
TREC web track
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
Formal Properties of the Web
•
•
•
•
•
Hypertext/Hypermedia
Directed graph with cycles
Edges = hyperlinks
Nodes = documents ???
Nodes often have internal tree structure (HTML,
XML)
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
Web Corpus
A web corpus consists of
• a database of documents
• a database of hyperlinks
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
Document Database
Information for each document:
– URL/URN
– Full Text (possibly with linguistic annotation such as
POS, named entities, phrases)
– Full Text Index
– Metadata
• Author, Language, Date, MIME type … (Dublin Core)
• Category, Abstract, Keywords, Type of Page …
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
Fields of Hyperlink Database
• source anchor URL
• source anchor position on web page (percentage)
• source anchor position in document structure (HTML
element path)
• source anchor type (text or image)
• source anchor text and context
• target anchor URL
• target anchor position on web page
• target anchor MIME type
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
Derived Properties of Hyperlinks
•
•
•
•
•
•
Same document?
Same server?
Same 2nd/3rd level domain?
Ascending of descending in directory structure
Source is within a list of links
Navigation link (up, previous, next …)
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
TREC web track
• Construction of a web corpus (WT10g) according to the
following criteria:
–
–
–
–
–
–
Broadly representative of web data in general
Many inter-server links
Contains all available pages from a set of servers
Contains an interesting set of meta-data
Contains few binary, non-English or duplicate documents
Size: 10 GB
P. Bailey, N. Craswell and D. Hawking. Engineering a multi-purpose test collection for Web retrieval experiments. IP&M, to appear.
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
Part II: Applications of Web Corpora
1.
2.
3.
4.
5.
6.
7.
8.
12 July 2002
Web Mining
Information Retrieval
Clustering and Categorisation
Summarisation
Discovery of Relations
Terminology Extraction
Information Extraction
Ontology Learning
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
Useful Methods
•
•
•
•
•
Machine Learning and Data Mining
Natural Language Processing
Information Retrieval
Ontologies and Semantic Web
Bibliometrics (citation analysis ~ link analysis)
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
Web Mining
• Web Content Mining
– Discovery of terminology, acronyms, concepts
• Web Structure Mining
– Discovery of relations, communities …
• Web Usage Mining
– Discovery of navigation patterns
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
Information Retrieval
• Usage of hyperlinks for determining popularity of
web pages
• Hub and authority pages
• Widely used: Google PageRank
• Mixed results in TREC web track
Jon M. Kleinberg (1997) Authoritative Sources in a Hyperlinked Environment. Journal of the ACM
Sergey Brin, Lawrence Page (1998) The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks
and ISDN Systems
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
Clustering
• Standard clustering algorithms form clusters by
iteratively grouping documents/clusters, according
to a distance measure
• Content-based methods measure distance by
counting terms/concepts (often TF/IDF)
• Connectivity-based distance measures make use of
hyperlinks
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
Categorisation
• Categorisation algorithms determine the
membership of a document in a pre-defined
thematic category
• Content-based categorisation methods measure
distance from a representative of the category
• Connectivity-based distance measures are based
on the assumption that certain types of hyperlinks
lead to documents of the same category
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
Summarisation / Keyword Extraction
• Source anchor text has been used to generate short
summaries of target web pages.
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
Discovery of Relations
• Hyperlink structure reflects relations between web
resources (e.g. between personal homepage,
project page, organisation page)
• Relations can be discovered by content-based
methods and by connectivity-based methods
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
Terminology Extraction
• Content-based: extraction of domain terminology
by statistical analysis (TF/IDF …) and/or phrasal
chunking
• Applicability of connectivity-based methods?
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
Information Extraction
• Automatic extraction of meta-data
• Extraction of named entities for concept-based
indexing
• Extraction of templates/relations for relation-based
indexing, and question answering
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
Ontology Learning
• Extraction of candidates by frequency of
occurrence in similar contexts
• Usage of textual clues (“such as”, “sogar” …)
• Applicability of connectivity-based methods?
• Definition and acronym mining
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
Part III: LT-World Web Corpus
1.
2.
3.
4.
12 July 2002
Content of LT World
Ontology
Hyperlinking within LT World
Construction of the corpus
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
LT World: Idea and Context
• The virtual information center is a comprehensive WWWbased information and knowledge service for the entire
area of language technology.
• LT World is a “virtual” center in the sense that most
information will physically remain with their creators or
with other service providers.
• The virtual information center has been online since
October 2001 under the name „LT World“ for „Language
Technology World“ (www.lt-world.org)
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
Virtual Information Center - LT World
• Information and Knowledge
– Technical and Scientific Information
• Players and Teams
– Persons, Projects, Organisations
• Resources and Results
– Research Systems, Commercial Products
• Communication and Events
– News, Conferences
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
LT World Ontology
Layer 3: Ontology for CL & LT
Products
Publi
cations
Projects
People
Corpora
Layer 2: Specific Ontologies
Layer 1: Dublin Core
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
etc.
LT World Ontology
• Dimensions
–
–
–
–
–
–
–
Linguality (monolingual, multilingual, cross-language)
Application
Computational/mathematical methods
Linguistic Models / Theories
Level of linguistic description/processing
Technologies
Language(s)
• Ontology is modelled in RDF with Protégé 2000
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
LT World: Coverage
•
•
•
•
•
99 topic nodes
300 NLP tools and products
1800 people
850 organisations
500 projects
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
Data Acquisition Process
• Manual collection, categorization and annotation
of URLs by students and staff
• Sources: conference proceedings and journals,
lists of links on the web,
• Self-registration and correction of data by users of
the service
• Technical/scientific information in topic nodes has
been provided by domain experts
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
LT World: Topic Nodes
Topic nodes are the main information unit of the Area
“Knowledge and Information”. They are organized in a
shallow slightly multidimensional hierarchy following the
chapter plan of the second edition of the Language
Technology Survey.
Example of the shallow hierarchy:
Information Extraction
• Named Entity Recognition
• Terminology Extraction
• Relation Extraction
• Answer Extraction
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
Information for each Topic
• Name
• Acronyms
• aka‘s, Term Translations
• Short Definition
• Overview Article (from HLT Survey)
• Topic Websites
• R&D Prototypes/Products
• Projects
• People
• Literature
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
Hyperlinking between Sections
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
Corpus Construction
• Start from URLs in LT-World collection
• Expand document set by recursively following outgoing
hyperlinks using a webspider (e.g., GNU wget)
• Expand document set by following incoming hyperlinks
(“link” query to search engine)
• Expand document set by search engine queries with
domain terminology
• Construct document database and link database
• (Filter out irrelevant documents)
• Publish Corpus
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
Part IV: Research Directions
Categorisation / Information Extraction
Discovery of Relations for Hyperlinking
Other
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
Categorisation and Information Extraction
• Research objectives
– find method for categorising documents according to
LT-World ontology
– find method for extraction of meta-information
• Compare and combine content-based and
connectivity-based methods
• If successful, it will contribute to semi-automatic
extension of the coverage of LT-World
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
Discovery of Relations
• Objective: develop method for finding pairs of
related documents, e.g. personal page –
organisation page.
• Content-based and connectivity-based methods are
applicable
• If successful, it will enable a significant
improvement of LT-World (resource discovery,
resource annotation)
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
Other
•
•
Objective: compare and combine content-based
and connectivity-based clustering methods
Applications:
1.
2.
3.
4.
5.
12 July 2002
Information Retrieval
Clustering
Summarisation
Terminology Extraction
Ontology Learning
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
Conclusion
• Main research interest: comparison and
combination of content-based and connectivitybased methods
• Main application impact: going from a set of
“seed” web pages to a domain-specific
information system
12 July 2002
Colloquim on Applications of Natural
Langauge Corpora, Saarland University
Download