Information Extraction

advertisement
Information Extraction
Referatsthemen
CIS, LMU München
Winter Semester 2013-2014
Dr. Alexander Fraser, CIS
Information Extraction – Reminder
• Vorlesung
• Learn the basics of Information Extraction (IE)
• Klausur – only on the Vorlesung!
• Seminar
• Deeper understanding of IE topics
• Each student who wants a Schein will have to make a presentation on IE
• 25 minutes (powerpoint, LaTeX, Mac)
• If two students work together, 40 minutes (each student speaks for 20
minutes)
• Hausarbeit
• 6 pages worked out, due 3 weeks after the Referat
• Separate for each student!
Topics
• Topic will be presented in roughly the same order as the
related topics are discussed in the Vorlesung
• Most of the topics require you to do a literature search
• There will usually be one article (or maybe two) which you find is
the key source
• If appropriate, please turn in PDF files of the key article and a few
other important articles
• There are a few projects involving programming
• These are particularly suitable to be done by two students
3
Referat
•
•
•
25 minutes for one student
40 minutes for two
Start with what the problem is, and why it is interesting to solve it (motivation!)
• It is often useful to present an example and refer to it several times
•
•
Then go into the details
If appropriate for your topic, do an analysis
• Don't forget to address the disadvantages of the approach as well as the advantages
(advantages tend to be what the authors focus on)
• List references and recommend further reading
•
Have a conclusion slide!
Information Extraction
Information Extraction (IE) is the process
of extracting structured information
from unstructured machine-readable documents
✓
✓
Tokenization&
Normalization
Source
Selection
?
05/01/67

1967-05-01
Named Entity
Recognition
Instance
Extraction
and beyond
Ontological
Information
Extraction
Fact
Extraction
Elvis Presley
Singer
Angela Merkel
Politician
...married Elvis
on 1967-05-01
5
History of IE
• TOPIC: IE at the Message Understanding Conferences
(MUC)
• These conferences focused on tasks like extracting merger events
from unstructured text
• Discuss problems solved, motivations and techniques
• Survey the literature
6
Source Selection
• TOPIC: Focused web crawling
• Why use focused web crawling?
• How do focused web crawlers work?
• What are the benefits and disadvantages of focused web
crawling?
• Python: scrapy
• Perl: WWW::Mechanize
7
Source Selection
• TOPIC: Language Identification
• Compare the approaches used
• Such as the compression, dictionary and character histogram
approaches
• Discuss the issue of 8-bit encodings and dealing with code pages
8
Source Selection
• TOPIC: Wrappers
• Wrappers are used to extract tuples (database entries) from
structured web sites
• Discuss the different ways to create wrappers
• Advantages and disadvantages
• How do wrappers deal with changing websites?
• Give some examples of different wrapper creation software
packages and discuss their pros and cons
9
Rule-based Named Entity Recognition:
Regular Sets
• TOPIC: Regular Sets in the Proceedings of the European
Parliament (Europarl corpus)
• Annotate regular classes in the Europarl corpora in both English and
German
• Interesting regular classes: report-IDs and dates. Others could include
nationalities/countries or monetary amounts.
• Programming intensive: this will involve writing a lot of regular
expressions, and analysis
10
Named Entity Recognition – Entity Classes
• TOPIC: fine-grained open classes of named entities
• Survey the proposed schemes of fine-grained open classes, such as BBN's
classes used for question answering
• Discuss the advantages and disadvantages of the schemes
• Discuss also the difficulty of human annotation – can humans annotate
these classes reliably?
11
Named Entity Recognition – Training Data
• TOPIC: Crowd-sourcing with Amazon Mechanical Turk (AMT)
• AMT's motto: artificial artificial intelligence
• Using human annotators to get quick (but low quality) annotations
• What are the pros and cons of this approach? How well do NER systems
perform when trained on this data?
12
Rule-Based Extraction
• TOPIC: Using Local Grammars for Citation Parsing
• Discuss how to define regular expressions for the different fields in a
citation
• Discuss how these regular expressions are combined in the local grammar
approach
• Present the advantages and disadvantages of this approach to citation
parsing
13
Learning Rules for Named Entity Recognition
• TOPIC: Learning Rules for Named Entity Recognition
• Discuss how rules can be learned given annotated corpora
• Present the basic algorithms
• Discuss the advantages and disadvantages of these approaches
14
Named Entity Recognition – Statistical
Model
• TOPIC: Hidden Markov Models for Named Entity Recognition
• Discuss how to formulate named entity recognition in the HMM
framework
• Discuss the transition model and the emission model – which features are
useful?
• One possible emphasis: what we want the models to be able to do, what
problems they have versus other statistical approaches
15
Named Entity Recognition – Statistical
Model
• TOPIC: Structured Perceptron for Named Entity Recognition
• Discuss how to formulate named entity recognition in the structured
perceptron framework
• This requires previous exposure to machine learning!
• What does the structured perceptron framework allow you to do versus
the HMM?
• Presentation of important features
16
Named Entity Recognition - Supervision
• TOPIC: Lightly Supervised Named Entity Recognition
• Starting from a few examples ("seed examples"), how do you
automatically build a named entity classifier?
• This is sometimes referred to as "bootstrapping"
• What the problems with this approach – how do you block the process
from generalizing too much?
• Analyze the pros and cons of this approach
17
Named Entity Recognition - Supervision
• TOPIC: Distant supervision for NER
• Related to the bootstrapping idea – but here we are using
information annotated for a different purpose
• How can distant supervision solve the knowledge bottleneck for
NER?
• What are the advantages and disadvantages of this approach?
18
NER – Toolkit
• TOPIC: Stanford NER Toolkit applied to Europarl
• Apply the Stanford NER Toolkit to the Europarl corpus, and compare the
output on English and German
• How does the model work?
• What are the differences between the English and German annotations of
parallel sentences, where do the models fail?
19
NER – Domain Adaptation
• TOPIC: Domain adaptation and failure to adapt
• What is the problem of domain adaptation?
• How is it addressed in statistical classification approaches to NER?
• How well does it work?
20
NER – Twitter
• TOPIC: Named Entity Recognition of Entities in Twitter
• There has recently been a lot of interest in annotating Twitter
• Which set of classes is annotated? What is used as supervised
training material, how is it adapted from non-Twitter training
sets?
• What are the peculiarities of working on 140 character tweets
rather than longer articles?
21
NER – BIO Domain
• TOPIC: Named Entity Recognition of Biological Entities
• Present a specific named entity recognition problem from the
biology domain
• Which set of classes is annotated? What is used as supervised
training material?
• What are the difficulties of this domain vs. problems like
extraction of company mergers which have been studied longer?
22
Instance Extraction
• TOPIC: Applying the Stanford Coreference Pipeline to Europarl
•
•
•
•
23
Apply the Stanford Coreference Pipeline to English Europarl data
How does it work?
What entities does it annotate well, and less well?
Can this information be used to translate English "it" to German?
IE for multilingual applications
• TOPIC: Transliteration Mining
• Transliteration mining is the mining of names which are transliterated
from one language to another from a list of word pairs
• Present the task of transliteration mining as done in the NEWS
conferences (here is the 2010 URL)
• http://translit.i2r.a-star.edu.sg/news2010/
• What approaches work well for transliteration mining?
• (Some basic statistical modelling background will be necessary here)
24
IE for multilingual applications
• TOPIC: Bilingual Terminology Mining
• The problem of bilingual terminology mining from comparable corpora is
the task of finding terms which are translations of each other given their
context
• How is this done using the vector space model vs. the pattern-based
approach?
• Some familiarity with basic information retrieval will be helpful here
• What the critical sources of knowledge for this approach?
25
Choosing a topic
• Any questions?
• I will put these slides on the seminar page later today
• Please email me with your choice of topic
• I'd also be open to using the Wiki, but I don't see how to make that work
• Check the seminar page first to see if it is already taken!
26
Download