ASV - Leipzig eHumanities

advertisement
Introduction of the ASV Subproject
Report on recent state of work and other activities
eTRACES Sponsor Meeting
Leipzig, 2012/05/07
Marco Büchler
Natural Language Processing Group
Department of Computer Science
University of Leipzig
What do you do with a million books?
Marco Büchler
2
We do not have any native speakers for ancient languages like ancient Greek
and Latin ...
Marco Büchler
3
Agenda

Scope of ULEI's subproject

Who is involved?

ACID for the eHumanities as a paradigm for successful projects
Marco Büchler
4
Basics for ULEI's subproject
Marco Büchler
5
A fundamental problem:
How to find relevant information in massive data?
Marco Büchler
6
Two initially associated documents
Marco Büchler
7
Documents are linked with a direction
Marco Büchler
8
Documents are linked with a direction: such as web links
Marco Büchler
9
Documents are linked in both directions: A loop
Marco Büchler
10
Detecting relevance: a document can be linked by more than one doc
Marco Büchler
11
Detecting relevance: a document can be linked by more than one doc
Marco Büchler
12
Detecting relevance on an entire digital library
Marco Büchler
13
Computing relevance weights (by reliability) on an entire digital library
Source: http://en.wikipedia.org/wiki/PageRank
The name of this strategy is Google's PageRank algorithm.
Marco Büchler
14
Some aspects about Google's PageRank algorithm
• Ranking is done by relevance weights (weighted links to a page)
• Benefit for humanities applications:
– Ranking does not necessarily need term weights as done with tf.idf
• e. g. Shakespeare's „to be, or not to be“
In humanities relevant data, however, we do not have a link structure like in
web based html files.
Marco Büchler
15
A similar problem: Two initial documents with text re-use
Marco Büchler
16
Given e. g. dating information: text re-use with direction I
Our assumption:
A quotation always implies a given relevance of the quoted author
by the quoting author –
either in a positive or negative way.
Marco Büchler
17
Given e. g. dating information: text re-use with direction II
Marco Büchler
18
Given e. g. dating information: text re-use with direction III
Marco Büchler
19
An old discipline: Text re-use in traditional humanities
Manually produced record of text re-use.
Marco Büchler
20
Some research objectives
In addition to Google's PageRank:
– Differentiate by
• Text re-use temperature
• Text re-use coverage
– Relevance by
• high score
• low score
Marco Büchler
21
Some answers to the intial questions/statements
What do you do with a million books?
– Cultural heritage of textual re-uses
– Text re-use graphs for something like a
„Cultural Heritage aware PageRanking“
We do not have any native speakers for
ancient languages like ancient Greek and
Latin ...
- Crowd sourcing provides on historical texts
qualitative results, however,
humanists are no native speakers
- The „Cultural Heritage aware
PageRanking“ approach aims to deal
with relevance given by native
Speakers even if they are not available,
nowadays
Marco Büchler
22
Who is involved?
Marco Büchler
23
Active collaborators
eTRACES/ULEI
(Prof. Dr. Gerhard Heyer)
'The Team'
Interface:projects
(Dr. Uwe Crenze)
The business
partner
Fragmentary
texts
(Dr. Monica Berti)
'The Humanist'
Perseus Digital Library
(Prof. Dr. Gregory Crane)
'The Content Provider'
Marco Büchler
24
„ACID for the eHumanities“
Marco Büchler
25
A new paradigm for successful eHumanities projects
• The million dollar question:
How to manage an eHumanities project successfully?
• After 4 years of activities in the eHumanities, you need just four questions:
Acceptance: How do you get humanists' acceptance for your techniques?
Complexity: Understand the complexity of necessary subtasks! e. g.: What is the
archetypus?
Interoperability: How can components or data interact with each other?
Diversity: Understand your data! e. g.: What does text re-use mean for your digital
library?
The ACID paradigm for the eHumanities
Marco Büchler
26
„ACID for the eHumanities“: Interoperability
Marco Büchler
27
„ACID for the eHumanities“: (Data) Interoperability I
Perseus DdbDP (XML) vs. Epiduke (XML)
Source: Pansch, D. 2010, Data Integration Methods for Structural Heterogeneous
Data in an eHumanities' Context, Bachelor thesis, 2010.
Marco Büchler
28
„ACID for the eHumanities“: (Data) Interoperability II
Source: Pansch, D. 2010, Data Integration Methods for Structural Heterogeneous
Data in an eHumanities' Context, Bachelor thesis, 2010.
Marco Büchler
29
„ACID for the eHumanities“: (Data) Interoperability III
• Several kinds of interoperability issues on
– Horizontal:
• Data level
• Algorithm level
• Tool/application level
– Vertical:
• e. g. between data and algorithm
Marco Büchler
30
„ACID for the eHumanities“: Diversity
Marco Büchler
31
„ACID for the eHumanities“: (Node) Diversity
Understand your data:
Understand the re-used text chunks.
( a knowledge thing)
Marco Büchler
32
„ACID for the eHumanities“: (Relation) Diversity
Understand your data:
Understand how text is re-used in your data.
(an experience thing)
Marco Büchler
33
„ACID for the eHumanities“: Diversity - 6 levels of text re-use
Text re-use is about unsupervised quotation detection in textual data.
- Level 1: Pre-processing (Cleaned and prepared data)
- Level 2: Featuring (Digital fingerprint of a re-use unit)
- Level 3: Feature selection (Signature of a digital fingerprint)
- Level 4: Linking (Match of re-use units that have at least one feature in
common)
- Level 5: Scoring (Weighting of linked re-use units)
- Level 6: Post-processing (e. g. post selection or views that depend on research
questions)
Implemented in TRACER (http://etraces.e-humanities.net/TRACER):
- Tool available in 2013
- Teaching courses (full week) are planned for 2013
- More than one million permutations of implementations of the 6 levels possible
(05/2012)
Marco Büchler
34
„ACID for the eHumanities“: Acceptance
Marco Büchler
35
Interdisciplinary collaborations: The problem!
Marco Büchler
36
Computer Scientists: Change your view for understanding humanists
How to get acceptance of humanists if text mining is a black box
that can't be looked in?
Marco Büchler
37
What we need!
Transparency: How to provide user-friendly insights into
complex mining techniques and machine learning?
Marco Büchler
38
Jumping into the mining process: Level 0 – Initial request
Marco Büchler
39
Jumping into the mining process: Level 1 - Preprocessing
Marco Büchler
40
Jumping into the mining process: Level 2 - Featuring
Marco Büchler
41
Jumping into the mining process: Level 3 - Selection
Marco Büchler
42
Jumping into the mining process: Level 4 - Linking
Marco Büchler
43
Jumping into the mining process: Level 5 - Scoring
Marco Büchler
44
„ACID for the eHumanities“: Complexity
Marco Büchler
45
„ACID for the eHumanities“: Complexity I
• Archetypus detection means to identify the origin of a thought or a chunk of text
(or at least the earliest occurrence).
• Sentiment (Acceptance) detection means if a text passage is re-used in a
„positive“ or „negative“ way
An example:
• German:
„Gleich und gleich gesellt sich gern.“
• Englisch: „Like will to like.“
„Birds of a feather flock together.“
(“to bring like and like together”)
Question:
How would/do you use this phrase regarding sentiments in your
daily life?
Marco Büchler
46
„ACID for the eHumanities“: Complexity II
Hom. Od. 17 215-219:
As he saw them, he spoke and addressed them, and reviled them in terrible and
unseemly words, and stirred the heart of Odysseus: “Lo, now, in very truth the vile leads the
vile. As ever, the god is bringing like and like together. Whither, pray, art thou leading this
filthy wretch,1 thou miserable swineherd, ...
Marco Büchler
47
„ACID for the eHumanities“: Complexity III
• German phrase:
„jemanden auf's Dach steigen“
• English (literally translated):
„to climb onto someone's roof“
• English (semantically translated): „to put someone down“,
„tell someone off“
• Understanding the example:
– Goes back to a German tradition between 7th and 12th century
– Young men went onto other's (and not following the rules of the community )
roof in order to remove it.
– Happened especially during (German) carnival and Shrove Tuesday
– There was no legal rule about it ...
– ... in early Middle-ages, however, this became fundamental part of early
adaptions of constitutions
Marco Büchler
48
„ACID for the eHumanities“: Complexity III
The home is invoilable.
Article 13 of the recent German constitution
Focus here: Constitution evolution task in different
societies.
Marco Büchler
49
„ACID for the eHumanities“: Complexity IV
Article 13: The home is invoilable.
vs.
judgement to online observation by federal institutions in context of
terrorism
... Das Schutzgut dieses Grundrechts ist die räumliche Sphäre, in der sich das
Privatleben entfaltet [...]. Neben Privatwohnungen fallen auch Betriebs- und
Geschäftsräume in den Schutzbereich des Art. 13 GG [...]. Dabei erschöpft sich der
Grundrechtsschutz nicht in der Abwehr eines körperlichen Eindringens in die
Wohnung. Als Eingriff in Art. 13 GG sind auch Maßnahmen anzusehen, durch die
staatliche Stellen sich mit besonderen Hilfsmitteln einen Einblick in Vorgänge
innerhalb der Wohnung verschaffen, die der natürlichen Wahrnehmung von
außerhalb des geschützten Bereichs entzogen sind. Dazu gehören nicht nur die
akustische oder optische Wohnraumüberwachung [...], sondern ebenfalls etwa die
Messung elektromagnetischer Abstrahlungen, mit der die Nutzung eines
informationstechnischen Systems in der Wohnung überwacht werden kann. Das
kann auch ein System betreffen, das offline arbeitet. ...
Decision about online observation by the German government
Source: http://www.bundesverfassungsgericht.de/entscheidungen/rs20080227_1bvr037007.html
Marco Büchler
50
„ACID for the eHumanities“: Complexity of text re-use research
Marco Büchler
51
Complex tasks do strongly need collaborations!
Google group for Historical Text Re-use:
http://groups.google.com/group/historical-text-re-use
Marco Büchler
52
Summary

Scope of ULEI's subproject

Who is involved?

ACID for the eHumanities as paradigm for successful projects

From mission to vision
Marco Büchler
53
eTRACES/ASV: 'The team'
Gerhard Heyer
Marco Büchler
Maria Moritz
Petra Gamrath
Christian Kötteritzsch
Frederik Baumgardt
Thomas Efer
Marco Büchler
54
Download