DASISH WEB-ANNOTATOR

advertisement
DASISH WEB-ANNOTATOR
TLA
This document specifies a browser extension for annotating web-documents. We present
the class structure of the implementation, describe the functionality from the user
perspective and define the REST API.
Document version: 1.0
Date: 9 February 2016
Authors: Olha Shkaravska, Przemek Lenkiewicz, Menzo Windhouwer, Twan Goosen, Daan Broeder
Max Planck Institute for Psycholinguistics Nijmegen
Table of Contents
Technical Summary. .......................................................................................................... 2
Terminology and Notations ............................................................................................ 3
Class Schema ........................................................................................................................ 4
Initial Annotation-Body Types ....................................................................................... 8
User Interface prototype .................................................................................................. 9
Main window view: ..................................................................................................................... 9
Context menu: .............................................................................................................................10
REST API.............................................................................................................................. 11
Annotations .................................................................................................................................11
api/annotations ................................................................................................................................... 11
api/annotations/<aid> ..................................................................................................................... 12
Sources ..........................................................................................................................................13
api/sources ............................................................................................................................................ 14
Appendix I .......................................................................................................................... 15
Text Note (to be implemented in the first prototype)..................................................15
Color ...............................................................................................................................................15
Tags: unary relations R(A), where A is the annotated fragment of a source .....15
Typed tags: unary relations RA(B), where A is a typed tag and B is the
annotated source .......................................................................................................................16
Binary relations R(A, B) , where A and B are the annotated fragments of a
source.............................................................................................................................................17
Technical Summary.
The aim of this document is to give specifications for a web-annotating tool,
which is to be developed within the DASISH project. The tool is a browser
extension that allows to annotate fragments of web documents by tags, colors
and text notes. The annotatable fragments may be texts and, on the later stages
of development, graphical objects as well.
Initially the tool will allow to annotate only web-pages. Later we plan to extend
the tool to annotate web-documents generated by linguistic software, e.g. EAFfiles, created by ELAN (MPI Nijmegen), or lexical entries created by LEXUS (MPI
Nijmegen). We do not want to limit annotatable objects by those generated by
DASISH participants and plan to include external linguistic software to our case
study.
The heart of the class schema of the project is class “Annotation”. An object of
Annotation class is in the “target” relation with one or more objects of class
“Source”. Semantics of an Annotation object is defined in its attribute “Body”.
There are a few types of annotations bodies that express variety of the
possibilities to annotate documents, from marking their fragments with simple
text tags or colors, to putting arbitrary text notes.
2
Terminology and Notations
User
<uid>
<aid>
<aoid>
<vid>
<fid>
<sid>
<datetime>
<cid>
<URI>
<prefix>
<text>
A person, a group, or “everyone” (public)
Principal identifier (a principal is either user or a group)
Annotation identifier
URI of an annotatable object outside the DB
Version identifier, which may be a number, a time stamp or
both, depending on the origin of the document.
A string that describes a fragment within a given document.
Examples: <xpath> for XML documents, coordinates for
graphics.
The identifier of an annotated source: <aoid>@<vid>#<fid>.
If <fid> is empty then <sid> refers to the whole document
given by <aoid>@<vid>. The default <vid> (when omitted)
corresponds to the latest version. Abbreviation <sid>
mimics “source identifier”.
Date and time, including time zone, as defined in
http://www.w3.org/TR/xmlschema-2/#dateTime
Cached Representation identifier
URI, as defined in http://tools.ietf.org/html/rfc3986
The prefix of a namespace
Some text
An example of <sid> is given by the URI
http://tla.mpi.nl/#xpointer(//div[id='post-1157']/p/substring(.,33,3))
Here the part http://tla.mpi.nl/ is an <aoid> and the part
xpointer(//div[id='post-1157']/p/substring(.,33,3)) is a <fid>. Since <vid> is not
given, the <sid> refers to the latest version of the resource located at
http://tla.mpi.nl/ .
<uid> is not mentioned explicitly below, as a parameter in the description of the
REST service, because it is known from the session via “Shibboleth”
identification procedure.
An owner is either the principal who has created the annotation or a principal to
whom the ownership has been assigned.
3
Class Schema




The schema is based on the following interfaces and classes:
class Source represents (a specific fragment of) a specific version of an
annotatable object; it contains information about this version, such as a
time stamp, the lists of references to cashed representations;
class Annotation that contains the references to the annotation’s body
(that contains the list of sources which it annotates), also the name of the
owner, the lists of “readers” and “writers”;
interface Cached representation is a generic interface to be implemented
by different representations of annotatable resources like serialized ones
(e.g. XML-sed), media-files, screenshots;
interface Body (of annotation) (can be text, “like”, color, relation, etc.);
contains the reference to the annotation.
We propose the following XML-serializations.
Source
Note that the MIME type for MHTML is taken from Wikipedia1, but there seems
to be some discussion about this approach2.
1
2
http://en.wikipedia.org/wiki/MHTML
http://stackoverflow.com/questions/31250/content-type-for-mht-files)
An annotation whose body is a binary relation
(in this example “implies”)
The intended meaning of the following example is that source1 implies source2.
An annotation whose body is “Note”
(see the section about the types of annotations)
Note that “full” XML presentations as above may be returned by the
corresponding GET methods. When we want to POST a new annotation then we
know less known about it: for instance, it does not have an assigned identifier
yet. We propose the following serialization of a new annotation:
7
Initial Annotation-Body Types
In the first prototype we plan to implement only 1-target annotations with the
body type “Note”. From the user perspective they are just text notes about
fragments of the document a-la comment in Word Documents, but displayed only
in a list or as a tooltip (like the Wired Marker currently does). Balloon display as
done in MS Word can be implemented in further stage.
In general we plan to implement the body types following the class diagram
above. Recall that these body types, besides “Notes”, are: color, tag (a unary
relation), labeled tag (a unary relation with parameters), binary relations. Below
we present series of instances of these body types. Implementing these instances
within our tool will have two-fold effect:
 first, it will serve for user’s convenience by providing a drop-down menu
of annotations once a fragment to be annotated is selected,
 second, it will show that within the proposed class schema it is possible to
create reasonable types of annotations,
To create an annotation, user needs to highlight the text and right-click the
mouse. The creation menu should appear near the highlighted text (or on the
right sub-panel of the whole panel). There the user can select the type of
annotation and add other parameters when necessary. It may be possible to
highlight the second fragment for binary relations using Shift(s).
For the existing annotations, left mouse click on the highlighted text triggers a
“callout” (or a rectangular box, connected to the text fragment) with a short
annotation description. It is applicable for tags and relations (see below). Right
mouse click on the highlighted text triggers the context menu that contains the
complete information about annotation: its author, date, its URI.
User Interface prototype
Main window view:
9
Context menu:
10
REST API
Remark on document versioning. Web-documents exist in time, that is
different versions of the document may exist under the same URI (<aoid>) in
different moments of time. In the first prototype we implement only the simplest
necessary handling of the versions of the web-document. In the first
implementation we omit REST requests concerning versions and rely on local
caching of old versions of annotated sources (as already exists as a feature in
Wired_-Marker).
All information necessary to fulfill a PUT, POST or DELETE request, such as the
URI of an annotated object, is given “serialized” in the request body, but not as
request parameters in the request’s URI. If a POST (PUT, DELETE) method is
performed, then in the case of success it returns a serialized information about
the added (resp. updated, removed) resource together with a standard HTTP
response code. The information includes: the resource ID, owner’s ID, time
stamp, (possibly) the list of the <sid>’s of the target sources. For the full
information the user will use GET on a just created/ updated annotation, already
knowing its ID. In the case of failure the corresponding error message and error
status are returned, e. g, 401 Unauthorized access. Only “owner” has DELETE
rights.
Annotations
api/annotations
Resource
GET api/annotations?source=<URI>
&text=<text>
&access=[read, write]
&ns=<prefix>:<ns>
&xpath=<xpath>
&owner=<uid>
&after=<datetime1>
&before=<datetime2>
POST api/annotations
Description
Returns the list of <aid>-s of the annotations of
the annotated object located at <URI>, to which
the inlogged <uid> has “read” (resp.”write”)
access and the bodies of whom contain the text
<text>. Moreover, these annotations are
created between <datetime1> and
<datetime2>.
If the parameter “source” is omitted, then
considers all annotated objects to which <uid>
has “read”/”write” access. Parameter xpath
allows to search over the parts of annotations
body, e.g. <xpath> may be
body[@type=’relation’]/relation=’contradiction’.
For this one needs URI’ of namespaces <ns>
represented by prefixes <prefix>.
The default <xpath> is “empty” and implies no
limitation. The default <datetime1> can be 01
Jan 1970, 00:00. The default <datetime2> is
today.
Adds a new annotation by picking up its XMLserialization from the request body.
The XML serialization should include the
annotated object URI’s and annotation body
(e.g. text).
11
api/annotations/<aid>
It is assumed, that if the logged-in user <uid> has no “read” access to <aid> then
GET methods over URI-s of the form api/annotations/<aid>[/…] will return error
status Unauhtorized access 401, or similar. The same happens if the logged-in
user <uid> has no “write” access to <aid> with PUT, POST and DELETE methods
over the URI-s of the form api/annotations/<aid>[/…] .
The table below describes the behavior of the pair (method, URI), when user
<uid> has authorized access to <aid>. Here “authorized access “ means that
<uid> has “read” access for GET-methods, and “write” access for PUT, POST, and
DELETE methods.
Resource
GET api/annotations/<aid>
DELETE api/annotations/<aid>
PUT api/annotations/<aid>
GET api/annotations/<aid>/body
PUT api/annotations/<aid>/body
GET
api/annotations/<aid>/sources
Description
Returns the serialized annotation that has this
<aid>.
Removes <aid> and all its target sources from
the database. Returns the serialized
representation of the removed <aid> with the
message “the following annotation has been
removed” or similar.
Updates the annotation with <aid>. E.g. it is
used when <uid> wants to correct typos in the
annotation body AND change annotated
fragments. (See PUT api/annotations/<aid>/body
for correcting body only.) The serialized
representation of the updated annotation is given
in the request body.
Returns the body of the <aid>.
Updates the body of the annotation <aid>. Used
e.g. for correcting typos in the text part. The
updated annotation’s body is given in the body of
the request.
Returns the list of the <sid>-s of all the target
sources of <aid>.
Sources
A source represents (a specific fragment of) a specific version of an annotatable
object. For instance, if an annotatable object is a web-page that has 3 versions
and users have annotated versions 1 and 3, then there are 2 sources in the Data
Base that correspond to the “web-page”. Naturally, these sources represent
versions 1 and 3.
Note that access to the whole document with <aoid> is possible via its
<sid>=<aoid>#, with empty fragment descriptor.
Adding sources to the DataBase and removing them is a responsibility of the
DataBase Management System. In fact, adding a source is a “side effect” of
creating an annotation on a certain URI. Moreover, is the source with
<sid>=<aoid>@<vid>#XXX is added to the DB, then the source
<sid>=<aoid>@<vid># must be added as well, unless it is already in the DB.
If all the annotations that refer to a certain source are deleted, then the DB
managing part deletes this source from the DB. A read-only REST API for
inspecting Sources (incl. fragments) is needed.
Cached representations are managed by the client, therefore creation and
deletion API is necessary. It is possible to store the cashed representation not
only of the fragment precisely corresponding to an annotation target source, but
of a larger fragment and even of the entire annotatable object.
api/sources
Resource
GET api/sources?uri=<aoid>
&maxSources=<number>
GET api/sources/<sid>/versions3
GET api/sources/<sid>/cached
GET api/sources/<sid>/cached/<cid>
POST api/sources/<sid>/cached
DELETE
api/sources/<sid>/cached/<cid>
3
Description
Returns the lists of the <sid>-s of all the sources
referring to<aoid>. that is the sources with the
<sid>-s of the form <aoid>@XXX#YYY.
The length of the list is bound by <number>. The
default length (maxSources value) must be
provided. Alternatively/additionally, one may use
paging to list the sources.
Instead of ?uri=<aoid> it may be possible to use
other ways of scoping the request GET api/sources,
for instance ?uriprefix=URI.
Returns the lists of the <sid>-s (URIs) of all the
“sibling”-versions of the <sid>=<aoid>@XXX#YYY
that is the list of <sid>’s of the form
>=<aoid>@ZZZ#YYY
Returns the list of meta-information of all the cached
representations of <sid>. The meta-information of a
cached representation includes: <cid>,MIME type,
subtype (e.g. “screenshot”), size, the tool ID which
opens the representation.
Returns the file that is the cached representation
with <cid> if it exists.
It is a multipart POST, with the request body
consisting of a description containing the metadata
specified by the Cached Representation realization
class, e.g., screenshot, and a single file (multiple
files must be archived). The description has a form
as follows:
<cachedrepresentation-description>
<mime>multipart/related</mime>
<tool>ToolID01</tool>
<type>MHTML</type>
</cachedrepresentation-description>
Adds a new cached representation of <sid>, by
taking the cached representation from the request
body.
Removes the cached representation <cid> given in
the body of the request from the list of cached
representations of the <sid>. It is removed from the
database as well, unless there are no more
references to this representation.
not to be implemented in the first prototype.
Appendix I
Foreseen annotation types that will possibly appear in the final version of the
tool. Still under discussion and will have to be checked compared to examples of
use-cases.
Text Note (to be implemented in the first prototype)
It is an arbitrary text with which a user can annotate the fragment(s) of the document,
for instance:
[…] In Europe if the climate in a country is nice then the government is
a disaster.[...]
[…] that the economical situation in Europe will stabilize by the end of
2013 […]
Color
This seems to be a simplest from user’s perspective type of annotation. One
simply highlights the fragment(s) of a text and marks it with a color chosen for
the color menu. For instance:
[…] the conclusions from this book […]
Tags: unary relations R(A), where A is the annotated fragment of a source
Name
Logical
meaning
(if applicable)
Example: a sketch for the user interface with
callouts appearing with left mouse click on the
highlighted annotated text
False
statement
Not A
[…] Some people think that the climate of the
Netherlands is Mediterranean but this is an illusion
[…]
True
statement
A
[…] He is not sure that Claude Monet is a French
impressionist. […]
I disagree
I  Not A
[…] Some people still claim that Nazi Germany lost
WWII because of cold winters in Russia.[.. ]
15
I agree
I A
[…] In Europe if the climate in a country is nice
then the government is a disaster.[...]
Doubtful
?
[…] that the economical situation in Europe will
stabilize by the end of 2013 […]
Typed tags: unary relations RA(B),
where A is a typed tag and B is the annotated source
A typical example of such annotation is given by tagging the citation by its
author. Here the citation is a source, author (e.g. Karl Marx) is a tag of type
“Author”. The full list of typed tags to be implemented is given in the table
below.
The name
Logical meaning
(if applicable)
Example: a sketch for the user interface with
callouts appearing the left mouse click on the
highlighted annotated text
|A is an
author
A is the author of B
[…] when “The Capital” is written […]
A is a
reference
A is the referred
book (article, text,
etc.) B
[…] the conclusions from this book […]
A is the book
(article, etc.), where […] The history of all hitherto existing society is
B is published
the history of class struggles […]
A is an URI
A is an URI where
the source (book,
article, text, etc.) of
B can be found
[…] The history of all hitherto existing society is
the history of class struggles […]
16
Discuss with
the person A
B must be
discussed with A
[…] the conclusions from this book […]
Binary relations R(A, B) ,
where A and B are the annotated fragments of a source
The name
Implies
Equivalent
Implies the opposite
Contradicts
Logical
meaning
(if
applicable)
A B
A==B
Example: a sketch for the user interface, a
callout (or rectangular box) points to BOTH
arguments
[…] The manager thinks that we will finish
development by the end of May. […]
There will be a
conference
in June where we could make a presentation.
[…]
[…] The manager thinks that we will finish
development by the end of May. […]
[…]
that we fix all the bugs listed as tickets by the
end of May. […]
A Not B […] The manager thinks that we will finish
development by the end of May. […]
A==
Not B
The
customer expects that we are done by the begin
of May […]
[…] Another manager thinks that we will not
finish development by the end of May.
[…] We fix
bugs listed as
17
all the
Download