Research Challenge 3 - EDBT Summer School 2015

advertisement
EDBT Summer School 2015
Research Challenge 3
Building a knowledge base for a specific domain
Speaker: Joan Guisado-Gámez
Context and motivation:
A knowledge base is a source of structured or unstructured information used by computer
systems. Many systems have used Wikipedia because it offers a wide source of knowledge.
This is a good solution in the scenario of a general system that may need data from many
different domains. However, for domain-specific system may not be the best solution, both in
terms of quality results and computation time.
Knowledge bases are composed by different entities, each representing a different and unique
concept within the knowledge base. Each entity contains structured or unstructured data that
describe that concept. However, knowledge bases are not only useful due to the data that they
contain, but also due to the maze of relations among the different entities. This makes graphs
structures the natural way of representing them.
Research/Design challenges
In this challenge we propose to design and develop, either individually or in teams, a method
that, based on Wikipedia, is capable of building a specific knowledge base for a given domain.
In particular, we propose to build a knowledge base for Legal and Legislative Resources.
This challenge entails, not necessarily in this order, a) select the important entities of Wikipedia
for the specific domain (contraction), b) identify new entities that did not exist in Wikipedia, due
to the specific domain dataset (expansion).
Linked data solutions may be a good starting point. Linked Data is the method of publishing
structured datasets in a way that they can be interlinked. Thus, it allows data from different
sources to be connected and queried. This is not exactly the described situation, and yet, it may
be useful.
To evaluate this exercise we will take into account:



The design of a prototype that builds a law-specific knowledge base.
A good proof of functionality that shows a) new law-specific entities that do not
exist in Wikipedia, b) the relation among those entities and c) the relations among
the new entities and those that already existed in Wikipedia. Statistic for excluded
entities, new entities and new relations are useful.
The efficiency, in terms of computation time.
EDBT Summer School 2015

The originality of the proposed method.
Technical prerequisites of participants:
 Basic programming skills in any language the student is familiar with e.g. python, java
etc.
 Recommendable:
 To be familiar with some graph library such as Sparksee, Neo4j, Python Networkx or
Snap.
 To be familiar with some entity linking libraries such as Apache Stanbol.
Technical support provided to participants:


English Wikipedia data: (www.sparsitytechnologies.com/downloads/WikipediaDump)
o articles_ids.csv: File (csv): article_id,article_title
o articles_links.csv: File(csv): article_id_from; article_id_to
o articles_body.csv: File(csv): article_id;article_body
o articles_redirects.csv: File(csv): article_id_from; article_id_to
o categories_ids.csv: File(csv): category_id, category_name
o article_category.csv: File(csv): article_id_from,category_id_to
o categories_relations.csv: File(csv): category_id_from, category_id_to
Legal Knowledge Bases / Ontologies – Not mandatory to use them:
o MetaLex Ontology (in Dutch) CEN MetaLex standardizes the way in which
sources of law and references to sources of law are to be represented in
XML.
Dumps (http://doc.metalex.eu/)
 RDF serialization of all regulations as Linked Data. This dataset does
not contain the text of the regulations.
(http://doc.metalex.eu/static/rdf_daily_dump.tgz)
 XML representations of all regulations as CEN MetaLex. This dataset
does not contain metadata of the regulations.
(http://doc.metalex.eu/static/xml_daily_dump.tgz)
o EuroVoc is a multilingual, multidisciplinary thesaurus covering the activities
of the EU, the European Parliament in particular. It contains terms in 22 EU
languages. EuroVoc is managed by the EU Publications Office, which moved
forward to ontology-based thesaurus management.
 http://datahub.io/dataset/eurovoc-in-skos
(there are no resolvable URIs or dumps, as the Publications Office reserves the
exclusive right to machine-readable representations of the thesaurus.)
o Any other source of legal information you may find in Internet.
Download