EDBT Summer School 2015 Research Challenge 3 Building a knowledge base for a specific domain Speaker: Joan Guisado-Gámez Context and motivation: A knowledge base is a source of structured or unstructured information used by computer systems. Many systems have used Wikipedia because it offers a wide source of knowledge. This is a good solution in the scenario of a general system that may need data from many different domains. However, for domain-specific system may not be the best solution, both in terms of quality results and computation time. Knowledge bases are composed by different entities, each representing a different and unique concept within the knowledge base. Each entity contains structured or unstructured data that describe that concept. However, knowledge bases are not only useful due to the data that they contain, but also due to the maze of relations among the different entities. This makes graphs structures the natural way of representing them. Research/Design challenges In this challenge we propose to design and develop, either individually or in teams, a method that, based on Wikipedia, is capable of building a specific knowledge base for a given domain. In particular, we propose to build a knowledge base for Legal and Legislative Resources. This challenge entails, not necessarily in this order, a) select the important entities of Wikipedia for the specific domain (contraction), b) identify new entities that did not exist in Wikipedia, due to the specific domain dataset (expansion). Linked data solutions may be a good starting point. Linked Data is the method of publishing structured datasets in a way that they can be interlinked. Thus, it allows data from different sources to be connected and queried. This is not exactly the described situation, and yet, it may be useful. To evaluate this exercise we will take into account: The design of a prototype that builds a law-specific knowledge base. A good proof of functionality that shows a) new law-specific entities that do not exist in Wikipedia, b) the relation among those entities and c) the relations among the new entities and those that already existed in Wikipedia. Statistic for excluded entities, new entities and new relations are useful. The efficiency, in terms of computation time. EDBT Summer School 2015 The originality of the proposed method. Technical prerequisites of participants: Basic programming skills in any language the student is familiar with e.g. python, java etc. Recommendable: To be familiar with some graph library such as Sparksee, Neo4j, Python Networkx or Snap. To be familiar with some entity linking libraries such as Apache Stanbol. Technical support provided to participants: English Wikipedia data: (www.sparsitytechnologies.com/downloads/WikipediaDump) o articles_ids.csv: File (csv): article_id,article_title o articles_links.csv: File(csv): article_id_from; article_id_to o articles_body.csv: File(csv): article_id;article_body o articles_redirects.csv: File(csv): article_id_from; article_id_to o categories_ids.csv: File(csv): category_id, category_name o article_category.csv: File(csv): article_id_from,category_id_to o categories_relations.csv: File(csv): category_id_from, category_id_to Legal Knowledge Bases / Ontologies – Not mandatory to use them: o MetaLex Ontology (in Dutch) CEN MetaLex standardizes the way in which sources of law and references to sources of law are to be represented in XML. Dumps (http://doc.metalex.eu/) RDF serialization of all regulations as Linked Data. This dataset does not contain the text of the regulations. (http://doc.metalex.eu/static/rdf_daily_dump.tgz) XML representations of all regulations as CEN MetaLex. This dataset does not contain metadata of the regulations. (http://doc.metalex.eu/static/xml_daily_dump.tgz) o EuroVoc is a multilingual, multidisciplinary thesaurus covering the activities of the EU, the European Parliament in particular. It contains terms in 22 EU languages. EuroVoc is managed by the EU Publications Office, which moved forward to ontology-based thesaurus management. http://datahub.io/dataset/eurovoc-in-skos (there are no resolvable URIs or dumps, as the Publications Office reserves the exclusive right to machine-readable representations of the thesaurus.) o Any other source of legal information you may find in Internet.