Research Problems in Semantic Web Search ____________________________ Varish Mulwad 1 Agenda ____________________________ • Introduction • Swoogle • Swoogle’s Competition – • Sindice • Semantic Web Search Engine (SWSE) • Watson • Falcon • Research Problems and Issues with Swoogle • References 2 Introduction ____________________________ Your Agent Web Possible because: Data is in machine understandable form like – RDF, OWL Dr.Finin’s FOAF Profile But how will agent find all this data ? Search Engines ? 3 Introduction ____________________________ Traditional Search Engine Results Semantic Web Search Engine Results 4 Swoogle ____________________________ • Swoogle is a crawler based indexing and retrieval system for Semantic Web • Swoogle crawls and discovers documents written in RDF,OWL • Swoogle classifies a Semantic Web Document(SWD) as – • Semantic Web Ontology (SWO) – Defines new terms • Semantic Web Databases (SWDB) – Makes assertions about individuals 5 Swoogle ____________________________ SWOOGLE DEMO 6 Swoogle Architecture ____________________________ 7 Swoogle Architecture ____________________________ SWD Discovery Component • Google crawler using the Google web service • Filetypes with extensions “.rdf”, ”.owl”, “.n3” • Google limits only 1000 results per query • A focussed crawler • Crawls documents within a given website • Extension and Focus constraints • A Swoogle crawler • Jena based crawler • Explores Semantic Links between SWDs 8 Swoogle Architecture ____________________________ Metadata Creation • Basic Metadata • Encoding – “RDF/XML”, “N-Triple”, “N3” • Language – RDF, RDFS, OWL, DAML + OIL • OWL Species – OWL-LITE, OWL-DL, OWL-FULL • Relations among SWDs • Reference relationship among SWDs • Inter ontology relationships 9 Swoogle Architecture ____________________________ Data analysis component • Classification of SWD as SWO or SWDB • Compute rank of SWD Web based interface • Human User Interface – http://swoogle.umbc.edu • Web Services using REST interface • Agent Service 10 Sindice ____________________________ • Created at Digital Enterprise Research Institute (DERI) • Key features of Sindice include – • Sindice collects SWDs and indexes them on resource URIs, Inverse Functional Properties(IFPs) and keywords • Sindice uses the Hadoop parallel architecture 11 Sindice ____________________________ Inverse Functional Property (IFP) – An OWL cardinality restriction Sincdice uses three indexes – • URI index • IFP index • Keyword index Benefits - Faster retrieval of data 12 Sindice ____________________________ Hadoop architecture is used in the following manner – • Sindice employs Hadoop/Nutch to distribute crawling job across multiple machines • Collected data is stored in the Hbase distributed column – based store • Efficient handling of large datasets across the cluster using a MapReduce implementation 13 Sindice ____________________________ SINDICE DEMO 14 SWSE ____________________________ • Semantic Web Search Engine (SWSE) is also a Semantic Web Search Engine created at Digital Enterprise Research Institute (DERI) • SWSE uses a “Multicrawler” – a pipelined architecture for crawling 15 Watson ____________________________ • Created at Knowledge Management Institute at the UK Open University • Major Design Principles – • Considers explicit and implicit relations between Ontologies • Ranking of Ontologies with focus on quality over popularity 16 Watson ____________________________ WATSON DEMO 17 Falcon ____________________________ • Falcon is a Semantic Web Search engine created at the Institute of Web Science in China • Falcon allows keyword based queries on : • Objects • Concepts • Documents • Falcon performs class subsumption reasoning 18 Falcon ____________________________ FALCON DEMO 19 Summary ____________________________ Swoogle Others • Keyword based search • Searches Ontologies and Instance Data Sindice • Indexes on URI, IFP, keywords • Use of Hadoop Architecture SWSE • Pipelined Architecture for Crawling Watson • Implicit relations between SWDs Falcon • Class Subsumption Reasoning 20 Issues ____________________________ Crawling • Swoogle’s crawler is running as a single thread on one machine • Limits the number of SWDs dicovered and revisted Possible Solutions • Use of Hadoop Architecture • Use of Grub 21 Other Issues ____________________________ Crawling large structured Datasets like DBPedia More reasoning More services 22 References ____________________________ • Li Ding et al., "Swoogle: A Search and Metadata Engine for the Semantic Web", Proceedings of the Thirteenth ACM Conference on Information and Knowledge Management, November 2004. • P. Mika, G. Tummarello “Web Semantics in the Clouds”, IEEE Intelligent Systems, Volume 23 , Issue 5 (September 2008) • E. Oren, R.Delbru, M. Catasta, R. Cyganiak, H. Stenzhorn, G. Tummarello “Sindice.com: A document-oriented lookup index for open linked data.” In International Journal of Metadata, Semantics and Ontologies, 3(1), 2008. • Mathieu d’Aquin et al., “Watson: A Gateway for the Semantic Web” ,Poster session of the European Semantic Web Conference, ESWC 2007 • Gong Cheng, Weiyi Ge, Honghan Wu, Yuzhong Qu , “Searching Semantic Web Objects Based on Class Hierarchies” In WWW 2008 Workshop on Linked Data on the Web, 2008 23 Questions? ____________________________ 24