2020 International Conference on Computer Communication and Informatics (ICCCI -2020), Jan. 22-24, 2020, Coimbatore, INDIA A Comparative Study of Semantic Search Systems Dr. Chetana Gavankar chetana.gavankar@pilani.bits-pilani.ac.in Taniya Bhosale taniya.bhosale0@gmail.com Anindita Chavan dhanashreegunda98@gmail.com Shafa Hassan aninditachavan@gmail.com shafa.hassan112@gmail.com individuals. Semantic web, instead of word matching, will be able to show related items thus showing new relationships. Todays search involves fetching documents that match the given words and phrases. But with semantic web, instead of performing word matching, it will be able to show how different things relate to each other. We have come across two types of semantic search systems so far, semantic search engine using semantic resources to improve search results and search systems that search across semantic resources [2], [3], [4], [5], [6]. Of these, the semantic search engines are the ones that seek to understand the query given by the user and gives relevant results. Example- Hakia, Duckduckgo, Lexxe. They go beyond keyword matching and try to find meaning in the queries. Whereas searching across semantic web resources is the kind of technique where the query is searched across ontologies in OWL and RDF formats. Example- Swoogle, Watson, Bioportal, Falcons. In this paper, Section II explains today’s web and it’s drawbacks. Section IIIgives a brief understanding of existing Semantic search engines and search engines based on Semantic Data. It explains the working methodologies, features and few of the drawbacks. Section IV concludes the review on the various search engines. Abstract— Today’s internet consists of mostly unstructured data, most of it being unusable for average users. With increase in the number of smart devices that are getting access to the web, we have a large set of unlinked data that is not able to communicate. Indirectly, it can be said that the Web is broken. Semantic Web focuses on making the meaning explicit instead of fetching results with the help of word matching. Semantic Web is an extension to the current Web that provides an easier way to find, share, reuse and combine information. In this paper, we are presenting an analysis of the different approaches taken by various semantic web search engines and the comparison between them, thus identifying the advantages and limitation of each search engine. I. INTRODUCTION The Semantic Web is an enhancement to the existing web which focuses on giving a well defined meaning to the information, helping computers and people to work together in liaison. One of the major challenges of presenting information on the web was that web applications were not able to provide context to the data and therefore could not differentiate between relevant and irrelevant information. According to Tim Berners Lee, [1] the Semantic Web is not a separate Web but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation. The Semantic Web will bring structure to the meaningful content of Web pages, creating an environment where software agents roaming from page to page can readily carry out sophisticated tasks for users. Semantic Web involves two terms: Semantic Markup and Web Services. A web service is a software system which is designed to support interaction between computers over the internet. Semantic markup refers to the communication gap between web users and computer-ized applications. Semantic web will involve a collaboration of both semantic markup as well as web services thus giving various applications a prospective to communicate with other applications and perform broader searches for information through simpler interfaces. The present web can be characterized as the syntactic web where data fetching and presentation is done by the machines, and filtering and identifying important information is designated to the 978-1-7281-4514-3/20/$31.00 c 2020 IEEE Dhanashree Gunda II. TODAY’S WEB AND IT’S DRAWBACKS Today, the Web provides us a platform through which information can be shared easily and everyone has the ability to write websites. HTML is used for programming the information or structuring of a Web page and connecting to other Web pages or resources with the help of hyper-links. A combined result of all of this is that the Web is expanding at a very fast pace. However, majority of the web pages are designed in such a way that they can be interpreted by the humans alone and cannot be processed by the machines. While fetching data, computers perform the job of site scraping which is decoding the colour schema and links that are encoded within the Web pages. The search engines lack the ability to actually interpret the result and then present it to the user. With increase of data on the Web,this situation is getting worse day by day. Most users only go through the top results provided by the search engines thus, neglecting the later ones. As the size of search results goes on increasing progressively, it has become hard for humans to interpret such huge amounts of information making the task of finding relevant information on the Web more difficult than desired. 1 Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on April 20,2022 at 07:10:33 UTC from IEEE Xplore. Restrictions apply. 2020 International Conference on Computer Communication and Informatics (ICCCI -2020), Jan. 22-24, 2020, Coimbatore, INDIA The conclusion is that the Web has developed as a platform for information exchange between the people instead of the machines. The meaning, i.e. the semantic content of the Web page is encoded in such a way that it is useful by human intervention and interpretation only. used in search engine optimization as it uses a tool to sift through all the HTML code and various scripts of a page in order to determine all the links present and whether they are active or dead. This information allows the analyst to determine whether the search engine has the ability to find and index a particular website. In OntoRank, two concepts are considered as reference relationship if and only if a relationship exists between the two classes in a relation set. The reference relations are directional and transitive. With more transitive steps, the reference relationship is weakened gradually. The link strength between two ontologies depends on the number of interreference concepts among ontology sets and reference strength. The main challenge behind using link analyze method to compute ontology importance is the lack of explicit links among ontologies, which results in the difference in ontology and web page, and demands the searching of implicit relationships between ontologies and hence has interoperability issues. Thus, the drawback of OntoRank algorithm is that it evaluates the rank of an ontology statistically and does not take into consideration the user query as an effective factor for ranking the results.[10] An example of Swoogle search is shown below in Figure 1. A. Limitations of Today’s Web 1) The huge amount of results that the search engines list out have a very low recall accuracy. For example, if the user has to search for the web pages where ”nuclear” and ”science” occur, the resulting information would be of very little use and the user will be overwhelmed with huge amount of results. Also this may not be even relevant to the search request. 2) The results that are fetched by the search engines are vocabulary sensitive. For example, a particular user wants to search for ”TCP/IP protocol”; and there are some useful web pages that have the word ’standard’ instead of protocol. Therefore, the Web pages that have been listed as a result of the search performed will not be the best match for a search that has been made with the use of keyword protocol. 3) The search results yields us a list of Web pages containing relevant information. It often happens that multiple entries are present for the same Web Page. Thus, if relevant information is getting distributed in more that one result, it would be very difficult to determine the set of all relevant entries. These shortcomings of todays Web poses the need of a Web where more focus is given on the semantics of the data. This would help establish new relationships among entities and yield precise results for queries that are even more complex that the ones we use today. III. SEARCH ENGINES A. Search across Semantic Resources 1) SWOOGLE:1Swoogle[7] is a semantic web search en-gine that searches for Web documents in RDF and OWL.[8] Swoogle performs indexing using a web crawler with which the search engine crawls through one page at a time, until all the pages have been indexed. This helps in collecting information about a web page like it’s metadata using which the search engine computes relationships between documents. Web Crawler also gives Swoogle the ability to keep track of URLs which have already been downloaded to avoid downloading the same page again. Swoogle not only identifies ontologies according to user specific search queries but also uses OntoRank algorithm to rank these ontologies on the basis of their popularity.[9] Identical to the PageRank algorithm, OntoRank also makes use of link analysis for evaluation of ontologies. Link analysis is a popular tech-nique used in data analysis to evaluate the relationships or connections between various types of objects. It is a kind of knowledge discovery which helps in better analysis especially in context of links. Link analysis is commonly 1 Fig. 1. SWOOGLE 2) BIOPORTAL:Bioportal2 is an ontology search engine [2] which has been developed by the National Center for Bio-medical Ontology. It serves as a repository for Biomedical Ontologies. Bioportal defines relationships among different domains of existing ontologies and also between the on-tologies. The Open Biomedical Resources(OBR) component automatically indexes the online biomedical data sets. The data sets are indexed on the basis of metadata annotations. It links the data sets to the terms in the ontologies. This helps in establishing semantic relationships among the entities and mapping of ontologies. Ontologies in Bioportal are repre-sented in different semantic web languages like Web Ontol-ogy Language(OWL), Open Biological and Biomedical On-tologies(OBO) and Resourse Description Framework(RDF). Mayo Clinics Lex Grid system is used to store ontologies in OBO Format and to access standard biomedical termi-nologies. Protege frame language is the back end for OWL http://swoogle.umbc.edu/2006/ 2 https://bioportal.bioontology.org/ 2 Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on April 20,2022 at 07:10:33 UTC from IEEE Xplore. Restrictions apply. 2020 International Conference on Computer Communication and Informatics (ICCCI -2020), Jan. 22-24, 2020, Coimbatore, INDIA and RDF ontologies. One of the key features of Bioportal is that the users can browse, search, download and update the existing ontologies.[2] They can also upload their own ontologies, add notes and edit ontologies according to the comments and also make suggestions to the ontology devel-opers. Users can also browse, create and upload mappings between the ontologies. In this way they can actively con-tribute to bioportal which increases its value and this feature distinguishes bioportal from other ontology repositories. At present, Bioportal has around 773 ontologies and 9,118,651 classes. Bioportal displays the ontology class hierarchy in a tree structure and also has different visualization methods for showing links between different classes. A snapshot of heart failure ontology is shown below in Figure 2. the functionality of searching in between and among the ontologies and semantic documents, retrieving metadata and metrics on ontologies, entities and reuse of ontologies. The architecture of Watson is shown in Figure 3. Fig. 3. WATSON ARCHITECTURE 4) FALCONS: Falcons [11] is used for semantic search over the web. It has been developed mainly for concept sharing and ontology reusing. It is an ontology search based on keywords that returns the concepts/ontologies whose textual description matches with the terms given in the keyword query. It ranks the results based on the relevance of the query and popularity of the concepts. The popularity of a concept is measured depending on a large set of data which is collected from the Semantic Web. In Falcons, each concept Fig. 2. BIOPORTAL 3) WATSON:Watson[3] is a search engine which collects, analyses and gives access to ontologies. It is a search engine which works on specific type of documents. It provides numerous functionalities to applications with a set of API’s. It finds, explores and locates semantic documents. Watson performs three functionalities:[3] 1) Collects the semantic data on the Web. 2) Implements different query methods to access the data. 3) Analyses the data to extract useful metadata and in-dexes. The semantic documents are located through a tracking and crawling component called Heritrix. The validation and analysis component indexes the documents using the Apache Lucene Indexing technology. The crawler also explores new repositories to locate documents written in ontology lan-guages. The collected semantic documents are recrawled in order to find evolutions of known semantic data or new elements. They are then filtered and only those documents whose content characterizes the semantic web are kept. The documents which cannot be parsed by the crawler are eliminated by Jena. All the documents in the form of RDF are extracted except the documents in RSS. Watson also ex-tracts metadata from the collected ontologies. Two ontologies which are different in nature can have the same URI. To re-solve this issue, Watson uses internal identifiers which differ from the URI’s of collected semantic documents. Watson can be used for querying complex keywords using SPARQL. It summarizes the entire description of the ontologies so that it is easily understandable by the users. Watson provides Fig. 4. Falcons Search Engine returned as a result is associated with a query relevant snippet giving an idea of how the concept is matched with the given question and also briefly gives its meaning. Also, the system recommends to the user, numerous other query related popular ontologies that can be used by the user to restrict the results to a specific ontology. Thus, Falcons is a search engine parsing through the semantic data and giving results in the form of concepts along with their context as structured snippets. Along with that, Falcons also provides a detailed RDF description of each concept and a summary of each ontology on demand. As shown in the figure 4, the query ’student university’ will return results in the form of ontologies. The lower area of the page gives the different concepts returned. For every concept the first line gives its name and type. It also gives an option for clicking on the name to browse its RDF description in detail. After that we can see a structured snippet that shows which part of the 3 Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on April 20,2022 at 07:10:33 UTC from IEEE Xplore. Restrictions apply. 2020 International Conference on Computer Communication and Informatics (ICCCI -2020), Jan. 22-24, 2020, Coimbatore, INDIA RDF description of the concept matches with the terms in the keyword query. After the snippet, the URI is given followed by a number that signifies in how many RDF documents this concept is mentioned and with links to further browse these documents if needed. The upper part of the result page gives ontologies related to the query and the user can choose to further restrict the search to these ontologies by selecting them. , the engine understands how the words are formed as phrases so that the relevant information is extracted. The technologies used include Part-of-speech Tagging, Parsing and word sense disambiguation. Each semantic key acts as a category for a set of keywords. For example, the semantic key ”Ferrari colour” refers to any number of colours. The ability to use semantic keys is of very little use when the user does not know the exact terms he is looking for. In the Ferrari example, Lexxe provides statistics of the most popular colours of Ferrari along with their percentages. By replacing the semantic key ”colour” with ”price” provides user with the information about the cost estimation. Lexxe is currently not in use. The snapshot of Lexxe search engine is shown below in Figure 6. B. Semantic Search Engines 1) HAKIA: Hakia 3 is a search engine [12] that provides results based on meaning matching. It uses Natural Language processing for returning the search results. Hakia searches within structured text. It is a private search engine that is designed to provide search results based on the meaning of the content rather than on the basis of page popularity. Hakia uses QDEX for query detection and extracting inverted text index. Ontosem is Hakia’s linguistic database repository in which words are categorized based on different meanings. It makes use of natural language processing to access full meaning of the text it handles. It is composed of a set of language independent ontologies consisting of thousands of interrelated concepts. A snapshot of Hakia search is shown in Figure 5. Fig. 6. LEXXE 3) SENSEBOT:5Sensebot[14] is a new kind of search engine in which a search query gives out results in the form of a text summary instead of a collection of web links. This summary works as a condensation on the topic of your search query, merging together the most important and pertinent aspects of the search results. The parsing of Web pages is done using text mining that identifies the key concepts inside the web page. This is followed by a multi-document summarizing on the content to produce a coherent summary. For instance, take the example of the term dot-com-bubble. As you can see in the figure 7, We type ”dot-com bubble” into Google and receive plenty of links with some context around them. But looking at all those links does not give a clear idea as to what exactly dot-com-bubble is. And so we start by clicking on each link one by one. However, some pages assume that we know the basics of the topic while others are too shallow, biased or just links to other resources instead of giving an accurate answer to our question. By the time we comprehend what ’dot-com bubble’ is, we have tapped on an excessive number of links and filtered through an excessive amount of repetitive data. SenseBot generates a short abstract of the top pages returned by the engine. Reading this summary will give us a good idea as to what ’dot-com bubble’ is and understand all the basics of it, and then we can decide if we want to further go into the details of it. SenseBot will skip the sources that are not extremely helpful on the topic applicable to our subject, regardless of Fig. 5. HAKIA 2) LEXXE:4On the basis of the requirement space, Semantic Search Engines are classified into four categories : Search Environment, Intrinsic Problems and Iterative and Exploratory, Search requirement and Query Type. Lexxe[13] comes under the Query Type category. It is a search engine that uses natural language processing on its semantic search technology. It was designed to answer short queries by gathering content from the unstructured data on the internet. The user can specify different parameters in the form of questions and then Lexxe automatically extracts expected results from the web pages. Hence, Lexxe can answer wide range of questions. Lexxe does not work like a traditional search engine which presents us the answers from manually prepared databases. Instead, it categorizes the results into three parts: answers, clusters and web page snippets. The fetching part uses computational linguistics to eliminate ir-relevant content. With the help of phrase recognition method 3 http://www.hakia.com/ 4 http://www.lexxe.com/ 5 http://www.sensebot.com/ 4 Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on April 20,2022 at 07:10:33 UTC from IEEE Xplore. Restrictions apply. 2020 International Conference on Computer Communication and Informatics (ICCCI -2020), Jan. 22-24, 2020, Coimbatore, INDIA how they were ranked by the search engines. Usually, any kind of search where the client is trying to comprehend a concept or a specific area of learning; or get a digest on an individual, an occasion, or an activity; or rapidly recognize the best source of data on the topic, will profit greatly from SenseBot. Greater the diversity of the views on the subject , higher would be the advantage of seeing their summary instead of concentrating on each view independently. 3) 4) 5) 6) Calculation detection. Does not save IP addresses of the users. It fetches the same result irrespective of the user. Keeps user’s search keywords private from the web-sites that are visited as a result of the search. 7) It operates on the data compiled from more than 400 sources such as Bing, Yahoo and also it’s own Web crawler known as DuckDuckBot. 8) Avails the use of cookies when required. 9) Makes use of a variety of open source technologies to provide personalized search results. 5) EXALEAD: 7Exalead offers functionalities such as searching within the results, proximity search, regular ex-pressions and phonetic search. The search engine provides access to over 8 billion Web pages and 1 billion online images. Exalead also offers the functionality of image search-ing which is useful for comparing and classifying those images. It helps users to limit the image search results by providing a combination of LTU software and Exalead. On each keyword input, Exalead fetches a set of images corresponding to the input. The face refinement functionality offered by LTU technologies restricts the search results to only those images that represent faces or portraits. Exalead image search is comprised of two modules: Image DNA Generator and Semantic Description Generator. The image DNA generator module creates a numerical vector called as DNA that encodes information related to the image such as colour, shape, texture, scale, object translation, image quality. The semantic description generator classifies the DNA image on the basis of pattern recognition as against the knowledge base that uses state of art techniques modeled on behavior of human subjects. The analyzer and the describer have the capability of learning thus providing an enhanced search experience. Exalead also helps the user to narrow down the image search by allowing the user to search on the basis of image size. The developers have now added a new feature that is video search. An example of Exalead search for Elvis Presley is shown below in Figure 9. Fig. 7. SENSEBOT 4) DUCKDUCKGO:6DuckDuckGo[15] is a new search engine focused on relevant results and respecting user pri-vacy. DuckDuckGo is a Semantic Web search search engine that is characterized by its feature rich semantic search. If a search is made for a word having multiple meanings, Duck-DuckGo gives it’s users the functionality of choosing from various options with its disambiguation results. Example: If we search for the term Apple then the search engine will list down all the possible contexts of the word Apple thus giving the user an option to search according to his preference. Duckduckgo search for the word ’bank’ is shown in Figure 8. Fig. 8. DUCKDUCKGO Fig. 9. EXALEAD Following are the advantages of DuckDuckGo: 1) Keyboard shortcuts. 2) Customization. 6 7 https://duckduckgo.com/ http://www.exalead.com/search/ 5 Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on April 20,2022 at 07:10:33 UTC from IEEE Xplore. Restrictions apply. 2020 International Conference on Computer Communication and Informatics (ICCCI -2020), Jan. 22-24, 2020, Coimbatore, INDIA [15] TABLE I COMPARATIVE STUDY OF SEARCH ENGINES Name of Engine Hakia Search Methodology Pure analysis of contents Features Working logic Advantages Drawbacks Excellent resumes, CMR, Semantic rank algorithm, Related searches. Swoogle Indexes documents using RDF. Uses REST Interface to provide different services. Searches structured text with the help of QDEX, Ontosem and semantic rank algorithm. Gives Semantic web search results in RDF, OWL using web ontology. Gathers relevant information for the given query from various credible sites easily. Finds appropriate ontologies, instance data structure of the semantic web. Lexxe Uses Semantic key technology. Users can query with a conceptual keyword. Uses text mining, summarizes multiple records. A meta search engine that collects information from different search engines. Classification and Categorization. Parsing, Word sense disambiguation,Part-ofSpeech tagging. Selects link from Subset Cluster, Natural Language Processing Blends together the significant and relevant aspects of search results. Focuses on privacy, doesn’t track user’s personal information, uses its own web crawler as well as other search engines. Image recognition using DNA generator and Semantic description generator. Parses through the links to give a summary of the relevant data. It gives results in the form of summary and uses local search. Uses computational linguistics to preclude irrelevant content. Clustering of results yields options on various contexts. Gives summary of results instead of providing links related to the query. Results are gathered from different sources like Yahoo, Wikipedia etc. Hence more reliable. Cannot operate on unstructured data, still ambiguous with some of the search queries. Not multilingual. Extending Swoogle to index and effectively query large amounts of instance data is still a challenge. Does not work well with long queries. Sensebot Duckduckgo Exalead Easily narrows down image search results with integration of software between LTU and Exalead. IV. CONCLUSION This paper provides a brief overview of the existing literature regarding intelligent semantic search technologies. We have reviewed their characteristics respectively and the conclusion derived is that the existing techniques have a few drawbacks particularly in terms of time response, accuracy of results, importance of results and relevancy of results. An efficient semantic web search engines should meet these challenges efficiently and be compatible with global stan-dards of web technology. REFERENCES [1] T. Berners-Lee, J. Hendler, and O. Lassila, “The semantic web,” Scientific American, vol. 284, pp. 34–43, May 2001. [2] N. F. Noy, N. H. Shah, P. L. Whetzel, B. Dai, M. Dorf, N. Griffith, C. Jonquet, D. L. Rubin, M.-A. Storey, C. G. Chute, and M. a. Musen, “Bioportal: ontologies and integrated data resources at the click of a mouse.,” Nucleic acids research, vol. 37, pp. W170–3, July 2009. [3] M. d’Aquin and E. Motta, “Watson, more than a semantic web search engine,” Semant. web, vol. 2, pp. 55–63, Jan. 2011. [4] C. Gavankar, Y.-F. Li, and G. Ramakrishnan, “Context-driven concept search across web ontologies using keyword queries,” in Proceedings of the 8th International Conference on Knowledge Capture, K-CAP 2015, (New York, NY, USA), pp. 20:1–20:4, ACM, 2015. Contains everything a mature search engine needs along with some additional features too. Almost makes it better than Google. Doesn’t give the end result. Gives intermediate result. Scaling challenges and difficult to integrate plugins. Does not index everything .Needs other search engines too. Indexing all the content on the web hasn’t been done yet. Not popular enough. [8] L. Ding, T. Finin, A. Joshi, R. Pan, S. R. Cost, Y. Peng, P. Reddivari, V. Doshi, and J. Sachs, “Swoogle: a search and metadata engine for the semantic web,” pp. 652–659, 2004. [9] C. Dwork, R. Kumar, M. Naor, and D. Sivakumar, “Rank aggregation methods for the web,” pp. 613–622, 2001. [10] Z. Ding and Z. Duan, “Improved ontology ranking algorithm based on semantic web,” pp. 103–107, July 2010. [11] Y. Qu and G. Cheng, “Falcons concept search: A practical search engine for web ontologies,” IEEE Trans. Systems, Man, and Cybernetics, Part A, vol. 41, no. 4, pp. 810–816, 2011. [12] R. Jain, N. Duhan, and A. Sharma, “Comparative study on semantic search engines,” International Journal of Computer Applications, vol. 131, pp. 4–11, 12 2015. [13] J. A. Khan, D. Sangroha, M. Ahmad, and M. T. Rahman, “A performance evaluation of semantic based search engines and keyword based search engines,” pp. 168–173, Nov 2014. [14] R. Unadkat, “Survey paper on semantic web,” Int. J. Adv. Pervasive Ubiquitous Comput., vol. 7, pp. 13–17, Oct. 2015. [15] J. Rashid and M. Nisar, “A study on semantic searching, semantic search engines and technologies used for semantic search engines,” International Journal of Information Technology and Computer Sci-ence(IJITCS)International Journal of Information Technology and Computer Science(IJITCS), vol. 10, pp. 82–89, 10 2016. [5] C. Gavankar, V. Kumar, Y. Li, and G. Ramakrishnan, “Enriching concept search across semantic web ontologies,” in Proceedings of the ISWC 2013 Posters & Demonstrations Track, Sydney, Australia, October 23, 2013, pp. 93–96, 2013. [6] C. Gavankar, Y. Li, and G. Ramakrishnan, “Explicit query interpretation and diversification for context-driven concept search across ontologies,” in The Semantic Web - ISWC 2016 - 15th International Semantic Web Conference, Kobe, Japan, October 17-21, 2016, Proceedings, Part I, pp. 271–288, 2016. [7] T. Finin, L. Ding, R. Pan, A. Joshi, P. Kolari, A. Java, and Y. Peng, “Swoogle: Searching for knowledge on the semantic web,” 2005. Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on April 20,2022 at 07:10:33 UTC from IEEE Xplore. Restrictions apply. 6 Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on April 20,2022 at 07:10:33 UTC from IEEE Xplore. Restrictions apply.