Spring Ferret: 360o Search Prafulla Mahindrakar Aniket Patil Ketan Umare Advisor: Dr. Ling Liu CS8803:Advanced Internet Application Development, Group Project. 09 Table of Contents 1. Motivation and Objectives ...............................................................................................3 2. Related Work........................................................................................................................4 2.1 Googling ........................................................................................................................................ 4 2.2 Socially relevant search .......................................................................................................... 5 2.2 Categorization of search results ........................................................................................... 5 3. Proposed Work ....................................................................................................................6 3.1 System Architecture ................................................................................................................. 6 3.2 Components................................................................................................................................. 6 3.2.1 Open ID Authentication..................................................................................................................... 7 3.2.2 Social Network Information Grabber APP ................................................................................ 7 3.2.3 Query Parsing ....................................................................................................................................... 7 3.2.4 Socially relevant Search .................................................................................................................... 7 3.2.5 External Search Interface ................................................................................................................. 7 3.2.6 Categorization of Search Results................................................................................................... 7 3.2.7 Display ...................................................................................................................................................... 7 4. Plan of action ........................................................................................................................7 4.1 Resource ....................................................................................................................................... 7 4.1.1 Software Tools....................................................................................................................................... 8 4.1.2 Hardware................................................................................................................................................. 8 4.1.3 Operating system ................................................................................................................................. 8 4.2 Schedule........................................................................................................................................ 8 5. Evaluation and Testing Method......................................................................................9 5.1 Deliverable .................................................................................................................................. 9 5.2 Evaluation Strategy................................................................................................................... 9 5.3 Testing Methodology ................................................................................................................ 9 5.3.1 Unit Testing ............................................................................................................................................ 9 5.3.2 Functional Testing ............................................................................................................................... 9 6. Bibliography ...................................................................................................................... 10 1. Motivation and Objectives fer·ret (v) (\ˈfer­әt\) to find and bring to light by searching Imagine trying to find a pair of the latest Ray‐Ban glasses in the Lenox Square Mall. It is not an easy task! Now think about doing the same across the World Wide Web. Feeling tizzy? The World Wide Web with its astronomical amount of information presents an enormous challenge for resource discovery. Precise navigation is impossible with the increasingly large collection of hyperlinks that users must traverse. Commercial search engines like Google and Yahoo have solved the problem at a fundamental level by making available a hypertext‐based index for pages across the web. Web Users can query the index for documents about a specific topic to find the desired document. While search engines have become quite popular and are helping to redefine how people access information scattered across the wide‐area network, they are not well suited to the case when users do not know what exactly they are looking for. In such a situation, using one of the popular search engines can be a messy, frustrating experience. What do you do when you don’t know where to start? Give Ferret a try! For any topic in the universe, Ferret provides a neatly organized view of the web. Our category guides bring meaningful and relevant information that makes browsing for a topic fast. Rather than the messy back‐and‐forth clicking of search results, we do the processing so that you can learn, explore and discover the things that matter to you. Ferret offers you a new way to discover the Web – it’s the place you should be when you want to browse and discover everything the Web has to offer. Come to Ferret when you want to learn about a topic or explore what’s happening now on the Web. We’ll show you content that you may have never discovered otherwise and we’ll give you an at‐an‐glance look at everything related to the query. Think of Ferret as your guide for exploring the Web. For instance, consider the search term ‘Transformers’. A Google search result returns a list arranged serially that speaks about the movie ‘Transformers’, and electrical transformers on the first page. However, a user who is interested in knowing about the class ‘Transformer’ in Java or about the comics on Transformers needs to browse several pages before such results are discovered. Our system graphically arranges and classifies results into categories such as text, multimedia, entertainment, discussions, blogs and more. A user simply needs a single click to have a 360 degree view of content associated with the query term. Ferret. Your guide to the world! 2. Related Work Search has been a constantly evolving and a continuously researched topic. There have been great success stories and even greater debacles in this industry. Web search has become such an important part of our life that it has contributed to our vocabulary in some cases. Following are some of the most different systems currently available online, from which we derive and drive our inspiration. Figure 1: Taxonomy of Existing Search Technologies 2.1 Googling In their seminal work [1], the authors described a new way of ranking web documents, based on the idea of citation. The Search engine instantly became a hit and overtook all of its competitors. The webpage [2] is the most highly visited page online and everyone knows “The Google Story”. Google uses a simple keyword based search, but the most important point is the ranking of content. Thus Google successfully demonstrates the idea that just the content is not important, but the way we present it is highly important. Google has continued to innovate and come up with great innovative new features, but still it has a long way to go. 2.2 Socially relevant search Social search or a social search engine is a type of web search method that determines the relevance of search results by considering the interactions or contributions of users.[3] Based on this simple idea is Delver[4], which uses the social network of a user to come up with better recommendations. It enables you to find, experience and benefit from the wealth of information created and referenced by your social world. Socially relevant search can really benefit a user, as what matters to him is usually what matters to his peers. Paper [5] talks about the benefits of integrating the web search and social search and quantifies it with great results. It also delineates the challenges in doing so. 2.2 Categorization of search results Search results categorization is another important way to present the search results. Take an example of the word Transformers. For the same word we could have different implications – an electrical device, a movie, the cartoon series, a toy, there could be a review about the movie, or some news about the invention of some new efficient transformer, etc. So how do you show these results? Which is more important? These questions are almost impossible to answer. Papers[6‐9] show a variety of ways in which we can classify the web search results and quantify them with interesting results. But Kosmix[10], is one of the most promising sites that has leveraged from this idea. It uses the search provided by Google, and creates a wrapper for its own classification system. It has been voted as one of the best new startups[11] and that just makes a statement about the importance of classification of results. 3. Proposed Work The following sections give an outline of the System architecture and a small description of the important components. 3.1 System Architecture Figure 2 System Architecture 3.2 Components This section explains the various modules that constitute the ferret search engine. 3.2.1 Open ID Authentication We use the open id authentication engine to login into our system. Open Id also serves us in getting user profiles from social networking websites like Facebook[12] and Orkut that use Open Id authentication. 3.2.2 Social Network Information Grabber APP This module lets us grab user’s social network through various websites like Facebook[12] and Orkut. This network is stored locally in our system database along with each users query history and his preferences which assist us in socially relevant search. 3.2.3 Query Parsing The query parsing engine uses the Stanford tagger[13, 14] to extract keywords which are then used by the social relevant search module to search for previously fruitful searches through the user’s social network. 3.2.4 Socially relevant Search This module uses the parsed query to search through the users social network to find if any relevant searches were made earlier. 3.2.5 External Search Interface This engine is multithreaded and accepts the raw query and dispatches it to the various worker threads, which aim at collecting the search results from variety of search engines like Google[2], A9[15], IMDB[16] etc. The worker threads use WSDL to communicate to the various search engines. The external interface is extensible since collecting results from a new search engine simply requires the implementation of a WSDL interface. This enables our system to be augmented by additional search results through Yahoo, Windows Live or any other search engine. 3.2.6 Categorization of Search Results The results collected through the various websites are then categorized using kmeans[9] and seed list based clustering, and then grouped into different categories 3.2.7 Display The clustered results and the socially relevant search results are then showed to the end user in tabbed format, which allows the user to easily find his appropriate content. 4. Plan of action Following sections outline the resources needed for developing the ferret system and the schedule till completion. 4.1 Resource We would be developing a java‐based application with usual JAVAEE components. The following sections list down the details of the requirements. Also included in the Operating systems are all the compatible platforms (on which we would develop and test) for ferret system. 4.1.1 Software Tools • Java 1.5 • Eclipse IDE • J2EE 1.4 • Apache Tomcat 5.5 • MySQL 5.0 • Clustering Algorithms (Developed by us) • Simile • MySQL JDBC Connector • JUnit 4.4 • WSDL’s and Open Source Web / REST API’s for Google, IMDB, Facebook etc. 4.1.2 Hardware We need simple commodity hardware, as it will not be a live system, but a proof of concept. Currently a Desktop PC with a browser and internet connectivity would suffice. We would primarily develop on our laptops. 4.1.3 Operating system The primary development and test platforms would be • Windows 98/XP/Vista • MacOSX 10.5.5 (Leopard) Though most of the technologies we are using are completely portable and we should be able to run on most systems that support JAVA. 4.2 Schedule Week No. 1 2 3 4 5 Dates Scheduled Work Feb 23 – Feb 27 Mar 2 – Mar 6 Mar 9 – Mar 13 Mar 16 – Mar 20 Mar 23 – Mar 27 6 Mar 30 – Apr 3 7 8 9 Apr 6 – Apr 10 Apr 13 – Apr 17 Apr 20 – Apr 24 Installation of J2EE, MySQL, Tomcat Study of web services Design and implementation of external agent interface Connectivity to Google OpenID for authentication Study and implementation of connectivity with Google Search and Amazon A9 Study and implementation of connectivity with Facebook social network Study and implementation of clustering algorithms Implementation of user interface Testing with JUnit and documentation 5. Evaluation and Testing Method Following sections give an outline for the deliverable at the end of the project, the evaluation strategy we will use to test the results of the system and the Testing methods. 5.1 Deliverable We envision a completely interactive frontend, which is highly intuitive. The system would consist of a working query manager and interface manager with atleast a few interfaces already built in. 5.2 Evaluation Strategy We would compare the quality of results at various levels. The most important criterion would be to compare against Google, and we would do some User acceptance testing, because the most important thing is how the user perceives the results. We also would compare the results to Kosmix[10] and Delver[4]. We would try and create a small survey and ask COC students to compare the results. This would help us study the potent of such a system. 5.3 Testing Methodology We would carry out testing at two levels, as explained below. 5.3.1 Unit Testing Each piece will be first unit tested using JUnit. This will ensure that individual units of source code are working properly. We would also test simple functionality of each unit. 5.3.2 Functional Testing This is an important step to ensure correctness. The most important problem here is obtaining a sizeable amount of social network data. Thus we would have to mostly manufacture data, and simulate the social network. 6. Bibliography 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. Brin, S. and L. Page, The anatomy of a large­scale hypertextual Web search engine. Computer Networks and ISDN Systems, 1998. 30(1‐7): p. 107‐117. larry page, S.B. Google. Available from: http://www.google.com. Wikipedia­Thre free Encyclopedia. Available from: http://www.wikipedia.com. Liad Agmon, A.y., Sagie Davidovitch(co‐founders), Delver. Mislove, A., K. Gummadi, and P. Druschel. Exploiting Social Networks for Internet Search. 2006. Chen, H. and S. Dumais. Bringing order to the Web: automatically categorizing search results. 2000: ACM Press New York, NY, USA. Thet, T., J. Na, and C. Khoo, Automatic Classification of Web Search Results: Product Review vs. Non­review Documents. LECTURE NOTES IN COMPUTER SCIENCE, 2007. 4822: p. 65. Vogel, D., et al., Classifying search engine queries using the web as background knowledge. SIGKDD Explor. Newsl., 2005. 7(2): p. 117‐122. Yeung, A., N. Gibbins, and N. Shadbolt, A k­Nearest­Neighbour Method for Classifying Web Search Results with Data in Folksonomies. 2008. Venky Harinarayan, A.R.C.‐f. Kosmix. Available from: http://www.kosmix.com. Read Write Web­ Top 10 Alternative Search Engines of 2008. Available from: http://www.readwriteweb.com/archives/top_10_alternative_search_engi.ph ps. Zuckerberg, M. Facebook ­ the Social Networking site. Available from: http://www.facebook.com. Toutanova, K. and C. Manning. Enriching the knowledge sources used in a maximum entropy part­of­speech tagger. 2000. Toutanova, K., et al. Feature­rich part­of­speech tagging with a cyclic dependency network. 2003: Association for Computational Linguistics Morristown, NJ, USA. Bezos, J. A9­Amazons Seach Engine. Available from: http://www.a9.com. Needham, C. Internet Movie Database. Available from: http://www.imdb.com.