Spring Ferret: 360o Search Prafulla Mahindrakar Aniket Patil Ketan Umare Advisor: Dr. Ling Liu CS8803:Advanced Internet Application Development, Group Project. 09 2 FERRET: 360O SEARCH Table of Contents 1. MOTIVATION AND OBJECTIVES 4 2. RELATED WORK 5 2.1 GOOGLING 2.2 SOCIALLY RELEVANT SEARCH 2.2 CATEGORIZATION OF SEARCH RESULTS 5 6 6 3. ARCHITECTURE 7 3.1 SYSTEM ARCHITECTURE DIAGRAM 3.2 PATTERN ORIENTED ARCHITECTURE 3.2.1 DESIGN PATTERNS 3.2.2 DESIGN PATTERNS USED IN FERRET 3.3 HIGH PERFORMANCE 3.3.1 THREADING 3.3.2 CACHING 3.4 DATABASE SCHEMA 3.4.1 ER DIAGRAM 3.4.2 USER TABLE 3.4.3 PAGE KEYWORD TABLE 3.4.4 USER SESSION TABLE 7 8 8 8 9 9 9 9 9 10 10 10 4. COMPONENTS 11 4.1 AUTHENTICATION 4.2 STANFORD TAGGER 4.2 WEB SEARCH 4.3 MEDIA SEARCH 4.4 PRODUCT SEARCH 4.4.1 CLUSTERING OF RESULTS 4.5 SOCIAL SEARCH 4.5.1 SESSIONS 4.5.2 LISTEN TO USER CLICKS 4.5.3 HEARTBEAT MESSAGES 4.6 CATEGORIZATION ENGINE 4.6.1 WHY DOCUMENT CLUSTERING 4.6.2 APPROACHES 4.6.3 BUILDING BLOCKS 4.6.4 LINGO ALGORITHM 4.7 PRESENTATION ENGINE 4.7.1 SEARCH RESULTS TAB CREATOR 4.8 VIEW 11 11 11 11 11 12 12 13 13 13 14 14 14 15 16 18 18 18 5. EVALUATION FRAMEWORK 19 2 FERRET: 360O SEARCH FERRET: 360O SEARCH 3 5.1 SOCIAL NETWORK SIMULATION 5.1.1 ER DIAGRAM 5.2 JMETER 5.2.1 TEST CASES 5.3 COMPARISON TO OTHER SEARCH ENGINES 19 19 20 20 21 6. TESTING AND RESULTS 22 6.1 PROTOTYPE SYSTEM 6.1.1 SOFTWARE 6.1.2 HARDWARE 6.1.3 OPERATING SYSTEM 6.2 RESULTS 6.2.1 WITHOUT MEMCACHED 6.2.2 WITH MEMCACHED 6.2.3 LOAD RESPONSE WITH MEMCACHED 22 22 22 22 22 23 23 24 7. FUTURE WORK 25 8. CONCLUSIONS 25 9. BIBLIOGRAPHY 25 9. APPENDIX 27 FERRET: 360O SEARCH 3 4 FERRET: 360O SEARCH 1. Motivation and Objectives fer·ret (v) (\ˈfer­әt\) to find and bring to light by searching Imagine trying to find a pair of the latest Ray‐Ban glasses in the Lenox Square Mall. It is not an easy task! Now think about doing the same across the World Wide Web. Feeling tizzy? The World Wide Web with its astronomical amount of information presents an enormous challenge for resource discovery. Precise navigation is impossible with the increasingly large collection of hyperlinks that users must traverse. Commercial search engines like Google and Yahoo have solved the problem at a fundamental level by making available a hypertext‐based index for pages across the web. Web Users can query the index for documents about a specific topic to find the desired document. While search engines have become quite popular and are helping to redefine how people access information scattered across the wide‐area network, they are not well suited to the case when users do not know what exactly they are looking for. In such a situation, using one of the popular search engines can be a messy, frustrating experience. What do you do when you don’t know where to start? Give Ferret a try! For any topic in the universe, Ferret provides a neatly organized view of the web. Our category guides bring meaningful and relevant information that makes browsing for a topic fast. Rather than the messy back‐and‐forth clicking of search results, we do the processing so that you can learn, explore and discover the things that matter to you. Ferret offers you a new way to discover the Web – it’s the place you should be when you want to browse and discover everything the Web has to offer. Come to Ferret when you want to learn about a topic or explore what’s happening now on the Web. We’ll show you content that you may have never discovered otherwise and we’ll give you an at‐an‐glance look at everything related to the query. Think of Ferret as your guide for exploring the Web. For instance, consider the search term ‘Transformers’. A Google search result returns a list arranged serially that speaks about the movie ‘Transformers’, and electrical transformers on the first page. However, a user who is interested in knowing about the class ‘Transformer’ in Java or about the comics on Transformers needs to browse several pages before such results are discovered. Our system graphically arranges and classifies results into categories such as text, multimedia, entertainment, discussions, blogs and more. A user simply needs a single click to have a 360 degree view of content associated with the query term. 4 FERRET: 360O SEARCH FERRET: 360O SEARCH 5 Ferret. Your guide to the world! 2. Related Work Search has been a constantly evolving and a continuously researched topic. There have been great success stories and even greater debacles in this industry. Web search has become such an important part of our life that it has contributed to our vocabulary in some cases. Following are some of the most different systems currently available online, from which we derive and drive our inspiration. Figure 1: Taxonomy of Existing Search Technologies 2.1 Googling In their seminal work [1], the authors described a new way of ranking web documents, based on the idea of citation. The Search engine instantly became a hit and overtook all of its competitors. The webpage [2] is the most highly visited page online and everyone knows “The Google Story”. Google uses a simple keyword based search, but the most important point is the ranking of content. Thus Google successfully demonstrates the idea that just the content is not important, but the way we present it is highly important. Google has continued to innovate and come up with great innovative new features, but still it has a long way to go. FERRET: 360O SEARCH 5 6 FERRET: 360O SEARCH 2.2 Socially relevant search Social search or a social search engine is a type of web search method that determines the relevance of search results by considering the interactions or contributions of users.[3] Based on this simple idea is Delver[4], which uses the social network of a user to come up with better recommendations. It enables you to find, experience and benefit from the wealth of information created and referenced by your social world. Socially relevant search can really benefit a user, as what matters to him is usually what matters to his peers. Paper [5] talks about the benefits of integrating the web search and social search and quantifies it with great results. It also delineates the challenges in doing so. 2.2 Categorization of search results Search results categorization is another important way to present the search results. Take an example of the word Transformers. For the same word we could have different implications – an electrical device, a movie, the cartoon series, a toy, there could be a review about the movie, or some news about the invention of some new efficient transformer, etc. So how do you show these results? Which is more important? These questions are almost impossible to answer. Papers[6‐9] show a variety of ways in which we can classify the web search results and quantify them with interesting results. But Kosmix[10], is one of the most promising sites that has leveraged from this idea. It uses the search provided by Google, and creates a wrapper for its own classification system. It has been voted as one of the best new startups[11] and that just makes a statement about the importance of classification of results. 6 FERRET: 360O SEARCH FERRET: 360O SEARCH 7 3. Architecture The following sections give an outline of the System architecture and a small description of the important components. 3.1 System Architecture Diagram Figure 2 System Architecture Ferret can operate in two modes, Logged in or Private mode. Each of these modes are described in detail in the later sections. In the logged in mode alongwith the typical web results, ferret also provides socially relevant search results, using the FERRET: 360O SEARCH 7 8 FERRET: 360O SEARCH users profile form one of the major social network databases, for example facebook OR twitter. The typical web results are categorized into 3 broad categories, namely, Web Search, Media Search and Product Search. Each Category is further categorized using our clustering algorithm. 3.2 Pattern Oriented Architecture The aim while developing ferret was to keep it flexible enough so that we can add new features with relative ease. Also performance was a major concern, so each of the components built was built for a large‐scale system. This could be easily achieved using Pattern oriented architecture. The following section describe the various patterns used in ferret. 3.2.1 Design Patterns A Design Pattern can be defined as a particular recurring design problem that arises in specific design contexts, and presents a well­proven generic scheme for its solution. Describing its constituent components, their responsibilities and relationships, and the ways in which they collaborate specifies the solution scheme[12, 13]. 3.2.2 Design Patterns used in Ferret Ferret uses these design patterns. 3.2.2.1 Front Controller A Front controller pattern enables centralized request processing. This enables changes to the levels below to be transparent. Even communication, threading can be abstracted easily from the presentation layer. 3.2.2.2 Abstract Factories Factories is a creational pattern that abstracts creation of objects from the place where it is used. This provides ease of adding modules. 3.2.2.3 Strategy A strategy pattern allows ferret to change clustering algorithms easily and thus allowing new algorithms to be plugged in with relative ease. This especially was vital during testing out various algorithms. 3.2.2.4 Adapter Adapter pattern is used to abstract the search/fetch/cluster logic from the presentation generator. This generator can also be modified easily irrespective of changes to the prior system. 3.2.2.5 Singleton Many things needs single connections and to avoid the overhead we used thread controllers in singletons so that we could reduce the thread creation overhead. Also tagger library is loaded just once so that we avoid the cost associated with re reading it. 8 FERRET: 360O SEARCH FERRET: 360O SEARCH 9 3.2.2.6 Spring (OS) Doors Just as the Spring system developed at Sun labs we have Controller which abstracts the access of data from the presentation layer. This allows us to deploy individual systems remotely, which could be employed in the future for large scale distributed computing. 3.3 High Performance For a search engine Performance is critical. Ferret achieves performance using large scale threading, distributed caching and easily allowing separation of modules onto separate physical hosts. 3.3.1 Threading Ferret uses pre‐spawned thread pools to offset the overhead of thread spawning. It also uses threads to perform searches across various domains parallely. 3.3.2 Caching Ferret uses memcached[14, 15] to cache recent results. To maintain freshness of the results, each cached entry is associated with Expiry value. Currently the expiry time is arbitrarily fixed, but future efforts would aim at arriving at this number using a learning algorithm. For example, it is known that google doesnot refresh its index for atleast n hours. In that case we could cache till the results are updated. 3.4 Database Schema The database used by ferret is minimal, and this is essential to enhance the performance. The following section describes the schema in detail. 3.4.1 ER Diagram Figure 3 Database Model for Social Search Table usr_user: Column Name Uid Description Auto‐generated primary key for usr_user FERRET: 360O SEARCH 9 10 FERRET: 360O SEARCH Username Password Name Profession ImageUrl table Login name for the user User’s password User’s name User’s profession Pathname for the user image Table puk_pagekeyword: Column Name pageid page keyword title Description Auto‐generated primary key for Puk_pagekeyword table URL of the page Processed query term for which page was retrieved Title for page Table uss_usersession: Column Name uid pageid historycount timestamp sessionid Description Auto‐generated primary key for uss_usersession table Refers to puk_pagekeyword.pageid Frequency of usage of search results Time at which user selected a page for reading Server‐generated session id for user 3.4.2 User Table The user table is needed to maintain login information of the user in case the Google Authentication system isn’t used. Also it stores the uid’s which again could be directly from facebook, bt would be needed in case of multiple networks. 3.4.3 Page Keyword Table This table is used to maintain a list of popular keyword and page combinations accessed by the users. Based on freshness criteria, this table should be cleaned every x number of days. 3.4.4 User Session table This table is used to track the user and his favorite links. This table is essential to implement the Good page Bad page algorithm. 10 FERRET: 360O SEARCH FERRET: 360O SEARCH 11 4. Components This section explains the various modules that constitute the ferret search engine. 4.1 Authentication Ferret uses its own database to authenticate the user. It is easy to instead use the Google OpenId system for authentication. The system treats the user as a guest and does not track your activities. This enables private browsing. 4.2 Stanford Tagger The Stanford Tagger used by Ferret is a Part‐Of‐Speech Tagger. It is a piece of software that reads text in some language and assigns parts of speech to each word (and other tokens), such as noun, verb, adjective, etc. The tagger is used to identify relevant keywords in a query and store them in the database. The tagger is used in the following components: • Dictionary Search: The tagger identifies nouns (personal, common, both singular and plural) as keywords to be sent to WordNet for query. • Social Search: The tagger identifies nouns (personal, common, both singular and plural), verbs and adverbs from the user’s query. 4.2 Web Search This engine is multithreaded and accepts the raw query and dispatches it to the various worker threads, which aim at collecting the search results from variety of search engines like Google[2], A9[16], IMDB[17] etc. The worker threads use WSDL to communicate to the various search engines. The external interface is extensible since collecting results from a new search engine simply requires the implementation of a WSDL interface. This enables our system to be augmented by additional search results through Yahoo, Windows Live or any other search engine. 4.3 Media Search This engine is multithreaded and accepts the raw query and dispatches it to the various worker threads, which aim at collecting the search results from variety of search engines like Google[2], A9[16], IMDB[17] etc. The worker threads use WSDL to communicate to the various search engines. The external interface is extensible since collecting results from a new search engine simply requires the implementation of a WSDL interface. This enables our system to be augmented by additional search results through Yahoo, Windows Live or any other search engine. 4.4 Product Search Ferret product search uses Amazon E‐Commerce API to retrieve product information. The API exposes Amazon's product data and e‐commerce functionality. This allows Ferret to leverage the data that Amazon uses to power its own business. FERRET: 360O SEARCH 11 12 FERRET: 360O SEARCH Ferret is able to retrieve product results over a huge range of categories. For every product, Ferret retrieves the product name, product cost as on Amazon and a product image. All searches are performed for US locale. In the future, it may be possible to detect the geographical region from where the query originates and adjust the locale accordingly. 4.4.1 Clustering of Results The search results are clustered dynamically on the basis of categories that are retrieved for the query term. All products belonging to a single category are arranged together using seed list based clustering. Any Amazon product can be classified into one of the following categories: Apparel, Automotive, Baby, Beauty, Blended, Books, Classical, Digital Music, DVD, Electronics, Foreign Books, Gourmet Food, Health Personal Care, Hobbies, Home Garden, Jewelry, Kitchen, Magazines, Merchants, Miscellaneous, Music, Musical Instruments, Music Tracks, Office Products, Outdoor Living, PC Hardware, Pet Supplies, Photo, Restaurants, Software, Software Video Games, Sporting Goods, Tools, Toys, VHS, Video, Video Games, Wireless, Wireless Accessories Figure 4 List of product categories in Amazon We used the above categories as a seed list and use the retrieved product information to detect the category and cluster appropriately. Due to the extensible nature of the product search component, we can easily obtain results from other e‐commerce providers such as Buy.com and Ebay. We also plan to integrate functionality to sort results by cost and social relevance. 4.5 Social Search Ferret adds a new spin to search: social networking. One of the most innovative features of Ferret is the ability to retrieve search results that are relevant to the user’s social network. The feature allows the user to leverage searches performed by the user’s friends. Social search recommends the best pages found by people in the user’s network that are relevant to the user’s query. Ferret’s social search tries to match the user’s query term with a larger set of searchers in the user’s social network that are looking for the same things. The results are clustered by the friend’s name and are listed serially. Each result contains the page name and the page url which is clickable for the purpose of viewing. The feature is an opt‐in: no one can see what the user is searching for unless the user logs in. This ensures user privacy. 12 FERRET: 360O SEARCH FERRET: 360O SEARCH 13 Social Search is implemented using the following primitives: • Sessions • Listen to user clicks • Heartbeat Messages 4.5.1 Sessions In Ferret, a session stores the state of communication between a server and the user enabling the server to identify that user across multiple page requests or visits to that site. A session is created when a user logs in with his username and password. The session for a user stores the following attributes: • User ID: The primary key generated by the database for the user • Query: The query term currently being searched by the user • Page URL: The page currently being viewed by the user • Timestamp: The time at which the user clicked on the page url 4.5.2 Listen to user clicks A request is sent to the server each time the user clicks on a page for browsing. This is used to store associate the query term processed by Stanford Tagger (keywords) with the user id previously stored in the session for future processing. The algorithm for the user clicks is as follows: 1. If session is invalid 2. Return 3. Else if no timestamp exists in session 4. Insert URL into session 5. Insert Keyword into session 6. Insert Timestamp into session 7. Return Figure 5 User­Click Algorithm 4.5.3 Heartbeat Messages A heartbeat message is an event‐driven message which is sent to the server when there is a search results page is reloaded. This message is used to detect if the user likes the page he has just viewed. We use heuristics to differentiate such a page from one the user does not like. The heuristic Ferret uses is as follows: If the user spends more time on a certain page, we can assume he does so because he likes the page. If the user returns back from a page “quickly”, he does not like the page. Currently, we have set a timeout of 30 seconds to differentiate a good page from a bad page. If the user spends 30 seconds or greater on a specific page, the system records the page as a good page FERRET: 360O SEARCH 13 14 FERRET: 360O SEARCH and stores it in the database. If the user returns from the page in less than 30 seconds the page is not associated with the user. The algorithm for the heartbeat process can be summarized as follows: 1. If session is invalid 2. Return 3. Else if no timestamp exists in session 4. Return 5. Else if page is liked by user 6. Associate page‐keyword combination with userid 7. Return 8. Else if page is disliked by user 9. Remove page‐keyword association with user 10. Return Figure 6 Heartbeat Algorithm In the future, social search can be improved by deducing the “meaning” of the query being searched using natural language processing query techniques and using the meaning to retrieve search results. For instance, if the user is searching for “what drug treats a headache” Ferret can process the semantic relationships between words and may deduce that someone searching for “what medicine relieves migraines” is a match. In addition, it may be possible to rank a set of results retrieved for a specific user’s friend by freshness or relevance to the query. 4.6 Categorization Engine The results collected through the various websites are then categorized using Lingo clustering, and then grouped into different categories. 4.6.1 Why Document Clustering With an enormous growth of the Internet it has become very difficult for the users to find relevant documents. In response to the user’s query, currently available search engines return a ranked list of documents along with their partial content (snippets). If the query is general, it is extremely difficult to identify the specific document which the user is interested in. The users are forced to sift through a long list of off‐topic documents. Moreover, internal relationships among the documents in the search result are rarely presented and are left for the user. One approach is to automatically group search results into thematic groups (clusters) which would help the user to see various perspective of the same query grouped into categories. 4.6.2 Approaches Clustering of web search results was first introduced in the Scatter‐Gather system. Several algorithms followed; Suffix Tree Clustering, (STC), implemented in the Grouper system pioneered in using recurring phrases as the basis for deriving 14 FERRET: 360O SEARCH FERRET: 360O SEARCH 15 conclusions about similarity of documents. MSEEC and SHOC also made explicit use of words proximity in the input documents. Apart from phrases, graph‐partitioning methods have been used in clustering search results All the above approaches follow a scheme where cluster content discovery is performed first, and then, based on the content, the labels are determined. But very often intricate measures of similarity among documents do not correspond well with plain human understanding of what a cluster’s “glue” element has been. To avoid such problems Lingo algorithm reverses this process and attempt to ensure that it can create a human‐perceivable cluster label and only then assign documents to it. This the approach we have followed in our implementation of clustering web results. 4.6.3 Building Blocks The following section describes the building blocks for the implementation of the clustering algorithm used in ferret. 4.6.3.1 Vector Space model Vector Space Model (VSM)[18] is a technique of information retrieval that transforms the problem of comparing textual data into a problem of comparing algebraic vectors in a multidimensional space. Once the transformation is done, linear algebra operations are used to calculate similarities among the original documents. Every unique term (word) from the collection of analyzed documents forms a separate dimension in the VSM and each document is represented by a vector spanning all these dimensions. For example, if vector v represents document j in a k‐dimensional space ,then component t of vector v, where t 1 . . . k, represents the degree of the relationship between document j and a term corresponding to dimension t. This relationship is best expressed as a t X d matrix A, usually named a term­document matrix , where t is the number of unique terms and d is the number of documents. Element aij of matrix A is therefore a numerical representation of relationship between term i and document j. There are many methods for calculating aij , commonly referred to as term weighting methods. 4.6.3.2 Calculating Relevance We use the tf‐idf method for calculating the term weights. The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. 4.6.3.3 Suffix Arrays Let A = a1a2a3 . . . an be a sequence of objects. Let us denote by Ai a suffix of A starting at position i ! 1 . . . n, such as Ai =aiai+1ai+2 . . . an. An empty suffix is also FERRET: 360O SEARCH 15 16 FERRET: 360O SEARCH defined for every A as An+1 = #. A suffix array[19] is an ordered array of all suffixes of A. Suffix arrays are used as an efficient data structure for verifying whether a sequence of objects B is a substring of A.The complexity of this operation is O(P + logN), a suffix array can be builtin O(NlogN). 4.6.3.5 Singular value Decomposition An algebraic method of matrix decomposition called Singular Value Decomposition[20] is used for discovering the orthogonal basis of the original term‐ document matrix. This basis consists of orthogonal vectors that, at least hypothetically, correspond to topics present in the original term‐document matrix. SVD breaks a t X d matrix A into three matrices U, ∑ and V , such that A = U∑ VT . U is a t X t orthogonal matrix whose column vectors are called the left singular vectors of A, V is a d X d orthogonal matrix whose column vectors are called the right singular vectors of A, and ∑ is a t X d diagonal matrix having the singular values of A ordered decreasingly along its diagonal. The rank rA of matrix A is equal to the number of its non‐zero singular values. The first rA columns of U form an orthogonal basis for the column space of A—an essential fact used by Lingo. 4.6.4 Lingo Algorithm At the very high level lingo[21] first finds frequent phrases from the input documents, hoping they are the most informative source of human‐readable topic descriptions. Next, by performing reduction of the original term‐document matrix using SVD, it tries to discover any existing latent structure of diverse topics in the search result. Finally, it match group descriptions with the extracted topics and assign relevant documents to them. 4.6.4.1 Preprocessing The aim of the preprocessing phase is to prune from the input all characters and terms that can possibly affect the quality of group descriptions. Two steps are performed: text filtering removes HTML tags, entities and non‐letter characters except for sentence boundaries. Next, appropriate stemming and stop words removal end the preprocessing phase. 4.6.4.Phrase Extraction We define frequent phrases as recurring ordered sequences of terms appearing in the input documents. Intuitively, when writing about something, we usually repeat the subject‐related keywords to keep a reader’s attention. Obviously, in a good writing style it is common to use synonymy and pronouns and thus avoid annoying repetition. To be a candidate for a cluster label, a frequent phrase or a single term must: • appear in the input documents at least certain number of times (term frequency threshold), • not cross sentence boundaries, • be a complete phrase , 16 FERRET: 360O SEARCH FERRET: 360O SEARCH 17 • not begin nor end with a stop word. We use suffix arrays to find such complete phrases. 4.6.4.2 Cluster Label Induction Once frequent phrases (and single frequent terms) that exceed term frequency thresholds are known, they are used for cluster label induction. There are three steps to this: term‐document matrix building, abstract concept discovery, phrase matching and label pruning. The term‐document matrix is constructed out of single terms that exceed a predefined term frequency threshold. Weight of each term is calculated using the standard term frequency, inverse document frequency (tfidf) formula, terms appearing in document titles are additionally scaled by a constant factor. In abstract concept discovery, Singular Value Decomposition method is applied to the term‐document matrix to find its orthogonal basis.Vectors of this basis (SVD’s U matrix) represent the abstract concepts appearing in the input documents. Phrase matching and label pruning step, where group descriptions are discovered, relies on an important observation that both abstract concepts and frequent phrases are expressed in the same vector space—the column space of the original term‐ document matrix A.The classic cosine distance is used to calculate how “close” a phrase or a single term is to an abstract concept. Let us denote by P a matrix of size t X (p+t) where t is the number of frequent terms and p is the number of frequent phrases. Having the P matrix and the i‐th column vector of the SVD’s U matrix, a vector mi of cosines of the angles between the i‐th abstract concept vector and the phrase vectors can be calculated. mi = UI T P. The phrase that corresponds to the maximum component of the mi vector should be selected as the human‐readable description of i‐th abstract concept. 4.6.4.2 Cluster Content Discovery In the cluster content discovery phase, the classic Vector Space Model is used to assign the input documents to the cluster labels induced in the previous phase. In a way, we re‐query the input document set with all induced cluster labels. The assignment process resembles document retrieval based on the VSM model. Let us define matrix Q, in which each cluster label is represented as a column vector. Let C = QTA, where A is the original term‐document matrix for input documents. This way, element cij of the C matrix indicates the strength of membership of the j‐th document to the i‐th cluster. A document is added to a cluster if cij exceeds the some threshold yet another control parameter of the algorithm. Documents not assigned to any cluster end up in an artificial cluster called Others. FERRET: 360O SEARCH 17 18 FERRET: 360O SEARCH 4.7 Presentation Engine This module is responsible for displaying and painting the results for the user browser. It uses the Adapter pattern to abstract the search part from the display part. 4.7.1 Search Results Tab creator This interface creates a tab and each type of tab can be separated into a different class. The most important functions are written in the base class and whenever a tab is needed to be different a simple class can be easily written. 4.8 View The clustered results and the socially relevant search results are then showed to the end user in tabbed format, which allows the user to easily find his appropriate content. The view uses Mootools[22], which is an opensource Javascriptig framework, which enables it to be browser agnostic. The following chart shows the performance comparison of mootools with other java‐scripting frameworks. The performance alongwith the ease of use makes it one of the preferred choices. Figure 7 Performance comparison of various Java­Scripting Frameworks (source:Blog) 18 FERRET: 360O SEARCH FERRET: 360O SEARCH 19 5. Evaluation Framework Ferret tries to use many old and some new ideas to combine them into a new exciting product. Hence evaluation of such a system is critical. The Evaluation falls under Three broad categories Social Network based relevance Performance Comparison to other contemporary search engines 5.1 Social Network Simulation Ferret needs the social network to provide information about a user and his friends so that it can perform and maintain social relevance search results. Though it has a facebook engine ready, Facebook authentication system requires a static IP or a URL to work with. Due to this limitation it became essential to simulate the social network. The following section describes a simple social network simulation 5.1.1 ER Diagram Ferret presently simulates a social network to implement Social Search. The database model used is as follows: Figure 8 Database Model for Social Network Table ufl_userfriends: Column Name Uid fid Description Refers to usr_user.uid Refers to usr_user.uid Table uhb_userhobbies: Column Name Uid hobbies Description Refers to usr_user.uid User hobby name Table Usk_usersearchpage: Column Name Uid pageid Description Refers to usr_user.uid Refers to puk_pagekeyword.pageid FERRET: 360O SEARCH 19 20 FERRET: 360O SEARCH Ferret does not use uhb_userhobbies table currently in simulation. It is possible to consider the user’s friends’ hobbies when recommending social search results to the user. 5.2 JMeter Performance of a search engine is critical and JMeter is an open source tool that can simulate multiple clients sending post request[23, 24]. It can also load test the application. Ferret was tested using JMeter and various performance stats were collected. This section provides details on the test cases. 5.2.1 Test Cases First screenshot shows the Jmeter Test plan setup screen. The Testplan is called Ferret Testplan. Screenshot 2 shows the type of parameter, namely the search query, to be passed and type of HTTP request to be sent, for example POST or GET. Screenshot 3 shows the expected amount of load (number of users), number of time each query is executed and the gap between consecutive queries. 20 FERRET: 360O SEARCH FERRET: 360O SEARCH 21 5.3 Comparison to Other search engines Search engines performance has an important component, which deals with the quality of results for a particular query. Such an evaluation is very subjective. To compare the result of ferret to contemporary search engine method of surveying was used. FERRET: 360O SEARCH 21 22 FERRET: 360O SEARCH 6. Testing And Results 6.1 Prototype system We have build a prototype systems for the demo using the hardware and software listed in the following sections. 6.1.1 Software • Java 1.6 • Eclipse IDE • J2EE 1.4 • Apache Tomcat 5.5 • MySQL 5.0 • Clustering Algorithms (Developed by us) • Mootools • Multibox • MySQL JDBC Connector • JUnit 4.4 • Open Source Web / REST API’s for Google, IMDB, Facebook etc. 6.1.2 Hardware We need simple commodity hardware, as it will not be a live system, but a proof of concept. Currently a Desktop PC with a browser and internet connectivity would suffice. We would primarily develop on our laptops. 6.1.3 Operating system The primary development and test platforms would be • Windows 98/XP/Vista • MacOSX 10.5.5 (Leopard) Though most of the technologies we are using are completely portable and we should be able to run on most systems that support JAVA. 6.2 Results We conducted results using memcached and Tomcat. For every search engine response times are very important. Since we use Google as our search provider our times can never be better than Google. Each Tab is separated on different threads and page is created parallely. 22 FERRET: 360O SEARCH FERRET: 360O SEARCH 23 6.2.1 Without Memcached Time in Seconds Response Times without Memcached 35 30 25 20 15 10 5 0 1 2 3 4 5 Number of times the same query dired (Representative) Without memcached, the same query takes approximately constant response times. This is because the entire result set is constructed for the same query al over again for every request. 6.2.2 With Memcached Time in Seconds Response Times with Memcached and logged on mode 35 30 25 20 15 10 5 0 1 2 3 4 5 Number of times the same query dired (Representative) Memcached improves the performance but a small amount of time is spent as the social results are never cached. But since they are stored locally on Ferrets own database, the bottleneck is because of the remote servers and the clustering system. FERRET: 360O SEARCH 23 24 FERRET: 360O SEARCH Time in Seconds Response Times with Memcached and not logged on mode 35 30 25 20 15 10 5 0 1 2 3 4 5 Number of times the same query dired (Representative) When the user is not logged in the complete page is constructed completely using the cached results. The Thread pools are not interrupted and thus the performance is very high. 6.2.3 Load Response with Memcached Response Times with Memcached and concurrent users Time in Seconds 20 15 10 5 0 1 2 3 4 5 Concurrent users * 4 The above graph shows the response time of Ferret system with multiple concurrent users search for the same query. It is evident that we need a server or a host of servers to handle multiple concurrent users. 24 FERRET: 360O SEARCH FERRET: 360O SEARCH 25 7. Future Work There are a lot of changes that we dream of, and we have a long way to go. This serves as a good demo tool, but not a final product. Following are some things we have planned for Ferret. o Using up our summer vacation to build on it o Notion of Social Rank o Adding blogs, forums, reservations, email search to search results o Using Digg interface to re‐rank sites o Learning better categories o And the list goes on... 8. Conclusions This was a very good learning experience. One of the most important things we learnt was how to develop an idea and get a working prototype. From our perspective, there are two navigation paradigms on the Web – Search and Browse. Search lets you find specific bits of information quickly or navigate to sites you already know. Browse gives you a more immersive way to explore a topic so that you can learn more about something or discover something new. Ferret is about reinventing Browse just as Google reinvented Search. 9. Bibliography 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Brin, S. and L. Page, The anatomy of a large­scale hypertextual Web search engine. Computer Networks and ISDN Systems, 1998. 30(1‐7): p. 107‐117. larry page, S.B. Google. Available from: http://www.google.com. Wikipedia­The free Encyclopedia. Available from: http://www.wikipedia.com. Liad Agmon, A.y., Sagie Davidovitch(co‐founders), Delver. Mislove, A., K. Gummadi, and P. Druschel. Exploiting Social Networks for Internet Search. 2006. Chen, H. and S. Dumais. Bringing order to the Web: automatically categorizing search results. 2000: ACM Press New York, NY, USA. Thet, T., J. Na, and C. Khoo, Automatic Classification of Web Search Results: Product Review vs. Non­review Documents. LECTURE NOTES IN COMPUTER SCIENCE, 2007. 4822: p. 65. Vogel, D., et al., Classifying search engine queries using the web as background knowledge. SIGKDD Explor. Newsl., 2005. 7(2): p. 117‐122. Yeung, A., N. Gibbins, and N. Shadbolt, A k­Nearest­Neighbour Method for Classifying Web Search Results with Data in Folksonomies. 2008. Venky Harinarayan, A.R.C.‐f. Kosmix. Available from: http://www.kosmix.com. FERRET: 360O SEARCH 25 26 FERRET: 360O SEARCH 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. Read Write Web­ Top 10 Alternative Search Engines of 2008. Available from: http://www.readwriteweb.com/archives/top_10_alternative_search_engi.ph ps. Buschmann, F., Pattern­oriented software architecture: a system of patterns. 2002: Wiley. Buschmann, F., K. Henney, and D. Schmidt, Pattern­oriented software architecture. 1996: Wiley New York. Fitzpatrick, B., Distributed caching with memcached. Linux Journal, 2004. 2004(124). Interactive, D., Memcached. 2006. Bezos, J. A9­Amazons Seach Engine. Available from: http://www.a9.com. Needham, C. Internet Movie Database. Available from: http://www.imdb.com. Wong, S., W. Ziarko, and P. Wong. Generalized vector spaces model in information retrieval. 1985: ACM New York, NY, USA. Manber, U. and G. Myers. Suffix arrays: A new method for on­line string searches. 1990: Society for Industrial and Applied Mathematics Philadelphia, PA, USA. Golub, G. and C. Reinsch, Singular value decomposition and least squares solutions. Numerische Mathematik, 1970. 14(5): p. 403‐420. Osinski, S., J. Stefanowski, and D. Weiss. Lingo: Search results clustering algorithm based on singular value decomposition. 2004: Springer. Proietti, V., MooToolsÐthe compact javascript framework. Foundation, A., Apache JMeter. Hansen, K., Load Testing your Applications with Apache JMeter. Java Boutique Internet, http://javaboutique. internet. com/tutorials/JMeter/, as viewed November, 2004. 26 FERRET: 360O SEARCH FERRET: 360O SEARCH 27 9. Appendix Figure 9: Home Page Searching for ‘metallica’ in Ferret… Figure 10: Search Query FERRET: 360O SEARCH 27 28 FERRET: 360O SEARCH Search (Web) results for ‘metallica’ in Ferret. Figure 11: Search Results Page > Web Search Tab Media results for ‘metallica’ in Ferret. Results are clustered by STC and Lingo Algorithm Figure 12: Search Results Page: Media Results Tab 28 FERRET: 360O SEARCH FERRET: 360O SEARCH 29 Playing ‘Nothing Else Matters’ video. Figure 13: Search Results Page > Media Results Tab > Media Player Product results for ‘metallica’ in Ferret. Results are clustered by seed‐list based clustering. Figure 14: Search Results Page > Product Search FERRET: 360O SEARCH 29 30 FERRET: 360O SEARCH Social results for ‘metallica’ in Ferret. No results shown since user is not logged in. Figure 15: Search Results Page > Social Search (Not Logged in) User logs into Ferret to see social search results. Figure 16: Ajax Login Option 30 FERRET: 360O SEARCH FERRET: 360O SEARCH 31 User’s friend has already searched for ‘metallica’ and his favorite ‘metallica’ pages are displayed. Figure 17: Search Results Page > Social Results Tab (Logged in) Search (Web) results for ‘ipl’ in Ferret. User ‘ketan’ is logged in and clicks on a URL. Figure 18: Search Results Page > Web Search > On Clicking a Query FERRET: 360O SEARCH 31 32 FERRET: 360O SEARCH User ‘ketan’ logs out and ‘praful’ logs in and searches for ipl again. Figure 19: When a Friend Logs in! Social results for ‘ipl’ display the URL ketan had liked when he searched for ipl. Figure 20: The Socially Relevant Query turns up on Friends Page 32 FERRET: 360O SEARCH FERRET: 360O SEARCH 33 Search (Web) results for ‘yoyo’ in Ferret. Dictionary Search is able to get a definition for ‘yoyo’ Figure 21: Wikipedia, Wordnet Dictionary search, Images from yahoo and google FERRET: 360O SEARCH 33