Advanced information retrieval Computer engineering department Chapter 13 – Web IR Searching the web • Introduction (History, Definitions, Aims, Statistics) • Tasks of search engines • • • • - Gathering - Indexing - “Searching” (Querying and ranking algorithms) - Document and query management Metasearch Browsing and web directory Users and web search Research issues Web Searching and Classical IR TREC Classic IR research 1970s 1980s 1990s then came the web web searching 2000s Terminology and Definitions • A web page corresponds to a document in traditional IR • Web pages are different in their size and in the types of • • • files that constitute them – text, graphics, sound, video, GIF, JPEG, ASCII,PDF, etc IR on the web considers as collection of documents the part of the web that is publicly indexable – exclude pages that cannot be indexed (authorization, dynamic pages) Location of web pages by navigation Location of web pages by searching (IR on the web) Challenges for web search • Problem with the data – Distributed data – High percentage of volatile data – Large volume – Unstructured data – Redundant data – Quality of data – Heterogeneous data • Problem faced by users – How to specify a query? – How to interpret answers? Search engines • Portals • – identification of real names – links to books from Amazon.com – send electronic postcards – translation to other languages – search of other media (metadata) – language-specific searches – weather, stock price, traffic Business model – targeted advertising with revenue from clickthroughs – word of mouth (since no industry standard evaluation) – fast response (<1s) 24 hours / 7 days a week – filter out spam Examples Search engines AltaVista Excite Google Infoseek Lycos NorthernLight URL www.altavista.com www.excite.com www.google.com www.infoseek.com www.lycos.com www.nlsearch.com • 25-55% of web covered only! • Some search engines powered by same IR engine • Most based in US, English Other search engines • Specialised in different countries/languages (e.g., http://www.iol.it/) • Specific – Rank according to popularity (e.g., DirectHit) – Topic oriented (e.g., SearchBroker) – Personnel or institutional pages, electronic mail addresses, images, software applets Tasks of a web search engine • Document gathering – select the documents to be indexed • Document indexing – represent the content of the selected documents – often 2 indices maintained (full + small for frequent queries) • Searching – represent the user information need into a query – retrieval process (search algorithms, ranking of web pages) • Document and query management – display the results – virtual collection (documents discarded after indexing) vs. physical collection (documents maintained after indexing) Document gathering • Document gathering = crawling the web • Crawler – Robot, spider, wanderer, walker, knowbot, web search agent – Program that traverses the web to send new or updated pages to be indexed – Run on local server and send requests to remote servers Crawling the web (1) • Crawling process – Start with set of URLs • Submitted by users or companies • Popular URLs – Breath-first or depth-first – Extract further URLs • Up to 10 millions pages per day • Several crawlers – Problem of redundancy – Web partition robot per partition Crawling the web (2) • Up-to-date? – Non-submitted pages need up to 2 months to be indexed – Search engine learns change frequency of pages – Popular pages (having many links to them) crawled more frequently • Guideline for robot behaviours – File placed at root of web server – Indicate web pages that should not be indexed – Avoid overloading servers/sites Typical anatomy of a large-scale crawler. Document indexing • Document indexing = building the indices • Indices are variant of inverted files – – – – meta tag analysis stop words removal + stemming position data (for phrase searches) weights • tf x idf; • downweight long URLs (not important page) • upweight terms appearing at the top of the documents, or emphasised terms – use de-spamming techniques • hyperlink information • count link popularity • anchor text from source links • hub and authority value of a page Maintaining indices over dynamic collections. Stop-press index • Collection of document in flux – Model document modification as deletion followed by insertion – Documents in flux represented by a signed record (d,t,s) – “s” specifies if “d” has been deleted or inserted. • Getting the final answer to a query – Main index returns a document set D0. – Stop-press index returns two document sets • D+ : documents not yet indexed in D0 matching the query • D- : documents matching the query removed from the collection since D0 was constructed. • Stop-press index getting too large – Rebuild the main index • signed (d, t, s) records are sorted in (t, d, s) order and mergepurged into the master (t, d) records – Stop-press index can be emptied out. Searching • Querying – – – – – 1 word or all words must be in the retrieved pages normalisation (stop words removal, stemming, etc) complex queries (date, structure, region, etc) Boolean expressions (advanced search) metadata • Ranking algorithms – use of web links – web page authority analysis • HITS (Hyperlink Induced Topic Search) • PageRank (Google) Meta-search systems • Take the search engine to the document – Forward queries to many geographically distributed repositories • Each has its own search service – Consolidate their responses. • Advantages – Perform non-trivial query rewriting • Suit a single user query to many search engines with different query syntax – Surprisingly small overlap between crawls • Consolidating responses – Function goes beyond just eliminating duplicates – Search services do not provide standard ranks which can be combined meaningfully Similarity search • Cluster hypothesis – Documents similar to relevant documents are also likely to be relevant • Handling “find similar” queries – Replication or duplication of pages – Mirroring of sites Document similarity • Jaccard coefficient of similarity between document d1 and d 2 • T(d) = set of tokens in document d T (d ) | – . r ' (d , d ) || TT ((dd )) T (d ) | – Symmetric, reflexive, not a metric – Forgives any number of occurrences and any permutations of the terms. 1 • 1 2 1 2 2 1 r ' (d1 , d2 ) is a metric Finding near duplicates algorithm Use of web links • Web link: represent a relationship between the connected pages • The main difference between standard IR algorithms and web IR algorithms is the massive presence of web links – web links are source of evidence but also source of noise – classical IR: citation-based IR – web track in TREC, 2000, TREC-9: Small Web task (2GB of web data); Large Web task (100GB of web data, 18.5 million documents) Use of anchor text • represent referenced document – why? • provides more accurate and concise description than the page itself • (probably) contains more significant terms than the page itself – used by ‘WWW Worm’ - one of the first search engines 1994 – representation of images, programs, … • generate page descriptions from anchor text Algorithms • Query independent page quality – global analysis • PageRank (Google): simulates a random walk across the web and computes the “score” of a page as the probability of reaching the page • Query dependent page quality – local analysis • HITS (Hyperlink Induced Topic Search): focusses on broad topic queries that are likely to be answered with too many pages – the more a page is pointed to by other pages, the more popular is the page – popular pages are more likely to include relevant information than nonpopular pages PageRank (1) • Designed by Brin and Page at Stanford University • Used to implement Google • Main idea: – a page has a high rank if the sum of the ranks of its in-links is high • in-link of page p: a link from a page to page p • out-link of a page p: a link from page p to a page – a high PageRank page has many in-links or few highly ranked in-links • Retrieval: use cosine product (content, feature, term weight) combined with PageRank value PageRank (2) • Random Surfer Model : user randomly navigates – Initially the surfer is at a random page – At each step the surfer proceeds • to a randomly chosen Web page with probability d called the “damping factor” • (e.g. probability of random jump = 0.2) to a randomly chosen page linked to the current page with probability 1-d (e.g. probability of following a random outlink = 0.8) • Process modelled by Markov Chain – PageRank PR of a page a = probability that the surfer is at page a on a given time PR(a) = Kd + K(1-d) i=1,n PR(ai)/C(ai) d set by system K normalisation factor a = page pointed by ai for i=1,n C(ai) = number of outlinks of ai HITS: Hypertext Induced Topic Search • Originated from Kleinberg, 1997 • Also referred to as the “The Connectivity Analysis Approach” • Broad topic queries produce large sets of retrieved results – abundance problem too many relevant documents – new type of quality measure needed distinguish the most “authoritative” pages high-quality response to a broad query • HITS: for a certain topic, it identifies – good authorities • pages that contain relevant information (good sources of content) – good hubs • page that point to useful pages (good sources of links) HITS (2) • Intuition – authority comes from inlinks – being a good hub comes from outlinks – better authority comes from inlinks from good hubs – being a better hub comes from outlinks to good authorities • Mutual reinforcement between hubs and authorities – a good authority page is pointed to by many hub pages – a good hub page point to many authority pages HITS (3) • Set of pages S that are retrieved – sometimes k (e.g. k = 200) top-ranked pages • Set of pages T that point to or are pointed to by retrieved set of pages S – rank pages according to in_degree (number of in-links) - not effective • Set of pages T : – authoritative pages relevant to query should have a large in_degree – considerable overlap in the sets of pages that point to them hubs Algorithm for HITS (General Principle) • Computation of hub and authority value of a page through the iterative propagation of “authority weight” and “hub weight” • Initially all values equal to 1 • Authority weight of page x(p) – if p is pointed to by many pages with large y-values,then it should receive a large x-value • • x(p) = qip y(qi) Hub weight of page y(p) – if p points to many pages with large x-values, then it should receive a large y-value y(p) = pqi x(qi) After each computation (iteration), weights are normalised Topic distillation • Process of finding ‘quality’ (authority - hub) Web documents related to a query topic, given an initial user information need. • Extensions of HITS – ARC (Automatic Resource Compilation) • distance-2 neighbourhood graph • anchor (& surrounding) text used in computation of hub and authority values – SALSA, etc. • Problems with HITS – – – – mutual reinforcing relationship between hosts automatically generated links non relevant highly connected pages topic drift: generalisation of the query topic Difference between PageRank and HITS • The PageRank is computed for all web pages stored in • • • the database and then prior to the query; HITS is performed on the set of retrieved web pages, and for each query. HITS computes authorities and hubs; PageRank computes authorities only. PageRank: non-trivial to compute, HITS: easy to compute, but real-time execution is hard Implementation details of PageRank have been reported Document and query management • Results – – – – – – Usually screens of 10 pages Clustering URL, size, date, abstract, etc Various sorting Most similar documents options Query refinement • Virtual collection vs. physical collection – document can change over time – document may be different to the one that has been indexed – broken link Metasearch (1) • Problems of Web search engines: – limited coverage of the publicly indexable Web – index different overlapping sections of the Web – based on different IR models different results to the same query users do not have the time, knowledge to select the most appropriate search engines with regard to their information need • Possible solution: metasearch engines – Web server that sends query to • Several search engines, Web directories, Databases – Collect results – Unify (merge) them - Data fusion • Aim: better coverage, increase effectiveness Metasearch (2) • Divided into phases – search engine selection • topic-dependent, past queries, network traffic, … – document selection • how many documents from each search engine? – merging algorithm • utilise rank positions, document retrieval scores, … Metasearcher URL used MetaCrawler www.metacrawler.com Dogpile www.dogpile.com SavvySearch www.search.com Sources 13 25 > 1000 MetaCrawler Browsing • Web directory – Catalog, yellow pages, subject categories – Many standard search engines provide subject categories now • Hierarchical taxonomy that classifies human knowledge Arts & Humanities Automotive Business & Economy Computers & Internet Education Employment Entertainment & Leisure Games Government Investing Kids & Family Life & Style News People Philosophy & Religion Politics Science & Technologies Social Science … Yahoo! • Around one million classified pages • More than 14 country specific directories (Chinese, Danish, French, …) • Manual classification (has its problem) • Acyclic graphs • Search within the classified pages/taxonomy • Problem of coverage Users and the web (1) • Queries on the web: average values Measure Average value Number of words 2.35 Number of operators 0.41 Repetitions of queries 3.97 million Queries per user session 2.02 Screens per query 1.39 Range 0 - 393 0 - 958 1 - 1.5 1 - 173325 1 - 78496 Users and the web (2) • Some statistics – Main purpose: research, leisure, business, education, • products and services (e-commerce) • people and company names and home pages • factoids (from any one of a number of documents) • entire, broad documents • mp3, image, video, audio – – – – 80% do not modify query 85% look first screen only 64% queries are unique 25% users use single keywords (problem for polysemic words and synonyms) – 10% queries are empty Users and the web (3) • Results of search engines are identical, independent of – user – context in which the user made the request • adding context information for improving search results it directly focus on the user need and answer – explicit context • query + category – implicit context • based on documents edited or viewed by user – personalised search • previous requests and interests, user profile Motivation • Problem: Query word could be ambiguous: – Eg: Query“Star” retrieves documents about astronomy, plants, animals etc. – Solution: Visualisation • Clustering document responses to queries along lines of different topics. • Problem 2: Manual construction of topic hierarchies and taxonomies – Solution: • Preliminary clustering of large samples of web documents. • Problem 3: Speeding up similarity search – Solution: • Restrict the search for documents similar to a query to most representative cluster(s). Example Scatter/Gather, a text clustering system, can separate salient topics in the response to keyword queries. (Image courtesy of Hearst) Clustering • Task : Evolve measures of similarity to cluster a collection of • • documents/terms into groups within which similarity within a cluster is larger than across clusters. Cluster Hypothesis: Given a `suitable‘ clustering of a collection, if the user is interested in document/term d/t, he is likely to be interested in other members of the cluster to which d/t belongs. Similarity measures – Represent documents by TFIDF vectors – Distance between document vectors – Cosine of angle between document vectors Clustering: Parameters • Similarity measure: (eg: cosine similarity) (d1 , d2 ) • Distance measure: (eg: eucledian distance) (d1 , d 2 ) • Number “k” of clusters • Issues – Large number of noisy dimensions – Notion of noise is application dependent Clustering: Formal specification • Partitioning Approaches – Bottom-up clustering – Top-down clustering • Geometric Embedding Approaches – Self-organization map – Multidimensional scaling – Latent semantic indexing • Generative models and probabilistic approaches – Single topic per document – Documents correspond to mixtures of multiple topics Partitioning Approaches • Partition document collection into k clusters • Choices: {D , D .....D } 1 • 2 k (d , d ) – Minimize intra-cluster distance – Maximize intra-cluster semblance (d , d ) If cluster representations Di are available (d , D ) – Minimize – Maximize (d , D ) i • Soft clustering i d1 , d 2 Di i d1 , d 2 Di 1 1 2 2 i d Di dDi i i Di with `confidence’ z d ,i – d assigned to zso , D maximize ) d ,i as to minimize – Find z (dor z d ,i i • Two ways to get partitions - bottom-up clustering i and top-down clustering d Di i dDi d ,i (d , Di ) Bottom-up clustering(HAC) • Initially G is a collection of singleton groups, each • with one document d Repeat – Find , in G with max similarity measure, s() – Merge group with group • For each keep track of best • Use above info to plot the hierarchical merging • process (DENDOGRAM) To get desired number of clusters: cut across any level of the dendogram Dendogram A dendogram presents the progressive, hierarchy-forming merging process pictorially. Similarity measure • Typically s() decreases with increasing number of merges • Self-Similarity – Average pair wise similarity between documents in 1 s() C2 s(d , d ) d1 , d 2 1 2 – s(d1 , d 2 )= inter-document similarity measure (say cosine of tfidf vectors) – Other criteria: Maximium/Minimum pair wise similarity between documents in the clusters Switch to top-down • Bottom-up – Requires quadratic time and space • Top-down or move-to-nearest – Internal representation for documents as well as clusters – Partition documents into `k’ clusters – 2 variants • “Hard” (0/1) assignment of documents to clusters • “soft” : documents belong to clusters, with fractional scores – Termination • when assignment of documents to clusters ceases to change much OR • When cluster centroids move negligibly over successive iterations Top-down clustering • Hard k-Means: Repeat… – Choose k arbitrary ‘centroids’ – Assign each document to nearest centroid – Recompute centroids • Soft k-Means : – Don’t break close ties between document assignments to clusters – Don’t make documents contribute to a single cluster which wins narrowly Research issues • Modelling • Querying • Distributed architecture • Ranking • Indexing • Dynamic pages • Browsing • User interface • Duplicated data