Search and the ‘Net at 2004 Trends, Challenges and Cutting-Edge Developments in Internet Search Services Michael Hunter Reference Librarian Hobart and William Smith Colleges for Rochester Regional Library Council Member Libraries’ Staff Sponsored by the Rochester Regional Library Council Supported by Library Services and Technology Act (LSTA) and/or Regional Bibliographic Databases and Resources Sharing (RBDB) funds granted by the New York State Library 2003 For Today …. State of the ‘Net and its Users Search Industry Overview Recent Developments in Established Services New Services The Deep Web at 2004 Tracking the Living Web: Weblogs and RSS Cutting-edge Developments Trends and Challenges to Today’s Search Services The Internet and its Users at 2004 How large is the Web? What do you mean by the Web? The totality of all Web sites Sounds simple …. BUT IS IT? UC Berkeley’s How Much Information Project http://www.sims.berkeley.edu/research/projects/how-much-info2003/internet.htm NOTE: 10 terabytes = total print collections of the Library of Congress Internet Use Worldwide Internet Use in the US http://www.pewinternet.org Internet Use in the US http://www.pewinternet.org “Top Ten” things our users do online http://www.pewinternet.org ACTIVITY % E-mail 92 Use search eng. For specific question Consumer info. 88 Get a map 79 Hobby info. 76 83 “Top Ten” things our users do online http://www.pewinternet.org Leisure info. 73 Weather 75 Get news 69 Instant message 67 Health/medical info. 66 Undergraduates and Search Engines Colaric, S. “Instruction for Web Searching: An Empirical Study” College and Research Libraries 64 (2) March 2003 p. 111-116 QUESTION %YES %NO %Don’t Know All work the same way 67 10 23 Engines look at all sites 64 18 18 Term(s) need to match index Gathers sites using a crawler 58 19 23 15 9 96 OR retrieves more than AND 62 18 20 The Internet Search Industry: Consolidation Performance Measures Popularity The Shrinking Search Industry Editorial control of search is shared among few Yahoo owns AlltheWeb, Altavista, Inktomi, Overture (paid listings) Google MSN AskJeeves owns Teoma LookSmart owns Wisenut Gigablast NOTE: Ownership is different from database affiliation Google Database Affiliates Google AOL Netscape Yahoo Openfind Database Freshness http://www.searchengineshowdown.com/stats/ freshness.shtml Based on a series of 6 current topic searches Pages that are updated daily AND report that date on the page Queries submitted May 17, 2003 Database Freshness http://www.searchengineshowdown.com/stats/ freshness.shtml Most have some results indexed in the last few days The bulk of most of the databases is about 1 month old Some pages may not have been reindexed for much longer Popularity: Searches per day self-reported data, as of 2/28/03 http://searchenginewatch.com/reports/article.php/2156461 SERVICE Searches, in Millions Google Overture Inktomi LookSmart FindWhat AskJeeves AltaVista AlltheWeb 250 167 80 45 33 20 18 12 Recent Developments among Established Services Google Froogle Phonebook Wildcard Words Info: Synonym feature Supplemental Index Search by location News Advanced Search and News Alerts ??? Froogle Locates information about products for sale online Gives URL’s of sites offering the item Provides links to exact page in the site where you can make the purchase Froogle Ranking follows normal Google ranking processes Paid placements always clearly marked Price range limits available Access at http://froogle.google.com or via Google Advanced Search Phonebook Command Search Searches US residential (rphonebook:) and business (bphonebook:) listings of Yahoo, MapQuest and other services rphonebook: MUST INCLUDE Last name City and/or State MAY INCLUDE • First name • bphonebook: MUST INCLUDE Business name (min. 1 word) City and/or State MAY INCLUDE • Full Business name Wildcard Words Google offers a word-sized asterisk to function as a wildcard Stands for a whole word Cannot be used for part of a word “three * mice” = 22,000 “three bl* mice” = 0 Wildcard Words Several * can be used together milosevic “International * * Hague” Retrieves military tribunal OR military court OR war tribunal OR military tribunal info: Not exactly hidden, but not well-known Searches for any information Google has about a site Convenient way to monitor linkage Typing a URL in the search box will give the same results Synonym Feature Place a tilde ~ immediately before a term to retrieve synonyms or related terms from the Google Index Eliminate the original term by placing a minus sign before it. ~hiking -hiking Google’s Supplemental Index For obscure or unusual searches Queried when Google fails to find good matches within its main web index. Live 9/9/03 Sample queries: “St. Andrews United Methodist Church” Homewood IL “nalanda residential junior college” alumni “illegal access error” jdk 1.2b4 supercilious supernovas Search by Location (beta) http://labs.google.com/location U.S. only Keyword(s) combined with address, city, state or zip Search results appear on a map News Advanced Search and News Alerts Advanced News Search added this Fall News Alerts Requires a (free) account One query per alert; limit of 50 alerts per email address Alerts contain links to news containing your alert keywords Cannot edit a query; delete and create a new one instead Alerts sent once a day or “as it happens” More about Google…. Google World http://indicateur.com Maintained by a French Search Engine Site and listed under Guides. Use Google translator (see Language Tools) to translate the site) Google Lab http://labs.google.com Place for cutting edge developments, many in beta awaiting user feedback and testing. Beyond Google: AskJeeves Simpler, cleaner interface Teoma crawler-based results blended with AJ “answers” Improved image database “Smart Answers” Popular queries mapped to news, image and other sources “appropriate to the query” ATW (FAST) http://alltheweb.com Continued commitment to a large database (2nd to Google) Powerful, new advanced search capabilities Extensive page customization options Results clustered by topic (“Folders”) Both HTML and Multimedia given, when available NOTE: Folders located at the BOTTOM of each results screen Altavista Simpler interface More language options Expanded image and multimedia collections Results labeled“Refreshed in last 48 hours” Includes PDF files “US” and “Local” search options “Prisma” query refinement Altavista Prisma Query Refinement Offers a maximum of 12 terms having the strongest associations with the original query term(s) Selected from the top 50 results of the original query NOTE: Clicking on a “Prisma” term adds it to your original query, creating a new set of Prisma terms. Similar to Refine (1997) but less graphic Teoma Ranking Includes a site’s relationship to other sites with similar content Results Ranked database results, with “Related Pages” Refine Clustering of your results and other related sites based on term relationships and web community linkages derived from your original results Resources “Link Collections from experts and enthusiasts” (Subject metasites) Hotbot Searches Hotbot (Inktomi) OR Google OR Lycos OR AskJeeves Not a true metaengine Advanced features operable only if supported by source engines Metacrawler Along with Dogpile and Webcrawler, owned by Infospace Simpler interface Offers the following customizations: Selection of sources searched Total number of results retrieved Length of search (“time-out period”) Offers a wide range of vertical searches: Images, MP3, Shopping, Subject Directory, Multimedia, News, Message Boards New Services Attracting Attention Gigablast Launched April, 2002 Smaller database than others Over 200 million on 10/4/03 pope canterbury Google:83,200 Gigablast:24,919 Created and maintained by Matt Wells (alone) Only search engine “continuously updated with index refreshed in real time” (Site submissions are immediately searchable) Ranking depends less on linkage than Google’s ranking, to avoid penalizing newer pages. No advertising (to date) Gigablast Search Features Basic search Full Boolean Advanced Search: Full Boolean and 2 (!) phrase boxes Limit by site Limit by domain (URL) Links to a page available Most “generic” html metatags indexed, searched and made available for display Unique to Gigablast!!! Gigablast Search Features Field searches include title, IP address and non-html filetypes: PDF, Word, Excel, PPT, PostScript, Ascii Text Results from one site clustered Cached version available Results include date indexed and last modified (!!) Linking to Gigablast improves ranking there KillerInfo http://www.killerinfo.com Metaengine searching Google, AOL, Lycos, Gigablast, MSN, Altavista, LookSmart and Open Directory 9 topical Deep Web channels offered Boolean and phrase search No other Advanced Search features Results clustering (a la Vivisimo) Number of results not given Adult content filter Surfwax http://surfwax.com Demo site for federated search software Simultaneous search of Deep Web, Intranets, Web and more Metaengine searches Wisenut, AOL, MSN, Yahoo, Incarta, CNN, LookSmart FOCUS search refinement feature Online thesaurus of related terms and definitions Surfwax http://surfwax.com Site SNAP of a result offers Author summary (from metatags) Related sites Site’s FOCUS words Key Points (query-related sections) Results ranking options: Relevance, Alpha and Source Preferences and Advanced Features require a (free) account; more options available to fee-based accounts Nutch http://nutch.org Project to implement an open source web search engine Why open source? With open source, search results processing is transparent, not hidden. Bias (if any) can be examined by anyone. Open source applications are free and available for use, modification or for-profit use. Users are asked to contribute their innovations back to the code base Nutch is seeking volunteer developers and donations The Deep Web at 2004 The Topography of the Internet or The Layers of the Web Mapping the web is challenging Unregulated in nature Influences from all over the globe Fulfills many purposes, from personal to commercial Changes rapidly and unexpectedly Divisions and terminology are inherently ambiguous eg. “Deep” vs “Invisible” Web May I suggest a biological, nautical metaphor, perhaps the ocean? SURFACE WEB SHALLOW WEB OPAQUE WEB DEEP WEB DARK WEB Surface Web Static html documents Crawler-accessible Shallow Web Static html documents loaded on servers that use ColdFusion or Lotus Domino or other similar software A different URL for the same page is created each time it is served. Crawlers skip these to avoid multiple copies of the same page in their database Technically human accessible via directories, Deep Web gateways or links from other sites Opaque Web Static html documents Technically crawler accessible 2 types: Downloaded and indexed by crawler Not downloaded or indexed by crawler Opaque Web Downloaded and indexed by crawler Buried in search results you never look at A casualty of “relevance” ranking Not downloaded or indexed by crawler due to programmed download limits Document buried deep in the site Part of a large document that did not get downloaded (Typical crawl per page is 110 K or less) Document added since last crawler visit (Even the best revisit on an average of every 2 weeks, depending on amount of change at a site) Opaque Web Access to the Opaque Web Specialized search engines General and specialized directories Subject metasites These services typically index more thoroughly and more often than large, general search engines Deep Web Technically inaccessible to crawlers Dynamically created pages Databases Non-textual files Password protected sites Sites prohibiting crawlers Technically accessible to crawlers • Textual files in non-html formats Dark Web http://research.arbornetwords.com Up to 5% of the web is completely unreachable due to Misconfigured routers Contractual disputes between ISP’s Broadband users with personal or corporate firewalls US Military sites UC Berkeley’s How Much Information Project http://www.sims.berkeley.edu/research/projects/how-much-info2003/internet.htm NOTE: 10 terabytes = total print collections of the Library of Congress http://www.sims.berkeley.edu/research/projects/how-much-info2003/internet.htm Reducing the Deep Web:mod_rewrite Making dynamic pages available to crawlers Mod_rewrite software loaded onto a web server containing dynamic pages (databases, etc) Crawler follows a link to a stable URL on the server www.mydomain.com/dvdplayers.html Mod_rewrite searches all the server’s dynamic pages containing dvdplayers and creates temporary pages with stable URL’s. These pages are linked to each other, creating a stream of virtual pages that can be crawled by any of the search engines Search engines often check the stream for spam or duplicate pages Mining the Deep Web:Directed Query Engines or Intelligent Agents Designed to access distributed Deep Web resources Some can be configured to search specific URL’s Databases Subject metasites report collections dynamic pages online newsletters Directed Query Engines for purchase Simultaneous search of Deep Web and other resources with many additional features Lexibot http://www.lexibot.com If you complete survey: $189 upgrades $15 If you don’t: $289 upgrades $50 BullsEye http://info.intelliseek.com BullsEye Pro: months $199 with free upgrades for 6 Hunter’s Maxim for the Deep Web Plan to first locate the category of information you want, then browse. Don’t be too specific in your searches. Cast a wide net. TRACKING THE LIVING WEB: WEBLOGS AND RSS FEEDS Blogs: What are they? Online diaries or journals, usually by one person, though many invite “comments” First developed in 1997 Within the same blog tone can range from personal musings to discussion of recent issues in technology and research High link-to-word ratio Often link to other weblogs of similar content Blogs: What are they? Can contain rumor, inside information, speculation, blatant errors as well as Breaking news: political and technical/research Commentary on new software or websites Consumer reaction to products or services Blog authoring tools are basic content management software, useful in ways other than online diaries Typify the spirit of information sharing that has fueled the Internet since its beginnings How large is the blogosphere? 2.4 to 2.9 million active blogs (est.) Who’s blogging? Jupiter Research 2% of Internet users have created a blog About 50% women, 50% men Over 50% are in English; remaining language, in order of prevalence: Portuguese, Polish, Farsi, French, Spanish, German, Italian, Dutch and Icelandic More … About 4% of Internet users read blogs, 60% men, 40% women On average, blogs are updated every 3 days About 4% of online Americans have gone to blogs for information about the Iraq War LiveJournal (large blog host) was the 650th most popular site on the Internet (May, 2003) 184,000 readers every 10 days Spend average of 22 minutes at the site Creating a Blog Blogger http://new.blogger.com Free, automated Web publishing tool Requires no new software Send posts to an existing website or create a free blog at Blogger Provide a site template and where you want the postings to appear To update, create posting, submit permission form and Blogger will sent FTP Advanced options available Locating Blogs Blog Hosting Sites www.livejournal.com diaryland.com radio.userland.com ($39.95 with added features) Blog metasites www.lights.com (library-related, world-wide) www.blogrunner.com www.llrx.com/columns/notes46.htm portal.eatonweb.com/ Locating Blogs Subject Directories dmoz.org/Computers/Internet/On_the_Web General Search Engines Blog keyword(s) or URL(bloghost) keyword(s) Professional Association homepages Subject Metasites Use Teoma.com “Resources” Searching Blog Content Blog hosting sites www.livejournal.com Blog Search Engines Feedster.com (includes RSS feeds also) Daypop.com (current events) Blogdex.media.mit www.technorati.com blogging-news.info Topical Blog Search Engines Detod (blawgs.detod.com) Exclusively legal weblogs Blogs and General Search Engines Blog-rich sites are increasingly visited by major crawler-based search services HOWEVER ANY rapidly-changing content can easily be missed by crawlers Obstacles to Crawling and Indexing Blog Content Only the most recent postings appear on the blog homepage (older are archived, and inaccessible to crawlers) Many bloggers post dozens of times a day Frequent postings may contain critical information to time-sensitive topics Even a daily crawl would miss these postings (typical crawl is about once every 3 weeks) Obstacles to Crawling and Indexing Blog Content – Page Design Several postings usually appear on the blog homepage Postings are NOT indexed separately, as crawler indexes the page as a whole Retrieval of an individual posting on a topic is unreliable Blogs and Libraries Blogs can offer an opportunity to post content on the Web quickly—no delay of FTP uploading or submission to a webmaster “What’s New” “Favorite Books” “Recent Acquisitions” “Program Changes due to the Weather” Blogs and Libraries Get more people involved in posting content on the Library (or library-sponsored) website No knowledge of html, RSS or XML needed Log onto the blog hosting website, create content, and update the page Current awareness without the annoyance of unwanted e-mails Choose when YOU want that information by visiting your blogs of choice Blogs and Libraries: Metasites Blogs and Libraries: A Bibliography (online) http://www.etches-johnson.com/nolibrary/bib.html Library Weblog Directory http://www.libdex.com/weblogs.html Blogs at the University of Minnesota Libraries http://www.lib.umn.edu/san/mt/ Fichter, D. (2003). Why and how to use blogs to promote your library's services. Marketing Library Services 17(6). http://www.infotoday.com/mls/nov03/fichter.shtml RSS “Rich Site Summaries” “Really Simple Syndication” “Really Stops Spam” Before RSS: Tracking latest news and site updates Software packages that monitored and reported changes at sites of your choosing News alert services, free and fee Manual checking of your bookmarks “Hit or miss” Listserv and Usenet postings RSS: What is it? XML filetype with content that is Structured (tags, standard and/or authordefined) Re-useable (can be integrated into web, e-mail, multimedia and many other formats Originally developed by Netscape as a content management tool for personalizing home pages “My News” “My Sports” “My Weather” RSS in detail http://blogs.law.harvard.edu/tech/rss RSS: What can it do? Creates a broadcast version of frequently updated content from a website, blog, news page or other source Authors can Summarize new content Broadcast new content eg. online newsletters Can be used as a way to distribute content to subscribers (syndication) independent of e-mail. Subscribers logon or access via aggregators. How do I access them? As RSS is in XML, may require downloading reader software (older versions of browsers cannot read XML). Sources for reader software include www.lights.com blogspace.com Sites with RSS feeds display a small icon (usually orange) labeled RSS or XML General search engines (limited, but worth a try) filetype:xml keyword(s) RSS Directories and Search Engines Syndic8 syndic8.com Directory of available syndicated news feeds Provides no reading area Uses Open Directory classification Feedster www.feedster.com The best search engine for blogs and RSS feeds Yahoo news.yahoo.com/rss Canadian Government tinyurl.com/vrh7 Often found in Blog Directories and Engines RSS aggregators Receive general or topical RSS feeds and blog postings Many are focused on news only Present content in compact form Combine multiple sources in one interface Provide links to full content In personal desktop versions or online Personal desktop aggregators Lets you specify any feeds you want access to Ampheta Desk www.disobey.com/amphetadesk/ Radiouserland radio.userland.com ($$) Feedreader feedreader.com Feedreader.com Online aggregators Selection of feeds may be limited NewsIsFree NewsIsFree.com 7379 sources grouped into 16 channels Create custom pages $$ offers more “Premium options” Many RSS sites include links to other aggregators Authoring and Producing RSS Lockergnome rss.lockergnome.com Documents, tools, developers, aggregators, free feed generator for you site RSS Primer for Publishers www.eevl.ac.uk/rss_primer/ Producing RSS feeds Technical information Feed promotion Feedster www.feedster.com Blogs and RSS Blogs may offer some or all of their content as RSS feeds, or not Blogs can exist as “pure” html documents, updated frequently Making content available in RSS increases a blog’s access and exposure via aggregators and other RSS-based search services The Living Web What can blogs and RSS feeds tell us about an author’s point of view? Which ones does an author list on their blog/homepage? Which ones does an author visit/subscribe to? Sometimes I want to know what the world thinks GOOGLE Sometimes I want to know what I think MY WEBLOG Sometimes I want to know what those I respect think BLOGS AND FEEDS I READ Beyond today’s (free) search engines: Cutting edge developments Including Context in System Design Context matters (!!??!) Textual context Query context “Who is asking and why?” Traditional approaches to retrieval have been deductive Data organized and mapped to anticipated query terms (controlled vocabularies, taxonomies) Human created and maintained Too slow for rapid data streams Bayesian approaches Uses statistical inference based on Bayes’ Theorem of Probability (Thomas Bayes, 1702-1761) Inductive approach (adaptive processing) Take the user’s information environment Infer structures, relationships, likely queries Inferred structures and relationships can then be mapped to a human-created classification scheme Currently used in corporate intranet and feebased content management software Will be used more in general information systems of the future Adaptive Processing Learning the searcher’s interests What term(s) did you search? What did you select? How long did you look at it? What is its source? How old was it? Direct input from searcher Rank the sources Rate individual results Eliminate certain sources, sites Inquirus http://inquirus.nj.nec.com Query interface research project Attempts to improve precision of results Monitors user’s search behavior to infer intent of queries Re-formulates queries to increase likelihood of desired answers Inquirus http://inquirus.nj.nec.com USER: “How do you make salsa?” SYSTEM: salsa and (recipe or ingredients or food) Eliminates pages on salsa dancing Ranking relies heavily on proximity of query terms and system-provided cognates to each other in the document Vector-Space Model 3-dimensional retrieval A way of ordering documents by word frequency/context in a “term space”and matching them to queries Documents are assigned coordinates One document may be in many “term spaces”or vectors Queries that fall within a given vector are likely to be answered by documents located in that vector A Multi-dimensional Boolean Boolean limited to term matches terrier female puppy Vector-space model More complex relationships can be mapped “Degrees of relatedness” of document to query Query and document “weights” based on length and direction of their vectors Documents in Vector Space “What do you have on movie stars’ diets?” STAR Doc about movie stars Doc about astronomy Doc about mammal behavior DIET Phibot http://phibot.org Project of the Univ. of Mainz and German Institute of Artificial Intelligence Crawls science, medicine and news web sites `200 million general science sites 70 million medical sites Traditional: Google-like processing Vector-Space Optimization: greater vector-space processing Digital Video Search Searches actual visual content Project of Dublin City University http://www.cdvp.dcu.ie Determine “structure” of the video by identifying shots with the greatest degree of change (keyframes) Use these to create a structure, and allow user to refine query based on these Needed by journalists, governments and airport security Current Trends in and Challenges to Today’s Search Industry User Interface Trends Toolbars, Toolbars, everywhere Review site: searchenginewatch.com/links/article.php/2156381 Search by Location – Major engines with local search options and local specialized ones Makes the haystack smaller; important in e-commerce “P2P” networks (Peer-to-peer) File-sharing networks, a la Napster KaZaA - most popular download EVER! Shares any filetype 90% of files shared are audio-visual in nature User Interface Trends Application Program Interface (API) Published set of programming “hooks”that lets you interact directly with a company’s open servers You can mine the company’s databases for free WHY? To attract more traffic to the site Example http://www.googlerace.com Enter 1 or 2 terms/phrases and see how Bush and Democratic candidates stack up! Created by Tara Calishain Search in Corporate Settings Drive Search Engine R&D Uniform, seamless access to all information: Internal & external, data & content XML More natural language processing Hybrid systems to search structured AND unstructured data Adaptive processing (Bayesian) Use of intelligent agent software Easier user interfaces Personalization Industry-wide Trends Distributed Crawling Volunteer your PC when not in use Grub.com, Looksmart Search continues to be driven by advertising and revenue Fewer services maintain their own crawlercreated database Increased crawling of non-html filetypes Challenges to the Industry Revenue E-content providers have cut into search software sales with their proprietary engines Fighting fraud Cloaking, ranking manipulation Scalability Size of surface Web increases Over 300 million queries a day to all Web S.E.’s Challenges to the Industry Freshness Competitive edge demands recent crawls Deep Web Embedded databases Non-html filetypes Real-time information Growing importance of the Living Web Challenges to the Industry Ambiguous query refinement Not very hopeful among general search engines User group too large User profiling difficult Indexing the smaller, newer sites Google’s link-based PageRank penalizes these sites The Biggest Challenge: “Just what are you looking for?” A known needle in a known haystack A known needle in an unknown haystack An unknown needle in an unknown haystack Any needle in a haystack The sharpest needle in a haystack Most of the sharpest needles in a haystack All the needles in a haystack The Biggest Challenge: “Just what are you looking for?” Affirmation of no needles in the haystack Things like needles in any haystack Let me know if any new needles show up Where are the haystacks? Needles, haystacks, ….whatever Thank You and Happy Holidays! Michael Hunter Reference Librarian Hobart and William Smith Colleges Geneva, NY 14456 (315) 781-3552 hunter@hws.edu