Search and the ‘Net @ 2006 Trends, Challenges and Cutting-Edge Developments in Internet Search Michael Hunter Reference Librarian Hobart and William Smith Colleges For Rochester Regional Library Council Member Libraries’ Staff Sponsored by the Rochester Regional Library Council Supported by Regional Bibliographic Databases and Resources Sharing (RBDB) funds granted by the New York State Library 2005 For Today …. The current “state of web search” What’s new among established services Services launched recently Cutting Edge in Search – Natural Language Processing Text mining The Latest from the Living Web Weblogs RSS feeds Podcasts Current trends and future possibilities Linklist for today’s session: people.hws.edu/hunter/search06links.htm Web Search @ 2006 Who’s crawling the Web? Yahoo – Owns AlltheWeb, Altavista, Inktomi, Overture Google MSN AskJeeves owns Teoma Gigablast NOTE: Ownership is different from database affiliation Google Database Affiliates Google AOL Excite Netscape Most popular services Google 48% Yahoo 29% (up 20% from last year) MSN 8% (up 30% from last year) All others 15% (AOL, AJ, Net, Gig) Study by Harris Interactive (must purchase) – www.harrisinteractive.com Database Size Google: ca. 10 billion web pages (???) Yahoo – 20 billion “web objects” MSN – 6 billion (est.) Teoma – 3 billion (est.) Gigablast – 1.5 billion (est.) Search Engine Overlap Results compared from 12,500 random queries from the largest engines 85% were unique to one engine 11% were shared by any two 3% were shared by any three 1% were shared by all Study by Dogpile, U Pittsburgh and Penn State – CompareSearchEngines.dogpile.com/OverlapAnalysis Recent Developments Among Established Services 2005: A lot to Yahoo! about No longer just a subject directory New features and an estimated 20% increase in users Vertical Search Engines – Music, health, finance, shopping and over 20 more Personalization – My Yahoo and Yahoo 360 – Creates an online identity with photos, restaurant reviews, personal histories and personal blog 2005: A lot to Yahoo! about RSS feeds – Offered as part of My Yahoo – User-friendly Reader/Aggregator provided; limited to 250,000 Yahoo-selected feeds – Yahoo content as RSS: News, Ask Yahoo, Buzz Index (popular searches), News Groups Video search (beta) //video.search.yahoo.com – Advanced search features: KW, format, file size, length, content filter Creative commons search.yahoo.com/cc – Content that is free to share or modify 2005: A lot to Yahoo! About Contextual Searching - Y!Q Selected web pages or highlighted sections analyzed for word frequency and “concept extraction” and used as basis for a search Results give basis for query in “context selection box” Refinements include removing unwanted terms/phrases and “more like this” link Requires download of free toolbar toolbar.yahoo.com 2005: A lot to Yahoo! About Open Content Alliance (10/3/05) Large scale E-text initiative Members include Yahoo, Internet Archive, National Archives (UK), RLG, LC, 8 US and 6 Canadian Universities Over 25,000 Digitized copies of public domain AND copyrighted works Works under copyright only available if permission granted by owner Yahoo plans to include the content in it’s database or subject directory 2005: A lot to Yahoo! About Yahoo/OCLC toolbar Searchers may restrict their results to the Open World Cat database, currently at 57 million records Displays library holdings in the searcher’s vicinity Download (free) at www.oclc.org/toolbar AOL search.aol.com Results from Google Personalization- (with free account) Results clustering a la Vivisimo “Smartbox” query refinement – Offers suggestions BEFORE search button is clicked “Snapshots” Human-created answers Local Search, Maps, Vertical Engines Gigablast www.gigablast.com “Related pages” – Relevant search results which may not contain original search terms Database now at 1.5 billion (up 50%) One to keep your eye on Ixquick Metaengine www.ixquick.com Repetitive results removed Results marked as irrelevant by user used to delete other similar pages in real time International price comparison covering over 5,000 merchants International phone directory, residential and business Google Personalization Re-orders search results based on user’s past searches and click tracks Ranking will change, depending on user profiles Requires setting up a (free) account Personalized home page (G. as portal?) Complex profiles are problematic eg. “Movies, computer hardware, the Internet, general news, astronomy” SEARCH: cars Which categories take precedence over others????? Google Personalization Search records personally associated with a user are deleted if service is dropped Search log data for all Google searches kept (via cookies) Google’s privacy policy: www.google.com/privacy.html Bookmark entire web pages Google Google Earth earth.google.com Geographic search application Originally Keyhole 3D, now a free Google download Images taken by satellites and aircraft “sometime in the last 3 years” “Fly to” accepts an address or coordinates, returns a view from 3,000 ft. above, with zoom capabilities Google Local for Mobile google.com/glm Free download Unique ID associated with your phone Simplified version of the web-based Local Search Emphasis on maps and directions Point-to-point directions limited to a certain area Business listings offer address and phone number only Does not support all mobile phones Google Video Search video.google.com Index of closed captioning and text descriptions from selected TV and other video content after Dec. 2004 Results include thumbnail, description, source, date, duration and hyperlink Currently hyperlink links to more description, not to the video itself Q&A Service Ready reference service providing answers to fact-based queries Google Print’s 2 divisions Publisher Program and Library Project Publisher Program Publishers authorize G. to scan and make searchable the full text of their books Users see only the full page containing their search terms Link to purchase copy Google Print’s 2 divisions Publisher Program and Library Project Library Project Scan and make searchable 15 million books, in and out of copyright, from Harvard, Stanford, Oxford, U. Michigan and NYPL For works in copyright, users see only a few sentences around search terms Users may browse full text of public domain works NOTE: Not possible to print ANY material from either Google Print project Library Project in 2005 June – Assoc. of American Publishers question legality of Library Project August 15 – G. “temporarily halts” scanning in-copyright works; continues scanning public domain works September 20 – Author’s Guild files a formal complaint against G. in NY Federal District Court alleging “massive copyright infringement” Services Launched Recently Icerocket www.icerocket.com Results Enhancements – Thumbnails of home page – Archived version (Internet Archive) – Qluick View Full Boolean Includes Web, Blogs, Multimedia and News, with unique advanced features “Blog Trends” tool MAY be using Google May become www.blogscour.com Brainboost www.brainboost.com A natural language “answer engine” Results include “Related Questions” as well as responses to your query Queryster queryster.com Interface that provides quick scanning of results from up to 10 engines – Yahoo, Google, MSN, AJ, WNut, Teoma, AV, Amazon, Ebay, A9 Executes your search as you click on the engine Batch search – executes multiple queries in each engine Fresh Google – uses daterange search, (not reliable) RedLightGreen www.redlightgreen.com 120 million titles from the Research Libraries Group union catalog Search options – Boolean Phrase Author Title – Keyword (Title, L C S H) Subject (LC) – Limits by language and date Results refined by Related Subjects, Authors and Language Reviews of books linked to record 5 Citation outputs available The Cutting Edge in Search: Natural Language Processing Beyond Searching the Full Text: Natural Language Processing (aka Text mining Data mining) How can we manage unstructured information? Current web search engines match query terms from the full text of downloaded documents (“bag of words”) Term frequency, position, page linkage and popularity and other factors used to create the final selection and ranking of results. Enter Natural Language Processing (NLP) With NLP software unstructured text and data can be processed to reveal degrees of meaning by – Extracting terms identified as significant – Summarizing content – Discovering relationships among terms and groups of terms – HOW??? NLP Extraction Take all articles from a group of pharmaceutical journals published in one year (the “corpus”) Extraction – Run a relevant controlled vocabulary (list of all known drugs) against the corpus NLP Extraction Drugs found, number of occurrences and location in the corpus plus a list of possible drugs not in the controlled vocabulary 86>penicillin click for locations 124>tetracycline click for locations 213>aspirin click for locations Are these also drugs? XXX, XXX, XXX NLP Summarization Retain phrases surrounding the extracted term(s) with links to locations in the corpus (KWIC Index) rare uses of penicillin Often penicillin is contraindicated when responds well to penicillin NLP Summarization Tag all words in the corpus with their grammatical function and search for noun – verb – noun and other syntactic patterns (drug A) treats (disease B) (drug C) causes (disease B) (drug D) is contraindicated in (disease B) NLP Term Relationship Queries answered by tracking references across sentences Can penicillin cause shock? “Penicillin treatment is not without risks. In certain cases it can trigger anaphylactic shock.” NLP can do even more … Word disambiguation bank (river) bank (finances) bank (verb) Retrieval of alternative word forms Retrieval of variants in capitalization and spelling Topic detection and tracking Following different themes in a changing RSS feed Machine translation NLP and Real Life Early recognition of emerging market trends and/or competitors Monitoring content from bio-medical and other journal literature that grows faster than the ability of researchers to read it Improve relevancy in searches of content from libraries, publishers and the Web The Latest from the Living Web Weblogs RSS Feeds - Podcasts Blogs: What are they? Online diaries or journals, usually by one person, though many invite “comments” First developed in 1997 Within the same blog tone can range from personal musings to discussion of recent issues in technology and research High link-to-word ratio Often link to other weblogs of similar content Blogs: What are they? Can contain rumor, inside information, speculation, blatant errors as well as – Breaking news: political and technical/research – Commentary on new software or websites – Consumer reaction to products or services Blog authoring tools are basic content management software, useful in ways other than online diaries – Typify the spirit of information sharing that has fueled the Internet since its beginnings Today’s Blogosphere The blogosphere is now over 30 times as big as it was 3 years ago, with no signs of letup in growth As of October 2005, Technorati is now tracking 19.6 Million weblogs The total number of weblogs tracked continues to double about every 5 months About a new weblog is created each second Today’s Blogosphere 2% - 8% of new weblogs per day are fake or spam weblogs Between 700,000 and 1.3 Million posts are made each day http://www.problogger.net/archives/2005/1 0/17/state-of-the-blogosphere-october2005/ Blogs and Search: Google blogsearch.google.com and search.blogger.com First major engine to offer a blogspecific search (Sept, ’05) Defines blogs as “sites which use RSS and other structured feeds and update content on a regular basis” Advanced Search features – Blog title Author of post Date range – Language limit Safe Search option Blogs and Search: Clusty clusty.com Formerly Vivisimo Metasearch engine with a blog search capability Source engines for blog search – Blogdigger – Feedster Blogpulse Technorati Daypop Blogs and Search: Clusty clusty.com Results clustered in topical folders Source engine given for each result Date and time of each posting given Accepts natural language queries Full Boolean capabilities Phrase search (“ “) Limits include – Domain Host – Number of results Source Engine Length of search (timeout) RSS: What is it? A broadcast version of current content from a website, blog, news page or other source (aka “RSS Feed”) A live, constantly updated table of contents with links to the full text, eg. a feed from NYTimes.com How do I access RSS feeds? Sites with RSS feeds display a small icon (usually orange) labeled RSS or XML or Atom As RSS is in XML, may require downloading reader software (older versions of browsers cannot read XML). Sources for reader software include – www.download.com (search rss reader) Aggregators allow for reading and organizing feeds of your choosing RSS:Crossing into the Mainstream Study of 4,000 respondents by Yahoo! And Ipsos Insight August, 2006 Who is using RSS? 12% were aware of RSS 4% had knowingly used it 27% unknowingly use RSS via personalized start pages, eg. My Yahoo Why do they use RSS? Ease of use Choice of content Instant updating capability (only 7% !!!) RSS:Crossing into the Mainstream What feeds are they using? (in order of popularity) World news National news Entertainment Science and technology Weather Local news http://publisher.yahoo.com/rss/RSS_WhitePaper1 004.pdf MY Yahoo! Ticker yahoo.com RSS reader and aggregator Click on Downloads Click Deskbar for MS Windows Choose among 250,000 Yahoo-selected RSS feeds News and Stocks Server Options allow filtering by a list of topics RSS at Google www.google.com/reader Requires setting up a (free) account Subscribe to any feed of your choosing Keyword search available for feeds in Google’s database RSS feeds available for Google News Folders (labels) available for grouping feeds of similar content Sort feed items by date and relevance Podcasts 101 iPod + broadcast = podcast Downloadable audio or video files which can be played on many devices PC, home systems Mobile (iPods, cars, MP3) “Broadcast” by means of RSS Not limited to Apple’s iPod or MP3 format Often embedded in weblogs with RSS feeds As with any Living Web (RSS) content podcasts can go offline; may or may not be archived Podcasts 101 Development Ease of publication (cheap storage, MP3 format) Ease of subscription (RSS 2.0) Ease of use (iPod, other mobile audio devices) Create files (audio parameters apply!) Publish files – From iTunes.com – From any web site capable of supplying content via RSS (Most blogs do) Podcasts 101 Subscribe to files via the URL for the RSS Podcast feed (Red RSS or XML button) Podcatcher – Freeware that receives and organizes podcasts (an RSS aggregator for podcasts). Available at podcatcher.rubyforge.org Podcasts in Higher Education Drexel (Chemistry) – Lectures podcasted; class time used for problem solving Duke (Computer Science) – Students required to listen to podcasts on related topics not covered in class U. of Hawaii (Computer Science) – Intro class of 600; lectures podcasted; “Listen to them when you have the time” Video podcasting “Vodcasts” June, 2005 - Apple iTunes begins to support video podcasting Can provide supplemental multimedia content as part of a course, or public relations initiative With Web cams, DV cameras and vodcasting, we may be headed toward the democratization of video content 60gb Video iPod now available Arstechnica.com/news.ars/post/20050 915-5313.html Podcasting and Search Many podcasts are embedded in blogs Google blog search tool blogsearch.google.com (subject) podcast Main Google search still text-based; rock filetype:mp3 = 124 on 11/28/05 Blog-based search engine with media search: www.blogdigger.com/media/ Podcasting and Search Podcast Directories and Catalogs www.podcast.net Directory of over 15,000 podcast feeds Searchable by Title & Description, KW, Host (Author), Location and Episode www.odeo.com Searchable catalog of mp3 podcasts Updates every 3 hours Offers text snippet of the latest podcast from each feed www.podcastdirectory.com www.podcastshuffle.com Trends and Future Possibilities Search Today … “Mass Media” as “My Media” Podcasting iTunes Blogs RSS “Search is no longer about a text-based web index. It’s about a person’s interface to the world” -- SEO executive Enhancing search through context and user personal profiles – My Yahoo! – Google Personalized Search Search Today … Federated search (single-point access, enterprise applications) The Desktop “without walls” Unstructured and structured data Internal, personal sources and WWW XML makes this possible – “Middleware” layer with modules that acquire, manage, retrieve and rank text, data and multimedia from a variety of sources and formats Search tomorrow ??? Search will become more Sophisticated Individualized Portable Specialized (vertical, subject-specific services) Voice recognition, GPS and mobile, local search will grow “Where can I find the best bargains on this in the area?” “Where is the nearest pizza parlor and how do I get there from here?” Thank You and Good Luck!! Michael Hunter Reference Librarian Hobart and William Smith Colleges Geneva, NY 14456 (315) 781-3552 hunter@hws.edu