Bibliographic Essay Unit 5 Christine Cox & Graham Sherriff Searching: Using technology to locate information 1. The World Wide Web The World Wide Web has opened up a whole new world of information to the public, but having access to information is not necessarily the same as being able to locate it, understand it or use it. Librarians have a key role in helping searchers to do these things and in teaching how to do them independently. Size and structure Searching for specific information has often been likened to searching for the proverbial needle in a haystack. How big is the internet? Google – widely considered the most comprehensive search engine – in 2004 claimed to have an index of 8 billion pages, while CEO Eric Schmidt in 2005 stated that based on his company’s data the Internet was made up of approximately 5 million terabytes, of which Google had only indexed 170 terabytes. This shows that search engines are only indexing a small amount of the web (Sullivan, 2006). To put the size of existing information into perspective, Schmidt also stated that based on Google’s data for 1998-2005, it would need 300 years to index the existing 5 million terabytes of data, not withstanding any additional new content (Mills, 2005). Unlike a library, the web has no comprehensive catalog. This lack of structure plays an important part in understanding how searching works. Search engines can be unhelpful if the searcher does not know how to use them properly or has not fully developed their question. At this point, the experience and training of librarians is crucial (Builder.com, 2004). 2. Searching with directories Web directories specialize in linking to other websites and organizing the links into categories. Thus they provide a catalog-style structure to the web content they cover. Browsing a directory involves searching by subject titles rather than keywords, so the user must be familiar with not only the subject being researched, but also the directory and the categories it offers. When an inexperienced user seeks information from a directory, the expertise of a librarian is an important factor in not only locating the information, but also identifying the appropriate subject areas to browse. 3. Searching with engines Search engines The development of search engines is a landmark achievement in the field of information location and retrieval. Instead of a table of contents, search engines function like the index of a book (Sherman & Price, 2001). This is especially useful because during the first years of the worldwide web, creators and administrators of online content made little effort to facilitate its retrieval. Consequently, improvements have had to be made mostly on the “demand side”, in the form of searching. The appearance of search engines prompted fears they would replace the librarian. They are now normally the first (and often only) means of searching for information, especially in secondary and higher education (Clyde, 2001). In addition, most searchers use one of a small number of engines. Approximately 92% of web searches are performed by Google, Yahoo!, Microsoft and ask.com (Reuters, 2007). However, search engines vary greatly in how they index the web and how they retrieve search results from their indexes. Limitations Relying on a small number of search engines might not be such a concern if they were reliably effective. However, a key concern for librarians is to recognize the relative advantages and disadvantages of regular search engines. The “Big Four” have the largest (and fastest-growing) indexes, but they also have their limitations, principally the use of algorithms that prioritize popularity. This is good for searchers who want to reference popular sites, but for many searchers the greatest need is relevancy, timeliness or credibility. Search engines are also limited by their for-profit orientation and their commercial rivalry: they have a market-driven incentive to perform extensively and efficiently, but commercialization provides a rationale for the manipulation of results in the form of paid-for placements or arbitrarily filtered results. Google in January 2007 acknowledged amending its results returns in order to protect its own corporate image (Cutts et al, 2007). Commercialization also makes engines susceptible to fads: in recent years, some have neglected searching development in favor of personalization and portalization (Hock, 2001). Librarians need to keep in mind that regular search engines cover only part of the web. Indexing will be a work in progress as long as the web continues to grow and renew. Estimates of regular search engines’ coverage range from 50% of web content to as little as 0.2% (Bergman, 2001). The “deep web” It is not only unpopular websites that fall under the radar of search engines’ crawlers. Sherman & Price (2001) identify four areas of the “deep web”: 1. The “opaque web”: text and pages from a website that exceed an engine’s programmed limits; unlinked pages. 2. The “private web”: content barred to engines by Robot Exclusion Protocol, passwords or “noindex” tags. 3. The “proprietary web”: access regulated by membership, dependent on registration, subscription or fees. 4. The “truly invisible web”: non-HTML formats such as PDF and Flash; compressed files; and content created dynamically by relational databases. (Another area is streamed data, like news ticker-tapes (Clyde, 2001).) Thus there is a wealth of unindexed web content. This content is also broad, with no category lacking significant representation of content, and it is widely thought to offer greater relevancy. A 2001 survey estimated the deep web’s “quality yield” as three times that of the “surface web” (12.3% vs. 4.7%) (Bergman, 2001). Perhaps most importantly, the deep web contains information that simply is not found elsewhere. This might not be a problem when searching for highly popular information such as celebrity gossip or sports news. But specialized research will yield better results in the deep web than in Google. Regular search engines have already taken steps to provide access to the deep web. Google and MSN can now search for PDF and Word files; others specialize in retrieving audio and video files. Increased access to academic resources is expanding rapidly as engines index library catalogs and collections, and incorporate databases, effectively shifting deep-web content onto the surface web. Examples include Google Book Search, Google Scholar and Windows Live Academic. This trend of engines embracing deep-web content is likely to continue (Cohen 2001, 2006). Metasearch engines When trying to maximize recall, librarians might be tempted to use metasearch engines. These submit enquiry terms to several search engines simultaneously and combine or aggregate the results. These therefore have greater breadth than the average regular search engine. This is particularly useful when seeking comprehensive recall or trying to find obscure information. However, they also have their drawbacks. Compared to regular search engines, a searchenginewatch.com survey found that their results feature a higher proportion of paid-for placements and that these placements are less clearly distinguished from ones not paid for. For example, the survey indicated that 86% of Dogpile results are paid-for placements (Sullivan, 2000). They are also limited by ceilings on retrievals per search engine, time-outs, engines’ different interpretation of enquiry terms and the fact that none searches a combination of the largest regular engines (Hock, 2001). Federated search engines Federated search engines are similar to metasearch engines, but include authentication services that let users use restricted-access sources. These can therefore return results from private or fee-based websites, and aim to be a “onestop shop” for searchers. Examples include MuseGlobal, Webfeat and ExLibris. Yahoo! in 2005 added a subscriptions-based search facility. However, these are normally commercial enterprises that cover access costs by charging search fees. In addition, the authentication set-up can be timeconsuming, and there are outstanding copyright issues (Fryer, 2004). 4. Searching techniques Asking the question: Precision vs. recall An important factor when searchng is precision versus recall. A search with a high precision rate indicates most results will be relevant. A search with a high recall rate indicates you have retrieved most of the available records, though these might also include irrelevant ones (Dugan, 2006). “The goal of information retrieval scientists is to provide the most precise or relevant documents in the midst of the recalled search results” (Lager, 1996). Asking the question: Advanced Query Operators Many search engines offer search functions that use advanced query operators (AQMs), such as Boolean operators (AND, OR and NOT). Only 10% of web searchers use these operators (Eastman, 2003). In addition, using AQMs does not have a significant impact on search results when applied by the average user. “Approximately 66% of the top ten results on average will be the same regardless of how the query is entered” (Jansen, 2003). Librarians, with expert searching skills, may have better results in using AQMs. According to Bernard Jansen and Caroline Eastman: “The amount of training and practice that would be needed to enable most users to correctly formulate and use advanced operators, along with the apparent need to understand the particular IR (information retrieval) system, is not justified by the relatively small potential improvement in results. So, training and experience in more sophisticated searching techniques and strategies could reasonably be limited to information professionals who might be expected to have a use for them, are knowledgeable on a particular system or set of systems, and engage in intricate searching tasks” (2003). Getting the answer: Evaluating results The credibility of search results is a major concern. Elizabeth Kirk, librarian at John Hopkins University, underlines the importance of skeptical evaluation: “When you use a research or academic library, the books, journals and other resources have already been evaluated by scholars, publishers and librarians. Every resource you find has been evaluated in one way or another before you ever see it. When you are using the World Wide Web, none of this applies. There are no filters. Because anyone can write a Web page, documents of the widest range of quality, written by authors of the widest range of authority, are available on an even playing field. Excellent resources reside along side the most dubious” (Kirk, 1996). Kapoun, a librarian at Southwest State University, bases his web evaluation on five criteria: accuracy, authority, objectivity, coverage and currency (Kapoun, 1998). Currency is a particularly important consideration for web content. It is often said that search engines “search the Internet”, but this is not true. “Each one searches a database of the full text of web pages selected from the billions of web pages out there residing on servers. When you search the web using a searching engine, you are always searching a somewhat stale copy of the real web page” (Barker, 2006). How “stale” is the copy of the page that you have searched? Ellen Chamberlain, Library Director at the University of South Carolina’s Beaufort Campus states: “There is no way to freeze a web page in time. Unlike the print world with its publication dates, editions, ISBN numbers, etc., web pages are fluid. [...] The page you cite today may be altered or revised tomorrow, or it might disappear completely” (2006). And librarians are ideally suited to evaluating the quality of information – it has always been an integral aspect of their professional responsibilities. According to Chris Sherman, the executive editor of SearchEngineWatch.com, “There’s a problem with information illiteracy among people. People find information online and don’t question whether it’s valid or not... I think that’s where librarians are extremely important” (Mills, 2006) 5. Current trends and future developments Recall The web is growing. The number of sites might be stabilizing, but the amount of data is expanding. With greater access to deep-web resources, the amount of retrievable data is soaring. Librarians and patrons have to navigate through more and more information to find exactly what they are looking for. Search engines are becoming more efficient and more powerful, enabling them to index deeper and more frequently. Technology is making possible huge gains in recall. Sherman & Price (2001) even suggest hypertext queries that treat the web as a single database, enabling truly comprehensive searches. Increased recall is also being made possible by specialized search engines, from A (agview.com for agriculture) to Z (fishbase.org for zoology). Specialized metasearchers have also appeared, like familyfriendlysearch.com for childfriendly search engines (Hock, 2001; Mostafa, 2005). Precision As recall increases, the next major challenge will be precision. One approach is to use technology to refine results. Mooter is a search engine that presents results in sub-groups leading to further sub-groups. Kartoo presents results on a “map”, organized horizontally rather than ranking vertically. But the greatest effort is being put into eliminating unwanted retrievals – in other words, increasing precision while maintaining recall. PowerScout and Watson customize the searching process, based on patterns of search history by the user and other users with similar search interests. In this way, the engine builds up a profile of what the searcher is likely to be looking for (Mostafa, 2005). Another promising path for development is a “supply side” solution: web content is tagged with data (“metadata”) that can be read by index crawlers. For example, a tag might state that the ‘Georgia’ in a web article’s title is the American Georgia, not the ex-Soviet Georgia. Search engines can then scan this data instead of superficial content that can be misleading, and ignore nonrelevant pages, such as “false drops” (O’Neill et al, 2003). An example of a metadata-reading search engine is the rapidly-expanding OAIster. There are drawbacks to metadata. It is vulnerable to spamming; there are legal issues over the use of trademarked terms, as in the legal saga of Terri Welles vs. Playboy (Sullivan, 2004); and it requires a lot of (human) data entry. However, the advantages would be huge gains in searching precision; the ability to certify content in different ways, helping searchers to assess its provenance and credibility; and possibly a basis for the next evolutionary stage of the web: the semantic web. Whereas metadata is currently processed statistically (a crawler calculates the frequency of terms, proximity, etc.), the semantic search engine processes metadata logically, combining one search function with another, or with other data-processing programs (Berners-Lee et al, 2001). Librarians and search engines No matter what advances are made in search technology, librarians’ expertise in locating and evaluating information will remain a valuable part of the searching process. Search engines are simultaneously complex and limited, and librarians who master searching technology are fulfilling their traditional responsibility of being able to guide patrons to the desired information. The article, “Librarians Versus the Search Giants,” details a panel conversation between librarians and representatives from Google and Microsoft: “One issue not in doubt during the conversation was the fact that the world will continue to need librarians. Indexes of knowledge - and Google, MSN, Yahoo!, etc. are indexes, not libraries - will still require someone to make sense of all the information” (Krozser, 2006). Librarians’ research skills and experience also prove useful when the internet fails. Their knowledge of information resources goes far beyond the web. Joe Janes, an associate professor in the Information School at the University of Washington in Seattle, comments: “When Google doesn't work, most people don't have a plan B. Librarians have lots of plan B's. We know when to go to a book, when to call someone, even when to go to Google” (Selingo, 2004). Bibliography Information taxonomy plays a critical role in Web site design and search processes. (2003, July 1). Builder.com. Retrieved February 12, 2007, from http://builder.com.com/5100-6374_14-5054221.html?tag=search Barker, J. (2006). Finding Information on the Internet: A Tutorial. Retrieved February 10, 2007, from http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/FindInfo.html Bergman, M. K. (2001, August). The Deep Web: Surfacing Hidden Value. [Electronic version]. The Journal of Electronic Publishing, 7 (1). Berners-Lee, T., Hendler, J. & Lassila, O. (2001, May 17). The Semantic Web. ScientificAmerican.com. Retrieved February 12, 2007, from http://www.sciam.com/article.cfm?articleID_00048144-10D2-1C7084A9809EC588EF21&catID=2 Bopp, R.E., & Smith, L.C. (2001). Reference and Information Services: An Introduction. Englewood, CO: Libraries Unlimited. Chamberlain, E. (2006). Bare Bones 101: A Basic Tutorial on Searching the Web. Retrieved February 9, 2007, from http://www.sc.edu/beaufort/library/pages/bones/bones.shtml Clyde, A. (2002, April). The Invisible Web [Electronic version]. Teacher Librarian, 29 (4). Retrieved February 12, 2007, from WilsonWeb database. Cohen, L. (2001, December 24). The Deep Web. Retrieved February 7, 2007, from http://www.internettutorials.net/deepweb.html Cohen, L. (2006, December). The Future of the Deep Web. Library 2.0. Retrieved February 11, 2007, from http://liblogs.albany.edu/library20/2006/11/the_future_of_the_deep_web.ht ml Cutts, M., Moulton, R & Carattini, K. (2007, January 25). A Quick Word about Googlebombs. Official Google Webmaster Central Blog. Retrieved February 13, 2007, from http://googlewebmastercentral.blogspot.com Dugan, J. (2006) Choosing the Right Tool for Internet Searching: Search Engines vs. Directories. Perspectives: Teaching Legal Research and Writing, 14 (2). Eastman, C.M., & Jansen, B.J. (2003). Coverage, Relevance, and Ranking: The Impact of Query Operators on Web Search Engine Results. ACM Transactions on Information Systems, 21 (4). Fryer, D. (2004, March/April). Federated Search Engines. Online, 28 (2). Retrieved February 13, 2007, from EBSCO Host database. Hock, R. (2001). The Extreme Searcher’s Guide to Web Search Engines (2nd edition). Medford, New Jersey: CyberAge Books. Jansen, B.J. (2003, May 18-21). Operators Not Needed? The Impact of Query Structure on Web Searching Results. Information Resource Management Association International Conference, Philadelphia, PA. Kapoun, J. (1998). Teaching undergrads WEB evaluation: A guide for library Instruction. College & Reseach Libraries News, 59 (7). Kirk, E.E. (1996). Evaluating Information Found on the Internet. Retrieved February 12, 2007, from http://www.library.jhu.edu/researchhelp/general/evaluating Krozser, K. (2006, March 12). Librarians Versus the Search Giants. Medialoper.com. Retrieved February 9, 2007, from http://www.medialoper.com Lager, M. (1996). Spinning a Web Search. Retrieved February 11, 2007, from http://www.library.ucsb.edu/untangle/lager.html Mills, E. (2005, October 8). Google ETA? 300 years to index the world’s info. CNET News.com. Retrieved February 8, 2007, from http://news.com.com/Google+ETA+300+years+to+index+the+worlds+info/2 100-1024_3-5891779.html Mills, E. (2006, September 29). Most reliable search tool could be your librarian. CNET News.com. Retrieved February 9, 2007, from http://news.com.com/Most+reliable+search+tool+could+be+your+librarian/2 100-1032_3-6120778.html Mostafa, J. (2005, January 24). Seeking Better Web Searches. ScientificAmerican.com. Retrieved February 12, 2007, from http://www.sciam.com/article.cfm?articleID=0006304A-37F4-11E8B7F483414B7F0000 O’Neill, E.T., Lavoie, B.F. & Bennet, R. (2003, April). Trends in the Evolution of the Public Web 1998-2002. D-Lib Magazine, 9 (4). Retrieved February 7, 2007, from http://www.dlib.org/dlib/april03/lavoie/04lavoie.html Google, Yahoo gain share in U.S. Web search market. (2007, January 15). Reuters. Retrieved February 13, 2007, from http://news.yahoo.com/s/nm/20070115/wr_nm/google_search_dc_1 Selingo, J. (2004, February 5). When a Search Engine Isn’t Enough, Call a Librarian. NYTimes.com. Retrieved February 11, 2007, from http://www.nytimes.com/2004/02/05/technology/circuits/05libr.html?ex=139 1403600&en=26c6c8a9c0c4212f&ei=5007&partner=USERLAND Sherman, C. & Price, G. (2001). The Invisible Web: Uncovering Information Sources Search Engines Can’t See. Medford, New Jersey: CyberAge Books. Sullivan, D. (2000, August 2). Invisible Web Gets Deeper. SearchEngineWatch.com. Retrieved February 11, 2007, from http://searchenginewatch.com/showPage.html?page=2162871 Sullivan, D. (2004, April 21). Meta Tag Lawsuits. SearchEngineWatch.com. Retrieved February 11, 2007, from http://searchenginewatch.com/showPage.html?page=2156551 Sullivan, D. (2006, August 22). Nielsen NetRatings Search Engine Ratings. SearchEngineWatch.com. Retrieved February 10, 2007, from http://searchenginewatch.com/showPage.html?page=2156451