INFO 522 Lecture 9 Missy Harvey Winter 2011 Search Engines and the Invisible Web Many libraries have created tutorials to help people learn how to search on the Web. I will point you to some of the better tutorials later on. I have taught workshops and a course on searching the Web in the School of Computer Science at Carnegie Mellon. So I will list some websites I’ve created to bring together helpful tools for you. You will see that our Dialog and LexisNexis knowledge is a good foundation for retrieval with Web-based search tools. Search engines are a big part of life here at Carnegie Mellon. The very first search engine, Lycos, was created here by the scientist, Fuzzy Mauldin. Carnegie Mellon provided funding for the project and lo and behold, the stock offerings from Lycos paid for a new building called Newell-Simon, named in honor of two of the co-founders of artificial intelligence who taught here: http://www.cmu.edu/tour/ (click on Walk Your Own Path and then click on bldg. #4. FYI, I work in #7, Wean Hall.). The most recent ventures that started here are Vivísimo and Yippy (formerly known as Clusty). The co-founders knew the value of the skills of librarians. So they consulted with many of us librarians at Carnegie Mellon numerous times while developing the product—now only about 10 years old! Finally, because of this history and ongoing collaborative work at Carnegie Mellon, Google opened up an office in Pittsburgh. Before proceeding, review the file: Week9.ppt and walk through the slides. Search Tool Comments One of the important distinctions most tutorials make is between “search engines” on the one hand and classification schemes (Subject Categories and Directories) or specialized collections (Virtual Libraries) on the other. With search engines, you enter words (perhaps with logical operators) describing your interest and hope for a match. Google, Bing and Excite are only a few examples. To do searches in Alta Vista (like you have done on Dialog and LexisNexis), visit the Alta Vista home page (http://www.altavista.com/), as well as the Advanced Search site (http://www.altavista.com/help/search/help_adv). The latter will show you that Alta Vista can handle Boolean and proximity operators, phrase searching, truncation, and its own varieties of field searching. Many of the fields it handles are unique to the Web. For example, you can restrict a search to terms that occur in the URL field. You can ask for all the pages that are linked to a given page such as: http://www.cs.cmu.edu/~missy/. Google Advanced Search provides a form to enable you to take advantage of various searching capabilities: Week 9 INFO 522 2 http://www.google.com/advanced_search. Look over it closely for the many features that are offered. Virtual libraries are collections of annotated bibliographies on particular subjects. The annotations generally contain clickable links to Web resources. The WWW Virtual Library is an example of this: http://vlib.org/. Sometimes libraries may refer to these as resource pages. With Subject Categories and Directories, you generally navigate through hierarchical term displays, clicking on terms that catch your interest. Often you will move from general terms to more specific terms as you go. Yahoo is an example of a directory system, a hierarchical subject classification scheme. Yahoo provides a search engine, but its main claim to fame is this directory scheme: http://dir.yahoo.com/. It functions much like the Dewey Decimal System and is arguably no better (some would say it not as good). Web pages are assigned to subject categories of the scheme by human beings, not software agents. But there are no logical principles for creating or subdividing categories—they are simply slapped together to cover whatever turns up on the Web. To develop and maintain expertise in searching the Web, probably the best single site for librarians to monitor is Search Engine Watch (http://searchenginewatch.com/). When you visit the site, you will see that there are many search engines. Take my word for it that they are NOT all equally good. They vary in coverage (the number of websites they give access to), capabilities, and basic algorithms. They vary in their principles for ranking retrieved documents by relevance. You can use Search Engine Watch to find comparative feature charts and evaluations for the many competing engines on today’s market. A good principle to live by in doing Web page retrieval is to use more than one search engine/directory. Most experts tend to include Google and Yahoo. What about the meta-search engines? They can search multiple search engines all at once— sort of like a Dialog OneSearch. Is it a good idea to use them? Well, I tend to use them when I want to confirm what, if anything, I may have missed in my search. In other words, I’m checking for thoroughness in my search results. In such cases, the leading meta-search engine is now Yippy, created by Vivísimo. It is not necessarily a good thing to use meta-search engines that enter your search phrase into many different engines for simultaneous retrievals. For one thing, inferior search engines are often included. For another, search engines may differ in their requirements and so what is a valid input for one may not be valid for another. For instance, you have probably learned to put quotation marks around two or more words that you want a search engine to treat as a phrase. But not all search engines will recognize that as a phrase search. Instead, they will simply treat the words in the phrase as if they were ORed together, with highly unsatisfactory results. Week 9 INFO 522 3 Google Scholar In 2004, an invention arrived that had a significant impact on the use of search engines AND on libraries. It’s called Google Scholar. This tool enables you to search specifically for scholarly literature. They state that the tool will find articles from a variety of academic publishers, professional societies, preprint repositories and universities, as well as scholarly articles available across the Web. Your search results include peer-reviewed papers, theses, books, preprints, and technical reports from all broad areas of research. Google Scholar analyzes and extracts citations and presents them as separate results, even if the documents they refer to are not available online. This means your search results may include citations of older works and seminal articles that appear only in books or other offline publications. Long before the announcement of Google Scholar, they had been working with various publishers to deliver search results from certain databases when people search using Google. Of course, the hitch (which they do NOT make self-evident) is that people can only see the fulltext of their search results IF their home institution (e.g., Drexel’s Hagerty Library) subscribes to those particular services and/or databases. The three key examples of databases that have made these arrangements with Google are: ACM Digital Library, IEEE Xplore, and OCLC WorldCat. My key concern is that students, especially undergraduates, have not learned in their classes that this access does NOT come for free. Places like Drexel pay about $120,000 per YEAR for access to IEEE Xplore—just one of numerous databases they subscribe to. So most Drexel people do NOT realize that when they get Google results linking to hits from the above three databases, they can view the articles because the Drexel libraries subscribe to those services. Jay Bhatt, the Engineering Librarian at Drexel, has been a critic of Google Scholar. He has made postings to listservs where all of us librarians have discussed this tool. He pointed out his concerns about engineering and science students relying too heavily on Google Scholar because (I place my comments in blue): It does not index e-books and handbooks such as those from EngNetBase and Knovel, etc. (These are two leading engineering e-book services.) Also, Books24x7 and Safari are important e-book services but not yet available in Google Scholar. Google Scholar does not provide information on what is being covered, what journals are indexed, what other databases are covered. So just relying on Google Scholar may not be helpful. In other words, you are not necessarily retrieving the best possible hits for your research. Worse yet, students may be misled to think that if they found little, they’ve exhausted all of their possibilities. Let me explain this a different way. If a librarian was helping a student find journal articles on a computer engineering topic, they would want to ensure that the student searched INSPEC (some of the full-text articles found in INSPEC are available in IEEE Xplore), Compendex, and Web of Science. We could throw in ACM Digital Library for Week 9 INFO 522 4 good measure. If a student did not search ALL of these, then they would be potentially missing out on important citations that could prove crucial to their research. So I encourage you take a look at Google Scholar (http://scholar.google.com/) and use it to find a FEW of the citations you will list in your final project. Note that I say a FEW because in the assignment I also caution you to use Web resources with care. There are many excellent items on the Web. But there is also a lot of junk. Look for a respectable level of scholarship if you decide to include a few Web-based materials and apply the principles for evaluating websites that are contained in one of the readings below. In our readings, I will offer a couple of articles for you to learn more about this tool. Then you can decide for yourself if you think it’s worthwhile or a lot of hype or a combination of both. WolframAlpha In 2009, a new model of searching was launched called WolframAlpha. If you’re unfamiliar with the name, Stephen Wolfram is the founder of Wolfram Research. He is a British physicist and mathematician who was once attached to the famous Institute for Advanced Study in Princeton, NJ. That’s the same institute where Einstein once researched. In 1988 he launched a computer algebra system called Mathematica that has significantly impacted the worlds of math and computer science. In 2002, Wolfram published a rather contentious book, A New Kind of Science (NKS), which presents an empirical study of very simple computational systems. He argues that these types of systems, rather than traditional mathematics, are needed to model and understand complexity in nature. His conclusion is that the universe is digital in its nature, and runs on fundamental laws which can be described as simple programs: cellular automata. He predicts a realization of this within the scientific communities will have a major and revolutionary influence on physics, chemistry and biology and the majority of the scientific areas in general, which is the reason for the book’s title. Wolfram|Alpha is being touted as a computational knowledge engine with a new approach to knowledge extraction. The engine is based on natural language processing, a large library of algorithms and an NKS approach to answering questions. The new engine differs from traditional search engines because it does not simply return a list of results based on a query— instead it computes an answer. Some industry observers, such as Nova Spivack, claims that “it could be as important as Google.” Readings REQUIRED Hagerty Library, Drexel University. (2004). Evaluating information on the Web. http://www.library.drexel.edu/documents/tutorials/webeval/intro.html Week 9 INFO 522 Meola, M. (2004). Chucking the checklist: A contextual approach to teaching undergraduates web-site evaluation. Portal: Libraries and the Academy, 4(3), 331-344. (Available through Summon at Hagerty Library.) Tenopir, C. (2005). Google in the academic library. Library Journal. http://www.libraryjournal.com/article/CA498868.html (very slow to open up) Kolowich, S. (2010). Searching for better research habits. Inside Higher Ed. http://www.insidehighered.com/news/2010/09/29/search 5 This article touches on the challenges faced by librarians today, especially academic librarians. After reading the article, make a point of looking at the comments at the end as well. Price, G. (2009). Great day for power searchers: Google adds new search options. ResourceShelf. http://www.resourceshelf.com/2009/10/01/google-search-adds-newfeatures/ University at Albany, University Libraries. (2010). The Deep Web. http://www.internettutorials.net/deepweb.asp UC Berkeley. (2010). The Invisible Web: What is it, why it exists, how to find it, and its inherent ambiguity. http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/InvisibleWeb.html Sutter, J.D. (2009). New search engines aspire to supplement Google. CNN.com. http://www.cnn.com/2009/TECH/05/12/future.search.engine/index.html?iref=news search Recommended Price, G. (2003). What Google teaches us that has nothing to do with searching. Searcher, 11(10). http://www.infotoday.com/searcher/nov03/price.shtml Abilock, D. (2010). Choose the best search engine for your information need. http://www.noodletools.com/debbie/literacies/information/5locate/adviceengine.ht ml University at Albany, University Libraries. (2010). How to choose a search tool. http://www.internettutorials.net/choose.asp Assignment and Discussion Board For this week, I am placing practice exercises AND a new assignment in the assignment folder. Only turn in the assignment, NOT the practice exercises. Submit your assignment by Midnight, Sunday, March 6th (Eastern time). Week 9 INFO 522 6 You are required to make some kind of a discussion board posting for the week. Please go to the Discussion Boards to respond to the topic for this week. Complete your discussions by Midnight, Sunday, March 6th (Eastern time) Submit questions or comments about the challenges and issues you face as you try to search using various Web tools.