It’s Everywhere…. It’s Everywhere…. “The Computer as an Educational Tool: Productivity and Problem Solving” ©Richard C. Forcier and Don E. Descy Today Why is it important? Searching, searching, searching The searchable Web The invisible Web (deep Web) What is it? Why is it? How do we get around in it? Resources/References Why is this important? You are going to want information… *Reports, papers, presentations *Medical, family, jobs, personal Your students are going to want information… *Reports, papers, presentations, personal Most of what you want can’t be found using regular search techniques! The Question How do you find information that is available… but isn’t ? How do you find your exit on the “Information Superhighway” if mapping that exit can’t be done? The Invisible Web Web sites that are hidden or are unable to be found or cataloged by regular search engines. “Public information on the deep Web is currently 400 to 550 times larger than the commonly defined World Wide Web.” (BrightPlanet, 2003) “A full ninety-five per cent of the deep Web is publicly accessible information — not subject to fees or subscriptions..” (BrightPlanet, 2003) The Invisible Web Facts 200,000+ Web sites 550 billion individual documents compared to the three billion of the surface Web Contains 7,500 terabytes of information compared to nineteen terabytes in the surface Web Total quality content is 1,000 to 2,000 times greater than that of the surface Web. The Invisible Web Facts (2) Sixty of the largest sites collectively contain over 750 terabytes of information — They exceed the size of the surface Web forty times. Fastest growing category of new information on the Internet Fifty percent greater monthly traffic than surface sites Invisible Web Facts (3) More highly linked to than surface sites Narrower, with deeper content, than conventional surface sites More than half of the content resides in topic-specific databases Content is highly relevant to every information need, market, and domain. Invisible Web Facts (4) Not well known to the Internetsearching public Searching, Searching, Searching Usually carried out using a “directory” or “search engine” Fast and efficient Misses most of what is out there 70% of searchers start from three sites (Nielson, 2003): Google,Yahoo, and MSN. Searching Tools Directories Search engines Directories Hand selected, evaluated, annotated Broad topics work best. Quality over quantity Location on list: May be paid How Directories Work Find site Evaluate Directory Staff Catalog and Add Web/Internet Directory Index/Information Searching Directory Server Browsing User Directory Problems Done by humans Takes time No universal categories or cataloging system Misses the most information/sites General Subject Directories “Yahoo” Biggest and most famous Often useful Information… jobs… travel… shopping… to… Yahoo.com Search Engines Computer generated Narrower topics Quantity over quality Uses newer retrieval technologies Location on list: May be paid Google, Hotbot, Northern Light, AltaVista, etc. How Search Engines Work Web/Internet Spiders/Robots Comb Web Search Engine Matches Request to Content User Inputs Request Database Stores URL and Content User Search Engine Problems Spiders/robots don’t think. More likely to index sites with more links to them (popularity) More likely to index U.S. sites More likely to index commercial sites Sites pay for indexing/position. At one time showed actual bid! Finding Good Search Engines UC-Berkeley: Recommended Search Engines: http://www.lib.berkeley.edu/TeachingLib/ Guides/Internet/SearchEngines.html UC-Berkeley: The Best Search Engines (9/2003): #1 Google #3 Vivisimo #2 Teome #4 AllTheWeb What do we miss? Library of Congress: 30 million+ documents ERIC databases Most daily newspapers Health and medical databases Museum and library collections The information you need? Why are pages invisible? (1) 1. Searchable databases: Typing is required. Selection of option combination is required. **Pages are not available until asked for (e.g., Library of Congress). **Pages are not static but dynamic (may not exist until requested). Why are pages invisible? (2) Search engines can’t handle “dynamic pages.” Search engines can’t handle “input boxes.” Why are pages invisible? (3) 2. Password or login required: (Spiders do not know passwords or login IDs.) 3. Non-HTML pages: – PDF, Word, Shockwave, Flash... – Some search engines may find them: e.g., Google, AltaVista Why are pages invisible? (4) 4. Script-based (computer generated) pages: – Create all or part of a Web page – Contain “?” in URL – Spiders programmed to back off – http://calver.org/search/file/ship (yes!) – http://calver.org/search?title=plane (no) Sites to Check Finding Invisible Information (1) “Librarians’ Index” Compiled by librarians in the “information supply business” Highest-quality sites only Reliable, annotated www.lii.org Finding Invisible Information (2) “About” 2,400,000+ resources Wide variety of subjects: Teens, religion, spirituality, shopping About.com Finding Invisible Information (3) “direct search” “Data not easily or entirely searchable/accessible from general search tools.” www.freepint.com/gary/direct.htm Finding Invisible Information (4) “The Invisible Web Catalog” 10,000+ searchable databases Quick search, “Hot List” Sort alphabetically or by score (relevance) www.profusion.com Finding Invisible Information (5) www. invisible-web.net Finding Invisible Information (6) “IncyWincy” Over 100,000 databases Many links to other search engines www.incywincy.com Finding Invisible Information (7) “CompletePlanet” 103,000+ databases and specialty search engines Some “surface” searching www.completeplanet.com Finding Invisible Information (8) Some are research oriented. “Infomine” Infomine.ucr.edu/ “Academic Info” www.academicinfo.net So… What To Do... Search several sites Use the “Advanced Search” feature Search using the term “Invisible Web” for IW search sites Search several “Invisible Web” sites Questions? PowerPoint available at descy.net