Exploring the Invisible Web Kevin R. Morgan Ed.D Professor/Designer St Petersburg College Introduction: The Visible and Invisible Web “I readily believe there are more invisible than visible things in the universe.” (Burnett, 1692) motto in Coleridge’s The Rime of the Ancient Mariner There are vast amounts of information available online, but even more information exists beyond the grasp of the general search engine. There is a much larger universe of invisible information in databases and directories which can’t be accessed by general purpose search engines, but it is nevertheless online, free, and of the highest academic standards. The Information Age and the WWW The Internet: Network to Knowledge • A 21st century library of Alexandria • A content of rich and interactive information network • Internet- information resources for teaching and learning Utilizing the WWW for research and learning The World . Wide Web is estimated to contain over 3 billion documents. (Barker, 2003) The Invisible Web is estimated to be 2-50 more times bigger than the visible web. What is the Invisible Web? The “Invisible Web” is a metaphor used to describe The vast depth or domain of information that lies beyond the visibility of our tools for gathering information. It is not really invisible, just passed over or missed. The Invisible Web includes the following and more: • content that has been excluded from general purpose search engines and Web directories. • examples include databases from universities, libraries, organizations, businesses, and government agencies. • a substantial part of the total Internet. How Big is the Invisible Web ? One study conducted by search company BrightPlanet, estimated that the inaccessible part of the web is about 500 times larger than what search engines already provide access. They estimated about 500 billion pages of information available on the web, and only 1/500 of that information could be reached via general search engines. (Sulivan, 2000) More conservative estimates place the Invisible Web at 2-50 times bigger than the visible web. (Barker, 2003) Even using the most conservative estimates, the Invisible Web represents a considerable quantity of information that lies beneath the surface of the Web. It is deeper than we thought! Invisible Web 100-500% larger than Visible Web 6 billion- 30 billion documents ??? Visible Web Government Directories Educational Research 3 billion documents Library of Congress Institutional Directories I Captured in General Purpose Search Engines Eric Scientific Research Colleges and Universities Specialized Search Engines and Directories Organizations Public/Private The Web: Visible and Invisible Information The visible Web is made up of HTML Web pages that search engines have chosen to include in their indices. Google, Alta Vista, Look Smart and others general purpose search engines all cover the surface of the Web but are limited in going into the deeper reaches of cyber space. There is an even greater amount of invisible information in databases which can’t be directly accessed by general purpose search engines, but it is never the less online and freely available to the savvy searcher. Search Engines: Robots, Knowbots and Spiders Search Engines do not really search the Web directly. Computer robot programs, referred to sometimes as "crawlers" or "knowledge-bots" or "knowbots" are used by search engines to roam the World Wide Web. Most large search engines operate several robots or spiders all the time. Even so, the Web is so enormous that it can take six months for spiders to cover it, resulting in a degree of "out-of-datedness" in results. (Barker 2003) Spiders or crawlers are programmed to retrieve general information by avoiding unfriendly or dangerous URLs that can trap them in endless loops of information or spider traps. Reasons for Invisibility of Some Pages There are certain types of pages that search engine companies routinely exclude by policy to save time and money. Some pages present technical barriers to web crawlers and are passed over by general browsers for time and efficiency. For example, A spider or crawler will back off when encountering a question mark (?) in a URL. To save time and money, spiders are programmed to avoid or exclude many sites, including educational, Governmental, and organizational databases. Visibility and Invisibility Visible Web Invisible Web Educator’s Reference Desk ERIC Database The Library of Congress Special Collections URLs ending in edu, org, gov a page has a ? in its URL General Search Engines and Subject Gateways Institutions and Organizations Internal directories It is very difficult to predict what sites will or won't be part of the Invisible Web. As Search Engines change their policies, what is invisible today can become visible tomorrow. Many sites are already hybrid- with both visible and invisible components. The Value in Using the Invisible Web Invisible Web resources offer the highest level of authority as educational institutions and government organizations maintain a high level of quality control over their information. Specialized search interfaces provide more control over search input and output with increased precision. Comprehensive resources allow searchers to perform exhaustive searches within a specific subject area and keep up-to-date and current. The search can yield exhaustive results of timely content. Invisible Web databases have the most current information available online as they are updated often. Understanding the Invisible Web The data found in the Invisible Web cannot be accessed easily via general purpose search engines. The Invisible Web is not the sole solution to all one’s information needs. It should be used in conjunction with other informational sources, including general searches. Invisible Web resources clearly identify who is providing the information, making it easy to judge the authority of the content and its provider. Targeted crawlers offer more comprehensive coverage of their subjects than general purpose search engines. Finding the Subject Databases and Directories Much of the Invisible Web is made up of the contents of thousands of specialized databases accessible online. Have a clear subject in mind to find the best specialized databases for your subject of study or field of research. Many databases can be found by using the word, database after a subject term, such as “humanities database” or “history database.” Another tip is to search using the words web directory and then your topic. If a directory web page refers to itself using the words "web directory," you will locate it. Searching Tip: Use Subject Gateways Searching through subject databases and web directories may be unfruitful for the novice searcher or student. Many of These independent searches can end in blocked access. Problem: Many of the databases are password protected. Solution: An easier and more fruitful method for finding databases relating to a specific subject area is to use some of the gateway sites that have already been organized by subject and content. These subject gateways are organized from general and specific, enabling students, educators, and researchers to finding valuable visible and invisible sources on the Internet. Educational Gateways Infomine provides a gateway to scholarly Internet resource collections: http://infomine.ucr.edu/ Academic Info also provides an educational subject directory and subject gateways: http://www.academicinfo.net/ The Educator’s Reference Desk has become the new access gateway to the ERIC databases: http://www.eduref.org/ The Alliance for Life-Long Learning offers online classes from Stanford, Yale, and Oxford Universities and provides a library of online resources through its Academic Subject directories that meet the highest academic standards: http://www.alllearn.org/er/directories.cgi General Purpose Subject Gateways Use the Invisible Web Directory from Sherman and Price’s companion site to The Invisible Web: http://www.invisible-web.net/ See this multi-subject guide to specialized search engines: http://www.searchability.com/ Explore CompletePlanet to link to over 103,000 searchable databases and specialty search engines : http://www.completeplanet.com/ Evaluating Invisible Web Resources The Librarians Index to the Internet provides an annotated directory with cross-reference links to both visible and invisible content: http://lii.org/ ResearchBuzz provides daily updates on search engines, new software, browser technology Web directories and databases: http://researchbuzz.com The Scout Report provide academics, researchers, librarians, and the K-12 community with valuable online information: http://scout.cs.wisc.edu/index.php The Internet Resources Newsletter is a monthly newsletter for academics, students, scientists, and social scientists: http://www.hw.ac.uk/libWWW/irn/irn.html References Barker, (2003) “Recommended Search Engines: Table of Features” UC Berkley http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/SearchEngines.html Sherman & Price, (2001) The Invisible Web: Uncovering Information Sources Search Engines Can’t Find. CyberAge Sullivan, (2000) “Invisible Web Gets Deeper”, The Search Engine Report, August 2000. http://searchenginewatch.com/sereport/article.php/2162871 Exploring the Invisible Web Contact Information Dr. Kevin R. Morgan St. Petersburg College: eCampus Seminole, Florida morgank@spcollege.edu