The Invisible Web

advertisement
It’s Everywhere…. It’s Everywhere….
“The Computer as an Educational Tool:
Productivity and Problem Solving”
©Richard C. Forcier and
Don E. Descy
Today
Why is it important?
Searching, searching, searching
The searchable Web
The invisible Web (deep Web)
What is it?
Why is it?
How do we get around in it?
Resources/References
Why is this important?
You are going to want information…
*Reports, papers, presentations
*Medical, family, jobs, personal
Your students are going to want
information…
*Reports, papers, presentations,
personal
Most of what you want can’t be found
using regular search techniques!
The Question
How do you find information
that is available…
but isn’t ?
How do you find your exit on the
“Information Superhighway” if
mapping that exit can’t be done?
The Invisible Web
Web sites that are hidden or are
unable to be found or cataloged
by regular search engines.
“Public information on the deep Web is
currently 400 to 550 times larger
than the commonly defined
World Wide Web.”
(BrightPlanet, 2003)
“A full ninety-five per cent of the deep
Web is publicly accessible information
— not subject to fees or
subscriptions..”
(BrightPlanet, 2003)
The Invisible Web Facts
200,000+ Web sites
550 billion individual documents
compared to the three billion of the
surface Web
Contains 7,500 terabytes of
information compared to nineteen
terabytes in the surface Web
Total quality content is 1,000 to 2,000
times greater than that of the surface
Web.
The Invisible Web Facts (2)
Sixty of the largest sites collectively
contain over 750 terabytes of
information — They exceed the size
of the surface Web forty times.
Fastest growing category of new
information on the Internet
Fifty percent greater monthly traffic
than surface sites
Invisible Web Facts (3)
More highly linked to than surface
sites
Narrower, with deeper content, than
conventional surface sites
More than half of the content resides
in topic-specific databases
Content is highly relevant to every
information need, market, and
domain.
Invisible Web Facts (4)
Not well known to the Internetsearching public
Searching, Searching, Searching
Usually carried out using a
“directory” or “search engine”
Fast and efficient
Misses most of what is out there
70% of searchers start from three
sites (Nielson, 2003): Google,Yahoo,
and MSN.
Searching Tools
Directories
Search engines
Directories
Hand selected, evaluated, annotated
Broad topics work best.
Quality over quantity
Location on list: May be paid
How Directories Work
Find site
Evaluate
Directory Staff
Catalog
and
Add
Web/Internet
Directory Index/Information
Searching
Directory Server
Browsing
User
Directory Problems
Done by humans
Takes time
No universal categories or
cataloging system
Misses the most information/sites
General Subject Directories
“Yahoo”
Biggest and most famous
Often useful
Information… jobs… travel…
shopping…
to…
Yahoo.com
Search Engines
Computer generated
Narrower topics
Quantity over quality
Uses newer retrieval technologies
Location on list: May be paid
Google, Hotbot, Northern Light,
AltaVista, etc.
How Search Engines Work
Web/Internet
Spiders/Robots
Comb Web
Search Engine
Matches Request
to Content
User Inputs
Request
Database Stores URL
and Content
User
Search Engine Problems
Spiders/robots don’t think.
More likely to index sites with more
links to them (popularity)
More likely to index U.S. sites
More likely to index commercial sites
Sites pay for indexing/position.
At one
time
showed
actual
bid!
Finding Good Search Engines
UC-Berkeley: Recommended Search
Engines:
http://www.lib.berkeley.edu/TeachingLib/
Guides/Internet/SearchEngines.html
UC-Berkeley: The Best Search
Engines (9/2003):
#1 Google
#3 Vivisimo
#2 Teome
#4 AllTheWeb
What do we miss?
Library of Congress: 30 million+
documents
ERIC databases
Most daily newspapers
Health and medical databases
Museum and library collections
The information you need?
Why are pages invisible? (1)
1. Searchable databases:
Typing is required.
Selection of option combination
is required.
**Pages are not available until asked
for (e.g., Library of Congress).
**Pages are not static but dynamic
(may not exist until requested).
Why are pages invisible? (2)
Search engines can’t handle
“dynamic pages.”
Search engines can’t handle “input
boxes.”
Why are pages invisible? (3)
2. Password or login required:
(Spiders do not know passwords or
login IDs.)
3. Non-HTML pages:
– PDF, Word, Shockwave, Flash...
– Some search engines may find them:
e.g., Google, AltaVista
Why are pages invisible? (4)
4. Script-based (computer generated)
pages:
– Create all or part of a Web page
– Contain “?” in URL
– Spiders programmed to back off
– http://calver.org/search/file/ship (yes!)
– http://calver.org/search?title=plane (no)
Sites to Check
Finding Invisible Information (1)
“Librarians’ Index”
Compiled by librarians in the
“information supply business”
Highest-quality sites only
Reliable, annotated
www.lii.org
Finding Invisible Information (2)
“About”
2,400,000+ resources
Wide variety of subjects: Teens,
religion, spirituality, shopping
About.com
Finding Invisible Information (3)
“direct search”
“Data not easily or entirely
searchable/accessible from general
search tools.”
www.freepint.com/gary/direct.htm
Finding Invisible Information (4)
“The Invisible Web Catalog”
10,000+ searchable databases
Quick search, “Hot List”
Sort alphabetically or by score
(relevance)
www.profusion.com
Finding Invisible Information (5)
www. invisible-web.net
Finding Invisible Information (6)
“IncyWincy”
Over 100,000 databases
Many links to other search engines
www.incywincy.com
Finding Invisible Information (7)
“CompletePlanet”
103,000+ databases and specialty
search engines
Some “surface” searching
www.completeplanet.com
Finding Invisible Information (8)
Some are research oriented.
“Infomine”
Infomine.ucr.edu/
“Academic Info”
www.academicinfo.net
So… What To Do...
Search several sites
Use the “Advanced Search” feature
Search using the term “Invisible
Web” for IW search sites
Search several “Invisible Web” sites
Questions?
PowerPoint available at
descy.net
Download