Computer Science 1000 Information Searching I strictly prohibited

advertisement
Computer Science 1000
Information Searching I
Permission to redistribute these slides is strictly prohibited without permission

World Wide Web – The Basics


our next topic examines how to find information on
the web
we consider a few basic terms here (which you’re
probably familiar with):




page/web page
link/hyperlink
site/web site
later in semester, we will revisit web technologies in
much more detail

World Wide Web



a system of linked documents accessed via the
internet
often simply referred to as the web
sometimes used interchangeably with the internet,
but this isn’t exactly correct


the internet is the global network of interconnected
devices (computers, routers, etc) that exchange data
the web refers to the documents being stored, the
software that broadcasts and receives them, and the
protocols used for transmission

Web Page




a document stored and accessed on the web
identified by a unique URL (Uniform Resource Locator)
often referred to simply as a page
today’s web pages are very rich in content




text
images
hyperlinks
videos

Web Site



a collection of related webpages on the internet
typically belong to a common organization or event
example

all pages served by the University of Lethbridge make up
its website

Hyperlink



a part of a web page that refers to a different
location
often just called a link
hyperlinks can reference:



another place on the same page
another webpage
hypertext: text containing hyperlinks

The Age of Information


the computer, internet, and web have changed how
we interact with information
information storage


the amount of available information is significantly greater
(and growing rapidly) than even a generation ago
information transmission

large amounts of information are available with a single
mouse click, and transfer almost immediately

Information Age – Rapid Onset


the situation has transformed tremendously in your
lifetimes
consider the global information capacity:





in 1986:
in 1993:
in 2000:
in 2007:
2.6 exabytes (< 1 CD per person)
15.8 exabytes
54.5 exabytes
295 exabytes (61 CDs per person)
how does one successfully navigate such a
mountain of digital content?
Martin and Lopez. The World’s Technological Capacity to Store, Communicate, and Compute Information. Science 332:6025 2011

Information Access

even in pre-internet days, there was a
wealth of information




large-scale: library
medium-scale: Encyclopaedia set
small-scale: newspaper
strategies developed to manage
information



categories
hierarchies
indices

Classification
systematic arrangement in groups or
categories according to established criteria
– Merriam Webster
 in other words, the information is
categorized according to relevant features
 consider our course notes:

terminology (4 sets of slides)
 information searching (2-3 sets of slides)
 etc ...


Classification
classification is not specific to digital
information
 library classification:

Dewey Decimal Classification
Library of Congress Classification

Classification
classification is not specific to digital
information
 newspaper classification


Classification
classification level of detail leads to
tradeoffs
 consider a coarse level of detail


e.g. taxonomy of living organisms



classify organisms according to Domain
(Archaea, Bacteria, Eukarya)
advantage: small number of groups
disadvantage: each group is massive

Classification
classification level of detail leads to
tradeoffs
 consider a fine level of detail


e.g. taxonomy of living organisms




classify organisms according to Genus
(Canis, Felis)
advantage: each group reasonably small
disadvantage: massive number of
groups
solution: hierarchy

Hierarchy
a decomposition of classifications according
to detail
 hierarchies contain levels

at the top (root) level, there is typically a small
number of broad categories
 each category is decomposed into small
categories
 a classification group is defined by
categorization at each level


Hierarchy

organism taxonomy hierarchy:

each Domain categorized into Kingdoms
Eukarya
Domain:
Kingdom:
Animalia
Fungi
Plantae
Protista

Hierarchy

organism taxonomy hierarchy:



each Kingdom classified in Phylum
each Phylum classified into Class
and so on ..
http://ag.arizona.edu/pubs/garden/mg/entomology/intro.html

Hierarchy

an object is still categorized, but by multiple
levels (instead of one)
http://schoolworkhelper.net/scientific-taxonomy/

Hierarchy

facilitates efficient searching through exclusion

example (text):






suppose you have a collection of a million items
these items organized into 10 equal-sized groups
each top-level group is also organized into 10 equal
subgroups
choosing first category eliminates 900000 items
choosing second category eliminates 90000 items
and so on …

Hierarchy


hierarchies are very popular
consider our previous examples:

Library of Congress Classification

Hierarchy


hierarchies are very popular
consider our previous examples:

Newspaper

Index


a detailed list of words, phrases, and/or topics
indicating place of occurrence
in essence, it maps keywords of interest to their
location


a bottom-up approach to information organization


e.g. a page number
as opposed to the top-down structure of a hierarchy
particularly popular in printed material

books, magazines, volumes, etc

Index - Example

Index

typically used on small-scale


made efficient through organizational scheme


books and volumes vs. libraries
alphabetical is very common
some overlap with hierarchies

e.g. subtopics

Finding Information – The Web
as discussed, the amount of information on
the web is immense
 many of the discussed techniques for
information finding also apply digitally

classification/hierarchies
 indexing


Classification

many commercial websites have a
classification structure

navigation bars

Hierarchies

many websites, especially large ones, will
also arrange their categories in hierarchical
fashion

Partition

a hierarchy where every object occurs only once


some hierarchies are necessarily partitions


organism taxonomy – every species appears only once
e.g. a particular book will only occur at one point in a
library classification
however, a partition in some case is not natural

an object might have an inherent fit in more than one
classification

Partitions

digital content is often stored using overlapping
hierarchies (non-partition)



potentially more intuitive
with hyperlinking, it’s easy to accomplish (two links to the
same page)
example (text):

Three Books for Frugal Fashionistas was stored on NPR’s
website under:


Home > Arts & Life > Books > Three Books for Frugal Fashionistas
Home > Listen > Latest Program > Three Books for Frugal Fashionistas

Indexes for the Web

unlike hierarchies, indexes are much less common
on individual websites


site maps might be considered an index of sorts
however, there are analogous technologies to
indexes that pertain to the web as a whole

Search Engines!
Download