Lecture02 Structure.ppt

advertisement
Types & structures of
information resources
What is out there for searching ?
What’s under the hood?
-essential knowledge for searchers
tefkos@rutgers.edu; http://comminfo.rutgers.edu/~tefko/
Tefko Saracevic
1
Central ideas
As a searcher you start with knowing:
Information resources
• What is out there available
for searching
• And there is a LOT!
• In this lecture & course we
will explore a sample only
– to illustrate
• from which you can
generalize
• and explore later more fully
in other courses or
professionally
Content
Tefko Saracevic
Their organization
• How structured, prepared
– indexed, classified, tagged,
labeled, abstracted, full text
treated … … …
– stored
– made accessible
• All in laying the ground for
searching
• Knowing what is under the
hood
Structure
2
ToC
1.
2.
3.
4.
5.
Definitions & terminology
Examples of vendors
Structure of records in databases
Indexes – as used in searching
Conclusion
Tefko Saracevic
3
1. Definitions & terminology
A few concepts that we are familiar
with, but still worth revisiting
Tefko Saracevic
4
Definitions
Resource:
source of help: somebody or
something that is a source of help
or information
Information resource:
Generic: A broad range of sources of
information in a variety of
formats
The data and information assets of
an organization , incl. a library
Databases, files, systems containing
organized information records
Database ( from Webopedia)
A collection of information
organized in such a way that
a computer program can
quickly select desired pieces
of data. You can think of a
database as an electronic
filing system.
– Dialog, Google are inf. resources
Tefko Saracevic
5
Definitions (cont.)
From Webopedia again:
Traditional databases are
organized by fields,
records, and files.
• A field is a single piece
of information
• A record is one
complete set of fields
• And a file is a collection
of records.
Tefko Saracevic
• E.g. a telephone book is
analogous to a file. It
contains a list of records,
each of which consists of
three fields: name, address,
and telephone number
• A catalog is a file. It contains
a list of records (catalog
entries) describing books in
a library . Each record has
fields, such as author, title,
publisher, date, subject
headings ….
6
On fields for searching
• Records (documents , objects)
used in information resources
are always organized in fields
– but different resources may and
do use different set of fields
– metadata provides information
ABOUT a record; used for
instance in Web records; always
organized in fields
• Fields serve to guide, point
out, or otherwise facilitate
searching
• Searching is automatically
always done by fields, even
if one does not know that or
has no idea of fields
• But more about fields later
• Indexes used in searching are
organized, divided by fields
Tefko Saracevic
7
Who provides inf. resources for searching?
• Terminology as to who & what can be confusing &
not consisted - so beware & do your own translation
– Provider: a producer of databases; there are great many
providers covering many fields
• e.g. Dept. of Education produces ERIC – a database of abstracts &
indexes of educational materials (articles, reports)
– Vendors or aggregators: organizations or companies that
get databases from providers or set of sources like journals
from publishers & organize them for searching; there is a
large number of vendors
– some providers are their own vendors:
• e.g. Chemical Abstract runs STN (Scientific & Technical Network)
Tefko Saracevic
8
2. Examples of vendors
Illustrates the ever changing
information industry
Tefko Saracevic
9
Example of a vendor:
• Dialog is oldest on the market
– started in 1972
• Acquires databases from
information providers
– it has over 900 databases
• Organizes content according
to uniform structures
• Describes the content
– done in Bluesheets
• a most important search tool for
you!
Tefko Saracevic
• Provides uniform & complex
searching capabilities
– geared toward professionals
• you have to master them for
effective searching
• Creates some own files
– e.g super indexes as Dialindex
• Access
– mostly through libraries &
companies as subscribers
– RUL does not have it, but in class
free access
10
Story of Dialog
illustrative of turbulences in inf. industry
1964 Roger Summit started Information
Sciences Laboratory at Lockheed
Missile & Space Company
– in the 1960’s developed Recon – online
system for NASA (government contract)
1972 Summit convinced Lockheed of
online commercial potential & it
went public as Dialog
– advent of online information industry
1981 became subsidiary of Lockheed
– moved to Palo Alto, CA
1989, the company was sold to KnightRidder - had other inf. resources
–
incorporated DataStar, a European
online company with 350 mostly
European oriented databases - still there
Tefko Saracevic
1997 Dialog was bought by the U.K.based M.A.I.D. Corp.
– moved to Cary, N.C. – still there
2000 The Thomson Corporation (now
ThompsonReuters) acquired Dialog
– in 1992 Thomson bought ISI with
citation indexes that became Web of
Knowledge incl. Web of Science
2008 Dialog was bought by ProQuest
–
–
–
ProQuest was Bell & Howell, also
UMI, also University Microfilm …
has many inf. products & services
among them CSA another online
vendor with over 100 databases
Still in business!
11
BTW – why do we still teach Dialog?
• Dialog is a legacy database –
grandady
– some call it a dinosaurs
• So why do we use Dialog for
exercises?
• Several reasons:
• oldest and largest surviving
vendor
• by far has a most
comprehensive set of
databases
• has a well developed
instructional program
Tefko Saracevic
But most importantly:
– serves as a good test bed to
develop searching skills that are
generalizable
•
learning what is under the hood
of all databases
– what you will systematically
learn from using Dialog can
be translated to all searching
• & you get an insight into
problems with searching
12
Newest large database:
• Scopus started in 2004 by
Elsevier – a HUGE publisher
• Very different from Dialog
– integrates over 17,000 journals &
other materials (has no separate
databases, but could be searched by
broad fields, type of materials, etc.)
• Indexes all (or takes existing
indexing for some)
• Elsevier also has
– Scirus – free science search engine
– ScienceDirect – journals’ full texts,
available on RUL, Indexes and databases
Tefko Saracevic
• Provides intuitive searching
– geared toward end user
– also provides various other
capabilities e.g. citation tracking
• Most subscribers libraries &
companies
– but through them access to
end users
• RUL was subscribed, but dropped
• in class you have free access
• Major competition to Web of
Science (RUL has it)
13
Types of information databases
• Many types are available:
–
–
–
–
–
Bibliographic
Numeric
Full text
Directory
Image
• still, film, video
• Some that are in Dialog are
also available elsewhere or on
their own
• Some vendors have exclusive
right to some databases
• Many you find in RUL
– Sound
• spoken word, music
– Multimedia
– Real time
Tefko Saracevic
14
Other vendors/aggregators
sample from RUL 275 databases; links require RUL login
Various disciplines or areas
Particularly related to LIS
Agricola
America: History and Life
Business and Industry Database
Dissertations and Theses
Education Index/Abstracts/Full Text
Factiva
Hispanic-American Periodicals Index
LexisNexis Academic
Medline
Oceanic Abstracts
Pollution Abstracts
Women's Studies International
ACM Digital Library
ASIST Digital Library
Computing Reviews
IEEE Xplore
Library, Information Science &
Technology Abstracts (LISTA)
Library and Information Science
Abstracts (LISA)
Library Literature and Information
Science
Professional Development Collection
Resources for College Libraries (RCL)
Tefko Saracevic
15
a BIG,
BIG problem
• In Dialog & some other vendors you can search a
number of databases at the same time
– so called federated searching
• in Dialog using file 411, Dialindex (get it: 411 … )
• In Scopus you search the whole thing – if you wish
• BUT in RUL & elsewhere there is no federated
searching
– you have to search each database separately
– at RUL through Searchlight you can search 8 databases
• others you have to search one at the time
– someday there will be federated searching, but at present
do not hold your breath
Tefko Saracevic
16
as would
Tefko Saracevic
imagine …
17
3. Structure of records in databases
Describing & organizing nature of
content
© Tefko Saracevic
18
Now unto structures –
getting under the hood
• Databases structure own records – documents, objects …
– why? to describe various parts of content for computers to
recognize – these are fields, as mentioned
• you can recognize that a section of a document is a title,
but a computer has to be told that a title is a title
– so that it can (among others) search for terms in a title when you
request so
• Fields in records are labeled as to content or function
– most fields in databases indicate the same content
• e.g. title, author, index terms, abstract, text parts, source, …
– but various databases do it in their own way
• in whatever convoluted way they do it, it is not that hard to decipher
Tefko Saracevic
19
Labeling schemes
• Many structure schemes were developed that prescribed
what to label & what to call the label – meta languages
– by providers, vendors, organizations, authorities
– in different subjects, domains
– for different types of objects
• Meta tags are used on the web – to describe & index
– semantic web is in development, to further enable description
of and searching for meaning
• MARC is a form of meta language
• To use these schemes for effective searching you have no
choice but to get familiar
Tefko Saracevic
20
Transparency of structures
• In some databases description of structure is readily
available
– even though it may look forbidding, complicated
• good example: Bluesheets in Dialog
…
• search fields in Scopus
• In others, structure is there but has to be discovered by
surmising
– even in
and particularly in
• But clever, appropriate use of structure in
searching is key to effective searching
Tefko Saracevic
21
Example: Dialog file 438 Bluesheet
Describes the
content of the file
© Tefko Saracevic
22
file 438 record & fields- each field is searchable
e.g. /TI=title; AU=author; SO=source; JN=Journal; …
Indicates
field &
abbreviation
Tefko Saracevic
23
Organization of indexes in Dialog
it has two kinds of indexes
• Dialog has a Basic Index –
searched by default
• Entering a command s (or
select) digital and libraries
– finds all documents that have
the term digital and the term
libraries anywhere in the
document
– s digital and libraries/TI finds
documents that have these
terms in the title
• Dialog has also Additional
Indexes
– these are for Authors (AU),
Sources (SO) , Publication
Years (PY) … & many more
– searched as s (or select)
digital and libraries and
AU=Saracevic
All other databases have similar arrangements as to indexes, but are not that
clearly visible as in Dialog, but are searchable in selections
Tefko Saracevic
24
file 438: searching in Basic Index it is searched by default
Examples how to search in basic index by
words & other fields
S means select command; W means with –
terms next to each other in that order
Tefko Saracevic
25
file 438: fields in
Additional Indexes
Additional index is searched by
indicating the field to be searched –
examples how to search them
Neat trick:
If you want to search
the latest update only,
add to search
UD=9999
Tefko Saracevic
26
file 438: fields in
Limit
Searches can be limited to cover
documents with given attributes –
examples how to limit searches
S2 means set 2 as retrieved previously
Tefko Saracevic
27
file 438: additional
uses of structure
Tefko Saracevic
Results can be sorted or ranked by
given fields –
examples how to sort or rank results
28
file 438: options in
displaying of results
Results can be displayed & then
printed in a number of ways –
examples of available formats
But watch out!
In real life some
formats are free
other cost $$$$!
Tefko Saracevic
29
Economics – tail that wags the whole dog
• In class Dialog searching is free
– & you can use it for class exercises & learning
• In real life Dialog (as every other vendor) has an
elaborate economic structure
– different files have different price tags for use
– time of use is calculated in DialUnits
• a Byzantine structure of charges - it is beyond understanding
– in different files different formats have different price
attached
• full formats in some files are really hefty!
Tefko Saracevic
30
Where to find all about structure?
• In Dialog in BlueSheets (file 415)
– consult often! and again! and again! and again!
– files have similarities and differences in structure –
BlueSheets show that
• For other vendors:
– some have similar description as BlueSheets
– some indicate fields that can be searched
• it shows structure
– in some revelation comes from checking what is
available in advanced searching or in tips for searching
– in some structure has to be surmised
Tefko Saracevic
31
Structure in search engines &
databases
• Mostly not readily apparent
– but all have capabilities to be used in searching
• Again: revelation comes from checking what
is available in Advanced Search, Search
Features, Search Tips, Help, & the like
• Most users do NOT take advantage of using
available structures in searching
– professional searchers do
• part of their tool kit & competencies
Tefko Saracevic
32
Example: structure from Advanced Search
Records
are
structured
& can be
searched
by these
fields &
topics
© Tefko Saracevic
More fields available
33
Example of structure from Scopus (features)
Records are
structured &
can be
searched by
additional 10
or so pull
down fields
© Tefko Saracevic
34
Example of structure from
Library Literature & Information Science Full Text (at RUL)
Records are
structured &
can be
searched by
additional 20
or so pull
down fields
© Tefko Saracevic
35
Similarities & differences
• All vendors & search
engines have basic &
advanced Boolean-type
search capabilities
– but how it is done & bells and
whistles differ
– once you master concepts
you can then do an AHA!
when you encounter a
variation & then translate
• Many vendors & search
engines have advanced
search features
– many above & beyond Boolean
Tefko Saracevic
• All vendors rank output
results
– but how it is done differs
– by default most (Dialog, Scopus
& most others) use LIFO – Last in
First Out
– but also allow for a number of
other ways. e.g. by source
• Search engines use ranking by
relevance, clustering,
PageRank & other criteria
• proprietary – they do not tell you
about it - not easy to discern
36
Similarities & differences …
• Most users
– do not know or care about
structure
– do not search beyond default
capabilities
– do not look beyond one or
two pages of results
– miss many potentially
relevant results
– do not know what is under
the hood
– can’t do advanced – more
sophisticated – searching
Tefko Saracevic
• Professional searchers
– know that structure is very
much connected to searching
– learn about & use available
structures
– understand defaults & use
advanced capabilities as
necessary
– know “tricks” for not missing
stuff or not getting to much
or to much junk
– explore in order to learn what
is under the hood
37
4. Indexes
As used in searching
Tefko Saracevic
38
We all know what an index is
but to refresh
An index is a list of words and
associated pointers to
where those words can be
found in a document
Search engine indexing
collects, parses, and stores
data to facilitate fast and
accurate information
retrieval
- example of automatic indexing
• Many kinds of indexes e.g.
– back of the book index,
alphabetical , subject,
classified, faceted, …
• As to creation:
– manual, automatic,
– today trend is toward
automatic creation of indexes
• by means of computer
algorithms to select words or
phrases to identify content
Here we deal with index structures & in next
lecture we deal with indexing vocabularies
Tefko Saracevic
39
Inverted indexes
• All databases have some
kind of inverted index
– searching is done through
them
Inverted index:
An index containing terms, as
keys, mapped to references
to the documents they
appear in. The index is sorted
by its keys. “Inverted” means
that the documents are found
by matching on terms, rather
than the other way around.
From Apple Glossary
Tefko Saracevic
• End of the book index is an
inverted index
• First inverted indexes were
made in 12th century
– concordance of the Bible
• a concordance is an
alphabetical list of the principal
words used in a book or body
of work, with their position
indicated as immediate context
• In contrast, sequential index
is a full index for each
document – one by one
40
Making & searching of inverted indexes
• Inverted indexes can be
made from regular
sequential indexes for every
document
• But also from regular texts
– abstracts and full texts
• Automatic indexes are made
from texts – now easily
– following given algorithms
– omitting “stop” words
• Dialog has 9: AN, FOR, THE,
AND, FROM, TO, BY, OF, WITH
Tefko Saracevic
• Searching is then done on
the inverted index
– so it is useful to understand the
structure
• for a document every word is
identified as where it appears
in text
• search looks for appearance
e.g. if “digital” is in position 8 in sentence
10 & “library” is in position 9 in
sentence 10 , then in a search is for
“digital library” the algorithm looks
what positions of terms “digital” &
“library” is next to each other in same
sentence, finds them & retrieves them
as hit
41
Inverted indexes
Useful to know how they function to understand
search & retrieval. Steps:
1. Each document is indexed
– every word in a document is taken as index term
with exception of stop words
– position in text is noted
2. Indexes for all documents are merged
• index terms are arranged alphabetically in the
bowel of the system
• under each index term are document numbers in
which it appears & position in text for that document
Tefko Saracevic
42
Example on creating an inverted index (from Walker & Janes, 1999)
Four documents: 101, 102, 103, 104
Fields: TI=Bold; AB=text; DE=descriptor
Tefko Saracevic
43
Terms for each document – after stop words eliminated
© Tefko Saracevic
44
Inverted index – a few last terms after letter R are missing, no space on page
Terms
© Tefko Saracevic
Doc
no.
Field
Position
45
In conclusion
Searching is
more art than
science,
but an art that
needs a lot of
knowledge what
is behind it
Tefko Saracevic
46
Tefko Saracevic
47
You can do it!
Try!
just start moving your
mouse on the empty
page serving as canvas
Tefko Saracevic
48
Download