vector model and ProQuest

advertisement
LIS618 lecture 6
Vector Model and ProQuest
Thomas Krichel
2011-11-01
advantages of Boolean model
• supposedly easy to grasp by the user
• precise semantics of queries
• implemented in the majority of commercial
systems
problems of Boolean model
• sharp distinction between relevant and
irrelevant documents
• no ranking possible
• users find it difficult to formulate Boolean
queries
• users find it difficult to resolve Boolean
queries
vector model
• associates weights with each index term
appearing in the query and in each database
document.
• relevance can be calculated as the cosine
between the two vectors, i.e. their cross
product divided be the square roots of the
squares of each vector. This measure varies
between 0 and 1.
tf/idf
• stands for term frequency / inverse document
frequency
• This refers to a technique that gives term a
high rank in a document if
– the term appears frequently in a document
– the term does not appear frequently in other
documents
• We will look at each component one at time.
absolute & maximum term frequency
• Let F_t_d be the number of times term t
appears in the document d. This is its absolute
term frequency in the document.
• Let m_d be the maximum absolute term
frequency achieved by any term in document
d. Examples
– Document 1: a b a a b c c d
m_1 = 3, because "a" appears 3 times
– Document 2: a b a f f f e d f a a
m_2 = 4, because "a" or "f" appears 4 times
relative document term frequency
• The relative term frequency f_t_d, is given by
f_t_d = F_t_d / m_d
that is the absolute term frequency of term t in
document d divided by the maximum absolute
term frequency of document d.
• This completes the "term frequency" part of
the tf/idf formula.
• Let us look at this part through an example.
main example, part I
• Consider three documents
– 1:
– 2:
– 3:
abcafonlpoftyx
amoeeennnanpl
raeefnliffffxl
• First, look at the maximum frequency achieved
by any term in a given document.
m_1 = 2
m_2 = 4
m_3 = 5
("a", "f" and "o" are there twice)
("n" is there four times)
("f" is there five times)
main example part II
• Now look at some example of absolute term
frequency
F_a_1 = 2
F_e_2 = 3
F_x_3 = 1
• and some examples of relative term frequency
f_a_1 = F_a_1 / m_1 = 2 / 2 = 1
f_e_2 = F_e_2 / m_2 = 3 / 4 = 0.75
f_x_3 = F_x_3 / m_3 = 1 / 5 = 0.2
inverse document frequency
• Let N be the number of documents in the
datebase. N=3 in our example.
• Let n_t be the number of documents where
the term t appears. In our example
n_a = 3
n_e = 2
n_x = 2
• N/n_t is an indication of inverse document
frequency of a term. It is larger the less a term
appears across documents in the database.
intermezzo: the logarithm
• The logarithm, written log() is a mathematical
function. You should know that
– log() is an increasing function, i.e. the bigger is x,
the bigger is log(x).
– log(1) = 0
– log(x) > 0
if
x>1
• Your calculator will tell you what the logarithm
of a number is.
tf/idf formula
• Term frequency and inverse document
frequency have to be combined.
• The final formula for the weight combines the
terms as follows
w_t_d = f_t_d * log( N / n_t )
main example part III
N=3
w_a_1 = 1 * log(3/3) = log(1) = 0
!
w_e_2 = 0.75 * log(3/2)
w_x_3 = 0.2 * log(3/2)
where log(3/2) = 0.176, approximately
practical operation
• The computer will search the documents for the
query term and return the documents where the
weight of term in the index for that document is
strictly positive, by order of weights, highest to
lowest.
• If there are several query terms the computer will
perform a more complicated operation that we will
not further study here, so we limit ourselves to the
case of one query term.
practical tests
• You ask the computer to query the term "a" in
our example. What documents are being
returned?
– Compare with the result of the Boolean model.
• You ask the computer to query the term "e".
What documents are being returned, and in
what order?
advantages of vector model
•
•
•
•
term weighting improves performance
sorting is possible
easy to compute, therefore fast
results are difficult to improve without
– query expansion
– user feedback circle
ProQuest search targets
• ProQuest searches “citations” and
“documents”.
• “citations” are description of documents such
as author names, titles, journal etc.
• “documents” contain the full-text of
documents.
• Target differences imply different behavior of
an expression when matched against a
candidate.
ProQuest search
• If you enter two search terms, they will be
used as one phrase.
• If you use three term, they are searched to be
appearing in proximity.
• You can force phrase interpretation by placing
the search expressions into double quotes.
terms
• A search term is something you type and that
has a meaning on its own.
• For example: house, or krichel.
• Terms have a regular expression
interpretation.
regular expressions
• ‘*’ is used as a right-handed truncation
character only; it will find all forms of a word.
For example, searching for “econom*”.
• ‘?’ is used to replace any single character,
either inside the word or the right end of the
word. For example, searching for “wom?n”
• ‘?’ cannot be used to begin a word.
operators: and
• AND Find the words.
• When searching for keywords in "Citation and
Document Text," AND finds documents in
which the words occur in the same paragraph
(within approx. 1000 characters) or the words
appear in any citation field.
operator: and not, or
• “and not” is the same as “not” in Dialog.
• “or” is a normal Boolean or.
proximity operators
• W/number Find documents where these
words are within some number number of
words apart (either before or after). Use when
searching for keywords within "Citation and
Document Text" or "Document Text."
Example: computer W/3 careers
• NOT W/number does the opposite.
proximity operators
• W/PARA Finds documents where these words
are within the same paragraph (within approx.
1000 characters). Use when searching for
keywords within "Document Text."
Example: internet W/PARA web
proximity operators
• W/DOC Find documents where all the words
appear within the document text. Use W/DOC
in place of AND when searching for keywords
within "Citation and Document Text" or
"Document Text" to retrieve more
comprehensive results.
Example: Internet W/DOC education
proximity operators
• PRE/number Find documents where the first
word appears some number number of words
before the second word.
• Use when searching for keywords within
"Citation and Document Text" or "Document
Text."
Example: world pre/3 web
field syntax
• It is possible to limit a search for a term to a
field.
• This is done by writing field( term)
abstract
• ABS() search article abstracts for your terms.
• Examples:
ABS(customer delight)
ABS(ozone)
appendix
• APX() searches the appendix of a document.
The appendix usually comes at the end of the
document, identified by a header
• Use Keywords to search this field.
• Example: APX(Michigan)
author
• AU() is used to find articles written by an
author or reviewer.
• Example
AU(Thomas Krichel)
Classification code (ABI)
• Use Classification Codes when searching
business topics. Classification Codes are a fast
way to precisely target a search by topic,
industry or market, geographical area, or
article type.
• Examples:
CC(1120) for Economic Policy & Planning
• This only applies to a subset of data from ABI
inform, which has these codes.
Coden
• This is use to search the coden index. A coden
is an alphanumeric code used for
shelving/ordering books and journals in
libraries, often based on a publication’s title.
• Example:
CODEN(EDUSBI)
Column / Document Column Head
• The title of a column in a periodical or
newspaper, such as “The Week in Review”.
This search field finds all articles where the
search words are in the column head.
• Examples:
COL(futures)
COL("The Week In Review")
•
company / organization
• CO() searches for an organization featured
prominently in an article,
– Associations and cooperatives
– Companies and their divisions
– Governmental organizations and olitical parties
– sports teams, music bands and churches
– native american tribes
• Comes with LCO({}) option for full matches.
publication date
• PDN() searches the publication date in
numeric format (mm/dd/yyyy).
• You can use the < and > signs to indicate dates
before and after a date, or between specific
dates.
• For example, PDN(>1/1/2002) AND
PDN(<1/5/2002) will find results from
publications with numeric dates between
January 1 2002 and January 5 2002.
dateline
• DLN() searches article Datelines. The dateline
occurs frequently in newspapers, just after the
article title, giving the date and place of the
articles origin. You can use Boolean, proximity
and truncation operators.
• DLN(lebanon pre/1 ohio)
document features
• SF() is used to search document features, such
as an index or auxiliary materials, that may be
included in or accompany a document.
• The document features indexed are:
– Graphs and Illustrations
– Maps
– References
– Tables
search by proquest handle
• ID() Searches the unique database ID for
articles and documents in ProQuest.
• Examples:
ID(356894)
document language
• LA() is used to search Language index. This
field contains the language in which the
document was published originally.
• Examples:
LA(french)
LN(french or english)
document text
• Searches only the full text of articles for your
search terms. Article abstracts are not
included in this search. AND, OR, and other
search operators are treated as such unless
enclosed in quotes.
• Examples:
TEXT(Kofi Annan)
TEXT("North Sea oil")
title searches
• TI() searches the title of a document, such as
“Seigniorage, Taxation and Myopia in EMU”
document type
• DT() is used to look for search words or
phrases in documents of a certain type.
• Examples
DT(commentary)
DT(editorial cartoon)
DT(review)
DT(arts/exhibits review)
DT(television review-no opinion)
company number
• DUNS() searches Dunn and Bradstreet trading
partner identification number. These numbers
provide a universal system for computer
identification of companies.
• Examples:
DUNS(00 695 7856)
DUN(03 575 3920)
footnote
• FOOT() searches the article footnotes for your
terms.
• Examples:
FOOT(326 U.S. 465)
volume
• Volume() searches the volume.
• Examples:
VO(100)
word count
• WC() restricts the number of words in the
article text. Use this search field to locate
articles under (<) or over (>) a certain length.
• Examples:
– WC(<1000)
– WC(>500)
– WC(>750 AND <1000)
year
• Year searches the publication year
• Examples:
YR(1986)
YR(1986-1987)
YR(>1998)
YR(<1998)
location
• GEO() is used this search field to look for
articles in which a geographical area or
location figures prominently in the text.
• Examples:
GEO(Midwest)
GN(UK)
GEO(New South Wales)
GN(Black Forest)
• Comes with LGEO({})
headnote
• HEAD() looks for words that occur in the
headnotes of an article. Headnotes are short
introductions, explanations, or comments at
the beginning of an article. They are different
from abstracts in that they do not attempt to
summarize the content of the article.
• Examples:
HEAD(escalator accidents)
HDN(digital tv)
HEAD(Global Economy)
caption texts
• CAP() This search field looks for occurrences of
search words in the caption text
accompanying article illustrations, graphs, and
photographs.
• Examples:
CAP(Chart)
(additional) index
• INDEX() locates all occurrences of search
words in any searchable index field. It does
not find occurrences in the text of the articles.
• Examples:
INDEX(starcore)
ISSN
• ISSN() looks for the eight-digit International
Standard Serials Number (ISSN), where
available. Hyphens are optional.
• Examples:
ISSN(0011-4664)
SN(00916358)
issue()
• ISSUE() is used to search Issue Number.
• Valid Forms:
ISSUE
IS
• Examples:
IS(10)
NAICIS / SIC
• NAICS() or SIC() searches for industry codes.
The NAICS/SIC code defines the economic
activity of a business as defined by the US
Census Bureau.
• Examples:
SIC(4911)
SIC(514210)
start page
• PAGE() is used for specific pages of a
publication. Useful for finding front page
articles.
• Example:
PAG(A.1) AND PUB(wall street journal) AND
PDN(1/10/2003)
person
• NAME() finds articles about a person. When
the Personal Name field is displayed in an
article citation, the life spans of historical
figures follow their names.
• You can enter the name in any format.
Searching for NA(John A Smith) will return the
same results as NA(Smith, John A).
product name
• PROD() finds articles about a specific product.
• Examples:
PROD(TiVo)
PR(harley-davidson)
journal
• JN() is used to search by a specific publication
or publications.
• Examples:
JN(Forbes)
JN(New York Times or Washington Post)
JN(computing) — retrieves all periodicals
with "computing" in their titles
section
• SECTION() finds articles that appear in a
specific section of a publication. Use the
SOURCE search field to specify a publication.
You must specify the section name exactly as
it appears in the publication.
• Examples:
SOURCE(New York Times) AND
SECTION(editorial) AND AU(Gore Vidal)
SEC(sports) AND NA(Florence Griffith
Joyner)
source type
• STYPE() is used include or exclude the
following source types from your search:
dissertations, newspapers, periodicals and
wire feeds.
• Examples:
NA(Winston Churchill) AND
STYPE(periodical)
GEO(Japan) AND STYPE(wire feed)
subject terms
• SU() is used to look for articles about a specific
subject. When searching Hoover's, this
contains information on company type.
• Examples:
SU(Music)
SU(venture capital companies)
SU(Health Care)
SU(nonprofit)
• Comes with LSU({}) facility
combined search
• When you select “Citations and abstracts”
from the drop-down menu, ProQuest searches
the following fields: AU(), NAME(), ABS() PN(),
TI(), SU(), CO(), SO(), GEO()
http://openlib.org/home/krichel
Please shutdown the computers when
you are done.
Thank you for your attention!
Download