Google and Amazon

advertisement
LIS618 lecture 5
Thomas Krichel
2003-10-26
Structure
• Theory on query languages
• Web information retrieval
– Google “theory”, see essay by Brin and Page.
It used to be at
http://www7.scu.edu.au/programme/
fullpapers/1921/com1921.htm
– Google query language, from Calishain and
Dornfest's book "Google hacks"
– Google Images, groups, odp
simple queries
• single-word queries
– one word only
– Hopefully some word combinations are
understood as one word, e.g. on-line
• Context queries
– phrase queries (be aware of stop words)
– proximity queries, generalize phrase queries
• Boolean queries
simple pattern queries
•
•
•
•
•
prefix queries (e.g. "anal" for analogy)
suffix queries (e.g. "oral" for choral)
substring (e.g. "al" for talk)
ranges (e.g. form "held" to "hero")
within a distance, usually Levenshtein
distance (i.e. the minimum number of
insertions, deletions, and replacements) of
query term
regular expressions
• come from UNIX computing
• build form strings where certain characters are
metacharacters.
• example: "pro(blem)|(tein)s?" matches problem,
problem, protein and proteins.
• example: New .*y matches "New Jersey" and
"New York City", and "New Delhy".
• great variety of dialects, usually very powerful.
• Extremely important in digital libraries.
structured queries
• make use of document structures
• simplest example is when the documents
are database records, we can search for
terms is a certain field only.
• if there is sufficient structure to field
contents, the field can be interpreted as
meaning something different than the word
it contains. example: dates
query protocols
• There are some standard languages
– Z39.50 queries
– CCL, "common command language" is a
development of Z39.50
– CD-RDx "compact disk read only data
exchange" is supported by US government
agencies such as CIA and NASA
– SFQL "structure full text query language" built
on SQL
web information retrieval
• We can think of the web as a pile of
documents called pages.
• Some "pages" are hard to index
– PDF documents
– Pictures
– Sound files
• But a majority of pages are written in
HTML
– easy to index
– have a loose structure
Google uses the structure of HTML
• Google finds the title of the page, i.e. the
contents of the <title> element.
• Google analysis headings and large font
sizes and gives priority weight to terms
found there.
• Most importantly, Google uses the link
structure of the web to find important
pages.
classic IR and the web
• In classic information retrieval, every
document has the same importance. They
differ as to their relevance to a query.
• In classic information retrieval, a document
d is relevant if the query terms appears
relatively frequently in d rather than in
other documents.
• If a web page contains the words "Bill
Clinton sucks" and a picture, it is not a
relevant hit for "Bill Clinton".
Google finds important pages
• The idea is that the documents on the web
have different degrees of "importance".
• Google will show the most important
pages first.
• The ideas is that more important pages
are likely to be more relevant to any query
than non-important pages.
Google's monkey
• Imagine that the web has P pages. Each
page has its own address (URL).
• Imagine a monkey who sits at a terminal.
He follows links at random, but on rare
occasions he gets bored and types in an
address of a random page out of those P.
• Will the monkey visit all pages with equal
probability?
PageRank
• Google page rank of a page is the
probability that the Google's money will
visit the page.
– The monkey will come frequently to pages
that have a lot of links to them.
– Once he is there, he will likely go to a page
that it linked by one of the pages that an
important page links to.
• The structure of all the links on the entire
web reveals the importance of the page.
many PageRanks
• There is an infinite number of ways to
calculate the page rank depending on
– how likely the monkey gets bored.
– the probability of the monkey to visit each
page.
• Potentially, there is a page rank for each
user of the web. Google tries to observe
users and may be associating personal
page ranks.
Notation
• Assume that a monkey gets bored with
probability d. If bored, it will visit page p
with probability π_p.
• For any page p, let o_p the number of
outgoing links.
• Let l(p',p) be the number of links from
page p' to page p.
Page rank
• The page rank for a page p is
r_p = π_p d + (1-d) ∑ l(p',p) r_p' / o_p'
• In words, it is likelihood that, if bored the
money goes to the page p plus the
likelihood that he gets there from another
page p'. The likelihood getting there from
p' is the likelihood of being there, times the
number of links between p' and p, divided
by the number of outgoing links on p'.
example
•
•
•
•
•
•
let there be a web of four pages A B C D
A links to B.
B links to C
C links to A and D.
D links to A.
Let the probability to get bored be ¼ and
there be a ¼ chance to move to any page
when bored.
page ranks
• The following system calculates the ranks
r_A = ¼ ¼ + ¾ (r_C / 2 + r_D)
r_B = ¼ ¼ + ¾ r_A
r_C = ¼ ¼ + ¾ r_B
r_D = ¼ ¼ + ¾ r_C / 2
• Since this is fairly complicated, Google
uses an iterative approximation to
calculate the rank.
interfaces
• simple interface has command driven
features that make it more advanced than
the advanced interface
• The advanced interface is a form interface
to query language available on the simple
interface.
• There are extensive language settings
– preferences for finding pages in a certain
language
– preferences for the language of the interface
query language I
•
•
•
•
default Boolean AND between terms
case insensitive
terms can be ORed with "OR" or "|"
adjacent terms have to be put in double
quotes
• Boolean NOT can be expressed with –
• Example: "krichel –thomas"
query language II
• * is a wildcard for any word
• +stopword requires the presences of a
stop word stopword. But the list of stop
words has not been published.
• In fact it depends from query to query
• There is a limit of 10 words, but a * does
not count towards the limit
query treatment
• Google prefers pages that have the search
terms
– in close proximity
– in the same order as in the query
• Repeating a query term once adds weight
to it
• repeating it twice has no further effect
special syntax I
• intitle: find in title only, "intitle: google"
• intext: find in text only. This will exclude
occurrences of the search term in anchor or
title data. "intext: html"
• inanchor: This option requests pages, for
which there is another page that links to them
with the anchor text in the query. example:
inanchor:"list of my courses" finds my courses
page because it has a link with that text from
my homepage.
special syntax II
• cache: pages that are in the google cache,
useful if query result has nothing to do with
the query terms
cache:openlib.org/home/krichel will show
the cached version of the page.
• If you add further terms, they will be
highlighted.
site: and inurl: special syntax
• inurl: find in URL only, "inurl: help"
– can use the * as a wildcard, like in inurl:
“*.openlib.org"
• site: domain of page, "site: liu.edu"
– breaks down if a path is included
– can not be used on its one, only with other
query expressions
daterange: special syntax
• limits the search to pages indexed
between a range of dates. Changed pages
are reindexed, unchanged pages are not
reindexed when the crawler visits a page.
• dates are expressed in the Julian period,
i.e. number of days after -4713-01-01 0:00
UTC of the Julian calendar. Today is
2452939.
• example: daterange: 2452640-2452939
mixing special syntax expressions
• The link: syntax does not mix with others.
• Other bad ideas:
– "site:openlib.org –inurl:openlib"
– "site:edu site:com"
• Things that work well
– intitle:search
– Intitle:biology inurl:help
Examples
• George Bush site:nytimes.com
• "Copyright * The New York Times"
"George Bush"
• Intitle:"directory * * trees"
• Botany intitle:"directory of" site:edu
• "powered by blogger" or site:blogspot.com
• "classical music" (inurl:mailman |
inurl:listserv)
phonebook: special syntax
• A location seems to be required, i.e.
phonebook: long island university ny
• no
– wildcards
– exclusions
– or
• there is also
– rphonebook for residential
– bphonebook for businesses
stocks on google
• stocks: ticker will look up a ticker symbol
ticker at http://finance.yahoo.com
• you can find ticker symbols there
• ticker symbols are useful to find financial
information about publicly traded
companies.
google images
• it has the following special syntaxes
– intitle searches for images on a page with a
given title, "intitle: long island university"
– Inurl: searches for images in pages that have
a certain url, inurl:liu.edu
– site: restricts the search to a certain site,
should be combined with a search term like
"site:liu.edu koenig"
Google interfaces to 3rd party data
• Google groups are an interface to usenet
news
• Google directory is an interface to the
Open Directory Project.
• In both cases Google is dependent on the
quality of these underlying data source.
usenet news
• Usenet is a collection of user-submitted notes on
various subjects that are posted to servers on a
worldwide network. Each subject collection of
posted notes is known as a newsgroup.
• A newsgroup is a discussion about a particular
subject consisting of notes written to a
networked site and distributed through Usenet.
• Newsgroups are hierarchical. Hierarchical
levels are separated by dots example:
comp.text.tex
• alt stands for anarchists, lunatics and terrorists.
usenet history
• The idea of network news was born in 1979
when two graduate students, Tom Truscott and
Jim Ellis, thought of using UUCP to connect
machines for the purpose of information
exchange among users. They set up a small
network of three machines in North Carolina.
• UUCP is ``UNIX to UNIX copy'' a protocol that is
used to copy files between machines running
some flavor of UNIX, without the need for IP
protocol. Usenet is older than the Internet
decline of usenet
• essentially open to all (peer-to-peer
system)
• used by spammers for
– posting
– gathering addresses
• steady decline of quality of contribution
• steady decline of quantity of contributions
usenet worth checking out
• independent reviews of products, often
written by experts.
• Example: interpretation of beethoven
sonatas by Wilhelm Kempff.
• Sorting by date reveals that the
newsgroup rec.music.classical.recordings
is still active. On a good day, you will find
no finer guide to records.
special syntax for usenet
• group: limits posting to a certain group
• title: limits to titles of postings
• author: searches for author name or email
address
• Mixing syntaxes works well
the open directory project
• "The Open Directory Project is the largest, most
comprehensive human-edited directory of the
Web. It is constructed and maintained by a vast,
global community of volunteer editors.
• Claim that there is a historic precedence in the
Oxford English Dictionary.
• Formerly known as ``GnuHoo'', then ``NewHoo'',
then acquired by NetScape, and called ``dmoz''.
dmoz.org
• dmoz is maintained by volunteers ``net-citizen''.
No special qualifications required, but claimed to
be experts.
• There are about 30,000 volunteers (they claim).
• Powers the core directory services for the
Web's largest and most popular search
engines and portals
– Netscape Search
– Google
– HotBot
AOL Search
Lycos
DirectHit
• Headquarters run by Netscape
http://openlib.org/home/krichel
Thank you for your attention!
Download