ITIS 1210 Introduction to Web

advertisement

ITIS 1210

Introduction to Web-Based

Information Systems

Internet Research Two

How Search Engines Rank Pages &

Constructing Complex Searches

How do Search Engines Crawl?

Gathering data from the Web is like browsing:

1. Visit a page.

2. Record all the words on the page

3.

Choose a link you haven’t seen/recorded

4. Click on the link.

Repeat 8 billion times.

Crawling the Web

One person with a Web browser, following one link per second.

How long does it take to browse the surface Web (8 billion pages)?

8 billion seconds = 133 million minutes

= 2 million hours

= 93 thousand days

= 256 years

Crawling the Web

How many people would it take to crawl the surface Web in a week? If each person follows one link per second (with no sleep):

One week = six hundred thousand seconds

Six hundred thousand / eight billion = thirteen thousand

Challenges:

 Remembering where you’ve been

 Remembering where you haven’t been

Storing all the data

A (small) Server Farm

The Deep Web

Not all pages get crawled:

 Private pages on Intranets (company networks)

 Pages that people don’t want crawled

 Dynamic content pages (from databases)

Dynamic content pages make the size of the

Internet infinite!

Dynamic Content Example

 zillow.com

 Won’t be indexed

Identifying High Quality Web

Pages

Google has ranked billions of Web pages by "quality".

You enter your search terms:

UNC Charlotte HCI

Google finds the highest quality page associated with these search terms.

Google Pagerank

Pretend you're surfing the Web randomly.

To move from page to page you could:

1) type in an address ( www.

sis.uncc.edu

) includes using a bookmark

OR

2) follow a link.

Pagerank measures how likely you are to reach a particular page through random surfing (either 1 or 2).

The main idea is that links to your page from important web pages indicate that your page is important.

Computing Pagerank

(what’s the probability of getting to this page?)

A, B, C, ...

L(A), L(B), L(C),...

= Pages pointing to Q

= number of links on each page

Pagerank of Q :

R(Q) = ( 1-d ) + d · (R(A)/L(A) + R(B)/L(B) + ...) d represents the relative chance of following a link to page Q and 1-d represents the relative chance of going directly to page Q (via typing in the address or using a bookmark):

Usually these are: d = 0.9 ( 1-d ) = 0.1

Computing Pagerank

Pretend the Web has only four pages:

W X Y Z

Links:

W

X Y

W Y

Z Z

W

L(W) =1 L(X) =0 L(Y) =2 L(Z) =1

Which page has the highest “quality”?

Computing Pagerank

Links: W

X Y

W Y

Z Z

W

L(W) =1 L(X) =0 L(Y) =2 L(Z) =1

R(W) = (1-d) + d * (R(Y)/L(Y) + R(Z)/L(Z))

= 0.1 + 0.9 * (R(Y)/ 2 + R(Z)/ 1 ))

R(X) = 0.1 + 0.9 * R(W)

R(Y) = 0.1

R(Z) = 0.1 + 0.9 * (R(Y)/ 2 )

Now, solve for:

R(W), R(X), R(Y),

R(Z)

Computing Values for R(W), R(X), R(Y) and R(Z)

We could use algebra to find the values, in the same way we could solve for x and y in: x = 1 + 2 x + y y = 2 + x + 3 y

Algebraic Solution w = R(W) x = R(X) y = R(Y) z = R(Z) w = 0.1 + 0.45

y + 0.9

z x = 0.1 + 0.9

w y = 0.1

w = 0.2775

x = 0.34795

y = 0.1

z = 0.1 + 0.45

y z = 0.145

But solving for eight billion variables is hard.

Instead, we'll use fixed point iteration .

Solution by Fixed-Point Iteration

Start with initial estimates of PageRank for each page:

R(W) = 1.0 R(X) = 1.0

R(Y) = 1.0

R(Z) = 1.0

Apply equations to compute new estimates: new R(W) = 0.1 + 0.9 * (R(Y) /2 + R(Z))

= 0.1 + 0.9 * ( 1.0

/2 + 1.0

)

= 1.45

new R(X) = 0.1 + 0.9 * R(W) = 0.1 + 0.9 * 1.0

= 1.0

new R (Y) = 0.1

new R(Z) = 0.1 + 0.9 * (R(Y)/ 2 ) = 0.1 + 0.9 * ( 1.0

/2) = 0.55

Solution by Fixed-Point Iteration

Start with updated estimates:

R(W) = 1.45 R(X) = 1.0

R(Y) = 0.1

R(Z) = 0.55

Apply equations to compute new estimates: new R(W) = 0.1 + 0.9 * (R(Y) /2 + R(Z))

= 0.1 + 0.9 * ( 0.1

/2 + 0.55

)

= 0.64

new R(X) = 0.1 + 0.9 * R(W) = 0.1 + 0.9 * 1.45

= 1.405

new R (Y) = 0.1

new R(Z) = 0.1 + 0.9 * (R(Y)/ 2 ) = 0.1 + 0.9 * ( 0.1

/2) = 0.145

Solution by Iteration iteration R(W) R(X) R(Y) R(Z)

0 1.00000 1.00000 1.00000 1.00000

1 1.45000 1.00000 0.10000 0.55000

2 0.64000 1.40500 0.10000 0.14500

3 0.27550 0.67600 0.10000 0.14500

4 0.27550 0.34795 0.10000 0.14500

5 0.27550 0.34795 0.10000 0.14500

... ... ... ...

Compute new estimates from the old until the estimates stop changing. Note that this is the same answer as the traditional algebraic approach, but this way scales better.

Final Pageranks highest page X R(X) = 0.34795

page W R(W) = 0.2755

.

.

.

page Z R(Z) = 0.14500

lowest page Y R(Y) = 0.10000

Final Pageranks

0.14500

0.10000

Y

2

Z

1

W X

1 0

0.27550

0.34795

How does Google Use

Pagerank?

 You enter search terms, such as “UNC

Charlotte HCI”

Google finds all the pages that have all those words on them

Of all those pages, Google will list the ones with the highest page rank first, but…

 …other ‘magic ingredients’ are used by

Google: trade secrets of their algorithms.

Introduction

Basic queries are somewhat limited

 One or two keywords

 Simple relationships

 Limited syntax

Complex queries provide more power

 Keywords & phrase can be connected to form more complex relationships

 Search filters can be employed to limit results

Understanding Boolean Operators

Syntax

 Rules for combining simple words to form complex sentences

Search engine syntax implemented by applying Boolean logic

George Boole

 1815-1864

Understanding Boolean Operators

Boolean logic

 Keywords act as nouns

 Boolean operators act as conjunctions

They define the connections between keywords

 Illustrated with Venn diagrams

John Venn

1834-1923

Understanding Boolean Operators

All web pages containing the word cats cats

W W W

Understanding Boolean Operators

All web pages containing the word dogs dogs

W W W

Understanding Boolean Operators

All web pages containing the words cats and dogs

Searches containing both words cats dogs

Intersection of the two sets

W W W

Understanding Boolean Operators

All web pages containing the words cats or dogs

Searches containing either word cats dogs

Union of the two sets

W W W

Understanding Boolean Operators

All web pages containing the words cats and not dogs

Searches containing one word but not the other cats dogs

Exclusion of the dogs set

W W W

Understanding Boolean Operators

All web pages containing the words dogs and not cats

Searches containing one word but not the other cats dogs

Exclusion of the cats set

W W W

Understanding Boolean Operators

Boolean operators

 AND

 OR

 NOT

Instruct the engine on how to combine keywords to produce results

Always use capital letters to avoid confusion with and, or, not as keywords

Understanding Boolean Operators

AND

 All these keywords must be on the Web page

OR

 These keywords may or may not be on the

Web page

 At least one of them must be

NOT

 None of these keywords can be on the Web page

Understanding Boolean Operators

Default operator

 Some engines have a default Boolean operator

 Usually AND

 Might be OR

Some engines may search for multiple words as phrases

Understanding Boolean Operators

Boolean operators may be

 Allowed on main page

 Confined to Advanced search pages

Some engines use symbols instead

 + for AND

 - for NOT

 No space between sign and word:

+solar +energy -windmill

Narrowing Searches with AND

AND

 Limits results

 Forces inclusion of a stop word

Indicates that all keywords must be found on Web page

Adding more ANDed keywords limits search more

Results should be more relevant because the keyword list has expanded

Narrowing Searches with AND

Example:

 “solar energy association” AND Portland

Solar energy association

Portland

W W W

Narrowing Searches with AND

Example:

 Henry +I same as “Henry I”

Henry I

W W W

Expanding Searches with OR

OR expands results

 Useful if you didn’t get enough returns from your first search

 The more keywords you add, the more results you should get

Every page returned must have at least one of the keywords on it

 Good to use when you have synonyms

Expanding Searches with OR

Example:

 oregon OR northwest oregon northwest

W W W

Restricting Queries with AND NOT

AND NOT excludes the keyword that follows NOT

Limits your search

 Produces fewer results

Useful if first search returns irrelevant results

 Use AND NOT to get rid of those results

Restricting Queries with AND NOT

Equivalent forms:

 cats AND NOT dogs

 cats AND-NOT dogs

 cats NOT dogs

 cats –dogs

Restricting Queries with AND NOT

Example:

 “solar energy association” AND portland

AND NOT maine portland

Solar energy association maine

Multiple Boolean Operators

Boolean operators allow you to focus a search

Any logical combination of operators is allowed

If it makes sense when spoken like a sentence it’s probably OK to use

Order of operations is usually left to right

 Use parentheses to organize terms

Multiple Boolean Operators

Bad example:

 constitution +american OR “united states” constitution american

“united states”

Multiple Boolean Operators

Good example:

 constitution +(american OR “united states”) constitution american

“united states”

Download