MS Word - Department of Computer Science

advertisement
TITLE:
ON EXERCISING EVERY LINK OF A WEB-SITE
WITH MINIMAL EFFORT
AUTHOR:
VAIDYANATHAN SIVARAMAN
AFFILIATION:
SCHOOL OF COMPUTER SCIENCE AND
ENGINEERING, ANNA UNIVERSITY
CONTACT ADDRESS:
NO.49,VI BLOCK,CEG HOSTELS,
ANNA UNIVERSITY,
CHENNAI – 600025.
E-MAIL ID:
vaidy_comp@yahoo.co.uk
Copied from
http://www.hipc.org/hipc2002/2002Posters/WebAccess.doc
by H Thimbleby, April 5, 2003
ON EXERCISING EVERY LINK OF A WEB-SITE WITH MINIMAL EFFORT
ABSTRACT
In this paper, I consider the problem of traversing every link of a web-site. Since
the number of pages and links runs to hundreds and thousands (respectively) in
important web-sites, random traversal yields very poor results. I model the
problem as a Directed Chinese Postman Problem and I propose an efficient
polynomial time algorithm for solving the same. I also indicate numerous
situations in which the concept comes into picture.
1.
INTRODUCTION
A user of a web site, browsing it, may want to find the quickest route to explore
an entire site; more seriously, an employee may be required to check every link
of a site for correctness, say if it provides medical information. Since links are
often descriptive, and require an understanding of their purpose, they must be
checked manually to see whether they link to appropriate pages. However, on
following a link, the human checker is now on another page.
2.
MAPPING THE LINK-EXPLORATION PROBLEM TO DIRECTED
CHINESE POSTMAN PROBLEM
This problem can be modeled as a directed Chinese postman problem. A vertex
in the graph corresponds to page and a directed edge represents a link from one
page to another. The edges are directed because a link has a direction
associated with it. Suppose there is a link from page i to page j. This means that
by following the link from page i the browser will go to page j. But the web-site
may not have a link from page j to page i.
We construct a weighted digraph G where each edge represents a link, each
vertex represents a page and the weight assigned to each edge represents the
time it takes to follow the link. The aim is to find the optimal tour from a set of
directed tours in which all arcs participate at least once .
3.
ALGORITHM FOR SOLVING DIRECTED CHINESE POSTMAN
PROBLEM
Here two cases are considered. The given digraph may be Eulerian or nonEulerian. If the given graph is Eulerian the task is relatively simple. Otherwise, we
have to convert the given non-Eulerian digraph into an Eulerian one in an optimal
way. First the Eulerian case is considered.
CHARACTERISATION OF AN EULER DIGRAPH :
If all the vertices are balanced i.e in-degree = out-degree for all vertices, the
digraph is Euler.
a)
ALGORITHM FOR FINDING DIRECTED EULER TOUR IN AN EULER
DIGRAPH :
Consider a digraph G in which all the vertices are balanced. Now we construct a
walk starting at an arbitrary vertex v and going through the arcs of G such that no
arc is traversed more than once. We continue tracing as far as possible. Since
every vertex is balanced, we can exit from every vertex we enter; the tracing
cannot stop at any vertex but v. And since v is also balanced, we shall eventually
reach v when tracing comes to an end. If this closed walk h we just traced
includes all the arcs of G, then h is the directed Euler tour we are looking at. If h
does not include all the arcs of G, we remove from G all the arcs in h and obtain
a subgraph h’ of G formed by the remaining arcs. Since both G and h have all
their vertices in balanced condition, the vertices in h’ are also balanced.
Moreover, h’ must touch h at least at one vertex a, because G is connected.
Starting from a, we can again construct a new walk in graph h’. Since all the
vertices of h’ are balanced, this walk in h’ must terminate at vertex a; but this
walk in h’ can be combined with h to form a new walk, which starts and ends at
vertex v and has more arcs than h. This process can be repeated until we obtain
a closed walk that traverses all the arcs of G.
b)
FINDING THE OPTIMAL TOUR IN NON-EULERIAN SITUATIONS :
Suppose the given directed graph is not Eulerian. We duplicate some of the arcs
so that the resulting super-digraph is Eulerian.
DUPLICATION OF ARCS
When we duplicate an arc going from vertex i to vertex j , we will include an
additional arc from vertex i to vertex j and having the same weight as the original
one. In the WWW context, it means that we are going through the link the second
time.
OPTIMAL DUPLICATION OF ARCS
Of course, there are many ways in the arcs of a digraph can be duplicated so
that the resulting super-digraph is Eulerian. But we are interested only in the
optimal duplication of edges – duplicating edges in such a way that the sum of
the weights of the arcs that have been duplicated is minimum.
ALGORITHM
STEP I
Find out the vertices that are unbalanced i.e for which in-degree is not equal to
its out-degree. We will have two categories A and B. Those vertices which have
in-degree greater than their out-degree are placed in category A . Those vertices
which have their out-degree greater than their in-degree are placed in category
B. Note that A and B are collections because if a vertex v has in-degree = 4 and
out-degree = 2, then it will be placed in category A two times. Formally, a vertex
is placed mod( in-degree – out-degree ) times in its appropriate category.
STEP II
Using Floyds algorithm we will have to find out the shortest directed path
between every pair of vertices a and b such that a is in A and b is in B.
STEP III
Now assign the vertices in A to the vertices in B in an optimal way. The cost of
assigning a vertex a in A to a vertex b in B is the length of the shortest directed
path from a to b (which has been computed in the previous step for all such pairs
). The assignment problem can be solved by using the classic Hungarian
Method.
STEP IV
Suppose in the optimal assignment , a is assigned to b. All the arcs in the
shortest path from a to b are duplicated in the given digraph G. This procedure is
repeated for all the assignments. The result is an Euler super-digraph S of the
given digraph G.
Since the assignment is made optimally, optimal duplication of the arcs takes
place.
STEP V
In the resulting digraph S (which is Euler) we find the Euler tour using the method
used for Euler digraphs.
4.
SIGNIFICANCE OF THE ALGORITHM IN THE WWW CONTEXT
Consider a website having hundreds of pages and thousands of links. When
there is an urgent need to check whether all the links are working properly,
random choosing of links may not result in checking all the links. For instance,
the web site for Benjamin Franklin’s House [40] had 66 pages and 1191 links at
the time of writing. Its optimal CPT of 2248 links is excessive for a human to
follow unaided; without following a CPT users would do even more work and be
unable to guarantee thorough checking. Hence there is a need for an efficient
algorithm.
The above algorithm makes sure we follow links second time only when the
corresponding digraph is non-Eulerian. Also solving the resulting assignment
problem makes sure that the checking is done in minimal time. The organized
approach presented above makes sure that we do not check a link many times
unnecessarily.
Suppose that an employee wants to check every link of a web-site for
correctness because it provides some important medical information. It will be
very difficult for him to do so in the absence of a disciplined approach. The above
algorithm can be used in this context because it helps in finding the optimal way
in which the checking can be done.
5.
CONCLUSION
In the WWW context, there occur several instances in which there is a need to
explore all the links of a website. The link-exploration problem has been carefully
mapped on to directed chinese postman problem. This paper has developed a
polynomial time algorithm for solving the directed Chinese postman problem. The
proposed algorithm can be used to find out the order in which links are to be
traversed. If a disciplined approach is not adopted, more work is expended and
thorough checking is not guaranteed.
REFERENCES
1. R. K. Ahuja, K. Mehlhorn, J. B. Orlin & R. E. Tarjan, “Faster Algorithms for the
Shortest Path
Problem,” Journal of the ACM, 37(2):213–223, 1990.
2. R. K. Ahuja, T. L. Magnanti & J. B. Orlin, Network Flows: Theory, Algorithms,
and Applications,
Prentice-Hall (Simon and Schuster), 1993.
3. S. Albers & M. R. Henzinger, Exploring Unknown Environments, Digital
Systems Research Center,
SRC Technical Note 1997-014, 1997.
4. J. M. Aldous & R. J. Wilson, Graphs and Applications, The Open University,
Springer Verlag, 2000.
5. D. Applegate, R. Bixby, V. Chvatal & W. Cook, “On the Solution of Traveling
Salesman
Problems,” Documenta Mathematica, Extra Volume ICM III:645–656, 1998.
6. K. Arnold & J. Gosling, The JavaTM Programming Language Second Edition,
2nd. ed.,
Addison-Wesley, 1998.
7. A. A. Assad & B. L. Golden, “Arc Routing Methods and Applications,” in
Network Routing, M. O.
Ball, T. L. Magnanti, C. L. Monma & G. L. Nemhauser, eds., 375–483, North
Holland, 1995.
Download