Uploaded by bryancshop

Broken Link Research Paper

advertisement
FABLE: Tracking Down Web Pages Whose URLs Have Changed
Jingyuan Zhu, Jiangchen Zhu, Vaspol Ruamviboonsuk, and Harsha V. Madhyastha
University of Michigan
this solution is less than desirable for users; the page’s
content might have been updated since when it was last
archived and services offered on the page, such as add-tocart, will not function on the archived copy. Sites hosting
such pages also lose the opportunity to show ads and attract users to their site. Furthermore, since it is an expensive (and practically infeasible) proposition to crawl and
save the entire web, many pages are never archived.
Abstract— User experience on the web crucially relies on
users being able to follow links included on any page to visit
other pages. However, when the URL with which one must refer to a specific web page changes (e.g., when the domain name
of the site hosting the page changes or the page is moved to a different subdirectory on that site), many links to this page across
the web continue to use its old URL. Users’ attempts to access
this page by following these links will often result in erroneous
responses even though the page still exists.
To address this problem, we present FABLE. Given a URL
which no longer works, FABLE attempts to identify an alternative functional URL to the same page: an alias of the broken
URL. Our high level insight is that archived copies of web pages
provide a rich set of information to find alias links. However,
not all pages are archived; even for those which are, straightforward approaches for using archived copies to find aliases result
in both poor coverage and efficiency. By employing a variety of
techniques to address these challenges, FABLE finds aliases for
30% of links in a diverse corpus of 1,500 broken URLs.
1
In this paper, we pursue a different tack: if a URL no
longer works but the page it was pointing to still exists, why
not try to find and use the new URL at which the page is
now available? Being able to identify such aliases of broken
URLs is useful in a variety of contexts: a web page’s author
can replace broken links included on her page, a user can do
the same to fix her bookmarks which no longer work, or a
website’s provider can add redirections from the old URLs
for its pages to the corresponding aliases. However, in all
of these cases, the challenge in finding new URLs to pages
stems from the lack of information and scale, e.g., a user
looking to fix her bookmarks may only have every page’s title saved, whereas the provider of a large site may have too
many broken links to patch across all of its pages.
Therefore, to automate the identification of functional
aliases for dysfunctional URLs, we present FABLE (Finding
Aliases for Broken Links Efficiently). Our key contribution
lies in demonstrating how to leverage web archives not to
serve requests for broken URLs, but to find aliases for them.
FABLE uses archived page content in three ways.
First, in cases where the archived copy of a page exists,
we show how to efficiently search for that page on the web.
Finding the archived copy of a page is itself not straightforward as web archival systems might have stored the page
under a different URL (e.g., with a different set of query parameters) than the broken URL for which we are attempting to find an alias. Once found, picking the right archived
copy is important: older copies for a URL are likely to have
stale content and newer copies might have been gathered after the URL was broken. Lastly, we show how to use a page’s
archived content to formulate relevant search queries and the
order in which to issue these queries so as to find as many
aliases as feasible with the least amount of work.
Second, to account for the incompleteness of web archives
and search indices, we leverage the graph structure of the
web. We observe that, when the old URL for a page no
longer works, other pages on the same site have often been
rewritten to link to the page’s new URL. Therefore, we efficiently search within web archives to identify pages which
previously linked to a target broken URL, and then determine
Introduction
The web is a vast highly interconnected repository of information and services. Rather than every web page being selfsufficient on its own, pages routinely link to other pages, both
on the same site and on other sites. The links on any page
point users to information and services that are closely related to what the page offers.
A problem with this hyper-connected structure of the web
is that it is common for links on the web to become dysfunctional over time. The URL pointing to a specific page may
not work a few years after it was created either because the
page was moved to a different URL (when the site hosting it
was reorganized) or because the page no longer exists. When
users visit pages that link to such broken URLs, they end up
lacking appropriate context or are robbed of relevant associated services, e.g., the background material linked from a
news article or the reviews for a product may be inaccessible.
Existing solutions to this problem take one of two approaches; both have fundamental shortcomings.
• Proposals for a content-centric Internet [22, 41] argue that
every page ought to be referred to by a more robust identifier than its URL, e.g., a cryptographic hash of the page’s
content. However, since these proposals not only require
all legacy pages to be rewritten but also mandate changes
to DNS, routing, etc., they have seen little adoption.
• Alternatively, web archival systems (such as the Internet
Archive’s Wayback Machine [5]) enable users to use the
archived copy of a page when the link to the page no
longer works [2, 8, 20]. But, when the page still exists,
1
which (if any) of the links that exist today on these pages is
an alias for the URL.
Third, we exploit the fact that, when a site is reorganized,
there is often a pattern in how old URLs map to their corresponding new locations. Using the aliases found for URLs
on a particular site using our first two techniques, we apply
programming by example [18] to infer potential aliases for
the remaining broken URLs on that site.
Our evaluation on a large, diverse corpus of broken links
shows that FABLE is able to find aliases for 30% of these
URLs, with a precision of 81%. Moreover, our analysis of
three university websites shows that, among all the requests
to these sites which are for URLs that were previously functional but no longer are, the aliases found by FABLE could
help fix 48% of these requests to broken URLs.
2
1999
21.9
11.0
40.7
19.5
2004
19.2
9.6
39.0
17.5
2009
10.9
9.4
36.4
17.1
2014
6.4
5.6
25.4
13.8
2019
0.7
2.7
4.5
11.7
% Total broken
93.1
85.3
73.8
51.2
19.6
Table 1: Out of 100,000 URLs in each year, % which do not
work now, broken down by the reason.
https://foo.com/path/ckvm92. Since u0 is a randomly generated URL (and hence, invalid), we can infer that
u is dysfunctional if requests for u and u0 elicit the same response. We tweak this technique by Bar-Yossef et al. [11]
in two ways: 1) for URLs with a numeric string in the middle (e.g., article IDs in the URLs for news articles), we also
consider u0 where we replace this token since this string may
dictate the server’s response, and 2) if the response to u contains a canonical URL [9] within the HTML, we find that to
almost always indicate a non-erroneous response. While our
detection of soft-404s is not perfect (e.g., the above method
does not work when the URL to a site’s homepage is broken), we ensure that we only err in marking potentially broken URLs as not broken, not the other way around.
Background and Motivation
A URL, which originally pointed to a particular web page,
may not work now for a variety of reasons. In this section,
we first list these various reasons and describe how we detect
them. We then present numbers highlighting the prevalence
of broken links on the web and their impact. Finally, we
describe our observations about the characteristics of broken
links which motivate our work.
2.1
% DNS error
% Connection error
% 40x/50x error
% Soft-404
2.2
Detecting Broken URLs
Prevalence and Impact of Broken URLs
To demonstrate how commonplace it is for links from the
past to no longer be functional, we compile a corpus of URLs
archived over time by the Wayback Machine [5]. We consider 5 years—1999, 2004, 2009, 2014, and 2019—and in
each year, we add to our corpus 100,000 URLs at random
out of all those which were archived by Wayback Machine
for the first time in that year; we use the year in which a URL
was first archived as an approximation of the year in which
that URL was created. We crawl the HTML for each of these
URLs to test whether they are broken. We remove from our
dataset all pages that either require authentication (ones that
return a 403 ‘Forbidden’ status code), or are forbidden to
be crawled by the site’s robots.txt file. We also crawl
only one URL every 15 seconds to minimize the likelihood
of triggering any site’s rate limits.
Many URLs go defunct over time. In each of the 5 years,
Table 1 shows the fraction of URLs from that year which
no longer work and the type of error we observe. As the
numbers show, a sizeable fraction of previously functional
URLs are now broken and the chances of an URL becoming
dysfunctional grow over time.
Many pages link to broken URLs. A URL being rendered
dysfunctional is by itself problematic, e.g., when this URL
has been bookmarked by a user. But, the impact of a broken
URL is further compounded by the fact that many web pages
may have linked to this URL. Visitors of each such page will
be robbed of the appropriate context or follow-up information/services that the page’s author meant to point them to.
Figure 1 shows an example; visitors of the home page of
Given the URL for a web page, the first step in loading the
page is to fetch the page’s HTML. In this paper, we focus
on those URLs where this first step fails. Even if the HTML
for a page exists, other resources on the page (images, CSS
stylesheets, etc.) may be missing, which is a separate problem outside the scope of this paper.
Fetching a page’s HTML involves several steps: a DNS
lookup to map the hostname in the page’s URL to an IP address, a TCP (+ TLS) connection setup to that IP address,
and a HTTP request followed by a response. When a URL
is no longer functional, any of these steps may fail. Failures
of the DNS lookup or connection setup are easy to detect.
Detecting the failure of the last step is, however, tricky since
web servers do not always return an error (i.e., a 404 ‘Not
Found’ response) when the requested page does not exist.
Often, web servers instead redirect the client to an unrelated
page (e.g., the site’s home page). Many servers even return
an error page with a 200 ‘OK’ status code response, e.g.,
when a website that is no longer being paid for is accessed,
the domain registrar may serve ads in response.
To detect when HTTP redirections and responses with a
200 status code are indicative of an error (which we refer to
as a “soft-404”), we adapt a technique from prior work [11].
Given a URL u to test, we obtain a new URL u0 which is
identical to u except that the suffix in u following the last
occurrence of the delimiter ‘/’ is replaced by a randomly
generated string of the same length, e.g., given u of the
form https://foo.com/path/file10, u0 could be
2
Duration
of trace
Site1
Site2
Site3
Broken outlink
Once-Valid
URLs
Total # % Broken
20 days
7 months
5 days
5363
5350
3231
14.3
7.9
45.6
Requests for
these URLs
Total # % Broken
77.9K
1.37M
40.3K
10.6
0.8
17.0
Table 2: Summary of logs from three university websites.
CDF across sites
1
CDF across pages
Figure 1: An example broken outgoing link.
1
0.8
2019
2014
2009
2004
1999
0.4
0.2
0
0.6
0.4
0.2
0
0.6
0
0.2
0.4
0.6
0.8
2019
2014
2009
2004
1999
0.8
0
0.2
0.4
0.6
0.8
1
Fraction of broken URLs
Figure 3: In each year’s corpus, for 1000 domains in which we
have at least 100 URLs, fraction of URLs that are broken.
Table 2 presents a summary of these datasets. On Site2,
we see that a very small fraction (0.8%) of the requests to
once-valid URLs elicit an error from the server. However,
on the other two sites, 10% and 17% of the requests to oncevalid URLs now result in an error or redirect to an incorrect
page, even though all of these URLs did work previously.
Based on the logs available to us, we are unable to determine
how users end up accessing these URLs (e.g., via their bookmarks or via links on other pages). Nonetheless, this data
illustrates the impact of broken links on users.
1
Fraction of broken outlinks
Figure 2: For pages in each year, fraction of the links included
on every page which no longer work.
https://jewishweek.timesofisrael.com/ will
find that the link to the “Letters to the Editor” is broken.
We quantify this outsized impact of broken URLs as follows. Out of all the still functional URLs from the corpus
spanning 20 years described above, we pick 2,500 at random from each of the five years considered. We crawl the
HTML for each of these 12,500 pages and identify all the
URLs that are linked from the HTML. We focus on the <a>
tags included in the page’s main HTML, thereby ignoring
links to ads and other content linked from iframes embedded
in the page. Some of the links are to URLs that were never
functional (e.g., misspelt URLs or absolute links mistakenly
included as relative links); we find these to be rare, but do
our best to detect and exclude them from our data.
Figure 2 shows that, irrespective of the year, a large fraction of pages – more than 40% in every year – have at least
one broken outgoing link. Of course, the farther back we go
in time, the larger the fraction of outgoing links that are typically broken. Significantly, many outlinks are broken even
on pages from a year ago.
Users issue requests to broken links. As evidence that
users are significantly impacted by broken links on the web,
we consider the server side logs from three university websites. Since many of the error-inducing requests in these logs
correspond to hackers probing the servers for vulnerabilities,
we focus on that subset of requests corresponding to URLs
which were successfully archived by Wayback Machine at
least once, i.e., these URLs were valid at some point in time;
we refer to these as “once-valid URLs.”
2.3
Coping with Broken Links
To address the problem of dysfunctional URLs, we make two
observations about them.
Broken URLs, not defunct sites. First, for most of the domains in which we find at least one broken URL, we observe
that there are typically other URLs in the same domain that
still work. In each of the five years we consider, among all
the domains from which we have at least 100 URLs from that
year in our corpus, we pick 1000 domains at random. Figure 3 plots the distribution across these domains of the fraction of URLs in that domain which are broken. Domains in
which all URLs no longer work are in the minority in every
year, particularly in the last decade. So, when a URL pointing to a particular page is no longer functional, it is rarely the
case that the site hosting that page does not exist.
Pages that broken URLs were pointing to often still exist. Second, for a sizeable fraction of broken URLs (e.g.,
many of the broken URLs on the three university websites
we studied above), we observe that the page the URL was
originally pointing to still exists at an alternate URL, i.e.,
the broken URL is merely due to a reorganization of the site
hosting it, not the result of the page being deleted. Website
providers often restructure the URLs for their pages (e.g.,
3
No. of broken URLs
Not found alias
70
Found alias
Search engine,
Archive,
Live web
60
50
40
Fetch & Crawl
30
20
URLs from
Server log,
Broken URLs
Bookmarks, Identify
etc…
broken URLs
10
smartsheet.com
planetc1.com
onlinepolicy.org
namm.org
mobilemarketingmagazine.com
knect365.com
jfku.edu
imageworksllc.com
gamespot.com
filecart.com
eclipse.org
ebates.com
decathlon.co.uk
commonsensemedia.org
activenetwork.com
aaascreensavers.com
0
(Alias, Confidence)
pairs
Figure 5: Overview of FABLE.
is one of scale; applying the manual process that we used in
the previous section is not an option.
Therefore, with FABLE, we seek to automate the identification of functional aliases for broken web links at scale. As
shown in Figure 5, FABLE takes a set of dysfunctional URLs
as input and outputs an alternate URL for those inputs for
which it is able to locate the page on the web. Since some
of the aliases identified by FABLE can be incorrect (for reasons outlined later), for each alias it identifies, FABLE also
outputs its confidence about this alias being correct.
Our design of FABLE is guided by three objectives.
• Coverage. We seek to find aliases for as many of the input
URLs as feasible. Ideally, if the page that a URL was
previously pointing to still exists, FABLE should be able
to find the page’s new URL.
• Precision. We strive to ensure that the aliases found by
FABLE are correct. If many of the identified aliases are
suspect, they will need to be manually inspected before
they can be used, which is hard to do at scale.
• Efficiency. Lastly, we aim to minimize the amount of
work – the number of pages crawled, search queries issued, etc. – that FABLE must do to locate alias links.
Figure 4: For 16 domains, number of broken URLs in this domain for which we do or do not find a functional alias.
when a site switches to a different domain or begins using
a content management system such as Wordpress or Drupal). In doing so, they break many of the previously functioning URLs to pages on their site. For example, in Figure 1, the “Letters to the Editor” link is broken because the
path /letters-to-the-editor/ on the site has been
renamed to /topic/letters-to-the-editor/.
To illustrate our second observation, we present our manual analysis of 16 domains chosen at random from the ones
in which a) we have at least one broken URL in our corpus,
and b) not all URLs are broken. For each dysfunctional URL
in these domains, we manually attempted to find the page
originally pointed to by this URL. We used Wayback Machine’s archived copies of these pages and attempted to discover these pages via web search and via exploration of these
16 sites. Figure 4 shows the subset of broken URLs on each
of these sites for which we were able to find an alias. Cumulatively, we were able to find functional aliases for 32% of
the broken URLs from these sites. As we show later, this is
only a lower bound as the aliases uncovered by our manual
investigation are not exhaustive.
3
FABLE
4
Design
Our high-level approach in building FABLE is to leverage
web archives such as the Wayback Machine to find aliases
for broken URLs. For every URL they have archived, web
archives provide a rich set of information: what was the
content at this URL before it stopped working, which other
URLs did this URL redirect to in the past, which pages
linked to this URL previously, etc. For every broken URL
given to it as input, FABLE leverages this historical context
to identify a corresponding alias which is functional today.
Overview
Our results from the previous section show that, though broken URLs are common on the web, the pages that these
URLs were pointing to often still exist. As discussed earlier
in the introduction, we envision that the existence of aliases
for broken URLs can be leveraged in a variety of settings.
However, two factors make it challenging to find aliases
for broken URLs. First, there is often limited information
available about the page that the URL was previously pointing to, e.g., only a page’s URL and title are stored in a user’s
bookmarks and a site’s provider can only rely on the anchor text associated with broken URLs linked from its pages.
Second, even in cases where one has extraneous information
about a page for which they are looking to identify the new
URL (e.g., a web developer who previously linked to this
page or the admin of the site hosting the page), the challenge
4.1
Challenges
A straightforward approach to use archived page content to
find aliases could work as follows. Given a broken URL,
we can look up the archived copy of this URL and use the
content in that copy (e.g., the page’s title and first paragraph)
to formulate a query to submit to a web search engine such
as Google or Bing. We can then crawl each of the search
results which are in the same domain as the broken URL, and
compare the content currently at that page with the archived
4
Approach
Web search
Match backlinks
Pattern based inference
Techniques
Section
If URL not archived, identify query arguments which do not impact page content to find if page was
archived under alternate URL
For any URL, use last archived copy with neither redirections nor error code response, as that is likely
the latest archived content before URL was broken
To search the web efficiently, use the title and content in the archived copy judiciously to formulate
search queries and order the queries in decreasing likelihood of finding a match
To efficiently find pages which previously linked to input URL, perform guided crawl within web
archive starting from archived copies of the URL and of the homepage of its site
Correlate historical and current backlinks to find functional alias
Apply programming by example on the aliases found for broken URLs on a site to predict aliases for
other broken URLs on that site
§4.2.1
§4.2.2
§4.2.3
§4.3.1
§4.3.2
§4.4
Table 3: Summary of the techniques used in FABLE to find aliases using three different approaches.
content. If any of these prove to be a close match, we succeed
in finding the new URL for the target page.
However, this strawman approach overlooks a number of
practical realities. First, as mentioned earlier, web archives
are understandably incomplete. So, what if there is no
archived content for the broken URL for which we are attempting to find an alias (as is the case for the “Letters to the
Editor” page linked from the example in Figure 1)? Second,
even if the URL is archived, which of the archived copies for
it to use, given that systems like Wayback Machine repeatedly recrawl URLs over time? Lastly, finding an alias (if it
exists) is not a given even if the URL is archived and we pick
the right archived copy for it: (1) indices maintained by even
popular web search engines are far from complete [39], (2)
even the latest archived copy of a page may significantly differ from what is currently on the page (e.g., the last archived
copy for the broken URL in Figure 1 would be missing all
the letters posted after the URL stopped working), and (3)
many web pages have little textual content or have titles that
are shared with other pages on the site, making them not
amenable for discovery via search engines.
In the rest of this section, we describe how our design of
FABLE addresses these challenges. Table 3 summarizes the
various techniques used.
4.2
has no impact on page content (e.g., affiliate IDs included in
the URLs for product pages on Amazon).
Therefore, for every input URL which has a query string
and has not been archived, FABLE attempts to identify which
of the arguments in the URL impact the page’s content. For
every argument arg in the URL, FABLE looks at all URLs
with archived copies which a) include this argument, and b)
are identical to the input URL except for the query string.
It determines that the value of arg does not matter if the
archived content for any of these similar URLs matches the
archived content for a different similar URL which either a)
has a different value for arg, or b) does not include that argument. We defer to Section 5 details about how we compare
any two pages and determine if their content matches.
After it tests the significance of each of the arguments
in an input URL, FABLE discards those arguments which it
finds do not impact page content. It then uses the archived
content for any URL which contains the remaining portion
of the input URL and whose additional arguments are all
deemed to be inconsequential to page content.
4.2.2
Next, FABLE determines which of an URL’s archived copies
to use; web archival systems repeatedly recrawl a page in
order to capture how that page’s content changes over time.
In FABLE, for any URL, we use the latest archived copy
in which a 200 status code response was observed. We make
this choice as a balance between two considerations: the content on a page is likely to have significantly changed since
when it was first archived, whereas the latest archived copy
for any broken URL is likely to reflect the same erroneous response we obtain when visiting that URL today. As a result,
neither the first nor the last archived copy of a page is likely
to help FABLE in formulating a search query with which it
can locate the page. By using the last archived copy with a
200 status code response, FABLE attempts to use the latest
copy which is likely to be non-erroneous.
Note that the chosen archived copy might still be significantly stale and the URL might already have stopped working by the time this copy was collected; recall from Section 2.1 that some sites serve erroneous responses with a 200
Using Archives to Search
We begin by improving upon the strawman described above
in using web search engines to find aliases. Our first two
techniques improve the number of aliases we can find with
this approach and the third increases efficiency in doing so.
4.2.1
Picking Right Archived Copy
Finding Archived Copies
First, we observe that, even if there exists no archived content
for a specific broken URL given as input to FABLE, the same
page might have been archived under a different URL. This
is because a page can often be referred to via a number of
URLs. For example, a URL can include a query string comprising several arguments in the form of (key, value) pairs,
and the server hosting this page may return the same content
irrespective of the values for some of these arguments; even
completely excluding some or all of these arguments often
5
status code. However, as we show later in Section 6, our
strategy for picking which archived copy of a broken URL
to use significantly increases the odds of finding its alias despite being limited by the rate at which archival systems recrawl pages and how long it has been since the URL became
dysfunctional, both of which are outside our control.
Archive
Live Web
http://www.filecart.com/flash-slideshowdesigner
https://filecart.com/get/flash-slideshowdesigner
Redirect
4.2.3
Formulating Search Queries
There are many ways to formulate a search query from a
page’s archived content. The challenge is to maximize efficiency without sacrificing coverage, i.e., minimize the number of search queries we issue and pages we crawl while finding most of the aliases discoverable via web search.
We address this tradeoff in FABLE based on two characteristics of web pages. First, prior work [35] has observed that a
few carefully chosen words from a page’s content can serve
as a robust “lexical signature” of the page. Second, though a
page’s title can change over time, we find that this happens
less often than changes to the page’s content.
Given these properties, FABLE issues three search queries
to find the alias for any particular broken URL; in all cases,
it limits the search results to be in that domain which users
are redirected to when visiting the homepage of the URL’s
domain now (e.g., for URLs in espn.go.com, we look for
results in espn.com because a HTTP request for the former redirects to the latter). First, we search using the page’s
title in the archived copy, but without requiring the match to
be exact (i.e., we do not include the title within quotes). This
allows for minor changes to the page’s title between when
the archived copy we use was collected and now. If the first
query fails to result in a match, FABLE then issues a second
query using the most significant words on the page; we describe how we pick these words in Section 5. If an alias has
still not been found, FABLE searches using the page’s title
again, but requiring an exact match this time; given that we
look at a limited number of search results for efficiency, this
last query helps find an alias in cases where the first query
returns too many results. Later, in Section 6, we show that
this combination of queries issued in this order offers a good
balance between efficiency and coverage.
4.3
Alias
Target
http://www.filecart.com/download/prof
essional-slideshow-maker
https://filecart.com/tag/professionalslideshow-maker
Figure 6: An example where FABLE is able to find the alias for
a broken URL by finding a page that previously linked to this
URL and by comparing the current and archived versions of
that page.
broken URL to its alias by comparing the archived and current copies of that page. Figure 6 shows an example.
4.3.1
Finding Backlinks Efficiently
Determining which URLs are linked from a given page is
easy: parse the page’s content; the opposite – finding which
pages link to a given URL – is not. Moreover, it is not useful
for us to find which pages link to a particular URL today.
Instead, we need to find a broken URL’s backlinks from the
past, in order to check if the URL has been replaced by its
alias on any of those pages. Since the site in which a broken
URL is hosted is more likely to have replaced this URL on
its pages, we focus on finding internal backlinks.
Crawling within web archive. Given a broken URL, FA BLE attempts to find its historical backlinks in two ways.
First, we leverage the fact that recursively following links,
starting from the outlinks on a page, often help find a link
back to that page, e.g., a professor’s web page may link to
the homepage of their department’s website, which in turn
links to a listing of all faculty, which contains a link to the
original page. So, in cases where archived copies of a broken
URL exist (and yet, we were unable to find its alias using
web search), FABLE performs a crawl within the web archive
starting from this URL attempting to find a loop back to the
URL, like in the above example. Second, if the previous step
fails to find a backlink or if no archived copies of the broken
URL exist, FABLE performs a crawl within the web archive
starting from the homepage of the URL’s domain.
With either strategy, we crawl once using the first archived
copy of every page and again using the last 200 status code
copy of every page. Among the broken URLs we consider in
our evaluation for which we find a backlink, crawls starting
Correlating Backlinks
The techniques we have described thus far improve coverage
and efficiency compared to the strawman approach. However, any approach that uses archived content to search on
the web suffers from a common set of limitations: the page
may not have been archived under any of its URLs, limited
textual content on the page may hinder the formulation of
relevant search queries, and search engines may not have indexed the page.
To sidestep these downsides, our second approach finds
aliases by locating backlinks which have been rewritten. On
any page that previously linked to a URL which is now broken, if that URL has now been replaced, we can map the
6
Domain
cars.com
sbnation.com
Old URL
New URL
www.cars.com/gmc/acadia/2014/
www.cars.com/ford/expedition/2014/
www.sbnation.com/ncaa-football/teams/Arizona
Title: Arizona Wildcats Schedule, ...
www.sbnation.com/ncaa-football/teams/Oregon
Title: Oregon Ducks Schedule, ...
www.cars.com/research/gmc-acadia-2014
www.cars.com/research/ford-expedition-2014/
www.sbnation.com/college-football/teams/arizona-wildcats
www.sbnation.com/college-football/teams/oregon-ducks
Table 4: On a couple of sites, examples showing the patterns in how broken URLs map to their aliases.
from the URL and from the homepage of its site respectively
find historical backlinks for 79% and 60% of these URLs.
Guided crawling for efficiency. A key consideration in
performing these crawls is to minimize the number of pages
we need to fetch and inspect. As we show later in Section 6,
naively performing a depth-first or breadth-first crawl results
in needing to visit a large number of pages, often without being able to find a backlink within a limited crawling budget.
To be efficient, FABLE’s crawls are guided by its goal:
finding a backlink for a specific URL. Specifically, when
searching for backlinks for a target URL t, among all outlinks seen on all pages crawled so far, we pick which link u
to preferentially pursue based on a combination of two factors. The first factor is the edit distance between the two
URLs t and u; a page in a nearby directory on the site is
more likely to contain a link to t. The second factor is the
similarity between the anchor text associated with the link
to u (i.e., the text underlying the link) and the representative
text for t (which is either extracted from the archived copy
for t, if available, or the tokens from the URL t). The latter
simulates human behavior of pursuing links which appear
most closely related to the target page.
Note that, in cases where FABLE finds a backlink from the
web archive, but the URL of the backlink page is itself not
functional today, FABLE recursively attempts to find the alias
for that backlink before matching links as described above.
4.3.2
these links; specifically, we compare the text in the DOM
node closest to the respective <a> tags.
4.4
Leveraging Patterns in URL Changes
Our last strategy for finding a broken URL’s alias exploits the
aliases found by FABLE using the web search and backlink
based strategies. Our high level insight here is that, on any
site, typically a large number of URLs are rendered dysfunctional over time, not just one. When this occurs, many of
the broken URLs on a site are often similar, and so are their
aliases; Table 4 presents examples from a couple of sites.
Therefore, for any given site, we seek to learn the patterns
underlying the mapping from the old to the new URLs for
its pages. To do this, we pursue programming by example
(PBE) [18], instantiated using Microsoft Excel’s Flash Fill
capability [7]. The PBE approach entails providing a number
of input to output examples, using which the output can be
predicted for a new input.
In our setting, it is insufficient to simply provide all broken
URLs on a site and their known aliases as input. First, since
there are often multiple patterns in mapping broken URLs to
their aliases on a given site, we need to cluster similar broken
URLs and invoke Flash Fill on each cluster separately. Second, as seen in the examples in Table 4, we need to provide
a bunch of auxiliary information about each page as input,
because new URLs often include the page’s title, date of creation [6], etc. We extract this information from the archived
copies of the broken URL, when available.
Matching Links Accurately
4.5
Within a limited crawling budget, FABLE finds as many
backlinks for a broken URL as it can. On any page which
previously linked to the URL, we need to determine which,
if any, of the links currently on the page is the alias. We
could crawl the pages at each of those links and compare
them with the archived content for the broken URL. However, recall that our use of backlinks to find aliases is driven
by the absence of archived copies for many URLs.
Given a page p which previously linked to URL u, which
is now broken, we declare URL a now linked from that page
as u’s alias in two cases. When the anchor texts associated
with u and a are unique (i.e., no other link had the same
anchor text as u in the archived copy of p, and no other link
on p has the same anchor text as a today), we declare a to be
u’s alias if the two anchor texts are similar (§5). If there is no
anchor text (e.g., if u are a are linked from images) or if they
are not unique on page p, we compare the text surrounding
Confidence in Results
In all of the techniques employed by FABLE to find aliases,
it needs to use some thresholds to determine similarity, e.g.,
when comparing the current and archived version of a page
and when comparing the anchor text across different versions
of a page. Given that page content is prone to change over
time, focusing only on exact matches would greatly handicap
FABLE. Therefore, in setting the thresholds of similarity, we
have to live with the chance that some of the aliases found
by FABLE will be false positives.
To help minimize the number of aliases found by FABLE
that users of the system need to manually inspect, we have it
compute and output its confidence in each alias that it identifies. The intuition underlying our model is that, for any broken URL, the rate of flux in the page’s content and how FA BLE finds the URL’s alias are indicative of the likelihood that
the alias is correct. For example, the more often a page’s con7
• Across 1.5K randomly sampled broken URLs spread
across years and exhibiting varied error types, FABLE
finds 80% more aliases than feasible with the best strawman approach that utilizes web archives.
• Among aliases for which FABLE’s confidence value is
90% or more, over 90% of these aliases are correct.
• For the same crawling budget, FABLE is able 35%
more aliases by matching backlinks compared to breadthfirst/depth-first style crawling.
• Across 48 domains which each have many broken URLs,
FABLE is able to infer 41% of aliases without needing to
crawl any pages or perform any web searches.
• On two of the three sites for which we have server-side
logs, FABLE finds aliases for 25–45% of broken URLs,
which can help prevent 48% of the erroneous responses.
tent changes, it is more likely that last archived copy for that
page’s old URL is significantly stale. Whereas, a match between the archived copy and current version of a page based
on the page’s content is a stronger indicator than a match
based on the page’s title.
In FABLE, we train a Naive Bayes classifier to compute the confidence in the alias for a given broken URL.
This classifier takes five features as input: 1) the error type
(DNS/connection error, 40x/50x status code, or soft-404), 2)
how often the page is updated, approximated by the number of archived copies on Wayback Machine, 3) how the
alias was discovered (by Search, Backlink Matching, or Inference), 4) how the alias was matched to the URL (by title,
content, or backlink), and 5) whether the broken URL is not
a homepage but the alias is. In our evaluation, we find that
the confidence computed by this model strongly correlates
with the correctness of FABLE’s output, and that the second
and fourth features prove to be the most important.
5
6.1
We use three datasets to evaluate FABLE: the manually analyzed URLs from 16 domains discussed in Section 2, a corpus of URLs which we use to test FABLE’s web search and
backlink based alias discovery, and another corpus of URLs
for evaluating its inferred aliases. We describe the latter two
datasets in this section.
Dataset for Search and Backlink. From all the broken outgoing links in Figure 2, we compile a corpus of 1.5K URLs
by sampling 500 URLs at random for each of the following error types: 1) “DNS&Other” includes URLs that prevent the client from issuing a HTTP request (e.g., DNS error, TCP/TLS connection failure), 2) “4/5xx” includes URLs
which result in a HTTP response with a 4xx or 5xx status
code, and 3) “Soft-404” includes all other URLs (which redirect to an unrelated page or return an error with a 200 status
code response). To prune out abandoned domains in which
there is no possibility of finding an alias, we only consider
URLs from those domains in which we have seen at least one
working URL. We consider at most 10 URLs per domain to
increase the diversity of our corpus.
Dataset for Inference. FABLE’s PBE-based inference of
aliases works best on sites which have many broken URLs.
Therefore, in contrast to our previous corpus which limits
the number of broken URLs per domain to 10, we assemble
a separate corpus of broken URLs to evaluate FABLE’s ability to infer aliases. This dataset comprises 1.4K URLs from
48 domains, where each of these domains is one in which
FABLE is able to find an alias for at least one broken URL
using its search and backlink based methods.
Implementation
Search APIs. We use the APIs for both Google custom
search [3] and Bing web search [10]. For any query, the
former limits the number of results initially returned to 10
and the latter limits to 50. Though both APIs allow us to ask
for additional results, we find negligible utility in doing so;
examining the top 10 results suffice to find almost all aliases
discoverable via web search.
Removal of boilerplate content. Modern web pages usually contain boilerplate [27] (e.g., ads, recommended stories,
and site-specific templates), which has little to do with the
core content and changes very frequently. Therefore, before
comparing the archived content of a page and the current
content on a potential alias, we remove the boilerplate from
both copies. To do so, we use Chrome’s DOM Distiller [1],
which powers Chrome’s reader mode.
Text similarity and representative words. FABLE has to
compute the similarity between two text fragments to compare page content, to compare page titles, and to compare anchor texts of two links. In all of these cases, we use TF-IDF
(term frequency-inverse document frequency) [36], which is
a common metric to summarize a piece of text. When comparing two TF-IDF vectors, we find that a cosine similarity
of 0.8 or higher typically results in a good match.
We use the TF-IDF model to also choose the most representative words from a page’s content that FABLE uses to
issue a web search query; we use the 7 words with the highest TF-IDF weights [23].
6
Datasets
6.2
Evaluation
Strawman Approaches
We evaluate FABLE’s coverage by comparing the number
of URLs for which it is able to find an alias compared to
two strawman approaches. We mimic how one might try
to find the alias upon encountering a broken link. The two
approaches we consider are for scenarios with and without
access to web archives.
We evaluate FABLE from three perspectives: 1) its coverage
in finding aliases for broken URLs and the accuracy of these
aliases as a function of confidence, 2) its efficiency in finding
aliases, and 3) the utility of the aliases it finds. Our evaluation shows that:
8
No. of broken URLs w/ alias
Soft-404
500
DNS&Other
4/5xx
400
300
200
100
0
Str
a
wm
an
Str
a
w/
oa
wm
an
rch
ive
Se
w/
a
arc
h
rch
ive
Ba
ck
lin
k
Se
a
rch
+B
ac
k
lin
k
Search
No archived copy
No search results
No match on results
Alias found
#URLs
594
56
539
311
Backlink
No backlink found
Backlink URLs broken
No match on backlinks
Matched link broken
Alias found
#URLs
696
291
109
170
306
Figure 7: For 1,500 broken URLs, number of URLs for which
Table 5: Breakdown of reasons for FABLE’s inability to find
each approach finds an alias, broken down by the error types
that the URLs exhibit.
aliases using search and backlink based methods.
which utilizes archived page content finds 18% fewer aliases
than FABLE finds using only web search (’Search’); FA BLE is able to use search engines better by uncovering cases
where the page corresponding to a broken URL has been
archived under a different URL and by utilizing more fresh
information about a page than that available in the page’s first
archived copy. FABLE discovers a similar number of aliases
using backlinks (’Backlink’) as it does using web search, but
the union of the two methods (’Search+Backlink’) is significantly larger. Backlink-based alias discovery can find aliases
for pages which have not been archived and outdoes Search
in finding aliases for pages whose content keeps changing
over time. In all, Search and Backlink together find aliases
for 454 of the 1,500 broken URLs, 80% more than the best
strawman approach.
For FABLE’s discovery of aliases using web search and
backlinks, Table 5 breaks down the different reasons why it is
unable to find an alias. The dominant bottleneck for Search
is the absence of any archived copy in Wayback Machine.
Even when a page has been archived, no match in the search
results is another big hindrance; this occurs because the title and most significant words are too generic, the page is
updated frequently and the archived copy is no longer representative, or the page is now absent from the search index of
both Google and Bing. For Backlink, inability to find backlinks is the primary limiting factor. When FABLE does find
a backlink, the URL for that backlink too is often broken, a
consequence of the site’s reorganization which rendered the
URL for which we are trying to find an alias dysfunctional.
Dataset for Inference. In this dataset comprising 1,391
URLs across 48 domains, Figure 8 plots the number of
aliases in each site that FABLE is initially able to discover
using the combination of its search and backlink based methods as well as the additional ones that it uncovers thereafter
via inference. The benefits of FABLE’s PBE-based inference
varies across sites, but overall, it increases the fraction of
broken URLs in this dataset for which we are able to find
aliases from 36% to 43%. As we show later, many of the
aliases that we discover here using Search and Backlink can
also be inferred.
Without archive. In the absence of web archives, the only
metadata available for a URL that is now broken is either the
page’s title (e.g., if the URL was bookmarked) or the anchor
text on a page which links to that URL. Since the title is more
representative than the anchor text, the best one can do is to
use the title to query web search engines for potential aliases,
restricting the results to the site to which the URL’s domain
now redirects. Like how one would only have access to the
page’s title as it was when the page was bookmarked, for
every URL in our corpus, we use the title from that archived
copy of the URL which is closest to the year in which we
first see a link to this URL in Wayback Machine. To find an
alias from web search results, we use the same technique to
match page titles as used in FABLE (§ 5).
With archive.
When one does have access to archived
copies of web pages, we assume that one can use web search
to find aliases similar to FABLE. The only differences we
consider is that the strawman uses the first archived copy of
any broken URL and it looks up Wayback Machine only using the URL for which it is attempting to find an alias.
6.3
Coverage
Manually analyzed URLs. For the data shown in Figure 4,
we had manually found aliases for 194 broken URLs. FABLE
is able to find 158 (i.e., 81%) of these. FABLE’s high coverage rate on known aliases suggests that it can find functional
aliases for most broken URLs that have one. FABLE also
finds 33 additional aliases which were missed by our manual
analysis, highlighting the utility of automating the discovery
of aliases. The automation does come with the risk of false
positives: FABLE finds 39 incorrect aliases.
Dataset for Search and Backlink.
For the 1,500 broken URLs in this dataset, Figure 7 shows the number of
broken URLs of each of the three error types for which
aliases are discovered using the different approaches: FA BLE ’s web search and backlink based methods as well as the
two strawman approaches. The strawman without access to
web archives, which only finds matches based on page titles,
uncovers the least number of aliases. Even the strawman
9
Search+Backlink
CDF across URLs
No. of broken URLs w/ alias
Inference
50
40
30
20
10
0
0
5
10
15
20
25
30
35
40
45
1
0.8
0.6
0.4
FABLE
BFS
DFS
0.2
0
0
10
20
30
40
50
60
No. of pages crawled to (try to) find backlink
Sites
Figure 8: On 48 sites with many broken URLs, aliases found via
Figure 10: For aliases found using backlinks, number of pages
search and backlinks and additional aliases inferred thereafter.
crawled to search for backlinks before finding alias.
True Positive
False Positive
queries, 2) the order in which it crawls the archived pages
on a site in order to find a backlink, and 3) its use of
programming-by-example to infer aliases. We evaluate the
utility of each of these design decisions.
Web search. As mentioned in Section 4, FABLE issues
three web search queries to find alias candidates: using the
page’s title without an exact match, using the most significant words on the page, and using the page’s title with an
exact match. Of all the aliases found using web search in our
Dataset for Search and Backlink, these three forms of querying find 68%, 50%, and 28% of the aliases, respectively, justifying the order in which FABLE issues these queries. Moreover, each of the three queries uncovers some aliases that the
other two do not, making it necessary for FABLE to use them
in combination.
Backlink discovery. We compare the order in which FABLE
crawls archived pages in order to search for a backlink with
breadth-first and depth-first crawls (BFS and DFS). Among
all aliases that FABLE finds via backlinks, we consider the
ones where FABLE finds a backlink by crawling starting from
the archived copy of the broken URL for which it is attempting to find an alias. Searching for a backlink by starting from
the archived copy of the homepage in the broken URL’s domain is likely to show similar results.
For each URL, we limit the number of pages crawled to
the maximum number that FABLE visits during its backlink
search for any of these URLs. Compared to FABLE’s crawling strategy, Figure 10 shows that BFS and DFS, which do
not differentiate between all the outgoing links on any page,
exceed the crawling limit for 35% and 41% of the URLs.
Inferring aliases. Earlier, we showed that FABLE is able
to infer many aliases that cannot be found via search and
backlinks (Figure 8). However, for many of the aliases discoverable by the latter two methods, it is unnecessary for
FABLE to spend effort searching the web and crawling web
archives. Once some of these aliases have been uncovered,
the remaining can be inferred based on this subset.
We evaluate the utility of FABLE’s inference of aliases in
improving efficiency by iteratively considering the URLs in
the Dataset for Inference. Whenever we consider a URL,
we first attempt to find an alias via web search and match-
Count
40
30
20
10
0
[10, 20) [30, 40) [40, 50) [50, 60) [60, 70) [70, 80) [80, 90) [90, 100)
Confidence Range (%)
Figure 9: Counts for True/False Positive aliases on test set under different confidence ranges output by the confidence model.
6.4
Precision
We evaluate FABLE’s precision from two aspects: the overall accuracy of aliases found by FABLE, and the relation between its accuracy and the confidence values that it outputs.
First, to measure the overall precision, we sample 100
URLs for which FABLE claims to have found an alias and
manually check their correctness. We find 81 out of 100 to
be correct. False positives are mainly due to matching titles
and link anchors, where the text is very short, yet contains
uncommon words. Better approaches for comparing short
snippets of text, compared to the TF-IDF based comparison we currently employ in FABLE, could eliminate many
of these false positives.
Second, we find that the confidence values output by FA BLE are strongly correlated with its accuracy. We use data
from the 100 URLs that we manually examined above to
train FABLE’s confidence model. Then, we manually label
the aliases found for another 100 URLs, for which Figure 9
plots true/false positive counts under different ranges of confidence values. The higher FABLE’s confidence is in an alias
it found, the more likely it is to be correct, e.g., over 90% of
the aliases with a confidence in the range [90, 100] are correct. Therefore, confidence values output by FABLE can help
its users determine which of the aliases it finds deserve closer
examination and which ones do not.
6.5
Efficiency
We attempt to minimize the amount of work that FABLE
does in three ways: 1) the order in which it executes search
10
URLs with aliases
% of Broken % of OnceValid
CDF across sites
1
0.8
0.6
Site1
Site2
Site3
0.4
Search+Backlink
Search+Backlink+Infer
0.2
0
0
0.2
0.4
0.6
0.8
25.9
17.4
45.8
3.7
1.4
20.9
Requests for these URLs
% of Broken % of OnceValid
48.8
25.1
48.8
5.2
0.2
8.3
Table 7: For three university websites, fraction of URLs and
fraction of requests that the aliases found by FABLE account
for. Both fractions are shown relative to all broken once-valid
URLs and all once-valid URLs.
1
Fraction of broken URLs for which alias is found
Figure 11: For sites with many broken URLs, aliases found usvice not usable” and “Stale content” are the primary problems as it is common for pages to offer functionalities such
as intra-site search, login, and comments, and many broken
URLs point to pages whose content is updated over time.
Note that, since all the broken URLs we consider are outgoing links on pages which have been archived by Wayback
Machine, they are more likely to be archived than the average
URL. In other words, the above analysis likely undercounts
how often one is likely to find no archived copy when visiting a broken URL. As evidence of this bias, we analyzed
the archives of emails sent to the Interesting People mailing list [4] and picked 100 at random out of all the URLs
linked from these emails which are now broken. We find
that Wayback Machine lacks archived copies for 20 of these
100 URLs, higher than the 10% seen in our dataset.
ing Search and Backlink and the increase enabled by Inference,
when FABLE attempts to infer aliases after the discovery of each
new alias.
# URLs
No archived copy
10
Stale content
41
Service not usable
78
Loss of recommendations
22
Loss of ad revenue
20
Total
94
Table 6: Across 100 broken URLs for which FABLE finds an
alias, breakdown of reasons why reliance on web archive to
serve requests for these URLs would be undesirable.
ing backlinks. If we find an alias, we check if all the aliases
found so far can help infer an alias for any of the remaining
URLs for which we have not yet found an alias. Figure 11
shows that, on many sites, FABLE’s inference is able to uncover more aliases without any additional work. In aggregate, 41% of all the aliases that FABLE discovers on the 48
sites are inferred.
6.6
6.6.2
Earlier in Section 6.3, we evaluated the fraction of broken
URLs for which FABLE is able to find aliases. Now, using
the server-side logs for the three university sites we described
earlier in Section 2, we examine the utility of FABLE in helping fix requests to broken URLs.
If all requests to broken URLs on these sites were to be
redirected to the corresponding aliases discovered by FA BLE , Table 7 presents the expected impact. Like we did
earlier, of all the URLs accessed on these sites, we focus
on the ones which we can confirm were “once-valid” (i.e.,
URLs which were successfully archived by Wayback Machine in the past). We exclude requests to each site’s home
page since the home page accounts for a sizeable fraction of
requests, but the URL to this page is unlikely to be broken.
First, we see that, while FABLE discovers aliases for 17–
45% of broken URLs, the fraction of requests to broken
URLs that these aliases can help fix is greater (25–48%).
This suggests either that FABLE is more likely to find aliases
for broken URLs that are more frequently accessed, or that
aliases are more likely to exist for popular broken URLs.
Second, the broken URLs for which FABLE finds aliases account for a sizeable fraction of the requests to all once-valid
URLs – 5% on Site1 and 8% on Site3 – not just the ones
to broken URLs; recall from Section 2 that Site2 has few
broken URLs to begin with. Therefore, redirecting users to
aliases found by FABLE when they visit broken URLs can
Utility of Aliases
In the last part of our evaluation, we demonstrate the utility
of aliases discovered by FABLE from two perspectives.
6.6.1
Fixing Broken Requests
Alias vs. Archive
When a user visits a broken URL, serving an archived copy
of that URL is undesirable for several reasons if that page is
still available at an alternative URL. First, the URL may not
have been archived before it was broken. Even if it was, the
content on the page may have significantly changed since
it was last archived or the page may include services (e.g.,
login, add to cart) that are unusable via the archived copy.
Moreover, the page’s provider might lose out on ad revenue
and the ability to attract the user to their site by recommending other relevant pages.
Out of all the URLs for which FABLE found correct
aliases, we pick 100 at random and manually compare Wayback Machine’s archived copies of these pages to the page
found today at the alias URL. Table 6 shows that, for almost
all of these 100 broken URLs, reliance on archived copies
would prove undesirable for one of the above reasons. “Ser11
Fixing broken links. To address the breakage of URLs due
to the reorganization of websites, many efforts have focused
on extracting a “lexical signature” from a page’s content as a
robust hyperlink [23–25, 34, 35] for the page. When a URL
no longer works, its lexical signature can be used to find the
page. We do the same in FABLE, but additionally rely on the
page’s title to search for a page when its last archived copy
is stale. FABLE also augments this reliance on web search to
discover aliases with techniques for matching backlinks and
inferring new aliases.
Another line of solutions, such as Opal [20] and Wayback
Machine’s browser extension [8], serve the archived copies
of pages to clients when they encounter broken links. As
we show in Section 6.6, such solutions are not ideal for both
users and website providers.
Duplicate page detection.
Detection of a web page’s
duplicates has been widely researched [15, 21, 29, 38].
The focus has primarily been on defining whether two
pages/documents are near-duplicates, in order to detect plagiarism and phishing, save storage by de-duplication, etc.
Applying prior techniques from this line of work to find
duplicates of archived copies is insufficient to find aliases
for broken URLs because not all URLs are archived and
archived copies are often stale.
NDN and CCN. Named data networking [12,30,41,42] and
content centric networking [14, 22] are alternate network architectures where data is retrieved by specifying the information it contains, not the host or IP address on which it
is stored. If this vision were to become reality, web pages
would have more robust identifiers than URLs. However, the
adoption of these proposals requires modifications to all layers of the network stack. In comparison, FABLE provides a
solution that is compatible with the legacy web and Internet.
Text-editing by example. FABLE utilizes MS Excel’s Flash
Fill functionality [7], which is based on a string programming language [17]. Many others have used programming
by example [19, 31, 33] to synthesize programs which automate text transformation given input-output example pairs.
As far as we know, FABLE is the first to utilize this functionality on URL transformations.
significantly help users access the information and services
they are looking for on these sites.
7
Discussion
Our work in developing FABLE leads to a number of interesting directions for future work.
Why do URLs break? When the domain name for a site
changes, clients looking to access the old URLs on this site
may not even succeed in performing a DNS lookup. But,
when websites are reorganized – e.g., when they switch from
one content management system to another, such as Wordpress to Drupal – it is unclear why redirections cannot automatically be added from the old to the new URL for every
page. This warrants further investigation.
Persistent URLs. To account for the brittleness of URLs
as identifiers of content, we envision a service that works as
follows. For any developer who wishes to link to a URL u on
their page, we could have them link instead to, say, https:
//persist.url/u. If u is still functional, the service
at persist.url can simply redirect to u. If not, and if
FABLE is able to identify an alias u0 for u, it can redirect
to u0 instead. If no alias is known but the url u has been
archived, it can redirect to the last non-erroneous archived
copy for u. In the case where none of the previous checks
match, the service can still redirect to u and have the user
deal with the erroneous response returned by u’s domain.
Impact of broken URLs on web’s availability.
Our
datasets from three university websites showed that, at least
in two of them, requests to URLs which were once valid but
are now broken account for a non-negligible fraction of the
site’s traffic. An interesting question for future exploration
is: what is the corresponding impact of broken URLs on the
web’s availability in aggregate? Given the variety of ways
in which an URL can be rendered dysfunctional (§ 2.1), one
would need to analyze logs collected from a representative
sample of web clients in order to answer this question.
8
Related Work
Detecting and studying broken URLs. Many tools and algorithms have been developed to detect web decay. FABLE
utilizes and improves upon Bar-Yossef et al.’s approach [11]
to detect soft-404 pages. Parking sensors [40] identifies broken URLs by detecting parking domains.
Prior work has also extensively studied the evolution of
the web graph and web pages, both with respect to the persistence of links [11, 26, 28, 37] and stability of page content [13, 16, 26, 32]. These studies confirm our observation
that a sizeable portion of URLs from the past are now invalid.
These prior efforts have also observed that flux in page content is common, which is why it is often insufficient to serve
archived copies in response to requests for broken URLs.
In contrast to all of this prior work, our contributions lie
in systematically finding alternate URLs at which the pages
originally pointed to by broken URLs now reside.
9
Conclusion
Since it is typically assumed that broken URLs on the web
point to pages which no longer exist, users look up the
archived copy when they encounter such a URL. In contrast
to this common wisdom, we found that a sizeable fraction
of broken URLs are a consequence of website reorganization and that the pages they were pointing to still exist. To
help users and web providers take advantage of this observation, we presented FABLE to accurately and efficiently find
the new URL for any page whose old URL no longer works.
Based on the promise shown by our analysis of three websites, we believe the aliases found by FABLE can help significantly improve the web’s availability at large.
12
References
[1] chromium/dom-distiller: Distills the dom. https://
github.com/chromium/dom-distiller.
[2] Cloudflare and the Wayback Machine, joining
forces for a more reliable web. https://blog.
archive.org/2020/09/17/internetarchive-partners-with-cloudflare-tohelp-make-the-web-more-useful-andreliable/.
[3] Custom search json api — programmable search
engine. https://developers.google.com/
custom-search/v1/overview.
[4] Interesting People mailing list archives. https://
ip.topicbox.com/groups/ip.
[5] Internet Archive: Wayback Machine. https://
archive.org/web/.
[6] Newspaper3k: Article scraping & curation — newspaper 0.0.2 documentation. https://newspaper.
readthedocs.io/en/latest/.
[7] Using Flash Fill in Excel. https://support.
microsoft.com/en-us/office/usingflash-fill-in-excel-3f9bcf1e-db934890-94a0-1578341f73f7.
[8] Wayback
machine
chrome
web
store.
https://chrome.google.com/
webstore/detail/wayback-machine/
fpnmgdkabkmnadcjpehmlllkndpkmiak.
[9] What is a canonical URL? how to master the
rel=canonical tag. https://blog.alexa.com/
canonical-url/.
[10] What is the bing web search api? - azure cognitive
services — microsoft docs.
https://docs.
microsoft.com/en-us/azure/cognitiveservices/bing-web-search/overview.
[11] Z. Bar-Yossef, A. Z. Broder, R. Kumar, and
A. Tomkins. Sic transit gloria telae: Towards an understanding of the web’s decay. In Proceedings of the 13th
international conference on World Wide Web, pages
328–337. ACM, 2004.
[12] S. H. Bouk, S. H. Ahmed, D. Kim, and H. Song.
Named-data-networking-based ITS for smart cities.
IEEE Communications Magazine, 55(1):105–111,
2017.
[13] J. Cho and H. Garcia-Molina. The evolution of the web
and implications for an incremental crawler. Technical
report, 1999.
[14] A. Detti, N. Blefari Melazzi, S. Salsano, and M. Pomposini. CONET: A content centric inter-networking
architecture. In Proceedings of the ACM SIGCOMM
workshop on Information-centric networking, pages
50–55, 2011.
[15] D. Fetterly, M. Manasse, and M. Najork. On the evolution of clusters of near-duplicate web pages. In Proceedings of the IEEE/LEOS 3rd International Confer-
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
13
ence on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No. 03EX726), pages
37–45. IEEE, 2003.
D. Fetterly, M. Manasse, M. Najork, and J. L. Wiener.
A large-scale study of the evolution of web pages. Software: Practice and Experience, 34(2):213–237, 2004.
S. Gulwani. Automating string processing in spreadsheets using input-output examples. ACM Sigplan Notices, 46(1):317–330, 2011.
D. C. Halbert. Programming by example. PhD thesis,
University of California, Berkeley, 1984.
W. R. Harris and S. Gulwani. Spreadsheet table transformations from examples. ACM SIGPLAN Notices,
46(6):317–328, 2011.
T. L. Harrison and M. L. Nelson. Just-in-time recovery
of missing web pages. In Proceedings of the seventeenth conference on Hypertext and hypermedia, pages
145–156, 2006.
M. Henzinger. Finding near-duplicate web pages: A
large-scale evaluation of algorithms. In Proceedings of
the 29th annual international ACM SIGIR conference
on Research and development in information retrieval,
pages 284–291, 2006.
V. Jacobson, M. Mosko, D. Smetters, and J. GarciaLuna-Aceves. Content-centric networking. Whitepaper, Palo Alto Research Center, pages 2–4, 2007.
M. Klein and M. L. Nelson. Revisiting lexical signatures to (re-) discover web pages. In International Conference on Theory and Practice of Digital Libraries,
pages 371–382. Springer, 2008.
M. Klein, J. Shipman, and M. L. Nelson. Is this a good
title? In Proceedings of the 21st ACM Conference on
Hypertext and Hypermedia, pages 3–12, 2010.
M. Klein, J. Ware, and M. L. Nelson. Rediscovering
missing web pages using link neighborhood lexical signatures. In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries,
pages 137–140, 2011.
W. Koehler. Web page change and persistence—a fouryear longitudinal study. Journal of the American society for information science and technology, 53(2):162–
171, 2002.
C. Kohlschütter, P. Fankhauser, and W. Nejdl. Boilerplate detection using shallow text features. In Proceedings of the third ACM international conference on Web
search and data mining, pages 441–450, 2010.
S. Lawrence, F. Coetzee, E. Glover, G. Flake, D. Pennock, B. Krovetz, F. Nielsen, A. Kruger, and L. Giles.
Persistence of information on the web: Analyzing citations contained in research articles. In Proceedings of
the ninth international conference on Information and
knowledge management, pages 235–242, 2000.
G. S. Manku, A. Jain, and A. Das Sarma. Detecting
near-duplicates for web crawling. In Proceedings of
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
the 16th international conference on World Wide Web,
pages 141–150, 2007.
S. Mastorakis, A. Afanasyev, and L. Zhang. On the
evolution of ndnSIM: An open-source simulator for
NDN experimentation. ACM SIGCOMM Computer
Communication Review, 47(3):19–33, 2017.
A. Miltner, K. Fisher, B. C. Pierce, D. Walker, and
S. Zdancewic. Synthesizing bijective lenses. Proceedings of the ACM on Programming Languages,
2(POPL):1–30, 2017.
A. Ntoulas, J. Cho, and C. Olston. What’s new on the
web? the evolution of the web from a search engine
perspective. In Proceedings of the 13th international
conference on World Wide Web, pages 1–12, 2004.
P.-M. Osera and S. Zdancewic. Type-and-exampledirected program synthesis. ACM SIGPLAN Notices,
50(6):619–630, 2015.
S.-T. Park, D. M. Pennock, C. L. Giles, and R. Krovetz.
Analysis of lexical signatures for improving information persistence on the world wide web. ACM Transactions on Information Systems (TOIS), 22(4):540–572,
2004.
T. A. Phelps and R. Wilensky. Robust hyperlinks cost
just five words each. University of California, Berkeley,
Computer Science Division, 2000.
G. Salton and C. Buckley. Term-weighting approaches
in automatic text retrieval. Information processing &
management, 24(5):513–523, 1988.
D. Spinellis. The decay and failures of web references.
Communications of the ACM, 46(1):71–77, 2003.
M. Theobald, J. Siddharth, and A. Paepcke. Spotsigs:
Robust and efficient near duplicate detection in large
web collections. In Proceedings of the 31st annual international ACM SIGIR conference on Research and
development in information retrieval, pages 563–570,
2008.
A. Van den Bosch, T. Bogers, and M. De Kunder. Estimating search engine index size variability: A 9-year
longitudinal study. Scientometrics, 107(2):839–856,
2016.
T. Vissers, W. Joosen, and N. Nikiforakis. Parking sensors: Analyzing and detecting parked domains. In Proceedings of the 22nd Network and Distributed System
Security Symposium (NDSS 2015), pages 53–53. Internet Society, 2015.
L. Zhang, A. Afanasyev, J. Burke, V. Jacobson,
K. Claffy, P. Crowley, C. Papadopoulos, L. Wang, and
B. Zhang. Named data networking. ACM SIGCOMM
Computer Communication Review, 44(3):66–73, 2014.
M. Zhang, V. Lehman, and L. Wang. Scalable namebased data synchronization for named data networking.
In IEEE INFOCOM 2017-IEEE Conference on Computer Communications, pages 1–9. IEEE, 2017.
14
Download