FABLE: Tracking Down Web Pages Whose URLs Have Changed Jingyuan Zhu, Jiangchen Zhu, Vaspol Ruamviboonsuk, and Harsha V. Madhyastha University of Michigan this solution is less than desirable for users; the page’s content might have been updated since when it was last archived and services offered on the page, such as add-tocart, will not function on the archived copy. Sites hosting such pages also lose the opportunity to show ads and attract users to their site. Furthermore, since it is an expensive (and practically infeasible) proposition to crawl and save the entire web, many pages are never archived. Abstract— User experience on the web crucially relies on users being able to follow links included on any page to visit other pages. However, when the URL with which one must refer to a specific web page changes (e.g., when the domain name of the site hosting the page changes or the page is moved to a different subdirectory on that site), many links to this page across the web continue to use its old URL. Users’ attempts to access this page by following these links will often result in erroneous responses even though the page still exists. To address this problem, we present FABLE. Given a URL which no longer works, FABLE attempts to identify an alternative functional URL to the same page: an alias of the broken URL. Our high level insight is that archived copies of web pages provide a rich set of information to find alias links. However, not all pages are archived; even for those which are, straightforward approaches for using archived copies to find aliases result in both poor coverage and efficiency. By employing a variety of techniques to address these challenges, FABLE finds aliases for 30% of links in a diverse corpus of 1,500 broken URLs. 1 In this paper, we pursue a different tack: if a URL no longer works but the page it was pointing to still exists, why not try to find and use the new URL at which the page is now available? Being able to identify such aliases of broken URLs is useful in a variety of contexts: a web page’s author can replace broken links included on her page, a user can do the same to fix her bookmarks which no longer work, or a website’s provider can add redirections from the old URLs for its pages to the corresponding aliases. However, in all of these cases, the challenge in finding new URLs to pages stems from the lack of information and scale, e.g., a user looking to fix her bookmarks may only have every page’s title saved, whereas the provider of a large site may have too many broken links to patch across all of its pages. Therefore, to automate the identification of functional aliases for dysfunctional URLs, we present FABLE (Finding Aliases for Broken Links Efficiently). Our key contribution lies in demonstrating how to leverage web archives not to serve requests for broken URLs, but to find aliases for them. FABLE uses archived page content in three ways. First, in cases where the archived copy of a page exists, we show how to efficiently search for that page on the web. Finding the archived copy of a page is itself not straightforward as web archival systems might have stored the page under a different URL (e.g., with a different set of query parameters) than the broken URL for which we are attempting to find an alias. Once found, picking the right archived copy is important: older copies for a URL are likely to have stale content and newer copies might have been gathered after the URL was broken. Lastly, we show how to use a page’s archived content to formulate relevant search queries and the order in which to issue these queries so as to find as many aliases as feasible with the least amount of work. Second, to account for the incompleteness of web archives and search indices, we leverage the graph structure of the web. We observe that, when the old URL for a page no longer works, other pages on the same site have often been rewritten to link to the page’s new URL. Therefore, we efficiently search within web archives to identify pages which previously linked to a target broken URL, and then determine Introduction The web is a vast highly interconnected repository of information and services. Rather than every web page being selfsufficient on its own, pages routinely link to other pages, both on the same site and on other sites. The links on any page point users to information and services that are closely related to what the page offers. A problem with this hyper-connected structure of the web is that it is common for links on the web to become dysfunctional over time. The URL pointing to a specific page may not work a few years after it was created either because the page was moved to a different URL (when the site hosting it was reorganized) or because the page no longer exists. When users visit pages that link to such broken URLs, they end up lacking appropriate context or are robbed of relevant associated services, e.g., the background material linked from a news article or the reviews for a product may be inaccessible. Existing solutions to this problem take one of two approaches; both have fundamental shortcomings. • Proposals for a content-centric Internet [22, 41] argue that every page ought to be referred to by a more robust identifier than its URL, e.g., a cryptographic hash of the page’s content. However, since these proposals not only require all legacy pages to be rewritten but also mandate changes to DNS, routing, etc., they have seen little adoption. • Alternatively, web archival systems (such as the Internet Archive’s Wayback Machine [5]) enable users to use the archived copy of a page when the link to the page no longer works [2, 8, 20]. But, when the page still exists, 1 which (if any) of the links that exist today on these pages is an alias for the URL. Third, we exploit the fact that, when a site is reorganized, there is often a pattern in how old URLs map to their corresponding new locations. Using the aliases found for URLs on a particular site using our first two techniques, we apply programming by example [18] to infer potential aliases for the remaining broken URLs on that site. Our evaluation on a large, diverse corpus of broken links shows that FABLE is able to find aliases for 30% of these URLs, with a precision of 81%. Moreover, our analysis of three university websites shows that, among all the requests to these sites which are for URLs that were previously functional but no longer are, the aliases found by FABLE could help fix 48% of these requests to broken URLs. 2 1999 21.9 11.0 40.7 19.5 2004 19.2 9.6 39.0 17.5 2009 10.9 9.4 36.4 17.1 2014 6.4 5.6 25.4 13.8 2019 0.7 2.7 4.5 11.7 % Total broken 93.1 85.3 73.8 51.2 19.6 Table 1: Out of 100,000 URLs in each year, % which do not work now, broken down by the reason. https://foo.com/path/ckvm92. Since u0 is a randomly generated URL (and hence, invalid), we can infer that u is dysfunctional if requests for u and u0 elicit the same response. We tweak this technique by Bar-Yossef et al. [11] in two ways: 1) for URLs with a numeric string in the middle (e.g., article IDs in the URLs for news articles), we also consider u0 where we replace this token since this string may dictate the server’s response, and 2) if the response to u contains a canonical URL [9] within the HTML, we find that to almost always indicate a non-erroneous response. While our detection of soft-404s is not perfect (e.g., the above method does not work when the URL to a site’s homepage is broken), we ensure that we only err in marking potentially broken URLs as not broken, not the other way around. Background and Motivation A URL, which originally pointed to a particular web page, may not work now for a variety of reasons. In this section, we first list these various reasons and describe how we detect them. We then present numbers highlighting the prevalence of broken links on the web and their impact. Finally, we describe our observations about the characteristics of broken links which motivate our work. 2.1 % DNS error % Connection error % 40x/50x error % Soft-404 2.2 Detecting Broken URLs Prevalence and Impact of Broken URLs To demonstrate how commonplace it is for links from the past to no longer be functional, we compile a corpus of URLs archived over time by the Wayback Machine [5]. We consider 5 years—1999, 2004, 2009, 2014, and 2019—and in each year, we add to our corpus 100,000 URLs at random out of all those which were archived by Wayback Machine for the first time in that year; we use the year in which a URL was first archived as an approximation of the year in which that URL was created. We crawl the HTML for each of these URLs to test whether they are broken. We remove from our dataset all pages that either require authentication (ones that return a 403 ‘Forbidden’ status code), or are forbidden to be crawled by the site’s robots.txt file. We also crawl only one URL every 15 seconds to minimize the likelihood of triggering any site’s rate limits. Many URLs go defunct over time. In each of the 5 years, Table 1 shows the fraction of URLs from that year which no longer work and the type of error we observe. As the numbers show, a sizeable fraction of previously functional URLs are now broken and the chances of an URL becoming dysfunctional grow over time. Many pages link to broken URLs. A URL being rendered dysfunctional is by itself problematic, e.g., when this URL has been bookmarked by a user. But, the impact of a broken URL is further compounded by the fact that many web pages may have linked to this URL. Visitors of each such page will be robbed of the appropriate context or follow-up information/services that the page’s author meant to point them to. Figure 1 shows an example; visitors of the home page of Given the URL for a web page, the first step in loading the page is to fetch the page’s HTML. In this paper, we focus on those URLs where this first step fails. Even if the HTML for a page exists, other resources on the page (images, CSS stylesheets, etc.) may be missing, which is a separate problem outside the scope of this paper. Fetching a page’s HTML involves several steps: a DNS lookup to map the hostname in the page’s URL to an IP address, a TCP (+ TLS) connection setup to that IP address, and a HTTP request followed by a response. When a URL is no longer functional, any of these steps may fail. Failures of the DNS lookup or connection setup are easy to detect. Detecting the failure of the last step is, however, tricky since web servers do not always return an error (i.e., a 404 ‘Not Found’ response) when the requested page does not exist. Often, web servers instead redirect the client to an unrelated page (e.g., the site’s home page). Many servers even return an error page with a 200 ‘OK’ status code response, e.g., when a website that is no longer being paid for is accessed, the domain registrar may serve ads in response. To detect when HTTP redirections and responses with a 200 status code are indicative of an error (which we refer to as a “soft-404”), we adapt a technique from prior work [11]. Given a URL u to test, we obtain a new URL u0 which is identical to u except that the suffix in u following the last occurrence of the delimiter ‘/’ is replaced by a randomly generated string of the same length, e.g., given u of the form https://foo.com/path/file10, u0 could be 2 Duration of trace Site1 Site2 Site3 Broken outlink Once-Valid URLs Total # % Broken 20 days 7 months 5 days 5363 5350 3231 14.3 7.9 45.6 Requests for these URLs Total # % Broken 77.9K 1.37M 40.3K 10.6 0.8 17.0 Table 2: Summary of logs from three university websites. CDF across sites 1 CDF across pages Figure 1: An example broken outgoing link. 1 0.8 2019 2014 2009 2004 1999 0.4 0.2 0 0.6 0.4 0.2 0 0.6 0 0.2 0.4 0.6 0.8 2019 2014 2009 2004 1999 0.8 0 0.2 0.4 0.6 0.8 1 Fraction of broken URLs Figure 3: In each year’s corpus, for 1000 domains in which we have at least 100 URLs, fraction of URLs that are broken. Table 2 presents a summary of these datasets. On Site2, we see that a very small fraction (0.8%) of the requests to once-valid URLs elicit an error from the server. However, on the other two sites, 10% and 17% of the requests to oncevalid URLs now result in an error or redirect to an incorrect page, even though all of these URLs did work previously. Based on the logs available to us, we are unable to determine how users end up accessing these URLs (e.g., via their bookmarks or via links on other pages). Nonetheless, this data illustrates the impact of broken links on users. 1 Fraction of broken outlinks Figure 2: For pages in each year, fraction of the links included on every page which no longer work. https://jewishweek.timesofisrael.com/ will find that the link to the “Letters to the Editor” is broken. We quantify this outsized impact of broken URLs as follows. Out of all the still functional URLs from the corpus spanning 20 years described above, we pick 2,500 at random from each of the five years considered. We crawl the HTML for each of these 12,500 pages and identify all the URLs that are linked from the HTML. We focus on the <a> tags included in the page’s main HTML, thereby ignoring links to ads and other content linked from iframes embedded in the page. Some of the links are to URLs that were never functional (e.g., misspelt URLs or absolute links mistakenly included as relative links); we find these to be rare, but do our best to detect and exclude them from our data. Figure 2 shows that, irrespective of the year, a large fraction of pages – more than 40% in every year – have at least one broken outgoing link. Of course, the farther back we go in time, the larger the fraction of outgoing links that are typically broken. Significantly, many outlinks are broken even on pages from a year ago. Users issue requests to broken links. As evidence that users are significantly impacted by broken links on the web, we consider the server side logs from three university websites. Since many of the error-inducing requests in these logs correspond to hackers probing the servers for vulnerabilities, we focus on that subset of requests corresponding to URLs which were successfully archived by Wayback Machine at least once, i.e., these URLs were valid at some point in time; we refer to these as “once-valid URLs.” 2.3 Coping with Broken Links To address the problem of dysfunctional URLs, we make two observations about them. Broken URLs, not defunct sites. First, for most of the domains in which we find at least one broken URL, we observe that there are typically other URLs in the same domain that still work. In each of the five years we consider, among all the domains from which we have at least 100 URLs from that year in our corpus, we pick 1000 domains at random. Figure 3 plots the distribution across these domains of the fraction of URLs in that domain which are broken. Domains in which all URLs no longer work are in the minority in every year, particularly in the last decade. So, when a URL pointing to a particular page is no longer functional, it is rarely the case that the site hosting that page does not exist. Pages that broken URLs were pointing to often still exist. Second, for a sizeable fraction of broken URLs (e.g., many of the broken URLs on the three university websites we studied above), we observe that the page the URL was originally pointing to still exists at an alternate URL, i.e., the broken URL is merely due to a reorganization of the site hosting it, not the result of the page being deleted. Website providers often restructure the URLs for their pages (e.g., 3 No. of broken URLs Not found alias 70 Found alias Search engine, Archive, Live web 60 50 40 Fetch & Crawl 30 20 URLs from Server log, Broken URLs Bookmarks, Identify etc… broken URLs 10 smartsheet.com planetc1.com onlinepolicy.org namm.org mobilemarketingmagazine.com knect365.com jfku.edu imageworksllc.com gamespot.com filecart.com eclipse.org ebates.com decathlon.co.uk commonsensemedia.org activenetwork.com aaascreensavers.com 0 (Alias, Confidence) pairs Figure 5: Overview of FABLE. is one of scale; applying the manual process that we used in the previous section is not an option. Therefore, with FABLE, we seek to automate the identification of functional aliases for broken web links at scale. As shown in Figure 5, FABLE takes a set of dysfunctional URLs as input and outputs an alternate URL for those inputs for which it is able to locate the page on the web. Since some of the aliases identified by FABLE can be incorrect (for reasons outlined later), for each alias it identifies, FABLE also outputs its confidence about this alias being correct. Our design of FABLE is guided by three objectives. • Coverage. We seek to find aliases for as many of the input URLs as feasible. Ideally, if the page that a URL was previously pointing to still exists, FABLE should be able to find the page’s new URL. • Precision. We strive to ensure that the aliases found by FABLE are correct. If many of the identified aliases are suspect, they will need to be manually inspected before they can be used, which is hard to do at scale. • Efficiency. Lastly, we aim to minimize the amount of work – the number of pages crawled, search queries issued, etc. – that FABLE must do to locate alias links. Figure 4: For 16 domains, number of broken URLs in this domain for which we do or do not find a functional alias. when a site switches to a different domain or begins using a content management system such as Wordpress or Drupal). In doing so, they break many of the previously functioning URLs to pages on their site. For example, in Figure 1, the “Letters to the Editor” link is broken because the path /letters-to-the-editor/ on the site has been renamed to /topic/letters-to-the-editor/. To illustrate our second observation, we present our manual analysis of 16 domains chosen at random from the ones in which a) we have at least one broken URL in our corpus, and b) not all URLs are broken. For each dysfunctional URL in these domains, we manually attempted to find the page originally pointed to by this URL. We used Wayback Machine’s archived copies of these pages and attempted to discover these pages via web search and via exploration of these 16 sites. Figure 4 shows the subset of broken URLs on each of these sites for which we were able to find an alias. Cumulatively, we were able to find functional aliases for 32% of the broken URLs from these sites. As we show later, this is only a lower bound as the aliases uncovered by our manual investigation are not exhaustive. 3 FABLE 4 Design Our high-level approach in building FABLE is to leverage web archives such as the Wayback Machine to find aliases for broken URLs. For every URL they have archived, web archives provide a rich set of information: what was the content at this URL before it stopped working, which other URLs did this URL redirect to in the past, which pages linked to this URL previously, etc. For every broken URL given to it as input, FABLE leverages this historical context to identify a corresponding alias which is functional today. Overview Our results from the previous section show that, though broken URLs are common on the web, the pages that these URLs were pointing to often still exist. As discussed earlier in the introduction, we envision that the existence of aliases for broken URLs can be leveraged in a variety of settings. However, two factors make it challenging to find aliases for broken URLs. First, there is often limited information available about the page that the URL was previously pointing to, e.g., only a page’s URL and title are stored in a user’s bookmarks and a site’s provider can only rely on the anchor text associated with broken URLs linked from its pages. Second, even in cases where one has extraneous information about a page for which they are looking to identify the new URL (e.g., a web developer who previously linked to this page or the admin of the site hosting the page), the challenge 4.1 Challenges A straightforward approach to use archived page content to find aliases could work as follows. Given a broken URL, we can look up the archived copy of this URL and use the content in that copy (e.g., the page’s title and first paragraph) to formulate a query to submit to a web search engine such as Google or Bing. We can then crawl each of the search results which are in the same domain as the broken URL, and compare the content currently at that page with the archived 4 Approach Web search Match backlinks Pattern based inference Techniques Section If URL not archived, identify query arguments which do not impact page content to find if page was archived under alternate URL For any URL, use last archived copy with neither redirections nor error code response, as that is likely the latest archived content before URL was broken To search the web efficiently, use the title and content in the archived copy judiciously to formulate search queries and order the queries in decreasing likelihood of finding a match To efficiently find pages which previously linked to input URL, perform guided crawl within web archive starting from archived copies of the URL and of the homepage of its site Correlate historical and current backlinks to find functional alias Apply programming by example on the aliases found for broken URLs on a site to predict aliases for other broken URLs on that site §4.2.1 §4.2.2 §4.2.3 §4.3.1 §4.3.2 §4.4 Table 3: Summary of the techniques used in FABLE to find aliases using three different approaches. content. If any of these prove to be a close match, we succeed in finding the new URL for the target page. However, this strawman approach overlooks a number of practical realities. First, as mentioned earlier, web archives are understandably incomplete. So, what if there is no archived content for the broken URL for which we are attempting to find an alias (as is the case for the “Letters to the Editor” page linked from the example in Figure 1)? Second, even if the URL is archived, which of the archived copies for it to use, given that systems like Wayback Machine repeatedly recrawl URLs over time? Lastly, finding an alias (if it exists) is not a given even if the URL is archived and we pick the right archived copy for it: (1) indices maintained by even popular web search engines are far from complete [39], (2) even the latest archived copy of a page may significantly differ from what is currently on the page (e.g., the last archived copy for the broken URL in Figure 1 would be missing all the letters posted after the URL stopped working), and (3) many web pages have little textual content or have titles that are shared with other pages on the site, making them not amenable for discovery via search engines. In the rest of this section, we describe how our design of FABLE addresses these challenges. Table 3 summarizes the various techniques used. 4.2 has no impact on page content (e.g., affiliate IDs included in the URLs for product pages on Amazon). Therefore, for every input URL which has a query string and has not been archived, FABLE attempts to identify which of the arguments in the URL impact the page’s content. For every argument arg in the URL, FABLE looks at all URLs with archived copies which a) include this argument, and b) are identical to the input URL except for the query string. It determines that the value of arg does not matter if the archived content for any of these similar URLs matches the archived content for a different similar URL which either a) has a different value for arg, or b) does not include that argument. We defer to Section 5 details about how we compare any two pages and determine if their content matches. After it tests the significance of each of the arguments in an input URL, FABLE discards those arguments which it finds do not impact page content. It then uses the archived content for any URL which contains the remaining portion of the input URL and whose additional arguments are all deemed to be inconsequential to page content. 4.2.2 Next, FABLE determines which of an URL’s archived copies to use; web archival systems repeatedly recrawl a page in order to capture how that page’s content changes over time. In FABLE, for any URL, we use the latest archived copy in which a 200 status code response was observed. We make this choice as a balance between two considerations: the content on a page is likely to have significantly changed since when it was first archived, whereas the latest archived copy for any broken URL is likely to reflect the same erroneous response we obtain when visiting that URL today. As a result, neither the first nor the last archived copy of a page is likely to help FABLE in formulating a search query with which it can locate the page. By using the last archived copy with a 200 status code response, FABLE attempts to use the latest copy which is likely to be non-erroneous. Note that the chosen archived copy might still be significantly stale and the URL might already have stopped working by the time this copy was collected; recall from Section 2.1 that some sites serve erroneous responses with a 200 Using Archives to Search We begin by improving upon the strawman described above in using web search engines to find aliases. Our first two techniques improve the number of aliases we can find with this approach and the third increases efficiency in doing so. 4.2.1 Picking Right Archived Copy Finding Archived Copies First, we observe that, even if there exists no archived content for a specific broken URL given as input to FABLE, the same page might have been archived under a different URL. This is because a page can often be referred to via a number of URLs. For example, a URL can include a query string comprising several arguments in the form of (key, value) pairs, and the server hosting this page may return the same content irrespective of the values for some of these arguments; even completely excluding some or all of these arguments often 5 status code. However, as we show later in Section 6, our strategy for picking which archived copy of a broken URL to use significantly increases the odds of finding its alias despite being limited by the rate at which archival systems recrawl pages and how long it has been since the URL became dysfunctional, both of which are outside our control. Archive Live Web http://www.filecart.com/flash-slideshowdesigner https://filecart.com/get/flash-slideshowdesigner Redirect 4.2.3 Formulating Search Queries There are many ways to formulate a search query from a page’s archived content. The challenge is to maximize efficiency without sacrificing coverage, i.e., minimize the number of search queries we issue and pages we crawl while finding most of the aliases discoverable via web search. We address this tradeoff in FABLE based on two characteristics of web pages. First, prior work [35] has observed that a few carefully chosen words from a page’s content can serve as a robust “lexical signature” of the page. Second, though a page’s title can change over time, we find that this happens less often than changes to the page’s content. Given these properties, FABLE issues three search queries to find the alias for any particular broken URL; in all cases, it limits the search results to be in that domain which users are redirected to when visiting the homepage of the URL’s domain now (e.g., for URLs in espn.go.com, we look for results in espn.com because a HTTP request for the former redirects to the latter). First, we search using the page’s title in the archived copy, but without requiring the match to be exact (i.e., we do not include the title within quotes). This allows for minor changes to the page’s title between when the archived copy we use was collected and now. If the first query fails to result in a match, FABLE then issues a second query using the most significant words on the page; we describe how we pick these words in Section 5. If an alias has still not been found, FABLE searches using the page’s title again, but requiring an exact match this time; given that we look at a limited number of search results for efficiency, this last query helps find an alias in cases where the first query returns too many results. Later, in Section 6, we show that this combination of queries issued in this order offers a good balance between efficiency and coverage. 4.3 Alias Target http://www.filecart.com/download/prof essional-slideshow-maker https://filecart.com/tag/professionalslideshow-maker Figure 6: An example where FABLE is able to find the alias for a broken URL by finding a page that previously linked to this URL and by comparing the current and archived versions of that page. broken URL to its alias by comparing the archived and current copies of that page. Figure 6 shows an example. 4.3.1 Finding Backlinks Efficiently Determining which URLs are linked from a given page is easy: parse the page’s content; the opposite – finding which pages link to a given URL – is not. Moreover, it is not useful for us to find which pages link to a particular URL today. Instead, we need to find a broken URL’s backlinks from the past, in order to check if the URL has been replaced by its alias on any of those pages. Since the site in which a broken URL is hosted is more likely to have replaced this URL on its pages, we focus on finding internal backlinks. Crawling within web archive. Given a broken URL, FA BLE attempts to find its historical backlinks in two ways. First, we leverage the fact that recursively following links, starting from the outlinks on a page, often help find a link back to that page, e.g., a professor’s web page may link to the homepage of their department’s website, which in turn links to a listing of all faculty, which contains a link to the original page. So, in cases where archived copies of a broken URL exist (and yet, we were unable to find its alias using web search), FABLE performs a crawl within the web archive starting from this URL attempting to find a loop back to the URL, like in the above example. Second, if the previous step fails to find a backlink or if no archived copies of the broken URL exist, FABLE performs a crawl within the web archive starting from the homepage of the URL’s domain. With either strategy, we crawl once using the first archived copy of every page and again using the last 200 status code copy of every page. Among the broken URLs we consider in our evaluation for which we find a backlink, crawls starting Correlating Backlinks The techniques we have described thus far improve coverage and efficiency compared to the strawman approach. However, any approach that uses archived content to search on the web suffers from a common set of limitations: the page may not have been archived under any of its URLs, limited textual content on the page may hinder the formulation of relevant search queries, and search engines may not have indexed the page. To sidestep these downsides, our second approach finds aliases by locating backlinks which have been rewritten. On any page that previously linked to a URL which is now broken, if that URL has now been replaced, we can map the 6 Domain cars.com sbnation.com Old URL New URL www.cars.com/gmc/acadia/2014/ www.cars.com/ford/expedition/2014/ www.sbnation.com/ncaa-football/teams/Arizona Title: Arizona Wildcats Schedule, ... www.sbnation.com/ncaa-football/teams/Oregon Title: Oregon Ducks Schedule, ... www.cars.com/research/gmc-acadia-2014 www.cars.com/research/ford-expedition-2014/ www.sbnation.com/college-football/teams/arizona-wildcats www.sbnation.com/college-football/teams/oregon-ducks Table 4: On a couple of sites, examples showing the patterns in how broken URLs map to their aliases. from the URL and from the homepage of its site respectively find historical backlinks for 79% and 60% of these URLs. Guided crawling for efficiency. A key consideration in performing these crawls is to minimize the number of pages we need to fetch and inspect. As we show later in Section 6, naively performing a depth-first or breadth-first crawl results in needing to visit a large number of pages, often without being able to find a backlink within a limited crawling budget. To be efficient, FABLE’s crawls are guided by its goal: finding a backlink for a specific URL. Specifically, when searching for backlinks for a target URL t, among all outlinks seen on all pages crawled so far, we pick which link u to preferentially pursue based on a combination of two factors. The first factor is the edit distance between the two URLs t and u; a page in a nearby directory on the site is more likely to contain a link to t. The second factor is the similarity between the anchor text associated with the link to u (i.e., the text underlying the link) and the representative text for t (which is either extracted from the archived copy for t, if available, or the tokens from the URL t). The latter simulates human behavior of pursuing links which appear most closely related to the target page. Note that, in cases where FABLE finds a backlink from the web archive, but the URL of the backlink page is itself not functional today, FABLE recursively attempts to find the alias for that backlink before matching links as described above. 4.3.2 these links; specifically, we compare the text in the DOM node closest to the respective <a> tags. 4.4 Leveraging Patterns in URL Changes Our last strategy for finding a broken URL’s alias exploits the aliases found by FABLE using the web search and backlink based strategies. Our high level insight here is that, on any site, typically a large number of URLs are rendered dysfunctional over time, not just one. When this occurs, many of the broken URLs on a site are often similar, and so are their aliases; Table 4 presents examples from a couple of sites. Therefore, for any given site, we seek to learn the patterns underlying the mapping from the old to the new URLs for its pages. To do this, we pursue programming by example (PBE) [18], instantiated using Microsoft Excel’s Flash Fill capability [7]. The PBE approach entails providing a number of input to output examples, using which the output can be predicted for a new input. In our setting, it is insufficient to simply provide all broken URLs on a site and their known aliases as input. First, since there are often multiple patterns in mapping broken URLs to their aliases on a given site, we need to cluster similar broken URLs and invoke Flash Fill on each cluster separately. Second, as seen in the examples in Table 4, we need to provide a bunch of auxiliary information about each page as input, because new URLs often include the page’s title, date of creation [6], etc. We extract this information from the archived copies of the broken URL, when available. Matching Links Accurately 4.5 Within a limited crawling budget, FABLE finds as many backlinks for a broken URL as it can. On any page which previously linked to the URL, we need to determine which, if any, of the links currently on the page is the alias. We could crawl the pages at each of those links and compare them with the archived content for the broken URL. However, recall that our use of backlinks to find aliases is driven by the absence of archived copies for many URLs. Given a page p which previously linked to URL u, which is now broken, we declare URL a now linked from that page as u’s alias in two cases. When the anchor texts associated with u and a are unique (i.e., no other link had the same anchor text as u in the archived copy of p, and no other link on p has the same anchor text as a today), we declare a to be u’s alias if the two anchor texts are similar (§5). If there is no anchor text (e.g., if u are a are linked from images) or if they are not unique on page p, we compare the text surrounding Confidence in Results In all of the techniques employed by FABLE to find aliases, it needs to use some thresholds to determine similarity, e.g., when comparing the current and archived version of a page and when comparing the anchor text across different versions of a page. Given that page content is prone to change over time, focusing only on exact matches would greatly handicap FABLE. Therefore, in setting the thresholds of similarity, we have to live with the chance that some of the aliases found by FABLE will be false positives. To help minimize the number of aliases found by FABLE that users of the system need to manually inspect, we have it compute and output its confidence in each alias that it identifies. The intuition underlying our model is that, for any broken URL, the rate of flux in the page’s content and how FA BLE finds the URL’s alias are indicative of the likelihood that the alias is correct. For example, the more often a page’s con7 • Across 1.5K randomly sampled broken URLs spread across years and exhibiting varied error types, FABLE finds 80% more aliases than feasible with the best strawman approach that utilizes web archives. • Among aliases for which FABLE’s confidence value is 90% or more, over 90% of these aliases are correct. • For the same crawling budget, FABLE is able 35% more aliases by matching backlinks compared to breadthfirst/depth-first style crawling. • Across 48 domains which each have many broken URLs, FABLE is able to infer 41% of aliases without needing to crawl any pages or perform any web searches. • On two of the three sites for which we have server-side logs, FABLE finds aliases for 25–45% of broken URLs, which can help prevent 48% of the erroneous responses. tent changes, it is more likely that last archived copy for that page’s old URL is significantly stale. Whereas, a match between the archived copy and current version of a page based on the page’s content is a stronger indicator than a match based on the page’s title. In FABLE, we train a Naive Bayes classifier to compute the confidence in the alias for a given broken URL. This classifier takes five features as input: 1) the error type (DNS/connection error, 40x/50x status code, or soft-404), 2) how often the page is updated, approximated by the number of archived copies on Wayback Machine, 3) how the alias was discovered (by Search, Backlink Matching, or Inference), 4) how the alias was matched to the URL (by title, content, or backlink), and 5) whether the broken URL is not a homepage but the alias is. In our evaluation, we find that the confidence computed by this model strongly correlates with the correctness of FABLE’s output, and that the second and fourth features prove to be the most important. 5 6.1 We use three datasets to evaluate FABLE: the manually analyzed URLs from 16 domains discussed in Section 2, a corpus of URLs which we use to test FABLE’s web search and backlink based alias discovery, and another corpus of URLs for evaluating its inferred aliases. We describe the latter two datasets in this section. Dataset for Search and Backlink. From all the broken outgoing links in Figure 2, we compile a corpus of 1.5K URLs by sampling 500 URLs at random for each of the following error types: 1) “DNS&Other” includes URLs that prevent the client from issuing a HTTP request (e.g., DNS error, TCP/TLS connection failure), 2) “4/5xx” includes URLs which result in a HTTP response with a 4xx or 5xx status code, and 3) “Soft-404” includes all other URLs (which redirect to an unrelated page or return an error with a 200 status code response). To prune out abandoned domains in which there is no possibility of finding an alias, we only consider URLs from those domains in which we have seen at least one working URL. We consider at most 10 URLs per domain to increase the diversity of our corpus. Dataset for Inference. FABLE’s PBE-based inference of aliases works best on sites which have many broken URLs. Therefore, in contrast to our previous corpus which limits the number of broken URLs per domain to 10, we assemble a separate corpus of broken URLs to evaluate FABLE’s ability to infer aliases. This dataset comprises 1.4K URLs from 48 domains, where each of these domains is one in which FABLE is able to find an alias for at least one broken URL using its search and backlink based methods. Implementation Search APIs. We use the APIs for both Google custom search [3] and Bing web search [10]. For any query, the former limits the number of results initially returned to 10 and the latter limits to 50. Though both APIs allow us to ask for additional results, we find negligible utility in doing so; examining the top 10 results suffice to find almost all aliases discoverable via web search. Removal of boilerplate content. Modern web pages usually contain boilerplate [27] (e.g., ads, recommended stories, and site-specific templates), which has little to do with the core content and changes very frequently. Therefore, before comparing the archived content of a page and the current content on a potential alias, we remove the boilerplate from both copies. To do so, we use Chrome’s DOM Distiller [1], which powers Chrome’s reader mode. Text similarity and representative words. FABLE has to compute the similarity between two text fragments to compare page content, to compare page titles, and to compare anchor texts of two links. In all of these cases, we use TF-IDF (term frequency-inverse document frequency) [36], which is a common metric to summarize a piece of text. When comparing two TF-IDF vectors, we find that a cosine similarity of 0.8 or higher typically results in a good match. We use the TF-IDF model to also choose the most representative words from a page’s content that FABLE uses to issue a web search query; we use the 7 words with the highest TF-IDF weights [23]. 6 Datasets 6.2 Evaluation Strawman Approaches We evaluate FABLE’s coverage by comparing the number of URLs for which it is able to find an alias compared to two strawman approaches. We mimic how one might try to find the alias upon encountering a broken link. The two approaches we consider are for scenarios with and without access to web archives. We evaluate FABLE from three perspectives: 1) its coverage in finding aliases for broken URLs and the accuracy of these aliases as a function of confidence, 2) its efficiency in finding aliases, and 3) the utility of the aliases it finds. Our evaluation shows that: 8 No. of broken URLs w/ alias Soft-404 500 DNS&Other 4/5xx 400 300 200 100 0 Str a wm an Str a w/ oa wm an rch ive Se w/ a arc h rch ive Ba ck lin k Se a rch +B ac k lin k Search No archived copy No search results No match on results Alias found #URLs 594 56 539 311 Backlink No backlink found Backlink URLs broken No match on backlinks Matched link broken Alias found #URLs 696 291 109 170 306 Figure 7: For 1,500 broken URLs, number of URLs for which Table 5: Breakdown of reasons for FABLE’s inability to find each approach finds an alias, broken down by the error types that the URLs exhibit. aliases using search and backlink based methods. which utilizes archived page content finds 18% fewer aliases than FABLE finds using only web search (’Search’); FA BLE is able to use search engines better by uncovering cases where the page corresponding to a broken URL has been archived under a different URL and by utilizing more fresh information about a page than that available in the page’s first archived copy. FABLE discovers a similar number of aliases using backlinks (’Backlink’) as it does using web search, but the union of the two methods (’Search+Backlink’) is significantly larger. Backlink-based alias discovery can find aliases for pages which have not been archived and outdoes Search in finding aliases for pages whose content keeps changing over time. In all, Search and Backlink together find aliases for 454 of the 1,500 broken URLs, 80% more than the best strawman approach. For FABLE’s discovery of aliases using web search and backlinks, Table 5 breaks down the different reasons why it is unable to find an alias. The dominant bottleneck for Search is the absence of any archived copy in Wayback Machine. Even when a page has been archived, no match in the search results is another big hindrance; this occurs because the title and most significant words are too generic, the page is updated frequently and the archived copy is no longer representative, or the page is now absent from the search index of both Google and Bing. For Backlink, inability to find backlinks is the primary limiting factor. When FABLE does find a backlink, the URL for that backlink too is often broken, a consequence of the site’s reorganization which rendered the URL for which we are trying to find an alias dysfunctional. Dataset for Inference. In this dataset comprising 1,391 URLs across 48 domains, Figure 8 plots the number of aliases in each site that FABLE is initially able to discover using the combination of its search and backlink based methods as well as the additional ones that it uncovers thereafter via inference. The benefits of FABLE’s PBE-based inference varies across sites, but overall, it increases the fraction of broken URLs in this dataset for which we are able to find aliases from 36% to 43%. As we show later, many of the aliases that we discover here using Search and Backlink can also be inferred. Without archive. In the absence of web archives, the only metadata available for a URL that is now broken is either the page’s title (e.g., if the URL was bookmarked) or the anchor text on a page which links to that URL. Since the title is more representative than the anchor text, the best one can do is to use the title to query web search engines for potential aliases, restricting the results to the site to which the URL’s domain now redirects. Like how one would only have access to the page’s title as it was when the page was bookmarked, for every URL in our corpus, we use the title from that archived copy of the URL which is closest to the year in which we first see a link to this URL in Wayback Machine. To find an alias from web search results, we use the same technique to match page titles as used in FABLE (§ 5). With archive. When one does have access to archived copies of web pages, we assume that one can use web search to find aliases similar to FABLE. The only differences we consider is that the strawman uses the first archived copy of any broken URL and it looks up Wayback Machine only using the URL for which it is attempting to find an alias. 6.3 Coverage Manually analyzed URLs. For the data shown in Figure 4, we had manually found aliases for 194 broken URLs. FABLE is able to find 158 (i.e., 81%) of these. FABLE’s high coverage rate on known aliases suggests that it can find functional aliases for most broken URLs that have one. FABLE also finds 33 additional aliases which were missed by our manual analysis, highlighting the utility of automating the discovery of aliases. The automation does come with the risk of false positives: FABLE finds 39 incorrect aliases. Dataset for Search and Backlink. For the 1,500 broken URLs in this dataset, Figure 7 shows the number of broken URLs of each of the three error types for which aliases are discovered using the different approaches: FA BLE ’s web search and backlink based methods as well as the two strawman approaches. The strawman without access to web archives, which only finds matches based on page titles, uncovers the least number of aliases. Even the strawman 9 Search+Backlink CDF across URLs No. of broken URLs w/ alias Inference 50 40 30 20 10 0 0 5 10 15 20 25 30 35 40 45 1 0.8 0.6 0.4 FABLE BFS DFS 0.2 0 0 10 20 30 40 50 60 No. of pages crawled to (try to) find backlink Sites Figure 8: On 48 sites with many broken URLs, aliases found via Figure 10: For aliases found using backlinks, number of pages search and backlinks and additional aliases inferred thereafter. crawled to search for backlinks before finding alias. True Positive False Positive queries, 2) the order in which it crawls the archived pages on a site in order to find a backlink, and 3) its use of programming-by-example to infer aliases. We evaluate the utility of each of these design decisions. Web search. As mentioned in Section 4, FABLE issues three web search queries to find alias candidates: using the page’s title without an exact match, using the most significant words on the page, and using the page’s title with an exact match. Of all the aliases found using web search in our Dataset for Search and Backlink, these three forms of querying find 68%, 50%, and 28% of the aliases, respectively, justifying the order in which FABLE issues these queries. Moreover, each of the three queries uncovers some aliases that the other two do not, making it necessary for FABLE to use them in combination. Backlink discovery. We compare the order in which FABLE crawls archived pages in order to search for a backlink with breadth-first and depth-first crawls (BFS and DFS). Among all aliases that FABLE finds via backlinks, we consider the ones where FABLE finds a backlink by crawling starting from the archived copy of the broken URL for which it is attempting to find an alias. Searching for a backlink by starting from the archived copy of the homepage in the broken URL’s domain is likely to show similar results. For each URL, we limit the number of pages crawled to the maximum number that FABLE visits during its backlink search for any of these URLs. Compared to FABLE’s crawling strategy, Figure 10 shows that BFS and DFS, which do not differentiate between all the outgoing links on any page, exceed the crawling limit for 35% and 41% of the URLs. Inferring aliases. Earlier, we showed that FABLE is able to infer many aliases that cannot be found via search and backlinks (Figure 8). However, for many of the aliases discoverable by the latter two methods, it is unnecessary for FABLE to spend effort searching the web and crawling web archives. Once some of these aliases have been uncovered, the remaining can be inferred based on this subset. We evaluate the utility of FABLE’s inference of aliases in improving efficiency by iteratively considering the URLs in the Dataset for Inference. Whenever we consider a URL, we first attempt to find an alias via web search and match- Count 40 30 20 10 0 [10, 20) [30, 40) [40, 50) [50, 60) [60, 70) [70, 80) [80, 90) [90, 100) Confidence Range (%) Figure 9: Counts for True/False Positive aliases on test set under different confidence ranges output by the confidence model. 6.4 Precision We evaluate FABLE’s precision from two aspects: the overall accuracy of aliases found by FABLE, and the relation between its accuracy and the confidence values that it outputs. First, to measure the overall precision, we sample 100 URLs for which FABLE claims to have found an alias and manually check their correctness. We find 81 out of 100 to be correct. False positives are mainly due to matching titles and link anchors, where the text is very short, yet contains uncommon words. Better approaches for comparing short snippets of text, compared to the TF-IDF based comparison we currently employ in FABLE, could eliminate many of these false positives. Second, we find that the confidence values output by FA BLE are strongly correlated with its accuracy. We use data from the 100 URLs that we manually examined above to train FABLE’s confidence model. Then, we manually label the aliases found for another 100 URLs, for which Figure 9 plots true/false positive counts under different ranges of confidence values. The higher FABLE’s confidence is in an alias it found, the more likely it is to be correct, e.g., over 90% of the aliases with a confidence in the range [90, 100] are correct. Therefore, confidence values output by FABLE can help its users determine which of the aliases it finds deserve closer examination and which ones do not. 6.5 Efficiency We attempt to minimize the amount of work that FABLE does in three ways: 1) the order in which it executes search 10 URLs with aliases % of Broken % of OnceValid CDF across sites 1 0.8 0.6 Site1 Site2 Site3 0.4 Search+Backlink Search+Backlink+Infer 0.2 0 0 0.2 0.4 0.6 0.8 25.9 17.4 45.8 3.7 1.4 20.9 Requests for these URLs % of Broken % of OnceValid 48.8 25.1 48.8 5.2 0.2 8.3 Table 7: For three university websites, fraction of URLs and fraction of requests that the aliases found by FABLE account for. Both fractions are shown relative to all broken once-valid URLs and all once-valid URLs. 1 Fraction of broken URLs for which alias is found Figure 11: For sites with many broken URLs, aliases found usvice not usable” and “Stale content” are the primary problems as it is common for pages to offer functionalities such as intra-site search, login, and comments, and many broken URLs point to pages whose content is updated over time. Note that, since all the broken URLs we consider are outgoing links on pages which have been archived by Wayback Machine, they are more likely to be archived than the average URL. In other words, the above analysis likely undercounts how often one is likely to find no archived copy when visiting a broken URL. As evidence of this bias, we analyzed the archives of emails sent to the Interesting People mailing list [4] and picked 100 at random out of all the URLs linked from these emails which are now broken. We find that Wayback Machine lacks archived copies for 20 of these 100 URLs, higher than the 10% seen in our dataset. ing Search and Backlink and the increase enabled by Inference, when FABLE attempts to infer aliases after the discovery of each new alias. # URLs No archived copy 10 Stale content 41 Service not usable 78 Loss of recommendations 22 Loss of ad revenue 20 Total 94 Table 6: Across 100 broken URLs for which FABLE finds an alias, breakdown of reasons why reliance on web archive to serve requests for these URLs would be undesirable. ing backlinks. If we find an alias, we check if all the aliases found so far can help infer an alias for any of the remaining URLs for which we have not yet found an alias. Figure 11 shows that, on many sites, FABLE’s inference is able to uncover more aliases without any additional work. In aggregate, 41% of all the aliases that FABLE discovers on the 48 sites are inferred. 6.6 6.6.2 Earlier in Section 6.3, we evaluated the fraction of broken URLs for which FABLE is able to find aliases. Now, using the server-side logs for the three university sites we described earlier in Section 2, we examine the utility of FABLE in helping fix requests to broken URLs. If all requests to broken URLs on these sites were to be redirected to the corresponding aliases discovered by FA BLE , Table 7 presents the expected impact. Like we did earlier, of all the URLs accessed on these sites, we focus on the ones which we can confirm were “once-valid” (i.e., URLs which were successfully archived by Wayback Machine in the past). We exclude requests to each site’s home page since the home page accounts for a sizeable fraction of requests, but the URL to this page is unlikely to be broken. First, we see that, while FABLE discovers aliases for 17– 45% of broken URLs, the fraction of requests to broken URLs that these aliases can help fix is greater (25–48%). This suggests either that FABLE is more likely to find aliases for broken URLs that are more frequently accessed, or that aliases are more likely to exist for popular broken URLs. Second, the broken URLs for which FABLE finds aliases account for a sizeable fraction of the requests to all once-valid URLs – 5% on Site1 and 8% on Site3 – not just the ones to broken URLs; recall from Section 2 that Site2 has few broken URLs to begin with. Therefore, redirecting users to aliases found by FABLE when they visit broken URLs can Utility of Aliases In the last part of our evaluation, we demonstrate the utility of aliases discovered by FABLE from two perspectives. 6.6.1 Fixing Broken Requests Alias vs. Archive When a user visits a broken URL, serving an archived copy of that URL is undesirable for several reasons if that page is still available at an alternative URL. First, the URL may not have been archived before it was broken. Even if it was, the content on the page may have significantly changed since it was last archived or the page may include services (e.g., login, add to cart) that are unusable via the archived copy. Moreover, the page’s provider might lose out on ad revenue and the ability to attract the user to their site by recommending other relevant pages. Out of all the URLs for which FABLE found correct aliases, we pick 100 at random and manually compare Wayback Machine’s archived copies of these pages to the page found today at the alias URL. Table 6 shows that, for almost all of these 100 broken URLs, reliance on archived copies would prove undesirable for one of the above reasons. “Ser11 Fixing broken links. To address the breakage of URLs due to the reorganization of websites, many efforts have focused on extracting a “lexical signature” from a page’s content as a robust hyperlink [23–25, 34, 35] for the page. When a URL no longer works, its lexical signature can be used to find the page. We do the same in FABLE, but additionally rely on the page’s title to search for a page when its last archived copy is stale. FABLE also augments this reliance on web search to discover aliases with techniques for matching backlinks and inferring new aliases. Another line of solutions, such as Opal [20] and Wayback Machine’s browser extension [8], serve the archived copies of pages to clients when they encounter broken links. As we show in Section 6.6, such solutions are not ideal for both users and website providers. Duplicate page detection. Detection of a web page’s duplicates has been widely researched [15, 21, 29, 38]. The focus has primarily been on defining whether two pages/documents are near-duplicates, in order to detect plagiarism and phishing, save storage by de-duplication, etc. Applying prior techniques from this line of work to find duplicates of archived copies is insufficient to find aliases for broken URLs because not all URLs are archived and archived copies are often stale. NDN and CCN. Named data networking [12,30,41,42] and content centric networking [14, 22] are alternate network architectures where data is retrieved by specifying the information it contains, not the host or IP address on which it is stored. If this vision were to become reality, web pages would have more robust identifiers than URLs. However, the adoption of these proposals requires modifications to all layers of the network stack. In comparison, FABLE provides a solution that is compatible with the legacy web and Internet. Text-editing by example. FABLE utilizes MS Excel’s Flash Fill functionality [7], which is based on a string programming language [17]. Many others have used programming by example [19, 31, 33] to synthesize programs which automate text transformation given input-output example pairs. As far as we know, FABLE is the first to utilize this functionality on URL transformations. significantly help users access the information and services they are looking for on these sites. 7 Discussion Our work in developing FABLE leads to a number of interesting directions for future work. Why do URLs break? When the domain name for a site changes, clients looking to access the old URLs on this site may not even succeed in performing a DNS lookup. But, when websites are reorganized – e.g., when they switch from one content management system to another, such as Wordpress to Drupal – it is unclear why redirections cannot automatically be added from the old to the new URL for every page. This warrants further investigation. Persistent URLs. To account for the brittleness of URLs as identifiers of content, we envision a service that works as follows. For any developer who wishes to link to a URL u on their page, we could have them link instead to, say, https: //persist.url/u. If u is still functional, the service at persist.url can simply redirect to u. If not, and if FABLE is able to identify an alias u0 for u, it can redirect to u0 instead. If no alias is known but the url u has been archived, it can redirect to the last non-erroneous archived copy for u. In the case where none of the previous checks match, the service can still redirect to u and have the user deal with the erroneous response returned by u’s domain. Impact of broken URLs on web’s availability. Our datasets from three university websites showed that, at least in two of them, requests to URLs which were once valid but are now broken account for a non-negligible fraction of the site’s traffic. An interesting question for future exploration is: what is the corresponding impact of broken URLs on the web’s availability in aggregate? Given the variety of ways in which an URL can be rendered dysfunctional (§ 2.1), one would need to analyze logs collected from a representative sample of web clients in order to answer this question. 8 Related Work Detecting and studying broken URLs. Many tools and algorithms have been developed to detect web decay. FABLE utilizes and improves upon Bar-Yossef et al.’s approach [11] to detect soft-404 pages. Parking sensors [40] identifies broken URLs by detecting parking domains. Prior work has also extensively studied the evolution of the web graph and web pages, both with respect to the persistence of links [11, 26, 28, 37] and stability of page content [13, 16, 26, 32]. These studies confirm our observation that a sizeable portion of URLs from the past are now invalid. These prior efforts have also observed that flux in page content is common, which is why it is often insufficient to serve archived copies in response to requests for broken URLs. In contrast to all of this prior work, our contributions lie in systematically finding alternate URLs at which the pages originally pointed to by broken URLs now reside. 9 Conclusion Since it is typically assumed that broken URLs on the web point to pages which no longer exist, users look up the archived copy when they encounter such a URL. In contrast to this common wisdom, we found that a sizeable fraction of broken URLs are a consequence of website reorganization and that the pages they were pointing to still exist. To help users and web providers take advantage of this observation, we presented FABLE to accurately and efficiently find the new URL for any page whose old URL no longer works. Based on the promise shown by our analysis of three websites, we believe the aliases found by FABLE can help significantly improve the web’s availability at large. 12 References [1] chromium/dom-distiller: Distills the dom. https:// github.com/chromium/dom-distiller. [2] Cloudflare and the Wayback Machine, joining forces for a more reliable web. https://blog. archive.org/2020/09/17/internetarchive-partners-with-cloudflare-tohelp-make-the-web-more-useful-andreliable/. [3] Custom search json api — programmable search engine. https://developers.google.com/ custom-search/v1/overview. [4] Interesting People mailing list archives. https:// ip.topicbox.com/groups/ip. [5] Internet Archive: Wayback Machine. https:// archive.org/web/. [6] Newspaper3k: Article scraping & curation — newspaper 0.0.2 documentation. https://newspaper. readthedocs.io/en/latest/. [7] Using Flash Fill in Excel. https://support. microsoft.com/en-us/office/usingflash-fill-in-excel-3f9bcf1e-db934890-94a0-1578341f73f7. [8] Wayback machine chrome web store. https://chrome.google.com/ webstore/detail/wayback-machine/ fpnmgdkabkmnadcjpehmlllkndpkmiak. [9] What is a canonical URL? how to master the rel=canonical tag. https://blog.alexa.com/ canonical-url/. [10] What is the bing web search api? - azure cognitive services — microsoft docs. https://docs. microsoft.com/en-us/azure/cognitiveservices/bing-web-search/overview. [11] Z. Bar-Yossef, A. Z. Broder, R. Kumar, and A. Tomkins. Sic transit gloria telae: Towards an understanding of the web’s decay. In Proceedings of the 13th international conference on World Wide Web, pages 328–337. ACM, 2004. [12] S. H. Bouk, S. H. Ahmed, D. Kim, and H. Song. Named-data-networking-based ITS for smart cities. IEEE Communications Magazine, 55(1):105–111, 2017. [13] J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. Technical report, 1999. [14] A. Detti, N. Blefari Melazzi, S. Salsano, and M. Pomposini. CONET: A content centric inter-networking architecture. In Proceedings of the ACM SIGCOMM workshop on Information-centric networking, pages 50–55, 2011. [15] D. Fetterly, M. Manasse, and M. Najork. On the evolution of clusters of near-duplicate web pages. In Proceedings of the IEEE/LEOS 3rd International Confer- [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] 13 ence on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No. 03EX726), pages 37–45. IEEE, 2003. D. Fetterly, M. Manasse, M. Najork, and J. L. Wiener. A large-scale study of the evolution of web pages. Software: Practice and Experience, 34(2):213–237, 2004. S. Gulwani. Automating string processing in spreadsheets using input-output examples. ACM Sigplan Notices, 46(1):317–330, 2011. D. C. Halbert. Programming by example. PhD thesis, University of California, Berkeley, 1984. W. R. Harris and S. Gulwani. Spreadsheet table transformations from examples. ACM SIGPLAN Notices, 46(6):317–328, 2011. T. L. Harrison and M. L. Nelson. Just-in-time recovery of missing web pages. In Proceedings of the seventeenth conference on Hypertext and hypermedia, pages 145–156, 2006. M. Henzinger. Finding near-duplicate web pages: A large-scale evaluation of algorithms. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 284–291, 2006. V. Jacobson, M. Mosko, D. Smetters, and J. GarciaLuna-Aceves. Content-centric networking. Whitepaper, Palo Alto Research Center, pages 2–4, 2007. M. Klein and M. L. Nelson. Revisiting lexical signatures to (re-) discover web pages. In International Conference on Theory and Practice of Digital Libraries, pages 371–382. Springer, 2008. M. Klein, J. Shipman, and M. L. Nelson. Is this a good title? In Proceedings of the 21st ACM Conference on Hypertext and Hypermedia, pages 3–12, 2010. M. Klein, J. Ware, and M. L. Nelson. Rediscovering missing web pages using link neighborhood lexical signatures. In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries, pages 137–140, 2011. W. Koehler. Web page change and persistence—a fouryear longitudinal study. Journal of the American society for information science and technology, 53(2):162– 171, 2002. C. Kohlschütter, P. Fankhauser, and W. Nejdl. Boilerplate detection using shallow text features. In Proceedings of the third ACM international conference on Web search and data mining, pages 441–450, 2010. S. Lawrence, F. Coetzee, E. Glover, G. Flake, D. Pennock, B. Krovetz, F. Nielsen, A. Kruger, and L. Giles. Persistence of information on the web: Analyzing citations contained in research articles. In Proceedings of the ninth international conference on Information and knowledge management, pages 235–242, 2000. G. S. Manku, A. Jain, and A. Das Sarma. Detecting near-duplicates for web crawling. In Proceedings of [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] the 16th international conference on World Wide Web, pages 141–150, 2007. S. Mastorakis, A. Afanasyev, and L. Zhang. On the evolution of ndnSIM: An open-source simulator for NDN experimentation. ACM SIGCOMM Computer Communication Review, 47(3):19–33, 2017. A. Miltner, K. Fisher, B. C. Pierce, D. Walker, and S. Zdancewic. Synthesizing bijective lenses. Proceedings of the ACM on Programming Languages, 2(POPL):1–30, 2017. A. Ntoulas, J. Cho, and C. Olston. What’s new on the web? the evolution of the web from a search engine perspective. In Proceedings of the 13th international conference on World Wide Web, pages 1–12, 2004. P.-M. Osera and S. Zdancewic. Type-and-exampledirected program synthesis. ACM SIGPLAN Notices, 50(6):619–630, 2015. S.-T. Park, D. M. Pennock, C. L. Giles, and R. Krovetz. Analysis of lexical signatures for improving information persistence on the world wide web. ACM Transactions on Information Systems (TOIS), 22(4):540–572, 2004. T. A. Phelps and R. Wilensky. Robust hyperlinks cost just five words each. University of California, Berkeley, Computer Science Division, 2000. G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5):513–523, 1988. D. Spinellis. The decay and failures of web references. Communications of the ACM, 46(1):71–77, 2003. M. Theobald, J. Siddharth, and A. Paepcke. Spotsigs: Robust and efficient near duplicate detection in large web collections. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 563–570, 2008. A. Van den Bosch, T. Bogers, and M. De Kunder. Estimating search engine index size variability: A 9-year longitudinal study. Scientometrics, 107(2):839–856, 2016. T. Vissers, W. Joosen, and N. Nikiforakis. Parking sensors: Analyzing and detecting parked domains. In Proceedings of the 22nd Network and Distributed System Security Symposium (NDSS 2015), pages 53–53. Internet Society, 2015. L. Zhang, A. Afanasyev, J. Burke, V. Jacobson, K. Claffy, P. Crowley, C. Papadopoulos, L. Wang, and B. Zhang. Named data networking. ACM SIGCOMM Computer Communication Review, 44(3):66–73, 2014. M. Zhang, V. Lehman, and L. Wang. Scalable namebased data synchronization for named data networking. In IEEE INFOCOM 2017-IEEE Conference on Computer Communications, pages 1–9. IEEE, 2017. 14