The Web as a Privacy Lab Richard Chow and Ji Fang and Philippe Golle and Jessica Staddon PARC {rchow,fang,pgolle,staddon}@parc.com tracking of users (i.e. their purchases, browsing habits, etc.). In particular, it may be possible to provide accurate product recommendations to an online shopper based only on the products they are viewing in their current visit, as opposed to requiring the customer to log into the site so that they can be linked with all previous purchases. We support our argument in the “Opportunities” section with two examples of Web-based data loss prevention techniques and one example of privacy-enhanced targeted advertising. Also, we discuss the challenges (and research opportunities) that come with mining the Web for privacy. Abstract The privacy dangers of data proliferation on the Web are well-known. Information on the Web has facilitated the deanonymization of anonymous bloggers, the de-sanitization of government records and the identification of individuals based on search engine queries. What has received less attention is Web-mining in support of privacy. In this position paper we argue that the very ability of Web data to breach privacy demonstrates its value as a laboratory for the detection of privacy breaches before they happen. In addition, we argue that privacy-invasive services may become privacy-respecting by mining publicly available Web data, with little decrease in performance and efficiency. Related Work While this paper argues that the public Web can enable the detection of a broad range of sensitive data, allowing far more powerful data loss prevention (DLP), existing work has demonstrated the usefulness of Web-mining for particular privacy problems. In particular, (He, Chu, and Liu ; Becker and Chen ; Lindamood et al. ) demonstrate how mining social network profiles (which may or may not be public), enables the identification of user attributes. In addition, Latanya Sweeney has developed a tool for finding lists of people (each of which may represent a privacy risk) (Sweeney ). We also emphasize that our vision is inspired by the use of the Web, as a representation of knowledge, to facilitate the understanding of text. In particular, the Natural Language Processing community has had strong success using the Web as a representation of human language (Nakov and Hearst 2005) and knowledge (Schoenmackers, Etzioni, and Weld 2008; Popescu and Etzioni 2005). Introduction Privacy breaches as a result of data leaks involve the release of information that was previously unknown, at least on a wide scale. This information may be explicit in the released information or inferable from the information. An example of the former is the release of a person’s name and HIV diagnosis. In contrast, a leak that involves inference might consist of the same diagnosis without the individual’s name, but instead with their date of birth, gender and zip code, a collection of attributes that is known to be identifying for many individuals (Golle 2006; Sweeney 2000). At the root of either scenario is a knowledge discovery question; that is, what knowledge will the recipient of this information likely gain? For privacy considerations, the veracity of this knowledge may be less important than the knowledge itself. For example, the appearance of being HIV positive may for some people (e.g. political figures) be almost as much of a concern as actually having the condition. Anticipating knowledge discovery requires a model of the recipient’s reference materials when ingesting information. That is, for example, the recipient’s experiences, memories and any resources they are likely to draw on. We propose that the Web, as a reflection of human knowledge, can approximate the recipient’s references and thus allow privacy breaching inferences to be identified before the associated content is released. In addition, we argue that Web mining has a strong potential to reduce the need for behavioral Opportunities Sensitive topic detection The data loss prevention (DLP) market provides tools that monitor content stored in an organization and content that is intended to leave an organization (e.g. through email) and flags any content deemed sensitive. The sensitivity determination is typically by virtue of a template that is manually constructed. For example, a hospital might construct a template for a HIPAA-protected disease like HIV consisting of medications, treatments and symptoms closely associated with HIV. Such templates can quickly get out-of-date as, for Copyright held by the authors. 48 Europe… …chaotic… outdoor café… …traffic… summer heat… Athens, Greece Web sites Figure 1: The left image is a screenshot of the chapter of (Wilson 2007) in which Valerie Plame discusses her first tour of duty with the CIA. The CIA redacted the location of her tour from the chapter title. But we show on the right hand side how seemingly innocuous text extracted from the chapter suggests the actual location of her tour (Athens, Greece) when entered into Google. • Microsoft-Yahoo! Merger: On May 4, 2007, Microsoft’s interest in acquiring Yahoo! became public. While merger planning had been in process for many months at that point, it was carefully kept secret in order to not jeopardize the negotiations. example, new medications and treatments arise, in the case of a disease like HIV. In addition, it is difficult for a human to efficiently recognize the terms that are likely to suggest the sensitive topic to a content recipient. The Web provides an environment, or laboratory, for experimenting with a sensitive topic and identifying the terms that are likely to allow the topic to be inferred (Chow, Golle, and Staddon 2008). As a motivating example, consider the recently published autobiography of Valerie Plame (Wilson 2007), a former CIA secret agent whose identity was leaked during the Bush administration. The CIA redacted much of her biography, including the location of her first tour of duty as well as much of the text of the associated chapter (see excerpt in Figure 1). However, the CIA left unredacted seemingly innocuous details like the fact that her tour was in Europe in a chaotic city with a lot of outdoor cafes, traffic and a hot summer climate. At the time of the screenshot in Figure 1, a Google query with these details yielded 2 top hits about Athens, Greece, which was the location of her tour. • “Deep Throat” Identification: On May 31, 2005, former Deputy Director of the FBI, William Felt, revealed himself to be the “Deep Throat” informant who leaked information regarding the Nixon administration’s involvement in the first Watergate break-in. Felt had carefully guarded his close association with Watergate for many years, partly out of fear of the political and legal consequences. In each of these incidents, the sensitivity of the event is due to the pairing of two interesting topics that were previously not known to be related. Specifically, the pairings are Adsense with falling margins, Microsoft with Yahoo!, and Deep Throat with William Felt. In addition, we argue that such sensitive pairings may be efficiently detected by measuring Web support of the pairing and comparing with the Web support of the individual topics. In particular, each topic should have high Web support (reflecting the high interest), however, the topic pairing should have low support (reflecting its novelty). Of course, not all such pairings are sensitive. For example, the news that Obama has taken up knitting meets the above criteria (both Obama and knitting have high Web support on their own, but not together) may be interesting, but is unlikely to be sensitive. However, we argue that such a test, with the correct parsing of the content, is simple to implement and may discover a collection of potentially sensitive events that could be easily reviewed by a human and culled as necessary. Sensitive association detection While DLP technology focuses on protecting information about certain sensitive topics, sensitive information may also consist of the combination, or linkage, between interesting but nonsensitive topics. As examples, consider: • Google Leak: In March of 2006, Google released slides prepared for a presentation to analysts. The notes portion of the slides mentioned likely reductions in AdSense margins going forward. This information constituted an unintended financial projection. Google immediately withdrew the slides and filed corrective paperwork with the SEC. 49 Term 1 AdSense Microsoft William Felt Postings Containing Term 1 1/1/81-event date 45,200 11,900,000 24,800 Term 2 margin Yahoo! Watergate Postings Containing Term 2 1/1/81-event date 1,580,000 11,700,000 148,000 Postings Containing Both 1/1/81-event date 48 420 495 Postings Containing Both in Year Following Event 1,230 223,000 659 Table 1: Each of the 3 events is characterized by popular terms that have low support in conjunction. The table includes the hit counts on Google Groups (GGr ) from 1/1/1981 until the day of the event for various terms and their conjunction, and contrasts the conjunction hit count with the hit count for the same conjunction in the year following the event. In all cases the conjunction’s support started very low and grew enormously in the single year following the event, indicating the events were both interesting and novel, and thus potentially sensitive. tioned together, in particular, at the time of writing, 13% of the Web documents mentioning “The Virgin Suicides” also mentioned “Godfather III”, according to Google. However, as the Web excerpts in Figure 2 indicate, the relation is not a positive one. To illustrate this idea in the context of the 3 events mentioned above, we show in Table 1 how the support for these topics varies for the elements of the pairs and the pairs over time. We use Google Groups (an archive of Usenet postings) (GGr ) because queryable instances of the Web as it existed in the past are not available, while Google Groups supports keyword search over any time period from January 1, 1981 to the present. If Google Groups support is correlated with Web support over time, then this provides an approximation to Web hit counts in the same time period. Challenges Before our vision for Web-enabled privacy can become a reality, we must consider a number of technical, business and security challenges. Preference Prediction The Web is an imperfect proxy for human knowledge. We propose to use information publicly available on the Web as a proxy for human knowledge. This proxy, however, is often sparse in coverage and even incorrect. Also, when modelling privacy, the conclusions drawn by an adversary may depend on the adversary’s private knowledge. For instance, a family member may be able to make more inferences from a person’s data than a stranger. To model this sort of private knowledge, one might consider using non-public information, such as data collected by restricted social networks such as Facebook or MySpace, or private individual information (email, phone, or medical records), or the purchase records collected by online retailers. All such information, however, is currently entirely off-limit to search engines. Targeted advertising based on user behavior (e.g. Web browsing, purchase history, etc.) is increasingly prevalent and brings with it obvious privacy concerns (for a recent example, see (Opsahl 2009)). While behavior tracking is clearly a sufficient input for predicting user preferences, we argue it is not necessary. Rather, the publicly available information on the Web can potentially afford advertisers the same precision in targeting with far less user behavior data, and hence, better privacy. This is not an entirely new idea. The social media analysis providers (e.g. Attentio, BuzzLogic, Andiamo Systems) have long monitored the Web to gauge the success of advertising campaigns and public reaction to events. However, these companies generally expect the monitoring process to be largely manual, and instead they make it efficient by identifying a handful of influential content sources (e.g., bloggers with strong followings in their customer’s demographic) to manually track. We argue that advances in data mining, natural language processing and the continued growth of the Web, enable a far more automated approach to preference prediction. In particular, we conjecture that products that are related are likely to co-occur on the Web frequently, and data mining can detect this relation. Such related products are good candidates for positive or negative recommendations. That is, the products may be mentioned together either because consumers of one product tend to enjoy the other, or conversely, consumers prefer one product over the other. Natural language processing, specifically sentiment analysis, can potentially analyze the public opinions regarding the related products and can thus distinguish between the two cases. As an illustrative example, consider the two movies “The Virgin Suicides” and “Godfather III”. These two movies are frequently men- Search engines offer imperfect access to Web content. Search engines excel at indexing textual data, but they remain much less successful at indexing non-textual data, such as images, videos or geographic location traces. This non-textual data constitutes a growing fraction of Web content that is of critical importance to privacy, yet is poorly indexed by search engines, and thus largely inaccessible to our privacy algorithms. Even the textual indices offered by search engines raise problems: they include various optimizations (such as spelling corrections) that facilitate search but may introduce noise in our privacy algorithms. In addition, while search engines support Boolean queries (e.g. a query for any pages containing term 1 or term 2, and at least one of term 3 or term 4) the results are often inconsistent. For example, when performing a query for “Term1 Term2” on a search engine more hits are produced than a query for “Term1”. In addition, the results may be 50 Figure 2: The figure shows three snippets of Web pages comparing the movies “The Virgin Suicides” and “Godfather III”. In each case the author expresses a negative view of “Godfather III” and a positive view of “The Virgin Suicides”. Sentiment analysis can potentially determine that there is a negative association between the two movies and hence that “Godfather III” should not be recommended to fans of “The Virgin Suicides” despite their strong association. sensitive to term order, use of quotations/capitalization, etc. These inconsistencies are difficult to efficiently detect and correct for because the search engine algorithms are largely opaque. the search engine what information is under investigation. Besides this privacy concern, another security challenge is that our technology might be open to manipulation by malicious publishers of Web content, who could attempt to influence the results of our privacy analysis in much the same way that search engine optimizers (SEOs) attempt to influence the rankings of search engines. Limitations of keyword search. Our privacy algorithms access Web content via the keyword search interface offered by search engines. Keyword search is a powerful tool to explore simple relationships between a pair (or small number) of items based on co-occurrences. More complex relationships between multiple items (such as inferences conditioned on multiple conditions) can not be represented efficiently as keyword queries. Another problem is that keywords can be ambiguous: our experiments suggest that additional context helps disambiguate related keywords. Finally, keyword search is limited as a content retrieval method. Natural language processing technology would help gain a better understanding of content, but at a significant computational cost. Conclusion We have argued that the abundance of data on the Web that leads to privacy concerns can be used to support privacy by enabling the detection of potential privacy breaches before they happen and by making targeted recommendations possible with limited personal user information. Achieving this vision comes with challenges, many of which are centered around the difficulty of mining the Web through a search engine that has different goals (namely, query interpretation) and limited features. That said, we are strong proponents of Web mining for privacy goals and we anticipate that with additional research and development on the search engine side, valuable privacy technologies are possible. Business challenges. Different applications have different performance requirements in terms of precision and recall. For example, a marketing application can probably tolerate much higher false positive and false negative rates than an application of our privacy technology to Data Loss Prevention (DLP). An open business question is to determine the applications for which our technology offers sufficiently high precision and recall. Another business concern is that our technology relies on a large number of queries to search engines. The technology may need to be partnered with a search engine. References Becker, J., and Chen, H. Measuring privacy risk in online social networks. In Proceedings of IEEE Web2.0 Security and Privacy, 2009. Chow, R.; Golle, P.; and Staddon, J. 2008. Detecting privacy leaks using corpus-based association rules. In KDD ’08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 893– 901. New York, NY, USA: ACM. Google groups. http://groups.google.com. Golle, P. 2006. Revisiting the uniqueness of simple demographics in the us population. In WPES ’06: Proceedings Security and privacy challenges. The queries our algorithms issue to search engines in the course of determining the sensitivity of some piece of information may reveal to 51 of the 5th ACM workshop on Privacy in electronic society, 77–80. New York, NY, USA: ACM. He, J.; Chu, W.; and Liu, Z. Inferring privacy information from social networks. Intelligence and Security Informatics, 2006. Lindamood, J.; Heatherly, R.; Kantarcioglu, M.; and Thuraisingham, B. Inferring private information using social network data. Nakov, P., and Hearst, M. 2005. Using the web as an implicit training set: application to structural ambiguity resolution. In HLT ’05: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, 835–842. Morristown, NJ, USA: Association for Computational Linguistics. Opsahl, K. 2009. Google begins behavioral targeting ad program. World Wide Web electronic publication. Popescu, A.-M., and Etzioni, O. 2005. Extracting product features and opinions from reviews. In HLT/EMNLP. Schoenmackers, S.; Etzioni, O.; and Weld, D. S. 2008. Scaling textual inference to the web. In EMNLP, 79–88. Sweeney, L. Finding lists of people on the web. Sweeney, L. 2000. Uniqueness of simple demographics in the U.S. population. Wilson, V. 2007. My Life as a Spy, My Betrayal by the White House. Simon and Schuster. 52