The Web as a Privacy Lab

advertisement
The Web as a Privacy Lab
Richard Chow and Ji Fang and Philippe Golle and Jessica Staddon
PARC
{rchow,fang,pgolle,staddon}@parc.com
tracking of users (i.e. their purchases, browsing habits, etc.).
In particular, it may be possible to provide accurate product
recommendations to an online shopper based only on the
products they are viewing in their current visit, as opposed
to requiring the customer to log into the site so that they can
be linked with all previous purchases.
We support our argument in the “Opportunities” section
with two examples of Web-based data loss prevention techniques and one example of privacy-enhanced targeted advertising. Also, we discuss the challenges (and research opportunities) that come with mining the Web for privacy.
Abstract
The privacy dangers of data proliferation on the Web are
well-known. Information on the Web has facilitated the deanonymization of anonymous bloggers, the de-sanitization
of government records and the identification of individuals
based on search engine queries. What has received less attention is Web-mining in support of privacy. In this position
paper we argue that the very ability of Web data to breach privacy demonstrates its value as a laboratory for the detection
of privacy breaches before they happen. In addition, we argue
that privacy-invasive services may become privacy-respecting
by mining publicly available Web data, with little decrease in
performance and efficiency.
Related Work
While this paper argues that the public Web can enable the
detection of a broad range of sensitive data, allowing far
more powerful data loss prevention (DLP), existing work
has demonstrated the usefulness of Web-mining for particular privacy problems. In particular, (He, Chu, and Liu ;
Becker and Chen ; Lindamood et al. ) demonstrate how
mining social network profiles (which may or may not be
public), enables the identification of user attributes. In addition, Latanya Sweeney has developed a tool for finding
lists of people (each of which may represent a privacy risk)
(Sweeney ).
We also emphasize that our vision is inspired by the use of
the Web, as a representation of knowledge, to facilitate the
understanding of text. In particular, the Natural Language
Processing community has had strong success using the Web
as a representation of human language (Nakov and Hearst
2005) and knowledge (Schoenmackers, Etzioni, and Weld
2008; Popescu and Etzioni 2005).
Introduction
Privacy breaches as a result of data leaks involve the release of information that was previously unknown, at least
on a wide scale. This information may be explicit in
the released information or inferable from the information. An example of the former is the release of a person’s name and HIV diagnosis. In contrast, a leak that involves inference might consist of the same diagnosis without the individual’s name, but instead with their date of
birth, gender and zip code, a collection of attributes that is
known to be identifying for many individuals (Golle 2006;
Sweeney 2000).
At the root of either scenario is a knowledge discovery
question; that is, what knowledge will the recipient of this
information likely gain? For privacy considerations, the
veracity of this knowledge may be less important than the
knowledge itself. For example, the appearance of being HIV
positive may for some people (e.g. political figures) be almost as much of a concern as actually having the condition.
Anticipating knowledge discovery requires a model of the
recipient’s reference materials when ingesting information.
That is, for example, the recipient’s experiences, memories
and any resources they are likely to draw on. We propose
that the Web, as a reflection of human knowledge, can approximate the recipient’s references and thus allow privacy
breaching inferences to be identified before the associated
content is released. In addition, we argue that Web mining has a strong potential to reduce the need for behavioral
Opportunities
Sensitive topic detection
The data loss prevention (DLP) market provides tools that
monitor content stored in an organization and content that is
intended to leave an organization (e.g. through email) and
flags any content deemed sensitive. The sensitivity determination is typically by virtue of a template that is manually constructed. For example, a hospital might construct a
template for a HIPAA-protected disease like HIV consisting
of medications, treatments and symptoms closely associated
with HIV. Such templates can quickly get out-of-date as, for
Copyright held by the authors.
48
Europe…
…chaotic…
outdoor café…
…traffic…
summer heat…
Athens, Greece Web
sites
Figure 1: The left image is a screenshot of the chapter of (Wilson 2007) in which Valerie Plame discusses her first tour of
duty with the CIA. The CIA redacted the location of her tour from the chapter title. But we show on the right hand side how
seemingly innocuous text extracted from the chapter suggests the actual location of her tour (Athens, Greece) when entered
into Google.
• Microsoft-Yahoo! Merger: On May 4, 2007, Microsoft’s
interest in acquiring Yahoo! became public. While merger
planning had been in process for many months at that
point, it was carefully kept secret in order to not jeopardize the negotiations.
example, new medications and treatments arise, in the case
of a disease like HIV. In addition, it is difficult for a human
to efficiently recognize the terms that are likely to suggest
the sensitive topic to a content recipient. The Web provides
an environment, or laboratory, for experimenting with a sensitive topic and identifying the terms that are likely to allow
the topic to be inferred (Chow, Golle, and Staddon 2008). As
a motivating example, consider the recently published autobiography of Valerie Plame (Wilson 2007), a former CIA
secret agent whose identity was leaked during the Bush administration. The CIA redacted much of her biography, including the location of her first tour of duty as well as much
of the text of the associated chapter (see excerpt in Figure 1).
However, the CIA left unredacted seemingly innocuous details like the fact that her tour was in Europe in a chaotic city
with a lot of outdoor cafes, traffic and a hot summer climate.
At the time of the screenshot in Figure 1, a Google query
with these details yielded 2 top hits about Athens, Greece,
which was the location of her tour.
• “Deep Throat” Identification: On May 31, 2005, former
Deputy Director of the FBI, William Felt, revealed himself to be the “Deep Throat” informant who leaked information regarding the Nixon administration’s involvement in the first Watergate break-in. Felt had carefully
guarded his close association with Watergate for many
years, partly out of fear of the political and legal consequences.
In each of these incidents, the sensitivity of the event is
due to the pairing of two interesting topics that were previously not known to be related. Specifically, the pairings
are Adsense with falling margins, Microsoft with Yahoo!,
and Deep Throat with William Felt. In addition, we argue
that such sensitive pairings may be efficiently detected by
measuring Web support of the pairing and comparing with
the Web support of the individual topics. In particular, each
topic should have high Web support (reflecting the high interest), however, the topic pairing should have low support
(reflecting its novelty). Of course, not all such pairings are
sensitive. For example, the news that Obama has taken up
knitting meets the above criteria (both Obama and knitting
have high Web support on their own, but not together) may
be interesting, but is unlikely to be sensitive. However, we
argue that such a test, with the correct parsing of the content,
is simple to implement and may discover a collection of potentially sensitive events that could be easily reviewed by a
human and culled as necessary.
Sensitive association detection
While DLP technology focuses on protecting information
about certain sensitive topics, sensitive information may also
consist of the combination, or linkage, between interesting
but nonsensitive topics. As examples, consider:
• Google Leak: In March of 2006, Google released slides
prepared for a presentation to analysts. The notes portion
of the slides mentioned likely reductions in AdSense margins going forward. This information constituted an unintended financial projection. Google immediately withdrew the slides and filed corrective paperwork with the
SEC.
49
Term 1
AdSense
Microsoft
William Felt
Postings
Containing Term 1
1/1/81-event date
45,200
11,900,000
24,800
Term 2
margin
Yahoo!
Watergate
Postings
Containing Term 2
1/1/81-event date
1,580,000
11,700,000
148,000
Postings Containing
Both
1/1/81-event date
48
420
495
Postings Containing
Both in Year Following
Event
1,230
223,000
659
Table 1: Each of the 3 events is characterized by popular terms that have low support in conjunction. The table includes the
hit counts on Google Groups (GGr ) from 1/1/1981 until the day of the event for various terms and their conjunction, and
contrasts the conjunction hit count with the hit count for the same conjunction in the year following the event. In all cases the
conjunction’s support started very low and grew enormously in the single year following the event, indicating the events were
both interesting and novel, and thus potentially sensitive.
tioned together, in particular, at the time of writing, 13% of
the Web documents mentioning “The Virgin Suicides” also
mentioned “Godfather III”, according to Google. However,
as the Web excerpts in Figure 2 indicate, the relation is not
a positive one.
To illustrate this idea in the context of the 3 events mentioned above, we show in Table 1 how the support for these
topics varies for the elements of the pairs and the pairs over
time. We use Google Groups (an archive of Usenet postings)
(GGr ) because queryable instances of the Web as it existed
in the past are not available, while Google Groups supports
keyword search over any time period from January 1, 1981
to the present. If Google Groups support is correlated with
Web support over time, then this provides an approximation
to Web hit counts in the same time period.
Challenges
Before our vision for Web-enabled privacy can become a
reality, we must consider a number of technical, business
and security challenges.
Preference Prediction
The Web is an imperfect proxy for human knowledge.
We propose to use information publicly available on the
Web as a proxy for human knowledge. This proxy, however,
is often sparse in coverage and even incorrect. Also, when
modelling privacy, the conclusions drawn by an adversary
may depend on the adversary’s private knowledge. For
instance, a family member may be able to make more
inferences from a person’s data than a stranger. To model
this sort of private knowledge, one might consider using
non-public information, such as data collected by restricted
social networks such as Facebook or MySpace, or private
individual information (email, phone, or medical records),
or the purchase records collected by online retailers. All
such information, however, is currently entirely off-limit to
search engines.
Targeted advertising based on user behavior (e.g. Web
browsing, purchase history, etc.) is increasingly prevalent
and brings with it obvious privacy concerns (for a recent
example, see (Opsahl 2009)). While behavior tracking is
clearly a sufficient input for predicting user preferences, we
argue it is not necessary. Rather, the publicly available information on the Web can potentially afford advertisers the
same precision in targeting with far less user behavior data,
and hence, better privacy. This is not an entirely new idea.
The social media analysis providers (e.g. Attentio, BuzzLogic, Andiamo Systems) have long monitored the Web to
gauge the success of advertising campaigns and public reaction to events. However, these companies generally expect
the monitoring process to be largely manual, and instead
they make it efficient by identifying a handful of influential content sources (e.g., bloggers with strong followings in
their customer’s demographic) to manually track. We argue
that advances in data mining, natural language processing
and the continued growth of the Web, enable a far more
automated approach to preference prediction. In particular, we conjecture that products that are related are likely to
co-occur on the Web frequently, and data mining can detect
this relation. Such related products are good candidates for
positive or negative recommendations. That is, the products
may be mentioned together either because consumers of one
product tend to enjoy the other, or conversely, consumers
prefer one product over the other. Natural language processing, specifically sentiment analysis, can potentially analyze
the public opinions regarding the related products and can
thus distinguish between the two cases. As an illustrative
example, consider the two movies “The Virgin Suicides”
and “Godfather III”. These two movies are frequently men-
Search engines offer imperfect access to Web content.
Search engines excel at indexing textual data, but they
remain much less successful at indexing non-textual data,
such as images, videos or geographic location traces. This
non-textual data constitutes a growing fraction of Web
content that is of critical importance to privacy, yet is poorly
indexed by search engines, and thus largely inaccessible
to our privacy algorithms. Even the textual indices offered
by search engines raise problems: they include various
optimizations (such as spelling corrections) that facilitate
search but may introduce noise in our privacy algorithms.
In addition, while search engines support Boolean queries
(e.g. a query for any pages containing term 1 or term 2,
and at least one of term 3 or term 4) the results are often
inconsistent. For example, when performing a query for
“Term1 Term2” on a search engine more hits are produced
than a query for “Term1”. In addition, the results may be
50
Figure 2: The figure shows three snippets of Web pages comparing the movies “The Virgin Suicides” and “Godfather III”. In
each case the author expresses a negative view of “Godfather III” and a positive view of “The Virgin Suicides”. Sentiment
analysis can potentially determine that there is a negative association between the two movies and hence that “Godfather III”
should not be recommended to fans of “The Virgin Suicides” despite their strong association.
sensitive to term order, use of quotations/capitalization, etc.
These inconsistencies are difficult to efficiently detect and
correct for because the search engine algorithms are largely
opaque.
the search engine what information is under investigation.
Besides this privacy concern, another security challenge is
that our technology might be open to manipulation by malicious publishers of Web content, who could attempt to influence the results of our privacy analysis in much the same
way that search engine optimizers (SEOs) attempt to influence the rankings of search engines.
Limitations of keyword search. Our privacy algorithms
access Web content via the keyword search interface offered
by search engines. Keyword search is a powerful tool
to explore simple relationships between a pair (or small
number) of items based on co-occurrences. More complex
relationships between multiple items (such as inferences
conditioned on multiple conditions) can not be represented
efficiently as keyword queries. Another problem is that
keywords can be ambiguous: our experiments suggest that
additional context helps disambiguate related keywords.
Finally, keyword search is limited as a content retrieval
method. Natural language processing technology would
help gain a better understanding of content, but at a significant computational cost.
Conclusion
We have argued that the abundance of data on the Web that
leads to privacy concerns can be used to support privacy by
enabling the detection of potential privacy breaches before
they happen and by making targeted recommendations possible with limited personal user information. Achieving this
vision comes with challenges, many of which are centered
around the difficulty of mining the Web through a search engine that has different goals (namely, query interpretation)
and limited features. That said, we are strong proponents
of Web mining for privacy goals and we anticipate that with
additional research and development on the search engine
side, valuable privacy technologies are possible.
Business challenges. Different applications have different
performance requirements in terms of precision and recall.
For example, a marketing application can probably tolerate
much higher false positive and false negative rates than an
application of our privacy technology to Data Loss Prevention (DLP). An open business question is to determine the
applications for which our technology offers sufficiently
high precision and recall. Another business concern is that
our technology relies on a large number of queries to search
engines. The technology may need to be partnered with a
search engine.
References
Becker, J., and Chen, H. Measuring privacy risk in online
social networks. In Proceedings of IEEE Web2.0 Security
and Privacy, 2009.
Chow, R.; Golle, P.; and Staddon, J. 2008. Detecting privacy leaks using corpus-based association rules. In KDD
’08: Proceeding of the 14th ACM SIGKDD international
conference on Knowledge discovery and data mining, 893–
901. New York, NY, USA: ACM.
Google groups. http://groups.google.com.
Golle, P. 2006. Revisiting the uniqueness of simple demographics in the us population. In WPES ’06: Proceedings
Security and privacy challenges. The queries our algorithms issue to search engines in the course of determining
the sensitivity of some piece of information may reveal to
51
of the 5th ACM workshop on Privacy in electronic society,
77–80. New York, NY, USA: ACM.
He, J.; Chu, W.; and Liu, Z. Inferring privacy information
from social networks. Intelligence and Security Informatics, 2006.
Lindamood, J.; Heatherly, R.; Kantarcioglu, M.; and Thuraisingham, B. Inferring private information using social
network data.
Nakov, P., and Hearst, M. 2005. Using the web as an implicit training set: application to structural ambiguity resolution. In HLT ’05: Proceedings of the conference on
Human Language Technology and Empirical Methods in
Natural Language Processing, 835–842. Morristown, NJ,
USA: Association for Computational Linguistics.
Opsahl, K. 2009. Google begins behavioral targeting ad
program. World Wide Web electronic publication.
Popescu, A.-M., and Etzioni, O. 2005. Extracting product
features and opinions from reviews. In HLT/EMNLP.
Schoenmackers, S.; Etzioni, O.; and Weld, D. S. 2008.
Scaling textual inference to the web. In EMNLP, 79–88.
Sweeney, L. Finding lists of people on the web.
Sweeney, L. 2000. Uniqueness of simple demographics in
the U.S. population.
Wilson, V. 2007. My Life as a Spy, My Betrayal by the
White House. Simon and Schuster.
52
Download