1 Googling me, googling you: is there something we can do

advertisement
1
Googling me, googling you: is there something we can do?
(Controlling the Googled view - a technological approach)
Eerke Boiten, Guy Banister, Ashish Bhati
School of Computing, University of Kent, Canterbury, UK
Abstract
The Right To Be Forgotten as a feature of the new EU Data Protection Directive raises many legal and technological questions, to the extent that it
seems unlikely that its objectives will be achieved in full. One particular context where such a right would be useful is “social recruitment”, prospective
employers using social networks and search engines to find out more about
potential employees. In the absence of legal avenues for controlling what
such employers might find out, we investigate technological alternatives.
This paper reports on experiments using CVs and Google search, leading to
methods for taking some control of the image generated of an individual
through the view of online search.
1.1
Introduction
The “Right To Be Forgotten” was proposed by European Commissioner Viviane Reding in 20101 and included as part of the European Commission’s revised draft Data Protection Directive in 20122. It is defined as an extension of
1
V. Reding, “Citizenship Privacy matters – Why the EU needs new personal data protection
rules”, The European Data Protection and Privacy Conference Brussels, 30 Nov 2010,
http://europa.eu/rapid/press-release_SPEECH-10-700_en.htm (last accessed 4/10/2013).
2
Commission Proposals on the Data Protection Reform: Regulation, 25 Jan 2012
http://ec.europa.eu/justice/data-protection/document/review2012/com_2012_11_en.pdf
(last accessed 2/10/2013).
adfa, p. 1, 2011.
© Springer-Verlag Berlin Heidelberg 2011
the principle of data minimisation which already existed in the 1995 Data
Protection Directive3. The Right is introduced as follows4:
Any person should have the right to have personal data concerning
them rectified and a 'right to be forgotten' where the retention of
such data is not in compliance with this Regulation. In particular, data subjects should have the right that their personal data are erased
and no longer processed, where the data are no longer necessary in
relation to the purposes for which the data are collected or otherwise
processed, where data subjects have withdrawn their consent for
processing or where they object to the processing of personal data
concerning them or where the processing of their personal data otherwise does not comply with this Regulation.
The legal aspects of this new right have been debated extensively ever since
the idea was first mooted, and a thorough technological analysis of the proposals has been given in an ENISA report by Druschel, Backes and Tirtea5.
Without wishing to trivialise any of the arguments in support of feasibility of
this new right, we conclude from the current state of the legal and technological debate that it is unlikely that its higher level objectives will be fully
achieved in the near future, if ever. Thus, there remains a need for individuals to protect themselves in areas where data protection legislation, certainly
for the time being, does not. Our research considers what methods are avail-
3
Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the
protection of individuals with regard to the processing of personal data and on the free
movement of such data, http://eurlex.europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:31995L0046:EN:HTML (last accessed
2/10/2013). Article 6 includes “[data kept should be …] not excessive in relation to the purposes for which they are collected and/or further processed [… and] kept in a form which
permits identification of data subjects for no longer than is necessary for the purposes”
4
Data Protection Reform, preamble (53). Subsequent preambles add further requirements.
5
P. Druschel, M. Backes, and R. Tirtea. “The right to be forgotten – between expectations and
practice”. European Network and Information Security Agency, November 2012,
http://www.enisa.europa.eu/activities/identity-and-trust/library/deliverables/the-right-tobe-forgotten (last accessed 2/10/2013).
able to individuals for this purpose, concentrating on the particular context
of “social recruitment”.
The rest of this paper is set out as follows. We first describe the setting in
more detail. Then we introduce a taxonomy of personal data and online resources, and describe the problem of data minimisation and the right to be
forgotten in the different scenarios arising from this taxonomy. Subsequently, we consider the fundamental role of search engines as an access method,
recall empirical data on search engine use, and discuss general strategies for
reducing the impact of “adverse” search results. We then describe our experiments in more detail, and present the outcomes of the experiments. Finally
we draw some conclusions about methods for improving control over one’s
searchable online presence.
1.2
The setting: “Social recruitment”.
At the time of publication of the new draft Data Protection Directive, a motivating factor for the right to be forgotten was given6 as young people having
no way of deleting embarrassing online material when applying for jobs.
“Online profile management” appears to be a growing industry with fees out
of most young people’s reach, but there is no evidence that such companies
have a “silver bullet”: they come up against most of the legal and technological obstacles that also hinder a full implementation of the right to be forgotten. We have found no reports on this industry sector that suggest they possess techniques beyond the ones also identified in our research7.
In order to guide our thoughts and experiments, we have concentrated on a
particular scenario in social recruitment:
6
European Commission spokesman Matthew Newman, quoted in BBC news article, 23 Jan
2012,http://www.bbc.co.uk/news/technology-16677370 (last accessed 2/10/2013).
7
E.g., T. Dowling, “Search me: online reputation management”, The Guardian, 24 May 2013.
http://www.theguardian.com/technology/2013/may/24/search-me-online-reputationmanagement (last accessed 2/10/2013).
-
-
-
-
1.3
an individual applies for a job with an employer, and in doing so
makes available a copy of their CV;
as part of the process of selecting candidates for interview, a
representative of the employer, holding a copy of the individual’s
CV, enters the individual’s name (only) into a search engine, e.g.
Google;
they then spend a limited amount of time investigating the
search results, visiting the linked pages but not necessarily following any further links, according to known common patterns
of searching behaviour, though not using additional more refined
search queries;
they note any potentially “interesting” item that is likely to be
concerned with this individual (the fact that none such may exist
by itself justifies putting a limit on the amount of time spent);
the search is “with memory”: deductions made during the investigation of earlier search results may impact on subsequent conclusions.
A taxonomy of personal information and online resources.
For personal data on online resources (websites, social networks, etc.), the
strategies and issues around privacy and in particular whether and how such
information can be “forgotten” vary, depending on the nature of the online
resource and the type of personal information concerned. This is evident
from discussions on the Right to be Forgotten e.g. by P. Bernal8 and Google
researcher P. Fleischer9, and worked out systematically here in order to
structure and focus our discussion in the rest of the paper. We propose a
simple two-dimensional analysis, leading to nine different scenarios.
8
P. Bernal, “The EU, the US, and the Right to be Forgotten”, in: Gutwirth, S., Leenes, R.E., De
Hert, P. (eds.), Computers, privacy and data protection – reloading data protection, Dordrecht etc.: Springer 2014.
9
P. Fleischer, “Foggy thinking about the Right to Oblivion”, 9 Mar 2011,
http://peterfleischer.blogspot.co.uk/2011/03/foggy-thinking-about-right-to-oblivion.html
(last accessed 13/10/2013).
The first dimension characterises the type of online resource by how it relates to the individual. The three types we distinguish are as follows.
-
-
-
ME: the individual is the subject and owner of the resource. A
typical example of this is the pre-social network concept of a
“personal webpage”, hosted on private web space or on some
generic service that does not prescribe the nature of the information provided, nor does it expect any linkage between the
contents of different individuals. Blogs and the contents of
emails can also fall into this category.
US: the individual is a participant in the resource, and has probably agreed to terms and conditions in order to receive a service
that is being provided. There is likely to be personal information
provided confidentially to the resource (for identification, billing,
etc.), as well as personal information provided by the resource to
other participants, third parties, or the public. (There may be no
personal information being published more widely at all, e.g.
basic e-commerce websites, which makes the resource of less interest in our privacy discussion.) The structure in which the information is provided is mostly determined by the resource
owner. A typical example of this is a social network or “community website”, or the “comments” section of a website. There
may be a disparity between the cooperative way the resource
presents itself to its participants and the reality of its business
model.
THEM: a resource that carries personal information about the
individual where no relationship between them is established in
advance, the individual is merely an “object” of interest. A typical
example of this would be the “contents” section of a newspaper
website, with stories about those individuals that are considered
to be of public interest.
The privacy considerations also change depending on what kind of personal data is involved. We distinguish three kinds of personal data:
-
-
-
ATTributes: this is the kind of data that is traditionally stored in
databases. Items like name, date of birth, address, names of
children – each of them occurring once or a fixed number of
times per individual, and often existing in a unique canonical abstract form. Pictures do not belong in this category, although
passport pictures by removing degrees of freedom or through
their biometric measurements come closer to this. Information
given to, but not published by, “US” resources is typically in this
category. On a social network like Facebook, this kind of information when published is typically in a “profile” or “About” section.
“STOries”: this is all other explicitly generated personal information, such as medical history, pictures and other media, status
updates on social networks, comments and posts on blogs and
websites, and the contents of emails. Users can ask Facebook to
provide “the full set of information” that Facebook holds about
them – the information returned will include these first two categories only, but exclude the final one10.
BEHaviour: this is all the implicitly generated personal information, such as location history as kept by smart phones and
networks, metadata about email communication and web
browsing, or purchase history information as collected by store
loyalty cards. Facebook, for example, maintains its own search
history, which is evident from typing a letter or two into its
search window – but this information is not returned to customers on request, nor can it be reset or directly controlled in any
way by users.
There is no doubt that all these three categories constitute personal data as
in a Data Protection context. For example, location info has recently been
10
See http://europe-v-facebook.org/ which reports on an on-going attempt to make Facebook
comply with European data protection legislation: of the 80+ categories of data recorded
by Facebook on each of their members, only about 20 are made available to data subjects
on request.
shown to be highly useful for identifying individuals11: for identifying 95% of
people, 4 data points were shown to suffice; the combination of work location and home location (up to block level) usually identifies uniquely12. The
boundaries between the three categories may not always be sharp, and the
same type of information can occur in each category: for example a location
may be an attribute if it is someone’s home address; a story if it is a check-in
on Twitter; and behaviour if it is quietly recorded by a smart phone.
1.4
Data minimality and the right to be forgotten: 9 scenarios.
The three types of online resources and the three kinds of data together lead
to nine different combinations, which are summarised in the table below and
discussed in some detail after that, proceeding column by column, and identifying the entries by the bold letters representing row and column.
data resource →
↓
ATTributes
“STOries”
BEHaviour
ME subject
US participant
THEM object
control but authentication
effect to other
contexts
full control
traditional
DPD context
typical social
network content etc.
traditional privacy: attributes in
public domain,
sensitive?
freedom of expression;
cyberbullying
some leakage
(e.g. dynamics)
metadata,
browsing, hid-
1984, data mining, deanony-
11
Y.-A. de Montjoye, C.A. Hidalgo, M. Verleysen and V.D. Blondel, “Unique in the Crowd: The
privacy bounds of human mobility”, Scientific Reports 3:1376, doi 10.1038/srep01376,
March 2013. http://www.nature.com/srep/2013/130325/srep01376/full/srep01376.html
(last accessed 2/10/2013)
12
P. Golle and K. Partridge, “On the Anonymity of Home/Work Location Pairs”, in H. Tokuda,
th
M. Beigl, A. Friday, A. J. Bernheim Brush, Y. Tobe (eds): Pervasive Computing, 7 International Conference, LNCS 5538, pp 390-397, Springer 2009. doi 10.1007/978-3-642-015168_26
den data and
$$$
misation
Firstly, in the ME column it looks like the data subject has full control over
each of the types of data. However, even here we cannot currently claim
that they are able to exercise a full right to be forgotten (or: erasure13 14) due
to historical web archives such as the WayBack Machine15 or more shortterm archiving such as in Google caches. It is possible to remove and withhold information from search engines, e.g. by asking Google for removal using special request forms, and using meta-tags to prevent indexing16. We will
come back to this in a further discussion on indexing and obfuscation.
ME-ATT: Although the individual has full discretion about which attributes to
publish here, fully hiding them defeats the purpose of having (something
like) a personal website. Displaying at least a name here serves as a kind of
claim to authenticity and identification that this resource does indeed concern the data subject. Using a pseudonym might be an alternative, particularly in the context of one individual maintaining multiple online identities.
ME-STO: The stories are likely the “content” that is being published to the
world in this scenario, and thus the raison d’être for the resource. With the
proviso made in general for this column, the data subject is fully able to ensure that the current contents of the resource reflects positively on them,
even if historical content provision might have been less well judged.
13
Druschel et al, ENISA report, section 3.3.
14
Also characterised as a “right to delete” in P. Bernal, “The EU, the US, and the Right to be
Forgotten”, in: Gutwirth, S., Leenes, R.E., De Hert, P. (eds.), Computers, privacy and data
protection- reloading data protection, Dordrecht etc.: Springer 2014.
15
“WayBack Machine”, web.archive.org (last accessed 2/10/2013).
16
“How to Ungoogle yourself”, http://www.wikihow.com/Ungoogle-Yourself (last accessed
2/10/2013)
ME-BEH: There is some behavioural information leaking in this scenario,
from the changes over time of what content is being provided. Mostly this
will be inconsequential, and not worthwhile for third parties to track, but the
individual cannot exercise control over this. Leaked information may be similar in nature to Facebook’s helpful notifications concerning change in relationship status, e.g. – which in our experience have always reflected significant change of personal circumstances of the data subject.
US-ATT: This is the traditional scenario for the application of data protection
legislation: a service that makes use of a database with personal information.
There is a clearly defined and explicitly agreed relationship between data
subject and data controller. The data subject knows what data is being requested, and can insist on data minimisation: that data is relevant for the
purpose. The data subject may even be able to withdraw all their data from
the resource by cancelling the service, unless the relationship is a legally required one, e.g. with the tax office. We do not believe that any proposed
right to be forgotten would add significant “consumer power” to this scenario. For our specific context, it is more important which personal data the resource can publish than which data it actually holds; however, this is likely to
be clear also through agreed terms and conditions or legal requirements.
US-STO: This describes the typical social network content, and seems to have
been a main area of concern that The Right To Be Forgotten was aimed to
address. As in the rest of this column, individuals may withdraw their data by
cancelling the relationship with the resource owners, although the latter may
be reluctant to allow this17. Visibility of personal data to other participants,
to the public including through search engines, and to third party businesses
is controlled through terms and conditions and “privacy controls”. Due to the
large variety of data and complex structure of social networks, it is notoriously difficult to provide privacy controls which are both effective and userfriendly. Other types of online information for this category include “news17
M. Aspan, “How Sticky Is Membership on Facebook? Just Try Breaking Free”, New York
Times, 11 Feb 2008. http://www.nytimes.com/2008/02/11/technology/11facebook.html
(last accessed 3/10/2013, requires free registration). See also http://europe-vfacebook.org/ for related and more recent issues.
groups” (the main discussion mechanism on the internet before WWW),
which was subject to implicit social contracts.
US-BEH: Many services in the US column are free to use for their participants, but this entry explains why they are nevertheless successful businesses. Data in this entry consists of browsing behaviour, metadata of all types of
communications, etc., to be aggregated or used for targeted advertising.
Although it was shown in the previous decade that many interesting data
sets can be successfully de-anonymised18 19, some of the on-going discussion
on the new European Data Protection Directive still hinges on anonymisation
as a sufficient protection for personal data20. Typical social networks do not
provide much privacy control over this data, nor do they make it explicitly
visible to users on request.
THEM-AT: Where US-AT was the traditional positive data protection scenario, this includes one of its more problematic aspects. How is an individual
ever to know which third parties hold data on them – for example telemarketing companies and debt collectors, let alone identity thieves? When an
individual’s attribute information is not only held, but also ends up being
published, e.g. a celebrity’s info in a newspaper, was this assumed to be in
the public domain; was it sensitive personal information? This area thus involves classic privacy dilemmas, including the tension with freedom of expression which comes into full force in the next category.
18
A. Narayanan, V. Shmatikov. "Robust de-anonymization of large sparse datasets." IEEE Symposium on Security and Privacy, IEEE, 111-125, 2008. doi: 10.1109/SP.2008.33 presents
successful deanonymisation of Netflix data.
19
M. Barbaro, T. Zeller Jr, "A Face Is Exposed for AOL Searcher No. 4417749", The New York
Times, 9 Aug 9 2006.
http://select.nytimes.com/gst/abstract.html?res=F10612FC345B0C7A8CDDA10894DE4044
82 (last accessed 3/10/2013, requires free registration) presents successful deanonymisation of AOL search data.
20
C. Doctorow: “Data protection in the EU: the certainty of uncertainty”, Guardian Technology
Blog, 5 Jun 2013. http://www.theguardian.com/technology/blog/2013/jun/05/dataprotection-eu-anonymous (last accessed 3/10/2013)
THEM-STO: In this scenario, stories are published about an individual without their consent or involvement. Social networks are careful to steer away
from this scenario when the story subject is another participant, by ensuring
that “tagging” (i.e., mentioning of a participant in a way that leads on to their
other information) is essentially consensual, by seeking consent or allowing
fine-grained privacy control. This may well reflect a lesson learnt from their
having to deal with cyber-bullying. Looking beyond social networks, this is
where the Right To Be Forgotten clashes in the debates with freedom of expression. The press, blogging individuals, and factual websites want to publish stories about other people – giving the subjects of these stories the right
to have them withdrawn would be a clear attack on freedom of expression,
and in some cases “allowing the re-writing of history”. It is not clear to us
what a right to be forgotten as applied to published information could sensibly achieve in this context beyond what is already attainable through existing
limits on freedom of expression such as libel laws. However, it will still be
relevant for the situations where third parties hold personal data for other
purposes than publication. One might consider the NSA/GCHQ euphemistically and indiscriminately “hoovering” contents of emails to cover this category, for example …
THEM-BEH: The gathering of behavioural personal data by third parties is
really the “1984” scenario of a surveillance society, which has been alluded
to in connection with various recent news stories: on a small scale, there
were the waste bins in London which collected data of passing smart
phones21, and more seriously, the mass collection of communications
metadata by NSA and GCHQ. The company doing the waste bin phone tracking claimed they were only recording anonymous data (through MAC monitoring), and although the experiment has been stopped, at the time of writing the Information Commissioner is still to provide a verdict. More effective
data protection in this area is clearly welcome.
21
“City of London calls halt to smartphone tracking bins”, BBC News, Aug 12 2013,
http://www.bbc.co.uk/news/technology-23665490 (last accessed 4/10/2013).
The scenario in the rest of this paper, of using a search engine on someone’s
name with a CV in hand, looking for possibly problematic search outcomes,
will find its results mostly in the US-STO and THEM-STO categories. Personal
data in the ATT row is less likely to be interpreted as reflecting badly on
someone (although some sensitive personal data may hit the searcher’s
prejudices, which points out an ethical risk of “social recruitment”22). Where
ATT data conflicts with the known CV information, we will interpret such
search outcomes as referring to a different person, so we are not looking for
“errors” in the CVs. Data in the BEH row is used in many ways that individuals might hold strong views on, but publication is not the main threat there.
(Although Google’s recent move in publishing individuals’ endorsals might be
placed in this category.)
1.5
The role of search engines, and taking control
The WWW as an information system is highly distributed, fairly unstructured,
dynamic, and of variable quality. Thus, if we are looking for information in it
– in our scenario: information about a particular person – we need to look in
many places, there is not much structure to such a search, the system cannot
help us by finding the information once and for all, and even then: what we
find may not be valuable or relevant. In an ideal information system, we
would enter a unique identifier for the relevant information (a “key” or index), and get back an organised (thematically sorted into categories and subcategories?) overview of all the available information on the provided key.
The compromise adopted in practice for the WWW is to use search engines.
They remember what webpages exist, they remember some of the relevant
information on them, even caching some of it, web crawlers will continuously update them on structure and information, and they will attempt to rank
the outcomes by relevance. Google has recently introduced a gentle subdivi-
22
M. Mihelich, “Special Report: A Check on Background Checks”, Sep 12 2013,
http://www.workforce.com/articles/9353-check-on-background-checks (last accessed
4/10/2013).
sion of outcomes by categories for image queries, but normally the outcome
is a simple list of results ordered by perceived relevance, split across multiple
pages.
Judging the relevance of a link to a query is not a simple problem as relevance is often subjective and based on the user’s existing knowledge which is
not accessible by the search engine. Each search engine judges the relevance
of a link using its own criteria. The AltaVista search engine, popular in the
mid-to-late nineties, used criteria such as the frequency of the query in the
pages of the website and whether the query appeared in the first few lines of
a website. The way in which Google determines the ranking of the pages
displayed in its search results is by use of its own algorithm called PageRank.
The same theory applies: the higher the rank of a page in the results page
the more relevant it is thought to be to the user’s query but Google’s method attempts to take into account whether other internet users think the
website relevant.
According to a study, of all the clicks made on the first two Google search
result pages, the second page receives about 5% of them23. This has given
rise to a competitive industry known as Search Engine Optimisation (SEO), an
industry based on the maximisation of the probability that a site will be
ranked highly by Google. Companies devote significant portions of their advertising budgets to SEO in a bid to attract the largest number of visitors.
Our scenario, of a job applicant’s name being Googled, allows two main avenues of attack: the first one is to influence the list of search results being
returned by Google, i.e. attacking Google’s interpretation of the WWW. This
could potentially be done in a number of ways:
-
23
by ensuring search results get removed. This could be through
Google request forms, or by using META tags that ask “robots”
“The Value of Google Result Positioning,” Chitika Online Advertising Network, June 2013
updated version. http://chitika.com/google-positioning-value, (last accessed 4/10/2013).
-
-
to stay out, for pages in the ME category24. Also, for results from
social networks or other US resources, it is likely that fine-tuning
privacy settings will remove the majority of them. These known
solutions were not explored further in our experiments.
by influencing the order of search results. This might be called
“Search Engine Pessimisation”, and should be applied to the potentially harmful results. However, unless these are in the ME
category, the individual would hardly be in the position to apply
the usual SEO techniques in reverse; and in that case, they could
equally ensure the page gets removed from search results. So we
have not explored “pessimisation” any further.
by adding extra search results. This could push damaging search
results to a lower position or even on to the next page of results
– which would halve the likelihood of the result being noticed25.
However, in order to “flood” Google like this, newly added results would have to rank highly. Part of our experiment is to find
out what kind of resources would satisfy this criterion.
The second line of attack aims at the other imprecise link in the WWW
viewed as an information system: the searching individual’s interpretation of
each search result. Recall that for a perfect search, a uniquely identifying
“key” is necessary. Using an individual’s name does not satisfy that criterion:
the name will generally be shared with many others on the WWW. In combination with other information, such as data of birth or location, the name
key could be made unique – however, searching with such an extended key
would de-emphasize many of the possible web pages which just do not mention all that information. With an imperfect key, for each search result, the
searcher needs to analyse whether this item is, or is not about the targeted
individual. We refer to this process as disambiguation. The second aspect of
our experiments aims to find strategies that reduce the searcher’s chance of
24
25
“How to Ungoogle yourself”, http://www.wikihow.com/Ungoogle-Yourself
“The Value of Google Result Positioning,” http://chitika.com/google-positioning-value:
“The traffic dropped by 140% going from 10th to 11th position and 86% going from 20th to
21st position.”
success at disambiguation. In particular, it looks at which types of information tend to be the most helpful in identifying individuals.
1.6
Experiments
We performed essentially the same experiment in two rounds: first in an
exploratory way, and then in a more thorough way with a more uniform selection of data subjects and provision of information about them. In analogy
with the recruitment scenario described above, for each of the data subjects,
-
the researcher enters the data subject’s name into Google;
for each search result on the first two pages, they record its
provenance (originating website or service), and their view of
whether the item actually relates to the data subject. They spend
only a limited amount of time trying to establish this, using the
available information (including possibly deductions from search
results investigated earlier), visiting the linked pages but not
necessarily following any further links, without using additional
more refined search queries.
In the first round of the experiment, data subjects were chosen from the
researcher’s contacts; the available information was the researcher’s personal knowledge plus requested CVs or LinkedIn profiles where available. For
the second round of the experiment, data subjects were students in the department who had sent in their CVs in response to a general request for CVs,
with a clarification in advance of the experiment in particular on how the CV
data would be handled. The CVs provided the initial information for the researcher in this round of the experiment.
1.7
First Experiment
In this experiment, we first looked at prevalence of social network sites on
Google search results. Sixteen individuals from the researcher’s mobile
phone contact list were selected, such that their names would be relatively
unique. We looked at the first 10 outcomes resulting from entering their full
name as a Google search query. For each of these, we recorded its prove-
nance. For slightly more than half, this turned out to be a social network –
with results from Facebook and LinkedIn each accounting for about a quarter
again of these social network outcomes. We also looked at the relative ranking of social network results among the search results – LinkedIn results
tended to outrank Facebook ones, with both ranking in the top five on average, and social network results ranking higher on average than others.
Next, we selected a new set of sixteen data subjects from the same mobile
phone contact list, randomly but restricted to those where we could either
obtain a CV or, failing that, a LinkedIn profile. Of the available LinkedIn profiles, we only considered the publicly visible parts. We looked at the first 10
text results of entering their name as a search query into Google. For each of
those, we set out to determine by following the result link:
-
whether this result did indeed refer to the data subject;
what items of information led to this conclusion – these might be
from the CV/profile, or from previously investigated links.
The disambiguating items of information were categorised as locale (a location tied to the person at some point in their life, including work place or for
education), education (non-geographical information about education), profession, age (including birth date etc.), name (information relating to the
individual’s full name), and media (any media from which the individual is
recognisable). Conclusions on each of the 160 search results were labelled as
positive or negative (regarding whether the item refers to the individual),
and direct or indirect (conclusion drawn from CV info, or relying on information from previous search result analysis). About 1 in 3 of the search results contained no real information (e.g. leading to a directory website
search with the same search query) and had to be discarded.
Of the search results that led to a meaningful conclusion, 1 in 4 were positive, and 3 in 4 were negative. Of the positive conclusions, 1 in 5 was indirect, all of these with a photograph linking the result to a previously disambiguated outcome. The majority of the direct positive conclusions were on
the basis of locale and profession. For the negative conclusions, 1 in 4 was
indirect, with a majority of these based on photographs. For the direct nega-
tive conclusions, locale was the decisive factor for a majority, and profession
was also significant. Particularly for negative decisions, the relevant factor
would often stand out already on the Google results page before following
the link – for example, the URL suffix could be an indication of a “wrong”
locale.
Although interesting outcomes, they were biased by the data subject names
being selected from the contacts of one particular 24 year old male university student, which suggests possible bias on gender, age, locale and education. In our next experiment, we set out to remove some of that bias by aiming for a larger and more arbitrary group of data subjects. Attributing the
conclusion to a single factor is also slightly restricting, as it may often be a
combination of factors that facilitates the decision. The selection of data
subjects, and the type of information typically available on a CV, will have
influenced the outcomes to make locale and profession dominant factors.
1.8
Second Experiment
In the second experiment, we asked students for their CVs, with a clear indication of the intentions of the experiment and how we would deal with their
personal data. Most of the students asked would already have been advised
previously to “Google themselves” and make sure prospective employers
receive a positive impression from “social recruitment” practices.
A total of 23 students agreed to let us use their CVs for this experiment. For
all these students, we again investigated prevalence of social network outcomes on the search results, and whether returned search results did indeed
refer to these individuals and what led us to that conclusion. In this case, we
explored the top 15 search results per individual. We report the results of
this experiment using the terminology defined above.
Social network links accounted for 56% of the search results. Of these,
LinkedIn accounted for 38%, Facebook for 21%, and Twitter for 14% with the
rest spread thinly among other social networks. (The higher prevalence of
LinkedIn results over Facebook is the main discrepancy between the first and
second experiment.) LinkedIn tended to be the highest ranked on search
results (in 16 of 23 cases), but Facebook and Twitter also ranked highest for a
few students. For 16 of the 23 students, social network site information
helped in the next phase of disambiguating the search results.
On a total of 345 search results, 230 (2 out of 3) led to a meaningful conclusion. Most of the others were links leading to pages containing no relevant
information. Almost one in four were positive conclusions, more than three
in four were negative.
Of the positive conclusions, three in four were direct, and one in four was
indirect. Locale proved to be the identifying factor in half the cases, profession in 30% of the cases; education, age, and media turned out to be less
important. However, for the 2 out of 23 CVs that did include a photograph,
this was a decisive factor for a few search results. Occasionally pages contained unique identifiers that allowed easy definite positives, such as email
addresses. One in three positive decisions was based on detailed duplicated
data between the CV and the web page.
Of the negative conclusions, one in six was indirect, and five in six were direct. Seven out of the 23 students turned out to share their name with a relatively famous person who dominated the search outcomes for that name.
For nearly half the negative conclusions, locale was the decisive factor. Profession accounted for a third, education for 12%, age and media were less
influential. Both for positive and negative conclusions, locale appeared indirectly also through area and country codes in phone numbers.
This second experiment of course also had its clear limitations and biases. All
of the respondents were students at the same department of the same university, introducing an age bias and a limited range of locale and education.
The majority of them had been contacted through the departmental placement office, who would already have advised them on content of CVs as well
as on the risks of social recruitment. The kind of department this is also implied that they would have been reasonably aware of technological and other issues around WWW, and it would also result in a gender bias.
1.9
Conclusions
The intention of the experiments was to find advice on how individuals could
control their online presence in the absence of sufficient legal avenues for
removing web content. The outcomes of the experiments support a number
of recommendations, which we have listed below, including also some advice
that does not directly follow from these experiments.
0.
Regularly review your online presence by entering your name
into different search engines, also searching with extra disambiguating information especially if your name is common or in common with a celebrity.
1.
Create a LinkedIn profile, and make sure it contains information that presents you in a positive light. The perceived relevance of such
profiles and the SEO efforts of LinkedIn make it likely this page will rank highly among your search results. You may want to avoid information that is often omitted from CVs such as photographs and dates of birth, as well as other information that is not strictly necessary but helpful in identifying you.
(Some of this may be for reasons of security as well as privacy – some people
already lie consistently about their town of birth in “security questions”.)
For academics and other groups, getting profile pages on specialised sites
such as Google Scholar, academia.edu and ResearchGate will have a similar
effect of creating high-ranking pages with positive information.
2.
Think carefully about privacy settings on your social network
accounts – in particular the ones concerning availability of information
through Internet search, but also the ones that determine what is publicly
visible inside the network. If you actively use social networks, some identifying information needs to be there to allow “old friends” to find you, but be
cautious about providing information that may subsequently unlock other
information elsewhere, such as photographs, year of birth, locale, and friend
lists. If you do not have (e.g.) a Facebook account, getting one with minimal
information and visibility of information may still create a useful high ranking
Google search result.
3.
For web content entirely under your own control, consider
asking search engines to stay out of your web content through the use of the
robots.txt file and HTML META tags. Think twice before including the key
linking categories of information such as photographs and locale. Landline
numbers betray more about locale than mobile phone numbers. Having different pseudonyms for different contexts rather than using your real name
may be preferable. The existence of Google image search means that you
should avoid using duplicate pictures for different presences, even if they are
not photographs of you.
4.
Think twice about providing any information on the internet
that could be potentially embarrassing. This obvious advice could have been
included in item 0, but also take into account that what is available to search
engines (etc.) now may only be a subset of what will become available in the
future. Newsgroup (Usenet, a popular discussion medium on the internet
before the WWW) articles used to expire on “every” server after a few
weeks, but every newsgroup post with international distribution ever made
became available to search engines a few years ago, through Yahoo putting
an archive of them online. Comments made pseudonymously on webpages
may be de-anonymised in the future if a comment hosting service like Disqus
is bought up by a company like Google or Microsoft that has extensive records of individuals’ IP address use over time.
Given how valuable personal information is to all kinds of third parties, it
seems likely that research will be initiated to defeat the “tricks” we encourage here. We can imagine programs that “post-process” search engine output to concentrate the results on more interesting ones. The process of disambiguation of search results, performed “by hand” by us in these experiments, is one that might lend itself to some degree of automation, possibly
using machine learning techniques. Recent events have made it abundantly
clear that there is a real “arms race” between individuals wanting to protect
their privacy and increasingly sophisticated and computationally powerful
internet firms and others.
Bibliography
1. M. Aspan, “How Sticky Is Membership on Facebook? Just Try Breaking Free”, New York
Times, 11 Feb 2008. http://www.nytimes.com/2008/02/11/technology/11facebook.html
(last accessed 3/10/2013, requires free registration).
2. M. Barbaro, T. Zeller Jr, "A Face Is Exposed for AOL Searcher No. 4417749", The New York
Times, 9 Aug 9 2006.
http://select.nytimes.com/gst/abstract.html?res=F10612FC345B0C7A8CDDA10894DE404
482 (last accessed 3/10/2013, requires free registration).
3. BBC news, “EU proposes 'right to be forgotten' by internet firms”, 23 Jan 2012,
http://www.bbc.co.uk/news/technology-16677370 (last accessed 2/10/2013).
4. BBC News, “City of London calls halt to smartphone tracking bins”, Aug 12 2013,
http://www.bbc.co.uk/news/technology-23665490 (last accessed 4/10/2013).
5. P. Bernal, “The EU, the US, and the Right to be Forgotten”, in: Gutwirth, S., Leenes, R.E.,
De Hert, P. (eds.), Computers, privacy and data protection – reloading data protection,
Dordrecht etc.: Springer 2014.
6. Chitika Online Advertising Network, “The Value of Google Result Positioning,” June 2013
updated version. http://chitika.com/google-positioning-value, (last accessed 4/10/2013).
7. C. Doctorow: “Data protection in the EU: the certainty of uncertainty”, Guardian Technology Blog, 5 Jun 2013. http://www.theguardian.com/technology/blog/2013/jun/05/dataprotection-eu-anonymous (last accessed 3/10/2013)
8. T. Dowling, “Search me: online reputation management”, The Guardian, 24 May 2013.
http://www.theguardian.com/technology/2013/may/24/search-me-online-reputationmanagement (last accessed 2/10/2013).
9. P. Druschel, M. Backes, and R. Tirtea. “The right to be forgotten – between expectations
and practice”. European Network and Information Security Agency, November 2012,
http://www.enisa.europa.eu/activities/identity-and-trust/library/deliverables/the-rightto-be-forgotten (last accessed 2/10/2013).
10. “Commission Proposals on the Data Protection Reform: Regulation”, 25 Jan 2012
http://ec.europa.eu/justice/dataprotection/document/review2012/com_2012_11_en.pdf (last accessed 2/10/2013).
11. “Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on
the protection of individuals with regard to the processing of personal data and on the
free movement of such data”, http://eurlex.europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:31995L0046:EN:HTML (last accessed
2/10/2013).
12. Europe vs. Facebook, http://europe-v-facebook.org/.
13. P. Fleischer, “Foggy thinking about the Right to Oblivion”, 9 Mar 2011,
http://peterfleischer.blogspot.co.uk/2011/03/foggy-thinking-about-right-to-oblivion.html
(last accessed 13/10/2013).
14. P. Golle and K. Partridge, “On the Anonymity of Home/Work Location Pairs”, in H. Tokuth
da, M. Beigl, A. Friday, A. J. Bernheim Brush, Y. Tobe (eds): Pervasive Computing, 7 International Conference, LNCS 5538, pp 390-397, Springer 2009. doi 10.1007/978-3-64201516-8_26
15. M. Mihelich, “Special Report: A Check on Background Checks”, Sep 12 2013,
http://www.workforce.com/articles/9353-check-on-background-checks (last accessed
4/10/2013).
16. Y.-A. de Montjoye, C.A. Hidalgo, M. Verleysen and V.D. Blondel, “Unique in the Crowd:
The privacy bounds of human mobility”, Scientific Reports 3:1376, doi
10.1038/srep01376, March 2013.
http://www.nature.com/srep/2013/130325/srep01376/full/srep01376.html (last accessed 2/10/2013)
17. A. Narayanan, V. Shmatikov. "Robust de-anonymization of large sparse datasets." IEEE
Symposium on Security and Privacy, IEEE, 111-125, 2008. doi: 10.1109/SP.2008.33 presents successful deanonymisation of Netflix data.
18. V. Reding, “Citizenship Privacy matters – Why the EU needs new personal data protection
rules”, The European Data Protection and Privacy Conference Brussels, 30 Nov 2010,
http://europa.eu/rapid/press-release_SPEECH-10-700_en.htm (last accessed 4/10/2013).
19. “WayBack Machine”, web.archive.org (last accessed 2/10/2013).
20. Wikihow, “How to Ungoogle yourself”, http://www.wikihow.com/Ungoogle-Yourself (last
accessed 2/10/2013)
Download