1 Googling me, googling you: is there something we can do? (Controlling the Googled view - a technological approach) Eerke Boiten, Guy Banister, Ashish Bhati School of Computing, University of Kent, Canterbury, UK Abstract The Right To Be Forgotten as a feature of the new EU Data Protection Directive raises many legal and technological questions, to the extent that it seems unlikely that its objectives will be achieved in full. One particular context where such a right would be useful is “social recruitment”, prospective employers using social networks and search engines to find out more about potential employees. In the absence of legal avenues for controlling what such employers might find out, we investigate technological alternatives. This paper reports on experiments using CVs and Google search, leading to methods for taking some control of the image generated of an individual through the view of online search. 1.1 Introduction The “Right To Be Forgotten” was proposed by European Commissioner Viviane Reding in 20101 and included as part of the European Commission’s revised draft Data Protection Directive in 20122. It is defined as an extension of 1 V. Reding, “Citizenship Privacy matters – Why the EU needs new personal data protection rules”, The European Data Protection and Privacy Conference Brussels, 30 Nov 2010, http://europa.eu/rapid/press-release_SPEECH-10-700_en.htm (last accessed 4/10/2013). 2 Commission Proposals on the Data Protection Reform: Regulation, 25 Jan 2012 http://ec.europa.eu/justice/data-protection/document/review2012/com_2012_11_en.pdf (last accessed 2/10/2013). adfa, p. 1, 2011. © Springer-Verlag Berlin Heidelberg 2011 the principle of data minimisation which already existed in the 1995 Data Protection Directive3. The Right is introduced as follows4: Any person should have the right to have personal data concerning them rectified and a 'right to be forgotten' where the retention of such data is not in compliance with this Regulation. In particular, data subjects should have the right that their personal data are erased and no longer processed, where the data are no longer necessary in relation to the purposes for which the data are collected or otherwise processed, where data subjects have withdrawn their consent for processing or where they object to the processing of personal data concerning them or where the processing of their personal data otherwise does not comply with this Regulation. The legal aspects of this new right have been debated extensively ever since the idea was first mooted, and a thorough technological analysis of the proposals has been given in an ENISA report by Druschel, Backes and Tirtea5. Without wishing to trivialise any of the arguments in support of feasibility of this new right, we conclude from the current state of the legal and technological debate that it is unlikely that its higher level objectives will be fully achieved in the near future, if ever. Thus, there remains a need for individuals to protect themselves in areas where data protection legislation, certainly for the time being, does not. Our research considers what methods are avail- 3 Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data, http://eurlex.europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:31995L0046:EN:HTML (last accessed 2/10/2013). Article 6 includes “[data kept should be …] not excessive in relation to the purposes for which they are collected and/or further processed [… and] kept in a form which permits identification of data subjects for no longer than is necessary for the purposes” 4 Data Protection Reform, preamble (53). Subsequent preambles add further requirements. 5 P. Druschel, M. Backes, and R. Tirtea. “The right to be forgotten – between expectations and practice”. European Network and Information Security Agency, November 2012, http://www.enisa.europa.eu/activities/identity-and-trust/library/deliverables/the-right-tobe-forgotten (last accessed 2/10/2013). able to individuals for this purpose, concentrating on the particular context of “social recruitment”. The rest of this paper is set out as follows. We first describe the setting in more detail. Then we introduce a taxonomy of personal data and online resources, and describe the problem of data minimisation and the right to be forgotten in the different scenarios arising from this taxonomy. Subsequently, we consider the fundamental role of search engines as an access method, recall empirical data on search engine use, and discuss general strategies for reducing the impact of “adverse” search results. We then describe our experiments in more detail, and present the outcomes of the experiments. Finally we draw some conclusions about methods for improving control over one’s searchable online presence. 1.2 The setting: “Social recruitment”. At the time of publication of the new draft Data Protection Directive, a motivating factor for the right to be forgotten was given6 as young people having no way of deleting embarrassing online material when applying for jobs. “Online profile management” appears to be a growing industry with fees out of most young people’s reach, but there is no evidence that such companies have a “silver bullet”: they come up against most of the legal and technological obstacles that also hinder a full implementation of the right to be forgotten. We have found no reports on this industry sector that suggest they possess techniques beyond the ones also identified in our research7. In order to guide our thoughts and experiments, we have concentrated on a particular scenario in social recruitment: 6 European Commission spokesman Matthew Newman, quoted in BBC news article, 23 Jan 2012,http://www.bbc.co.uk/news/technology-16677370 (last accessed 2/10/2013). 7 E.g., T. Dowling, “Search me: online reputation management”, The Guardian, 24 May 2013. http://www.theguardian.com/technology/2013/may/24/search-me-online-reputationmanagement (last accessed 2/10/2013). - - - - 1.3 an individual applies for a job with an employer, and in doing so makes available a copy of their CV; as part of the process of selecting candidates for interview, a representative of the employer, holding a copy of the individual’s CV, enters the individual’s name (only) into a search engine, e.g. Google; they then spend a limited amount of time investigating the search results, visiting the linked pages but not necessarily following any further links, according to known common patterns of searching behaviour, though not using additional more refined search queries; they note any potentially “interesting” item that is likely to be concerned with this individual (the fact that none such may exist by itself justifies putting a limit on the amount of time spent); the search is “with memory”: deductions made during the investigation of earlier search results may impact on subsequent conclusions. A taxonomy of personal information and online resources. For personal data on online resources (websites, social networks, etc.), the strategies and issues around privacy and in particular whether and how such information can be “forgotten” vary, depending on the nature of the online resource and the type of personal information concerned. This is evident from discussions on the Right to be Forgotten e.g. by P. Bernal8 and Google researcher P. Fleischer9, and worked out systematically here in order to structure and focus our discussion in the rest of the paper. We propose a simple two-dimensional analysis, leading to nine different scenarios. 8 P. Bernal, “The EU, the US, and the Right to be Forgotten”, in: Gutwirth, S., Leenes, R.E., De Hert, P. (eds.), Computers, privacy and data protection – reloading data protection, Dordrecht etc.: Springer 2014. 9 P. Fleischer, “Foggy thinking about the Right to Oblivion”, 9 Mar 2011, http://peterfleischer.blogspot.co.uk/2011/03/foggy-thinking-about-right-to-oblivion.html (last accessed 13/10/2013). The first dimension characterises the type of online resource by how it relates to the individual. The three types we distinguish are as follows. - - - ME: the individual is the subject and owner of the resource. A typical example of this is the pre-social network concept of a “personal webpage”, hosted on private web space or on some generic service that does not prescribe the nature of the information provided, nor does it expect any linkage between the contents of different individuals. Blogs and the contents of emails can also fall into this category. US: the individual is a participant in the resource, and has probably agreed to terms and conditions in order to receive a service that is being provided. There is likely to be personal information provided confidentially to the resource (for identification, billing, etc.), as well as personal information provided by the resource to other participants, third parties, or the public. (There may be no personal information being published more widely at all, e.g. basic e-commerce websites, which makes the resource of less interest in our privacy discussion.) The structure in which the information is provided is mostly determined by the resource owner. A typical example of this is a social network or “community website”, or the “comments” section of a website. There may be a disparity between the cooperative way the resource presents itself to its participants and the reality of its business model. THEM: a resource that carries personal information about the individual where no relationship between them is established in advance, the individual is merely an “object” of interest. A typical example of this would be the “contents” section of a newspaper website, with stories about those individuals that are considered to be of public interest. The privacy considerations also change depending on what kind of personal data is involved. We distinguish three kinds of personal data: - - - ATTributes: this is the kind of data that is traditionally stored in databases. Items like name, date of birth, address, names of children – each of them occurring once or a fixed number of times per individual, and often existing in a unique canonical abstract form. Pictures do not belong in this category, although passport pictures by removing degrees of freedom or through their biometric measurements come closer to this. Information given to, but not published by, “US” resources is typically in this category. On a social network like Facebook, this kind of information when published is typically in a “profile” or “About” section. “STOries”: this is all other explicitly generated personal information, such as medical history, pictures and other media, status updates on social networks, comments and posts on blogs and websites, and the contents of emails. Users can ask Facebook to provide “the full set of information” that Facebook holds about them – the information returned will include these first two categories only, but exclude the final one10. BEHaviour: this is all the implicitly generated personal information, such as location history as kept by smart phones and networks, metadata about email communication and web browsing, or purchase history information as collected by store loyalty cards. Facebook, for example, maintains its own search history, which is evident from typing a letter or two into its search window – but this information is not returned to customers on request, nor can it be reset or directly controlled in any way by users. There is no doubt that all these three categories constitute personal data as in a Data Protection context. For example, location info has recently been 10 See http://europe-v-facebook.org/ which reports on an on-going attempt to make Facebook comply with European data protection legislation: of the 80+ categories of data recorded by Facebook on each of their members, only about 20 are made available to data subjects on request. shown to be highly useful for identifying individuals11: for identifying 95% of people, 4 data points were shown to suffice; the combination of work location and home location (up to block level) usually identifies uniquely12. The boundaries between the three categories may not always be sharp, and the same type of information can occur in each category: for example a location may be an attribute if it is someone’s home address; a story if it is a check-in on Twitter; and behaviour if it is quietly recorded by a smart phone. 1.4 Data minimality and the right to be forgotten: 9 scenarios. The three types of online resources and the three kinds of data together lead to nine different combinations, which are summarised in the table below and discussed in some detail after that, proceeding column by column, and identifying the entries by the bold letters representing row and column. data resource → ↓ ATTributes “STOries” BEHaviour ME subject US participant THEM object control but authentication effect to other contexts full control traditional DPD context typical social network content etc. traditional privacy: attributes in public domain, sensitive? freedom of expression; cyberbullying some leakage (e.g. dynamics) metadata, browsing, hid- 1984, data mining, deanony- 11 Y.-A. de Montjoye, C.A. Hidalgo, M. Verleysen and V.D. Blondel, “Unique in the Crowd: The privacy bounds of human mobility”, Scientific Reports 3:1376, doi 10.1038/srep01376, March 2013. http://www.nature.com/srep/2013/130325/srep01376/full/srep01376.html (last accessed 2/10/2013) 12 P. Golle and K. Partridge, “On the Anonymity of Home/Work Location Pairs”, in H. Tokuda, th M. Beigl, A. Friday, A. J. Bernheim Brush, Y. Tobe (eds): Pervasive Computing, 7 International Conference, LNCS 5538, pp 390-397, Springer 2009. doi 10.1007/978-3-642-015168_26 den data and $$$ misation Firstly, in the ME column it looks like the data subject has full control over each of the types of data. However, even here we cannot currently claim that they are able to exercise a full right to be forgotten (or: erasure13 14) due to historical web archives such as the WayBack Machine15 or more shortterm archiving such as in Google caches. It is possible to remove and withhold information from search engines, e.g. by asking Google for removal using special request forms, and using meta-tags to prevent indexing16. We will come back to this in a further discussion on indexing and obfuscation. ME-ATT: Although the individual has full discretion about which attributes to publish here, fully hiding them defeats the purpose of having (something like) a personal website. Displaying at least a name here serves as a kind of claim to authenticity and identification that this resource does indeed concern the data subject. Using a pseudonym might be an alternative, particularly in the context of one individual maintaining multiple online identities. ME-STO: The stories are likely the “content” that is being published to the world in this scenario, and thus the raison d’être for the resource. With the proviso made in general for this column, the data subject is fully able to ensure that the current contents of the resource reflects positively on them, even if historical content provision might have been less well judged. 13 Druschel et al, ENISA report, section 3.3. 14 Also characterised as a “right to delete” in P. Bernal, “The EU, the US, and the Right to be Forgotten”, in: Gutwirth, S., Leenes, R.E., De Hert, P. (eds.), Computers, privacy and data protection- reloading data protection, Dordrecht etc.: Springer 2014. 15 “WayBack Machine”, web.archive.org (last accessed 2/10/2013). 16 “How to Ungoogle yourself”, http://www.wikihow.com/Ungoogle-Yourself (last accessed 2/10/2013) ME-BEH: There is some behavioural information leaking in this scenario, from the changes over time of what content is being provided. Mostly this will be inconsequential, and not worthwhile for third parties to track, but the individual cannot exercise control over this. Leaked information may be similar in nature to Facebook’s helpful notifications concerning change in relationship status, e.g. – which in our experience have always reflected significant change of personal circumstances of the data subject. US-ATT: This is the traditional scenario for the application of data protection legislation: a service that makes use of a database with personal information. There is a clearly defined and explicitly agreed relationship between data subject and data controller. The data subject knows what data is being requested, and can insist on data minimisation: that data is relevant for the purpose. The data subject may even be able to withdraw all their data from the resource by cancelling the service, unless the relationship is a legally required one, e.g. with the tax office. We do not believe that any proposed right to be forgotten would add significant “consumer power” to this scenario. For our specific context, it is more important which personal data the resource can publish than which data it actually holds; however, this is likely to be clear also through agreed terms and conditions or legal requirements. US-STO: This describes the typical social network content, and seems to have been a main area of concern that The Right To Be Forgotten was aimed to address. As in the rest of this column, individuals may withdraw their data by cancelling the relationship with the resource owners, although the latter may be reluctant to allow this17. Visibility of personal data to other participants, to the public including through search engines, and to third party businesses is controlled through terms and conditions and “privacy controls”. Due to the large variety of data and complex structure of social networks, it is notoriously difficult to provide privacy controls which are both effective and userfriendly. Other types of online information for this category include “news17 M. Aspan, “How Sticky Is Membership on Facebook? Just Try Breaking Free”, New York Times, 11 Feb 2008. http://www.nytimes.com/2008/02/11/technology/11facebook.html (last accessed 3/10/2013, requires free registration). See also http://europe-vfacebook.org/ for related and more recent issues. groups” (the main discussion mechanism on the internet before WWW), which was subject to implicit social contracts. US-BEH: Many services in the US column are free to use for their participants, but this entry explains why they are nevertheless successful businesses. Data in this entry consists of browsing behaviour, metadata of all types of communications, etc., to be aggregated or used for targeted advertising. Although it was shown in the previous decade that many interesting data sets can be successfully de-anonymised18 19, some of the on-going discussion on the new European Data Protection Directive still hinges on anonymisation as a sufficient protection for personal data20. Typical social networks do not provide much privacy control over this data, nor do they make it explicitly visible to users on request. THEM-AT: Where US-AT was the traditional positive data protection scenario, this includes one of its more problematic aspects. How is an individual ever to know which third parties hold data on them – for example telemarketing companies and debt collectors, let alone identity thieves? When an individual’s attribute information is not only held, but also ends up being published, e.g. a celebrity’s info in a newspaper, was this assumed to be in the public domain; was it sensitive personal information? This area thus involves classic privacy dilemmas, including the tension with freedom of expression which comes into full force in the next category. 18 A. Narayanan, V. Shmatikov. "Robust de-anonymization of large sparse datasets." IEEE Symposium on Security and Privacy, IEEE, 111-125, 2008. doi: 10.1109/SP.2008.33 presents successful deanonymisation of Netflix data. 19 M. Barbaro, T. Zeller Jr, "A Face Is Exposed for AOL Searcher No. 4417749", The New York Times, 9 Aug 9 2006. http://select.nytimes.com/gst/abstract.html?res=F10612FC345B0C7A8CDDA10894DE4044 82 (last accessed 3/10/2013, requires free registration) presents successful deanonymisation of AOL search data. 20 C. Doctorow: “Data protection in the EU: the certainty of uncertainty”, Guardian Technology Blog, 5 Jun 2013. http://www.theguardian.com/technology/blog/2013/jun/05/dataprotection-eu-anonymous (last accessed 3/10/2013) THEM-STO: In this scenario, stories are published about an individual without their consent or involvement. Social networks are careful to steer away from this scenario when the story subject is another participant, by ensuring that “tagging” (i.e., mentioning of a participant in a way that leads on to their other information) is essentially consensual, by seeking consent or allowing fine-grained privacy control. This may well reflect a lesson learnt from their having to deal with cyber-bullying. Looking beyond social networks, this is where the Right To Be Forgotten clashes in the debates with freedom of expression. The press, blogging individuals, and factual websites want to publish stories about other people – giving the subjects of these stories the right to have them withdrawn would be a clear attack on freedom of expression, and in some cases “allowing the re-writing of history”. It is not clear to us what a right to be forgotten as applied to published information could sensibly achieve in this context beyond what is already attainable through existing limits on freedom of expression such as libel laws. However, it will still be relevant for the situations where third parties hold personal data for other purposes than publication. One might consider the NSA/GCHQ euphemistically and indiscriminately “hoovering” contents of emails to cover this category, for example … THEM-BEH: The gathering of behavioural personal data by third parties is really the “1984” scenario of a surveillance society, which has been alluded to in connection with various recent news stories: on a small scale, there were the waste bins in London which collected data of passing smart phones21, and more seriously, the mass collection of communications metadata by NSA and GCHQ. The company doing the waste bin phone tracking claimed they were only recording anonymous data (through MAC monitoring), and although the experiment has been stopped, at the time of writing the Information Commissioner is still to provide a verdict. More effective data protection in this area is clearly welcome. 21 “City of London calls halt to smartphone tracking bins”, BBC News, Aug 12 2013, http://www.bbc.co.uk/news/technology-23665490 (last accessed 4/10/2013). The scenario in the rest of this paper, of using a search engine on someone’s name with a CV in hand, looking for possibly problematic search outcomes, will find its results mostly in the US-STO and THEM-STO categories. Personal data in the ATT row is less likely to be interpreted as reflecting badly on someone (although some sensitive personal data may hit the searcher’s prejudices, which points out an ethical risk of “social recruitment”22). Where ATT data conflicts with the known CV information, we will interpret such search outcomes as referring to a different person, so we are not looking for “errors” in the CVs. Data in the BEH row is used in many ways that individuals might hold strong views on, but publication is not the main threat there. (Although Google’s recent move in publishing individuals’ endorsals might be placed in this category.) 1.5 The role of search engines, and taking control The WWW as an information system is highly distributed, fairly unstructured, dynamic, and of variable quality. Thus, if we are looking for information in it – in our scenario: information about a particular person – we need to look in many places, there is not much structure to such a search, the system cannot help us by finding the information once and for all, and even then: what we find may not be valuable or relevant. In an ideal information system, we would enter a unique identifier for the relevant information (a “key” or index), and get back an organised (thematically sorted into categories and subcategories?) overview of all the available information on the provided key. The compromise adopted in practice for the WWW is to use search engines. They remember what webpages exist, they remember some of the relevant information on them, even caching some of it, web crawlers will continuously update them on structure and information, and they will attempt to rank the outcomes by relevance. Google has recently introduced a gentle subdivi- 22 M. Mihelich, “Special Report: A Check on Background Checks”, Sep 12 2013, http://www.workforce.com/articles/9353-check-on-background-checks (last accessed 4/10/2013). sion of outcomes by categories for image queries, but normally the outcome is a simple list of results ordered by perceived relevance, split across multiple pages. Judging the relevance of a link to a query is not a simple problem as relevance is often subjective and based on the user’s existing knowledge which is not accessible by the search engine. Each search engine judges the relevance of a link using its own criteria. The AltaVista search engine, popular in the mid-to-late nineties, used criteria such as the frequency of the query in the pages of the website and whether the query appeared in the first few lines of a website. The way in which Google determines the ranking of the pages displayed in its search results is by use of its own algorithm called PageRank. The same theory applies: the higher the rank of a page in the results page the more relevant it is thought to be to the user’s query but Google’s method attempts to take into account whether other internet users think the website relevant. According to a study, of all the clicks made on the first two Google search result pages, the second page receives about 5% of them23. This has given rise to a competitive industry known as Search Engine Optimisation (SEO), an industry based on the maximisation of the probability that a site will be ranked highly by Google. Companies devote significant portions of their advertising budgets to SEO in a bid to attract the largest number of visitors. Our scenario, of a job applicant’s name being Googled, allows two main avenues of attack: the first one is to influence the list of search results being returned by Google, i.e. attacking Google’s interpretation of the WWW. This could potentially be done in a number of ways: - 23 by ensuring search results get removed. This could be through Google request forms, or by using META tags that ask “robots” “The Value of Google Result Positioning,” Chitika Online Advertising Network, June 2013 updated version. http://chitika.com/google-positioning-value, (last accessed 4/10/2013). - - to stay out, for pages in the ME category24. Also, for results from social networks or other US resources, it is likely that fine-tuning privacy settings will remove the majority of them. These known solutions were not explored further in our experiments. by influencing the order of search results. This might be called “Search Engine Pessimisation”, and should be applied to the potentially harmful results. However, unless these are in the ME category, the individual would hardly be in the position to apply the usual SEO techniques in reverse; and in that case, they could equally ensure the page gets removed from search results. So we have not explored “pessimisation” any further. by adding extra search results. This could push damaging search results to a lower position or even on to the next page of results – which would halve the likelihood of the result being noticed25. However, in order to “flood” Google like this, newly added results would have to rank highly. Part of our experiment is to find out what kind of resources would satisfy this criterion. The second line of attack aims at the other imprecise link in the WWW viewed as an information system: the searching individual’s interpretation of each search result. Recall that for a perfect search, a uniquely identifying “key” is necessary. Using an individual’s name does not satisfy that criterion: the name will generally be shared with many others on the WWW. In combination with other information, such as data of birth or location, the name key could be made unique – however, searching with such an extended key would de-emphasize many of the possible web pages which just do not mention all that information. With an imperfect key, for each search result, the searcher needs to analyse whether this item is, or is not about the targeted individual. We refer to this process as disambiguation. The second aspect of our experiments aims to find strategies that reduce the searcher’s chance of 24 25 “How to Ungoogle yourself”, http://www.wikihow.com/Ungoogle-Yourself “The Value of Google Result Positioning,” http://chitika.com/google-positioning-value: “The traffic dropped by 140% going from 10th to 11th position and 86% going from 20th to 21st position.” success at disambiguation. In particular, it looks at which types of information tend to be the most helpful in identifying individuals. 1.6 Experiments We performed essentially the same experiment in two rounds: first in an exploratory way, and then in a more thorough way with a more uniform selection of data subjects and provision of information about them. In analogy with the recruitment scenario described above, for each of the data subjects, - the researcher enters the data subject’s name into Google; for each search result on the first two pages, they record its provenance (originating website or service), and their view of whether the item actually relates to the data subject. They spend only a limited amount of time trying to establish this, using the available information (including possibly deductions from search results investigated earlier), visiting the linked pages but not necessarily following any further links, without using additional more refined search queries. In the first round of the experiment, data subjects were chosen from the researcher’s contacts; the available information was the researcher’s personal knowledge plus requested CVs or LinkedIn profiles where available. For the second round of the experiment, data subjects were students in the department who had sent in their CVs in response to a general request for CVs, with a clarification in advance of the experiment in particular on how the CV data would be handled. The CVs provided the initial information for the researcher in this round of the experiment. 1.7 First Experiment In this experiment, we first looked at prevalence of social network sites on Google search results. Sixteen individuals from the researcher’s mobile phone contact list were selected, such that their names would be relatively unique. We looked at the first 10 outcomes resulting from entering their full name as a Google search query. For each of these, we recorded its prove- nance. For slightly more than half, this turned out to be a social network – with results from Facebook and LinkedIn each accounting for about a quarter again of these social network outcomes. We also looked at the relative ranking of social network results among the search results – LinkedIn results tended to outrank Facebook ones, with both ranking in the top five on average, and social network results ranking higher on average than others. Next, we selected a new set of sixteen data subjects from the same mobile phone contact list, randomly but restricted to those where we could either obtain a CV or, failing that, a LinkedIn profile. Of the available LinkedIn profiles, we only considered the publicly visible parts. We looked at the first 10 text results of entering their name as a search query into Google. For each of those, we set out to determine by following the result link: - whether this result did indeed refer to the data subject; what items of information led to this conclusion – these might be from the CV/profile, or from previously investigated links. The disambiguating items of information were categorised as locale (a location tied to the person at some point in their life, including work place or for education), education (non-geographical information about education), profession, age (including birth date etc.), name (information relating to the individual’s full name), and media (any media from which the individual is recognisable). Conclusions on each of the 160 search results were labelled as positive or negative (regarding whether the item refers to the individual), and direct or indirect (conclusion drawn from CV info, or relying on information from previous search result analysis). About 1 in 3 of the search results contained no real information (e.g. leading to a directory website search with the same search query) and had to be discarded. Of the search results that led to a meaningful conclusion, 1 in 4 were positive, and 3 in 4 were negative. Of the positive conclusions, 1 in 5 was indirect, all of these with a photograph linking the result to a previously disambiguated outcome. The majority of the direct positive conclusions were on the basis of locale and profession. For the negative conclusions, 1 in 4 was indirect, with a majority of these based on photographs. For the direct nega- tive conclusions, locale was the decisive factor for a majority, and profession was also significant. Particularly for negative decisions, the relevant factor would often stand out already on the Google results page before following the link – for example, the URL suffix could be an indication of a “wrong” locale. Although interesting outcomes, they were biased by the data subject names being selected from the contacts of one particular 24 year old male university student, which suggests possible bias on gender, age, locale and education. In our next experiment, we set out to remove some of that bias by aiming for a larger and more arbitrary group of data subjects. Attributing the conclusion to a single factor is also slightly restricting, as it may often be a combination of factors that facilitates the decision. The selection of data subjects, and the type of information typically available on a CV, will have influenced the outcomes to make locale and profession dominant factors. 1.8 Second Experiment In the second experiment, we asked students for their CVs, with a clear indication of the intentions of the experiment and how we would deal with their personal data. Most of the students asked would already have been advised previously to “Google themselves” and make sure prospective employers receive a positive impression from “social recruitment” practices. A total of 23 students agreed to let us use their CVs for this experiment. For all these students, we again investigated prevalence of social network outcomes on the search results, and whether returned search results did indeed refer to these individuals and what led us to that conclusion. In this case, we explored the top 15 search results per individual. We report the results of this experiment using the terminology defined above. Social network links accounted for 56% of the search results. Of these, LinkedIn accounted for 38%, Facebook for 21%, and Twitter for 14% with the rest spread thinly among other social networks. (The higher prevalence of LinkedIn results over Facebook is the main discrepancy between the first and second experiment.) LinkedIn tended to be the highest ranked on search results (in 16 of 23 cases), but Facebook and Twitter also ranked highest for a few students. For 16 of the 23 students, social network site information helped in the next phase of disambiguating the search results. On a total of 345 search results, 230 (2 out of 3) led to a meaningful conclusion. Most of the others were links leading to pages containing no relevant information. Almost one in four were positive conclusions, more than three in four were negative. Of the positive conclusions, three in four were direct, and one in four was indirect. Locale proved to be the identifying factor in half the cases, profession in 30% of the cases; education, age, and media turned out to be less important. However, for the 2 out of 23 CVs that did include a photograph, this was a decisive factor for a few search results. Occasionally pages contained unique identifiers that allowed easy definite positives, such as email addresses. One in three positive decisions was based on detailed duplicated data between the CV and the web page. Of the negative conclusions, one in six was indirect, and five in six were direct. Seven out of the 23 students turned out to share their name with a relatively famous person who dominated the search outcomes for that name. For nearly half the negative conclusions, locale was the decisive factor. Profession accounted for a third, education for 12%, age and media were less influential. Both for positive and negative conclusions, locale appeared indirectly also through area and country codes in phone numbers. This second experiment of course also had its clear limitations and biases. All of the respondents were students at the same department of the same university, introducing an age bias and a limited range of locale and education. The majority of them had been contacted through the departmental placement office, who would already have advised them on content of CVs as well as on the risks of social recruitment. The kind of department this is also implied that they would have been reasonably aware of technological and other issues around WWW, and it would also result in a gender bias. 1.9 Conclusions The intention of the experiments was to find advice on how individuals could control their online presence in the absence of sufficient legal avenues for removing web content. The outcomes of the experiments support a number of recommendations, which we have listed below, including also some advice that does not directly follow from these experiments. 0. Regularly review your online presence by entering your name into different search engines, also searching with extra disambiguating information especially if your name is common or in common with a celebrity. 1. Create a LinkedIn profile, and make sure it contains information that presents you in a positive light. The perceived relevance of such profiles and the SEO efforts of LinkedIn make it likely this page will rank highly among your search results. You may want to avoid information that is often omitted from CVs such as photographs and dates of birth, as well as other information that is not strictly necessary but helpful in identifying you. (Some of this may be for reasons of security as well as privacy – some people already lie consistently about their town of birth in “security questions”.) For academics and other groups, getting profile pages on specialised sites such as Google Scholar, academia.edu and ResearchGate will have a similar effect of creating high-ranking pages with positive information. 2. Think carefully about privacy settings on your social network accounts – in particular the ones concerning availability of information through Internet search, but also the ones that determine what is publicly visible inside the network. If you actively use social networks, some identifying information needs to be there to allow “old friends” to find you, but be cautious about providing information that may subsequently unlock other information elsewhere, such as photographs, year of birth, locale, and friend lists. If you do not have (e.g.) a Facebook account, getting one with minimal information and visibility of information may still create a useful high ranking Google search result. 3. For web content entirely under your own control, consider asking search engines to stay out of your web content through the use of the robots.txt file and HTML META tags. Think twice before including the key linking categories of information such as photographs and locale. Landline numbers betray more about locale than mobile phone numbers. Having different pseudonyms for different contexts rather than using your real name may be preferable. The existence of Google image search means that you should avoid using duplicate pictures for different presences, even if they are not photographs of you. 4. Think twice about providing any information on the internet that could be potentially embarrassing. This obvious advice could have been included in item 0, but also take into account that what is available to search engines (etc.) now may only be a subset of what will become available in the future. Newsgroup (Usenet, a popular discussion medium on the internet before the WWW) articles used to expire on “every” server after a few weeks, but every newsgroup post with international distribution ever made became available to search engines a few years ago, through Yahoo putting an archive of them online. Comments made pseudonymously on webpages may be de-anonymised in the future if a comment hosting service like Disqus is bought up by a company like Google or Microsoft that has extensive records of individuals’ IP address use over time. Given how valuable personal information is to all kinds of third parties, it seems likely that research will be initiated to defeat the “tricks” we encourage here. We can imagine programs that “post-process” search engine output to concentrate the results on more interesting ones. The process of disambiguation of search results, performed “by hand” by us in these experiments, is one that might lend itself to some degree of automation, possibly using machine learning techniques. Recent events have made it abundantly clear that there is a real “arms race” between individuals wanting to protect their privacy and increasingly sophisticated and computationally powerful internet firms and others. Bibliography 1. M. Aspan, “How Sticky Is Membership on Facebook? Just Try Breaking Free”, New York Times, 11 Feb 2008. http://www.nytimes.com/2008/02/11/technology/11facebook.html (last accessed 3/10/2013, requires free registration). 2. M. Barbaro, T. Zeller Jr, "A Face Is Exposed for AOL Searcher No. 4417749", The New York Times, 9 Aug 9 2006. http://select.nytimes.com/gst/abstract.html?res=F10612FC345B0C7A8CDDA10894DE404 482 (last accessed 3/10/2013, requires free registration). 3. BBC news, “EU proposes 'right to be forgotten' by internet firms”, 23 Jan 2012, http://www.bbc.co.uk/news/technology-16677370 (last accessed 2/10/2013). 4. BBC News, “City of London calls halt to smartphone tracking bins”, Aug 12 2013, http://www.bbc.co.uk/news/technology-23665490 (last accessed 4/10/2013). 5. P. Bernal, “The EU, the US, and the Right to be Forgotten”, in: Gutwirth, S., Leenes, R.E., De Hert, P. (eds.), Computers, privacy and data protection – reloading data protection, Dordrecht etc.: Springer 2014. 6. Chitika Online Advertising Network, “The Value of Google Result Positioning,” June 2013 updated version. http://chitika.com/google-positioning-value, (last accessed 4/10/2013). 7. C. Doctorow: “Data protection in the EU: the certainty of uncertainty”, Guardian Technology Blog, 5 Jun 2013. http://www.theguardian.com/technology/blog/2013/jun/05/dataprotection-eu-anonymous (last accessed 3/10/2013) 8. T. Dowling, “Search me: online reputation management”, The Guardian, 24 May 2013. http://www.theguardian.com/technology/2013/may/24/search-me-online-reputationmanagement (last accessed 2/10/2013). 9. P. Druschel, M. Backes, and R. Tirtea. “The right to be forgotten – between expectations and practice”. European Network and Information Security Agency, November 2012, http://www.enisa.europa.eu/activities/identity-and-trust/library/deliverables/the-rightto-be-forgotten (last accessed 2/10/2013). 10. “Commission Proposals on the Data Protection Reform: Regulation”, 25 Jan 2012 http://ec.europa.eu/justice/dataprotection/document/review2012/com_2012_11_en.pdf (last accessed 2/10/2013). 11. “Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data”, http://eurlex.europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:31995L0046:EN:HTML (last accessed 2/10/2013). 12. Europe vs. Facebook, http://europe-v-facebook.org/. 13. P. Fleischer, “Foggy thinking about the Right to Oblivion”, 9 Mar 2011, http://peterfleischer.blogspot.co.uk/2011/03/foggy-thinking-about-right-to-oblivion.html (last accessed 13/10/2013). 14. P. Golle and K. Partridge, “On the Anonymity of Home/Work Location Pairs”, in H. Tokuth da, M. Beigl, A. Friday, A. J. Bernheim Brush, Y. Tobe (eds): Pervasive Computing, 7 International Conference, LNCS 5538, pp 390-397, Springer 2009. doi 10.1007/978-3-64201516-8_26 15. M. Mihelich, “Special Report: A Check on Background Checks”, Sep 12 2013, http://www.workforce.com/articles/9353-check-on-background-checks (last accessed 4/10/2013). 16. Y.-A. de Montjoye, C.A. Hidalgo, M. Verleysen and V.D. Blondel, “Unique in the Crowd: The privacy bounds of human mobility”, Scientific Reports 3:1376, doi 10.1038/srep01376, March 2013. http://www.nature.com/srep/2013/130325/srep01376/full/srep01376.html (last accessed 2/10/2013) 17. A. Narayanan, V. Shmatikov. "Robust de-anonymization of large sparse datasets." IEEE Symposium on Security and Privacy, IEEE, 111-125, 2008. doi: 10.1109/SP.2008.33 presents successful deanonymisation of Netflix data. 18. V. Reding, “Citizenship Privacy matters – Why the EU needs new personal data protection rules”, The European Data Protection and Privacy Conference Brussels, 30 Nov 2010, http://europa.eu/rapid/press-release_SPEECH-10-700_en.htm (last accessed 4/10/2013). 19. “WayBack Machine”, web.archive.org (last accessed 2/10/2013). 20. Wikihow, “How to Ungoogle yourself”, http://www.wikihow.com/Ungoogle-Yourself (last accessed 2/10/2013)