IS6600-11 Big Data, Intelligence & Surveillance 1 Hype, Reality or …? 2 Purpose • The purpose of this class is to introduce the concept of Big Data, examine its potential and value for organisations and governments, as well as the downside effects on privacy • There is also a lot of hype about Big Data • I also hope to stimulate your own thinking about Big Data – and how it affects you 3 Basics • Big Data refers to the vast quantities of data that businesses and governments gather • This data is *believed* to contain useful, actionable intelligence that could lead to – – – – Process efficiencies Lower costs, Higher profits, Identification of terrorism threats/plans • What is needed is the will and expertise to perform the relevant analysis. 4 How Big is Big? • It depends on how quickly you can access and process data (with normal database management tools) • For a small company, hundreds of gigabytes could be big. For a larger company, hundreds of terabytes – 1 terabyte = 1000 gigabytes – 1 petabyte = 1000 terabytes – 1 exabyte = 1000 petabytes • Zettabyte, Yottabyte 5 Size Contexts • Some areas of science generate huge amounts of data: – Meteorology (weather forecasting) & Remote Sensing – Genomics (genome sequencing) – Physics, e.g. CERN • 150 million sensors each deliver data 40 million times per second • Working with only 0.001% of the data collected, still 25 petabytes a year is collected • If all data was used, it would be 500 exabytes a day – 200 times more than all other global data sources combined – Social data, RFID data, – Surveillance – NSA & GCHQ 6 The History • Big Data is not a new topic – Data has been getting bigger continually ever since the first byte was created – It is related to storage capacity and processing power – which also keep growing continually • Over the last 25 years, many governments have attempted to consolidate data holdings into single databases controlled by single parties – National ID Schemes – National Health Records Management 7 Corporate Examples • Amazon handles millions of back-end operations every day, as well as queries from more than half a million third-party sellers. • Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes (2560 terabytes) of data • Facebook handles 50 billion photos. • TaoBao & Alibaba – again, billions of transactions • Consumer profile databases, Loyalty Cards, Octopus • Park’n Shop’s Money Back Card is the same thing 8 Ford • http://www.datanami.com/datanami/2013-0316/how_ford_is_putting_hadoop_pedal_to_the_metal.html • Ford’s modern hybrid Fusion model generates up to 25 gigabytes of data per hour – Data that is a potential goldmine for Ford, as long as it can find the right analytical tools for the job. • The data can be used to – understand driving behaviors and reduce accidents, – understand wear and tear – identify issues that lower maintenance costs, – avoid collisions • But who should own the data? Ford? The car owner? 9 Needles & Haystacks • The volume of data is huge, beyond imagination, and the consultants and software firms want us to believe that somewhere, if you can find them, there may be some needles – pieces of actionable intelligence 10 Who is Pushing Big Data? • IBM! – Because they want to sell you their software that (they claim) will help you to analyse the data and find the needles • Consultants stand to make millions, by panicking their clients into spending on software solutions • Globally, this is a US$100 billion industry, growing 10% a year 11 Is Everyone Happy? • The consultants suggest not. Accenture: – 22% of companies are very satisfied – 35% are quite satisfied – 34% are dissatisfied – 39% say that they have data that is relevant to their business strategy • Big data can be useful – if you know what to look for and how to get that ‘intelligence’ to the people who can use it 12 Consultant Perspectives • Companies have lots of data, but “most organisations measure too many things that don’t matter and don’t put sufficient focus onto the things that do” (Accenture). • “Companies are buried in information” and are struggling to use it (McKinsey) • The more data they have, the less they seem to know! – The more you know, the more you don’t know?! 13 Then What Should the Companies Do? • Spend more money (say the consultants) – “a large investment in new data capabilities” • McKinsey – “embed analytics into business processes” • Accenture • Alternatively – Go and ask people what they think is happening! – Ask your lost customers why they got lost! • A survey or big data analytics won’t tell you why. 14 Gartner’s Hype Cycle 15 Big Data and Intelligence • One of the highest impact news stories since June 2013 has concerned the secret surveillance activities of the NSA and GCHQ agencies – as revealed by Edward Snowden • These surveillance activities are fundamentally about big data and analytics, just as they are also about privacy and security, espionage and politics 16 Key Terms • NSA – National Security Agency (US) (www.nsa.gov) • GCHQ – General Communications Headquarters (UK) (www.gchq.gov.uk) • Prism, Tempora, Xkeyscore, Bullrun, – Systems that store, retrieve and analyze the data • The Guardian (http://www.theguardian.com/international) – UK newspaper that first published the stories • Patriot Act – US Act for Homeland Security post 11-9-11 http://en.wikipedia.org/wiki/Patriot_Act 17 The Government’s Perspective • Looking for needles in the metadata – Phone numbers, call duration & frequency – Global patterns that may involve terrorism – If a bombing in India can be matched to a sudden increase of calls in another country, that might be of interest – To be effective, they need as much data as possible – in short, everything. 18 The Surveillance Picture • Edward Snowden has leaked a LOT of information • The stories are still coming. We have learned a LOT about what governments do – with their own citizens’ data, and with data from other countries • You may recall stories about data being captured in Hong Kong and China from the Chinese University and Tsinghua University Internet hubs – http://www.reuters.com/article/2013/06/24/us-usasecurity-tsinghua-idUSBRE95N0M220130624 • This is a series of events of global proportion • We should not be surprised at anything any more – If they want to collect it, anything, then they can and will. 19 Selected Events • Publication of a top-secret court order against Verizon mandating it to hand over the call records of all its customers • http://www.theguardian.com/world/2013/jul/19/nsa-extendedverizon-trawl-through-court-order • Orders for all other telecoms firms also existed • Large-scale collection of data without individual warrants – Prism • http://en.wikipedia.org/wiki/PRISM_(surveillance_program) 20 Prism • A system that gives the NSA access to the personal information of non-US people from US Internet companies – Apple, Facebook, Google, Microsoft, Skype, Yahoo,… • These companies always claimed that they protected individual privacy, but … it seems that this was not the case • However, they were legally required to say nothing – the court orders prohibited them saying anything about their data sharing with the NSA • Data obtained by cable tapping – Metadata & content from 4 US telecoms providers’ cables 21 Facebook • During Jan-June 2013, governments requested info on 38,000 Facebook users – 11,000 + from the US (79% compliance) – 4000+ from India (50% compliance) – 170 from Turkey (47% compliance) – 11 from Egypt (0% compliance) – http://www.theguardian.com/technology/2013/a ug/27/facebook-government-user-requests 22 XKeyscore • This is the data retrieval system used to collect, process and search the data • http://en.wikipedia.org/wiki/XKeyscore • It allows an NSA analyst to query “nearly everything a typical user does on the Internet” in near-real time, including: – Email content – Websites visited and searches – Metadata • In theory these systems were designed to analyse data about foreigners, but many Americans were also included in the databases 23 GCHQ • This is the UK’s government department that deals with Telecommunications Signals & Intelligence • http://www.gchq.gov.uk • http://en.wikipedia.org/wiki/Government_Communicat ions_Headquarters • Access to Prism since 2010 • Operates Tempora, similar to Prism, for collecting data from the Internet and Telecomms. 24 GCHQ • In 2009, GCHQ spied on foreign politicians visiting the UK for a G20 summit – Eavesdropping phonecalls, emails – Monitoring computers – Installing keyloggers and then tracking activities post-summit – Turkish Finance Minister (Simsek) – Russian leader (Medvedev) • Purpose – Economic/Political Intelligence 25 Tempora • Much of the data is harvested from Internet cables that enter the UK (GBs-TBs per second) – 300 GCHQ and 250 NSA analysts are involved • Telephone calls, Email messages, Facebook entries, Personal Internet history, IM chats, pwds, – Cooperation with private telecoms companies – Data held for 3 days, metadata for 30 • http://en.wikipedia.org/wiki/Tempora • http://www.theguardian.com/uk/2013/jun/21/gchqcables-secret-world-communications-nsa 26 Bullrun • NSA and GCHQ spend millions developing programmes that can break Internet security (cryptography) protocols like https, ssl, etc. • They also work directly with the telecom providers to ensure that they have backdoors that help them to access data that clients think is private/secret (AT&T and the UN) • There are no Secrets! – http://www.theguardian.com/world/2013/sep/05/nsa-gchq-encryption-codes-security 27 Collusion or Legal Obligation? • One defence offered by the private companies that hold the data is that they are required to obey the law of the countries in which they operate – They have no choice – they must hand over the data, or cooperate with the security agencies – Also, they cannot reveal that they are cooperating – they are gagged from revealing the existence of the Prism/Tempora/Bullrun systems 28 Payouts • GCHQ and NSA are working with each other, sharing each other’s data • NSA subsidizes GCHQ’s costs @ GBP millions annually • http://www.theguardian.com/uknews/2013/aug/01/nsa-paid-gchq-spying-edwardsnowden • NSA benefits by GCHQ operating under less strict operating & oversight rules • NSA expects returns… reports, intelligence. 29 Problems • Big data is HUGE – there is simply too much data to collect and analyse – GCHQ may collect up to 20% of the actual data flow • Big data is getting bigger – Cables that carry hundreds of GBs/second make that task harder still • As always, 99.999% of the data is not useful. – Can you find the 0.001% that might be? 30 Reactions • There have been attempts to stop media organizations from reporting on the surveillance programmes • Computers owned by the Guardian newspaper were physically destroyed in an attempt to remove the data & prevent further publication – Additional copies are held in Brazil and the US – http://www.wired.com/threatlevel/2013/08/guar dian-snowden-files-destroyed/ 31 Implications for Individuals • Is your data being harvested? – It seems likely. • Are your private communications, including online purchases, secure? Private? – Not very. • Are you protected by data privacy laws? – Not against governments. – Perhaps against private companies. • http://www.pcpd.org.hk/ 32 Questions • What kind of data is being collected? – Where, By Who, For What Purposes??? – Can we see/find (some of) the data anywhere? – Are you personally at risk? • That depends on who you are, what you do, who you talk to and what about. – Should we be concerned? • Is there anything we can do as individuals, as decision makers, as companies? – http://www.theguardian.com/world/2013/sep/05/nsa-how-to-remain-secure-surveillance • Or is it more sensible just to get on with our lives? • Do some Internet research now and try to answer some of these questions. 33