Seminar 11 - Department of Information Systems

advertisement
IS6600-11
Big Data, Intelligence & Surveillance
1
Hype, Reality or …?
2
Purpose
• The purpose of this class is to
introduce the concept of Big
Data, examine its potential
and value for organisations and governments,
as well as the downside effects on privacy
• There is also a lot of hype about Big Data
• I also hope to stimulate your own thinking
about Big Data – and how it affects you
3
Basics
• Big Data refers to the vast quantities of data that
businesses and governments gather
• This data is *believed* to contain useful,
actionable intelligence that could lead to
–
–
–
–
Process efficiencies
Lower costs,
Higher profits,
Identification of terrorism threats/plans
• What is needed is the will and expertise to
perform the relevant analysis.
4
How Big is Big?
• It depends on how quickly you can access and
process data (with normal database management
tools)
• For a small company, hundreds of gigabytes could be
big. For a larger company, hundreds of terabytes
– 1 terabyte = 1000 gigabytes
– 1 petabyte = 1000 terabytes
– 1 exabyte = 1000 petabytes
• Zettabyte, Yottabyte
5
Size Contexts
• Some areas of science generate huge amounts of
data:
– Meteorology (weather forecasting) & Remote Sensing
– Genomics (genome sequencing)
– Physics, e.g. CERN
• 150 million sensors each deliver data 40 million times per second
• Working with only 0.001% of the data collected, still 25 petabytes
a year is collected
• If all data was used, it would be 500 exabytes a day – 200 times
more than all other global data sources combined
– Social data, RFID data,
– Surveillance – NSA & GCHQ
6
The History
• Big Data is not a new topic
– Data has been getting bigger continually ever since the
first byte was created
– It is related to storage capacity and processing power –
which also keep growing continually
• Over the last 25 years, many governments have
attempted to consolidate data holdings into single
databases controlled by single parties
– National ID Schemes
– National Health Records Management
7
Corporate Examples
• Amazon handles millions of back-end operations
every day, as well as queries from more than half a
million third-party sellers.
• Walmart handles more than 1 million customer
transactions every hour, which is imported into
databases estimated to contain more than 2.5
petabytes (2560 terabytes) of data
• Facebook handles 50 billion photos.
• TaoBao & Alibaba – again, billions of transactions
• Consumer profile databases, Loyalty Cards, Octopus
• Park’n Shop’s Money Back Card is the same thing
8
Ford
• http://www.datanami.com/datanami/2013-0316/how_ford_is_putting_hadoop_pedal_to_the_metal.html
• Ford’s modern hybrid Fusion model generates up to
25 gigabytes of data per hour
– Data that is a potential goldmine for Ford, as long as it
can find the right analytical tools for the job.
• The data can be used to
– understand driving behaviors and reduce accidents,
– understand wear and tear
– identify issues that lower maintenance costs,
– avoid collisions
• But who should own the data? Ford? The car owner?
9
Needles & Haystacks
• The volume of data is
huge, beyond imagination,
and the consultants and
software firms want us to
believe that somewhere, if
you can find them, there
may be some needles –
pieces of actionable
intelligence
10
Who is Pushing Big Data?
• IBM!
– Because they want to sell you their software that
(they claim) will help you to analyse the data and
find the needles
• Consultants stand to make millions, by
panicking their clients into spending on
software solutions
• Globally, this is a US$100 billion industry,
growing 10% a year
11
Is Everyone Happy?
• The consultants suggest not. Accenture:
– 22% of companies are very satisfied
– 35% are quite satisfied
– 34% are dissatisfied
– 39% say that they have data that is relevant to
their business strategy
• Big data can be useful – if you know what to
look for and how to get that ‘intelligence’ to
the people who can use it
12
Consultant Perspectives
• Companies have lots of data, but “most
organisations measure too many things that
don’t matter and don’t put sufficient focus
onto the things that do” (Accenture).
• “Companies are buried in information” and
are struggling to use it (McKinsey)
• The more data they have, the less they seem
to know!
– The more you know, the more you don’t know?!
13
Then What Should the Companies Do?
• Spend more money (say the consultants)
– “a large investment in new data capabilities”
• McKinsey
– “embed analytics into business processes”
• Accenture
• Alternatively
– Go and ask people what they think is happening!
– Ask your lost customers why they got lost!
• A survey or big data analytics won’t tell you why.
14
Gartner’s Hype Cycle
15
Big Data and Intelligence
• One of the highest impact news stories since
June 2013 has concerned the secret
surveillance activities of the NSA and GCHQ
agencies – as revealed by Edward Snowden
• These surveillance activities are fundamentally
about big data and analytics, just as they are
also about privacy and security, espionage and
politics
16
Key Terms
• NSA – National Security Agency (US) (www.nsa.gov)
• GCHQ – General Communications Headquarters (UK)
(www.gchq.gov.uk)
• Prism, Tempora, Xkeyscore, Bullrun,
– Systems that store, retrieve and analyze the data
• The Guardian
(http://www.theguardian.com/international)
– UK newspaper that first published the stories
• Patriot Act
– US Act for Homeland Security post 11-9-11
http://en.wikipedia.org/wiki/Patriot_Act
17
The Government’s Perspective
• Looking for needles in the metadata
– Phone numbers, call duration & frequency
– Global patterns that may involve terrorism
– If a bombing in India can be matched to a sudden
increase of calls in another country, that might be
of interest
– To be effective, they need as much data as
possible – in short, everything.
18
The Surveillance Picture
• Edward Snowden has leaked a LOT of information
• The stories are still coming. We have learned a LOT
about what governments do – with their own
citizens’ data, and with data from other countries
• You may recall stories about data being captured in
Hong Kong and China from the Chinese University
and Tsinghua University Internet hubs
– http://www.reuters.com/article/2013/06/24/us-usasecurity-tsinghua-idUSBRE95N0M220130624
• This is a series of events of global proportion
• We should not be surprised at anything any more
– If they want to collect it, anything, then they can and will. 19
Selected Events
• Publication of a top-secret court order against
Verizon mandating it to hand over the call records of
all its customers
• http://www.theguardian.com/world/2013/jul/19/nsa-extendedverizon-trawl-through-court-order
• Orders for all other telecoms firms also existed
• Large-scale collection of data without individual
warrants
– Prism
• http://en.wikipedia.org/wiki/PRISM_(surveillance_program)
20
Prism
• A system that gives the NSA access to the personal
information of non-US people from US Internet
companies
– Apple, Facebook, Google, Microsoft, Skype, Yahoo,…
• These companies always claimed that they protected
individual privacy, but … it seems that this was not
the case
• However, they were legally required to say nothing –
the court orders prohibited them saying anything
about their data sharing with the NSA
• Data obtained by cable tapping
– Metadata & content from 4 US telecoms providers’ cables 21
Facebook
• During Jan-June 2013, governments requested
info on 38,000 Facebook users
– 11,000 + from the US (79% compliance)
– 4000+ from India (50% compliance)
– 170 from Turkey (47% compliance)
– 11 from Egypt (0% compliance)
– http://www.theguardian.com/technology/2013/a
ug/27/facebook-government-user-requests
22
XKeyscore
• This is the data retrieval system used to collect,
process and search the data
• http://en.wikipedia.org/wiki/XKeyscore
• It allows an NSA analyst to query “nearly everything
a typical user does on the Internet” in near-real time,
including:
– Email content
– Websites visited and searches
– Metadata
• In theory these systems were designed to analyse
data about foreigners, but many Americans were also
included in the databases
23
GCHQ
• This is the UK’s government department that
deals with Telecommunications Signals &
Intelligence
• http://www.gchq.gov.uk
• http://en.wikipedia.org/wiki/Government_Communicat
ions_Headquarters
• Access to Prism since 2010
• Operates Tempora, similar to Prism, for
collecting data from the Internet and
Telecomms.
24
GCHQ
• In 2009, GCHQ spied on foreign politicians
visiting the UK for a G20 summit
– Eavesdropping phonecalls, emails
– Monitoring computers
– Installing keyloggers and then tracking activities
post-summit
– Turkish Finance Minister (Simsek)
– Russian leader (Medvedev)
• Purpose – Economic/Political Intelligence
25
Tempora
• Much of the data is harvested from Internet
cables that enter the UK (GBs-TBs per second)
– 300 GCHQ and 250 NSA analysts are involved
• Telephone calls, Email messages, Facebook entries,
Personal Internet history, IM chats, pwds,
– Cooperation with private telecoms companies
– Data held for 3 days, metadata for 30
• http://en.wikipedia.org/wiki/Tempora
• http://www.theguardian.com/uk/2013/jun/21/gchqcables-secret-world-communications-nsa
26
Bullrun
• NSA and GCHQ spend millions developing
programmes that can break Internet security
(cryptography) protocols like https, ssl, etc.
• They also work directly with the telecom
providers to ensure that they have backdoors
that help them to access data that clients
think is private/secret (AT&T and the UN)
• There are no Secrets!
– http://www.theguardian.com/world/2013/sep/05/nsa-gchq-encryption-codes-security
27
Collusion or Legal Obligation?
• One defence offered by the private companies
that hold the data is that they are required to
obey the law of the countries in which they
operate
– They have no choice – they must hand over the
data, or cooperate with the security agencies
– Also, they cannot reveal that they are cooperating
– they are gagged from revealing the existence of
the Prism/Tempora/Bullrun systems
28
Payouts
• GCHQ and NSA are working with each other,
sharing each other’s data
• NSA subsidizes GCHQ’s costs @ GBP millions
annually
• http://www.theguardian.com/uknews/2013/aug/01/nsa-paid-gchq-spying-edwardsnowden
• NSA benefits by GCHQ operating under less
strict operating & oversight rules
• NSA expects returns… reports, intelligence.
29
Problems
• Big data is HUGE – there is simply too much
data to collect and analyse
– GCHQ may collect up to 20% of the actual data
flow
• Big data is getting bigger
– Cables that carry hundreds of GBs/second make
that task harder still
• As always, 99.999% of the data is not useful.
– Can you find the 0.001% that might be?
30
Reactions
• There have been attempts to stop media
organizations from reporting on the
surveillance programmes
• Computers owned by the Guardian newspaper
were physically destroyed in an attempt to
remove the data & prevent further publication
– Additional copies are held in Brazil and the US
– http://www.wired.com/threatlevel/2013/08/guar
dian-snowden-files-destroyed/
31
Implications for Individuals
• Is your data being harvested?
– It seems likely.
• Are your private communications, including
online purchases, secure? Private?
– Not very.
• Are you protected by data privacy laws?
– Not against governments.
– Perhaps against private companies.
• http://www.pcpd.org.hk/
32
Questions
• What kind of data is being collected?
– Where, By Who, For What Purposes???
– Can we see/find (some of) the data anywhere?
– Are you personally at risk?
• That depends on who you are, what you do, who you
talk to and what about.
– Should we be concerned?
• Is there anything we can do as individuals, as decision
makers, as companies?
–
http://www.theguardian.com/world/2013/sep/05/nsa-how-to-remain-secure-surveillance
• Or is it more sensible just to get on with our lives?
• Do some Internet research now and try to
answer some of these questions.
33
Download