Statistical Challenges in Big Data Niall Adams

advertisement
Statistical Challenges in Big Data
Niall Adams1,2
1 Department
of Mathematics
Imperial College London
2 Heilbronn
Institute for Mathematical Research
University of Bristol
January 2015
1/26
Contents
1. “Big Data” – some general comments.
2. Exemplar application: network cyber-security.
3. Statistical Challenges.
4. Conclusion.
Disclaimers
I
My personal view, not looking to tread on any toes
I
Is it “Big Data”, “Big data”, “big data”, “Big Data”?
2/26
1.General Comments
3/26
What’s hot in Data Science?
Adapted from1
1
http://www.crowdflower.com/blog/data-science-2015-whats-hot-whats-not
4/26
Depending on your point of view, this list is either:
I
A reassuring, socially conscious, inclusive vision for where data
science is going.
I
Marketing flannel.
Either way, these points do not cover technical aspects at all. Sigh.
Let’s first ask . . .
5/26
What is “Big Data”
From the wiki (emphasis mine)
“Big data is an all-encompassing term for any collection of
data sets so large and complex that it becomes difficult to
process them using traditional data processing applications.
The challenges include analysis, capture, curation, search,
sharing, storage, transfer, visualization, and privacy violations.
...”
I
Green: primarily CS
I
Blue: primarily statistics and machine learning
I
Red: An issue for everyone?
6/26
More Cynically...
“Data Science” and “Big Data” are simply a rebranding of Data
Mining, with the ambition of bringing together evermore diverse
data sources and problems.
Much is promised: “if you can obtain this data, do X , profit will
accrue”.
Where X =?, statistics? machine learning? voodoo? (This is
based on several interesting consultancy experiences).
7/26
The Five V’s
Big data is usually characterised as:
I
Volume - data size
I
Velocity - data rate
I
Variety - diverse data sources, cf. data fusion
I
Veracity - data quality
I
Value - a commercial consideration?
Each of these presents challenges, which we will return to.
8/26
Data mining was originally conceived as secondary processing
activity, with the tag line “discovering nuggets in data”.
Secondary here means that the data was collected for a primary
purpose and is inspected for other things. For example:
I
bank account records – customer relationship management,
I
supermarket shopping records – market basket analysis,
I
telecomms records – fraud detection.
Such data was generally collected for accounting and control
purposes.
Big data seems to want to take this further, and collect any and all
data together.
9/26
2.Cyber-security
Let’s think about big data in network cyber-security, an important
problem:
BBC News - Hack attack causes 'massive damage' at steel works
23/12/2014 08:50
BBC News - North Korea partially back online after internet collapse
23/12/2014 08:50
TECHNOLOGY
22 December 2014 Last updated at 13:01
BBC News - US insists North Korea must take Sony hack blame
22/12/2014 16:54
Hack attack causes 'massive
ASIA damage' at steel works
23 December 2014 Last updated at 05:47
A blast furnace at a German steel mill suffered "massive damage" following a cyber attack onBBC
theNews
plant's
says
a N Korea back on terror list
- Sonynetwork,
hack: US mulls
putting
report.
22/12/2014
North Korea partially back online after internet collapse
US & CANADA
Details of the incident emerged in the annual report of the German Federal Office for Information Security (BSI).
It said attackers used booby-trapped emails to steal logins that gave them access to the mill's control systems.
December
2014 Last
updated Korea
at 08:27 after an almost unprecedented internet outage, amid a cyber
Some internet services have21been
restored
in North
BBC News - Sony hack: China tells US it opposes cyber attacks
This led to parts of the plant failing and meant
a blast
furnace
could
security
row
with the
US.not be shut down as normal.
US insists North
must take Sony hack blame
USKorea
& CANADA
The unscheduled shutdown of the furnaceThough
causedthere
the damage,
said
report.from the authorities in Pyongyang, US experts reported the restoration.
has been
no the
comment
In its report, BSI said the attackers were very
skilled
and used
both
targeted
emails
andwas
social
infiltrate
21 December
2014 Lastto
updated
at 14:27
Some
analysts
say the
country's
web
access
cutengineering
entirely
for atechniques
time.
the plant. In particular, said BSI, the attackers used a "spear phishing" campaign
aimed
at particular
in thethat
company
BBC News
UkrainePictures.
conflict: Hackers take sides in virtual war
The US has
rejected
North individuals
Korea's claim
it was to
not responsible for a cyber-attack
on -Sony
trick people into opening messages that sought and grabbed login names and passwords.
Washington said it would launch a proportional response to a cyber-attack on Sony Pictures, which made a comedy about North
Korean leader Kim Jong-un. North Korea strongly denies carrying out the attack and invited the US to take part in a joint investigation.
Sony hack: US mulls putting N Korea back on terror list
ASIA
The phishing helped the hackers extract information they used to gain access to the plant's office network and then its production
systems.
A any
senior
security official
North
Korea should instead "admit culpability and compensate Sony".
Officials would not comment on
USUS
involvement
in thesaid
current
outages.
22 December 2014 Last updated at 08:26
President
Barack
Obama
has said the US is considering putting North Korea back on its list of terrorism sponsors afte
Once inside the steel mill's network, the "technical capabilities" of the attackers were evident, said the
BSI report,
as they
showed
North
Korea strongly
objects
to
Sony
Pictures'
satirical
Interview,
which portrays
the fictional
Meanwhile,
China's
permanent
representative
to the
United
Nations
hasofcalled
all film,
sidesThe
to avoid
an escalation
of tension
on the killing of its leader, Kim Jongthe
hacking
Sony
Pictures.
familiarity with both conventional IT security
systems but
also the
specialised
software used
to oversee
and
administer
thefor
plant.
Korean Peninsula after the UNUn.
security council put the North's human rights record on its agenda.
Sony hack: China
tells US it opposes cyber attacks
EUROPE
wouldwho
be taken
BSI did not name the company operating the plant nor when the attack took place. In addition, it saidAitdecision
did not know
was after a review, he said, calling the attack an act of cyber-vandalism, not of war.
News,
theSeoul
attack and threats, Sony cancelled the Christmas Day release of the film.
behind the attack nor what motivated it. Analysis: Stephen Evans, BBCAfter
There is a paradox. North Korea is highly "teched up" but is denied the worldwide web. Many people have smart phones, for
North Korea denies the attack over The Interview, which depicts
the fictional
killing
of its
leader Kim Jong-Un.
20 December
2014 Last
updated
at 00:25
example, but they cannot access
the
web withanonymous
them.
foreign
minister
has toldit US
State John Kerry that his country is "again
Responding
threats
against
cinemas,
Sony
said it was
considering
releasing
"onSecretary
a different of
platform".
The attack is one of only a few on industrial systems known to have caused
damage.toThe
most widely
known
example
of such
anChina's
terrorism".
attack involved the Stuxnet worm which damaged centrifuges being used by Iran in its nuclear enrichment
programme.
Sony cancelled
the Christmasand
Daycyber
release
after threats to cinemas. It is considering "a different platform".
Ukraine conflict: Hackers take sides
The authorities take great pains
to FBI
prevent
from
the internet.
Recently,
in Pyongyang
were told
they
The
saidcitizens
on Friday
thataccessing
North Korea
had carried
out lastembassies
month's cyber-attack,
in which
script
details and private emails were
could
not have
networks
the"We
building.
It transpired
that demand
neighbouring
property
had Yi
risen
residents
Benjamin Sonntag, a software developer and
digital
rightswifi
activist,
told within
Reuters:
do not
expect aCostly
nuclear
power for
plant
or steel However,
Wang
did because
not respond
directly to US calls to curb cyber attacks by North Korea.
leaked.
there could get access to the embassies' wifi.
In a CNN interview, President Obama described the hacking as a "very costly, very expensive" example of cyber-vandalism.
plant to be connected to the internet.
TheSecurity
US hasspokesman
accused North
Korea
attacking
Pictures over
the film The Interview, a claim it
The US defended its findings on Saturday, with US National
Mark
Strohofsaying:
"WeSony
are confident
the North
By Vitaly
Shevchenko
"To be computerised, but to be connectedWhat
to theNorth
internet
anddoes
to behave
hackable
that
is quite
unexpected,"
he for
said.
Korea
is
an -intranet,
its own
internet
with
a lot of
state-controlled
websitestodisseminating
the
He
said
US
officials
would
examine allnews
the evidence
determine
whether
North Korea should be put back on the list of state
Korean
government
isinternal
responsible
this
destructive
attack."
More Technology stories
party line, but also a cookery website.
sponsors of terrorism.
BBC Monitoring
President Barack Obama has said the US is considering putting North Korea back on its list of terr
"If the North Korean government wants to help, they can admit their culpability and compensate
Sonythe
for bitter
the damages
this
Throughout
violence
of attack
the Ukrainian conflict, another
Ordinary North Koreans are unlikely
to notice
the absence"I'll
of the
because
they
were
denied
itObama
anyway.
Butinadding
they
might
waitinternet
to review
what the
finding
are,"
Mr
said,
thatconversation
he did not think
the attack "was an act of war".
Mr Wang's
remarks
came
a computer
phone
caused,"
he said.
hackers. with Mr Kerry in which the two discussed the ha
notice the disappearance of their own online newspapers and sources of news. And also the cookery website.
10/26
North
Koreaministry
had been
on"As
thethe
USUnited
listisfor
two Korea's
decades
until the
White
removed
it inslandering
2008,
after
Pyongyang
agreedwith
to full
China
North
close
ally
and itsHouse
largest
tradingand
partner,
and
isus,
seen
as the nation
the
On Saturday, the North Korean
foreign
said:
States
is spreading
groundless
allegations
I
There are many types of cyber-attacker and many types of
victim.
I
Our focus is on defending a corporate network, using traffic
flow data.
I
Primarily developing statistical streaming and network analysis
methodology as a filtering tool to support, not supplant
network analysts.
I
Particularly interested in local (node, edge, neighbourhood)
approaches for this application.
This is required as a complementary tool because
I
The network is too big and complicated for routine DPI-style
forensic analysis
I
Packet capture impractical at corporate scale
I
Privacy concerns
I
Discovery versus diagnosis
11/26
Example
Sophisticated network intrusion, Los Alamos National Lab,
reproduced form Neil et al. 2 . Binned Netflow data.
2
Neil et al. (2014) ’Statistical detection of intruders within computer
networks using scan statistics’, In Adams, N. and Heard, N. (2014), Data
Analysis for Network Cyber-Security, Imperial College Press.
12/26
Our example is NETFLOW data collected at Imperial College
London. This
I
has ∼ 40K computers
I
generates ∼ 12Tb of flow data per month, or ∼ 15Gb p/h
I
experience suggests no smoking gun in NETFLOW, prefer to
focus on weak signals and combining evidence.
some interests and idiosyncrasies of Imperial:
I
Particularly concerned about: illegal transfer of copyright
material, protecting IP
I
Few constraints on network usage (academic freedom, Halls of
residence, . . . )
We are developing a HADOOP-based system for statistical
querying with bulk NETFLOW.
13/26
2.1.Data
A NETFLOW record is a summary of a connection between two
network devices which is collected as the connection traverses a
router.
Example NETFLOW data (anonymised). This consists of two flow
records from the same source address to the same destination
address, on destination port 80. The two events started within 2
seconds of each other.
Date flow start
2009-04-22 12:04:44.664
2009-04-22 12:04:42.613
Duration
13.632
16.768
Proto Src IP Addr
TCP
126.253.5.69
TCP
126.253.5.69
Dst IP Addr
124.195.12.246
124.195.12.246
Src Pt Dst Pt Packets Bytes
49882 80
1507
2.1 M
49881 80
2179
3.1 M
Numerous challenges with NETFLOW data:
I
quality: duplication, direction, timing, etc
I
scale, human change, machine change, anonymity, etc
I
data analysis focus: event, node, edge, neighbourhood, . . .
14/26
3.Challenges
Want to comment on:
I
Volume,
I
Velocity,
and link to the cyber-security application.
15/26
CS versus Statistics
There are clear differences between computing research and
statistics/ML research, broadly:
I
CS: improved hardware, software infrastructure. Efficient
algorithms, mathematical guarantees. Database and search
operations. Languages.
I
Statistics: new inferential methodology, handling uncertainty,
mathematical properties of tools. Applied statistics.
It is applying statistical methods to big data where the two areas
meet (collide?).
Optimising infrastructure is CS research, that may not be of
particular interest for inference problems, where we need a stable
and easy to use platform. So, optimisation for one may not help
with optimisation of the other.
16/26
3.1.Volume
The fundamental problem has always been that data is too big to
fit in memory.
With modern cloud systems, this is further exacerbated by the
distributed nature of the data, and the infrastructure for accessing
it. For example, HADOOP is best for querying – complicated data
analysis procedures are difficult to craft in.
A basic challenge it is to adapt data analysis procedures to the
constraints of the infrastructure.
It is convenient to distinguish:
I
model building: large scale description of the data, for
summary or prediction (e.g. classification),
I
pattern detection: find small local structures in the data
(e.g. association rules, anomaly detection, mode hunting).
17/26
If we can treat data as a giant IID sample (unlikely, see later)
I
Most conventional (frequentist) statistical hypothesis testing
procedures become irrelevant with big data - they will always
tend to imply significance. A new paradigm might be required.
I
Permutation and resampling procedures may be preferable,
but then computational burden becomes an issue. (On this
aspect, the work of Gandy & Rubin-Delanchy is particularly
relevant).
I
If we are interested in building models of the whole data, do
NOT need to use all the data. Sampling is sufficient - and the
computation can be used better for model selection and so on.
I
On the other hand, if we are concerned with anomaly
detection and local structures, sampling is wont to miss the
structures we seek.
18/26
I
With complicated problems (such as cyber), it is hard to
motivate a convincing generative model for the data, and
completely impractical to attempt to compute such a model.
I
Much interesting work on scaling up exact inference
procedures (e.g. SMC). But is exact inference really needed?
Depends on the specifics of the application. On Friday, Dan
Lawson will talk about a new kind of approximate inference
procedure for big data.
19/26
In the cyber-security example, we are interested in monitoring and
local anomalies. Options:
I
Computers
I
Edges
I
Small neighbourhoods
I
Time
Time is a critical issues in a number of ways
I
Temporal aspects can suggest a breach,
I
The world is changing, and this needs to be accounted for and
handled. So, when can we treat our data as IID? How do we
handle periodicities? Drift? Abrupt change?
Many big data problems have this character: a massive number of
manageable, inter-connected problems.
20/26
Graphs
Relational data, producing networks, has been a big driver of big
data. For example, social media networks (sigh). There are good
big data tools for some types of big data graph analysis (e.g.
GraphLab).
In our cyber-security example, there is a dynamic network
structure. In that application:
I
the large scale structure of the graph may not be of interest,
I
Hubs tend to be created by automated traffic.
We desperately need better models for graphs, particularly dynamic
graph structures. The work of Wolfe & Olhede on “graphons” is
particularly interesting in this regard.
21/26
Velocity
High frequency data is often handled with streaming analysis. This
seeks to
I touch each data point only once,
I automatically handle temporal variation.
One area of particular interest to me is extending conventional
statistical procedures to the stream by incorporating a forgetting
factor.
Often, streaming analytics operate in processing pipelines, that
seek to reduce the high-frequency data down to a manageable
quantity.
Some challenges:
I Efficiency: updating and temporal adaptation must be very
fast.
I Self-monitoring: How to give a model the capability to
monitor itself. Dire consequences if a model generates
gibberish?
22/26
Change detection on the stream
Context:
I
unending sequence of data
I
changes of unpredictable size, at unpredictable times
I
no opportunity to intervene – parameter setting?
Such contexts arise in modern applications, such as high-frequency
finance and network monitoring.
We call this continuous monitoring3
Standard SPC approaches do NOT adapt well to this scenario.
We prefer using an adaptive estimation framework, incorporating a
decision rule to provide a parametric continuous monitoring device.
Headline: Provides comparable performance with fewer control
parameters → Automation is important.
3
Bodenham, D.A. and Adams, N.M. “Continuous changepoint monitoring of data streams using adaptive
estimation”, Submitted (2014).
23/26
Application Example
Context: use multivariate adaptive change detection for
continuous monitoring on destination ports on a single router4 . 14
days data, 100 minute bins.
Left: data and flagged changes. Right: Active nodes in flagged bin.
4
Bodenham D.A. and Adams, N.M. , “Continuous monitoring of a computer network using multivariate
adaptive estimation,” in Data Mining Workshops (ICDMW), 2013 IEEE 13th International Conference on, Dec
2013, pp. 311318.
24/26
4.Conclusion
I
A key message about big data, reminds me of how I apologise
for my inadequate sexual performance. It is not how big it is,
it is what you do with it5 .
I
Claims should be based on “what insight was extracted” NOT
“how much data was used”
The statistical challenges of big data have two fundamental
sources:
I
I
I
I
5
How to reason about giant data sets in the abstract
How to implement this reasoning given the specifics of the:
collection, storage, processing, reporting infrastructure.
All these are very exciting challenges.
Yes, I know that this is never a convincing claim. Sigh.
25/26
Thank you!
Questions?
26/26
Download