Statistical Challenges in Big Data Niall Adams1,2 1 Department of Mathematics Imperial College London 2 Heilbronn Institute for Mathematical Research University of Bristol January 2015 1/26 Contents 1. “Big Data” – some general comments. 2. Exemplar application: network cyber-security. 3. Statistical Challenges. 4. Conclusion. Disclaimers I My personal view, not looking to tread on any toes I Is it “Big Data”, “Big data”, “big data”, “Big Data”? 2/26 1.General Comments 3/26 What’s hot in Data Science? Adapted from1 1 http://www.crowdflower.com/blog/data-science-2015-whats-hot-whats-not 4/26 Depending on your point of view, this list is either: I A reassuring, socially conscious, inclusive vision for where data science is going. I Marketing flannel. Either way, these points do not cover technical aspects at all. Sigh. Let’s first ask . . . 5/26 What is “Big Data” From the wiki (emphasis mine) “Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process them using traditional data processing applications. The challenges include analysis, capture, curation, search, sharing, storage, transfer, visualization, and privacy violations. ...” I Green: primarily CS I Blue: primarily statistics and machine learning I Red: An issue for everyone? 6/26 More Cynically... “Data Science” and “Big Data” are simply a rebranding of Data Mining, with the ambition of bringing together evermore diverse data sources and problems. Much is promised: “if you can obtain this data, do X , profit will accrue”. Where X =?, statistics? machine learning? voodoo? (This is based on several interesting consultancy experiences). 7/26 The Five V’s Big data is usually characterised as: I Volume - data size I Velocity - data rate I Variety - diverse data sources, cf. data fusion I Veracity - data quality I Value - a commercial consideration? Each of these presents challenges, which we will return to. 8/26 Data mining was originally conceived as secondary processing activity, with the tag line “discovering nuggets in data”. Secondary here means that the data was collected for a primary purpose and is inspected for other things. For example: I bank account records – customer relationship management, I supermarket shopping records – market basket analysis, I telecomms records – fraud detection. Such data was generally collected for accounting and control purposes. Big data seems to want to take this further, and collect any and all data together. 9/26 2.Cyber-security Let’s think about big data in network cyber-security, an important problem: BBC News - Hack attack causes 'massive damage' at steel works 23/12/2014 08:50 BBC News - North Korea partially back online after internet collapse 23/12/2014 08:50 TECHNOLOGY 22 December 2014 Last updated at 13:01 BBC News - US insists North Korea must take Sony hack blame 22/12/2014 16:54 Hack attack causes 'massive ASIA damage' at steel works 23 December 2014 Last updated at 05:47 A blast furnace at a German steel mill suffered "massive damage" following a cyber attack onBBC theNews plant's says a N Korea back on terror list - Sonynetwork, hack: US mulls putting report. 22/12/2014 North Korea partially back online after internet collapse US & CANADA Details of the incident emerged in the annual report of the German Federal Office for Information Security (BSI). It said attackers used booby-trapped emails to steal logins that gave them access to the mill's control systems. December 2014 Last updated Korea at 08:27 after an almost unprecedented internet outage, amid a cyber Some internet services have21been restored in North BBC News - Sony hack: China tells US it opposes cyber attacks This led to parts of the plant failing and meant a blast furnace could security row with the US.not be shut down as normal. US insists North must take Sony hack blame USKorea & CANADA The unscheduled shutdown of the furnaceThough causedthere the damage, said report.from the authorities in Pyongyang, US experts reported the restoration. has been no the comment In its report, BSI said the attackers were very skilled and used both targeted emails andwas social infiltrate 21 December 2014 Lastto updated at 14:27 Some analysts say the country's web access cutengineering entirely for atechniques time. the plant. In particular, said BSI, the attackers used a "spear phishing" campaign aimed at particular in thethat company BBC News UkrainePictures. conflict: Hackers take sides in virtual war The US has rejected North individuals Korea's claim it was to not responsible for a cyber-attack on -Sony trick people into opening messages that sought and grabbed login names and passwords. Washington said it would launch a proportional response to a cyber-attack on Sony Pictures, which made a comedy about North Korean leader Kim Jong-un. North Korea strongly denies carrying out the attack and invited the US to take part in a joint investigation. Sony hack: US mulls putting N Korea back on terror list ASIA The phishing helped the hackers extract information they used to gain access to the plant's office network and then its production systems. A any senior security official North Korea should instead "admit culpability and compensate Sony". Officials would not comment on USUS involvement in thesaid current outages. 22 December 2014 Last updated at 08:26 President Barack Obama has said the US is considering putting North Korea back on its list of terrorism sponsors afte Once inside the steel mill's network, the "technical capabilities" of the attackers were evident, said the BSI report, as they showed North Korea strongly objects to Sony Pictures' satirical Interview, which portrays the fictional Meanwhile, China's permanent representative to the United Nations hasofcalled all film, sidesThe to avoid an escalation of tension on the killing of its leader, Kim Jongthe hacking Sony Pictures. familiarity with both conventional IT security systems but also the specialised software used to oversee and administer thefor plant. Korean Peninsula after the UNUn. security council put the North's human rights record on its agenda. Sony hack: China tells US it opposes cyber attacks EUROPE wouldwho be taken BSI did not name the company operating the plant nor when the attack took place. In addition, it saidAitdecision did not know was after a review, he said, calling the attack an act of cyber-vandalism, not of war. News, theSeoul attack and threats, Sony cancelled the Christmas Day release of the film. behind the attack nor what motivated it. Analysis: Stephen Evans, BBCAfter There is a paradox. North Korea is highly "teched up" but is denied the worldwide web. Many people have smart phones, for North Korea denies the attack over The Interview, which depicts the fictional killing of its leader Kim Jong-Un. 20 December 2014 Last updated at 00:25 example, but they cannot access the web withanonymous them. foreign minister has toldit US State John Kerry that his country is "again Responding threats against cinemas, Sony said it was considering releasing "onSecretary a different of platform". The attack is one of only a few on industrial systems known to have caused damage.toThe most widely known example of such anChina's terrorism". attack involved the Stuxnet worm which damaged centrifuges being used by Iran in its nuclear enrichment programme. Sony cancelled the Christmasand Daycyber release after threats to cinemas. It is considering "a different platform". Ukraine conflict: Hackers take sides The authorities take great pains to FBI prevent from the internet. Recently, in Pyongyang were told they The saidcitizens on Friday thataccessing North Korea had carried out lastembassies month's cyber-attack, in which script details and private emails were could not have networks the"We building. It transpired that demand neighbouring property had Yi risen residents Benjamin Sonntag, a software developer and digital rightswifi activist, told within Reuters: do not expect aCostly nuclear power for plant or steel However, Wang did because not respond directly to US calls to curb cyber attacks by North Korea. leaked. there could get access to the embassies' wifi. In a CNN interview, President Obama described the hacking as a "very costly, very expensive" example of cyber-vandalism. plant to be connected to the internet. TheSecurity US hasspokesman accused North Korea attacking Pictures over the film The Interview, a claim it The US defended its findings on Saturday, with US National Mark Strohofsaying: "WeSony are confident the North By Vitaly Shevchenko "To be computerised, but to be connectedWhat to theNorth internet anddoes to behave hackable that is quite unexpected," he for said. Korea is an -intranet, its own internet with a lot of state-controlled websitestodisseminating the He said US officials would examine allnews the evidence determine whether North Korea should be put back on the list of state Korean government isinternal responsible this destructive attack." More Technology stories party line, but also a cookery website. sponsors of terrorism. BBC Monitoring President Barack Obama has said the US is considering putting North Korea back on its list of terr "If the North Korean government wants to help, they can admit their culpability and compensate Sonythe for bitter the damages this Throughout violence of attack the Ukrainian conflict, another Ordinary North Koreans are unlikely to notice the absence"I'll of the because they were denied itObama anyway. Butinadding they might waitinternet to review what the finding are," Mr said, thatconversation he did not think the attack "was an act of war". Mr Wang's remarks came a computer phone caused," he said. hackers. with Mr Kerry in which the two discussed the ha notice the disappearance of their own online newspapers and sources of news. And also the cookery website. 10/26 North Koreaministry had been on"As thethe USUnited listisfor two Korea's decades until the White removed it inslandering 2008, after Pyongyang agreedwith to full China North close ally and itsHouse largest tradingand partner, and isus, seen as the nation the On Saturday, the North Korean foreign said: States is spreading groundless allegations I There are many types of cyber-attacker and many types of victim. I Our focus is on defending a corporate network, using traffic flow data. I Primarily developing statistical streaming and network analysis methodology as a filtering tool to support, not supplant network analysts. I Particularly interested in local (node, edge, neighbourhood) approaches for this application. This is required as a complementary tool because I The network is too big and complicated for routine DPI-style forensic analysis I Packet capture impractical at corporate scale I Privacy concerns I Discovery versus diagnosis 11/26 Example Sophisticated network intrusion, Los Alamos National Lab, reproduced form Neil et al. 2 . Binned Netflow data. 2 Neil et al. (2014) ’Statistical detection of intruders within computer networks using scan statistics’, In Adams, N. and Heard, N. (2014), Data Analysis for Network Cyber-Security, Imperial College Press. 12/26 Our example is NETFLOW data collected at Imperial College London. This I has ∼ 40K computers I generates ∼ 12Tb of flow data per month, or ∼ 15Gb p/h I experience suggests no smoking gun in NETFLOW, prefer to focus on weak signals and combining evidence. some interests and idiosyncrasies of Imperial: I Particularly concerned about: illegal transfer of copyright material, protecting IP I Few constraints on network usage (academic freedom, Halls of residence, . . . ) We are developing a HADOOP-based system for statistical querying with bulk NETFLOW. 13/26 2.1.Data A NETFLOW record is a summary of a connection between two network devices which is collected as the connection traverses a router. Example NETFLOW data (anonymised). This consists of two flow records from the same source address to the same destination address, on destination port 80. The two events started within 2 seconds of each other. Date flow start 2009-04-22 12:04:44.664 2009-04-22 12:04:42.613 Duration 13.632 16.768 Proto Src IP Addr TCP 126.253.5.69 TCP 126.253.5.69 Dst IP Addr 124.195.12.246 124.195.12.246 Src Pt Dst Pt Packets Bytes 49882 80 1507 2.1 M 49881 80 2179 3.1 M Numerous challenges with NETFLOW data: I quality: duplication, direction, timing, etc I scale, human change, machine change, anonymity, etc I data analysis focus: event, node, edge, neighbourhood, . . . 14/26 3.Challenges Want to comment on: I Volume, I Velocity, and link to the cyber-security application. 15/26 CS versus Statistics There are clear differences between computing research and statistics/ML research, broadly: I CS: improved hardware, software infrastructure. Efficient algorithms, mathematical guarantees. Database and search operations. Languages. I Statistics: new inferential methodology, handling uncertainty, mathematical properties of tools. Applied statistics. It is applying statistical methods to big data where the two areas meet (collide?). Optimising infrastructure is CS research, that may not be of particular interest for inference problems, where we need a stable and easy to use platform. So, optimisation for one may not help with optimisation of the other. 16/26 3.1.Volume The fundamental problem has always been that data is too big to fit in memory. With modern cloud systems, this is further exacerbated by the distributed nature of the data, and the infrastructure for accessing it. For example, HADOOP is best for querying – complicated data analysis procedures are difficult to craft in. A basic challenge it is to adapt data analysis procedures to the constraints of the infrastructure. It is convenient to distinguish: I model building: large scale description of the data, for summary or prediction (e.g. classification), I pattern detection: find small local structures in the data (e.g. association rules, anomaly detection, mode hunting). 17/26 If we can treat data as a giant IID sample (unlikely, see later) I Most conventional (frequentist) statistical hypothesis testing procedures become irrelevant with big data - they will always tend to imply significance. A new paradigm might be required. I Permutation and resampling procedures may be preferable, but then computational burden becomes an issue. (On this aspect, the work of Gandy & Rubin-Delanchy is particularly relevant). I If we are interested in building models of the whole data, do NOT need to use all the data. Sampling is sufficient - and the computation can be used better for model selection and so on. I On the other hand, if we are concerned with anomaly detection and local structures, sampling is wont to miss the structures we seek. 18/26 I With complicated problems (such as cyber), it is hard to motivate a convincing generative model for the data, and completely impractical to attempt to compute such a model. I Much interesting work on scaling up exact inference procedures (e.g. SMC). But is exact inference really needed? Depends on the specifics of the application. On Friday, Dan Lawson will talk about a new kind of approximate inference procedure for big data. 19/26 In the cyber-security example, we are interested in monitoring and local anomalies. Options: I Computers I Edges I Small neighbourhoods I Time Time is a critical issues in a number of ways I Temporal aspects can suggest a breach, I The world is changing, and this needs to be accounted for and handled. So, when can we treat our data as IID? How do we handle periodicities? Drift? Abrupt change? Many big data problems have this character: a massive number of manageable, inter-connected problems. 20/26 Graphs Relational data, producing networks, has been a big driver of big data. For example, social media networks (sigh). There are good big data tools for some types of big data graph analysis (e.g. GraphLab). In our cyber-security example, there is a dynamic network structure. In that application: I the large scale structure of the graph may not be of interest, I Hubs tend to be created by automated traffic. We desperately need better models for graphs, particularly dynamic graph structures. The work of Wolfe & Olhede on “graphons” is particularly interesting in this regard. 21/26 Velocity High frequency data is often handled with streaming analysis. This seeks to I touch each data point only once, I automatically handle temporal variation. One area of particular interest to me is extending conventional statistical procedures to the stream by incorporating a forgetting factor. Often, streaming analytics operate in processing pipelines, that seek to reduce the high-frequency data down to a manageable quantity. Some challenges: I Efficiency: updating and temporal adaptation must be very fast. I Self-monitoring: How to give a model the capability to monitor itself. Dire consequences if a model generates gibberish? 22/26 Change detection on the stream Context: I unending sequence of data I changes of unpredictable size, at unpredictable times I no opportunity to intervene – parameter setting? Such contexts arise in modern applications, such as high-frequency finance and network monitoring. We call this continuous monitoring3 Standard SPC approaches do NOT adapt well to this scenario. We prefer using an adaptive estimation framework, incorporating a decision rule to provide a parametric continuous monitoring device. Headline: Provides comparable performance with fewer control parameters → Automation is important. 3 Bodenham, D.A. and Adams, N.M. “Continuous changepoint monitoring of data streams using adaptive estimation”, Submitted (2014). 23/26 Application Example Context: use multivariate adaptive change detection for continuous monitoring on destination ports on a single router4 . 14 days data, 100 minute bins. Left: data and flagged changes. Right: Active nodes in flagged bin. 4 Bodenham D.A. and Adams, N.M. , “Continuous monitoring of a computer network using multivariate adaptive estimation,” in Data Mining Workshops (ICDMW), 2013 IEEE 13th International Conference on, Dec 2013, pp. 311318. 24/26 4.Conclusion I A key message about big data, reminds me of how I apologise for my inadequate sexual performance. It is not how big it is, it is what you do with it5 . I Claims should be based on “what insight was extracted” NOT “how much data was used” The statistical challenges of big data have two fundamental sources: I I I I 5 How to reason about giant data sets in the abstract How to implement this reasoning given the specifics of the: collection, storage, processing, reporting infrastructure. All these are very exciting challenges. Yes, I know that this is never a convincing claim. Sigh. 25/26 Thank you! Questions? 26/26