Big Data: Big Challenges for Computer Science Henri Bal Vrije Universiteit Amsterdam Multiple types of data explosions High-volume data 10-100 x global internet traffic per year (by 2018) Complex data Graphics Processing Units (GPUs) Differences CPUs and GPUs ● ● CPU: minimize latency of 1 activity (thread) ● Must be good at everything ● Big on-chip caches ● Sophisticated control logic ALU ALU ALU ALU Control Cache GPU: maximize throughput of all threads using large-scale parallelism Example: NVIDIA Maxwell ● ● 16 independent streaming multiprocessors 2048 compute cores Ongoing GPU work at VU ● ● Applications ● Multimedia data ● Digital forensics data ● Climate modelling ● Radio astronomy data Methodologies ● ● COMMIT/ Hadoop on accelerators Programming methods for accelerators ● Teaching GPUs (with UvA) ● National ICT research infrastructure Complex data ● Still smaller in volume than astronomy etc. ● Much more complicated, semantically rich data ● Growing fast …. Semantic web ● Make the Web smarter by injecting meaning so that machines can reason about it ● ● initial idea by Tim Berners-Lee in 2001 Now attracted the interest of big IT companies WebPIE: a Web-scale Parallel Inference Engine ● Web-scale parallel reasoner doing full materialization ● ● Orders of magnitude faster than previous work by using smart parallel algorithms Jacopo Urbani + Frank van Harmelen (VU) Christiaan Huygens nomination PhD thesis Urbani Reasoning on changing data ● WebPIE must recompute everything if data changes ● ● Challenge: real-time incremental reasoning, combining new (streaming) data & historic data ● ● ● ● Takes on the order of 1 day on a 64-node compute cluster Nanopublications (http://nanopub.org) Handling 2 million news articles per day (Piek Vossen, VU) Data streams from (health) sensors & smart phones Exploit massive parallel computing and GPUs Other work on complex data ● ● Use semantic web to describe and reason about computer infrastructure (Cees de Laat, UvA) Machine learning using GPUs (Hadoop) ● ● Joint work with Max Welling (UvA) Business applications ● With Frans Feldberg (VU, Economy) Discussion ● ● ● We can process peta-scale (1015 , LHC) simple data with cluster and grid technology Exascale (1018 , SKA) may be feasible with GPUs, but requires new parallel programming methodologies Processing complex data is vastly more complicated, even at smaller scales ● Complex data is also escalating in size ● Dynamic (streaming) data will be next ● Processing exa-scale dynamic complex data?