backward-chaining + - Department of Computer Science

advertisement

D S

R C

Data Science

R e s e a r c h C e n t e r

High Performance Distributed

Computing

Henri Bal

Vrije Universiteit Amsterdam

D S

R C

Outline

1. Development of the field

2. Highlights VU-HPDC group

3. Links to data science cycle

4. Conclusions

D S

R C

Developments

• Multiple types of data explosions:

– Big data: huge processing/transportation demands

– Complex heterogeneous data

LOFAR: ~15 PB/year

SKA: >300 PB/year, exascale processing

Complex data

D S

R C

Developments

• Infrastructure explosion

– High complexity: heterogeneous systems with diversity of processors, systems, networks

D S

R C

VU HPDC GROUP

• Bridge the gap between demanding applications and complex infrastructure

• Distributed programming systems for

– Clusters, grids, clouds

– Accelerators (GPUs)

– Heterogeneous systems (``Jungles”)

– Clouds & mobile devices

• Applications: multimedia, semantic web, model checking, games, astronomy, astrophysics, climate modeling ….

D S

R C

Highlights VU-HPDC group

Solved Awari 2002

AAAI-VC 2007 DACH 2008 - BS DACH 2008 - FT

3rd Prize: ISWC 2008 1st Prize: SCALE 2008 1st Prize: SCALE 2010 EYR 2011

Sustainability award

D S

R C

Links to data science cycle

Decision

Theory

Visual

Analytics

Understand and decide

Perception

Cognition

Distributed

Processing

Distributed reasoning

Reasoning

Knowledge representati on

Large Scale

Databases

Store and process

Software

Eng.

System /

Network

Eng.

Analyze and model

Multimedia

Retrieval

Information

Retrieval

Machine

Learning

Modeling and simulation

D S

R C

Reasoning – Semantic Web

• Make the Web smarter by injecting meaning so that machines can “understand” it.

o initial idea by Tim Berners-Lee in 2001

• Now attracted the interest of big IT companies

D S

R C

Google Example

D S

R C

Google Example

D S

R C

Distributed Reasoning

• WebPIE: web-scale distributed reasoner doing full materialization

• QueryPIE: distributed reasoning with

backward-chaining + pre-materialization of schema-triples

• DynamiTE: maintains materialization after updates (additions & removals)

 Challenge: real-time incremental reasoning on web scale, combining new (streaming) data & existing historic data

With: Jacopo Urbani, Alessandro Margara, Frank van Harmelen

COMMIT/

D S

R C

Glasswing: MapReduce on Accelerators

• Use accelerators as a mainstream feature

• Massive out-of-core data sets

• Scale vertically & horizontally

• Code portability using OpenCL

• Maintain MapReduce abstraction

With: Ismail El Helw, Rutger Hofman

D S

R C

Glasswing Pipeline

• Overlaps computation, communication & disk access

• Supports multiple buffering levels

D S

R C

Evaluation of Glasswing

• Glasswing uses CPU, memory & disk resources more efficiently than Hadoop

• Compute-bound applications benefit dramatically from GPUs

• Better scalability than Hadoop

• Runs on a variety of accelerators

• E.g. k-means clustering:

– 8.5

× (1 node) vs.

15.5

× (64 nodes) vs.

107 × (GPU node)

Download