The Large Scale Data Management and Analysis

advertisement
The Large Scale Data Management and Analysis
Project (LSDMA)
Dr. Andreas Heiss, SCC, KIT
Steinbuch Centre for Computing (SCC)
KIT – University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association
www.kit.edu
Overview
Introducing KIT and SCC
Big Data
Infrastructures at KIT: GridKa and the Large Scale Data Facility (LSDF)
Large Scale Data Management and Analysis (LSDMA)
Summary and Outlook
2
September 12, 2013 Dr. Andreas Heiss
Introducing KIT
KIT is both
state university with research and teaching and
research center of the Helmholtz Association with
program oriented provident research
Objectives:
research
Numbers
24,000 students
9,400 employees
teaching
innovation
3,200 PhD researchers
370 professors
790 million EUR annual budget in 2012
3
September 12, 2013 Dr. Andreas Heiss
Introducing Steinbuch Center for Computing
Provisioning and development of IT services for
KIT and beyond
R&D
High Performance Computing
Grids and Clouds
Big Data
~ 200 employees in total
50% scientists
50% technicians, administrative personnel and
student assistants
named after Karl Steinbuch, professor at
Karlsruhe University, creator of the term
“Informatik” (German term for computer science)
4
September 12, 2013 Dr. Andreas Heiss
Big Data
Comparing Google trends
Cloud computing
Big Data
Grid Computing
2010
5
September 12, 2013 Dr. Andreas Heiss
2013
Big Data
Comparing Google trends
Cloud computing
Big Data
Grid Computing
6
September 12, 2013 Dr. Andreas Heiss
Big Data 2000 years ago
“In those days Caesar Augustus
issued a decree that a census should
be taken of the entire Roman world.”
clearly defined purpose for
collecting data: tax lists of all
tax payers
(Luke 2,1)
data collection
distributed
analog
time-consuming
distributed storage of data
tedious data aggregation
7
September 12, 2013 Dr. Andreas Heiss
Big Data today
One Buzzword ….. various challenges!
Industry
- Data mining
- Business intelligence
- Get additional information from
(often) already existing data.
- Data aggregation
- Typically O(10) or O(100) TBs
New field to make money!
- Products
- Services
- Market shared between some
‘big players’ and many startups / spin-offs!
8
September 12, 2013 Dr. Andreas Heiss
Science
- Handling huge amounts of data
- PetaBytes
- Distributed data sources and/or
storage
- (Global) data management
- High Throughput
- Data preservation
Definition of Data Science
Venn-Diagramm by Drew Conway (IA Ventures)
9
September 12, 2013 Dr. Andreas Heiss
Big Data in science: LHC at CERN
Goals
search for the origin of
mass
understanding the early
state of the universe
LHC
went live in 2008
four detectors
main discovery until now:
a Higgs boson
2012: 25 PB of data taken
Goal for 2015: 500 Hz@L3
10
September 12, 2013 Dr. Andreas Heiss
Big Data in science: LHC at CERN
Goals
O(1000)
physicists
search
for the origin
of
mass distributed worldwide
understanding the early
state of the universe
LHC
went live in 2008
four detectors
main discovery until now:
a Higgs boson
2012: 25 PB of data taken
Goal for 2015: 500 Hz@L3
11
September 12, 2013 Dr. Andreas Heiss
Worldwide LHC Computing Grid –
Hierarchical Tier Structure
Hierarchy of services, response
times and availability:
1 Tier-0 center at CERN
Hierarchical model relaxed
copy of all raw data (tape)
first pass reconstruction
11 Tier-1 centers worldwide
2 to 3 distributed copies of raw
data
large-scale data reprocessing
Storage of simulated data from
Tier-2 centers
tape storage
~150 Tier-2 centers worldwide
user analysis
simulations
12
September 12, 2013 Dr. Andreas Heiss
Hierarchy
Mesh
Courtesy of Ian Bird, CERN
Big Data in science: DNA sequencing
GB
MB
13
September 12, 2013 Dr. Andreas Heiss
Big Data in science: synchrotron light sources
Source: Wikipedia
ANKA @ KIT
14
September 12, 2013 Dr. Andreas Heiss
Big Data in science: synchrotron light sources
Dectris Pilatus 6M
2463 x 2527 pixels
7 MB images
25 frames/s
175 MB/s
Several TB/day
Data doesn‘t fit any more on USB drive
Users are usually not affiliated to the synchrotron lab
Users from physics, biology, chemistry, material sciences, …
15
September 12, 2013 Dr. Andreas Heiss
Big Data in science: high throughput imaging
Imaging machines / microscope
1 – 100 frames/s => up to 800 MByte/s => O(10) TBytes/day
Reconstruction of
zebrafish early
embryonic development
16
September 12, 2013 Dr. Andreas Heiss
Big Data in science
Many research areas, where the data growth is very fast
Biology, chemistry, earth sciences, …
Data sets became too big to take home
Data rates require dedicated IT infrastructures to record and store
Data analysis requires farms and clusters. Single PCs not sufficient.
Collaborations require distributed infrastructures and networks
Data management becomes a challenge
Less IT experienced and IT interested people than e.g. in phyisics
17
September 12, 2013 Dr. Andreas Heiss
Definition of Data Science
Physicist
Biologist, chemist, …
Venn-Diagramm by Drew Conway (IA Ventures)
18
September 12, 2013 Dr. Andreas Heiss
KIT infrastructures: GridKa
German WLCG Tier-1 Center
Supports all LHC experiments + Belle II + several
small communities and older experiments
>10,000 cores
Disk space: 12 PB, tape space: 17 PB
6x10 Gbit/s network connectivity
~ 15% of LHC data permanently stored at GridKa
Services: file transfer, workload management, file
catalog, …
Global Grid User Support (GGUS): service
development and operation of the trouble ticket
system for the world-wide LHC Grid
Annual international GridKa School
2013: ~140 participants from 19 countries
19
September 12, 2013 Dr. Andreas Heiss
GridKa Experiences
evolving demands and usage patterns
no common workflows
hardware commodity, software not
hierarchical storage with tape is challenging
data access and I/O is the central issue
Different users / user communities have different data access methods
and access patterns!
on-site experiment representation highly useful
20
September 12, 2013 Dr. Andreas Heiss
KIT infrastructure: Large Scale Data Facility
Main goals
provision of storage for multiple research
groups at KIT and U-Heidelberg
support of research groups in data analysis
Resources and access
6 PB of on-line storage
6 PB of archival storage
100 GbE connection between LSDF@KIT
and U-Heidelberg
analysis cluster of 58*8 cores
variety of storage protocols
jointly funded by Helmholtz Association
and state of Baden-Württemberg
21
September 12, 2013 Dr. Andreas Heiss
LSDF experiences
high demand for storage, analysis and archival
research groups vary in
research topics (from genetic sequencing to geophysics)
size
IT expertise
need for services and protocols
Important needs common to many user groups
sharing data with other groups
data security and preservation
‘consulting’
many small groups depend on LSDF
23
September 12, 2013 Dr. Andreas Heiss
The Large Scale Data Management and
Analysis (LSDMA) project: facts and figures
Helmholtz portfolio extension
initial project duration: 2012-2016
partners:
project coordinator: Achim Streit (KIT)
sustainability: inclusion of activities into respective Helmholtz programoriented funding in 2015
next annual international symposium:
September 24th at KIT
24
September 12, 2013 Dr. Andreas Heiss
Scientific Data Life Cycle
25
September 12, 2013 Dr. Andreas Heiss
LSDMA: Dual Approach
Data Life Cycle Labs
Data Services Integration Team
Joint R&D with scientific user
communities
Generic methods R&D
optimization of the data life cycle
community-specific data analysis
tools and services
26
September 12, 2013 Dr. Andreas Heiss
data analysis tools and services
common to several DLCLs
interface between federated data
infrastructures and DLCLs/communities
Selected LSDMA activities (I)
DLCL Energy (KIT, U-Ulm)
analyzing stereoscopic satellite images for estimating
the efficiency of solar energy with Hadoop
privacy policies for personal energy data
DLCL Key Technologies (KIT, U-Heidelberg, U-Dresden)
optimization of tomographical reconstruction using
data-intensive computing
visualization for high throughput microscopy
DLCL Health (FZJ)
workflow support for data-intensive parameter studies
efficient metadata administration and indexing
27
September 12, 2013 Dr. Andreas Heiss
Selected LSDMA activities (II)
DLCL Earth&Environment (KIT, DKRZ)
MongoDB for data and metadata of meteorologic satellite
data
Data Replication within the European EUDAT project
using iRods
DLCL Structure of Matter (DESY, GSI, HTW)
Development of a portal for PETRA-III data
Determining the computing requirements for FAIR data
analysis
DSIT (all partners)
Federated identity management
Archive
Federated storage (e.g. dCache)
…
28
September 12, 2013 Dr. Andreas Heiss
LSDMA Challenges
Communities differ in
previous knowledge
level of specification of the data
life cycle
tools and services used
Within communities
focus on data analysis
high fluctuation of computing
experts
running tools and services
Needs driven by
increasing amount of data
cooperation between groups
policies
Lessons learned
interoperable AAI crucial
data privacy very challenging,
both legally and technically
communities need evolution, not
revolution
needs can be very specific
open access/data
long-term preservation
29
September 12, 2013 Dr. Andreas Heiss
Summary and Outlook
data facilities and R&D very important for KIT
extensive experience at GridKa and LSDF
wide variety of user communities
often very specific needs
Interoperable AAI and privacy crucial topics
Today, data is important to basically all research
topics
more projects on state, national and international
levels to come
LSDMA: research on generic data methods,
workflows and services and community specific
support and R&D.
30
September 12, 2013 Dr. Andreas Heiss
Download