Project Summary - Johns Hopkins University

advertisement
Project Summary
MRI: The Development of Data-Scope – A Multi-Petabyte Generic Data Analysis Environment for Science
PI: Alexander Szalay, Co-Is: Kenneth Church, Charles Meneveau, Andreas Terzis, Scott Zeger
The Data-Scope is a new scientific instrument, capable of ‘observing’ immense volumes of data from
various scientific domains such as astronomy, fluid mechanics, and bioinformatics.
Intellectual Merit: The nature of science is changing – new discoveries will emerge from the analysis of
large amounts of complex data generated by our high-throughput instruments: this is Jim Gray’s “Fourth
Paradigm” of scientific discovery. Virtual instruments (i.e., computers) generate equally large volumes of
data – the sizes of the largest numerical simulations of nature today are on par with the experimental data
sets. This data deluge is not simply a computational problem, but rather requires a new and holistic
approach. We need to combine scalable algorithms and statistical methods with novel hardware and
software mechanisms, such as deep integration of GPU computing with database indexing and fast
spatial search capabilities. Today, scientists can easily tackle data-intensive problems at the 5-10TB
scales: one can perform these analyses at a typical departmental computing facility. However, 50-100 TB
are considerably more difficult to deal with and perhaps only ten universities in the world can analyze
such data sets. Moving to the petabyte scale, there are less than a handful of places anywhere in the
world that can address the challenge. At the same time there are many projects which are crossing over
the 100TB boundary today. Astrophysics, High Energy Physics, Environmental Science, Computational
Fluid Dynamics, Genomics, and Bioinformatics are all encountering data challenges in the several
hundred TB range and beyond – even within the Johns Hopkins University. The large data sets are here,
but we lack an integrated software and hardware infrastructure to analyze them!
We propose to develop the Data-Scope, an instrument specifically designed to enable data analysis tasks
that are simply not possible today. The instrument’s unprecedented capabilities combine approximately
five Petabytes of storage with a sequential IO bandwidth close to 500GBytes/sec, and 600 Teraflops of
GPU computing. The need to keep acquisition costs and power consumption low, while maintaining high
performance and storage capacity introduces difficult tradeoffs. The Data-Scope will provide extreme data
analysis performance over PB-scale datasets at the expense of generic features such as fault tolerance
and ease of management. This is however acceptable since the Data-Scope is a research instrument
rather than a traditional computational facility. Over the last decade we have demonstrated the ability to
develop and operate data-intensive systems of increasing scale.
Broader Impact: The data-intensive nature of science is becoming increasingly important. We face
today a similar vacuum in our abilities to handle large data sets as the one from which the concept of the
Beowulf cluster emerged in the 90s that eventually democratized high-performance computing. Many
universities and scientific disciplines are now looking for a new template that will enable them to address
PB-scale data analysis problems. In developing the Data-Scope, we can substantially strengthen the
Nation’s expertise in data-intensive science. In order to accelerate the acceptance of the proposed approach we will collaborate with researchers across multiple disciplines and institutions nationwide (Los
Alamos, Oak Ridge, UCSC, UW, STScI and UIC). The proposed instrument will also host public services
on some of the largest data sets in astronomy and fluid mechanics, both observational and simulated.
Students and postdoctoral fellows who will be involved in data-intensive research using the Data-Scope
will acquire skills that will serve them well in their career as 21st century scientists.
Partnerships: We have a strong industrial involvement and interest. We have been working with
Microsoft Research and the SQL Server team for over a decade exploring ways to bring data-intensive
computations as close to the data as possible. Microsoft has provided substantial funding to build the
GrayWulf facility, the forerunner to the Data-Scope. We will continue to collaborate with Microsoft to
advance innovations in data-intensive computing. NVIDIA is extremely interested in using GPUs in data
intensive computations and in building data-balanced architectures and has recently awarded JHU a
CUDA Research Center status. JHU is an active partner in the Open Cloud Consortium (OCC). The DataScope will use the 10Gbps OCC connectivity to move external large data sets into the instrument and will
be linked to the rest of the OCC infrastructure.
Project Description
(a) Instrument Location
The Data-Scope will be located in Room 448 at the Bloomberg Center for Physics and Astronomy. This
room has adequate electricity and cooling to support 180kVA. The equipment currently residing there will
be moved to a new location during the summer of 2010. The room is also instrumented with
environmental sensors monitoring the temperature of each rack and the electricity of each circuit through
a wireless sensor network connected to a real-time monitoring system.
(b) Research Activities to be enabled
Astrophysics, Physical Sciences
Alex Szalay, Rosemary Wyse, Mark Robbins, Morris Swartz, Tamas Budavari, Ani Thakar, Brice Menard,
Mark Neyrinck (JHU Physics and Astronomy), Robert Hanisch (Space Telescope Science Institute), Piero
Madau, Joel Primack (UC Santa Cruz), Andrew Connolly (Univ. Washington), Salman Habib (Los
Alamos), Robert Grossman (UIC)
From 2010 JHU became the long term host of the Sloan Digital Sky Survey (SDSS) Archive [1], (Szalay,
Thakar). This archive is one of astronomy’s most used facilities today. On the Data-Scope we can lay out
the SDSS data multiple times and achieve orders of magnitude better performance for the whole
astronomy community. The long term curation of the SDSS archive is one of the main thrusts for the NSFfunded Data Conservancy (Choudhury). The Data-Scope will enable new functionalities over the SDSS
data that would perform complex computations over arbitrary subsets in a matter of seconds (Menard).
The Virtual Astronomical Observatory [2] (VAO, directed by R. Hanisch provides the framework for
discovering and accessing astronomical data from observatories and data centers. Perhaps the single
most important VAO tool in the research community is its cross-matching service, OpenSkyQuery [3],
built at JHU. Survey catalogs today contain 108 or more objects, while the next-generation telescopes,
like LSST [4], Pan-STARRS [5] will soon produce orders of magnitude larger time-domain catalogs. Data
sets this large can only be explored with a parallel computational instrument optimized for both I/O and
CPU. The OpenSkyQuery, running on the Data-Scope (Budavari, Wilton) will improve performance for
astronomical cross-matching by a factor of 10-100, helping the whole community.
Access to large simulations has always been limited and awkward. The Data-Scope will host several of
the world’s largest cosmological simulations [6,7,8] (Primack, Habib) and make them publicly accessible
through interactive web services (like in turbulence). In particular, we will also combine observational data
[9] and simulations of unprecedented size (300TB+) about the Milky Way (Wyse, Madau, Szalay). Our
postdocs (Neyrinck) and graduate students will generate a suite of 500 realizations of a 1Gpc
cosmological simulation, for multiple analyses of large scale structure. We will explore introducing arrays
as native data types into databases to accelerate the analysis of structured data, in collaboration with
Jose Blakeley (Microsoft), one of the Chief Architects of the SQL Server database. The total volume of
the astrophysical simulation effort can easily exceed 500TB.
JHU has been a member of the Open Cloud Consortium [10], led by R. Grossman (UIC). The OCC has a
unique infrastructure, using its own high-speed networking protocols to transfer large data sets (20 min
for a TB from Chicago to JHU). This framework enables us to import these large simulations directly from
the National Supercomputer Centers (ORNL, NERSC), and also transfer the results to other institutions
and NSF facilities (UCSD, UIUC, TeraGrid). The Data-Scope will also be available through the OCC.
Detailed image simulations for LSST have been carried out [12] on the Google/IBM MR cluster (Connolly)
but that system is reaching its resource limits, more storage and processing power is needed than what is
available. The Data-Scope system can offer both, and the simulations can be expanded to 100TB.
State of the art simulations of molecular dynamics on the largest HPC facilities use billion particle
systems, representing about 1000 particles in each spatial direction, or about 300nm for atomic systems.
The NSF MRI funded Graphics Processor Laboratory (NSF MRI-0923018, Acquisition of 100TF
Graphics Processor Laboratory for Multiscale/Multiphysics Modeling, PI: Robbins, CoIs: Burns,
Graham-Brady, Szalay, Darlymple) is enabling us to port the relevant codes to the GPU architecture.
Integrating this system with the Data-Scope and using it to analyze thousands of configurations for billion
particle simulations would enable dramatically better understanding of the mechanics of disordered
systems and scaling behavior of avalanches as materials fail (Falk and Robbins). Saving the positions,
velocities and accelerations of two such simulations requires 144TB, which is consistent with the size of
the planned instrument, and would result in a dramatic breakthrough in the state of the art in this field.
Data is already flowing in from the Large Hadron Collider (LHC). JHU has a 20TB local facility to analyze
the data and produce simulated data (Swartz), which has already been outgrown. Having access to a
200TB revolving storage and fast network transfers from FNAL would dramatically impact the quality of
the JHU contribution to the LHC effort.
Turbulence, computational fluid mechanics
Charles Meneveau (ME), Shiyi Chen (ME), Greg Eyink (AMS), Omar Knio (ME), Rajat Mittal (ME), Randal
Burns (CS), Tony Darlymple (CE)
The proposed Data-Scope will enable path-breaking research in fluid mechanics and turbulence.
Constructing, handling, and utilizing large datasets in fluid dynamics has been identified at two NSF
workshops [12,13] as a major pacing item for continued progress in the field. Scientific computing of
multi-scale physical phenomena is covering increasingly wider ranges of spatial and temporal scales.
Discretization and integration forward in time of the underlying partial differential equations constitute
massive simulations that describe the evolution of physical variables (velocity, pressure, concentration
fields) as function of time and location in the entire domain of interest. The prevailing approach has been
that individual researchers perform large simulations that are analyzed during the computation, and only a
small subset of time-steps are stored for subsequent, and by necessity more “static”, analysis. The
majority of the time evolution is discarded. As a result, much of the computational effort is not utilized as
well as it should. In fact, often the same simulations must be repeated after new questions arise that were
not initially obvious. But many (or even most) breakthrough concepts cannot be anticipated in advance,
as they will be motivated in part by output data and must then be tested against it. Thus a new paradigm
is emerging that aims to create large and easily accessible databases that contain the full space-time
history of simulated flows. The Data-Scope will provide the required storage and access for the analysis.
The proposed effort in fluid mechanics and turbulence builds upon the accomplishments of two prior
related projects: ITR-AST-0428325: Exploring the Lagrangian Structure of Complex Flows with 100
Terabyte Datasets (09/04 to 08/10; PIs: A. Szalay, E. Vishniac, R. Burns, S. Chen & G. Eyink), which
used an MRI-funded cluster (funded through MRI-320907 DLMS: Acquisition of Instrumentation for a
Digital Laboratory for Multi-Scale Science, PIs: Chen et al.). The cluster consists of a compute layer
and a database layer, the JHU public turbulence database. It houses a 27 TB database that contains the
entire time history of a 10243 mesh point pseudo-spectral DNS of forced isotropic turbulence [14,15].
1024 time-steps have been stored, covering a full “large-eddy” turnover time. A database web service
fulfills user requests for velocities, pressure, various space-derivatives of velocity and pressure, and
interpolation functions (see http://turbulence.pha.jhu.edu). The 10244 isotropic turbulence database has
been in operation continuously since it went online in 2008 and has been used extensively. Usage
monitoring has shown that it has been accessed by over 160 separate IP’s and to date the total number
of points queried exceeds 3.6 x 1010. Our prior research results using, or motivated by, the database are
described in various publications [16-34]. For sample results from non-JHU users see e.g. papers [35,36].
The Data-Scope will enable us to build, analyze and provide publicly accessible datasets for other types
of flows besides isotropic turbulence. We anticipate that over the next 2-4 years there will be about 5-7
new datasets created from very large CFD and turbulence simulations. The scale of the overall data will
be in the 350-500 TB range. Anticipated topics include magneto-hydrodynamic turbulence, channel flow,
atmospheric turbulence, combustion, compressible turbulence[37], cardiovascular and animal motion
flows[38], free-surface flows, and propagation and quantification of uncertainty in model data[39,40].
Once these DBs are made accessible online, based on our experience with the isotropic turbulence DB,
we anticipate that about 10 non-JHU research groups will regularly access each of the datasets. Hence,
we expect about 50 external research groups (US and international) to profit from the instrument.
Name and affiliation
Science area
A. Szalay, A. Thakar, S.
Choudhury
B.Menard +postdocs
R Hanisch (STScI), T.
Budavari, A. Thakar,
Astronomy, data mgmt
P. Madau (UCSC), R. Wyse,
A. Szalay, 2 postdoc, 2 PhD st
S. Habib (Los Alamos), J.
Primack (UCSC)
Astrophysics
A. Connolly (UW)+postdocs
M. Neyrinck, A. Szalay, 2 PhD
students
M. Robbins, M. Falk
M. Swartz
G. Eyink, E. Vishniac , R.
Burns, 1 postdoc, 1 PhD st
C. Meneveau, S. Chen, R.
Burns, 1 PhD student
O. Knio, C. Meneveau, R.
Burns, 3 SANDIA, 2 PhD st
C. Meneveau + 1 PhD st
Astrophysics
Astrophysics
Astrophysics
Virtual Astronomical
Observatory
Related project/agency
TB
SDSS archive services (NSF,
Sloan Foundation)
SDSS data analysis
OpenSkyQuery, VO Spectrum,
VO Footprint (NSF+NASA)
70
30
40
Via Lactea/The Milky Way
Laboratory, (DOE, NSF)
Public cosmology simulations
(DOE,NASA)
300+
100
50
Physics, Mat sci
High Energy Phys.
Turbulence, CFD,
Astrophysics
Turbulence, Mech. Eng.
LSST imaging simulations
Public access for 500 Gadget2
cosmological simulations
Multiscale sims of avalanche
Analysis of LHC data
MHD database (NSF CDI)
144
250
100
Channel flow DB (NSF CDI)
80
Combustion, Energy,
Mech Eng
Env. Eng., atmospheric
flow, wind energy.
Chemically reacting flows with
complex chemistry (DOE BES)
Daily cycle of atmospheric
boundary layer (NSF CBET)
S. Chen, C. Meneveau + 1
PhD st
Turbulence, Aerospace, propulsion
Compressible turbulence
(AFOSR, in planning)
R. Mittal
D. Valle + faculty + postdocs
Genomics/Genotyping
Cardiovascular & animal motion
flows
Epigenetic biomarkers, pathway
Anal. (NCI,NIDCR,NLM)
Improving communication in the
ICU
Multiple genome/whole exome
10-20
M. Ochs, Postdocs: E. Fertig
and A. Favorov
H. Lehmann
Biomedical and
biological fluid mech
Systems biology,
networks
Biomedical informatics
S. Yegnasubramanian, A.
DeMarzo, W. Nelson, 4
Postdocs
G. Steven Bova
High throughput
sequencing
Ultra-high dim genomics correlations of very large datasets;
(DoD, NCI, MD StemCell Fund)
Distinctive genomic profiles of
aggressive prostate cancer
(DoD, pending)
Correlation of multidimensional
biological data (NSF+pending)
Petascale Ocean Expt (NSF)
10-50
Astrophysics
S. Wheelan, Postdoc: L.
Mularoni
T. Haine
Correlation across highdimensional genomics
datasets
High throughput
sequencing
Env. Sci/Oceanography
D. Waugh
Env. Sci./Atmosphere
B. Zaitchik
Env. Sci./Hydrology
Hugh Ellis
Env. Sci./Air Quality
120
60-100
100
50
0.5/yr
9/yr
20-50
Chemistry-Climate Coupling
(NSF,NASA)
Regional climate analysis
(NASA)
Air Quality and Public Health
(EPA, NSF)
Table 1. Tabular representation of the major users of the Data-Scope Instrument and their data sizes
120
1050/yr
500
100
30
30
Bioinformatics
Michael Ochs (Oncology, Biostatistics), Harold Lehmann (Public Health), Srinivasan Yegnasubramanian,
Angelo DeMarzo, William Nelson (Oncology), G. Steven Bova (Pathology, Urology), Sarah Wheelan
(Oncology, Biostatistics)
Research activities that require or will be significantly accelerated by the Data-Scope are diverse and
include high-dimensional biology, high-throughput sequencing, and algorithmic development to improve
inference of biological function from multidimensional data.
Harold Lehmann and colleagues have examined ICU activities and are striving for better integration of the
massive amount of information produced daily in the unit. In general, they have been hampered by the
inability to perform statistical analyses that use machine learning to find patterns, gaps, and warnings in
patient courses and in team behavior. The volume of digital signals available from the many monitors
applied to patients amounts to approximately 9TB per year in a modest-sized ICU. The proposed
infrastructure would enable in-depth analysis to tackle quality and safety issues in healthcare–prediction
of patient courses with and without interventions, and to include team activity in the mix. The researchers
would include ICU and patient-safety researchers, computer scientists, statisticians, and their students.
Srinivasan Yegnasubramanian, Angelo DeMarzo, and William Nelson jointly direct work on large,
dynamic and high-dimensional datasets. In particular, they are working to characterize the genetics and
epigenetics of normal hematopoietic and chronic myeloid leukemia stem cells as well as simultaneous
genome-wide characterization of somatic alterations in imprinting, DNA methylation, and genomic copy
number in prostate cancer.
Dr. Yegnasubramanian, along with co-director Sarah J. Wheelan, M.D., Ph.D., also directs the Next
Generation Sequencing Center, which is a shared resource for the entire genomics research community
at the Sidney Kimmel Comprehensive Cancer Center as well as the JHU Schools of Medicine and Public
Health at large. The center is expected to generate on the order of 100TB to 1PB of data per year using 4
ultra-high throughput Next Generation Sequencing Instruments. The capability to analyze these data in
the Data-Scope, allowing integration and cross-cutting analyses of independently generated datasets will
be of central importance.
Michael Ochs directs several projects whose goal is to identify markers of cancer and cancer progression.
The Data-Scope will allow Dr. Ochs to implement sophisticated probabilistic models, using known biology,
data from many different measurement platforms, and ongoing experiments that will enable better
prediction and management of disease. G. Steven Bova directs a large-scale effort in prostate cancer;
many of the algorithms needed already exist but cannot operate as needed on his extremely large and
diverse dataset, without an instrument such as the Data-Scope.
Sarah Wheelan is working to develop methods for creating biologically relevant hypotheses from high
throughput sequencing datasets. Such datasets typically bring several hundreds of millions of observations, and new algorithms are required to help biological investigators create the relevant and
sophisticated queries that are possible with these data. By cross linking many different types of
experiments across many different systems and using straightforward statistical techniques to survey
correlations, new relationships can be uncovered.
Climate and Environmental Science
Hugh Ellis (DOGEE), Darryn Waugh, Tom Haine, Ben Zaitchik, Katalin Szlavecz (EPS)
Several JHU faculty are performing high-resolution modeling and data assimilation of Earth's atmosphere,
oceans, and climate system. The presence of the Data-Scope will allow multiple JHU users to access
climate reanalysis data, satellite-derived datasets, and stored model output, and significantly accelerate
the scientific analysis by permitting data exploration and visualization that would otherwise not be
possible. The Data-Scope will also facilitate storage and analysis of large ensemble integrations,
contributing to our ability to characterize uncertainty of the simulations and predictions.
One example of the relevant research is the Petascale Arctic-Atlantic- Antarctic Experiment (PAVE)
project (Haine). Under this project kilometer-resolution, planetary-scale, ocean and sea-ice simulations
are being developed that will contain 20 billion grid cells, will require 10-40TB of memory, and will exploit
petascale resources with between 100,000 and 1,000,000 processor cores. Local access to PAVE
solutions, stored on the Data-Scope, will significantly accelerate the scientific analysis.
Other examples include projects examining regional climate modeling with data assimilation (Zaitchik);
the impact of air quality on public health in the US (Ellis); coupling between stratospheric chemistry and
climate (Waugh); downscaled meteorological simulations of climate change-related health effects (Ellis).
The simulations in these projects again produce large volumes of output, not only because of the required
number of simulations and the need for fairly high spatial and temporal resolution, but also the large
number of variables (e.g., large numbers of chemical species in air quality modeling together with
numerous physical properties in meteorological simulation). The projects involving data assimilation also
require run-time ingestion of large gridded datasets.
In addition to these computer simulations the JHU wireless sensor networks (Szlavecz, Terzis) are
providing in-situ monitoring of the soil’s contribution to the carbon cycle, and generate smaller (100 million
records collected thus far) but quite complex data sets which need to be correlated and integrated with
the large scale climate models and biological survey data.
(c) Description of the Research Instrumentation and Needs
Rationale for the Data-Scope
The availability of large experimental datasets coupled with the potential to analyze them computationally
is changing the way we do science [41,42]. In many cases however, our ability to acquire experimental
data outpaces our ability to process them leading to the so-called data deluge [43]. This data deluge is
the outcome of three converging trends: the recent availability of high throughput instruments (e.g.,
telescopes, high-energy particle accelerators, gene sequencing machines), increasingly larger disks to
store the measurements, and ever faster CPUs to process them.
Not only experimental data are growing at a rapid pace; the volume of data produced by computer
simulations, used in virtually all scientific disciplines today, is increasing at an even faster rate. The
reason is that intermediate simulation steps must also be preserved for future reuse as they represent
substantial computational investments. The sheer volume of these datasets is only one of the challenges
that scientists must confront. Data analyses in other disciplines (e.g., environmental sciences) must span
thousands distinct datasets with incompatible formats and inconsistent metadata. Overall, dataset sizes
follow a power law distribution and challenges abound at both extremes of this distribution.
While improvements in computer hardware have enabled this data explosion, the performance of different
architecture components increases at different rates. CPU performance has been doubling every 18
months, following Moore’s Law [44]. The capacity of disk drives is doubling at a similar rate, somewhat
slower that the original Kryder’s Law prediction [45], driven by higher density platters. On the other hand,
the disks’ rotational speed has changed little over the last ten years. The result of this divergence is that
while sequential IO speeds increase with density, random IO speeds have changed only moderately. Due
to the increasing difference between the sequential and random IO speeds of our disks, only sequential
disk access is possible – if a 100TB computational problem requires mostly random access patterns, it
cannot be done. Finally, network speeds, even in the data center, are unable to keep up with the doubling
of the data sizes [46]. Said differently, with petabytes of data we cannot move the data where the
computing is–instead we must bring the computing to the data. JHU has been one of the pioneers in
recognizing this trend and designing systems around this principle [47].
The typical analysis pipeline of a data-intensive scientific problem starts with a low level data access
pattern during which outliers are filtered out, aggregates are collected, or a subset of the data is selected
based on custom criteria. The more CPU-intensive parts of the analysis happen during subsequent
passes. Such analyses are currently implemented in academic Beowulf clusters that combine computeintensive but storage-poor servers with network attached storage. These clusters can handle problems of
a few tens of terabytes, but they do not scale above hundred terabytes, constrained by the very high
costs of PB-scale enterprise storage systems. Furthermore, as we grow these traditional systems to meet
our data needs, we are hitting a “power wall” [48], where the power and space requirements for these
systems exceed what is available to individual PIs and small research groups. Existing supercomputers
are not well suited for data intensive computations either; they maximize CPU cycles, but lack IO
bandwidth to the mass storage layer. Moreover, most supercomputers lack disk space adequate to store
PB-size datasets over multi-month periods. Finally, commercial cloud computing platforms are not the
answer, at least today. The data movement and access fees are excessive compared to purchasing
physical disks, the IO performance they offer is substantially lower (~20MBps), and the amount of
provided disk space is woefully inadequate (e.g. ~10GB per Azure instance).
Based on these observations, we posit that there is a vacuum today in data-intensive scientific
computations, similar to the one that lead to the development of the BeoWulf cluster: an inexpensive yet
efficient template for data intensive computing in academic environments based on commodity
components. The proposed Data-Scope aims to fill this gap.
The Design Concept
We propose to develop the Data-Scope, an instrument optimized for analyzing petabytes of data in an
academic setting where cost and performance considerations dominate ease of management and
security. The Data-Scope will form a template for other institutions facing similar challenges.
The following requirements guide the Data-Scope’s design:
(a)
(b)
(c)
(d)
(e)
Provide at least 5 petabytes of storage, with a safe redundancy built in.
Keep the ratio of total system to raw disk costs as low as possible.
Provide maximal sequential throughput, approaching the aggregate disk speed
Allow streaming data analyses on par with data throughput (i.e., 100s of TFlops)
Maintain total power requirements as low as possible.
This ordered list maps well onto the wish list of most academic institutions. The tradeoff is in some
aspects of fault tolerance, the level of automation in data movement and recovery and a certain
complexity in programming convenience since the high stream-processing throughput at a low power is
achieved by using GPUs. These tradeoffs, combined with maximal use of state of the art commodity
components, will allow us to build a unique system, which can perform large data analysis tasks simply
not otherwise possible. The Data-Scope will enable JHU scientists and their collaborators to
•
•
•
•
Bring their 100TB+ data sets to the instrument, analyze them for several months at phenomenal
data rates and take their results ‘home’
Create several long-term, robust and high performance services around data sets in the 10200TB range, and turn them into major public resources
Explore new kinds of collaborative research in which even the shared, temporary resources can
be in the hundreds of terabytes and kept alive for several months
Explore new data-intensive computational and data analysis paradigms enabled by the intersection of several technologies (HPC, Hadoop, GPU) and toolkits like CUDA-SQL and MPI-DB.
In the paragraphs that follow we describe the Data-Scope’s hardware and software designs.
The Hardware Design
The driving goal behind the Data-Scope design is to maximize stream processing throughput over TBsize datasets while using commodity components to keep acquisition and maintenance costs low.
Performing the first pass over the data directly on the servers’ PCIe backplane is significantly faster than
serving the data from a shared network file server to multiple compute servers. This first pass commonly
reduces the data significantly, allowing one to share the results over the network without losing
performance. Furthermore, providing substantial GPU capabilities on the same server enables us to avoid
moving too much data across the network as it would be done if the GPUs were in a separate cluster.
Since the Data-Scope’s aim is providing large amounts of cheap and fast storage, its design must begin
with the choice of hard disks. There are no disks that satisfy all three criteria. In order to balance these
three requirements we decided to divide the instrument into two layers: performance and storage. Each
layer satisfies two of the criteria, while compromising on the third.
Performance Servers will have high speed and inexpensive SATA drives, but compromise on capacity:
Samsung Spinpoint HD103SJ 1TB, 150MB/s, (see [49], verified by our own measurements). The Storage
Servers will have larger yet cheaper SATA disks but with lower throughput: Samsung Spinpoint HD203WI
2TB, 110MB/s. The storage layer has 1.5x more disk space to allow for data staging and replication to
and from the performance layer.
The rest of the design focuses on maintaining the advantages from these two choices. In the
performance layer we will ensure that the achievable aggregate data throughput remains close to the
theoretical maximum, which is equal to the aggregate sequential IO speed of all the disks. As said before,
we achieve this level of performance by transferring data from the disks over the servers’ local PCIe
interconnects rather than slower network connections. Furthermore, each disk is connected to a separate
controller port and we use only 8-port controllers to avoid saturating the controller. We will use the new
LSI 9200-series disk controllers, which provide 6Gbps SATA ports and a very high throughput (we have
measured the saturation throughput of the LS92111-8i to be 1346 MB/s). Each performance server will
also have four high-speed solid-state disks (OCZ-Vertex2 120GB, 250MB/s read, 190MB/s write) to be
used as an intermediate storage tier for temporary storage and caching for random access patterns [48].
Motherboard
Memory
CPU
Enclosure
Disk ctrl ext
Disk ctrl int
Hard disk
SSD
NIC 10GbE
Cables
GPU card
Performance
price
Components
qty
SM X8DAH+F$469
1
18GB
$621
1
Intel E5630
$600
2
SM SC846A
$1,200
1
N/A
LSI 9211-8i
$233
3
Samsung 1TB
$65 24
OCZ
V2
$300
4
Chelsio N310E
$459
1
$100
1
2
GTX480
$500
Total Price
total
$469
$621
$1,200
$1,200
$699
$1,560
$1,200
$459
$100
$1,000
$8,508
Components
X8DAH+F-O
24GB
Intel E5630
SM SC847
LSI 9200-8e
LSI 9211-8i
Samsung
Chelsio
Storage
price
qty
$469
1
$828
1
$600
2
$2,000
3
$338
2
$233
1
$100 126
$540
$100
N/A
Total Price
1
3
0
total
$469
$828
$1,200
$6,000
$676
$233
$12,600
$540
$300
$0
$22,846
Table 2. The projected cost and configuration for a single unit of each server type
The performance server will use a SuperMicro SC846A chassis, with 24 hot-swap disk bays, four internal
SSDs, and two GTX480 Fermi-based NVIDIA graphics cards, with 500 GPU cores each, offering an
excellent price-performance for floating point operations at an estimated 3 teraflops per card. The Fermibased TESLA 2050 has not been announced yet, we will reconsider if it provides a better price
performance as the project begins. We have built a
prototype system according to these specs and it
performs as expected.
In the storage layer we maximize capacity while
keeping acquisition costs low. To do so we amortize
the motherboard and disk controllers among as many
disks as possible, using backplanes with SATA
expanders while still retaining enough disk bandwidth
per server for efficient data replication and recovery
tasks. We will use locally attached disks, thus keeping
both performance and costs reasonable. All disks are
hot-swappable, making replacements simple. A
storage node will consist of 3 SuperMicro SC847
chassis, one holding the motherboard and 36 disks,
with the other two holding 45 disks each, for a total of
126 drives with a total storage capacity of 252TB. On
Figure 1. The network diagram of the Data-Scope.
the storage servers we will use one LSI-9211-8i controller to drive the backplane of the 36 disks,
connecting 2x4 SATA ports to the 36 drives, through the backplane’s port multiplier. The two external disk
boxes are connected to a pair of LSI 9200-8e controllers, with 2x4 ports each, but the cards and boxes
are cross-wired (one 4-port cable from each card to each box), for redundancy in case of a controller
failure, as the split backplanes automatically revert to the good controller. The IO will be limited by the
saturation point of the controllers and the backplanes, estimated to be approximately 3.6GB/s.
Both servers use the same dual socket
SuperMicro IPMI motherboard (X8DAH+FO) with 7 PCIeGen2 slots. The CPU is the
cheapest 4-core Westmere, but we will be
able to upgrade to faster dual 6-cores in
the future, as prices drop. In our prototype
we tried to saturate this motherboard: we
exceeded a sequential throughput of 5GB/s
with no saturation seen.
servers
rack units
capacity
price
power
GPU
seq IO
netwk bw
1P
1.0
4.0
24.0
8.5
1.0
6.0
4.6
10.0
1S
1.0
12.0
252.0
22.8
1.9
0.0
3.8
20.0
90P
90
360
2160
766
94
540
414
900
12S
12
144
3024
274
23
0
45
240
Full
102
504
5184
1040
116
540
459
1140
TB
$K
kW
TF
GBps
Gbps
The network interconnect is 10GbE. Three
7148S switches from Arista Networks’ are
Table 3. Summary of the Data-Scope properties for single
used at the ‘Top of the Rack’ (TOR), and a
servers and for the whole system consisting of Performance
high performance 7148SX switch is used
(P) and Storage (S) servers.
for the ‘core’ and the storage servers. The
TOR switches each have four links aggregated to the core for a 40Gbps throughput. We deploy Chelsio
NICs, single port on the performance servers and dual port on the storage side.
Hardware Capabilities
The Data-Scope will consist of 90 performance and 12 storage servers. Table 3 shows the aggregate
properties of the full instrument. The total disk capacity will exceed 5PB, with 3PB in the storage and
2.2PB in the performance layer. The peak aggregate sequential IO performance is projected to be
459GB/s, and the peak GPU floating point performance will be 540TF. This compares rather
favorably with other HPC systems. For example, the Oak Ridge Jaguar system, the world’s fastest
scientific computer, has 240GB/s peak IO on its 5PB Spider file system [50].
The total power consumption is only 116kW, a fraction of typical HPC systems, and a factor of 3 better
per PB of storage capacity than its predecessor, the GrayWulf. The total cost of the parts is $1.04M. The
projected cost of the assembly + racks is $46K, and the whole networking setup is $114K, for a total
projected hardware cost of about $1.2M. In the budget we reserved $200K for contingency and spare
parts. The system is expected to fit into 12 racks.
Data Ingestion and Recovery Strategies
The storage servers are designed for two purposes, Data Replication and Recovery, incremental and full
dataset copies and restores (large and small), and Import/Export of Large Datasets, where users show up
with a couple of boxes of disks and should be able to start experiments within hours, and keep their data
online over the lifetime of the experiment (e.g., months)
Individual disk failures at the expected standard rate of about 3%/yr are not expected to cause much of a
problem, for the performance servers–this amounts to one failure every 6 days. On our 1PB GrayWulf
server we experienced a much lower disk failure rate (~1%) so far. These can be dealt with fairly easily,
by reformatting simple media errors automatically with data recovery, and manually replacing failed disks.
The bigger challenge is that most of the time the storage servers do not need much bandwidth (e.g.,
during incremental copies), but there is occasionally a need for considerably more bandwidth for a large
restore. Our solution is to design the network for the routine scenarios (i.e., incremental backups and
small restores). Both the performance servers as well as the storage servers are configured with hotswappable disks so atypical large restores can be performed by physically connecting disks to the
servers (i.e., sneakernet [51]). Given that moveable media (disks) are improving faster than networks,
sneakernet will inevitably become the low cost solution for large ad hoc restores, e.g., 10-1000TBs.
The hot-swap disks are also useful for importing and exporting large datasets (~100TBs). The DataScope is intended to encourage users to visit the facility and bring their own data. For practical reasons,
the data set should be small enough to fit in a few 50 pound boxes (~100TBs). With the hot swappable
feature, users could plug in their disks and have their data copied to the performance servers in a few
hours. When visitors leave after a few weeks/months, their data could be swapped out and stored in a
bookshelf, where it could be easily swapped back in, if the visitor needs to perform a follow-up experiment
remotely. Both the performance servers and especially the storage servers could be configured with a
few spare disk slots so one can swap in his data without having to swap out someone else’s data.
Remote users can transfer data using the fast Open Cloud Consortium (OCC) network [10] – currently a
dedicated 10GbE to MAX and Chicago and soon much higher. OCC has also dedicated network links to
several Internet NAPs. Finally, the JHU internal backbone is already running at 10Gbps and in the next
few months the high throughput genomics facilities at JHMI will be connected to this network.
Usage Scenarios
We envisage about 20-25 simultaneous applications, which can use the Data-Scope in four different
ways. One can run stable, high availability public web services, allowing remote users to perform
processing operations on long-lived data sets. These would be typically built on several tens of TB of data
and would store data in a redundant fashion for both safety and performance. Examples of such services
might be the VO cross-match services in astronomy, or the JHU Turbulence database services. Other
applications can load their data into a set of large distributed shared databases, with aggregate sizes
in tens to a few hundred TB. The users can run data intensive batch queries against these data sets and
store the intermediate and final results in a shared database and file system space. We have developed a
parallel workflow system developed for database ingest of data sets in the 100TB range for the PanSTARRS project[52]. This can be turned into a more generic utility with a moderate amount of work.
Hadoop is an open source implementation of Google’s MapReduce [53], which provides a good load
balancing and an elegant data-parallel programming paradigm. Part of the instrument will run Hadoop
over a multitude of data sets. We will experiment with running the most compute-intensive processing
stages (bioinformatics, ray-tracing for image simulations in astronomy) on the GPUs using CUDA code.
Finally, when all else is inadequate, certain users can request access to the “bare metal”, running their
own code end-to-end on the performance servers.
User Application Toolkits and Interfaces
We will provide users with several general-purpose programming toolkits and libraries to maximize application performance. We have already developed some of the key software components. For example, the
custom collaborative environment built for SDSS, Casjobs[54,55], has been in use for seven years, by
more than 2,500 scientists. The component we need to add is a shared 100TB intermediate term storage,
which has been designed but remains to be built.
We have designed and implemented a generic SQL/CUDA interface that enables users to write their own
user defined functions that can execute inside the GPUs, but are called from the database. Since all the
data flow is on the backplane of the same server, one can achieve a stunning performance. This was
demonstrated in our entry for the SC-09 Data Challenge[56].
We have implemented our own S3/Dropbox lookalike, which has been connected to various open source
S3 bindings downloaded from SourceForge. This interface is simple, scalable, well-documented, and will
provide a convenient way for users to up- and download their smaller data sets.
On the applications side we have already ported several key applications to CUDA, but most of the development work will materialize as users start to migrate their applications to the Data-Scope. The
expectations are that we will need to customize them for the Data-Scope and integrate them to each
other. Other components, such as the integration between MPI and the SQL DB, have been prototyped
but will need to be fully developed.
In summary, the components that we have developed allow novel high-throughput data analyses. For
example, users can, using the SQL-CUDA integration, access powerful analysis patterns like FFT, within
a database query. Likewise, Linux MPI applications can read/write data from/to databases using the MPI-
DB API. During the initial stages of the Data-Scope’s development we will use some of the performance
and storage servers for software development and testing.
Data Lifecycles
We envisage three different lifecycle types for data in the instrument. The first would be persistent data,
over which permanent public services will be built for a wide community, like OpenSkyQuery, the
Turbulence database or the Milky Way Laboratory. The main reason to use the Data-Scope in this case is
the massive performance gains from the speed of the hardware and parallelism in execution. These data
sets will range from several tens to possibly a few hundred TB.
The second type of processing will enable truly massive data processing pipelines that require both
high bandwidth and fast floating point operations. These pipelines will process hundreds of TB, including
reprocessing large images from high throughput genomic sequencers, for reduced error rates, and
massive image processing tasks for astronomy or cross correlations of large environmental data sets.
Data will be copied physically by attaching 2TB disks to the Data-Scope, while results will be extracted
using the same method. These datasets will be active on the system for one to a few weeks. Another
typical user of this pattern would be the LHC data analysis group.
The third type of usage will be community analysis of very large data sets. Such datasets will be in
the 200-500TB range. We will keep the media after the dataset has been copied into the instrument and
use them to restore the input data in the case of a disk failure. Once such a massive data set arrives, its
partitioning and indexing will be massive endeavor therefore it makes only sense if the data stays active
for an extended period (3-12 months). Intermediate, derived data sets could also reach tens or even
100TB. Examples of such datasets include a massive set of simulations (500 cosmological simulations
with a high temporal resolution) coupled with an analysis campaign by a broad community.
System Monitoring, Data Locality
There will be a data locality server, monitoring the file systems, so the system is aware what’s where
without depending on a user to update tables manually when disks are swapped in and out. There will be
a special file on every disk that tells the system what is on the disk, and a bar code on the outside of disks
that would make it easy to locate disks. As for software, we plan to use open source backup software
(such as Amanda), as much as possible. Both the operating system software and applications (including
Hadoop, SQL Server etc) will be deployed using a fully automated environment, already in use for several
years at JHU. Since the performance servers are identical, it is very easy to dynamically change the node
allocations between the different usage scenarios.
Development Strategy and Methods
Our Track Record
The JHU group has been systematically working on building high performance data-intensive computing
systems (both hardware and software) over the last decade, originally started with Jim Gray. We also
have more than a decade of experience in building and operating complex scientific data centers. The
100TB Sloan Digital Sky Survey (SDSS) archive is arguably one of the most used astronomy facilities in
the world. Some of the most used web services (OpenSkyQuery) for the Virtual Observatory have also
been built and ran out of JHU. Our group has also been hosting the GalaxyZoo project[57] in its first two
years until we moved it to the Amazon Cloud.
In 2008 we built the 1PB GrayWulf facility (named after our friend Jim Gray), the winner of the Data
Challenge at SuperComputing-08, supported by Microsoft and the Gordon and Betty Moore Foundation.
The experience gained during the GrayWulf construction was extremely useful in the Data-Scope design.
We will also heavily rely on software developed for the GrayWulf[52], but will also add Hadoop to the
software provided. It is also clear that the GrayWulf will be maxed out by early Fall 2010, and we need an
expansion with ‘greener’ properties, to avoid the power wall. In a project funded by NSF’s HECURA
program, we have built low-power experimental systems for data intensive computing, using solid state
disks[48]. The lessons learned have been incorporated into the Data-Scope design. JHU was part of the
Open Cloud Consortium team[10] winning the Network Challenge at SC-08.
Prototypes, Metrics
Over the years we have built and customized a suite of performance tools designed to measure and
project server performance in data-intensive workloads. We also have access to many years of real
workload logs from the SDSS and NVO servers for comparisons. For the Data-Scope proposal we have
already built a prototype performance server, which has exceeded the original expectations. The numbers
quoted in the proposal are on the conservative side.
(d) Impact on the Research and Training Infrastructure
Impact on Research
There is already a critical mass of researchers at JHU that will use the Data-Scope (see Section (b)).
These research problems represent some of the grand challenges in scientific fields ranging from
astrophysics and physical sciences, to genome and cancer research, and climate science. What these
challenges have in common is that their space and processing requirements surpass the capabilities of
the computing infrastructure that the university owns or can lease (i.e., cloud computing).
While it is becoming increasingly feasible to amass PB-scale datasets, it is still difficult to share such
datasets over wide area networks. The Data-Scope will provide scientists with an environment where they
can persistently store their large datasets and allow their collaborators to process these data either
remotely (through Web Services) or locally (by submitting jobs through the batch processing system). In
this way the Data-Scope will organically become a gathering point for the scientific community.
Furthermore, by being a shared environment among multiple projects the Data-Scope will promote
sharing of code and best practices that is currently hindered by disparate, vertically integrated efforts. We
will seed this community centered at the Data-Scope via our collaborators in other academic institutions
and national labs (see letters of collaboration).
The availability of compelling datasets at the Data-Scope will have secondary benefits. As the public
availability of few-TB datasets (e.g., SDSS), helped nurture research in distributed databases, data
mining, and visualization, the ability to broadly access and process multi-PB datasets will create a
renaissance in data-intensive computational techniques.
Impact on Training
The analysis of PB-scale datasets lies at the center of the Fourth Paradigm in Science [58]. Likewise,
dealing with the data deluge is of paramount importance to the Nation’s security. Therefore, future
generations of scientists and engineers must develop the data analysis skills and computational thinking
necessary to compete globally and tackle grand challenges in bioinformatics, clean energy, and climate
change. By providing the ability to store, analyze, and share large and compelling datasets from multiple
scientific disciplines the Data-Scope will become a focal point in all education levels. The paragraphs that
follow describe our interdisciplinary education plans focused around the proposed instrument.
The JHU Center for Computational Genomics offers workshops, short courses, a seminar series, and an
annual symposium, all of which bring together undergraduate and graduate students, fellows, and faculty
(including clinicians) from disciplines as diverse as computer science, molecular biology, oncology,
biostatistics, and mathematics. In these courses, participants are encouraged to not only attend courses
but also to develop and teach courses to the other “students.” Recent short courses and seminars include
discussions of synthetic biology and the database architecture needed for a large-scale synthetic genome
project; introduction to the R programming language; biological sequence alignment algorithms; overview
of experimental and analytical methods in high-throughput biology, and more. Having such a wide range
of expertise and academic experience brings new opportunities for students and professionals in diverse
fields and at all levels of training to get involved as well as better communication among scientists when
they are part of multi-disciplinary teams. Senior Personnel S. Wheelan will integrate the Data-Scope into
future iterations of the workshops offered by the JHU Center for Computational Genomics.
Along the same lines, CoPI Church will leverage the NSF-Funded Center for Language and Speech
Processing (CLSP) Summer Workshop and the Summer School in Human Language Technologies
(HLT), funded by the HLT Center of Excellence, to introduce the Data-Scope to the students in the NLP
field. Since its inception in 1998, about 100 graduate students and 60 undergraduates have participated
in the workshop. More than 50 additional students supported by the North American Chapter of the
Association for Computational Linguistics, and over 20 students from local colleges and universities have
attended the two-week intensive summer school portion, leading to the education and training of over 230
young researchers. The workshop provides extended and substantive intellectual interactions that have
led indirectly to many continuing relationships. The opportunity to collaborate so closely with top
international scientists, both one-on-one and as a team, offers a truly exceptional and probably
unprecedented research education environment.
Finally, multiple JHU faculty members participating in this project are part of an NSF-funded team on
“Modeling Complex Systems: The Scientific Basis of Coupling Multi-Physics Models at Different Scales.”
We will leverage this IGERT program to introduce graduate students to the topics of large-scale
simulations and data-intensive science using the Data-Scope.
Outreach to under-represented communities. CoPI Meneveau will use his existing network to continue
recruiting talented Hispanic US graduate students through his contact with Puerto Rico universities and
ongoing collaborations with Prof. L. Castillo of RPI and Universidad del Turabo (PR). Over the past 3
years, there have been 3 visiting Hispanic PhD students, 1 MS visiting student from UPR Mayaguez, and
4 REU undergraduate students from Puerto Rico working with Dr. Meneveau.
Women are better represented in computational linguistics, environmental sciences, and biology than in
computer science and the physical sciences. We will capitalize on this observation to recruit more female
students in the sciences that face this gender imbalance. Half of the undergraduate participants in the
previously mentioned workshops are women. They are talented students selected through a highly
selective search. These students are already involved in inter-disciplinary research in data-intensive
science that makes them exceptionally well qualified to enter our PhD programs. We will also leverage
the JHU Women in Science and Engineering program (WISE, funded by NSF, housed in the Whiting
School of Engineering, and currently linked to Baltimore County’s Garrison Forest School) to build into
our program a permanent flow of new students at both the high school and undergraduate levels. The
greater Baltimore area has several historically black colleges and universities from which we will actively
attempt to recruit students, both at the undergraduate level (for summer research) and as prospective
applicants to our program. Our target colleges will include Morgan State University, Coppin State College,
Bowie State University, Howard University, and the University of D.C.
General outreach. PI Szalay was a member of the NSF Taskforce on Cyberlearning, and is a coauthor of
the report “Fostering Learning in the Networked World: The Cyberlearning Opportunity and Challenge”. PI
Szalay also led the effort of building a major outreach program around the tens of terabytes of SDSS
data. This program has delivered more than 50,000 hours classroom education in astronomy and data
analysis for high school students, through a set of student laboratory exercises appropriate for students
from elementary through high school. The sdss.org website recently won the prestigious SPORE Award
for the best educational websites in science, issued by AAS and Science magazine.
Recently the GalaxyZoo (GZ), served from JHU, became one of the most successful examples of `citizen
science’ [57]. In this, members of the public were asked to visually classify images of a million galaxies
from the SDSS data. More than 100,000 people signed up, did the training course, and performed over
40 million galaxy classifications. GalaxyZoo has been featured by every major news organization in the
world (CNN, BBC, the Times of London, NYT, Nature, The Economist, Science News), as an example
how science can attract a large involved non-expert population, if presented in the right manner.
As part of the NSF-funded projects in Meneveau’s laboratory, two Baltimore Polytechnic Institute (BPI)
high school students conducted research activities in his lab. “Baltimore Poly”, as it is affectionately called
in this area, has a long tradition of excellence in science and serves a majority African-American student
body. CoPI Terzis has collaborated with BPI teachers to develop projects that encourage K-12 students
to pursue science and engineering careers through interactive participation in cutting-edge research. This
proposed NSF project with the novel capabilities offered by the Data-Scope will enable continuation of the
link to Poly’s Ingenuity Project in Baltimore.
(e) Management Plan
The Work Breakdown Structure (WBS) for the project is found in
Table 4. The actual WBS has been developed down to the third level,
but has been partially collapsed (bold), in order to fit on a single page.
The instrument development will be led by the group of five PIs, also
selected as representatives of the different schools of JHU. They will
be responsible for the high level management of the project. The
management and operations of the instrument (including the
sustained operations) will be carried out within the Institute of Data
Intensive Engineering and Science (IDIES) at JHU.
The system design and architecture is developed by three of the PIs,
Alex Szalay, Andreas Terzis and Ken Church, and Jan Vandenberg,
Head of Computer Operations at the Dept. of Physics and Astronomy.
There is full time Project Manager, Alainna White, who has already
been involved in the planning of the schedule and the WBS. She will
lead the construction of the instrument hardware and supervise the
system software. The design and implementation of the system
software will be led by Richard Wilton (with 20 years of experience in
SW development), advised by Prof. Randal Burns, from JHU CS.
PIs
Design/Architecture
Project Manager
Construction
System Software
Operations Council
Apps Software
Operations
Figure 2. The organization of
the Data-Scope development
and operations team.
There will be an Operations Council, consisting of 5 faculty members representing different disciplines
and a member representing the external users of the instrument. This group will make recommendations
during the design, construction and commissioning of the instrument, and make decisions about resource
allocation and application support development after the commissioning period. The Operations Council
will be chaired by Randal Burns. Additional members include Mark Robbins, Charles Meneveau, Sarah
Wheelan, Darryn Waugh and Robert Hanisch (STScI, Director of the VAO, external representative).
Operations will be lead by Ani Thakar, who has been responsible for the development and operation of
the SDSS Archive, now hosted by JHU who also coordinated the development of the Pan-STARRS data
management system. The operations team will consist of a full time System Administrator and a half-time
Database Administrator (DBA), shared with IDIES. Alainna White will move into the position of the System
Administrator after the commissioning and will also have access to additional support personnel through
IDIES in high pressure situations. Extrapolating from our current computing facilities this is adequate
support in an academic environment.
The parts list and the detailed system configuration has been presented in Section (c), Table 2, since
these are closely tied to the goals and performance of the instrument. The total cost of the hardware parts
is $1,040K, networking $114K, racks and assembly $46K. There is a $200K contingency, to
accommodate price fluctuations, and spare parts. A large fraction of this will be funded by JHU’s cost
sharing ($980K).
The first month of the project is spent on Hardware Configuration (Task 1), reevaluating the original,
existing design, accommodate possible new emerging hardware components with a better price
performance, update the in-house prototypes and reconfirm the performance metrics. There is minimal
risk in this phase, since we have already built and validated the performance server, and the storage
server is very straightforward in all respects. The only possible risk is a sudden escalation of CPU or
memory prices, in which case we cut back on purchasing the full memory. We also plan to delay the
purchase of half the GPU cards, since this is the field with the fastest pace of evolution, every three
months there are new, better and cheaper cards on the market. With the recent release of the first
NVIDIA Fermi, this is expected to consolidate by the end of 2010.
The System Software (Task 2) is a main part of the system. Most of the important pieces have been
largely developed or designed for the SDSS, GrayWulf, Pan-STARRS and the turbulence archives. Some
have been in use by thousands of people for more than 6 years (Casjobs). This helps to mitigate the risks
involved. Nevertheless, there will be still a lot of extra development needed to modify, homogenize and
generalize these tools. These modification tasks will be carried out by the same individuals who
developed the original software, thus there will be a relatively mild learning curve. Some of this effort can
be started right away, on the prototype boxes, while the rest of the system is assembled. The main
categories for the system software have not been collapsed, so that they can be seen individually. The
total system software development comes to 24 man-months, all during the first year.
The Construction schedule (Task 3) is rather aggressive. The main risk is that some of the components
may be out of stock (at a good price) given the quantity. We encountered this in the past with SSDs. This
is why we budgeted 28 days for the assembly and preparation. We can absorb another two weeks lead
time without a major crisis. The preparation of the instrument room and networking will begin immediately
as the design is consolidated. We allocated 14 days for validation and testing of the different hardware
components. Over the years we have built a lot of automated diagnostic and test tools to validate the
performance of the different system components, therefore risks are minimal. Since we will have some
spare parts, faulty boxes can be repaired on site. Disks will be directly shipped to JHU, and inserted into
the servers after they are rack mounted, to minimize disk damages during shipping. We will hire
undergraduates for the disk tray assembly, bar coding and insertion.
During the Commissioning phase (Task 4) we will installing user interfaces, create the environments for
the test users including the loading a several large data sets. We will also simulate various system
failures and recoveries. End user documentation will be prepared and revised during this period. We
have budgeted an additional 12 months in year 2 to interface system components with the evolving applications, since it is clear that there will some unanticipated requests made by the different apps that have
implications for the system software. After a decade in developing data-intensive computing hardware
and software systems we feel that the task at hand (with the 12 months in year 2) can be realistically
accomplished, given the reuse of our existing software base, as part of the commissioning phase.
Application Support (Task 5) will be a major part of our development, a total of 64 months of effort will
be spent on enhancing existing applications to maximally utilize the Data-Scope, like CUDA porting of
application kernels, custom data partitioning tools, Hadoop conversion, DB loading workflow
customization. There will be dedicated SW engineers, whose development time will be allocated in
approximately 3 month chunks by the Operations Council and supervised by the Head of Operations.
In the Operations Phase (Task 6, not shown due to space constraints), starting in Jan 2012, we
reevaluate every six months (by the Operations Council) how the resources will be allocated among the
different projects, and the lifecycles of the different data sets. The reallocation may include moving some
of the machines from a Windows DB role to Linux – easily done since the HW is identical, and SW install
is automatic push. We expect requests to come in the form of a short (1 page) proposal. In exceptional
cases the 6 month cycle can be accelerated. Users can either use networks to copy their data sets, or as
described in Section(c), they can bring their own disks, still the most efficient way to move large data.
JHU is committed to a 5 year Sustained Operation of the Data-Scope beyond the end of the MRI grant
(see support letter). In this phase, JHU will provide the facility, electricity and cooling, the salary of the
System Manager, the 50% DBA, and the MAX connection fees. As hardware upgrades become
necessary, we envisage first upgrading the disks (as new projects bring their own disks, they will leave
them there). CPU and memory upgrades are incremental and easy, and they will be financed by projects
needing the enhanced capabilities. We expect that the 5PB capability of the system will be maxed-out in
about 2-3 years. At that point we hope that the system becomes so indispensible to many of the involved
research projects that they will use their own funds to incrementally add to the system. In a ‘condominium
model’, that IDIES has been using to sustain HPC at JHU, users can buy blocks of new machines, which
gives them a guaranteed access to a time-share of the whole instrument. The configuration of these
blocks is standardized, to benefit from the economy of scale in automated system management.
All software developed is and will be Open Source. Our GrayWulf system has been cloned at several
places, the Casjobs environment is running at more than 10 locations world-wide. Our data access and
web services applications (and our data sets) have been in use over 20 institutions world-wide. We will
create a detailed white paper on the hardware design, the performance metrics, and create scaled-down
reference designs containing concrete recipes in the Beowulf spirit, for installations smaller than the DataScope, to be used by other institutions.
WBS
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
2
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11
2.12
2.13
2.14
3
3.1
3.2
3.3
3.4
3.5
4
4.1
4.2
4.3
4.4
4.5
4.6
5
5.1
5.2
5.3
5.4
5.5
Task Name
Hardware Configuration
Reevaluate components (price and performance)
Reevaluate detailed design
Update prototype performance box
Update prototype storage box
Update and Verify Performance Metrics
Update Network architecture
Update requirements for operating environment
Design detailed recovery plan
System Software Development
Create a data map application/service
Create MPI-DB
Create SQL-CUDA Framework
Create recovery software
Customize system monitoring software
Customize sensor monitoring software
Create S3/Dropbox service
Create customized interface for Dropbox
Customize MyDB / CASJobs
Create monitoring database application and interface
Loading Framework
Disk Tracking Software
Design automated OS/SW installation environment
Create deployment plan
Construction
Select vendor for off-site assembly
Order all components
Assembly and Preparation
System Integration
Initial system testing and debug
Commissioning
Establish WAN Connections
Ingest of Datasets
Install User Interaction Layer
Create user test environment
Simulate failures and recoveries
Unanticipated software build requests
Application Software Support
Data partitioning
Loading workflow customization
Application DB conversion
Hadoop Conversion
Application kernel porting to CUDA
WBS Code
(1) HW Configuration
(2) Sys SW Dev
(3) Construction
(4) Commissioning
(5) App SW Support
1/11
2/11
3/11
4/11
5/11
6/11
7/11
Duration
26 days
10 days
10 days
10 days
10 days
5 days
5 days
3 days
7 days
180 days
60 days
180 days
27 days
60 days
85 days
60 days
60 days
35 days
90 days
60 days
60 days
30 days
20 days
8 days
84 days
10 days
14 days
28 days
18 days
14 days
412 days
12 days
60 days
30 days
150 days
60 days
260 days
475 days
475 days
475 days
475 days
475 days
475 days
8/11
9/11
Start
1/3/11
1/3/11
1/12/11
1/17/11
1/17/11
1/31/11
1/10/11
2/3/11
1/17/11
1/3/11
1/3/11
1/3/11
4/8/11
1/3/11
1/11/11
1/3/11
5/2/11
7/25/11
1/3/11
1/11/11
1/3/11
1/3/11
1/28/11
1/24/11
2/7/11
2/7/11
2/21/11
3/11/11
4/20/11
5/16/11
6/3/11
6/3/11
6/3/11
6/3/11
6/3/11
6/3/11
1/3/12
3/1/11
3/1/11
3/1/11
3/1/11
3/1/11
3/1/11
10/11
11/11
Finish
2/7/11
1/14/11
1/25/11
1/28/11
1/28/11
2/4/11
1/14/11
2/7/11
1/25/11
9/9/11
3/25/11
9/9/11
5/16/11
3/25/11
5/9/11
3/25/11
7/22/11
9/9/11
5/6/11
4/4/11
3/25/11
2/11/11
2/24/11
2/2/11
6/2/11
2/18/11
3/10/11
4/19/11
5/13/11
6/2/11
12/31/12
6/20/11
8/25/11
7/14/11
12/29/11
8/25/11
12/31/12
12/24/12
12/24/12
12/24/12
12/24/12
12/24/12
12/24/12
12/11
2012
Existing Facilities, Equipment and Other Resources
Our collaboration has a substantial amount of existing hardware and facilities, which will enhance the
possible uses of the Data-Scope. These facilities have also served as a way to gradually understand the
real needs of data-intensive computing, and as prototypes for our current, much larger instrument.
The DLMS Cluster
The DLMS cluster was supported by an NSF MRI grant, MRI-320907 DLMS: Acquisition of Instrumentation for a Digital Laboratory for Multi-Scale Science, PIs: Chen et al.). The cluster consists of a compute
layer and a database layer, the JHU public turbulence database. It houses a 27 TB database that
contains the entire time history of a 10243 mesh point pseudo-spectral DNS of forced isotropic turbulence,
connected to a 128 node Beowulf cluster througha Myrinet fabric. This was our first experiment in
integrating traditional HPC with databases. The resulting public database has been in wide use for many
years (see Section (b) of the Project Description), and has recently been migrated to the Graywulf, for
improved performance.
The GrayWulf Cluster
GrayWulf is a distributed database cluster at JHU consisting
of 50 database nodes with 22TB and an 8-core server each,
for a total of 1.1PB. The cluster was purchased on funds from
the Gordon and Betty Moore Foundation, the Pan-STARRS
project and Microsoft Research. The cluster already hosts
several large data sets (Pan-STARRS, turbulence, SDSS,
various Virtual Observatory catalogs and services, environmental sensor data, computer security data sets, network
traffic analysis data, etc). Currently about 500TB is already
utilized. The cluster has an IO performance exceeding many
supercomputers: the aggregate sequential read speed is
more than 70 Gbytes/sec. The GrayWulf is a direct
predecessor for the Data-Scope. One of its weaknesses is
that there is no low-cost storage layer, backups were made to tapes, a soluition that is not scalable in the
long run.
The HHPC Cluster
The same computer room hosts an 1800 core BeoWulf cluster, a computational facility shared among
several JHU faculty. The HHPC and the GrayWulf share a common 288-port DDR Infiniband switch for an
extremely high-speed interconnect. There is an MPI interface under development that will enable very
fast peer-to-peer data transfers between the compute nodes and the database nodes. The Deans of the
JHU Schools provide funds to cover the management and operational costs of the two connected clusters
as part of IDIES.
The HECURA/Amdahl Blade Cluster
As part of our ongoing HECURA grant (OCI-09104556) we have built a
36-node cluster combining low power motherboards, GPUs and solid
state disks. The cluster, while only consuming 1.2kW, has an IO speed
of 18Gbytes/sec, and using the GPU/SQL integration, it was able to
compute 6.4 billion tree traversals for an astronomical regression
problem. NVIDIA has also provided a substantial support (hardware
donation and a research grant) for this system.
NSF-MRI NVIDIA cluster
JHU has been recently awarded an NSF MRI grant (CMMI-0923018), to purchase a large GPU cluster.
We are in the process of architecting and ordering this system based on the next generation Fermi
architecture (as soon as the first Fermi-based TESLA cards are available). The cluster will also have a
high speed disk IO. An additional significance of the cluster is that it will help to introduce many students
to this technology. There is a natural cohesion between the Data-Scope and the GPU cluster.
The Open Cloud Consortium Testbed
The OCC has placed so far one rack of a BeoWulf cluster, and Yahoo is in the process of deploying two
more racks of the cluster to JHU. There will be five more racks (42 nodes each with 8 cores and 3TB) in
Chicago, for a total of almost 900TB of disk space. The servers are connected to a 48-port 1Gbps switch,
with two 10Gbps uplink ports. This system is part of the Open Science Cloud Testbed, and can be used
for the Hadoop-based transform and load workflows distributed over geographic distances. The purpose
of the cluster is to explore petascale distributed computing and data mining challenges where the nodes
are separated across the continental US. There is a dedicated 10Gbps lambda connection from the JHU
cluster to UIC, via the id-Atlantic Crossroads (MAX), and McLean VA. The Infiniband switch of the HHPCGrayWulf has also a 10Gbit module that is connected to the outgoing 10Gbps line. This enables any one
of the GrayWulf machines to be accessible from any of the OCC servers. The core switch of the DataScope will be also directly linked to the 10Gbps backbone.
JHU Internal Connections
The internal JHU backbone has recently been upgraded to 10Gbps. The Data-Scope will be able to
connect directly to the 10Gbps router. The different partners within JHU will be all connected to the fast
backbone during the first two years of the project.
References
[1] Sloan Digital Sky Survey Archive, http://skyserver.sdss.org/
[2] Virtual Astronomical Observatory, http://us-vo.org/
[3] Budavari,T., Szalay,A.S, Malik,T., Thakar,A., O'Mullane,W., Williams,R., Gray,J., Mann,R.,
Yasuda,N.: Open SkyQuery -- VO Compliant Dynamic Federation of Astronomical Archives,
Proc. ADASS XIII, ASP Conference Series, eds: F.Ochsenbein, M.Allen and D.Egret, 314,
177 (2004).
[4] Large Synoptic Survey Telescope, http://lsst.org/
[5] Panoramic Survey Telesope and Rapid Response System, http://pan-starrs.ifa.hawaii.edu/
[6] Klypin, A., Trujillo-Gomez, S, Primack, J., 2010, arXiv:1002.3660.
[7] Diemand, J., Kuhlen, M., Madau, P., Zemp, M., Moore, B., Potter,D., & Stadel, J. 2008,
Nature, 454, 735
[8] Heitmann, K., White, M., Wagner, C., Habib, S., Higdon, D., 2008, arXiv:0812.1052v1
[9] Steinmetz, M. et al. 2006, AJ, 132, 1645
[10] Open Cloud Consortium, http://opencloudconsortium.org/
[11] Sarah Loebman, Dylan Nunley, YongChul Kwon, Bill Howe, Magdalena Balazinska, and
Jeffrey P. Gardner, 2009, Proc. IASDS.
[12] Yeung, P.K., R.D. Moser, M.W. Plesniak, C. Meneveau, S. Elgobashi and C.K. Aidun, Report
on NSF Workshop on Cyber-Fluid Dynamics: New Frontiers in Research and Education,
2008.
[13] Moser, R.D., K. Schulz, L. Smits & M. Shephard, “A Workshop on the Development of Fluid
Mechanics Community Software and Data Resources”, Report on NSF Workshop, in
preparation, 2010.
[14] Perlman, E. R. Burns, Y. Li & C. Meneveau, Data exploration of turbulence simulations using
a database cluster, In Proceedings of the Supercomputing Conference (SC’07), 2007.
[15] Li, Y., E. Perlman, M. Wan, Y. Yang, C. Meneveau, R. Burns, S. Chen, G. Eyink & A. Szalay,
A public turbulence database and applications to study Lagrangian evolution of velocity
increments in turbulence, J. Turbulence 9, N 31, 2008.
[16] Chevillard, L. and C. Meneveau, Lagrangian dynamics and statistical geometric structure of
turbulence, Phys. Rev. Lett. 97, 174501, 2006.
[17] Chevillard, L. and C. Meneveau, Intermittency and universality in a Lagragian model of
velocity gradients in three-dimensional turbulence, C.R. Mecanique 335, 187-193, 2007.
[18] Li, Y. and C. Meneveau, Origin of non-Gaussian statistics in hydrodynamic turbulence, Phys.
Rev. Lett., 95, 164502, 2005.
[19] Li, Y. and C. Meneveau, Intermittency trends and Lagrangian evolution of non-gaussian
statistics in turbulent flow and scalar transport, J. Fluid Mech., 558, 133-142, 2006.
[20] Li, Y., C. Meneveau, G. Eyink and S. Chen, The subgrid-scale modeling of helicity and
energy dissipation in helical turbulence, Phys. Rev. E 74, 026310, 2006.
[21] Biferale, L., L. Chevillard, C. Meneveau & F. Toschi, Multi-scale model of gradient evolution
in turbulent flows, Phys. Rev. Letts. 98, 213401, 2007.
[22] Chevillard, L., C. Meneveau, L. Biferale and F. Toschi, Modeling the pressure Hessian and
viscous Laplacian in turbulence: comparisons with DNS and implications on velocity gradient
dynamics, Phys. Fluids 20, 101504, 2008.
[23] Wan, M., S. Chen, C. Meneveau, G. L. Eyink and Z. Xiao, Evidence supporting the turbulent
Lagrangian energy cascade, submitted to Phys. Fluids, 2010.
[24] Chen, S. Y., G. L. Eyink, Z. Xiao, and M. Wan, Is the Kelvin Theorem valid for highReynolds-number turbulence? Phys. Rev. Lett. 97 144505, 2006.
[25] Meneveau, C. Lagrangian dynamics and models of the velocity gradient tensor in turbulent
flows. Annu. Rev. Fluid Mech. 43, in press, 2010.
[26] Yu H. & C. Meneveau. Lagrangian refined Kolmogorov similarity hypothesis for gradient time
evolution and correlation in turbulent flows. Phys. Rev. Lett. 104, 084502, 2010.
[27] Eyink, G. L., Locality of turbulent cascades, Physica D 207, 91-116, 2005.
[28] Eyink, G. L., Turbulent cascade of circulations, Comptes Rendus Physique 7 (3-4), 449-455,
2006a.
[29] Eyink, G. L., Multi-scale gradient expansion of the turbulent stress tensor, J. Fluid Mech. 549
159-190, 2006b.
[30] Eyink, G. L. Turbulent diffusion of lines and circulations, Phys. Lett. A 368 486–490, 2007.
[31] Eyink, G. L. and H. Aluie, The breakdown of Alfven’s theorem in ideal plasma. flows:
Necessary conditions and physical conjectures, Physica D, 223, 82–92, 2006.
[32] Eyink, G. L., Turbulent flow in pipes and channels as cross-stream “inverse cascades” of
vorticity, Phys. Fluids 20, 125101, 2008.
[33] Eyink, G. L., Stochastic line motion and stochastic conservation laws for nonideal
hydromagnetic models, J. Math. Phys., 50 083102, 2009.
[34] Eyink, G. L. The small-scale turbulent kinematic dynamo, in preparation, 2010.
[35] Luethi, B., M. Holzer, & A. Tsinober. Expanding the QR space to three dimensions. J. Fluid
Mech. 641, 497-507, 2010.
[36] Gungor A.G. & S. Menon. A new two-scale model for large eddy simulation of wall-bounded
flows. Progr. Aerospace Sci. 46, 28-45, 2010.
[37] Wang, J., L.-P. Wang, Z. Xiao, Y. Shi & S. Chen, A hybrid approach for numerical simulation
of isotropic compressible turbulence, J. Comp. Physics, in press, 2010.
[38] Mittal, R. Dong H., Bozkurttas, M., Najjar F.M., Vargas, A. and vonLoebbecke A., "A versatile
sharp interface immersed boundary method for incompressible flows with complex
boundaries", J. Comp. Phys. 227, 4825-4852, 2008.
[39] Le Maître, O.P. L. Mathelin, O.M. Knio & M.Y. Hussaini, Asynchronous time integration for
Polynomial Chaos expansion of uncertain periodic dynamics, Discrete and Continuous
Dynamical Systems 28, 199-226, 2010.
[40] Le Maître O.P. & O.M. Knio, Spectral Methods for Uncertainty Quantification – With
Application to Computational Fluid Dynamics, Springer, 2010.
[41] Szalay,A.S., Gray,J., 2006, Science in an Exponential World, Nature, 440, 413.
[42] Bell, G., Gray,J. & Szalay, A.S. 2006, “Petascale Computational Systems: Balanced CyberInfrastructure in a Data-Centric World”, IEEE Computer, 39, pp 110-113.
[43] Bell, G., Hey, A., Szalay, A.S. 2009, “Beyond the Data Deluge”, Science, 323, 1297
[44] Moore’s Law: http://en.wikipedia.org/wiki/Moore%27s_law
[45] Walter, C., 2005, Kryder’s Law. Scientific American. August 2005.
[46] Nielsen’s Law: http://www.useit.com/alertbox/980405.html
[47] Szalay,A.S., Gordon Bell, Jan Vandenberg, Alainna Wonders, Randal Burns, Dan Fay, Jim
Heasley, Tony Hey, Maria Nieto-SantiSteban, Ani Thakar, Catherine Van Ingen, Richard
Wilton: GrayWulf: Scalable Clustered Architecture for Data Intensive Computing,
Proceedings of the HICSS-42 Conference, Mini-Track on Data Intensive Computing, ed. Ian
Gorton (2009).
[48] Szalay, A., Bell, G., Huang, H., Terzis, A., White, A. Low-Power Amdahl-Balanced Blades for
Data Intensive Computing. In the Proceedings of the 2nd Workshop on Power Aware
Computing and Systems (HotPower '09). October 10, 2009.
[49] Schmid, P., Ross, A., 2009, Sequential Transfer Tests of 3.5” hard drives,
http://www.tomshardware.com/reviews/wd-4k-sector,2554-6.html
[50] Spider: http://www.nccs.gov/computing-resources/jaguar/file-systems/
[51] Church, K., Hamilton, J.,, A. 2009, Sneakernet: Clouds with Mobility,
http://idies.jhu.edu/meetings/idies09/KChurch.pdf
[52] Yogesh Simmhan, Y., Roger Barga, Catharine van Ingen, Maria Nieto-Santisteban, Laszlo
Dobos, Nolan Li, Michael Shipway, Alexander S. Szalay, Sue Werner, Jim Heasley, 2009,
GrayWulf: Scalable Software Architecture for Data Intensive Computing, Proceedings of the
HICSS-42 Conference, Mini-Track on Data Intensive Computing, ed. Ian Gorton.
[53] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”, 6th
Symposium on Operating System Design and Implementation, San Francisco, 2004.
[54] W. O’Mullane, N. Li, M.A. Nieto-Santisteban, A. Thakar, A.S. Szalay, J. Gray , 2005, “Batch
is back: CasJobs, serving multi-TB data on the Web,”, Microsoft Technical Report, MSR-TR2005-19.
[55] Li, N. Szalay, A.S., 2009, “Simple Provenance in Scientific Databases”, Proc. Microsoft eScience Workshop, Pittsburgh, eds: T. Hey and S. Tansley.
[56] Szalay,A.S. et al, 2009: Entry in Supercomputing-09 Data Challenge.
[57] Raddick J., Lintott C., Bamford S., Land K., Locksmith D., Murray P., Nichol B., Schawinski
K., Slosar A., Szalay A., Thomas D., Vandenberg J., Andreescu D., 2008, “Galaxy Zoo:
Motivations of Citizen Scientists”, Bulletin of the American Astronomical Society, 40, 240
[58] The Fourth Paradigm, Data-Intensive Scientific Discovery, 2009, eds; T. Hey, S. Tansley, K.
Tolle, Microsoft Research Press.
MRI Coordinator
The National Science Foundation
4201 Wilson Boulevard,
Arlington, VA 22230
Charlottesville, April 20, 2010
Dear Sirs,
We would like to express NVIDIA’s interest in the proposal entitled “Data-Scope – a Multi-Petabyte Generic
Analysis Environment for Science”, submitted to the NSF MRI program by the Johns Hopkins University, PI:
Alexander Szalay.
We are very excited about the possibility of building a new, balanced architecture for the emerging dataintensive computational challenges. These problems do not map very well unto the traditional architectures.
The unique combination of extreme I/O capabilities combined with the right mix of GPUs represent a novel
approach we have not seen anywhere else and has a very good chance to result in major breakthroughs in how
we think about future extreme scale computing. Furthermore, the integration of CUDA with the SQL Server
engine and the ability to run GPU code inside User Defined Functions will give database queries unprecedented
performance.
NVIDIA is happy to support this effort, and in recognition of his work, JHU became one of the inaugural
members of our recently formed CUDA Research Centers. We will provide JHU with advance technical
information on technologies and on our emerging new products, send evaluation boards, and we will also
extend major discounts as NVIDIA equipment is acquired.
We believe that the design of the Data-Scope instrument described in the proposal will find a broad
applicability, and we will see it replicated in many universities around the world.
Sincerely,
David Luebke, Ph.D.
NVIDIA Distinguished Inventor
Director of Research
NVIDIA Corporation
Download