GrayWulf - Microsoft Research

advertisement
GrayWulf: Scalable Software Architecture for Data Intensive Computing
Yogesh Simmhan, Roger Barga,
Catharine van Ingen
Microsoft Research
{yoges, barga,
vaningen}@microsoft.com
Maria Nieto-Santisteban, Laszlo
Dobos, Nolan Li, Michael Shipway,
Alexander S. Szalay, Sue Werner
Johns Hopkins University
{nieto, nli, mshipway, szalay,
swerner}@pha.jhu.edu
Abstract
Big data presents new challenges to both cluster
infrastructure software and parallel application
design. We present a set of software services and
design principles for data intensive computing with
petabyte data sets, named GrayWulf†. These services
are intended for deployment on a cluster of
commodity servers similar to the well-known Beowulf
clusters. We use the Pan-STARRS system currently
under development as an example of the architecture
and principles in action.
1. Introduction
The nature of high performance computing is
changing. While a few years ago much of high-end
computing involved maximizing CPU cycles per
second allocated for a given problem, today it
revolves more and more around performing
computations over large data sets. The computing
must be moved to the data as, in the extreme, the data
cannot be moved [1].
The combination of inexpensive sensors,
satellites, and internet data accessibility is creating a
tsunami of data. Today, data sets are often in the tens
of terabytes and growing to petabytes. Satellite
imagery data sets have been in the petabyte range for
over 10 years [2]. High Energy Physics crossed the
petabyte boundary in about 2002 [3]. Astronomy will
cross that boundary in the next twelve months [4].
This data explosion is changing the science process
leading to eScience by blending in computer science
and informatics [5].
For data intensive applications the concept of
“cloud computing” is emerging where data and
computing are co-located at a large centralized
facility, and accessed as well-defined services. The
†
The GrayWulf name pays tribute to Jim Gray who has been
actively involved in the design principles.
Jim Heasley
University of Hawaii
heasley@ifa.hawaii.edu
data presented to the users of these services is often
an intermediate data product containing scienceready variables (“Level 2+”) and commonly used
summary statistics of those variables. The scienceready variables are derived by complex algorithm or
model requiring expert science knowledge and
computing skills not always shared by the users of
these services [6]. This offers many advantages over
the grid-based model, but it is still not clear how
willing scientists will be to use such remote clouds.
Recently Google and IBM have made such a facility
available for the academic community [23].
Data intensive cloud computing poses new
challenges for the cluster infrastructure software and
the application programmers for the intermediate data
products. There is no easy way to manage and
analyze these data today, not is there a simple system
design and deployment template. The requirements
include (i) scalability, (ii) the ability to grow and
evolve the system over a long period, (iii)
performance, (iv) end-user simplicity, and (v)
operational simplicity. All of these requirements are
to be met with sufficient system availability and
reliability at a relatively low cost not only at initial
deployment but throughout over the data lifetime.
2. Design Tenets for Data Intensive
Systems
A high level systems view of a GrayWulf cluster
is shown in Figure 1. We separate the computing
farm into shared compute resources (the traditional
Beowulf [7]) and shared queryable data stores (the
Gray database addition).
Those resources are accessed by three different
groups:
 Users who perform analyses on the shared
database.
 Data valets who maintain the contents of the
shared databases on behalf of the users.
Figure 1. GrayWulf system overview.

Operators who maintain the compute and
storage resources on behalf of both the users and
data valets.
These groups have different rights and
responsibilities and the actions they perform on the
system have different consequences.
Users access the GrayWulf in the cloud from
their desktops. Data intensive operations happen in
the cloud; the (significantly smaller) results of those
operations can be retrieved from the cloud or shared,
like popular photos sharing services, with other cloud
computing collaborators. Specific challenges for a
GrayWulf beyond a Beowulf include:
 Present a simple query model. Scientists
program as a means to an end rather than an end
in itself and are often not proficient
programmers. Our GrayWulf uses relational
databases although other models such as
MapReduce [8] could be chosen.
 Minimize data movement. This is especially
the case over often poor internet connections to
the science desktop. GrayWulf systems include
user storage – a “MyDB” – which is collocated
with the larger shared databases. The MyDB or
its subset may be downloaded to the user’s
desktop.
 Varying job sizes. Accommodate not only
analyses that run for periods of days but also
analyses that run on the order of a few minutes.
The workflow system and its scheduler that
handles the user analyses must accommodate
both.
 Collaboration and privacy. Enable both data
privacy and data sharing between users. Having
the MyDB in the cloud enables users to
collaborate.
A powerful abstraction for the user cloud service
is a domain-specific workbench integrating the data,
common analysis tools and visualizations, but
running on standard GrayWulf hardware and
software platform.
The data valets operate behind the scenes to
present the shared databases to the users. They are
responsible for creating and maintaining the shared
database that contains the intermediate data product.
The data valets assemble the initial data, provide all
algorithms used to convert that data to science-ready
variables, design the database schema and storage
layout, and implement all necessary data update and
grooming activities. There are three key elements to
the GrayWulf approach:
 “20 queries”. This is a set of English language
questions that captures the science interest used
as requirements for the deployed database [9,
10]. An example of a query might be “find all
quasars within a region and compute their
distance to surrounding galaxies”. The 20 queries
are the first driver for all database and algorithm
design.

“Divide and conquer”. Operating at petabyte
scales requires partitioning which enables
parallelization. Many scientific analyses are
driven by space or time. Partitioning along one
or both of those axes enables the data to be
distributed across physical resources. Once
partitioned, queries and/or data updates can be
run in parallel on each partition with or without a
reduction step [11].
Divide and conquer
approach is the second driver for all efficient
database and algorithm implementation.
 ”Faults happen”. Fault handling is designed
into all data valet workflows rather than being
added on afterwards. Success is simply a special
case of rapid fault repair and faults are detected
at the highest possible level in the system. Data
availability and data integrity are a concern at
petabyte scale, particularly when dealing with
commodity hardware. Traditional backup-restore
breaks down. Two copies protect against
catastrophic data loss; the surviving copy can be
used to generate the second copy. Three copies
are needed to protect against data corruption; the
surviving two copies can be compared against
known checksums and greatly reduce the
probability that both copies are corrupt.
The GrayWulf system services used by the data
valets have many of the same abstractions as the user
services. The difference is that the data valets
services must operate at petabyte scale with a much
higher need for logging and provenance tracking for
problem diagnosis.
Mediating between the users, data valets and
operators are the GrayWulf cluster management
services which provide configuration management,
fault management, and performance monitoring.
These services are often thin veneers on existing
services or extend those services to become databasecentric. For example, the GrayWulf configuration
manager encapsulates all knowledge to deploy a new
database using known database template onto a target
server.
To illustrate how the GrayWulf system
architecture and approach works in practice, the PanSTARRS system currently under development and
scheduled for deployment in early 2009 will be used.
Why we believe this system generalizes will be
explained as well as specific research challenges still
ahead.
3. The Pan-STARRS GrayWulf
3.1. A Telescope as a Data Generator
The Pan-STARRS project [12] will use a special
telescope in Hawaii with a 1.4 gigapixel camera to
sample the sky over a period of 3.5 years. The large
field of view and the relatively short exposures will
enable the telescope to cover three quarters of the sky
4 times per year, in 5 optical colors. This will result
in more than a petabyte of images per year. The
current telescope (PS1) is only a prototype of a much
larger system (PS4) consisting of 4 identical replicas
of PS1. PS1 saw first light in 2007 and is operating
out of the Haleakala Observatory at Maui, Hawaii.
The images are processed through an image
processing pipeline (IPP) that removes instrument
signatures from the raw data, extracts the individual
detections in each image, and matches them against
an existing catalog of objects. The IPP processes each
3 GB image in approximately 30 seconds. The
objects found in these images can exhibit a wide
range of variability: they can move on the sky (solar
system objects and nearby stars), or even if they are
stationary they can have transients in their light
(supernovae and novae) or can have regular
variations (variable stars). The IPP also computes the
more complicated statistical properties of each object.
Each night, the IPP processes approximately 2-3
terabytes of imagery to extract 100 million detections
on a LINUX compute farm of over 300 processors.
The detections are loaded into a shared database
for analysis. While the initial system and database
design was based on the experience gained from the
Sloan Digital Sky Survey (SDSS) SkyServer
experience [13], the Pan-STARRS system presents
two new challenges:
 The data size is too large for a monolithic
database. The SDSS database after 5 years now
contains roughly 290 million objects and is 4
terabytes in size [14]. The Pan-STARRS
database will contain over 5 billion objects and,
at the end of the first year, will contain 70 billion
detections in 30 terabytes. At the end of PS1
project, the database will grow to 100 terabytes.
 The updated data will become available to the
astronomers each week. SDSS has released
survey data on the order of every six months.
New detections will be loaded into the PanSTARRS database nightly and become available
to astronomers weekly. The object properties and
the detection-object associations are also updated
regularly.
At a high level, there are two different database
access patterns: queries that touch only a few objects
or detections and crawlers that perform long
sequential accesses to objects or detections. Crawlers
are typically partitioned into buckets to both return
data quickly and bound the redo time in the event of a
failure. User crawlers are read only to the shared
database; data valets crawlers are used to update
object properties or detection-object associations.
3.2. User Access – MyDB and CASJobs
User access to the shared database is through a
server-side workbench. Each user gets their own
database (MyDB) on the servers to store intermediate
results. Data can be analyzed or cross-correlated not
only inside MyDB but also against the shared
databases. Because both the MyDBs and the shared
databases are collocated in the cloud, the bandwidth
across machines can be high and the cross-machine
database queries efficient.
Users have full control over their own MyDBs
and have the full expressivity of SQL. Users may
create custom schemas to hold results or bind
additional code within their MyDB for specific
analyses. Data may be uploaded to or downloaded
from a MyDB and tables can be shared with other
users, thus creating a collaborative environment
where users can form groups to share results.
Access to MyDBs and the shared databases are
through the Catalog Archive Server Jobs System
(CASJobs) initially built for SDSS [15]. The
CASJobs web service authenticates users, schedules
queries to ensure fairness to both long and short
running queries, and includes tools for database
access such as a schema browser and execution query
plan viewer.
CASJobs logs each user action, and once the final
correct result is achieved, it is possible to recreate a
clean path that leads directly to the result, removing
all the blind alleys. This is particularly useful as
many scientific analyses begin in an exploratory
fashion and only then repeated at scale. The clean
path is often obscured by the exploration.
The combination of MyDB and CASJobs has
proved to be incredibly successful [16]. Today 1,600
astronomers – approximately 10 percent of the
world’s professional astronomy population – are
daily users. Some of these users are “power users”
extending the full capabilities of SQL. Others use
“baby SQL” much like a scripting language staring
from commonly used sample queries. The system
supports both ad hoc and production analyses and
balances both small queries as well as long bucketed
crawlers.
3.3. The PS1 Database
The PS1 database includes three head types of
tables: Detections, Objects, and Metadata. Detection
rows correspond to astronomical sources detected in
either single or stacked images. Object rows
correspond to unique astronomical sources (stars and
galaxies) and summarize statistics from both single
and stack detections which are stored in different sets
of columns respectively. Metadata refers mainly to
telescope and image information to track the
provenance of the detections. Figure 2 shows the
actual tables and their relationships.
While the number of objects stays mostly
constant over the duration of the whole survey, about
14 billion new stack detections are inserted every six
months and about 120 million single image
detections every day. The gathered data generates an
83 terabyte monolithic database by the end of the
survey (Table 1). These numbers clearly show that to
reach a scalable solution we need to distribute the
data across a database cluster handled by the Gray
Figure 2. Pan-STARRS PS1 database schema.
Table 1. Pan-STARRS PS1 database
estimated size and yearly growth.
Table
Y1
Y2
Y3
Object
2.03
2.03
2.03
P2Detection
8.02
16.03 24.05
StackDetection
6.78
13.56 20.34
StackApFlx
0.62
1.24
1.86
StackModelFits
1.22
2.44
3.66
StackHighSigDelta 1.76
3.51
5.27
Other tables
1.78
2.07
2.37
Indexes (+20%)
4.44
8.18
11.20
26.65 49.07 71.5
Total [terabytes]

Y3 .5
2.03
28.06
23.73
2.17
4.27
6.15
2.52
13.78
82.71
part of the Wulf.
3.4. Layering and Partitioning for
Scalability
The Pan-STARRS database architecture was
informed by the GrayWulf design tenets.
 The 20 queries suggest that many of the user
queries operate only on the objects without
touching the detections. There were relatively
few joins between the objects and detections.
Making a significantly smaller unified copy of
just the objects and dedicating servers to only
those queries would allow the server
configuration to be optimized for those queries
and, if necessary, additional copies could be
deployed.
 Dividing the detections spatially enables
horizontal partitioning.
Separating the load associated with the data
updates from that due to the user queries enables
simpler balancing and configuration. Servers can
be optimized for one or the other role.
 Treating the weekly data update as a special case
of a database or server loss designs the fault
handling in.
As shown in Figure 3, the database architecture is
hierarchal. The lower data layer, or slice databases,
contains vertical partitions of the objects, detections
and metadata. The data are divided spatially using
zones [17] and different pieces of the sky are
assigned to each slice database. Each slice contains
roughly the same number of objects along with their
detections and metadata. The top data layer, or head
databases, presents a unified view of the objects,
detections, and metadata to the users. Head databases
may be frozen or virtual. Frozen head databases
contain only the objects and their properties copied
from the slices. Virtual head databases “glue” the
slice contents together using Distributed Partitioned
Views (DPV) across slice databases.
Slice databases are one of hot, warm, or cold. The
hot and warm slices are used exclusively for
CASJobs user queries. The cold slices are used for all
data loading, validation, merging, indexing, and
reorganization. This division lets us optimize disk
and server configuration for the two different loads.
The three copies also provide fault tolerance and load
balancing. We can lose any slice at any time and
recover.
The head and slice databases are distributed
across head and slice servers. Head servers hold hot
Figure 3. Shared data store architecture.
and warm head databases; slice servers hold hot and
warm slice databases. Each slice server hosts four
databases – two hot and two warm. Each slice
database is about 2.5 terabytes to accommodate to the
commodity hardware. The architecture scales very
nicely by adding more slice blocks as the data grow.
The hot and warm databases are distributed such that
no server holds both the hot and warm from a given
partition; if one database or server fails, another
server can pick the workload while the appropriate
recovery processes take place.
The loader/merger servers hold the cold slice
databases and the load databases. Every night, new
detections from the IPP are ingested into load
databases where the data are validated and staged.
Every week, load databases are merged into the cold
databases. Once the merge has been successfully
completed, the cold databases are copied to an
appropriate slice server. The resulting “waiting
warm” databases are then to be flipped to replace the
old warm databases. This approach does require
copying each (large) cold slice database twice across
the internal network every week. That challenge
seems less than managing to reproduce the load and
merge on the hot and warm slices while still
providing good CASJobs user query performance as
well as fault tolerance and data consistency. We are
taking the “Xerox” path: do the work once and copy.
3.5. Data Releases: How to keep half your
users happy and the other half not
confused
PS1 will be the first astronomical survey
providing access to temporal series in a timely
matter. We expect a weekly delay between when the
IPP ships the new detections and scientists can query
them. However, once the new detections are into the
system, the question remains on how often we update
the object statistics. There are two issues consider
regarding updates: cost and repeatability.
Object updates can be expensive in terms of
computation, and I/O resources. Some statistics, such
as refining the astrometry (better positional
precision), require retrieving complex computations
using all the detections associated with an object.
Celestial objects are not visible to the Pan-STARRS
telescope all year. Objects are “done for this year” at
predictable and known times. Thus, object updates
can be computed in a rolling manner following the
telescope. This distributes the processing over time as
well as presents the new statistics in a timely manner.
The updates can be done as system crawlers.
Since statistics update can be a complex process,
we plan to divide the workload in small tasks and use
system crawlers that scan the data continuously and
compute new statistics. Although, the current PanSTARRS design does not include the “Beo” side of
the wulf, we might well use the force of the Beowulf
if the computational complexity of the statistics
grows. Additionally, updates may affect indexes and
produce costly logging activities that take time and
disk space. While the first issue can be solved by
throwing more or better hardware in the load/merge
machines, the repeatability issue is more subtle and
difficult to address.
Science relies on repeatability. Once the data has
been published, it must not change so results can be
reproduced. While half our users are eager to get to
the latest detections and updated statistics, the other
half prefers resubmitting the query from last week or
last month and getting the exact same results. Since
we cannot afford to have full copies of different
releases we need a more creative solution.
In the prior section we mentioned that the head
database presents a single logical view to the user
using distributed partitioned views or, in other words,
the head database is a virtual database with views on
the contents of the slice databases which are
distributed across slice servers. The virtual head
database has no physical content other than some
light weight metadata. When a new release is
published, we create a frozen instance of the object
statistics and define distributed partitioned views on
the detections using a release constraint. The release
overheads will make the database grow to over a 100
terabytes at the end of PS1 survey (3.5 years of
observation).
3.6. Cluster Management Services
Architecturally, GrayWulf is a collection of
services distributed across logical resources which
reside on abstracted physical resources. Systems
management is a well trod path. The data-intensive
nature and scale of a GrayWulf poses additional
challenges for simple touch-free, yet robust
operations. The GrayWulf approach is to co-exist
with other deployed management applications by
leveraging and layering.
The Cluster Configuration Manager acts as the
central repository and arbiter for GrayWulf services
and resources. Operator recorded changes and
changes by data valet workflows are tracked directly;
out-of-band changes are discovered by survey.
CASJobs and all data valet workflows interact with
the Configuration Manager. The Configuration
Manager also allocates resources between the data
valets and the users; resources may be used by only
valets, both valets and users, or only users at any
time.
The Configuration Manager captures the
following cluster entities and relations between them:
 Services: Role definitions. Each machine role is
related to a set of infrastructure services that run
on it. These are servers that are deployed on the
machines and server instances of these that are
running. Database server instances are defined
in the logical view as specializations of server
instances.
 Database instances: Templates and maps.
Each unique deployed database is a database
instance. Each instance is defined by a
description of the data it stores (such as the slice
zone), a database schema template, and the
mapping to the storage on which it resides.
 Storage: Database to disk. Each database
instance is mapped to files. Files are located in
directories on software volumes. Software
volumes are constructed from RAID volume
extents; software volumes can span and/or subset
RAID volumes. RAID volumes are constructed
from locally attached physical disks.
 Machines: Roles for processors, networks,
and disks. The cluster is composed of a number
of machines, and each machine contains physical
disks and network interconnects. Machines are
assigned specific roles, such as a web server or a
hot/warm database server or load/merge database
server. Roles help determine the software to be
installed or updated on a machine as well as
targeting data valet workflows. The knowledge
of the physical disks and networks is necessary
when balancing the applied load or repairing
faults.
The Configuration Manager exposes APIs for
performing consistent operations on the cluster. An
example is creation of a slice database instance using
the current version of that database schema and
mapping on a specific machine; the APIs update the
configuration database as the operation proceeds. All
data valet and operator actions use these APIs. The
operators have a web interface that uses the client
library to view the cluster configuration or modify
the configuration, such as when adding a new
machine.
GrayWulf factors in fault handling because faults
happen. New database deployment is treated as a case
of a failed database; loss of a database instance is
treated as a special case of the load/merge flip. The
fault recovery model is the simple yet effective:
reboot and replace. Failed disks that do not fail the
RAID volume are repaired by replacing the disk;
optionally, the query load on the affected databases
can be throttled. Soft machine failures such as service
crashes or hangs are repaired by reboot. All other
machine failure causes the database instances on that
machine to be migrated to other machines. In other
words when a machine holding a hot slice database is
lost, another copy of that hot slice database is
reconstituted on another machine from the
corresponding cold slice database. The failed
machine can be replaced at a future time; the sole
task requiring operator attention.
Health and fault monitoring tools detect failures
and launch fault recovery workflows. Whenever
possible, the monitoring tools operate at the highest
level first. In other words, if the monitor can query a
database instance, there is no need to ping the
machine. There are also survey validations that
periodically compare the configuration database to
the deployed resources.
Performance can be monitored in three ways.
CASJobs includes a user query log that can be
mined. A performance watchdog gathers non
intrusive, low granularity performance data as a
general operational check. An impact analysis
sandbox is provided to launch a suite of queries or
jobs on a machine and record detailed performance
logs.
3.7. Data Valet Workflow Support
Services
Workflows are being commonly used to model
scientific experiments and applications for eScience
researchers. GrayWulf uses this same methodology
for the data valet actions. As shown in Figure 4, the
data valet workflow infrastructure includes a set of
services for workflow visual composition, automatic
provenance capture, scheduling, monitoring, and
fault handling. Our implementation uses the Trident
open source scientific workflow system [18] built on
top of the core runtime support Windows Workflow
Foundation [19].
Each data valet action becomes a workflow and is
composed of a set of activities. Workflows are
scheduled with knowledge of cluster configuration
information from the Configuration Manager and
policy information from the workflow registry.
During execution, a provenance service intercepts
each event to create an operation history.
Each activity performs a single action on a single
database or server. This divide and conquer approach
simplifies the subsequent fault handling and
monitoring as the state of the system and the steps to
recover can be simply determined from the
information captured by the provenance service.
Known activities and completed workflows are
tracked in a registry. This registry enables reuse of
activities or workflows in different workflows. For
example, in Pan-STARRS, the workflow for making
a copy of a slice the database on a remote machine
are used both by the workflow for copying updated
slice databases to the head machines, as well as to
make a new replica of a slice database in case of a
disk failure. This ability to reuse activities and
workflows encourages data valets to divide and
conquer when implementing workflows. A graphical
visual interface enables new workflow composition,
adding activities, as well as monitoring workflow
execution.
data valet workflows, this tracking should be
invaluable for complex system debugging.
The workflow infrastructure also maintains a
resource registry that records metadata about
activities, services, data products and workflows, and
their relationships. A resource may even be a data
valet or operator. This allows composition of
workflows that require a human step such as
replacing a disk or visual validation of a result. The
resource registry is also used to manage workflow
versioning or curate of workflows, services, and data
products.
3.8. Workflows as Data Management
Tools
Figure 4. Trident scientific workflow system
modules.
The scheduling service instantiates all data valet
workflows including load balancing and throttling.
The scheduler maintains a list of workflows that can
be executed on servers with specific roles; this
provides simple runtime protections such as avoiding
a copy of a cold slice database to a head server. The
scheduling service interacts with the Configuration
Manager to track database movements and server
configuration changes. At present, the scheduling
service is limited to scheduling one and only one
complete workflow on a specific node for simplicity;
we are evaluating methods to enable more
parallelism.
Workflow provenance refers to the derivation
history of a result, and is vital for the understanding,
verification, and reproduction of the results created
by a workflow [20]. As a workflow executes, the
provenance service intercepts each event to create a
log. Each record represents a change in operational
state and forms an accurate history of state changes
up to and including the “present” state of execution.
These records are sent to a provenance service that
“weaves” these records into a provenance trace
describing how the final result was created. The
provenance service also tracks the evolution of a
workflow as it changes from version to version. For
Applications using GrayWulf are by definition
data intensive and there is constant data movement
and churning taking place. Data is being streamed,
loaded, moved, load balanced, replicated, recovered
and crawled at any point in time by the data valet
tasks. Workflows provide the support required to
manage the complexity of the cluster configuration
and round the clock data updates in an autonomic
manner: reliably without administrator presence. If a
machine or a disk fails, the system recovers onto
existing machines by reconfiguring itself and
administrator’s primary task will be to eventually roll
out that machine and plug in a replacement and let
the system rebalance the load.
There are several types of data valet workflows
that are applicable to GrayWulf applications that are,
to some extent, analogous to Extract Load Transform
(ETL) steps performed in Data Warehouses.
 Load workflow. Data streaming from sensors
and instruments need to be assimilated into
existing databases. The load workflow does the
initial data format conversion from the incoming
data format into the database schema. This may
involve data translation, verifying data structure
consistency, checking data quality, mapping the
incoming data structure to the database schema,
determining optimized placement of the “load
databases”, performing the actual load, and
running post load validations. For example, PanSTARRS load workflows take the incoming
CSV data files from the IPP in, validates and
loads them into load databases that are optimally
co-located with slice databases.
 Merge workflow. The merge workflow ingests
the incoming data in the load databases into the
cold databases, and reorganizes the incremented
database in an efficient manner. If multiple
databases are to be loaded, as in the case of PanSTARRS, there may be additional scheduling
required to do the updates concurrently and
expose a consistent data release version to the
users.
 Data replication workflow. The data replication
workflow creates copies of databases over the
network in an efficient manner. This is necessary
to ensure availability of adequate copies of the
databases, replicating databases upon updates or
failures. The workflow coordinates with the
CASJobs scheduler to acquire locks on hot and
warm databases, determines data placement,
makes optimal use of I/O and network
bandwidth and ensures the databases are not
corrupted during copy. In Pan-STARRS, these
workflows are typically run after a merge
workflow updates an offline copy of the database
to propagate the results, or if a fault causes the
loss of a database copy.
 Crawler workflow. Complete scans of the
available data are required to check for silent
data corruption or, in Pan-STARRS, to update
object data statistics. These workflows need to
be designed as minimally intrusive background
tasks on live systems. A common pattern for
these crawlers is to “bucket” portions of the large
data space and apply the validation or statistic on
that bucket. This allows the tasks to be
performed incrementally, recover from faults at
the last processed bucket, balance the load on the
machine and enable parallelization.
 Fault-recovery workflow. Recovering from
hardware or software faults is a key task. Fault
recovery requires complex coordination between
several software components and must be fully
automated. Fault and recovery can be modeled as
state diagrams that are implemented naturally as
workflows. These workflows may be associated
as fault handlers for other workflows or triggered
by the health monitor. Occasionally, a fault can
be recovered in a shorter time by allowing
existing tasks to proceed to completion; hence
look-ahead planning is required. In PanSTARRS, many of the fault workflows deal with
data loss and coordinate with the schedulers to
pause existing workflows, launch data
replication workflows and ensure the system
gracefully degrades and recovers, while
remaining in a deterministic state.
Data valet workflows are different from
traditional eScience workflows which are scientist
facing. They have a greater degree of fault resilience
due to their critical nature, the security model takes
into account writes performed on system databases,
and provenance collection needs to track attribution
and database states.
4. Future Visions and Conclusions
Data-intensive science has rapidly emerged as
the fourth pillar of science. The hardware and
software tools required for effective eScience need to
evolve beyond the high performance computational
science that is supported by Beowulf. GrayWulf is a
big step in this direction.
Big data cannot reside on the user’s desktop due
to the cost and capacity limitations of desktop storage
and networking. The alternate is to store data in the
cloud and move the computations to the data. Our
Graywulf uses SQL and databases to simply transport
user analyses to the data, store results near the data
and retrieve summary results to the desktop. Enabling
users to share results in the cloud also promotes
sharing and collaboration often central to multi
disciplinary science. Cloud data storage also
simplifies the preparation and grooming of shared
data products by controlling the location and access
to those data products – the location of the “truth” is
known at all times and provenance can be preserved.
The infrastructure services present in GrayWulf
provide essential tools for deploying a shared data
product in the tens of terabytes to petabyte scale. The
services support the initial configuration, fault
handling, and rolling upgrade of the cluster. Beyond
Beowulf, the GrayWulf approach supports
continuous data updates without sacrificing data set
query performance or availability. These tools are
designed to assist both the operators of the system in
routine system maintenance as well as the data valets
who are responsible for data ingestion or other
grooming..
GrayWulf is designed to be extensible and future
ready. This is aptly captured by its application to
Pan-STARRS which will see its science data size
grow linearly by 30 terabytes per year during the first
3.5 years and four times that when the telescope is
fully deployed. The infrastructure services are
modular to ensure longevity. While the relational
data access and programming model is used today,
we can equally well accommodate patterns like
Hadoop [20], Map Reduce or Dryad-LINQ [21]. The
use of workflow for the data valet operations makes
those tasks simpler and easier to evolve. While we
have focused on the Pan-STARRS system here, we
are also working on other GrayWulf application s
such as a turbulence study on gridded data and crosscorrelations for large astronomical catalogs.
The scalable and extensible software architecture
discussed in this paper addresses the unique
challenges posed by very large scientific data sets.
The architecture is intended to be deployed on the
GrayWulf hardware architecture based on commodity
servers with local storage and high bandwidth
networking. We believe the combination provides an
affordable and simply deployable approach for
systems for data intensive science analyses.
5. Acknowledgements
The authors would like to thank Jim Gray for
many years of intense collaboration and friendship.
Financial support for the GrayWulf cluster hardware
was provided by the Gordon and Betty Moore
Foundation, Microsoft Research and the PanSTARRS project. Microsoft’s SQL Server group, in
particular Michael Thomassy, Lubor Kollar and Jose
Blakeley for the enormous help in optimizing the
throughput of the database engine. We would like
thank Tony Hey, Dan Fay and David Heckerman for
their encouragement and support throughout the
development of this project.
6. References
[1] P. Watson, “Databases in Grid Applications: Locality
and Distribution”, Database: Enterprise, Skills and
Innovation. British National Conference on Databases, UK,
July 5-7, 2005. LNCS Vol. 3567 pp. 1-16.
[2] P. M. Caulk, “The Design of a Petabyte Archive and
Distribution System for the NASA ECS Project", Fourth
NASA Goddard Conference on Mass Storage Systems and
Technologies, College Park, Maryland, March 1995.
[3] J. Becla and D. Wang, “Lessons Learned from
Managing a Petabyte”, CIDR 2005 Conference, Asilomar,
2005.
[4] Panoramic Survey Telescope and Rapid Response
System, http://pan-starrs.ifa.hawaii.edu/
[5] A. S. Szalay and J. Gray, “Science in an Exponential
World”, Nature, 440, pp 23-24, 2006.
[6] J. Gray, D. Liu, M. Nieto-Santisteban, A. S. Szalay, D.
DeWitt and G. Heber, “Scientific data management in the
coming decade”, ACM SIGMOD, 34(4), pp 34-41, 2005.
[7] T. Sterling, J. Salmon, D. J. Becker and D. F. Savarese,
“How to Build a Beowulf: A Guide to the Implementation
and Application of PC Clusters”, MIT Press, Cambridge,
MA, 1999, also http://beowulf.org/.
[8] J. Dean and S. Ghemawat, “MapReduce: Simplified
Data Processing on Large Clusters”, 6th Symposium on
Operating System Design and Implementation, San
Francisco, 2004.
[9] J. Gray and A. S. Szalay, “Where the Rubber Meets the
Sky: Bridging the Gap between Databases and Science”,
MSR-TR-2004-110, October 2004, IEEE Data Engineering
Bulletin, December 2004, Vol. 27.4, pp. 3-11.
[10] M. Nieto-Santisteban, T. School, A. Szalay and A.
Kemper, “20 Spatial Queries for an Astronomer’s
Bench(mark)”, Proceedings of Astronomical Data Analysis
Software and Systems XVII, London, UK, 23rd - 26th
September 2007.
[11] M. A. Nieto-Santisteban, J. Gray, A. S. Szalay, J.
Annis, A. R. Thakar and W. J. O’Mullane, “When
Database Systems Meet the Grid”, Proceedings of ACM
CIDR 2005, Asilomar, CA, January 2005).11.
[12] D. Jewitt, “Project Pan-STARRS and the Outer Solar
System”, Earth, Moon and Planets 92: 465–476, 2003.
[13] A. R. Thakar, A. S. Szalay, P. Z. Kunszt and J. Gray,
“The Sloan Digital Sky Survey Science Archive: Migrating
a Multi-Terabyte Astronomical Archive from Object to
Relational DBMS”, Computing in Science and
Engineering, V5.5, Sept 2003, IEEE Press. pp. 16-29, 2003.
[14] J. Adelman-McCarthy et al., “The Sixth Data Release
of the Sloan Digital Sky Survey”, The Astrophysical
Journal Supplement Series, Volume 175, Issue 2, pp. 297313, 2007.
[15] W. O’Mullane, N. Li, M. Nieto-Santisteban, A. S.
Szalay and A. R. Thakar, “Batch is Back: CASJobs,
Serving Multi-TB Data on the Web”, International
conference on Web Services 2005: 33-40.
[16] V. Singh, J. Gray, A. R. Thakar, A. S. Szalay, J.
Raddick, W. Boroski, S. Lebedeva and B. Yanny,
“SkyServer Traffic Report – The First Five Years”,
Microsoft Technical Report, MSR-TR-2006-190, 2006.
[17] M. A. Nieto-Santisteban, A. Thakar, A. Szalay and J.
Gray, “Large-Scale Query and XMatch, Entering the
Parallel Zone,” Astronomical Data Analysis Software and
Systems XV ASP Conference Series, Vol 351, Proceedings
of the Conference Held 2-5 October 2005 in San Lorenzo
de El Escorial, Spain. Edited by Carlos Gabriel, Christophe
Arviset, Daniel Ponz, and Enrique Solano. San Francisco:
Astronomical Society of the Pacific, 2006. p.493.
[18] Barga, R.S., Jackson, J., Araujo, N., Guo, D., Gautam,
N., Grochow, K. and Lazowska, E., “Trident: Scientific
Workflow Workbench for Oceanography”, International
Workshop on Scientific Workflows (SWF 2008).
[19] P. Andrews, “Presenting Windows Workflow
Foundation”, Sams Publishing, 2005.
[20] Y. Simmhan, B. Plale and D. Gannon, “A Survey of
Data Provenance for eScience”. SIGMOD Record, Vol. 34,
No. 3, 2005.
[21] http://hadoop.apache.org/
[22] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly,
“Dryad: Distributed Data-Parallel Programs from
Sequential Building Blocks”, EuroSys, 2007.
[23] S. Lohr, “Google and I.B.M. Join in ‘Cloud
Computing’ Research”, New York Times article, October
8, 2007.
Download