Distributed Computing: An Introduction

Distributed Computing: An Introduction
By: Leon Erlanger
You can define distributed computing many different ways. Various vendors have created and marketed distributed
computing systems for years, and have developed numerous initiatives and architectures to permit distributed
processing of data and objects across a network of connected systems.
One flavor of distributed computing has received a lot of attention lately, and it will be a primary focus of this story--an
environment where you can harness idle CPU cycles and storage space of tens, hundreds, or thousands of networked
systems to work together on a particularly processing-intensive problem. The growth of such processing models has
been limited, however, due to a lack of compelling applications and by bandwidth bottlenecks, combined with significant
security, management, and standardization challenges. But the last year has seen a new interest in the idea as the
technology has ridden the coattails of the peer-to-peer craze started by Napster. A number of new vendors have
appeared to take advantage of the nascent market; including heavy hitters like Intel, Microsoft, Sun, and Compaq that
have validated the importance of the concept. Also, an innovative worldwide distributed computing project whose goal
is to find intelligent life in the universe--SETI@Home--has captured the imaginations, and desktop processing cycles of
millions of users and desktops.
Increasing desktop CPU power and communications bandwidth have also helped to make distributed computing a
more practical idea. The numbers of real applications are still somewhat limited, and the challenges--particularly
standardization--are still significant. But there's a new energy in the market, as well as some actual paying customers,
so it's about time to take a look at where distributed processing fits and how it works.
Distributed vs Grid Computing
There are actually two similar trends moving in tandem--distributed computing and grid computing. Depending on how
you look at the market, the two either overlap, or distributed computing is a subset of grid computing. Grid Computing
got its name because it strives for an ideal scenario in which the CPU cycles and storage of millions of systems across
a worldwide network function as a flexible, readily accessible pool that could be harnessed by anyone who needs it,
similar to the way power companies and their users share the electrical grid.
Sun defines a computational grid as "a hardware and software infrastructure that provides dependable, consistent,
pervasive, and inexpensive access to computational capabilities." Grid computing can encompass desktop PCs, but
more often than not its focus is on more powerful workstations, servers, and even mainframes and supercomputers
working on problems involving huge datasets that can run for days. And grid computing leans more to dedicated
systems, than systems primarily used for other tasks.
Large-scale distributed computing of the variety we are covering usually refers to a similar concept, but is more geared
to pooling the resources of hundreds or thousands of networked end-user PCs, which individually are more limited in
their memory and processing power, and whose primary purpose is not distributed computing, but rather serving their
user. As we mentioned above, there are various levels and types of distributed computing architectures, and both Grid
and distributed computing don't have to be implemented on a massive scale. They can be limited to CPUs among a
group of users, a department, several departments inside a corporate firewall, or a few trusted partners across the
firewall.
How It Works
In most cases today, a distributed computing architecture consists of very lightweight software agents installed on a
number of client systems, and one or more dedicated distributed computing management servers. There may also be
requesting clients with software that allows them to submit jobs along with lists of their required resources.
An agent running on a processing client detects when the system is idle, notifies the management server that the
system is available for processing, and usually requests an application package. The client then receives an application
package from the server and runs the software when it has spare CPU cycles, and sends the results back to the server.
The application may run as a screen saver, or simply in the background, without impacting normal use of the computer.
If the user of the client system needs to run his own applications at any time, control is immediately returned, and
processing of the distributed application package ends. This must be essentially instantaneous, as any delay in
returning control will probably be unacceptable to the user.
Client-Server Interactions picture
Distributed Computing Management Server
The servers have several roles. They take distributed computing requests and divide their large processing tasks into
smaller tasks that can run on individual desktop systems (though sometimes this is done by a requesting system). They
send application packages and some client management software to the idle client machines that request them. They
monitor the status of the jobs being run by the clients. After the client machines run those packages, they assemble the
results sent back by the client and structure them for presentation, usually with the help of a database.
If the server doesn't hear from a processing client for a certain period of time, possibly because the user has
disconnected his system and gone on a business trip, or simply because he's using his system heavily for long periods,
it may send the same application package to another idle system. Alternatively, it may have already sent out the
package to several systems at once, assuming that one or more sets of results will be returned quickly. The server is
also likely to manage any security, policy, or other management functions as necessary, including handling dialup users
whose connections and IP addresses are inconsistent.
Obviously the complexity of a distributed computing architecture increases with the size and type of environment. A
larger environment that includes multiple departments, partners, or participants across the Web requires complex
resource identification, policy management, authentication, encryption, and secure sandboxing functionality. Resource
identification is necessary to define the level of processing power, memory, and storage each system can contribute.
Policy management is used to varying degrees in different types of distributed computing environments. Administrators
or others with rights can define which jobs and users get access to which systems, and who gets priority in various
situations based on rank, deadlines, and the perceived importance of each project. Obviously, robust authentication,
encryption, and sandboxing are necessary to prevent unauthorized access to systems and data within distributed
systems that are meant to be inaccessible.
If you take the ideal of a distributed worldwide grid to the extreme, it requires standards and protocols for dynamic
discovery and interaction of resources in diverse network environments and among different distributed computing
architectures. Most distributed computing solutions also include toolkits, libraries, and API's for porting third party
applications to work with their platform, or for creating distributed computing applications from scratch.
What About Peer-to-Peer Features?
Though distributed computing has recently been subsumed by the peer-to-peer craze, the structure described above is
not really one of peer-to-peer communication, as the clients don't necessarily talk to each other. Current vendors of
distributed computing solutions include Entropia, Data Synapse, Sun, Parabon, Avaki, and United Devices. Sun's open
source GridEngine platform is more geared to larger systems, while the others are focusing on PCs, with Data Synapse
somewhere in the middle. In the case of the SETI@home project, Entropia, and most other vendors, the structure is a
typical hub and spoke with the server at the hub. Data is delivered back to the server by each client as a batch job. In
the case of DataSynapse's LiveCluster, however, client PCs can work in parallel with other client PCs and share results
with each other in 20ms long bursts. The advantage of LiveCluster's architecture is that applications can be divided into
tasks that have mutual dependencies and require interprocess communications, while those running on Entropia
cannot. But while Entropia and other platforms can work very well across an Internet of modem connected PCs,
DataSynapse's LiveCluster makes more sense on a corporate network or among broadband users across the Net.
DataSynapse's LiveCluster picture
The Poor Man's Supercomputer
The advantages of this type of architecture for the right kinds of applications are impressive. The most obvious is the
ability to provide access to supercomputer level processing power or better for a fraction of the cost of a typical
supercomputer. SETI@Home's Web site FAQ points out that the most powerful computer, IBM's ASCI White, is rated
at 12 TeraFLOPS and costs $110 million, while SETI@home currently gets about 15 TeraFLOPs and has cost about
$500K so far. Further savings comes from the fact that distributed computing doesn't require all the pricey electrical
power, environmental controls, and extra infrastructure that a supercomputer requires. And while supercomputing
applications are written in specialized languages like mpC, distributed applications can be written in C, C++, etc.
The performance improvement over typical enterprise servers for appropriate applications can be phenomenal. In a
case study that Intel did of a commercial and retail banking organization running Data Synapse's LiveCluster platform,
computation time for a series of complex interest rate swap modeling tasks was reduced from 15 hours on a dedicated
cluster of four workstations to 30 minutes on a grid of around 100 desktop computers. Processing 200 trades on a
dedicated system took 44 minutes, but only 33 seconds on a grid of 100 PCs. According to the company using the
technology, the performance improvement running various simulations allowed them to react much more swiftly to
market fluctuations (but we have to wonder if that's a good thing…).
Code Distribution and Process Model Picture
Performance Gains Graph Picture
Scalability is also a great advantage of distributed computing. Though they provide massive processing power, super
computers are typically not very scalable once they're installed. A distributed computing installation is infinitely scalable-simply add more systems to the environment. In a corporate distributed computing setting, systems might be added
within or beyond the corporate firewall.
A byproduct of distributed computing is more efficient use of existing system resources. Estimates by various analysts
have indicated that up to 90 percent of the CPU cycles on a company's client systems are not used. Even servers and
other systems spread across multiple departments are typically used inefficiently, with some applications starved for
server power while elsewhere in the organization server power is grossly underutilized. And server and workstation
obsolescence can be staved off considerably longer by allocating certain applications to a grid of client machines or
servers. This leads to the inevitable Total Cost of Ownership, Total Benefit of Ownership, and ROI discussions. Another
byproduct, instead of throwing away obsolete desktop PCs and servers, an organization can dedicate them to
distributed computing tasks.
Advantages of Distributed Computing Table
Distributed Computing Application Characteristics
Obviously not all applications are suitable for distributed computing. The closer an application gets to running in real
time, the less appropriate it is. Even processing tasks that normally take an hour are two may not derive much benefit if
the communications among distributed systems and the constantly changing availability of processing clients becomes
a bottleneck. Instead you should think in terms of tasks that take hours, days, weeks, and months. Generally the most
appropriate applications, according to Entropia, consist of "loosely coupled, non-sequential tasks in batch processes
with a high compute-to-data ratio." The high compute to data ratio goes hand-in-hand with a high compute-tocommunications ratio, as you don't want to bog down the network by sending large amounts of data to each client,
though in some cases you can do so during off hours. Programs with large databases that can be easily parsed for
distribution are very appropriate.
Clearly, any application with individual tasks that need access to huge data sets will be more appropriate for larger
systems than individual PCs. If terabytes of data are involved, a supercomputer makes sense as communications can
take place across the system's very high speed backplane without bogging down the network. Server and other
dedicated system clusters will be more appropriate for other slightly less data intensive applications. For a distributed
application using numerous PCs, the required data should fit very comfortably in the PC's memory, with lots of room to
spare.
Taking this further, United Devices recommends that the application should have the capability to fully exploit "coarsegrained parallelism," meaning it should be possible to partition the application into independent tasks or processes that
can be computed concurrently. For most solutions there should not be any need for communication between the tasks
except at task boundaries, though Data Synapse allows some interprocess communications. The tasks and small
blocks of data should be such that they can be processed effectively on a modern PC and report results that, when
combined with other PC's results, produce coherent output. And the individual tasks should be small enough to produce
a result on these systems within a few hours to a few days.
Types of Distributed Computing Applications
Beyond the very popular poster child SETI@Home application, the following scenarios are examples of other types of
application tasks that can be set up to take advantage of distributed computing.

A query search against a huge database that can be split across lots of desktops, with the submitted query
running concurrently against each fragment on each desktop.

Complex modeling and simulation techniques that increase the accuracy of results by increasing the number of
random trials would also be appropriate, as trials could be run concurrently on many desktops, and combined
to achieve greater statistical significance (this is a common method used in various types of financial risk
analysis).

Exhaustive search techniques that require searching through a huge number of results to find solutions to a
problem also make sense. Drug screening is a prime example.

Many of today's vendors, particularly Entropia and United Devices, are aiming squarely at the life sciences
market, which has a sudden need for massive computing power. As a result of sequencing the human genome,
the number of identifiable biological targets for today's drugs is expected to increase from about 500 to about
10,000. Pharmaceutical firms have repositories of millions of different molecules and compounds, some of
which may have characteristics that make them appropriate for inhibiting newly found proteins. The process of
matching all these "ligands" to their appropriate targets is an ideal task for distributed computing, and the
quicker it's done, the quicker and greater the benefits will be. Another related application is the recent trend of
generating new types of drugs solely on computers.

Complex financial modeling, weather forecasting, and geophysical exploration are on the radar screens of
these vendors, as well as car crash and other complex simulations.
To enhance their public relations efforts and demonstrate the effectiveness of their platforms, most of the distributed
computing vendors have set up philanthropic computing projects that recruit CPU cycles across the Internet. Parabon's
Compute-Against-Cancer harnesses an army of systems to track patient responses to chemotherapy, while Entropia's
FightAidsAtHome project evaluates prospective targets for drug discovery. And of course, the SETI@home project has
attracted millions of PCs to work on analyzing data from the Arecibo radio telescope for signatures that indicate
extraterrestrial intelligence. There are also higher end grid projects, including those run by the US National Science
Foundation, NASA, and as well as the European Data Grid, Particle Physics Data Grid, the Network for Earthquake
Simulation Grid, and Grid Physics Network that plan to aid their research communities. And IBM has announced that it
will help to create a life sciences grid in North Carolina to be used for genomic research.
Porting Applications
The major distributed computing platforms generally have two methods of porting applications, depending on the level
of integration needed by the user, and whether the user has access to the source code of the application that needs to
be distributed. Most of the vendors have software development kits (SDK's) that can be used to wrap existing
applications with their platform without cracking the existing .exe file. The only other task is determining the complexity
of pre- and post-processing functions. Entropia in particular boasts that it offers "binary integration," which can integrate
applications into the platform without the user having to access the source code.
Other vendors, including Data Synapse, and United Devices, however offer API's of varying complexity that require
access to the source code, but provide tight integration and access by the application to the all the security,
management, and other features of the platforms. Most of these vendors offer several libraries of proven distributed
computing paradigms. Data Synapse comes with C++ and Java software developer kit support. United Devices uses a
POSIX compliant C/C++ API. Integrating the application can take anywhere from half a day to months depending on
how much optimization is needed. Some vendors also allow access to their own in-house grids for testing by
application developers.
LiveCluster API
LiveCluster API
Security and Standards Challenges
The major challenges come with increasing scale. As soon as you move outside of a corporate firewall, security and
standardization challenges become quite significant. Most of today's vendors currently specialize in applications that
stop at the corporate firewall, though Avaki, in particular, is staking out the global grid territory. Beyond spanning
firewalls with a single platform, lies the challenge of spanning multiple firewalls and platforms, which means standards.
Most of the current platforms offer high level encryption such as Triple DES. The application packages that are sent to
PCs are digitally signed, to make sure a rogue application does not infiltrate a system. Avaki comes with its own PKI
(public key infrastructure), for example. Identical application packages are typically sent to multiple PCs and the results
of each are compared. Any set of results that differs from the rest becomes security suspect. Even with encryption, data
can still be snooped when the process is running in the client's memory, so most platforms create application data
chunks that are so small, that it is unlikely snooping them will provide useful information. Avaki claims that it integrates
easily with different existing security infrastructures and can facilitate the communications among them, but this is
obviously a challenge for global distributed computing.
Working out standards for communications among platforms is part of the typical chaos that occurs early in any
relatively new technology. In the generalized peer-to-peer realm lies the Peer-to-Peer Working Group, started by Intel,
which is looking to devise standards for communications among many different types of peer-to-peer platforms,
including those that are used for edge services and collaboration.
The Global Grid Forum is a collection of about 200 companies looking to devise grid computing standards. Then you
have vendor-specific efforts such as Sun's Open Source JXTA platform, which provides a collection of protocols and
services that allows peers to advertise themselves to and communicate with each other securely. JXTA has a lot in
common with JINI, but is not Java specific (thought the first version is Java based).
Intel recently released its own peer-to-peer middleware, the Intel Peer-to-Peer Accelerator Kit for Microsoft .Net, also
designed for discovery, and based on the Microsoft.Net platform.
For Grid projects there's the Globus ToolKit, www.globus.org developed by a group at the Argonne National Laboratory
and another team at the University of Southern California's Information Science Institute. It bills itself as an open source
infrastructure that provides most of the security, resource discovery, data access, and management services needed to
construct Grid applications. A large number of today's grid applications are based on the Globus Toolkit. IBM is offering
a version of the Globus ToolKit for its servers running on Linux and AIX. And Entropia has announced that it intends to
integrate its platform with Globus, as an early start towards communications among platforms and applications.
For today, however, the specific promise of distributed computing lies mostly in harnessing the system resources that
lie within the firewall. It will take years before the systems on the Net will be sharing compute resources as effortlessly
as they can share information.
Finally, we realize that distributed computing technologies and research cover a vast amount of territory, and we've
only touched the surface so far. Please join our forums and discuss some of your experiences with distributed
computing, and feel free to make suggestions of technologies you'd like to see us explore in more detail.
Companies and Organizations to Watch
Avaki Corporation
Cambridge, MA Corporate Headquarters
One Memorial Drive
Cambridge, MA 02142
617-374-2500
www.Avaki.com
Makes Avaki 2.0, grid computing software for mixed platform environments and global grid configurations. Includes a
PKI based security infrastructure for grids spanning multiple companies, locations, and domains.
The DataGrid
Dissemination Office:
CNR-CED
Piazzale Aldo Moro 7
00145 Roma (Italy)
+39 06 49933205
www.eu-datagrid.org
A project funded by the European Union and led by CERN and five other partners whose goal is to set up computing
grids that can analyse data from scientific exploration across the continent. The project hopes to develop scalable
software solutions and testbeds that can handle thousands of users and tens of thousands of grid connected systems
from multiple research institutions.
DataSynapse Inc.
632 Broadway
5th Floor
New York, NY 10012-2614
212-842-8842
www.datasynapse.com
Makes LiveCluster, distributed computing software middleware aimed at the financial services and energy markets.
Currently mostly for use inside the firewall. Includes the ability for interprocess communications among distributed
application packages.
Distributed.Net
www.distributed.net
Founded in 1997, Distributed.Net was one of the first non-profit distributed computing organizations and the first to
create a distributed computing network on the Internet. Distributed.net was highly successful in using distributed
computing to take on cryptographic challenges sponsored by RSA Labs and CS Communication & Systems.
Entropia, Inc.
10145 Pacific Heights Blvd., Suite 800
San Diego, CA 92121 USA
858-623-2840
www.entropia.com
Makes the Entropia distributed computing platform aimed at the life sciences market. Currently mostly for use inside the
firewall. Boasts binary integration, which lets you integrate your applications using any language without having to
access the application's source code. Recently integrated its software with The Globus Toolkit.
Global Grid Forum
www.gridforum.org
A standards organization composed of over 200 companies working to devise and promote standards, best practices,
and integrated platforms for grid computing.
The Globus Project
www.globus.org
A research and development project consisting of members of the Argonne National Laboratory, the University of
Southern California's Information Science Institute, NASA, and others focused on enabling the application of Grid
concepts to scientific and engineering computing. The team has produced the Globus Toolkit, an open source set of
middleware services and software libraries for constructing grids and grid applications. The ToolKit includes software
for security, information infrastructure, resource management, data management, communication, fault detection, and
portability.
Grid Physics Network
www.griphyn.org
The Grid Physics Network (GriPhyN) is a team of experimental physicists and IT researchers from the University of
Florida, University of Chicago, Argonne National Laboratory and about a dozen other research centers working to
implement the first worldwide Petabyte-scale computational and data grid for physics and other scientific research. The
project is funded by the National Science Foundation.
IBM
International Business Machines Corporation
New Orchard Road
Armonk, NY 10504.
914-499-1900
IBM is heavily involved in setting up over 50 computational grids across the planet using IBM infrastructure for cancer
research and other initiatives. IBM was selected in August by a consortium of four U.S. research centers to help create
the "world's most powerful grid," which when completed in 2003 will supposedly be capable of processing 13.6 trillion
calculations per second. IBM also markets the IBM Globus ToolKit, a version of the ToolKit for its servers running AIX
and Linux.
Intel Corporation
2200 Mission College Blvd.
Santa Clara, California 95052-8119
408-765-8080
www.intel.com
Intel is the principal founder of the Peer-To-Peer Working Group and recently announced the Peer-to-Peer Accelerator
Kit for Microsoft.NET, middleware based on the Microsoft.NET platform that provides building blocks for the
development of peer-to-peer applications and includes support for location independence, encryption and availability.
The technology, source code, demo applications, and documentation will be made available on Microsoft's gotdotnet
(www.gotdotnet.com) website. The download will be free. The target release date is early December. Also partners with
United Devices on the Intel-United Devices Cancer Research Project, which enlists Internet users in a distributed
computing grid for cancer research.
NASA Advanced SuperComputing Division (NAS)
NAS Systems Division Office
NASA Ames Research Center
Moffet Field, CA 94035
650-604-4502
www.nas.nasa.gov
NASA's NAS Division is leading a joint effort among leaders within government, academia, and industry to build and
test NASA's Information Power Grid (lPG), a grid of high performance computers, data storage devices, scientific
instruments, and advanced user interfaces that will help NASA scientists collaborate with these other institutions to
"solve important problems facing the world in the 21st century."
Network for Earthquake Engineering Simulation Grid (NEESgrid)
www.neesgrid.org/
In August 2001, the National Science Foundation awarded $10 million to a consortium of institutions led by the National
Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign to build the
NEESgrid, which will link earthquake engineering research sites across the country in a national grid, provide data
storage facilities and repositories, and offer remote access to research tools.
Parabon Computation
3930 Walnut Street, Suite 100
Fairfax, VA 22030-4738
703-460-4100
www.parabon.com
Makes Frontier server software and Pioneer client software, a distributed computing platform that supposedly can span
enterprises or the Internet. Also runs the Compute Against Cancer, a distributed computing grid for non-profit cancer
organizations.
Particle Physics Data Grid (PPDG)
www.ppdg.net
A collaboration of the Argonne National Laboratory, Brookhaven National Laboratory, Caltech, and others to develop,
acquire and deliver the tools for a national computing grid for current and future high-energy and nuclear physics
experiments.
Peer-to-Peer Working Group
5440 SW Westgate Drive, Suite 217
Portland, OR 97221
503-291-2572
www.peer-to-peerwg.org
A standards group founded by Intel and composed of over 30 companies with the goal of developing best practices that
enable interoperability among peer-to-peer applications.
Platform Computing
3760 14th Ave
Markham, Ontario L3R 3T7
Canada
905-948-8448
www.platform.com
Makes a number of enterprise distributed and grid computing products, including Platform LSF Active Cluster for
distributed computing across Windows desktops, Platform LSF for distributed computing across mixed environments of
UNIX, Linux, Macintosh and Windows servers, desktops, supercomputers, and clusters. Also offers a number of
products for distributed computing management and analysis, and its own commercial distribution of the Globus Toolkit.
Targets computer and industrial manufacturing, life sciences, government, and financial services markets.
SETI@Home
http://setiathome.ssl.berkeley.edu/
A worldwide distributed computing grid based at the University of California at Berkeley that allows users connected to
the Internet to donate their PC's spare CPU cycles to the exploration of extraterrestrial life in the universe. Its task is to
sort through the 1.4 billion potential signals picked up by the Arecibo telescope to find signals that repeat. Users receive
approximately 350K or data at a time and the client software runs as a screensaver.
Sun Microsystems Inc.
901 San Antonio Road
Palo Alto, CA 94303
USA
650-960-1300
www.sun.com
Sun is involved in several grid and peer-to-peer products and initiatives including its open source Grid Engine platform
for setting up departmental and campus computing grids (with the eventual goal of a global grid platform) and its JXTA
(short for juxtapose) set of protocols and building blocks for developing peer-to-peer applications.
United Devices, Inc.
12675 Research, Bldg A
Austin, Texas 78759
512-331-6016
www.ud.com
Makes the MetaProcessor distributed computing platform aimed at life sciences, geosciences, and industrial design
and engineering markets and currently focused inside the firewall. Also partners with Intel on the Intel-United Devices
Cancer Research Project, which enlists Internet users in a distributed computing grid for cancer research.
Copyright (c) 2003 Ziff Davis Media Inc. All Rights Reserved.