Distributed Computing: An Introduction By: Leon Erlanger You can define distributed computing many different ways. Various vendors have created and marketed distributed computing systems for years, and have developed numerous initiatives and architectures to permit distributed processing of data and objects across a network of connected systems. One flavor of distributed computing has received a lot of attention lately, and it will be a primary focus of this story--an environment where you can harness idle CPU cycles and storage space of tens, hundreds, or thousands of networked systems to work together on a particularly processing-intensive problem. The growth of such processing models has been limited, however, due to a lack of compelling applications and by bandwidth bottlenecks, combined with significant security, management, and standardization challenges. But the last year has seen a new interest in the idea as the technology has ridden the coattails of the peer-to-peer craze started by Napster. A number of new vendors have appeared to take advantage of the nascent market; including heavy hitters like Intel, Microsoft, Sun, and Compaq that have validated the importance of the concept. Also, an innovative worldwide distributed computing project whose goal is to find intelligent life in the universe--SETI@Home--has captured the imaginations, and desktop processing cycles of millions of users and desktops. Increasing desktop CPU power and communications bandwidth have also helped to make distributed computing a more practical idea. The numbers of real applications are still somewhat limited, and the challenges--particularly standardization--are still significant. But there's a new energy in the market, as well as some actual paying customers, so it's about time to take a look at where distributed processing fits and how it works. Distributed vs Grid Computing There are actually two similar trends moving in tandem--distributed computing and grid computing. Depending on how you look at the market, the two either overlap, or distributed computing is a subset of grid computing. Grid Computing got its name because it strives for an ideal scenario in which the CPU cycles and storage of millions of systems across a worldwide network function as a flexible, readily accessible pool that could be harnessed by anyone who needs it, similar to the way power companies and their users share the electrical grid. Sun defines a computational grid as "a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to computational capabilities." Grid computing can encompass desktop PCs, but more often than not its focus is on more powerful workstations, servers, and even mainframes and supercomputers working on problems involving huge datasets that can run for days. And grid computing leans more to dedicated systems, than systems primarily used for other tasks. Large-scale distributed computing of the variety we are covering usually refers to a similar concept, but is more geared to pooling the resources of hundreds or thousands of networked end-user PCs, which individually are more limited in their memory and processing power, and whose primary purpose is not distributed computing, but rather serving their user. As we mentioned above, there are various levels and types of distributed computing architectures, and both Grid and distributed computing don't have to be implemented on a massive scale. They can be limited to CPUs among a group of users, a department, several departments inside a corporate firewall, or a few trusted partners across the firewall. How It Works In most cases today, a distributed computing architecture consists of very lightweight software agents installed on a number of client systems, and one or more dedicated distributed computing management servers. There may also be requesting clients with software that allows them to submit jobs along with lists of their required resources. An agent running on a processing client detects when the system is idle, notifies the management server that the system is available for processing, and usually requests an application package. The client then receives an application package from the server and runs the software when it has spare CPU cycles, and sends the results back to the server. The application may run as a screen saver, or simply in the background, without impacting normal use of the computer. If the user of the client system needs to run his own applications at any time, control is immediately returned, and processing of the distributed application package ends. This must be essentially instantaneous, as any delay in returning control will probably be unacceptable to the user. Client-Server Interactions picture Distributed Computing Management Server The servers have several roles. They take distributed computing requests and divide their large processing tasks into smaller tasks that can run on individual desktop systems (though sometimes this is done by a requesting system). They send application packages and some client management software to the idle client machines that request them. They monitor the status of the jobs being run by the clients. After the client machines run those packages, they assemble the results sent back by the client and structure them for presentation, usually with the help of a database. If the server doesn't hear from a processing client for a certain period of time, possibly because the user has disconnected his system and gone on a business trip, or simply because he's using his system heavily for long periods, it may send the same application package to another idle system. Alternatively, it may have already sent out the package to several systems at once, assuming that one or more sets of results will be returned quickly. The server is also likely to manage any security, policy, or other management functions as necessary, including handling dialup users whose connections and IP addresses are inconsistent. Obviously the complexity of a distributed computing architecture increases with the size and type of environment. A larger environment that includes multiple departments, partners, or participants across the Web requires complex resource identification, policy management, authentication, encryption, and secure sandboxing functionality. Resource identification is necessary to define the level of processing power, memory, and storage each system can contribute. Policy management is used to varying degrees in different types of distributed computing environments. Administrators or others with rights can define which jobs and users get access to which systems, and who gets priority in various situations based on rank, deadlines, and the perceived importance of each project. Obviously, robust authentication, encryption, and sandboxing are necessary to prevent unauthorized access to systems and data within distributed systems that are meant to be inaccessible. If you take the ideal of a distributed worldwide grid to the extreme, it requires standards and protocols for dynamic discovery and interaction of resources in diverse network environments and among different distributed computing architectures. Most distributed computing solutions also include toolkits, libraries, and API's for porting third party applications to work with their platform, or for creating distributed computing applications from scratch. What About Peer-to-Peer Features? Though distributed computing has recently been subsumed by the peer-to-peer craze, the structure described above is not really one of peer-to-peer communication, as the clients don't necessarily talk to each other. Current vendors of distributed computing solutions include Entropia, Data Synapse, Sun, Parabon, Avaki, and United Devices. Sun's open source GridEngine platform is more geared to larger systems, while the others are focusing on PCs, with Data Synapse somewhere in the middle. In the case of the SETI@home project, Entropia, and most other vendors, the structure is a typical hub and spoke with the server at the hub. Data is delivered back to the server by each client as a batch job. In the case of DataSynapse's LiveCluster, however, client PCs can work in parallel with other client PCs and share results with each other in 20ms long bursts. The advantage of LiveCluster's architecture is that applications can be divided into tasks that have mutual dependencies and require interprocess communications, while those running on Entropia cannot. But while Entropia and other platforms can work very well across an Internet of modem connected PCs, DataSynapse's LiveCluster makes more sense on a corporate network or among broadband users across the Net. DataSynapse's LiveCluster picture The Poor Man's Supercomputer The advantages of this type of architecture for the right kinds of applications are impressive. The most obvious is the ability to provide access to supercomputer level processing power or better for a fraction of the cost of a typical supercomputer. SETI@Home's Web site FAQ points out that the most powerful computer, IBM's ASCI White, is rated at 12 TeraFLOPS and costs $110 million, while SETI@home currently gets about 15 TeraFLOPs and has cost about $500K so far. Further savings comes from the fact that distributed computing doesn't require all the pricey electrical power, environmental controls, and extra infrastructure that a supercomputer requires. And while supercomputing applications are written in specialized languages like mpC, distributed applications can be written in C, C++, etc. The performance improvement over typical enterprise servers for appropriate applications can be phenomenal. In a case study that Intel did of a commercial and retail banking organization running Data Synapse's LiveCluster platform, computation time for a series of complex interest rate swap modeling tasks was reduced from 15 hours on a dedicated cluster of four workstations to 30 minutes on a grid of around 100 desktop computers. Processing 200 trades on a dedicated system took 44 minutes, but only 33 seconds on a grid of 100 PCs. According to the company using the technology, the performance improvement running various simulations allowed them to react much more swiftly to market fluctuations (but we have to wonder if that's a good thing…). Code Distribution and Process Model Picture Performance Gains Graph Picture Scalability is also a great advantage of distributed computing. Though they provide massive processing power, super computers are typically not very scalable once they're installed. A distributed computing installation is infinitely scalable-simply add more systems to the environment. In a corporate distributed computing setting, systems might be added within or beyond the corporate firewall. A byproduct of distributed computing is more efficient use of existing system resources. Estimates by various analysts have indicated that up to 90 percent of the CPU cycles on a company's client systems are not used. Even servers and other systems spread across multiple departments are typically used inefficiently, with some applications starved for server power while elsewhere in the organization server power is grossly underutilized. And server and workstation obsolescence can be staved off considerably longer by allocating certain applications to a grid of client machines or servers. This leads to the inevitable Total Cost of Ownership, Total Benefit of Ownership, and ROI discussions. Another byproduct, instead of throwing away obsolete desktop PCs and servers, an organization can dedicate them to distributed computing tasks. Advantages of Distributed Computing Table Distributed Computing Application Characteristics Obviously not all applications are suitable for distributed computing. The closer an application gets to running in real time, the less appropriate it is. Even processing tasks that normally take an hour are two may not derive much benefit if the communications among distributed systems and the constantly changing availability of processing clients becomes a bottleneck. Instead you should think in terms of tasks that take hours, days, weeks, and months. Generally the most appropriate applications, according to Entropia, consist of "loosely coupled, non-sequential tasks in batch processes with a high compute-to-data ratio." The high compute to data ratio goes hand-in-hand with a high compute-tocommunications ratio, as you don't want to bog down the network by sending large amounts of data to each client, though in some cases you can do so during off hours. Programs with large databases that can be easily parsed for distribution are very appropriate. Clearly, any application with individual tasks that need access to huge data sets will be more appropriate for larger systems than individual PCs. If terabytes of data are involved, a supercomputer makes sense as communications can take place across the system's very high speed backplane without bogging down the network. Server and other dedicated system clusters will be more appropriate for other slightly less data intensive applications. For a distributed application using numerous PCs, the required data should fit very comfortably in the PC's memory, with lots of room to spare. Taking this further, United Devices recommends that the application should have the capability to fully exploit "coarsegrained parallelism," meaning it should be possible to partition the application into independent tasks or processes that can be computed concurrently. For most solutions there should not be any need for communication between the tasks except at task boundaries, though Data Synapse allows some interprocess communications. The tasks and small blocks of data should be such that they can be processed effectively on a modern PC and report results that, when combined with other PC's results, produce coherent output. And the individual tasks should be small enough to produce a result on these systems within a few hours to a few days. Types of Distributed Computing Applications Beyond the very popular poster child SETI@Home application, the following scenarios are examples of other types of application tasks that can be set up to take advantage of distributed computing. A query search against a huge database that can be split across lots of desktops, with the submitted query running concurrently against each fragment on each desktop. Complex modeling and simulation techniques that increase the accuracy of results by increasing the number of random trials would also be appropriate, as trials could be run concurrently on many desktops, and combined to achieve greater statistical significance (this is a common method used in various types of financial risk analysis). Exhaustive search techniques that require searching through a huge number of results to find solutions to a problem also make sense. Drug screening is a prime example. Many of today's vendors, particularly Entropia and United Devices, are aiming squarely at the life sciences market, which has a sudden need for massive computing power. As a result of sequencing the human genome, the number of identifiable biological targets for today's drugs is expected to increase from about 500 to about 10,000. Pharmaceutical firms have repositories of millions of different molecules and compounds, some of which may have characteristics that make them appropriate for inhibiting newly found proteins. The process of matching all these "ligands" to their appropriate targets is an ideal task for distributed computing, and the quicker it's done, the quicker and greater the benefits will be. Another related application is the recent trend of generating new types of drugs solely on computers. Complex financial modeling, weather forecasting, and geophysical exploration are on the radar screens of these vendors, as well as car crash and other complex simulations. To enhance their public relations efforts and demonstrate the effectiveness of their platforms, most of the distributed computing vendors have set up philanthropic computing projects that recruit CPU cycles across the Internet. Parabon's Compute-Against-Cancer harnesses an army of systems to track patient responses to chemotherapy, while Entropia's FightAidsAtHome project evaluates prospective targets for drug discovery. And of course, the SETI@home project has attracted millions of PCs to work on analyzing data from the Arecibo radio telescope for signatures that indicate extraterrestrial intelligence. There are also higher end grid projects, including those run by the US National Science Foundation, NASA, and as well as the European Data Grid, Particle Physics Data Grid, the Network for Earthquake Simulation Grid, and Grid Physics Network that plan to aid their research communities. And IBM has announced that it will help to create a life sciences grid in North Carolina to be used for genomic research. Porting Applications The major distributed computing platforms generally have two methods of porting applications, depending on the level of integration needed by the user, and whether the user has access to the source code of the application that needs to be distributed. Most of the vendors have software development kits (SDK's) that can be used to wrap existing applications with their platform without cracking the existing .exe file. The only other task is determining the complexity of pre- and post-processing functions. Entropia in particular boasts that it offers "binary integration," which can integrate applications into the platform without the user having to access the source code. Other vendors, including Data Synapse, and United Devices, however offer API's of varying complexity that require access to the source code, but provide tight integration and access by the application to the all the security, management, and other features of the platforms. Most of these vendors offer several libraries of proven distributed computing paradigms. Data Synapse comes with C++ and Java software developer kit support. United Devices uses a POSIX compliant C/C++ API. Integrating the application can take anywhere from half a day to months depending on how much optimization is needed. Some vendors also allow access to their own in-house grids for testing by application developers. LiveCluster API LiveCluster API Security and Standards Challenges The major challenges come with increasing scale. As soon as you move outside of a corporate firewall, security and standardization challenges become quite significant. Most of today's vendors currently specialize in applications that stop at the corporate firewall, though Avaki, in particular, is staking out the global grid territory. Beyond spanning firewalls with a single platform, lies the challenge of spanning multiple firewalls and platforms, which means standards. Most of the current platforms offer high level encryption such as Triple DES. The application packages that are sent to PCs are digitally signed, to make sure a rogue application does not infiltrate a system. Avaki comes with its own PKI (public key infrastructure), for example. Identical application packages are typically sent to multiple PCs and the results of each are compared. Any set of results that differs from the rest becomes security suspect. Even with encryption, data can still be snooped when the process is running in the client's memory, so most platforms create application data chunks that are so small, that it is unlikely snooping them will provide useful information. Avaki claims that it integrates easily with different existing security infrastructures and can facilitate the communications among them, but this is obviously a challenge for global distributed computing. Working out standards for communications among platforms is part of the typical chaos that occurs early in any relatively new technology. In the generalized peer-to-peer realm lies the Peer-to-Peer Working Group, started by Intel, which is looking to devise standards for communications among many different types of peer-to-peer platforms, including those that are used for edge services and collaboration. The Global Grid Forum is a collection of about 200 companies looking to devise grid computing standards. Then you have vendor-specific efforts such as Sun's Open Source JXTA platform, which provides a collection of protocols and services that allows peers to advertise themselves to and communicate with each other securely. JXTA has a lot in common with JINI, but is not Java specific (thought the first version is Java based). Intel recently released its own peer-to-peer middleware, the Intel Peer-to-Peer Accelerator Kit for Microsoft .Net, also designed for discovery, and based on the Microsoft.Net platform. For Grid projects there's the Globus ToolKit, www.globus.org developed by a group at the Argonne National Laboratory and another team at the University of Southern California's Information Science Institute. It bills itself as an open source infrastructure that provides most of the security, resource discovery, data access, and management services needed to construct Grid applications. A large number of today's grid applications are based on the Globus Toolkit. IBM is offering a version of the Globus ToolKit for its servers running on Linux and AIX. And Entropia has announced that it intends to integrate its platform with Globus, as an early start towards communications among platforms and applications. For today, however, the specific promise of distributed computing lies mostly in harnessing the system resources that lie within the firewall. It will take years before the systems on the Net will be sharing compute resources as effortlessly as they can share information. Finally, we realize that distributed computing technologies and research cover a vast amount of territory, and we've only touched the surface so far. Please join our forums and discuss some of your experiences with distributed computing, and feel free to make suggestions of technologies you'd like to see us explore in more detail. Companies and Organizations to Watch Avaki Corporation Cambridge, MA Corporate Headquarters One Memorial Drive Cambridge, MA 02142 617-374-2500 www.Avaki.com Makes Avaki 2.0, grid computing software for mixed platform environments and global grid configurations. Includes a PKI based security infrastructure for grids spanning multiple companies, locations, and domains. The DataGrid Dissemination Office: CNR-CED Piazzale Aldo Moro 7 00145 Roma (Italy) +39 06 49933205 www.eu-datagrid.org A project funded by the European Union and led by CERN and five other partners whose goal is to set up computing grids that can analyse data from scientific exploration across the continent. The project hopes to develop scalable software solutions and testbeds that can handle thousands of users and tens of thousands of grid connected systems from multiple research institutions. DataSynapse Inc. 632 Broadway 5th Floor New York, NY 10012-2614 212-842-8842 www.datasynapse.com Makes LiveCluster, distributed computing software middleware aimed at the financial services and energy markets. Currently mostly for use inside the firewall. Includes the ability for interprocess communications among distributed application packages. Distributed.Net www.distributed.net Founded in 1997, Distributed.Net was one of the first non-profit distributed computing organizations and the first to create a distributed computing network on the Internet. Distributed.net was highly successful in using distributed computing to take on cryptographic challenges sponsored by RSA Labs and CS Communication & Systems. Entropia, Inc. 10145 Pacific Heights Blvd., Suite 800 San Diego, CA 92121 USA 858-623-2840 www.entropia.com Makes the Entropia distributed computing platform aimed at the life sciences market. Currently mostly for use inside the firewall. Boasts binary integration, which lets you integrate your applications using any language without having to access the application's source code. Recently integrated its software with The Globus Toolkit. Global Grid Forum www.gridforum.org A standards organization composed of over 200 companies working to devise and promote standards, best practices, and integrated platforms for grid computing. The Globus Project www.globus.org A research and development project consisting of members of the Argonne National Laboratory, the University of Southern California's Information Science Institute, NASA, and others focused on enabling the application of Grid concepts to scientific and engineering computing. The team has produced the Globus Toolkit, an open source set of middleware services and software libraries for constructing grids and grid applications. The ToolKit includes software for security, information infrastructure, resource management, data management, communication, fault detection, and portability. Grid Physics Network www.griphyn.org The Grid Physics Network (GriPhyN) is a team of experimental physicists and IT researchers from the University of Florida, University of Chicago, Argonne National Laboratory and about a dozen other research centers working to implement the first worldwide Petabyte-scale computational and data grid for physics and other scientific research. The project is funded by the National Science Foundation. IBM International Business Machines Corporation New Orchard Road Armonk, NY 10504. 914-499-1900 IBM is heavily involved in setting up over 50 computational grids across the planet using IBM infrastructure for cancer research and other initiatives. IBM was selected in August by a consortium of four U.S. research centers to help create the "world's most powerful grid," which when completed in 2003 will supposedly be capable of processing 13.6 trillion calculations per second. IBM also markets the IBM Globus ToolKit, a version of the ToolKit for its servers running AIX and Linux. Intel Corporation 2200 Mission College Blvd. Santa Clara, California 95052-8119 408-765-8080 www.intel.com Intel is the principal founder of the Peer-To-Peer Working Group and recently announced the Peer-to-Peer Accelerator Kit for Microsoft.NET, middleware based on the Microsoft.NET platform that provides building blocks for the development of peer-to-peer applications and includes support for location independence, encryption and availability. The technology, source code, demo applications, and documentation will be made available on Microsoft's gotdotnet (www.gotdotnet.com) website. The download will be free. The target release date is early December. Also partners with United Devices on the Intel-United Devices Cancer Research Project, which enlists Internet users in a distributed computing grid for cancer research. NASA Advanced SuperComputing Division (NAS) NAS Systems Division Office NASA Ames Research Center Moffet Field, CA 94035 650-604-4502 www.nas.nasa.gov NASA's NAS Division is leading a joint effort among leaders within government, academia, and industry to build and test NASA's Information Power Grid (lPG), a grid of high performance computers, data storage devices, scientific instruments, and advanced user interfaces that will help NASA scientists collaborate with these other institutions to "solve important problems facing the world in the 21st century." Network for Earthquake Engineering Simulation Grid (NEESgrid) www.neesgrid.org/ In August 2001, the National Science Foundation awarded $10 million to a consortium of institutions led by the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign to build the NEESgrid, which will link earthquake engineering research sites across the country in a national grid, provide data storage facilities and repositories, and offer remote access to research tools. Parabon Computation 3930 Walnut Street, Suite 100 Fairfax, VA 22030-4738 703-460-4100 www.parabon.com Makes Frontier server software and Pioneer client software, a distributed computing platform that supposedly can span enterprises or the Internet. Also runs the Compute Against Cancer, a distributed computing grid for non-profit cancer organizations. Particle Physics Data Grid (PPDG) www.ppdg.net A collaboration of the Argonne National Laboratory, Brookhaven National Laboratory, Caltech, and others to develop, acquire and deliver the tools for a national computing grid for current and future high-energy and nuclear physics experiments. Peer-to-Peer Working Group 5440 SW Westgate Drive, Suite 217 Portland, OR 97221 503-291-2572 www.peer-to-peerwg.org A standards group founded by Intel and composed of over 30 companies with the goal of developing best practices that enable interoperability among peer-to-peer applications. Platform Computing 3760 14th Ave Markham, Ontario L3R 3T7 Canada 905-948-8448 www.platform.com Makes a number of enterprise distributed and grid computing products, including Platform LSF Active Cluster for distributed computing across Windows desktops, Platform LSF for distributed computing across mixed environments of UNIX, Linux, Macintosh and Windows servers, desktops, supercomputers, and clusters. Also offers a number of products for distributed computing management and analysis, and its own commercial distribution of the Globus Toolkit. Targets computer and industrial manufacturing, life sciences, government, and financial services markets. SETI@Home http://setiathome.ssl.berkeley.edu/ A worldwide distributed computing grid based at the University of California at Berkeley that allows users connected to the Internet to donate their PC's spare CPU cycles to the exploration of extraterrestrial life in the universe. Its task is to sort through the 1.4 billion potential signals picked up by the Arecibo telescope to find signals that repeat. Users receive approximately 350K or data at a time and the client software runs as a screensaver. Sun Microsystems Inc. 901 San Antonio Road Palo Alto, CA 94303 USA 650-960-1300 www.sun.com Sun is involved in several grid and peer-to-peer products and initiatives including its open source Grid Engine platform for setting up departmental and campus computing grids (with the eventual goal of a global grid platform) and its JXTA (short for juxtapose) set of protocols and building blocks for developing peer-to-peer applications. United Devices, Inc. 12675 Research, Bldg A Austin, Texas 78759 512-331-6016 www.ud.com Makes the MetaProcessor distributed computing platform aimed at life sciences, geosciences, and industrial design and engineering markets and currently focused inside the firewall. Also partners with Intel on the Intel-United Devices Cancer Research Project, which enlists Internet users in a distributed computing grid for cancer research. Copyright (c) 2003 Ziff Davis Media Inc. All Rights Reserved.