Paul Avery Oct. 21, 2001 Version 1 Data Grids: A New Computational Infrastructure for Data Intensive Science 1 Introduction Twenty-first century scientific and engineering enterprises are increasingly characterized by their geographic dispersion and their reliance on large data archives. These characteristics bring with them unique challenges. First, the increasing size and complexity of modern data collections require significant investments in information technologies to store, retrieve and analyze them. Second, the increased distribution of people and resources in these projects has made resource sharing and collaboration across significant geographic and organizational boundaries critical to their success. Infrastructures known as “Grids”[1] are being developed to address the problem of resource sharing. An excellent introduction to Grids can be found in the article[2], “The Anatomy of the Grid”, which provides the following interesting description: “The real and specific problem that underlies the Grid concept is coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations. The sharing that we are concerned with is not primarily file exchange but rather direct access to computers, software, data, and other resources, as is required by a range of collaborative problem-solving and resource-brokering strategies emerging in industry, science, and engineering. This sharing is, necessarily, highly controlled, with resource providers and consumers defining clearly and carefully just what is shared, who is allowed to share, and the conditions under which sharing occurs. A set of individuals and/or institutions defined by such sharing rules form what we call a virtual organization (VO).” The existence of very large distributed data collections adds a significant new dimension to enterprise-wide resource sharing, and has led to substantial research and development effort on “Data Grid” infrastructures, capable of supporting this more complex collaborative environment. This work has taken on more urgency for new scientific collaborations, which in some cases will reach global proportions and share data archives with sizes measured in dozens or even hundreds of Petabytes within a decade. These collaborations have recognized the strategic importance of Data Grids for realizing the scientific potential of their experiments, and have begun working with computer scientists, members of other scientific and engineering fields and industry to research and develop this new technology and create production-scale computational environments. Figure 1 below shows a U.S. based Data Grid consisting of a number of heterogeneous resources. Figure 1: A transcontinental Data Grid composed of computational and storage resources of different types linked by highspeed networks. 1 My aim in this paper is to review Data Grid technologies and how they can benefit data intensive sciences. Developments in industry are not included here, but since most Data Grid work is presently carried out to address the urgent data needs of advanced scientific experiments, the omission is not a serious one. (The problems solved while dealing with these experiments will in any case be of enormous benefit to industry in a short time.) Furthermore, I will concentrate on those projects which are developing Data Grid infrastructures for a variety of disciplines, rather than “vertically integrated” projects that benefit a single experiment or discipline, and explain the specific challenges faced by those disciplines. 2 Data intensive activities The number and diversity of data intensive projects is expanding rapidly. The following recounting of projects is presented as a survey that, while incomplete, shows the scope of and immense interest in data intensive methods in solving scientific problems. Physics and space sciences: High energy and nuclear physics experiments at accelerator laboratories at Fermilab, Brookhaven and SLAC already generate dozens to hundreds of Terabytes of colliding beam data per year that is distributed to and analyzed by hundreds of physicists around the world to search for subtle new interactions. Upgrades to these experiments and new experiments planned for the Large Hadron Collider at CERN will increase data rates to Petabytes per year. Gravitational wave searches at LIGO, VIRGO and GEO will accumulate yearly samples of approximately 100 Terabytes of mostly environmental and calibration data that must be correlated and filtered to search for rare gravitational events. New multiwavelength all-sky surveys utilizing telescopes instrumented with gigapixel CCD arrays will soon drive yearly data collection rates from Terabytes to Petabytes. Similarly, remote-sensing satellites operating at multiple wavelengths will generate several Petabytes of spatial-temporal data that can be studied by researchers to accurately measure changes in our planet’s support systems. Biology and medicine: Biology and medicine are rapidly increasing their dependence on data intensive methods. Experiments at new generation light sources have the potential to generate massive amounts of data while recording the changes in shape of individual protein molecules. Organism genomes are being sequenced and stored by new generations of sequencing engines in databases and their properties are compared using new statistical methods requiring massive computational power. Proteomics, the study of protein structure and function, is expected to generate enormous amounts of data, easily dwarfing the data samples obtained from genome studies. In medicine, a single three dimensional brain scan can generate a significant fraction of a Terabyte of data, while systematic adoption of digitized radiology scans will produce dozens of Petabytes of data that can be quickly accessed and searched for breast cancer and other diseases. Exploratory studies have shown the value of converting patient records to electronic form and attaching digital CAT scans, X-Ray charts and other instrument data, but systematic use of such methods would generate databases many Petabytes in size. Medical data pose additional ethical and technical challenges stemming from exacting security restrictions on access to this data and patient identification. Computer simulations: Advances in information technology in recent years have given scientists and engineers the ability to develop sophisticated simulation and modeling techniques for improved understanding of the behavior of complex systems. When coupled to the huge processing power and storage resources available in supercomputers or large computer clusters, these advanced simulation and modeling methods become tools of rare power, permitting detailed and rapid studies of physical processes while sharply reducing the need to conduct lengthy and costly experiments or to build expensive prototypes. The following examples provide a hint of the potential of modern simulation methods. High energy and nuclear physics experiments routinely generate simulated datasets whose size (in the multi-Terabyte range) is comparable to and sometimes exceeds the raw data collected by the same experiment. Supercomputers generate enormous databases from long-term simulations of climate systems with different parameters that can be compared with one another and with remote satellite sensing data. Environmental modeling of bays and estuaries using fine-scale fluid dynamics calculations generates massive datasets that permit the calculation of pollutant dispersal scenarios under different assumptions that can be compared with measurements. These projects also have geographically distributed user communities who must access and manipulate these databases. Physics at the Large Hadron Collider: Although I briefly discussed high energy physics earlier, the requirements for experiments at CERN’s Large Hadron Collider (LHC), due to start operations in 2006, are 2 so extreme as to merit a separate treatment. LHC experiments face computing challenges of unprecedented scale in terms of data volume and complexity, processing requirements and the complexity and distributed nature of the analysis and simulation tasks among thousands of scientists worldwide. Every second, each of the two general purpose detectors will filter one billion collisions and record one hundred of them to mass storage, generating data rates of 100 Mbytes per second and several Petabytes per year of raw, processed and simulated data in the early years of operation. The data storage rate is expected to grow in response to the pressures of increased beam intensity, additional physics processes that must be recorded and better storage capabilities, leading to LHC data collections totaling approximately 100 Petabytes by the end of this decade, and up to an Exabyte (1000 Petabytes) by the middle of the following decade. The challenge facing the LHC laboratory and scientific community is how to build an information technology infrastructure that will provide these computational and storage resources while enabling their effective and efficient use by a scientific community of thousands of physicists spread across the globe. 3 Data Grids and Data Intensive Sciences To develop the argument that Data Grids offer a comprehensive solution to data intensive activities, I first summarize some general features of Grid technologies. These technologies comprise a mixture of protocols, services, and tools that are collectively called “middleware”, reflecting the fact that they are accessed by “higher level” applications or application tools while they in turn invoke processing, storage, network and other services at “lower” software and hardware levels. Grid middleware includes security and policy mechanisms that work across multiple institutions; resource management tools that support access to remote information resources and simultaneous allocation (“co-allocation”) of multiple resources; general information protocols and services that provide important status information about hardware and software resources, site configurations, and services; and data management tools that locate and transport datasets between storage systems and applications. The diagram in Figure 2 outlines in a simple way the roles played by various Grid technologies. The lowest level Fabric contains shared resources such as computer clusters, data storage systems, catalogs, networks, etc. that Grid tools must access and manipulate. The Resource and Connectivity layers provide, respectively, access to individual resources and the communication and authentication tools needed to communicate with them. Coordinated use of multiple resources – possibly at different sites – is handled by Collective protocols, APIs, and services. Applications and application toolkits utilize these Grid services in myriad ways to provide “Grid-aware” services for members of a particular virtual organization. A much more detailed explication of Grid architecture can be found in reference [2]. Application Collective Resource Connectivity While standard Grid infrastructures provide distributed scientific Fabric communities the ability collaborate and share resources, additional capabilities are needed to cope with the specific challenges associated with scientists accessing and manipulating very large distributed Figure 2: Diagram showing how data collections. These collections, ranging in size from Terabytes Grid services in different levels to Petabytes, comprise raw (measured) and many levels of processed provide applications access to reor refined data as well as comprehensive metadata describing, for sources (Fabric). example, under what conditions the data was generated or collected, how large it is, etc. New protocols and services must facilitate access to significant tertiary (e.g., tape) and secondary (disk) storage repositories to allow efficient and rapid access to primary data stores, while taking advantage of disk caches that buffer very large data flows between sites. They also must make efficient use of high performance networks that are critically important for the timely completion of these transfers. Thus to transport 10 Terabytes of data to a computational resource in a single day requires a 1 Gigabit per second network operated at 100% utilization. Efficient use of these extremely high network bandwidths also requires special software interfaces and programs that in most cases have yet to be developed. The computational and data management problems encountered in these experiments include the following challenging aspects: 3 Computation-intensive as well as data-intensive: Analysis tasks are compute-intensive and dataintensive and can involve thousands of computer, data handling, and network resources. The central problem is coordinated management of computation and data, not just data curation and movement. Need for large-scale coordination without centralized control: Rigorous performance goals require coordinated management of numerous resources, yet these resources are, for both technical and strategic reasons, highly distributed and not always amenable to tight centralized control. Large dynamic range in user demands and resource capabilities: It must be possible to support and arbitrate among a complex task mix of experiment-wide, group-oriented, and (perhaps thousands of) individual activities—using I/O channels, local area networks, and wide area networks that span several distance scales. Data and resource sharing: Large dynamic communities would like to benefit from the advantages of intra and inter community sharing of data products and the resources needed to produce and store them. The “Data Grid” has been introduced as a unifying concept to describe the new technologies required to support such next-generation data-intensive applications—technologies that will be critical to future dataintensive computing in the many areas of science and commerce in which sophisticated software must harness large amounts of computing, communication and storage resources to extract information from data. Data Grids are typically characterized by the following elements: (1) they have large extent (national and even global) and scale (many resources and users); (2) they layer sophisticated new services on top of existing local mechanisms and interfaces, facilitating coordinated sharing of remote resources; and (3) they provide a new dimension of transparency in how computational and data processing are integrated to provide data products to user applications. This transparency is vitally important for sharing heterogeneous distributed resources in a manageable way, a point to which I will return in the next section. 4 Major Data Grid Efforts Today I describe in this section the major projects that have been undertaken to develop Data Grids. Without exception, these efforts have been driven by the extreme needs of current and planned scientific experiments, and have led to fruitful collaborations between application scientists and computer scientists. High energy physics has been at the forefront of this activity, a fact that can be attributed to its historical standing as both a highly data intensive and highly collaborative discipline. It is widely recognized, however, that existing high energy physics computing infrastructures are not scalable to upgraded experiments at SLAC, Fermilab and Brookhaven, and new experiments at the LHC, that will generate Petabytes of data per year and be analyzed by global collaborations. Participants in other planned experiments in nuclear physics, gravitational research, large digital sky surveys, and virtual astronomical observatories, fields which also face challenges associated with massive distributed data collections and a dispersed user community, have also decided to adopt Data Grid computing infrastructures. Scientists from these and other disciplines are exploiting new initiatives and partnering with computer scientists and each other to develop productionscale Data Grids. A tightly coordinated set of projects has been established that together are developing and applying Data Grid concepts to problems of tremendous scientific importance, in such areas as high energy physics, nuclear physics, astronomy, bioinformatics and climate science. These projects include (1) the Particle Physics Data Grid (PPDG [3]), which is focused on the application of Data Grid concepts to the needs of a number of U.S.-based high energy and nuclear physics experiments; (2) the Earth System Grid project [4], which is exploring applications in climate and specific technical problems relating to request management; (3) the GriPhyN [5] project, which plans to conduct extensive computer science research on Data Grid problems and develop general tools that will support the automatic generation and management of derived, or “virtual”, data for four leading experiments in high energy physics, gravitational wave searches and astronomy; (4) the European Data Grid (EDG) project, which aims to develop an operational Data Grid infrastructure supporting high energy physics, bioinformatics, and satellite sensing; (5) the TeraGrid [6], which will provide a massive distributed computing and data resource connected by ultra-high speed optical networks; and (6) the International Virtual Data Grid Laboratory (iVDGL [7]), which will provide a world- 4 wide set of resources for Data Grid tests by a variety of disciplines. These projects have adopted the Globus [8] toolkit for their basic Grid infrastructure to speed the development of Data Grids. The Globus directors have played a leadership role in establishing a broad national—and indeed international—consensus on the importance of Data Grid concepts and on specifics of a Data Grid architecture. I will discuss these projects in varying levels of detail in the rest of this section while leaving coordination and common architecture issues for the following section. 4.1 Globus-Based Infrastructure The Globus Project [8], a joint research and development effort of Argonne National Laboratory, the Information Sciences Institute of the University of Southern California, and the University of Chicago, has over the past several years developed the most comprehensive and widely used Grid framework available today. The widespread adoption of Globus technologies is due in large measure to the fact that its members work closely with a variety of scientific and engineering projects while maintaining compatibility with other computer science toolkits used by these projects. Globus tools are being studied, developed, and enhanced at institutions worldwide to create new Grids and services, and to conduct computing research. Globus components provide the capabilities to create “Grids” of computing resources and users; track the capabilities of resources within a grid; specify the resource needs of user’s computing tasks; mutually authenticate both users and resources; and deliver data to and from remotely executed computing tasks. Globus is distributed in a modular and open “toolkit” form, facilitating the incorporation of additional services into scientific environments and applications. For example, Globus has been integrated with technologies such as the Condor high-throughput computing environment [9,10] and the Portable Batch System (PBS) job scheduler [11]. Both of these integrations demonstrate the power and value of the open protocol toolkit approach. Experience with projects such as the NSF’s National Technology Grid [12], NASA’s Information Power Grid [13] and DOE ASCI’s Distributed Resource Management Project [14] have provided considerable experience with the creation of production infrastructures. 4.2 Earth System Grid 4.3 The TeraGrid The TeraGrid Project [6] was recently funded by the National Science Foundation for $53M over 3 years to construct a distributed supercomputing facility at four sites: the National Center for Supercomputing Applications [15] in Illinois, the San Diego Supercomputing Center, Caltech’s Center for Advanced Computational Research and Argonne National Laboratory. The project aims to build and deploy the world's largest, fastest, most comprehensive, distributed infrastructure for open scientific research. When completed, the TeraGrid will include 13.6 teraflops of Linux cluster computing power distributed at the four TeraGrid sites, facilities capable of managing and storing more than 450 terabytes of data, high-resolution visualization environments, and toolkits for Grid computing. These components will be tightly integrated and connected through an optical network that will initially operate at 40 gigabits per second and later be upgraded to 50-80 gigabits/second—an order of magnitude beyond today's fastest research network. TeraGrid aims to partner with other Grid projects and has active plans to help several discipline sciences deploy their applications, including dark matter calculations, weather forecasting, biomolecular electrostatics, and quantum molecular calculations [16]. 4.4 Particle Physics Data Grid The Particle Physics Data Grid (PPDG) is a collaboration of computer scientists and physicists from six experiments who plan to develop, evaluate and deliver Grid-enabled tools for data-intensive collaboration in particle and nuclear physics. The project has been funded by the U.S. Department of Energy since 1999 and was recently funded for over US$3.1M in 2001 (funding is expected to continue at a similar level for 2002-2003) to establish a “collaboratory pilot” to pursue these goals. The new three-year program will exploit the strong driving force provided by currently running high energy and nuclear physics experiments at SLAC, Fermilab and Brookhaven together with recent advances in Grid middleware. Novel mechanisms and policies will be vertically integrated with Grid middleware and experiment specific applications and computing resources to form effective end-to-end capabilities. Our goals and plans are guided by the immediate, medium-term and longer-term needs and perspectives of the LHC experiments 5 ATLAS [17] and CMS [18] that will run for at least a decade from late 2005 and by the research and development agenda of other Grid-oriented efforts. We exploit the immediate needs of running experiments – BaBar [19], D0 [20], STAR [21] and Jlab [22] experiments – to stress-test both concepts and software in return for significant medium-term benefits. PPDG is actively involved in establishing the necessary coordination between potentially complementary data-grid initiatives in the US, Europe and beyond. The BaBar experiment faces the challenge of data volumes and analysis needs planned to grow by more than a factor 20 by 2005. During 2001, the CNRS-funded computer center at CCIN2P3 Lyon, France will join SLAC in contributing data analysis facilities to the fabric of the collaboration. The STAR experiment at RHIC has already acquired its first data and has identified Grid services as the most effective way to couple the facilities at Brookhaven with its second major center for data analysis at LBNL. An important component of the D0 fabric is the SAM [23] distributed data management system at Fermilab, to be linked to applications at major US and international sites. The LHC collaborations have identified data-intensive collaboratories as a vital component of their plan to analyze tens of petabytes of data in the second half of this decade. US CMS is developing a prototype worldwide distributed data production system for detector and physics studies. 4.5 GriPhyN GriPhyN is a large collaboration of computer scientists and experimental physicists and astronomers who aim to provide the information technology (IT) advances required for Petabyte-scale data intensive sciences. Funded by the National Science Foundation for US$11.9M for 2000-2005, the project requirements are driven by four forefront experiments: the ATLAS [17] and CMS [18] experiments at the LHC, the Laser Interferometer Gravitational Wave Observatory (LIGO [24]) and the Sloan Digital Sky Survey (SDSS [25]). These requirements, however, easily generalize to other sciences as well as 21 st century commerce, so the GriPhyN team is pursuing IT advances centered on the creation of Petascale Virtual Data Grids (PVDG) that meet the data-intensive computational needs of a diverse community of thousands of scientists spread across the globe. GriPhyN has adopted the concept of virtual Major Archive data as a unifying theme for its investigations Facilities of Data Grid concepts and technologies. This term is used to refer to two related concepts: transparency with respect to location as a means of improving access performance, with respect to speed and/or reliability, and transparency with respect to materialization, as a means of facilitating the definition, sharing, and use of data derivation mechanisms. These characteristics combine to enable the definiNetwork caches & tion and delivery of a potentially unlimited regional centers virtual space of data products derived from other data. In this virtual space, requests can be satisfied via direct retrieval of materialized products and/or computation, with local and global resource management, policy, and security constraints determining the strategy used. Local The concept of virtual data recognizes that all sites except irreproducible raw experimental data ? need ‘exist’ physically only as the specification for how they may be derived. The grid Figure 3: Virtual data in action. In this example, a may instantiate zero, one, or many copies of data request is satisfied by data from a major archive derivable data depending on probable demand facility, data from one regional center and computaand the relative costs of computation, storage, tion on data at a second regional center, plus both data and transport. On a much smaller scale, this and computation from the local site and neighbor. dynamic processing, construction, and delivery of data is precisely the strategy used to generate much, if not most, of the web content delivered in response to queries today. 6 Figure 3 illustrates what the virtual data grid concept means in practice. Consider an astronomer using SDSS to investigate correlations in galaxy orientation due to lensing effects by intergalactic dark matter [26,27,28]. A large number of galaxies—some 107— must be analyzed to get good statistics, with careful filtering to avoid bias. For each galaxy, the astronomer must first obtain an image, a few pixels on a side; process it in a computationally intensive analysis; and store the results. Execution of this request involves virtual data catalog accesses to determine whether the required analyses have been previously constructed. If they have not, the catalog must be accessed again to locate the applications needed to perform the transformation and to determine whether the required raw data is located in network cache, remote disk systems, or deep archive. Appropriate computer, network, and storage resources must be located and applied to access and transfer raw data and images, produce the missing images, and construct the desired result. The execution of this single request may involve thousands of processors and the movement of terabytes of data among archives, disk caches, and computer systems nationwide. Virtual data grid technologies will be of immediate benefit to numerous other scientific and engineering application areas. For example, NSF and NIH fund scores of X-ray crystallography labs that together are generating Petabytes of molecular structure data each year. Only a small fraction of this data is being shared via existing publication mechanisms. Similar observations can be made concerning long-term seismic data generated by geologists, data synthesized from studies of the human genome database, brain imaging data, output from long-duration, high-resolution climate model simulations, and data produced by NASA’s Earth Observing System. In order to realize these concepts, GriPhyN is conducting research into virtual data cataloging, execution planning, execution management, and performance analysis issues (see Figure 4). The results of this research, and other relevant technologies, are developed and integrated to form a Virtual Data Toolkit (VDT). Successive VDT releases will be applied and evaluated in the context of the four partner experiments. VDT 1.0 was released in October 2001 and the next release is expected early in 2002. Production Team Individual Investigator Other Users Interactive User Tools Request Planning and Scheduling Tools Virtual Data Tools Resource Resource Management Management Services Services Request Execution Management Tools Security and Security and Policy Policy Services Services Other Grid Other Grid Services Services Transforms Raw data source Distributed resources (code, storage, computers, and network) Figure 4: A Petascale Virtual Data Grid, showing different kinds of users accessing Grid resources and services 4.6 European Data Grid 4.7 International Virtual Data Grid Laboratory 7 5 Common Infrastructure Given the international nature of the experiments participating in these projects (some of them, like the LHC experiments, are participating in several projects), There is widespread recognition by scientists in these projects of the importance of developing common protocols and tools to enable inter-Grid operation avoid costly duplication. Scientists from several of these projects are 6 Summary Data Grid technologies embody entirely new approaches to the analysis of large data collections, in which the resources of an entire scientific community are brought to bear on the analysis and discovery process, and data products are made available to all community members, regardless of location. Large interdisciplinary efforts recently funded in the U.S. and EU have begun research and development of the basic technologies required to create working Data Grids. Over the coming years, they will deploy, evaluate, and optimize Data Grid technologies on a production scale, and integrate them into production applications. The experience gained with these new information infrastructures, providing transparent managed access to massive distributed data collections, will be applicable to large-scale data-intensive problems in a wide spectrum of scientific and engineering disciplines, and eventually in industry and commerce. Such systems will be needed in the coming decades as a central element of our information-based society. References 1 Formerly known by the name “Computational Grid”, the term “Grid” reflects the fact that the resources to be shared may be quite heterogeneous and have little to do with computing per se. 2 I. Foster, C. Kesselman, S. Tuecke, “The Anatomy of the Grid: Enabling Virtual Scalable Organizations”, International Journal of High Performance Computing Applications, 15(3), 200-222, 2001, http://www.globus.org/anatomy.pdf. 3 PPDG home page, http://www.ppdg.net/. 4 Earth System Grid homepage, http://www.earthsystemgrid.org/. 5 GriphyN Project home page, http://www.griphyn.org/. 6 TeraGrid home page, http://www.teragrid.org/. 7 International Virtual Data Grid Laboratory home page, http://www.ivdgl.org/. 8 Globus home page, http://www.globus.org/. 9 Livny, M. High-Throughput Resource Management. In Foster, I. and Kesselman, C. eds. The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann, 1999, 311-337. 10 Moore, R., Baru, C., Marciano, R., Rajasekar, A. and Wan, M. Data-Intensive Computing, in Foster, I. and Kesselman, C. eds. The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann, 1999, 105-129. 11 Johnston, W.E., Gannon, D. and Nitzberg, B., Grids as Production Computing Environments: The Engineering Aspects of NASA's Information Power Grid. In Proc. 8th IEEE Symposium on High Performance Distributed Computing, 1999, IEEE Press. 12 Stevens, R., Woodward, P., DeFanti, T. and Catlett, C. From the I-WAY to the National Technology Grid. Communications of the ACM, 40(11):50-61. 1997. 13 Information Power Grid home page, http://www.ipg.nasa.gov/. 14 Beiriger, J., Johnson, W., Bivens, H., Humphreys, S. and Rhea, R., Constructing the ASCI Grid. In Proc. 9th IEEE Symposium on High Performance Distributed Computing, 2000, IEEE Press. 15 NCSA home page at http://www.ncsa.edu/. 16 These applications are described more fully at http://www.teragrid.org/about_faq.html. 8 17 The ATLAS Experiment, A Toroidal LHC ApparatuS, http://atlasinfo.cern.ch/Atlas/Welcome.html 18 The CMS Experiment, A Compact Muon Solenoid, http://cmsinfo.cern.ch/Welcome.html 19 BaBar, http://www.slac.stanford.edu/BFROOT 20 D0, http://www-d0.fnal.gov/ 21 STAR, http://www.star.bnl.gov/ 22 Jlab experiments, http://www.jlab.org/ 23 SAM, http://d0db.fnal.gov/sam/ 24 LIGO home page, http://www.ligo.caltech.edu/. 25 SDSS home page, http://www.sdss.org/. 26 Fischer, P., McKay, T.A., Sheldon, E., Connolly, A., Stebbins, A. and the SDSS collaboration: Weak Lensing with SDSS Commissioning Data: The Galaxy-Mass Correlation Function To 1/h Mpc, Astron.J. in press, 2000. 27 Luppino, G.A. and Kaiser, N., Ap.J. 475, 20, 1997. 28 Tyson, J.A., Kochanek, C. and Dell’Antonio, I.P., Ap.J., 498, L107, 1998. 9