Conclusion of the PostDoctoral Program on {SSIOSCAR September, 2005 Geoffroy Vallee Introduction From March 2005 to September 2005, a collaboration was initiated between EDF R&D, INRIA and ORNL on the subject of software infrastructure for clustering. The main goal of this collaboration was the integration of the Single System Image (SSI) Kerrighed, developed by INRIA and EDF R&D, in the cluster toolkit OSCAR. This collaboration allows to fund a postdoc position, located for the five months at IRISA, France and then at the ORNL, Oak Ridge, USA. Section 1 presents initial objectives. Section 2 presents studies and prototypes done during the time of the postdoc. Section 3 presents scientific papers, talks made by the postdoc. Initial Objectives EDF R&D, INRIA (Institut national de Recherche en Informatique et Automatique) (and more specifically the INRIA research unit located in Rennes at IRISA (Institut de Recherche en Informatique et Systèmes Aléatoires) ) and Oak Ridge National Laboratory (ORNL) initiated a two years research collaboration on the subject of Cluster Computing. This collaboration is focus on: Three strategic objectives: SO1 : the evaluation of the OSCAR consortium and the OSCAR toolbox by EDF R&D and INRIA SO2 : the instruction of the decision by INRIA and EDF R&D to participate to the OSCAR consortium. If the decision is taken, define with the consortium the role INRIA and EDF R&D will play. SO3 : the initiation a of more general and long term collaboration between INRIA, EDF R&D and ORNL on the subject of High Performance Computing. Two technical objectives: TO1 : the evaluation of the Kerrighed system by ORNL TO2 : the integration of the Kerrighed system into the OSCAR toolbox One industrial objective: IO1 : make OSCAR toolbox adapted to industrial needs and available on commercial clusters. A two years postdoc position, that has been funded by INRIA and EDF R&D, has be initiated in the framework of this collaboration. Punctual exchange of researchers between the three organisms was also organized. Initially, the postdoc student, Geoffroy Vallée, was supposed to integrate the team of Stephen Scott at ORNL during the first year and join the team of Christine Morin at IRISA/INRIA the second year. The first three months of the collaboration was supposed to be dedicated to answer the 2 first objectives. The two technical objectives was supposed to be achieved at the end of the first year. The industrial objective was supposed be achieved at the end of the second year, joining efforts of the G. Vallée postdoc, researchers of ORNL and INRIA and engineers of EDF R&D. This collaboration was also initiated as an opportunity to work on a more general and long term collaboration between INRIA, EDF R&D and ORNL on the subject of High Performance Computing. Work Done Schedule Because of visa issue, the organization of the postdoc was modified. Geoffroy Vallée was at IRISA the first five months, waiting for the completion of visa papers. Geoffroy Vallée arrived at the ORNL on August, 2005 and stay at the ORNL until this end of the postdoc program on September, 2005. Geoffroy Vallée spent a month in December/January, 2005 in order to work with IRISA researchers and EDF engineers. SSI-OSCAR The postdoc program allows to create the SSI­OSCAR software. SSI­OSCAR aims at providing an easy way to use clusters with a Single System Image (SSI) and an easy way to administrate clusters, using a distribution for high performance computing on clusters OSCAR. For that, the Kerrighed SSI has been integrated into the OSCAR distribution. The first version of SSI­OSCAR was a complete spin­off suite because of important modifications of the OSCAR suite. The OSCAR needed to be modified because of limitations of both the OSCAR suite and the Kerrighed SSI. For example, OSCAR was not able to easily changed the kernel installed on compute nodes and Kerrighed being an extension of the Linux kernel, it was not possible to change the kernel used for compute nodes without OSCAR modifications. SSI­OSCAR 1.0 has been released in November, 2005 and announced during SuperComputing'05. This version was based on OSCAR 3.0 for RedHat 9.0 and on Kerrighed 1.0 release candidate 8. This version was provided as an alternative OSCAR suite, important modification being made to support the Kerrighed kernel. SSI­OSCAR 2.0 has been released in March, 2005. This version was based on Kerrighed 1.0.0 and on OSCAR 4.0 for RedHat 9.0 and Fedora Core 2. This version was still released as an alternative OSCAR suite, new features being the support of a new OSCAR version and of a new Kerrighed version SSI­OSCAR 3.0 has been released in May, 2005. This version introduced a new architecture: the new version is available as a "spin­off" OSCAR package which can be downloaded and installed with the OSCAR Package Downloader (OPD) of OSCAR. This version provides Kerrighed 1.0.0 (the package of Kerrighed kernel has been modified to include a large set of drivers, supporting initrd images), and is based on OSCAR 4.1 for RedHat 9 et Fedora Core 2. This version was announced to the OSCAR symposium. SSI­OSCAR 3.1 has been released in May, 2005. SSI­OSCAR 3.1 includes Kerrighed 1.0.2 and the integration of Kerrighed tests into OSCAR tests. With this new version, it was possible to test the Kerrighed installation with the OSCAR GUI and the Kerrighed system is automatically launched at the end of the cluster installation. This version is based on OSCAR 4.1 for RedHat 9 et Fedora Core 2. OSCAR Package OSCAR is based on binary packages therefore the first step to integrate Kerrighed in OSCAR is to create binary packages for Kerrighed. RedHat 9 and Fedora Core 2 being the two popular Linux distributions supported by OSCAR, packages was created for both RedHat 9 and Fedora Core 2. These two distributions are based on RPM packages, therefore some information can be used to create these packages. Nevertheless, to create packages, each distribution needs specific parameters (e.g. the compiler version to use). Therefore, to ease the package creation, a framework was developed to automatically create binary packages for Kerrighed from Linux sources and Kerrighed sources. This framework, named Kpackager, allows to centralize common components between binary packages and to ease the management of specific parameters. At the end, Kpackager allows the package maintainer to create Kerrighed packages using a simple make command. The advantage of this solution is to: ● simplify the management of files needed to create a package. The creation of binary packages can quickly be complex because of the important set of files (e.g. patches, configuration) and information to manage (e.g. sources location). ● ease the creation of packages for a new Kerrighed version. The package maintainer just has to update information for the new version (like the Kerrighed patch). ● ease the support of new distributions based on a same binary package format. Files for a Linux distribution can be used as starting point to support a new Linux distribution. The current Kpackager version allows to create packages for Kerrighed 1.0.2 for RedHat 9 and Fedora Core 2. Each distribution has its own configuration files to specialized the kernel to the Linux distribution (e.g. use of different compiler, different configuration of kernel options). Kpackager was integrated in the CVS repository of the SSI­OSCAR project (the CVS server on lievre.irisa.fr, project kpackager) to ease to technical transfer to the next team which will manage the project. Future Work The current version of Kpackager and SSI­OSCAR are based on RedHat 9 and Fedora Core 2. These two Linux distributions are not longer the most popular Linux distributions supported by OSCAR. The next version of kpackages and SSI­OSCAR may be based on CentOS 4 (a clone of the professional RedHat distribution) and Fedora Core 3. Some improvements can also be done for the management of Kerrighed patches in Kpackager, patches currently being duplicated for each Linux distribution. Kerrighed Port on the 2.4.29 Kernel At the beginning of this program, Kerrighed was based on the 2.4.24 kernel. This version being old comparing to supported kernel in Linux distributions, some issues appeared during the creation of SSI­ OSCAR software. Therefore, Kerrighed was ported on the 2.4.29, the most recent 2.4 kernel available for the port. This port allowed: ● to fix some kernel issue to the differences between the supported kernel of Linux distributions and the Kerrighed kernel. ● to update Kerrighed to the most recent 2.4 kernel. ● initiate the x86_64 port, this architecture being not supported by the 2.4.24 kernel. Initiation of the x86_64 Port The Kerrighed port on x86_64 machines was initiated thanks to the port on the 2.4.29 kernel, the x86_64 support in the 2.4.29 kernel being better than in the 2.4.24 kernel initially supported by Kerrighed. This port allows to compile Kerrighed on x86_64 machines but no validation was made. OSCARonDebian EDF clusters are based on the Debian Linux distribution. OSCAR 3.0 (the official stable version of OSCAR at the beginning of the collaboration program) was only supporting RPM based Linux distributions and a basic framework was available to ease the extension to new binary package formats. A first port on Debian was created and announced during the OSCAR symposium (May, 2005). This version, experimental, allowed to have a summer student funded by Google, thanks to the Summer of Code program. This contribution allowed to fix issues of the initial version. The state of the current version allows to integrate the code to the development repository of the OSCAR project and the integration in OSCAR 5 is planed. SSI related works at ORNL ORNL is involved in two FastOS projects: PetascaleSSI and Molar. PetascaleSSI aims at creating a petascale SSI. Molar (MOdular Linux and Adaptive Runtime support for HEC OS/R research) aims at adaptive, reliable,and efficient operating and runtime system solutions for ultra­scale high­end scientific computing on the next generation of supercomputers. SSI features (e.g. process checkpoint/restart and process migration) were studied in this context. SSI­OSCAR may be used for some specific points of these projects. In the context of these studies of large­scale system, i have studied a solution for cluster virtualization for kernel research and testing. A paper has been written based on these studies. Conferences and Talks ● The COSET­1 workshop, June 2004: presentation of the SSI­OSCAR project and chairman of the session Cluster Single System Image Operating Systems. ● The Cluster 2004 IEEE International Conference: presentation of the paper @InProceedings{morlotvalgalmarbersch04cluster, Author = {Christine Morin and Renaud Lottiaux and Geoffroy Vallée and Pascal Gallard and David Margery and Jean­Yves Berthou and Isaac D. Scherson}, Title = {Kerrighed and Data Parallelism: Cluster Computing on Single System Image Operating Systems}, Booktitle = {The 2004 IEEE International Conference on Cluster Computing}, Address Pages Month Year } = {San Diego, California, USA}, = {20­­23}, = September, = 2004 ● Presentation at ETSU (Johnson City, Tennessee, USA), September 2004: The Kerrighed Operating System: a Single System Image for Cluster. ● Presentation at the University of Tennessee (Knoxville, Tennessee, USA), September 2004: The Kerrighed Operating System: a Single System Image for Cluster. ● The HAPCW workshop, October 2004: presentation of the paper: @InProceedings{valbergallotmarmor04hapcw, Author = {Geoffroy Vallée and Jean­Yves Berthou and Pascal Gallard and Renaud Lottiaux and David Margery and Christine Morin}, Title = {Kerrighed: a Single System Image Providing High Availability Capabilities to Applications}, Booktitle = {HAPCW'04: High Availability and Performance Computing Workshop}, Organization = {Held in conjunction with LACSI 2004, }, Month = OCT, Year = 2004 } ● Talk at ORNL, October 2004: Kerrighed: a Single System Image for Clusters. ● SuperComputing OSCAR BOF, November 2004: presentation of the SSI­OSCAR project. ● The CCGRID 2005 IEEE International Conference: presentation of the paper: @InProceedings{lotboigalvalmor05ccgrid, Author = {Renaud Lottiaux and Benoit Boissinot and Pascal Gallard and Geoffroy Vallée and Christine Morin}, Title = {OpenMosix, OpenSSI and Kerrighed: A Comparative Study}, Booktitle = {Cluster Computing and Grid 2005 (CCGRID 2005)}, Address = {Cardiff, England}, Month = May, Year = 2005 } ● The DSM 2005 workshop (held in conjunction of CCGRID 2005): session co­chairman. ● OSCAR Symposium, May 2005. Member of the program committee and presentation of the papers: @InProceedings{valberprilep05oscar, Author = {Geoffroy Vallée and Jean­Yves Berthou and Hugues Prisker and Daniel Leprince}, Title = {OSCAR on Debian: the EDF Experience}, Booktitle = {The 3rd Annual OSCAR Symposium}, Organization = {Held in conjunction with the 19th International Symposium on High Performance Computing Systems and Applications (HPCS 2005)} Address = {University of Guelph, Guelph, Ontario, Canada}, Month = May, Year = 2005, } @InProceedings{valscomorberpri05oscar, Author = {Geoffroy Vallée and Stephen L. Scott and Christine Morin and Jean­Yves Berthou and Hugues Prisker}, Title = {SSI­OSCAR: a Cluster Distribution for High Performance Computing Using a Single System Image}, Booktitle = {The 3rd Annual OSCAR Symposium}, Organization = {Held in conjunction with the 19th International Symposium on High Performance Computing Systems and Applications (HPCS 2005)}, Address = {University of Guelph, Guelph, Ontario, Canada}, Month = May, Year = 2005, } ● The COSET­2 workshop, June 2005: member of the program committee. ● The ISPDC 2005 IEEE International Conference, July 2005. Presentation of the paper: @InProceedings{vallotmarmorber05ispdc, author = {Geoffroy Vallée and Renaud Lottiaux and David Margery and Christine Morin and Jean­Yves Berthou}, title = {Ghost Process: a Sound Basis to Implement Process Duplication, Migration and Checkpoint/Restart in Linux Clusters}, booktitle = {The 4th International Symposium on Parallel and Distributed Computing}, year = {2005}, address = {Lille, France}, month = {July} }