Complexity of System Configuration Management A Dissertation submitted by Yizhan Sun In partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science TUFTS UNIVERSITY August 2006 c Yizhan Sun, June, 2006 ADVISOR: Alva L. Couch Abstract System administration has often been considered to be a “practice” with no theoretical underpinnings. In this thesis, we begin to define a theory of system administration, based upon two activities of system administrators: configuration management and dependency analysis. We formalize and explore the complexity of these activities, and demonstrate that they are intractable in the general case. We define the concepts of system behavior, kinds of configuration operations, a model of configuration management, a model of reproducibility, and proofs that several parts of the process are NP-complete or NP-hard. We also explore how system administrators keep these tasks tractable in practice. This is a first step toward a theory of system administration and a common language for discussing the theoretical underpinnings of the practice. ii Acknowledgements This thesis is the result of four years of work whereby I have been accompanied and supported by many people. It is a pleasant aspect that I have now the opportunity to express my gratitude to all of them. The first person I would like to thank is my advisor Alva L. Couch. I have been working with him since 2001 when I started my Master’s project. His enthusiasm and integral view of research and his humor when things get tough have made a deep impact upon me. He patiently guided me and supported me throughout the course of my study. I would like to thank Professor Kofi Laing and Professor Lenore Cowen for their help on computation theory and for being committee members. I also thank Professor Ricardo Pucella for reviewing my work. I am grateful for my fellow students: Ning Wu, Marc Chiarini, Hengky Susanto, Josh Danziger and Bill Bogstad for their insightful comments and discussions of this work. I would like to thank the Tufts University Computer Science Department for giving me financial support to finish the study. I owe a lot of gratitude to my parents and my parents-in-law. My parents stayed with us for three years in the U.S. to help me with my two children. Without their help, I could not even start this thesis. My parents-in-law supported us financially out of their limited resources. They have shown me what unconditional love is. I am very grateful for my husband for his love and encouragement during the Ph.D. period. I thank my sons Samuel and David for their smiles and countless precious joyful moments in our life. iii DEDICATION To my parents and my parents-in-law, for their endless love. iv Contents 1 Introduction 2 2 Landscape of System Administration 5 2.1 Definition of System Administration . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Taxonomy of System Administration . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Some System Administration Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.1 Backup and Restore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.2 User Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.3 Service Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.4 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.5 Testing and Quality Assurance . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3 Introduction to Configuration Management 18 3.1 Software Configuration Management . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 System Configuration Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 How Configuration Controls Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4 Challenges of Configuration Management 22 4.1 Change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.2 Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.3 Interdependence of Software and Hardware . . . . . . . . . . . . . . . . . . . . . . . 23 4.4 Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.5 Contingency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.6 Diverse Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.7 Ineffective Collaboration of Multiple System Administrators . . . . . . . . . . . . . . 26 v 4.8 Mobile Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.9 Service Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5 Automation and Autonomic Computing 28 5.1 History of Automation in Configuration Management . . . . . . . . . . . . . . . . . . 29 5.2 Current Strategies of System Configuration . . . . . . . . . . . . . . . . . . . . . . . 32 5.2.1 Manual Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.2.2 Custom Scripting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.2.3 Structured Scripting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.2.4 File Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.2.5 Declarative Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6 The Configuration Process 36 6.1 Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 6.2 Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 6.3 The Configuration Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 7 A Model of Configuration Management 45 7.1 Closed- vs. Open-world Models of Systems . . . . . . . . . . . . . . . . . . . . . . . 45 7.2 Observed Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 7.3 Actual State and Observed State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 7.4 Configuration Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 7.5 Two Configuration Management Automata . . . . . . . . . . . . . . . . . . . . . . . 49 8 Reproducibility 8.1 8.2 51 Local Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 8.1.1 Properties of Locally Reproducible Operations . . . . . . . . . . . . . . . . . 55 8.1.2 Constructing Locally Reproducible Operations . . . . . . . . . . . . . . . . . 57 Population Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 9 Limits on Configuration Operations 62 9.1 Limits on Configuration Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 9.2 Relationship Between Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 vi 10 Complexity of Configuration Composition 10.1 Composability 69 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 10.2 Complexity of Operation Composability . . . . . . . . . . . . . . . . . . . . . . . . . 70 10.2.1 Component Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 10.2.2 General Composability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 10.2.3 Atomic Operation Composability . . . . . . . . . . . . . . . . . . . . . . . . . 74 10.2.4 Composability of Partially Ordered Operations . . . . . . . . . . . . . . . . . 75 10.2.5 Composability of Convergent Operations . . . . . . . . . . . . . . . . . . . . . 76 10.2.6 Summary of The Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 10.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 11 Dependency Analysis 79 11.1 Dependency Analysis Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 11.1.1 Instrumentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 11.1.2 Perturbation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 11.1.3 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 11.1.4 Requirements Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 11.1.5 Dependency Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 11.2 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 11.2.1 White Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 11.2.2 Black Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 11.2.3 Functional Expectations and Tests . . . . . . . . . . . . . . . . . . . . . . . . 84 11.2.4 State and Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 11.3 Dependence Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 11.3.1 Dependency in a Closed-world System . . . . . . . . . . . . . . . . . . . . . . 86 11.3.2 Dependency in an Open-world System . . . . . . . . . . . . . . . . . . . . . . 88 11.4 Complexity of Dependency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 11.4.1 Black Box Dependency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 89 11.4.2 White Box Dependency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 90 11.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 12 Configuration Management Made Tractable in Practice 12.1 Experience and Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 92 92 12.2 Reduction of State Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 12.3 Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 12.4 Re-baselining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 12.5 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 12.6 Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 12.7 Current Strategies Make Configuration Management Tractable . . . . . . . . . . . . 96 13 Ways to Reduce Complexity 99 13.1 Choosing Easy Instances to Solve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 13.1.1 Reduction of State Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 13.1.2 Using Simple Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 13.1.3 Forming Hierarchies and Relationships . . . . . . . . . . . . . . . . . . . . . . 100 13.2 Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 13.3 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 14 Conclusions 104 14.1 Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 14.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 viii List of Figures 6.1 The configuration stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 6.2 The configuration process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 6.3 Asynchronous interactions between system administrators and the environment . . . 42 9.1 Stateless operations can have if statements . . . . . . . . . . . . . . . . . . . . . . . . 65 9.2 An example of sets that are sequence idempotent but not stateless . . . . . . . . . . 67 10.1 Summary of proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 ix Complexity of System Configuration Management 1 Chapter 1 Introduction System administration has traditionally been viewed as a practice with no theoretical underpinnings. In this thesis, we take the first steps toward developing a theory of system administration, by defining and analyzing models of system administration practice. This theory guides us in looking at system administration in new ways and lends understanding to current practices, and suggests new practices for the future. System administration is in a critical transition period similar to the historical transition from alchemy to chemistry[19]. Practitioners are pioneering the scientific study of the common methods in use today. Large number of experiments have been performed, and various ways of tuning practice have been explored and observed. But the field lacks a mature theoretical foundation and systematic experimental data, which are critical to future development. This work contributes to the construction of a complete theory for system administration and intends to inspire more theoretical research within the field. As the complexity of computing systems and the demands of their roles increase rapidly, system administration becomes difficult. Within a network, there are tens, hundreds, or thousands of machines; each might have different architectures, hardware devices, operating systems, and installed software applications. Software and hardware are developed by independent and often competing vendors and developers. A large amount of implementation detail is needed to configure these components to accomplish common goals. Users who interact with the system have diverse needs and requirements that are constantly in flux. The life cycles of new technologies and tools have become shorter and shorter and this has increased the difficulties of integrating new technologies with legacy infrastructure. Thousands of 2 scripts with complex dependency relationships are embedded within the systems. And these systems are required to support business and industrial processes which are continually reconstructed and reorganized to meet changing users’ demands. Nevertheless, service guarantees for modern systems are becoming commonplace. A minute of down-time might cause thousands of dollars of lost business. This complexity grows beyond human management capacity, especially for large scale systems. We approach this problem by studying configuration management - the activities of initial configuration or reconfiguration of the system components other than users so that they are in an organized collaboration to satisfy users’ requirements or requirement changes. Automation in configuration management is the key to freeing humans from being overwhelmed by implementation details. Many strategies have been explored and many tools have been developed. Based on their strategies, we group tools into different categories: custom scripting, structured scripting, file distribution, and declarative syntax. We will introduce these strategies in greater detail later. All of these tools try to hide implementation details to some degree from system administrators. Ideally, humans are only required to make policies that declare the high level requirements or objectives of the system, such as “I want two web servers in my network satisfying service and security requirements”, and some autonomic system can take those declarations and translate them into low level implementation details. This is the vision of autonomic computing (self-managing) systems. In such ideal environments, humans delegate configuration management to autonomic computing systems. Intuitively, everyone believes that building such an autonomic system is “difficult” or even impossible because configuration management is difficult. But no one has mathematically analyzed why it is difficult, where the difficulties come from, and how to reduce the complexity. This work intends to fill in these details. We believe that fundamental theories are necessary to manage the complexity of configuration management. So far, our contributions in building fundamental theories have included: • understanding of the nature of configuration process; • a theoretical model of configuration management; • a theory of reproducibility of configuration operations; • formal definitions of various limits on configuration operations and discuss their impact on complexity of configuration management; • formal definitions of composability of configuration operations with or without various limits; 3 • a formal definition of dependencies between components of a system; • proof that composability of configuration operations is an NP-hard problem and proof that a complete dependency analysis is intractable, so that automation of configuration management is intractable in the general case; • summary of techniques used in practice that make configuration management tractable; and • solutions to tractability that arise from and are suggested by computation theory. This thesis describes all of the above issues in detail. Chapter 2 presents a general landscape of system administration. Chapter 3 is an introduction to configuration management. Chapter 4 summarizes challenges that arise in configuration management. Chapter 5 discusses various strategies used to cope with the difficulty of configuration management and the vision and promise of autonomic computing. Chapter 6 contextualizes the configuration process; this discussion serves as the link between configuration practice and our theoretical model and discussions. Chapter 7, 8, 9, 10, 11, 12, and 13 describe theoretical models of configuration management, reproducibility, composability, and dependency analysis, and discuss their computational complexity and the ways in which system administrators keep these tasks tractable. Finally we draw some conclusions and lessons for the future in Chapter 14. 4 Chapter 2 Landscape of System Administration 2.1 Definition of System Administration System administration is an emerging field in computer science and computer engineering that concerns the operational management of human-computer systems. It is a body of knowledge and technologies that enables an administrator to initialize a system to states that satisfy users’ needs to produce work, and to keep the system in those desired states, while interactions with users tend to cause the system to drift away from those states. System administration concerns every possible action involving the system and every level in the hierarchy of the system, from machine to user. System administration is distinguished from system management and autonomic management by being a human-centered activity, in which users and system managers are considered on an equal basis with computer systems. This leads to an understanding of “human-computer communities”[17] in which human and non-human entities interact to achieve some common goal. System administration as a practice requires a broad range of knowledge and skills including an understanding of system dynamics and administrative techniques as well as an understanding of human psychology and people-skills. System administration focuses upon: the real-world goals for, services provided by, and constraints on computing systems; the policy and specification of system structure and behavior, and the implementation of these policies and specifications; the activities required in order to develop an assurance that the specifications and real-world goals have been met; and the evolution of such sys5 tems over time. It is also concerned with the processes, methods and tools for managing computing systems in a cost-effective and timely manner. System administration as a practice has existed since the first computer was invented in the 1940s. However, not formally recognized as a branch of computer science, system administration is relatively young compared to other traditional computer science disciplines, such as software engineering, artificial intelligence, and theory. Good starting points to understand system administration include the textbooks[17, 19] by M. Burgess. The Large Installation Systems Administration Conference (LISA)[93] is the flagship conference of the system administration community; proceedings in this conference reflect a broad view of practice across the field of system administration. Additionally, the book, Selected Papers in System and Network Administration[3] offers a roadmap of the development of system administration through samples of LISA papers. Other research literature on system administration can be found in proceedings of the USENIX Security Symposium, the Conference on Integrated Network Management (INM), and the Network Operations and Management Symposium (NOMS). SysAdmin[2] is a periodical that affords good overview of the state-of-the-art technologies in system administration. Large scale system administration discussions are distributed through community-wide mailing lists[1] and through professional service sites such as http://www.lopsa.org and Sun microsystems’ http://bigadmin.sun.com. 2.2 Taxonomy of System Administration E. Anderson and D. Patterson described an initial taxonomy of system administration in [5]. We base our taxonomy on theirs with some modifications. System administration can be categorized as interactions between two entities: management and tasks. The relationship between these two entities is that almost every task is associated with management activities, such as policy making, configuration management, maintenance, and training. For example, the task of managing a particular service includes making policy decisions, configuring it to achieve the goals listed in these policies, maintaining desired configuration and behavior, and training other system administrators to manage it and users to use it. System administration can be divided into four categories: • Policy Making – deciding high level goals for the system to achieve; • Configuration Management – initial configuration or reconfiguration of the network of computers according to policies or policy changes; 6 • Maintenance – bringing the system back to a desired state when the system is degraded either by failure of system components or poor performance; • Training – improving administrators’ skills and train users to take better advantage of the system. System scale increases; new technologies emerge; system administrators and users leave and arrive; these factors make training an essential process for system administration. Maintenance as an ongoing activity is driven by two factors: system entropy and the need for changes in policy. The entropy of the system, defined as the extent to which the actual configuration or behavior of the system is statistically unknown or unpredictable, tends to grow when inconsistent configuration operations are applied to a collection of systems, resources are consumed by users, software malfunctions, viruses attack or spam arrives, and malicious operations break programs, etc. Maintenance is the process of restoring the order out of disorder. System administration can also be grouped into separate categories based upon commonly performed tasks[5]. Categories based on the tasks include: • Services – include internal services provided to the system and external services provided to the users. Examples include backup, mail, printing, NFS, DNS, Web, database, etc; • Software Installation – includes operating system installation, application installation, software packaging, and user customization; • Monitoring – helps administrators determine what is happening in the system and network. This includes system and network monitoring, resource accounting, data display, benchmarking, configuration discovery, and performance tuning; • Management – includes site configuration, host configuration, site move, fault tolerance, etc; • User Management - includes management of user accounts, documentation, policy, user interaction, etc; • Improvement – includes training administrators, software design, models, system self improvement; and • Miscellaneous – includes trouble tickets, secure root access, general tool, security, file synchronization, remote access, file migration, resource cleanup, etc. 7 However, the boundaries between these categories are often blurred. Many current configuration tools deal with more than one kind of task. For example, configuration management and maintenance are combined when using Cfengine[15, 16], a “convergent” tool designed to bring the system in conformance with system requirements. Each requirement can be the original one or a modified version of the original requirement; Cfengine does not know the difference between requirements that have been modified and those that remain the same as before. Another example is LCFG[6, 8] and ISConf[90, 91]; both manage system configuration and software installation. 2.3 Some System Administration Tasks We now examine some important tasks performed by system administrators in more detail. Software configuration and system configuration are omitted here because we will discuss them in later chapters. 2.3.1 Backup and Restore This section is a simplified version of the backup discussion in Selected Papers in System and Network Administration[3]. Backup is about exploiting storage redundancy to increase robustness and data integrity in order to cope with errors and natural disasters[17]. A copy of data is made to restore information in case the original is destroyed. Backup seems simple to accomplish until one includes requirements such as high availability of file-systems, size of data, ease and speed of backup and recovery, media and version management, scalability, and other site-specific needs. Backups were traditionally performed using primitive system tools such as “dump” and “restore”, but the subtleties of backup and contingency planning justified the creation of more complex tools to manage backup and restore schedules. The goal of creating a backup schedule is to be able to restore files that are lost in a reasonable amount of time and not to interfere with daily use of the system. Several tools have been developed for heterogeneous environments[69, 72], and for use while systems remain online[82], and for backing up data from one system to spare disks on another system[89]. However, few of these have withstood the test of time like the Amanda backup system[83]. Backup methods are affected by constant technological improvements in backup device speed, density, robustness, cost, and interoperability with vendor-supplied backup software. When balancing backup speed against cost of media, one finds different optimal solutions for small, medium and 8 large sites. Traditionally, backups have been recorded to tape because tape is relatively inexpensive. However, tapes are bulky and time-consuming to set up manually, so there is a point at which the cost of hiring someone to change tapes is more than the cost of mirroring disks or trying robotic solutions such as robotic tape libraries. Disk mirroring is an affordable solution for smaller sites, but it becomes expensive for sites with terabytes of data. Many sites utilize a mix of disk mirroring and tape or CD backup such as utilizing the disk as a cache for the tape backup process[89]. 2.3.2 User Management User management consists of controlling the “interface” between users and computers. Several issues are involved: management of user accounts, policies related to computer use, resource management, and user help and support. User Accounts The tasks of managing user accounts include choosing authentication and authorization methods for determining account access and privileges; account creation and retirement; scope, authorization, and privilege associated with each account; resource management including quotas and sizes of user directories, and managing login environments associated with particular accounts and privileges. These tasks are trivial if there are only a few users to manage. Scaling up the number of users, system privileges, account changes, or environments creates challenges for each task. Account management tools were first created with the goal of simplifying the account creation process. Scripts were designed to automate the steps of accumulating appropriate information about users, managing passwords and other forms of user authentication, creating user file storage and directories, and changing the location of user files to match user needs[36]. Sites with thousands of accounts, such as schools, need to create large numbers of accounts quickly; because large populations appear all at once at the beginning of each term. They must also be able to retire accounts efficiently because of high turnover in the user population. In most account management systems, a central repository stores account information, while daemons or agents extract information from the database and create local accounts on systems the user must access[85]. User Support With large numbers of users to manage, providing effective user support is both difficult and crucial. Approaches to user support include helping users directly or in person, utilizing electronic commu- 9 nications, training users to make them self-sufficient, and documenting the answers to frequently asked questions. One important part of user support is managing problem reports (also called “trouble tickets”) from users. System administrators could not accomplish deployment of new services and changes in architecture if they must also respond to users’ problem reports in real time. It is much more efficient and common to utilize a “triage” approach in which junior system administrators interact with users directly and protect senior system administrators from too much direct interaction with users. A trouble ticket tool coordinates and documents interactions between system administrators and users. Early trouble ticket tools were email-only submission tools with centralized queues of requests[63]. Later, these systems were extended so that users could query the status of each problem report, and tickets could be assigned to particular administrators[81]. Systems were improved to support multiple submission methods such as phone and GUI, and to support multiple request queues and request priorities[80]. Even with sophisticated tools for managing requests electronically, there are circumstances where direct user assistance is needed. Direct in-person support becomes difficult if the user who needs support is located at a remote site. The Virtual Network Computing model[10] is a way to allow an administrator to log onto an existing user session and guide remote users through resolving difficulties online. Various tools are available to help with this, including remote desktop utilities for Windows. User Policies There is a direct relationship between an organization’s policies concerning users and the expense of maintaining computing infrastructure. A good policy should clearly state: 1. rules about what users are allowed/not allowed to do, such as whether they can install new software; 2. specifications of what mandatory enforcement users can expect, e.g. whether certain kinds of files are periodically deleted. 3. other regulations and promises, e.g. policies on privacy, record keeping, retirement of accounts, user privileges, etc. Since policy-making is closely related to the requirements and business model of the local environment, few system administration researchers have studied this topic in the context of system 10 administration, though there is much research on the role of policy in business that is not known to the system administration community. Zwicky et al[98] made the important point that system policy lies at the root of system consistency and integrity. The effects of policy upon system administration and cost are at best poorly understood, and there is little information on the range of policies that are currently in use or effective. Resource Management Resource management is closely related to policy-making. For example, there are ongoing debates on how to manage users’ storage space. One approach is to enforce disk quotas, which strictly limit the amount of disk space users can have access to. One problem with quotas is that they are too restrictive to allow users create large temporary files especially during program development and debugging. Another approach is to utilize a “tidying policy”[47, 97] in which certain kinds of files (with limited usefulness) are deleted regularly. For example, all core files can be deleted periodically, on the grounds that a core file is only useful for debugging an application immediately after a bug is encountered. Other uses of tidying include removing files that are not related to job function, e.g., movies and mp3 songs. However, this approach exerts less control over the disk resources and does not avoid all resource problems (including running out of disk space). 2.3.3 Service Management Services are specialized tasks performed on the behalf of programs or users, e.g., mail, printing, NFS (Network File System), Web, DNS (Domain Name Service), and database services. Services are achieved by system processes called daemons. Daemons reside at servers with a specific IP address and TCP or UDP port address. A service has a listening socket (IP address and TCP or UDP port) which responds to client requests by opening a new temporary communication socket at a random port number, which will be terminated after service. Mail Electronic mail is the application most used by end-users on a regular basis. Very early research in mail targeted interoperability between the wide variety of independently developed mail systems. This research and the reduction in variety over time, combined with Simple Mail Transfer Protocol (SMTP) as a standard mail interchange protocol, solved the interoperability problem. Research then turned to flexible delivery and automating mailing lists[1]. There was then a brief pause in 11 the research. However, as the Internet continued to grow, research on scaling delivery of mail both locally and in mailing lists[64] was needed. At the same time, commercialization caused SPAM to become a problem[54]. Printing Printing covers the problems of getting print jobs from users to printers, allowing users to select printers, and getting errors and acknowledgements from printers to users. Early research in printing merged together the various printing systems that had evolved[42]. Once the printing systems were interoperable, printing research turned to improving the resulting systems, making them easier to debug, configure, and extend[77]. As sites continued to grow, scaling the printing system became a concern, and recent papers have looked into what happens when there are thousands of printers[95]. NFS NFS, the abbreviation for Network File System, is a network service that allows all network users to access shared files stored on computers of different types. NFS provides access to shared files through an interface called the Virtual File System (VFS) that runs on top of TCP/IP. Users can manipulate shared files as if they were stored locally on the user’s own hard disk. DNS DNS, the abbreviation for Domain Name Service, is a service that translates Internet domain names into numerical IP addresses. Because domain names are alphabetic, they are easier to remember. The Internet however, is actually based upon IP addresses. Every time one uses a domain name, therefore, a DNS service must translate the name into the corresponding IP address. For example, the domain name eecs.tufts.edu is translated to 136.64.23.0/24. Web A Web Service is an internet service that is described via WSDL (Web Services Description Language) and is capable of being accessed via standard network protocols such as (but not limited to) SOAP over HTTP. Apache has been the most popular web server on the Internet since 1996. Apache is an open-source HTTP server for modern operating systems including UNIX and Windows NT. It is a secure, efficient and extensible server that provides HTTP services in sync with the current HTTP standards. 12 Database The database service is a service that is used to manage and query a database. Database service is provided by a database management system (DBMS), which is a suite of computer programs designed to manage a database, and perform operations on the data as requested by perhaps numerous clients. 2.3.4 Security A network is secure if data is protected from harm, resources are protected from misuse, data is kept confidential, and systems remain available for their intended purpose. Assuring security of networks involves systematic strategies and approaches to protect critical data and resources. Security practices include assurance of: • data integrity - validity of data; • service or application integrity - service availability and conformation to specifications; • data confidentiality; and • authentication and authorization - assuring that the proper people are allowed to perform particular tasks or access particular data. Security was neglected in the early days of computing systems since all users and hosts were assumed to be trustable. The spread of the Internet Worm in 1988 challenged the naive trust model and redefined the notion of security of computer systems. Since then, security has evolved into a pursuit in its own right with its own conferences and intellectual traditions. Security cannot exist without a security policy: a clear definition of what is to be protected and why. A clearly defined policy is used to create a security plan and architecture, based upon the possible threats and risks associated with each threat. A system can be compromised by: [17] • physical threats: weather, natural disaster, bombs, power failures etc; • human threats: cracking, stealing, trickery, bribery, spying, sabotage, accidents; and • software threats: viruses, Trojan horses, logic bombs, denial of service. One implements a security policy by methods that include access control, intrusion detection, firewalling and filtering, and backup and restore (discussed in the previous section). 13 Access Control Access control determines what specific users can do according to security policy. Access control has two parts: authentication and authorization. Authentication is any process by which one verifies that someone is who they claim they are. Traditionally authentication has been based on shared secrets (e.g. passwords) used in conjunction with cryptographic algorithms. When using authentication based on cryptography, an attacker listening to the network gains no information that would enable it to falsely claim another’s identity. There are two main approaches to the use of encryption: shared-key encryption algorithms and public-key encryption algorithms. Kerberos[88], a network authentication protocol, is the most commonly used example of a shared-key algorithm. It is designed to provide strong authentication for client/server applications. Public key algorithms[65], with less overhead than shared-key algorithms, are widely used in authentication for their great convenience and flexibility. The difficulty with public key algorithms is that they require a global registry of public keys. With the explosive growth of the Internet, new authentication schemes integrating shared-key and public-key appeared[84] to address the scalability of network security infrastructures that can manage millions of transactions, geographically distributed, within a single realm of trust. Recently, related techniques such as smart cards(used in mobile phones) and biometrics(fingerprints and iris scans)[22] have also been tried as authentication methods. Authorization is the process of granting or denying access to a network resource. This is usually determined by finding out if a person, once identified, is a part of a particular group of users. The most common authorization mechanism is known as an access control list (ACL), which is a list of digital identities along with a set of actions that they may perform on a resource (also known as permissions). Security groups simplify management because an ACL can have a few entries specifying which groups have a specific level of access to a resource. With careful group design, the ACL should be relatively static. One can change authorization policy for resources by manipulating the members of a group maintained by a centralized authority, such as a directory. Nesting groups within each other increases the flexibility of the group model for managing authorization. Monitoring and Intrusion Detection Monitoring is usually done in non-intrusive ways and can be applied to production systems without impacting performance. Monitoring contains four components: 1. data collection and generation; 14 2. data logging or storage; 3. analysis; and 4. reporting. There has been a lot of work on gathering data from specific sources, from file and directory state[79] to OC3 network links[9]. The collected data is usually logged in a fairly basic way, often through syslog or some flat file. Using a relational database to log the raw data and converting it to a standard form for inquiries was explored in [41]. Later, generic monitoring infrastructure [4, 53] was developed. Data analysis has not received nearly the attention it deserves. Data collection techniques are only useful if the data can be used to identify problems. Swatch[52] can send email or page system administrators when things seem to go wrong. To assist analysis of large segments of monitoring data, visual and audio approaches have been used[45, 46]. Intrusion detection refers to the process of detecting and perhaps correcting security breaches such as cracking, invasion, corruption, or exposure of private data. It can be achieved by monitoring the system’s state, including requests for service, filesystem, or contents of system logs. There are many monitoring tools available. Network Flight Recorder utilizes a scripting language to allow customizability for site needs. TCP wrappers[94], act as proxies for services, are also applied to reject undesirable requests according to access rules. Another strategy for intrusion detection is to monitor the filesystems of target servers for the effects of intrusions. The Tripwire[92] tools, along with many other integrity checkers, allows one to create a “signature” for a filesystem based upon declarations of the dynamic properties of files and directories within the filesystem. Based on this declaration, Tripwire maintains a database of cryptographic signatures for whole filesystems and reports any deviations from declared file and directory properties. Firwalling and Filtering A firewall is a system designed to prevent unauthorized access to or from a private network. Firewalls can be implemented in both ways via hardware and software, or a combination of both. A network firewall filters both inbound and outbound traffic. It can also manage public access to private networked resources such as host applications. It can be used to log all attempts to enter the private network and can trigger alarms when hostile or unauthorized entry is attempted. Firewalls can filter packets based on their source, destination addresses and port numbers. This is called address filtering. Firewalls can also filter specific types of network traffic. This is also known as 15 protocol filtering because the decision to forward or reject traffic depends upon the protocol used, for example HTTP, FTP or TELNET. Sophisticated firewalls can also filter traffic by packet attributes, content, or connection state. Firewalls cannot prevent damaging incidents carried out by insiders. Firewalls also have a significant disadvantage in that they restrict how users can use the Internet. In many places, these restrictions are simply unrealistic and unacceptable. Firewalls can also be a single point of failure for a network such as constitute a traffic bottleneck; if the firewall itself is compromised, the whole network is at risk. Also, if a firewall rule set contains a mistake, it may not appropriately protect the network. Since these rule sets are complex to craft and validate, mistakes are common. The firewall is only one component of a secure system; it must be used with combination of other access control methods. 2.3.5 Testing and Quality Assurance Testing that systems conform to human desires and requirements is an important process that supports quality assurance. What is the cost of not testing? For business or security critical systems, such as online banking, any failure of the system might cause customer dissatisfaction, lost transactions, lost productivity, lost revenue, lost customers, penalties, or threat to the organization that owns and operates the system. Even where a system does not provide critical services, failures might still have serious impact upon users. Testing in system administration determines whether the system meets its requirements and ensures that it does not violate policy rules. Positive testing is used to validate the correctness of certain desired behaviors. Negative testing ensures that a system does not do what it is not supposed to do. Monitoring can be considered to be one kind of testing that gathers information about the behaviors of the system and issues warnings if it finds potential problems. Testing is related to almost every subarea of system administration, especially performance, security, system integrity, and change management. Unfortunately, research on testing in system administration is limited if not absent. In practice, administrators’ confidence in the system mostly comes from the experiences learned from collecting and documenting previous system failures and the workarounds that addressed them. Unlike software engineering, where testing is intensively studied and performed in a systematic way, system administration suffers from simple, ad-hoc testing methods that make no attempt to assure complete system function. In fact, studies indicate that testing consumes more than fifty percent of the cost of software development[38]. In contrast, 16 system administration, which is so crucial to modern organizations, pays much less attention to testing than necessary. The difficulty of performing testing is that the system under test must be always “alive and online” - available to provide services and resources. Here we have the key challenge. On one hand, testing is crucial to provide quality assurance; on the other hand, the requirement for the system to constantly serve its mission does not allow for rigorous testing of the production machines. Techniques for testing the system while it remains online include a “trace-driven parallel execution”, in which a system under test is subjected to the same load as a production system and the results compared[66]. Normally, testing is done before the system or application is deployed, so development of benchmark suites is essential. The development of tools to provide support and automation of testing is also vital. 17 Chapter 3 Introduction to Configuration Management Configuration management is the ongoing process of maintaining the behavior of computer systems and networks or software systems, and assuring that they serve the missions of the human organizations that utilize them. Clearly, configuration and maintenance are overlapping issues. Maintenance is a phase of configuration that deals with creeping decay and changing requirements. All systems tend to decay into chaos with time due to management changes and somewhat unpredictable interactions with users[25]. Theoretically, in any closed system, the entropy (measure of disorder) tends to increase with time unless activities from outside the system restore the order. Configuration management is comprised of two closely related parts: software configuration management and system configuration management. These parts are similar in that they control behavior through some data in files or databases. They differ in what they control. System configuration management concerns the configuration of the whole system and network while software configuration management concerns the configuration of one or more software packages on one or more hosts. Many system configuration tools deal with both system configuration management and software configuration management. The focus of this thesis is system configuration management. However, many mechanisms and technologies used in system configuration management are applicable to software configuration management. In the following sections we introduce software configuration management and system configuration management in more detail. 18 3.1 Software Configuration Management Software configuration management covers the problems of managing software installed on computers. There are two types of software: operating systems (OS) and applications. An OS is installed by either copying an image to the local hard drive or by booting the new machine off some other media (e.g., floppy disk, CD). Installation of OS is often destructive; everything on the disk can be deleted during the installation process (except for Windows systems). System administrators must plan on restoring the information if reinstallation was in error. Since OS is bottom-level software, any non-OS applications at higher levels may be affected by changes of OS. The problem is that the subroutines and system calls that software must utilize to communicate with the outside world are part of the OS. That is why operating system upgrades or migration from one type of OS to another are often a problem and consume a large amount of effort[7]. Nowadays, application software is usually contained in packages which are collections of related files[78]. When installed, a software package unpacks into one or more directories. There are two kinds of software applications: free and proprietary software. The former is typically available in source form as well as binary form. The latter may only be available in binary form. Thus software installation might be accomplished in two different ways: installation from source code or installation from binaries. Free software and open source software is usually in source form and must be compiled. Commercial software is usually installed from a CD by running an installation program. Software package management is a complex issue, mainly because different software packages are designed by different developers and tested in different environments. A software package might require one type of operating system, sufficient disk space and memory, the proper version and the proper location of shared libraries, the existence and appropriate version of other software packages, etc. One widely adopted approach for software distribution is the Depot model[68]. In the Depot scheme, separate directories of executable software packages are maintained for different machine architectures under a single file tree. Software packages installed under the Depot tree are made available within the filesystems of client hosts via symbolic links. The RedHat Package Manager (RPM) is an open packaging system, available for anyone to use, which runs on Red Hat Linux as well as other Linux and UNIX systems[21]. RPM installs, updates, uninstalls, verifies and queries software. RPM is the baseline package format of the Linux Standard Base. 19 The “RPM database” consists of a doubly-linked list that contains all information for all installed packages. The database keeps track of all files that are changed and created when a user installs a program and can therefore very easily remove the same files. However, the database suffers from contradictory package version dependencies and incomplete and outdated documentation[55]. 3.2 System Configuration Management Network and system configuration management is the process of maintaining the function of computer networks in alignment with some previously determined policy. A policy is a list of high-level goals for system or network behavior. This high-level “policy”, describing how systems should behave, is translated into a low-level “configuration”, informally defined as the contents of a number of files contained within the system whose contents affect system behavior[35]. 3.3 How Configuration Controls Behavior Configuration controls the behavior of system by a variety of methods: • Configuration text file This method controls the behavior of computer systems by specifying and controlling the contents of files stored in some form of non-volatile storage such as disk or flash memory. For example, system services, like telnet, rlogin and finger are controlled with a line in the configuration file /etc/inetd.conf or via files in the directory xinetd.d. These files specify in detail how each host should behave. The mechanism behind configuration files is that some computer programs, called daemons, read the content of the files and provide or prevent specific behaviors. These files are called configuration files and are referred to collectively as the configuration of the system by convention. The contents of configuration files are not controlled by regular users and do not change due to actions of non-administrators. In this case, configuration management is the process of specifying, modifying, and otherwise managing the contents of these configuration files. • Database In Windows systems, many host parameters are configured in a database called the system registry. Many configuration tools maintain a central database that specifies configurations for a network of hosts. The configuration data of the central database will be pushed to or pulled 20 by agents installed in those hosts. The agents will then change the contents of configuration files on each host according to the central database in order to have specified behaviors. • Transmitted protocol Simple Network Management Protocol (SNMP) is a protocol designed to monitor the performance of network hardware. It is suited to non-interactive devices like printers and static network devices like routers. The management console can read and modify the variables stored on devices and issue notifications of special events. In spite of its limitations, for example weak security, SNMP remains the protocol of choice for the management of most network hardware, and many tools have been written to query and manage SNMP enabled devices. System administrators oversee and take part in the configuration process in a variety of ways, by performing manual configuration changes on one computing system at a time, or perhaps by invoking computer programs that accomplish similar changes for a single system or the network as a whole. 21 Chapter 4 Challenges of Configuration Management Nowadays, more and more organizations involve computers and computer networks in their daily work. As computer systems become players in complex communities (such as the stock market), the expected results of governing these communities become more and more challenging to assure. The complexity of system administration arises both from the complexity of the machines and the demands of their roles in human-computer communities. The unique challenges of configuration management as a practice include frequent changes in policy and technologies, large scale of application, assemblies of imperfect software and hardware, heterogeneity, contingencies, diverse users, ineffective collaborations of multiple system administrators, service guarantees, and the mobile environment. 4.1 Change A fundamental characteristic of modern computing systems is that they need to be delivered and modified rapidly in tight time frames, while the requirements for such systems are constantly changing. Change of requirements is the source of complexity of many major issues in system administration. Given new requirements, system administrators must plan what to change to meet those requirements. Any change involves risk. They must have an idea of what major things will be affected by such a change. Dependency analysis is needed to show the consequences of actions. 22 And in many situations, changes must be made without disruption to mission critical services, for example, an online banking service. To accomplish a change, system administrators might need to install and configure new hardware and software, upgrade old software, reconfigure the system and network, and repair the bugs caused by the change or be able to roll back to a previous state if the change has major problems. 4.2 Scale The scale of systems is a large part of the challenge. A typical operating system contains several thousand files; a large number of other “packages” (drawn from a cast of thousands) may be added over time. The number of software packages installed in a network of computers can be unbelievably large. For example, at the site of the Computer Science Department at Tufts University (which is a typical departmental site), in our study in the year 2001, we had around one thousand software systems comprising about ten thousand programs installed on our network. Within a network, there are tens, hundreds, or thousands of machines, each might have different architectures, hardware devices, operating systems, and installed software applications. Fully installed and configured systems thus tend to be both unique and complex, unless steps are taken to limit uniqueness and/or complexity. Scale poses several unique problems for system administrators. Tasks that were formerly accomplished by hand for small numbers of stations become impractical when system administrators must configure thousands of stations, e.g., for a bank or trading company. In a large-enough network, it is impossible to assure that changes are actually made on hosts when requested; the host could be powered down, or external effects could counteract the change. This leads to strategies for ensuring that changes are committed properly and are not changed inappropriately by other external means, e.g., by hand-editing a managed configuration file. 4.3 Interdependence of Software and Hardware Hardware and software in a system need to have all required elements, qualities, or characteristics, and can change in character over time as hardware ages and software is revised. Certain physical requirements must be met for hardware to function, i.e., temperature, humidity, physical connections to power or other equipment. Hardware is normally from different manufacturers. Different hardware are not necessarily compatible and are not guaranteed to work together when combined 23 or connected. The type of hardware limits kinds of software that can execute on it. Software is provided by many developers and vendors. Software requirements for the system often differ and even in contradictory ways. These multiple software systems share the same resource space. Thus resource competition happens constantly. The developers of software cannot completely foresee the various complex environments where their software will be executed. Thus assuring the correct environment to support the function of one kind of software can make another kind of software work poorly or fail. Operating systems and programs often contain bugs and emergent features that were not planned or designed for. System administrators must balance performance and cost and be vigilant of changes in performance, configuration, or requirements that might cause failures of software or hardware. 4.4 Heterogeneity A typical managed network is composed of computers each with thousands of pieces of different hardware and software. These components have complex dependency relationships and share and compete for a common set of resources. One factor that increases management complexity is heterogeneity: variations in architecture or configuration among populations of machines. Each machine is potentially different from all others and they may also carry out different roles, i.e., as servers and clients. They might have different architectures, hardware devices, operating systems, or installed software applications. The most common form of heterogeneity arises when a set of machines vary in architecture (e.g. SPARC versus x86). However, heterogeneity appears in other and more subtle ways, including in software environments. For example, hosts that are in different configuration states can be viewed as having “heterogeneous configurations”. Hosts that must behave differently than others exhibit “heterogeneous behavior”. Heterogeneity increases the cost of management, since system administrators must take into account differences between hosts. A highly heterogeneous network can make it difficult to deploy system-wide configuration changes, since the actions to make a change (or even the nature of the change) might differ on each station whose behavior should change. Unintentional heterogeneity arises from utilizing different solutions to the same problem for no 24 defensible reason. E.g., a group of administrators might all choose different paths in which to install new software, for no particularly good reason. Unintentional heterogeneity is a management problem. However, intentional/controlled heterogeneity with a defensible reason is used to make the system more robust and secure. A heterogeneous network is less vulnerable to a single type of failure or security exploit[44]. System administrators must balance the needs of uniformity and heterogeneity and avoid unintentional heterogeneity. 4.5 Contingency A contingency is an event whose occurrence causes system problems, and which may or may not occur depending upon conditions and other factors. For example, the hard disk of a system may fail. This is a contingency because if it does occur, steps must be taken to address the failure. A configuration can be modified by many sources, including package installation, manual overrides, or security breaches. Unintended changes in configuration represent one form of contingency. These contingencies often violate assumptions required in order for a particular strategy to produce proper results. One problem that plagues configuration management tools and strategies is the creation of latent preconditions. Like any software program, every configuration management operation requires some pre-existing conditions to assure an appropriate behavioral result for the operation. A host requirement necessary to assure a desired effect for an operation is called a precondition of the operation. A latent precondition of an operation is a precondition that is not known by the administrator beforehand, but whose absence causes a behavioral problem after the operation is applied, for some subset of hosts within a population. Due to software and hardware heterogeneity, often an individual system possesses a software or hardware property whose presence cannot be easily detected except through failure of the system as a result of a configuration change. Latent preconditions are due to lack of complete knowledge of dependencies between different components of system. For example, very commonly, the act of replacing a dynamic library to repair one application might break another application. 25 4.6 Diverse Users Without users, system administration is trivial but meaningless. Users are both the reason for computers to exist and their greatest threat. Each user has different background, views and demands for the services of systems and networks. For example, some user might want to use older versions of a dynamic library because his/her application depends on it; in the mean time, some other user might request a newer version of the library to have more functions or faster speed. A system administrator must balance all kinds of needs and in the mean time ensure the stability and security of the system. For the common benefit of the whole community, policies must be planned and enforced. The group of users is not held constant, either. For example, in a university environment, large groups of students enroll and graduate each year. System administrators need to periodically create and delete user accounts and adjust storage space accordingly. 4.7 Ineffective Collaboration of Multiple System Administrators Large scale systems often require a team of system administrators to work collaboratively so that the day-to-day system administration work remains manageable. The ideal situation for a system administrator is that his or her domain of change matches his or her domain of responsibility: what is controlled matches precisely what one is responsible for controlling. This ideal, however, is never achieved in practice; the typical system administrator has power over subsystems for which he or she has no responsibility, and vice versa. Since there is no practical way to control privileges so that an administrator’s domain of charge and control matches a corresponding domain of responsibility, conflicts can arise when more than one person works on configuring or changing the same aspects of a system. Since different system administrators have different backgrounds, skill levels, and computing language preferences, good team disciplines including documentation, effective communications and appropriate delegation of tasks are essential. 26 4.8 Mobile Environments Additionally, mobile computing devices are becoming more and more pervasive: employees need to communicate with their companies while they are not in their office. They do so by using laptops, PDAs, or mobile phones with diverse forms of wireless technologies to access their companies’ data. Mobile devices and users add enormous challenge to configuration management for ensuring security policies. 4.9 Service Guarantees The requirement for service guarantees of modern systems has never been in such high demand. For many companies, the system must be “alive and online” all the time. A minute of down time might cause thousands of dollars of lost in business. System administrators must be able to react effectively, efficiently, and unintrusively to accomplish desired changes, without unacceptable downtime. They must also be able to quickly analyze problems and correct configuration mistakes. In the mean time, high performance expectations do not allow system administrators to perform intrusive experiments to search for or optimize solutions. 27 Chapter 5 Automation and Autonomic Computing A general problem of modern computing systems is that their complexity is increasingly becoming the limiting factor in their further development. Large companies and institutions are employing large-scale computer networks for communication and computation. Distributed applications running on these computer networks are diverse and deal with many different tasks, ranging from internal control processes to presenting web content and to customer support. Automation is the use of control systems such as computers to control processes, replacing human operators. System administrators use automation to manage the implementation-level complexity. For example, replacing a manual configuration procedure with a computer program such as a script is a widely used technique in system administration. Autonomic computing is an industry-wide initiative started by IBM in 2001 and aimed at creating self-managing computer systems to overcome their rapidly growing complexity and to enable their further growth[57]. It is inspired by the autonomic nervous system of the human body. This nervous system controls important bodily functions (e.g. respiration, heart rate, and blood pressure) without any conscious intervention. Four functional areas are defined for autonomic systems: • self-configuration: automatic configuration of components; • self-healing: automatic discovery, and correction of faults; • self-optimization: automatic monitoring and control of resources to ensure the optimal functioning with respect to the defined requirements; 28 • self-protection: proactive identification and protection from arbitrary attacks. In autonomic systems, system administrators play a new role: they do not control the system directly; instead, they define general policies and rules that serve as an input for the self-management process. Autonomic computing is an advanced form of automation. There is a large gap between autonomic computing and current automation approaches such as scripting. Currently, system administrators still function at the implementation level; they are the “translators” from high-level goals to low-level machine configurations. Current tools only enable them to manage the low-level system with ease. In autonomic computing, system administrators are instead asked to interact with systems at the management level of making policies, leaving the configuration process completely to the autonomic systems (or configuring a baseline environment in which autonomic systems can take over and manage configuration thereafter). Our study benefits the development of automation of configuration management by describing some of its theoretical boundaries. Our study shows that configuration management, including dependency analysis and composition of configuration operations, is an intractable process in general without further constraints. Constraints which make configuration management tractable are also discussed in this thesis. In the following two sections, we give brief summaries of the history of configuration management and current configuration strategies. 5.1 History of Automation in Configuration Management The context of system administration is changing. When the first electronic digital computers were produced in the 1940s, compilation, linking, and loading of programs were entirely performed by human operators. Administration of computers was inevitably accomplished manually. The original administrators loaded programs and managed batch jobs on the predecessors of time-sharing machines. Very few people had the privilege to use a computer. The concept of maintaining a particular standard of operation was not present. All operations were “best-effort”. Thus major issues of current system administration, e.g. configuration management, user management, security, trouble shooting, testing, were very different tasks compared to their present form in modern computing systems. Later multi-user multi-privilege operating systems such as UNIX and VMS became more preva- 29 lent than batch operating systems. The separation of system management from normal user functions and the ability to manage unprivileged users from a privileged shell made it possible for a group of users to share computers and networks concurrently. The role of system administrator evolved from running batch jobs to managing multi-processing, interactive computing systems. Early Unix systems were often administered by volunteer computer users. Over time, as the complexity of computing systems increased, these volunteers were in no position to guarantee quality of services. Modern demands for service guarantees require dedicated professionals rather than volunteers. Early attempts at configuration management were all direct interactions between the administrator and the system. The early systems were administered solely by running specific commands and editing specific files. As these commands and files became increasingly complex, human error became a significant factor. Administrators responded by “scripting”; placing useful commands into files that could be replayed when needed. Scripts are not “programs” in the traditional sense of computer science and system administrators need not be programmers. Instead, they are lists of commands that can be replayed. Scripts can be used for scheduling services, retrieving data from large log files, installation and configuration of systems and applications, etc. General-purpose scripting languages used are sh, csh, Perl, Python, Tcl/Tk. As system administration tasks grow more and more complex, so do the scripts to automate those tasks. System administrators are “wired” into the system by the scripts since a substantial knowledge of the implementation details of those scripts are required. Nowadays, systems are rarely constructed from scratch; most system administration tasks involve managing pre-existing systems and integration with “legacy” infrastructure. Thousands of scripts with their preconditions and postconditions and other implementation details, e.g., complex dependency relationships, are embedded within the systems. And these systems are required to support business and industrial processes which are continually reconstructed and reorganized to meet changing users’ demands. This complexity grows beyond human management capacity, especially for large scale systems. Declarative management represents a paradigm shift from scripted management. The idea of declarative management is to let system administrators concentrate on making policies and rules for the system, and allow some robotic agents translate those policies and rules into implementationlevel instructions and carry out the tasks. The principal difference between agent-based and scripted management is that programmers write the agents, while system administrators traditionally wrote the scripts. The first attempt of declarative management was Site[51], site configu- 30 ration language, which allowed site configuration to be specified via a centralized configuration file. Cfengine[15, 16, 18] introduced the idea of convergence that repeated execution of Cfengine scripts would bring the system in conformance with its desired state specified in the configuration file. However, the specifications and declarations of Cfengine configuration files remain low-level specifications concerning system and network contents instead of describing behaviors and restrictions. Full-fledged policy based management has not come into wide use due to the cost of deployment. Research on policy based management becomes one of the most vigorous areas of system administration to ease the burden of managing complex human-computing systems. Closure[30] is a new model of configuration management based upon a hierarchy of simple communicating autonomous agents. Each of these agents is responsible for a “closure”: a domain of semantic predictability in which declarative commands to the agent have a simple, persistent, portable, and documented effect upon subsequent observable behavior. Closures are built bottom-up to form a management hierarchy based upon the pre-existing dependencies between subsystems in a complex system. Closure agents decompose configuration management via modularity of effect and behavior that promises to eventually lead to self-organizing systems driven entirely by behavioral specifications, where a systems configuration is free of details that have no observable effect upon system behavior. Autonomic computing[57] is a top-down version of closure, although all current implementations provide bottom-up features. IBM’s vision of autonomic computing[60] includes a vast and tangled hierarchy of self-governing systems that in turn comprise large numbers of interacting, autonomous, self-governing components at the next level down. Those autonomic systems and subsystems bear the characteristics of self-configuration, self-optimization, self-healing, and self-protection. They expect autonomic computing to involve enormous range in scale, from individual devices to the entire Internet. The difference between “closures” and “autonomic systems” lies in what they assume about the outside world. Autonomic systems often assume that the changes that can occur to a system are known in advance. Thus they assume a “closed world” surrounding the managed system. Closures assume that there is an “open world” of unpredictable events and are designed around dealing with the unexpected. They are also different in their original design mechanisms. The idea of closures came from the observation that the complexity of system administration is closely related to flexibility of system configuration and its relationship to the likelihood of human error. To reduce complexity, one must limit one’s environment and options to assure homogeneity and predictability, thus leading to simplicity of management. On the contrary, autonomic computing intends to add 31 “intelligence” to the system so that it can manage complex situations. These two research groups have made progress in constructing components of their systems, i.e., websphere[58] from IBM and HTTP, IP closures[31, 96] from Tufts University. Both of these approaches encounter problems in dealing with legacy infrastructures. 5.2 Current Strategies of System Configuration Configuration management is a complex issue related to theory, practice and policy. There are several existing strategies of conducting system configuration, namely manual configuration, custom scripting, structured scripting, file distribution, and declarative syntax. We will give a brief introduction on these strategies in the following paragraphs. 5.2.1 Manual Configuration In manual configuration, configurations are made entirely by hand. Manual configuration is only cost-effective for small-sized systems with a few machines and a few users, but is often utilized even for large networks in which few changes in function are expected over time. Manual configuration has the advantage that system behavior is closely monitored during each step of configuration. Errors can be easily corrected. However, manual configuration is not feasible for large networks with frequent changes. 5.2.2 Custom Scripting In custom scripting, manual procedures are encoded into repeatable automatic procedures using a high-level language such as a shell, Perl, Python or domain-specific languages. Custom scripting is a first attempt at configuration automation. The main weakness of custom scripting is the difficulty of addressing and reacting to pre-existing conditions of a host. Scripts are often crafted in haste for one-time use and then employed over a long time-scale[27]. Each script requires preconditions that are often poorly documented if at all. Applying a script to a host that does not satisfy its preconditions leads to unpredictable results and variation in actual configuration. 5.2.3 Structured Scripting Structured scripting is an enhancement of custom scripting that allows one to create scripts that are reusable on a larger network by providing a framework that assures repeatable preconditions 32 and manages portability between heterogeneous sets of hosts. There are two basic approaches to structured scripting: execution management[90, 91] and variable instantiation[73]. Execution Management ISConf[59, 90, 91] structures configuration and installation scripts into “stanzas”: installation and scripts whose execution order is held constant across a population of hosts. On each host, maintenance of host state determines which stanzas to execute. The postconditions of the operations already completed are treated as the preconditions of the next (whether or not this is actually true), and these are always the same for a particular host. ISConf uses time stamp files to remember which stanzas have been executed on each host, and assures that hosts that have missed a cycle of configuration due to downtime are eventually brought up to date by running the scripts that were missed. The strength of ISConf is that when changes are few and the environment is relatively homogeneous, it produces repeatable results for each host in the network. This means that if one host is configured with ISConf and exhibits correct behavior, it is likely that other hosts will exhibit the same behavior as well. The fact that “order matters”[90] is a daunting limitation of the strategy. Any new script can only be added at the end of all the stanzas or the stanza needs a complete rebuild. When using ISConf, it is impractical to delete a pre-existing stanza or change stanza order. This would make it possible to violate the preconditions of a stanza. Erroneous stanzas that misconfigure a host cannot be safely deleted; they must instead be undone by later explicit stanzas. Thus the size and complexity of the input file grows in proportion to changes and states, and quickly becomes difficult to understand if changes and/or errors are frequent. In this case, one must start over, re-engineering the entire script sequence from scratch and testing each stanza individually. A similar process of starting over is required if one wishes to apply an ISConf input file for one architecture to a different architecture; ISConf input files are not guaranteed to be portable. Variable Instantiation Another approach to structured scripting is to code the locations of common system settings as variables so that scripts can be written portably across operating systems. In Problem Informant/Killer Tool (PIKT)[73], common files are located through variables whose contents change to reflect operating system values, via a mechanism similar to that in Imakefiles. A well-engineered PIKT script 33 can be utilized on many different hosts. One shortcoming of PIKT is that a script can be written once but must be verified and validated on every kind of target platform. 5.2.4 File Distribution Practitioners were quick to understand the limits of custom scripting and have struggled for decades to design a more robust method of making configuration changes. The first attempts to replace custom scripting employed file distribution. In this strategy, one maintains master copies of crucial configuration files in a repository, and periodically automatically distributes these copies to managed hosts[23, 26]. This largely avoids the problems of sequencing encountered in custom scripts, but replaces these with problems of execution scalability and several capability limitations. File distribution schemes such as RDIST[23] rely upon a single master server that runs local commands on clients to force them into compliance. This is an inherently serial process that takes a very long time in large networks[26]. As well, the process is plagued by the fact that all knowledge of the variations in platforms has to be codified and stored on the central server, a daunting and error-prone manual task. RDIST also suffers from lack of ability to express precedence between related file copying operations, as well as excessive repository sizes as variations are required. One copy of each version of each file must be stored. Initial strategies for combating this version explosion include replacing simple distribution with remote execution of post-install scripts[26] that deal with portability issues. Proscriptive configuration generation is a technique for managing the combinatorial explosion that occurs in using file distribution for configuration management when networks are large and heterogeneous. A description of appropriate network behavior is translated into the precise configuration file contents that assure that behavior[6, 8]. The translation is either accomplished centrally and then transmitted to clients or generated by the clients themselves through use of a distributed agent. The agents installed on the hosts read a host configuration generated from that data. Ongoing administrative tasks include the description and the agent; as requirements change, new kinds of files must be generated. The description may be maintained in many ways, including databases, XML, or plaintext. Generating files from a master template minimizes the problems of unintended consequences encountered via the other methods, but changes the problem of debugging scripts to that of writing appropriate generators. 34 5.2.5 Declarative Syntax Declarative syntax is a configuration management strategy wherein custom scripts are replaced by an autonomous agent[15, 16, 18, 25]. This agent interprets a declarative configuration file that describes the ideal state of a host, then proceeds to make changes that bring the host somehow “nearer” to that ideal state. The main configuration management agent in contemporary use is Cfengine, whose declarations and operation bears a strong resemblance to logic programming[29]. It first determines a set of facts about the local system, then proceeds to correct any facts that are not in compliance with its idea of system health. The main benefit of declarative syntax over scripting is that we avoid forever the problem of writing and maintaining fragile custom software. Unlike ISConf, in which order and content must be preserved, Cfengine must instead preserve management of objects once they are managed in any form. A second benefit of Cfengine over all prior forms of configuration management is that the agent for a particular host has distributed authority about its own needs. This means that no central repository must be kept of data about individual hosts; they can be easily customized without maintaining a global snapshot of desirable state. The weakness of declarative syntax is “incremental burden of management”: once touched, a file must remain managed forever. For example, suppose one configures Cfengine to edit /etc/inetd.conf and then applies that script to most of the hosts in an environment, skipping those that are currently powered down. Thus heterogeneity (a difference) among hosts is constructed. Unless the system administrator accounts for the possibility of the change in all further use of Cfengine, it is possible that some hosts that were initially powered down will never receive the change, so that hosts that do not get the change might respond differently from hosts that do get the change. The current declarative syntax tools like Cfengine operate at the file contents level, i.e., they only declare what the contents of a file should be without validating meaning of the contents. This requires system administrators to manage the meaning of file contents. Few if any of the current configuration tools possesses the “intelligence” to make configuration a completely automated procedure. 35 Chapter 6 The Configuration Process In the theoretical study of this thesis, we show that, in worst case, configuration management is intractable. It is important to understand how these theories fit into the real world and how system administrators make configuration management tractable in practice. We make the link between our theory and practice by first observing how system administrators operate in the real world in this chapter and summarizing techniques used by system administrators that reduce the complexity of configuration management in Chapter 12. Documentation and experience are two important factors that we must discuss first before we move to the configuration process. 6.1 Documentation Documentation and experience are two key factors that keep the system manageable. Documentation is the organized collection of records that describe the purpose, structure, requirements, operations, functional specifications, history of previous changes, and maintenance for a computing system or a system component such as a computer program or a hardware device. In our discussion, documentation is not limited to documents provided by the developers of a system or system component. Rather, documentation includes documents written down by system administrators at the same site or other sites (and perhaps posted on the Internet or published in papers and books) and documents provided by the developers of system components. Documentation plays a critical role in system administration. The ultimate goal of system administration is to keep the system in alignment with system requirements, rather than to develop 36 a full understanding of the system. With appropriate documentation, system administrators can bypass the analysis of complex implementation details of the system. However the analysis cannot be completely avoided because documentation is not always 100% accurate, and might not consider the (often complex) current environment, in which system administrators must work. Trust of documentation can be risky due to the limitations of these documents. A principal problem with documentation is that systems are extremely complex and documentation is usually incomplete. The reasons for this incompleteness include that the system can assume many more physical states than can be covered explicitly in the documentation, and these states perhaps correspond to physical behaviors too varied to document. Documentation can be incomplete because: • It does not document or describe the current state of the system. • It does not foresee the effects of particular sequences of changes. • It does not cover cases in which other modules interact with a given one. • It does not cover cases in which components are faulty and do not function as documented. The accuracy and feasibility of documentation must be judged by the system administrators based on their past experience, experiments and observation, or even intuition. Measured by sheer volume, many systems and system components are superbly documented. We have a wealth of online and printed documentation. Unfortunately, sheer volume is not everything or even (really) enough. If administrators, programmers, or users cannot find the information they want, in a reasonable period of time, the documentation fails its purpose. 6.2 Experience The experience of a system administrator determines the ability to choose correctly from multiple options for assuring a specific behavior for a system[56]. “Experienced” system administrators choose correctly from options more frequently than ”inexperienced” system administrators. “Experienced” system administrators know where to look in documentation for details of specific features, while ”inexperienced” system administrators may face a long search. The experience of a system administrator includes: • memory of solutions, • mental maps of documentation and where to locate specific facts and procedures, and 37 • mental maps of the expertise of peers, and who to contact for advice about specific problems. The experience of system administrators is very important to their overall performance in assuring correct system behavior. Hellerstein[56] pointed out that the complexity of a configuration task can be measured by how much expertise is needed to complete it. With a complex task, there is a large difference between the time taken by an experienced person versus the time taken by a novice. As modern computing systems become more and more complex, no system administrator can always configure the system from memory without consulting its documentation. The issue of how to find specific information is crucial. Thus experience of system administrators become more precious since experienced system administrators can efficiently and correctly decide which expert they should consult, which documents they should read, and what tests they should perform to validate the system. 6.3 The Configuration Process The configuration process can be characterized into three stages: learning, planning, and deploying. Each stage involves loops of activities. Figure 6.1 describes some qualities of the process and gives a high level picture; Figure 6.2 shows more details. During the learning stage, system administrators learn about the policy, the system, and the connections between them. They gather information in order to motivate the design and construction of solutions. System administrators normally first refer to their own memory to compose necessary actions to accomplish system requirements. At this stage, they may perform some tests of the system to verify and validate their past experience and their knowledge of the system. They may also consult other system administrators or refer to various documentation including the experience written down by other system administrators and documents provided by the system developers and vendors. The planning stage cannot be separated from the learning stage. System administrators plan what to do while they learn. But the emphasis of the planning stage is to come up with a sequence of operations that can transform the system to a desired state. Dependency analysis is the process of discovering interactions and relationships between entities of a system. The role of dependency analysis in system administration is difficult to describe, because such analysis does not arise as a separable process from learning, planning, and doing. Average system administrators avoid doing any kind of dependency analysis or perhaps choose to 38 do a very simple procedural dependency analysis: determining the order of configuration procedures that produces a specific result. For example, to install an ssh service, one must first install a network card (1), bring up the network connection (2), then download the ssh package (3), and last install the package (4). There is a dependence/order between different procedures. One must perform procedure 1 before procedure 2; otherwise procedure 2 will fail. Procedural dependencies are more abstract than static and dynamic dependencies (the static and dynamic interactions and relationships among entities of a system) in the sense that there are semantic reasons for the order of procedures. However, system administrators do not usually care about the semantic grounding of their actions, provided that the actions perform and produce results as documented. The result of planning is a sequence of operations that “should” be able to transform the system to a desired state. Normally, the sequence is chosen from “best practices” - shared practices, agreed upon by a group of system administrators for a site. By doing this, system administrators bypass another difficult problem: composition of configuration operations, which we have shown to be another intractable problem besides dependency analysis[87]. The deployment of configuration operations can be done manually, through scripting, or by delegation. If it is done manually, system administrators work in a tight loop of deployment and testing. By scripting, system administrators put a series of configuration operations in one program and execute the program. By delegation, system administrators delegate the task to someone else. After the deployment, if the system satisfies requirements, system administrators terminate the process by updating the documentation if needed. If the system fails to demonstrate the desired behavior, system administrators loop back to the learning and planning stages. If they think the system is at some unmanaged state, they may choose to re-baseline the system and start from scratch with an initial configuration whose properties are well known. There are several characters that demonstrate the uniqueness of the configuration process: asynchrony of inputs, high possibility of partial completion of each step, nondeterminism of transition between two steps, looping and iteration among steps, and dynamism of people (user, management, and system administrators) and things (requirements, the documentation, and the system) involved in the process. • Asynchrony of inputs During the configuration process, system administrators interact with many people and things (Refer to Figure 6.3). Asynchrony means that system administrators decide what to do at particular time differently and depending upon inputs from other people at arbitrary times. 39 Figure 6.1: The configuration stages 40 Figure 6.2: The configuration process 41 Figure 6.3: Asynchronous interactions between system administrators and the environment For example, a system administrator starts to do something according to his or her original plan; he or she receives a asynchronous call from the management requesting something else, then users might request a third thing. He or she then needs to do some conflict resolution to orchestrate all these things. Even for a single thread, the configuration process can be asynchronous too. For example, a system administrator gets a request to add a user to a group. Based on his/her experience, he/she starts with editing /etc/groups, but it does not work. He/she then sends an email to an expert for advice. In the mean time, he/she seeks other options. He/she might try “yp”, a directory service. Before he/she makes the observation that yp does not work, he/she might get an asynchronous message from the expert suggesting LDAP. He/she then drops yp totally and tries LDAP and it works. The control flow of the process highly depends upon the inputs from other people that occur asynchronously such that a general control flow diagram is not feasible. • Partial completion of steps System administrators do not finish one step completely before moving to the next. They often only partially finish one step and move back and forth between two unfinished steps. For example, when they use documentation, they do not sit down and read the entire document; as soon as they find something helpful, they integrate it with their own experience and come up with a sequence of operations to accomplish the requirements, or they may perform tests to further validate the system and the documentation. 42 • Non-determinism of paths System administrators do not strictly follow predefined rules when configuring the system. The configuration process is similarly to Markov transitions[61]. At each step/state, there might be several paths that system administrators can take. There is a probability distribution describing the likelihood of different paths. For example, upon a request, a system administrator might consult other system administrators or they might search for documentation. The pattern of what system administrators decide to do at each step is different for different people, different tasks, and different circumstances. However, the realistic process is not Markovian, as memory plays a role in which steps will be taken next[34]. • Looping and iteration The practice of system administration in seeking performance and policy goals can be modeled as a series of loops involving seeking knowledge and testing understanding. The administrator alternates between observing behavior, reading about options, planning strategies, and deploying solutions. This is not a typical ”flow chart” of the sort that would describe a computer program; instead the activities of reading, planning, observing, and deploying are typically intermixed and interdependent rather than separate and distinct. • Dynamism The requirements, experience, documentation, and system itself are not static; they change dynamically with time. During the configuration process, requirements can be revised by management and users. So the objectives for configuration can be a “moving” target, which increases the difficulty and cost of the configuration. The experience of system administrators is changing every day. They learn about the system through successes and failures and forget some knowledge as a natural process. Documentation regarding the system is constantly updated by system administrators and system developers. The system itself changes with time as well. In summary, the current configuration process is “human-centered” in the sense that the experience of system administrators has a dominant effect within the process. The configuration process is not a well-defined control flow diagram due to five factors: (1) the inputs from other people are asynchronous; (2) the process involves loops and iteration of steps; (3) it is not necessary to completely finish one step before move to the next; (4) the transaction between steps is not well 43 determined and is subject to the current situation; and (5) people, requirements, and systems in the process are changing. 44 Chapter 7 A Model of Configuration Management A rigorous language for discussing the issue of configuration management is currently lacking. To this end, we develop a simple state-machine model of configuration management. Configurations or observed behaviors comprise the state of a system and configuration processes accomplish state transitions. Our theoretical discussion is based upon this model of configuration management, which we abstract from practice. The word “system” is perhaps too vague a word to describe what the system administrator manages. The word “system” refers to different things in different chapters. It refers to a host in the theory of reproducibility, a host or a site in the theory of configuration operations and their composition, and any system with interacting subsystems in the theory of dependency analysis. In most chapters, we do not emphasize the interactions within a system except in the chapter on dependency analysis. We assume each “system” to be a closed-world system; we explain this concept in the next section. 7.1 Closed- vs. Open-world Models of Systems Theoretically, a closed-world system is a system that has no interactions with anything outside of the system. In practice, a system is effectively closed if the behavior that can be affected by outside forces is not described in the functional specification. For example, one may change the color of a web server, but this change does not affect whether this web server can achieve its functional 45 expectations to provide web service, so the web server can be considered to be in a closed-world system even if the color is determined by outside forces, e.g., an interior decorator. A system is also effectively closed if there are dependencies upon constant properties of the outside world that cannot be violated. For example, in normal circumstances, we treat a web server and a DNS server as a closed-world system without considering whether electric power will fail, since it is nearly always available. By contrast, in an open-world system, unknown entities from outside the boundaries of the system affect the behavior of entities of the system in such a way that whether functional expectations can be satisfied is not solely determined by the states of the system. The behavior of the system is thus unpredictable. In some cases, this can be addressed by making the system boundaries larger. E.g., one can draw a boundary around both DNS and web services in order to view the pair as a closed system; the configuration of DNS affects the behavior of web service too much for the two entities to be considered independent. 7.2 Observed Behavior In the previous section, we noted that the difference between “closed” and “open” systems depends upon what we choose to observe or not to observe about the system. We next must define what we mean by “observed behavior”. This requires identifying the questions we might ask whether they are true or not in determining whether a particular behavior is present. Suppose that we are given a finite set of tests T that we can apply to a configuration1 . Tests are statements that can be either “TRUE” or “FALSE” by some testing method. For example tests might include: • The system has a 5 gb hard drive. • The system has at least 128 MB of RAM. • The system has a network card. • Port 69 is listed in /etc/services. • TCP port 80 answers web service requests. • UDP port 69 rejects tftp requests. 1 The set of “all” possible tests is not finite, but in practice, the set of tests that we focus upon must be finite. 46 Some of these facts concern hardware configuration, while others concern behavior, both internal and external. A numeric measurement is represented by a set of facts, one per possible value. For example, if there can be between one and five server processes on a machine, this would be represented by the five tests as to whether that number was 1, 2, 3, 4, or 5. In practice, all the parameters can take finitely many values. Thus T is a finite set. Also, Axiom 1 The observed behavior of a system is a function of its actual configuration. Here we do not consider dependencies upon other systems. We assume that the system under study is in a closed-world system so that its behavior cannot be affected by other systems. Also, tests are chosen to represent static properties of configuration, not resource bounds, and presume the presence of adequate resources to perform the test. The observed behavior of a system is a subset ψ of T , where t ∈ ψ exactly when the system passes the test t. We represent user requirements as a subset R ⊆ T . For each test t that should succeed, we add it to R; for each test t ∈ T that should return false, we replace it with its complement into R. Any test not mentioned in R is one for which we are not concerned about the outcome. A system meets its requirements if ψ ⊇ R. 7.3 Actual State and Observed State The actual state of a system is defined as its configuration information recorded on the machine. This is distinguished from its “observed state”, which describes the set of behaviors in which we are interested. From Axiom 1, we eliminate possible behavioral effects caused by inadequate resources. The actual state of a system is its configuration, denoted as s ∈ S where S is the set of all possible configurations of a system. |S| can be arbitrarily large2 . We also assume that the behavior of a system is always in synchronization with its actual state. We do not consider the cases where the configuration files are modified but server daemons do not reread these files or act accordingly. The observed state of a system is defined as the observed behavior of a system, i.e., ψ ⊆ T . We define Ψ to be the set of all possible observed states of a system. Note that |Ψ| = 2|T | . A test function σ can be defined to map from an actual state to an observed state, i.e., σ : S → power(T ). 2 However, we assume the size of S is finite since in practice, even though we are dealing with large scale systems, the total number of possible states is still finite. 47 7.4 Configuration Operations Configuration operations act on the actual state of a system, not the observed state. Definition 1 A configuration operation p takes as input a state s and produces a modified state s0 = p(s). Configuration operations might include things like: • replace /etc/inetd.conf with the one at foo:/bar/repo/inetd.conf. • delete all udp protocol lines in /etc/services. • cd into /usr/src/ssh and run make install. An operation can be automated or manual, accomplished by a computer program or by a human administrator. Operations can be combined through function composition: Definition 2 For two operations p and q, the operation q ◦ p is defined as applying p, then q, to s: (q ◦ p)(s) = q(p(s)). In this thesis, sequences of operations are read from right to left, not left to right, to conform to conventions of algebra. Axiom 2 For any system, there is a baseline operation b that – when applied to any configuration of the host – transforms it into a baseline configuration with a predictable and repeatable actual state, a “baseline state” B. There are no other restrictions on b, for example b could be “reformat the hard disk”. Note that since observed behavior is a function of actual state, the baseline state also corresponds to a repeatable observed state as well. Definition 3 Given a set of m configuration operations P = {p1 , p2 , · · · pm }, let P ∗ represent the set of all possible results of composing finite sequences of operations, i.e. P ∗ = {p̃ | p̃ = pα(1) ◦ pα(2) ◦ · · · ◦ pα(k−1) ◦ pα(k) where k and α are integers and k ≥ 1, α : [k] → [m], pα(i) ∈ P } ∪ {ε} where p̃ is a sequence of operations and ε represents the empty operation (“do nothing”). P ∗ is the set of all possible results of composing finite sequences of operations from P . Sometimes, we will loosely refer to the sequence as the composition of the elements of the sequence. 48 Definition 4 The set of reachable states of a system with respect to a baseline state B and operations P is S = {s = p̃(B), p̃ ∈ P ∗ }. The set of reachable states is a subset of all possible actual states. There might be other ways to change an actual state other than configuration operations. In this thesis, we do not distinguish these two concepts and assume that only configuration operations can change the actual state of a system. This assumption is reasonable because usually the set of actual states is not arbitrary, but is instead the result of applying configuration operations to some known initial system state. The direct consequence of Axiom 2 is that an actual state s constructed by starting at the baseline configuration and applying a series of operations from P is completely determined by the sequence of operations applied since the last baseline operator b, i.e., all operations prior to the last baseline can be ignored. Thus there is a one-to-one correspondence between sequences of operations since the last baseline operation b and configurations s ∈ S. Thus for the rest of the thesis, we represent s as a sequence of operations after the baseline operation. In practice, many sequences of operations can have exactly the same effect; thus there is a (perhaps poorly understood) equivalence relation E on P ∗ , where p˜i ≡ p˜j whenever applying both sequences in state produces the same final state in either case. There are two kinds of equivalence: observed and actual. Two machines are observably equivalent if their results agree for all tests in a test suite T . They are equivalent in actuality if their configurations are identical. In practice, the latter is impossible; two machines that are identical in configuration cannot even share the same network; they must have differing internet addresses, and thus their configurations must be different in some crucial ways. The exact nature of equivalence between machines is a central issue in many tools, including radmind[50] and ISConf[90, 91]. In this thesis, we define equivalence in terms of observed state to avoid ambiguity. 7.5 Two Configuration Management Automata Two configuration automatons can be defined: the first is based upon actual states of the system and the second is based upon observed states of the system. M1 : (S, P, W1 ) M2 : (Ψ, P, W2 ) where S is the set of all possible actual states (configurations) of the system, P is the set of configuration operations applicable to the system, and W1 and W2 are transition rules and defined 49 as follows. W1 is the set of all triples (p, s, s0 ) where s ∈ S is the actual state of the system before an input configuration operation p ∈ P and s0 ∈ S is the resulting actual state of the system. W2 is the set of all triples (p, ψ, ψ 0 ). if s ∈ A(ψ) and s0 ∈ A(ψ 0 ) and s0 = p(s), then (p, ψ, ψ 0 ) ∈ W2 . Note that M1 can be arbitrarily large since |S| is arbitrarily large; M2 is bounded by 2|T | . M1 is deterministic by Axiom 1 and Axiom 2 since the actual state of a system can be represented by its configuration from Axiom 1 and its configuration can be represented by a sequence of operations applied to its baseline state from Axiom 2. M2 is possibly non-deterministic. Before each operation, the observed state ψ corresponds to a subset of possible actual states A(ψ) ⊂ S. Without further constraints, a typical operation p seems non-deterministic, because the actual states that can result from applying p are the set p(A(ψ)) = {p(s) | s ∈ A(ψ)}. Thus it is possible that p, when applied to a system in observed state ψ, can produce one of several observed states σ(A(ψ)) as a result. Without further information, one cannot limit p(A(ψ)) to be the particular configuration s0 ∈ p(A(ψ)) that is actually in effect after applying p. This uncertainty leads to apparent non-determinism when applying configuration operations. 50 Chapter 8 Reproducibility The goal of any configuration management strategy is to achieve reproducibility of effect: repeating the same configuration operation on different hosts in a large network produces the same behavior on each host. Configuration operations should cause deterministic state transitions from one behavioral state to another. Reproducibility is difficult to achieve due to the difference between the actual state of a host and its observed state that humans and operations can practically observe. In making a configuration change, it is not practical to examine the whole state of the hard disk beforehand. Much of the actual state of a host is not observed. Latent preconditions arise in parts of host state that humans or operations are not currently considering when making changes. The majority of the discussion of reproducibility theory is described in paper[33]; we reiterate that discussion here for completeness. In this chapter, we show that for one host in isolation and for some configuration processes, reproducibility of observed effect for a configuration process is a statically verifiable property of the process. However, reproducibility of populations of hosts can only be verified by explicit testing. Using configuration processes verified to be locally reproducible, we can identify latent preconditions that affect behavior among a population of hosts. Constructing configuration management tools with statically verifiable observed behaviors thus reduces the lifecycle cost of configuration management. 51 8.1 Local Reproducibility First we develop the concept of reproducibility for a single host h. Even though each configuration operation itself is deterministic, a change in actual state from s to s0 = p(s) may not effect a change in the observed state ψ; the observed state is an unspecified function of the actual one. This results in situations where the same configuration operation, applied to two configurations in the same observed state but differing actual states, leads to two different observed states as a result. This can occur if prior configuration operations left the two configurations in differing actual states that are observed as identical, but for which further operations expose differences. Local reproducibility is formally defined as follows: Definition 5 Suppose we have a system h, a set of actual states S for h, a set of candidate operations P appropriate to h, a set of tests T that one can perform on h, a test function σ : S → power(T ), and a map φ from each observed state ψ ∈ Ψ to a subset φ(ψ) ⊂ P that it is appropriate to apply to h when it is in the observed state ψ. Then the formal system (h, S, T , P, φ) exhibits observed local reproducibility (or simply local reproducibility) if the following condition holds: For every pair of actual states s, s0 ∈ S with σ(s) = σ(s0 ) and for every p ∈ φ(σ(s)), we have σ(p(s)) = σ(p(s0 )). The intuition behind the definition is that local reproducibility means for a single host, two actual states that have the same behavior will continue to have the same behavior is the same operation is applied to both. Another possible definition of local reproducibility is that the configuration automaton based upon observed behavior for a single host is deterministic. Proposition 1 The formal system (h, S, T , P, φ) exhibits observed local reproducibility exactly when the state machine (Ψ, P, W2 ) with states Ψ = σ(S), operations P , and transition rules W2 = {(p, σ(s), σ(p(s))) | s ∈ S, p ∈ φ(σ(s))} (8.1) is deterministic. Proof: Suppose we have a formal system exhibiting observed local reproducibility as in Definition 5, and construct the state machine of the proposition. By hypothesis, the only allowable operations 52 change one configuration in S into another in S, so that this state machine can be in one of a limited number of states σ(S) ⊂ Ψ. Now start in a state ψ ∈ σ(S) and apply an operation p ∈ φ(ψ). Let s, s0 be two actual states such that σ(s) = σ(s0 ) = ψ. Then by hypothesis, σ(p(s)) = σ(p(s0 )), so that the result of p is invariant of choice of s. Thus the state machine is deterministic. The converse is similar. 2 Some configuration management strategies achieve local reproducibility by strictly utilizing a set of operations in a particular sequence[90, 91]. Proposition 2 Suppose that P = {b, p1 , . . . , pn }, and that S = {b, p1 ◦ b, p2 ◦ p1 ◦ b, . . . , pn ◦ · · · ◦ p1 ◦ b}. (8.2) Suppose that φ(σ(b)) = {p1 } and let φ(σ(pk ◦ · · · ◦ p1 ◦ b)) = {pk+1 } (8.3) for 1 ≤ k < n. Then the formal system (h, S, T , P, φ) exhibits observed local reproducibility. Proof: Starting at baseline b, we form the configurations b, p1 ◦ b, p2 ◦ p1 ◦ b, . . ., pn ◦ · · · ◦ p1 ◦ b. As b creates an actual state, and each operation is deterministic, the sequence of operations uniquely determines an actual state. As observed tests are deterministic, the observed state corresponding to this actual state is uniquely determined as well. 2 This proposition is part of the theoretical grounding of ISConf[59, 90, 91]. Unconstrained deterministic operations, when applied in a specific order, appear deterministic to any observer utilizing deterministic tests as a mechanism for observing. However, Proposition 2 is extremely limiting. The reachable states are attained by applying prefixes of the sequence of configuration operations to the baseline configuration. This results in a sequence of configurations s1 , . . . , sn+1 , where going forward from si to si+1 requires operation pi , while going backward requires starting over from the baseline state[59]. Since re-baselining a host is currently a matter of erasing all of the host’s contents and starting over, the machine is unavailable for use during this process. This can lead to hidden costs from lost productivity due to machine downtime[74]. A generally applicable configuration management strategy should – within limits – be able to change a host from any state to any other without having to rebuild the entire host from scratch. 53 The polar opposite of the strategy of Proposition 2 is to consider all operations to be applicable at all times, and instead constrain the nature of operations to provide observed local reproducibility. In this strategy, we allow application of all possible sequences of operations in P . Definition 6 Suppose we have a system containing a host h, a baseline state b, a set of operations P , and a set of tests T . For all observed states ψ ∈ σ(S), let φ(ψ) = P , so that all operations apply to all observed states. Then we say the system (h, b, T , P ) exhibits observed local reproducibility (or simply local reproducibility) whenever the system (h, P ∗ (b), T , P, φ) exhibits observed local reproducibility according to Definition 5. This simpler notion of local reproducibility makes it possible to unambiguously discuss the local reproducibility of a particular operation p ∈ P . Definition 7 With respect to the formal system (h, b, T , P ), an operation p ∈ P exhibits observed local reproducibility (or simply local reproducibility) if for every s, s0 ∈ P ∗ (b) with σ(s) = σ(s0 ), σ(p(s)) = σ(p(s0 )). In this case, we say that p is a locally reproducible operation. Proposition 3 The system (h, b, T , P ) exhibits observed local reproducibility exactly when every operation p ∈ P exhibits observed local reproducibility with respect to the system (h, b, T , P ). Proof: Suppose the system (h, b, T , P ) exhibits observed local reproducibility. Then by Definition 6, the system (h, P ∗ (b), T , P, φ) exhibits observed local reproducibility, where φ(ψ) = P for all observed states ψ ∈ σ(P ∗ (b)). Then for each operation p ∈ P , the second condition in Definition 5 is true, and for every s, s0 ∈ P ∗ (b) with σ(s) = σ(s0 ), σ(p(s)) = σ(p(s0 )). Thus the operation p exhibits observed local reproducibility. Conversely, suppose that for all p ∈ P , p exhibits observed local reproducibility. Then by the same argument as above, the second condition of Definition 5 is true. Since S = P ∗ (b), every s ∈ S can be expressed as pα(1) ◦ pα(2) ◦ · · · ◦ pα(k−1) ◦ pα(k) ◦ b, where k ≥ 0, α : [k] → [m], pα(i) ∈ P . Thus p(s) = p ◦ pα(1) ◦ pα(2) ◦ · · · ◦ pα(k−1) ◦ pα(k) ◦ b ∈ P ∗ (b) = S and the first condition of Definition 5 is true. Thus the formal system (h, P ∗ (b), T , P, φ) exhibits observed local reproducibility according to Definition 5, so that by Definition 6, the system (h, b, T , P ) does so as well. 2 In other words, a set of operations exhibits observed local reproducibility with respect to a baseline b if each operation has a reproducible observed effect on each reachable configuration s ∈ P ∗ (b). This is the definition of reproducibility that best models the operation of Cfengine[15, 16, 18] and related convergent agents. 54 Note that local reproducibility of P trivially implies local reproducibility of P ∗ . Although P ∗ is infinite, P ∗ (b) is a subset of a finite (though large) set of configurations, as the number of possible configurations is finite. 8.1.1 Properties of Locally Reproducible Operations Several relatively straightforward propositions demonstrate the properties of locally reproducible operations in more detail. In the following propositions, to ease notation, we will presume the existence of a host h, a baseline b, a set of possible operations P , and a set of tests T . We will presume that S = P ∗ (b) is the set of reachable configurations. All claims of local reproducibility of an operation refer to Definition 7 and are made in the context of the formal system (h, b, T , P ). Proposition 4 The set of operations P is locally reproducible if and only if for each operation p ∈ P and each actual state s ∈ S = P ∗ (b), σ(p(s)) is a function τp of σ(s). Then we can express the observed state of a configuration after p as σ(p(s)) = τp (σ(s)). Proof: This is a direct and obvious consequence of the definition of observed local reproducibility. A set of operations exhibits observed local reproducibility if and only if the resulting observed state after each operation is a function of the observed state before the operation; τp makes this functional relationship explicit. 2 Because locally reproducible operations p correspond with state functions τp , they also exhibit the typical properties of functions, notably, that a composition of functions is also a function: Proposition 5 A composition of operations that each exhibits observed local reproducibility also exhibits observed local reproducibility. Proof: Let T be a set of tests and S = P ∗ (b) represent a set of configurations. Let s ∈ S. Consider locally reproducible operations p and q on S. Since p is locally reproducible, for any particular observed state σ(s) of s, σ(p(s)) = τp (σ(s)) is a constant. Likewise for q, for any observed state σ(p(s)), σ(q(p(s))) = τq (τp (σ(s))) is a constant. Thus for any observed state σ(s), σ(q(p(s))) = σ(q ◦ p(s)) is a constant and q ◦ p exhibits observed local reproducibility. 2 As composing operations on a configuration is the same as applying them in order, this means that an arbitrary sequence of locally reproducible operations is locally reproducible as well. However, the above does not yet tell us how to implement local reproducibility for the operations p that we might compose. In particular, some counter-intuitive results arise straightforwardly from the model. 55 Proposition 6 Configuration operations containing linear code (with no branches) do not necessarily exhibit observed local reproducibility. Proof: As a counterexample, we construct an operation p1 whose outcome is not a function of prior observed state. Let X and Y be two configurable parameters of our host. Let operation p1 be “X := Y ”, let operation p2 be “Y := 1”, and let operation p3 be “Y := 2”. Let T consist of one test t1 : “X > 1”. Then p1 is not locally reproducible; it has two outcomes depending upon unobserved pre-existing conditions. There are two reachable latent states Y == 1 and Y == 2 that are not measured by the tests T , which determine the observed outcome. These latent states are constructed by applying operations p2 or p3 , respectively. 2 Reproducibility or non-reproducibility arise from properties of both the domain and range of an operation. Note that if Y was a constant, p1 would exhibit observed local reproducibility; its non-reproducibility arises from the fact that Y ’s value is unpredictable. This situation occurs often in practice, such as when changing file protection modes. Suppose there is a file “foo” that we wish to make executable and configuration operations p1 , p2 , and p3 , where p1 is “chmod ugo+X foo”, p2 is “chmod 744 foo”, and p3 is “chmod 644 foo”. Suppose T consists of one test “test -x foo” where the user running the test is not the owner or in the file’s group; this tests whether the file is executable to world. Then the observed states of applying p2 and p3 are indistinguishable, because neither p2 nor p3 makes the file executable to the user performing the test. But performing p1 after p2 makes the file world-executable (protection 755, because ”X”, the conditional execution flag, makes it fully executable if any execute bit is set), while performing p1 after p3 leaves it completely unexecutable (protection 644). Similar situations occur during stream editing of files. Likewise, conditional statements based upon unobserved data pose serious problems: Proposition 7 A conditional statement if (F ) then X := G need not produce a locally reproducible outcome if F is not observed, even if G is observed. Proof: As a counterexample, consider two boolean variables X and Y , where X is observed and Y is not. Consider the code: X := FALSE; if (Y ) then X := TRUE; This is equivalent to X := Y , which makes X unobserved. 2 56 (8.4) 8.1.2 Constructing Locally Reproducible Operations It is easy to construct a locally reproducible configuration operation. Each locally reproducible configuration operation p corresponds to a function from initial states ψ to final states ψ 0 , in the context of a set of reachable states S. This operation must thus depend upon the value of ψ and avoid conditioning its effects on the values of other variant properties of the host or network. It can, however, depend upon host properties γ that do not vary as a result of any configuration operation p. Proposition 8 Let p be a configuration operation. Let ψ be an observed state of a configuration s measured before applying p. Let γ represent the attributes of a host that remain constant during configuration. If p consists solely of setting a configuration parameter X to a value that is a function only of ψ, γ, and constants, then p is locally reproducible. Proof: Let ψ be the state of the configuration before the operation p. By hypothesis, p has the form X := F (ψ, γ), where F is a function only of observed state ψ, constants, and invariants of a particular host. We must show that the resulting observed state ψ 0 after applying p is a function of the previous observed state. Because F is a function, there is one and only one outcome for F (ψ, γ) for each state ψ, so that the resulting value of X in the actual configuration changes as a function of observed state whether it is observed or not. As the resulting observed state is a function of the resulting configuration p(s) by Axiom 1, it must change predictably and repeatably as well. 2 The above result is easily generalized. Corollary 1 Suppose an operation p consists only of a sequence of assignments X := F (ψ, γ), where X is a configuration parameter, ψ and γ are held constant throughout, and F is a function of ψ, γ, and constants. Then p is locally reproducible. Proof: A sequence of assignments is the same as a composition of the operations that perform the assignments. By Proposition 8, each one of these is locally reproducible. By Proposition 10, a composition of any two is locally reproducible. By induction on the number of assignments, the result easily follows. 2. Conditional statements pose no difficulties (although they contain more than one possible execution path) because repeatability is guaranteed by the constancy of ψ and γ. Proposition 9 Suppose a configuration operation p has the form if (F (ψ, γ)) then X := G(ψ, γ) 57 (8.5) where X is a configuration parameter and F and G are functions whose values solely depend upon the current observed state ψ, the host invariants γ, and constants. Then p is locally reproducible. Proof: We must show that the observed outcome is a function of ψ and γ. There are only two possible outcomes. If F (ψ, γ) is false, nothing happens, so the result of all such operations is locally reproducible, having the same observed state as the original. If F (ψ, γ) is true, the assignment X := G(ψ, γ) occurs, and is locally reproducible because of Proposition 8. Since taking the branch is itself a function of observed state, the whole branch is locally reproducible as well. 2 Corollary 2 A sequence of conditional assignments of the form if (F (ψ, γ)) then X := G(ψ, γ) (8.6) where F and G are functions only of ψ, γ, and constants, is locally reproducible. Proof: Repeat the argument of Corollary 1 with these conditional statements. The proof is trivial given the constancy of ψ, γ, F (ψ, γ), and G(ψ, γ). With the above results in mind, we are ready to relate reproducibility to the structure of configuration operations as programs. Proposition 10 Let p be a configuration operation and s a configuration. Let ψ be the observed state of s and let γ represent the set of host invariants that do not change during configuration. Suppose that when p is applied to s, it: 1. Sets parameters to values that are functions of ψ, γ, and constants. 2. Takes program branches (conditionals) depending only upon the values of functions of ψ, γ and constants. where ψ and γ are held constant throughout p. Then if the operation p terminates successfully on every configuration s in its domain, p exhibits observed local reproducibility. Proof: Assume that p is an operation conforming to the above hypotheses. Let G = (V, E) be the program graph of the operation p as a procedure. Construct the graph to have nodes v ∈ V for each parameter assignment statement and conditional. Since p obeys the rules above, this graph takes branches based solely upon functions of a static observed state ψ that does not change during the execution of p. This means that any loop taken during execution of p will never terminate, because its branch condition cannot change during the execution of p. Since p terminates, loops 58 are not present and the program takes a predetermined finite path of vertices v1 , . . . , vk through the program graph for each particular choice of ψ. Along that path, it executes a predetermined sequence of assignment statements, so that Corollary 1 applies and the result is locally reproducible. 2 It is rather important, however, that the operation utilize only static values of ψ during execution of p, and not dynamic re-measurements of tests during p. One can get away with limited dynamic measurements of ψ, provided one is careful not to use two differing measurements of ψ simultaneously: Proposition 11 Suppose that p as described in Proposition 10 is also allowed to re-measure the whole observed state ψ at any time during its execution, as well as setting parameters and branching based upon functions of the observed state ψ and host invariants γ. Only one measurement of ψ is available at a time and any setting must be a function of the most recent measurement. If p terminates, the result is locally reproducible, but not all processes are guaranteed to terminate. Proof: Let p be an operation that conforms to the hypotheses of the proposition. Repeat the construction of the program graph G = (V, E) from the proof of Proposition 10 with one change: include re-measurement operations for observed state in the program graph. It is now possible for loops to occur during execution, but if the operation terminates, we claim that it still produces a locally reproducible state. First, any terminating computation p will have executed a finite sequence of operations v1 , . . . , vn within its program graph, where each vi is either a parameter assignment statement or remeasurement of entire state. Without loss of generality, we can express this sequence as an alternating sequence m1 , a1 , m2 , a2 , . . . , mk , ak where each mi measures state and each ai represents a series of assignment statements relative to that state. Now consider what happens during this sequence to two configurations s and s0 with the same observed state. We must show that for both s and s0 , the branches taken are identical, leading to identical paths with identical effects. Since s and s0 have the same observed state, the results of m1 are the same in each case; hence the operations a1 done between m1 and m2 are identical in purport. These statements are locally reproducible, so the resulting observed state m2 is the same in both cases. Proceeding by induction for m2 . . . mk , it is easy to demonstrate that the exact same assignment statements and branches are taken overall, so that the result of p is independent of the actual configuration s or s0 . Thus p is locally reproducible. 2 59 8.2 Population Reproducibility We next turn our attention to assuring reproducibility of operations over a population of hosts. Such reproducibility is extremely important as a cost-saving measure. If we must validate the behavior of each host of a population separately, it becomes very expensive to build large networks. Ideally, we should be able to test one particular host in order to understand the behavior of a population. Definition 8 If, for any subset of a population of hosts whose configurations are in the same observed state, an operation p results in identical observed final states over the subset, then p exhibits observed population reproducibility (or simply population reproducibility). Local reproducibility does not imply population reproducibility. Proposition 12 An operation that is locally reproducible on every host to which it applies can fail to exhibit population reproducibility. Proof: As a counterexample, consider an operation applied to hosts with differing operating systems. Suppose that p is the operation of copying a constant xinetd.conf into /etc/. The population reproducibility of this operation has no dependence upon its implementation; it is a property of the operating system. If the operating system does not support the xinetd abstraction, the operation does nothing to behavior. Consider also an operation that exposes a bug in one OS that is not present in another. There are many other such latent variables that cause the same operation on different hosts to yield different outcomes. 2 While local reproducibility is relatively easy to achieve, population reproducibility remains a pressing problem, and there is no currently available tool that addresses it sufficiently. There is one overarching observation that motivates our approach to population reproducibility: Proposition 13 If all configuration processes are verified as locally reproducible, and one observes population differences in behavior in applying these processes, then the variation must be due to latent preconditions in the population rather than artifacts in the processes. Proof: These processes have effects that are observably locally reproducible. If their effects differ for two hosts in a population, then since they are locally reproducible in isolation, they will always differ in the same way regardless of when the operations are applied. Hence some other factor is causing the difference, and the only other variants are host identity and preconditions. 2 Thus a locally reproducible process that differs in effect over hosts in a population can be used to test for heterogeneity of latent variables within the population. 60 One way of assuring population reproducibility is to utilize locally reproducible actions that fail to be population reproducible to expose each latent variable in the population. With latent variables exposed, we can form equivalence classes of hosts with the same latent structure, and condition further configuration changes by that structure. We show a simple algorithm for constructing “synthetic” population reproducibility to inspire more prudent thoughts on population reproducibility. Given a set of hosts H, a set of locally reproducible operations P on hosts in H, a set of tests T , and an empty set of additional tests T 0 , : 1. Run an operation sequence p̃ on the population of hosts to which it should apply. 2. Perform all tests in T ∪ T 0 on each host h to compute an observed state ψh . (a) If results are the same for all hosts, then p exhibits population reproducibility over the hosts to which it applies. e where for h, h0 ∈ e e (b) If results are different, then form equivalence classes of hosts H, h ∈ H, ψh = ψh0 . For each equivalence class e h, and a configuration c, add the test “h(c) ∈ e h” to T 0 . The above construction is impractical. There are |P ∗ | iterations at step 1, and each iteration requires O(2|T | ) operations to carry it out. The point is that it is possible to deal with population non-reproducibility incrementally, one deviation at a time, adding tests to T 0 one by one. In this way, our state of knowledge grows with the operations we apply and observe. 61 Chapter 9 Limits on Configuration Operations There has recently been much attention to the limits imposed upon configuration operations and how they affect usability of a set of operations [32]. Some authors maintain that operations should be constructed to be repeatable without consequence once a desirable state has been achieved [15, 16, 18, 20]. Other authors maintain that operations must, to provide a consistent outcome, be based upon imperative order [90, 91] or upon generating the whole configuration as a monolithic entity [6, 8]. In this chapter, we precisely define limits upon operations with the intent of discussing how those limits affect composability of operations. 9.1 Limits on Configuration Operations Definition 9 A set of operations P is observably idempotent if for any operation p ∈ P , repeating the operation twice in sequence has the same effect as doing it once, i.e., σ(p ◦ p ◦ q̃) = σ(p ◦ q̃) where p ∈ P and q̃ ∈ P ∗ The use of q̃ in this definition asserts that the idempotence occurs when applied to any actual state of the system from B. We could say, equivalently, that σ(p ◦ p) = σ(p) when starting at any state in S. In continuous spaces, an operation is convergent if repeating the operation many times can cause the system to move to some target state, i.e. limn→∞ pn = target state. 62 By contrast, in system administration idempotent operations take on the discrete character of configuration space. Definition 10 A set of operations P is observably convergent if for any operation p ∈ P and any q̃ in P ∗ , there is an integer n > 0, such that p is observably idempotent when applied to pn ◦ q̃, i.e., σ(p ◦ pn ◦ q̃) = σ(p ◦ p ◦ pn ◦ q̃). For example, consider the task of creating a user in a directory service. Classical procedural operations would be: 1. create user record in LDAP. 2. wait for LDAP directory service to sync up and serve the directory record. 3. create user’s home directory and associate data. Time must pass between steps 1 and 3. An equivalent convergent operation to accomplish this task might be an operation c with the following pseudo-code: if (user does not exist) put user in LDAP else if (user’s home directory does not exist) create an appropriate home directory While c by itself does not accomplish the task, repeating c several times (while time passes) guarantees that the task will be completed. After this time, repeating c does not change the system behavior. The value of n in the definition of convergence thus depends upon how fast the LDAP update occurs. The key issue is that rather than thinking of system configuration as one large change, a convergent operation makes one small change at a time. Through repetition of the operation, its preconditions are fulfilled by previous runs or other operations, or simply by the passage of time. Thus idempotence is a special case of convergence where n = 1. In dynamic situations, when we do not know the nature of the best solution, we can allow convergent operations to “discover” the quickest solution to a problem. One kind of convergent operator for which this is true employs “simulated annealing”[62]. This approach presumes that the“best” solution lives in a solution space that is not convex[12]; a particular locally best configuration is not necessarily globally best. This corresponds to a potential “surface” in which the 63 best solution is one of many peaks of the objective function. The simulated annealing approach is to choose at each step whether to move to or away from a peak, in order to allow one chance to switch to a better peak over several trials. For example, in web server response optimization, the behavior of the operator depends upon an external parameter ρ, the “temperature” of the system. Simulated annealing occurs when at each application of the operator, one makes a decision how to proceed based upon the probability ρ: with probability ρ move files to a server that seems slower (as observed before the move); with probability 1 − ρ, move files to a server that seems faster. The annealing process is to let ρ → 0 over repeated operation; this is also called a “cooling schedule”. The result of simulated annealing is often a near-optimal response time[62]. In the initial configuration phase, behavior of the host is completely determined by configuration operations; while in the maintenance phase, especially during resource optimization, the influence of users becomes nontrivial. The effectiveness of convergence then increases as in the above example. Convergent operations are not necessarily consistent; they might even oppose one another to achieve equilibrium in the overall observed state of the system. The inter-relationship of convergent operations can be grouped as: 1. Orthogonal - working on different parts of the system 2. Collaborative - helping each other 3. Conflicting - opposing one another Convergence needs careful design. Poorly designed convergent operations will result in a mess. When we expect them to work independently, they might interfere each other; when we want them to work collaboratively, they might compromise each other. Definition 11 A set of operations P is observably sequence idempotent if for any sequence p̃ of elements from the set, repeating the sequence of operations has the exact same effect as doing it once, i.e., σ(p̃ ◦ p̃ ◦ r̃) = σ(p̃ ◦ r̃) where p̃, r̃ ∈ P ∗ Definition 12 A set of operations P is observably stateless if repeating an operation (where the repetition is not necessarily adjacent) will accomplish the same result, i.e., σ(p◦ q̃ ◦p◦ r̃) = σ(p◦ q̃ ◦ r̃), where p ∈ P and q̃, r̃ ∈ P ∗ . Unconditional commands that set parameter values are stateless. For example set M=0. But carefully crafted, stateless operations can also contain conditionals. For example: baseline state: M = 0. 64 Figure 9.1: Stateless operations can have if statements P = {p, q}. p: if M == 0, then set M = 3; if M == 1, then set M = 2; if M == 2, then do nothing; if M == 3, then do nothing; q: set M = 1; Please refer to Figure 9.1. Note that the statelessness of p and q is not only dependent upon the contents of p and q, but also upon the baseline state. Definition 13 A set of operations P is observably commutative if for any two operations p, q ∈ P, σ(p ◦ q ◦ r̃) = σ(q ◦ p ◦ r̃), where p, q ∈ P and r̃ ∈ P ∗ . Note that commutative operations can make conflicting changes to the system. For example, p : M = M + 1 and q : M = M − 1. Definition 14 A set of operations P is observably consistent or homogeneous if operations never undo the changes made by others, i.e., for p ∈ P and q̃ ∈ P ∗ , σ(p ◦ q̃) ⊇ σ(q̃). Definition 15 An operation is atomic if it has two result states. Either 1. it has all required preconditions and it asserts known postconditions or 2. it lacks some precondition and does nothing at all. 65 Definition 16 An operation is aware if the operation knows whether it succeeded or not in enforcing its requirements. In practice, this means that the operation can return a boolean value that is true if it succeeded and false if not. 9.2 Relationship Between Limits Each of the concepts of idempotence, convergence, sequence idempotence, and statelessness refer to a condition true of an operation or set of operations. We can understand these concepts by outlining how each one limits an operation or operations. The “strength” of a condition refers to the amount of limitation imposed; a stronger condition means that less operations meet the conditions. Proposition 14 Idempotence is a stronger condition upon operations than convergence. Proof: Idempotence is a special case of convergence by restricting n = 1. 2 Proposition 15 Sequence idempotence is a stronger condition upon operations than idempotence. Proof: A single operation can be viewed as a sequence with size 1. A sequence idempotent operation set (which requests that any sequence of operations is idempotent) is always an idempotent set (which only request individual operation to be idempotent). There are sets that are idempotent but not sequence idempotent. For example: suppose that there are two configuration parameters M and N , where the baseline state is that M = N = 0. Let p be the operation if (M == 1) then N = 2, and let q be M = 1. Clearly repeating either p or q has no effect, so they are idempotent in isolation. Also q ◦ p just sets M to one, but (q ◦ p) ◦ (q ◦ p) sets N to 2 as well. 2 Proposition 16 Statelessness is a stronger condition upon operations than sequence idempotence. Proof: Given a stateless set, we prove it is also sequence idempotent. Assume without loss of generality a sequence p̃ = p1 ◦ p2 ◦ · · · ◦ pk . σ(p̃ ◦ p̃) = σ((p1 ◦ p2 ◦ · · · ◦ pk−1 ◦ pk ) ◦ (p1 ◦ p2 ◦ · · · ◦ pk−1 ◦ pk )) = σ((p1 ◦ p2 ◦ · · · ◦ pk−1 ) ◦ (pk ◦ p1 ◦ p2 ◦ · · · ◦ pk−1 ◦ pk )) = σ((p1 ◦ p2 ◦ · · · ◦ pk−1 ) ◦ (pk ◦ p1 ◦ p2 ◦ · · · ◦ pk−1 )) = ··· = σ(p1 ◦ p2 ◦ · · · ◦ pk ) = σ(p̃). 66 Figure 9.2: An example of sets that are sequence idempotent but not stateless There are cases that is sequence idempotent but not stateless. For example: Suppose M is the configuration parameter. p and q are operations. Baseline state: M = 0 p: if M == 0, then set M = 1; if M == 1, then do nothing; if M == 2, then set M = 4; if M == 3, then set M = 1; if M == 4, then do nothing; q: if M == 0, then set M = 2; if M == 1, then set M = 3; if M == 2, then do nothing; if M == 3, then do nothing; if M == 4, then set M = 2; Referring to Figure 9.2, the readers can verify that this set of operations is sequence idempotent. Sequence p ◦ q ◦ p sets M = 1 while sequence p ◦ q sets M = 4. So σ(p ◦ q ◦ p) 6= σ(p ◦ q). Therefore the set of operations is not stateless. 2 Proposition 17 If a set of operations is both idempotent and commutative, it has to be stateless and consistent. Proof: Commutativity & Idempotence ⇒ Statelessness: 67 σ(p ◦ q̃ ◦ p) = σ(p ◦ p ◦ q̃)(Commutativity) = σ(p ◦ q̃)(Idempotence) Commutativity & Idempotence ⇒ Consistency: Suppose there are two commutative and idempotent operations p and p are not consistent, i.e., p makes one test t ∈ T true, p makes this test false. Let us consider two sequences p ◦ p ◦ p and p ◦ p. Sequence p ◦ p ◦ p will make the test first true, then false, and finally true. Sequence p ◦ p will make the test true, and then false. So these two sequences do not have equivalent observed effects. However, σ(p ◦ p ◦ p) = σ(p ◦ p ◦ p)(Commutativity) = σ(p ◦ p). By contradiction, the above case cannot exist. So commutative and idempotent operations must be consistent. 2 The enemy of tractability of composability seems to be lack of knowledge and control over combinatorial behaviors of configuration operations, except through explicit testing. While we usually know exactly what an operation does when applied to a baseline system, we seldom know precisely what will happen when two operations are composed, i.e., when the second operation is applied to a non-baseline system. Worse, poorly engineered scripts of the kind employed for configuration management are especially prone to errors when applied to systems in unpredictable states. The limits on operations we discussed previously all add some forms of control over combinatorial behaviors. Convergence, idempotence, sequence idempotence, statelessness add a structure of equivalence relations between sequences of configuration operations. Commutativity limits potential results of a set of scripts so that different permutations of the same set are equivalent. Consistency rules out conflicting behaviors. The limits of atomicity and awareness make operations tighter, more robust and more secure. And they are used with other limits to enhance their functionality. However, in the next chapter we show that only these limits are not enough to make composability tractable. 68 Chapter 10 Complexity of Configuration Composition 10.1 Composability Composability of configuration operations refers to the ability to find a finite sequence of configuration operations that reliably transforms a given initial state of a system to an observed state that satisfies some given user requirements. There are two types of composability: syntactic and semantic[75]. Syntactic composability refers to implementation details that enable several operations to be combined, e.g., interface specifications. A set of operations is syntactically composable if each operation employs the proper interfaces to call other operations and the underlying operating environment. Syntactic composability has been studied in detail in the context of software engineering, and proven to be achievable by establishing a common framework, e.g., Component Object Model Plus (COM+)[70], Common Object Request Broker Architecture (CORBA)[49], and Enterprise JavaBeans (EJB)[71]. By contrast, semantic composability refers to the ability to achieve a specific function or goal by composing operations from a set. Semantic composability emphasizes the meaning of the composition. Our studies address the theoretical aspects of semantic composability as it applies to system administration. 69 10.2 Complexity of Operation Composability Composability is a desirable feature in configuration management. A composable approach can greatly reduce costs of configuration and training of new administration staff and encourage collaborations among system administrators. Composability is far from being achieved for current tools, which avoid the issue by either limiting the ways in which operations are composed, or using operations that by nature can be composed. In this section we show mathematically that composability is an NP-hard problem. The composability problem with no limits imposed upon configuration operations is called the GENERAL COMPOSABILITY (GC) problem. We start with a simplified version of GC called COMPONENT SELECTION (CS). We first prove that CS is NP-hard, hence GC is NP-hard. Then we prove that the composability problems with various limits (including COMPOSABILITY OF ATOMIC OPERATIONS, COMPOSABILITY OF PARTIALLY ORDERED SETS and COMPOSABILITY OF CONVERGENT OPERATIONS) are all NP-hard problems. 10.2.1 Component Selection Definition 17 COMPONENT SELECTION(CS) is defined as follows: INSTANCE: Set P = {p1 , p2 , · · · , pm } of m components, set T = {t1 , t2 , · · · , tn } of n tests, subset R ⊆ T of desired outcomes, test process function σ : power(P ) → power(T ) (that describes the outcome σ(Q) for each Q ⊆ P ) which takes time O(|T |) and a positive integer K ≤ |P |. QUESTION: Does P contain a subset Q ⊆ P with |Q| ≤ K, such that R ⊆ σ(Q)? The above definition allows us to find an optimal solution in an easy way. The integer K in the instance is the maximum size of composition allowed. If the answer of the question for some integer K is “yes”, i.e., we can find a set of components that meet the requirements and with the length equal or less than K, we can always lower K to K − 1 and ask the question again until we get the answer “no”. Thus we can find the smallest set of components that meet the requirements. In CS, if we are not required to search for the optimal solution, any selection that can satisfy the requirement suffices, then the solution is trivial, just fix K to be equal to |P |. However, in composability which is defined later, a polynomial algorithm that searches for a non-optimal solution is still difficult to construct. The optimization and near-optimization of composability is discussed in section 10.2.2 and 13.2. Theorem 1 CS is NP-hard. 70 Proof: We must show that CS is at least difficult as a known NP-complete problem. Restrict CS to RESTRICTED COMPONENT SELECTION(RCS) by allowing only instances having R = T and S σ(Q) = pi ∈Q σ({pi }). We show in the following proof that RCS is NP-complete, thus CS is NP-hard. 2 Note that making R = T is trivial since tests are artifacts of human choices. It simply restricts S all the tests in T relevant to the user’s requirement. However restriction σ(Q) = pi ∈Q σ({pi }) is significant. It imposes structure on the test process function σ. It eliminates the case where the set of user requirements satisfied by a combination of components cannot be satisfied by any of the components alone. It requires the operations must have the property of compositional predictability, i.e., the observed behavior of a composition of operations can be predicted without testing by considering the observed behaviors of individual operations in the composition. This is a very strong requirement. It is a stronger condition than consistency. Definition 18 The definition of RESTRICTED COMPONENT SELECTION(RCS) is as follows: INSTANCE: Set P = {p1 , p2 , · · · , pm } of m components, set R = {t1 , t2 , · · · , tn } be a set of n desired outcomes, test process function σ : power(P ) → power(R) whose computation takes time of S O(|R|) and σ(Q) = pi ∈Q σ({pi }), a positive integer K ≤ |P |. QUESTION: Does P contain a subset Q ⊆ P with |Q| ≤ K, such that R = σ(Q)? Theorem 2 RCS is NP-complete. Proof: A proof is contained in [76], but contains problems that we will correct here. The proof of NP-completeness will use the MINIMUM COVER (MC) problem, known to be NP-complete. MC is defined as follows: Definition 19 MINIMUM COVER: INSTANCE: Collection C of m subsets of a finite set S, where |S| = n, positive integer K ≤ |C|. QUESTION: Does C contain a cover of S of size of K or less, i.e., a subset C 0 ⊆ C with S |C 0 | ≤ K such that ci ∈C 0 ci = S? First it must be shown that RCS is in NP. Given a subset Q of P , determing if R = σ(Q) can be done by searching in σ(Q) for each element of R. Since σ(Q) has at most n elements and R has n elements, a simple algorithm requires O(n2 ) time, which is polynomial in the length of the instance, thus RCS is in NP. 71 The transformation function f from any instance ω of MC to an instance f (ω) of RCS is defined as follows: 1. Let every ci ∈ C correspond to an operator pi ∈ P . 2. Let S be R. 3. For every ci = {si,1 , si,2 , · · ·} ∈ C, let σ(ci ) = σ({pi }) = {si,1 , si,2 , · · ·} = {ti,1 , ti,2 , · · ·}. 4. Let K in MC (KM C ) be K in RCS (KRCS ) Step 1 requires time O(m); step 2 requires O(n); step 3 requires time O(mn); step 4 requires O(1). So f is a polynomial function of input. Now we show: ω ∈ M C ⇐⇒ f (ω) ∈ RCS ←−: Assume f (ω) ∈ RCS, then there exists subset Q ⊆ P , with |Q| ≤ KRCS , such that R = S S S σ(Q). Let C 0 = {ci |ci = σ({pi }), pi ∈ Q}. ci ∈C 0 ci = pi ∈Q σ({pi }) = pi ∈Q {ti,1 , ti,2 , · · ·} = S S pi ∈Q {si,1 , si,2 , · · ·} = σ(Q) by f , because S = R. Then S = σ(Q) = ci ∈C 0 ci , because KM C = KRCS and |C 0 | = |Q| ≤ KM C . Therefore ω ∈ M C. −→ : Assume ω ∈ M C. Then there exists a subset C 0 ⊆ C with |C 0 | ≤ KM C , such that S = S S S 0 ci ∈C 0 ci . Let Q = {pi ∈ P |ci ∈ C }. Then σ(Q) = pi ∈Q σ({pi }) = pi ∈Q {si,1 , si,2 , · · ·} = S S S pi ∈Q {ti,1 , ti,2 , · · ·} = ci ∈C 0 ci , because R = S by f then R = S = ci ∈C 0 ci = σ(Q), because KRCS = KM C and |Q| = |C 0 | ≤ KRCS . Therefore f (ω) ∈ RCS. 2 10.2.2 General Composability Unfortunately, our problem is somewhat more difficult than CS, because in our problem, order of operations does matter. We are most interested in what test outcomes result from applying operations from a set of operations P . Operations differ from components in the above theorem and definition, because the order of application of operations does matter, and repeating an operation is possible with different results than performing it once. We assume that to apply an operation pi to the system takes only one step and we can test the system in linear time. We also assume that the system begins in a predictable and reproducible baseline state B to which the sequence of operations is applied. 72 Definition 20 GENERAL COMPOSABILITY(GC) is defined as follows: INSTANCE: Set P = {p1 , p2 , · · · , pm } of m configuration operations, set T = {t1 , t2 , · · · , tn } of n tests, set of user requirements R ⊆ T , test process function σ : P ∗ → power(T ), where for p̃ ∈ P ∗ , σ(p̃) represents the tests in T that succeed after applying p̃ to a given repeatable baseline state, suppose the computation time of σ is a linear function of |T |, a positive integer K ≤ |F|, where F is a polynomial function of |P |. QUESTION: Does P ∗ contain a sequence q̃ of length less than or equal to K, such that T ⊆ σ(q̃)? For q̃ a sequence of operations, σ(q̃) is the result of testing the application of the sequence q̃ to a given baseline state. This should at least satisfy the requirements given in R, but may satisfy more requirements. The reason that we impose integer K to be less than or equal to a polynomial function of the length of the set of operations is that in practice no one implements a sequence of operations with arbitrary length. In practice, there is a finite upper bound on the number of operations and people seldom repeat an operation more than a few times. As we pointed in previous section of the CS problem, integer K makes composability an optimization problem. In many cases, configuration management does not need to be optimal. Any sequence of operations that will accomplish the result will do. However, even when a non-optimal solution is acceptable, a polynomial algorithm that can compute this solution is still difficult to construct due to the lack of structure of test process function σ. In other words, even if we lower our standard to a non-optimal solution, the computational cost is not greatly reduced. We will discuss optimization and near-optimal optimization in section 13.2. Proposition 18 If operations in P are commutative and idempotent, then CS is reducible to GC. Proof: The difference between the two problems is that in GC, each option for operations is a sequence, whereas in CS each option is a set. If all operations in a sequence are commutative and idempotent, then there is a 1-1 correspondence between equivalent subsets of sequences and sets, and the reduction simply utilizes this correspondence. 2. Theorem 3 GC is NP-hard. Proof: We must show that GC is at least as difficult as another NP-hard problem by restricting it to CS by allowing only instances that the operations are commutative and idempotent. 2 73 We know from the proof from Section 9.2 that commutative and idempotent operations must be stateless and consistent, so the above proof shows that even with limits of idempotence, statelessness, consistency and commutativity, composability is still an NP-hard problem. We need to point out that once we have the subset of P that satisfies user’s requirements, which means we have somehow solved CS, order does not matter if we know the precedence of operations or the operations are consistent, stateless and aware[28]. If we are sure of the appropriate sequences, the order of writing down the elements does not matter; we can resort them into an appropriate order later. Also, in Maelstrom[28], Couch claims that if the operations meet the requirements of consistency, convergence (which for Couch is indistinguishable from statelessness) and awareness, a specific sequence of operations with length O(n2 ) will try out all the permutations of n operations. The success of Maelstrom depends on the axiomatic existence of consistency, statelessness and awareness. Lack of these properties for any operation causes Maelstrom to fail. Cfengine employs Maelstrom in a very simple form; its operations are simple enough to comply with Maelstrom’s conditions. 10.2.3 Atomic Operation Composability A common illusion is that composability of atomic operations is tractable. Atomic operations have known preconditions and postconditions. The preconditions of an atomic operation p ∈ P are a set of tests O ⊆ T that must be true before applying p. The postconditions of an operation p are a set of tests R ⊆ T that will be true after p has been applied, given that preconditions have been met; else, p will do nothing. Definition 21 ATOMIC OPERATION COMPOSABILITY(AOC) is defined as follows: INSTANCE: Set P = {p1 , p2 , · · · , pm } of m atomic configuration operations, set T = {t1 , t2 , · · · , tn } of n tests, set of user requirements R ⊆ T , a positive integer K ≤ |F|, where F is a polynomial function of |P |. Each p ∈ P is associated with a set O ⊆ T and a non-empty set R ⊆ T . If tests in O are all true, after p is applied, tests in R will be all true; else p will do nothing, a test process function σ : P ∗ → power(T ), where for p̃ ∈ P ∗ , σ(p̃) represents the tests in T that succeed after applying p̃ to a given repeatable baseline state, suppose the computation time of σ is a linear function of |T |. QUESTION: Does P ∗ contain a sequence q̃ of length less than or equal to K, such that R ⊆ σ(q̃)? Note that postcondition set R assures certain behaviors after the execution of an atomic operation. It must be non-empty otherwise an atomic operation degrades to a general operation without 74 limits on its behavior. Theorem 4 AOC is NP-hard. Proof: In the following we prove that AOC is at least difficult as RCS by showing that operations in RCS are atomic operations with empty precondition set. Recall that we restrict GC to CS by idempotence and commutativity so that we have a set rather than a sequence. We further restrict CS to RCS by requiring compositional predictability, S S i.e., σ( i pi ) = i σ({pi }). Suppose the set of operations P is idempotent, commutative and compositional predictable, then the observed state of sequence p ◦ q̃ is σ(p ◦ q̃) = σ({p, q̃}) = S σ(p) σ(q̃), thus the postcondition set σ(p) is a subset of σ(p◦ q̃); in other words, observed behavior of p is always assured after the operation is applied to any actual state. Therefore, operations with limits of idempotence, commutativity and compositional predictability are special atomic operations with empty precondition set. We have shown in RCS that composability problem of operations with limits of idempotence, commutativity and compositional predictability is NP-complete. Thus AOC is NP-hard. 2 10.2.4 Composability of Partially Ordered Operations What if P is a partially ordered set? Will precedence ease composability? Definition 22 COMPOSABILITY OF PARTIALLY ORDERED OPERATION SETS (CPOOS) is defined as follows: INSTANCE: Set P = {p1 , p2 , · · · , pm } of m configuration operations, set T = {t1 , t2 , · · · , tn } of n tests, set of user requirements R ⊆ T , a positive integer K ≤ |F|, where F is a polynomial function of |P |, a partial order S = {(pi , pj )|pi , pj ∈ P }, where (pi , pj ) ∈ S exactly when pi must precede pj , set P ⊆ P ∗ conforms to the ordering in S, test process function σ : P → power(T ), where for p̃ ∈ P , σ(p̃) represents the tests in T that succeed after applying p̃ to a given repeatable baseline state. Suppose the computation of σ is linear in |T |, i.e., O(n). QUESTION: Does P contain a sequence q̃ of length less than or equal to K, such that R ⊆ σ(q̃)? Theorem 5 CPOOS is NP-hard. Proof: 75 In the following we prove that CPOOS is restricted to CS by allowing only instances with idempotent operations. Order of operations does not matter if we know the precedence of those operations. In other words, if we are sure of the appropriate sequences, the order of writing down the elements is trivial; we can resort them into an appropriate order later using an algorithm like topological sort which takes O(|P | + |S|) [24]. Thus a set of operations with known precedence can be treated as the same way as a commutative set. We have shown that GC collapses to CS with restriction of commutativity and idempotence. Thus we can restrict CPOOS to CS by allowing only instances with idempotent operations. 2 10.2.5 Composability of Convergent Operations Can convergence help to reduce complexity? As we discussed before, a convergent operation is an operation whose repeated application has a fixed point, after achieving that fixed point additional execution of the operation does not add anything to the system, i.e., ∃n0 , when n ≥ n0 , σ(p◦pn ◦ q̃) = σ(pn ◦ q̃). Definition 23 CONVERGENT OPERATION COMPOSABILITY(COC) is defined as follows: INSTANCE: Set P = {p1 , p2 , · · · , pm } of m convergent configuration operations. Let P ⊆ P ∗ be the set of sequences that show stabilized behaviors, i.e., where for p̃ ∈ P , repeating any operations in sequence p̃ after execution of p̃ will not change the observed state of the system. Set T = {t1 , t2 , · · · , tn } of n tests, set of user requirements R ⊆ T , and a positive integer K ≤ |F|, where F is a polynomial function of |P |, a test process function σ : P → power(T ), where for p̃ ∈ P , σ(p̃) represents the tests in T that succeed after applying p̃ to a given repeatable baseline state. Suppose computation of σ takes time O(|T |). QUESTION: Does P contain a sequence q̃ of length less than or equal to K, such that R ⊆ σ(q̃)? Theorem 6 COC is NP-hard. Proof: COS is restricted to CS by allowing only instances that the operations are idempotent (idempotence is a special case of convergence) and commute. Since this easier problem is NP-hard, so is the larger problem of composability for unrestricted operations. 2 76 10.2.6 Summary of The Proofs In this section we have shown mathematically that composability is an NP-hard problem. The GENERAL COMPOSABILITY (GC) problem is restricted to COMPONENT SELECTION (CS) with idempotence and commutativity. CS is restricted to RESTRICTED COMPONENT SELECTION (RCS) with compositional predictability. RCS is transformed from MINIMUM COVER which is a known NP-complete problem. Then we prove that ATOMIC OPERATION COMPOSABILITY, COMPOSABILITY of PARTIALLY ORDERED SETS and CONVERGENT OPERATION COMPOSABILITY are all NP-hard problems. We have shown that give a list of requirements and a configuration operation repository with well defined behavior, the process of finding the optimal sequence of operations to meet the requirement is an NP-hard problem and the problem remains NP-hard regardless of whether operations are: • idempotent/stateless/convergent • consistent/commutative/atomic • orderable via a known partial order on the operations (known dependency between operations) Refer to Figure 10.1 for summary of proofs. 10.3 Discussion Practitioners of system administration argue about limits upon configuration operations. Some maintain that operations should be constructed to be repeatable without consequence once a desirable state has been achieved. Others maintain that operations must be based upon imperative order or upon generating the whole configuration as a monolithic entity. Our work proves that these arguments are not based upon mathematical fact. The problem remains hard no matter how one limits operations. We will discuss how current tools get around composability problem in Chapter 12. 77 Figure 10.1: Summary of proofs 78 Chapter 11 Dependency Analysis Dependency analysis is the process of discovering interactions and relationships between entities of a complex system. It is widely used in industries that design and develop products such as software systems and hardware devices. Dependency analysis is a difficult process in the context of a system with complex interactions. The “system” in the world of system administration is a network of hundreds or thousands of complex subsystems such as computers, software applications, and hardware devices. Fortunately, the goal of system administrators is not a full understanding of the interactions and relationships among entities of a system as long as following the instructions of some documentation of the system results in desired system behavior. By following a documented procedure, system administrators bypass the “hardness” of dependency analysis. However, dependency analysis cannot be fully avoided due to the following reasons: 1. Documentation is not always 100% accurate. This is due to software bugs, human errors, and constraints of development time and expense on the developers’ side; and dynamic use of the system, software upgrades/patches, and execution of scripts that have global effects on the system administrators’ side. 2. Documentation does not address all possible working environments. Modern systems are constructed with hundreds and thousands of components from independent developers and vendors. Even if all of the components are fully tested and verified by their developers for some ideal environment, it is possible that they fail to function as specified the current environment. Dependency analysis is used in many areas of system administration, e.g., 79 • Root cause analysis - determining the cause of a failure. • Impact analysis - determining which entities of a system or customers will be affected by a problem. • Change analysis - determining the consequences of changes made to a system. • Requirements analysis - determining the requirements necessary to provide a service. We first introduce currently used techniques in dependency analysis in Section 11.1. In Section 11.2 , we introduce basic concepts. In Section 11.3, we formally define dependencies. In Section 11.4, we analyze the complexities of black and white box analysis. 11.1 Dependency Analysis Techniques In this section, we introduce some common techniques used in dependency analysis: instrumentation, perturbation, data mining, requirements analysis, and dependency control. Requirements analysis and dependency control are white box approaches (based upon contents of the system) and the rest are black box approaches (based upon the behavior of the system). 11.1.1 Instrumentation Instrumentation is a black box technique in which an “instrument” or ”probe” is placed within one or more entities of the system. Dependencies are calculated by correlating transactions recorded for various entities. Code instrumentation is widely used in circumstances where source code is available. For example, in the Linux operating system, it is easy to instrument kernel functions to track processes and files. A key problem of instrumentation is its intrusiveness. It is possible that the instrumentation changes too much about the computation for it to be usable. Another limitation of instrumentation is that it may be unusable in situations where instrumentation code cannot be inserted into the system due to security requirements, licensing, or other technical constraints, e.g., in commercial software. 11.1.2 Perturbation One uses perturbation/fault-injection[13, 11] in cases where the behavior one seeks to understand is too rare to occur in practice, or when it is so infrequent that we need to artificially create circumstances where it will happen. Perturbation is the process of explicitly perturbing system 80 entities while monitoring the system’s response. Any behavioral change caused by perturbation infers some dependency relation. Fault injection is used as a perturbation tool. One arranges for one component to “fail” and then observe the results. This is especially useful if the component failures model realistic situations. A fault can be modeled as a simple bit flip, locking of a file, a disk filling up, or overloading of a service[14]. 11.1.3 Data Mining Data mining is the process of extracting knowledge hidden in large volumes of raw data. It is used in many areas such as financial markets analysis, image recognition and bioinformatics. Facing a large amount of statistical data describing system states, researchers in system administration begin to show appreciation of the technique of data mining. Strider[40] is an administrative tool for Microsoft Windows that uses state differencing to identify potential causes of differing program behaviors. Windows registry entries are used to describe system state. Strider attempts to identify regions where a change might cause a problem. It does this by correlating changes, determining what changes are “normal”, and filtering them out. The user must then analyze the resulting registry map, which consists of changes that might cause a problem. 11.1.4 Requirements Analysis One of the keys of white box dependency analysis is to find the requirements (or preconditions) of an entity. It is crucial for this approach to find the requirements that can affect the entity’s behavior which concerns us. Sowhat[86] is a system administrative tool that performs global impact analysis of dynamic library dependencies of Solaris and Linux systems. Dependencies include requested library names or paths that are hard coded in executable programs for use by a dynamic linker. The diagnostic program ldd provided by the operating system exposes executable programs’ dependencies on dynamic libraries. 11.1.5 Dependency Control Another approach of dependency analysis is to proactively and strictly define critical dependencies. The Linux Standard Base (LSB) project[48] seeks to provide a dynamic linking environment within Linux in which vendor-provided software is guaranteed to execute properly. The goal of LSB is to identify a set of core standards that must be shared among distributions in order to guarantee that a product that works properly in one of them will work in all compliant distributions. These standards 81 include requirements for the content of dynamic libraries, as well as standards for locations of system files used by library functions. With these standards in hand, the LSB provides tools with which one can certify both environments and programs to be compliant with the standard. Linux distributions can be examined by an automatic certification utility that checks link order, versions of libraries, and locations of relevant system files. A distribution may have more libraries than the standard specifies, but the libraries specified in the standard must be first to be scanned during linking and must contain the appropriate versions of library subroutines. Another certification utility checks that the binary code for Linux applications only calls library functions protected by the standard. Since the LSB tools solely analyze the contents of binary files, they can check closed-source executables for compliance. 11.2 Basic Concepts To be as general as possible, a system is composed of entities that collaborate and communicate with each other to achieve some common goals. Each entity in this community has its own mission or duty to accomplish, which we call functional expectations (We will define them formally later). A dependency exists when an entity cannot achieve its functional expectations by itself but only with cooperation from others. An entity can be anything that concerns us, e.g., software, hardware, object, parameter, or a subsystem of entities. Within a system, there are two kinds of dependency relationships: vertical and horizontal dependencies. • Vertical dependencies are dependencies where a change of one entity may directly result in a change of behavior of another entity. For example, one entity is a subpart/subroutine of another entity; or one entity’s output feeds in the input of another entity. Vertical dependencies form a hierarchy of relationships among entities. • Horizontal dependencies are dependencies that two or more entities are peers and they must coordinate to achieve some common goals. A change of one entity may not directly change the behavior of another but functional expectations are no longer met. In order to meet functional expectations, other peer entities must change their states accordingly. There are two approaches to dependency analysis: “white box” and “black box”. 82 11.2.1 White Box The white box approach is to analytically study and understand the internal structure of the system (e.g., source code) and derive dependencies from that structure. This method relies on a human or a static analysis program to analyze system configuration, installation data, and application code to compute dependencies. White box dependency analysis is content-oriented, not behavior-oriented. It attempts to discover why there is a dependency, i.e., analyzes the cause and effect relationships between an entity and its environment. It is possible that some dependencies are buried deep in the system and do not affect external behaviors at all. Thus a dependency found by a white box method might not have an impact on behavior of the system. For example, when we download a file from a remote server, normally it does not matter to us how the packets are routed, so the dependencies of routing mechanism upon others does not concern us. Further, some dependencies found by white box dependency analysis might not be useful. For example, when a process looks up a user name, it first searches for the user name in the /etc/passwd file, then in LDAP. In many sites, the /etc/password is empty and all users are in LDAP (or NIS, NIS+), so the reference to /etc/password never has any effect upon the outcome. From white box dependency analysis, we conclude that the process depends upon /etc/password, but this dependency does not affect the behavior of the process and thus has no real value. Thus it is critical for white box analysis to identify what dependencies affect behavior and need to be tracked. When the system is simple and its internal operation is well understood, the white box approach can suffice. However, this approach breaks down when the system is complex and implementation details are unknown or too complex to analyze. 11.2.2 Black Box A black box approach is based on system behavior using tests and inference. Each system entity is treated as a black box with some configuration parameters outside the box to control its behavior. The internal operations are unknown to the analyzer. To help in understanding, a black box is like a toy to a child. It has some external buttons and/or handles to control its behavior. But the internal bolts and nuts are completely hidden from the child. Black box dependency analysis helps answer how a dependency can affects system behaviors. The shortcoming of black box dependency analysis is that it cannot discover the cause and effect of a dependency. For example, one sees a rooster cry and the sun rises. One who uses black box 83 dependency analysis could mistakenly conclude that sunrise depends upon a rooster’s cry. Further, black box dependency analysis can never conclude independence - lack of dependence, but can only conclude lack of observation of dependence. 11.2.3 Functional Expectations and Tests The concept of functional expectations is crucial in dependency analysis. Without some concept of how systems should behave, dependency analysis is meaningless. It may seem that vertical dependencies have nothing to do with functional expectations. The dependency is there regardless of whether there are functional expectations. However, dependencies irrelevant to functional expectations are also irrelevant to our considerations. For example, we may paint our server different colors. So the color of our server has a dependency upon the painting process. However, if the functional expectations of our server do not include a requirement of the color, then this dependency has no value in our management. Functional expectations consist of behaviors we would like to observe, in the form of tests that should be true. It does not matter if more tests are true than we desire; all that matters is that the requirements we have are met. For example, the functional expectations of a web server system might include: • it has a valid IP address; • it responds to one or more domain names and ports; • it provides correct contents for each URL request; • it calls CGI scripts appropriately; • it interacts appropriately with database servers; and • it meets site-specific security requirements. In practice, most of the items in the list are expressed as a group of tests. This simple example provides the basic idea of using tests to express functional expectations. Definition 24 The functional expectations for an entity E are a test set RE ⊆ T (T are the possible tests that can be performed for the system) that encode expectations that include the desired behavior of the entity or the desired results expected from the existence or execution of the entity. 84 Note that functional expectation tests are not necessarily performed on the entity itself; they may instead concern its environment or surroundings or interactions with other systems. For example, the functional expectation for the code validator of LSB is that if an application passes the validator, it should work on any LSB compliant run-time environment. Thus testing this functional requirement requires moving the code to another compliant system and checking out its function there. Further, functional expectations for an entity can be more than its basic functions. For example: a DNS server. Obviously, its functional expectations should include converting host names into IP addresses and vice versa. However, those are not all of its requirements. A typical DNS server must also serve appropriate information in a collaborative way so that other servers, e.g., a web server and a DHCP server, can produce a network that meets functional expectations. 11.2.4 State and Behavior Dependencies only exist in systems where there are multiple, semi-independent entities. We refer to these entities as {Ei }. Definition 25 The state of an entity E, e, is described by the contents of the entity that is used to control its behavior. We use ei to denote a state for the ith entity of a system. In configuration management, the state of an entity is its configuration. This is typically the state of one or more files associated with the entity. The distinctions between entities are often imprecise, and it may well be that one file is part of the configurations of two distinct entities. Definition 26 Every entity E has a set of possible states, denoted as E. Again we use Ei to denote the set of possible states for the ith entity of a system. The set of possible states of a system of n entities is S = {s = (e1 , e2 , · · · , en )|ei ∈ Ei } ⊆ E1 × E2 × · · · × En . Due to the overlaps between configuration parameters, the state space of system of entities is not necessarily the product state space of individual entities; only a subset of the product space is meaningful and all states may not even be achievable. Definition 27 Given a set of Boolean tests T , the observed behavior of a state of a system or entity is a subset U ⊆ T where t ∈ U exactly when test t returns TRUE. Definition 28 V (S) is a relation describing the outcomes of tests upon states in S. V (S) = {(s, U )|s ∈ S, U ⊆ T is an observed behavior of state s }. 85 For a state s ∈ S, V (s) = {∩Ui |(s, Ui ) ∈ V (S)} where ∩Ui represents the intersection of U s. This is the set of observed behaviors that remain true regardless of other (perhaps external) influences. Unlike the relation V (S), V (s) is always a function from S into subsets of T . V (s) is the set of observed tests that remain true for the state s of an entity or a group of entities, regardless of what is happening in the outside world around the entity or the group of entities. For example, if our configuration s for a DNS server is that 10.2.3.4 maps to the name foo.com, then the test that this map works in both directions will remain true regardless of what is happening in the outside world. In other words, the contents of V (s) are the observed tests that this entity or group of entities controls, taken from the set of all observed tests. Definition 29 The function Q : power(T ) → power(S) is defined by is Q(U ) = {s ∈ S that V (s) ⊇ U }. This is the set of system states that satisfies tests in U . Here we require that states in Q(U ) must be system states. 11.3 Dependence Definition Dependency is lack of independence. One entity cannot achieve its functional expectations without requiring that some other entity or entities are at some particular states. 11.3.1 Dependency in a Closed-world System Strictly speaking, dependency is difficult to define precisely in an open-world system. In an openworld system, unknown outside forces may affect the behavior of system states. When the system fails to achieve functional expectations, we are not sure whether a dependency is broken within the system or some outside influences affect the behavior of the system. For example, consider a system of a printer and a web server in an open world where electric power is considered as an outside influence that can change. Suppose the printer changes state and at the same moment the electric power is shut off (but this is unknown because electric power is considered an element outside of the system). We then draw an incorrect conclusion that there is dependency between the web server and the printer because we see a change of state in the printer and a failure in the web server. Dependence is a relationship between entities in which one entity cannot carry out its mission (functional expectations) by itself without the cooperation of other entities. Formally, the definition of dependency is as follows: 86 Definition 30 Given a closed-world system of n (n ≥ 2) entities {Ei } and a test set T of m tests, each entity has functional expectations Ri ⊆ T ; Ei depends upon Ej if there exist two system states s and s0 ∈ S such that: • the content difference between s and s0 is only caused by a change of entity Ej ’s state, and • s can satisfy Ei ’s functional expectations but s0 cannot, i.e., V (s) ⊇ Ri and ∃t ∈ Ti such that t 6∈ V (s0 ). The meaning of this definition is that Ei depends upon Ej if choice of Ej ’s state can possibly change the fact whether functional expectations for Ei can be met. In an open-world system, it is possible that the behavior of a system state varies in time; it can sometimes satisfy a set of functional expectations and sometimes not. We will discuss dependencies in an open-world system in the next section. The “critical set” of a dependency is a set of tests that specify which functional expectations of the dependent will be affected if the dependency is not satisfied. Definition 31 For entities Ei and Ej , the functional expectation t ∈ Ri is a critical condition if the results of this test differ depending upon choice of state for Ej . Rij ⊆ Ri is the set of all critical conditions of Ei with respect to Ej of critical conditions for all s0 as described above. Some dependencies might be more critical than other dependencies. For example, the functional expectation of correctness of a server might be more important than the functional expectation of its performance. Thus dependencies that compromise correctness if not satisfied are more important than the ones that compromise performance if not satisfied. Strength is a metric describing how “heavily” one entity depends upon other entities. Definition 32 For entities Ei and Ej , let W (Ei , Ej ) be the number of states in Q(Ri ) for which there exists an s0 as described above. The strength of the dependency is |W (Ei , Ej )|/|Q(Ri )|. Note that if the strength of a dependency is less than 1, the state of Ej can make Ei fail to conform to expectations in some system state and cannot make this happen in some other system state. For example, suppose that a java application can be configured to get a required class either locally or remotely (via, e.g., a web server or shard system). So in this case there are two states conforming to the functional expectations of this application: one is to get the required class locally and the other is to get the required class remotely. The application depends upon the remote copy of 87 the class at strength of 1/2 < 1 which means there is some alternative way to achieve the functional expectations for this application even if the network is not available. Dependencies with strength 1 indicate the system is inflexible and that some condition is absolutely required. One can explore redundancies to weaken dependencies. Let us consider a system of an Apache server, a dynamic library libz.so and a database mysql. Apache requires libz.so and mysql. If libz.so is deleted, none of the functional expectations of Apache can be met. If mysql is not available, Apache fails to provide one of its functions: search in a database. Thus according to our definition, Apache depends upon libz.so with criticality of RApache ; Apache depends upon mysql with criticality of R = { “search in a database”} ⊂ RApache , where ”search in a database” is a single functional test. 11.3.2 Dependency in an Open-world System In an open-world system, dependency cannot be defined in absolute terms, because literally anything whatsoever can happen to affect behavior. In the absence of external effects, however, there is a weak form of dependency based upon the assumption that external effects are absent or at least quiescent. Dependency in an open-world system can only be studied in a temporal domain. We can only say that at some specific time, this state depends upon that state. For example, in an open-world system with a web server and a DNS server, the claim that the states of the web server depend upon the states of a DNS server is temporal depending upon whether there is a backup DNS server. We can only say, at some specific time, the states of the web server depend upon the states of the DNS server and at some other time, when the backup server is present, the functional state of the web server does not depend upon the states of the DNS server since the failure of the DNS server cannot affect the service of the web server. Observation 1 In an open-world system, if there are intervals of time in which the external influences on functional expectations of the system do not vary, dependencies can be analyzed during these times as if the system is a closed-world system. Discussion: In a closed-world system, the external influences can affect the behaviors that are not part of the system’s functional expectations or they are constant if they can affect the behaviors that are part of the system functional expectations. During these intervals, there is no variation of external influences on the system functional expectations, so we can analyze the system as if it was a closed-world system. 88 11.4 Complexity of Dependency Analysis 11.4.1 Black Box Dependency Analysis To simplify the problem, we assume that there is no overlap of configuration contents of different entities. Further, we assume that each entity can only take a limited number of possible states. This is a practical limitation of any realistic system. From the definition of dependency, the problem of dependency analysis can be defined as: INSTANCE: A closed-world system of n (n ≥ 2) entities {Ei } and a test set T of m tests, each entity has a functional expectations set Ri ⊆ T , each entity has at most d possible states, a system state s ∈ S is a vector state composed of states of individual entities; a function R : S → power(T ) takes linear (proportional to number of tests) time to compute the behavior of a system state. QUESTION: Does Ei depends upon Ej , i.e., ∃s, s0 , where s0 6= s such that: • the only difference between s and s0 is the state of Ej , i.e., s = (e1 , e2 , · · · , ej , · · · , en ) and s = (e1 , e2 , · · · , e0j , · · · , en ) where ej 6= e0j ; • s satisfies the functional expectations for Ei , i.e., R(s) ⊇ Ri ; • s0 does not satisfy the functional expectations for Ej , i.e., there is at least one t ∈ Ri such that t 6∈ R(s0 ). This problem is not in NP because no matter how fast we can perform these tests, the test function forms an exponential size table to look up the behaviors of different system states (if each entity can have d possible states, the number of total system states at the worst case is dn ). By the definition of black box dependency analysis, one cannot know the result of testing by analyzing the contents of system states but exclusively through testing the external behaviors. In other words, the black box dependency analysis has no internal structures. Thus the problem remains in EXPTIME if no further assumptions are made. However, if we have some prior knowledge of which system states can achieve functional expectations for Ei , which in many cases we do, and the number of those system states is O(n), dependency analysis collapses into P . Because we only need to change the entity state of Ej and test the system for at most d times for each working system state. To summarize, black box dependency analysis is tractable if the following two conditions are true: 89 1. each entity can only take a limited number of states; 2. we have some prior knowledge of which system states can achieve the functional expectations. The violation of condition 1 makes the problem unbounded in general; the violation of condition 2 makes the problem become EXPTIME, because we need to test every possible system state in state space which has an exponential size. 11.4.2 White Box Dependency Analysis Unlike the black box approach, in which external behavior of each entity is all that is available, in the white box approach, some representation of the content of an entity is available for analysis. In black box analysis, we may change the values of some configuration parameters to observe the entity’s behavior after the change, but we cannot analyze the internal logic of how these configuration parameters affect the entity’s behavior. In an abstract sense, many kinds of entities, especially software components, can be viewed as programs running on a Turing machine. Any process aiming to understand the internal logic of an entity is a white box analysis. White box analysis includes control flow analysis and data flow analysis. Control flow analysis is the process of determining the instructions that a particular execution reaches or does not reach. Likewise, data flow analysis is the process of determining, for a specific computed quantity, which inputs affected its value. For example, in data flow analysis, a dependency is established if the value of one variable influences another; in control flow analysis, one entity depends upon another if the execution of the second depends upon whether the first was executed. In general, both control flow analysis and data flow analysis are intractable. Given an arbitrary program, the process of finding whether its preconditions hold or analyzing its postconditions is not necessarily decidable. To illustrate the idea, consider the following simple example: Precondition: I am penniless. start: Work to earn a dollar Buy a lottery ticket If ticket is not a jackpot winner, GO TO start If ticket is a jackpot winner, celebrate. What are the postconditions of the program? If this program ever finishes, then you will be rich - the problem is that there’s nothing to stop it going round and round in circles forever! Here 90 we encounter the same halting problem as in program correctness: given an arbitrary program, deciding its preconditions and postconditions might not necessarily halt. 11.5 Discussion From the analysis above, we conclude that neither black box nor white box dependency analysis is tractable in the worst case. In black box analysis, since we do not have any idea of the internal structure of the box, we have to perform every possible test to discover dependencies. In white box analysis, the complexity of contents of entities makes the problem intractable. Further, since we do not know the behavior of the system, we have to analyze every dependency in the system to guarantee behavior. In practice, a gray box approach, a mixture of black box and white box dependency analysis, is used. In gray box approach, the system is not “completely black” to the analyzer. One opens the box to a depth necessary for the analysis, but never intends to do a complete white box analysis for the whole system. In this way, with even limited knowledge of internal structure, we can choose what to test next, thus reduce the testing state space significantly; and with the test results, we are able to further identify critical regions for the white box analysis, thus eliminate much unnecessary cost spent on dependencies that is irrelevant to the behavior we concern. The cost of gray box analysis varies on a case by case basis. Theoretically, it is still intractable in the worst case where the ingredient of black box or the white box analysis within this approach does not gain us anything. However, in practice, it normally reduces the complexity of the problem. The way people make dependency analysis tractable is by refining their state model only in presence of non-determinism. They adopt naive models that seem deterministic, and add states only when that seeming determinism is violated. Dependency analysis can be avoided if the system administrators have enough experience or/and the necessary knowledge to configure the system can be gained by reading documentation. This suggests the following conditions: precise documentation, efficient search mechanisms through documentation, and homogeneous environment(maximize the experience of system administrators). In summary, we gave formal definitions of white box and black box dependency analysis in system administration and examined the computational complexity of each approach respectively. We showed that both white box and black box dependency analysis are intractable. 91 Chapter 12 Configuration Management Made Tractable in Practice In this chapter, we discuss some general guidelines used by system administrators to keep configuration management tractable and examine current strategies of configuration management as examples. Configuration management is intractable in the general case because the complete understanding of the system and its behavior is intractable. People manage complexity in many ways. There is no perfect solution that makes configuration management tractable without making some sacrifice such as accuracy, flexibility, and convenience. The intrinsic complexity cannot be destroyed but can only be hidden. In the following paragraphs, we summarize these mechanisms used by system administrators to reduce the complexity of configuration management. Note that these mechanisms are often used in combination. 12.1 Experience and Documentation System administrators use their experience of past solutions and documentation to avoid dependency analysis and composing operations. After validating that it is proper to use past experience or documentation in the current environment, they simply repeat those actions or follow instructions of the documentation without doing dependency analysis and composition. For example, suppose a system administrator needs to install a network card. The installation guide instructs one to install the driver first, then plug the card in the machine. The system 92 administrator can just follow these instructions without bothering to analyze the interactions and relationships between software and hardware of the network card. It is not important to know that if the instructions are not followed, the card will not work; this fact is not even considered. Often, this leads to cases where procedures are overly constrained for no particularly good reason, simply to avoid a deeper analysis of dependencies. The drawback of this approach is that experience and documentation is not always 100% accurate and these might not be proper for the current environment system administrators are working with. In a very dynamic environment, it is possible to misconfigure the system by simply following instructions from documentation or experience that do not apply to the system’s current state. 12.2 Reduction of State Space Reduction of state space is a matter of limiting one’s choices during configuration. With a set of n operations, theoretically, one can compose an infinite number of possible sequences. Thus there exist a large number of possible actual states of the system. However, there are only a few distinct behaviors that we might want the resulting system to exhibit. Limiting the states of the system to match the number of distinct desirable behaviors simplifies the configuration process. If we make a limit that only a few sequences can be chosen, composition of operation sequences becomes tractable simply by choosing from a limited number of sequences. Composability entails just the selection of the one operation available, which is obviously tractable. This is a strong argument for “overconstraining” the documentation so that the options available do not overwhelm the administrator. Following instructions from documentation is a case of reduction of state space; only one sequence of operations is available to choose. Also acting accordingly to some shared practices, agreed upon by a group of system administrators for a site, can also efficiently reduce the state space of the system. By restricting heterogeneity of the network (making machines alike), one can increase the predictability of the system. Solutions on one machine are likely to work on another. Instructions in documentation become more feasible to apply for a large network. Homogeneity can be achieved by always following the same order of procedures or cloning machines. Standardization is an effective method to proactively and strictly define critical dependencies. The Linux Standard Base (LSB) project[48] seeks to provide a dynamic linking environment within Linux in which vendor-provided software is guaranteed to execute properly. LSB achieves compati- 93 bility among systems and applications by enforcing restrictions on critical dependencies include the locations of system files, the name and content of dynamic libraries. The drawback of reduction of state space is its inflexibility during configuration. Strategically, variety in configuration can pay off in terms of productivity. Moreover, a varied system is less vulnerable to a single type of failure. System administrators must find an appropriate balance between homogeneity and variety. 12.3 Abstraction Abstraction means to represent complex dependency relationships by simpler mechanisms to hide complexities. System administrators have a simple model of the system. They operate at higher level of abstraction than the interaction details of different subsystems. In many software systems, interactions and relationships among packages within the system are abstracted as strings of fields under “requires” and “provides”. This dependency information does not come from system administrators’ analysis of packages but from the white box knowledge of developers of the system. The true dependency relationship is hidden by this abstraction. For example, RedHat Package Manager (RPM) files and their equivalents use this mechanism to the ease the task of adding software to a Linux system. In the package header, each package is declared to “provide” zero or more services. These are just strings with no real semantic meaning. A package that needs a service then “requires” it. This dependency information can be considered as “documentation” embedded in the system. Just as other documentation, this abstraction can be in error[55]. Another abstraction is using dependencies between procedures instead of dependencies between system components. The dependencies between entities of the system is represented by the order in which procedures are performed. For example, in order to install a wireless card, one must install its software before plugging in the card. The complex interactions between the hardware and software of the card is hidden by the order of these two installation procedures. The drawback of abstraction is that the representation may not correctly reflect the real dependency or it fails to deal with problems that occur at lower levels than the abstraction. For example, if package names listed in the “required” field of a package do not include all packages that this package depends on, then a problem occurs. By only looking at the abstracted level of these string names, one cannot solve the problem. Some analysis of implementation details of the package is 94 needed to solve the problem. 12.4 Re-baselining Re-baselining the system to one of repeatable baseline states is the strategy used when the complexity has grown to an unmanageable level. The drawback of this approach is that it might be time consuming to rebuild a system from scratch. One must consider the cost of downtime when rebaselining the system. 12.5 Orthogonality Using orthogonality is to separate configuration parameter space into independent subsets. Component Selection is in P if the requirements that can be satisfied by each operation are disjoint subsets. The drawback is that it is not always feasible due to interactions and dependencies between subsystems. 12.6 Closure The techniques used in closure are closely related to documentation and reduction of state space. A “closure” is a “domain of semantic predictability”, a structure in which configuration commands or parameter settings have a documented, predictable, and persistent effect upon the external behavior of software and hardware managed by the closure[30]. Using closure systems is like delegating configuration tasks to someone else; either the closure system or the developer of the closure system is responsible to perform necessary operations in order to accomplish system goals. RedHat Enterprise Linux[78] can be considered as a closure. It is an operating system that accommodates a wide number of third party applications. It is self-patching and self-updated by its vendor. However, suppose that one has to add a foreign package to the system managed by the closure. Then he must know what parts of RedHat Linux Enterprise will be violated through dependency analysis, so that they can be managed differently after the insertion. For example, it may be necessary to turn off automatic software updating in order to install certain foreign packages. If the operation of one closure violates the integrity of another closure, unexpected behavior occurs. A consistent closure system of hierarchies of sub-closures is needed to totally avoid dependency 95 analysis. The drawback of a closure system is its design complexity[31]. A closure cannot exist by itself but can only exist within a system of consistent closures. The design of closures must conquer or smartly avoid the complexities of dependency analysis and composition of operations. In summary, system administrators utilize the above strategies (except closures) in combination to make configuration tractable. Closure is at an early stage of development. Each strategy has a drawback that system administrators must make sacrifices in order to reduce the complexity of configuration. In the following paragraphs we will discuss in detail some ways that several existing strategies reduce complexity. 12.7 Current Strategies Make Configuration Management Tractable In section 5.2, we introduced several current configuration strategies. No strategy can perfectly make configuration management tractable. Every strategy sacrifices something including convenience, flexibility, adaptability, efficiency, etc. Most of the guidelines mentioned in the previous section are used in these strategies. However, each strategy has its special way to implement these guidelines. In manual configuration, configurations are made entirely by hand. Manual configuration is only cost-effective for small-sized systems with a few machines and a few users, but is often utilized even for large networks in which few changes in function are expected over time. Composing configuration operations is done by humans using experience or knowledge from other administrators. Manual configuration almost requires a tight loop of configuration and testing. System administrators invoke one operation at a time; thus a linear O(n), (n=number of operations) search (or less) suffices to figure out what to invoke next. After applying an operation, one will normally test if the operation achieves the goal that is intended. Latent preconditions still exist, since the system is partially observed, but are not as significant as scripting, since system administrators test the system at each step. In custom scripting, manual procedures are encoded into repeatable automatic procedures using a high-level language such as a shell, Perl, or Python. Composition is still planned and accomplished by humans, though it can be automated if scripts satisfy certain very restrictive conditions [28]. 96 Scripts are often crafted in haste for one-time use [27]. Applying poorly engineered scripts to hosts in an unknown state leads to network rot: variation in actual state (latent preconditions) that becomes exponential with the number of executed configuration operations. Careless use of custom scripting can actually increase the size of state space. Thus custom scripting is not recommended for large networks. Structured scripting is an enhancement of custom scripting that allows one to create scripts that are reusable on a larger network. ISConf [59, 90, 91] structures configuration and installation scripts into “stanzas” whose execution order is held constant across a population of hosts. On each host, probes of host state determine which stanzas to execute. The postconditions of the operations already completed are treated as the preconditions of the next (whether or not this is actually true), and these are always the same for a particular host. Thus composability of n operations becomes O(n) since there are only n actual observed states, namely, p1 (B), p2 ◦p1 (B), · · · , pn ◦ · · · ◦p2 ◦p1 (B). Note that while composability is thus tractable for an individual host, creating a set of operations that will configure a heterogeneous network of hosts remains intractable. The main technique to simplify configuration management in file distribution is reducing configuration state space. A description of appropriate network behavior is translated into the precise configuration file contents that assure that behavior. Configuration data are stored in a central file repository or database. The agents installed on the hosts will read a host configuration generated from that data. The strategy of proscriptive configuration generation limits existing preconditions to a very small number. There is exactly one configuration for each kind of behavior; there is no “unintentional heterogeneity” caused by, e.g., editing the file at different times. Thus there is a bijective map between behaviors and configurations. Declarative syntax is a configuration management strategy wherein custom scripts are replaced by an autonomous agent [15, 16, 18, 25]. This agent interprets a declarative configuration file that describes the ideal state of a host, then proceeds to make changes that bring the host somehow “nearer” to that ideal state. The key to tractability of convergent operations is that they accomplish a specific change in behavior, but do not assert that this change actually guarantees a particular behavior. Since we are not considering behavior in the tool, the human has to assure that. Besides, the functional overlap of configuration operations remains rather low or non-existent; most operations are orthogonal to one another. Thus, to achieve a specific objective, the agent does not have any freedom to choose from different operations to accomplish a specific result. The actual state space of the system is reduced to a manageable level. 97 Another way that declarative syntax simplifies the composability problem is by keeping objectives simple and independent. In Cfengine, e.g., objectives are relatively simple, such as “make this file contain these contents” or “change that parameter to be equal to this”. The issue of how these contents affect behavior is not addressed. If one avoids the meaning of content, and specifies only content, the problem is fundamentally simpler; orthogonality of operations is assured. Thus, the human system administrators are required to translate behavior requirements to file content requirements or parameter setting requirements. In using many currently available tools, composability is actually assured by system administrators. System administrators may need a long period to gain the experience of making the appropriate decisions when making changes to satisfy a behavioral requirement, and must take the totality of any existing configuration into account when making changes. And that is partly why the complexity of configuration is always a significant barrier to understanding by new staff. 98 Chapter 13 Ways to Reduce Complexity In this chapter, we explore how intractable problems are managed in general based upon computation theory. Even though many problems are intractable, humans have lived with them for centuries. Trains need to be scheduled, salesmen must plan their trips for marketing, and thieves have to decide in very limited time to choose which items to pick up, even though TRAIN SCHEDULING, TRAVELING SALSEMAN and INTEGER KNAPSACK are all NP-complete problems. The methods by which intractable problems are solved in real life can give some insight into how our composability problem might be solved. There are three approaches of getting around intractability: • choosing easy instances to solve • approximation: trading optimality for computability • memoization and dynamic programming In the following sections we will discuss some of these strategies that apply to configuration management. 13.1 Choosing Easy Instances to Solve NP-completeness refers to worst case complexity. Even if a problem is NP-complete, many subsets of instances can be solved in polynomial time. We suggest three different ways to reduce complexity in 99 configuration management: reduction of state space, using simple operations and forming hierarchies and relationships among operations. 13.1.1 Reduction of State Space Composibility can be made relatively easy by use of a reduced-size state space, as we have seen in the current configuration management tools. For details, see Section 12.7. 13.1.2 Using Simple Operations MINIMUM COVER is solvable in polynomial time by matching techniques if all of the candidate sets c in the cover set C have |c| ≤ 2 [43]. This implies that relatively simple operations that only address one or two user requirements can be composed efficiently. That composability is tractable in this case can be understood by considering the components of graphical user interfaces. In these systems, typical components are widgets with very limited functions. The dependencies between widgets are obvious. The semantics of widgets are constrained for easy reuse when the widget is used in a different context. For example, the semantics for a button is to trigger an event when it is clicked or released. There are no underlying assumptions or dependencies that are not obvious from the context. The semantics of widgets and the limited domains in which the widgets are used are simple enough that solutions for syntactic composability are also solutions for semantic composability. Configuration management is also a somewhat “constrained” domain. Configuration of a site typically includes: 1. Editing the bootstrap scripts. 2. Configuring internal services, typically DNS, LDAP, NFS, Web, etc. 3. Installing and configuring software packages. We can consider building smaller operations that are designed to work together, engineering them to a common framework, and then scaling up. 13.1.3 Forming Hierarchies and Relationships MINIMUM COVER is in P if each element of C except the largest element is laminarily “contained” in some other element, i.e., ci ⊆ cj , if |ci | ≤ |cj |. In our composability problem, nesting 100 means operations are sorted with the increasing order of the number of objectives that they can achieve; each operation can accomplish at least the objectives of the previous one. Composability of operations is trivial by always selecting the least upper bound of the objectives. Also, MINIMUM COVER can be reduced to INTEGER KNAPSACK via surrogate relaxation[67]. The INTEGER KNAPSACK problem is still NP-complete, however, it is polynomially solvable if the values of the items are in increasing order, and each value is greater than the sum of the values before it. In this case we can solve the problem quickly by a simple greedy algorithm. Besides, the complexity of MINIMUM COVER can be reduced using divide-and-conquer method if subsets are organized into disjoint sequences of subsets. These simplifications of NP-completeness suggest that we can reduce complexity by forming hierarchies and relationships within the set of operations. In the following, we propose a system managed by closures[30]. In the theory of closure, configuration parameters are grouped into two distinct sets: exterior parameters and interior parameters. Exterior parameters of a closure completely determine the behavior of the closure. Interior parameters are those parameters that are invisible by observing behaviors. For example, the port number of a web server is exterior; the name of its root directory may not be. The former is necessary to pass any behavioral test; the latter may change without affecting external behavior at all. One closure contains another through parameter dominance, i.e., the exterior parameters of the dominated closure must be present in the parameter space of the dominant closure. Thus the dominant closure controls the behavior of dominated closure. The parameter space of a closure is either disjoint from that of all other closures or it is dominated by some other closure. The system is composed by several disjoint sets of closures; each set is a chain of nesting closures. A system can be divided into relatively large independent(orthogonal) subsystems, and construct the configuration with a few large closures These closures “contain” smaller closures through parameter dominance. Smaller closures “contain” even smaller closures and at the bottom are the simple operations we suggested in the previous subsection. 13.2 Approximation In practice, if a tractable algorithm cannot be found to solve a problem exactly, a near-optimal solution is used instead; this is called approximation. 101 In many cases, configuration management does not need to be optimal. Any sequence of operations that will accomplish the result will do. One criterion of increasing importance is cost. It is not worthwhile to apply much effort in searching for the optimal solution when a near-optimal solution is “good enough” with low cost, especially when we are composing mostly simple and orthogonal operations. The reason for this is that one can embody a near-optimal sequence – as long as the length of the sequence is polynomial to the number of available operations – into a script that can be executed efficiently. In other words, scripting precludes optimality. Moreover, to avoid latent preconditions, a repeatable composition is more important than optimal composition. However, even we can accept non-optimal solutions, a general polynomial algorithm for searching for a sequence of operations that satisfies requirements at all can be difficult to construct. In addition, most scripts are complex entities, the cost of managing them and maintaining them is sometimes significant; keeping the scripts to a minimum might save for the long run. Moreover, composition of operations is often repeated in similar situations, it makes sense to spend more effort to find the optimal solution once and amortize the cost over all repetitions. 13.3 Dynamic Programming INTEGER KNAPSACK can be solved in pseudo-polynomial time with dynamic programming. In dynamic programming, we maintain memory of subproblem solutions, trading space for time. The two key ingredients that make dynamic programming applicable are optimal substructure (a global optimal solution contains within it optimal solution to subproblems) and overlapping subproblems (subproblems are revisited by the algorithm over and over again). System configuration can often be composed from configurations of relatively independent subsystems. For example, configuration of a typical departmental site includes a series of configurations of server subsystems, i.e., web service, file system service, domain service, mail service, etc. The optimal solution for configuration of the whole system is often a union of optimal solutions for configurations of subsystems. This implies optimal substructure. One might argue that when resource competition between different subsystems occurs, optimal substructure does not exist, since the interests of subsystems are not consistent. This argument mainly affects performance of dynamic programming. In our composability problem, performance or security requirement can be coded within the requirements. We are searching for the shortest sequence of configuration operations that achieve the requirements. The shortest sequence of the whole is composed by the shortest 102 sequences of subsystems. Subproblems such as deploying a specific service are repeated constantly in configuring a large network. For example, file editing is used almost everywhere in configuration since services are typically carried out by various daemons which read configuration files for instructions. Another example, disk partitioning of a group of hosts is repeated on every individual host. Thus we have the ingredient of overlapping of subproblems. The idea of dynamic programming is to memoize the previous computed solutions of subproblems and save them for future use. If we generalize this idea, complicated configuration management can by simplified by forming a set of “best practices”, i.e., a set of solutions that have been used and proven to be effective. If a community of administrators agrees to utilize the same initial baseline state for each kind of host and the same set of operations for configuring hosts at multiple sites, system administrators can then amortize the cost of forming the practices by aggregating effective practices in a global knowledge base[37, 39]. The salient feature of best practices is that they keep latent states to a minimum and thus reduce the amount that a system administrator has to remember in maintaining a large network. Without some form of consistent practice, routine problem-solving causes a state explosion in which randomly selected modifications are applied to hosts for each kind of problem. For example, we could decide to solve the problem of large logfiles by backing them up to several different locations, depending upon host. Then, finding a specific logfile takes much longer as one must search over all possibilities, rather than codifying and utilizing one consistent place to store them. However one should always use “best practices” cautiously by asking questions like in what sense is a practice best? When and for whom? 103 Chapter 14 Conclusions The research of this thesis was motivated by an apparent lack of fundamental theories of system administration. System administration as an emerging field in computer science and computer engineering has often been considered to be a “practice” with no theoretical underpinnings. In this thesis, we began to define a theory of system administration, based upon two activities of the system administrator: configuration management and dependency analysis. In this chapter, we summarize the contributions of this thesis to the theory of system administration and make suggestions for future work. 14.1 Review System administration concerns every aspect of operational management of human-computer systems. Our research concentrates on one of its subparts: configuration management: the activities of initial configuration or reconfiguration of the network of computers according to policies or policy changes. As the complexity of computing systems and the demands of their roles increase rapidly, configuration management becomes increasingly difficult. In the course of this thesis we have developed a theoretical framework to study and examine the complexity of configuration management. Two kinds of automatons were constructed. One is based upon the actual configuration and the other is based upon observed behavior. The first one is deterministic but can be arbitrary large; the second is non-deterministic with a size boundary of 2|T | where T is the test set. The nondeterminism of operations on observed states is one of the most challenging problems 104 of configuration management. If the effect of an operation is not predictable and reproducible, it is difficult to maintain it for repeatable uses or for large scale networks. In the discussion of reproducibility of configuration operations, we have shown that for one host in isolation and for some configuration processes, reproducibility of observed effect for a configuration process is a statically verifiable property of the process. However, reproducibility of populations of hosts can only be verified by explicit testing. Using configuration processes verified to be locally reproducible, we can identify latent preconditions that affect behavior among a population of hosts. Much attention has been paid to the limits imposed upon configuration operations and how they affects usability of a set of operations. Based upon our theoretical framework, we formally defined many limits of configuration operations including: idempotence, statelessness, convergence, commutativity, consistency, awareness and atomicity. Their role of reducing the complexity of configuration management was also discussed. They all add some forms of control over combinatorial behaviors. Convergence, idempotence, sequence idempotence, statelessness add a structure of equivalence relations between sequences of configuration operations. Commutativity limits potential results of a set of scripts so that the behaviors of different permutations of the same set are equivalent. Consistency rules out conflicting behaviors. The limits of atomicity and awareness make operations more tight, robust, and secure. And they are used with other limits to enhance their functionality. However, we showed in our composability theory that only these limits are not enough to make composability tractable. Using commutativity and idempotence we reduced a known NP-complete problem to composability of configuration operations. Thus the general composability problem without any limit is NP-hard. We also studied how other limits affect the complexity of composition. The conclusion was that composition remains NP-hard regardless of whether the operations have those limits. Dependency analysis is an important process in configuration management and other parts of system administration. It is used in root cause analysis, impact analysis, change analysis, and requirement analysis. We formally defined a dependence using our theoretical model and studied its complexity based on two approaches: white box and black box. Our conclusion was that dependency analysis is intractable in general. System administrators get around the complexity by doing only a partial analysis to the system. By the contextualization of the configuration process and review of current configuration strategies, we summarized how configuration management is made tractable in practice. These mechanisms include the use of experience and documentation, reduction of state space, abstraction, 105 re-baselining, the use of orthogonality, and the use of closures. For each mechanism, we discussed its drawbacks. We also made observations on how the current configuration strategies simplify configuration management. Many ways of reducing complexity of a problem have been explored in computation theory. We made connections between computation theory and configuration management. We suggested ways to apply various techniques of reducing complexity to configuration management including choosing easy instances, approximation, and using dynamic programming. 14.2 Future Work We share the view of many researchers that the increasing system complexity is quickly reaching a level beyond human ability to manage and secure. We need a revolutionary change in the way we manage our systems. Such change or a series of changes need strong theoretical support. Following the research described in this thesis, a number of projects could be undertaken: • To explore approximation algorithms for intractable problems to reduce complexity Our previous research proved that self-configuration, an important part of autonomic computing, is intractable in the general case without further limitations. We wish to continue studying applications of approximation algorithms to achieve tractability. • To build a theory of policy in system administration. There is a direct relationship between an organization’s policies and the expense of maintaining computing infrastructure. There is much research on the role of policy in business that is unknown to the system administration community. We intend to systematically study the effect of policy on maintenance cost and system integrity. • To address the complexity of reconfiguration and optimization of solution with a cost model. Reconfiguration is much more difficult than initial configuration because it deals with a potential diversity of states rather than a simple repeatable initial state. One recurrent problem is that it is often not obvious whether it is less expensive to change an existing configuration, or to start over and configure the machine from scratch: what practitioners call a “bare metal rebuild”. We desire to incorporate a cost model to enable optimization of various choices. 106 Bibliography [1] Discussion of large scale system configuration issues. http://lists.inf.ed.ac.uk/mailman/listinfo/lssconfdiscuss. [2] Sys admin - the journal for unix and linux system administrators. http://www.samag.com/. [3] E. Anderson, M. Burgess, and A. Couch. Selected Papers in Network and System Administration. J. Wiley & Sons, Chichester, 2001. [4] E. Anderson and D. Patterson. Extensible, scalable monitoring for clusters of computers. Proceedings of the Eleventh Large Installation System Administration Conference (LISA XI) (USENIX Association: Berkeley, CA), page 9, 1997. [5] E. Anderson and D. Patterson. A retrospective on twelve years of lisa proceedings. Proceedings of the Thirteenth Large Installation System Administration Conference (LISA XIII) (USENIX Association: Berkeley, CA), page 95, 1999. [6] P. Anderson. Towards a high level machine configuration system. Proceedings of the Eighth Large Installation System Administration Conference (LISA VIII) (USENIX Association: Berkeley, CA), page 19, 1994. [7] P. Anderson, G. Beckett, K. Kavoussanakis, G. Mecheneau, J. Paterson, and P. Toft. Experiences and challenges of large-scale system configuration. 2003. [8] P. Anderson, P. Goldsack, and J. Patterson. Smartfrog meets lcfg: autonomous reconfiguration with central policy control. Proceedings of the Seventeenth Large Installation System Administration Conference (LISA XVII) (USENIX Association: San Diego, CA), 2003. [9] J. Apisdort, K. Claffy, K. Thompson, and R. Wilder. Oc3mon: flexible, affordable, high performance statistics collection. Proceedings of the Tenth Large Installation System Administration Conference (LISA X) (USENIX Association: Berkeley, CA), page 97, 1996. 107 [10] AT&T. Virtual network computing. http://www.uk.research.att.com/vnc. [11] S. Bgchi, G. Kar, and J. Hellerstein. Dependency analysis in distributed systems using fault injection: application to problem determination in an e-commerce environment. Proceedings of the Workshop on Large Installation System Administration III (USENIX Association: Berkeley, CA, 1989), 1989. [12] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [13] A. Brown, G. Kar, and A. Keller. An active approach to characterizing dynamic dependencies for problem determination in a distributed environment. Proceedings of the Seventh IFIP/IEEE International Symposium on Integrated Network Management (Seattle, WA, 2001), 2001. [14] H. Burch and B. Cheswick. Tracing anonymous packets to their approximate source. Proceedings of the Fourteenth Large Installation System Administration Conference (LISA XIV) (USENIX Association: Berkeley, CA), page 319, 2000. [15] M. Burgess. A site configuration engine. Computing systems (MIT Press: Cambridge MA), 8:309, 1995. [16] M. Burgess. Computer immunology. Proceedings of the Twelfth Large Installation System Administration Conference (LISA XII) (USENIX Association: Berkeley, CA), page 283, 1998. [17] M. Burgess. Principles of Network and System Administration. J. Wiley & Sons, Chichester, 2000. [18] M. Burgess. Theoretical system administration. Proceedings of the Fourteenth Large Installation System Administration Conference (LISA XIV) (USENIX Association: Berkeley, CA), page 1, 2000. [19] M. Burgess. Analytical Network and System Administration — Managing Human-Computer Systems. J. Wiley & Sons, Chichester, 2004. [20] M. Burgess and R. Ralston. Distributed resource administration using cfengine. Software practice and experience, 27:1083, 1997. [21] The RPM community. The rpm package manager (rpm). http://www.rpm.org/. [22] The Biometric Consortium. The biometric consortium. http://www.biometrics.org/. 108 [23] M.A. Cooper. Overhauling rdist for the ’90s. Proceedings of the Sixth Large Installation System Administration Conference (LISA VI) (USENIX Association: Berkeley, CA), page 175, 1992. [24] T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. The MIT Press, Cambridge, Massachusetts, 1989. [25] A. Couch. Slink: simple, effective filesystem maintenance abstractions for community-based administration. Proceedings of the Tenth Large Installation System Administration Conference (LISA X) (USENIX Association: Berkeley, CA), page 205, 1996. [26] A. Couch. Chaos out of order: a simple, scalable file distribution facility for intentionally heterogeneous networks. Proceedings of the Eleventh Large Installation System Administration Conference (LISA XI) (USENIX Association: Berkeley, CA), page 169, 1997. [27] A. Couch. An expectant chat about script maturity. Proceedings of the Fourteenth Large Installation System Administration Conference (LISA XIV) (USENIX Association: Berkeley, CA), page 15, 2000. [28] A. Couch and Noah Daniels. The maelstrom: network service debugging via ‘ineffective procedures’. Proceedings of the Fifteenth Large Installation System Administration Conference (LISA XV) (USENIX Association: Berkeley, CA), 2001. [29] A. Couch and M. Gilfix. It’s elementary, dear watson: applying logic programming to convergent system management processes. Proceedings of the Thirteenth Large Installation System Administration Conference (LISA XIII) (USENIX Association: Berkeley, CA), page 123, 1999. [30] A. Couch, J. Hart, E.G. Idhaw, and D. Kallas. Seeking closure in an open world: a behavioural agent approach to configuration management. Proceedings of the Seventeenth Large Installation System Administration Conference (LISA XVII) (USENIX Association: Berkeley, CA), page 129, 2003. [31] A. Couch and S. Schwartzberg. Experience in implementing an http service closure. Proceedings of the Eighteenth Large Installation System Administration Conference (LISA XVIII) (USENIX Association: Berkeley, CA), page 213, 2004. [32] A. Couch and Y. Sun. On the algebraic structure of convergence. LNCS, Proc. 14th IFIP/IEEE International Workshop on Distributed Systems: Operations and Management, Heidelberg, Germany, pages 28–40, 2003. 109 [33] A. Couch and Y. Sun. On observed reproducibility in network configuration management. Science of Computer Programming, 53:215–253, 2004. [34] A. Couch, N. Wu, and H. Susanto. Toward a cost model for system administration. Proceedings of the Niteenth Large Installation System Administration Conference (LISA 05) (USENIX Association: Sandiego, CA), pages 125–141, 2005. [35] Alva Couch. System Configuration Management in The Elsivier System Management Handbook. 2006. [36] D.A. Curry, S.D. Kimery, K.C. De La Croix, and J.R. Schwab. Acmaint: an account creation and maintenance system for distributed unix systems. Proceedings of the Fourth Large Installation System Administration Conference (LISA IV) (USENIX Association: Berkeley, CA, 1990), page 1, 1990. [37] G. Halprin et al. Sa-bok (the systems administration body of knowledge). http://www.sysadmin.com.au/sa-bok.html. [38] L. J. Osterweil et al. Strategic directions in software quality. ACM Computing Surveys, 4:738– 750, 1996. [39] R. Kolstad et al. The sysadmin book of knowledge gateway. http://ace.delos.com/taxongate. [40] Yi-Min Wang et al. Strider: a black-box, state-based approach to change and configuration management and support. Proceedings of the Seventeenth Large Installation System Administration Conference (LISA XVII) (USENIX Association: San Diego, CA), 2003. [41] J. Finke. Monitoring usage of workstations with a relational database. Proceedings of the Eighth Large Installation System Administration Conference (LISA VIII) (USENIX Association: Berkeley, CA), 1994. [42] M. Fletcher. nlp: a network printing tool. Proceedings of the Sixth Large Installation System Administration Conference (LISA VI) (U SENIX Association: Long Beach, CA, pages 245– 256, 1992. [43] M. R. Garey and D. S. Johnson. Computers and Intractability, A Guide to the Theory of NP-Completeness. Freeman, New York, NY, 1979. [44] D. Geer and S. Charney. Debate: is an operating system monoculture a threat to security? Talk in USENIX 04, Boston, MA, 2004. 110 [45] M. Gilfix and A. Couch. Peep (the network aualizer): monitoring your network with sound. Proceedings of the Fourteenth Large Installation System Administration Conference (LISA XIV) (USENIX Association: Berkeley, CA), page 109, 2000. [46] L. Girardin and D. Brodbeck. A visual approach for monitoring logs. Proceedings of the Twelfth Large Installation System Administration Conference (LISA XII) (USENIX Association: Berkeley, CA), page 299, 1998. [47] J. Greely. A flexible filesystem cleanup utility. Proceedings of the Fifth Large Installation System Administration Conference (LISA V) (USENIX Association: Berkeley, CA), page 105, 1991. [48] Free Standards Group. Lsb - linux standards base. http://www.linuxbase.org. [49] Object Management Group. Common object request broker architecture. http://www.omg.org/corba. [50] Research Systems Unix Group. Radmind. http://rsug.itd.umich.edu/software/redmind/. [51] B. Hagemark and K. Zadeck. Site: a language and system for configuring many computers as one computer site. Proceedings of the Workshop on Large Installation System Administration III (USENIX Association: Berkeley, CA, 1989), page 1, 1989. [52] S.E. Hansen and E.T. Atkins. Automated system monitoring and notification with swatch. Proceedings of the Seventh Large Installation System Administration Conference (LISA VII) (USENIX Association: Berkeley, CA), page 145, 1993. [53] D.R. Hardy and H.M. Morreale. Buzzerd: automated system monitoring with notification in a network environment. Proceedings of the Sixth Large Installation System Administration Conference (LISA VI) (USENIX Association: Berkeley, CA), page 203, 1992. [54] R. Harker. Selectively rejecting spam using sendmail. Proceedings of the Eleventh Large Installation System Administration Conference (LISA XI) (USENIX Association: Berkeley, CA), page 205, 1997. [55] John Hart and Jeffrey D’Amelia. An analysis of rpm validation drift. Proceedings of The sisteenth Large Installation System Administration Conference (LISA 02) (USENIX Association: Berkeley, CA, 2002), pages 155–166, 2002. 111 [56] J. Hellerstein. Complexity of configuration management and experience. To be published. [57] IBM. Autonomic computing. http://www.research.ibm.com/autonomic/. [58] IBM. Websphere. http://www-306.ibm.com/software/websphere/. [59] L. Kanies. Practical and theoretical experience with isconf and cfengine. Proceedings of the Seventeenth Large Installation System Administration Conference (LISA XVII) (USENIX Association: San Diego, CA), 2003. [60] J. Kephart and D. Chess. The vision of autonomic computing. Computer, 36(1):41–52, 2003. [61] M. Kijima. Markov Processes for Stochastic Modeling. Chapman & Hall, London, UK, 1997. [62] S. Kirkpatrick, C. Gelatt Jr., and M.P. Vecchi. Optimization by simulated annealing. Science, page 168, 1983. [63] D. Koblas and P.M. Moriarty. Pits: a request management system. Proceedings of the Sixth Large Installation System Administration Conference (LISA VI) (USENIX Association: Berkeley, CA), page 197, 1992. [64] R. Kolstad. Tuning sendmail for large mailing lists. Proceedings of the Eleventh Large Installation System Administration Conference (LISA XI) (USENIX Association: Berkeley, CA), page 195, 1997. [65] RSA Laboratories. The public-key cryptography standards. http://www.rsasecurity.com/rsalabs/node.asp?id=2124. [66] E. Lassettre, D. Coleman, Y. Diao, S. Froehlich, J. Hellerstein, L. Hsiung, T. Mummert, M. Raghavachari, G. Parker, L. Russell, M. Surendra, V. Tseng, N. Wadia, and P. Ye. Dynamic surge protection: an approach to handling unexpected workload surges with resource actions that have lead times. LNCS, Proc. 14th IFIP/IEEE International Workshop on Distributed Systems: Operations and Management, Heidelberg, Germany, 2003. [67] L. Lorena and F. Lopes. A surrogate heuristic for set covering problem. European Journal of Operational Research, 79:138–150, 1994. [68] K. Manheimer, B.A. Warsaw, S.N. Clark, and W. Rowe. The depot: a framework for sharing software installation across organizational and unix platform boundaries. Proceedings of the 112 Fourth Large Installation System Administration Conference (LISA IV) (USENIX Association: Berkeley, CA, 1990), page 37, 1990. [69] M. Metz and H. Kaye. Deejay: the dump jockey: a heterogeneous network backup system. Proceedings of the Sixth Large Installation System Administration Conference (LISA VI) (USENIX Association: Berkeley, CA), page 115, 1992. [70] Microsoft. Component object model. http://www.microsoft.com/com/teck/COMPlus.asp. [71] Sun Microsystems. Enterprise javabeans. http://java.sun.com/products/ejb. [72] K. Montgomery and D. Reynolds. Filesystem backups in a heterogeneous environment. Proceedings of the Workshop on Large Installation System Administration III (USENIX Association: Berkeley, CA, 1989), page 95, 1989. [73] R. Osterlund. Pikt: problem informant/killer tool. Proceedings of the Fourteenth Large Installation System Administration Conference (LISA XIV) (USENIX Association: Berkeley, CA), page 147, 2000. [74] D. Patterson. A simple way to estimate the cost of downtime. Proceedings of the Sixteenth Large Installation System Administration Conference (LISA XVI) (USENIX Association: Berkeley, CA), page 181, 2002. [75] M. D. Petty and W. Weisel. A composability lexicon. Proceedings of the Spring 2003 Simulation Interoperability Workshop (Orlando, FL), 2003. [76] M. D. Petty, W. Weisel, and E. Mielke. Computational complexity of selecting models for composition. Proceedings of the Fall 2003 Simulation Interoperability Workshop (Orlando, FL), 2003. [77] P. Powell and J. Mason. Lprng - an enhanced print spooler system. Proceedings of the Ninth Large Installation System Administration Conference (LISA IX) (USENIX Association: Berkeley, CA, page 13, 1995. [78] Redhat. Redhat enterprise linux. http://www.redhat.com/enus/USA/rhel. [79] K. Rich and S. Auditor. hobgoblin: a file and directory auditor. Proceedings of the Workshop on Large Installation System Administration V (USENIX Association: Berkeley, CA), 1991. 113 [80] C. Ruefenacht. Rust: managing problem reports and to-do lists. Proceedings of the Tenth Large Installation System Administration Conference (LISA X) (USENIX Association: Berkeley, CA), page 81, 1996. [81] P. Scott. Automating 24x7 support response to telephone requests. Proceedings of the Eleventh Large Installation System Administration Conference (LISA XI) (USENIX Association: Berkeley, CA), page 27, 1997. [82] S. Shumway. Issues in on-line backup. Proceedings of the Fifth Large Installation System Administration Conference (LISA V) (USENIX Association: Berkeley, CA), page 81, 1991. [83] J. Da Silva and Ólafur Guǒmundsson. The amanda network backup manager. Proceedings of the Seventh Large Installation System Administration Conference (LISA VII) (USENIX Association: Berkeley, CA), page 171, 1993. [84] M. Sirbu and J. Chuang. Distributed authentication in kerberos using public key cryptography. Internet Society 1997 Symposium on Network and Distributed System Security. [85] H. Spencer. The amanda network backup manager. Proceedings of the Tenth Large Installation System Administration Conference (LISA X) (USENIX Association: Berkeley, CA), 1996. [86] Y. Sun and A. Couch. Global impact analysis of dynamic library dependencies. Proceedings of the Fifteenth Large Installation System Administration Conference (LISA XV) (USENIX Association: Berkeley, CA), page 145, 2001. [87] Y. Sun and A. Couch. Composability of configuration management. In preparation, 2006. [88] MIT Kerberos Team. Kerberos: the network authentication protocol. http://web.mit.edu/kerberos/www/. [89] J.W. Toigo. How to architect tiered backup with d2d2t. Proceedings of the Eighteenth Large Installation System Administration Conference (LISA XVIII) (USENIX Association: Atlanta, GA), 2004. [90] S. Traugott. Why order matters: turing equivalence in automated system administration. Proceedings of the Sixteenth Large Installation System Administration Conference (LISA XVI) (USENIX Association: Berkeley, CA), page 99, 2002. 114 [91] S. Traugott and J. Huddleston. Bootstrapping an infrastructure. Proceedings of the Twelfth Large Installation System Administration Conference (LISA XII) (USENIX Association: Berkeley, CA), page 181, 1998. [92] Tripwire. Security scanner. http://www.tripwire.com. [93] Usenix. Large installation system administration conference. http://www.usenix.org/events/. [94] Wietse Venema. Tcp wrappers. http://ciac.llnl.gov/ciac/ToolsUnixNetSec.html. [95] B. Woodard. Building an enterprise printing system. Proceedings of the Twelfth Large Installation System Administration Conference (LISA XII) (USENIX Association: Berkeley, CA), page 219, 1998. [96] N. Wu and A. Couch. Bootstrapping an ip closure. technical report, 2006. [97] E.D. Zwicky. Disk space management without quotas. Proceedings of the Workshop on Large Installation System Administration III (USENIX Association: Berkeley, CA, 1989), page 41, 1989. [98] E.D. Zwicky, S. Simmons, and R. Dalton. Policy as a system administration tool. Proceedings of the Fourth Large Installation System Administration Conference (LISA IV) (USENIX Association: Berkeley, CA, 1990), page 115, 1990. 115