Complexity of System Configuration Management

advertisement
Complexity of System Configuration Management
A Dissertation
submitted by
Yizhan Sun
In partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
in
Computer Science
TUFTS UNIVERSITY
August 2006
c
Yizhan
Sun, June, 2006
ADVISOR: Alva L. Couch
Abstract
System administration has often been considered to be a “practice” with no theoretical underpinnings. In this thesis, we begin to define a theory of system administration, based upon two activities
of system administrators: configuration management and dependency analysis. We formalize and
explore the complexity of these activities, and demonstrate that they are intractable in the general
case. We define the concepts of system behavior, kinds of configuration operations, a model of
configuration management, a model of reproducibility, and proofs that several parts of the process
are NP-complete or NP-hard. We also explore how system administrators keep these tasks tractable
in practice. This is a first step toward a theory of system administration and a common language
for discussing the theoretical underpinnings of the practice.
ii
Acknowledgements
This thesis is the result of four years of work whereby I have been accompanied and supported by
many people. It is a pleasant aspect that I have now the opportunity to express my gratitude to
all of them.
The first person I would like to thank is my advisor Alva L. Couch. I have been working with
him since 2001 when I started my Master’s project. His enthusiasm and integral view of research
and his humor when things get tough have made a deep impact upon me. He patiently guided me
and supported me throughout the course of my study.
I would like to thank Professor Kofi Laing and Professor Lenore Cowen for their help on computation theory and for being committee members. I also thank Professor Ricardo Pucella for
reviewing my work.
I am grateful for my fellow students: Ning Wu, Marc Chiarini, Hengky Susanto, Josh Danziger
and Bill Bogstad for their insightful comments and discussions of this work.
I would like to thank the Tufts University Computer Science Department for giving me financial
support to finish the study.
I owe a lot of gratitude to my parents and my parents-in-law. My parents stayed with us for
three years in the U.S. to help me with my two children. Without their help, I could not even start
this thesis. My parents-in-law supported us financially out of their limited resources. They have
shown me what unconditional love is.
I am very grateful for my husband for his love and encouragement during the Ph.D. period. I
thank my sons Samuel and David for their smiles and countless precious joyful moments in our life.
iii
DEDICATION
To my parents and my parents-in-law,
for their endless love.
iv
Contents
1 Introduction
2
2 Landscape of System Administration
5
2.1
Definition of System Administration . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.2
Taxonomy of System Administration . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.3
Some System Administration Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.3.1
Backup and Restore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.3.2
User Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.3.3
Service Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.3.4
Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.3.5
Testing and Quality Assurance . . . . . . . . . . . . . . . . . . . . . . . . . .
16
3 Introduction to Configuration Management
18
3.1
Software Configuration Management . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
3.2
System Configuration Management . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.3
How Configuration Controls Behavior . . . . . . . . . . . . . . . . . . . . . . . . . .
20
4 Challenges of Configuration Management
22
4.1
Change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
4.2
Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
4.3
Interdependence of Software and Hardware . . . . . . . . . . . . . . . . . . . . . . .
23
4.4
Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
4.5
Contingency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
4.6
Diverse Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
4.7
Ineffective Collaboration of Multiple System Administrators . . . . . . . . . . . . . .
26
v
4.8
Mobile Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
4.9
Service Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
5 Automation and Autonomic Computing
28
5.1
History of Automation in Configuration Management . . . . . . . . . . . . . . . . . .
29
5.2
Current Strategies of System Configuration . . . . . . . . . . . . . . . . . . . . . . .
32
5.2.1
Manual Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
5.2.2
Custom Scripting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
5.2.3
Structured Scripting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
5.2.4
File Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
5.2.5
Declarative Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
6 The Configuration Process
36
6.1
Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
6.2
Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
6.3
The Configuration Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
7 A Model of Configuration Management
45
7.1
Closed- vs. Open-world Models of Systems
. . . . . . . . . . . . . . . . . . . . . . .
45
7.2
Observed Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
7.3
Actual State and Observed State . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
7.4
Configuration Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
7.5
Two Configuration Management Automata . . . . . . . . . . . . . . . . . . . . . . .
49
8 Reproducibility
8.1
8.2
51
Local Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
8.1.1
Properties of Locally Reproducible Operations . . . . . . . . . . . . . . . . .
55
8.1.2
Constructing Locally Reproducible Operations . . . . . . . . . . . . . . . . .
57
Population Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
9 Limits on Configuration Operations
62
9.1
Limits on Configuration Operations
. . . . . . . . . . . . . . . . . . . . . . . . . . .
62
9.2
Relationship Between Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
vi
10 Complexity of Configuration Composition
10.1 Composability
69
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
10.2 Complexity of Operation Composability . . . . . . . . . . . . . . . . . . . . . . . . .
70
10.2.1 Component Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
10.2.2 General Composability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
10.2.3 Atomic Operation Composability . . . . . . . . . . . . . . . . . . . . . . . . .
74
10.2.4 Composability of Partially Ordered Operations . . . . . . . . . . . . . . . . .
75
10.2.5 Composability of Convergent Operations . . . . . . . . . . . . . . . . . . . . .
76
10.2.6 Summary of The Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
10.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
11 Dependency Analysis
79
11.1 Dependency Analysis Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
11.1.1 Instrumentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
11.1.2 Perturbation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
11.1.3 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
11.1.4 Requirements Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
11.1.5 Dependency Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
11.2 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
11.2.1 White Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
11.2.2 Black Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
11.2.3 Functional Expectations and Tests . . . . . . . . . . . . . . . . . . . . . . . .
84
11.2.4 State and Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
11.3 Dependence Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
11.3.1 Dependency in a Closed-world System . . . . . . . . . . . . . . . . . . . . . .
86
11.3.2 Dependency in an Open-world System . . . . . . . . . . . . . . . . . . . . . .
88
11.4 Complexity of Dependency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
11.4.1 Black Box Dependency Analysis . . . . . . . . . . . . . . . . . . . . . . . . .
89
11.4.2 White Box Dependency Analysis . . . . . . . . . . . . . . . . . . . . . . . . .
90
11.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
12 Configuration Management Made Tractable in Practice
12.1 Experience and Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vii
92
92
12.2 Reduction of State Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
12.3 Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
12.4 Re-baselining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95
12.5 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95
12.6 Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95
12.7 Current Strategies Make Configuration Management Tractable . . . . . . . . . . . .
96
13 Ways to Reduce Complexity
99
13.1 Choosing Easy Instances to Solve . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
13.1.1 Reduction of State Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
13.1.2 Using Simple Operations
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
13.1.3 Forming Hierarchies and Relationships . . . . . . . . . . . . . . . . . . . . . . 100
13.2 Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
13.3 Dynamic Programming
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
14 Conclusions
104
14.1 Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
14.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
viii
List of Figures
6.1
The configuration stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
6.2
The configuration process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
6.3
Asynchronous interactions between system administrators and the environment . . .
42
9.1
Stateless operations can have if statements . . . . . . . . . . . . . . . . . . . . . . . .
65
9.2
An example of sets that are sequence idempotent but not stateless . . . . . . . . . .
67
10.1 Summary of proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
ix
Complexity of System Configuration Management
1
Chapter 1
Introduction
System administration has traditionally been viewed as a practice with no theoretical underpinnings.
In this thesis, we take the first steps toward developing a theory of system administration, by
defining and analyzing models of system administration practice. This theory guides us in looking
at system administration in new ways and lends understanding to current practices, and suggests
new practices for the future.
System administration is in a critical transition period similar to the historical transition from
alchemy to chemistry[19]. Practitioners are pioneering the scientific study of the common methods
in use today. Large number of experiments have been performed, and various ways of tuning
practice have been explored and observed. But the field lacks a mature theoretical foundation
and systematic experimental data, which are critical to future development. This work contributes
to the construction of a complete theory for system administration and intends to inspire more
theoretical research within the field.
As the complexity of computing systems and the demands of their roles increase rapidly, system
administration becomes difficult. Within a network, there are tens, hundreds, or thousands of machines; each might have different architectures, hardware devices, operating systems, and installed
software applications. Software and hardware are developed by independent and often competing
vendors and developers. A large amount of implementation detail is needed to configure these components to accomplish common goals. Users who interact with the system have diverse needs and
requirements that are constantly in flux.
The life cycles of new technologies and tools have become shorter and shorter and this has
increased the difficulties of integrating new technologies with legacy infrastructure. Thousands of
2
scripts with complex dependency relationships are embedded within the systems. And these systems are required to support business and industrial processes which are continually reconstructed
and reorganized to meet changing users’ demands. Nevertheless, service guarantees for modern
systems are becoming commonplace. A minute of down-time might cause thousands of dollars of
lost business. This complexity grows beyond human management capacity, especially for large scale
systems.
We approach this problem by studying configuration management - the activities of initial
configuration or reconfiguration of the system components other than users so that they are in an
organized collaboration to satisfy users’ requirements or requirement changes.
Automation in configuration management is the key to freeing humans from being overwhelmed
by implementation details. Many strategies have been explored and many tools have been developed.
Based on their strategies, we group tools into different categories: custom scripting, structured
scripting, file distribution, and declarative syntax. We will introduce these strategies in greater
detail later. All of these tools try to hide implementation details to some degree from system
administrators. Ideally, humans are only required to make policies that declare the high level
requirements or objectives of the system, such as “I want two web servers in my network satisfying
service and security requirements”, and some autonomic system can take those declarations and
translate them into low level implementation details. This is the vision of autonomic computing
(self-managing) systems. In such ideal environments, humans delegate configuration management
to autonomic computing systems. Intuitively, everyone believes that building such an autonomic
system is “difficult” or even impossible because configuration management is difficult. But no one
has mathematically analyzed why it is difficult, where the difficulties come from, and how to reduce
the complexity. This work intends to fill in these details.
We believe that fundamental theories are necessary to manage the complexity of configuration
management. So far, our contributions in building fundamental theories have included:
• understanding of the nature of configuration process;
• a theoretical model of configuration management;
• a theory of reproducibility of configuration operations;
• formal definitions of various limits on configuration operations and discuss their impact on
complexity of configuration management;
• formal definitions of composability of configuration operations with or without various limits;
3
• a formal definition of dependencies between components of a system;
• proof that composability of configuration operations is an NP-hard problem and proof that a
complete dependency analysis is intractable, so that automation of configuration management
is intractable in the general case;
• summary of techniques used in practice that make configuration management tractable; and
• solutions to tractability that arise from and are suggested by computation theory.
This thesis describes all of the above issues in detail. Chapter 2 presents a general landscape
of system administration. Chapter 3 is an introduction to configuration management. Chapter 4
summarizes challenges that arise in configuration management. Chapter 5 discusses various strategies used to cope with the difficulty of configuration management and the vision and promise of
autonomic computing. Chapter 6 contextualizes the configuration process; this discussion serves
as the link between configuration practice and our theoretical model and discussions. Chapter 7,
8, 9, 10, 11, 12, and 13 describe theoretical models of configuration management, reproducibility,
composability, and dependency analysis, and discuss their computational complexity and the ways
in which system administrators keep these tasks tractable. Finally we draw some conclusions and
lessons for the future in Chapter 14.
4
Chapter 2
Landscape of System
Administration
2.1
Definition of System Administration
System administration is an emerging field in computer science and computer engineering that
concerns the operational management of human-computer systems. It is a body of knowledge
and technologies that enables an administrator to initialize a system to states that satisfy users’
needs to produce work, and to keep the system in those desired states, while interactions with
users tend to cause the system to drift away from those states. System administration concerns
every possible action involving the system and every level in the hierarchy of the system, from
machine to user. System administration is distinguished from system management and autonomic
management by being a human-centered activity, in which users and system managers are considered
on an equal basis with computer systems. This leads to an understanding of “human-computer
communities”[17] in which human and non-human entities interact to achieve some common goal.
System administration as a practice requires a broad range of knowledge and skills including an
understanding of system dynamics and administrative techniques as well as an understanding of
human psychology and people-skills.
System administration focuses upon: the real-world goals for, services provided by, and constraints on computing systems; the policy and specification of system structure and behavior, and
the implementation of these policies and specifications; the activities required in order to develop an
assurance that the specifications and real-world goals have been met; and the evolution of such sys5
tems over time. It is also concerned with the processes, methods and tools for managing computing
systems in a cost-effective and timely manner.
System administration as a practice has existed since the first computer was invented in the
1940s. However, not formally recognized as a branch of computer science, system administration
is relatively young compared to other traditional computer science disciplines, such as software
engineering, artificial intelligence, and theory. Good starting points to understand system administration include the textbooks[17, 19] by M. Burgess. The Large Installation Systems Administration
Conference (LISA)[93] is the flagship conference of the system administration community; proceedings in this conference reflect a broad view of practice across the field of system administration.
Additionally, the book, Selected Papers in System and Network Administration[3] offers a roadmap
of the development of system administration through samples of LISA papers. Other research
literature on system administration can be found in proceedings of the USENIX Security Symposium, the Conference on Integrated Network Management (INM), and the Network Operations and
Management Symposium (NOMS). SysAdmin[2] is a periodical that affords good overview of the
state-of-the-art technologies in system administration. Large scale system administration discussions are distributed through community-wide mailing lists[1] and through professional service sites
such as http://www.lopsa.org and Sun microsystems’ http://bigadmin.sun.com.
2.2
Taxonomy of System Administration
E. Anderson and D. Patterson described an initial taxonomy of system administration in [5]. We
base our taxonomy on theirs with some modifications.
System administration can be categorized as interactions between two entities: management
and tasks. The relationship between these two entities is that almost every task is associated
with management activities, such as policy making, configuration management, maintenance, and
training. For example, the task of managing a particular service includes making policy decisions,
configuring it to achieve the goals listed in these policies, maintaining desired configuration and
behavior, and training other system administrators to manage it and users to use it.
System administration can be divided into four categories:
• Policy Making – deciding high level goals for the system to achieve;
• Configuration Management – initial configuration or reconfiguration of the network of computers according to policies or policy changes;
6
• Maintenance – bringing the system back to a desired state when the system is degraded either
by failure of system components or poor performance;
• Training – improving administrators’ skills and train users to take better advantage of the
system.
System scale increases; new technologies emerge; system administrators and users leave and
arrive; these factors make training an essential process for system administration. Maintenance as
an ongoing activity is driven by two factors: system entropy and the need for changes in policy.
The entropy of the system, defined as the extent to which the actual configuration or behavior of
the system is statistically unknown or unpredictable, tends to grow when inconsistent configuration
operations are applied to a collection of systems, resources are consumed by users, software malfunctions, viruses attack or spam arrives, and malicious operations break programs, etc. Maintenance
is the process of restoring the order out of disorder.
System administration can also be grouped into separate categories based upon commonly performed tasks[5]. Categories based on the tasks include:
• Services – include internal services provided to the system and external services provided to
the users. Examples include backup, mail, printing, NFS, DNS, Web, database, etc;
• Software Installation – includes operating system installation, application installation, software packaging, and user customization;
• Monitoring – helps administrators determine what is happening in the system and network.
This includes system and network monitoring, resource accounting, data display, benchmarking, configuration discovery, and performance tuning;
• Management – includes site configuration, host configuration, site move, fault tolerance, etc;
• User Management - includes management of user accounts, documentation, policy, user interaction, etc;
• Improvement – includes training administrators, software design, models, system self improvement; and
• Miscellaneous – includes trouble tickets, secure root access, general tool, security, file synchronization, remote access, file migration, resource cleanup, etc.
7
However, the boundaries between these categories are often blurred. Many current configuration
tools deal with more than one kind of task. For example, configuration management and maintenance are combined when using Cfengine[15, 16], a “convergent” tool designed to bring the system
in conformance with system requirements. Each requirement can be the original one or a modified
version of the original requirement; Cfengine does not know the difference between requirements
that have been modified and those that remain the same as before. Another example is LCFG[6, 8]
and ISConf[90, 91]; both manage system configuration and software installation.
2.3
Some System Administration Tasks
We now examine some important tasks performed by system administrators in more detail. Software
configuration and system configuration are omitted here because we will discuss them in later
chapters.
2.3.1
Backup and Restore
This section is a simplified version of the backup discussion in Selected Papers in System and
Network Administration[3]. Backup is about exploiting storage redundancy to increase robustness
and data integrity in order to cope with errors and natural disasters[17]. A copy of data is made
to restore information in case the original is destroyed. Backup seems simple to accomplish until
one includes requirements such as high availability of file-systems, size of data, ease and speed of
backup and recovery, media and version management, scalability, and other site-specific needs.
Backups were traditionally performed using primitive system tools such as “dump” and “restore”, but the subtleties of backup and contingency planning justified the creation of more complex
tools to manage backup and restore schedules. The goal of creating a backup schedule is to be able
to restore files that are lost in a reasonable amount of time and not to interfere with daily use
of the system. Several tools have been developed for heterogeneous environments[69, 72], and for
use while systems remain online[82], and for backing up data from one system to spare disks on
another system[89]. However, few of these have withstood the test of time like the Amanda backup
system[83].
Backup methods are affected by constant technological improvements in backup device speed,
density, robustness, cost, and interoperability with vendor-supplied backup software. When balancing backup speed against cost of media, one finds different optimal solutions for small, medium and
8
large sites. Traditionally, backups have been recorded to tape because tape is relatively inexpensive.
However, tapes are bulky and time-consuming to set up manually, so there is a point at which the
cost of hiring someone to change tapes is more than the cost of mirroring disks or trying robotic
solutions such as robotic tape libraries. Disk mirroring is an affordable solution for smaller sites,
but it becomes expensive for sites with terabytes of data. Many sites utilize a mix of disk mirroring
and tape or CD backup such as utilizing the disk as a cache for the tape backup process[89].
2.3.2
User Management
User management consists of controlling the “interface” between users and computers. Several issues
are involved: management of user accounts, policies related to computer use, resource management,
and user help and support.
User Accounts
The tasks of managing user accounts include choosing authentication and authorization methods for
determining account access and privileges; account creation and retirement; scope, authorization,
and privilege associated with each account; resource management including quotas and sizes of user
directories, and managing login environments associated with particular accounts and privileges.
These tasks are trivial if there are only a few users to manage. Scaling up the number of users,
system privileges, account changes, or environments creates challenges for each task.
Account management tools were first created with the goal of simplifying the account creation
process. Scripts were designed to automate the steps of accumulating appropriate information about
users, managing passwords and other forms of user authentication, creating user file storage and
directories, and changing the location of user files to match user needs[36]. Sites with thousands of
accounts, such as schools, need to create large numbers of accounts quickly; because large populations appear all at once at the beginning of each term. They must also be able to retire accounts
efficiently because of high turnover in the user population. In most account management systems,
a central repository stores account information, while daemons or agents extract information from
the database and create local accounts on systems the user must access[85].
User Support
With large numbers of users to manage, providing effective user support is both difficult and crucial.
Approaches to user support include helping users directly or in person, utilizing electronic commu-
9
nications, training users to make them self-sufficient, and documenting the answers to frequently
asked questions.
One important part of user support is managing problem reports (also called “trouble tickets”)
from users. System administrators could not accomplish deployment of new services and changes
in architecture if they must also respond to users’ problem reports in real time. It is much more
efficient and common to utilize a “triage” approach in which junior system administrators interact
with users directly and protect senior system administrators from too much direct interaction with
users. A trouble ticket tool coordinates and documents interactions between system administrators
and users. Early trouble ticket tools were email-only submission tools with centralized queues of
requests[63]. Later, these systems were extended so that users could query the status of each problem
report, and tickets could be assigned to particular administrators[81]. Systems were improved to
support multiple submission methods such as phone and GUI, and to support multiple request
queues and request priorities[80].
Even with sophisticated tools for managing requests electronically, there are circumstances where
direct user assistance is needed. Direct in-person support becomes difficult if the user who needs
support is located at a remote site. The Virtual Network Computing model[10] is a way to allow
an administrator to log onto an existing user session and guide remote users through resolving
difficulties online. Various tools are available to help with this, including remote desktop utilities
for Windows.
User Policies
There is a direct relationship between an organization’s policies concerning users and the expense
of maintaining computing infrastructure. A good policy should clearly state:
1. rules about what users are allowed/not allowed to do, such as whether they can install new
software;
2. specifications of what mandatory enforcement users can expect, e.g. whether certain kinds of
files are periodically deleted.
3. other regulations and promises, e.g. policies on privacy, record keeping, retirement of accounts,
user privileges, etc.
Since policy-making is closely related to the requirements and business model of the local environment, few system administration researchers have studied this topic in the context of system
10
administration, though there is much research on the role of policy in business that is not known
to the system administration community. Zwicky et al[98] made the important point that system
policy lies at the root of system consistency and integrity. The effects of policy upon system administration and cost are at best poorly understood, and there is little information on the range of
policies that are currently in use or effective.
Resource Management
Resource management is closely related to policy-making. For example, there are ongoing debates
on how to manage users’ storage space. One approach is to enforce disk quotas, which strictly
limit the amount of disk space users can have access to. One problem with quotas is that they are
too restrictive to allow users create large temporary files especially during program development
and debugging. Another approach is to utilize a “tidying policy”[47, 97] in which certain kinds
of files (with limited usefulness) are deleted regularly. For example, all core files can be deleted
periodically, on the grounds that a core file is only useful for debugging an application immediately
after a bug is encountered. Other uses of tidying include removing files that are not related to
job function, e.g., movies and mp3 songs. However, this approach exerts less control over the disk
resources and does not avoid all resource problems (including running out of disk space).
2.3.3
Service Management
Services are specialized tasks performed on the behalf of programs or users, e.g., mail, printing,
NFS (Network File System), Web, DNS (Domain Name Service), and database services. Services
are achieved by system processes called daemons. Daemons reside at servers with a specific IP
address and TCP or UDP port address. A service has a listening socket (IP address and TCP or
UDP port) which responds to client requests by opening a new temporary communication socket
at a random port number, which will be terminated after service.
Mail
Electronic mail is the application most used by end-users on a regular basis. Very early research in
mail targeted interoperability between the wide variety of independently developed mail systems.
This research and the reduction in variety over time, combined with Simple Mail Transfer Protocol
(SMTP) as a standard mail interchange protocol, solved the interoperability problem. Research
then turned to flexible delivery and automating mailing lists[1]. There was then a brief pause in
11
the research. However, as the Internet continued to grow, research on scaling delivery of mail both
locally and in mailing lists[64] was needed. At the same time, commercialization caused SPAM to
become a problem[54].
Printing
Printing covers the problems of getting print jobs from users to printers, allowing users to select
printers, and getting errors and acknowledgements from printers to users. Early research in printing
merged together the various printing systems that had evolved[42]. Once the printing systems were
interoperable, printing research turned to improving the resulting systems, making them easier to
debug, configure, and extend[77]. As sites continued to grow, scaling the printing system became a
concern, and recent papers have looked into what happens when there are thousands of printers[95].
NFS
NFS, the abbreviation for Network File System, is a network service that allows all network users
to access shared files stored on computers of different types. NFS provides access to shared files
through an interface called the Virtual File System (VFS) that runs on top of TCP/IP. Users can
manipulate shared files as if they were stored locally on the user’s own hard disk.
DNS
DNS, the abbreviation for Domain Name Service, is a service that translates Internet domain names
into numerical IP addresses. Because domain names are alphabetic, they are easier to remember.
The Internet however, is actually based upon IP addresses. Every time one uses a domain name,
therefore, a DNS service must translate the name into the corresponding IP address. For example,
the domain name eecs.tufts.edu is translated to 136.64.23.0/24.
Web
A Web Service is an internet service that is described via WSDL (Web Services Description Language) and is capable of being accessed via standard network protocols such as (but not limited
to) SOAP over HTTP. Apache has been the most popular web server on the Internet since 1996.
Apache is an open-source HTTP server for modern operating systems including UNIX and Windows NT. It is a secure, efficient and extensible server that provides HTTP services in sync with
the current HTTP standards.
12
Database
The database service is a service that is used to manage and query a database. Database service
is provided by a database management system (DBMS), which is a suite of computer programs
designed to manage a database, and perform operations on the data as requested by perhaps
numerous clients.
2.3.4
Security
A network is secure if data is protected from harm, resources are protected from misuse, data
is kept confidential, and systems remain available for their intended purpose. Assuring security
of networks involves systematic strategies and approaches to protect critical data and resources.
Security practices include assurance of:
• data integrity - validity of data;
• service or application integrity - service availability and conformation to specifications;
• data confidentiality; and
• authentication and authorization - assuring that the proper people are allowed to perform
particular tasks or access particular data.
Security was neglected in the early days of computing systems since all users and hosts were assumed
to be trustable. The spread of the Internet Worm in 1988 challenged the naive trust model and
redefined the notion of security of computer systems. Since then, security has evolved into a pursuit
in its own right with its own conferences and intellectual traditions.
Security cannot exist without a security policy: a clear definition of what is to be protected and
why. A clearly defined policy is used to create a security plan and architecture, based upon the
possible threats and risks associated with each threat. A system can be compromised by: [17]
• physical threats: weather, natural disaster, bombs, power failures etc;
• human threats: cracking, stealing, trickery, bribery, spying, sabotage, accidents; and
• software threats: viruses, Trojan horses, logic bombs, denial of service.
One implements a security policy by methods that include access control, intrusion detection, firewalling and filtering, and backup and restore (discussed in the previous section).
13
Access Control
Access control determines what specific users can do according to security policy. Access control
has two parts: authentication and authorization.
Authentication is any process by which one verifies that someone is who they claim they are.
Traditionally authentication has been based on shared secrets (e.g. passwords) used in conjunction
with cryptographic algorithms. When using authentication based on cryptography, an attacker
listening to the network gains no information that would enable it to falsely claim another’s identity. There are two main approaches to the use of encryption: shared-key encryption algorithms
and public-key encryption algorithms. Kerberos[88], a network authentication protocol, is the most
commonly used example of a shared-key algorithm. It is designed to provide strong authentication for client/server applications. Public key algorithms[65], with less overhead than shared-key
algorithms, are widely used in authentication for their great convenience and flexibility. The difficulty with public key algorithms is that they require a global registry of public keys. With the
explosive growth of the Internet, new authentication schemes integrating shared-key and public-key
appeared[84] to address the scalability of network security infrastructures that can manage millions
of transactions, geographically distributed, within a single realm of trust. Recently, related techniques such as smart cards(used in mobile phones) and biometrics(fingerprints and iris scans)[22]
have also been tried as authentication methods.
Authorization is the process of granting or denying access to a network resource. This is usually
determined by finding out if a person, once identified, is a part of a particular group of users. The
most common authorization mechanism is known as an access control list (ACL), which is a list
of digital identities along with a set of actions that they may perform on a resource (also known
as permissions). Security groups simplify management because an ACL can have a few entries
specifying which groups have a specific level of access to a resource. With careful group design, the
ACL should be relatively static. One can change authorization policy for resources by manipulating
the members of a group maintained by a centralized authority, such as a directory. Nesting groups
within each other increases the flexibility of the group model for managing authorization.
Monitoring and Intrusion Detection
Monitoring is usually done in non-intrusive ways and can be applied to production systems without
impacting performance. Monitoring contains four components:
1. data collection and generation;
14
2. data logging or storage;
3. analysis; and
4. reporting.
There has been a lot of work on gathering data from specific sources, from file and directory state[79]
to OC3 network links[9]. The collected data is usually logged in a fairly basic way, often through
syslog or some flat file. Using a relational database to log the raw data and converting it to a
standard form for inquiries was explored in [41]. Later, generic monitoring infrastructure [4, 53]
was developed. Data analysis has not received nearly the attention it deserves. Data collection
techniques are only useful if the data can be used to identify problems. Swatch[52] can send email
or page system administrators when things seem to go wrong. To assist analysis of large segments
of monitoring data, visual and audio approaches have been used[45, 46].
Intrusion detection refers to the process of detecting and perhaps correcting security breaches
such as cracking, invasion, corruption, or exposure of private data. It can be achieved by monitoring
the system’s state, including requests for service, filesystem, or contents of system logs. There are
many monitoring tools available. Network Flight Recorder utilizes a scripting language to allow
customizability for site needs. TCP wrappers[94], act as proxies for services, are also applied to
reject undesirable requests according to access rules. Another strategy for intrusion detection is to
monitor the filesystems of target servers for the effects of intrusions. The Tripwire[92] tools, along
with many other integrity checkers, allows one to create a “signature” for a filesystem based upon
declarations of the dynamic properties of files and directories within the filesystem. Based on this
declaration, Tripwire maintains a database of cryptographic signatures for whole filesystems and
reports any deviations from declared file and directory properties.
Firwalling and Filtering
A firewall is a system designed to prevent unauthorized access to or from a private network. Firewalls
can be implemented in both ways via hardware and software, or a combination of both. A network
firewall filters both inbound and outbound traffic. It can also manage public access to private
networked resources such as host applications. It can be used to log all attempts to enter the
private network and can trigger alarms when hostile or unauthorized entry is attempted. Firewalls
can filter packets based on their source, destination addresses and port numbers. This is called
address filtering. Firewalls can also filter specific types of network traffic. This is also known as
15
protocol filtering because the decision to forward or reject traffic depends upon the protocol used, for
example HTTP, FTP or TELNET. Sophisticated firewalls can also filter traffic by packet attributes,
content, or connection state.
Firewalls cannot prevent damaging incidents carried out by insiders. Firewalls also have a
significant disadvantage in that they restrict how users can use the Internet. In many places, these
restrictions are simply unrealistic and unacceptable. Firewalls can also be a single point of failure
for a network such as constitute a traffic bottleneck; if the firewall itself is compromised, the whole
network is at risk. Also, if a firewall rule set contains a mistake, it may not appropriately protect
the network. Since these rule sets are complex to craft and validate, mistakes are common. The
firewall is only one component of a secure system; it must be used with combination of other access
control methods.
2.3.5
Testing and Quality Assurance
Testing that systems conform to human desires and requirements is an important process that
supports quality assurance. What is the cost of not testing? For business or security critical
systems, such as online banking, any failure of the system might cause customer dissatisfaction, lost
transactions, lost productivity, lost revenue, lost customers, penalties, or threat to the organization
that owns and operates the system. Even where a system does not provide critical services, failures
might still have serious impact upon users.
Testing in system administration determines whether the system meets its requirements and
ensures that it does not violate policy rules. Positive testing is used to validate the correctness
of certain desired behaviors. Negative testing ensures that a system does not do what it is not
supposed to do. Monitoring can be considered to be one kind of testing that gathers information
about the behaviors of the system and issues warnings if it finds potential problems.
Testing is related to almost every subarea of system administration, especially performance,
security, system integrity, and change management. Unfortunately, research on testing in system
administration is limited if not absent. In practice, administrators’ confidence in the system mostly
comes from the experiences learned from collecting and documenting previous system failures and
the workarounds that addressed them. Unlike software engineering, where testing is intensively
studied and performed in a systematic way, system administration suffers from simple, ad-hoc
testing methods that make no attempt to assure complete system function. In fact, studies indicate
that testing consumes more than fifty percent of the cost of software development[38]. In contrast,
16
system administration, which is so crucial to modern organizations, pays much less attention to
testing than necessary.
The difficulty of performing testing is that the system under test must be always “alive and
online” - available to provide services and resources. Here we have the key challenge. On one
hand, testing is crucial to provide quality assurance; on the other hand, the requirement for the
system to constantly serve its mission does not allow for rigorous testing of the production machines.
Techniques for testing the system while it remains online include a “trace-driven parallel execution”,
in which a system under test is subjected to the same load as a production system and the results
compared[66]. Normally, testing is done before the system or application is deployed, so development
of benchmark suites is essential. The development of tools to provide support and automation of
testing is also vital.
17
Chapter 3
Introduction to Configuration
Management
Configuration management is the ongoing process of maintaining the behavior of computer systems and networks or software systems, and assuring that they serve the missions of the human
organizations that utilize them.
Clearly, configuration and maintenance are overlapping issues. Maintenance is a phase of configuration that deals with creeping decay and changing requirements. All systems tend to decay
into chaos with time due to management changes and somewhat unpredictable interactions with
users[25]. Theoretically, in any closed system, the entropy (measure of disorder) tends to increase
with time unless activities from outside the system restore the order.
Configuration management is comprised of two closely related parts: software configuration
management and system configuration management. These parts are similar in that they control
behavior through some data in files or databases. They differ in what they control. System configuration management concerns the configuration of the whole system and network while software
configuration management concerns the configuration of one or more software packages on one or
more hosts. Many system configuration tools deal with both system configuration management
and software configuration management. The focus of this thesis is system configuration management. However, many mechanisms and technologies used in system configuration management are
applicable to software configuration management.
In the following sections we introduce software configuration management and system configuration management in more detail.
18
3.1
Software Configuration Management
Software configuration management covers the problems of managing software installed on computers. There are two types of software: operating systems (OS) and applications.
An OS is installed by either copying an image to the local hard drive or by booting the new machine off some other media (e.g., floppy disk, CD). Installation of OS is often destructive; everything
on the disk can be deleted during the installation process (except for Windows systems). System
administrators must plan on restoring the information if reinstallation was in error. Since OS is
bottom-level software, any non-OS applications at higher levels may be affected by changes of OS.
The problem is that the subroutines and system calls that software must utilize to communicate
with the outside world are part of the OS. That is why operating system upgrades or migration
from one type of OS to another are often a problem and consume a large amount of effort[7].
Nowadays, application software is usually contained in packages which are collections of related
files[78]. When installed, a software package unpacks into one or more directories. There are two
kinds of software applications: free and proprietary software. The former is typically available
in source form as well as binary form. The latter may only be available in binary form. Thus
software installation might be accomplished in two different ways: installation from source code or
installation from binaries. Free software and open source software is usually in source form and
must be compiled. Commercial software is usually installed from a CD by running an installation
program.
Software package management is a complex issue, mainly because different software packages
are designed by different developers and tested in different environments. A software package might
require one type of operating system, sufficient disk space and memory, the proper version and the
proper location of shared libraries, the existence and appropriate version of other software packages,
etc.
One widely adopted approach for software distribution is the Depot model[68]. In the Depot
scheme, separate directories of executable software packages are maintained for different machine
architectures under a single file tree. Software packages installed under the Depot tree are made
available within the filesystems of client hosts via symbolic links.
The RedHat Package Manager (RPM) is an open packaging system, available for anyone to use,
which runs on Red Hat Linux as well as other Linux and UNIX systems[21]. RPM installs, updates,
uninstalls, verifies and queries software. RPM is the baseline package format of the Linux Standard
Base.
19
The “RPM database” consists of a doubly-linked list that contains all information for all installed
packages. The database keeps track of all files that are changed and created when a user installs a
program and can therefore very easily remove the same files. However, the database suffers from
contradictory package version dependencies and incomplete and outdated documentation[55].
3.2
System Configuration Management
Network and system configuration management is the process of maintaining the function of computer networks in alignment with some previously determined policy. A policy is a list of high-level
goals for system or network behavior. This high-level “policy”, describing how systems should behave, is translated into a low-level “configuration”, informally defined as the contents of a number
of files contained within the system whose contents affect system behavior[35].
3.3
How Configuration Controls Behavior
Configuration controls the behavior of system by a variety of methods:
• Configuration text file This method controls the behavior of computer systems by specifying
and controlling the contents of files stored in some form of non-volatile storage such as disk or
flash memory. For example, system services, like telnet, rlogin and finger are controlled
with a line in the configuration file /etc/inetd.conf or via files in the directory xinetd.d.
These files specify in detail how each host should behave. The mechanism behind configuration
files is that some computer programs, called daemons, read the content of the files and provide
or prevent specific behaviors. These files are called configuration files and are referred to
collectively as the configuration of the system by convention.
The contents of configuration files are not controlled by regular users and do not change due
to actions of non-administrators. In this case, configuration management is the process of
specifying, modifying, and otherwise managing the contents of these configuration files.
• Database
In Windows systems, many host parameters are configured in a database called the system
registry. Many configuration tools maintain a central database that specifies configurations for
a network of hosts. The configuration data of the central database will be pushed to or pulled
20
by agents installed in those hosts. The agents will then change the contents of configuration
files on each host according to the central database in order to have specified behaviors.
• Transmitted protocol
Simple Network Management Protocol (SNMP) is a protocol designed to monitor the performance of network hardware. It is suited to non-interactive devices like printers and static
network devices like routers. The management console can read and modify the variables
stored on devices and issue notifications of special events. In spite of its limitations, for example weak security, SNMP remains the protocol of choice for the management of most network
hardware, and many tools have been written to query and manage SNMP enabled devices.
System administrators oversee and take part in the configuration process in a variety of ways,
by performing manual configuration changes on one computing system at a time, or perhaps by
invoking computer programs that accomplish similar changes for a single system or the network as
a whole.
21
Chapter 4
Challenges of Configuration
Management
Nowadays, more and more organizations involve computers and computer networks in their daily
work. As computer systems become players in complex communities (such as the stock market),
the expected results of governing these communities become more and more challenging to assure.
The complexity of system administration arises both from the complexity of the machines and the
demands of their roles in human-computer communities.
The unique challenges of configuration management as a practice include frequent changes in
policy and technologies, large scale of application, assemblies of imperfect software and hardware,
heterogeneity, contingencies, diverse users, ineffective collaborations of multiple system administrators, service guarantees, and the mobile environment.
4.1
Change
A fundamental characteristic of modern computing systems is that they need to be delivered and
modified rapidly in tight time frames, while the requirements for such systems are constantly changing.
Change of requirements is the source of complexity of many major issues in system administration. Given new requirements, system administrators must plan what to change to meet those
requirements. Any change involves risk. They must have an idea of what major things will be
affected by such a change. Dependency analysis is needed to show the consequences of actions.
22
And in many situations, changes must be made without disruption to mission critical services, for
example, an online banking service. To accomplish a change, system administrators might need to
install and configure new hardware and software, upgrade old software, reconfigure the system and
network, and repair the bugs caused by the change or be able to roll back to a previous state if the
change has major problems.
4.2
Scale
The scale of systems is a large part of the challenge. A typical operating system contains several
thousand files; a large number of other “packages” (drawn from a cast of thousands) may be added
over time. The number of software packages installed in a network of computers can be unbelievably
large. For example, at the site of the Computer Science Department at Tufts University (which is
a typical departmental site), in our study in the year 2001, we had around one thousand software
systems comprising about ten thousand programs installed on our network. Within a network,
there are tens, hundreds, or thousands of machines, each might have different architectures, hardware devices, operating systems, and installed software applications. Fully installed and configured
systems thus tend to be both unique and complex, unless steps are taken to limit uniqueness and/or
complexity.
Scale poses several unique problems for system administrators. Tasks that were formerly accomplished by hand for small numbers of stations become impractical when system administrators must
configure thousands of stations, e.g., for a bank or trading company. In a large-enough network, it
is impossible to assure that changes are actually made on hosts when requested; the host could be
powered down, or external effects could counteract the change. This leads to strategies for ensuring
that changes are committed properly and are not changed inappropriately by other external means,
e.g., by hand-editing a managed configuration file.
4.3
Interdependence of Software and Hardware
Hardware and software in a system need to have all required elements, qualities, or characteristics,
and can change in character over time as hardware ages and software is revised. Certain physical
requirements must be met for hardware to function, i.e., temperature, humidity, physical connections to power or other equipment. Hardware is normally from different manufacturers. Different
hardware are not necessarily compatible and are not guaranteed to work together when combined
23
or connected.
The type of hardware limits kinds of software that can execute on it. Software is provided
by many developers and vendors. Software requirements for the system often differ and even in
contradictory ways. These multiple software systems share the same resource space. Thus resource
competition happens constantly.
The developers of software cannot completely foresee the various complex environments where
their software will be executed. Thus assuring the correct environment to support the function of
one kind of software can make another kind of software work poorly or fail.
Operating systems and programs often contain bugs and emergent features that were not planned
or designed for. System administrators must balance performance and cost and be vigilant of
changes in performance, configuration, or requirements that might cause failures of software or
hardware.
4.4
Heterogeneity
A typical managed network is composed of computers each with thousands of pieces of different
hardware and software. These components have complex dependency relationships and share and
compete for a common set of resources.
One factor that increases management complexity is heterogeneity: variations in architecture or
configuration among populations of machines. Each machine is potentially different from all others
and they may also carry out different roles, i.e., as servers and clients. They might have different
architectures, hardware devices, operating systems, or installed software applications.
The most common form of heterogeneity arises when a set of machines vary in architecture (e.g.
SPARC versus x86). However, heterogeneity appears in other and more subtle ways, including in
software environments. For example, hosts that are in different configuration states can be viewed
as having “heterogeneous configurations”. Hosts that must behave differently than others exhibit
“heterogeneous behavior”.
Heterogeneity increases the cost of management, since system administrators must take into
account differences between hosts. A highly heterogeneous network can make it difficult to deploy
system-wide configuration changes, since the actions to make a change (or even the nature of the
change) might differ on each station whose behavior should change.
Unintentional heterogeneity arises from utilizing different solutions to the same problem for no
24
defensible reason. E.g., a group of administrators might all choose different paths in which to
install new software, for no particularly good reason. Unintentional heterogeneity is a management
problem.
However, intentional/controlled heterogeneity with a defensible reason is used to make the system
more robust and secure. A heterogeneous network is less vulnerable to a single type of failure or
security exploit[44].
System administrators must balance the needs of uniformity and heterogeneity and avoid unintentional heterogeneity.
4.5
Contingency
A contingency is an event whose occurrence causes system problems, and which may or may not
occur depending upon conditions and other factors. For example, the hard disk of a system may
fail. This is a contingency because if it does occur, steps must be taken to address the failure. A
configuration can be modified by many sources, including package installation, manual overrides, or
security breaches. Unintended changes in configuration represent one form of contingency. These
contingencies often violate assumptions required in order for a particular strategy to produce proper
results.
One problem that plagues configuration management tools and strategies is the creation of latent
preconditions. Like any software program, every configuration management operation requires some
pre-existing conditions to assure an appropriate behavioral result for the operation. A host requirement necessary to assure a desired effect for an operation is called a precondition of the operation.
A latent precondition of an operation is a precondition that is not known by the administrator
beforehand, but whose absence causes a behavioral problem after the operation is applied, for some
subset of hosts within a population. Due to software and hardware heterogeneity, often an individual system possesses a software or hardware property whose presence cannot be easily detected
except through failure of the system as a result of a configuration change. Latent preconditions are
due to lack of complete knowledge of dependencies between different components of system. For
example, very commonly, the act of replacing a dynamic library to repair one application might
break another application.
25
4.6
Diverse Users
Without users, system administration is trivial but meaningless. Users are both the reason for
computers to exist and their greatest threat. Each user has different background, views and demands
for the services of systems and networks. For example, some user might want to use older versions
of a dynamic library because his/her application depends on it; in the mean time, some other
user might request a newer version of the library to have more functions or faster speed. A system
administrator must balance all kinds of needs and in the mean time ensure the stability and security
of the system. For the common benefit of the whole community, policies must be planned and
enforced.
The group of users is not held constant, either. For example, in a university environment, large
groups of students enroll and graduate each year. System administrators need to periodically create
and delete user accounts and adjust storage space accordingly.
4.7
Ineffective Collaboration of Multiple System Administrators
Large scale systems often require a team of system administrators to work collaboratively so that
the day-to-day system administration work remains manageable.
The ideal situation for a system administrator is that his or her domain of change matches his
or her domain of responsibility: what is controlled matches precisely what one is responsible for
controlling. This ideal, however, is never achieved in practice; the typical system administrator has
power over subsystems for which he or she has no responsibility, and vice versa.
Since there is no practical way to control privileges so that an administrator’s domain of charge
and control matches a corresponding domain of responsibility, conflicts can arise when more than
one person works on configuring or changing the same aspects of a system. Since different system
administrators have different backgrounds, skill levels, and computing language preferences, good
team disciplines including documentation, effective communications and appropriate delegation of
tasks are essential.
26
4.8
Mobile Environments
Additionally, mobile computing devices are becoming more and more pervasive: employees need to
communicate with their companies while they are not in their office. They do so by using laptops,
PDAs, or mobile phones with diverse forms of wireless technologies to access their companies’ data.
Mobile devices and users add enormous challenge to configuration management for ensuring security
policies.
4.9
Service Guarantees
The requirement for service guarantees of modern systems has never been in such high demand.
For many companies, the system must be “alive and online” all the time. A minute of down time
might cause thousands of dollars of lost in business.
System administrators must be able to react effectively, efficiently, and unintrusively to accomplish desired changes, without unacceptable downtime. They must also be able to quickly analyze
problems and correct configuration mistakes. In the mean time, high performance expectations do
not allow system administrators to perform intrusive experiments to search for or optimize solutions.
27
Chapter 5
Automation and Autonomic
Computing
A general problem of modern computing systems is that their complexity is increasingly becoming
the limiting factor in their further development. Large companies and institutions are employing large-scale computer networks for communication and computation. Distributed applications
running on these computer networks are diverse and deal with many different tasks, ranging from
internal control processes to presenting web content and to customer support.
Automation is the use of control systems such as computers to control processes, replacing human
operators. System administrators use automation to manage the implementation-level complexity.
For example, replacing a manual configuration procedure with a computer program such as a script
is a widely used technique in system administration.
Autonomic computing is an industry-wide initiative started by IBM in 2001 and aimed at creating
self-managing computer systems to overcome their rapidly growing complexity and to enable their
further growth[57]. It is inspired by the autonomic nervous system of the human body. This nervous
system controls important bodily functions (e.g. respiration, heart rate, and blood pressure) without
any conscious intervention. Four functional areas are defined for autonomic systems:
• self-configuration: automatic configuration of components;
• self-healing: automatic discovery, and correction of faults;
• self-optimization: automatic monitoring and control of resources to ensure the optimal functioning with respect to the defined requirements;
28
• self-protection: proactive identification and protection from arbitrary attacks.
In autonomic systems, system administrators play a new role: they do not control the system
directly; instead, they define general policies and rules that serve as an input for the self-management
process.
Autonomic computing is an advanced form of automation. There is a large gap between autonomic computing and current automation approaches such as scripting. Currently, system administrators still function at the implementation level; they are the “translators” from high-level goals
to low-level machine configurations. Current tools only enable them to manage the low-level system with ease. In autonomic computing, system administrators are instead asked to interact with
systems at the management level of making policies, leaving the configuration process completely
to the autonomic systems (or configuring a baseline environment in which autonomic systems can
take over and manage configuration thereafter).
Our study benefits the development of automation of configuration management by describing
some of its theoretical boundaries. Our study shows that configuration management, including
dependency analysis and composition of configuration operations, is an intractable process in general
without further constraints. Constraints which make configuration management tractable are also
discussed in this thesis.
In the following two sections, we give brief summaries of the history of configuration management
and current configuration strategies.
5.1
History of Automation in Configuration Management
The context of system administration is changing. When the first electronic digital computers were
produced in the 1940s, compilation, linking, and loading of programs were entirely performed by
human operators. Administration of computers was inevitably accomplished manually. The original administrators loaded programs and managed batch jobs on the predecessors of time-sharing
machines. Very few people had the privilege to use a computer. The concept of maintaining a
particular standard of operation was not present. All operations were “best-effort”. Thus major
issues of current system administration, e.g. configuration management, user management, security, trouble shooting, testing, were very different tasks compared to their present form in modern
computing systems.
Later multi-user multi-privilege operating systems such as UNIX and VMS became more preva-
29
lent than batch operating systems. The separation of system management from normal user functions and the ability to manage unprivileged users from a privileged shell made it possible for a group
of users to share computers and networks concurrently. The role of system administrator evolved
from running batch jobs to managing multi-processing, interactive computing systems. Early Unix
systems were often administered by volunteer computer users. Over time, as the complexity of
computing systems increased, these volunteers were in no position to guarantee quality of services.
Modern demands for service guarantees require dedicated professionals rather than volunteers.
Early attempts at configuration management were all direct interactions between the administrator and the system. The early systems were administered solely by running specific commands
and editing specific files. As these commands and files became increasingly complex, human error
became a significant factor. Administrators responded by “scripting”; placing useful commands
into files that could be replayed when needed. Scripts are not “programs” in the traditional sense
of computer science and system administrators need not be programmers. Instead, they are lists of
commands that can be replayed. Scripts can be used for scheduling services, retrieving data from
large log files, installation and configuration of systems and applications, etc. General-purpose
scripting languages used are sh, csh, Perl, Python, Tcl/Tk. As system administration tasks grow
more and more complex, so do the scripts to automate those tasks. System administrators are
“wired” into the system by the scripts since a substantial knowledge of the implementation details
of those scripts are required.
Nowadays, systems are rarely constructed from scratch; most system administration tasks involve managing pre-existing systems and integration with “legacy” infrastructure. Thousands of
scripts with their preconditions and postconditions and other implementation details, e.g., complex dependency relationships, are embedded within the systems. And these systems are required
to support business and industrial processes which are continually reconstructed and reorganized
to meet changing users’ demands. This complexity grows beyond human management capacity,
especially for large scale systems.
Declarative management represents a paradigm shift from scripted management. The idea of
declarative management is to let system administrators concentrate on making policies and rules for
the system, and allow some robotic agents translate those policies and rules into implementationlevel instructions and carry out the tasks.
The principal difference between agent-based and
scripted management is that programmers write the agents, while system administrators traditionally wrote the scripts. The first attempt of declarative management was Site[51], site configu-
30
ration language, which allowed site configuration to be specified via a centralized configuration file.
Cfengine[15, 16, 18] introduced the idea of convergence that repeated execution of Cfengine scripts
would bring the system in conformance with its desired state specified in the configuration file.
However, the specifications and declarations of Cfengine configuration files remain low-level specifications concerning system and network contents instead of describing behaviors and restrictions.
Full-fledged policy based management has not come into wide use due to the cost of deployment.
Research on policy based management becomes one of the most vigorous areas of system administration to ease the burden of managing complex human-computing systems. Closure[30] is a new
model of configuration management based upon a hierarchy of simple communicating autonomous
agents. Each of these agents is responsible for a “closure”: a domain of semantic predictability in
which declarative commands to the agent have a simple, persistent, portable, and documented effect
upon subsequent observable behavior. Closures are built bottom-up to form a management hierarchy based upon the pre-existing dependencies between subsystems in a complex system. Closure
agents decompose configuration management via modularity of effect and behavior that promises
to eventually lead to self-organizing systems driven entirely by behavioral specifications, where a
systems configuration is free of details that have no observable effect upon system behavior.
Autonomic computing[57] is a top-down version of closure, although all current implementations
provide bottom-up features. IBM’s vision of autonomic computing[60] includes a vast and tangled
hierarchy of self-governing systems that in turn comprise large numbers of interacting, autonomous,
self-governing components at the next level down. Those autonomic systems and subsystems bear
the characteristics of self-configuration, self-optimization, self-healing, and self-protection. They
expect autonomic computing to involve enormous range in scale, from individual devices to the
entire Internet.
The difference between “closures” and “autonomic systems” lies in what they assume about the
outside world. Autonomic systems often assume that the changes that can occur to a system are
known in advance. Thus they assume a “closed world” surrounding the managed system. Closures
assume that there is an “open world” of unpredictable events and are designed around dealing
with the unexpected. They are also different in their original design mechanisms. The idea of
closures came from the observation that the complexity of system administration is closely related
to flexibility of system configuration and its relationship to the likelihood of human error. To reduce
complexity, one must limit one’s environment and options to assure homogeneity and predictability,
thus leading to simplicity of management. On the contrary, autonomic computing intends to add
31
“intelligence” to the system so that it can manage complex situations. These two research groups
have made progress in constructing components of their systems, i.e., websphere[58] from IBM and
HTTP, IP closures[31, 96] from Tufts University. Both of these approaches encounter problems in
dealing with legacy infrastructures.
5.2
Current Strategies of System Configuration
Configuration management is a complex issue related to theory, practice and policy. There are
several existing strategies of conducting system configuration, namely manual configuration, custom scripting, structured scripting, file distribution, and declarative syntax. We will give a brief
introduction on these strategies in the following paragraphs.
5.2.1
Manual Configuration
In manual configuration, configurations are made entirely by hand. Manual configuration is only
cost-effective for small-sized systems with a few machines and a few users, but is often utilized even
for large networks in which few changes in function are expected over time. Manual configuration
has the advantage that system behavior is closely monitored during each step of configuration.
Errors can be easily corrected. However, manual configuration is not feasible for large networks
with frequent changes.
5.2.2
Custom Scripting
In custom scripting, manual procedures are encoded into repeatable automatic procedures using a
high-level language such as a shell, Perl, Python or domain-specific languages. Custom scripting is
a first attempt at configuration automation. The main weakness of custom scripting is the difficulty
of addressing and reacting to pre-existing conditions of a host. Scripts are often crafted in haste
for one-time use and then employed over a long time-scale[27]. Each script requires preconditions
that are often poorly documented if at all. Applying a script to a host that does not satisfy its
preconditions leads to unpredictable results and variation in actual configuration.
5.2.3
Structured Scripting
Structured scripting is an enhancement of custom scripting that allows one to create scripts that
are reusable on a larger network by providing a framework that assures repeatable preconditions
32
and manages portability between heterogeneous sets of hosts. There are two basic approaches to
structured scripting: execution management[90, 91] and variable instantiation[73].
Execution Management
ISConf[59, 90, 91] structures configuration and installation scripts into “stanzas”: installation and
scripts whose execution order is held constant across a population of hosts. On each host, maintenance of host state determines which stanzas to execute. The postconditions of the operations
already completed are treated as the preconditions of the next (whether or not this is actually
true), and these are always the same for a particular host. ISConf uses time stamp files to remember which stanzas have been executed on each host, and assures that hosts that have missed a cycle
of configuration due to downtime are eventually brought up to date by running the scripts that
were missed.
The strength of ISConf is that when changes are few and the environment is relatively homogeneous, it produces repeatable results for each host in the network. This means that if one host is
configured with ISConf and exhibits correct behavior, it is likely that other hosts will exhibit the
same behavior as well.
The fact that “order matters”[90] is a daunting limitation of the strategy. Any new script can
only be added at the end of all the stanzas or the stanza needs a complete rebuild. When using
ISConf, it is impractical to delete a pre-existing stanza or change stanza order. This would make
it possible to violate the preconditions of a stanza. Erroneous stanzas that misconfigure a host
cannot be safely deleted; they must instead be undone by later explicit stanzas. Thus the size and
complexity of the input file grows in proportion to changes and states, and quickly becomes difficult
to understand if changes and/or errors are frequent. In this case, one must start over, re-engineering
the entire script sequence from scratch and testing each stanza individually. A similar process of
starting over is required if one wishes to apply an ISConf input file for one architecture to a different
architecture; ISConf input files are not guaranteed to be portable.
Variable Instantiation
Another approach to structured scripting is to code the locations of common system settings as variables so that scripts can be written portably across operating systems. In Problem Informant/Killer
Tool (PIKT)[73], common files are located through variables whose contents change to reflect operating system values, via a mechanism similar to that in Imakefiles. A well-engineered PIKT script
33
can be utilized on many different hosts.
One shortcoming of PIKT is that a script can be written once but must be verified and validated
on every kind of target platform.
5.2.4
File Distribution
Practitioners were quick to understand the limits of custom scripting and have struggled for decades
to design a more robust method of making configuration changes. The first attempts to replace
custom scripting employed file distribution. In this strategy, one maintains master copies of crucial
configuration files in a repository, and periodically automatically distributes these copies to managed
hosts[23, 26]. This largely avoids the problems of sequencing encountered in custom scripts, but
replaces these with problems of execution scalability and several capability limitations.
File distribution schemes such as RDIST[23] rely upon a single master server that runs local
commands on clients to force them into compliance. This is an inherently serial process that takes a
very long time in large networks[26]. As well, the process is plagued by the fact that all knowledge
of the variations in platforms has to be codified and stored on the central server, a daunting and
error-prone manual task. RDIST also suffers from lack of ability to express precedence between
related file copying operations, as well as excessive repository sizes as variations are required. One
copy of each version of each file must be stored. Initial strategies for combating this version explosion include replacing simple distribution with remote execution of post-install scripts[26] that
deal with portability issues. Proscriptive configuration generation is a technique for managing the
combinatorial explosion that occurs in using file distribution for configuration management when
networks are large and heterogeneous. A description of appropriate network behavior is translated
into the precise configuration file contents that assure that behavior[6, 8]. The translation is either accomplished centrally and then transmitted to clients or generated by the clients themselves
through use of a distributed agent. The agents installed on the hosts read a host configuration
generated from that data. Ongoing administrative tasks include the description and the agent; as
requirements change, new kinds of files must be generated. The description may be maintained in
many ways, including databases, XML, or plaintext. Generating files from a master template minimizes the problems of unintended consequences encountered via the other methods, but changes
the problem of debugging scripts to that of writing appropriate generators.
34
5.2.5
Declarative Syntax
Declarative syntax is a configuration management strategy wherein custom scripts are replaced by
an autonomous agent[15, 16, 18, 25]. This agent interprets a declarative configuration file that
describes the ideal state of a host, then proceeds to make changes that bring the host somehow
“nearer” to that ideal state.
The main configuration management agent in contemporary use is Cfengine, whose declarations
and operation bears a strong resemblance to logic programming[29]. It first determines a set of
facts about the local system, then proceeds to correct any facts that are not in compliance with its
idea of system health.
The main benefit of declarative syntax over scripting is that we avoid forever the problem of
writing and maintaining fragile custom software. Unlike ISConf, in which order and content must
be preserved, Cfengine must instead preserve management of objects once they are managed in
any form. A second benefit of Cfengine over all prior forms of configuration management is that
the agent for a particular host has distributed authority about its own needs. This means that
no central repository must be kept of data about individual hosts; they can be easily customized
without maintaining a global snapshot of desirable state.
The weakness of declarative syntax is “incremental burden of management”: once touched,
a file must remain managed forever.
For example, suppose one configures Cfengine to edit
/etc/inetd.conf and then applies that script to most of the hosts in an environment, skipping
those that are currently powered down. Thus heterogeneity (a difference) among hosts is constructed. Unless the system administrator accounts for the possibility of the change in all further
use of Cfengine, it is possible that some hosts that were initially powered down will never receive
the change, so that hosts that do not get the change might respond differently from hosts that do
get the change.
The current declarative syntax tools like Cfengine operate at the file contents level, i.e., they
only declare what the contents of a file should be without validating meaning of the contents. This
requires system administrators to manage the meaning of file contents. Few if any of the current configuration tools possesses the “intelligence” to make configuration a completely automated
procedure.
35
Chapter 6
The Configuration Process
In the theoretical study of this thesis, we show that, in worst case, configuration management is
intractable. It is important to understand how these theories fit into the real world and how system
administrators make configuration management tractable in practice.
We make the link between our theory and practice by first observing how system administrators
operate in the real world in this chapter and summarizing techniques used by system administrators
that reduce the complexity of configuration management in Chapter 12.
Documentation and experience are two important factors that we must discuss first before we
move to the configuration process.
6.1
Documentation
Documentation and experience are two key factors that keep the system manageable.
Documentation is the organized collection of records that describe the purpose, structure, requirements, operations, functional specifications, history of previous changes, and maintenance for
a computing system or a system component such as a computer program or a hardware device.
In our discussion, documentation is not limited to documents provided by the developers of a
system or system component. Rather, documentation includes documents written down by system
administrators at the same site or other sites (and perhaps posted on the Internet or published in
papers and books) and documents provided by the developers of system components.
Documentation plays a critical role in system administration. The ultimate goal of system
administration is to keep the system in alignment with system requirements, rather than to develop
36
a full understanding of the system. With appropriate documentation, system administrators can
bypass the analysis of complex implementation details of the system. However the analysis cannot
be completely avoided because documentation is not always 100% accurate, and might not consider
the (often complex) current environment, in which system administrators must work.
Trust of documentation can be risky due to the limitations of these documents. A principal
problem with documentation is that systems are extremely complex and documentation is usually
incomplete. The reasons for this incompleteness include that the system can assume many more
physical states than can be covered explicitly in the documentation, and these states perhaps correspond to physical behaviors too varied to document. Documentation can be incomplete because:
• It does not document or describe the current state of the system.
• It does not foresee the effects of particular sequences of changes.
• It does not cover cases in which other modules interact with a given one.
• It does not cover cases in which components are faulty and do not function as documented.
The accuracy and feasibility of documentation must be judged by the system administrators based
on their past experience, experiments and observation, or even intuition.
Measured by sheer volume, many systems and system components are superbly documented. We
have a wealth of online and printed documentation. Unfortunately, sheer volume is not everything
or even (really) enough. If administrators, programmers, or users cannot find the information they
want, in a reasonable period of time, the documentation fails its purpose.
6.2
Experience
The experience of a system administrator determines the ability to choose correctly from multiple options for assuring a specific behavior for a system[56]. “Experienced” system administrators
choose correctly from options more frequently than ”inexperienced” system administrators. “Experienced” system administrators know where to look in documentation for details of specific features,
while ”inexperienced” system administrators may face a long search. The experience of a system
administrator includes:
• memory of solutions,
• mental maps of documentation and where to locate specific facts and procedures, and
37
• mental maps of the expertise of peers, and who to contact for advice about specific problems.
The experience of system administrators is very important to their overall performance in assuring correct system behavior. Hellerstein[56] pointed out that the complexity of a configuration
task can be measured by how much expertise is needed to complete it. With a complex task, there
is a large difference between the time taken by an experienced person versus the time taken by a
novice. As modern computing systems become more and more complex, no system administrator
can always configure the system from memory without consulting its documentation. The issue of
how to find specific information is crucial. Thus experience of system administrators become more
precious since experienced system administrators can efficiently and correctly decide which expert
they should consult, which documents they should read, and what tests they should perform to
validate the system.
6.3
The Configuration Process
The configuration process can be characterized into three stages: learning, planning, and deploying.
Each stage involves loops of activities. Figure 6.1 describes some qualities of the process and gives
a high level picture; Figure 6.2 shows more details.
During the learning stage, system administrators learn about the policy, the system, and the connections between them. They gather information in order to motivate the design and construction
of solutions. System administrators normally first refer to their own memory to compose necessary
actions to accomplish system requirements. At this stage, they may perform some tests of the
system to verify and validate their past experience and their knowledge of the system. They may
also consult other system administrators or refer to various documentation including the experience
written down by other system administrators and documents provided by the system developers
and vendors.
The planning stage cannot be separated from the learning stage. System administrators plan
what to do while they learn. But the emphasis of the planning stage is to come up with a sequence
of operations that can transform the system to a desired state.
Dependency analysis is the process of discovering interactions and relationships between entities
of a system. The role of dependency analysis in system administration is difficult to describe,
because such analysis does not arise as a separable process from learning, planning, and doing.
Average system administrators avoid doing any kind of dependency analysis or perhaps choose to
38
do a very simple procedural dependency analysis: determining the order of configuration procedures
that produces a specific result. For example, to install an ssh service, one must first install a
network card (1), bring up the network connection (2), then download the ssh package (3), and
last install the package (4). There is a dependence/order between different procedures. One must
perform procedure 1 before procedure 2; otherwise procedure 2 will fail. Procedural dependencies
are more abstract than static and dynamic dependencies (the static and dynamic interactions and
relationships among entities of a system) in the sense that there are semantic reasons for the order
of procedures. However, system administrators do not usually care about the semantic grounding
of their actions, provided that the actions perform and produce results as documented.
The result of planning is a sequence of operations that “should” be able to transform the system
to a desired state. Normally, the sequence is chosen from “best practices” - shared practices, agreed
upon by a group of system administrators for a site. By doing this, system administrators bypass
another difficult problem: composition of configuration operations, which we have shown to be
another intractable problem besides dependency analysis[87].
The deployment of configuration operations can be done manually, through scripting, or by
delegation. If it is done manually, system administrators work in a tight loop of deployment and
testing. By scripting, system administrators put a series of configuration operations in one program
and execute the program. By delegation, system administrators delegate the task to someone else.
After the deployment, if the system satisfies requirements, system administrators terminate the
process by updating the documentation if needed. If the system fails to demonstrate the desired
behavior, system administrators loop back to the learning and planning stages. If they think the
system is at some unmanaged state, they may choose to re-baseline the system and start from
scratch with an initial configuration whose properties are well known.
There are several characters that demonstrate the uniqueness of the configuration process: asynchrony of inputs, high possibility of partial completion of each step, nondeterminism of transition
between two steps, looping and iteration among steps, and dynamism of people (user, management,
and system administrators) and things (requirements, the documentation, and the system) involved
in the process.
• Asynchrony of inputs
During the configuration process, system administrators interact with many people and things
(Refer to Figure 6.3). Asynchrony means that system administrators decide what to do at
particular time differently and depending upon inputs from other people at arbitrary times.
39
Figure 6.1: The configuration stages
40
Figure 6.2: The configuration process
41
Figure 6.3: Asynchronous interactions between system administrators and the environment
For example, a system administrator starts to do something according to his or her original
plan; he or she receives a asynchronous call from the management requesting something else,
then users might request a third thing. He or she then needs to do some conflict resolution
to orchestrate all these things.
Even for a single thread, the configuration process can be asynchronous too. For example, a
system administrator gets a request to add a user to a group. Based on his/her experience,
he/she starts with editing /etc/groups, but it does not work. He/she then sends an email to
an expert for advice. In the mean time, he/she seeks other options. He/she might try “yp”, a
directory service. Before he/she makes the observation that yp does not work, he/she might
get an asynchronous message from the expert suggesting LDAP. He/she then drops yp totally
and tries LDAP and it works. The control flow of the process highly depends upon the inputs
from other people that occur asynchronously such that a general control flow diagram is not
feasible.
• Partial completion of steps
System administrators do not finish one step completely before moving to the next. They
often only partially finish one step and move back and forth between two unfinished steps. For
example, when they use documentation, they do not sit down and read the entire document;
as soon as they find something helpful, they integrate it with their own experience and come
up with a sequence of operations to accomplish the requirements, or they may perform tests
to further validate the system and the documentation.
42
• Non-determinism of paths
System administrators do not strictly follow predefined rules when configuring the system.
The configuration process is similarly to Markov transitions[61]. At each step/state, there
might be several paths that system administrators can take. There is a probability distribution describing the likelihood of different paths. For example, upon a request, a system
administrator might consult other system administrators or they might search for documentation. The pattern of what system administrators decide to do at each step is different for
different people, different tasks, and different circumstances. However, the realistic process is
not Markovian, as memory plays a role in which steps will be taken next[34].
• Looping and iteration
The practice of system administration in seeking performance and policy goals can be modeled
as a series of loops involving seeking knowledge and testing understanding. The administrator alternates between observing behavior, reading about options, planning strategies, and
deploying solutions. This is not a typical ”flow chart” of the sort that would describe a
computer program; instead the activities of reading, planning, observing, and deploying are
typically intermixed and interdependent rather than separate and distinct.
• Dynamism
The requirements, experience, documentation, and system itself are not static; they change
dynamically with time. During the configuration process, requirements can be revised by
management and users. So the objectives for configuration can be a “moving” target, which
increases the difficulty and cost of the configuration. The experience of system administrators
is changing every day. They learn about the system through successes and failures and forget
some knowledge as a natural process. Documentation regarding the system is constantly
updated by system administrators and system developers. The system itself changes with
time as well.
In summary, the current configuration process is “human-centered” in the sense that the experience of system administrators has a dominant effect within the process. The configuration process
is not a well-defined control flow diagram due to five factors: (1) the inputs from other people
are asynchronous; (2) the process involves loops and iteration of steps; (3) it is not necessary to
completely finish one step before move to the next; (4) the transaction between steps is not well
43
determined and is subject to the current situation; and (5) people, requirements, and systems in
the process are changing.
44
Chapter 7
A Model of Configuration
Management
A rigorous language for discussing the issue of configuration management is currently lacking. To
this end, we develop a simple state-machine model of configuration management. Configurations
or observed behaviors comprise the state of a system and configuration processes accomplish state
transitions. Our theoretical discussion is based upon this model of configuration management, which
we abstract from practice.
The word “system” is perhaps too vague a word to describe what the system administrator
manages. The word “system” refers to different things in different chapters. It refers to a host in
the theory of reproducibility, a host or a site in the theory of configuration operations and their
composition, and any system with interacting subsystems in the theory of dependency analysis.
In most chapters, we do not emphasize the interactions within a system except in the chapter
on dependency analysis. We assume each “system” to be a closed-world system; we explain this
concept in the next section.
7.1
Closed- vs. Open-world Models of Systems
Theoretically, a closed-world system is a system that has no interactions with anything outside of
the system. In practice, a system is effectively closed if the behavior that can be affected by outside
forces is not described in the functional specification. For example, one may change the color of
a web server, but this change does not affect whether this web server can achieve its functional
45
expectations to provide web service, so the web server can be considered to be in a closed-world
system even if the color is determined by outside forces, e.g., an interior decorator.
A system is also effectively closed if there are dependencies upon constant properties of the
outside world that cannot be violated. For example, in normal circumstances, we treat a web server
and a DNS server as a closed-world system without considering whether electric power will fail,
since it is nearly always available.
By contrast, in an open-world system, unknown entities from outside the boundaries of the system affect the behavior of entities of the system in such a way that whether functional expectations
can be satisfied is not solely determined by the states of the system. The behavior of the system is
thus unpredictable. In some cases, this can be addressed by making the system boundaries larger.
E.g., one can draw a boundary around both DNS and web services in order to view the pair as a
closed system; the configuration of DNS affects the behavior of web service too much for the two
entities to be considered independent.
7.2
Observed Behavior
In the previous section, we noted that the difference between “closed” and “open” systems depends
upon what we choose to observe or not to observe about the system. We next must define what we
mean by “observed behavior”. This requires identifying the questions we might ask whether they
are true or not in determining whether a particular behavior is present.
Suppose that we are given a finite set of tests T that we can apply to a configuration1 . Tests
are statements that can be either “TRUE” or “FALSE” by some testing method. For example tests
might include:
• The system has a 5 gb hard drive.
• The system has at least 128 MB of RAM.
• The system has a network card.
• Port 69 is listed in /etc/services.
• TCP port 80 answers web service requests.
• UDP port 69 rejects tftp requests.
1 The
set of “all” possible tests is not finite, but in practice, the set of tests that we focus upon must be finite.
46
Some of these facts concern hardware configuration, while others concern behavior, both internal
and external. A numeric measurement is represented by a set of facts, one per possible value.
For example, if there can be between one and five server processes on a machine, this would be
represented by the five tests as to whether that number was 1, 2, 3, 4, or 5. In practice, all the
parameters can take finitely many values. Thus T is a finite set. Also,
Axiom 1 The observed behavior of a system is a function of its actual configuration.
Here we do not consider dependencies upon other systems. We assume that the system under study
is in a closed-world system so that its behavior cannot be affected by other systems. Also, tests
are chosen to represent static properties of configuration, not resource bounds, and presume the
presence of adequate resources to perform the test.
The observed behavior of a system is a subset ψ of T , where t ∈ ψ exactly when the system passes
the test t. We represent user requirements as a subset R ⊆ T . For each test t that should succeed,
we add it to R; for each test t ∈ T that should return false, we replace it with its complement
into R. Any test not mentioned in R is one for which we are not concerned about the outcome. A
system meets its requirements if ψ ⊇ R.
7.3
Actual State and Observed State
The actual state of a system is defined as its configuration information recorded on the machine.
This is distinguished from its “observed state”, which describes the set of behaviors in which we
are interested.
From Axiom 1, we eliminate possible behavioral effects caused by inadequate resources. The
actual state of a system is its configuration, denoted as s ∈ S where S is the set of all possible
configurations of a system. |S| can be arbitrarily large2 .
We also assume that the behavior of a system is always in synchronization with its actual state.
We do not consider the cases where the configuration files are modified but server daemons do not
reread these files or act accordingly.
The observed state of a system is defined as the observed behavior of a system, i.e., ψ ⊆ T . We
define Ψ to be the set of all possible observed states of a system. Note that |Ψ| = 2|T | .
A test function σ can be defined to map from an actual state to an observed state, i.e., σ : S →
power(T ).
2 However, we assume the size of S is finite since in practice, even though we are dealing with large scale systems,
the total number of possible states is still finite.
47
7.4
Configuration Operations
Configuration operations act on the actual state of a system, not the observed state.
Definition 1 A configuration operation p takes as input a state s and produces a modified state
s0 = p(s).
Configuration operations might include things like:
• replace /etc/inetd.conf with the one at foo:/bar/repo/inetd.conf.
• delete all udp protocol lines in /etc/services.
• cd into /usr/src/ssh and run make install.
An operation can be automated or manual, accomplished by a computer program or by a human
administrator. Operations can be combined through function composition:
Definition 2 For two operations p and q, the operation q ◦ p is defined as applying p, then q, to s:
(q ◦ p)(s) = q(p(s)).
In this thesis, sequences of operations are read from right to left, not left to right, to conform to
conventions of algebra.
Axiom 2 For any system, there is a baseline operation b that – when applied to any configuration
of the host – transforms it into a baseline configuration with a predictable and repeatable actual
state, a “baseline state” B.
There are no other restrictions on b, for example b could be “reformat the hard disk”. Note
that since observed behavior is a function of actual state, the baseline state also corresponds to a
repeatable observed state as well.
Definition 3 Given a set of m configuration operations P = {p1 , p2 , · · · pm }, let P ∗ represent
the set of all possible results of composing finite sequences of operations, i.e. P ∗ = {p̃ | p̃ =
pα(1) ◦ pα(2) ◦ · · · ◦ pα(k−1) ◦ pα(k) where k and α are integers and k ≥ 1, α : [k] → [m], pα(i) ∈ P } ∪ {ε}
where p̃ is a sequence of operations and ε represents the empty operation (“do nothing”).
P ∗ is the set of all possible results of composing finite sequences of operations from P . Sometimes,
we will loosely refer to the sequence as the composition of the elements of the sequence.
48
Definition 4 The set of reachable states of a system with respect to a baseline state B and operations P is S = {s = p̃(B), p̃ ∈ P ∗ }.
The set of reachable states is a subset of all possible actual states. There might be other ways
to change an actual state other than configuration operations. In this thesis, we do not distinguish
these two concepts and assume that only configuration operations can change the actual state of a
system. This assumption is reasonable because usually the set of actual states is not arbitrary, but
is instead the result of applying configuration operations to some known initial system state.
The direct consequence of Axiom 2 is that an actual state s constructed by starting at the
baseline configuration and applying a series of operations from P is completely determined by the
sequence of operations applied since the last baseline operator b, i.e., all operations prior to the last
baseline can be ignored. Thus there is a one-to-one correspondence between sequences of operations
since the last baseline operation b and configurations s ∈ S. Thus for the rest of the thesis, we
represent s as a sequence of operations after the baseline operation.
In practice, many sequences of operations can have exactly the same effect; thus there is a
(perhaps poorly understood) equivalence relation E on P ∗ , where p˜i ≡ p˜j whenever applying both
sequences in state produces the same final state in either case. There are two kinds of equivalence:
observed and actual. Two machines are observably equivalent if their results agree for all tests in a
test suite T . They are equivalent in actuality if their configurations are identical. In practice, the
latter is impossible; two machines that are identical in configuration cannot even share the same
network; they must have differing internet addresses, and thus their configurations must be different
in some crucial ways. The exact nature of equivalence between machines is a central issue in many
tools, including radmind[50] and ISConf[90, 91]. In this thesis, we define equivalence in terms of
observed state to avoid ambiguity.
7.5
Two Configuration Management Automata
Two configuration automatons can be defined: the first is based upon actual states of the system
and the second is based upon observed states of the system.
M1 : (S, P, W1 )
M2 : (Ψ, P, W2 )
where S is the set of all possible actual states (configurations) of the system, P is the set of
configuration operations applicable to the system, and W1 and W2 are transition rules and defined
49
as follows.
W1 is the set of all triples (p, s, s0 ) where s ∈ S is the actual state of the system before an input
configuration operation p ∈ P and s0 ∈ S is the resulting actual state of the system.
W2 is the set of all triples (p, ψ, ψ 0 ). if s ∈ A(ψ) and s0 ∈ A(ψ 0 ) and s0 = p(s), then (p, ψ, ψ 0 ) ∈
W2 .
Note that M1 can be arbitrarily large since |S| is arbitrarily large; M2 is bounded by 2|T | .
M1 is deterministic by Axiom 1 and Axiom 2 since the actual state of a system can be represented by its configuration from Axiom 1 and its configuration can be represented by a sequence
of operations applied to its baseline state from Axiom 2. M2 is possibly non-deterministic. Before
each operation, the observed state ψ corresponds to a subset of possible actual states A(ψ) ⊂ S.
Without further constraints, a typical operation p seems non-deterministic, because the actual states
that can result from applying p are the set p(A(ψ)) = {p(s) | s ∈ A(ψ)}. Thus it is possible that p,
when applied to a system in observed state ψ, can produce one of several observed states σ(A(ψ))
as a result. Without further information, one cannot limit p(A(ψ)) to be the particular configuration s0 ∈ p(A(ψ)) that is actually in effect after applying p. This uncertainty leads to apparent
non-determinism when applying configuration operations.
50
Chapter 8
Reproducibility
The goal of any configuration management strategy is to achieve reproducibility of effect: repeating
the same configuration operation on different hosts in a large network produces the same behavior on
each host. Configuration operations should cause deterministic state transitions from one behavioral
state to another.
Reproducibility is difficult to achieve due to the difference between the actual state of a host and
its observed state that humans and operations can practically observe. In making a configuration
change, it is not practical to examine the whole state of the hard disk beforehand. Much of the
actual state of a host is not observed. Latent preconditions arise in parts of host state that humans
or operations are not currently considering when making changes.
The majority of the discussion of reproducibility theory is described in paper[33]; we reiterate
that discussion here for completeness. In this chapter, we show that for one host in isolation and
for some configuration processes, reproducibility of observed effect for a configuration process is a
statically verifiable property of the process. However, reproducibility of populations of hosts can
only be verified by explicit testing. Using configuration processes verified to be locally reproducible,
we can identify latent preconditions that affect behavior among a population of hosts. Constructing configuration management tools with statically verifiable observed behaviors thus reduces the
lifecycle cost of configuration management.
51
8.1
Local Reproducibility
First we develop the concept of reproducibility for a single host h. Even though each configuration
operation itself is deterministic, a change in actual state from s to s0 = p(s) may not effect a change
in the observed state ψ; the observed state is an unspecified function of the actual one. This results
in situations where the same configuration operation, applied to two configurations in the same
observed state but differing actual states, leads to two different observed states as a result. This
can occur if prior configuration operations left the two configurations in differing actual states that
are observed as identical, but for which further operations expose differences. Local reproducibility
is formally defined as follows:
Definition 5 Suppose we have a system h, a set of actual states S for h, a set of candidate operations P appropriate to h, a set of tests T that one can perform on h, a test function σ :
S → power(T ), and a map φ from each observed state ψ ∈ Ψ to a subset φ(ψ) ⊂ P that it is
appropriate to apply to h when it is in the observed state ψ. Then the formal system (h, S, T , P, φ)
exhibits observed local reproducibility (or simply local reproducibility) if the following condition
holds:
For every pair of actual states s, s0 ∈ S with σ(s) = σ(s0 ) and for every p ∈ φ(σ(s)), we have
σ(p(s)) = σ(p(s0 )).
The intuition behind the definition is that local reproducibility means for a single host, two
actual states that have the same behavior will continue to have the same behavior is the same
operation is applied to both.
Another possible definition of local reproducibility is that the configuration automaton based
upon observed behavior for a single host is deterministic.
Proposition 1 The formal system (h, S, T , P, φ) exhibits observed local reproducibility exactly when
the state machine (Ψ, P, W2 ) with states Ψ = σ(S), operations P , and transition rules
W2 = {(p, σ(s), σ(p(s))) | s ∈ S, p ∈ φ(σ(s))}
(8.1)
is deterministic.
Proof: Suppose we have a formal system exhibiting observed local reproducibility as in Definition 5,
and construct the state machine of the proposition. By hypothesis, the only allowable operations
52
change one configuration in S into another in S, so that this state machine can be in one of a limited
number of states σ(S) ⊂ Ψ. Now start in a state ψ ∈ σ(S) and apply an operation p ∈ φ(ψ). Let
s, s0 be two actual states such that σ(s) = σ(s0 ) = ψ. Then by hypothesis, σ(p(s)) = σ(p(s0 )), so
that the result of p is invariant of choice of s. Thus the state machine is deterministic. The converse
is similar. 2
Some configuration management strategies achieve local reproducibility by strictly utilizing a
set of operations in a particular sequence[90, 91].
Proposition 2 Suppose that P = {b, p1 , . . . , pn }, and that
S = {b, p1 ◦ b, p2 ◦ p1 ◦ b, . . . , pn ◦ · · · ◦ p1 ◦ b}.
(8.2)
Suppose that φ(σ(b)) = {p1 } and let
φ(σ(pk ◦ · · · ◦ p1 ◦ b)) = {pk+1 }
(8.3)
for 1 ≤ k < n. Then the formal system (h, S, T , P, φ) exhibits observed local reproducibility.
Proof: Starting at baseline b, we form the configurations b, p1 ◦ b, p2 ◦ p1 ◦ b, . . ., pn ◦ · · · ◦ p1 ◦ b. As
b creates an actual state, and each operation is deterministic, the sequence of operations uniquely
determines an actual state. As observed tests are deterministic, the observed state corresponding
to this actual state is uniquely determined as well. 2
This proposition is part of the theoretical grounding of ISConf[59, 90, 91]. Unconstrained deterministic operations, when applied in a specific order, appear deterministic to any observer utilizing
deterministic tests as a mechanism for observing.
However, Proposition 2 is extremely limiting. The reachable states are attained by applying
prefixes of the sequence of configuration operations to the baseline configuration. This results in a
sequence of configurations s1 , . . . , sn+1 , where going forward from si to si+1 requires operation pi ,
while going backward requires starting over from the baseline state[59]. Since re-baselining a host is
currently a matter of erasing all of the host’s contents and starting over, the machine is unavailable
for use during this process. This can lead to hidden costs from lost productivity due to machine
downtime[74]. A generally applicable configuration management strategy should – within limits –
be able to change a host from any state to any other without having to rebuild the entire host from
scratch.
53
The polar opposite of the strategy of Proposition 2 is to consider all operations to be applicable
at all times, and instead constrain the nature of operations to provide observed local reproducibility.
In this strategy, we allow application of all possible sequences of operations in P .
Definition 6 Suppose we have a system containing a host h, a baseline state b, a set of operations
P , and a set of tests T . For all observed states ψ ∈ σ(S), let φ(ψ) = P , so that all operations
apply to all observed states. Then we say the system (h, b, T , P ) exhibits observed local reproducibility (or simply local reproducibility) whenever the system (h, P ∗ (b), T , P, φ) exhibits observed local
reproducibility according to Definition 5.
This simpler notion of local reproducibility makes it possible to unambiguously discuss the local
reproducibility of a particular operation p ∈ P .
Definition 7 With respect to the formal system (h, b, T , P ), an operation p ∈ P exhibits observed
local reproducibility (or simply local reproducibility) if for every s, s0 ∈ P ∗ (b) with σ(s) = σ(s0 ),
σ(p(s)) = σ(p(s0 )). In this case, we say that p is a locally reproducible operation.
Proposition 3 The system (h, b, T , P ) exhibits observed local reproducibility exactly when every
operation p ∈ P exhibits observed local reproducibility with respect to the system (h, b, T , P ).
Proof: Suppose the system (h, b, T , P ) exhibits observed local reproducibility. Then by Definition 6, the system (h, P ∗ (b), T , P, φ) exhibits observed local reproducibility, where φ(ψ) = P for all
observed states ψ ∈ σ(P ∗ (b)). Then for each operation p ∈ P , the second condition in Definition 5
is true, and for every s, s0 ∈ P ∗ (b) with σ(s) = σ(s0 ), σ(p(s)) = σ(p(s0 )). Thus the operation p
exhibits observed local reproducibility.
Conversely, suppose that for all p ∈ P , p exhibits observed local reproducibility. Then by the
same argument as above, the second condition of Definition 5 is true. Since S = P ∗ (b), every s ∈ S
can be expressed as pα(1) ◦ pα(2) ◦ · · · ◦ pα(k−1) ◦ pα(k) ◦ b, where k ≥ 0, α : [k] → [m], pα(i) ∈ P . Thus
p(s) = p ◦ pα(1) ◦ pα(2) ◦ · · · ◦ pα(k−1) ◦ pα(k) ◦ b ∈ P ∗ (b) = S and the first condition of Definition 5
is true. Thus the formal system (h, P ∗ (b), T , P, φ) exhibits observed local reproducibility according
to Definition 5, so that by Definition 6, the system (h, b, T , P ) does so as well. 2
In other words, a set of operations exhibits observed local reproducibility with respect to a
baseline b if each operation has a reproducible observed effect on each reachable configuration s ∈
P ∗ (b). This is the definition of reproducibility that best models the operation of Cfengine[15, 16, 18]
and related convergent agents.
54
Note that local reproducibility of P trivially implies local reproducibility of P ∗ . Although P ∗ is
infinite, P ∗ (b) is a subset of a finite (though large) set of configurations, as the number of possible
configurations is finite.
8.1.1
Properties of Locally Reproducible Operations
Several relatively straightforward propositions demonstrate the properties of locally reproducible
operations in more detail. In the following propositions, to ease notation, we will presume the
existence of a host h, a baseline b, a set of possible operations P , and a set of tests T . We will
presume that S = P ∗ (b) is the set of reachable configurations. All claims of local reproducibility of
an operation refer to Definition 7 and are made in the context of the formal system (h, b, T , P ).
Proposition 4 The set of operations P is locally reproducible if and only if for each operation
p ∈ P and each actual state s ∈ S = P ∗ (b), σ(p(s)) is a function τp of σ(s). Then we can express
the observed state of a configuration after p as σ(p(s)) = τp (σ(s)).
Proof: This is a direct and obvious consequence of the definition of observed local reproducibility.
A set of operations exhibits observed local reproducibility if and only if the resulting observed state
after each operation is a function of the observed state before the operation; τp makes this functional
relationship explicit. 2
Because locally reproducible operations p correspond with state functions τp , they also exhibit
the typical properties of functions, notably, that a composition of functions is also a function:
Proposition 5 A composition of operations that each exhibits observed local reproducibility also
exhibits observed local reproducibility.
Proof: Let T be a set of tests and S = P ∗ (b) represent a set of configurations. Let s ∈ S.
Consider locally reproducible operations p and q on S. Since p is locally reproducible, for any
particular observed state σ(s) of s, σ(p(s)) = τp (σ(s)) is a constant. Likewise for q, for any
observed state σ(p(s)), σ(q(p(s))) = τq (τp (σ(s))) is a constant. Thus for any observed state σ(s),
σ(q(p(s))) = σ(q ◦ p(s)) is a constant and q ◦ p exhibits observed local reproducibility. 2
As composing operations on a configuration is the same as applying them in order, this means
that an arbitrary sequence of locally reproducible operations is locally reproducible as well. However,
the above does not yet tell us how to implement local reproducibility for the operations p that we
might compose.
In particular, some counter-intuitive results arise straightforwardly from the model.
55
Proposition 6 Configuration operations containing linear code (with no branches) do not necessarily exhibit observed local reproducibility.
Proof: As a counterexample, we construct an operation p1 whose outcome is not a function of
prior observed state. Let X and Y be two configurable parameters of our host. Let operation p1
be “X := Y ”, let operation p2 be “Y := 1”, and let operation p3 be “Y := 2”. Let T consist of
one test t1 : “X > 1”. Then p1 is not locally reproducible; it has two outcomes depending upon
unobserved pre-existing conditions. There are two reachable latent states Y == 1 and Y == 2 that
are not measured by the tests T , which determine the observed outcome. These latent states are
constructed by applying operations p2 or p3 , respectively. 2
Reproducibility or non-reproducibility arise from properties of both the domain and range of
an operation. Note that if Y was a constant, p1 would exhibit observed local reproducibility; its
non-reproducibility arises from the fact that Y ’s value is unpredictable.
This situation occurs often in practice, such as when changing file protection modes. Suppose
there is a file “foo” that we wish to make executable and configuration operations p1 , p2 , and p3 ,
where p1 is “chmod ugo+X foo”, p2 is “chmod 744 foo”, and p3 is “chmod 644 foo”. Suppose
T consists of one test “test -x foo” where the user running the test is not the owner or in the
file’s group; this tests whether the file is executable to world. Then the observed states of applying
p2 and p3 are indistinguishable, because neither p2 nor p3 makes the file executable to the user
performing the test. But performing p1 after p2 makes the file world-executable (protection 755,
because ”X”, the conditional execution flag, makes it fully executable if any execute bit is set), while
performing p1 after p3 leaves it completely unexecutable (protection 644). Similar situations occur
during stream editing of files.
Likewise, conditional statements based upon unobserved data pose serious problems:
Proposition 7 A conditional statement if (F ) then X := G need not produce a locally reproducible
outcome if F is not observed, even if G is observed.
Proof: As a counterexample, consider two boolean variables X and Y , where X is observed and
Y is not. Consider the code:
X := FALSE; if (Y ) then X := TRUE;
This is equivalent to X := Y , which makes X unobserved. 2
56
(8.4)
8.1.2
Constructing Locally Reproducible Operations
It is easy to construct a locally reproducible configuration operation. Each locally reproducible
configuration operation p corresponds to a function from initial states ψ to final states ψ 0 , in the
context of a set of reachable states S. This operation must thus depend upon the value of ψ and
avoid conditioning its effects on the values of other variant properties of the host or network. It can,
however, depend upon host properties γ that do not vary as a result of any configuration operation
p.
Proposition 8 Let p be a configuration operation. Let ψ be an observed state of a configuration
s measured before applying p. Let γ represent the attributes of a host that remain constant during
configuration. If p consists solely of setting a configuration parameter X to a value that is a function
only of ψ, γ, and constants, then p is locally reproducible.
Proof: Let ψ be the state of the configuration before the operation p. By hypothesis, p has the
form X := F (ψ, γ), where F is a function only of observed state ψ, constants, and invariants of a
particular host. We must show that the resulting observed state ψ 0 after applying p is a function of
the previous observed state. Because F is a function, there is one and only one outcome for F (ψ, γ)
for each state ψ, so that the resulting value of X in the actual configuration changes as a function
of observed state whether it is observed or not. As the resulting observed state is a function of the
resulting configuration p(s) by Axiom 1, it must change predictably and repeatably as well. 2
The above result is easily generalized.
Corollary 1 Suppose an operation p consists only of a sequence of assignments X := F (ψ, γ),
where X is a configuration parameter, ψ and γ are held constant throughout, and F is a function
of ψ, γ, and constants. Then p is locally reproducible.
Proof: A sequence of assignments is the same as a composition of the operations that perform
the assignments. By Proposition 8, each one of these is locally reproducible. By Proposition 10,
a composition of any two is locally reproducible. By induction on the number of assignments, the
result easily follows. 2.
Conditional statements pose no difficulties (although they contain more than one possible execution path) because repeatability is guaranteed by the constancy of ψ and γ.
Proposition 9 Suppose a configuration operation p has the form
if (F (ψ, γ)) then X := G(ψ, γ)
57
(8.5)
where X is a configuration parameter and F and G are functions whose values solely depend upon
the current observed state ψ, the host invariants γ, and constants. Then p is locally reproducible.
Proof: We must show that the observed outcome is a function of ψ and γ. There are only two
possible outcomes. If F (ψ, γ) is false, nothing happens, so the result of all such operations is locally
reproducible, having the same observed state as the original. If F (ψ, γ) is true, the assignment
X := G(ψ, γ) occurs, and is locally reproducible because of Proposition 8. Since taking the branch
is itself a function of observed state, the whole branch is locally reproducible as well. 2
Corollary 2 A sequence of conditional assignments of the form
if (F (ψ, γ)) then X := G(ψ, γ)
(8.6)
where F and G are functions only of ψ, γ, and constants, is locally reproducible.
Proof: Repeat the argument of Corollary 1 with these conditional statements. The proof is trivial
given the constancy of ψ, γ, F (ψ, γ), and G(ψ, γ).
With the above results in mind, we are ready to relate reproducibility to the structure of configuration operations as programs.
Proposition 10 Let p be a configuration operation and s a configuration. Let ψ be the observed
state of s and let γ represent the set of host invariants that do not change during configuration.
Suppose that when p is applied to s, it:
1. Sets parameters to values that are functions of ψ, γ, and constants.
2. Takes program branches (conditionals) depending only upon the values of functions of ψ, γ
and constants.
where ψ and γ are held constant throughout p. Then if the operation p terminates successfully on
every configuration s in its domain, p exhibits observed local reproducibility.
Proof: Assume that p is an operation conforming to the above hypotheses. Let G = (V, E) be the
program graph of the operation p as a procedure. Construct the graph to have nodes v ∈ V for
each parameter assignment statement and conditional. Since p obeys the rules above, this graph
takes branches based solely upon functions of a static observed state ψ that does not change during
the execution of p. This means that any loop taken during execution of p will never terminate,
because its branch condition cannot change during the execution of p. Since p terminates, loops
58
are not present and the program takes a predetermined finite path of vertices v1 , . . . , vk through
the program graph for each particular choice of ψ. Along that path, it executes a predetermined
sequence of assignment statements, so that Corollary 1 applies and the result is locally reproducible.
2
It is rather important, however, that the operation utilize only static values of ψ during execution of p, and not dynamic re-measurements of tests during p. One can get away with limited
dynamic measurements of ψ, provided one is careful not to use two differing measurements of ψ
simultaneously:
Proposition 11 Suppose that p as described in Proposition 10 is also allowed to re-measure the
whole observed state ψ at any time during its execution, as well as setting parameters and branching
based upon functions of the observed state ψ and host invariants γ. Only one measurement of ψ
is available at a time and any setting must be a function of the most recent measurement. If p
terminates, the result is locally reproducible, but not all processes are guaranteed to terminate.
Proof: Let p be an operation that conforms to the hypotheses of the proposition. Repeat the
construction of the program graph G = (V, E) from the proof of Proposition 10 with one change:
include re-measurement operations for observed state in the program graph. It is now possible for
loops to occur during execution, but if the operation terminates, we claim that it still produces a
locally reproducible state.
First, any terminating computation p will have executed a finite sequence of operations v1 , . . . , vn
within its program graph, where each vi is either a parameter assignment statement or remeasurement of entire state. Without loss of generality, we can express this sequence as an alternating sequence m1 , a1 , m2 , a2 , . . . , mk , ak where each mi measures state and each ai represents
a series of assignment statements relative to that state.
Now consider what happens during this sequence to two configurations s and s0 with the same
observed state. We must show that for both s and s0 , the branches taken are identical, leading to
identical paths with identical effects. Since s and s0 have the same observed state, the results of
m1 are the same in each case; hence the operations a1 done between m1 and m2 are identical in
purport. These statements are locally reproducible, so the resulting observed state m2 is the same
in both cases. Proceeding by induction for m2 . . . mk , it is easy to demonstrate that the exact same
assignment statements and branches are taken overall, so that the result of p is independent of the
actual configuration s or s0 . Thus p is locally reproducible. 2
59
8.2
Population Reproducibility
We next turn our attention to assuring reproducibility of operations over a population of hosts. Such
reproducibility is extremely important as a cost-saving measure. If we must validate the behavior
of each host of a population separately, it becomes very expensive to build large networks. Ideally,
we should be able to test one particular host in order to understand the behavior of a population.
Definition 8 If, for any subset of a population of hosts whose configurations are in the same
observed state, an operation p results in identical observed final states over the subset, then p exhibits
observed population reproducibility (or simply population reproducibility).
Local reproducibility does not imply population reproducibility.
Proposition 12 An operation that is locally reproducible on every host to which it applies can fail
to exhibit population reproducibility.
Proof: As a counterexample, consider an operation applied to hosts with differing operating systems. Suppose that p is the operation of copying a constant xinetd.conf into /etc/. The population reproducibility of this operation has no dependence upon its implementation; it is a property
of the operating system. If the operating system does not support the xinetd abstraction, the
operation does nothing to behavior. Consider also an operation that exposes a bug in one OS that
is not present in another. There are many other such latent variables that cause the same operation
on different hosts to yield different outcomes. 2
While local reproducibility is relatively easy to achieve, population reproducibility remains a
pressing problem, and there is no currently available tool that addresses it sufficiently. There is one
overarching observation that motivates our approach to population reproducibility:
Proposition 13 If all configuration processes are verified as locally reproducible, and one observes
population differences in behavior in applying these processes, then the variation must be due to
latent preconditions in the population rather than artifacts in the processes.
Proof: These processes have effects that are observably locally reproducible. If their effects differ
for two hosts in a population, then since they are locally reproducible in isolation, they will always
differ in the same way regardless of when the operations are applied. Hence some other factor is
causing the difference, and the only other variants are host identity and preconditions. 2
Thus a locally reproducible process that differs in effect over hosts in a population can be used
to test for heterogeneity of latent variables within the population.
60
One way of assuring population reproducibility is to utilize locally reproducible actions that
fail to be population reproducible to expose each latent variable in the population. With latent
variables exposed, we can form equivalence classes of hosts with the same latent structure, and
condition further configuration changes by that structure.
We show a simple algorithm for constructing “synthetic” population reproducibility to inspire
more prudent thoughts on population reproducibility.
Given a set of hosts H, a set of locally reproducible operations P on hosts in H, a set of tests
T , and an empty set of additional tests T 0 , :
1. Run an operation sequence p̃ on the population of hosts to which it should apply.
2. Perform all tests in T ∪ T 0 on each host h to compute an observed state ψh .
(a) If results are the same for all hosts, then p exhibits population reproducibility over the
hosts to which it applies.
e where for h, h0 ∈ e
e
(b) If results are different, then form equivalence classes of hosts H,
h ∈ H,
ψh = ψh0 . For each equivalence class e
h, and a configuration c, add the test “h(c) ∈ e
h”
to T 0 .
The above construction is impractical. There are |P ∗ | iterations at step 1, and each iteration
requires O(2|T | ) operations to carry it out. The point is that it is possible to deal with population
non-reproducibility incrementally, one deviation at a time, adding tests to T 0 one by one. In this
way, our state of knowledge grows with the operations we apply and observe.
61
Chapter 9
Limits on Configuration
Operations
There has recently been much attention to the limits imposed upon configuration operations and
how they affect usability of a set of operations [32]. Some authors maintain that operations should
be constructed to be repeatable without consequence once a desirable state has been achieved
[15, 16, 18, 20]. Other authors maintain that operations must, to provide a consistent outcome, be
based upon imperative order [90, 91] or upon generating the whole configuration as a monolithic
entity [6, 8]. In this chapter, we precisely define limits upon operations with the intent of discussing
how those limits affect composability of operations.
9.1
Limits on Configuration Operations
Definition 9 A set of operations P is observably idempotent if for any operation p ∈ P , repeating
the operation twice in sequence has the same effect as doing it once, i.e., σ(p ◦ p ◦ q̃) = σ(p ◦ q̃) where
p ∈ P and q̃ ∈ P ∗
The use of q̃ in this definition asserts that the idempotence occurs when applied to any actual
state of the system from B. We could say, equivalently, that σ(p ◦ p) = σ(p) when starting at any
state in S.
In continuous spaces, an operation is convergent if repeating the operation many times can cause
the system to move to some target state, i.e. limn→∞ pn = target state.
62
By contrast, in system administration idempotent operations take on the discrete character of
configuration space.
Definition 10 A set of operations P is observably convergent if for any operation p ∈ P and any
q̃ in P ∗ , there is an integer n > 0, such that p is observably idempotent when applied to pn ◦ q̃, i.e.,
σ(p ◦ pn ◦ q̃) = σ(p ◦ p ◦ pn ◦ q̃).
For example, consider the task of creating a user in a directory service.
Classical procedural operations would be:
1. create user record in LDAP.
2. wait for LDAP directory service to sync up and serve the directory record.
3. create user’s home directory and associate data.
Time must pass between steps 1 and 3.
An equivalent convergent operation to accomplish this task might be an operation c with the
following pseudo-code:
if (user does not exist)
put user in LDAP
else if (user’s home directory does not exist)
create an appropriate home directory
While c by itself does not accomplish the task, repeating c several times (while time passes) guarantees that the task will be completed. After this time, repeating c does not change the system
behavior. The value of n in the definition of convergence thus depends upon how fast the LDAP update occurs. The key issue is that rather than thinking of system configuration as one large change,
a convergent operation makes one small change at a time. Through repetition of the operation, its
preconditions are fulfilled by previous runs or other operations, or simply by the passage of time.
Thus idempotence is a special case of convergence where n = 1.
In dynamic situations, when we do not know the nature of the best solution, we can allow
convergent operations to “discover” the quickest solution to a problem. One kind of convergent
operator for which this is true employs “simulated annealing”[62]. This approach presumes that
the“best” solution lives in a solution space that is not convex[12]; a particular locally best configuration is not necessarily globally best. This corresponds to a potential “surface” in which the
63
best solution is one of many peaks of the objective function. The simulated annealing approach is
to choose at each step whether to move to or away from a peak, in order to allow one chance to
switch to a better peak over several trials. For example, in web server response optimization, the
behavior of the operator depends upon an external parameter ρ, the “temperature” of the system.
Simulated annealing occurs when at each application of the operator, one makes a decision how to
proceed based upon the probability ρ: with probability ρ move files to a server that seems slower
(as observed before the move); with probability 1 − ρ, move files to a server that seems faster. The
annealing process is to let ρ → 0 over repeated operation; this is also called a “cooling schedule”.
The result of simulated annealing is often a near-optimal response time[62].
In the initial configuration phase, behavior of the host is completely determined by configuration
operations; while in the maintenance phase, especially during resource optimization, the influence of
users becomes nontrivial. The effectiveness of convergence then increases as in the above example.
Convergent operations are not necessarily consistent; they might even oppose one another to
achieve equilibrium in the overall observed state of the system. The inter-relationship of convergent
operations can be grouped as:
1. Orthogonal - working on different parts of the system
2. Collaborative - helping each other
3. Conflicting - opposing one another
Convergence needs careful design. Poorly designed convergent operations will result in a mess.
When we expect them to work independently, they might interfere each other; when we want them
to work collaboratively, they might compromise each other.
Definition 11 A set of operations P is observably sequence idempotent if for any sequence p̃ of
elements from the set, repeating the sequence of operations has the exact same effect as doing it
once, i.e., σ(p̃ ◦ p̃ ◦ r̃) = σ(p̃ ◦ r̃) where p̃, r̃ ∈ P ∗
Definition 12 A set of operations P is observably stateless if repeating an operation (where the
repetition is not necessarily adjacent) will accomplish the same result, i.e., σ(p◦ q̃ ◦p◦ r̃) = σ(p◦ q̃ ◦ r̃),
where p ∈ P and q̃, r̃ ∈ P ∗ .
Unconditional commands that set parameter values are stateless. For example set M=0. But
carefully crafted, stateless operations can also contain conditionals. For example:
baseline state: M = 0.
64
Figure 9.1: Stateless operations can have if statements
P = {p, q}.
p: if M == 0, then set M = 3;
if M == 1, then set M = 2;
if M == 2, then do nothing;
if M == 3, then do nothing;
q: set M = 1;
Please refer to Figure 9.1. Note that the statelessness of p and q is not only dependent upon
the contents of p and q, but also upon the baseline state.
Definition 13 A set of operations P is observably commutative if for any two operations p, q ∈ P,
σ(p ◦ q ◦ r̃) = σ(q ◦ p ◦ r̃), where p, q ∈ P and r̃ ∈ P ∗ .
Note that commutative operations can make conflicting changes to the system. For example, p :
M = M + 1 and q : M = M − 1.
Definition 14 A set of operations P is observably consistent or homogeneous if operations never
undo the changes made by others, i.e., for p ∈ P and q̃ ∈ P ∗ , σ(p ◦ q̃) ⊇ σ(q̃).
Definition 15 An operation is atomic if it has two result states. Either
1. it has all required preconditions and it asserts known postconditions or
2. it lacks some precondition and does nothing at all.
65
Definition 16 An operation is aware if the operation knows whether it succeeded or not in enforcing
its requirements. In practice, this means that the operation can return a boolean value that is true
if it succeeded and false if not.
9.2
Relationship Between Limits
Each of the concepts of idempotence, convergence, sequence idempotence, and statelessness refer to
a condition true of an operation or set of operations. We can understand these concepts by outlining
how each one limits an operation or operations. The “strength” of a condition refers to the amount
of limitation imposed; a stronger condition means that less operations meet the conditions.
Proposition 14 Idempotence is a stronger condition upon operations than convergence.
Proof: Idempotence is a special case of convergence by restricting n = 1. 2
Proposition 15 Sequence idempotence is a stronger condition upon operations than idempotence.
Proof: A single operation can be viewed as a sequence with size 1. A sequence idempotent operation
set (which requests that any sequence of operations is idempotent) is always an idempotent set
(which only request individual operation to be idempotent). There are sets that are idempotent
but not sequence idempotent. For example: suppose that there are two configuration parameters
M and N , where the baseline state is that M = N = 0. Let p be the operation if (M == 1) then
N = 2, and let q be M = 1. Clearly repeating either p or q has no effect, so they are idempotent
in isolation. Also q ◦ p just sets M to one, but (q ◦ p) ◦ (q ◦ p) sets N to 2 as well. 2
Proposition 16 Statelessness is a stronger condition upon operations than sequence idempotence.
Proof: Given a stateless set, we prove it is also sequence idempotent. Assume without loss of
generality a sequence p̃ = p1 ◦ p2 ◦ · · · ◦ pk .
σ(p̃ ◦ p̃)
= σ((p1 ◦ p2 ◦ · · · ◦ pk−1 ◦ pk ) ◦ (p1 ◦ p2 ◦ · · · ◦ pk−1 ◦ pk ))
= σ((p1 ◦ p2 ◦ · · · ◦ pk−1 ) ◦ (pk ◦ p1 ◦ p2 ◦ · · · ◦ pk−1 ◦ pk ))
= σ((p1 ◦ p2 ◦ · · · ◦ pk−1 ) ◦ (pk ◦ p1 ◦ p2 ◦ · · · ◦ pk−1 ))
= ···
= σ(p1 ◦ p2 ◦ · · · ◦ pk )
= σ(p̃).
66
Figure 9.2: An example of sets that are sequence idempotent but not stateless
There are cases that is sequence idempotent but not stateless. For example:
Suppose M is the configuration parameter. p and q are operations.
Baseline state: M = 0
p: if M == 0, then set M = 1;
if M == 1, then do nothing;
if M == 2, then set M = 4;
if M == 3, then set M = 1;
if M == 4, then do nothing;
q: if M == 0, then set M = 2;
if M == 1, then set M = 3;
if M == 2, then do nothing;
if M == 3, then do nothing;
if M == 4, then set M = 2;
Referring to Figure 9.2, the readers can verify that this set of operations is sequence idempotent.
Sequence p ◦ q ◦ p sets M = 1 while sequence p ◦ q sets M = 4. So σ(p ◦ q ◦ p) 6= σ(p ◦ q). Therefore
the set of operations is not stateless. 2
Proposition 17 If a set of operations is both idempotent and commutative, it has to be stateless
and consistent.
Proof: Commutativity & Idempotence ⇒ Statelessness:
67
σ(p ◦ q̃ ◦ p) = σ(p ◦ p ◦ q̃)(Commutativity) = σ(p ◦ q̃)(Idempotence)
Commutativity & Idempotence ⇒ Consistency:
Suppose there are two commutative and idempotent operations p and p are not consistent, i.e.,
p makes one test t ∈ T true, p makes this test false. Let us consider two sequences p ◦ p ◦ p and
p ◦ p. Sequence p ◦ p ◦ p will make the test first true, then false, and finally true. Sequence p ◦ p will
make the test true, and then false. So these two sequences do not have equivalent observed effects.
However, σ(p ◦ p ◦ p) = σ(p ◦ p ◦ p)(Commutativity) = σ(p ◦ p). By contradiction, the above case
cannot exist. So commutative and idempotent operations must be consistent. 2
The enemy of tractability of composability seems to be lack of knowledge and control over combinatorial behaviors of configuration operations, except through explicit testing. While we usually
know exactly what an operation does when applied to a baseline system, we seldom know precisely
what will happen when two operations are composed, i.e., when the second operation is applied
to a non-baseline system. Worse, poorly engineered scripts of the kind employed for configuration
management are especially prone to errors when applied to systems in unpredictable states. The
limits on operations we discussed previously all add some forms of control over combinatorial behaviors. Convergence, idempotence, sequence idempotence, statelessness add a structure of equivalence
relations between sequences of configuration operations. Commutativity limits potential results of
a set of scripts so that different permutations of the same set are equivalent. Consistency rules out
conflicting behaviors. The limits of atomicity and awareness make operations tighter, more robust
and more secure. And they are used with other limits to enhance their functionality. However, in
the next chapter we show that only these limits are not enough to make composability tractable.
68
Chapter 10
Complexity of Configuration
Composition
10.1
Composability
Composability of configuration operations refers to the ability to find a finite sequence of configuration operations that reliably transforms a given initial state of a system to an observed state
that satisfies some given user requirements. There are two types of composability: syntactic and
semantic[75].
Syntactic composability refers to implementation details that enable several operations to be
combined, e.g., interface specifications. A set of operations is syntactically composable if each
operation employs the proper interfaces to call other operations and the underlying operating environment. Syntactic composability has been studied in detail in the context of software engineering,
and proven to be achievable by establishing a common framework, e.g., Component Object Model
Plus (COM+)[70], Common Object Request Broker Architecture (CORBA)[49], and Enterprise
JavaBeans (EJB)[71].
By contrast, semantic composability refers to the ability to achieve a specific function or goal by
composing operations from a set. Semantic composability emphasizes the meaning of the composition. Our studies address the theoretical aspects of semantic composability as it applies to system
administration.
69
10.2
Complexity of Operation Composability
Composability is a desirable feature in configuration management. A composable approach can
greatly reduce costs of configuration and training of new administration staff and encourage collaborations among system administrators. Composability is far from being achieved for current
tools, which avoid the issue by either limiting the ways in which operations are composed, or using
operations that by nature can be composed.
In this section we show mathematically that composability is an NP-hard problem. The composability problem with no limits imposed upon configuration operations is called the GENERAL
COMPOSABILITY (GC) problem. We start with a simplified version of GC called COMPONENT
SELECTION (CS). We first prove that CS is NP-hard, hence GC is NP-hard. Then we prove
that the composability problems with various limits (including COMPOSABILITY OF ATOMIC
OPERATIONS, COMPOSABILITY OF PARTIALLY ORDERED SETS and COMPOSABILITY
OF CONVERGENT OPERATIONS) are all NP-hard problems.
10.2.1
Component Selection
Definition 17 COMPONENT SELECTION(CS) is defined as follows:
INSTANCE: Set P = {p1 , p2 , · · · , pm } of m components, set T = {t1 , t2 , · · · , tn } of n tests,
subset R ⊆ T of desired outcomes, test process function σ : power(P ) → power(T ) (that describes
the outcome σ(Q) for each Q ⊆ P ) which takes time O(|T |) and a positive integer K ≤ |P |.
QUESTION: Does P contain a subset Q ⊆ P with |Q| ≤ K, such that R ⊆ σ(Q)?
The above definition allows us to find an optimal solution in an easy way. The integer K in the
instance is the maximum size of composition allowed. If the answer of the question for some integer
K is “yes”, i.e., we can find a set of components that meet the requirements and with the length
equal or less than K, we can always lower K to K − 1 and ask the question again until we get the
answer “no”. Thus we can find the smallest set of components that meet the requirements. In CS, if
we are not required to search for the optimal solution, any selection that can satisfy the requirement
suffices, then the solution is trivial, just fix K to be equal to |P |. However, in composability which
is defined later, a polynomial algorithm that searches for a non-optimal solution is still difficult to
construct. The optimization and near-optimization of composability is discussed in section 10.2.2
and 13.2.
Theorem 1 CS is NP-hard.
70
Proof:
We must show that CS is at least difficult as a known NP-complete problem. Restrict CS to
RESTRICTED COMPONENT SELECTION(RCS) by allowing only instances having R = T and
S
σ(Q) = pi ∈Q σ({pi }). We show in the following proof that RCS is NP-complete, thus CS is
NP-hard. 2
Note that making R = T is trivial since tests are artifacts of human choices. It simply restricts
S
all the tests in T relevant to the user’s requirement. However restriction σ(Q) = pi ∈Q σ({pi }) is
significant. It imposes structure on the test process function σ. It eliminates the case where the
set of user requirements satisfied by a combination of components cannot be satisfied by any of the
components alone. It requires the operations must have the property of compositional predictability,
i.e., the observed behavior of a composition of operations can be predicted without testing by
considering the observed behaviors of individual operations in the composition. This is a very
strong requirement. It is a stronger condition than consistency.
Definition 18 The definition of RESTRICTED COMPONENT SELECTION(RCS) is as follows:
INSTANCE: Set P = {p1 , p2 , · · · , pm } of m components, set R = {t1 , t2 , · · · , tn } be a set of n
desired outcomes, test process function σ : power(P ) → power(R) whose computation takes time of
S
O(|R|) and σ(Q) = pi ∈Q σ({pi }), a positive integer K ≤ |P |.
QUESTION: Does P contain a subset Q ⊆ P with |Q| ≤ K, such that R = σ(Q)?
Theorem 2 RCS is NP-complete.
Proof: A proof is contained in [76], but contains problems that we will correct here. The proof of
NP-completeness will use the MINIMUM COVER (MC) problem, known to be NP-complete. MC
is defined as follows:
Definition 19 MINIMUM COVER:
INSTANCE: Collection C of m subsets of a finite set S, where |S| = n, positive integer K ≤ |C|.
QUESTION: Does C contain a cover of S of size of K or less, i.e., a subset C 0 ⊆ C with
S
|C 0 | ≤ K such that ci ∈C 0 ci = S?
First it must be shown that RCS is in NP. Given a subset Q of P , determing if R = σ(Q) can
be done by searching in σ(Q) for each element of R. Since σ(Q) has at most n elements and R
has n elements, a simple algorithm requires O(n2 ) time, which is polynomial in the length of the
instance, thus RCS is in NP.
71
The transformation function f from any instance ω of MC to an instance f (ω) of RCS is defined
as follows:
1. Let every ci ∈ C correspond to an operator pi ∈ P .
2. Let S be R.
3. For every ci = {si,1 , si,2 , · · ·} ∈ C, let σ(ci ) = σ({pi }) = {si,1 , si,2 , · · ·} = {ti,1 , ti,2 , · · ·}.
4. Let K in MC (KM C ) be K in RCS (KRCS )
Step 1 requires time O(m); step 2 requires O(n); step 3 requires time O(mn); step 4 requires
O(1). So f is a polynomial function of input.
Now we show:
ω ∈ M C ⇐⇒ f (ω) ∈ RCS
←−:
Assume f (ω) ∈ RCS, then there exists subset Q ⊆ P , with |Q| ≤ KRCS , such that R =
S
S
S
σ(Q). Let C 0 = {ci |ci = σ({pi }), pi ∈ Q}.
ci ∈C 0 ci =
pi ∈Q σ({pi }) =
pi ∈Q {ti,1 , ti,2 , · · ·} =
S
S
pi ∈Q {si,1 , si,2 , · · ·} = σ(Q) by f , because S = R. Then S = σ(Q) =
ci ∈C 0 ci , because KM C =
KRCS and |C 0 | = |Q| ≤ KM C . Therefore ω ∈ M C.
−→ :
Assume ω ∈ M C. Then there exists a subset C 0 ⊆ C with |C 0 | ≤ KM C , such that S =
S
S
S
0
ci ∈C 0 ci . Let Q = {pi ∈ P |ci ∈ C }. Then σ(Q) =
pi ∈Q σ({pi }) =
pi ∈Q {si,1 , si,2 , · · ·} =
S
S
S
pi ∈Q {ti,1 , ti,2 , · · ·} =
ci ∈C 0 ci , because R = S by f then R = S =
ci ∈C 0 ci = σ(Q), because
KRCS = KM C and |Q| = |C 0 | ≤ KRCS . Therefore f (ω) ∈ RCS. 2
10.2.2
General Composability
Unfortunately, our problem is somewhat more difficult than CS, because in our problem, order
of operations does matter. We are most interested in what test outcomes result from applying
operations from a set of operations P . Operations differ from components in the above theorem and
definition, because the order of application of operations does matter, and repeating an operation
is possible with different results than performing it once. We assume that to apply an operation
pi to the system takes only one step and we can test the system in linear time. We also assume
that the system begins in a predictable and reproducible baseline state B to which the sequence of
operations is applied.
72
Definition 20 GENERAL COMPOSABILITY(GC) is defined as follows:
INSTANCE: Set P = {p1 , p2 , · · · , pm } of m configuration operations, set T = {t1 , t2 , · · · , tn }
of n tests, set of user requirements R ⊆ T , test process function σ : P ∗ → power(T ), where for
p̃ ∈ P ∗ , σ(p̃) represents the tests in T that succeed after applying p̃ to a given repeatable baseline
state, suppose the computation time of σ is a linear function of |T |, a positive integer K ≤ |F|,
where F is a polynomial function of |P |.
QUESTION: Does P ∗ contain a sequence q̃ of length less than or equal to K, such that T ⊆ σ(q̃)?
For q̃ a sequence of operations, σ(q̃) is the result of testing the application of the sequence q̃ to a
given baseline state. This should at least satisfy the requirements given in R, but may satisfy more
requirements.
The reason that we impose integer K to be less than or equal to a polynomial function of the
length of the set of operations is that in practice no one implements a sequence of operations with
arbitrary length. In practice, there is a finite upper bound on the number of operations and people
seldom repeat an operation more than a few times.
As we pointed in previous section of the CS problem, integer K makes composability an optimization problem. In many cases, configuration management does not need to be optimal. Any
sequence of operations that will accomplish the result will do. However, even when a non-optimal
solution is acceptable, a polynomial algorithm that can compute this solution is still difficult to
construct due to the lack of structure of test process function σ. In other words, even if we lower
our standard to a non-optimal solution, the computational cost is not greatly reduced. We will
discuss optimization and near-optimal optimization in section 13.2.
Proposition 18 If operations in P are commutative and idempotent, then CS is reducible to GC.
Proof: The difference between the two problems is that in GC, each option for operations is a
sequence, whereas in CS each option is a set. If all operations in a sequence are commutative and
idempotent, then there is a 1-1 correspondence between equivalent subsets of sequences and sets,
and the reduction simply utilizes this correspondence. 2.
Theorem 3 GC is NP-hard.
Proof:
We must show that GC is at least as difficult as another NP-hard problem by restricting it to
CS by allowing only instances that the operations are commutative and idempotent. 2
73
We know from the proof from Section 9.2 that commutative and idempotent operations must be
stateless and consistent, so the above proof shows that even with limits of idempotence, statelessness,
consistency and commutativity, composability is still an NP-hard problem.
We need to point out that once we have the subset of P that satisfies user’s requirements, which
means we have somehow solved CS, order does not matter if we know the precedence of operations
or the operations are consistent, stateless and aware[28]. If we are sure of the appropriate sequences,
the order of writing down the elements does not matter; we can resort them into an appropriate
order later. Also, in Maelstrom[28], Couch claims that if the operations meet the requirements of
consistency, convergence (which for Couch is indistinguishable from statelessness) and awareness, a
specific sequence of operations with length O(n2 ) will try out all the permutations of n operations.
The success of Maelstrom depends on the axiomatic existence of consistency, statelessness and
awareness. Lack of these properties for any operation causes Maelstrom to fail. Cfengine employs
Maelstrom in a very simple form; its operations are simple enough to comply with Maelstrom’s
conditions.
10.2.3
Atomic Operation Composability
A common illusion is that composability of atomic operations is tractable. Atomic operations have
known preconditions and postconditions. The preconditions of an atomic operation p ∈ P are a set
of tests O ⊆ T that must be true before applying p. The postconditions of an operation p are a set
of tests R ⊆ T that will be true after p has been applied, given that preconditions have been met;
else, p will do nothing.
Definition 21 ATOMIC OPERATION COMPOSABILITY(AOC) is defined as follows:
INSTANCE: Set P = {p1 , p2 , · · · , pm } of m atomic configuration operations, set T
=
{t1 , t2 , · · · , tn } of n tests, set of user requirements R ⊆ T , a positive integer K ≤ |F|, where F
is a polynomial function of |P |. Each p ∈ P is associated with a set O ⊆ T and a non-empty set
R ⊆ T . If tests in O are all true, after p is applied, tests in R will be all true; else p will do nothing,
a test process function σ : P ∗ → power(T ), where for p̃ ∈ P ∗ , σ(p̃) represents the tests in T that
succeed after applying p̃ to a given repeatable baseline state, suppose the computation time of σ is a
linear function of |T |.
QUESTION: Does P ∗ contain a sequence q̃ of length less than or equal to K, such that R ⊆ σ(q̃)?
Note that postcondition set R assures certain behaviors after the execution of an atomic operation. It must be non-empty otherwise an atomic operation degrades to a general operation without
74
limits on its behavior.
Theorem 4 AOC is NP-hard.
Proof:
In the following we prove that AOC is at least difficult as RCS by showing that operations in
RCS are atomic operations with empty precondition set.
Recall that we restrict GC to CS by idempotence and commutativity so that we have a set
rather than a sequence. We further restrict CS to RCS by requiring compositional predictability,
S
S
i.e., σ( i pi ) = i σ({pi }). Suppose the set of operations P is idempotent, commutative and
compositional predictable, then the observed state of sequence p ◦ q̃ is σ(p ◦ q̃) = σ({p, q̃}) =
S
σ(p) σ(q̃), thus the postcondition set σ(p) is a subset of σ(p◦ q̃); in other words, observed behavior
of p is always assured after the operation is applied to any actual state. Therefore, operations with
limits of idempotence, commutativity and compositional predictability are special atomic operations
with empty precondition set.
We have shown in RCS that composability problem of operations with limits of idempotence,
commutativity and compositional predictability is NP-complete. Thus AOC is NP-hard. 2
10.2.4
Composability of Partially Ordered Operations
What if P is a partially ordered set? Will precedence ease composability?
Definition 22 COMPOSABILITY OF PARTIALLY ORDERED OPERATION SETS (CPOOS)
is defined as follows:
INSTANCE: Set P = {p1 , p2 , · · · , pm } of m configuration operations, set T = {t1 , t2 , · · · , tn }
of n tests, set of user requirements R ⊆ T , a positive integer K ≤ |F|, where F is a polynomial
function of |P |, a partial order S = {(pi , pj )|pi , pj ∈ P }, where (pi , pj ) ∈ S exactly when pi must
precede pj , set P ⊆ P ∗ conforms to the ordering in S, test process function σ : P → power(T ),
where for p̃ ∈ P , σ(p̃) represents the tests in T that succeed after applying p̃ to a given repeatable
baseline state. Suppose the computation of σ is linear in |T |, i.e., O(n).
QUESTION: Does P contain a sequence q̃ of length less than or equal to K, such that R ⊆ σ(q̃)?
Theorem 5 CPOOS is NP-hard.
Proof:
75
In the following we prove that CPOOS is restricted to CS by allowing only instances with
idempotent operations.
Order of operations does not matter if we know the precedence of those operations. In other
words, if we are sure of the appropriate sequences, the order of writing down the elements is trivial;
we can resort them into an appropriate order later using an algorithm like topological sort which
takes O(|P | + |S|) [24]. Thus a set of operations with known precedence can be treated as the same
way as a commutative set. We have shown that GC collapses to CS with restriction of commutativity
and idempotence. Thus we can restrict CPOOS to CS by allowing only instances with idempotent
operations. 2
10.2.5
Composability of Convergent Operations
Can convergence help to reduce complexity? As we discussed before, a convergent operation is an
operation whose repeated application has a fixed point, after achieving that fixed point additional
execution of the operation does not add anything to the system, i.e., ∃n0 , when n ≥ n0 , σ(p◦pn ◦ q̃) =
σ(pn ◦ q̃).
Definition 23 CONVERGENT OPERATION COMPOSABILITY(COC) is defined as follows:
INSTANCE: Set P = {p1 , p2 , · · · , pm } of m convergent configuration operations. Let P ⊆ P ∗ be
the set of sequences that show stabilized behaviors, i.e., where for p̃ ∈ P , repeating any operations
in sequence p̃ after execution of p̃ will not change the observed state of the system. Set T =
{t1 , t2 , · · · , tn } of n tests, set of user requirements R ⊆ T , and a positive integer K ≤ |F|, where F
is a polynomial function of |P |, a test process function σ : P → power(T ), where for p̃ ∈ P , σ(p̃)
represents the tests in T that succeed after applying p̃ to a given repeatable baseline state. Suppose
computation of σ takes time O(|T |).
QUESTION: Does P contain a sequence q̃ of length less than or equal to K, such that R ⊆ σ(q̃)?
Theorem 6 COC is NP-hard.
Proof:
COS is restricted to CS by allowing only instances that the operations are idempotent (idempotence is a special case of convergence) and commute. Since this easier problem is NP-hard, so is
the larger problem of composability for unrestricted operations. 2
76
10.2.6
Summary of The Proofs
In this section we have shown mathematically that composability is an NP-hard problem. The GENERAL COMPOSABILITY (GC) problem is restricted to COMPONENT SELECTION (CS) with
idempotence and commutativity. CS is restricted to RESTRICTED COMPONENT SELECTION
(RCS) with compositional predictability. RCS is transformed from MINIMUM COVER which is
a known NP-complete problem. Then we prove that ATOMIC OPERATION COMPOSABILITY,
COMPOSABILITY of PARTIALLY ORDERED SETS and CONVERGENT OPERATION COMPOSABILITY are all NP-hard problems. We have shown that give a list of requirements and a
configuration operation repository with well defined behavior, the process of finding the optimal
sequence of operations to meet the requirement is an NP-hard problem and the problem remains
NP-hard regardless of whether operations are:
• idempotent/stateless/convergent
• consistent/commutative/atomic
• orderable via a known partial order on the operations (known dependency between operations)
Refer to Figure 10.1 for summary of proofs.
10.3
Discussion
Practitioners of system administration argue about limits upon configuration operations. Some
maintain that operations should be constructed to be repeatable without consequence once a desirable state has been achieved. Others maintain that operations must be based upon imperative
order or upon generating the whole configuration as a monolithic entity. Our work proves that these
arguments are not based upon mathematical fact. The problem remains hard no matter how one
limits operations.
We will discuss how current tools get around composability problem in Chapter 12.
77
Figure 10.1: Summary of proofs
78
Chapter 11
Dependency Analysis
Dependency analysis is the process of discovering interactions and relationships between entities
of a complex system. It is widely used in industries that design and develop products such as
software systems and hardware devices. Dependency analysis is a difficult process in the context
of a system with complex interactions. The “system” in the world of system administration is a
network of hundreds or thousands of complex subsystems such as computers, software applications,
and hardware devices. Fortunately, the goal of system administrators is not a full understanding of
the interactions and relationships among entities of a system as long as following the instructions
of some documentation of the system results in desired system behavior. By following a documented procedure, system administrators bypass the “hardness” of dependency analysis. However,
dependency analysis cannot be fully avoided due to the following reasons:
1. Documentation is not always 100% accurate.
This is due to software bugs, human errors, and constraints of development time and expense
on the developers’ side; and dynamic use of the system, software upgrades/patches, and
execution of scripts that have global effects on the system administrators’ side.
2. Documentation does not address all possible working environments.
Modern systems are constructed with hundreds and thousands of components from independent developers and vendors. Even if all of the components are fully tested and verified by
their developers for some ideal environment, it is possible that they fail to function as specified
the current environment.
Dependency analysis is used in many areas of system administration, e.g.,
79
• Root cause analysis - determining the cause of a failure.
• Impact analysis - determining which entities of a system or customers will be affected by a
problem.
• Change analysis - determining the consequences of changes made to a system.
• Requirements analysis - determining the requirements necessary to provide a service.
We first introduce currently used techniques in dependency analysis in Section 11.1. In Section
11.2 , we introduce basic concepts. In Section 11.3, we formally define dependencies. In Section
11.4, we analyze the complexities of black and white box analysis.
11.1
Dependency Analysis Techniques
In this section, we introduce some common techniques used in dependency analysis: instrumentation, perturbation, data mining, requirements analysis, and dependency control. Requirements
analysis and dependency control are white box approaches (based upon contents of the system) and
the rest are black box approaches (based upon the behavior of the system).
11.1.1
Instrumentation
Instrumentation is a black box technique in which an “instrument” or ”probe” is placed within one
or more entities of the system. Dependencies are calculated by correlating transactions recorded
for various entities. Code instrumentation is widely used in circumstances where source code is
available. For example, in the Linux operating system, it is easy to instrument kernel functions to
track processes and files. A key problem of instrumentation is its intrusiveness. It is possible that
the instrumentation changes too much about the computation for it to be usable. Another limitation
of instrumentation is that it may be unusable in situations where instrumentation code cannot be
inserted into the system due to security requirements, licensing, or other technical constraints, e.g.,
in commercial software.
11.1.2
Perturbation
One uses perturbation/fault-injection[13, 11] in cases where the behavior one seeks to understand
is too rare to occur in practice, or when it is so infrequent that we need to artificially create
circumstances where it will happen. Perturbation is the process of explicitly perturbing system
80
entities while monitoring the system’s response. Any behavioral change caused by perturbation
infers some dependency relation. Fault injection is used as a perturbation tool. One arranges for
one component to “fail” and then observe the results. This is especially useful if the component
failures model realistic situations. A fault can be modeled as a simple bit flip, locking of a file, a
disk filling up, or overloading of a service[14].
11.1.3
Data Mining
Data mining is the process of extracting knowledge hidden in large volumes of raw data. It is used
in many areas such as financial markets analysis, image recognition and bioinformatics. Facing
a large amount of statistical data describing system states, researchers in system administration
begin to show appreciation of the technique of data mining. Strider[40] is an administrative tool
for Microsoft Windows that uses state differencing to identify potential causes of differing program
behaviors. Windows registry entries are used to describe system state. Strider attempts to identify
regions where a change might cause a problem. It does this by correlating changes, determining
what changes are “normal”, and filtering them out. The user must then analyze the resulting
registry map, which consists of changes that might cause a problem.
11.1.4
Requirements Analysis
One of the keys of white box dependency analysis is to find the requirements (or preconditions)
of an entity. It is crucial for this approach to find the requirements that can affect the entity’s
behavior which concerns us. Sowhat[86] is a system administrative tool that performs global impact analysis of dynamic library dependencies of Solaris and Linux systems. Dependencies include
requested library names or paths that are hard coded in executable programs for use by a dynamic
linker. The diagnostic program ldd provided by the operating system exposes executable programs’
dependencies on dynamic libraries.
11.1.5
Dependency Control
Another approach of dependency analysis is to proactively and strictly define critical dependencies.
The Linux Standard Base (LSB) project[48] seeks to provide a dynamic linking environment within
Linux in which vendor-provided software is guaranteed to execute properly. The goal of LSB is to
identify a set of core standards that must be shared among distributions in order to guarantee that a
product that works properly in one of them will work in all compliant distributions. These standards
81
include requirements for the content of dynamic libraries, as well as standards for locations of system
files used by library functions. With these standards in hand, the LSB provides tools with which
one can certify both environments and programs to be compliant with the standard.
Linux distributions can be examined by an automatic certification utility that checks link order,
versions of libraries, and locations of relevant system files. A distribution may have more libraries
than the standard specifies, but the libraries specified in the standard must be first to be scanned
during linking and must contain the appropriate versions of library subroutines. Another certification utility checks that the binary code for Linux applications only calls library functions protected
by the standard. Since the LSB tools solely analyze the contents of binary files, they can check
closed-source executables for compliance.
11.2
Basic Concepts
To be as general as possible, a system is composed of entities that collaborate and communicate
with each other to achieve some common goals. Each entity in this community has its own mission
or duty to accomplish, which we call functional expectations (We will define them formally later).
A dependency exists when an entity cannot achieve its functional expectations by itself but only
with cooperation from others. An entity can be anything that concerns us, e.g., software, hardware,
object, parameter, or a subsystem of entities.
Within a system, there are two kinds of dependency relationships: vertical and horizontal dependencies.
• Vertical dependencies are dependencies where a change of one entity may directly result in a
change of behavior of another entity. For example, one entity is a subpart/subroutine of another entity; or one entity’s output feeds in the input of another entity. Vertical dependencies
form a hierarchy of relationships among entities.
• Horizontal dependencies are dependencies that two or more entities are peers and they must
coordinate to achieve some common goals. A change of one entity may not directly change the
behavior of another but functional expectations are no longer met. In order to meet functional
expectations, other peer entities must change their states accordingly.
There are two approaches to dependency analysis: “white box” and “black box”.
82
11.2.1
White Box
The white box approach is to analytically study and understand the internal structure of the system
(e.g., source code) and derive dependencies from that structure. This method relies on a human or
a static analysis program to analyze system configuration, installation data, and application code
to compute dependencies.
White box dependency analysis is content-oriented, not behavior-oriented. It attempts to discover why there is a dependency, i.e., analyzes the cause and effect relationships between an entity
and its environment. It is possible that some dependencies are buried deep in the system and do not
affect external behaviors at all. Thus a dependency found by a white box method might not have
an impact on behavior of the system. For example, when we download a file from a remote server,
normally it does not matter to us how the packets are routed, so the dependencies of routing mechanism upon others does not concern us. Further, some dependencies found by white box dependency
analysis might not be useful. For example, when a process looks up a user name, it first searches
for the user name in the /etc/passwd file, then in LDAP. In many sites, the /etc/password is
empty and all users are in LDAP (or NIS, NIS+), so the reference to /etc/password never has
any effect upon the outcome. From white box dependency analysis, we conclude that the process
depends upon /etc/password, but this dependency does not affect the behavior of the process and
thus has no real value. Thus it is critical for white box analysis to identify what dependencies affect
behavior and need to be tracked.
When the system is simple and its internal operation is well understood, the white box approach
can suffice. However, this approach breaks down when the system is complex and implementation
details are unknown or too complex to analyze.
11.2.2
Black Box
A black box approach is based on system behavior using tests and inference. Each system entity is
treated as a black box with some configuration parameters outside the box to control its behavior.
The internal operations are unknown to the analyzer. To help in understanding, a black box is
like a toy to a child. It has some external buttons and/or handles to control its behavior. But the
internal bolts and nuts are completely hidden from the child. Black box dependency analysis helps
answer how a dependency can affects system behaviors.
The shortcoming of black box dependency analysis is that it cannot discover the cause and effect
of a dependency. For example, one sees a rooster cry and the sun rises. One who uses black box
83
dependency analysis could mistakenly conclude that sunrise depends upon a rooster’s cry. Further,
black box dependency analysis can never conclude independence - lack of dependence, but can only
conclude lack of observation of dependence.
11.2.3
Functional Expectations and Tests
The concept of functional expectations is crucial in dependency analysis. Without some concept
of how systems should behave, dependency analysis is meaningless. It may seem that vertical dependencies have nothing to do with functional expectations. The dependency is there regardless of
whether there are functional expectations. However, dependencies irrelevant to functional expectations are also irrelevant to our considerations. For example, we may paint our server different colors.
So the color of our server has a dependency upon the painting process. However, if the functional
expectations of our server do not include a requirement of the color, then this dependency has no
value in our management.
Functional expectations consist of behaviors we would like to observe, in the form of tests that
should be true. It does not matter if more tests are true than we desire; all that matters is that
the requirements we have are met. For example, the functional expectations of a web server system
might include:
• it has a valid IP address;
• it responds to one or more domain names and ports;
• it provides correct contents for each URL request;
• it calls CGI scripts appropriately;
• it interacts appropriately with database servers; and
• it meets site-specific security requirements.
In practice, most of the items in the list are expressed as a group of tests. This simple example
provides the basic idea of using tests to express functional expectations.
Definition 24 The functional expectations for an entity E are a test set RE ⊆ T (T are the possible
tests that can be performed for the system) that encode expectations that include the desired behavior
of the entity or the desired results expected from the existence or execution of the entity.
84
Note that functional expectation tests are not necessarily performed on the entity itself; they
may instead concern its environment or surroundings or interactions with other systems. For example, the functional expectation for the code validator of LSB is that if an application passes the
validator, it should work on any LSB compliant run-time environment. Thus testing this functional
requirement requires moving the code to another compliant system and checking out its function
there.
Further, functional expectations for an entity can be more than its basic functions. For example:
a DNS server. Obviously, its functional expectations should include converting host names into IP
addresses and vice versa. However, those are not all of its requirements. A typical DNS server must
also serve appropriate information in a collaborative way so that other servers, e.g., a web server
and a DHCP server, can produce a network that meets functional expectations.
11.2.4
State and Behavior
Dependencies only exist in systems where there are multiple, semi-independent entities. We refer
to these entities as {Ei }.
Definition 25 The state of an entity E, e, is described by the contents of the entity that is used to
control its behavior. We use ei to denote a state for the ith entity of a system.
In configuration management, the state of an entity is its configuration. This is typically the state
of one or more files associated with the entity. The distinctions between entities are often imprecise,
and it may well be that one file is part of the configurations of two distinct entities.
Definition 26 Every entity E has a set of possible states, denoted as E. Again we use Ei to denote
the set of possible states for the ith entity of a system. The set of possible states of a system of n
entities is S = {s = (e1 , e2 , · · · , en )|ei ∈ Ei } ⊆ E1 × E2 × · · · × En .
Due to the overlaps between configuration parameters, the state space of system of entities is not
necessarily the product state space of individual entities; only a subset of the product space is
meaningful and all states may not even be achievable.
Definition 27 Given a set of Boolean tests T , the observed behavior of a state of a system or
entity is a subset U ⊆ T where t ∈ U exactly when test t returns TRUE.
Definition 28 V (S) is a relation describing the outcomes of tests upon states in S. V (S) =
{(s, U )|s ∈ S, U ⊆ T is an observed behavior of state s }.
85
For a state s ∈ S, V (s) = {∩Ui |(s, Ui ) ∈ V (S)} where ∩Ui represents the intersection of
U s. This is the set of observed behaviors that remain true regardless of other (perhaps external)
influences. Unlike the relation V (S), V (s) is always a function from S into subsets of T .
V (s) is the set of observed tests that remain true for the state s of an entity or a group of entities,
regardless of what is happening in the outside world around the entity or the group of entities. For
example, if our configuration s for a DNS server is that 10.2.3.4 maps to the name foo.com, then
the test that this map works in both directions will remain true regardless of what is happening in
the outside world. In other words, the contents of V (s) are the observed tests that this entity or
group of entities controls, taken from the set of all observed tests.
Definition 29 The function Q : power(T ) → power(S) is defined by is Q(U ) = {s ∈ S that
V (s) ⊇ U }. This is the set of system states that satisfies tests in U .
Here we require that states in Q(U ) must be system states.
11.3
Dependence Definition
Dependency is lack of independence. One entity cannot achieve its functional expectations without
requiring that some other entity or entities are at some particular states.
11.3.1
Dependency in a Closed-world System
Strictly speaking, dependency is difficult to define precisely in an open-world system. In an openworld system, unknown outside forces may affect the behavior of system states. When the system
fails to achieve functional expectations, we are not sure whether a dependency is broken within the
system or some outside influences affect the behavior of the system. For example, consider a system
of a printer and a web server in an open world where electric power is considered as an outside
influence that can change. Suppose the printer changes state and at the same moment the electric
power is shut off (but this is unknown because electric power is considered an element outside of the
system). We then draw an incorrect conclusion that there is dependency between the web server
and the printer because we see a change of state in the printer and a failure in the web server.
Dependence is a relationship between entities in which one entity cannot carry out its mission
(functional expectations) by itself without the cooperation of other entities.
Formally, the definition of dependency is as follows:
86
Definition 30 Given a closed-world system of n (n ≥ 2) entities {Ei } and a test set T of m tests,
each entity has functional expectations Ri ⊆ T ; Ei depends upon Ej if there exist two system states
s and s0 ∈ S such that:
• the content difference between s and s0 is only caused by a change of entity Ej ’s state, and
• s can satisfy Ei ’s functional expectations but s0 cannot, i.e., V (s) ⊇ Ri and ∃t ∈ Ti such that
t 6∈ V (s0 ).
The meaning of this definition is that Ei depends upon Ej if choice of Ej ’s state can possibly
change the fact whether functional expectations for Ei can be met.
In an open-world system, it is possible that the behavior of a system state varies in time; it can
sometimes satisfy a set of functional expectations and sometimes not. We will discuss dependencies
in an open-world system in the next section.
The “critical set” of a dependency is a set of tests that specify which functional expectations of
the dependent will be affected if the dependency is not satisfied.
Definition 31 For entities Ei and Ej , the functional expectation t ∈ Ri is a critical condition if
the results of this test differ depending upon choice of state for Ej . Rij ⊆ Ri is the set of all critical
conditions of Ei with respect to Ej of critical conditions for all s0 as described above.
Some dependencies might be more critical than other dependencies. For example, the functional
expectation of correctness of a server might be more important than the functional expectation of
its performance. Thus dependencies that compromise correctness if not satisfied are more important
than the ones that compromise performance if not satisfied.
Strength is a metric describing how “heavily” one entity depends upon other entities.
Definition 32 For entities Ei and Ej , let W (Ei , Ej ) be the number of states in Q(Ri ) for which
there exists an s0 as described above. The strength of the dependency is |W (Ei , Ej )|/|Q(Ri )|.
Note that if the strength of a dependency is less than 1, the state of Ej can make Ei fail to
conform to expectations in some system state and cannot make this happen in some other system
state. For example, suppose that a java application can be configured to get a required class either
locally or remotely (via, e.g., a web server or shard system). So in this case there are two states
conforming to the functional expectations of this application: one is to get the required class locally
and the other is to get the required class remotely. The application depends upon the remote copy of
87
the class at strength of 1/2 < 1 which means there is some alternative way to achieve the functional
expectations for this application even if the network is not available.
Dependencies with strength 1 indicate the system is inflexible and that some condition is absolutely required. One can explore redundancies to weaken dependencies.
Let us consider a system of an Apache server, a dynamic library libz.so and a database mysql.
Apache requires libz.so and mysql. If libz.so is deleted, none of the functional expectations of
Apache can be met. If mysql is not available, Apache fails to provide one of its functions: search
in a database. Thus according to our definition, Apache depends upon libz.so with criticality of
RApache ; Apache depends upon mysql with criticality of R = { “search in a database”} ⊂ RApache ,
where ”search in a database” is a single functional test.
11.3.2
Dependency in an Open-world System
In an open-world system, dependency cannot be defined in absolute terms, because literally anything
whatsoever can happen to affect behavior. In the absence of external effects, however, there is a
weak form of dependency based upon the assumption that external effects are absent or at least
quiescent.
Dependency in an open-world system can only be studied in a temporal domain. We can only
say that at some specific time, this state depends upon that state. For example, in an open-world
system with a web server and a DNS server, the claim that the states of the web server depend
upon the states of a DNS server is temporal depending upon whether there is a backup DNS server.
We can only say, at some specific time, the states of the web server depend upon the states of the
DNS server and at some other time, when the backup server is present, the functional state of the
web server does not depend upon the states of the DNS server since the failure of the DNS server
cannot affect the service of the web server.
Observation 1 In an open-world system, if there are intervals of time in which the external influences on functional expectations of the system do not vary, dependencies can be analyzed during
these times as if the system is a closed-world system.
Discussion: In a closed-world system, the external influences can affect the behaviors that are not
part of the system’s functional expectations or they are constant if they can affect the behaviors
that are part of the system functional expectations. During these intervals, there is no variation of
external influences on the system functional expectations, so we can analyze the system as if it was
a closed-world system.
88
11.4
Complexity of Dependency Analysis
11.4.1
Black Box Dependency Analysis
To simplify the problem, we assume that there is no overlap of configuration contents of different
entities.
Further, we assume that each entity can only take a limited number of possible states. This is
a practical limitation of any realistic system.
From the definition of dependency, the problem of dependency analysis can be defined as:
INSTANCE: A closed-world system of n (n ≥ 2) entities {Ei } and a test set T of m tests, each
entity has a functional expectations set Ri ⊆ T , each entity has at most d possible states, a system
state s ∈ S is a vector state composed of states of individual entities; a function R : S → power(T )
takes linear (proportional to number of tests) time to compute the behavior of a system state.
QUESTION: Does Ei depends upon Ej , i.e., ∃s, s0 , where s0 6= s such that:
• the only difference between s and s0 is the state of Ej , i.e., s = (e1 , e2 , · · · , ej , · · · , en ) and
s = (e1 , e2 , · · · , e0j , · · · , en ) where ej 6= e0j ;
• s satisfies the functional expectations for Ei , i.e., R(s) ⊇ Ri ;
• s0 does not satisfy the functional expectations for Ej , i.e., there is at least one t ∈ Ri such
that t 6∈ R(s0 ).
This problem is not in NP because no matter how fast we can perform these tests, the test
function forms an exponential size table to look up the behaviors of different system states (if each
entity can have d possible states, the number of total system states at the worst case is dn ). By the
definition of black box dependency analysis, one cannot know the result of testing by analyzing the
contents of system states but exclusively through testing the external behaviors. In other words, the
black box dependency analysis has no internal structures. Thus the problem remains in EXPTIME
if no further assumptions are made.
However, if we have some prior knowledge of which system states can achieve functional expectations for Ei , which in many cases we do, and the number of those system states is O(n),
dependency analysis collapses into P . Because we only need to change the entity state of Ej and
test the system for at most d times for each working system state.
To summarize, black box dependency analysis is tractable if the following two conditions are
true:
89
1. each entity can only take a limited number of states;
2. we have some prior knowledge of which system states can achieve the functional expectations.
The violation of condition 1 makes the problem unbounded in general; the violation of condition 2
makes the problem become EXPTIME, because we need to test every possible system state in state
space which has an exponential size.
11.4.2
White Box Dependency Analysis
Unlike the black box approach, in which external behavior of each entity is all that is available, in
the white box approach, some representation of the content of an entity is available for analysis.
In black box analysis, we may change the values of some configuration parameters to observe the
entity’s behavior after the change, but we cannot analyze the internal logic of how these configuration
parameters affect the entity’s behavior.
In an abstract sense, many kinds of entities, especially software components, can be viewed as
programs running on a Turing machine. Any process aiming to understand the internal logic of
an entity is a white box analysis. White box analysis includes control flow analysis and data flow
analysis.
Control flow analysis is the process of determining the instructions that a particular execution
reaches or does not reach. Likewise, data flow analysis is the process of determining, for a specific
computed quantity, which inputs affected its value. For example, in data flow analysis, a dependency
is established if the value of one variable influences another; in control flow analysis, one entity
depends upon another if the execution of the second depends upon whether the first was executed.
In general, both control flow analysis and data flow analysis are intractable. Given an arbitrary
program, the process of finding whether its preconditions hold or analyzing its postconditions is not
necessarily decidable. To illustrate the idea, consider the following simple example:
Precondition: I am penniless.
start: Work to earn a dollar
Buy a lottery ticket
If ticket is not a jackpot winner, GO TO start
If ticket is a jackpot winner, celebrate.
What are the postconditions of the program? If this program ever finishes, then you will be
rich - the problem is that there’s nothing to stop it going round and round in circles forever! Here
90
we encounter the same halting problem as in program correctness: given an arbitrary program,
deciding its preconditions and postconditions might not necessarily halt.
11.5
Discussion
From the analysis above, we conclude that neither black box nor white box dependency analysis is
tractable in the worst case. In black box analysis, since we do not have any idea of the internal
structure of the box, we have to perform every possible test to discover dependencies. In white box
analysis, the complexity of contents of entities makes the problem intractable. Further, since we
do not know the behavior of the system, we have to analyze every dependency in the system to
guarantee behavior.
In practice, a gray box approach, a mixture of black box and white box dependency analysis, is
used. In gray box approach, the system is not “completely black” to the analyzer. One opens the
box to a depth necessary for the analysis, but never intends to do a complete white box analysis
for the whole system. In this way, with even limited knowledge of internal structure, we can choose
what to test next, thus reduce the testing state space significantly; and with the test results, we are
able to further identify critical regions for the white box analysis, thus eliminate much unnecessary
cost spent on dependencies that is irrelevant to the behavior we concern.
The cost of gray box analysis varies on a case by case basis. Theoretically, it is still intractable
in the worst case where the ingredient of black box or the white box analysis within this approach
does not gain us anything. However, in practice, it normally reduces the complexity of the problem.
The way people make dependency analysis tractable is by refining their state model only in
presence of non-determinism. They adopt naive models that seem deterministic, and add states
only when that seeming determinism is violated.
Dependency analysis can be avoided if the system administrators have enough experience or/and
the necessary knowledge to configure the system can be gained by reading documentation. This
suggests the following conditions: precise documentation, efficient search mechanisms through documentation, and homogeneous environment(maximize the experience of system administrators).
In summary, we gave formal definitions of white box and black box dependency analysis in
system administration and examined the computational complexity of each approach respectively.
We showed that both white box and black box dependency analysis are intractable.
91
Chapter 12
Configuration Management Made
Tractable in Practice
In this chapter, we discuss some general guidelines used by system administrators to keep configuration management tractable and examine current strategies of configuration management as
examples.
Configuration management is intractable in the general case because the complete understanding
of the system and its behavior is intractable. People manage complexity in many ways. There is
no perfect solution that makes configuration management tractable without making some sacrifice
such as accuracy, flexibility, and convenience. The intrinsic complexity cannot be destroyed but
can only be hidden. In the following paragraphs, we summarize these mechanisms used by system
administrators to reduce the complexity of configuration management. Note that these mechanisms
are often used in combination.
12.1
Experience and Documentation
System administrators use their experience of past solutions and documentation to avoid dependency
analysis and composing operations. After validating that it is proper to use past experience or
documentation in the current environment, they simply repeat those actions or follow instructions
of the documentation without doing dependency analysis and composition.
For example, suppose a system administrator needs to install a network card. The installation
guide instructs one to install the driver first, then plug the card in the machine. The system
92
administrator can just follow these instructions without bothering to analyze the interactions and
relationships between software and hardware of the network card. It is not important to know that
if the instructions are not followed, the card will not work; this fact is not even considered. Often,
this leads to cases where procedures are overly constrained for no particularly good reason, simply
to avoid a deeper analysis of dependencies.
The drawback of this approach is that experience and documentation is not always 100% accurate
and these might not be proper for the current environment system administrators are working
with. In a very dynamic environment, it is possible to misconfigure the system by simply following
instructions from documentation or experience that do not apply to the system’s current state.
12.2
Reduction of State Space
Reduction of state space is a matter of limiting one’s choices during configuration. With a set of
n operations, theoretically, one can compose an infinite number of possible sequences. Thus there
exist a large number of possible actual states of the system. However, there are only a few distinct
behaviors that we might want the resulting system to exhibit. Limiting the states of the system to
match the number of distinct desirable behaviors simplifies the configuration process. If we make a
limit that only a few sequences can be chosen, composition of operation sequences becomes tractable
simply by choosing from a limited number of sequences. Composability entails just the selection
of the one operation available, which is obviously tractable. This is a strong argument for “overconstraining” the documentation so that the options available do not overwhelm the administrator.
Following instructions from documentation is a case of reduction of state space; only one sequence
of operations is available to choose. Also acting accordingly to some shared practices, agreed upon
by a group of system administrators for a site, can also efficiently reduce the state space of the
system.
By restricting heterogeneity of the network (making machines alike), one can increase the predictability of the system. Solutions on one machine are likely to work on another. Instructions in
documentation become more feasible to apply for a large network. Homogeneity can be achieved
by always following the same order of procedures or cloning machines.
Standardization is an effective method to proactively and strictly define critical dependencies.
The Linux Standard Base (LSB) project[48] seeks to provide a dynamic linking environment within
Linux in which vendor-provided software is guaranteed to execute properly. LSB achieves compati-
93
bility among systems and applications by enforcing restrictions on critical dependencies include the
locations of system files, the name and content of dynamic libraries.
The drawback of reduction of state space is its inflexibility during configuration. Strategically,
variety in configuration can pay off in terms of productivity. Moreover, a varied system is less
vulnerable to a single type of failure. System administrators must find an appropriate balance
between homogeneity and variety.
12.3
Abstraction
Abstraction means to represent complex dependency relationships by simpler mechanisms to hide
complexities. System administrators have a simple model of the system. They operate at higher
level of abstraction than the interaction details of different subsystems.
In many software systems, interactions and relationships among packages within the system are
abstracted as strings of fields under “requires” and “provides”. This dependency information does
not come from system administrators’ analysis of packages but from the white box knowledge of
developers of the system. The true dependency relationship is hidden by this abstraction. For
example, RedHat Package Manager (RPM) files and their equivalents use this mechanism to the
ease the task of adding software to a Linux system. In the package header, each package is declared
to “provide” zero or more services. These are just strings with no real semantic meaning. A
package that needs a service then “requires” it. This dependency information can be considered as
“documentation” embedded in the system. Just as other documentation, this abstraction can be in
error[55].
Another abstraction is using dependencies between procedures instead of dependencies between
system components. The dependencies between entities of the system is represented by the order in
which procedures are performed. For example, in order to install a wireless card, one must install its
software before plugging in the card. The complex interactions between the hardware and software
of the card is hidden by the order of these two installation procedures.
The drawback of abstraction is that the representation may not correctly reflect the real dependency or it fails to deal with problems that occur at lower levels than the abstraction. For example,
if package names listed in the “required” field of a package do not include all packages that this
package depends on, then a problem occurs. By only looking at the abstracted level of these string
names, one cannot solve the problem. Some analysis of implementation details of the package is
94
needed to solve the problem.
12.4
Re-baselining
Re-baselining the system to one of repeatable baseline states is the strategy used when the complexity has grown to an unmanageable level. The drawback of this approach is that it might be
time consuming to rebuild a system from scratch. One must consider the cost of downtime when
rebaselining the system.
12.5
Orthogonality
Using orthogonality is to separate configuration parameter space into independent subsets. Component Selection is in P if the requirements that can be satisfied by each operation are disjoint
subsets. The drawback is that it is not always feasible due to interactions and dependencies between subsystems.
12.6
Closure
The techniques used in closure are closely related to documentation and reduction of state space. A
“closure” is a “domain of semantic predictability”, a structure in which configuration commands or
parameter settings have a documented, predictable, and persistent effect upon the external behavior
of software and hardware managed by the closure[30]. Using closure systems is like delegating
configuration tasks to someone else; either the closure system or the developer of the closure system
is responsible to perform necessary operations in order to accomplish system goals.
RedHat Enterprise Linux[78] can be considered as a closure. It is an operating system that
accommodates a wide number of third party applications. It is self-patching and self-updated by
its vendor.
However, suppose that one has to add a foreign package to the system managed by the closure.
Then he must know what parts of RedHat Linux Enterprise will be violated through dependency
analysis, so that they can be managed differently after the insertion. For example, it may be
necessary to turn off automatic software updating in order to install certain foreign packages.
If the operation of one closure violates the integrity of another closure, unexpected behavior occurs. A consistent closure system of hierarchies of sub-closures is needed to totally avoid dependency
95
analysis.
The drawback of a closure system is its design complexity[31]. A closure cannot exist by itself
but can only exist within a system of consistent closures. The design of closures must conquer or
smartly avoid the complexities of dependency analysis and composition of operations.
In summary, system administrators utilize the above strategies (except closures) in combination
to make configuration tractable. Closure is at an early stage of development. Each strategy has
a drawback that system administrators must make sacrifices in order to reduce the complexity of
configuration.
In the following paragraphs we will discuss in detail some ways that several existing strategies
reduce complexity.
12.7
Current Strategies Make Configuration Management
Tractable
In section 5.2, we introduced several current configuration strategies. No strategy can perfectly make
configuration management tractable. Every strategy sacrifices something including convenience,
flexibility, adaptability, efficiency, etc.
Most of the guidelines mentioned in the previous section are used in these strategies. However,
each strategy has its special way to implement these guidelines.
In manual configuration, configurations are made entirely by hand. Manual configuration is only
cost-effective for small-sized systems with a few machines and a few users, but is often utilized even
for large networks in which few changes in function are expected over time. Composing configuration
operations is done by humans using experience or knowledge from other administrators. Manual
configuration almost requires a tight loop of configuration and testing. System administrators invoke
one operation at a time; thus a linear O(n), (n=number of operations) search (or less) suffices to
figure out what to invoke next. After applying an operation, one will normally test if the operation
achieves the goal that is intended. Latent preconditions still exist, since the system is partially
observed, but are not as significant as scripting, since system administrators test the system at each
step.
In custom scripting, manual procedures are encoded into repeatable automatic procedures using
a high-level language such as a shell, Perl, or Python. Composition is still planned and accomplished
by humans, though it can be automated if scripts satisfy certain very restrictive conditions [28].
96
Scripts are often crafted in haste for one-time use [27]. Applying poorly engineered scripts to hosts
in an unknown state leads to network rot: variation in actual state (latent preconditions) that
becomes exponential with the number of executed configuration operations. Careless use of custom
scripting can actually increase the size of state space. Thus custom scripting is not recommended
for large networks.
Structured scripting is an enhancement of custom scripting that allows one to create scripts
that are reusable on a larger network. ISConf [59, 90, 91] structures configuration and installation
scripts into “stanzas” whose execution order is held constant across a population of hosts. On each
host, probes of host state determine which stanzas to execute. The postconditions of the operations
already completed are treated as the preconditions of the next (whether or not this is actually true),
and these are always the same for a particular host. Thus composability of n operations becomes
O(n) since there are only n actual observed states, namely, p1 (B), p2 ◦p1 (B), · · · , pn ◦ · · · ◦p2 ◦p1 (B).
Note that while composability is thus tractable for an individual host, creating a set of operations
that will configure a heterogeneous network of hosts remains intractable.
The main technique to simplify configuration management in file distribution is reducing configuration state space. A description of appropriate network behavior is translated into the precise
configuration file contents that assure that behavior. Configuration data are stored in a central file
repository or database. The agents installed on the hosts will read a host configuration generated
from that data. The strategy of proscriptive configuration generation limits existing preconditions
to a very small number. There is exactly one configuration for each kind of behavior; there is no
“unintentional heterogeneity” caused by, e.g., editing the file at different times. Thus there is a
bijective map between behaviors and configurations.
Declarative syntax is a configuration management strategy wherein custom scripts are replaced
by an autonomous agent [15, 16, 18, 25]. This agent interprets a declarative configuration file that
describes the ideal state of a host, then proceeds to make changes that bring the host somehow
“nearer” to that ideal state. The key to tractability of convergent operations is that they accomplish
a specific change in behavior, but do not assert that this change actually guarantees a particular
behavior. Since we are not considering behavior in the tool, the human has to assure that. Besides,
the functional overlap of configuration operations remains rather low or non-existent; most operations are orthogonal to one another. Thus, to achieve a specific objective, the agent does not have
any freedom to choose from different operations to accomplish a specific result. The actual state
space of the system is reduced to a manageable level.
97
Another way that declarative syntax simplifies the composability problem is by keeping objectives simple and independent. In Cfengine, e.g., objectives are relatively simple, such as “make this
file contain these contents” or “change that parameter to be equal to this”. The issue of how these
contents affect behavior is not addressed. If one avoids the meaning of content, and specifies only
content, the problem is fundamentally simpler; orthogonality of operations is assured. Thus, the
human system administrators are required to translate behavior requirements to file content requirements or parameter setting requirements. In using many currently available tools, composability is
actually assured by system administrators. System administrators may need a long period to gain
the experience of making the appropriate decisions when making changes to satisfy a behavioral
requirement, and must take the totality of any existing configuration into account when making
changes. And that is partly why the complexity of configuration is always a significant barrier to
understanding by new staff.
98
Chapter 13
Ways to Reduce Complexity
In this chapter, we explore how intractable problems are managed in general based upon computation theory.
Even though many problems are intractable, humans have lived with them for centuries. Trains
need to be scheduled, salesmen must plan their trips for marketing, and thieves have to decide in very
limited time to choose which items to pick up, even though TRAIN SCHEDULING, TRAVELING
SALSEMAN and INTEGER KNAPSACK are all NP-complete problems. The methods by which
intractable problems are solved in real life can give some insight into how our composability problem
might be solved.
There are three approaches of getting around intractability:
• choosing easy instances to solve
• approximation: trading optimality for computability
• memoization and dynamic programming
In the following sections we will discuss some of these strategies that apply to configuration
management.
13.1
Choosing Easy Instances to Solve
NP-completeness refers to worst case complexity. Even if a problem is NP-complete, many subsets of
instances can be solved in polynomial time. We suggest three different ways to reduce complexity in
99
configuration management: reduction of state space, using simple operations and forming hierarchies
and relationships among operations.
13.1.1
Reduction of State Space
Composibility can be made relatively easy by use of a reduced-size state space, as we have seen in
the current configuration management tools. For details, see Section 12.7.
13.1.2
Using Simple Operations
MINIMUM COVER is solvable in polynomial time by matching techniques if all of the candidate
sets c in the cover set C have |c| ≤ 2 [43]. This implies that relatively simple operations that only
address one or two user requirements can be composed efficiently.
That composability is tractable in this case can be understood by considering the components
of graphical user interfaces. In these systems, typical components are widgets with very limited
functions. The dependencies between widgets are obvious. The semantics of widgets are constrained
for easy reuse when the widget is used in a different context. For example, the semantics for a
button is to trigger an event when it is clicked or released. There are no underlying assumptions
or dependencies that are not obvious from the context. The semantics of widgets and the limited
domains in which the widgets are used are simple enough that solutions for syntactic composability
are also solutions for semantic composability.
Configuration management is also a somewhat “constrained” domain. Configuration of a site
typically includes:
1. Editing the bootstrap scripts.
2. Configuring internal services, typically DNS, LDAP, NFS, Web, etc.
3. Installing and configuring software packages.
We can consider building smaller operations that are designed to work together, engineering
them to a common framework, and then scaling up.
13.1.3
Forming Hierarchies and Relationships
MINIMUM COVER is in P if each element of C except the largest element is laminarily “contained” in some other element, i.e., ci ⊆ cj , if |ci | ≤ |cj |. In our composability problem, nesting
100
means operations are sorted with the increasing order of the number of objectives that they can
achieve; each operation can accomplish at least the objectives of the previous one. Composability of
operations is trivial by always selecting the least upper bound of the objectives. Also, MINIMUM
COVER can be reduced to INTEGER KNAPSACK via surrogate relaxation[67]. The INTEGER
KNAPSACK problem is still NP-complete, however, it is polynomially solvable if the values of the
items are in increasing order, and each value is greater than the sum of the values before it. In this
case we can solve the problem quickly by a simple greedy algorithm. Besides, the complexity of
MINIMUM COVER can be reduced using divide-and-conquer method if subsets are organized into
disjoint sequences of subsets.
These simplifications of NP-completeness suggest that we can reduce complexity by forming
hierarchies and relationships within the set of operations. In the following, we propose a system
managed by closures[30].
In the theory of closure, configuration parameters are grouped into two distinct sets: exterior
parameters and interior parameters. Exterior parameters of a closure completely determine the
behavior of the closure. Interior parameters are those parameters that are invisible by observing
behaviors. For example, the port number of a web server is exterior; the name of its root directory
may not be. The former is necessary to pass any behavioral test; the latter may change without
affecting external behavior at all. One closure contains another through parameter dominance, i.e.,
the exterior parameters of the dominated closure must be present in the parameter space of the
dominant closure. Thus the dominant closure controls the behavior of dominated closure. The
parameter space of a closure is either disjoint from that of all other closures or it is dominated by
some other closure. The system is composed by several disjoint sets of closures; each set is a chain
of nesting closures.
A system can be divided into relatively large independent(orthogonal) subsystems, and construct the configuration with a few large closures These closures “contain” smaller closures through
parameter dominance. Smaller closures “contain” even smaller closures and at the bottom are the
simple operations we suggested in the previous subsection.
13.2
Approximation
In practice, if a tractable algorithm cannot be found to solve a problem exactly, a near-optimal
solution is used instead; this is called approximation.
101
In many cases, configuration management does not need to be optimal. Any sequence of operations that will accomplish the result will do. One criterion of increasing importance is cost. It is not
worthwhile to apply much effort in searching for the optimal solution when a near-optimal solution
is “good enough” with low cost, especially when we are composing mostly simple and orthogonal
operations. The reason for this is that one can embody a near-optimal sequence – as long as the
length of the sequence is polynomial to the number of available operations – into a script that can
be executed efficiently. In other words, scripting precludes optimality. Moreover, to avoid latent
preconditions, a repeatable composition is more important than optimal composition. However, even
we can accept non-optimal solutions, a general polynomial algorithm for searching for a sequence
of operations that satisfies requirements at all can be difficult to construct.
In addition, most scripts are complex entities, the cost of managing them and maintaining them
is sometimes significant; keeping the scripts to a minimum might save for the long run. Moreover,
composition of operations is often repeated in similar situations, it makes sense to spend more effort
to find the optimal solution once and amortize the cost over all repetitions.
13.3
Dynamic Programming
INTEGER KNAPSACK can be solved in pseudo-polynomial time with dynamic programming. In
dynamic programming, we maintain memory of subproblem solutions, trading space for time. The
two key ingredients that make dynamic programming applicable are optimal substructure (a global
optimal solution contains within it optimal solution to subproblems) and overlapping subproblems
(subproblems are revisited by the algorithm over and over again).
System configuration can often be composed from configurations of relatively independent subsystems. For example, configuration of a typical departmental site includes a series of configurations
of server subsystems, i.e., web service, file system service, domain service, mail service, etc. The
optimal solution for configuration of the whole system is often a union of optimal solutions for configurations of subsystems. This implies optimal substructure. One might argue that when resource
competition between different subsystems occurs, optimal substructure does not exist, since the
interests of subsystems are not consistent. This argument mainly affects performance of dynamic
programming. In our composability problem, performance or security requirement can be coded
within the requirements. We are searching for the shortest sequence of configuration operations
that achieve the requirements. The shortest sequence of the whole is composed by the shortest
102
sequences of subsystems.
Subproblems such as deploying a specific service are repeated constantly in configuring a large
network. For example, file editing is used almost everywhere in configuration since services are
typically carried out by various daemons which read configuration files for instructions. Another
example, disk partitioning of a group of hosts is repeated on every individual host. Thus we have
the ingredient of overlapping of subproblems.
The idea of dynamic programming is to memoize the previous computed solutions of subproblems
and save them for future use. If we generalize this idea, complicated configuration management
can by simplified by forming a set of “best practices”, i.e., a set of solutions that have been used
and proven to be effective. If a community of administrators agrees to utilize the same initial
baseline state for each kind of host and the same set of operations for configuring hosts at multiple
sites, system administrators can then amortize the cost of forming the practices by aggregating
effective practices in a global knowledge base[37, 39]. The salient feature of best practices is that
they keep latent states to a minimum and thus reduce the amount that a system administrator has
to remember in maintaining a large network. Without some form of consistent practice, routine
problem-solving causes a state explosion in which randomly selected modifications are applied to
hosts for each kind of problem. For example, we could decide to solve the problem of large logfiles
by backing them up to several different locations, depending upon host. Then, finding a specific
logfile takes much longer as one must search over all possibilities, rather than codifying and utilizing
one consistent place to store them. However one should always use “best practices” cautiously by
asking questions like in what sense is a practice best? When and for whom?
103
Chapter 14
Conclusions
The research of this thesis was motivated by an apparent lack of fundamental theories of system
administration. System administration as an emerging field in computer science and computer
engineering has often been considered to be a “practice” with no theoretical underpinnings. In this
thesis, we began to define a theory of system administration, based upon two activities of the system
administrator: configuration management and dependency analysis. In this chapter, we summarize
the contributions of this thesis to the theory of system administration and make suggestions for
future work.
14.1
Review
System administration concerns every aspect of operational management of human-computer systems. Our research concentrates on one of its subparts: configuration management: the activities of
initial configuration or reconfiguration of the network of computers according to policies or policy
changes. As the complexity of computing systems and the demands of their roles increase rapidly,
configuration management becomes increasingly difficult.
In the course of this thesis we have developed a theoretical framework to study and examine
the complexity of configuration management. Two kinds of automatons were constructed. One is
based upon the actual configuration and the other is based upon observed behavior. The first one
is deterministic but can be arbitrary large; the second is non-deterministic with a size boundary of
2|T | where T is the test set.
The nondeterminism of operations on observed states is one of the most challenging problems
104
of configuration management. If the effect of an operation is not predictable and reproducible,
it is difficult to maintain it for repeatable uses or for large scale networks. In the discussion of
reproducibility of configuration operations, we have shown that for one host in isolation and for
some configuration processes, reproducibility of observed effect for a configuration process is a
statically verifiable property of the process. However, reproducibility of populations of hosts can
only be verified by explicit testing. Using configuration processes verified to be locally reproducible,
we can identify latent preconditions that affect behavior among a population of hosts.
Much attention has been paid to the limits imposed upon configuration operations and how
they affects usability of a set of operations. Based upon our theoretical framework, we formally
defined many limits of configuration operations including: idempotence, statelessness, convergence,
commutativity, consistency, awareness and atomicity. Their role of reducing the complexity of
configuration management was also discussed. They all add some forms of control over combinatorial behaviors. Convergence, idempotence, sequence idempotence, statelessness add a structure
of equivalence relations between sequences of configuration operations. Commutativity limits potential results of a set of scripts so that the behaviors of different permutations of the same set
are equivalent. Consistency rules out conflicting behaviors. The limits of atomicity and awareness
make operations more tight, robust, and secure. And they are used with other limits to enhance
their functionality. However, we showed in our composability theory that only these limits are not
enough to make composability tractable.
Using commutativity and idempotence we reduced a known NP-complete problem to composability of configuration operations. Thus the general composability problem without any limit is
NP-hard. We also studied how other limits affect the complexity of composition. The conclusion
was that composition remains NP-hard regardless of whether the operations have those limits.
Dependency analysis is an important process in configuration management and other parts
of system administration. It is used in root cause analysis, impact analysis, change analysis, and
requirement analysis. We formally defined a dependence using our theoretical model and studied its
complexity based on two approaches: white box and black box. Our conclusion was that dependency
analysis is intractable in general. System administrators get around the complexity by doing only
a partial analysis to the system.
By the contextualization of the configuration process and review of current configuration strategies, we summarized how configuration management is made tractable in practice. These mechanisms include the use of experience and documentation, reduction of state space, abstraction,
105
re-baselining, the use of orthogonality, and the use of closures. For each mechanism, we discussed
its drawbacks. We also made observations on how the current configuration strategies simplify
configuration management.
Many ways of reducing complexity of a problem have been explored in computation theory. We
made connections between computation theory and configuration management. We suggested ways
to apply various techniques of reducing complexity to configuration management including choosing
easy instances, approximation, and using dynamic programming.
14.2
Future Work
We share the view of many researchers that the increasing system complexity is quickly reaching a
level beyond human ability to manage and secure. We need a revolutionary change in the way we
manage our systems. Such change or a series of changes need strong theoretical support.
Following the research described in this thesis, a number of projects could be undertaken:
• To explore approximation algorithms for intractable problems to reduce complexity
Our previous research proved that self-configuration, an important part of autonomic computing, is intractable in the general case without further limitations. We wish to continue
studying applications of approximation algorithms to achieve tractability.
• To build a theory of policy in system administration.
There is a direct relationship between an organization’s policies and the expense of maintaining
computing infrastructure. There is much research on the role of policy in business that is
unknown to the system administration community. We intend to systematically study the
effect of policy on maintenance cost and system integrity.
• To address the complexity of reconfiguration and optimization of solution with a cost model.
Reconfiguration is much more difficult than initial configuration because it deals with a potential diversity of states rather than a simple repeatable initial state. One recurrent problem
is that it is often not obvious whether it is less expensive to change an existing configuration,
or to start over and configure the machine from scratch: what practitioners call a “bare metal
rebuild”. We desire to incorporate a cost model to enable optimization of various choices.
106
Bibliography
[1] Discussion of large scale system configuration issues. http://lists.inf.ed.ac.uk/mailman/listinfo/lssconfdiscuss.
[2] Sys admin - the journal for unix and linux system administrators. http://www.samag.com/.
[3] E. Anderson, M. Burgess, and A. Couch. Selected Papers in Network and System Administration. J. Wiley & Sons, Chichester, 2001.
[4] E. Anderson and D. Patterson. Extensible, scalable monitoring for clusters of computers.
Proceedings of the Eleventh Large Installation System Administration Conference (LISA XI)
(USENIX Association: Berkeley, CA), page 9, 1997.
[5] E. Anderson and D. Patterson. A retrospective on twelve years of lisa proceedings. Proceedings
of the Thirteenth Large Installation System Administration Conference (LISA XIII) (USENIX
Association: Berkeley, CA), page 95, 1999.
[6] P. Anderson. Towards a high level machine configuration system. Proceedings of the Eighth
Large Installation System Administration Conference (LISA VIII) (USENIX Association:
Berkeley, CA), page 19, 1994.
[7] P. Anderson, G. Beckett, K. Kavoussanakis, G. Mecheneau, J. Paterson, and P. Toft. Experiences and challenges of large-scale system configuration. 2003.
[8] P. Anderson, P. Goldsack, and J. Patterson. Smartfrog meets lcfg: autonomous reconfiguration with central policy control. Proceedings of the Seventeenth Large Installation System
Administration Conference (LISA XVII) (USENIX Association: San Diego, CA), 2003.
[9] J. Apisdort, K. Claffy, K. Thompson, and R. Wilder. Oc3mon: flexible, affordable, high performance statistics collection. Proceedings of the Tenth Large Installation System Administration
Conference (LISA X) (USENIX Association: Berkeley, CA), page 97, 1996.
107
[10] AT&T. Virtual network computing. http://www.uk.research.att.com/vnc.
[11] S. Bgchi, G. Kar, and J. Hellerstein. Dependency analysis in distributed systems using fault
injection: application to problem determination in an e-commerce environment. Proceedings of
the Workshop on Large Installation System Administration III (USENIX Association: Berkeley, CA, 1989), 1989.
[12] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
[13] A. Brown, G. Kar, and A. Keller. An active approach to characterizing dynamic dependencies
for problem determination in a distributed environment. Proceedings of the Seventh IFIP/IEEE
International Symposium on Integrated Network Management (Seattle, WA, 2001), 2001.
[14] H. Burch and B. Cheswick. Tracing anonymous packets to their approximate source. Proceedings of the Fourteenth Large Installation System Administration Conference (LISA XIV)
(USENIX Association: Berkeley, CA), page 319, 2000.
[15] M. Burgess. A site configuration engine. Computing systems (MIT Press: Cambridge MA),
8:309, 1995.
[16] M. Burgess. Computer immunology. Proceedings of the Twelfth Large Installation System
Administration Conference (LISA XII) (USENIX Association: Berkeley, CA), page 283, 1998.
[17] M. Burgess. Principles of Network and System Administration. J. Wiley & Sons, Chichester,
2000.
[18] M. Burgess. Theoretical system administration. Proceedings of the Fourteenth Large Installation System Administration Conference (LISA XIV) (USENIX Association: Berkeley, CA),
page 1, 2000.
[19] M. Burgess. Analytical Network and System Administration — Managing Human-Computer
Systems. J. Wiley & Sons, Chichester, 2004.
[20] M. Burgess and R. Ralston. Distributed resource administration using cfengine. Software
practice and experience, 27:1083, 1997.
[21] The RPM community. The rpm package manager (rpm). http://www.rpm.org/.
[22] The Biometric Consortium. The biometric consortium. http://www.biometrics.org/.
108
[23] M.A. Cooper. Overhauling rdist for the ’90s. Proceedings of the Sixth Large Installation System
Administration Conference (LISA VI) (USENIX Association: Berkeley, CA), page 175, 1992.
[24] T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. The MIT Press, Cambridge, Massachusetts, 1989.
[25] A. Couch. Slink: simple, effective filesystem maintenance abstractions for community-based
administration. Proceedings of the Tenth Large Installation System Administration Conference
(LISA X) (USENIX Association: Berkeley, CA), page 205, 1996.
[26] A. Couch. Chaos out of order: a simple, scalable file distribution facility for intentionally
heterogeneous networks. Proceedings of the Eleventh Large Installation System Administration
Conference (LISA XI) (USENIX Association: Berkeley, CA), page 169, 1997.
[27] A. Couch. An expectant chat about script maturity. Proceedings of the Fourteenth Large
Installation System Administration Conference (LISA XIV) (USENIX Association: Berkeley,
CA), page 15, 2000.
[28] A. Couch and Noah Daniels. The maelstrom: network service debugging via ‘ineffective procedures’. Proceedings of the Fifteenth Large Installation System Administration Conference
(LISA XV) (USENIX Association: Berkeley, CA), 2001.
[29] A. Couch and M. Gilfix. It’s elementary, dear watson: applying logic programming to convergent system management processes. Proceedings of the Thirteenth Large Installation System
Administration Conference (LISA XIII) (USENIX Association: Berkeley, CA), page 123, 1999.
[30] A. Couch, J. Hart, E.G. Idhaw, and D. Kallas. Seeking closure in an open world: a behavioural
agent approach to configuration management. Proceedings of the Seventeenth Large Installation
System Administration Conference (LISA XVII) (USENIX Association: Berkeley, CA), page
129, 2003.
[31] A. Couch and S. Schwartzberg. Experience in implementing an http service closure. Proceedings of the Eighteenth Large Installation System Administration Conference (LISA XVIII)
(USENIX Association: Berkeley, CA), page 213, 2004.
[32] A. Couch and Y. Sun. On the algebraic structure of convergence. LNCS, Proc. 14th IFIP/IEEE
International Workshop on Distributed Systems: Operations and Management, Heidelberg,
Germany, pages 28–40, 2003.
109
[33] A. Couch and Y. Sun. On observed reproducibility in network configuration management.
Science of Computer Programming, 53:215–253, 2004.
[34] A. Couch, N. Wu, and H. Susanto. Toward a cost model for system administration. Proceedings
of the Niteenth Large Installation System Administration Conference (LISA 05) (USENIX
Association: Sandiego, CA), pages 125–141, 2005.
[35] Alva Couch. System Configuration Management in The Elsivier System Management Handbook. 2006.
[36] D.A. Curry, S.D. Kimery, K.C. De La Croix, and J.R. Schwab. Acmaint: an account creation
and maintenance system for distributed unix systems. Proceedings of the Fourth Large Installation System Administration Conference (LISA IV) (USENIX Association: Berkeley, CA,
1990), page 1, 1990.
[37] G.
Halprin
et
al.
Sa-bok
(the
systems
administration
body
of
knowledge).
http://www.sysadmin.com.au/sa-bok.html.
[38] L. J. Osterweil et al. Strategic directions in software quality. ACM Computing Surveys, 4:738–
750, 1996.
[39] R. Kolstad et al. The sysadmin book of knowledge gateway. http://ace.delos.com/taxongate.
[40] Yi-Min Wang et al. Strider: a black-box, state-based approach to change and configuration
management and support. Proceedings of the Seventeenth Large Installation System Administration Conference (LISA XVII) (USENIX Association: San Diego, CA), 2003.
[41] J. Finke. Monitoring usage of workstations with a relational database. Proceedings of the
Eighth Large Installation System Administration Conference (LISA VIII) (USENIX Association: Berkeley, CA), 1994.
[42] M. Fletcher. nlp: a network printing tool. Proceedings of the Sixth Large Installation System
Administration Conference (LISA VI) (U SENIX Association: Long Beach, CA, pages 245–
256, 1992.
[43] M. R. Garey and D. S. Johnson. Computers and Intractability, A Guide to the Theory of
NP-Completeness. Freeman, New York, NY, 1979.
[44] D. Geer and S. Charney. Debate: is an operating system monoculture a threat to security?
Talk in USENIX 04, Boston, MA, 2004.
110
[45] M. Gilfix and A. Couch. Peep (the network aualizer): monitoring your network with sound. Proceedings of the Fourteenth Large Installation System Administration Conference (LISA XIV)
(USENIX Association: Berkeley, CA), page 109, 2000.
[46] L. Girardin and D. Brodbeck. A visual approach for monitoring logs. Proceedings of the
Twelfth Large Installation System Administration Conference (LISA XII) (USENIX Association: Berkeley, CA), page 299, 1998.
[47] J. Greely. A flexible filesystem cleanup utility. Proceedings of the Fifth Large Installation
System Administration Conference (LISA V) (USENIX Association: Berkeley, CA), page 105,
1991.
[48] Free Standards Group. Lsb - linux standards base. http://www.linuxbase.org.
[49] Object
Management
Group.
Common
object
request
broker
architecture.
http://www.omg.org/corba.
[50] Research Systems Unix Group. Radmind. http://rsug.itd.umich.edu/software/redmind/.
[51] B. Hagemark and K. Zadeck. Site: a language and system for configuring many computers as
one computer site. Proceedings of the Workshop on Large Installation System Administration
III (USENIX Association: Berkeley, CA, 1989), page 1, 1989.
[52] S.E. Hansen and E.T. Atkins. Automated system monitoring and notification with swatch.
Proceedings of the Seventh Large Installation System Administration Conference (LISA VII)
(USENIX Association: Berkeley, CA), page 145, 1993.
[53] D.R. Hardy and H.M. Morreale. Buzzerd: automated system monitoring with notification
in a network environment. Proceedings of the Sixth Large Installation System Administration
Conference (LISA VI) (USENIX Association: Berkeley, CA), page 203, 1992.
[54] R. Harker. Selectively rejecting spam using sendmail. Proceedings of the Eleventh Large Installation System Administration Conference (LISA XI) (USENIX Association: Berkeley, CA),
page 205, 1997.
[55] John Hart and Jeffrey D’Amelia. An analysis of rpm validation drift. Proceedings of The sisteenth Large Installation System Administration Conference (LISA 02) (USENIX Association:
Berkeley, CA, 2002), pages 155–166, 2002.
111
[56] J. Hellerstein. Complexity of configuration management and experience. To be published.
[57] IBM. Autonomic computing. http://www.research.ibm.com/autonomic/.
[58] IBM. Websphere. http://www-306.ibm.com/software/websphere/.
[59] L. Kanies. Practical and theoretical experience with isconf and cfengine. Proceedings of the
Seventeenth Large Installation System Administration Conference (LISA XVII) (USENIX Association: San Diego, CA), 2003.
[60] J. Kephart and D. Chess. The vision of autonomic computing. Computer, 36(1):41–52, 2003.
[61] M. Kijima. Markov Processes for Stochastic Modeling. Chapman & Hall, London, UK, 1997.
[62] S. Kirkpatrick, C. Gelatt Jr., and M.P. Vecchi. Optimization by simulated annealing. Science,
page 168, 1983.
[63] D. Koblas and P.M. Moriarty. Pits: a request management system. Proceedings of the Sixth
Large Installation System Administration Conference (LISA VI) (USENIX Association: Berkeley, CA), page 197, 1992.
[64] R. Kolstad. Tuning sendmail for large mailing lists. Proceedings of the Eleventh Large Installation System Administration Conference (LISA XI) (USENIX Association: Berkeley, CA),
page 195, 1997.
[65] RSA
Laboratories.
The
public-key
cryptography
standards.
http://www.rsasecurity.com/rsalabs/node.asp?id=2124.
[66] E. Lassettre, D. Coleman, Y. Diao, S. Froehlich, J. Hellerstein, L. Hsiung, T. Mummert,
M. Raghavachari, G. Parker, L. Russell, M. Surendra, V. Tseng, N. Wadia, and P. Ye. Dynamic
surge protection: an approach to handling unexpected workload surges with resource actions
that have lead times. LNCS, Proc. 14th IFIP/IEEE International Workshop on Distributed
Systems: Operations and Management, Heidelberg, Germany, 2003.
[67] L. Lorena and F. Lopes. A surrogate heuristic for set covering problem. European Journal of
Operational Research, 79:138–150, 1994.
[68] K. Manheimer, B.A. Warsaw, S.N. Clark, and W. Rowe. The depot: a framework for sharing
software installation across organizational and unix platform boundaries. Proceedings of the
112
Fourth Large Installation System Administration Conference (LISA IV) (USENIX Association:
Berkeley, CA, 1990), page 37, 1990.
[69] M. Metz and H. Kaye. Deejay: the dump jockey: a heterogeneous network backup system. Proceedings of the Sixth Large Installation System Administration Conference (LISA VI) (USENIX
Association: Berkeley, CA), page 115, 1992.
[70] Microsoft. Component object model. http://www.microsoft.com/com/teck/COMPlus.asp.
[71] Sun Microsystems. Enterprise javabeans. http://java.sun.com/products/ejb.
[72] K. Montgomery and D. Reynolds. Filesystem backups in a heterogeneous environment. Proceedings of the Workshop on Large Installation System Administration III (USENIX Association:
Berkeley, CA, 1989), page 95, 1989.
[73] R. Osterlund. Pikt: problem informant/killer tool. Proceedings of the Fourteenth Large Installation System Administration Conference (LISA XIV) (USENIX Association: Berkeley, CA),
page 147, 2000.
[74] D. Patterson. A simple way to estimate the cost of downtime. Proceedings of the Sixteenth Large
Installation System Administration Conference (LISA XVI) (USENIX Association: Berkeley,
CA), page 181, 2002.
[75] M. D. Petty and W. Weisel. A composability lexicon. Proceedings of the Spring 2003 Simulation
Interoperability Workshop (Orlando, FL), 2003.
[76] M. D. Petty, W. Weisel, and E. Mielke. Computational complexity of selecting models for
composition. Proceedings of the Fall 2003 Simulation Interoperability Workshop (Orlando,
FL), 2003.
[77] P. Powell and J. Mason. Lprng - an enhanced print spooler system. Proceedings of the Ninth
Large Installation System Administration Conference (LISA IX) (USENIX Association: Berkeley, CA, page 13, 1995.
[78] Redhat. Redhat enterprise linux. http://www.redhat.com/enus/USA/rhel.
[79] K. Rich and S. Auditor. hobgoblin: a file and directory auditor. Proceedings of the Workshop
on Large Installation System Administration V (USENIX Association: Berkeley, CA), 1991.
113
[80] C. Ruefenacht. Rust: managing problem reports and to-do lists. Proceedings of the Tenth Large
Installation System Administration Conference (LISA X) (USENIX Association: Berkeley,
CA), page 81, 1996.
[81] P. Scott. Automating 24x7 support response to telephone requests. Proceedings of the Eleventh
Large Installation System Administration Conference (LISA XI) (USENIX Association: Berkeley, CA), page 27, 1997.
[82] S. Shumway. Issues in on-line backup. Proceedings of the Fifth Large Installation System
Administration Conference (LISA V) (USENIX Association: Berkeley, CA), page 81, 1991.
[83] J. Da Silva and Ólafur Guǒmundsson. The amanda network backup manager. Proceedings
of the Seventh Large Installation System Administration Conference (LISA VII) (USENIX
Association: Berkeley, CA), page 171, 1993.
[84] M. Sirbu and J. Chuang. Distributed authentication in kerberos using public key cryptography.
Internet Society 1997 Symposium on Network and Distributed System Security.
[85] H. Spencer. The amanda network backup manager. Proceedings of the Tenth Large Installation
System Administration Conference (LISA X) (USENIX Association: Berkeley, CA), 1996.
[86] Y. Sun and A. Couch. Global impact analysis of dynamic library dependencies. Proceedings
of the Fifteenth Large Installation System Administration Conference (LISA XV) (USENIX
Association: Berkeley, CA), page 145, 2001.
[87] Y. Sun and A. Couch. Composability of configuration management. In preparation, 2006.
[88] MIT
Kerberos
Team.
Kerberos:
the
network
authentication
protocol.
http://web.mit.edu/kerberos/www/.
[89] J.W. Toigo. How to architect tiered backup with d2d2t. Proceedings of the Eighteenth Large
Installation System Administration Conference (LISA XVIII) (USENIX Association: Atlanta,
GA), 2004.
[90] S. Traugott. Why order matters: turing equivalence in automated system administration.
Proceedings of the Sixteenth Large Installation System Administration Conference (LISA XVI)
(USENIX Association: Berkeley, CA), page 99, 2002.
114
[91] S. Traugott and J. Huddleston. Bootstrapping an infrastructure. Proceedings of the Twelfth
Large Installation System Administration Conference (LISA XII) (USENIX Association:
Berkeley, CA), page 181, 1998.
[92] Tripwire. Security scanner. http://www.tripwire.com.
[93] Usenix. Large installation system administration conference. http://www.usenix.org/events/.
[94] Wietse Venema. Tcp wrappers. http://ciac.llnl.gov/ciac/ToolsUnixNetSec.html.
[95] B. Woodard. Building an enterprise printing system. Proceedings of the Twelfth Large Installation System Administration Conference (LISA XII) (USENIX Association: Berkeley, CA),
page 219, 1998.
[96] N. Wu and A. Couch. Bootstrapping an ip closure. technical report, 2006.
[97] E.D. Zwicky. Disk space management without quotas. Proceedings of the Workshop on Large
Installation System Administration III (USENIX Association: Berkeley, CA, 1989), page 41,
1989.
[98] E.D. Zwicky, S. Simmons, and R. Dalton. Policy as a system administration tool. Proceedings of the Fourth Large Installation System Administration Conference (LISA IV) (USENIX
Association: Berkeley, CA, 1990), page 115, 1990.
115
Download