CLOPS: A DSL for Command Line Options

advertisement
CLOPS: A DSL for Command Line Options
Mikoláš Janota, Fintan Fairmichael, Viliam Holub,
Radu Grigore, Julien Charles, Dermot Cochran, and Joseph R. Kiniry
School of Computer Science and Informatics,
Lero — The Irish Software Engineering Research Centre, and
The Complex and Adaptive Systems Laboratory (CASL),
University College Dublin, Ireland
Abstract. Programmers often write custom parsers for the command
line input of their programs. They do so, in part, because they believe
that both their program’s parameterization and their option formats are
simple. But as the program evolves, so does the parameterization and the
available options. Gradually, option parsing, data structure complexity,
and maintenance of related program documentation becomes unwieldy.
This article introduces a novel DSL called CLOPS that lets a programmer specify command line options and their complex inter-dependencies
in a declarative fashion. The DSL is supported by a tool that generates the following features to support command line option processing:
(1) data structures to represent option values, (2) a command line parser
that performs validity checks, and (3) command line documentation. We
have exercised CLOPS by specifying the options of a small set of programs like ls, gzip, and svn which have complex command line interfaces. These examples are provided with the Open Source release of the
CLOPS system.
1
Introduction
A boiling frog is a well known phenomenon in software development. Just as a
frog meets its fate in water that is slowly heated, so do software projects become
unmanageable as their complexity gradually increases.
Processing of command line options is not an exception. The first few versions
of a program have small number of options that are easy to process and do not
have many dependencies between them.
No wonder then, that many programmers, with famous last words “this is
easy” on their lips, opt for custom code for option processing. As the number
of the program’s options and their inter-dependencies grows, option processing
code gets bloated and diverges from its documentation.
This article suggests an alternative by providing a DSL that enables programmers to explicitly, declaratively capture options and their inter-dependencies.
Furthermore, a tool is provided that produces documentation, code for command
line processing, and data structures. The data structures produced represent the
values of the options and are automatically populated during parsing.
The observations described above shaped the language proposed in this article. On the one hand, simple option sets must be easy to write when the program
is in its early stages. But on the other hand, a set of options must be easy to
evolve and even complicated option sets must be supported.
CLOPS specification
1
3
2
Developer code
4
5
2
Generated parser and
data structures
5
Generated
documentation
5
Shipped product
Fig. 1. Developer usage scenario.
There are two types of users involved in using CLOPS — the developer who
specifies the command line interface and integrates the command line parser
with the main program body, and the end-user who interacts with the final
product by executing it and providing options on the command line. Fig. 1 gives
a diagrammatic representation of a developer’s usage scenario for creating a
command line program using CLOPS.
The developer describes the command line of the tool in the CLOPS DSL(1).
This description is used by the CLOPS tool to produce code and documentation(2). The code is used as a library(4) by the developer to parse the command
line(3). The result of parsing is a set of data structures (options) that can be
easily queried. The documentation is typically presented to the end-user(5).
The end-user is typically not aware that the developer’s tool uses CLOPS.
This is despite the fact that CLOPS parses the command line, provides many
of the error messages, and generates the documentation that the user consults
(such as a man page and the usage message printed by the program itself).
This article contains several core contributions:
– We identify the main concepts seen in command line processing and describe
their interdependencies (Section 4).
– We propose a novel DSL called CLOPS for describing command line option
interfaces. Its syntax and semantics are described in Section 5.
– We provide a tool (its implementation is discussed in Section 6) that reads
in a description in this language and generates multiple artifacts:
• a Java implementation of the command line parser,
• documentation, and
• data structures that are used to store the result of parsing.
– (Section 7) summarizes results of applying CLOPS to well-known programs
like ls; interesting cases are discussed and the CLOPS description is compared to existing implementations.
– Section 8 is a summary of experience of one of the authors that have applied
CLOPS in projects he has been working on.
Section 9 elaborates how our research relates to other parts of Computer Science,
namely Software Product Lines. Section 10 is an overview of other approaches
tackling command line processing and Section 11 provides pointers for further
research and improvements. The upcoming section sets up the ground for further
discussion and is followed by a light-weight exposure to CLOPS DSL in Section 3.
2
Background
Command line utilities are widespread in the software community and hence
there is no need to go into great detail about them. Nevertheless, several properties of command line options that will be important for later discussion must
be reviewed.
A program is run from a command line with a list of options that affect its
behavior. For instance, the program ls lists files in a multi-column format if run
with the option -C and lists files one-per-line if run with the -1 option.
This means that the invoked program has to process the options provided
on the command line and store the result of that analysis into its state in some
fashion.
Obviously, the behaviors of these particular options, -C and -1, cannot be
put into effect at the same time. We say that there is a dependency between
these two options.
As a consequence, ls -C -1 results in the same behavior as if only -1 was
provided and vice versa, according to the principle last one wins. For such dependency we say that the options override one another. There are several kinds
of dependencies with varying resolution strategies.
Another common pattern seen in command line arguments is option arguments. To give an example, the program tail prints the last x lines of a given
input and is typically invoked with tail -n x filename. This means that option processing needs to make sure that x is indeed a number, and produce an
NAME:: encrypt
ARGS::
input :{"-i"}:{ file }:[ mustexist , canbedir="false"]
output:{"-o"}:{ file }:[ canbedir="false"]
FORMAT:: input output;
Listing 1. A simple CLOPS description of an encrypting program
error otherwise, and convert its textual representation into a numeric one. Often,
an argument has a default value. If tail is invoked without the -n option, for
instance, x is assumed to have value 10.
The set of options supported by a certain program represent its (textual) user
interface. Hence, it is important for a program’s options to be well-documented as
a program’s users cannot be expected to read its source code. In UNIX-like systems documentation is maintained in the database of man(ual) pages. As stated
before, one of the objectives of this work is to keep a program’s documentation
consistent with its behavior.
3
CLOPS Exposure
Let us imagine we are writing a program which encrypts a given file. As we want
to avoid problems with weirdly-named files (such as starting with the hyphen),
we introduce one-argument options -i filename and -o filename to enable
the user to specify the input and the output file, respectively.
Listing 1 is how we write it in CLOPS. The list under ARGS provides the
possible command line options identified as input and output. For both of the
options it specifies that they accept file-names as values. Additionally, the file
given as input must exist and none of the files can be a directory.
The FORMAT is a regular expression that tells us that both input and output
must be specified and input comes first.
The parser generated according to this CLOPS description will accept command lines such as -i foo -o bar, provided that the file foo exists.
After a while we get bored with this restrictive form and we decide to enable
the user not to specify the output file as the program will derive the name from
the name of the input file. Additionally, our colleague who has been using our
tool is annoyed by all the debugging messages that are meaningless to him and
we introduce the quiet mode. Since we have our first end-user, we add textual
descriptions so we can generate simple documentation. The result can be seen in
Listing 2. Note that for quiet we have not specified any type, which is the same
as writing quiet :{ "-q"}:{boolean}.
NAME:: encrypt1
ARGS::
input :{"-i"}:{ file }:[ mustexist , canbedir="false"]:"The input file."
output:{"-o"}:{ file }:[ canbedir="false"]:"The output file."
quiet :{"-q"}:"Turn the quiet mode on."
FORMAT:: quiet? input output?;
Listing 2. Extended version of the encrypting program.
4
Concepts
The previous sections highlighted several common patterns that are seen in command line interfaces. Namely, we have different types of options, options depend
on one another, and the order of options on the command line is important.
What follows is a concept analysis of this domain. Each of these concepts is
later reflected in the CLOPS DSL.
Options. The command line interface of a particular program focuses on a set
of options. Each option has a type such as string, number, or file. For parsing
purposes, each option is associated with a set of strings (possibly infinite) that
determines how the option is specified on the command line. For instance, for
the command head, any string of the form -n NUMBER specifies the number-oflines option. For a given option, we will call these strings match strings. Last
but not least, an option has a documentation string for end-user documentation
purposes.
Values. Parsing a command line results in an internal representation of that
particular command line. We will refer to this result as option values. As the
program executes, these values control its behavior. Each option can be set to
a value that must agree with the option’s type. For example, the value of the
number-of-lines option seen above, is an integer.
Format. The ordering of options provided on the command line is often important. Different orderings may have different meanings to the program, and
sometimes a different ordering of the same options is not valid. The format
specifies all valid sequences of options.
We should note that the format not only determines whether a sequence of
options is valid, but plays an important role in parsing. For instance, in the case
of the version control system svn, one can write svn add add, where the first
occurrence of the string add means “add a file to the repository” and the second
occurrence means that the file to be added is called add. The parser is able to
make this distinction as svn’s format specifies that a command must precede a
filename and only one command use is permitted.
Validity. Only certain combinations of option values are valid. A function that
determines whether a given combination of the values is valid or not is called a
validity function. Validity functions are used to enforce constraints on options,
for instance ensuring that two mutually exclusive options are not both enabled.
Whenever an end-user specifies option values that fail validity testing, the command line parser produces an appropriate error message.
Rules. A change in the value of one option may affect values of other options.
Such relationships are represented using rules that are triggered during parsing
to change the value of other options. More specifically, a rule consists of: a
trigger, telling us at which point of the format it may fire; a guard, a condition
for firing; and an effect, the action executed when the guard holds. Rules are used
to realize, among other features, overriding. For instance, in ls, the option -1
triggers a rule that sets -C to false, and vice versa.
Command line string. An option is associated with a set of strings that state how
that particular option is specified on the command line. However, each command
line specifies multiple options.
In practice, a command line input is represented as a list of strings. A special
character (denoted here by ) is appended to each of these strings and then the
strings are concatenated to produce the command line string. This string will
be used as the input to the parser.
Parsing partitions the given command line string into substrings, each corresponding to a different option. For instance, ls run with the command line
string -S --format=long sorts files by size (-S ) and prints them in the long
format (--format=long ).
Usually options come on the command line separately, therefore the match
strings typically contain the delimiter character. For instance, the ls command’s --format option matches on --format=FORMATSPECIFIER , where the
FORMATSPECIFIER is one of across, commas, horizontal, long, single-column,
verbose, and vertical.
In some programs, the individual match strings do not have to be delimited. For instance, ps xa is equivalent to ps x a. This means that the option x
matches on the character x optionally followed by . These patterns motivated
us to represent a command line input as a single string as we can treat such
options uniformly.
5
DSL
We define the CLOPS domain specific language by giving its syntax and its
semantics.
description →
options →
format →
fly →
override →
validity →
opt →
formatRegex →
flyRule →
overrideRule →
validityRule →
optName →
optAliases →
optType →
optProps →
optDoc →
guard →
assignment →
errorMsg →
propName →
options format fly? override? validity?
OPTIONS:: opt+
FORMAT:: formatRegex
FLY:: flyRule+
OVERRIDE:: overrideRule+
VALIDITY:: validityRule+
optName optAliases optType? optProps? optDoc?
optName | formatRegex (? | + | ? ) | formatRegex OR formatRegex
optName (: guard)? → assignment+
guard → assignment+
guard (: errorMsg)?
id
:{ regex (, regex )? }
:{ id }
:[ propName = string (, propName = string)∗ ]
string
expr
optName := expr
expr
id
Fig. 2. Syntax of the CLOPS DSL.
5.1
Syntax
Fig. 2 gives an overview of the CLOPS syntax1 . The meta-symbols used in
the right hand side of productions (∗ , + , ? , |) have the usual meaning (zero
or more times, one or more times, optional, alternative). Parentheses are also
meta-symbols. (We omit the literal round parentheses from the grammar for
clarity.) Terminals are typeset in bold. Nonterminals are typeset in roman if
their productions are present, and are typeset in italics if their productions are
missing.
There are four nonterminals not described in the grammar: id, string, regex,
and expr. An id is a Java identifier2 . A string is a Java string literal. A regex is a
Java string literal s that does not cause an exception when used in an expression
Pattern .compile(s). An expr is a side-effect-free Java expression whose type is
constrained, as discussed later.
A command line parser description in CLOPS consists of one or more option
declarations, a format declaration, and, optionally, rule declarations. The informal meaning of each of these syntax sections is summarized through a series of
examples below. A formal semantics is given in the next section.
Example 1: A Prototypical Option Declaration Here is an example of an
option declaration:
1
2
See also http://clops.sourceforge.net/clops/grammar.html
The option type is allowed to contain -, unlike Java identifiers
tabsize :{"-T", "--tabsize"}
:{ int }
:[ minvalue="0" default="8"]
:"assume given tab stops"
The name of the option is tabsize , it has aliases "-T" and "--tabsize", its
type is int , and its documentation string is "assume given tab stops". Values are
given to the option properties minvalue and default . The type int is not a Java
type, but is an option type. The built-in types include string , file , and boolean,
which is the default option type. The complete set of built-in option types is
given later in Table 1, and Section 6 discusses how the type set is extended.
The match strings of the option tabsize are the words of the regular language specified by the expression ((-T)|(--tabsize))[= ]([^ ]*)+ . The prefix ((-T)|(--tabsize)) is built from the aliases, while the suffix [= ]([^ ]*)+
is automatically defined by the option type. The suffix is overridden for any option type using the property suffixregexp .
For another example, consider again the command head. This command
permits the user to write -n NUMBER but also a shorter version -NUMBER. To
specify this shorthand syntax, the suffix is (\\d)+ (non-zero number of digits
followed by the delimiter), and the prefix - (a hyphen).
Example 2: Format Strings using Options A convention many UNIX tools
follow is to consider everything on a command line that does begin with a hyphen
character (-) as not being a filename. All file names that do start with a hyphen
must appear after the special option --. This behavior is captured in the format
string:
(Flag OR File)∗ (HyphenHyphen HyphenFile∗)?
Here Flag, File , HyphenHyphen, and HyphenFile are option names. Section 6
introduces option groups, which make formats easier to write.
Example 3: Fly Rules Fly rules are used, among other purposes, to obtain
the ‘last wins’ behavior discussed earlier with regards to the ls command. Two
fly rules that describe this interface are:
HyphenOne −> HyphenC := false
HyphenC −> HyphenOne := false
Validity rules, such as the following, are used if an error is preferable:
HyphenOne? && HyphenC? −> "Both -1 and -C were given."
For fly rules, a trigger is specified via an option name on the left of the arrow.
For override rules, the trigger is described similarly. All validity functions are
triggered when parsing finishes.
The fly rule examples above do not specify a guard, so the guard defaults
to true. The validity rule example contains an explicit guard and illustrates
some convenient syntactic sugar. The API generated for each option consists of
a method to query its value and a method to query if it was set at all. The
option name used in an expression is desugared into a call to get the option’s
value, while the option name followed by a question mark is desugared into a
call to check if the option was set. As expected, the (Java) type of the guard
must be boolean. In the future we plan to allow JML [12] expressions to be used
as guards.
The actions of fly rules are assignments. The left hand side of an assignment
names an option, while the right hand side contains an expression. The expression must have a type convertible (according to Java rules) to the type of the
option’s value. Override rules behave the same fashion, but validity rules are different. As seen in the validity rule example, the action of the rule is not written
as an assignment, but is instead implicit. In this case, the assigned option has
the type string − list , and is used to collect error messages.
The validity function is specified through validity rules. These rules are triggered after parsing has finished and, if their guard evaluates to true, the string
on the righthand side is produced as an error.
5.2
Semantics
As previously mentioned, a CLOPS description is used to generate several artifacts, such as documentation and source code. The command line parser for the
described options is the crux of these artifacts.
The mapping of the CLOPS description to a generated parser defines the
DSL’s semantics. The goal of this section is to provide an operational semantics
of the command line parser for a given CLOPS description.
To be able to discuss a CLOPS description formally, we will assume that we
have several mathematical objects that correspond to it.
Preliminaries Assume the existence of a finite state automaton A. Each transition of A is labeled with an option o and a set of rules R. The notation s →o,R s0
denotes that there is a transition from state s to state s0 in A, labeled with the
option o and the set of rules R.
Legal sequences of options, as given by the CLOPS description, correspond to
paths in the automaton that start in the starting state and end in an accepting
state. The automaton is built in the usual way from the regular expression
specified via the CLOPS format section.
A rule is a triple (g, o, e) containing a guard, an option, and a side-effect-free
value expression. The intuition behind the rules is that all the rules labeling a
transition are triggered whenever that transition is taken. When a rule (g, o, e)
is triggered, its guard g is evaluated. If the guard evaluates to true, then the
expression e is evaluated and the result is assigned to the value of the option o.
We say that such a rule has fired.
The function parse computes a value for a given option from an input string.
Hence, its type is Option × String → Value. The manner in which a String is
converted to a Value is defined by the Option in question.
The function match tries to find a matching string of a given option among
the prefixes of the command line string. Using the functional languages convention, its type is Options × String → Maybe String (). If the match is successful
it returns the prefix on which the option matches, otherwise it returns null.
The option store is a partial function V from options to their values. The
function is partial as when an option has not been set, it has no value.
Details Let us now proceed with the definition of the command line parser
based on these mathematical objects (the automaton A and the functions parse
and match).
The parsing of the command line string is a traversal of the automaton. At
each stage of the parse the available transitions from the current automaton
state represent the options that are valid for use at this point in the command
line string. Each available option is checked to see if it matches at the current
position of the command line string. If no option matches, we produce an error.
If an option matches we take the transition for that option, fire the rules that
label the same transition if their guards evaluate to true, and move forward in
the command line string by the match length.
The rest of the section defines this traversal formally.
Some new notation is convenient. The notation V [x 7→ v] stands for a function
that returns V (y) if x 6= y and v otherwise. For the side-effect-free expression e
defined on the option values V , the notation e(V ) represents the value to which
e evaluates after substituting values for options.
The operational semantics of the command line parser is given as a transition
function on the parser’s states. A state of the parser is either an error state or a
triple (c, V, s), a command line string (see Section 4), a function from options
to their values, and a state of the automaton A, respectively.
For a state (c, V, s) we define the following two auxiliary sets:
1. Let Legal be the set of options labeling a transition from s in A
Legal (s) ≡ o | (∃s0 )(s →o,R s0 )
2. Let Matched be a subset of Legal comprising options that match on c.
Matched (s, c) ≡ {o ∈ Legal (s) | match(o, c) 6= null}
The parser transition goes from the state (c, V, s) to the state (c0 , V 0 , s0 )
if Matched contains exactly one option o and s →o,R s0 . The command line c0 is
obtained from c by stripping the prefix Match(o).
The values of the options are updated by computing the function Parse for o
and by triggering, in parallel, the rules labeling the transition being taken. This
is formally written as follows.
Let V 00 = V [o 7→ parse(match(o, c))]. Then V 0 is defined as
(
e(V )
if there is a single rule (g, p, e) ∈ R and g(V 00 ) holds, or
0
V (p) =
00
V (p) there is no such a rule.
This covers the case when |Matched| = 1 and there are no conflicting rules.
However, if |Matched| = 1, and there are at least two rules whose guards evaluate
to true that assign different values to the same option, then (c, V, s) goes to
the error state.
If the set Matched is empty, there is no transition of A to be taken. If it
contains more than one option, the parsing is ambiguous. Thus, if |Matched| =
6 1
we also go to the error state.
The computation of the command line parser starts in a state with the command line string to be processed, an option value function not defined anywhere
(not assigning any values), and the start state of the automaton.
Note: In Future Work we suggest how one could check statically that the size of
the set Match is at most one.
6
Implementation
Source code and builds of the implementation are freely available for download
on the tool’s website3 . There is also a user-oriented tutorial that details the
first steps in running the tool and creating a CLOPS description. The tool has
been implemented in Java and requires a Java Runtime Environment of at least
version 5.0. Although we chose to write our implementation in Java, there is
nothing to stop our approach from being implemented in other programming
languages, as we do not rely on any language features that are unique to Java.
The overall functionality of the tool is as one might expect: it parses a specification provided in the CLOPS DSL and produces a command line parser with
consistent documentation.
6.1
Option Types
Recall that when a user declares an option in the DSL, they must specify the
type of the option. An option type determines the Java type used to represent the option’s value and, consequently, the parsing mechanism used for the
corresponding option. Table 1 lists the built-in option types.
In order to extend the CLOPS DSL, new option types may be implemented
and registered for use. An option factory defines the known set of options. Thus,
to extend the CLOPS system and provide additional types, a developer extends
the default option factory to make the system aware of new developer-provided
types. The implementation of new option types can extend the built-in types to
reuse and refine their functionality, or one may be written from the ground up,
so long as the Option interface is satisfied.
While parsing each option declaration, the DSL parser queries the option
factory for the corresponding option type. The option factory provides the necessary details of an option type: the class representing the option at runtime, as
well as the Java type that is used for the value of the option.
3
http://clops.sourceforge.net/
Name
Inherits
from
Java type
Properties
basic
boolean
None
basic
T
boolean
default[T], suffixregexp[string]
allowarg[boolean]
counted-boolean basic
int
countstart[int], countmax[int],
warnonexceedingmax[boolean],
erroronexceedingmax[boolean]
string
basic
String
string-regexp
string
String
regexp[string]
string-enum
string
String
choices[string], casesensitive[boolean]
int
basic
int
minvalue[int], maxvalue[int]
float
basic
f loat
minvalue[float], maxvalue[float]
file
basic
F ile
canexist[boolean], mustexist[boolean],
canbedir[boolean], mustbedir[boolean],
allowdash[boolean]
list
basic
ListhT i
allowmultiple[boolean], splitter[string]
string-list
list
ListhStringi
file-list
list
ListhF ilei
canexist[boolean], mustexist[boolean],
canbedir[boolean], mustbedir[boolean],
allowdash[boolean]
Table 1. Built-in option types and properties. Abstract types are typeset in
italics. T is the type of the concrete option’s value.
The information provided by the option factory about each option’s type is
used in the code generation phase. An interface is generated that provides access
to the value for each option. For instance, for a string option, a getter method
with return type String is created. For each option there is also a method with
return type boolean that determines whether an option has been set at all.
Each option type has a mechanism for indicating the properties that it accepts. When parsing a given input specification, we can check that the option
will actually accept all provided properties. Assuming there are no problems,
code is generated that sets the option properties to their specified values during
the initialization phase.
6.2
Option Groups
A tool’s format string sometimes grows quite long and is consequently difficult
to understand and maintain due to its sheer number of options, independent of
their dependencies. For this reason, we allow the options to be grouped.
An option group is defined by specifying a unique identifier paired with a list
of options and/or other groups that are contained in the group. Then, when the
identifier for a group is used in a format string, its definition is recursively expanded as a set of alternatives. Of course, these are hierarchical groups, and thus
must be acyclic. We have found that appropriate use of option groupings make
format strings much more concise, understandable, and more easily maintained.
There are many differing styles of documentation used to provide information
to the end-user on the command line options available for a given tool. For this
reason we have leveraged a powerful and mature open-source templating library,
the Apache Velocity Project’s templating engine [15], to produce documentation
from the information contained within a CLOPS description. We provide several
built-in templates for documentation generation, and if a developer requires a
different style for their documentation, they can modify an existing template, or
create their own.
7
Experiments
In order to challenge CLOPS and to measure its effectiveness, several popular
programs that have interesting option sets were described using the CLOPS
system. Programs that are difficult or impossible to accurately describe using
existing command line parsing tools without resorting to a large amount of
custom parsing code were explicitly chosen.
Sources of information as to the semantics of a given tool’s command line interface include manual pages, source code, command line help (e.g., tool --help),
as well as trial-and-error with interesting option combinations.
We found describing these tools a useful exercise that led to many adjustments to our design and implementation. The fact that we were successfully able
to describe their interfaces emphasizes the power and flexibility of our approach.
Additionally, even though all these programs are used by a large number
of people, several inconsistencies were discovered in their documentation, and
between their documentation and implementation.
Obviously, using the generated parser guarantees consistency between documentation and implementation. Moreover, the examples discussed in previous
sections show that a systematic approach to recording options often highlights
inconsistencies in their design, as well as their documentation.
7.1
ls
The program ls lists files found in the file-system, is frequently used and many
options influence which information is displayed and in what format. The GNU
implementation of ls comes with 56 options.
The options we have found the most interesting are options determining the
format of the output. There is an option format, with the enumeration domain,
where each of the enumeration values corresponds to a boolean option. The
following snippet from the manual page shows how.
--format=WORD
across -x, commas -m, horizontal -x, long -l,
single-column -1, verbose -l, vertical -C
This means that invoking ls -C should be the same as ls --format=vertical;
ls --format=horizontal as ls -x etc.
This design is clearly a hotbed of inconsistencies and we have found one. The
option format can have only one value at a time, i.e., if defined on the command
line multiple times, the last one wins. According to the POSIX standard, however, the option -1 should be assumed whenever -l is given4 . As the GNU implementation follows the standard, the call ls -l -1 results in a different output
than ls --format=long --format=single-column. The former corresponds to
ls -l while the latter to ls -1. This means that the value single-column does
not have the same effect as the option -1 and the documentation is inaccurate.
Interestingly enough, the BSD implementation of ls does not follow the
POSIX standard as ls -l -1 results in the equivalent of ls -1, i.e., the options
-l and -1 are regarded as distinct options which override one another.
We believe that situations similar to the format option arise quite often: as
the program evolves, new options are added (like -1) and at some point there is
too many that deal with a similar issue that an enumeration option is introduced
to unify them. If that is the case, maintaining backward compatibility is difficult.
We should note that the enumeration format does appear only in the GNU
implementation.
How does CLOPS help? Rules let us specify the equality between options.
The following CLOPS description snippet relates format’s value "commas" and
the boolean option commas.
commas {commas} −> format := {"commas"};
format {true} −> commas := {$(format).equals("commas")};
%$
The first identifier in the rules is the triggering option followed by a condition
and an action. The first rule specifies that if commas is set to true, the format’s
value is set to the value "commas". The second rule says that whenever the value
of format changes, commas is set to true iff format’s value is "commas".
Note that this specification is not quite symmetric, as it does not specify
what should happen when commas is set to false . In the current solution we
disallow the user to set the option explicitly to false as the following would
result in an inconsistent state (-m is a identifier of the option commas).
--format=commas -m=false
Alternatively we could unset the option format if the option representing the
current value is set to false. Not enabling boolean options to be set to false is
seen in the GNU implementation.
Writing up the rules in this fashion clarifies which options correspond to one
another.
The source code processing the options in the GNU implementation has 327
non-blank lines of code. The CLOPS description including the documentation
has 246 non-blank lines.
4
http://www.opengroup.org/onlinepubs/000095399/utilities/ls.html
7.2
gzip
The program gzip compresses files using Lempel-Ziv coding. The most interesting part of the command line interface for this tool is the compression level.
The compression level has a value in the range 1 to 9 and defaults to 6. The
compression level is specified on the command line by prepending the number
with hyphen (−4 etc.) or by the predefined human-friendly aliases --fast for
−1 and --best for −9. In CLOPS, the compression level would be defined as
an integer parameter:
compressionlevel :{"-"}:{int }:[ suffixregexp ="([1-9])\\00",
maxvalue="9",minvalue="1",default="6"]
In other words, we set the option name to - with a suffix of one digit followed
by an option separator. Compression level aliases are defined as regular options
with fly rules setting the compression level:
ARGS:: compressionlevelbest:{"--best"}
compressionlevelfast :{"--fast"}
FLY:: compressionlevelbest −> compressionlevel:={9};
compressionlevelfast −> compressionlevel:={1};
Although such a construction is simple and sufficient, there is a partial inconsistency when we use multiple options setting the compression level: for example,
specifying --best --fast would correctly set the compression level to 1, but
leave the compressionlevelbest parameter set. Similarly, we would expect option compressionlevelbest to be set to true when −9 is provided. We can fix
this with additional set/reset fly rules:
FLY::
compressionlevelbest −> compressionlevel:={9}, compressionlevelfast :={false };
compressionlevelfast −> compressionlevel:={1}, compressionlevelbest:={false };
compressionlevel {$( compressionlevel ). equals("9")} −>
compressionlevelfast :={false }, compressionlevelbest :={true};
compressionlevel {$( compressionlevel ). equals("1")} −>
compressionlevelfast :={true}, compressionlevelbest :={false };
The level of verbosity is increased with the verbose option -v. Using -v
multiple times increases the verbosity level up to a maximum of three. In CLOPS
we have a counted-boolean option that has type integer, that increases its count
each time it is used. Gzip also has a quiet option -q that turns off some output.
Using the quiet option resets the verbosity level to zero, and using the verbose
option turns off quiet. To achieve this in CLOPS we can define verbose and quiet
as the following:
ARGS::
verbose:{"-v","--verbose"}:{counted−boolean}:[countmax="3",warnonexceedingmax]
quiet :{"-q","--quiet"}
FLY::
quiet −> verbose:={0};
verbose −> quiet:={false};
8
Testimony
One of the authors (Julien Charles), who joined the project later than the other
authors, applied CLOPS to three projects he has been working on. The following
text summarizes the experience he gained. As each of the projects was in a
different stage of maturity it demonstrates three different approaches to the use
of CLOPS as well as a guide to get a better understanding of the tool.
All the tools mentioned in this section are part of the Mobius tool suite5
and are written in Java. The task was to (1) replace an existing complete frontend with CLOPS, (2) apply CLOPS to a tool with an unclearly defined list of
options, and (3) apply the CLOPS methodology to a tool with no front-end yet.
8.1
Replacing an already complete option handler
The first encounter with CLOPS was to change the front-end of an already
complete tool, Bico, a prelude generator for the proof assistant Coq. Its original
architecture for handling options was the following:
– a Main class file containing methods to parse the options and set their values
in a structure (174 lines), and
– an Option enum type containing the options’ descriptions and their exact
syntax (87 lines).
Option conversion Since the options were well defined, most of them were
easily translated to a CLOPS description.
The first difficulty we came across was with ambiguous arguments. Bico was
taking as arguments a directory and an optional list of class file names. Both
these arguments had no specific order on the command line and could interleave.
One method of dealing with this would have been to treat both of them
as Strings and distinguish between them later in the Java code. This was not
an ideal choice because CLOPS has mechanisms for recognizing files, as well as
checking their existence and type, and we wanted to use these features.
The solution (Listing 3) we arrived at was to make the directory a mandatory
first argument. The CLOPS specification mandated that the directory must
exist. The class files were specified as a list of Strings, where the Strings have to
be of a specific form (specified by the argumentshape property).
The second difficulty encountered involved the syntax of CLOPS and the
authors of the pertaining code had to be consulted. In Bico there are two options
that cannot be specified together. In CLOPS it is specified using a validity
rule, which produces an error if its condition evaluates to true. The solution
to this requirement using a validity rule is given in Listing 4. Such rules are
expressive but their syntax is obscure for an outside user and the documentation
was insufficient at the time. Still, in the Bico case, it is fairly easy to understand:
5
Their source code can be obtained at http://mobius.ucd.ie.
ARGS::
...
Dir : {}:{ file }:[ between="", mustExist="true", mustBeDir="true"]
: "Specify directory in which Bico acts (there must be
only one directory, and this argument is mandatory)."
Clazz: {}:{ string − list }:[ between="", argumentshape="[a-zA-Z.]+"]
: "Generate also the file for the specified classes, bico
must be able to find them in its class path."
...
FORMAT::
Help | (Dir (Misc | Type | Clazz )∗);
Listing 3. Solution to the first problem
{$(Map) && $(List)} checks if the value of the options Map and List are true at
the same time; $(o) expands to the value of o and && is the Java and operator.
Note that the options Map and List default to false , hence they always do
have a value.
ARGS::
...
Map: {"-m", "-map", "--map"}:[default="false"]
: "Triggers the generation of the map implementation."
List : {"-l", "-list", "--list"}:[default="false"]
: "Triggers the generation of the list implementation."
...
VALIDITY::
{$(Map) && $(List)} : "The option ’-m’ and ’-l’ cannot be
specified at the same time."
Listing 4. Solution to the second problem
Documentation The documentation templating was really useful. It facilitates the generation of correct and clear documentation of the options. Some
standard templates are provided, but they can be easily modified slightly to fit
the requirements of a particular user. For instance, the default templates show
for each validity rule two things: the code generated for the test and the explanation. From the end-user point of view, the only item of relevance is the
explanation. It was easy to edit the template and remove the code for the test,
as the template uses the Velocity language. For many programs it is also necessary to add a copyright statement to all documentation. Again, this can be
easily added by some simple template editing.
Statistics The Main file that uses CLOPS is 134 lines long. The CLOPS description is 47 lines long, for a total of 181 lines of CLOPS-related code to write.
Since the original argument-specific code for the program was 261 lines, there
was therefore a reduction of 80 lines. In summary the use of CLOPS led to a code
reduction of roughly 25% and adds information that was not originally present,
such as proper descriptions of the different arguments.
8.2
Other CLOPS usage examples
Clarifying arguments handling The second tool that we used CLOPS on was
Mobius’ DirectVCGen. It is a generator of verification conditions from annotated
Java source files. It uses ESC/Java2 as a front-end to handle the annotations
and the Java source. This case is interesting because it has options inherited
from ESC/Java2 as well as several options of its own. ESC/Java2 has over a
hundred options and DirectVCGen actually only uses a few of them, but was
not displaying an error message when an unsupported option was provided.
The CLOPS specification has allowed the identification of which arguments
were used by DirectVCGen, and what their syntax was. Five arguments were
identified, two specific to the DirectVCGen and three inherited from ESC/Java2.
Here the aspect of CLOPS that was emphasized was its ability to help provide
clear descriptions of the options as well as proper documentation.
The code to handle arguments in the original version was of 118 lines long,
and in the new version with CLOPS is 132 lines long (101 lines of Java and 31
lines of CLOPS specification). The main gain here is having a proper description
of the options which was not the case in the first version and thus making the
program more robust.
Developing with CLOPS The third tool we have used CLOPS on is LogicSync, a prelude consistency checker for ESC/Java2. The interest in using CLOPS
right from the start of development was to be able to properly test the tool even
at the early stages and also to be able to easily keep documentation up to date
as the argument syntax evolves. Thanks to the experience acquired with the two
other examples there were no specific difficulties in writing the CLOPS description. It was completed, together with the integration to the Java program, in
less than one hour.
9
Related Concepts
A CLOPS-generated parser performs two prominent operations: 1) Processes the
given sequence of strings (a command line) and returns a corresponding set of
option values. 2) Decides whether the combination of option values is valid or
not.
There is a variety of techniques aimed at describing sets of combinations of
certain entities. A grammar describes a set of sequences of tokens; such sets are
known as languages. A logic formula corresponds to a set of interpretations that
satisfy the formula.
In CLOPS we can find reflections of both: the format string defines a regular
language of options imposing restrictions on how options are sequenced on the
command line; the validity function imposes restrictions on option values.
We should note, however, that grammars are typically used for quite a different purpose than the command line as they typically correspond to complex
nested structures that drive compilers, interpreters, etc.
On the other hand, the elements on the command line live quite independently of one another and each corresponds to a wish of the user running the
program. Some elements of the command line provide necessary information for
the program to carry out its job, such as cp fileSrc fileDest—the copy program hardly can do anything if it does not know which files to copy or where.
Some elements of the command line trigger a certain behavior of the program.
For instance, the program ls will happily run with or without the argument
-C (column formatting) or the argument -1 (one-per-line formatting). Each of
these arguments corresponds to a feature of the program and the command line
enables the user to express which features he requires.
In the program ls, the feature -l (long listing) is even so popular that users
typically alias the invocation ls -l to ll. This provides us with an alternative
view on the program: the original ls represents a family of programs and ll is
a member of that family.
This brings us to another relevant domain Software Product Lines (SPL),
which studies families of programs [5]. In SPL an individual program is combination of certain features, and a family of programs is a set of feature combinations. The individual features and dependencies between them are captured in
feature model [11].
In SPL terminology, the command line options represent variability of the
program in question. In general, variability can be bound at different times, e.g.,
at compile-time or at run-time [13]. In our case, it is always bound at run-time.
From the user perspective, however, there is no difference between an alias ll
and a program called ll (the efficiency issue here is negligible).
Command line options are commonly regarded as something of small importance but we believe that is an underestimation. Deciding which options a
particular program supports determines the scope of the variability in the program. The POSIX standard for common command line utilities is an evidence
that it is not a trivial task.
In SPL, examining the scope of variability is known as feature oriented domain analysis (FODA) [11]. The variability in FODA is captured in the feature
model, which is typically captured as a feature diagram. A feature model corresponds to a CLOPS description as both are variability models.
This discussion brings us back to grammars, in SPL the relation between
grammars and feature models is not new [2]. Grammars are particularly useful
in approaches where the order of features is important [3].
10
Related Work
There is a relatively large number of libraries for processing command line options. We identified three main groups of command line libraries.
The first group consists of libraries that follow the philosophy of the original Unix getopt function (Commons CLI [6] and JSAP [10]). These libraries
usually provide only limited capabilities on top of basic getopt. Option values
are queried via a Map-like interface and are usually treated as Strings or a particular type explicitly predefined by a getter method name. Libraries from the
first group are ignorant of dependencies between options—dependencies must be
explicitly handled by the tool developer.
Libraries in the second group recognize options via annotations of variables
or methods in the source code (JewelCLI [9] and Args4j [1]). This approach has
several advantages to the getopt-like option specification. Types are extracted
directly from the source code and therefore the parser automatically checks parameter syntax. Unfortunately, libraries in the second group also ignore option
interdependencies.
The third group contains libraries that process options externally specified
(JCommando [8]). An external specification is compiled into a library, much like
in the CLOPS approach.
Table 2 summarizes and compares some of the best known of these libraries
with CLOPS in several ways. Libraries are compared based upon their approach,
features, and flexibility.
Help extraction is the ability to generate help documentation from the command line specification. Incremental options are options where multiple uses
change the state of the option in some way, for example using -v more than
once often increases the level of verbosity. Relations among options are mechanisms to specify dependencies between options (for instance, two options that
cannot be used together, an option that can only be used in conjunction with
another, etc.).
Some of the libraries allow call-backs during specific stages of a parse (e.g.
after an option has been set). Thus, for some of the abilities detailed it is possible
for the developer to write extra code to achieve the desired effect. In these cases
the libraries do not facilitate the functionality, but do not prevent it either.
Most of the libraries support short and long variants of option aliases. Supporting a list of variants, rather than only a short/long pair, is rather exceptional
(JewelCLI). Strict adherence to a finite enumeration of option-alias strings is a
limitation in the flexibility of specifying option names. For example, the tail
command prints a number of lines from the end of the input, where the exact
number is specified as an option of the form -NUMBER. Even if a list of variants
is allowed, specifying the aliases for this type of option would be too verbose to
be an acceptable approach.
The leading hyphen is used almost exclusively as an option prefix. However,
alternative prefixes are often attractive in some situations. For example, the
chmod command uses both - and + as prefixes for options and effectively distinguishes when a particular feature is to be enabled or disabled. Another example
Basic,
special
Types of option-arguments
Yes
Rules
Yes
Arbitrary
Help extraction
Fly rules
Relations among options
Option and arg. separator
Incremental options
(old UNIX option style, e.g. tar xzf)
Yes
Yes
Yes
Multiple use of one option
Option concatenation
Yes
Command line format
No
No
No
No
No
No
Last one
No
Type, attr., regexp Type, regexp
Java prim.,
class, list
Yes
Default values for options
Option-argument validation
Yes
Yes
No
No
No
No
List
Optional option-arguments
(e.g. gzip -1,-2,..,-9.)
Dynamic option names
Yes
Yes
Option-argument format
(e.g. -m X, -m=X, -mX)
Yes
Regexp
Apache 2.0
Source
Annotation
Separate
DSL
MIT
0.54
JewelCLI
1.0
CLOPS
Alternatives to leading hyphen
Option name variations
License
Specification
Version
No
No
Yes, --
No
No
Yes
First one
No
None
String
No
No
No
No
No
Short/Long
Apache 2.0
Source
Builder p.
1.1
Commons CLI
Call-backs
No
No
No
Call-backs
Yes
Last one
No
Type
Basic,
special
Yes
No
No
No
Yes
No
MIT
Source
Annotation
2.0.1
Args4j
Call-backs
No
No
Limited
Call-backs
Yes
Last one
No
Type, range
Basic
No
No
No
No
Command only
Short/Long
zlib/libpng
Separate
DSL/XML
1.01
JCommando
Table 2. Features provided by the most popular CLO libraries.
No
No
No
No
No
Yes
First one
No
Type
Java prim.,
List
Yes
No
No
No
No
Short/Long
LGPL
Source
Builder p.
2.1
JSAP
Call-backs
(hyphen required)
Yes
Yes, --
No
Call-backs
No
Call-backs
No
None
char*
No
Yes
No
No
No
Short/Long
GNU
Source
Table
getopt
is the tradition of some Windows command line tools to use slash (/) instead of
hyphen (-).
Although it is often useful, the format of a command line is another feature
omitted by most libraries. Many tools (including svn, zip, and chown) require
a command name early in a command line in order to properly process later
options. If there is no support for format within a library, as is nearly uniformly
the case, such functionality must be hard-coded.
Dependencies among options are almost always ignored in current command
line utilities and, therefore, manual testing the validity of option values is delegated to the user of the library. As mentioned in the Introduction, this leads to
overly complicated and potentially buggy code.
As an exception, we have found the JCommando [8] library. JCommando
introduces the notion of a command, which is similar to an option except that
only one command can be set by a command line. The set method for the
command is called last, after all option set methods. The fallback command
(“commandless”) is used if no others are set by the user.
A command can specify what options can/should have been set for a valid
parse. This is done using a boolean expression over option identifiers. Each identifier will evaluate to true or false at runtime according to whether that option
was set or not. The boolean expression for the provided command is evaluated
when that command is set, and if it evaluates to false an error is produced. This
is a much simpler form of validity check than we allow. In particular, there is
no way to relate options to each other outside the context of a command, nor is
there a way to use option values in the test.
Surprisingly, many libraries do not provide a mechanism to enable a developer
to specify, for example, filenames that start with a hyphen as any string starting
with a hyphen is assumed to be an option. UNIX getopt solves this problem
using the special string --, which is modeled in a CLOPS specification by the
simple aforementioned format expression (see Section 5.1).
Feature modeling [11] has a goal similar to that of our DSL. A feature model
explicitly captures the variability and commonality of a program [5]. In fact, one
can imagine a program as a family of programs whose members correspond to
the possible configurations of the feature model, as expressed via command line
options. Whereas feature models target various types of variabilities at designand compile-time, command line options represent variability resolved at the
runtime.
11
Conclusions and Future Work
Many command line tools solve the command line parsing problem using custom
code, sometimes relying on a little help from specialized libraries. “A DSL is
viewed as the final and most mature phase of the evolution of object-oriented
application frameworks,” according to Deursen et al [14]. CLOPS aims to succeed
and supersede existing libraries in this regard. Combining an analysis of the set
of existing solutions with the identification and refinement of the domain-specific
concepts of command line options, a minimized conceptual framework is defined.
Consequently, the syntax of the DSL is concise and self-documenting, reflecting
exactly this conceptual design, which lends itself to a gradual learning curve.
Additionally, while the DSL is simple enough to formalize and is easy to use, it
is also powerful enough to cover all command line conventions of which we are
aware.
The implementation is done classically, by writing a compiler that generates
Java source code. From our experience of reimplementing the interface of standard UNIX tools and implementing the interface of some other tools, CLOPS
increases a programmer’s productivity, though the quantification of such is the
subject of further work.
Some aspects are painfully missing from the current implementation. One
compelling aspect of many DSLs is that consistency checks are accomplished
at a higher level of abstraction. One such consistency check for CLOPS is the
ability to statically analyze a CLOPS description to guarantee that the generated
command line parser compiles and does not fail in certain ways. For example, the
CLOPS system might perform type-checking instead of deferring such to the Java
compiler. Creators of other DSLs found that early type-checking is sometimes
more precise typechecking [14]. Another potential static check is ensuring regular
expressions used to match options that can legally appear in the same parsing
state are disjoint. This would ensure that the Matched set contains at most
one element. (Of course, the parser would still fail at runtime when Matched is
empty.) Finally, we require, but do not check, that the expressions used in rules
do not have side-effects. Existing research on containing side-effects in languages
like Java will prove useful [7,4].
Other parts are missing, but are less troublesome. A hassle that accompanies code generation is a more complicated build process. This annoyance is
sometimes avoided by generating the parser at runtime. Of course, such an architecture change may hurt performance, but, for many command line parsing
tools, we expect performance to be less important than ease of use.
In various flavors of UNIX there are many implementations of common command line POSIX utilities such as chmod and ls. After describing a single command line interface for a given tool across a suite of implementations one might
generate command line strings automatically for system-level testing. Such work
has been done in a limited form already here at UCD for the OpenCVS project.
Finally, some of this work is applicable in the context of tools with graphical,
rather than textual, interfaces. GUI programs often store their settings in a
preferences store that is read upon start-up, and many such programs provide a
‘wizard’ (e.g., a preferences pane) to populate the settings file. While it requires
is no stretch of the imagination to envisage the CLOPS framework generating
the wizards and the parser for such GUI programs, important details need to be
clarified before this step is made.
12
Acknowledgments
This work is funded, in part, by Science Foundation Ireland under grant number
03/CE2/I303-1 to “Lero: The Irish Software Engineering Research Centre”, and
by the Information Society Technologies programme of the European Commission, Future and Emerging Technologies under the IST-2005-015905 MOBIUS
project, and by an EMBARK Scholarship from the Irish Research Council in Science, Engineering and Technology. Viliam Holub is supported by the IRCSET
Embark Postdoctoral Fellowship Scheme. The article contains only the authors’
views and the Community is not liable for any use that may be made of the
information therein.
References
1. Args4j. http://args4j.dev.java.net/.
2. Don Batory. Feature models, grammars, and propositional formulas. In H. Obbink and K. Pohl, editors, Proceedings of the 9th International Software Product
Line Conference (SPLC ’05), volume 3714 of Lecture Notes in Computer Science.
Springer-Verlag, September 2005.
3. Don Batory and Sean O’Malley. The design and implementation of hierarchical software systems with reusable components. ACM Transactions on Software
Engineering and Methodology, 1992.
4. David G. Clarke, John M. Potter, and James Noble. Ownership types for flexible
alias protection. SIGPLAN Not., 33(10):48–64, 1998.
5. Paul Clements and Linda Northrop. Software Product Lines: Practices and Patterns. Addison–Wesley Publishing Company, 2002.
6. The Apache Commons CLI library. http://commons.apache.org/cli/.
7. A. Darvas and P. Muller. Reasoning about method calls in JML specifications. In
Formal Techniques for Java-like Programs, 2005.
8. JCommando: Java command-line parser. http://jcommando.sourceforge.net/.
9. JewelCLI. http://jewelcli.sourceforge.net/.
10. JSAP: the Java Simple Argument Parser. http://www.martiansoftware.com/
jsap/.
11. Kyo C. Kang, Sholom G. Cohen, James A. Hess, William E. Novak, and A. Spencer
Peterson. Feature-oriented domain analysis (FODA), feasibility study. Technical
Report CMU/SEI-90-TR-021, SEI, Carnegie Mellon University, November 1990.
12. Gary T. Leavens, Albert L. Baker, and Clyde Ruby. Behavioral Specifications of
Business and Systems, chapter JML: A Notation for Detailed Design, pages 175–
188. Kluwer Academic Publishing, 1999.
13. M. Svahnberg, J. Van Gurp, and J. Bosch. A taxonomy of variability realization
techniques. Software-Practice and Experience, 35(8):705–754, 2005.
14. Arie van Deursen, Paul Klint, and Joost Visser. Domain-specific languages: an
annotated bibliography. SIGPLAN Not., 35(6):26–36, 2000.
15. The Apache Velocity Project. http://velocity.apache.org/.
Download