Toward Data Centered Tools for Understanding and Transforming Legacy Business Applications

advertisement
Toward Data Centered Tools for Understanding and Transforming Legacy Business
Applications
Satish Chandra
Jackie de Vries
John Field
Raghavan Komondoor∗
Frans Nieuwerth
Howard Hess
Manivannan Kalidasan
G. Ramalingam
Justin Xue
IBM Research
Abstract
We assert that tools for understanding and transforming legacy
business applications should be built around logical data models,
rather than the structure of source code artifacts or control flow. In
this position paper, we argue that data centered tools are beneficial for a variety of frequently-occurring code understanding and
transformation scenarios for legacy business applications, and outline the goals and status of the Mastery project at IBM Research,
whose aim is to build a suite of logical model recovery tools, focusing initially on data models.
Order key
*
1
*
1
Order
Financial Product
Company
Order key
• product key
• order number
Info:
• order amount
• current status, …
Product key
• company code
• product number
Info:
• price table, …
Statistics:
• previous day
• month to date
Company
code
*
1
Delete
…
1
0..1
*
Transaction
Correct
…
New Order
…
......
The Problem
For the past two years, our group at IBM’s T.J. Watson Research
Center has been investigating tools and techniques for analyzing and transforming legacy business applications, focusing on
mainframe-based systems written in Cobol. Many such applications have evolved over the course of decades, and, although they
often implement core business functionality, they are typically extremely difficult to update in a timely manner in response to new
business requirements. Some of the impediments to legacy system
evolution and transformation include:
Size: Legacy application portfolios, i.e., complete collections
of interdependent programs and persistent data stores, can be very
large. For example, one IBM customer had a portfolio consisting of
700 interdependent applications, 3000 online datasets, 27000 batch
jobs, and 31,000 compilation units. The sheer volume of information contained in an application of this size makes it impossible for
an individual to understand the relationships between all parts of
the application.
Redundancy: Over time, applications frequently develop a
great deal of redundant code and data. The reasons include lack of
full integration of redundant software artifacts from business mergers, creation of redundant structures for performance reasons, or
quick hacks coded by programmers adding new functionality under tight time constraints.
Technology: The software structure of legacy code often reflects the limitations of the languages, systems, and middleware on
which it was originally designed to run, long after improved systems and libraries become available.
Skills: As new languages and systems software become popular, skills in legacy languages and systems atrophy.
For all of the reasons listed above, there has been increasing interest in the use of automated and semi-automated tools to analyze
and transform legacy code to allow evolving business requirements.
∗
Contact author: komondoo@us.ibm.com
Figure 1: A high-level logical data model for a typical business
application.
Such tools include program understanding tools for extracting information from large code bases, tools for extracting semanticallyrelated code statements through techniques such as slicing, tools
for migrating from one library or middleware base to another, tools
for integrating legacy code with modern middleware, etc.
2
Importance of Data Centered Tools
Our group has worked on a number of research topics related to
various program analysis and transformation tools, including, e.g.,
language porting, slicing, and redundancy detection tools; however,
we have come to the conclusion over time that there are unique
benefits in having tools that are centered on understanding the data
manipulated by programs. We have therefore embarked on a new
project, called Mastery, which is concerned with extracting logical data models from legacy applications, linking these models in a
semantically well-founded way back to their physical realizations
in code (a physical model), and using these models and the links
between them as the foundation for a variety of program understanding and transformation tools.
Why do we believe that logical data models are critical for understanding and transforming legacy applications? Consider the
UML-style model depicted in Fig. 1. This model describes key
data structures and their interrelationships for a typical financial
application. In this case, the application is a batch application that
processes transactions pertaining to orders for financial products;
the processing of a transaction results either in the creation of a
new order for a product, or in the correction of an error in an existing unfulfilled order, or in the cancellation of an unfulfilled order,
etc.
01 TRANSACTION-REC.
05 TRANS-DET-KEY .. 39 bytes ..
05 GENERAl-INFO.
10 REC-TYPE PIC X(3).
05 FILLER PIC X(154).
01 LOCAL-VARS.
05 ORDER-BUF ..
0..1
*
Transaction
Order key
*
Order
Financial Product
…
Product key
• company code
• product number
Info:
• price table
Statistics:
• previous day
• month to date
01 ORDER-REC.
..
01 NTR-TRANSACTION-REC.
05 NTR-KEY .. 39 bytes ..
05 .. other New Order fields ..
*
01 PR1-PRODUCT-REC.
05 PRODUCT-KEY .. 18 bytes ..
..
05 PRICE-TABLE ..
01 PR2-PRODUCT-REC.
05 PRODUCT-KEY .. 18 bytes ..
..
05 PREVIOUS-DAY-SALES ..
05 MTD-SALES ..
..
(perhaps historical) artifact is elided in the logical model, and
both records are linked to a single “Financial Product” entity.
1
For large legacy applications, persistent data is the principle coupling mechanism connecting components of an application
portfolio. Yet, as an application evolves to meet new business requirements, it is our observation that the structure and coherence of
the data models underlying the code “decays” faster than the structure and coherence of the basic control and process flow through
the application. We conjecture that this is because of the relative
ease of adding new functionality by creating modules that manipulate new data items that are related to, but stored and manipulated
separately from, the original data manipulated by the application,
as opposed to refactoring the basic process flow through the application.
1
New Order
Order key
…
3
Figure 2: An outline of the data declarations in the application represented by the logical data model in Fig. 1, with links between
data items and logical entities shown using dashed arrows.
Applications of Logical Data Models
Because persistent data is the backbone of most commercial applications, logical data models serve as good “anchors” around which
to base program exploration, understanding, and transformation activities. Consider the following list of very commonly occurring
code modification tasks for legacy applications:
The application represented by the model in Fig. 1 is large
(around 60,000 lines of Cobol) and complex. The complexity of
the code obscures its essential functionality, which is to process different kinds of transactions pertaining to orders for products. This
functionality is expressed succinctly and at a high level of abstraction by the data model. In other words, the “business logic” of
the application is concerned primarily with maintaining and updating certain relationships among persistent and transient data items;
therefore, the data model embodies much of the interesting functionality of the application, even though the model contains no representation of “code”.
It is notable that the logical data model shown in Fig. 1 differs
much from the data declarations in the source code of the application. Fig. 2 shows an outline of these data declarations, with the
data items linked (links shown using dashed arrows) to the corresponding logical data model entities (this figure contains a relevant subset of the logical model in Fig. 1). The data declarations
are spread over several source files; furthermore, they reveal little
about the structure of and relationships between the logical entities manipulated by the application, which is obtainable only by an
analysis of the code that uses the data. As illustrated in Fig. 2, the
logical data model adds value by making information that is hidden
in the code explicit, such as:
• adding fields/attributes to an existing entity
• making data representations (e.g., field expansion to accommodate growing numbers of customers, materials, etc.)
• building adapters to, e.g., allow one application to query and
update data in another system
• merging databases representing similar entities (e.g., accounts) maintained by merged businesses
• moving flat files to relational databases, e.g., to allow transactional online access to data previously accessed via offline
batch applications.
All of these scenarios fundamentally revolve around major
changes to data and data relationships and small changes to “process flow” or “business logic”, rather than the reverse. The logical
model, together with the links from logical entities to physical realizations in the source code, makes it easy to browse the application in terms of the logical entities, to understand the impact in the
source of any change to a logical entity, and hence to plan transformations.
Logical data models have other applications, too. The “business
rules” of an application (rules that constitute the business logic)
can usually be organized around logical model entities; therefore,
logical models can be used as a basis for code documentation or
porting. Logical models can also be used to identify redundancies
and problematic relationships while integrating the applications of
businesses that merge.
For all of the reasons enumerated above, we believe that logical data models should form the foundation for legacy analysis,
understanding, and transformation tools.
Logical entities The logical entities manipulated by the program
include Transactions, “Delete” transactions, “Correct” transactions, Orders, etc. Physical data items (variables) correspond to these entities; e.g., ORDER-BUF and ORDER-REC
store Orders (as indicated by the links).
Logical subtypes Transactions are of several kinds (have several
subtypes).
Associations Entities are associated with (or pertain to) other entities, as indicated by the solid arrows. Associations have multiplicities; e.g., the labels on the association from Transaction
to Order indicate that each transaction pertains to zero or one
(existing) Orders, and that each Order has zero or more transactions pertaining to it on any given day.
4
Challenges in Building Logical Data Models
In general, the goal of the Mastery project in extracting logical data
models is to recover information at roughly the level of UML class
diagrams and OCL constraints, as if such models had been used to
design the application from scratch. That is, we are looking not just
for a diagrammatic representation of the physical data declarations,
but a natural model corresponding to the way experienced designers
Aggregation The information corresponding to a single
financial product is stored in two physical records,
PR1-PRODUCT-REC and PR2-PRODUCT-REC, which
are tied together by their PRODUCT-KEY attribute. This
2
think about the system. To extract such a model programmatically
requires analysis that abstracts away physical artifacts.
In general, it can be quite challenging to construct semantically well-founded logical data models from legacy applications.
For example, Cobol programmers frequently overlay differing data
declarations on the same storage (using the REDEFINE construct).
Sometimes overlays represent disjoint unions, i.e., situations in
which the same storage is used for distinct unrelated types, typically depending on some “tag” variable. However, overlays are
also used to reinterpret the same underlying type in different, nondisjoint ways; e.g., to split an account number into distinct subfields representing various attributes of the accounts. The disjoint
and non-disjoint use of overlays can be distinguished in general
only by examining the way the data are used in code, and not just
by examining the data declarations.
Examples of other challenging problems related to extracting
logical data models from applications include:
Identifying logical entities and the correspondences between
program variables and logical entities.
Identifying distinct variables that aggregate to a single logical
entity because they are always used together.
Identifying associations (with multiplicities) between entities
based on implicit and explicit primary/foreign key relationships.
Identifying uses of generic polymorphism, i.e., data types designed to be parameterized by other data types.
These problems are challenging because they require sophisticated analysis of the application’s code. For illustrations of these
problems (except generic polymorphism) see Figures 1 and 2, and
the discussion in Section 2.
5
Figure 3: Screen snapshot depicting the Mastery Modeling Tool
(MMT).
tool contains several components. An importer processes Cobol
source files and Cobol copybooks to build a physical model that
can be browsed inside the tool. A model extractor builds a UMLclass-diagram-like logical model (as described in Section 5.1), and
links that model to the physical data model. Currently, the model
extractor identifies basic logical types, subtypes, and association
relations. Extensive model view facilities allow logical models and
their links to physical data items to be browsed and filtered from
many perspectives. A query facility allows expressive queries on
the properties of models and links. Such queries can be used, e.g.,
to understand the impact of various proposed updates to the application, from a data perspective. A report generation capability
generates HTML-based reports depicting the logical and physical
models for easy browsing outside the MMT tool.
The Mastery Project and the Mastery Modeling Tool
In this section, we give a brief overview of the Mastery project at
IBM Research, whose aim is to create data centered modeling tools,
and to develop the algorithmic foundations for model extraction. In
the long run, our goal is to address process model and business rule
extraction, too, in addition to data model extraction.
Mastery currently consists of two distinct threads: research on
novel algorithms for logical data model abstraction, and development of an interactive tool (the Mastery Modeling Tool, or MMT)
for extracting, querying, and manipulating logical models from
Cobol applications.
5.1
6
There are commercially available tools that assist with data modeling, e.g., ERWin [6], Rational Rose [8], Relativity [7], SEEC [11],
and Crystal [9]. However, none of these tools infer a logical data
model for an existing application by analyzing how data is used in
the code of the application. They either support forward engineering of applications from data models (ERWin and Rose), or infer
data models from data declarations alone (Relativity, SEEC, and
Rose), or infer logical data models by analyzing the contents of
data stores as opposed to the code that uses the data (Crystal).
Previously reported academic work in using type inference for
program understanding includes [13, 4, 16, 14]. Of these, the approaches of [16, 14] are close in spirit to the type inference component of our model extractor; however, our approach has certain
distinctions such as the use of a heuristic to distinguish disjoint uses
of redefined variables.
Other non-type-inference-based previous approaches in inferring logical data models (or certain aspects of logical data models)
include [2, 3, 5, 15]; details of these approaches are outside the
scope of this paper. Examples of approaches for inferring business
rules (as opposed to data models) are [12, 1].
Model Extraction Algorithms Research
Our basic algorithmic idea is to use dataflow information to infer
information about logical types. For example, given a statement
assigning A to B, we can infer that at that point in the code A and
B have the same type (i.e., correspond to the same logical entity).
Further, if we know that K is a variable containing a key value for a
logical type τ , and K is assigned to a field F of record R, then we
can infer that there is a logical association between the type of R
and τ . We regard reads from external or persistent data sources as
defining “basic entities” from which more elaborate relations can
be inferred. We have also incorporated a heuristic that uses the
declarations of variables to infer subtype relationships.
The simple ideas outlined above are greatly complicated by various Cobol programming idioms, particularly the elaborate use of
overlays. The details of our work regarding these aspects are outside the scope of this paper.
5.2
Related Work
The Mastery Modeling Tool
The Mastery Modeling Tool is built on the open source Eclipse [10]
tool framework. A snapshot of the tool is depicted in Fig. 3. The
3
References
[1] S. Blazy and P. Facon. Partial evaluation for the understanding of fortran programs. Intl. Journal of Softw. Engg. and
Knowledge Engg., 4(4):535–559, 1994.
[2] G. Canfora, A. Cimitile, and G. A. D. Lucca. Recovering a
conceptual data model from cobol code. In Proc. 8th Intl.
Conf. on Softw. Engg. and Knowledge Engg. (SEKE ’96),
pages 277–284. Knowledge Systems Institute, 1996.
[3] B. Demsky and M. Rinard. Role-based exploration of objectoriented programs. In Proc. 24th Intl. Conf. on Softw. Engg.,
pages 313–324. ACM Press, 2002.
[4] P. H. Eidorff, F. Henglein, C. Mossin, H. Niss, M. H.
Sorensen, and M. Tofte. Annodomini: from type theory to
year 2000 conversion tool. In Proc. 26th ACM SIGPLANSIGACT Symp. on Principles of Programming Languages,
pages 1–14. ACM Press, 1999.
[5] M. D. Ernst, J. Cockrell, W. G. Griswold, and D. Notkin.
Dynamically discovering likely program invariants to support
program evolution. In Proc. 21st Intl. Conf. on Softw. Engg.,
pages 213–224. IEEE Computer Society Press, 1999.
[6] http://ca.com.
[7] http://relativity.com.
[8] http://www-306.ibm.com/software/rational.
[9] http://www.businessobjects.com.
[10] http://www.eclipse.org.
[11] http://www.seec.com.
[12] F. Lanubile and G. Visaggio. Function recovery based on program slicing. In Proc. Conf. on Software Maintenance, pages
396–404. IEEE Computer Society, 1993.
[13] R. O’Callahan and D. Jackson. Lackwit: a program understanding tool based on type inference. In Proc. 19th intl. conf.
on Softw. Engg., pages 338–348. ACM Press, 1997.
[14] G. Ramalingam, J. Field, and F. Tip. Aggregate structure
identification and its application to program analysis. In Proc.
26th ACM SIGPLAN-SIGACT Symp. on Principles of Programming Languages, pages 119–132. ACM Press, 1999.
[15] A. van Deursen and T. Kuipers. Identifying objects using
cluster and concept analysis. In Proc. 21st Intl. Conf. on
Softw. Engg., pages 246–255. IEEE Computer Society Press,
1999.
[16] A. van Deursen and L. Moonen. Understanding COBOL systems using inferred types. In Proc. 7th Intl. Workshop on Program Comprehension, pages 74–81. IEEE Computer Society
Press, 1999.
4
Download