Using logical data models for understanding and
transforming legacy business applications.
by Chandra, Satish^de Vries, Jackie^Field, John^Hess,
Howard^Kalidasan, Manivannan^Raghavan, Komondoor V.^Nieuwerth,
Frans^Ramalingam, Ganesan^Xue, Justin
Show me your flowchart and conceal your tables, and I shall
continue to be mystified. Show me your tables, and I won't usually
need your flowchart; it'll be obvious.
Frederick Brooks, The Mythical Man-Month
Modifying a legacy application is typically an expensive and
time-consuming process, even when the required modifications are
conceptually very simple. We argue that this problem can be ameliorated
by adopting an approach in which logical data models of a legacy
application are used by software developers to understand, maintain, and
transform the software. In addition, we outline the goals and status of
the Mastery project at IBM Research, which aims to build a suite of
tools for automatically extracting logical models from legacy
applications, focusing initially on logical data models.
THE PROBLEM
For the past few years, our group at IBM Research has been
investigating tools and techniques for analyzing and transforming legacy
business applications, focusing on mainframe-based applications written
in COBOL. (1) Such applications are often decades old and implement core
business functionality. Yet they are difficult to update in a timely
manner in response to new business requirements due to a number of
factors that include the following:
[] Volume of code in a typical application
[] Logical structure of code has deteriorated as updates have
accumulated over time
[] Functional redundancy
[] Structure of code reflects the dated technology on which it was
built
[] Scarce technical skills
Size
Legacy application portfolios, that is, complete collections of
programs and related components, can be very large. For example, one IBM
customer had a portfolio consisting of 700 interdependent applications,
3000 online data sets, 27,000 batch jobs, and 31,000 compilation units.
The sheer volume of information contained in an application of this size
makes it impossible for an individual to understand the relationships
between all parts of the application.
Deterioration
The logical structure of code and data tends to deteriorate over
time as a result of a continuous stream of modifications and
enhancements. For large legacy applications, persistent data is the
principal coupling mechanism between components of an application
portfolio. Yet, as an application evolves to meet new business
requirements, the structure and coherence of the data models underlying
the code decays faster than the structure and coherence of the basic
control and process flow through the application. Perhaps this is
because it is relatively easy to add new functionality to an existing
application by creating modules that manipulate new data items stored
separately from the original application data. The alternative of
refactoring the basic process flow through the application to
accommodate new requirements typically requires much more intrusive
changes.
Redundancy
Over time, applications frequently accumulate a great deal of
redundant code (multiple code fragments that perform the same logical
function) and redundant data (data structures that represent the same
information, perhaps with slight differences, and are scattered
throughout the code). Reasons for this redundancy include incomplete
integration of information systems following business mergers,
performance-driven enhancements to the code, and quick "hacks"
when adding new functionality under tight schedules.
Technology
The code structure of legacy applications often reflects the
limitations of the programming languages used and the middleware on
which it was originally designed to run. In many cases, the code
structure dictated by the constraints of legacy languages and middleware
renders such systems more difficult to understand and evolve than they
would be if they had been implemented on modern platforms.
Skills
As new languages and software systems become popular, it becomes
more difficult to find people with skills in legacy languages and
systems.
Interest in the use of automated and semiautomated tools to analyze
and transform legacy code is increasing. Such tools include
program-understanding tools, tools for identifying and extracting
semantically related code statements (through techniques such as program
slicing (2)), tools for migrating from one library or middleware base to
another, tools for integrating legacy code with modern middleware, and
so on.
In the remainder of the paper, we first explain the value of
logical data models and describe a number of applications of logical
models to program-understanding and transformation tasks. Then we
describe the Mastery project, which is concerned with developing
algorithms and tools for extracting and manipulating logical data and
the source code from which they are derived. We conclude with a brief
review of related work and some final comments.
VALUE OF LOGICAL DATA MODELS
The Mastery project is concerned with extracting logical models
from legacy applications. These logical models, which are high-level
abstractions of business processes and data relationships, are used
together with human- and machine-readable links from these logical
models back to their physical realizations in code as the foundation for
a variety of program-understanding and transformation tasks (we use
"physical" to mean "implementation-related"). The
initial focus of the Mastery project is on logical data models:
abstractions encoding essential data relationships. In this paper we
focus on applications of data models because we believe that their
utility (relative to process- and control-oriented program abstractions)
in program understanding and transformation has been under-appreciated.
Nonetheless, other concepts of logical models not covered in this paper
are also valuable; the information they provide can complement logical
data models for many of the applications we consider.
Logical data models are critical for understanding and transforming
legacy applications. Consider the UML **-style (3) logical data model
depicted in Figure 1 (UML stands for Unified Modeling Language **.) This
model describes key data structures and their interrelationships for a
typical order-processing application. In this case, a batch application
processes transaction records pertaining to orders for parts; the
processing of a transaction may result in the creation of a new order
for a part (New Order), in the correction of an error in an existing
unfulfilled order (Correct), or in the cancelation of an unfulfilled
order (Delete), and so on.
[FIGURE 1 OMITTED]
The application represented by the model in Figure 1 is large
(around 60,000 lines of COBOL) and complex. The complexity of the code
obscures its essential functionality, which is to process different
kinds of transactions pertaining to orders for parts. This functionality
is expressed succinctly and at a high level of abstraction by the data
model. In other words, the "business logic" of the application
is concerned primarily with maintaining and updating certain
relationships among persistent and transient data items; therefore, the
data model embodies much of the interesting functionality of the
application, even though the model contains no representation of code.
It is notable that the logical data model shown in Figure 1 differs
greatly from the data declarations in the source code of the
application. Figure 2 shows an outline of these data declarations, with
the data items linked (links shown using dashed arrows) to the
corresponding logical-data-model entities (this figure contains a
relevant subset of the logical model in Figure 1). The data declarations
are spread over several source files; furthermore, they reveal little
about the structure of and relationships between the logical entities
manipulated by the application, which is obtainable only by an analysis
of the code that uses the data. As illustrated in Figure 2, the logical
data model adds value by making information that is hidden in the code
explicit, such as the following:
[] Logical entities--The logical entities manipulated by the
program include Transactions (i.e., requests to the system of various
types) Orders, Parts, and so forth. Physical data items (variables)
correspond to these entities; such as ORDER-BUF and ORDER-REC store
Orders (as indicated by the links).
[] Logical subtypes--Transactions are of several kinds (have
several subtypes), such as Delete, Correct, and New Order.
[] Associations--Entities are associated with (or pertain to) other
entities, as indicated by the red arrows in Figure 1. Associations have
multiplicities; for example, the labels on the association from
Transaction to Order indicate that each Transaction pertains to zero or
one (existing) Orders and that each Order has zero or more Transactions
pertaining to it on any given day.
[] Aggregation--The information corresponding to a single part is
stored in two physical records, PRI-PART-REC and PR2-PART-REC, which are
tied together by their PART-KEY attribute. This (perhaps historical)
artifact is elided in the logical model, and both records are linked to
a single "Part" entity.
[] Integrity constraints--Although our example does not illustrate
it, a logical data model can also include semantic integrity constraints
and data invariants (beyond those implied by multiplicities on
associations), such as the constraint that the Order Amount must be
positive.
[FIGURE 2 OMITTED]
APPLICATIONS OF LOGICAL DATA MODELS
COPYRIGHT 2006 All Rights
Reserved. Reproduced with permission of the copyright holder. Further reproduction or distribution is prohibited without permission.
Copyright 2006, Gale Group. All rights
reserved. Gale Group is a Thomson Corporation Company.
NOTE: All illustrations and photos have been removed from this article.