Reconfigurability provides the foundation upon which autonomic
systems can adapt to their changing environments, which is useful for
dynamically optimizing system functionality based upon the observed
execution profile or for recovering from errors and failures without
human intervention. Unfortunately, reconfiguring distributed systems is
difficult in practice, since the processes are often interdependent.
Arbitrarily changing the behavior of one process through reconfiguration
may render other processes unusable. By formally expressing the
dependencies among processes, both static and dynamic reconfigurations
can be analyzed to determine whether they are compatible with the
existing configuration.
This paper describes a model for formally capturing the structure
and run-time behavior of a distributed system. The structure is defined
by a set of elements containing the state variables in the system. The
runtime behavior is defined by threads that execute atomic actions
called operations. Operations invoke code blocks to bring about state
changes in the system, but these state changes are limited to a single
element and thread. The variables accessed during the execution of an
operation constitute the signature of the operation. The collective set
of operation signatures in the system can be used to derive dataflow
dependencies between operations and between elements.
Reconfiguring the system is modeled by changing the code blocks
that are invoked by an operation. Given the current configuration of the
system, a proposed reconfiguration can be subjected to an off-line test
to determine whether any of the dataflow dependencies of the system
change as a result of the reconfiguration. If there are changes to the
dataflow dependencies, the user is warned that the proposed
reconfiguration may not be safe. Ultimately, the user decides whether or
not to apply a reconfiguration. (1)
We have employed the proposed system model in developing a
software-implemented fault-tolerant (SIFT) environment based around
ARMORs (Adaptive Reconfigurable Mobile Objects of Reliability) to
provide error detection and recovery for distributed applications. To
ensure that the SIFT environment is resilient to failures, dedicated
fault-tolerant mechanisms were added by reconfiguring the ARMOR
processes (2) that make up the SIFT environment. The system model and
associated criteria for safe reconfigurations were applied to verify
that the additional fault-tolerant mechanisms did not impact the SIFT
functionality in unexpected ways.
System model
An abstract model that captures the structure and run-time
execution of the system is used to analyze the effects of
reconfiguration on the dependencies among the various components in the
system. A system is specified by a triple (C, V, T):
* C is the set of code blocks in the system. A code block performs
a computation triggered by events called operations.
* V is the set of state variables (3) in the system. Variables are
only accessed through executing code blocks.
* T is the set of threads in the system. Threads execute by
sequentially invoking code blocks in the system through operations,
which bring about state changes by manipulating the system variables in
V.
Elements. State variables in the system are partitioned into
components called elements. Code blocks are placed in the elements
containing the state variables manipulated by the code blocks. Given a
predicate Access(c, v) that is true if and only if code block c [member
of] C reads or writes variable v [member of] V, an element can be
defined formally as follows.
Definition 1 (Element): An element is a pair (C, V), where V is a
set of variables and C is a set of code blocks that do not access
variables other than those in V:
element e [equivalent to] (C, V), C [subset or equal to] C
[conjunction]([for all]c [member of] C, V [subset or equal to] V, v
[member of] V : Access(c, v) [??] v [member of] V)
Elements partition the system so that each variable is found in one
and only one element, and the code blocks within an element cannot
access state variables residing in other elements.
Operations. Code blocks in the system are indirectly invoked
through operations. One operation can invoke several code blocks as
defined by a function BindCode: P [right arrow] [2.sup.c], where P is
the alphabet of operations and [2.sup.c] is the power set over all code
blocks (i.e., BindCode maps an operation to a set of code blocks). The
notation "p [right arrow] c" is also used to denote that
operation p is bound to code block c via the BindCode function. An
operation executes by being delivered to all code blocks bound to the
operation via the BindCode function.
Definition 2 (Operation Delivery): An operation p [member of] P is
said to be delivered to a code block c [member of] c, denoted Deliver(p,
c), by executing code block c. The delivery completes when code block c
completes executing.
Because each code block exists in one and only one element,
delivery can be thought of as occurring with respect to elements as
well. An operation executes by being delivered to all bound code blocks
(elements) in the system. Execution completes only after all deliveries
for p complete.
Definition 3 (Operation Execution): An operation p [member of] P is
said to have executed, denoted Execute(p), after it has been delivered
to all code blocks bound to p:
Execute(p) [equivalent to] [for all]c [member of] BindCode(p) :
Deliver(p, c)
Threads. The run-time behavior of the system is characterized by a
set of threads, each of which serially executes a sequence of
operations. Given P, the set of all operation sequences is given by
[P.sup.*]. A sequence can be specified by its constituent operations as
P = <[p.sub.0], [p.sub.1], [p.sub.2], ...>. We denote p to be in
sequence P by p [member of] P, and that p precedes q in sequence P by p
[<.sub.P] q.
A thread T [member of] T is specified by a triple T = (P, V, F):
* P [member of] [P.sup.*] is a sequence of operations organized as
a stack, also referred to as T.ops. Threads execute by sequentially
executing the operations on the stack, beginning with the head
operation. To emphasize that operations are executed within a thread,
the notations used in Definitions 2 and 3 are extended to T.Deliver(p,
c) and T.Execute(p).
* V [member of] [V.sub.thd] is a subset of private thread variables
that are only accessible through thread T. [V.sub.thd] is disjoint from
the state variables V. Although two threads can contain private
variables with the same names, each thread maintains its own independent
value. The set of private variables for thread T is also referred to as
T.vars.
* F is a data structure called a frame stack, which is used to
preserve the contents of private thread variables across nested
operation invocations, essentially providing scope to the variables in
the thread. A nested operation invocation occurs when an operation
pushes a sequence of operations onto T.ops.
The pseudocode for the execution of a thread is given in Figure 1.
The while loop executes until the operation stack is exhausted. The head
operation is popped from the stack and stored in variable p. Line 5
executes the operation. Note that because the statement T.Execute(p)
does not complete until all deliveries for p complete (i.e., until all
code blocks bound to p have executed), the operations within a thread
are executed in a strictly sequential order.
Figure 1 Pseudocode for thread execution
1 procedure ExecuteThread(T):
2 while T.ops[not equal to]0 do
3 p := head(T.ops)
4 T.ops := tail(T.ops)
5 T.Execute(p)
6 end do
Delivery actions. The code blocks that execute during an operation
delivery cannot make arbitrary state changes to the system. Their
effects are confined to the current thread and the element to which the
operation is delivered. Specifically, a single delivery of operation p
in thread T to code block c (i.e., T.Deliver(p, c)) can perform any of
the following actions:
1. State variables of element e (the element to which c belongs)
can be read. c.readVars denotes the set of state variables read by c.
2. State variables of element e can be written by code block c,
denoted by c.writeVars.
3. Private variables within thread T can be read by code block c,
denoted by c.readThdVars.
4. Private variables within thread T can be written by code block
c, denoted by c.writeThdVars.
5. The code block can atomically push new operations onto the
operation stack of T. Let c.pushOps refer to the set of operation
sequences that can be pushed while c executes. Only one of these
sequences, however, is pushed when c executes (allows the code block to
push different sequences depending upon run-time conditions).
6. The code block can create new threads whose states are
initialized from the private variables in the parent thread T and the
state variables in e.
These six items constitute a signature of code block c, and the
signature provides a succinct, black-box description of how the code
block manipulates element state and thread state during run time.
Programmers specify the signature as meta-data for each code block that
they develop. The collective set of signatures for all code blocks in
the system are used to derive dataflow dependencies among operations
given the current BindCode mapping function.
Intrathread dependencies among operations. Because elements are
allowed to read from private thread variables during an operation
delivery, a particular operation delivery may be dependent upon
deliveries earlier in the execution of the thread that wrote to thread
variables. To better understand the dependencies among operations within
a thread, the concept of an input/output signature is developed with
respect to individual operations and sequences of operations as follows:
* An input signature for an operation describes the thread
variables that are read by an operation when it executes.
* An output signature for an operation describes the thread
variables that are written when an operation executes.
* An input signature for an operation sequence describes the thread
variables whose values must be established before a sequence begins
executing. These variables serve as inputs to the aggregate sequence of
operations.
* An output signature for an operation sequence describes the
thread variables that are the intended outputs of a sequence of
operations. An operation sequence is pushed onto the operation stack of
a thread with the express purpose of writing to the thread variables in
the output signature of the sequence.
Input signatures. First, the thread variables used as input for
operation delivery T.Deliver(p, c) are considered. Define a function
DeliveryInputSig: [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] as
follows:
(1) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
This definition states that the set of input variables used by
T.Deliver(p, c) not only includes c.readThdVars but also includes those
thread variables used as input for the operation sequences pushed by
code block c. The outermost parenthetical expression in Equation 1
includes the input variables of the sequence (denoted by SeqInputSig(P),
to be defined later), but excludes those variables used by the sequence
that were written by c. (4) The intuition is that the input signature
for T.Deliver(p, c) should only include those variables whose values
were defined before code block c executed.
In general, operation p can be delivered to several code blocks,
and the composite input signature for operation p denoted by
ExecInputSig(p): [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] can
be derived from Equation 1:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
The input signature for a sequence P is slightly more complicated.
If an operation p [member of] P takes variable v as input, then v should
not be part of the input signature for P if v was written by an
operation that preceded p in P. Define SeqInputSig: [MATHEMATICAL
EXPRESSION NOT REPRODUCIBLE IN ASCII] as follows:
SeqInputSig(P) = {v [member of] [V.sub.thd]: [there exists]p
[member of] P: v [member of] ExecInputSig(p) [conjunction] [for all]q
[<.sub.P] p: v [not member of] ExecOutputSig(q)}
Output signatures. As with input signatures, the output signature
is defined first with respect to a single delivery p [right arrow] c and
then extended to account for all code blocks bound to p.
(2) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII],
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
The delivery output signature in Equation 2 includes not only the
variables written directly by c, but also the variables written by any
sequence of operations pushed by c.
The output signature for an operation sequence P is defined, in
general, by the programmer. The programmer who writes code block c
pushes sequence P with the intent of producing specific outputs. The
individual operations in P may write intermediate values as P executes,
but these intermediate results ultimately will be discarded when P
completes (see the discussion of the frame stack that follows).
Discarding the intermediate values produced by P provides scope to the
thread variables, and this is similar to saving or restoring register
contents when entering or exiting a function in order to preserve the
contents of registers not intended to store function outputs. Because
the intended output of an operation sequence cannot be derived without
semantic knowledge of the sequence, the output signature SeqOutputSig(P)
must be specified by the programmer.
Frame stack, Each thread contains a frame stack that implements the
saving or restoring of private thread variables described above. When an
operation sequence is pushed on the operation stack of the thread, the
current values of all intermediate output variables for the sequence are
saved. This set consists of all thread variables that will be written by
the operations in the pushed sequence P, excluding those variables
designated as the intended output of the sequence:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
The variables stored in the top entry of the frame stack are
restored when the operation sequence P completes.
Concurrency. Operations within a thread execute in a sequential
order. Only one thread executes within an element (i.e., delivers
operations to code blocks within an element) at any given time to ensure
mutually exclusive access to the state variables of the element. This
assumption treats each delivery as if it were writing to the element,
but this requirement can be loosened if the deliveries can be tagged as
"read-only" or "read/write," in which case a
multiple-read, single-writer lock can be used to control access to the
elements.
Applying the model to object-oriented programs. In object-oriented
systems, objects are similar to elements in the sense that objects
typically encapsulate state and member functions that operate on the
state of the object. Each object, therefore, has an associated signature
for each of its member functions. When these member functions are
invoked within a thread, dataflow dependencies exist among the objects
visited by the execution of a thread. This is the intuition behind the
construction of the model presented in this section.
Elements, however, more generally represent sets of related
objects. For example, a tree data structure may represent each node as
an object, but the entire tree would most likely be encapsulated in an
element. Keeping I/O signatures at the coarser granularity level of
elements makes the dataflow dependability analysis more manageable than
if per-object signatures were required.
Member function invocations, therefore, are modeled as operations
being delivered to elements. Nested function invocations are split into
multiple operation deliveries, since the model defines the execution of
a thread as the sequential execution of operations. For example,
consider a function X that performs some computation [C.sub.1], calls
function Y, and performs an additional computation [C.sub.2] after Y
returns. This execution trace is modeled as three operation deliveries:
the execution of [C.sub.1], followed by the execution of Y, followed by
the execution of [C.sub.2]. Each of these operation deliveries
represents an atomic state change to the system state. (5)
With programs explicitly designed around the system model presented
in this paper, a structure is in place for extracting information to
perform dataflow analysis on the individual operations that make up the
execution of the system. Furthermore, the model allows the information
to be gathered and analyzed in an automated fashion, given the I/O
signatures of each individual operation. If the dataflow analysis
framework is to be applied to object-oriented programs not designed with
the proposed system model in mind, then the issue becomes one of how to
extract the appropriate information needed for the dataflow analysis
(e.g., how do objects pass information between each other within a
thread of execution, what information is exchanged through each object
invocation, and what are the effects of each object's invocation on
system state). After gathering this information, the effects of a
reconfiguration on the system can be determined. Obtaining such
information may require manual input from the programmer or may require
refactoring certain aspects of the application to more closely conform
to a system model such as the one proposed in this paper to facilitate
the automated dataflow analysis of proposed reconfigurations.
Reconfigurability
The sets of code blocks, elements, and operations are considered to
be static, and reconfiguration occurs by either adding or removing a
single operation binding (i.e., changing the BindCode function). The
system transitions into a new configuration view whenever the BindCode
function changes. A configuration view is denoted by [V.sub.i], with
configuration view [V.sub.i+1] occurring immediately after [V.sub.i].
When necessary, the configuration view index will be specified along
with the BindCode function (e.g., Bind[Code.sub.i](p) specifies the set
of code blocks bound to p under configuration view [V.sub.i]).
Impact on thread state. The BindCode function determines the I/O
signatures for operations and, therefore, establishes dependencies among
operations within a thread of execution. Programmers design code blocks
with an understanding of these dependencies in order to bring about a
desired effect. For example, if operation [p.sub.k] copies data from
element [e.sub.1] to variable v, and operation [p.sub.k+1], copies data
from variable v to element [e.sub.2], data are transferred from one
[e.sub.1] to [e.sub.2] by virtue of the I/O signatures of [p.sub.k] and
[p.sub.k+1]. If a reconfiguration--either adding or removing an
operation binding--disrupts the data dependencies for future operations
in the execution of a thread, then the reconfiguration is said to be
unsafe.
Definition 4 (Safe Reconfiguration): A reconfiguration from view
[V.sub.i] to [V.sub.i+1] that affects the binding p [right arrow] c,
where p [member of] P and c [member of] C, is considered to be safe with
respect to thread state if and only if, for all executed operation
sequences P [member of] [P.sup.*] in which p [member of] P:
1. All dataflow dependencies (6) between operations following p and
operation p in view [V.sub.i] continue to exist in view [V.sub.i+1].
2. All dataflow dependencies between operations following p and
operation preceding p in view [V.sub.i] continue to exist in view
[V.sub.i+1]:
[for all]P [member of] [P.sup.*] : p [member of] P [??] [for all] q
[[less than or equal to].sub.P] p, r [>.sub.P] p:
r dataflow-dependent on q in [V.sub.i] [??] r dataflow-dependent on
q in [V.sub.i+1]
System reconfigurations change the binding of operation p, either
assigning a code block top (causing an extra computation to be
performed) or removing a code block from p (eliminating a computation
from threads that execute in the system). A reconfiguration is unsafe
only if it affects operations further downstream in the execution of the
thread. The following examples illustrate safe reconfigurations in which
dataflow dependencies are added, removed, or changed:
* A new binding p [right arrow] c reads from some thread variable
v, resulting in a new dataflow dependency that did not exist in the
previous configuration. This reconfiguration is safe, since it does not
disrupt the execution of operations further downstream.
* A removed binding p [right arrow] c causes p to no longer read
from thread variable v. Obviously, this reconfiguration destroys a
dataflow dependency that previously existed, but the reconfiguration is
not considered to be unsafe since this, in itself, does not affect
operations further downstream in the execution of the thread.
After the binding p [right arrow] c is removed, it may be safe to
remove the binding that wrote to v earlier in the execution of the
thread. This is an example of how functionality often can be removed
incrementally: the safe reconfiguration criteria help identify the
correct order in which multiple bindings should be removed to bring the
system safely to a desired configuration.
* A new binding p [right arrow] c writes to thread variable v that
is not currently read by any other operation further downstream in the
execution of the thread. Although this lone reconfiguration does not
immediately bring about a change in functionality, additional
reconfigurations can be made so that operations further downstream read
from thread variable v. As an alternative, a reconfiguration can be made
so that a later operation pushes a new operation sequence onto the
operation stack of the thread with the express purpose of performing
some computation based upon the newly written value in variable v.
As the previous examples suggest, several reconfigurations can be
made to remove or replace specific functionality in the system. The
dataflow dependency analysis assists in deciding the proper order to
apply such reconfigurations so that the correct behavior of the system
is not disturbed. It should be clear, however, that the safe
reconfiguration criteria presented in this section cannot guarantee
consistency using dataflow dependency analysis alone. Definition 4 is a
sufficient but not necessary condition for preserving consistency across
reconfigurations. A semantic analysis of the system is required to make
this consistency determination, and the semantics of a code block cannot
be derived solely from the code block I/O signatures. Dataflow analysis,
however, permits the programmer to understand the relationships between
a new code block and the other code blocks in the system, simply by
incorporating the signature of the new code block into the following
analysis framework.
Dataflow analysis framework. The analysis framework and resulting
criteria depend only upon the current system configuration (i.e., the
[BindCode.sub.i] function and resulting I/O signatures) and the proposed
configuration (i.e.; [BindCode.sub.i+1]). If the criteria indicate that
the proposed reconfiguration may be unsafe, the user is notified of the
existing dataflow dependency that would be broken by the
reconfiguration. Other reconfigurations may be needed to compensate for
the broken data dependency.
Figure 2, used throughout this section as a running example, shows
data dependencies among operations in a sequence P = <[p.sub.0],
[p.sub.1], [p.sub.2], [p.sub.3]> that was pushed onto the operation
stack. The arcs between nodes indicate the dataflow dependencies between
operations (e.g., operation [p.sub.1] writes to a variable z [member of]
[V.sub.thd], which is later read by [p.sub.3]). Arcs emanating from
input define the input signature for P (e.g., [p.sub.3] reads from
variable y, whose value was not altered by either [p.sub.1] or
[p.sub.2]). Arcs that point to the output node represent the variables
that are part of the output signature for sequence P. Variables that
store intermediate values are shown to emanate from the output node,
indicating that they will be restored when P completes. A dataflow
dependency graph can be constructed completely for any sequence P given
the I/O signatures for P and the individual operations in P.
[FIGURE 2 OMITTED]
The dataflow dependencies in Figure 2 can be altered by either
adding or removing an operation binding. Since the analysis performed in
both cases is similar, the discussion focuses on adding an operation
binding p [right arrow] c, without loss of generality. Adding a binding
p [right arrow] c is considered to be safe if DeliveryOutputSig(p, c) =
0, since thread variables are not written when p is delivered to c.
If DeliveryOutputSig(p, c) [not equal to] 0, then how the
additional writes affect later operations in the thread must be
examined. For illustrative purposes, suppose that the binding [p.sub.2]
[right arrow] c is being added to the example in Figure 2. The analysis
decomposes the effect of the additional write into two cases:
1. How the additional write from [p.sub.2] [right arrow] c affects
later operations within P (intrasequence dependencies)
2. How the additional write from [p.sub.2] [right arrow] c affects
operations that follow execution of P in the thread (extrasequence
dependencies)
Intrasequence dependencies. The first case examines how adding
binding p [right arrow] c affects the dataflow among operations within
the sequence to which p belongs. If an operation p' that follows p
in sequence P reads from a variable written by the delivery T.Deliver(p,
c), then p' is dependent upon the reconfiguration if no other
operation between p and p' overwrote variable v. This condition is
formally expressed in the following theorem (proofs are omitted for the
sake of brevity).
Theorem 1 (Intrasequence Safety): Given an operation sequence P
[member of] [P.sup.*] and an operation p [member of] P, the binding p
[right arrow] c can be safely added to the system only if any variables
written by p [right arrow] c do not overwrite the values in the
variables expected by later operations within P. Formally, a safe
configuration implies:
[for all]p' [>.sub.P] p : V = (ExecInputSig(p')
[intersection] DeliveryOutputSig(p, c)) [conjunction] V [not equal to] 0
[??] [for all]v [member of] V: [there exists] p" [member of] P:p
[<.sub.P] p" [<.sub.P] p' [conjunction] v [member of]
ExecOutputSig(p").
In the following examples, the binding [p.sub.2] [right arrow] c is
added to the system described in Figure 2. The four cases differ with
respect to the variables written by the new binding (i.e.,
DeliveryOutputSig([p.sub.2], c)). The preceding theorem is applied to
each case to determine whether the proposed reconfiguration is
considered to be safe:
1. Write to z. This reconfiguration is not safe, since it changes
the dataflow dependency graph in Figure 2:[p.sub.3] would read the value
written to z by [p.sub.2], not the value written by [p.sub.1] as before
(Figure 3).
[FIGURE 3 OMITTED]
2. Write to y. This reconfiguration is also not safe for similar
reasons: [p.sub.3] no longer reads the value of y established before P
executes (Figure 4).
[FIGURE 4 OMITTED]
3. Write to w. This reconfiguration is safe, since no operation
within P after [p.sub.2] reads from w. Since w is an intermediate
variable (not the final output of P), its value will be restored when P
completes; thus, the side effect of the new binding [p.sub.2] [right
arrow] c will be masked (Figure 5).
[FIGURE 5 OMITTED]
4. Write to v. This reconfiguration is also safe, since no
operation within P after [p.sub.2] reads from v (Figure 6).
[FIGURE 6 OMITTED]
The cases involving writing w and writing v differ in one important
respect: in the latter case, variable v was not included in the set of
intermediate thread variables used by P, since no operation in P was
known to produce v prior to the reconfiguration. Thus, there are two
possible scenarios to consider:
* Sequence P was pushed onto the operation stack of the thread
prior to the reconfiguration, in which case the frame stack of the
thread does not contain the correct value of v to restore when P
completes. Because sequence P was pushed onto the stack with the intent
of being executed in an earlier configuration not containing the binding
[p.sub.2] [right arrow] c, it is appropriate to suppress the [p.sub.2]
[right arrow] c delivery, thus preventing variable v from being
overwritten as a side effect of the reconfiguration.
* Sequence P was pushed onto the operation stack of the thread
after the reconfiguration, in which case v is known to be an
intermediate value produced by P. The stack frame, therefore, contains
the correct value of v to restore when P completes.
Extrasequence dependencies. The additional binding [p.sub.2] [right
arrow] c from Figure 2 can affect only operations that follow sequence P
in the execution of the thread if [p.sub.2] [right arrow] c writes to
variables in the output signature of P. This is a necessary but not
sufficient condition for the reconfiguration to be considered safe,
since [p.sub.2] [right arrow] c can write to an output variable of P as
long as no other operation that follows P depends upon the overwritten
value (again, the notion of preserving the dataflow dependency
relationships that existed before reconfiguration).
Figure 7 presents an example in which operation p executes as part
of a sequence P pushed by an operation [q.sub.j] [member of] Q. When
sequence P completes, operations that follow [q.sub.j] in sequence Q
will begin executing ([q.sub.j+1] in the figure). If p is the last
operation within P to write to thread variable v, and v is in the output
signature for sequence P, then the value written to v by operation p
will propagate to operations that follow q in sequence Q.
[FIGURE 7 OMITTED]
Lemma 1 (Write Propagation Outside of Sequence): Given an operation
sequence P and an operation p [member of] P, the value written to
variable v by p will propagate to operations that follow P if and only
if no other operations after p in P overwrite variable v. This condition
is defined by the following predicate:
WritePropagation (v, p, P) [equivalent to] v [member of]
(ExecOutputSig(p) [intersection] SeqOutputSig(P)) [conjunction] [for
all]p' [>.sub.P] p : v [not member of] ExecOutputSig(p')
If operation q [member of] Q pushes operation sequence P onto the
operation stack of the thread, then q is called the parent operation of
sequence P. In general, an operation sequence can have several parents,
given by the set parents(P) as follows:
parents(P) [not equal to] {p [member of] P: [there exists]c [member
of] C:P [member of] c.pushOps [conjunction] c = BindCode(p)}.
The concept of parent operations can be used to establish a
"leads to" relationship between two operations and between an
operation and a sequence:
* q [??] P [??] q [member of] parents(P)
* q [??] p [??] p [member of] P [conjunction] q [member of]
parents(P)
Finally, p [??] q denotes a transitive chain of "leads
to" relationships, meaning that the execution of operation p brings
about the execution of operation q in either one step (i.e., p [??] q
[??] p [??] q) or several steps (i.e., p [??] q [conjunction] q [??] r
[??] p [??] r). The "leads to" relationship is needed to
determine the extent to which a reconfiguration affects other
operations. If q [??] p, then changes to the binding of p can propagate
to operations that follow q as expressed in Lemma 2, presented below.
The dataflow dependencies for each operation that leads to p must be
checked in order to determine whether a reconfiguration is safe.
Continuing with the example in Figure 3, let operation r [member
of] R push sequence Q onto the operation stack of the thread when r
executes; thus, r [??] p. The value in variable v that exists after
operation q executes (q being the operation that pushed sequence P) will
propagate to operations that follow r in sequence R if and only if the
remaining operations in Q do not overwrite v and v is part of the output
signature of sequence Q. This propagation continues up the [??] chain as
long as these two conditions hold at every step; as soon as one of the
conditions is violated, checking can stop, since the value in v will not
propagate further. (7)
Lemma 2 (Generalized Write Propagation): Given an operation
sequence P for which r [??] P, the value written to variable v by
operation p [member of] P propagates to operations that follow r if and
only if the following condition holds:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII],
Proof rationale. The [WritePropagation.sup.*] predicate
is intended to be a generalization of the
WritePropagation predicate introduced in Lemma
1. When r is an immediate parent of p (i.e., r [??]
p), Lemma 1 can be directly applied.
If r [??] p forms a multistep chain (see Figure 8),
then [WritePropagation.sup.*] is recursively defined. The
intuition is that the write to variable v performed
by operation p will propagate back to operation
r if there is at least one path of propagation,
namely a path through operation q in which q is
the parent operation of p. The write propagates
from p to r if the write propagates outside sequence
P, denoted by WritePropagation(v, p, P),
and if the write propagates from q to r, denoted
by [WritePropagation.sup.*](v, q, Q, r).
[FIGURE 8 OMITTED]
If the value in v propagates back to sequence R in Figure 7, then a
reconfiguration that overwrites v can be considered safe as long as
future operations in R do not read the overwritten value in v. This is
the essence of the following theorem: a reconfiguration is considered
safe only if either (1) the effects of the new binding do not propagate
to sequence R (where r [??] p and r [member of] R) or (2) the effects of
the new binding propagate to sequence R, but no operations within R read
from the overwritten variable.
Theorem 2 (Extrasequence Safety): Given an operation sequence P
[member of] [P.sup.*] and an operation p [member of] P, the binding p
[right arrow] c can be safely added to the system only if any variables
written by p [right arrow] c do not overwrite the values in the
variables expected by later operations that follow P. Formally, a safe
reconfiguration implies:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
Proof rationale. All operations that lead top must be examined to
determine whether the reconfiguration is safe. Only those operations r
that exist in operation sequences pushed by code blocks are considered
(r [member of] R, where R [member of] [[union].sub.c] [member of] c]
c.pushOps). For each of these operations, one of the following two
conditions must hold for each thread variable for the reconfiguration to
be considered safe:
1. The value written by p from sequence P must not propagate back
to those operations that follow r in sequence R ([logical
not][WritePropagation.sup.*](v, p, P, r)). If the value does not
propagate, then operations that follow r will not be affected by the
reconfiguration.
2. If the value propagates to sequence R, then operations that
follow r in R are checked to see whether they read from the variable
written by p (v [member of] ExecInputSig(r'), where r'
[>.sub.R] r). If such an operation r' is found, then the
reconfiguration can be considered safe only if there is another
operation between r and r' that overwrites the value in v (i.e.,
r' does not read the value in v established by operation p).
Theorem 3 (Safe Reconfiguration Criteria): Adding a new binding p
[right arrow] c to the system is considered safe if and only if the new
binding satisfies both the Intrasequence Safety and Extrasequence Safety
properties.
Impact on element state. The previous subsection established the
conditions under which a reconfiguration can be considered safe from the
standpoint of thread state. In addition to thread state, the system also
contains state variables found in the elements. Operations in the
threads can manipulate state variables as they execute, so changing the
bindings of operations to elements can impact the changes that are
brought about in the element state.
Recall from the subsection on concurrency that threads lock a:n
element before an operation delivery to ensure mutually exclusive access
to the element. The element is unlocked after the delivery, permitting
other threads to operate on the element. For example, the execution of
an operation sequence P = <[p.sub.1], [p.sub.2], [p.sub.3]> with
operation-to-element bindings (8) [p.sub.1] [right arrow] [e.sub.1],
[p.sub.2] [right arrow] [e.sup.2], and [p.sub.3] [right arrow] [e.sub.1]
is shown in Figure 9A. State changes to the elements are atomic only
with respect to a single operation delivery. Another thread, for
example, can deliver an operation to element [e.sub.1] in Figure 9A
between operations [p.sub.1]. and [p.sub.3].
Figure 9 (A) Default locking for each operation delivery;
(B) element locks held across several deliveries
(A) [lock [e.sub.1]], [p.sub.1], [unlock [e.sub.1]],
[lock [e.sub.2]], [p.sub.2], [unlock [e.sub.2]],
[lock [e.sub.1]], [p.sub.3], [unlock [e.sub.1]
(B) [lock], [p.sub.1], [p.sub.2], [p.sub.3], [unlock]
Figure 9B shows how multidelivery locks can be used to extend
atomicity across several operation deliveries. In this case, the lock
action indicates that the lock for each element is not released after
delivery until the unlock is reached, at which point all the element
locks in the block are released. (9,10) With multidelivery locks,
dataflow dependencies among elements can be established within each
lock/unlock block. Outside these blocks, the element dataflow
dependencies cannot be determined, since other threads can overwrite the
element state between operation deliveries. Once the dataflow
dependencies are established within the lock/unlock blocks, the analysis
for determining a safe reconfiguration is similar to the procedure
outlined earlier in the subsection, "Impact on thread state,"
but space considerations preclude a detailed examination of the
criteria.
Reconfigurations in practice
The safe reconfiguration criteria presented in the previous section
require only static information for input (namely, the current and
proposed configurations expressed as BindCode functions). Thus, these
checks can be performed off line while the system executes. This section
briefly describes examples in which reconfiguration is useful and how
the safe reconfiguration criteria can be employed. It then describes a
reconfigurable software-implemented fault-tolerant environment developed
using these ideas.
Example applications of reconfigurability. The following are three
types of applications in which the model is useful in determining the
safety of a proposed reconfiguration:
1. Different execution phases for long-running applications. Some
long-running applications require functionality that varies according to
their phase of execution. A spacecraft sent to explore one of the outer
planets in the solar system, for example, spends most of its time
traveling to reach the target planet. While in this cruise mode, power
must be conserved, and the demands on the system are few. When the
spacecraft reaches the planet, however, it must perform several tasks,
such as collecting data, taking pictures, navigating a fly-by of the
planet, or controlling its own descent and landing. These phase-specific
code blocks can be activated only when necessary through the
reconfiguration concepts outlined in this paper. Since the number of
phases and the transitions between phases is known at design time, the
systems engineer can apply the safe reconfiguration criteria during the
development cycle to verify that the intended transitions are safe.
2. On-line software upgrades. Since software upgrades involve
changing the code that applications execute, on-line software upgrades
can be viewed as reconfigurations to the system. Before the upgrades are
made, the safe reconfiguration criteria can be used to ensure that the
proposed upgrade is compatible with the existing configuration of the
software.
3. Adaptivity. Some application domains require that the software
adapt to changing conditions in the environment of the application.
Middleware that provides fault tolerance to distributed applications,
for example, may need to adjust the level of service it provides to the
application depending upon the observed error behavior. Software
structured around the system model presented in this paper facilitates
this adaptation.
Reconfiguration can be used to transform the computation of the
application to take advantage of various mechanisms of fault tolerance
(e.g., incremental checkpointing of element state, (11) alternate
implementations of an element that employ design diversity to mitigate
the effects of software bugs, and backup elements that store redundant
copies of the data). The safe reconfiguration criteria can be used to
show that the additional fault tolerance mechanisms do not disrupt the
normal computation performed by the application.
Reconfigurable SIFT environment. We have developed a
software-implemented fault-tolerant (SIFT) environment by employing the
proposed formal system model. The SIFT environment consists of ARMOR
processes, which provide error detection and recovery services to
themselves and to user applications. (2,12) Because ARMOR processes are
designed around the system model presented in this paper, the SIFT
environment can be customized--even during run time--to the particular
dependability needs of the application.
The ARMOR-based SIFT environment was used for managing parallel
scientific applications executing on a computing test bed at the Jet
Propulsion Laboratory. (13) Extensive fault injection testing revealed
that it is essential for the SIFT environment to be protected against
errors in order to provide adequate fault-tolerance services to the
application. The reconfigurability concepts introduced in this paper
were applied to incorporate fault tolerance into the ARMOR processes as
follows:
1. Microcheckpointing (11) was transparently added to the ARMOR
processes to protect the state of the SIFT environment. The partitioning
of the system state into elements permitted incremental checkpoints to
be taken on an element-by-element basis. The microcheckpointing
algorithm exploited the fact that state changes brought about by an
operation delivery were confined to a single element and thread, thus
making transparent checkpointing possible.
2. Assertion checks were added to strengthen error detection. These
assertion checks were inserted during run time by dynamically changing
the bindings of selected operations to pass through the assertion before
being delivered to the original element. This construct was particularly
useful in implementing range or sanity checks on inputs.
The safe reconfiguration criteria were applied in both of these
cases to show that the added fault tolerance mechanisms did not disturb
the existing functionality of the SIFT environment, which included
running the scientific applications, recovering from application
failures, recovering from node failures, and monitoring resource usage.
Conclusion and related work
This paper has presented a model that captures the structure
(defined by elements) and run-time behavior (defined by operations) of a
system. An executing code block can bring about state changes only
within a single element and thread. The extent of these state changes
are represented in a signature for the code block, and the collective
set of signatures in the system can be used to analyze the dependencies
that exist among operations and among elements.
By indirectly invoking code blocks through operations, the behavior
of the system can be reconfigured by changing the operation-to-code
block binding. Indirection can be achieved statically through designs
such as the Polylith software bus, (14) in which components are
interconnected through a mediator. The bindings of operations to code
blocks is somewhat similar to a publish-and-subscribe event subsystem,
except that operations are not asynchronous. Operations execute
sequentially and, therefore, more closely resemble instructions that
execute in a virtual machine architecture described by the proposed
system model.
Architectural description languages (ADLs) are popular in
describing the structure of a system. (15-17) ADLs model the system at a
conceptual level, using port-based connections between components to
define the system structure. Some incorporate semantics that describe
the behavior of the components and connectors, including constraints in
their usage. Our model, in contrast, is rooted in the implementation of
the system and is a bottom-up approach in which the programmer describes
the behavior of code blocks through the use of per-block signatures.
These signatures are processed in an automated fashion to construct
dependency relationships between the system components. Additional code
blocks can be designed without the need to formally incorporate them
into a larger architectural description model--only the signatures for
the new code blocks are required to apply the safe reconfiguration
criteria presented earlier.
Nevertheless, several ADLs express dynamic reconfigurations at the
architectural level. Darwin, for example, addresses the problem from a
structural perspective by allowing components to be instantiated during
run time, (18) but reconfigurations only occur while the system is
quiescent to preserve consistency. Wright permits the architectural
topology of the system to change during run time in response to special
control events, which are distinguished from the usual communication
events that drive its component behavior. (19) It has also been
suggested that architectural styles and models can be incorporated into
the adaptation framework of self-repairing systems as first-class
entities. (20) The architectural styles are used to determine what
aspects of the system should be monitored and how to reconfigure the
system within predetermined constraints.
Shrivastava and Wheater describe a reconfigurable workflow model
that provides run-time support for interconnecting the I/O of executing
tasks. (21) Although the work outlines the infrastructure that supports
reconfiguration, there is no mention of how to judge whether a proposed
reconfiguration can be considered safe given the current configuration
of the system. Reconfiguration compatibility issues have been addressed
with respect to real-time control applications in Feiler and Li, (22)
but this analysis requires knowledge of the semantics and permissible
behavior of the application.
Acknowledgments
This work was supported in part by NASA Fellowship NGT5-50228 for
Keith Whisnant, NSF grants CCR 00-86096 ITR and CCR 99-02026, grant ARMY
WMH 0993, and a grant from Motorola. Special thanks to Fran Baker for
reviewing early drafts of this paper.
Cited references and notes
(1.) It may be that the user is reconfiguring the system with the
explicit intent of changing the dataflow dependencies; thus, the
off-line tests are used only as warnings of possible incompatibility.
(2.) Z. Kalbarczyk, R. K. Iyer, S. Bagchi, and K. Whisnant,
"Chameleon: A Software Infrastructure for Adaptive Fault
Tolerance," IEEE Transactions on Parallel and Distributed Systems
10, No. 6, 560-579 (June 1999).
(3.) Where appropriate, the name of a variable v [member of] V (v.
name) is distinguished from the value of the variable (v.val).
(4.) If a code block both writes to thread variables and pushes an
operation sequence onto the