SRIRAM: a scalable resilient autonomic
mesh.
by Verma, Dinesh C.^Shaikh, Anees^Sahu, Sambit ^Chang,
Isabella^Calo, Seraphin^Acharya, Arup
IBM Systems Journal • March, 2003 • Scalable Replication Infrastructure using Resilient
Autonomic Meshes
Most networked applications are currently implemented using a
client/server computing model. A server with a well-known address hosts
the application, while different clients access it over the network.
Typically, the server (or a set of servers) will be located at a single
site, and the overall performance of the application will depend upon
factors such as the speed of the network between the client and the
server, the computing power at the hosting site, and congestion in the
network. The concentration of servers at a single site also reduces the
ability of the application to withstand failures. In order to improve
the availability and reliability of a system, distributed architectures
incorporating replicas and mirrors are frequently used. However, the
process of replication and mirroring is usually manual and, due to the
complexity in the control and management of the system, somewhat
cumbersome. An autonomic replication and mirroring facility would
significantly simplify the process of replication and would improve the
availability of applications.
In this paper, we describe the highly scalable distributed
architecture SRIRAM (Scalable Replication Infrastructure using Resilient
Autonomic Meshes), which is designed to dynamically create replicas of
applications for resilient operation. The basic idea behind SRIRAM is
that several computers are available at any given time on the network,
and an application deployed on one of the machines can be mirrored and
run on any other machine that is available and capable of providing the
same service. All the computers are connected in a mesh with
self-managing properties. A machine hosting an application uses the
communications overlay (an application-level communication network that
overlays the mesh) to transmit the application's replication
requirements, identify potential replicas, and configure the replicas to
start a copy of the application. Clients search for one of the replicas
of any application in which they are interested, and invoke the services
of that application from the replicated copy.
The autonomic replication infrastructure provided by SRIRAM can be
used in various scenarios, but is most relevant in the context of
peer-to-peer networks, (1,2) content distribution networks, (3,4) and
Grid computing. (5) The basic SRIRAM architectural components can be
used to improve the underlying communication infrastructure (resiliency
to faults, increased availability, etc.) in each of these contexts, with
the application replication mechanism as a specific feature provided
within each of these operating environments.
The rest of the paper is structured as follows. In the next section
we give an overview of the SRIRAM architecture. In the following four
sections we discuss each of the major components of the architecture as
well as the types of applications that can exploit the replication
support provided by SRIRAM. In the remaining two sections we review
related work, and then we present our conclusions and directions for
future research.
Architecture overview
The SRIRAM architecture, which includes five major components, is
illustrated in Figure 1. The mesh creates a network interconnecting all
machines participating in the system. A flexible and efficient
query-search mechanism is built on top of the network. Security and
anonymity controls round up the communication infrastructure consisting
of the bottom three components. The upper two layers are a specific use
of this infrastructure. The query-search mechanism facilitates the
resource advertising by the participants on the mesh. The resource
advertising facility is used for automatic search and for creation of
replicas.
[FIGURE 1 OMITTED]
The autonomic mesh component consists of a self-configuring network
interconnecting all machines in the system. This layer supports a
broadcast mechanism that allows all the participating machines to
communicate in an efficient, scalable, self-configuring, and
self-healing manner.
The query-search component supports basic search primitives that
allow participants to search for information about other participants
within the SRIRAM system. SRIRAM uses a system based on active programs
(Java ** language-based query capsules), which enables a flexible and
efficient search mechanism. Caching is used to improve the
responsiveness of the system.
The security/anonymity component provides for communication with
other peers while preserving the anonymity of the requester or the
respondent. Security and access control within SRIRAM is based on
digital certificates issued by trusted authentication servers. The
resource advertising component allows a participating machine to
describe the resources required for replicating the applications running
on it, and for possible replicas to indicate their resource
availability.
Finally, the application replication component provides the basic
functions for replicating the code and data of applications, and for
maintaining the proper consistency of application data among the
different mirrors. Each one of these components is described in more
detail in subsequent sections.
Autonomic mesh algorithm
The autonomic mesh component within SRIRAM provides an overlay that
interconnects all of the participating machines so that they may
communicate with each other. This function is similar to the overlays
created in distributed peer-to-peer networks like Gnutella (2) that
enable group communication among all participants. However, Gnutella and
similar systems use a flooding scheme for their group communication,
which consumes a significant amount of network and node resources. In
SRIRAM, we have opted for a scheme that builds a rooted spanning tree
among all the participants and attempts to minimize the number of
messages exchanged for any given query.
The use of a rooted spanning tree has its own set of problems. The
traditional distributed algorithms for creating a spanning tree are
relatively slow and complex, and are thus impractical for our needs.
Furthermore, nodes that are nearer the root of the rooted spanning tree
are likely to see more traffic than nodes at the leaves of the tree. The
tree is also more likely to be disrupted when a machine leaves or joins
the system.
To accelerate the process of spanning tree creation, SRIRAM uses a
semi-distributed scheme similar to that used in YOID. (6) In the
semi-distributed scheme, SRIRAM deploys a number of hint-servers within
the system. The hint-servers store a limited amount of information about
the participants, and the information is not guaranteed to be
up-to-date. A participant wishing to join the system communicates with
the hint-servers in order to obtain the identities of possible nodes in
the existing spanning tree to which it can connect.
To solve the problem of increased load on participants near the
root of the tree, a ranking scheme is used. Each participating node in
SRIRAM computes a rank for itself. A rank is a measure of the computing
capability of the node. For computational ease, the closer the node is
to the center of spanning tree activity, the lower its rank. The root of
the spanning tree is the node with the lowest rank. In addition, the
rank computation also involves the inverse of a weighted combination of
its CPU speed, available disk space, memory size, and speed of its
network interfaces. Ranks impose a strict ordering on the participants
in the tree, and ensure that no cycles can form in the constructed tree.
For efficient tree creation and restructuring, a new machine is
only allowed to join the tree by choosing a parent from among existing
participants with ranks lower than itself. This provides a simple, yet
effective, scheme for eliminating cycles in the spanning tree. The
selected participant becomes the parent of the new node. When a
participant leaves the tree, its children join the parent of the
departing machine. If an orphaned child node does not succeed in joining
any node, it then increases its own rank and contacts the hint-server
for a list of possible parents. The steps in the creation of the
spanning tree are described below in further detail.
Joining the tree. When a new machine is about to join a tree (this
is known as the registration phase), it computes its rank and contacts
the hint-server to obtain a list of machines with ranks slightly lower
than the computed rank. It then contacts each of the machines in the
list and requests that it become its parent. A machine in the list may
accept the request only if it has a lower rank than the newcomer. In
addition, it may refuse to accept new children beyond a certain
preconfigured limit, or it may no longer be up. The delay in obtaining a
response from the machine is used to estimate the round-trip delay
between the potential parent and the newcomer. The newcomer joins (as
child) the machine with the lowest latency that responds positively to
its join request. If no machine in the list responds positively, the
newcomer doubles its computed rank and obtains a new list from the
hint-server. If a newcomer has a rank smaller than the current root, the
hint-server returns a special code to both the current root machine and
the newcomer, asking that the newcomer become the new root of the
spanning tree as the parent of the existing root machine.
COPYRIGHT 2003 All Rights
Reserved. Reproduced with permission of the copyright holder. Further reproduction or distribution is prohibited without permission.
Copyright 2003, Gale Group. All rights
reserved. Gale Group is a Thomson Corporation Company.
NOTE: All illustrations and photos have been removed from this article.