Dynamic reconfiguration: basic building blocks for
autonomic computing on IBM pSeries servers.
by Jann, Joefon^Browning, Luke M.^Burugula, R. Sarma
One of the cardinal features of an autonomic component in an
information technology (IT) infrastructure is the ability of the
component to adapt itself smoothly to changes in its environment.
Endowing a computing system with this self-management feature often
translates to the implementation of self-protecting, self-healing,
self-optimizing, and self-configuring algorithms and subcomponents.
Because the primary role of an operating system (OS) is to manage the
physical resources of a computer system so as to optimize the
performance of its applications (including middleware, which consists of
applications from the perspective of the OS), an OS supporting autonomic
computing (1) needs to handle the changes in the amount of physical
resources allocated to it in a smooth fashion. Some of the most
prominent physical resources of an OS are processors, physical memory,
and I/O devices.
The current tendency among the noncommodity symmetric
multiprocessor (SMP) system vendors is to develop systems that are
increasingly large in terms of the number of processors, number of I/O
slots, and memory size. Although advances in the design of hardware
continue to provide rapid increases in the sizes of these physical
resources, a number of major applications and subsystems often lag
behind in scalability; hence, the trend in high-end SMPs is to support
partitioning of large SMPs and to use these systems for effective server
consolidation. Partitioned SMPs typically come in two kinds: systems
with physical partitions (PPARs) and systems with logical partitions
(LPARs). In a physically partitioned system, the granularity of
partitioning is typically coarse, because the partitioning occurs at
physical boundaries such as system boards. In a logically partitioned
system, the granularity of partitioning is typically much more
fine-grained, such as a single CPU or even a fraction of a CPU, a small
block of memory, or an I/O-slot instead of an entire I/O-bus. Hence a
given SMP can be subdivided into many more LPARs than PPARs.
IBM first provided LPAR support in the Advanced Interactive
Executive (AIX *) operating system with the introduction of the pSeries
* 690 system in December 2001. This first release of LPAR support was
static in nature, that is, the reassignment of a resource from one LPAR
to another LPAR cannot be made while AIX is actively running, and both
the donor LPAR and the receiver LPAR must be rebooted to enable a
reassignment. For such a system to provide support for various
resource-related autonomic computing features, such as dynamic resource
balancing across LPARs, Capacity on Demand, Dynamic CPU Sparing, and hot
swapping, it needs to augment the static partitioning capabilities with
dynamic LPAR (DLPAR) capabilities. As of 2002, the pSeries 690 supports
the dynamic reassignment of resources across LPARs running AIX. In AIX,
this functionality is referred to as dynamic reconfiguration (DR). Since
AIX is an enterprise UNIX ** operating system that has been designed to
be robust, high in performance, rich in functions and support of
platforms, and hence monolithic, the addition of a valuable autonomic
computing feature such as DR has to be carefully morphed into the
existing semantics, code base, and structural organization of the
operating system. These challenges found in adding autonomic computing
capabilities are encountered by most of the large systems with a
significant installation base. Later in this paper, we briefly describe
the design of DR in AIX and show that with carefully developed designs,
autonomic computing capabilities can be added to an enterprise quality
OS, while preserving its performance, semantics, and structural
organization. Besides describing the designs within AIX that enable the
smooth migration of physical resources, we also describe how these
designs are being exploited to provide a variety of valuable autonomic
computing features to an IT establishment.
Autonomic benefits of DLPAR. DLPAR in a pSeries 690-AIX system
offers a great deal of flexibility to users, allowing resources to be
shifted to where they are most needed without impacting system
availability. The DLPAR technologies that have been developed provide
the basic building blocks on which many self-optimizing,
self-configuring, self-protecting, and self-healing features of the
system are built. These features enable the implementation of autonomic
system management and goal-oriented policies to optimize the performance
and usage of system resources. DR also improves the levels of resource
utilization and the reliability and serviceability (RAS) characteristics
of the SMP, that translate into real cost savings for the IT
establishment.
Some of the benefits offered by an SMP with LPAR capabilities are:
1. Servers can be consolidated by simply placing the workloads of
several smaller servers into separate LPARs of a big SMP, hence reducing
and unifying systems administration tasks.
2. Workloads can be separated by designating separate LPARs to run
different workloads, for example, one LPAR for development work, one
LPAR for testing, and several LPARs for production workloads.
3. The running of an application/subsystem/OS at its optimal
performance and scalability can be obtained on an LPAR with optimal
amounts of physical resources for that specific instance of
application/subsystem/OS.
DLPAR additionally enables the following autonomic features in a
system:
* Dynamic Capacity on Demand (DCOD)--DLPAR enables cross-partition
workload management, which is particularly important for server
consolidation, in that it can be used to better leverage system
resources across partitions, thereby achieving higher levels of resource
utilization, resulting in enhanced system throughput. Here is a possible
usage scenario: The LPARs on an SMP are the servers for workloads
originating from users in different time zones of the country, or even
from different cities around the globe. While one LPAR
"sleeps," its spare resources can be shifted to another LPAR
that "wakes up" to do its work for the day. This shifting can
be done manually via operator command, and then later can be automated
via the Global Resource Manager (GRM, an automated resource balancer
across a specified group of LPARs in an SMP, based on OS utilization and
needs) or the enterprise WorkLoad Manager (eWLM, an end-to-end
response-time-based load balancer for "instrumented"
applications spanning LPARs in an SMP, or even across SMPs).
* Dynamic Capacity Upgrade on Demand (DCUoD)--DLPAR enables the
upcoming DCUoD feature of the pSeries 690 by allowing customers to
purchase a server with extra unlicensed resource capacity, and later
license and add this capacity dynamically to running AIX LPARs as their
resource requirements increase.
* Dynamic CPU Guard and Dynamic CPU Sparing--DLPAR allows systems
to smoothly replace processors that show intermittent, but correctable,
errors. This self-healing feature will continue to become important with
reduction in the silicon device size along with greater and greater
integration on a chip. The Dynamic CPU Guard feature is an improved and
dynamic version of the existing RAS feature named CPU Guard in earlier
AIX versions. The older CPU Guard feature predicts the failure of a
running CPU by monitoring certain types of transient errors and
dynamically takes the CPU off line, but it does not provide a substitute
CPU, so that a customer is left with less computing power. Additionally,
the older feature will not allow an SMP to operate with less than two
processors. The DLPAR technologies allow the OS to function even with
one processor. In addition, the Dynamic CPU Sparing feature allows the
transparent substitution of a good unlicensed processor for one that is
suspected of being defective. This on-line switch is made seamlessly, so
that applications and kernel extensions are not impacted. The new
processor autonomously replaces the defective one. Dynamic CPU Guard and
Dynamic CPU Sparing work together to protect a customer's
investments through their self-diagnosing and self-healing software.
Both features are planned to be available on pSeries 690 servers in AIX
5.2.
The IBM pSeries DLPAR system architecture
The initial release of DR will be supported on the POWER4 pSeries
690 and 670 servers. Figure 1 illustrates the RS/6000 * system
architecture for DLPAR. In the diagram, LMB stands for logical memory
block and is the granularity of physical and logical memory assigned to
an LPAR. In AIX 5.2, an LMB will consist of 256 MB of contiguous memory.
The size of an LMB is expected to decrease in future releases of AIX.
[FIGURE 1 OMITTED]
In this section we list some of the pSeries system components that
had to be modified to become DR-aware in order to implement DLPAR. Some
of these components were introduced during the implementation of static
LPAR (e.g., the hardware management console, hypervisor, global
firmware, and two registers for partition memory management), and some
components existed even before LPAR existed (e.g., local firmware and
AIX). We did not have to introduce any new components for the
implementation of DLPAR; we only made changes to existing ones.
COPYRIGHT 2003 All Rights
Reserved. Reproduced with permission of the copyright holder. Further reproduction or distribution is prohibited without permission.
Copyright 2003, Gale Group. All rights
reserved. Gale Group is a Thomson Corporation Company.
NOTE: All illustrations and photos have been removed from this article.