Toward a new landscape of systems management in an
autonomic computing environment.
by Lanfranchi, Giovanni^Della Peruta, Pietro^Perrone,
Antonio^Calvanese, Diego
Problems cannot be solved at the same level of awareness that
created them.
--Albert Einstein
The information technology (IT) infrastructure that supports
business systems today continues to evolve at a breakneck pace, with the
integration of new devices, servers, and applications creating highly
complex systems. Systems management needs to similarly evolve in order
to cope with this increasing complexity in IT. Management tools need to
become "smarter" if they are to drive successful business
systems. Performance and availability management tools are, in
particular, key enablers of an efficient and profitable business. A
business cannot execute efficiently unless the mission-critical business
applications and the supporting middleware and operating systems are
available and performing well. Any component failure or poor performance
could adversely impact the business.
To minimize the business downtime, the speedy execution of the
appropriate corrective action is needed. An important component of
performance and availability management is the monitoring of the system
in order to detect anomalies as soon as they occur and to take the
necessary corrective actions (e.g., restore the failing objects to their
desired state, or activate backup resources). The monitoring and the
taking of corrective action should be optimized with respect to the
overall objectives of the entire business system. Whenever feasible,
systems management tools with predictive capabilities should be used in
order to detect future problem states, and to allow action to be taken
before a failure occurs and adversely impacts users.
When the tools monitoring the system resources simply collect raw
data (e.g., performance metrics), it can be difficult to draw any
conclusion about the health of a system. A graphical performance monitor
or an event viewer tool can be useful, but only if a skilled
administrator is able to interpret the information provided, and to
determine if a problem exists or not. Only an experienced administrator
has the ability to correlate the appropriate domain knowledge with the
data collected, such as metrics and events, and come up with answers to
questions such as: Is there a memory bottleneck?, What is its cause?,
and How can I fix it? In the current monitoring environment, the system
administrator inherently owns the best practices and applies them in
order to identify and cure problems.
We describe here the approach taken in IBM Tivoli Monitoring, (1)
in which the monitoring tool directly models and implements the relevant
aspects of the domain entities as logical objects, and via this
implementation it transforms the raw data into the information required
in order to detect, correct, and, whenever possible, predict abnormal
system behavior. The system administrator is thus better able to address
IT problems quickly, and thus concentrate on better serving business
demands.
The best practices embedded in the monitoring tool enable the
autonomic behavior of the monitoring tool at run time. (2) This
autonomic behavior can be extended to the other two phases of the
application life cycle: the design phase (where the best practices are
created), and the deployment phase (where the best practices are
deployed into the managed systems).
In this paper we first present IBM Tivoli Monitoring as an existing
solution that displays autonomic behavior at run time, and then we focus
on extending this solution to encompass the design time and the
deployment time. In the next section, we provide an overview of IBM
Tivoli Monitoring with a particular focus on the "resource
model" concept. In the following section we present the Systems
Management Ontology with an overview of "description logics"
and their use in representing Common Information Model constructs.
Finally, we link the previous themes in order to illustrate the proposed
approach for autonomic systems management.
The IBM Tivoli Monitoring solution
The IBM Tivoli Monitoring (ITM) solution is based on the resource
model concept. In this section we define the resource model and
illustrate it through an example. We then discuss the application of the
resource model to all phases of the ITM life cycle.
The resource model concept. The resource model is the main tool for
implementing an "identify, notify, and cure" systems
management strategy. A resource model has two parts: a dynamic model and
a reference model.
The Common Information Model (CIM) formalism promoted by the
Distributed Management Task Force is a way to organize the available
information about the managed environment that applies the basic
structuring and conceptualization techniques of the object-oriented
paradigm. (3) The approach uses a uniform modeling formalism that,
together with the basic repertoire of object-oriented constructs,
supports the cooperative development of an object-oriented schema across
multiple organizations. The dynamic model uses CIM classes and CIM
properties to describe resources (such as memory) and their performance
metrics (such as process working set). Moreover, it uses CIM methods to
describe actions that can be executed against the resource (such as
starting a process). The CIM association is then used to represent the
resource within a high level object (logical object).
The reference model, implemented as a decision procedure and coded
using either Visual Basic ** or JavaScript **, incorporates best
practices that:
* Interpret the status of a problem object against a defined
baseline using the metrics provided by the dynamic model
* Discover the root cause of the problem
* Correct the problem
* Log performance data related to the domain objects
The reference model describes the "critical paths" within
systems or applications. It interprets the "quality" of
applications and systems against a predefined service level in order to
discover the root cause of the problem and to react accordingly. The
reference model has a direct link to the dynamic model describing the
resources and it analyzes the properties of objects aggregated in the
model, generates indications about the status of those resources, and
invokes methods to correct a problem. The reference model, in effect,
implements the best practices normally used by a system administrator to
detect problems and to identify their root cause.
A resource model is created for each problem and it contains the
best practices used to identify and correct a well-defined problem.
Figure 1 contains an example of a resource model used to identify memory
problems in a Microsoft Windows ** machine. The dynamic model is created
using a CIM representation of the resources (Cache, Process, PageFile,
Memory) and the properties that are useful in detecting memory problems.
These classes are then associated via association class
MemoryModelToResource with object MemoryModel. Class MemoryModel also
contains a set of methods, such as setLargeCache, that can be used to
modify the status of some resource (e.g., Cache).
[FIGURE 1 OMITTED]
The reference model is linked to class MemoryModel (and thus to the
dynamic model) and, using this class, it can access the performance
metrics of the associated resources and apply the encoded best practices
in order to discover if a memory problem exists. When a problem is
detected, the reference model can invoke a well-defined method in
MemoryModel to fix the problem, and also to notify the system operator
about the problem resolution. The best practices are implemented in the
form of an algorithm that analyzes and correlates metrics, and compares
the results against a well-defined baseline.
Let us analyze a simplified resource model for detecting critical
memory leaks (see Figure 2). Classes Memory and Process, as well as
other additional classes not shown in Figure 2, constitute the dynamic
model. MyMemory is an instance of class Memory, whereas Netscape is an
instance of class Process. Notice that property Available of class
Memory, and properties PrivateBytes and WorkingSet of class Process are
represented in class instances MyMemory and Netscape by their numerical
values, respectively 256, 388 958, and 109 909. The function of layer
Provider is to gather the raw metrics (from the actual resources) in
order to supply the above-mentioned values for properties. The reference
model controls the data gathering and interprets the data according to
best-practices algorithms in order to detect the critical "memory
leak" condition.
[FIGURE 2 OMITTED]
A "critical memory leak" occurs when the conditions
"low available memory with high working set" and "memory
leak in private bytes" exist. The condition "low available
memory with high working set" can be detected by examining the
number of bytes used for the working set of the process (property
bytTotalWorkingSet), the number of bytes used as cache (bytTotalCache)
and the total memory available (bytTotalAvail). Those three values can
be calculated using "raw" metrics as follows.
* List all the process working sets and store the high working set
in bytTotalWorkingSet
* Store the metric Cache from Memory object in bytTotalCache
* Store the metric Available from the Memory object in
bytTotalAvail
The values collected for system resources are further processed in
order to detect the "low available memory with high working
set" condition.
bytTotalRAM = bytTotalWorkingSet +
bytTotalCache + bytTotalAvail
numPercentWS =
(bytTotalWorkingSet/bytTotalRAM) * 100
numPercentCache =
(bytTotalCache/bytTotalRAM) * 100
numPercentAvail =
(bytTotalAvail/bytTotalRAM) * 100
If (numPercentWS > numPercentCache and
COPYRIGHT 2003 All Rights
Reserved. Reproduced with permission of the copyright holder. Further reproduction or distribution is prohibited without permission.
Copyright 2003, Gale Group. All rights
reserved. Gale Group is a Thomson Corporation Company.
NOTE: All illustrations and photos have been removed from this article.