More Resources

Toward a new landscape of systems management in an autonomic computing environment.


by Lanfranchi, Giovanni^Della Peruta, Pietro^Perrone, Antonio^Calvanese, Diego
IBM Systems Journal • March, 2003 •

Problems cannot be solved at the same level of awareness that created them.

--Albert Einstein

The information technology (IT) infrastructure that supports business systems today continues to evolve at a breakneck pace, with the integration of new devices, servers, and applications creating highly complex systems. Systems management needs to similarly evolve in order to cope with this increasing complexity in IT. Management tools need to become "smarter" if they are to drive successful business systems. Performance and availability management tools are, in particular, key enablers of an efficient and profitable business. A business cannot execute efficiently unless the mission-critical business applications and the supporting middleware and operating systems are available and performing well. Any component failure or poor performance could adversely impact the business.

To minimize the business downtime, the speedy execution of the appropriate corrective action is needed. An important component of performance and availability management is the monitoring of the system in order to detect anomalies as soon as they occur and to take the necessary corrective actions (e.g., restore the failing objects to their desired state, or activate backup resources). The monitoring and the taking of corrective action should be optimized with respect to the overall objectives of the entire business system. Whenever feasible, systems management tools with predictive capabilities should be used in order to detect future problem states, and to allow action to be taken before a failure occurs and adversely impacts users.

When the tools monitoring the system resources simply collect raw data (e.g., performance metrics), it can be difficult to draw any conclusion about the health of a system. A graphical performance monitor or an event viewer tool can be useful, but only if a skilled administrator is able to interpret the information provided, and to determine if a problem exists or not. Only an experienced administrator has the ability to correlate the appropriate domain knowledge with the data collected, such as metrics and events, and come up with answers to questions such as: Is there a memory bottleneck?, What is its cause?, and How can I fix it? In the current monitoring environment, the system administrator inherently owns the best practices and applies them in order to identify and cure problems.

We describe here the approach taken in IBM Tivoli Monitoring, (1) in which the monitoring tool directly models and implements the relevant aspects of the domain entities as logical objects, and via this implementation it transforms the raw data into the information required in order to detect, correct, and, whenever possible, predict abnormal system behavior. The system administrator is thus better able to address IT problems quickly, and thus concentrate on better serving business demands.

The best practices embedded in the monitoring tool enable the autonomic behavior of the monitoring tool at run time. (2) This autonomic behavior can be extended to the other two phases of the application life cycle: the design phase (where the best practices are created), and the deployment phase (where the best practices are deployed into the managed systems).

In this paper we first present IBM Tivoli Monitoring as an existing solution that displays autonomic behavior at run time, and then we focus on extending this solution to encompass the design time and the deployment time. In the next section, we provide an overview of IBM Tivoli Monitoring with a particular focus on the "resource model" concept. In the following section we present the Systems Management Ontology with an overview of "description logics" and their use in representing Common Information Model constructs. Finally, we link the previous themes in order to illustrate the proposed approach for autonomic systems management.

The IBM Tivoli Monitoring solution

The IBM Tivoli Monitoring (ITM) solution is based on the resource model concept. In this section we define the resource model and illustrate it through an example. We then discuss the application of the resource model to all phases of the ITM life cycle.

The resource model concept. The resource model is the main tool for implementing an "identify, notify, and cure" systems management strategy. A resource model has two parts: a dynamic model and a reference model.

The Common Information Model (CIM) formalism promoted by the Distributed Management Task Force is a way to organize the available information about the managed environment that applies the basic structuring and conceptualization techniques of the object-oriented paradigm. (3) The approach uses a uniform modeling formalism that, together with the basic repertoire of object-oriented constructs, supports the cooperative development of an object-oriented schema across multiple organizations. The dynamic model uses CIM classes and CIM properties to describe resources (such as memory) and their performance metrics (such as process working set). Moreover, it uses CIM methods to describe actions that can be executed against the resource (such as starting a process). The CIM association is then used to represent the resource within a high level object (logical object).

The reference model, implemented as a decision procedure and coded using either Visual Basic ** or JavaScript **, incorporates best practices that:

* Interpret the status of a problem object against a defined baseline using the metrics provided by the dynamic model

* Discover the root cause of the problem

* Correct the problem

* Log performance data related to the domain objects

The reference model describes the "critical paths" within systems or applications. It interprets the "quality" of applications and systems against a predefined service level in order to discover the root cause of the problem and to react accordingly. The reference model has a direct link to the dynamic model describing the resources and it analyzes the properties of objects aggregated in the model, generates indications about the status of those resources, and invokes methods to correct a problem. The reference model, in effect, implements the best practices normally used by a system administrator to detect problems and to identify their root cause.

A resource model is created for each problem and it contains the best practices used to identify and correct a well-defined problem. Figure 1 contains an example of a resource model used to identify memory problems in a Microsoft Windows ** machine. The dynamic model is created using a CIM representation of the resources (Cache, Process, PageFile, Memory) and the properties that are useful in detecting memory problems. These classes are then associated via association class MemoryModelToResource with object MemoryModel. Class MemoryModel also contains a set of methods, such as setLargeCache, that can be used to modify the status of some resource (e.g., Cache).

[FIGURE 1 OMITTED]

The reference model is linked to class MemoryModel (and thus to the dynamic model) and, using this class, it can access the performance metrics of the associated resources and apply the encoded best practices in order to discover if a memory problem exists. When a problem is detected, the reference model can invoke a well-defined method in MemoryModel to fix the problem, and also to notify the system operator about the problem resolution. The best practices are implemented in the form of an algorithm that analyzes and correlates metrics, and compares the results against a well-defined baseline.

Let us analyze a simplified resource model for detecting critical memory leaks (see Figure 2). Classes Memory and Process, as well as other additional classes not shown in Figure 2, constitute the dynamic model. MyMemory is an instance of class Memory, whereas Netscape is an instance of class Process. Notice that property Available of class Memory, and properties PrivateBytes and WorkingSet of class Process are represented in class instances MyMemory and Netscape by their numerical values, respectively 256, 388 958, and 109 909. The function of layer Provider is to gather the raw metrics (from the actual resources) in order to supply the above-mentioned values for properties. The reference model controls the data gathering and interprets the data according to best-practices algorithms in order to detect the critical "memory leak" condition.

[FIGURE 2 OMITTED]

A "critical memory leak" occurs when the conditions "low available memory with high working set" and "memory leak in private bytes" exist. The condition "low available memory with high working set" can be detected by examining the number of bytes used for the working set of the process (property bytTotalWorkingSet), the number of bytes used as cache (bytTotalCache) and the total memory available (bytTotalAvail). Those three values can be calculated using "raw" metrics as follows.

* List all the process working sets and store the high working set in bytTotalWorkingSet

* Store the metric Cache from Memory object in bytTotalCache

* Store the metric Available from the Memory object in bytTotalAvail

The values collected for system resources are further processed in order to detect the "low available memory with high working set" condition. bytTotalRAM = bytTotalWorkingSet +

bytTotalCache + bytTotalAvail numPercentWS =

(bytTotalWorkingSet/bytTotalRAM) * 100 numPercentCache =

(bytTotalCache/bytTotalRAM) * 100 numPercentAvail =

(bytTotalAvail/bytTotalRAM) * 100

If (numPercentWS > numPercentCache and


1  2  3  4  
COPYRIGHT 2003 All Rights Reserved. Reproduced with permission of the copyright holder. Further reproduction or distribution is prohibited without permission.
Copyright 2003, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.
NOTE: All illustrations and photos have been removed from this article.


Browse by Journal Name:
Today on Entrepreneur

e-Business & Technology
Franchise News
Business Book Sampler
Starting a Business
Sales & Marketing
Growing a Business
E-mail*:
Zip Code*: