Clockwork: a new movement in autonomic
systems.
by Russell, Lance W.^Morgan, Stephen P.^Chron, Edward G.
Large computing systems, especially those running time-varying
workloads, are difficult to keep tuned. Dozens of interacting parameters
may need to be understood and adjusted. Even if a system is tuned well
at one point, because of changing workloads it may end up being poorly
tuned at some other point. Badly tuned systems not only perform poorly,
they also waste resources and frustrate users.
There is substantial and growing interest in autonomic systems,
that is, systems that dynamically self-regulate. A key aspect of
self-regulation is self-tuning. Current work on autonomic tuning is only
slightly more advanced than static tuning; largely, such work revolves
around primitive notions of reactive autonomicity, based on feedback
control. Reactive autonomic systems reconfigure on the basis of
instantaneous need or, at best, on the basis of short-term historical
measurements. As with any techniques involving feedback control,
reactive autonomic systems carry with them the well-known problems of
potential instability or slow response to change.
In the next section of this paper, we propose a new approach to the
problem. We introduce the concept of predictive autonomicity, based on
feedforward control. We outline a general method, which we call
Clockwork, for constructing predictive autonomically tuned systems.
Using statistical modeling, tracking, and forecasting techniques
borrowed from econometrics, systems employing the Clockwork method
detect and forecast cyclic variations in their load, estimate the impact
of the variations on future performance, and use these data to
reconfigure themselves, in anticipation of need.
The third section describes a prototype, scalable network attached
storage (NAS) system that we built using Clockwork, demonstrating the
feasibility of the method. A network attached store is a network file
server that processes requests sent to it using a protocol such as
Network File System (NFS), (1) over a medium such as Ethernet, by one or
more clients. NFS, layered in turn on the Transmission Control
Protocol/Internet Protocol (TCP/IP) suite, uses a remote procedure call
architecture, in which every request from a client to a server engenders
a response from the server to the client. Typical NFS requests are to
create a file, to write data to a file, to read data from a file, and to
delete a file. A response indicates whether the corresponding request
was processed without error and, if so, contains request-specific data,
for example, file contents from a read.
An NAS acts as a central repository for data shared among clients.
With an NAS, clients need not each store the data, reducing cost.
Clients need not coordinate updates to the data, simplifying their
workings. Data management may be centralized, simplifying management and
reducing costs. Small computers may be deployed widely; alternatively,
large systems may be scaled further. It is desirable to have a powerful
NAS to support more clients or to process more work from the same number
of clients. For this paper, we prototyped one with a scalable
architecture, integrating multiple stores into a single, virtual NAS.
Requests are sent to the virtual NAS and are spread among the individual
stores. The advantage of the architecture is that systems of various
capabilities, including a very powerful system, may be built from
relatively inexpensive components. The disadvantage is that the overall
performance of a system will be only as good as that of its worst
performing store. Although a virtual NAS could be massively
overprovisioned to minimize the effect of one poorly performing store,
that would reduce the advantage of the architecture. Alternatively,
autonomic tuning could be used to balance load among the stores. We
chose the latter approach.
Key performance measurements of the NAS prototype, demonstrating
the practicality of the method, are presented in the fourth section.
Finally, in the fifth section, directions for future work are suggested.
The Clockwork method
Clockwork is a general method, analogous to those already in wide
industrial use by electric power utilities and retail chains, for
example. It enables a predictive autonomic system to be implemented
following five simple steps, summarized in Table 1. The first two are
configuration steps. They establish a system objective and a means to
track it with load. The remaining three are operational steps. They
automatically and continually track, forecast, and control the system.
A system that cannot be measured cannot be managed. Clockwork first
establishes a simple, quantifiable objective, comprising a performance
objective and a confidence level. For an electric utility, an
appropriate performance objective would be to meet the instantaneous
demand for electricity reliably. A potential performance objective for a
retail chain would be to achieve a certain in-stock ratio, a measure of
how much product is in stock at a given time. For an NAS, achieving a
certain average response time would be a suitable performance objective.
The confidence level measures how closely the system must meet its
performance objective. For example, the electric utility might need to
meet demand 99.99999 percent of the time, the retail chain might need to
achieve the in-stock ratio 90 percent of the time, and the NAS might
need to achieve the average response time 66 percent of the time.
Often, objectives are subclassed. Some electric utility customers
may be willing to trade decreased reliability for lower cost, some
retail chains may require tighter controls on in-stock ratios for more
profitable products, and some NAS clients may be willing to trade
increased response time for lower cost. Although for brevity, the
present discussion ignores subclassed objectives, Clockwork can handle
them.
Clockwork, in the second step, establishes a simple, quantifiable
measure of demand. An appropriate measure for an electric utility would
be electricity being consumed; for a retail chain it would be sales
being made; and for an NAS, it would be requests being processed.
Tracking the objective (and its variance) in relation to demand is
the third step. An electric utility would track how reliably it met
electricity demand, the time it took (or would take) for generators to
be spun-up, and instantaneous capacity, as electricity was being
consumed; a retail chain would track product in-stock ratios and product
distribution times, as sales were being made; and an NAS would track
response time, as requests were being processed.
In the fourth step, demand is forecast, along with uncertainty,
using autoregressive time series procedures. This technique projects
future values of a variable based on the history of that variable alone,
which simplifies forecasting considerably. A key contribution of
Clockwork is that the same procedure would be used by the utility, the
retail chain, and the NAS.
Fifth and finally, the controllable parameters of the system are
adjusted to meet the objective. In anticipation of forecast demand: the
electric utility would bring its generators on or off line, would buy or
sell electricity or options to do the same, or would activate or
deactivate its spinning reserve; the retail chain would issue store
orders to its distribution centers, would issue purchase orders to its
vendors, or would liquidate its excess inventory; and the NAS would
reassign files to stores, would replicate files among or migrate files
between stores, or would bring stores on or off line.
The prototype
In this section, we describe how we used Clockwork to prototype a
scalable, autonomically tuned NAS. Our purpose in building the prototype
was to determine whether the method is feasible and practicable, rather
than to achieve optimal performance. Nevertheless, as the measurements
in the next section show, the prototype performs well. For proof of
concept, and because we were able to operate in a shared-disk
environment, we implemented file reassignment, but not file replication
(copying a file to multiple stores) or migration (moving a file between
stores). Had we been faced with a serially shared disk or a
shared-nothing environment, we would have had to have implemented
replication and migration.
The prototype comprises three main components: a set of stores, or
storage servers, that process requests for files kept in a cluster file
system, a request router that spreads requests among the stores, and an
autonomic control program that directs the router, following the
Clockwork method. We call the overall system an NAS plex, as it
integrates multiple, otherwise independent systems. The prototype NAS
plex is depicted within the dashed-line area of Figure 1. It includes
four stores, a router, an internal network, and shared disks. Two
clients are connected to the plex via an external network.
[FIGURE 1 OMITTED]
COPYRIGHT 2003 All Rights
Reserved. Reproduced with permission of the copyright holder. Further reproduction or distribution is prohibited without permission.
Copyright 2003, Gale Group. All rights
reserved. Gale Group is a Thomson Corporation Company.
NOTE: All illustrations and photos have been removed from this article.