The software architecture of a SAN storage control
system.
by Glider, J.S.^Fuente, C.F.^Scales, W.J.
Storage controllers have traditionally enabled mainframe computers
to access disk drives and other storage devices. (1) To support
expensive enterprise-level mainframes built for high performance and
reliability, storage controllers were designed to move data in and out
of mainframe memory as quickly as possible, with as little impact on
mainframe resources as possible. Consequently, storage controllers were
carefully crafted from custom-designed processing and communication
components, and optimized to match the performance and reliability
requirements of the mainframe.
In recent years, several trends in the information technology used
in large commercial enterprises have affected the requirements that are
placed on storage controllers. UNIX ** and Windows ** servers have
gained significant market share in the enterprise. The requirements
placed on storage controllers in a UNIX or Windows environment are less
exacting in terms of response time. In addition, UNIX and Windows
systems require fewer protocols and connectivity options. Enterprise
systems have evolved from a single operating system environment to a
heterogeneous open systems environment in which multiple operating
systems must connect to storage devices from multiple vendors.
Storage area networks (SANs) have gained wide acceptance.
Interoperability issues between components from different vendors
connected by a SAN fabric have received attention and have mostly been
resolved, but the problem of managing the data stored on a variety of
devices from different vendors is still a major challenge to the
industry. At the same time, various components for building storage
systems have become commoditized and are available as inexpensive
off-the-shelf items: high-performance processors (Pentium **-based or
similar), communication components such as Fibre Channel switches and
adapters, and RAID (2) (redundant array of independent disks)
controllers.
In 1996, IBM embarked on a program that eventually led to the IBM
TotalStorage * Enterprise Storage Server * (ESS). The ESS core
components include such standard components as the PowerPC *-based
pSeries * platform running AIX * (a UNIX operating system built by IBM)
and the RAID adapter. ESS also includes custom-designed components such
as nonvolatile memory, adapters for host connectivity through SCSI
(Small Computer System Interface) buses, and adapters for Fibre Channel,
ESCON * (Enterprise Systems Connection) and FICON * (Fiber Connection)
fabrics. An ESS provides high-end storage control features such as very
large unified caches, support for zSeries * FICON and ESCON attachment
as well as open systems SCSI attachment, high availability through the
use of RAID-5 arrays, failover pairs of access paths and fault-tolerant
power supplies, and advanced storage functions such as point-in-time
copy and peer-to-peer remote copy. An ESS controller, containing two
access paths to data, can have varying amounts of back-end storage,
front-end connections to hosts, and disk cache, thereby achieving a
degree of scalability.
A project to build an enterprise-level storage control system, also
referred to as a "storage virtualization engine," was
initiated at the IBM Almaden Research Center in the second half of 1999.
One of its goals was to build such a system almost exclusively from
off-the-shelf standard parts. As any enterprise-level storage control
system, it had to deliver high performance and availability, comparable
to the highly optimized storage controllers of previous generations. It
also had to address a major challenge for the heterogeneous open systems
environment, namely to reduce the complexity of managing storage on
block devices. The importance of dealing with the complexity of managing
storage networks is brought to light by the total-cost-of-ownership
(TCO) metric applied to storage networks. A Gartner report (4) indicates
that the storage acquisition costs are only about 20 percent of the TCO.
Most of the remaining costs are related, in one way or another, to
managing the storage system.
Thus, the SAN storage control project targets one area of
complexity through block aggregation, also known as block
virtualization. (5) Block virtualization is an organizational approach
to the SAN in which storage on the SAN is managed by aggregating it into
a common pool, and by allocating storage to hosts from that common pool.
Its chief benefits are efficient and flexible usage of storage capacity,
centralized (and simplified) storage management, as well as providing a
platform for advanced storage functions.
A Pentium-based server was chosen for the processing platform, in
preference to a UNIX server, because of lower cost. However, the
bandwidth and memory of a typical Pentium-based server are significantly
lower than those of a typical UNIX server. Therefore, instead of a
monolithic architecture of two nodes (for high availability) where each
node has very high bandwidth and memory, the design is based on a
cluster of lower-performance Pentium-based servers, an arrangement that
also offers high availability (the cluster has at least two nodes).
The idea of building a storage control system based on a scalable
cluster of such servers is a compelling one. A storage control system
consisting only of a pair of servers would be comparable in its utility
to a midrange storage controller. However, a scalable cluster of servers
could not only support a wide range of configurations, but also enable
the managing of all these configurations in almost the same way. The
value of a scalable storage control system would be much more than
simply building a storage controller with less cost and effort. It would
drastically simplify the storage management of the enterprise storage by
providing a single point of management, aggregated storage pools in
which storage can easily be allocated to different hosts, scalability in
growing the system by adding storage or storage control nodes, and a
platform for implementing advanced functions such as fast-write cache,
point-in-time copy, transparent data migration, and remote copy.
In contrast, current enterprise data centers are often organized as
many islands, each island containing its own application servers and
storage, where free space from one island cannot be used in another
island. Compare this with a common storage pool from which all requests
for storage, from various hosts, are allocated. Storage management
tasks-such as allocation of storage to hosts, scheduling remote copies,
point-in-time copies and backups, commissioning and decommissioning
storage--are simplified when using a single set of tools and when all
storage resources are pooled together.
The design of the virtualization engine follows an
"in-band" approach, which means that all I/O requests, as well
as all management and configuration requests, are sent to it and are
serviced by it. This approach migrates intelligence from individual
devices to the network, and its first implementation is appliance-based
(which means that the virtualization software runs on stand-alone
units), although other variations, such as incorporating the
virtualization application into a storage network switch, are possible.
There have been other efforts in the industry to build scalable
virtualized storage. The Petal research project from Digital Equipment
Corporation (6) and the DSM ** product from LeftHand Networks, Inc. (7)
both incorporate clusters of storage servers, each server privately
attached to its own back-end storage. Our virtualization engine
prototype differs from these designs in that the back-end storage is
shared by all the servers in the cluster. VERITAS Software Corporation8
markets Foundation Suite **, a clustered volume manager that provides
virtualization and storage management. This design has the
virtualization application running on hosts, thus requiring that the
software be installed on all hosts and that all hosts run the same
operating system. Compaq (now part of Hewlett-Packard Company) uses the
Versastor ** technology, (9) which provides a virtualization solution
based on an out-of-band manager appliance controlling multiple in-band
virtualization agents running on specialized Fibre Channel host bus
adapters or other processing elements in the data path. This more
complex structure amounts to a two-level hierarchical architecture in
which a single manager appliance controls a set of slave host-resident
or switch-resident agents.
The rest of the paper is structured as follows. In the next
section, we present an overview of the virtualization engine, which
includes the hardware configuration, a discussion of the main challenges
facing the software designers, and an overview of the software
architecture. In the following four sections we describe the major
software infrastructure components: the cluster operating environment,
the distributed I/O facilities, the buffer management component, and the
hierarchical object pools. In the last section we describe the
experience gained from implementing the virtualization engine and
present our conclusions.
Overview of the virtualization engine
COPYRIGHT 2003 All Rights
Reserved. Reproduced with permission of the copyright holder. Further reproduction or distribution is prohibited without permission.
Copyright 2003, Gale Group. All rights
reserved. Gale Group is a Thomson Corporation Company.
NOTE: All illustrations and photos have been removed from this article.