The importance of sibling clustering for efficient
bulkload of XML document trees.
by Kanne, Carl-Christian^Moerkotte, Guido
INTRODUCTION
Loading large amounts of data which is already available in an
external format is called a bulkload operation. In conventional database
management systems (DBMSes), bulkloads are often used to initialize a
database, for example, when introducing an application to DBMS usage or
when importing data from a different DBMS or storage format.
In contrast, an XDS (XML Data Store) needs to support document
imports as regular operations that are used very frequently by
applications. Hence, the bulkload becomes a core function whose
performance is a determinant of overall system performance; however, we
could find very few publications that discuss efficient XML bulkload in
large-scale XML data stores.
This paper attempts to mitigate this deficit by presenting the
design and implementation of the bulkload component for the Natix XDS, a
native XML data store developed at the University of Mannheim. (1)
Besides requirements analysis and API (application programming
interface) design, our main focus is the design of an efficient
tree-partitioning algorithm that decomposes the logical XML document
tree into subtrees, or clusters, that fit on a disk page. Such a
clustering algorithm is needed not only in Natix, but in every XDS that
provides native tree storage, such as IBM's System RX. (2)
Of particular concern is the number of clusters generated. When
accessing the stored documents, intercluster navigation is much slower
than intracluster navigation, often by several orders of magnitude. Even
if access reordering techniques are used, (3) the number of clusters is
a crucial factor for query performance. Hence, the bulkload algorithm
must minimize the number of clusters generated.
Our main contributions are
1. a detailed requirements analysis for XML bulkload components,
2. an analysis of existing tree clustering algorithms,
3. a novel linear-time tree-clustering algorithm that generates up
to 30 percent fewer clusters than the best known algorithms,
4. a description of the design and implementation of a concrete XML
bulkloader, the Natix bulkload component, and
5. an evaluation of the Natix bulkload component.
The rest of this paper is organized as follows. In the next
section, "XML storage," we give a brief overview of native
tree storage in Natix. Next, in the section "Requirements," we
analyze the requirements an XML bulkload component must meet. Then, in
"Tree-clustering algorithms," we discuss existing
tree-clustering algorithms with respect to these requirements. In the
section "Natix bulkload component," we describe the interface
and the implementation of the bulkload component. This section also
covers our novel tree-clustering algorithm. In the section
"Evaluation," we provide our experimental results, including a
comparison with the bulkload performance of other XML storage systems.
Our concluding remarks are presented in the last section.
XML STORAGE
The design of the bulkload component is strongly influenced by the
format used for storing XML data in the XDS. A suitable format for an
enterprise-level XDS must efficiently support bulkloads, incremental
updates, synchronization, and recovery. In this section, we briefly
review the Natix format for storing XML data (for details, refer to
References 1, 4, and 5). Similar formats are also used in other XDSes,
such as IBM's System RX. (2)
Logical tree model
The interface to our core storage engine uses a simple, general
logical tree model with only two node types and no XML-specific
constructs, such as attributes. This model enables a simple engine
design and is easily mapped to concrete XML models such as the Document
Object Model (DOM) or the simple API for XML (SAX), as explained in the
next subsection.
The documents are represented as ordered trees in which nodes are
labeled with symbols taken from an alphabet [SIGMA]. In the current
implementation, we use the set of integers from 0 to 2 (16)--1 as
[SIGMA]. Leaf nodes can additionally be labeled with arbitrarily long
strings over a different alphabet (in the current implementation, this
alphabet consists of the set of Unicode characters).
Mapping XML to the logical object model
A small wrapper module maps a concrete XML representation (such as
DOM, SAX) with its node types and attributes to the simple tree model
and vice versa. A sample tree for a document fragment is shown in Figure
1.
[FIGURE 1 OMITTED]
Mapping XML document nodes to logical nodes
Elements are mapped one-to-one to tree nodes of the logical model.
Attributes are mapped to child nodes of an additional attribute
container child node, which is always the first child of the element
node to which the attributes belong. Attributes, PCDATA (including
whitespace-only data), CDATA nodes, and comments are stored as leaf
nodes.
Mapping XML tags to tree labels
As alphabet [SIGMA], the storage engine uses non-negative integers,
which are called DeclarationIDs. The module that maps XML to the logical
tree model also performs the mapping from tag names, attribute names,
and special node labels (such as PCDATA) to DeclarationIDs. All the
documents in an XML collection share the same mapping, which makes query
evaluation simpler and more efficient because the possible integer
values for a given tag or attribute name can be resolved once per query
and stay the same for all documents in the collection.
Physical tree model
Natix organizes physical storage in units known as XML segments. An
XML segment provides a storage area and an interface for storing and
accessing a collection of logical tree instances. A physical tree is the
mapping of a logical tree to physical storage and consists of records,
each of which is smaller than a disk page and contains a subtree of the
logical tree. In this section we use the term record subtree (or simply
subtree) to refer to any subtree of the logical tree that is stored in a
record. The records can be addressed using record IDs (RIDs).
A record subtree contains three types of nodes. Aggregate nodes
represent the inner nodes of the tree and are labeled with non-negative
integers (over the alphabet [SIGMA]). Literal nodes are leaf nodes that
in addition to a label over the alphabet [SIGMA] can also have content
in the form of a string. A proxy node is a special leaf node that points
(refers) to a record subtree that is stored in a separate record and
corresponds to a connected subtree of the logical tree. Substituting all
proxies with the referenced subtrees reconstructs the original logical
tree. Although proxies use RIDs to refer to the target subtree, this is
an implementation detail and not a requirement for applying our
techniques (one could use logical references, such as found in System RX
(2)).
Every record subtree has two attributes: the parent RID, which
points to the parent subtree (if it exists), and the logical document ID
field, which determines the document to which this subtree belongs.
An example of a logical tree and its mapping onto records is shown
in Figure 2. To store the given logical tree (assumed not to fit on a
page), the physical data tree is distributed over the three records
[r.sub.1], [r.sub.2], and [r.sub.3]. Two proxies ([p.sub.1] and
[P.sub.2]) are used in the top level record. Two helper aggregate nodes
([h.sub.1] and [h.sub.2]) have been added to the physical tree. They
group the children below [p.sub.1] and [p.sub.2] into a tree. Proxy and
helper aggregate nodes are drawn with dashed lines. They are only needed
to link together subtrees contained in different records and are called
scaffolding nodes. Nodes drawn with solid lines represent logical nodes
([f.sub.i]), and are called facade nodes. Only facade nodes are visible
to the caller of the XML segment interface.
[FIGURE 2 OMITTED]
The given physical tree is only one possible way of storing the
sample logical tree. Additional possibilities exist as a proxy can be
created on any edge of the logical tree. The maintenance of the physical
tree during incremental updates is described in Reference 1. The initial
creation of a physical tree for a newly imported document is the main
function of the bulkload component described in this paper.
REQUIREMENTS
The requirements for a bulkload component are based on four goals,
all of which are related to performance:
1. The interface should closely match the typical output of XML
parsers. XML parsers are the most common source of imported XML
documents, and many XML tools, among them query evaluation components,
are able to efficiently deliver results using parser-like interfaces.
Hence, it is reasonable to assume that the data to be bulkloaded is
delivered as XML parser output. Otherwise, changing the data
representation before or during the bulkload operation would require
expending additional resources.
2. The bulkload operation should not require main memory
proportional to the document size. Linear memory usage would prohibit
the import of documents larger than available main memory. As a
generalization, the total amount of concurrently importable documents
would be limited by available physical memory.
3. The storage layout for imported XML documents should be
optimized for performance for typical workloads (documents). We identify
three subgoals.
COPYRIGHT 2006 All Rights
Reserved. Reproduced with permission of the copyright holder. Further reproduction or distribution is prohibited without permission.
Copyright 2006, Gale Group. All rights
reserved. Gale Group is a Thomson Corporation Company.
NOTE: All illustrations and photos have been removed from this article.