More Resources

The importance of sibling clustering for efficient bulkload of XML document trees.


by Kanne, Carl-Christian^Moerkotte, Guido
IBM Systems Journal • June, 2006 • Extensible markup language

INTRODUCTION

Loading large amounts of data which is already available in an external format is called a bulkload operation. In conventional database management systems (DBMSes), bulkloads are often used to initialize a database, for example, when introducing an application to DBMS usage or when importing data from a different DBMS or storage format.

In contrast, an XDS (XML Data Store) needs to support document imports as regular operations that are used very frequently by applications. Hence, the bulkload becomes a core function whose performance is a determinant of overall system performance; however, we could find very few publications that discuss efficient XML bulkload in large-scale XML data stores.

This paper attempts to mitigate this deficit by presenting the design and implementation of the bulkload component for the Natix XDS, a native XML data store developed at the University of Mannheim. (1) Besides requirements analysis and API (application programming interface) design, our main focus is the design of an efficient tree-partitioning algorithm that decomposes the logical XML document tree into subtrees, or clusters, that fit on a disk page. Such a clustering algorithm is needed not only in Natix, but in every XDS that provides native tree storage, such as IBM's System RX. (2)

Of particular concern is the number of clusters generated. When accessing the stored documents, intercluster navigation is much slower than intracluster navigation, often by several orders of magnitude. Even if access reordering techniques are used, (3) the number of clusters is a crucial factor for query performance. Hence, the bulkload algorithm must minimize the number of clusters generated.

Our main contributions are

1. a detailed requirements analysis for XML bulkload components,

2. an analysis of existing tree clustering algorithms,

3. a novel linear-time tree-clustering algorithm that generates up to 30 percent fewer clusters than the best known algorithms,

4. a description of the design and implementation of a concrete XML bulkloader, the Natix bulkload component, and

5. an evaluation of the Natix bulkload component.

The rest of this paper is organized as follows. In the next section, "XML storage," we give a brief overview of native tree storage in Natix. Next, in the section "Requirements," we analyze the requirements an XML bulkload component must meet. Then, in "Tree-clustering algorithms," we discuss existing tree-clustering algorithms with respect to these requirements. In the section "Natix bulkload component," we describe the interface and the implementation of the bulkload component. This section also covers our novel tree-clustering algorithm. In the section "Evaluation," we provide our experimental results, including a comparison with the bulkload performance of other XML storage systems. Our concluding remarks are presented in the last section.

XML STORAGE

The design of the bulkload component is strongly influenced by the format used for storing XML data in the XDS. A suitable format for an enterprise-level XDS must efficiently support bulkloads, incremental updates, synchronization, and recovery. In this section, we briefly review the Natix format for storing XML data (for details, refer to References 1, 4, and 5). Similar formats are also used in other XDSes, such as IBM's System RX. (2)

Logical tree model

The interface to our core storage engine uses a simple, general logical tree model with only two node types and no XML-specific constructs, such as attributes. This model enables a simple engine design and is easily mapped to concrete XML models such as the Document Object Model (DOM) or the simple API for XML (SAX), as explained in the next subsection.

The documents are represented as ordered trees in which nodes are labeled with symbols taken from an alphabet [SIGMA]. In the current implementation, we use the set of integers from 0 to 2 (16)--1 as [SIGMA]. Leaf nodes can additionally be labeled with arbitrarily long strings over a different alphabet (in the current implementation, this alphabet consists of the set of Unicode characters).

Mapping XML to the logical object model

A small wrapper module maps a concrete XML representation (such as DOM, SAX) with its node types and attributes to the simple tree model and vice versa. A sample tree for a document fragment is shown in Figure 1.

[FIGURE 1 OMITTED]

Mapping XML document nodes to logical nodes

Elements are mapped one-to-one to tree nodes of the logical model. Attributes are mapped to child nodes of an additional attribute container child node, which is always the first child of the element node to which the attributes belong. Attributes, PCDATA (including whitespace-only data), CDATA nodes, and comments are stored as leaf nodes.

Mapping XML tags to tree labels

As alphabet [SIGMA], the storage engine uses non-negative integers, which are called DeclarationIDs. The module that maps XML to the logical tree model also performs the mapping from tag names, attribute names, and special node labels (such as PCDATA) to DeclarationIDs. All the documents in an XML collection share the same mapping, which makes query evaluation simpler and more efficient because the possible integer values for a given tag or attribute name can be resolved once per query and stay the same for all documents in the collection.

Physical tree model

Natix organizes physical storage in units known as XML segments. An XML segment provides a storage area and an interface for storing and accessing a collection of logical tree instances. A physical tree is the mapping of a logical tree to physical storage and consists of records, each of which is smaller than a disk page and contains a subtree of the logical tree. In this section we use the term record subtree (or simply subtree) to refer to any subtree of the logical tree that is stored in a record. The records can be addressed using record IDs (RIDs).

A record subtree contains three types of nodes. Aggregate nodes represent the inner nodes of the tree and are labeled with non-negative integers (over the alphabet [SIGMA]). Literal nodes are leaf nodes that in addition to a label over the alphabet [SIGMA] can also have content in the form of a string. A proxy node is a special leaf node that points (refers) to a record subtree that is stored in a separate record and corresponds to a connected subtree of the logical tree. Substituting all proxies with the referenced subtrees reconstructs the original logical tree. Although proxies use RIDs to refer to the target subtree, this is an implementation detail and not a requirement for applying our techniques (one could use logical references, such as found in System RX (2)).

Every record subtree has two attributes: the parent RID, which points to the parent subtree (if it exists), and the logical document ID field, which determines the document to which this subtree belongs.

An example of a logical tree and its mapping onto records is shown in Figure 2. To store the given logical tree (assumed not to fit on a page), the physical data tree is distributed over the three records [r.sub.1], [r.sub.2], and [r.sub.3]. Two proxies ([p.sub.1] and [P.sub.2]) are used in the top level record. Two helper aggregate nodes ([h.sub.1] and [h.sub.2]) have been added to the physical tree. They group the children below [p.sub.1] and [p.sub.2] into a tree. Proxy and helper aggregate nodes are drawn with dashed lines. They are only needed to link together subtrees contained in different records and are called scaffolding nodes. Nodes drawn with solid lines represent logical nodes ([f.sub.i]), and are called facade nodes. Only facade nodes are visible to the caller of the XML segment interface.

[FIGURE 2 OMITTED]

The given physical tree is only one possible way of storing the sample logical tree. Additional possibilities exist as a proxy can be created on any edge of the logical tree. The maintenance of the physical tree during incremental updates is described in Reference 1. The initial creation of a physical tree for a newly imported document is the main function of the bulkload component described in this paper.

REQUIREMENTS

The requirements for a bulkload component are based on four goals, all of which are related to performance:

1. The interface should closely match the typical output of XML parsers. XML parsers are the most common source of imported XML documents, and many XML tools, among them query evaluation components, are able to efficiently deliver results using parser-like interfaces. Hence, it is reasonable to assume that the data to be bulkloaded is delivered as XML parser output. Otherwise, changing the data representation before or during the bulkload operation would require expending additional resources.

2. The bulkload operation should not require main memory proportional to the document size. Linear memory usage would prohibit the import of documents larger than available main memory. As a generalization, the total amount of concurrently importable documents would be limited by available physical memory.

3. The storage layout for imported XML documents should be optimized for performance for typical workloads (documents). We identify three subgoals.


1  2  3  4  5  6  
COPYRIGHT 2006 All Rights Reserved. Reproduced with permission of the copyright holder. Further reproduction or distribution is prohibited without permission.
Copyright 2006, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.
NOTE: All illustrations and photos have been removed from this article.


Browse by Journal Name:
Today on Entrepreneur
Related Video

e-Business & Technology
Franchise News
Business Book Sampler
Starting a Business
Sales & Marketing
Growing a Business
E-mail*:
Zip Code*: