Cost-based optimization in DB2 XML.
by Balmin, Andrey^Eliaz, Tom^Hornibrook, John^Lim, Lipyeow^Lohman,
Guy M. ^Simmen, David^Wang, Min^Zhang, Chun
INTRODUCTION
As XML has been increasingly accepted by the information technology
industry as a ubiquitous language for data interchange, there has been a
concomitant increase in the need for repositories that natively store,
update, and query XML documents. Together with extensions to SQL
(Structured Query Language) for formatting relational rows into XML
documents and for querying them, called SQL/ XML, (1) XQuery has emerged
as the primary language for querying XML documents, with the publication
in 2005 of a draft standard for the language. (2) XQuery combines many
of the declarative features of SQL and the document navigational
features of XPath, (3) but subsumes neither.
Despite the ascendancy of XML, SQL/XML, and XQuery, the huge
investment in relational database technology over the last three decades
is unlikely to be supplanted abruptly. Hence the XML
"revolution" is more likely to be a gradual evolution, in
which XML documents will be stored in relational tables and queried
interchangeably by either SQL or XQuery for the foreseeable future.
Accordingly, IBM has developed DB2 * XML, a hybrid database system
that combines the relational capabilities of DB2 Universal Database *
for Linux **, Unix **, and Windows ** with comprehensive native XML
support. This means that XML is supported as a native data format
alongside relational tables, and XQuery is supported as a second query
language alongside SQL. Moreover, the hybrid nature of DB2 XML makes it
much easier to fully support the SQL/ XML language, which allows
arbitrary nesting of SQL and XQuery statements within each other.
To support query processing in the XQuery and SQL/XML languages on
the native XML store, the DB2 XML team has extended existing components
in DB2 UDB with support for XQuery and added new components. The overall
rationale and architecture of DB2 XML, as well as an overview of the
design of its major components, are described in Reference 4, including
the new native XML store, XML indexes, query modeling, and query
processing. In this paper we describe in more detail the extensions made
to the cost-based optimizer portion of the DB2 UDB query processor to
efficiently support XQuery and SQL/XML.
XQuery language
In explaining the role of the DB2 XML optimizer, we first review
how DB2 XML stores XML data and how XQuery queries it. In DB2 XML, a new
native XML type is introduced to represent XML data. Tables can be
created having one or more columns of this XML type, with each row in
any XML column containing an XML document, or, more precisely, an
instance of the XML Query Data Model. (5) As with other column types,
the contents of XML columns can be indexed optionally by one or more
indexes. Example 1 shows the creation of a table with an XML column, the
insertion of an XML document into that column of the table, and the
creation of two XML indexes on that column.
Example 1
create table Product (
pid varchar(10) not null primary key,
Description xml
):
insert into Product values(
' 100-100-01',
xmlparse(document
'
Snow Shovel, Basic 22"
Basic Snow Shovel, 22" wide,
straight handle with D-Grip
9.99
1 kg
Tool s
'
preserve whitespace)
);
create index I_PRICE
on Product(Description)
generate key using xmlpattern
'//price' as sql double;
create index I_CATEGORY
on Product(Description)
generate key using xmlpattern
'/product/category' as sql varchar(10);
In this example, //price and /product/category are XPath patterns.
The last two statements in Example 1 define indexes I_PRICE and
I_CATEGORY that contain references to only those nodes in
"Description" documents whose root-to-node paths match these
XPath patterns, organized by the values of such nodes. The
"//" notation in the first XPath pattern permits any number of
nodes between the root node of each document and an instance of a price
node.
XQuery resembles SQL in that it is largely declarative; that is, it
specifies what data is desired, not how to access that data. Each XQuery
statement contains a FLWOR (for, let, where, order by, and return,
pronounced "flower") expression, which specifies (1) zero or
more FOR and LET clauses that describe the data to be accessed, (2) an
optional WHERE clause that defines conditions on that data, (3) an
optional ORDER BY clause for ordering the result, and (4) a RETURN
clause that specifies the structure of the data returned by that query.
The FOR and LET clauses can optionally assign intermediate results to
variable names, denoted by a preceding '$'.
The FOR clause can be thought of as an iterator that accesses items
from XML data, creating one row per item. The LET clause effectively
arranges those data items into a sequence in one row. This mapping in
DB2 of XQuery items to rows and XQuery FOR clauses to the iterators used
in processing relational rows is crucial for exploiting much of the
existing infrastructure of DB2, as we explain in the section "Plan
generation."
Example 2 shows a sample XQuery that returns all products having a
price less than 100 and a category of "Tools." The FOR clause
iterates over the product nodes in all documents of PRODUCT. DESCRIPTION
that match the given XPath pattern, assigning each to the variable $i.
Those whose category is "Tools" survive the filtration of the
WHERE clause and are RETURNed to the user. This query has no LET clause.
Example 2
for $i in
db2-fn:xml col umn( ' PRODUCT.DESCRIPTION')
//product[ .//price < 100]
where $i/category='Tools'
return $i ;
Although the semantics of the XQuery language require that results
be returned in the order specified by any nested FOR clauses, the
requirement does not mandate the strategy for evaluating those clauses
by an optimizer, and many aspects of XQuery, such as nested FOR loops
and XPath navigation, partially restrict the order in which it should be
processed. XQuery has enough alternative execution choices to need
cost-based optimization in the same way that SQL queries do. The above
example illustrates that even simple XQuery queries require many of the
same optimization decisions required for SQL queries. Because a DB2 XML
user can define multiple XML indexes on an XML column, as well as a
traditional index on any combination of relational columns, the
optimizer must decide which of these alternative access paths, either
individually or in combination, to exploit in evaluating a query.
A plan is defined as an ordered set of steps used to access
information. Alternative plans for the query in this example might
exploit the I_PRICE index, the I_CATEGORY index, both indexes (ANDed
together), or neither. The choice of the optimal plan depends on the
characteristics of the database and may have enormous impact on the
query's performance. For example, if most of the products fall into
the "Tools" category, but very few of them have a price less
than 100, then the plan that uses the I_PRICE index will be much more
efficient than one that scans the entire table or one that accesses both
I_PRICE and I_CATEGORY indexes and intersects the result.
Although our simple example did not illustrate it, XQuery permits
join predicates, that is, WHERE clauses or XPath predicates that relate
the values of multiple columns, or nodes from documents in multiple XML
columns. Similar to relational predicates that were proven to be
commutative and associative by using relational algebra, XQuery
predicates may largely be reordered in a similar way, potentially
providing huge performance benefits. Hence, the DB2 XML optimizer still
needs to determine the best way to order those joins and the best join
method (algorithm) to accomplish each join. This is the major
contributor to complexity in SQL optimizers. These and other
considerations offer many opportunities for optimization of XQuery
queries.
Challenges
Why then is it not possible to reuse the results of three decades
of research and experience with relational query optimization to
optimize XQuery queries? It is in fact possible for the most part, but
XQuery introduces several major new challenges. First and foremost of
these is heterogeneity. SQL optimization was significantly aided by the
simple homogeneity of rows in relational tables having identical
"flat" schemas. In contrast, the XML data model is inherently
heterogeneous and hierarchical. For a given XML schema, one or more
elements may be missing in any XML document, and there is no requirement
for explicit NULL values for these elements. LET clauses effectively
construct varying-length rows containing sequences of elements whose
number is difficult to estimate and may vary from row to row; a FOR loop
over such a sequence removes the nesting of that sequence and changes it
into as many rows as there were elements in a single row. We further
postulate that XML schemas themselves are likely to change frequently
from document to document, or even be unavailable or unknown for a given
XML document, leading to schema heterogeneity within even a single table
containing a single XML column.
COPYRIGHT 2006 All Rights
Reserved. Reproduced with permission of the copyright holder. Further reproduction or distribution is prohibited without permission.
Copyright 2006, Gale Group. All rights
reserved. Gale Group is a Thomson Corporation Company.
NOTE: All illustrations and photos have been removed from this article.