LEO: an autonomic query optimizer for
DB2.
by Markl, Volker^Lohman, Guy M.^Raman, Vijayshankar
The remarkable growth of the relational database management systems
(DBMS) industry over the last two decades can be largely attributed to
the standardization of its Structured Query Language, SQL. SQL is a
declarative language, that is, it requires the user to specify only what
data are wanted, leaving to the query optimizer of the DBMS the
difficult problem of deciding how best to access and process the data.
For a given SQL query, there may be many different ways to access each
table that is referenced, to join those tables, and, because the join
operation is commutative, to order those joins and perform other
operations necessary to complete the query. Hence, there may be hundreds
or even thousands of possible ways to process a given query correctly.
For example, suppose the SQL query is:
SELECT name, age, salary FROM employees
WHERE age > 60 AND city = `SAN JOSE' AND salary < 60,000
This query asks for the name, age, and salary of each employee who
is over age 60, lives in San Jose, and makes less than $60,000 annually
in salary. Each filtering condition in the WHERE clause that is joined
by AND is called a predicate. Since this query references only one
table, there are no choices of join order or join method, yet the
optimizer still may consider many possible ways for the DBMS to process
this simple query. It can always sequentially scan all the rows in the
table and apply each predicate to each row. Or, if the appropriate
indexes exist, it could exploit one or more indexes to access only the
rows satisfying one or more of the predicates. For example, an index on
age would permit accessing only those rows where the value of age is
greater than 60 and then applying the other predicates (on city and
salary). Alternatively, an index on city would limit the access to those
rows having city equal to "San Jose" and subsequently applying
the other predicates (on age and salary) to those rows retrieved.
Alternatively, indexes on multiple columns, for example a combined index
on city and age, or a combined index on city and salary, might be
exploited, if they existed, or strategies combining any of the indexes
discussed above (so-called "index ANDing"). Which strategy
might be preferable depends upon the characteristics of the database,
the availability of various indexes, and how selective each predicate
is, that is, how many rows are satisfied by each condition.
Most modern query optimizers determine the best query execution
plan (QEP, or simply plan) for executing an SQL query by mathematically
modeling the execution cost for each of many alternative QEPs and
choosing the one with the lowest estimated cost. The execution cost is
largely dependent upon the number of rows that will be processed by each
operator in the QEP, so the optimizer first estimates this incrementally
as each predicate is applied. Estimating the cardinality (i.e., number
of rows) of a result, after one or more of the predicates have been
applied, has been the subject of much research for over 20 years. (1-5)
To avoid accessing the database when optimizing queries, this estimate
typically begins with statistics of database characteristics,
specifically, the number of rows for each table. The cardinality of each
intermediate result is derived incrementally by multiplying the base
table's cardinality by a filter factor--or selectivity--for each
predicate in the query, which is derived from statistics on the columns
affected by that predicate, such as its number of distinct values or a
histogram of its distribution. The selectivity of a predicate P
effectively represents the probability that any row in the table will
satisfy P, and that selectivity depends upon the characteristics of the
database. For example, in the query above, the predicate on city might
be quite selective if the database were a worldwide database of a large,
multinational company, but it might be a lot less selective if the
database contained all the employees of some small start-up firm
centered in San Jose. For the latter case, the predicates on age and/or
salary would be more selective. The optimizer would tend to favor QEPs
that could exploit indexes to apply the most selective predicates and
QEPs that utilize simple table scans if there were no indexes, or if the
optimizer estimated that most employees would satisfy all of the
predicates. In DB2 * (Database 2 *), the choice of QEP is based solely
upon the detailed cost estimate for each of the competing plans, and not
upon such simplistic heuristics.
When there are multiple tables in the FROM clause of the query, the
number of alternative strategies considered by the optimizer increases
exponentially, because the myriad choices mentioned above are compounded
with additional decisions about the order in which tables are joined and
the method by which they are joined. DB2, for example, supports three
major types of join method, and there are several variants within each
of these. For a two-table join with a handful of predicates, the DB2
optimizer might consider over a hundred different plans; for six tables,
the number of plans could be well over a thousand! The DB2 optimizer
considers all of these alternatives automatically for the user, who is
not even aware that it is going on!
Although query optimizers do a remarkably good job. of estimating
both the cost and the cardinality of most queries, many assumptions
underlie this mathematical model. Examples of these assumptions include:
currency of information, uniformity, independence of predicates, and a
principle of inclusion.
Currency of information: Updating the statistics each time a row is
updated or deleted would create a locking bottleneck in the system
catalogs, where statistics are stored. It is difficult or impossible to
calculate some statistics incrementally, such as the number of distinct
values for each column, and so it is common for statistics to be
recomputed periodically as a user-invoked batch operation (called
RUNSTATS in DB2). Despite this, the optimizer assumes that the
statistics reflect the current state of the database, that is, that the
database characteristics are relatively stable, and it relies upon the
user to know when any table has changed enough to warrant the expensive
recollection of statistics.
Uniformity: Although many products use histograms to deal with skew
in the distribution of values for "local" selection predicates
(on columns within a single table), we are unaware of any available
product that exploits them for join predicates, that is, those relating
columns in multiple tables. Thus for join predicates, the query
optimizer still relies on the assumption of uniformity.
Independence of predicates: Selectivities for each predicate are
calculated individually and multiplied together, essentially assuming
the predicates are statisticallyindependent of each other, even though
the underlying columns may be related, for example by a functional
dependency. While multidimensional histograms address this problem for
local predicates, they have never been applied to join predicates,
aggregation, and so on. Applications common today have hundreds of
columns in each table and thousands of tables, making it impossible to
know on which subset(s) of columns to maintain multivariate statistics.
Principle of inclusion: The selectivity for a join predicate X.a =
Y.b is typically defined to be 1/max{[absolute value of a], [absolute
value of b]}, where [absolute value b] denotes the number of distinct
values of column b. This implicitly assumes the "principle of
inclusion," that is, that each value of the smaller domain has a
match in the larger domain. Fortunately, this assumption is frequently
true for the most common joins between a primary key to a table (e.g., a
product number in the Products table) and a reference to that key (a
foreign key) in another table (e.g., the Orders table).
When these assumptions are invalid, significant errors in the
cardinality--and hence cost--estimates result, causing suboptimal plans
to be chosen. From the authors' experience, the primary cause of
major modeling errors is the cardinality estimate on which costs depend.
Cost estimates might be off by 10 or 15 percent, at most, for a given
cardinality, but cardinality estimates can be off by orders of magnitude
when their underlying assumptions are invalid or uncertain. Although
there has been considerable success in using histograms to detect and
correct for data skew, (6-8) and in using sampling to gather up-to-date
statistics, (9,10) there has been to date no comprehensive approach to
correcting all modeling errors, regardless of origin.
In this paper we describe our approach toward autonomic query
optimization for overcoming modeling errors and incorrect statistics,
which has led to the prototype of a LEarning Optimizer (LEO) for DB2.
(11) LEO learns from any modeling mistake, at any point in a QEP, by
automatically validating its estimates against actual cardinalities for
a query. Determining at what point in the plan the significant errors
occurred then allows for reoptimizing the query at this point (12) and
adjusting its model dynamically to better optimize future queries. Over
time, this feedback method amasses experiential information that
augments and adjusts the database statistics for the part of the
database that enjoys the most user activity. Not only does this
information enhance the quality of the optimizer's estimates, but
it can also suggest where statistics gathering should be concentrated
and can even supplant the need for statistics collection.
A learning optimizer
COPYRIGHT 2003 All Rights
Reserved. Reproduced with permission of the copyright holder. Further reproduction or distribution is prohibited without permission.
Copyright 2003, Gale Group. All rights
reserved. Gale Group is a Thomson Corporation Company.
NOTE: All illustrations and photos have been removed from this article.