Generation of efficient parsers through direct
compilation of XML Schema grammars.
by Perkins, Eric^Matsa, Morris^Kostoulas, Margaret Gaitatzes
^Heifets, Abraham^Mendelsohn, Noah
INTRODUCTION
XML has begun to work its way into the business computing
infrastructure and underlying protocols such as the Simple Object Access
Protocol (SOAP) and Web services. In the performance-critical setting of
business computing, however, the flexibility of XML becomes a liability
due to the potentially significant performance penalty.
XML processing is conceptually a multitiered task, an attribute it
inherits from the multiple layers of specifications that govern its use:
XML, (1,2) XML Namespaces, (3) XML Information Set (Infoset), (4) and
XML Schema. (5) Traditional XML processor implementations reflect these
specification layers directly. Bytes, read off the "wire" or
from disk, are converted to some known form (often UTF-16 characters)
and tokenized (UTF stands for Universal Text Format). Attribute values
and end-of-line sequences are normalized. Namespace declarations and
prefixes are resolved, and the tokens are then transformed into some
representation of the document Infoset; at the same time, checking for
well-formedness (1,2) is performed. The Infoset is optionally checked
against an XML Schema grammar (XML schema, schema) for validity and
rendered to the user through some interface, such as Simple API for XML
(SAX) or Document Object Model (DOM) (API stands for application
programming interface).
In practice, these tasks are combined to some extent. Typically a
generic parser handles scanning, XML normalization, namespaces, and
well-formedness checking, as required by the XML specification.
Validation is then grafted on as a separate process, operating as a
filter on the output of the generic parser. Because validation is an
add-on in such a design, it has a strictly detrimental effect on parser
performance. Validation is, therefore, typically used exclusively for
debugging, if at all, and is disabled during production.
Although schema validation is expensive in the above scenario,
there is no fundamental reason why validation needs to be expensive.
Indeed, grammars have long been used to generate optimized
special-purpose parsers that operate much more efficiently than their
generic counterparts, while performing validation checking. (6) The XML
specifications were designed to enable the compilation of an XML Schema
grammar to a special-purpose parser (see Section 2.4 in Reference 5).
Techniques that apply standard grammar-based parser generation to
XML Schema grammars have been used to demonstrate that compilation of
schemas can produce high-performance special(7) purpose parsers. (7)
Traditional parser-generation schemes are not, however, particularly
well-suited to XML parsing and have difficulty representing some XML
Schema constructs that are not found in traditional parsing situations.
At the same time, the syntax of XML and the constraints defined in an
XML schema are not sufficiently complex to require the full power of
traditional parser-generation methods.
Previous efforts in this area that built on conventional
intermediate representations have, in general, supported fewer features
of XML (8) or delivered less efficient solutions. (9)
Rather than adapting a conventional intermediate representation to
the forms of XML and XML Schema, we propose a compilation technique that
deals directly with the abstract schema components of the XML Schema
Recommendation. (5) By tying the code-generation scheme directly to the
schema components, we are able to take advantage of the simple lexical
structure of XML and the determinism assurances built into XML Schema
grammars.
The generated validating parser drives the optimized scanning
process. Two complementary optimization strategies, specialization and
optimistic scanning, are used to speed scanning and validation.
Specialization focuses on the use of specialized context-sensitive
scanning primitives that can scan and validate the input efficiently.
Optimistic scanning speeds the scanning of the common cases, such as
simple data without comments or entity references. The resulting
generated parser is shown to be significantly faster than some widely
available parsers, both validating and nonvalidating.
In the next section, we describe the challenges involved in
generating parsers through compilation of XML schemas. Following that,
we propose an architecture for direct schema compilation, highlighting
the breadth of support for schema features and the targeted
optimizations that minimize the cost of parsing. Then we provide
performance measurements of sample generated parsers and compare those
to performance measurements of commonly used parsers. Finally, we
describe related work from the technical literature and conclude with
final comments.
CHALLENGES OF XML SCHEMA COMPILATION
XML Schema, and the specifications on which it depends, present
several challenges to schema-based parser generation: XML namespaces and
the dynamic typing features of XML Schema complicate the scanning of
markup. As a result, the schema grammar and the lexical production of
XML are not easily combined with traditional grammar compilation
techniques. Additionally, XML Schema provides support for content models
that are difficult to represent in traditional automaton models, making
the traditional models inefficient as intermediate representations of
the schema.
XML namespaces
Throughout the XML processing stack, markup and meta-markup (such
as namespace declarations and xsi:type attributes) assert scoped
properties and declarations for the containing element and all of its
attributes, as well as its content. In the case of XML namespaces and
the dynamic typing mechanism used in XML Schema, this pattern presents
some difficulties for naive parser implementations.
Namespaces qualify XML element and attribute names by using a
namespace-prefix declaration. The declaration takes the form of a
special attribute with a reserved prefix (xmlns) followed by the prefix
to be declared. The value of this attribute is the declared namespace.
The scope of the namespace declaration includes the enclosing element,
all of the sibling attributes, and the element's content. This
arrangement, although natural, presents some difficulties for XML
processors.
Beyond the syntactic complexities of the namespace declaration, the
XML Namespaces Recommendation augments the well-formedness constraints
of XML to forbid the repetition of a qualified attribute regardless of
its lexical representation in the tag. Thus, in the examples below, the
first my-elt-1 is well-formed, but my-el t-2 is not, because both
attributes resolve to the same qualified name.
xmlns:a="http://www.example.com/a"
xmlns:b="http://www.example.com/b"/>
xmlns:a="http://www.example.com/a"
xmlns:b="http://www.example.com/a"/>
xmlns:a="http://www.example.com/a"
xmlns:b="http://www.example.com/c"/>
During processing, namespace declarations prevent the qualified
names of the element and its attributes from being conclusively known
until the end of the tag. This means that scanning of qualified names in
XML requires infinite look-ahead to fully resolve names. In the
preceding examples, the second a:my-elt-1 appears to be the same as the
first until the last attribute is scanned.
The pattern of declaration used in XML namespaces is typical
throughout the XML processing stack. In the XML layer, for example, the
predefined attributes xml:lang and xml:space, which may be used to
indicate natural language and desired white-space handling, use this
pattern. The values of these attributes, however, do not affect
validation, and therefore do not complicate scanning or validation.
Dynamic typing in XML Schema
XML Schema includes a mechanism for dynamic typing of instance
elements. By using the xsi:type attribute, an instance element may
assert its XML Schema type. The declared type must be validly derived
from the type that would otherwise have been used to validate the
element, with respect to the constraints on type derivation. This
declared type may have a significantly different content model from the
default type that is otherwise expected.
The syntax of xsi:type is similar to that of namespace declarations
and poses the same kinds of processing hurdles. In particular, the
possibility of an xxi:type attribute prevents an XML processor from
conclusively determining the type declaration to use for validation
until the entire tag has been scanned. Furthermore, because the element
type declaration governs the type declarations used to validate the
attributes, the processor cannot conclusively determine the types--and
therefore the validity--of the attributes until the entire tag has been
read. In the example below, the element will be invalid if the dynamic
type restricts the attributes to be of type xsd:integer:
my-attr-3="three" my-attr-4="four"
my-attr-5="five" my-attr-6="six"
xsi:type="six-integer-attributes"
xmlns:xsi="http://www.w3.org
/2001/XMLSchema-instance"/>
XML Schema content models
COPYRIGHT 2006 All Rights
Reserved. Reproduced with permission of the copyright holder. Further reproduction or distribution is prohibited without permission.
Copyright 2006, Gale Group. All rights
reserved. Gale Group is a Thomson Corporation Company.
NOTE: All illustrations and photos have been removed from this article.