How to build a WebFountain: an architecture for very
large-scale text analytics.
by Gruhl, D.^Chavet, L.^Gibson, D.^Meyer, J.^Pattanayak,
P.^Tomkins, A^Zien, J.
This paper describes WebFountain as a platform for very large-scale
text analytics applications. WebFountain processes and analyzes billions
of documents and hundreds of terabytes of information by using an
efficient and scalable software and hardware architecture. It has been
created as an infrastructure in the text analytics marketplace.
Analysts expect this market to grow to five billion dollars by
2005. The leaders in the text analytics market provide easily installed
packages that focus on document discovery within the enterprise (i.e.,
search and alerts) and often bring some level of analytical function.
The remainder of the market is populated with smaller entrants offering
niche solutions that either address a targeted business need of bring to
bear some piece of the growing body of corporate and academic research
on more advanced text analytic techniques.
Lower-function commercial solutions typically operate in the domain
of a million-documents of so, whereas higher-function offerings exist at
a significantly lower scale. Such offerings focus primarily on the
enterprise and secondarily on the World Wide Web through the mechanism
of small-scale focused "crawls."
When large-scale exploitation of the World Wide Web is required,
individuals and corporations alike turn to undifferentiated
lower-function solutions such as hosted keyword search engines. (1,2)
Typically, such solutions receive a small number of keywords (often one)
and are unaware that the query comes from a competitive intelligence
analyst, or an economics professor, or a professional baseball player.
Users with a business need to exploit the Web or large-scale
enterprise collections are justifiably unsatisfied with the current
state of affairs. Web-scale offerings leave professional users with the
sense that there is fantastic content "out there" if only they
could find if. Provocative new offerings showcase sophisticated new
functions, but no vendor combines all these exciting new
approaches--truly effective solutions require components drawn from
diverse fields, including linguistic and statistical variants of natural
language processing, machine learning, pattern recognition, graph
theory, linear algebra, information extraction, and so on. The result is
that corporate information technology departments must struggle to
cobble together combinations of different tools, each of which is a
monolithic chain of data ingestion, processing, and user interface.
This situation spurred the creation of WebFountain as an
environment where the right function and data can be brought together in
a scalable, modular, extensible manner to create applications with value
for both business and research. The platform has been designed to
encompass different approaches and paradigms and make the results of
each available to the others.
A complete presentation and performance analysis of the WebFountain
platform is unfortunately beyond the scope of this paper; instead, we
adopt the approach taken by the book How to Build a Beowulf, (3) which
laid out in high-level terms a set of architectural decisions that had
been used successfully to produce "Beowulf" clusters of
commodity machines. We now describe the high-level design of the
WebFountain system.
Requirements
The requirements for a very large-scale text analytics system that
can process Web material are as follows:
1. It must support billions of documents of many different types.
2. It must support documents in any language.
3. Reprocessing all documents in the system must take less than 24
hours.
4. New documents will be added to the system at a rate of hundreds
of millions per week.
5. Some required operations will be computationally intensive.
6. New approaches and techniques for text analytics will need to be
tried on the system at any time.
7. Since this is a service offering, many different users must be
supported on the system at the same time.
8. For economic reasons, the system must be constructed primarily
with general-purpose hardware.
Related literature
The explosive growth of the Web and the difficulty of performing
complex data analysis tasks on unstructured data has led to several
different lines of research and development. Of these, the most
prominent are the Web search engines (see, for instance, Google (1) and
Alta Vista (2)), which have been primarily designed to address the
problem of "information overload." A number of interesting
techniques have been suggested in this area; however, because this is
not the direct focus of this paper, we omit these here. The interested
reader is referred to the survey by Broder and Henzinger. (4)
Several authors (5-9) describe relational approaches to Web
analysis. In this model, data on the Web are seen as a collection of
relations (for instance, the "points to" relation), each of
which are realized by a function and accessed through a relational
engine. This process allows a user to describe his or her query in
declarative form (Structured Query Language, or SQL, typically) and
leverages the machinery of SQL to execute the query. In all of these
approaches, the data are fetched dynamically from the network on a lazy
basis, and therefore, run-time performance is heavily penalized.
The Internet Archive, (10) the Stanford WebBase project, (11) and
the Compaq Computer Corporation (now Hewlett-Packard Company) SRC
Web-in-a-box project (12) have a different objective. The data are
crawled and hosted, as is the case in Web search engines. In addition, a
streaming data interface is provided that allows applications to access
the data for analysis. However, the focus is not on support for general
and extensible analysis.
The Grid initiative (13) provides highly distributed computation in
a world of multiple "virtual organizations"; the focus is,
therefore, on the many issues that arise from resource sharing in such
an environment. This initiative is highly relevant to the WebFountain
business model, in which multiple partners interact cooperatively with
the system. However, architecturally WebFountain is a distributed
architecture based on local area networks, rather than wide-area
networks, and thus the particular models differ.
The Semantic Web (14) initiative proposes approaches to make
documents more accessible to automated reasoning. WebFountain
annotations on documents may be seen as an internal representation of
standardized markup as provided by frameworks such as the Resource
Description Framework (RDF), (15) upon which ontologies of markups can
be built using mechanisms such as OWL (16) of DAML. (17)
Other research from different areas with significant overlap
includes IBM's autonomic computing initiative, (18,19) which
addresses issues of "self-healing" for complex environments,
such as WebFountain.
System design
The main WebFountain is designed as a loosely coupled,
share-nothing parallel cluster of Intel-based Linux ** servers. It
processes and augments articles using a variant of the blackboard system
approach to machine understanding. (20,21) These augmented articles can
then be queried. Additionally, aggregate statistics or other
cross-document meta-data can be computed across articles, and the
results can be made available to applications.
The loosely coupled nature of the cluster makes it a natural for a
Web-service style communication approach, for which we use a
lightweight, high-speed Simple Object Access Protocol (SOAP) derivative
called Vinci. (22)
We scale up to billions of documents by making sure that full
parallelism can be achieved, and by adding a fair amount of hardware to
solve the problem (currently, 256 nodes in the main cluster alone). This
level of scaling is made possible because the same hardware and, in many
cases, the same results of analysis are used to support multiple
customers.
To support the multilingual requirement, all documents are
converted to Universal Character Set transformation format 8 (UTF-8)
(23) upon ingestion, allowing the system to support transport, storage,
indexing, and augmentation in any language. We currently have developed
text analytics for Western European languages, Chinese, and Arabic, with
others being developed and imported.
Ingestion is supported by a 48-node "crawler" cluster
that obtains documents from the Web as well as other sources and sends
them into the main processing cluster (see Figure 1).
[FIGURE 1 OMITTED]
Additional racks of SMP (symmetric multiprocessor) machines and
blade servers supply the additional processing needed for more complex
tasks. A well-defined application programming interface (API) allows new
augmenters and miners (both described later) to be added as needed, and
an overall cluster management system (also described later), backed by a
number of human operators, schedules tasks to allow maximum utilization
of the system.
Ingestion
The process of loading data into WebFountain is referred to as
ingestion. Because ingestion of Web sources is so important to a system
for large-scale unstructured data analysis, the ingestion subsystem is
broken into two components. The first focuses on large-scale acquisition
of Web content, for which the primary issues are the raw scale of the
data and the heterogeneity of the content itself. The second focuses on
acquisition of other sources, for which the primary concerns are
extraction of the data itself and management of the delivery channel. We
discuss these two components separately.
Acquisition of Web data. The approach taken and the hardware and
software used to acquire data are indicated in this discussion.
COPYRIGHT 2004 All Rights
Reserved. Reproduced with permission of the copyright holder. Further reproduction or distribution is prohibited without permission.
Copyright 2004, Gale Group. All rights
reserved. Gale Group is a Thomson Corporation Company.
NOTE: All illustrations and photos have been removed from this article.