More Resources

How to build a WebFountain: an architecture for very large-scale text analytics.


by Gruhl, D.^Chavet, L.^Gibson, D.^Meyer, J.^Pattanayak, P.^Tomkins, A^Zien, J.
IBM Systems Journal • March, 2004 •

This paper describes WebFountain as a platform for very large-scale text analytics applications. WebFountain processes and analyzes billions of documents and hundreds of terabytes of information by using an efficient and scalable software and hardware architecture. It has been created as an infrastructure in the text analytics marketplace.

Analysts expect this market to grow to five billion dollars by 2005. The leaders in the text analytics market provide easily installed packages that focus on document discovery within the enterprise (i.e., search and alerts) and often bring some level of analytical function. The remainder of the market is populated with smaller entrants offering niche solutions that either address a targeted business need of bring to bear some piece of the growing body of corporate and academic research on more advanced text analytic techniques.

Lower-function commercial solutions typically operate in the domain of a million-documents of so, whereas higher-function offerings exist at a significantly lower scale. Such offerings focus primarily on the enterprise and secondarily on the World Wide Web through the mechanism of small-scale focused "crawls."

When large-scale exploitation of the World Wide Web is required, individuals and corporations alike turn to undifferentiated lower-function solutions such as hosted keyword search engines. (1,2) Typically, such solutions receive a small number of keywords (often one) and are unaware that the query comes from a competitive intelligence analyst, or an economics professor, or a professional baseball player.

Users with a business need to exploit the Web or large-scale enterprise collections are justifiably unsatisfied with the current state of affairs. Web-scale offerings leave professional users with the sense that there is fantastic content "out there" if only they could find if. Provocative new offerings showcase sophisticated new functions, but no vendor combines all these exciting new approaches--truly effective solutions require components drawn from diverse fields, including linguistic and statistical variants of natural language processing, machine learning, pattern recognition, graph theory, linear algebra, information extraction, and so on. The result is that corporate information technology departments must struggle to cobble together combinations of different tools, each of which is a monolithic chain of data ingestion, processing, and user interface.

This situation spurred the creation of WebFountain as an environment where the right function and data can be brought together in a scalable, modular, extensible manner to create applications with value for both business and research. The platform has been designed to encompass different approaches and paradigms and make the results of each available to the others.

A complete presentation and performance analysis of the WebFountain platform is unfortunately beyond the scope of this paper; instead, we adopt the approach taken by the book How to Build a Beowulf, (3) which laid out in high-level terms a set of architectural decisions that had been used successfully to produce "Beowulf" clusters of commodity machines. We now describe the high-level design of the WebFountain system.

Requirements

The requirements for a very large-scale text analytics system that can process Web material are as follows:

1. It must support billions of documents of many different types.

2. It must support documents in any language.

3. Reprocessing all documents in the system must take less than 24 hours.

4. New documents will be added to the system at a rate of hundreds of millions per week.

5. Some required operations will be computationally intensive.

6. New approaches and techniques for text analytics will need to be tried on the system at any time.

7. Since this is a service offering, many different users must be supported on the system at the same time.

8. For economic reasons, the system must be constructed primarily with general-purpose hardware.

Related literature

The explosive growth of the Web and the difficulty of performing complex data analysis tasks on unstructured data has led to several different lines of research and development. Of these, the most prominent are the Web search engines (see, for instance, Google (1) and Alta Vista (2)), which have been primarily designed to address the problem of "information overload." A number of interesting techniques have been suggested in this area; however, because this is not the direct focus of this paper, we omit these here. The interested reader is referred to the survey by Broder and Henzinger. (4)

Several authors (5-9) describe relational approaches to Web analysis. In this model, data on the Web are seen as a collection of relations (for instance, the "points to" relation), each of which are realized by a function and accessed through a relational engine. This process allows a user to describe his or her query in declarative form (Structured Query Language, or SQL, typically) and leverages the machinery of SQL to execute the query. In all of these approaches, the data are fetched dynamically from the network on a lazy basis, and therefore, run-time performance is heavily penalized.

The Internet Archive, (10) the Stanford WebBase project, (11) and the Compaq Computer Corporation (now Hewlett-Packard Company) SRC Web-in-a-box project (12) have a different objective. The data are crawled and hosted, as is the case in Web search engines. In addition, a streaming data interface is provided that allows applications to access the data for analysis. However, the focus is not on support for general and extensible analysis.

The Grid initiative (13) provides highly distributed computation in a world of multiple "virtual organizations"; the focus is, therefore, on the many issues that arise from resource sharing in such an environment. This initiative is highly relevant to the WebFountain business model, in which multiple partners interact cooperatively with the system. However, architecturally WebFountain is a distributed architecture based on local area networks, rather than wide-area networks, and thus the particular models differ.

The Semantic Web (14) initiative proposes approaches to make documents more accessible to automated reasoning. WebFountain annotations on documents may be seen as an internal representation of standardized markup as provided by frameworks such as the Resource Description Framework (RDF), (15) upon which ontologies of markups can be built using mechanisms such as OWL (16) of DAML. (17)

Other research from different areas with significant overlap includes IBM's autonomic computing initiative, (18,19) which addresses issues of "self-healing" for complex environments, such as WebFountain.

System design

The main WebFountain is designed as a loosely coupled, share-nothing parallel cluster of Intel-based Linux ** servers. It processes and augments articles using a variant of the blackboard system approach to machine understanding. (20,21) These augmented articles can then be queried. Additionally, aggregate statistics or other cross-document meta-data can be computed across articles, and the results can be made available to applications.

The loosely coupled nature of the cluster makes it a natural for a Web-service style communication approach, for which we use a lightweight, high-speed Simple Object Access Protocol (SOAP) derivative called Vinci. (22)

We scale up to billions of documents by making sure that full parallelism can be achieved, and by adding a fair amount of hardware to solve the problem (currently, 256 nodes in the main cluster alone). This level of scaling is made possible because the same hardware and, in many cases, the same results of analysis are used to support multiple customers.

To support the multilingual requirement, all documents are converted to Universal Character Set transformation format 8 (UTF-8) (23) upon ingestion, allowing the system to support transport, storage, indexing, and augmentation in any language. We currently have developed text analytics for Western European languages, Chinese, and Arabic, with others being developed and imported.

Ingestion is supported by a 48-node "crawler" cluster that obtains documents from the Web as well as other sources and sends them into the main processing cluster (see Figure 1).

[FIGURE 1 OMITTED]

Additional racks of SMP (symmetric multiprocessor) machines and blade servers supply the additional processing needed for more complex tasks. A well-defined application programming interface (API) allows new augmenters and miners (both described later) to be added as needed, and an overall cluster management system (also described later), backed by a number of human operators, schedules tasks to allow maximum utilization of the system.

Ingestion

The process of loading data into WebFountain is referred to as ingestion. Because ingestion of Web sources is so important to a system for large-scale unstructured data analysis, the ingestion subsystem is broken into two components. The first focuses on large-scale acquisition of Web content, for which the primary issues are the raw scale of the data and the heterogeneity of the content itself. The second focuses on acquisition of other sources, for which the primary concerns are extraction of the data itself and management of the delivery channel. We discuss these two components separately.

Acquisition of Web data. The approach taken and the hardware and software used to acquire data are indicated in this discussion.


1  2  3  4  5  6  7  
COPYRIGHT 2004 All Rights Reserved. Reproduced with permission of the copyright holder. Further reproduction or distribution is prohibited without permission.
Copyright 2004, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.
NOTE: All illustrations and photos have been removed from this article.


Browse by Journal Name:
Today on Entrepreneur

e-Business & Technology
Franchise News
Business Book Sampler
Starting a Business
Sales & Marketing
Growing a Business
E-mail*:
Zip Code*: