Chapter 11: Text Processing

Document processing is rapidly becoming one of the dominant functions of computers. Computers are used to edit documents, to search documents, to transport documents over the Internet, and to display documents on printers and computer screens. Web `surfing' and Web searching are becoming significant and important computer applications, and many of the key computations in all of this document processing involve character strings and string pattern matching. For example, the Internet document formats HTML and XML are primarily text formats, with added tags for multimedia content. Making sense of the many terabytes of information on the Internet requires a considerable amount of text processing.

In this chapter, we study several of the fundamental text processing algorithms for quickly performing important string operations. We pay particular attention to algorithms for string searching and pattern matching, since these can often be computational bottlenecks in many document-processing applications. We also study some fundamental data structure and algorithmic issues involved in text processing, as well.

The progression of topics studied in this chapter continues to follow our abstract data type approach. The terminology and notation for the string ADT, which is used in this chapter, is defined early. It turns out that representing a string as an array of characters is quite simple and efficient, so we don't spend a lot of attention on string representations. Nevertheless, the string ADT includes an interesting method for string pattern matching, and we study pattern matching algorithms in this chapter. We also study the trie data structure, which is a tree-based structure that allows for fast searching in a collection of strings. We also study an important text processing problem, namely, the problem of compressing a document of text so that it fits more efficiently in storage or can be transmitted more efficiently over a network. We study another text processing problem, as well, which deals with how we can measure the similarity between two documents. All of these problems are topics that arise often in Internet computations, such as Web crawlers, search engines, document distribution, and information retrieval.

In addition to having interesting applications, the topics of this chapter also highlight some important algorithmic design patterns. In particular, in the section on pattern matching, we discuss the brute-force method, which is often inefficient but has wide applicability. For text compression we study the greedy method, which often allows us to approximate solutions to hard problems, and for some problems (such as in text compression) actually gives rise to optimal algorithms. Finally, in discussing text similarity, we introduce the dynamic programming design pattern, which can be applied in some special instances to solve a problem in polynomial time that appears at first to require exponential time to solve.