Lexical Semantics and Automatic Hypertext Construction

Stephen J. Green[1]

Macquarie University Web: http://www.mq.edu.au
Division of Information and Communication Sciences Web: http://www.ics.mq.edu.au
North Ryde NSW 2109, Australia
Email: sjgreen@ics.mq.edu.au
Web: http://www.mri.mq.edu.au/~sjgreen

Abstract: We describe recent work in the area of lexical semantics as it applies to the task of automatic hypertext construction. A discussion of the basic principles of lexical cohesion is used to inform the discussion of a method for extracting sets of lexically related words from a text -- a process called lexical chaining. We discuss how various researchers have used lexical chains to perform automatic hypertext construction and related tasks.
Categories and Subject Descriptors: I.2.7 [Artificial Intelligence]: Natural Language Processing - Text analysis ; H.3.1 [Information Storarge and Retrieval]: Content Analysis and Indexing - Linguistic processing; I.7.2 Text Processing: Document Preparation - Hypertext/hypermedia
General Terms: Algorithms, Documentation
Additional Key Words and Phrases: Lexical semantics, automatic hypertext construction

Introduction

Lexical semantics is the study of word meanings and the relationships between them. While much of the work in this field has occurred in the linguistic community, as large corpora and lexical resources have become available, some of the techniques developed have been applied to problems in computational linguistics.

Before we begin, however, we should note that much of the work in information retrieval (IR) could be regarded as work in lexical semantics. IR systems compute the similarity between two documents by determining which words the documents have in common. That is, the relationship between two documents is defined by an identity relation between the words in the document. Such systems have been used to automatically construct hypertexts with some success [Allan 1997].

While this approach has proven quite successful, it suffers from two problems related to the meanings of words. The first of these is polysemy, when a single word has several meanings. The second is synonymy, when different words have the same meanings. Both of these phenomena disrupt the simple identity relationship used by traditional IR systems, by leading them to infer that there is a relationship between two words when there is none (polysemy) or that there is no relationship between words when the words are related (synonymy).

Lexical cohesion and lexical chaining

Cohesion as Halliday and Hasan [Halliday 1976] put it, "refers to relations of meaning that exist within the text, and that define it as a text". Lexical cohesion is created by the use of lexical cohesive relations. These relations are: repetition of the same word (the same relation used by traditional IR systems), the use of a synonym for a word (e.g., dog and hound), the use of a superordinate for a word (e.g., car and vehicle), and the use of collocations (e.g., garden and digging).

Lexical chaining is the process of extracting lexically related words from a text. In general, a text will contain several lexical chains, each of which describes some portion of the cohesive structure of the text. These chains can be built using any lexical resource that relates words by their meaning. The original work [Morris 1991] used Roget's Thesaurus [Chapman 1992], but almost all current lexical chainers [Stairmand 1997], [Hirst 1998], [Barzilay 1997], [Green 1999], [Al-Halimi 1998] use the WordNet database [Beckwith 1991], [Fellbaum 1998].

WordNet is composed of synonym sets or synsets. Each synset contains one or more words that have the same meaning. A word may appear in many synsets, depending on the number of senses that it has. Synsets are connected by links that indicate different semantic relations. For example, two synsets can be connected by a HYPERNYM link, which indicates that the words in the source synset are instances of the words in the target synset. These relations map very well to the lexical cohesive relations described by Halliday and Hasan.

Morris and Hirst [Morris 1991] showed that the lexical chains retrieved from a text will tend to mirror the discourse structure of that text. That is, the pattern of topics and subtopics in a document is similar to the pattern of occurrences of different elements of the lexical chains in that document. It is this property of lexical chains that makes them useful for building hypertext links within and between texts.

Approaches to lexical chaining

There are several feasible algorithms for extracting the lexical chains from a text. Stairmand [Stairmand 1994] used a simple algorithm to build sets of chains. First, all of the words in the text are collected. Then, for each word in the text, a set of terms close to the given word in WordNet is generated. The chains are then built by finding links between these sets of expanded terms.

Other lexical chainers, such as those described by Hirst and St-Onge [Hirst 1998] and Green [Green 1999], attempt to take into account word sense relationships of varying strengths. For example, the repetition of a word is considered to be a stronger relation than the use of a synonym. Relationships between words can also be computed by considering certain kinds of paths between synsets in the WordNet graph. In these chainers, as words are read from a text they are added to the chain with which they have the strongest relation.

One failing of these kinds of chainers is that any relationships between words within a chain are lost as soon as the decision to add the word to the chain is made. Al-Halimi and Kazman [Al-Halimi 1998] take a different approach, building what they call lexical trees. A lexical tree retains the relations that occurred between words during the chaining process. For the most part, their approach to making decisions about word relatedness during chaining is similar to that shown by Hirst and St-Onge. The relations between words are used during the indexing of meeting transcripts.

All of these approaches are greedy, in that they make decisions about to which chain a word should belong as early as possible. Barzilay and Elhadad [Barzilay 1997] take a different approach. Their lexical chainer keeps all possible interpretations until all the words to be chained have been considered. Of course, keeping all possible interpretations could require a large amount of storage. Barzilay and Elhadad solve this problem by first segmenting the text, performing their chaining algorithm on the segments and then joining the chains across different segments. Even given this, we can expect that this algorithm will be slower than those described above.

One of the most useful properties of lexical chaining is that sense disambiguation is performed as a side effect of the process. As words are added to lexical chains, synsets that cannot be linked to synsets already in a chain are removed. The members of a lexical chain determine the semantic context in which a word is being used and so word meanings that are impossible in this context are disregarded. Also, we expect that words that are synonymous will appear in the same chain, since such words would share a synset. Clearly, some of the algorithms described (e.g., Barzilay and Elhadad's) will perform better in this respect.

Automatic hypertext construction

Building links within a text

There have been several attempts (see, for example, [Stairmand 1997], [Manabu 1994], [Kazman 1996], [Kominek 1997]) to use lexical chains for text segmentation. Most of the approaches to document segmentation involving lexical chaining attempt to discover topic boundaries by considering where lexical chains end. A point where several chains end and several different chains begin is a good candidate for a topic boundary.

The decisions involved in finding topic boundaries are very similar to ones that would be made when deciding whether or not to place a hypertext link between two portions of a text. Essentially, we need to determine how much each chain contributes to each portion of the text. Portions that share chains that make major contributions should be considered as candidates for linking. Green [Green 1999] formalizes this notion by calculating the density of each chain in each paragraph of a text. Paragraphs that share a sufficient number of high-density chains are linked.

Building links between texts

We can view the task of building links between texts as trying to discover inter-textual cohesive relations. Of course, we would not expect to find as many cohesive relations between two different texts as we would in a single text, but texts that deal with similar topics should contain similar chains.

One possibility for computing inter-text cohesion is to simply use a lexical chainer to build relations between the chains of the two texts. Unfortunately, this approach is too slow for real-time use and it is unlikely to produce useful results. Most of the chaining algorithms described above when used within a text will admit some questionable relations between words, simply because within a text we expect cohesion and so weak relations are assumed to be acceptable. When used between texts, they are likely to admit spurious relationships.

Another possibility is simply to represent a document by the synsets that its chains contain and then measure the similarity of the sets of chains simply by seeing how many synsets the documents have in common. In this way we represent not only the content of the document, but also the conceptual neighbourhood of the content. Essentially any of the lexical chainers intended for IR use [Stairmand 1997], [Kazman 1996] could be used in this way, although Green's [Green 1999] system was built explicitly to perform this task.

Can lexical semantics improve performance?

Of course, we need to determine whether techniques based on lexical chaining will build links as effectively as traditional techniques. Green [Green 1999] describes the results of an experiment testing the usefulness of links generated using both lexical chains and an IR system. Although the difference in performance did not quite reach statistically significant levels, subjects' performance on a question answering task seemed to be better using hypertext links generated using lexical chains.

Evaluations of IR tasks related to automatic hypertext construction show less than satisfactory results when lexical chains are used in favour of traditional approaches. Stairmand [Stairmand 1997] describes evaluations of text segmentation and document retrieval, while Manabu and Honda [Manabu 1994] evaluated text segmentation methods for Japanese.

Current problems and future directions

Researchers have noted some of the problems inherent in the use of WordNet. First, most methods restrict themselves to WordNet's noun hierarchy, since there are few connections between the noun and verb hierarchies. A more difficult problem is that the vocabulary in WordNet is fixed and some very important classes of words, most notably proper nouns, are almost entirely neglected.

The first problem could be alleviated (as noted by Hirst and St-Onge [Hirst 1998]) by using a traditional thesaurus where the structure of the thesaurus would support such connections. Such thesauri are now available in electronic form. The second problem requires additional sources of information. Stairmand [Stairmand 1997] has partially addressed this issue by using collocations derived from a corpus to improve lexical chaining performance.

Given that the links between WordNet synsets are typed, a valuable avenue of research is determining how these types could be used to provide link types. It is also interesting to consider how diverse collections of lexical knowledge (e.g., WordNet, thesauri, and dictionaries) could be used together to improve the performance of both lexical chaining and automatic link generation.

Footnotes

[1] The author is now at Sun Microsystems Laboratories, and can be reached at stephen.green@east.sun.com

References

[Al-Halimi 1998] Reem Al-Halimi and Rick Kazman. "Temporal Indexing through Lexical Chaining" in WordNet: An Electronic Lexical Database, C. Fellbaum (editor), MIT Press, Cambridge, MA, 333-351, 1998.

[Allan 1997] James Allan. "Building Hypertext using Information Retrieval" in Information Processing and Management 33(2) 145-159, 1997.

[Barzilay 1997] Regina Barzilay and Michael Elhadad. "Using Lexical Chains for Text Summarization" in ACL Workshop on Intelligent Scalable Text Summarization, July 1997.

[Beckwith 1991] Richard Beckwith, Christiane Fellbaum, Derek Gross, and George A. Miller. "WordNet: A Lexical Database Organized on Psycholinguistic Principles" in Lexical Acquisition: Exploiting On-line Resources to Build a Lexicon, U. Zernik (editor), Lawrence Erlbaum Associates, 211-231, 1991.

[Chapman 1992] Robert L. Chapman, (editor). Roget's International Thesaurus (5th edition), HarperCollins, 1992.

[Fellbaum 1998] Christiane Fellbaum (editor). WordNet: An Electronic Lexical Database. MIT Press, 1998.

[Green 1999] Stephen J. Green. "Building Hypertext Lnks by Computing Semantic Similarity" in IEEE Transactions on Knowledge and Data Engineering, Sept/Oct, 1999.

[Halliday 1976] Michael A. K. Halliday and Ruqaiya Hasan. Cohesion in English. Longman, 1976.

[Hirst 1998] Graeme Hirst and David St-Onge. "Lexical Chains as Representations of Context for the Detection and Correction of Malapropisms" in WordNet: An electronic lexical database, Christiane Fellbaum (editor), Cambridge, MA: The MIT Press, 1998.

[Kazman 1996] Rick Kazman, Reem Al-Halimi, William T., and Marilyn Mantei. "Four Paradigms for Indexing Video Conferences" in IEEE Multimedia, 63-73, 1963.

[Kominek 1997] John Kominek and Rick Kazman. "Accessing Multimedia through Concept Clustering" in Proceedings ofACM SIGCHI '97, Atlanta, Ga, 19-26, March 1997.

[Manabu 1994] Okumura Manabu and Takeo Honda. "Word sense disambiguation and text segmentation based on lexical cohesion" in Proceedings of the 15th International Conference on Computational Linguistics (Coling '94), 755-761, 1994.

[Morris 1991] Jane Morris and Graeme Hirst. "Lexical Cohesion Computed by Thesaural Relations as an Indicator of the Structure of Text" in Computational Linguistics ,17(1) 21-48, 1991.

[Stairmand 1994] Mark A. Stairmand. "Lexical Chains, WordNet and Information Retrieval. Condensed version of Master's Thesis, 1994.

[Stairmand 1997] Mark A. Stairmand. "Textual Context Analysis for Information Retrieval" in Proceedings of ACM SIGIR '97, 140-147, 1997.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or permissions@acm.org.