ACM Computing Surveys
31(4), December 1999,
http://www.acm.org/surveys/Formatting.html. Copyright ©
1999 by the Association for Computing Machinery, Inc. See the permissions statement below.
Semantically Indexed Hypermedia:
Douglas Tudhope and
Linking Information Disciplines
University of Glamorgan Web: http://www.glam.ac.uk/
Hypermedia Research Unit Web: http://www.comp.glam.ac.uk/pages/research/hypermedia/
School of Computing Web: http://www.comp.glam.ac.uk/
Pontypridd, CF37 1DL, Wales, UK
Categories and Subject Descriptors:
H.5.4 [Information Interfaces and Presentation]: Hypertext/Hypermedia;
H.3.1 [Information Storage and Retrieval]: Thesauruses;
H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval;
H.3.5 [Information Storage and Retrieval]: web-based services;
H.3.7 [Information Storage and Retrieval]: Digital Libraries
Additional Key Words and Phrases: Semantic index, Semantic distance
measures, Metadata, Dublin Core
Semantic linking has always been a strand of hypermedia research and is
becoming central to current attempts to facilitate access to information
in large hypertexts and the emerging 'semantic web'
Due to the scaling problems with explicitly authored links between
information items, it is likely that future large scale hypertexts will
employ a mixture of authored links and indirect, computed links via some
form of indexing system. Problems of information access are heightened by
the lack of precision of current WWW retrieval technology and users
unfamiliar with indexing conventions. There is a critical need for tools
that will assist users to formulate and refine queries, and navigate
through information spaces. Recent years have seen the growth of metadata,
Digital Libraries, and interest in the application of traditional
information science and library cataloguing techniques to the new
environment of hypertext and the WWW. Semantic indexing provides a bridge
between the various information disciplines. With the growing influence
of the Resource Description Framework [Lassila 1999],
semantic tagging and
cataloguing of information is likely to become a key component of the
information architecture of intranet hypertexts and the WWW.
Representing semantic knowledge about a domain or application area in order
to facilitate access to information has been a major focus in hypermedia,
since the early days (e.g. [Collier 1987],
[Trigg 1986]). One approach
has been to assign semantic labels or more formal typing to authored
hypertext links [Nanard 1991],
[Schnase 1993]. Another
approach, the one followed here, includes a semantic index layer in its
model of hypermedia architecture. In addition to explicitly authored links,
each information item is indexed with descriptor terms - frequently more
than one term will be required. Frisse and Cousins
[Frisse 1989] first introduced
the notion of separate index and document spaces to hypertext, observing
that different conformations of those spaces allow for different
possibilities in automated reasoning. Different types of indexing system
are possible. It is useful to categorise indexing systems according to
three dimensions [van Rijsbergen 1979]:
- whether index terms are automatically derived or manually assigned.
- whether index terms belong to a controlled vocabulary or are
- whether terms can be combined as ordered strings representing a single concept
when indexing (pre-coordinated terms), e.g. "Association of Computing
Machinery", or must be post-coordinated on retrieval. The latter allows the
possibility of 'false positives' where items are returned that have no
connection between different terms in the source string.
Information Retrieval (IR) has tended towards automatically generated free
text index terms (post-coordinated), weighted by statistical frequency of
terms in documents and collections. On the other hand, distinguishing
features of a semantic index are that semantic relationships exist
between controlled index terms, usually (but not necessarily) the result
of manual cataloguing. Semantically indexed hypermedia links are, by
definition, computed, corresponding to Intensional-Retrieval links
[DeRose 1989]. This allows the possibility of flexible query-based
The semantic index approach employs a set of semantic relationships
between index terms, following the well established thesaurus tradition
in information science (ISO 2788, ISO 5964). A large number of thesauri
exist, covering a variety of subject domains, for example the Medical
Subject Headings [MeSH 1999] and the Art and
[AAT 1999]. Classification systems, such as
Dewey Decimal or Library of
Congress, focus on hierarchical relationships. These controlled
vocabularies are part of standard cataloguing practice in
libraries and museums and are now being applied to digital hypertexts
via thematic keywords in metadata resource descriptors. For example, the Dublin Core [DC 1999] standard metadata set includes
elements for Title,
Creator, Date, Format, etc. in addition to the more complex notion of the
Subject (or theme) of a resource. Guidelines recommend that, where possible,
the Subject element be taken from a relevant controlled vocabulary. Links between concepts in the subject domain
can be expressed by the semantic relationships in a thesaurus. The three
main thesaurus relationships are Equivalence (equivalent terms),
Hierarchical (broader/narrower terms), and Associative (more loosely
Related Terms). Sometimes specialisations of the three main relationships
are included (for example distinguishing taxonomic and instance hierarchical relationships). Following a minimalist approach to semantic modelling by restricting the set of
relationships permits interoperability of cataloguing/retrieval tools and techniques. It also facilitates automated reasoning over this core set of relationships.
Navigation is provided indirectly by queries to the semantic index space,
as opposed to directly following explicit links between information items.
The queries can be simple or complex. The conventional hypermedia navigation
techniques may be implemented by relatively simple queries
although there would be no particular reason to use a semantic index
to achieve that functionality. One additional possibility provided by a
semantic index space is an organised set of browsable concept descriptors,
as a means of comprehending the associated layer of media items
[Pollard 1993]. The user can browse the index space, 'beam down' to view
media items of interest, and conversely 'beam up' to the index space from
media items. Additionally, when index terms are combined, the user may
browse around each term, broadening and narrowing the specificity of
description and seeing the effect on likely 'hits'
Alternatively, the combined terms can be considered as locating a position
in a 'hyperindex', permitting a string of terms to be broadened or narrowed
in one navigation action [Bruza 1990].
If a user enters a
set of query terms as opposed to browsing the index space, equivalence
relationships permit a broad entry vocabulary of synonyms to be tied
together for retrieval purposes, without the user having to specify the
exact term employed for indexing. As a simple example, this document is
indexed by a set of controlled
vocabulary terms from the ACM Computing
Classification [ACM 1998] (see Categories and
Subject Descriptors above). In the ACM Digital Library pages, explicit hypertext
links can be navigated. In addition, controlled vocabulary
index terms can be combined with free text terms when searching the library
and the hypertext version of the classification can be browsed as a subject
index in order to select terms for searching.
Beyond this, the inclusion of semantic information in the index space provides the
opportunity for knowledge-based hypermedia systems that provide intelligent
navigation support and retrieval, with the system taking a more active role
in the navigation process than relying on manual browsing alone. For
example, rules governing permitted combinations of terms can filter a
user's possible navigation options
Work at the University of Glamorgan explores the potential of reasoning
over the semantic relationships in the index space. Traversal of transitive
relationships makes possible imprecise matching between query and media item,
or between two media items, rather than relying on an exact match of controlled vocabulary terms
Expanding terms offers an augmented browsing
capacity based on measures of distance in the semantic index space.
Results can be post-processed for expression in a particular retrieval tool. Various
possibilities exist for indirect computed links with such hybrid
[Cunliffe 1997]. For example,
information items with semantically close terms can be ranked in the result
or destination set, or the system might automatically suggest terms to be
considered for inclusion in a query. If facets exist for time and place in
the index space, then a result set can be returned as a dynamic guided tour
based on temporal or spatial relationships (or indeed other orderings).
Alternatively, the focus of a user's navigation can remain in the document
(media) space, typically requiring less cognitive overhead than
constructing a formal query
[Marchionini 1995]. In this case, having found
an information item of interest, the navigation action consists of
requesting "More items like this one", with the system responsible for a
(best-match) similarity measure of the item's index terms. At the cost of
greater cognitive demand on the user, the source context for the navigation
may be modified and particular media items or terms (de)emphasised (cf.
relevance feedback techniques in IR).
Semantically based retrieval underpins diverse efforts to provide access to distributed multimedia resources, such as the many
projects involving SGML (XML) and Z39.50 for networked access to
cross-platform information. Major efforts are underway to create subject-based
gateways to Internet resources, sometimes combining manually indexed
and robot harvested metadata. The W3C Recommendation for a
'machine-understandable' Resource Description Framework supports the thrust
of this research [Lassila 1999]. An RDF descriptor might include the Dublin
Core element, Subject, specifying a classification or thesaurus to which
keywords belong. Precise semantic index retrieval tools will be required
to provide a manageable set of results to requests that may span several
[Doerr 1997], and may involve networked terminology
servers and more than one thesaurus or classification. One point worth
emphasising is the social dimension to access and the link with existing
cataloguing practice. Controlled vocabularies are often the result of
standards efforts in subject domains, continue to evolve, and are part of a
network of practice and education/training in the information science
community. They have the potential to act as a bridge between information
provider and seeker, "a semantic road map for searchers and indexers"
[Soergel 1995], if tools can be devised that
visualise their structure and how they may be used.
A number of key issues for research remain if the potential of significant
gains in precision of information access is to be realised.
- An advantage of building query functionality into hypertext navigation
is a smooth transition between querying and browsing. Can we identify the
appropriate extent of cognitive effort demanded by interfaces to navigation
tools? How far should the internal workings of matching functions or the
detail of the underlying semantic network be brought to the surface?
- Some applications may lend themselves to the specialisation of the
standard thesaurus relationships into richer sets, particularly the
associative relationship. For example, in some situations it may be useful to
distinguish various kinds of causal relationships from the generic
- The problem of expressing similarity between pre-coordinated strings
of semantic index terms needs further investigation. How much should be
pre-computed and what can be left to dynamic computation? How best can we
express syntax or structure in such strings? This effort converges with work on
description logic ontologies [Bullock 1998],
- Various efforts attempt to combine statistical IR and semantic
controlled vocabulary approaches. For example, Agosti et al
[Agosti 1995] propose
a three layer architecture for Hypermedia IR systems combining a
statistical index layer and a semantic (thesaurus) layer (see also
Studies of online
searching behaviour have investigated conditions influencing choice of
free text or controlled vocabulary terms (e.g.
[Fidel 1991]). How should the
two approaches be best integrated - should they be seen as different
components of a toolkit, or should a matching function incorporate both
statistical weighting and semantic measures? In addition, indirect semantic
links and explicit authored links will soon be combined in link/search
engines. What principles should guide this integration?
- The semantic interoperability of overlapping but different thesauri
is an important issue for remote access to distributed sets of resources
employing controlled vocabularies in metadata. A concept may exist in one
vocabulary but not another, or may map (partially) to various concepts.
Art and Architecture Thesaurus Browser, [Online: http://shiva.pub.getty.edu/aat_browser/], 1999.
ACM Computing Classification. http://www.acm.org/class/1998/
Maristella Agosti, Massimo Melucci, and Fabio Crestani. "Automatic Authoring and Construction of Hypermedia for Information Retrieval" in ACM Multimedia Systems, 3(1), 15-24, 1995.
Hans C. Arents and Walter F. L. Bogaerts. "Navigation without Links and Nodes without Contents: Intensional Navigation in a Third-Order Hypermedia System" in Hypermedia, 5(3), 187-204, 1993.
Y. Alp Aslandogan, Chuck Thier, Clement T. Yu, Jon Zou, and Naphtali Rishe. "Using Semantic Contents and WordNet in Image Retrieval" in Proceedings of ACM SIGIR '97, 286-295, 1997.
Tim Berners-Lee. World Wide Web Design Issues: A Roadmap to the Semantic Web, [Online: http://www.w3.org/DesignIssues/Semantic.html], 1998.
Peter Bruza. "Hyperindices: A Novel Aid for Searching in Hypermedia" in Proceedings of the ACM European Conference on Hypertext '90 (ECHT '90), Versailles, France,109-122, November 1990.
Joseph Bullock and Carole Goble. "TourisT: The Application of a Description Logic based Semantic Hypermedia System for Tourism" in Proceedings of ACM Hypertext '98, Pittsburgh PA, 132-141, June 1998.
Yves Chiaramella and Ammar Kheirbek. "An Integrated Model for Hypermedia and Information Retrieval" in Information Retrieval and Hypertext, Maristella Agosti and Alan Smeaton (editors), Kluwer, 139-178, 1996.
George Collier. "Thoth-II: Hypertext with Explicit Semantics" in Proceedings of ACM Hypertext '87, Chapel Hill, NC, 269-289, November 1987.
Daniel Cunliffe, Carl Taylor, and Douglas Tudhope. "Query-based Navigation in Semantically Indexed Hypermedia" in Proceedings of ACM Hypertext 97, Southampton, UK, 87-95, April 1997.
Dublin Core. [Online: http://purl.org/metadata/dublin_core], 1999.
Steven J. DeRose. "Expanding the Notion of Links" in Proceedings of ACM Hypertext '89, Pittsburgh, PA, 249-257, November 1989.
Martin Doerr, Irene Fundulaki and Vassilis Christophidis. "The Specialist Seeks Expert Views: Managing Digital Folders in the AQUARELLE Project" in Proceedings of Museums and the Web, David Bearman and Jennifer Trant (editors), 261-270, 1997.
Raya Fidel. "Searchers' Selection of Search Keys (I-III)" in Journal of American Society for Information Science, 42(7), 490-527, 1991.
Mark E. Frisse and Steven B. Cousins. "Information retrieval from hypertext: Update on the Dynamic Medical Handbook" in Proceedings of ACM Hypertext '89, Pittsburgh, PA, 199-211, November 1989.
Ora Lassila and Ralph Swick (editors), "Resource Description Framework (RDF) Model and Syntax Specification" World Wide Web Consortium Recommendation, [Online: http://www.w3.org/TR/REC-rdf-syntax/], February 22 1999.
Gary Marchionini. Information Seeking in Electronic Environments. Cambridge University Press, 1995.
MeSH 1999. Medical Subject Headings homepage. http://www.nlm.nih.gov/mesh/meshhome.html
Jocelyne Nanard and Mark Nanard. "Using structured types to incorporate knowledge in hypertext" in Proceedings of ACM Hypertext '91, San Antonio, TX, 329-344, December 1991.
Richard Pollard. "A hypertext-based thesaurus as a subject browsing aid for bibliographic databases" in Information Processing and Management, 29(3), 345-357, 1993.
Steven Pollitt, Martin P Smith and Patrick A J Braekevelt. "View-based Searching Systems" in Proceedings of Joint Workshop of BCS IR and HCI Specialist Groups, (Johnson and Dunlop eds.) 73-77.
Roy Rada, Weigang Wang, Alex Birchall. "Retrieval hierarchies in hypertext" in Information Processing and Management 29(3), 359-371, 1993.
John L. Schnase, John J. Leggett, David L. Hicks, and Ron L. Szabo. "Semantic Data Modeling of Hypermedia Associations. ACM Transactions on Information Systems (TOIS), 11(1), 27-49, January 1993.
Dagobert Soergel. "The Art and Architecture Thesaurus (AAT): a critical appraisal" in Visual Resources, 10(4), 369-400, 1995.
Randall H. Trigg and Mark Weiser. "Textnet: A Network-based Approach to Text Handling" in ACM Transactions on Office Information Systems (TOIS), 4(1), 1-23, January 1986.
Douglas Tudhope, Paul Beynon-Davies, Carl Taylor, and Chris B. Jones. "Virtual Architecture Based on a Binary Relational Model: A Museum Hypermedia Application" in Hypermedia, 6(3), 174-192, 1994.
Douglas Tudhope and Carl Taylor. "Navigation via Similarity: Automatic Lnking Based on Semantic Closeness" in Information Processing and Management, 33(2), 233-242, 1997.
[van Rijsbergen 1979]
C. J. "Keith" van Rijsbergen. Information Retrieval. Butterworth, 1979.
Peter C. Weinstein. "Ontology-based metadata: transforming the MARC legacy" in Proceedings of ACM Digital Libraries '98, 254-263, 1998.
Permission to make digital or hard copies of part or all of
this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or
commercial advantage and that copies bear this notice and the full
citation on the first page. Copyrights for components of this work
owned by others than ACM must be honored. Abstracting with credit is
permitted. To copy otherwise, to republish, to post on servers, or to
redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from Publications Dept, ACM Inc., fax +1
(212) 869-0481, or firstname.lastname@example.org.