ACM Computing Surveys 28A(4), December 1996, http://www.acm.org/surveys/1996/BlakeleyThoughts/. Copyright © 1996 by the Association for Computing Machinery, Inc. See the permissions statement below.

Thoughts on Directions in Database Research

José A. Blakeley

Microsoft Corporation
One Microsoft Way
Redmond, WA 98052-6399
joseb@microsoft.com

Database Management Systems as Conglomerates of Cooperating Data Stores

The traditional notion of a DBMS is being challenged by a proliferation of systems that enable individuals and departments within organizations to keep track of their information. More an more mission critical data is kept in systems outside traditional corporate DBMSs. Examples of such systems include file systems (where most of the world's electronic data is stored today), spread sheets, project management systems, electronic mail, personal financial systems, desktop DBMSs, information published in the Internet and Intranets, among others.

This proliferation of information presents a very interesting challenge to the database research community. We have the opportunity to export many of the database concepts, principles, and engineering practices to many other systems outside DBMSs. This would make database technology even more widely available. Some of these database concepts include transaction management, query processing, indexing techniques, persistent storage management, and schema management, among others.

As an example, one of my mission critical databases is email. I call it mission critical because any time my email service is down for some reason, my productivity is affected significantly. Every day, I wish I could formulate queries beyond the simple filtering of messages on my email. For example I would like to formulate queries that join my email file with itself, or perform the equivalent of nested SQL sub-queries, or join my email with other data containers outside email. However, since email is not a real DBMS, I don't have a sophisticated query processor available with it. The same occurs with other data sources at my disposal. What is the answer? Should I be forced to move my email and other critical data to a DBMS just to be able to query it? Or should a query processing service be made available inside email? I believe the latter is the more appropriate answer. We need to redefine the boundaries of DBMS systems so that we can componentize and re-deploy some of the traditional DBMS functions to other environments. This will transform the notion of a database from a centralized, one-size-fits-all data store, to a conglomerate of cooperating stores helping people and organizations satisfy their data management needs.

Industry has already taken the initiative by inventing architectural frameworks, sometimes called object service architectures (OSAs), consisting of conglomerates of relatively independent services that can be integrated via a common software backplane. Representative examples of this approach are Microsoft's Component Object Model (COM) and OMG's OSA. Examples of the software backplane are Microsoft's Distributed COM and OMG's CORBA. We need to analyze how DBMSs are put together to define the proper abstraction boundaries where finer-grained DBMS functionality can be factored. This will enable us to define common interfaces around these boundaries, and to factor and reuse common database services better.

Industry is a step ahead of research in this direction by providing initial models (e.g., Microsoft's OLEDB http://www.microsoft.com/oledb/). Still, defining appropriate functional boundaries and designing good interfaces remains more an art than a science. Research is needed to define good principles of interface design, define optimal boundaries of reusable database functionality, develop theory and techniques for composing systems out of components, as well as theory and techniques to verify the behavior of the resulting systems. As another example, Object-Relational DBMSs are adding abstract data type extensions to relational DBMSs. However, in order for this new industry to flourish, there need to be standard DBMS architectures that enable the mixing and matching of third-party extensions. Parties interested in writing these extensions cannot afford having to write these extensions to proprietary DBMS architectures. To be useful, these extensions should be able to be plugged into many DBMSs in a standard way.

Derived Data as a First-Class DBMS Service

The notion of derived data appears in DBMSs disguised under different isolated techniques. Single-table and join-indexes are examples of derived data which are defined by relational expressions over base data with additional physical properties. Cursor models (e.g., SQL-CLI's snapshot, keyset, and dynamic cursors) are examples of derived data that guarantee fast retrieval of frequently used data under various update conditions on the cursor itself, or its base data. Replication is an example of derived data management that guarantees fast data access in distributed and disconnected environments. Data warehouse systems are using derived data as a strategy for delivering results efficiently. When supported, each of the above examples of derived data are supported by separate, totally unrelated pieces of code in the system.

The problem of maintaining and exploiting derived data has received considerable attention from the research community, yet we still do not find uniform, native support of these ideas as first-class service in DBMSs. An impediment to the wider adoption of derived data in DBMSs has been that techniques have been developed somewhat in isolation, without regard for their impact on other aspects of the system such as transaction behavior, and query/update processing. This is another area where componentization may enable sharing of concepts, algorithms, and techniques in a single derived data service which can be leveraged for multiple purposes.

Integrated Database and Programming Language Environments

Database applications need programming environments that tightly integrate conventional programming languages with database and system functions such as queries, transactions, security, and distribution. In addition, it is becoming more evident that database application developers need the development environments that are common in conventional programming. Compiler technology is at the center of this integration, but we also need integrated editors, debuggers, profilers, and project management functions that development environments provide. The development of integrated database and programming language environments requires close cooperation between programming-language and database compiler developers. Traditionally, programming language developers have resisted extending their languages with database functions. Similarly, database compiler developers have kept database languages (SQL) neutral with respect to programming-languages. In the meantime, the problem for database application developers remains. SQL Persistent Stored Modules adds programming language extensions to SQL, but this effort represents a tremendous amount of unnecessary reinvention, resulting in yet-another-programming-language. The recent development of the Java language and its underlying virtual machine may serve as a catalist for the cretaion of a universal run-time environment not just for conventional, but also for database programming.

Databases and Scalable Distributed Computing

Advances in computer hardware and networks are making available tremendous amounts of computing power. However, writing scalable, distributed applications that can take advantage of this computing power continues to be extremely difficult. Database technology can contribute uniquely to making scalable, distributed computing a reality for two reasons. First, databases are one of the few application domains that have demonstrated the ability to scale their performance as more computing resources are made available. Database research on parallel, reliable, scalable systems will be increasingly important. Second, while the infrastructure for building distributed applications is there, it is still extremely hard to write them easily, with well-defined semantics in the presence of failures, and in a scalable way. Database research can contribute to the creation of environments for building robust, scalable, distributed applications by exporting concepts and techniques such as transactions that will enable the definition of proper semantics for these applications in the presence of failures and to optimize the interactions among distributed components with changes in the workload and characteristics of the computing resources.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or permissions@acm.org.

joseb@microsoft.com