One consequence of inexpensive computer memory and storage devices is that much less gets thrown out. People who normally wouldn’t characterize themselves as packrats find themselves accumulating megabytes of old email messages, news articles, personal financial data, digital images, digital music in various formats and, increasingly, animations, movies and other multimedia presentations. For many of us, digital memory serves to supplement the neural hardware we were born with for keeping track of things; the computer becomes a sort of neural prosthetic or memory amplifier.
However reassuring it may be to know that every aspect of your digital lifestyle is stored on your computer’s hard drive, storing information doesn’t do much good if you can’t get at what you need when you need it. How do you recall the name of the restaurant your friend from Seattle mentioned in email a couple of years back when she told you about her new job? Or perhaps you’re trying to find the recommendation for a compact digital camera that someone sent you in email or you saved from a news article. It’s tough remembering where you put things and you’d rather not look through all your files each time you want to recall a piece of information.
In 1999, when NASA launched the first of its Earth Observing System (EOS) satellites, they knew they would have to do something with the terabytes (a terabyte is a billion bytes) of data streaming down from these orbiting observers. EOS satellites collect data for the long-term study and monitoring of our planet including its atmosphere, oceans and biosphere. NASA scientists realized that there was too much data to store on fast-access disk drives and it simply wouldn’t do any good to put it on some long-term-storage media like CDs because no one would really know what was on them. Using CDs would have been like your throwing a decade’s worth of your financial data in unmarked boxes in your basement.
To make sure that scientists would be able to use the EOS data, NASA engineers devised strategies for cataloging and summarizing it as it streamed down from the satellites, and developed software for searching these catalogs and summaries to help scientists find what they’re looking for. Just as a library would be much harder to use without a card catalog, so the EOS data would be almost useless without a way to search it efficiently. Let’s consider some techniques and programs for keeping track of our digital data.
Simple text files are particularly convenient for storing data, whether it’s dessert recipes, sports scores or your old email messages. Even if your data is in some special format, you can often convert it to raw text by using an appropriate filter (a program that takes a file in one format and converts it to another format). One of the dangers of storing data in special formats is that, by the time you get around to using it, the program that created it may no longer work. Another problem is that different programs use different formats and often enough you’d like to search across files generated by different programs. For example, in trying to recall that digital camera recommendation, you might want to look at files generated by your email program, your word-processing program and your web browser.
One simple way to catalog and summarize all your data is to use filters to convert the text parts of all your files into raw text files. These text summaries would omit images, audio clips and graphical information, but the text remaining is likely to provide useful clues about the non-text content. The NASA engineers working on EOS didn’t have much text in their satellite data but they did have information about the time the data was generated, where the satellite was positioned in its orbit and therefore what it was looking at, and what sorts of cameras and sensors were used to produce the data, all of which could be very useful to someone searching for information relevant to a particular study.
Let’s look as some basic but powerful tools for searching text files. Suppose you have a lot of text files (indicated by the extension txt, as in java.txt) in which you’ve stored various email messages and news articles about programming languages. How would you go about searching in these files for, say, an article that talks about both Java and Scheme? A host of programs make it relatively easy to search through files and whole file systems.
To demonstrate one such program, suppose that we have five files each consisting of one line each. I’ll list their contents using cat:
% ls 1.txt 2.txt 3.txt 4.txt 5.txt % cat 1.txt Here we have Java and Scheme appearing on the same line. % cat 2.txt Scheme appearing on the same line but followed by Java. % cat 3.txt Java followed by Scheme after fewer than 20 characters. % cat 4.txt Java with more than 20 characters before Scheme appears. % cat 5.txt Java on a line without that word that rhymes with scream.
The original text editor provided in Unix provides an operation, g/re/p, which searches globally (hence the g) for strings matching a given pattern (called a regular expression— hence the re) and then prints (the p) out the lines containing those strings. This operation was used so often that it was packaged into the command grepthat can be easily invoked from a Unix shell. Versions of grep are now available for every operating system I know of. Here’s how to use grep to find all lines in all our text files containing the word Java:
% grep Java *.txt 1.txt:Here we have Java and Scheme appearing on the same line. 2.txt:Scheme appearing on the same line but followed by Java. 3.txt:Java followed by Scheme after fewer than 20 characters. 4.txt:Java with more than 20 characters before Scheme appears. 5.txt:Java on a line without that word that rhymes with scream.
In this invocation, grep takes two arguments: a search pattern, Java in this case, and a file specification, *.txt. grep prints out the file name, for example, 1.txt, and the line of text separated by a colon.
Let’s try something a little more complex. Here’s a way to find all lines in all text files containing the words Java and Scheme:
% grep Java *.txt | grep Scheme 1.txt:Here we have Java and Scheme appearing on the same line. 2.txt:Scheme appearing on the same line but followed by Java. 3.txt:Java followed by Scheme after fewer than 20 characters. 4.txt:Java with more than 20 characters before Scheme appears.
Here we used Unix pipes to feed the results from one program, grep Java *.txt, into another, grep Scheme. The pipe packages up the results from the first invocation of grep as a (virtual) file so that it becomes the second argument (the file specification) to the second invocation.
You can also tell grep that you want to exclude all lines that match a pattern by using the -v (for “veto”) option. Here’s how you’d find all lines in all text files that contain the word Java but not the word Scheme:
% grep "Java" *.txt | grep -v "Scheme" 5.txt:Java on a line without that word that rhymes with scream.
We can get similar results by making more sophisticated use of the language of regular expressions to specify search patterns. But, as so often in modern programming practice, we’re going to find ourselves trying to make sense of different languages and different syntax even within a single call to grep.
The pattern language for specifying lists of files to search as the second argument to grep is not as sophisticated as the language for specifying regular expressions used in the first argument to grep. The former is part of the shell language and common to all commands; the latter is part of grep. To complicate matters, grep uses a different pattern language for regular expressions.
In specifying files, the shell uses ? to match any single character and * to match any sequence of zero or more characters, so that *.??? matches text.txt but not program.pl. grep uses a different method for matching single characters that doesn’t involve ? at all and uses * in a related but disorientingly different method for matching multiple characters.
As the first argument in an invocation of grep, the pattern "Java.*Scheme" matches any string consisting of Java followed by zero or more characters (except the so-called newline character) followed by Scheme. In specifying files in the shell, the . was treated literally, but in the regular expression the . indicates what can match (any character except newline) and the * indicates how often it can match (zero or more times).
This regular expression language lets you be very precise about what can match and how many times. For example, \d* matches zero or more digits (0,...,9), \s+ matches one or more whitespace characters (spaces, tabs, carriage returns), and [a-zA-Z]{2,5} matches at least two and no more than five alphabetic characters (A,...,Z and a,...,z). This command finds any line in which Java appears followed by Scheme:
% grep "Java.*Scheme" *.txt 1.txt:Here we have Java and Scheme appearing on the same line. 3.txt:Java followed by Scheme after fewer than 20 characters. 4.txt:Java with more than 20 characters before Scheme appears.
This is not the same as finding any line that contains both Java and Scheme; we can manage that search by using a somewhat more complicated regular expression:
% grep "Java.*Scheme\|Scheme.*Java" *.txt 1.txt:Here we have Java and Scheme appearing on the same line. 2.txt:Scheme appearing on the same line but followed by Java. 3.txt:Java followed by Scheme after fewer than 20 characters. 4.txt:Java with more than 20 characters before Scheme appears.
The vertical bar, which means “pipe” elsewhere in Unix, means “or” in this context. The backslash, \, preceding the bar is an “escape” characterthat tells Unix not to interpret the bar in the usual way. Curly brackets, { and }, also have to be “backslashified” to avoid being misinterpreted by the shell, as in this next invocation, meant to find all lines in all text files containing the word Java followed by the word Scheme separated by no more than 20 characters:
% grep "Java.\{0,20\}Scheme" *.txt 1.txt:Here we have Java and Scheme appearing on the same line. 3.txt:Java followed by Scheme after fewer than 20 characters.
The need to prefix characters like |, { and } with backslashes so annoyed some programmers that they modified and extended grep’s regular expression language so that they could use these and other characters without a slash. If you find yourself similarly annoyed, use egrep(for “extended” grep) to search your files:
% egrep "Java.{0,20}Scheme" *.txt 1.txt:Here we have Java and Scheme appearing on the same line. 3.txt:Java followed by Scheme after fewer than 20 characters.
How, you might ask, could anyone have invented such an arcane set of conventions for invoking one relatively simple program? Even though it’s not really simple, most folks will agree that it’s fairly Byzantine in its rules of engagement. There are a number of factors at play here. Programming languages, like natural languages, evolve to suit the purposes of those who use them. It makes sense to have short ways of saying things that’ll be said often. The most succinct ways of saying things, short character sequences, are few and hence they tend to get reused with different meanings in different contexts. You wouldn’t translate “La palabra española para ‘said’ es ‘dicho’ ” as “The Spanish word for ‘said’ is ‘said’.” Quotes, backslashes and other conventions for escaping characters establish the appropriate context for interpretation and enable a program that uses one set of conventions to invoke a program that uses a different set.
It’s not at all unusual to have a program written in one language call a program written in another that calls a program written in a third. A web page written in HTML, which is interpreted by your web browser, may cause a program written in Perl or Scheme to run on the web server (another program) hosting the web page, which in turn causes a Java program to be downloaded to your web browser and run in the Java Virtual Machine (still another program). Often enough the calling program needs to pass information to the program being called using the language of the calling program.
Here’s an example illustrating how useful it is for one program, the shell, to convey information to a second program, grep in this case. The shell command aliaslets you define new commands corresponding to short programs (to save typing or to customize your environment). The alias command takes two arguments: the name of the new command and its definition. Even though the definitions are pretty short, it’s handy to put them in a special file that’s loaded every time you fire up a shell. Usually in a shell command the name of the command is followed by one or more arguments. And so, in defining new commands using alias, it’s convenient to be able to refer to the arguments; alias uses the convention that \!* refers to all the arguments appearing on the command line (there are other conventions for referring to individual arguments).
Suppose I want to define a new command whose arguments are inserted into the pattern or file specification arguments passed to an invocation of grep. I could use the definition:
% alias self 'grep "\!*" \!*.txt'
The new command self invokes grep to search through a text file whose name (minus the .txt extension) is specified as the only argument to self and print out all lines in which the name of the file appears. Here’s a simple example showing the new command applied to a self-referential file:
% cat refers.txt This file refers to its name in the text. % self refers This file refers to its name in the text.
You can probably guess that ’grep "\\!*" \!*.txt’ would yield something very different: lines containing the string \!*. With a little more work, I could define a command that would rename a file and then change all internal references to the old name of the file to the new name of the file; this is very useful for maintaining lots of web pages. But I’m not trying to teach you everything about grep or alias, and I’m the first to admit that every time I do anything nontrivial combining these two programs I have to look up the documentation to get the silly conventions right. The point is that the conventions aren’t quite as silly as they first appear and when you are doing a lot of grepping through files looking for stuff, escape characters and cryptic (but easy to type) sequences suddenly make a lot of sense. Well, maybe not a lot of sense, but certainly some sense.
Programs like grep are amazingly useful for searching your own stuff, largely because you have some idea of what’s in your own stuff. You know what you tend to call things; moreover, you know the names of your friends, familiar places and favorite things that you can use in arguments to programs like grep to find what you’re looking for. Other sources of data, however, can require more effort to organize.
In the case of the Earth Observing System, we imagined associating text files with chunks of satellite data. These text files would be even more useful if the information in them appeared in a particular order and a consistent format, say:
% cat data/EOS-078879.txt Spacecraft: QuikSCAT Date: 2002-08-24T14:10:00-05:00 Latitude: 44.67 North Longitude: 63.58 West Measurement: Phytoplankton and Dissolved Organic Matter Instrument: Moderate-Resolution Imaging Spectroradiometer (MODIS)
The date is specified in an international standard, ISO 8601, the location in terms of latitude and longitude and the measurement and instrument in terms of known types identified by NASA. The associated data could consist of a few numbers or an image file of many megabytes. For an example a little closer to earth, imagine creating a format to keep track of all your music files:
% cat song/1739.txt Title: Raccoon Cat Artist: Bill Frisell Album: Gone, Just Like A Train Genre: jazz Year: 1998 Track: 14 of 16
These files contain a very simple form of structured datathat’s easier to search than “unstructured” data: because the information is tagged consistently, you can easily pick out, say, the title and track of a song. It should be pretty obvious that, given a collection of such files, we could use grep or some other such program to find all jazz songs about raccoons performed by artists named Bill on albums whose names mention trains.
That’s just the beginning; we could also create a set of files containing information about individual artists. Here’s what a file for Bill Frisell might look like:
% cat artist/0427.txt Name: Bill Frisell Birthdate: March 18, 1951 Born: Baltimore, Maryland Instrument: guitar
Combining the files for songs and the files for artists, we could search for all artists born in Baltimore after 1950 who produced more than one jazz album between 1900 and 2000. We’d need more than just grep, but it wouldn’t be very difficult to write a short shell script for such a search.
If we’re really into keeping and analyzing a lot of data, there are plenty of programs out there to help. These programs, called database management systems or databasesfor short, are used to keep track of bank accounts, customers, payrolls, employee benefits, purchases, inventory, scientific data, train schedules, computer accounts, college applications, student grades, course registrations, medical records, and so on. Name a type of data — someone probably keeps track of it in a database somewhere. If suddenly all the database programs in the world stopped working, business would grind to halt and most stores would close their doors; planes wouldn’t take off and government agencies wouldn’t be able to function.
In addition to being indispensable in this information age, databases are very interesting programs, and often they’re just the thing for keeping track of your stuff. In Chapter 1, we played briefly with the logic programming language Prolog. Prologlets you assert relationships like “Fred is Anne’s parent”, expressed in Prolog syntax as parent(fred, anne). We could use Prolog to describe all our song and artist data. We stored everything we knew about Bill Frisell’s song “Raccoon Cat” in a file called 1739.txt. We could use this file name as a unique name to refer to the song and then express the data in terms of several binary relationships, for example, genre(1739, jazz) and year(1739, 1998). Modern databases, often called relational databases, can express much the same information as Prolog but they use a somewhat different format and are designed to handle huge amounts of data — say, all the account information for American Express or the parts inventory for the Ford Motor Company.
In database lingo, the song and artist files are called records. Songs and artists are different types of records. Each record consists of a set of binary relationships that assign values or instances, for example, 1998 and jazz, to attributes, for example, year and genre. Each record has one attribute, called its primary key, that distinguishes it from every other record of the same type. Here this could be the song’s title, but since we’re talking about recordings or song tracks we could easily have songs of the same title performed by several artists or even the same artist at different times. Often the primary key is just an integer unique among records of the same type.
A collection of records of a given type, such as the collection of all songs or the collection of all artists, is called a table. A database is just a collection of tables. Usually there are several ways of organizing your data and it’s quite tricky to set up the records, attributes and instances so as to anticipate all the ways you might ultimately want to think about them. If you really want to go crazy with your music database, you’ll probably want record types for albums, recording labels, concerts, groups of artists such as bands and orchestras, and other people involved in the music business such as producers and recording engineers. People with a knack for organizing data are paid very good money for structuring data in commercial applications.
Let’s call up a database program and try to get some idea of what it means to create and use a database. Modern database systems have all sorts of fancy graphical front ends but, as usual, we’ll interact with this database program using a relatively low-level programming language, here a variant of SQL(for “structured query language”). Programmers who write database programs typically start with a subset of standard SQL and then add special syntax to take advantage of functionality that distinguishes their program or product. I’ve got a couple of open-source database programs, MySQL and PostgreSQL, installed on my home computer (“open-source software” and “free software” are not synonymous but they’re close enough as long as you’re just using the programs for your own purposes). I’ll use MySQL.
First I’ll create a new database named “music” and indicate that I want all the tables and records I’ll create to be stored in it:
> CREATE DATABASE music ; OK > USE DATABASE music ; OK
I’m going to create separate tables for artists and songs, starting with the artist table. Artist records will have an integer identifier (the primary key) plus attributes for the artist’s first name, last name, birthdate and birthplace. In addition to the names of the attributes, I indicate the type of the attribute values, integers, variable length sequences of characters; for the artist’s birthdate, I use a special DATE type that has associated functions for referring to a DATE’s year, month and day. NULL is a special attribute value and I’ve used the AUTO_INCREMENT feature so that the database program provides integer identifiers numbered from 1 to however many artist records I create. All the terms in capital letters are either part of SQL, so-called reserved words, or MySQL extensions. The MySQL interpreter conveniently lets me use multiple lines for a command as long as I observe the correct syntax and end the command with a semicolon.
> CREATE TABLE artist ( > id SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT, > first VARCHAR(20), > last VARCHAR(20), > birthdate DATE, > birthplace VARCHAR(40), > PRIMARY KEY (id) ) ; OK
The DESCRIBE command lets us look at the fields of a specified table. I usually do this just to make sure I didn’t mistype anything.
> DESCRIBE artist ; +------------+-------------+------+-----+---------+ | Field | Type | Null | Key | Default | +------------+-------------+------+-----+---------+ | id | smallint(5) | | PRI | NULL | | first | varchar(20) | YES | | NULL | | last | varchar(20) | YES | | NULL | | birthdate | date | YES | | NULL | | birthplace | varchar(40) | YES | | NULL | +------------+-------------+------+-----+---------+
The output following the invocation of the DESCRIBE command depicts a table using various characters (+, - and |) to produce the horizontal and vertical lines separating rows and columns. It’s kind of old fashioned, but it’s convenient for interacting with a database without a fancy graphical user interface.
Next we INSERT a bunch of records into the artist table. I use NULL for anything I don’t know or want the program to fill in for me, in this case, the record number.
> INSERT INTO artist VALUES > (NULL, 'Bill', 'Frisell', '1951-03-18', 'Baltimore, Maryland'), > (NULL, 'Bonnie', 'Raitt', '1949-11-08', 'Burbank, California'), > (NULL, 'Melvin', 'Taylor', '1959-03-13', 'Jackson, Mississippi'), > (NULL, 'Robert', 'Cray', '1953-08-01', 'Columbus, Georgia'), > (NULL, 'Keith', 'Jarrett', '1945-05-08', 'Allentown, Pennsylvania'), > (NULL, 'Sue', 'Foley', '1968-03-29', 'Ottawa, Canada') ; OK
Now that there’s some data in the database, let’s see how SQL lets us manage it. First, we’ll use an SQL queryto list all the artist records sorted by last name:
> SELECT * FROM artist ORDER BY last ; +----+--------+---------+------------+-------------------------+ | id | first | last | birthdate | birthplace | +----+--------+---------+------------+-------------------------+ | 4 | Robert | Cray | 1953-08-01 | Columbus, Georgia | | 6 | Sue | Foley | 1968-03-29 | Ottawa, Canada | | 1 | Bill | Frisell | 1951-03-18 | Baltimore, Maryland | | 5 | Keith | Jarrett | 1945-05-08 | Allentown, Pennsylvania | | 2 | Bonnie | Raitt | 1949-11-08 | Burbank, California | | 3 | Melvin | Taylor | 1959-03-13 | Jackson, Mississippi | +----+--------+---------+------------+-------------------------+
Once you get the hang of translating them, SQL queries read a little like natural language. This one might be translated as “select all (the * acts as a wildcard) the attributes from the artist table and then print them out, one record to a line, ordered by the last name of the artist.” MySQL displays the result as a table. Indeed, in addition to the tables specifically defined by CREATE TABLE, every SQL query creates a new virtual table that can be used anywhere an explicitly defined table can be used.
Let’s make a more interesting request: a table with two columns listing the name and birthdate of those artists born before 1950:
> SELECT last, birthdate FROM artist > WHERE birthdate < '1950-01-01' ORDER BY birthdate ; +---------+------------+ | last | birthdate | +---------+------------+ | Jarrett | 1945-05-08 | | Raitt | 1949-11-08 | +---------+------------+
Regular expressions and pattern matching are just too useful for them to be unavailable for SQL queries. Unfortunately, MySQL has a different pattern syntax from grep; in the next query % matches zero or more characters. To create a table with three columns listing the first name, last name and birthplace of any artist born in Mississippi, I say:
> SELECT first, last, birthplace FROM artist > WHERE birthplace LIKE "%Mississippi%" ; +--------+--------+----------------------+ | first | last | birthplace | +--------+--------+----------------------+ | Melvin | Taylor | Jackson, Mississippi | +--------+--------+----------------------+
We can also create tables that relate records within the same table. Suppose we’re interested in finding artists born in the same month. We could create a table with four columns that lists the last name and birthdate of one artist and the last name and birthdate of a second artist if both artists were born in the same month. Previous queries concerned only one table, specified by the FROM keyword, and so we could refer to attributes by name without ambiguity. To avoid confusion when referring to separate records from the same table, we use the AS keyword in the next query to introduce local aliases, one and two, so as to distinguish between the two references. In queries involving more than one table or referring to the same table multiple times, we refer to attributes by prefixing the attribute names with their respective table names or designated aliases (for multiple references to the same table).
The first line of this query takes advantage of functions for referring to the year, month and day of an attribute value of type DATE. The last line of the query is a bit of a hack, but imagine what would happen if you left it off: you’d find out a bunch of facts such as that Robert Cray was born in the same month as Robert Cray and ditto for Keith Jarrett and Keith Jarrett. We could eliminate this sort of useless information by requiring the records to be different; however, you’d still be told that Melvin Taylor and Sue Foley were born in the same month and — surprise, surprise — Sue Foley and Melvin Taylor were born in the same month. So this final line ensures that an artist is not listed as being born in the same month as him or herself and that pairs of artists born in the same month are listed only once.
> SELECT one.last, one.birthdate, two.last, two.birthdate > FROM artist AS one, artist AS two > WHERE MONTH(one.birthdate) = MONTH(two.birthdate) > AND one.last > two.last ; +---------+------------+---------+------------+ | last | birthdate | last | birthdate | +---------+------------+---------+------------+ | Taylor | 1959-03-13 | Frisell | 1951-03-18 | | Frisell | 1951-03-18 | Foley | 1968-03-29 | | Taylor | 1959-03-13 | Foley | 1968-03-29 | +---------+------------+---------+------------+
This query is a special case of a join. When a database program (or engine) processes a join of two tables, it creates a new temporary table with one record for each possible combination of a record from the first table and a record from the second table. The new records are the concatenation of the attributes of the two records from which they were created. If there are n records in the first table and m in the second, then there are n*m combined records in the temporary table. Since there are six records in the artist table and the artist record has five attributes, the next query would display a table with 10 columns and 36 records:
> SELECT * FROM artist AS one, artist AS two ;
Here the database engine goes through each new record in the new temporary table using the specification in the WHERE portion of the query to determine what gets included in the final table resulting from the SELECT. To join a table containing thousands of employee records with itself in order to find groups of people living in the same area who might want to carpool, you’d have create a temporary table consisting of millions of records. In practice, database engines are clever enough that they seldom have to construct the full temporary table.
Such cleverness is even more important with queries joining many tables. In general, the size of the temporary table resulting from a join of multiple tables is the product of the sizes of the tables; thus a join of three different tables consisting of a thousand records each would yield a temporary table with a billion records. A lot of effort is applied to designing algorithms that optimize the performance of database engines.
Most databases contain many different tables. In running a business, you might need a table for your employees, another for customers, another for inventory items, and yet another for keeping track of every business transaction. Queries involving joins of multiple tables would be required in generating the payroll, billing customers, and figuring out which products to stock and which to discontinue.
Let’s create another table for songs with records that have identifier, title, artist, album and year attribute values:
> CREATE TABLE song ( > id SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT, > title VARCHAR(100), > artist SMALLINT UNSIGNED NOT NULL REFERENCES artists, > album VARCHAR(100), > year VARCHAR(4), > PRIMARY KEY (id) ) ; OK
And let’s create a few song records just to experiment with:
> INSERT INTO song VALUES > (NULL, 'Let Me Drive', '6', 'Love Comin\' Down', '2000'), > (NULL, 'Two Trains', '6', 'Love Comin\' Down', '2000'), > (NULL, 'Gone, Just Like A Train', '1', 'Gone, Just Like A Train', '1998'), > (NULL, 'Lowdown Dirty Shame', '3', 'Blues On The Run', '1982'), > (NULL, 'Monkey Business', '2', 'Silver Lining', '2002') ; OK
Now we find all artists who’ve produced a song with the word “Train” in the title:
> SELECT artist.last, song.title FROM artist, song > WHERE song.title LIKE "%Train%" > AND artist.id = song.artist ; +---------+-------------------------+ | last | title | +---------+-------------------------+ | Foley | Two Trains | | Frisell | Gone, Just Like A Train | +---------+-------------------------+
And this last query finds all artists born before 1950 who produced an album after 2000:
> SELECT artist.last, song.album > FROM artist, song > WHERE artist.id = song.artist > AND artist.birthdate < '1950-01-01' > AND song.year > '2000' ; +-------+---------------+ | last | album | +-------+---------------+ | Raitt | Silver Lining | +-------+---------------+
Databases are great for keeping track of lots of highly structured data in a format carefully designed with particular applications in mind (check out Ullman88 or Silberschatzetal01b for a more comprehensive introduction to the theory and practice of database systems). In some cases, large databases are even minedto discover interesting trends and unsuspected relationships. In mining the sales data for a chain of convenience stores, it was discovered that customers who purchase disposable diapers also often purchase beer. In an effort to encourage this trend, the store owners were advised to put the disposable diapers and the beer close to each other so as to remind forgetful customers. The programs that mine the data rely on methods from statistics and artificial intelligence, but they depend on powerful database engines to answer queries efficiently. Indeed, most SQL queries are generated not by humans typing or using fancy graphical interfaces but rather by other programs dynamically generating queries and then using the results to perform additional computations that give rise to still more queries.
SQL and its variants are basically programming languages. Any given commercial database product has extensions that let you do almost anything you could do in any other programming language. Many programmers, however, prefer to handle calculations that don’t involve inserting or extracting data in some other programming language such as C++; in this case, the C++ code generates SQL queries involving INSERTs and SELECTs, sends them off as “foreign function calls” to the database program, and receives the results back so it can perform additional calculations.
A program responsible for handling online orders might generate a SQL query to find out all the items a customer has put in his or her electronic “shopping cart,” generate another SQL query to determine where the customer lives, use C++ code to tally up the cost of the items and add state tax as appropriate, call another program to manage the necessary credit-card transaction, and issue a request to shipping to pack and ship the items. Such complex programs are made much simpler by the fact that the data is highly structured, so the programmer knows exactly which records contain information about purchases, billing, shipping and the like.
Not all data is so conveniently structured to suit our purposes. Most of us aren’t going to take the time to organize our files (email messages, recipes, financial records) so they can be put in a database and conveniently accessed by SQL queries. One advantage in searching through your own stuff is that you generally know what you call things and may even have some idea of where you put them — music in one directory, favorite artists in another and so on. You also have some idea of what your personal data does and, more importantly, doesn’t contain. For example, you’d probably know if you had a recent consumer review on digital cameras among your files. Neither of these advantages is likely to hold if you’re searching in other people’s stuff.
With the advent of the World Wide Web, ordinary people, not just computer professionals, are storing large amounts of unstructured data on computers so that anyone, anywhere in the world can get at that information with a click of a mouse. We’ll return to the topic of searching through this information in Chapter 14, after learning more in Chapter 11 about the basic machinery that makes the World Wide Web possible.