Keeping Track of Your Stuff

Example Exercises

Searching Email Files

Suppose you keep a file of all the email messages that you send. Think about using grep to find email messages that mention a set of keywords. For example, suppose you want to find all messages that include the words gideon and reinforcement. You might also want to find all messages that contain gideon and either reinforcement or learning. Assume that each email message begins with From and ends at the beginning of the next message or at the end of the file.

Of course you could write a fancy program in C or Perl, but there are a number of quick hacks using grep or egrep that will solve the problem ``with a little assembly.'' The egrep command is one of several ``improvements'' on vanilla grep (use man egrep or info egrep to find out more). The grep command is a good example of a Unix command that has evolved from humble beginnings to become an amazingly useful Swiss-army-knife sort of a utility. Another such command is the find command. By combining grep and find you can easily handle the most demanding file searching problems.

Here are some examples of useful things you can do with egrep to search files for blocks of text containing complicated patterns and grouped keywords. The following command prints out the line number for each line matching ^From, i.e., the empty string found at the beginning of each line (denoted ^) followed by the string From (download a file in zip format containing the sample email messages used in the examples shown below):

% egrep -n "^From" email.txt
1:From tld Wed Jun  4 03:02:34 -0400 2003
12:From tld Tue Jun 17 08:51:32 -0400 2003
24:From tld Tue Jun 17 11:00:32 -0400 2003

The next command prints out the 3 lines before and the 3 lines after any line matching pictures:

% egrep -3 "pictures" email.txt

Gideon,

Thanks for the pictures.  I look forward to seeing you
and the rest family next week. 

Tom

Here's a case insensitive (-i) search displaying the line before (-B) and the 2 lines after (-A) each line matching Gideon:

% egrep -i -B 1 -A 2 "Gideon" email.txt
From tld Wed Jun  4 03:02:34 -0400 2003
To: gideon@startup.com
Subject: birthday

Gideon,

Thanks for the pictures.  I look forward to seeing you

The next command prints out the line numbers for each line matching ^From or Gideon:

% egrep -i -n "^From|Gideon" email.txt
1:From tld Wed Jun  4 03:02:34 -0400 2003
2:To: gideon@startup.com
5:Gideon,
12:From tld Tue Jun 17 08:51:32 -0400 2003
24:From tld Tue Jun 17 11:00:32 -0400 2003

The following exercises explore various methods of searching for messages containing multiple keywords.

  1. Explain what the following command accomplishes.

    % egrep -n -i -B 2 -A 10 "Jeff" email.txt | egrep -i "windsurf"
    

  2. If each email message were on a single line, separate from all the other messages, then searching for messages containing multiple keywords would be relatively easy to accomplish using grep. One solution, therefore, is to design a filter that removes all line feeds (carriage returns) except those corresponding to the beginning of an email message, thereby isolating the text of each message on a single line. Here is an example script showing one way of doing this:12

    % cat email.txt \
             | sed 's/@/A_T_S_I_G_N/g' \
             | sed 's/^From/@From/' \
             | tr "\n" " " \
             | tr "@" "\n" \
             | sed 's/A_T_S_I_G_N/@/g'
    

    Think about why it was necessary to go through such contortions to accomplish this transformation. Note that the sed command is line based and the tr command is character based. Write a description of the above program explaining what each step does.

  3. There's always more than one way to do just about anything. Design an alternative solution using your favorite shell commands. You can number messages using awk, make substitutions using sed, search for patterns using grep or use any number of other commands.

Building Database Systems

The conceptual components of databases -- tables, records, fields -- are easy to implement and the basic operations -- selects, joins, inserts -- provided by database systems are easy to replicate using simple shell scripts. In this exercise, we'll do exactly that. For many applications, you don't need an industrial-strength database system, but it is still useful to think of your data and the operations performed on your data as conforming to the structure and operations supported by modern databases. We'll start by showing you how to organize data in files so as to reproduce the structure of the records and tables used in the music database example in Chapter 3.

In our simple database system, tables are implemented as files, records as individual lines of text in files with the fields separated using the semi-colon (``;'') as a delimiter. We could have used spaces, tabs or any other character as a delimiter, but our data contains spaces that we wish to preserve and so we chose the semi-colon since it doesn't appear in our data. You'll often hear of database and spread-sheet data exported to or imported from comma-, space- or tab-delimited file formats.

Here are the two tables from Chapter 3 -- recall that cat prints the contents of a file or files to the standard output. The name derives from ``concatenate'' since this command is often used to concatenate the contents of several files into a single file as in the invocation: cat file_1 file_2 ... file_n. Here's the file containing information about recording artists:

% cat artist
1;Bill;Frisell;1951-03-18;Baltimore, Maryland
2;Bonnie;Raitt;1949-11-08;Burbank, California
3;Melvin;Taylor;1959-03-13;Jackson, Mississippi
4;Robert;Cray;1953-08-01;Columbus, Georgia
5;Keith;Jarrett;1945-05-08;Allentown, Pennsylvania
6;Sue;Foley;1968-03-29;Ottawa, Canada

And here's the file containing information about songs; recall that the third field refers to the artist who performed the song:

% cat song
1;Let Me Drive;6;Love Comin' Down;2000
2;Two Trains;6;Love Comin' Down;2000
3;Gone, Just Like A Train;1;Gone, Just Like A Train;1998
4;Lowdown Dirty Shame;3;Blues On The Run;1982
5;Monkey Business;2;Silver Lining;2002

We could use more complicated delimiters involving multiple characters, but a single character is simple and, if we absolutely have to include the delimiter character in our data, then there are workarounds such as using escape characters or special sequences.

That pretty much covers the data structures required for our database. Now we need to provide utilities -- shell scripts in this case -- to add, delete and modify records and to answer queries and display the results of queries. We're going to use several common shell commands in concert to provide these services.

Several of the most useful shell commands operate on the lines of files divided into fields by separators. So-called ``white space'' is the most common default delimiter, but in many cases you can specify an alternative delimiter. Here we use the sort command to list the records in the artist table alphabetically by last name:

% sort -t ";" -k 3 artist 
4;Robert;Cray;1953-08-01;Columbus, Georgia
6;Sue;Foley;1968-03-29;Ottawa, Canada
1;Bill;Frisell;1951-03-18;Baltimore, Maryland
5;Keith;Jarrett;1945-05-08;Allentown, Pennsylvania
2;Bonnie;Raitt;1949-11-08;Burbank, California
3;Melvin;Taylor;1959-03-13;Jackson, Mississippi

In SQL the query would look like:

SELECT * FROM artist ORDER BY last ;

Using sort we listed the entire record. The SQL SELECT operator allows us to specify a subset of the fields in a record. We need something that will pick out just the fields we want to display. Recall that awk is a complex, Swiss-army-knife-like utility; it even has its own specialized scripting language. We'll start by using awk to select a subset of the fields in the artist table and then I'll tell you about another command that does exactly what we need from awk. Notice that awk has a slightly different syntax for specifying the delimiter:

% sort -t ";" -k 4 artist | awk -F ";" '{ print $3 ";" $4 }' 
Jarrett 1945-05-08
Raitt 1949-11-08
Frisell 1951-03-18
Cray 1953-08-01
Taylor 1959-03-13
Foley 1968-03-29

The above script implements the following SQL query:

SELECT last, birthday FROM artist ORDER BY birthday ; 

This particular application of awk is so common and so useful that you would think someone would write a utility designed to perform exactly this sort of operation; a Swiss army knife can get the job done, but you'd rather have a tool made just for the job. Someone did write such a utility and they called it cut for reasons that will become clearer as you learn to use it. You should use info to learn about cut, but here's a little preview:

% sort -t ";" -k 4 artist | cut -f 3,4 -d ";"
Jarrett;1945-05-08
Raitt;1949-11-08
Frisell;1951-03-18
Cray;1953-08-01
Taylor;1959-03-13
Foley;1968-03-29

Implement the following SQL query as a shell script and make sure that it won't list an artist by the name of ``Mississippi John Hurt'' if indeed the artist was born in Alabama:13

SELECT last, birthplace FROM artist WHERE birthplace LIKE "%Mississippi%" ;

Finally we need some way of implementing joins as illustrated in the following query. Joins are a bit complicated; fortunately, joins were recognized as being important enough to warrant a command specifically designed to make it easy to join together files representing tables. The command is called join appropriately enough. Here's an example of the kind of ``join'' we want to be able to perform:

SELECT artist.last, song.album FROM artist, song WHERE artist.id = song.artist ;

I suggest you read the man pages on join (run ``info join''), but I'll give you some very basic tips on how to use this command. The following invocation: join -t ";" -1 1 -2 3 file_1 file_2 assumes that file_1 and file_ 2 are text files representing tables with fields separated by the semi-colon (``;'') character. As in the case of sort and awk, white space is the default for delimiting fields. This invocation will produce a file with one line for each pair of lines (one from each of file_1 and file_2) for which the first field in the line from file_1 (hence the ``-1 1'') matches the third field in file_2 (hence the ``-2 3''). There's a catch though: file_1 has to be sorted by its first field and file_2 has to be sorted by its third field.

Given that preamble, here's a script that will pair up artists and their songs and then select out just the artist's last name and the album title of the song. Note that ``-'' can be used instead of a file name for one or the other (but not both) of the files to be joined. The ``-'' is used to indicate that the input for the corresponding argument comes from the standard input which, in the following script, is piped from the sort invocation.

% sort -t ";" -k 3 song | join -t ";" -1 1 -2 3 artist - | cut -f 3,8 -d ";"
Frisell;Gone, Just Like A Train
Raitt;Silver Lining
Taylor;Blues On The Run
Foley;Love Comin' Down
Foley;Love Comin' Down

Alternatively (and a little less elegantly) you could use temporary files as the following variation demonstrates:

% sort -t ";" -k 3 song > one.tmp
% join -t ";" -1 1 -2 3 artist one.tmp > two.tmp
% cut -f 3,8 -d ";" two.tmp > three.tmp

There's always more than one way to do almost anything and the join command has a bit of the Swiss army knife about it too. The join command has additional options that allow you to specify exactly what fields you want to print. Here's a script showing how to use the ``-o'' option to get join to print just the last name of the artist and the title and year of the song. Use info join to learn how to specify the fields in different files.

% sort -t ";" -k 3 song | join -t ";" -1 1 -2 3 -o 1.1 2.2 2.5 artist -
1;Gone, Just Like A Train;1998
2;Monkey Business;2002
3;Lowdown Dirty Shame;1982
6;Let Me Drive;2000
6;Two Trains;2000

It's up to you whether you do everything with join or use some combination of join and cut. I think it's a little less cluttered piping a join into a cut, but the downside of this approach is that you have to keep careful track of the field offsets whereas if you do everything using the join command, you can specify the fields as offsets into the original input files.

Finally we get to your assignment. You're to build an account management system using sort, awk, grep (or egrep), and join. You can use other standard utilities, e.g., sed, but the objective here is for you to learn to use these basic commands. In your account-management system, there will be an account record for each customer with fields for a unique identifier, customer name, billing address, etc. There should also be a transaction record for each sale with fields for a unique transaction identifier, account to bill, amount of sale, description of goods purchased. Finally, there will be a received record for each payment received with the amount received, account, date of receipt, etc. Each record should have a unique identifier which is commonly referred to as the primary key for the record type. Use your system to generate a report that shows the balance (sales minus payments) for each client. The following little awk program will come in handy.

% cat charge
1 Fred 12
2 Mary 8
3 Fred 9
4 Andy 5
5 Fred 23
% grep "Fred" charge | awk '{ s += $3 } END { print s '}
44

We can also be a little fancier with our output:

% grep "Fred" charge | awk '{ n = $2 } { s += $3 } END { print n " owes $" s }'
Fred owes $44

Don't be hesitant to use temporary files as needed to store intermediate results.

Code Fragments

Download all the code fragments in this chapter as a file in zip format.


12 The \ followed by a carriage return (with no following whitespace characters, so be careful) allows you to type a long command to the shell as if on a single line.

13 Here's a hint: Use grep or egrep. These commands do not have options for specifying delimiters, but their pattern-matching capabilities allow you to select fields using regular expressions as patterns. The regular expression [;] matches a single semi-colon character; [^;]* matches zero or more characters as long as they are not semi-colons, and, if re is any regular expression, then (re){n} matches n consecutive matchings of re.