Analyze This

Previous Home Next

As usual, right after I finished Friday's entry and right up until now as I'm about to start the next one, I thought about what to write about next. The content of the journal has been planned in the sense that I've thought about what I want to write but it hasn't been scripted in the sense that I know when to write what nor has it been designed in the sense that I know what ideas or concepts will follow or precede what other ideas and concepts. The journal is naturally growing to take advantage of serendipity and to fit my current mood and interests. That said each morning I have to figure out what to write about next. And whatever sequence of ideas and concepts I might have thought was appropriate last week will likely have been altered by what I thought about in writing the last entry.

In addition to a few anecdotes about computing and some ice-cold plunges into the arcane syntax of half a dozen programming languages, we've touched on a bunch of pretty important ideas. We talked about the idea of interacting with computers and looked at a bunch of different ways of doing this including command line interpreters for communicating with a computer's operating system, scripting languages for more complicated communication, and specialized interpreters for programming languages like Scheme and prolog and interactions with database and web servers.

We investigated the basic notion of a function, alternatively called a procedure, as a set of rules or recipe for carrying out a computation. A computer program consists of a set of function definitions together with data in some form. We distinguished the program itself from the running of the program which we identified with the notion of process, thereby capturing the dynamic, evolutionary characteristics of running programs. We enhanced the notion of a static procedure defined once and for all and oblivious to its past to allow procedures to have mutable state (memory) and to accept procedures as input, invoke them at their whim, and indeed generate and return newly minted procedures of their own design. Data and procedures are interchangeable and for that reason are replaced by the general notion of objects, that perform computations and have computations performed on them.

We looked at several informal models of computation including the client/server model for thinking about services such as those offered by database and web servers, the substitution model for thinking about computation algebraically, and object-oriented programming for thinking about software engineering in the large by dividing a large project into smaller entities each of which provides a particular set of services, keeping private the irrelevant and distracting details making public only the interface. But these notion just begin to scratch the surface of computer science. Here are some of the "biggies" that have been left largely untouched and that haunt my dreams when I get overly invested in this project.

First, there is the idea of a "formal" computational model, a mathematical framework for investigating what is possible to compute and how hard (how much time and memory they require) such computations are likely to be. With a little work, the substitution model can be made precise and used as a formal model. Formal computational models try to answer questions such as the following: Is one model of computation more expressive than another in the sense that there things you can compute in one but not the other? Is there a program specified in a given formal model, that can determine whether another program specified in the same or another formal model will ever halt on a given input? It would be nice if you could tell whether a procedure you wrote will ever stop running and return you an answer. Can you think of a reason why you'd want to adopt a less expressive, more restrictive model of computation, i.e., write in a programming language that wasn't able to specify some kinds of computation? Suppose with the less expressive model/language you could get some nice properties, like the ability to check whether a program would halt or a guarantee that every procedure would take only a relatively short amount of time?

As much as I like this stuff, I plan to steer clear of the formal issues or at least only bring them up when they impinge directly on more practical issues at hand. In keeping with our Bug Safari metaphors from Thursday, this is less because I'm a lepidopterist (in this case, a theoretical or mathematically-inclined computer scientist) and you don't want to get me started on butterflies but rather because I hang around (and occasionally work) with a lot of lepidoterists and so I can easily get caught up speaking the argot of the lepidopterists. Besides, there are some excellent texts by computer scientists with a knack for making the mathematically complex accessible, e.g., [Davis, 2000] and [Harel, 2000], and I prefer to leave the more careful treatment of these issues to those with the requisite gift.

Next there is the area broadly called "the design and analysis of algorithms and data structures" which has theoretical as well as practical consequences. A data structure refers to a particular method for storing, retrieving and, more generally, manipulating data. An algorithm refers to the bare-bones description of a procedure definition usually specified in terms of a simplified programming language. Usually any associated data structures used by the algorithm are considered part of the recipe. Such a recipe can be implemented by translating from the simplified programming language to a specific programming language. For example, consider the very simple procedure to square a number. You could use a specification in natural language, e.g., "To square an integer, multiply the number by itself", but typically algorithm designers use some agreed-upon standard language such as ALGOL (for ALGO(rithmic) L(anguage)). Here's how to specify the square function in ALGOL:

integer procedure square(n)
    integer n;
    begin
        square := n * n 
    end;

Squint at the notation as usual but note that the integer statements are "type declarations" and can safely be ignored, := indicates assignment (x := 2 assigns the variable x the value of 2) whereas = (:= without the :) indicates an equality test (x = 2 evaluates to true if x is equal to 2 and evaluates to false otherwise), and, internal to the definition of a procedure, assigning the procedure name to a value (square := n * n) is the way to specify the return value for the procedure. Way too briefly, type declarations are annotations made by the programmer to help the programmer and other programmers understand programs and to help various programs that interpret, compile and otherwise try to make sense of programs do a better job.

You could then translate this ALGOL specification into Scheme as:

(define (average n) (* n n))

or into the C programming language as

int square(int n) { return n * n ; }

The touted advantage of using an agreed-upon standard like ALGOL is that everybody knows ALGOL (not true!) and that it doesn't favor any particular language (well, ok, ignoring dialects of ALGOL itself, in some sense this is true but it does favor languages with similar syntax to ALGOL). But this is one of the least interesting observations about the design and analysis of algorithms. You write algorithms in a general specification language to be clear about what the algorithms mean, to make it easy to implement algorithms in a variety of languages, and to enable computer scientists to analyze the characteristics of algorithms independent of the particular language in which they are written.

To get a taste of what it's like to analyze algorithms, let's return to the problem of computing the nth element of the Fibonacci sequence. Here is an algorithm written in ALGOL. It doesn't look too different from the implementation in Scheme that we looked at on Friday.

integer procedure fibonacci(n);
    integer n;
    if n = 1 
       then fibonacci := 0  
       else if n = 2 
               then fibonacci := 1 
               else fibonacci := fibonacci(n - 1) + fibonacci(n - 2);

Now here's an alternative (and, incidentally, more efficient) algorithm also written in ALGOL.

integer procedure fibonacci(n);
    integer n;
    if n = 1 
       then fibonacci := 0
       else if n = 2
               then fibonacci := 1
               else begin 
                        integer penultimate; penultimate := 0;
                        integer ultimate; ultimate := 1;
                        integer temporary;
                        for i = 3 step 1 until n do 
                            begin
                                temporary := penultimate + penultimate; 
                                penultimate := ultimate;
                                ultimate := temporary;
                            end ;
                        fibonacci := ultimate
                    end

In analyzing these two algorithms, a computer scientist would say that the first scales exponentially in n while the second scales linearly in n. Practically speaking, if each addition (application of the + operator) takes on the order of one millionth of a second and ignoring all other aspects of the computation, computing fibonacci(100) with the second algorithm would take considerably less than a second (102/106 = 1/104 seconds) and computing fibonacci(100) with the first algorithm would take ... well, let's see, there are how many seconds in a year?

> (* 60 60 24 365)
31536000

or around 32 million not worrying about leap years and such. And about how many seconds will our calculation take with the first algorithm?

> (define (raise m n) (if (= n 0) 1 (* m (raise m (- n 1)))))
> (* (raise 2 100) (/ 1 (raise 10 6)))
19807040628566084398385987584/15625

Oops, this version of Scheme is trying to be clever by returning the number I requested as a rational number. It does this so it doesn't have to return me an approximation of what I asked for. How would you print out 1/3 as a "real" number? 0.33 or perhaps 0.3333333? Both of these answers are approximations to 1/3. If I include a "real" number in the calculation then Scheme will try to print the result as a real.

> (* (raise 2 100) (/ 1.0 (raise 10 6)))
1.2676506002282293e+24

Well, I guess that's progress! Now Scheme is offering me an approximation but it displays the approximation in scientific notation with a lot more precision (the digits on right of the decimal point before e+24) than I can make use of. (Don't believe anyone who tells you that computer arithmetic is simple; what goes on behind the scenes when you multiply and divide numbers in a calculator or general-purpose computer is fascinating and comprises an important subfield of computer science called scientific computing.) Let's just say 1.27 * 1024 which is a pretty big number. Since 31536000 is 3.1536000e+7 in scientific notation or approximately 3.15 * 107 we can estimate that it would take 4 * 1016 years. Asking Scheme to confirm this we get the following answer.

> (/ (* (raise 2 100) (/ 1.0 (raise 10 6))) 
     (* 60 60 24 365))
4.019693684133147e+16

That's an incomprehensibly long amount of time and that's just to compute fibonacci(100). Remember that things get exponentially worse as n gets larger, so fibonacci(101) takes twice as long as fibonacci(100).

> (/ (* (raise 2 101) (/ 1.0 (raise 10 6))) 
     (* 60 60 24 365))
8.039387368266294e+16

This sort of analysis is obviously quite useful and every student of computer science needs some exposure to this part of the discipline which includes coverage of a range of special topics in mathematics. I won't spend a lot of time performing formal analyses but from time to time I'll refer to various results from this area of computer science. The text by Cormen, Leiserson, Rivest and Stein provides a comprehensive introduction to algorithms and their analysis [Cormen et al., 2001].

After formal models and analyzing algorithms and data structures (these are both on the theoretical side of computer science - I'm showing the bias that results from my hanging around with lepidopterists), there is the general area of "systems". By some accounts systems includes the field of databases but, by other accounts, "core" systems emphasizes operating systems, programming languages, computer architectures, networks, distributed computing and a bunch of other specialty areas. Here are some areas of systems that I'd like to delve into, albeit briefly.