Notes on [Lam & Bacchus, 1994]


Lam and Bacchus address the issue of bias versus variance and propose a solution to the overfitting problem in the context of learning Bayesian networks. Recall that it is often possible to explain data with less error (less variance) by using a model with more parameters (more bias). For example, it is possible to model a set of pairs of the form (x,f(x)) to any desired accuracy using a polynomial of high enough degree, where in this case the parameters correspond to the coefficients of the polynomial.

Empirical and theoretical results in machine learning suggest that less complex models are often better at generalizing to unseen cases in situations in which the data is sparse. Allowing arbitrarily complex models may result in over fitting the model to the data. In the case of noisy data, complex models tend to fit to the noise. (This discussion glosses over some rather deep technical issues regarding generalization some of which are addressed in [Wolpert, 1995].)

Minimum Description Length

William of Occam (alternatively spelled ``Ockham'') (c. 1285 - 1349) counseled scientists to choose the simplest theory that explains the data to avoid overfitting. Occam's heuristic is often referred to as ``Occam's razor.'' Rissanen's Minimum Description Length (MDL) criterion [Rissanen, 1978] provides one possible implementation of Occam's razor.

The MDL principle counsels that the best model of a collection of data items is the model that minimizes the sum of (a) the length of the encoding of the model, and (b) the length of the encoding of the data given the model, both of which can measured in bits. (It is often useful to think of a model as a means of compressing data.) Lam and Bacchus suggest using the MDL principle to bias the choice of model (Bayesian networks in their case) toward simpler ones.

Now instead of agonizing over the choice of a prior we simply have to choose an appropriate coding scheme. At first blush, this seems like a win. But note that in some respects a completely connected network is less complicated than a sparsely connected network and a completely connected network can model any joint distribution. In general, a completely connected network will require an exponential number of probabilities but some completely connected networks will have regularities that admit compact encodings. Nonetheless, the Lam and Bacchus approach is appealing, especially in the case in which we have no particular strong a priori knowledge about the network.

Estimating Minimum Description Length

To encode a Bayesian network with n nodes it is sufficient to encode a list of the parents of each node and a set of conditional probabilities for each node. For a node with k parents, we need k log n bits to encode the list of its parents. The encoding for the conditional probabilities will depend on the number of parents and the number of possible values the variables might take on.

To encode the data given the model, Lam and Bacchus employ coding and compression techniques, specifically character codes as described in [Cormen, Leiserson & Rivest, 1989]. Using character codes, frequently occurring items receive shorter codes. Huffman's algorithm provides a method for generating optimal, i.e., minimal length, character codes given the frequency of occurrence (probabilities) of items occurring in the data. Ideally we would use the probabilities in a given Bayesian network to generate a Huffman coding and then use the length of the data encoded with this Huffman coding to measure the encoding length of the data given the model.

In general, however, it is computationally difficult to compute the encoding described above and hence Lam and Bacchus propose an alternative scoring metric. They show that the encoding length of the data as described above is a monotonically increasing function of the (Kullback-Leibler) cross entropy between the distribution defined by the model and the true distribution, and they use the cross entropy as a metric to search for a network of minimum description length.

As Lam and Bacchus point out, ``One way of viewing the MDL principle is as a Bayesian approach in which the prior distribution over models is inversely proportional to the encoding length.'' We make this connection more explicit in the following paragraphs. Lam and Bacchus argue that MDL allows us to make a reasonable tradeoff regarding accuracy and usefulness noting that highly connected networks are more difficult to use from a computational standpoint.

Some Details Regarding MDL-based Scoring Metrics

The minimum description length (MDL) principle indicates that the best model of a collection of data items is the model that minimizes the sum of (a) the length of the encoding of the model, and (b) the length of the encoding of the data given the model, both of which are measured in bits.

To encode the model (which we are assuming is a graphical model corresponding to a Bayesian network), we need to encode the topology (structure) of the graph (a list of parents for each node) and a set of conditional probabilities associated with each node (these specify the conditional probability tables for discrete variables). In the remainder of this document, all logs are assumed to be base 2.

Let k_i be the number of parents of node i. Assuming n nodes, it will take to k_i log n bits to encode the parents of node i. The size of the conditional probability table for a given node will depend on the number of values that the node and its parents can take on. Let s_i be the number of values that node i can take on. The total description length of the network requires

   sum_{i=1}^n [ k_i log n + d (s_i - 1) prod_{j in F_i} s_j ]  (1)
bits where d is the number of bits required to encode a probability, F_i is the set of parents of node i, and we take advantage of the fact that a distribution over s values requires only s - 1 numbers.

To encode the data given the model, we use some standard techniques from coding and compression theory.

Suppose that P = {p_1,...,p_m} is a distribution over m possible primitive events, {e_1,...,e_m}, where p_i is the probability of e_i. Suppose further that our data consists of a sequence of such primitive events whose occurrence is determined by this distribution.

Huffman's algorithm (see Chapter 17 of [Corman, Leiserson and Rivest, 1991] for a description of the algorithm and a proof that it produces an optimal coding) when given P will assign to e_i a codeword of length approximately - log p_i. When we have data consisting of a sequence of N events, we would expect approximately N p_i occurrences of e_i and so the length of the string encoding the sequence would be approximately

                - N sum_{i=1}^m p_i log p_i
This is in fact the optimal encoding. Now suppose that we guess P using Q = {q_1,...,q_m} and construct a codebook (assign codewords to events according to Huffman's algorithm). In this case, the encoding will require
                - N sum_{i=1}^m p_i log q_i                     (2)
bits. A theorem due to Gibbs ensures us that
        sum_{i=1}^m p_i log p_i >= sum_{i=1}^m p_i log q_i
with equality holding only if P = Q.

We could use Equations 1 and 2 together to calculate the description length for a model and a data set by estimating P from the data set (e_i should appear approximately N p_i times for large N). The main practical problem is that the number of primitive events may be quite large! Since a primitive event corresponds to a complete instantiation of a Bayesian network, there are 2^n such events in the case of n boolean variables. Lam and Bacchus [1994] building on the work of Chow and Liu [1968], use a couple of transformations to overcome this problem.

First, they introduce the Kullbach-Leibler cross-entropy function,

        C(P,Q) = sum_{i=1}^m p_i [ log p_i - log q_i ]

                                      p_i 
               = sum_{i=1}^m p_i log -----                      (3)
                                      q_i
where P and Q are distributions defined over the same set of events. It follows from Gibbs theorem that this quantity is always non-negative and zero only if P = Q. It is easy to show that the encoding length of the data is a monotonically increasing function of the cross entropy between the distribution defined by the model and the true distribution.

It's not immediately clear how cross entropy helps us to solve the problem of summing over a very large set of events, but the next step will introduce yet another measure that is monotonically related to cross entropy and easy to compute. Basically, instead of quantifying over all complete instantiations of the variables in the network, we consider a decomposition of the events in terms of the structure of the network, i.e., events corresponding to assignments to subsets of variables governed by marginal distributions of the form Pr(X_i,Parents(X_i)). Lam and Bacchus show that C(P,Q) is a monotonically decreasing function of

        sum_{i=1 : Parents(X_i) != {}} W(X_i,Parents(X_i))      (4)
where
        W(X_i,Parents(X_i)) = sum_{X_i,Parents(X_i)}
                                            Pr(X_i,Parents(X_i)) 
                Pr(X_i,Parents(X_i)) log --------------------------
                                          Pr(X_i) Pr(Parents(X_i)) 
where the probabilities, e.g., Pr(Parents(X_i)), are estimated from the data and the sum is over all assignments to X_i and Parents(X_i). Lam and Bacchus use Equation 4 as a substitute for Equation 2 and the sum of Equations 1 and 4 as a score to assign to a given network in searching for a network to account for the data.

Lam and Bacchus use a variant of best first search to find a network with a high score. At each point, they choose an already evaluated network and an arc which when added to the network does not introduce a cycle. In order to choose a network/arc pair to work on next, Lam and Bacchus use an easily computed measure of the value of a arc. An arc is designated as a pair of nodes (X_i,X_j) and the mutual information between X_i and X_j is defined as

                                                     Pr(X_i,X_j)
        W(X_i,X_j) = sum_{X_i,X_j} Pr(X_i,X_j) log -----------------
                                                    Pr(X_i) Pr(X_j)
where again probabilities such as Pr(X_i,X_j) are estimated from the data. At each point, Lam and Bacchus select the network/arc pair with the highest sum of network score and arc mutual information.

Strategies such as Lam and Bacchus's which perform a greedy lookahead search only considering the possible addition of one new arc at a time can suffer from their myopic lookahead. [Xiang et al., 1996] discuss the consequences of such myopic search strategies; postscript of this paper is available at

	http://www.sis.pitt.edu/~dsl/UAI96/Xiang.UAI96.html

MDL and MAP

Cheeseman [1995] points out that Bayesian learning methods are more general than MDL methods. In the standard Bayesian approach, we search for the hypothesis h maximizing Pr(h|d), i.e., the maximum a posteriori (MAP) hypothesis. Using Bayes rule we have
                         Pr(d|h) Pr(h)
        Pr(h|d) = ----------------------------                  (5)
                   Sum_{h in H} Pr(d|h) Pr(h)

Taking logarithms of both sides of (1) we have

        - log Pr(h|d) = - log Pr(h) - log Pr(d|h) + C           (6)
where we use - log instead of log to ensure positive quantities and C is a constant which can be ignored in comparing hypotheses.

Information theory tells us that - log Pr(event) is the minimum description length required to encode the specified event. If the base of the log is 2 then the description is encoded in bits.

By using (5) to find the hypothesis that best explains the data it is clear from (6) that we are searching for the hypothesis that minimizes the sum of the description of the hypothesis and the description of the data given the hypothesis. Notice that no additional assumptions are required to implement the bias suggested by Occam's razor. Of course, we still have to specify the prior and likelihood functions which presumably Lam and Bacchus would argue requires that we again consider issues of coding and description length.


Back to Tutorial