Notes on [Cooper & Herskovits, 1992]

Suppose that you are given a database D of examples with m cases where each case corresponds to an instantiation of a set of variables Z. If B_S represent a Bayesian network structure, how might you compute the probability of the network given the data, Pr(B_S|D). The following paragraphs presents a method for computing Pr(B_S|D) due to Cooper and Herskovits. We begin with four assumptions.
(i) The database variables, denoted Z, are discrete.
Given (i) we have
   Pr(B_S,D) = Integral_{B_P} Pr(D|B_S,B_P)f(B_P|B_S)dB_P  (1) 
where B_P is a conditional probability assignment over the variables in B_S and f is a conditional probability density function.
(ii) Cases occur independently, given a belief-network model.
Given (ii) we have
   Pr(D|B_S,B_P) = Product_{i=1}^m Pr(C_i|B_S,B_P)         (2) 
where C_i is the ith case and Pr(C_i|B_S,B_P) can be computed directly from the network structure B_S and the conditional probability assignment B_P.
(iii) There are no cases that have missing variables.

(iv) The density function f(B_P|B_S) is uniform.
Regarding the assumptions (i-iv), note (i) can be circumvented by discretizing a continuous variable with potential consequences for precision and combinatorial complexity, (ii) is reasonable if you're Monte Carlo sampling from the original underlying target distribution, but somewhat problematic in the case of data from time series, (iii) missing data can be a problem but there are standard methods such as Gibbs sampling for dealing with missing data (these methods have costs in terms of the data and computation time required), and (iv) prior probabilities can be handled using Dirichlet distributions.

Let Z={x_1,...,x_n} be the set of variables, D the database of m cases, a variable x_i has r_i possible value assignments {v_{i1},...,v_{i r_i}}, let Parents(x_i) be the parents of x_i in a particular Bayesian network structure, w_{ij} be the jth unique instantiation of Parents(x_i) relative to the ordering of cases in D, and there are q_i such instantiations.

Define N_{ijk} to be the number of cases in D in which variable x_i has the value v_ik and Parents(x_i) is instantiated as w_{ij}. Let

   N_{ij} = Sum_{k=1}^{r_i} N_{ijk}  
Given (i-iv), Cooper and Herskovits show that
        Pr(B_S,D) = 
                Product_{j=1}^{q_i} Fraction{(r_i-1)!}{(N_{ij}+r_i-1)!}
                Product_{k=1}^{r_i} N_{ij}!                     (3)
which can be computed in O(m n) time given constant r = max_i r_i, constant u = max_i|Parents(x_i)|, and the assumption that Pr(B_S) can be computed in O(u n r).

The probability of a structure given the data is given by

                = Fraction{Pr(B_S,D)}{Pr(D)} 
                = Fraction{Pr(B_S,D)}{Sum_{B_S in Q}Pr(B_S,D)}          (4)
where Q is the set of possible belief-network structures. For comparing two structures, Pr(B_S,D) is sufficient. Note, however, that enumerating Q will be impractical for most Z. Similarly, finding max_{B_S}Pr(B_S|D) (i.e., inferring the maximum likelihood network from data) will be problematic unless you can somehow the restrict the search to a small subset of Q.

Cooper and Herskovits suggest imposing an ordering on the variables in Z. If x_i precedes x_j in the ordering, then we do not allow structures in which there is an arc from x_j to x_i. The resulting subset of Q is considerably smaller than Q but still too large to enumerate explicitly. Instead, Cooper and Herskovits suggest a greedy heuristic algorithm (called K2) proceeds by first assuming a node has no parents and then incrementally adding as a parent that predecessor in the ordering whose addition most increases the probability of the resulting structure until no improvement is possible or a prespecified threshold on the maximum number of parents is reached.

The above equations can easily be extended to handle missing data and hidden variables by marginalizing out the missing information. Straightforward implementations of this extension are not practical however.

Back to Tutorial