Learning Bayesian Networks

Note that given the chain rule, a set of conditional independencies and any ordering on the variables in the problem domain, we can construct a Bayesian-network structure that models the probabilistic relationships among the variables. However, this network structure may fail to reveal many conditional independencies in the domain. Fortunately, often we are aware of direct causal influences that determine arcs in the underlying graph and thereby bias our choice of network structure.

Cooper and Herskovits [1992] describe a Bayesian framework and an algorithm for computing the probability of network structures given a database of cases. The Cooper and Herskovits paper does a good job of summarizing earlier work on constructing networks due to Chow and Liu, Spirtes, Glymour and Scheines, Pearl and Verma, and others.

Cooper and Herskovits employ a Bayesian approach in their work and as a consequence must assume a prior over the space of all possible network structures. Cooper and Herskovits [1992] assume this prior to be uniform, though other possibilities are mentioned. Click here for additional discussion of the Cooper and Herskovits paper.

Lam and Bacchus [1994] point out that the uniform prior would result in choosing a more complex model even if that model is only slightly more accurate. Empirical and theoretical results in machine learning suggest that less complex models are often better at generalizing to unseen cases in situations in which data is sparse. Allowing arbitrarily complex models may result in over fitting the model to the data.

Lam and Bacchus suggest as an alternative using the minimum description length (MDL) principle to bias the choice of model toward simpler ones. The minimum description length principle counsels that the best model of a collection of data items is the model that minimizes the sum of (a) the length of the encoding of the model, and (b) the length of the encoding of the data given the model, both of which can measured in bits. Lam and Bacchus show that the encoding length of the data is a monotonically increasing function of the (Kullback-Leibler) cross-entropy between the distribution defined by the model and the true distribution, and they use this fact to guide search for network of minimum description length. Click here for additional discussion of the Lam and Bacchus paper.

Buntine [1994] has developed a system that takes as input a set of priors for a set of possible models using Bayesian networks and then constructs a computer program that learns from a database.

Exercises

The notebook CooperandHerskovits.m explores the methods of Cooper and Herskovits for learning Bayesian networks from data. Also included is code for generating and analyzing time series data from dynamical systems represented in terms of Bayesian networks.

Back to Tutorial