Expectation Maximization

The following paragraphs describe the expectation maximization (EM) algorithm [Dempster et al., 1977]. The EM algorithm is used to approximate a probability function (p.f. or p.d.f.). EM is typically used to compute maximum likelihood estimates given incomplete samples.

Let X be the set of hidden variables, Y the set of observable variables, and Omega_X and Omega_Y the set of values or sample space for these variables. The product of Omega_X and Omega_Y can be thought of as the space of complete samples and Omega_Y the set of incomplete samples. If (x,y) is a complete sample, then y is the observable part.

Let f(x, y | theta) specify a family of functions one of which governs the generation of complete samples and let g(y | theta) specify a family of functions one of which governs the generation of incomplete samples. The functions f and g are related by the following equation.

        g(y | theta) = Integral_{x in Omega_X} f(x,y | theta) dx

The EM algorithm attempts to find a ``value for theta which maximizes g(y | theta) given an observed y, but it does so by making essential use of the associated family f(x, y | theta)'' [Dempster et al., 1977].

EM can be thought of as a deterministic version of Gibbs sampling. EM operates on the means or modes of unknown variables using expected sufficient statistics instead of sampling unknown variables as does Gibbs sampling. Both EM and Gibbs sampling are used for approximation with incomplete data.

Suppose that x is the missing data and y is the observed data. It may be straightforward to calculate p(x, y | theta) but not so easy to calculate p(y | theta).

Define Q(theta | theta') as follows:

        Q(theta | theta') = E[log p(x, y | theta) | theta', y]

where x is the random variable and we are taking the expectation with respect to the probability density p(x | theta',y).

The EM algorithm works as follows

1. Set i to 0 and choose theta_i arbitrarily.

2. Compute Q(theta | theta_i)

3. Choose theta_i+1 to maximize Q(theta | theta_i)

4. If theta_i != theta_i+1, then set i to i+1 and return to Step 2.

where Step 2 is often referred to as the expectation step and Step 3 is called the maximization step.

The generalized EM (GEM) algorithm is the same except that instead of requiring maximization in Step 3 it only requires that the estimate be improved.

Here is a proof sketch showing that each iteration of EM actually improves the current estimate theta. We start with the following simple identity (see the Appendix at the end of this document):

        log p(y | theta) = log p(x, y, | theta) - log p(x | theta, y)

and take expectations of both sides treating x as a random variable with distribution p(x | theta',y).

        log p(y | theta) = E[log p(x, y | theta) | theta', y] - 
                           E[log p(x | theta, y) | theta', y]        (1)

We note without proof that the second term in (1) is minimized when theta = theta' (see Exercise 1 below).

Now consider any value theta'' such that

        E[log p(x, y | theta'') | theta', y] > 
        E[log p(x, y | theta' ) | theta', y]

and note that if we replace theta' by theta'' we increase the first term in (1) while not increasing the second term so that

        p(y | theta'') > p(y | theta')

thereby improving our current estimate.

In general, computations required for EM can be quite difficult requiring complex integrations to compute the required expectations and perform the maximization. In the remainder of this section, we consider a special case of EM well suited to distributions in the exponential family.

Consider the problem of learning a model for the unsupervised learning problem in which X and Y are boolean variables, Y depends on X, and X is hidden. In this case, complete samples (x,y) are drawn from {0,1}^2, incomplete samples are drawn from {0,1}.

The hypothesis space is defined by the following network structure.

                o ----> o 
                X       Y

Here are the parameters for the above structure.

        {theta(x) = Pr(X=x) : x in {0,1}}

        {phi(x,y) = Pr(Y=y | X=x) : (x,y) in {0,1}^2}

Here is a graphical model representing the learning problem.

            _________________
            |               |
            |               |           The boxed part of the 
        o ----> o ----> o <---- o       model indicates a plate
   theta(X) |   X       Y   | phi(X,Y)  (see [Buntine, 1994])
            | N             |           representing N samples.
            |_______________|

Here are some samples with missing values indicated by underscores.

        X       Y 
        ---------
        _       1       Note that in the general case the
        _       0       values for X could be either partially 
        _       1       or completely missing in the data.
        _       0
        _       0
           ...

We represent the parameter distributions using beta distributions. Let's assume that Beta[n,m] is the distribution that corresponds to out having observed n 0's and m 1's. For each parameter distribution, the associated counts correspond to sufficient statistics. In the case of the theta parameters, we have

        n(x) = Sum_{s in S} (1_{Value(s,X) = x})

where S is the set of samples comprising the data, s is a variable ranging over samples, Value(s,V) is the value assigned to the variable V in sample s, 1_{Value(s,V) = v} = 1 if Value(s,V) = v, 1_{Value(s,V) = v} = 0 if Value(s,V) != v. In the case in which there is no missing data, we have

        xi(theta(1) | S) = Beta(n(0),n(1))

Similarly, for the phi parameters, we have

        n(x,y) = Sum_{s in S} (1_{Value(s,Y) = y} 1_{Value(s,X) = x})

and, again in the case of no missing data, we have

        xi(phi(1,1) | S) = Beta(n(1,0),n(1,1))

In the case of missing data, instead of using the sufficient statistics which we don't have, we use the expected sufficient statistics to iteratively assign estimates to theta and phi. This iterative process proceeds in two steps.

In the first step, called the expectation step, we compute the expected sufficient statistics as follows.

    E(n(x) | S,theta,phi) = 
        Sum_{s in S} Pr(X = x | Y = Value(s,Y), theta, phi)

    E(n(x,y) | S,theta,phi) = 
        Sum_{s in S} 
            (Pr(X = x | Y = Value(s,Y), theta, phi) 1_{Value(s,Y) = y})

The required computations can easily be carried out for distributions in the exponential family.

In the second step, called the maximization step, we set theta and phi to their mode conditioned on the expected sufficient statistics. In this case, finding the mode is trivial. For example, if the prior on phi(1,1) is Beta[alpha_1,alpha_2], then we're looking for the mode of

     Beta[alpha_1+E(n(1,0)|S,theta,phi),alpha_2+E(n(1,1)|S,theta,phi)]

which is just

                            alpha_2+E(n(1,1)|S,theta,phi)
 phi(1,1)' = ---------------------------------------------------------------
              alpha_1+E(n(1,0)|S,theta,phi) + alpha_2+E(n(1,1)|S,theta,phi)

The formulation of EM given here is perfectly suited to working with distributions in the exponential families, including the Bernoulli, Poisson, normal, gamma, beta, multinomial, and Dirichlet distributions.

Exercises

1. Prove that the second term in (1) is minimized when theta = theta' (see Exercise 9.3 in [Gelman, Carlin, Stern & Rubin, 1995] or the derivation in Appendix A.2 of [Ripley, 1996]).

2. Show that if we use the following alternative formulation for Q(theta | theta')

        Q(theta | theta') = E[log p(theta, x | y) | theta', y]

then we increase the posterior density, p(theta | y), instead of the likelihood, p(y | theta).

3. Consider the case in which X and Y are boolean variables with Y dependent on X, Pr(X) and Pr(Y|X) are specified as conditional probability tables, Y is observed, and X is hidden. Suppose that the target distribution is specified by

              Pr(X=1) = 0.9
        Pr(Y=1 | X=1) = 0.7 
        Pr(Y=1 | X=0) = 0.4

Use the EM algorithm to calculate the expected sufficient statistics and estimate entries in the conditional probability tables starting with the following initial estimates,

        theta(1) = phi(1,1) = phi(1,0) = 
        theta(0) = phi(0,1) = phi(0,0) = 0.5

the following data,

and a prior Beta[1,1] for each parameter.

4. Implement a version of EM specific to the problem described in Exercise 3 and experiment with different initial estimates and data sets. Does this example have multiple local maxima?

5. Our example, Pr(Y|X)Pr(X) was chosen to keep the notation and discussion simple. Discuss the practical relevance of this simple model. Consider other models in which the potential advantage of introducing one or more hidden variables is more obvious or easy to motivate.

6. In keeping with Exercise 5, carry out a variant of the empirical analysis described in Exercise 4, but using as both target class and hypothesis space the so called "idiot" Bayes model shown below

                        o <---- o ----> o
                       Y_1      X      Y_2

where Y_1, and Y_2 are observable, X is not, all three are boolean, tables look something like

                        X      Y_1     Y_2 
                        -----------------
                        _       1       1       
                        _       0       1
                        _       1       0
                        _       1       1

and the target distributions and prior densities on the parameters are of your own choosing.

Appendix A

Here is a useful identity used in the proof that the EM algorithm converges to a local maximum in p(theta | d).

        p(c | a,b) = p(c,a | b)/p(a | b)   (conditional probability)

        p(a | b) = p(c,a | b)/p(c | a,b)                   (algebra)

        log p(a | b) = log p(c,a | b) - log p(c | a,b)    (calculus)

Back to Tutorial