Hidden Markov models are used in speech recognition. Suppose that we have a set W of words and a separate training set for each word. Build an HMM for each word using the associated training set. Let lambda_w denote the HMM parameters associated with the word w. When presented with a sequence of observations sigma, choose the word with the most likely model, i.e.,
w* = arg max_{w in W} Pr(sigma|lambda_w)
alpha_t(i) = Pr(O_1=o_1,...,O_t=o_t, X_t = q_i | lambda)Note that
alpha_T(i) = Pr(O_1=o_1,...,O_T=o_T, X_T = q_i | lambda) = Pr(sigma, X_T = q_i | lambda)The alpha values enable us to solve Problem 1 since, marginalizing, we obtain
Pr(sigma|lambda) = sum_i=1^N Pr(o_1,...,o_T, X_T = q_i | lambda) = sum_i=1^N alpha_T(i)
Define the beta values as follows,
beta_t(i) = Pr(O_t+1=o_t+1,...,O_T=o_T | X_t = q_i, lambda)We will need the beta values later in the Baum-Welch algorithm.
a. alpha_1(i) = pi_i b_i(o_1) b. alpha_t+1(j) = [sum_i=1^N alpha_t(i) a_ij] b_j(o_t+1)
a. beta_T(i) = 1 b. beta_t(i) = sum_j=1^N a_ij b_j(o_t+1) beta_t+1(j)
For 1 <= i <= N, a. delta_1(i) = pi b_i(o_1) b. Phi_1(i) = 0
For 2 <= t <= T, 1 <= j <= N, a. delta_t(j) = max_i [delta_t-1(i)a_ij]b_j(o_t) b. Phi_t(j) = argmax_i [delta_t-1(i)a_ij]
a. p* = max_i [delta_T(i)] b. i*_T = argmax_i [delta_T(i)]
For t = t-1,t-2,...,1, i*_t = Phi_t+1(i*_t+1)
Suppose that the outputs (observations) are in a 1-1 correspondence with the states so that N = M, varphi(q_i) = v_i and b_i(j) = 1 for j = i and 0 for j != i. Now the Markov process is not hidden at all and the HMM is just a Markov chain. To estimate the lambda parameters for this Markov chain it is enough just to calculate the appropriate frequencies from the observed sequence of outputs. These frequencies constitute sufficient statistics for the underlying distributions.
In the more general case, we can't observe the states directly so we can't calculate the required frequencies. In the hidden case, we use expectation maximization (EM) as described in [Dempster et al., 1977].
Instead of calculating the required frequencies directly from the observed outputs, we iteratively estimated the parameters. We start by choosing arbitrary values for the parameters (just make sure that the values satisfy the requirements for probability distributions).
We then compute the expected frequencies given the model and the observations. The expected frequencies are obtained by weighting the observed transitions by the probabilities specified in the current model. The expected frequencies so obtained are then substituted for the old parameters and we iterate until there is no improvement. On each iteration we improve the probability of O being observed from the model until some limiting probability is reached. This iterative procedure is guaranteed to converge on a local maximum of the cross entropy (Kullback-Leibler) performance measure.
xi_t(i,j) = Pr(X_t = q_i, X_t+1 = q_j | sigma, lambda)We compute these probabilities using the forward backward variables.
alpha_t(i) a_ij(o_t+1) beta_t+1(j) xi_t(i,j) = ------------------------------------- Pr(O | lambda)The probability of being in q_i at t given the observation sequence and model.
gamma_t(i) = Pr(X_t = q_i | sigma, lambda)Which we obtain by marginalization.
gamma_t(i) = sum_j xi_t(i,j)Note that
sum_t=1^T gamma_t(i) = expected number of transitions from q_iand
sum_t=1^T xi_t(i,j) = expected number of transitions from q_i to q_j
a. bar{pi}_i = gamma_t(i) sum_t=1^T-1 xi_t(i,j) b. bar{a}_ij = ------------------------ sum_t=1^T-1 gamma_t(i) sum_t=1^T-1 gamma_t(j) 1_{o_t = k} c. bar{b}_j(k) = ------------------------------------ sum_t=1^T-1 gamma_t(j)
X_0 X_1 X_2 X_3 X_T-1 X_T o ----> o ----> o ----> o ... o ----> o | | | | | | | | | | | | v v v v v v o o o o ... o o O_0 O_1 O_2 O_3 O_T-1 O_T Fig. 1: Bayesian network representation for an HMMIn the description of the Baum-Welch algorithm provided above, the computation of the expected sufficient statistics depends on computing the following term for all i and j in Omega_X.
xi_t(i,j) = Pr( X_t=q_i, X_t+1=q_j | sigma, lambda)These computations in turn rely on computing the forward and backward variables (the alpha's and beta's).
alpha_t(i) a_ij(o_t+1) beta_t+1(j) xi_t(i,j) = ------------------------------------- Pr(sigma | lambda)Generally, the forward and backward variables are computed using the forward-backward procedure which uses dynamic programming to compute the variables in time polynomial in |Omega_X|, |Omega_O|, and T. In the following paragraphs, we show how the xi's can be computed using standard Bayesian network inference algorithms in the same big-Oh complexity. One advantage of this approach is that it extends easily to the case in which the hidden part of the model is factored into some number of state variables.
In the network shown in Figure 1 the O_t's are known. In particular, we have that O_i=o_i. If we assign the O_t's accordingly, use the probabilities indicated by lambda, and apply a standard Bayesian network inference algorithm, we obtain for every X_t the posterior distribution Pr(X_t|sigma,lambda). This isn't exactly what we need since X_t and X_t+1 are clearly not independent. If they were independent, then we could obtain Pr(X_t,X_t+1|sigma,lambda) from the product of Pr(X_t|sigma,lambda) and Pr(X_t+1|sigma,lambda). There are a number of remedies but one approach which is graphically intuitive involves adding a new state variable (X_t,X_t+1) which is the obvious deterministic function X_t and X_t+1. This addition results in the network shown in Figure 2.
X_0,X_1 X_1,X_2 X_2,X_3 ... X_T-1,X_T o o o o o / ^ / ^ / ^ ^ / ^ / | / | / | | / | / | / | / | | / | o ----> o ----> o ----> o ... o ----> o | | | | | | | | | | | | v v v v v v o o o o ... o o O_0 O_1 O_2 O_3 O_T-1 O_T Fig. 2: Bayesian network with joint variables, (X_t,X_t+1)If we update the network in Figure 2 with O_t=o_t then we obtain Pr((X_t,X_t+1)|sigma,lambda) directly. It should be clear that we can eliminate the X_t (except for X_0) to obtain the singly connected network shown in Figure 3 which can be updated in time polynomial in |Omega_X|, |Omega_O|, and T.
X_0 X_0,X_1 X_1,X_2 X_2,X_3 ... X_T-1,X_T o ----> o ----> o ----> o o ----> o | | | | | | | | | | | | v v v v v v o o o o ... o o O_0 O_1 O_2 O_3 O_T-1 O_T Fig. 3: Bayesian network with X_t's eliminatedThe extension to HMMs with factored state spaces (e.g., see Figure 4) is graphically straightforward. The computational picture is more complicated and depends on the specifics of the update algorithm. It is important to point out, however, that there is a wide range of update algorithms, both approximate and exact, to choose from.
X_1 o ----> o ----> o ----> o ... o ----> o \ \ \ \ \ \ \ \ \ \ \ \ X_2 o ----> o ----> o ----> o ... o ----> o | | | | | | | | | | | | v v v v v v O o o o o ... o o Fig. 4: Bayesian network representation for an HMM with factored state space Omega_X = Omega_X_1 times Omega_X_2. The state variable is two dimensional X_t = X_{1,t},X_{2,t}.