The problem of inferring the smallest DFA consistent with a set of input/output pairs is NP-complete [Gold, 72] [Gold, 78] [Angluin, 78]. Even finding a DFA polynomially close to the smallest is intractable assuming P!=NP [Pitt & Warmuth, 89]. Kearns and Valiant [1989] show that predicting the outputs of an unknown DFA on inputs chosen from an arbitrary probability distribution is as hard as computing certain apparently hard number-theoretic predicates.
Angluin [Angluin, 1987], building on the work of Gold [1972], provides a polynomial-time algorithm for inferring the smallest DFA given the ability to reset the automaton to the initial state at any time and a source of counterexamples. In Angluin's model, at any point, the learner can hypothesize a DFA and the source of counterexamples will indicate if it is correct and, if it is not, provide a sequence of inputs on which the hypothesized and actual DFAs generate different outputs.
Let A = B^* denote the set of all finite sequences of inputs, and |a|
denote the length of the sequence a in A. Let q be the sequence of
outputs of length |a|+1 resulting from executing the sequence a
starting in q, and qa be the final state following the execution of
the sequence a starting in q. An automaton is said to be ``reduced''
if, for all q_1 != q_2 in Q, there exists a in A such that q_1 !=
q_2. A reduced automaton is used to represent the ``discernable''
structure of a dynamical system; you cannot expect a learner to
discern the difference between two states if no sequence of inputs and
outputs serves to distinguish them. A ``homing sequence,'' a in A,
has the property that, for all q_1,q_2 in Q, q_1 = q_2 implies
q_1a = q_2a. Every automaton has a homing sequence; however, the
shortest homing sequence may be as long as |Q|^2 [Rivest & Schapire,
1989].
A sequence, a in A, is said to be a ``distinguishing sequence'' if,
for all q_1,q_2 in Q, q_1 = q_2 implies q_1 = q_2. Note that every
distinguishing sequence is a homing sequence, but not the other way
around. A distinguishing sequence is a sequence of inputs such that
for any starting state the sequence of outputs associated with the
states encountered in executing that sequence uniquely identifies the
starting state.
Rivest and Schapire [1989] show how to dispense with the reset in the
general case, and how to dispense with both the reset and the source
of counterexamples in the case in which a distinguishing sequence is
either provided or can be learned in polynomial time [Rivest &
Schapire, 1987] [Schapire, 1991].
Yannakakis and Lee [1991] have shown that it is PSPACE-complete to
determine whether or not a DFA has a distinguishing sequence. (There
exist machines whose shortest distinguishing sequence is exponential
in length.) However, they also show that it can be determined in
polynomial time whether a DFA has an ``adaptive'' distinguishing
sequence, and, if so, find such a sequence of length O(|Q|^2) in
polynomial time. An adaptive distinguishing sequence (distinct from a
``preset'' distinguishing sequence as defined above) is not really a
sequence at all but rather a decision tree whose branches are
correspond to outputs.
A hidden Markov model (HMM) [Rabiner, 1989] corresponds to an SFA in
which there is only one possible input. Figure 1 depicts the general
form for an HMM as a graphical model.
Abe and Warmuth [1992] show that HMMs are not trainable in time
polynomial in the alphabet size unless RP = NP. Further, they show
that even if we allow the algorithm to run exponential in the alphabet
size, there are no known algorithms that run in time polynomial in
the number of states of the target HMM.
A Markov chain [Kemeny & Snell, 1960] can be thought of as an SFA with
deterministic output function that serves to uniquely identify states
(e.g., for all q in Q, h(q) = q). (You might also imagine the case of
SFAs in which outputs are deterministic but ambiguous.)
Markov chains are said to have the the Markov property: the
probability of the next state is independent of past states given the
immediately prior state.
We can generalize the Markov property and speak of the L-Markov
property: the probability of the next state is independent of states
more than L states in the past given the last L states. Stochastic
processes that satisfy the L-Markov property but not the L-1-Markov
property are said to be order-L Markov chains.
In a probabilistic finite automata (PFA), the output in state q is a
random function of q, and the state following q is a deterministic
function of q and the output in q. Figure 2 depicts the general form
for a PFA as a graphical model. Autoregressive (AR) models represent
a special case of PFAs.
An acyclic probabilistic finite automaton (APFA) [Ron et al., 1995] is
a PFA in which there are no cycles in the state-transition diagram.
Note that it may be useful for short-term prediction to assume
acyclicity.
Ron et al. [1994] introduce a restriction on PFAs called probabilistic
finite suffix automata (PFSA). PFSA is a PFA in which each state in Q
has a unique label corresponding to a sequence (string) of outputs
and, moreover, the label for f(q,h(q)) is a suffix of the sequence
defined by concatenating the output h(q) to the label for q. The set
of strings labeling states is suffix free: a set of strings S is
suffix free if for all s in S, s is the only suffix of s in S. An
L-PFSA is a PFSA in which each state is labeled by a string of length
at most L. Note that an L-PFSA can be modeled by an order-L Markov
chain. L-PFSAs can be thought of as Markov chains of variable order.
To be clear on what it means to be a PFSA consider the following
example.
A DFA is a five tuple (Q,q_0,U,A,f) consisting of a set of states, an
initial state, a set of inputs, a set of accepting states, and a state
transition function f:Q cross U -> Q. We say that a DFA accepts a
string t = u_0,...,u_k in U^* if and only if there exists a sequence
of states and inputs q_0,u_0,q_1,...,u_k,q_{k+1} such that f(q_i,u_i)
= q_{i+1} for 0 =< i =< k and q_{k+1} is in A.
How do you model a DFA M = (Q,q_0,U,A,f) using an HMM?
How do you model an HMM M = (Q,q_0,Y,h,f) using a DFA?
2. Prove the pumping lemma for HMMs.
Stochastic Finite Automata
In a stochastic finite automaton (SFA), f(q,a) and h(q) are random
functions and Pr(f(q,a)|q,a) and Pr(h(q)) define the state transition
and output probabilities respectively. The general SFA enables us to
model uncertainty in state transitions and outputs.
q f(q) depends probabilistically on q
o ----> o
| |
| |
v v
o o
h(q) depends probabilistically on q
Fig. 1: General form for a hidden Markov model
The HMM training problem is the problem of approximating an arbitrary,
unknown source distribution by distributions generated by HMMs.
q f(q,h(q)) deterministically depends on q and h(q)
o ----> o
| / |
| / |
| / |
| / |
v v
o o
h(q) probabilistically depends on q
Fig. 2: General form for a probabilistic finite automaton
Note that the PFA in Figure 2 can be transformed into the general form
for a HMM by augmenting the state space so that Q' = Q times Y, the
augmented state transitions are probabilistic f': Q' -> Q', and the
output function is deterministic h': Q' -> Y. This is a completely
observable (nonhidden) Markov process.
t = 1 2 3 4 5 6
o ----> o ----> o ----> o ----> o ----> o
| / | / | / | / | / |
| / | / | / | / | / |
| / | / | / | / | / |
| / | / | / | / | / |
v v v v v v
o o o o o o
h(q_t) = 0 1 1 0 1 0
q_2 is some (nonempty) suffix of
the label of q_1 concatenated with 0
q_3 is some (nonempty) suffix of
the label of q_2 concatenated with 1
q_4 ...
Let n be an upper bound on |Q|. Ron et al. [1994] show that it is
possible to learn a model for an L-PFSA that is epsilon close (using
the Kullback-Leibler divergence to measure the distance between the
distributions corresponding to the hypothesized model and the target
L-PFSA) with probability 1-delta in time polynomial in L, n, |Y|,
1/epsilon, and 1/delta. Ron et al. [1995] show similar results (with
better bounds) for learning APFAs.
Equivalent Models
In this section, we consider the relationship between dynamical models
considered as recognizers for languages. We begin with the definition
for deterministic finite automata given by [Aho, Hopcroft, & Ullman,
1974]. This definition is different from the more general description
given earlier.
Here is the graphical model.
q f(q,u)
o ----> o
/ Here is the state transition diagram.
/
/ q u f(q,u)
u / o ------------> o
o o
A (deterministic) hidden Markov model (HMM) is a five tuple
(Q,q_0,Y,h,f), consisting of a set of states, an initial state, a set
of outputs, an output function h:Q -> Y, and a state transition
function f:Q -> Q. In the following, we generalize to allow
nondeterministic transitions so that f is a relation Q -> 2^Q where
2^Q denotes the power set of Q. We say that an HMM accepts a string t
= y_0,...,y_k in Y^* if and only if there exists a sequence of states
and outputs q_0,y_0,q_1,...,y_k,q_k,y_{k+1},... such q_{i+1} is in
f(q_i) and h(q_i) = y_i for 0 =< i =< k and h(q_i) = null symbol for
k+1 =< i =< infinity.
Here is the graphical model.
q f(q)
o ----> o Here is the state transition diagram.
| |
| | q f(q)
v v o ------------> o
o o h(q) h(f(q))
h(q)
Exercises
1. We define a simple HMM to be an HMM in which h(q) = null
if and only if f(q) = q. Prove that every HMM interpreted as
language recognizer can be modeled by a simple HMM.
Back to Tutorial