In this section, we consider various formalisms for representing dynamical systems with finite state spaces. See [Moore, 1956] ``Gedanken Experiments on Sequential Machines'' for an introduction to finite automata for modeling discrete time and space dynamical systems.

The problem of inferring the smallest DFA consistent with a set of input/output pairs is NP-complete [Gold, 72] [Gold, 78] [Angluin, 78]. Even finding a DFA polynomially close to the smallest is intractable assuming P!=NP [Pitt & Warmuth, 89]. Kearns and Valiant [1989] show that predicting the outputs of an unknown DFA on inputs chosen from an arbitrary probability distribution is as hard as computing certain apparently hard number-theoretic predicates.

Angluin [Angluin, 1987], building on the work of Gold [1972], provides a polynomial-time algorithm for inferring the smallest DFA given the ability to reset the automaton to the initial state at any time and a source of counterexamples. In Angluin's model, at any point, the learner can hypothesize a DFA and the source of counterexamples will indicate if it is correct and, if it is not, provide a sequence of inputs on which the hypothesized and actual DFAs generate different outputs.

Let A = B^* denote the set of all finite sequences of inputs, and |a| denote the length of the sequence a in A. Let q be the sequence of outputs of length |a|+1 resulting from executing the sequence a starting in q, and qa be the final state following the execution of the sequence a starting in q. An automaton is said to be ``reduced'' if, for all q_1 != q_2 in Q, there exists a in A such that q_1 != q_2. A reduced automaton is used to represent the ``discernable'' structure of a dynamical system; you cannot expect a learner to discern the difference between two states if no sequence of inputs and outputs serves to distinguish them. A ``homing sequence,'' a in A, has the property that, for all q_1,q_2 in Q, q_1 = q_2 implies q_1a = q_2a. Every automaton has a homing sequence; however, the shortest homing sequence may be as long as |Q|^2 [Rivest & Schapire, 1989].

A sequence, a in A, is said to be a ``distinguishing sequence'' if, for all q_1,q_2 in Q, q_1 = q_2 implies q_1 = q_2. Note that every distinguishing sequence is a homing sequence, but not the other way around. A distinguishing sequence is a sequence of inputs such that for any starting state the sequence of outputs associated with the states encountered in executing that sequence uniquely identifies the starting state.

A hidden Markov model (HMM) [Rabiner, 1989] corresponds to an SFA in which there is only one possible input. Figure 1 depicts the general form for an HMM as a graphical model.

q f(q) depends probabilistically on q o ----> o | | | | v v o o h(q) depends probabilistically on q Fig. 1: General form for a hidden Markov modelThe HMM training problem is the problem of approximating an arbitrary, unknown source distribution by distributions generated by HMMs.

Abe and Warmuth [1992] show that HMMs are not trainable in time polynomial in the alphabet size unless RP = NP. Further, they show that even if we allow the algorithm to run exponential in the alphabet size, there are no known algorithms that run in time polynomial in the number of states of the target HMM.

A Markov chain [Kemeny & Snell, 1960] can be thought of as an SFA with deterministic output function that serves to uniquely identify states (e.g., for all q in Q, h(q) = q). (You might also imagine the case of SFAs in which outputs are deterministic but ambiguous.)

Markov chains are said to have the the Markov property: the probability of the next state is independent of past states given the immediately prior state.

We can generalize the Markov property and speak of the L-Markov property: the probability of the next state is independent of states more than L states in the past given the last L states. Stochastic processes that satisfy the L-Markov property but not the L-1-Markov property are said to be order-L Markov chains.

In a probabilistic finite automata (PFA), the output in state q is a random function of q, and the state following q is a deterministic function of q and the output in q. Figure 2 depicts the general form for a PFA as a graphical model. Autoregressive (AR) models represent a special case of PFAs.

q f(q,h(q)) deterministically depends on q and h(q) o ----> o | / | | / | | / | | / | v v o o h(q) probabilistically depends on q Fig. 2: General form for a probabilistic finite automatonNote that the PFA in Figure 2 can be transformed into the general form for a HMM by augmenting the state space so that Q' = Q times Y, the augmented state transitions are probabilistic f': Q' -> Q', and the output function is deterministic h': Q' -> Y. This is a completely observable (nonhidden) Markov process.

An acyclic probabilistic finite automaton (APFA) [Ron et al., 1995] is a PFA in which there are no cycles in the state-transition diagram. Note that it may be useful for short-term prediction to assume acyclicity.

Ron et al. [1994] introduce a restriction on PFAs called probabilistic finite suffix automata (PFSA). PFSA is a PFA in which each state in Q has a unique label corresponding to a sequence (string) of outputs and, moreover, the label for f(q,h(q)) is a suffix of the sequence defined by concatenating the output h(q) to the label for q. The set of strings labeling states is suffix free: a set of strings S is suffix free if for all s in S, s is the only suffix of s in S. An L-PFSA is a PFSA in which each state is labeled by a string of length at most L. Note that an L-PFSA can be modeled by an order-L Markov chain. L-PFSAs can be thought of as Markov chains of variable order. To be clear on what it means to be a PFSA consider the following example.

t = 1 2 3 4 5 6 o ----> o ----> o ----> o ----> o ----> o | / | / | / | / | / | | / | / | / | / | / | | / | / | / | / | / | | / | / | / | / | / | v v v v v v o o o o o o h(q_t) = 0 1 1 0 1 0 q_2 is some (nonempty) suffix of the label of q_1 concatenated with 0 q_3 is some (nonempty) suffix of the label of q_2 concatenated with 1 q_4 ...Let n be an upper bound on |Q|. Ron et al. [1994] show that it is possible to learn a model for an L-PFSA that is epsilon close (using the Kullback-Leibler divergence to measure the distance between the distributions corresponding to the hypothesized model and the target L-PFSA) with probability 1-delta in time polynomial in L, n, |Y|, 1/epsilon, and 1/delta. Ron et al. [1995] show similar results (with better bounds) for learning APFAs.

A DFA is a five tuple (Q,q_0,U,A,f) consisting of a set of states, an initial state, a set of inputs, a set of accepting states, and a state transition function f:Q cross U -> Q. We say that a DFA accepts a string t = u_0,...,u_k in U^* if and only if there exists a sequence of states and inputs q_0,u_0,q_1,...,u_k,q_{k+1} such that f(q_i,u_i) = q_{i+1} for 0 =< i =< k and q_{k+1} is in A.

Here is the graphical model. q f(q,u) o ----> o / Here is the state transition diagram. / / q u f(q,u) u / o ------------> o o oA (deterministic) hidden Markov model (HMM) is a five tuple (Q,q_0,Y,h,f), consisting of a set of states, an initial state, a set of outputs, an output function h:Q -> Y, and a state transition function f:Q -> Q. In the following, we generalize to allow nondeterministic transitions so that f is a relation Q -> 2^Q where 2^Q denotes the power set of Q. We say that an HMM accepts a string t = y_0,...,y_k in Y^* if and only if there exists a sequence of states and outputs q_0,y_0,q_1,...,y_k,q_k,y_{k+1},... such q_{i+1} is in f(q_i) and h(q_i) = y_i for 0 =< i =< k and h(q_i) = null symbol for k+1 =< i =< infinity.

Here is the graphical model. q f(q) o ----> o Here is the state transition diagram. | | | | q f(q) v v o ------------> o o o h(q) h(f(q)) h(q)

How do you model a DFA M = (Q,q_0,U,A,f) using an HMM?

- To model accepting states in the HMM, create a set of states Q' = Q union {a} where a is new state.
- To model outputs let Y = U union {null} where null is the null symbol.
- To convert from inputs of the DFA to outputs of the HMM, let Q'' = Q' cross U, f'((q,u)) = {(q',u') : f(q,u') = q' for some u' in U}, f'((a,u)) = {(a,null)}, and h((q,u)) = u.
- The equivalent HMM is M' = (Q'',(q_0,null),Y,h,f'). Prove that M accepts a string t in U^* if and only if M' accepts t.

How do you model an HMM M = (Q,q_0,Y,h,f) using a DFA?

- Let f'(q,h(q')) = q' for each q' in f(q).
- Let A be the unique largest set of all states in Q such that for h(q) = null and f(q) is contained in A.
- The equivalent DFA is M' = (Q,q_0,Y,A,f'). Prove that M accepts a string t in Y^* if and only if M' accepts t.

2. Prove the pumping lemma for HMMs.