Let's consider the problem of modelling of several objects in a sequence of images. If there are M objects with K different positions and orientations in the image, there are be KM possible states for the system underlying an image. An HMM would require KM distinct states to model the system. This way of representing the system is not only inefficient but also difficult to interpret. We would prefer that our HMM could capture the state space by using M different K-dimensional variables.
A factorial HMM is such a representation. In this model, there are multiple independent Markov chains of latent variables and the distribution of the observed variable at any given time is conditional on the states of all the corresponding latent variables in that given time. The graphical model of the system can be represented as follows:
The motivation for considering factorial HMM can be seen by noting that in order to represent, say, 10 bits of information at a given time step, a standard HMM would need K = 210 = 1024 latent states, whereas a factorial HMM could make use of 10 binary latent chains. However, this presents additional complexity in training, as we will see in the later chapters.