If you’re reading this I can probably assume you’re already motivated to learn about Hidden Markov Models. Still, for the sake of establishing some context, I’m going to start by saying a little bit about what these things are and what they’re useful for.
Let’s suppose we have a discrete process that we wish to model (in this article, all of our processes will be discrete — a decision that is motivated by the fact that I want to discuss using these algorithms with a computer. Continuous variants of HMMs exist, although I do not work with them myself). At each point in time, the process is in one of a finite number of states. We don’t know what state the process started in, nor do we know what state the process has passed through, but we do know that its state at time is only allowed to depend on its state at time . Furthermore, we know the following:
- The probability that the process began in a given state,
- The probability that, if it’s in state at time , that it will be in state at time ,
- The probability that the process ends in a given state.
A Hidden Markov Model is a mathematical model for this type of situation. The “Markov” part of the title comes from the fact that the model makes the Markov assumption: that the process’ state at a particular time is determined (probabilistically) only by the state it was in during the immediately-prior time. The “Hidden” part comes from the assumption that we don’t know for sure which state the process was actually in at any time.
The mathematics associated with HMMs is primarily concerned with empirically picking the HMM parameters from some sample data, and with using the HMMs to draw conclusions about the probability that a process went through a particular sequence of states.
The interesting thing about HMMs is that they can be used very effectively to solve various pattern-recognition problems, such as speech, handwriting, and gesture recognition. Furthermore, the algorithms associated with HMMs are pleasantly efficient, meaning that they can be used in many real-time systems to do these tasks.
In this paper, we will describe the major HMM algorithms, and will describe how HMMs can be applied to the gesture-recognition problem.
To help keep things concrete, I want to start by establishing something of a running example. Naturally I’ve decided to pick something from my own life. Experience tells me that, on any given night, I’m probably eating one of three things for dinner: either it’s pasta, or it’s a burrito, or it’s dahl. I don’t usually like to eat the same thing two nights in a row, although my friends will readily tell you that this isn’t exactly unheard of. However, what I eat on one day is heavily influenced by what I had the previous day (for some reason). Here’s a graph that shows the probability of eating one thing tomorrow given what I ate today:
Now suppose you’re my mother and you are worried about whether or not my hippie vegetarian diet is giving me the nutrients I need. You want to know what I ate last week so that you can plan your nagging accordingly, although you were unable to figure out (directly) where I’ve been eating. However, you do have the following information:
- You have the graph up above. You got it because you asked me for it, and (being a nerd) this is the type of information I’ve been tracking in Excel for the past year, so I was able to toss you a bone and send you the probability graph.
- You haven’t been asking me what I’ve been eating (because that would annoy me), but you have been going through the trash in my kitchen every night (you have a key to my apartment, or perhaps you can pick locks, I can’t remember which). The trash isn’t perfectly correlated with what I’ve been eating, but different things in my trash are linked probabilistically with my diet. In fact, I’ve been tracking this as well, and for some reason didn’t think it odd when you asked for the data. I’ve diagrammed the relationship between dinner and trash in the graph below.
Figure 2. Given what I had for dinner that night, the probability that it will lead to finding either a plastic bowl or some tin foil wrapper in my trash.
Being my mom, you are extremely educated and have decided to model your son’s diet (that is, my diet) using a Hidden Markov Model. In this article, we will describe how you do this.
Definition (Hidden Markov Model)
Let be a set of states, and let be a set of emissions. A Hidden Markov Model on and is a triple , where
- is the probability distribution function denoting the probability ,
- is an matrix (called the state transition matrix) where entry is given by ,
- and is an matrix (called the emission matrix) where entry .
In terms of our example about my dinners:
- is the set of things I might have for dinner (pasta, burritos, or dahl),
- is the set of things that wind up in my trash as a consequence of what I’ve had for dinner (plastic bowls or tin foil wrappers),
- is the adjacency matrix of the graph in Figure 1,
- and is the adjacency matrix of the graph in Figure 2.
- Lastly, tells you what the probability is that, at the beginning of the week, I had pasta, a burrito, or dahl for dinner.
Oh, and we need to interpret and as what I had for dinner on night and what I put in my trash on night , respectively.
Once we’ve decided to model a process using a Hidden Markov Model, there are three questions that often come up:
- The Evaluation Problem. For each possible sequence of observations , what is the probability that the process will lead to those observations? In the context of our example, for instance, we might ask
- “What is the probability of finding a plastic bowl in my trash on Monday, Tuesday, and Thursday, and some tin foil wrapper in my trash on Wednesday and Friday?”
- “How does this compare to the probability of finding a plastic bowl in my trash every night of the week? Or the probability of finding tin foil wrapper on Monday and a plastic bowl every other night?”
- “What is the most likely sequence of trash, given my eating habits?”
These questions are all about what observation we’re most likely to have.
- The Decoding Problem. Given a sequence of observations, what sequence of states was most likely to lead to these observations? In the context of our current example, we might ask
- “Given that a plastic bowl was found in my trash every night of the week, what was I most likely eating each night?”
- “Given that a plastic bowl was found in my trash on Monday and Thursday, and tin foil wrapper was found every other night, what was I most likely eating each night?”
These questions are all about what most likely caused the observation.
- The Learning Problem. Suppose we have some reason to expect that a particular sequence of observations is most likely to occur. How can we tweak our model parameters so that the model predicts that this observation is most likely? This question comes up when we’ve decided to use a Hidden Markov Model, but we don’t already have perfect data about the probabilities involved in the choice of , , and . In our example, this would come up if my mother already knew what most often showed up in my trash, but didn’t have the information contained in Figures 1 and 2.
Hidden Markov Models are popular because of the happy combination of the following two facts: First, these are questions we often want to ask about mathematical models. Second, in the case of Hidden Markov Models, there are efficient algorithms for answering all three of these problems.
In the next three sections, we will look at algorithms for addressing each of these three problems.
From a mathematical perspective, the Evaluation Problem can be defined like so:
Definition (Evaluation Problem)
Let be a Hidden Markov Model, and let be a sequence of observations. Compute the probability .
This computation can be computed naively using the information in . Since our Hidden Markov Model has only a finite number of states and observations, we could exhaustively compute the observations for every possible sequence of states and sum the results appropriately to find the probability of the sequence we’re interested in. Computationally, this has order operations, where is the number of states and is the length of the sequence of observations. This is an unreasonably large computation in most real-world applications.
However, there is an algorithm for solving this problem that is linear in . It is called the Forward Algorithm, and our purpose in this section is to describe how it works.
For each , we define a function like so:
Direct computation of might seem tricky, but there is a recurrence relation that makes it simple to compute:
where the base case is given by
So, to compute , we use the recurrence relation to find for , and then calculate the sum
From a mathematical perspective, the Decoding Problem can be defined like so:
Definition (Decoding Problem)
Let be a Hidden Markov Model, and let be a sequence of observations. Find the sequence that maximizes .
The most common algorithm for solving this problem is the Viterbi algorithm.
As with the forward algorithm, we define an auxiliary function for each :
In practice, is computed using the following recurrence relation:
with the base case
Now, to find the most-likely sequence of states, we use the functions to find the most likely state at each time. Due to the structure of the recurrence relation, we actually backtrack, starting with the chronologically-last state and working towards the chronologically-first state.
In particular, for , we define . Then, if we define
we can recursively define for by
The sequence defined in this way can be shown to solve the Decoding Problem.
Now, it’s worth pointing out that this algorithm — while very efficient — does have a serious drawback in many applications. Pay particular attention to the fact that the solution produced by the algorithm is computed recursively in terms of the last observation in the sequence. Now suppose we wanted to solve the Decoding Problem in a real-time system, where observations were being made as the system was running. What then? We may wish to solve the Decoding Problem “in time” as observations are being made, but the Viterbi algorithm is only applicable after the final observation has been recorded.
In these applications, a work-around is needed. One solution is to continuously employ the Viterbi algorithm on the partial sequence of observations, noting that the partial state sequence can change in time as a result of new observations.
From a (somewhat) mathematical perspective, the Learning Problem can be defined like so:
Definition (Learning Problem)
Let be a Hidden Markov Model and let be a sequence of observations. Find a Hidden Markov Model , “similar” to , which maximizes the probability of observing .
Of course, what we mean by “similar” is pretty vague. Generally we expect the new model to have the same state and observation sets. Intuitively, we want to think of these as methods for making slight adjustments to an already somewhat-working model.
There are three popular algorithms for addressing the Learning Problem: the Forward-Backward Algorithm, the Baum-Welch Algorithm, and Viterbi Training. In this section, we will focus on the second of these exclusively.
The purpose of the Baum-Welch algorithm is to produce a refinement of such that , with equality only when is a fixed point of the algorithm. It is an example of a local-optimization, in that the algorithm is notguaranteed to maximize , but it is guaranteed not to make the probability any lower.
The Baum-Welch algorithm uses the forward-variable defined when earlier when we talked about the Encoding Problem. It also uses a similar function, the backward-variable, which we define as for . We can compute recursively since
with the base case
With the forward- and backward-variables defined, we can introduce a new variable , which we can compute (due to Bayes rule) as
Now, is defined up through time . To encode information at time , we introduce (by abuse of notation) , which we can compute as
In terms of these variables, we compute according to the equations
So, by iterating this process until the difference is small enough as to be negligible, we can increase the probability of observing as predicted by our model.
So we’ve talked about the major algorithms associated with Hidden Markov Models, and we’ve even talked a tiny bit about their use. However, we haven’t yet described the issues that come up with modeling things more complicated than my dinner habits. At this point in the article we turn our attention to some of the important issues that users of Hidden Markov Models must face.
“Oh good,” you’re thinking to yourself, “he’s going to talk about the exciting issues surrounding the representation of numbers.” I promise, I’ll keep it brief.
I’m typing this article on a MacBook Pro with an Intel Core 2 Duo processor. This is the computer that I do most of my work on, including much of my math modeling. When working with small numbers — say, numbers between 0 and 1 — the processor’s ability to accurately represent the numbers I’m working with is very important.
Let me give you an example of what I mean. A single-precision IEEE Standard 754 floating point number is unable to represent numbers smaller than roughly , or about . This is not a lot of precision for applications that need to deal with long Markov chains. Double-precision machines can represent numbers as small as , which is roughly — dramatically more precision, though not necessarily enough.
If you find yourself in need of more precision, here’s a computational trick that is often used: rather than performing your computations with the probability , instead perform your computations using the number . Since , we have that . You can often represent this value as an unsigned integer, giving you dramatically more representation capabilities. Of course, you will need to modify your algorithms appropriately: anywhere you compute , you will instead need to compute , for instance. The bookkeeping takes some care, although the representation benefits can be substantial.
Okay, that’s all I want to say about number representation.
Next I want to focus on what is arguably a more important issue: designing your Hidden Markov Model around some empirical data.
Earlier we talked about the Baum-Welch algorithm, and how it can be used to refine an existing Hidden Markov Model so as to make it more suitable for a particular dataset. The trouble is, we didn’t talk about how to pick that initial Hidden Markov Model!
There aren’t any sure-fire methods (that I’m aware of) for generating a Hidden Markov Model from scratch to match some empirical data. It’s more of a trial and error thing, really. You begin by making something of a “guess” at what a good model might be, then use Baum-Welch to tune the probabilities accordingly.
The most important decisions one must make when designing their initial model are the following:
- How many states do I think my model needs?
- How many possible observations do I think there are?
- What kind of topology do I think my state transition graph should have?
The first decision is usually made by someone who thinks they already have a little bit of insight into the process that is being modeled. If no one is really up to this task, in practice people will make a few different attempts at modeling and see how much of a difference it makes. Sometimes, if you start out with too many states, the Baum-Welch algorithm will effectively ignore the unnecessary states (say, by making it highly improbable that other states will transition into them).
The second decision is usually made by someone who has some insight into how the observations are made. If the observation is being made with a piece of equipment that has very coarse resolution, there’s no need to have very many possible observations in the model.
The last decision is again one that requires some insight into the process. There are a few common topologies that people like to use as a starting point, though. If you’re not sure which to use, try them each and check the results.
Of course, technically speaking the graph for your Hidden Markov Model is a complete graph, since (in principle) any state could transition to any other. However, if you make the transition probability from one state to another equal to zero, you have (in effect) removed this edge from the graph.
In our next post, we will look at a very concrete example: using Hidden Markov Models to recognize gestures made by a person using a Nintendo Wii Remote.
This article could not have been compiled without the following references, each of which I highly recommend:
- Warakagoda, N, et. al. Hidden Markov Models. Referenced 24 April 2008.
- Hidden Markov Models. Referenced 24 April 2008.
- Fink, G. Markov Models for Pattern Recognition: From Theory to Application. Springer 2008.
Of course, the Wikipedia also has a decent article on HMMs.