What is a Hidden Markov Model?


If you’re reading this I can probably assume you’re already motivated to learn about Hidden Markov Models. Still, for the sake of establishing some context, I’m going to start by saying a little bit about what these things are and what they’re useful for.

Let’s suppose we have a discrete process that we wish to model (in this article, all of our processes will be discrete — a decision that is motivated by the fact that I want to discuss using these algorithms with a computer. Continuous variants of HMMs exist, although I do not work with them myself). At each point in time, the process is in one of a finite number of states. We don’t know what state the process started in, nor do we know what state the process has passed through, but we do know that its state at time t is only allowed to depend on its state at time t-1. Furthermore, we know the following:

  • The probability that the process began in a given state,
  • The probability that, if it’s in state s_i at time t, that it will be in state s_j at time t+1,
  • The probability that the process ends in a given state.

A Hidden Markov Model is a mathematical model for this type of situation. The “Markov” part of the title comes from the fact that the model makes the Markov assumption: that the process’ state at a particular time is determined (probabilistically) only by the state it was in during the immediately-prior time. The “Hidden” part comes from the assumption that we don’t know for sure which state the process was actually in at any time.

The mathematics associated with HMMs is primarily concerned with empirically picking the HMM parameters from some sample data, and with using the HMMs to draw conclusions about the probability that a process went through a particular sequence of states.

The interesting thing about HMMs is that they can be used very effectively to solve various pattern-recognition problems, such as speech, handwriting, and gesture recognition. Furthermore, the algorithms associated with HMMs are pleasantly efficient, meaning that they can be used in many real-time systems to do these tasks.

In this paper, we will describe the major HMM algorithms, and will describe how HMMs can be applied to the gesture-recognition problem.

To help keep things concrete, I want to start by establishing something of a running example. Naturally I’ve decided to pick something from my own life. Experience tells me that, on any given night, I’m probably eating one of three things for dinner: either it’s pasta, or it’s a burrito, or it’s dahl. I don’t usually like to eat the same thing two nights in a row, although my friends will readily tell you that this isn’t exactly unheard of. However, what I eat on one day is heavily influenced by what I had the previous day (for some reason). Here’s a graph that shows the probability of eating one thing tomorrow given what I ate today:

Graph of our Markov process

Figure 1. Given what I had for dinner last night, the probability of what I will have tonight.

Now suppose you’re my mother and you are worried about whether or not my hippie vegetarian diet is giving me the nutrients I need. You want to know what I ate last week so that you can plan your nagging accordingly, although you were unable to figure out (directly) where I’ve been eating. However, you do have the following information:

  • You have the graph up above. You got it because you asked me for it, and (being a nerd) this is the type of information I’ve been tracking in Excel for the past year, so I was able to toss you a bone and send you the probability graph.
  • You haven’t been asking me what I’ve been eating (because that would annoy me), but you have been going through the trash in my kitchen every night (you have a key to my apartment, or perhaps you can pick locks, I can’t remember which). The trash isn’t perfectly correlated with what I’ve been eating, but different things in my trash are linked probabilistically with my diet. In fact, I’ve been tracking this as well, and for some reason didn’t think it odd when you asked for the data. I’ve diagrammed the relationship between dinner and trash in the graph below.

The probability that my dinner lead to something in my trash.

Figure 2. Given what I had for dinner that night, the probability that it will lead to finding either a plastic bowl or some tin foil wrapper in my trash.

Being my mom, you are extremely educated and have decided to model your son’s diet (that is, my diet) using a Hidden Markov Model. In this article, we will describe how you do this.

Basics and definitions

Definition (Hidden Markov Model)

Let Q = \lbrace{ q_i }\rbrace_{i=1}^N be a set of states, and let V = \lbrace{ v_i }\rbrace_{i=1}^M be a set of emissions. A Hidden Markov Model on Q and V is a triple \lambda = (A,B,\pi), where

  1. \pi is the probability distribution function denoting the probability \pi(q_i) = P(S_1 = q_i),
  2. A is an N \times N matrix (called the state transition matrix) where entry a_{ij} is given by a_{ij} = P(S_{t+1} = q_j | S_t = q_i),
  3. and B is an N\times M matrix (called the emission matrix) where entry b_{ij} = P(O_t = v_j | S_t = q_i).

In terms of our example about my dinners:

  1. Q is the set of things I might have for dinner (pasta, burritos, or dahl),
  2. V is the set of things that wind up in my trash as a consequence of what I’ve had for dinner (plastic bowls or tin foil wrappers),
  3. A is the adjacency matrix of the graph in Figure 1,
  4. and B is the adjacency matrix of the graph in Figure 2.
  5. Lastly, \pi tells you what the probability is that, at the beginning of the week, I had pasta, a burrito, or dahl for dinner.

Oh, and we need to interpret S_t and O_t as what I had for dinner on night t and what I put in my trash on night t, respectively.

Questions about HMMs that we can easily answer

Once we’ve decided to model a process using a Hidden Markov Model, there are three questions that often come up:

  1. The Evaluation Problem. For each possible sequence of observations o_1,o_2,...,o_T, what is the probability that the process will lead to those observations? In the context of our example, for instance, we might ask
    • “What is the probability of finding a plastic bowl in my trash on Monday, Tuesday, and Thursday, and some tin foil wrapper in my trash on Wednesday and Friday?”
    • “How does this compare to the probability of finding a plastic bowl in my trash every night of the week? Or the probability of finding tin foil wrapper on Monday and a plastic bowl every other night?”
    • “What is the most likely sequence of trash, given my eating habits?”

    These questions are all about what observation we’re most likely to have.

  2. The Decoding Problem. Given a sequence of observations, what sequence of states was most likely to lead to these observations? In the context of our current example, we might ask
    • “Given that a plastic bowl was found in my trash every night of the week, what was I most likely eating each night?”
    • “Given that a plastic bowl was found in my trash on Monday and Thursday, and tin foil wrapper was found every other night, what was I most likely eating each night?”

    These questions are all about what most likely caused the observation.

  3. The Learning Problem. Suppose we have some reason to expect that a particular sequence of observations is most likely to occur. How can we tweak our model parameters so that the model predicts that this observation is most likely? This question comes up when we’ve decided to use a Hidden Markov Model, but we don’t already have perfect data about the probabilities involved in the choice of A, B, and \pi. In our example, this would come up if my mother already knew what most often showed up in my trash, but didn’t have the information contained in Figures 1 and 2.

Hidden Markov Models are popular because of the happy combination of the following two facts: First, these are questions we often want to ask about mathematical models. Second, in the case of Hidden Markov Models, there are efficient algorithms for answering all three of these problems.

In the next three sections, we will look at algorithms for addressing each of these three problems.

The Evaluation Problem

From a mathematical perspective, the Evaluation Problem can be defined like so:

Definition (Evaluation Problem)

Let \lambda be a Hidden Markov Model, and let O = o_1,o_2,...,o_T be a sequence of observations. Compute the probability P(O | \lambda).

This computation can be computed naively using the information in \lambda = (A,B,\pi). Since our Hidden Markov Model has only a finite number of states and observations, we could exhaustively compute the observations for every possible sequence of states and sum the results appropriately to find the probability of the sequence we’re interested in. Computationally, this has order N^T operations, where N is the number of states and T is the length of the sequence of observations. This is an unreasonably large computation in most real-world applications.

However, there is an algorithm for solving this problem that is linear in T. It is called the Forward Algorithm, and our purpose in this section is to describe how it works.

For each t = 1,2,...,T, we define a function \alpha_t like so:

\alpha_t(i) = P(o_1,o_2,...,o_t, q_t = s_i | \lambda) .

Direct computation of \alpha_t might seem tricky, but there is a recurrence relation that makes it simple to compute:

\alpha_{t+1}(j) = b_j(o_{t+1})\sum\limits_{i=1}^N \alpha_t(i) a_{ij} where 1 \leq j \leq N and t = 1,2,...,T-1,

where the base case is given by

\alpha_1(j) = \pi_j b_j(o_1) for 1 \leq j \leq N.

So, to compute P(O | \lambda), we use the recurrence relation to find \alpha_T(i) for i = 1,2,...,N, and then calculate the sum

P(O | \lambda) = \sum\limits_{i=1}^n \alpha_T(i).

The Decoding Problem

From a mathematical perspective, the Decoding Problem can be defined like so:

Definition (Decoding Problem)

Let \lambda be a Hidden Markov Model, and let O = o_1,o_2,...,o_T be a sequence of observations. Find the sequence S = s_1,s_2,...,s_T that maximizes P(S | O,\lambda).

The most common algorithm for solving this problem is the Viterbi algorithm.

As with the forward algorithm, we define an auxiliary function for each t = 1,2,...,T:

\delta_t(i) = \max\limits_{s_1,s_2,...,s_{t-1}} P(s_1,s_2,...,s_{t-1}, s_t = q_i, o_1,o_2,...,o_{t-1},o_t | \lambda) .

In practice, \delta_t is computed using the following recurrence relation:

\delta_{t+1}(j) = b_k(o_{t+1})\left(\max\limits_{1 \leq i \leq N} \delta_t(i) a_{ij}\right) for 1 \leq i \leq N and t = 1,2,...,T-1

with the base case

\delta_1(j) = \pi_j b_j(o_1) for 1 \leq j \leq N.

Now, to find the most-likely sequence of states, we use the \delta functions to find the most likely state at each time. Due to the structure of the recurrence relation, we actually backtrack, starting with the chronologically-last state and working towards the chronologically-first state.

In particular, for t = T-1,T-2,...,1, we define \psi_t(j) = \underset{i}{\arg\max}\, \delta_{t}(i) a_{ij}. Then, if we define

s_T = \underset{i}{\arg\max}\, \delta_T(i),

we can recursively define s_t for t = T-1,T-2,...,1 by

s_t = \psi_{t+1}(s_{t+1}).

The sequence s_1,s_2,...,s_T defined in this way can be shown to solve the Decoding Problem.

Now, it’s worth pointing out that this algorithm — while very efficient — does have a serious drawback in many applications. Pay particular attention to the fact that the solution produced by the algorithm is computed recursively in terms of the last observation in the sequence. Now suppose we wanted to solve the Decoding Problem in a real-time system, where observations were being made as the system was running. What then? We may wish to solve the Decoding Problem “in time” as observations are being made, but the Viterbi algorithm is only applicable after the final observation has been recorded.

In these applications, a work-around is needed. One solution is to continuously employ the Viterbi algorithm on the partial sequence of observations, noting that the partial state sequence can change in time as a result of new observations.

The Learning Problem

From a (somewhat) mathematical perspective, the Learning Problem can be defined like so:

Definition (Learning Problem)

Let \lambda be a Hidden Markov Model and let O = o_1,o_2,...,o_T be a sequence of observations. Find a Hidden Markov Model \hat\lambda, “similar” to \lambda, which maximizes the probability of observing O.

Of course, what we mean by “similar” is pretty vague. Generally we expect the new model \hat\lambda to have the same state and observation sets. Intuitively, we want to think of these as methods for making slight adjustments to an already somewhat-working model.

There are three popular algorithms for addressing the Learning Problem: the Forward-Backward Algorithm, the Baum-Welch Algorithm, and Viterbi Training. In this section, we will focus on the second of these exclusively.

The purpose of the Baum-Welch algorithm is to produce a refinement \hat\lambda of \lambda such that P(O | \hat\lambda) \geq P(O | \lambda), with equality only when \lambda is a fixed point of the algorithm. It is an example of a local-optimization, in that the algorithm is notguaranteed to maximize P(O|\hat\lambda), but it is guaranteed not to make the probability any lower.

The Baum-Welch algorithm uses the forward-variable \alpha_t defined when earlier when we talked about the Encoding Problem. It also uses a similar function, the backward-variable, which we define as \beta_t(j)=P(o_{t+1},o_{t+2},...,o_T | q_t=s_j,\lambda) for t=1,2,...,T. We can compute \beta_t recursively since

\beta_t(i) = \sum\limits_{j=1}^N \beta_{t+1}(j) a_{ij} b_j(o_{t+1}) for t = 1,2,...,T-1 and 1 \leq i \leq N

with the base case

\beta_T(i) = 1 where 1 \leq i \leq N.

With the forward- and backward-variables defined, we can introduce a new variable \gamma_t(i,j) = P(q_t=s_i,q_{t+1}=s_j|O,\lambda), which we can compute (due to Bayes rule) as

\gamma_t(i,j) = \frac{\alpha_t(i) a_{ij} b_j(o_{t+1}) \beta_{t+1}(j)}{\sum\limits_{i=1}^N\sum\limits_{j=1}^N \alpha_t(i) a_{ij} \beta_{t+1}(j) b_j(o_{t+1})} for 1 \leq i,j \leq N and t = 1,2,...,T-1.

Now, \gamma_t(i,j) is defined up through time T-1. To encode information at time T, we introduce (by abuse of notation) \gamma_t(i) = P(q_t = s_i | O,\lambda), which we can compute as

\gamma_t(i) = \frac{\alpha_t(i) \beta_t(i)}{\sum\limits_{i=1}^N \alpha_t(i) \beta_t(i)}.

In terms of these variables, we compute \hat\lambda = (\hat A, \hat B, \hat \pi) according to the equations

\hat\pi_i = \gamma_1(i) for 1 \leq i \leq N,

\hat a_{ij} = \frac{\sum\limits_{t=1}^{T-1} \gamma_t(i,j)}{\sum\limits_{t=1}^{T-1} \gamma_t(i)} for 1 \leq i,j \leq N,

\hat b_j(k) = \frac{\sum\limits_{t\text{ s.t. \\} o_t = v_k}\gamma_t(j)}{\sum\limits_{t=1}^{T} \gamma_t(j)} for 1 \leq j \leq N, 1 \leq k \leq M.

So, by iterating this process until the difference P(O|\hat\lambda) - P(O|\lambda) is small enough as to be negligible, we can increase the probability of observing O as predicted by our model.

Practical concerns

So we’ve talked about the major algorithms associated with Hidden Markov Models, and we’ve even talked a tiny bit about their use. However, we haven’t yet described the issues that come up with modeling things more complicated than my dinner habits. At this point in the article we turn our attention to some of the important issues that users of Hidden Markov Models must face.

Computers and numeric representation

“Oh good,” you’re thinking to yourself, “he’s going to talk about the exciting issues surrounding the representation of numbers.” I promise, I’ll keep it brief.

I’m typing this article on a MacBook Pro with an Intel Core 2 Duo processor. This is the computer that I do most of my work on, including much of my math modeling. When working with small numbers — say, numbers between 0 and 1 — the processor’s ability to accurately represent the numbers I’m working with is very important.

Let me give you an example of what I mean. A single-precision IEEE Standard 754 floating point number is unable to represent numbers smaller than roughly 10^{-38}, or about (0.5)^{128}. This is not a lot of precision for applications that need to deal with long Markov chains. Double-precision machines can represent numbers as small as 10^{-308}, which is roughly (0.5)^{1024} — dramatically more precision, though not necessarily enough.

If you find yourself in need of more precision, here’s a computational trick that is often used: rather than performing your computations with the probability p \in [0,1], instead perform your computations using the number \hat p = \log p. Since p \in [0,1], we have that \hat p \in (-\infty,0]. You can often represent this value as an unsigned integer, giving you dramatically more representation capabilities. Of course, you will need to modify your algorithms appropriately: anywhere you compute p_1 \cdot p_2, you will instead need to compute \hat p_1 + \hat p_2, for instance. The bookkeeping takes some care, although the representation benefits can be substantial.

Okay, that’s all I want to say about number representation.

The learning problem revisited

Next I want to focus on what is arguably a more important issue: designing your Hidden Markov Model around some empirical data.

Earlier we talked about the Baum-Welch algorithm, and how it can be used to refine an existing Hidden Markov Model so as to make it more suitable for a particular dataset. The trouble is, we didn’t talk about how to pick that initial Hidden Markov Model!

There aren’t any sure-fire methods (that I’m aware of) for generating a Hidden Markov Model from scratch to match some empirical data. It’s more of a trial and error thing, really. You begin by making something of a “guess” at what a good model might be, then use Baum-Welch to tune the probabilities accordingly.

The most important decisions one must make when designing their initial model are the following:

  1. How many states do I think my model needs?
  2. How many possible observations do I think there are?
  3. What kind of topology do I think my state transition graph should have?

The first decision is usually made by someone who thinks they already have a little bit of insight into the process that is being modeled. If no one is really up to this task, in practice people will make a few different attempts at modeling and see how much of a difference it makes. Sometimes, if you start out with too many states, the Baum-Welch algorithm will effectively ignore the unnecessary states (say, by making it highly improbable that other states will transition into them).

The second decision is usually made by someone who has some insight into how the observations are made. If the observation is being made with a piece of equipment that has very coarse resolution, there’s no need to have very many possible observations in the model.

The last decision is again one that requires some insight into the process. There are a few common topologies that people like to use as a starting point, though. If you’re not sure which to use, try them each and check the results.

Figure 3. Common topologies for Hidden Markov Models.

Of course, technically speaking the graph for your Hidden Markov Model is a complete graph, since (in principle) any state could transition to any other. However, if you make the transition probability from one state to another equal to zero, you have (in effect) removed this edge from the graph.

What’s next

In our next post, we will look at a very concrete example: using Hidden Markov Models to recognize gestures made by a person using a Nintendo Wii Remote.


This article could not have been compiled without the following references, each of which I highly recommend:

  • Warakagoda, N, et. al. Hidden Markov Models. Referenced 24 April 2008.
  • Hidden Markov Models. Referenced 24 April 2008.
  • Fink, G. Markov Models for Pattern Recognition: From Theory to Application. Springer 2008.

Of course, the Wikipedia also has a decent article on HMMs.

Previous Post
Next Post
Leave a comment


  1. hama

     /  22 September, 2008

    iam very glad to see your website , and iam very neccessary to send me a practical problem on hidden markov model

  2. Seema sharma

     /  9 December, 2008

    Hi , i read it, quite helpful fr my prjct. I m final yr. Stdnt of computer engg. nd mking prjct on “credit card fraud detection using hiden markov model”. I m unable to decide how to implement it exactly, plz hlp me .

  3. If you have any specific questions, I could try to help out!

  4. deepika

     /  19 February, 2009

    Hi , i read it, quite helpful fr my prjct. I m final yr. Stdnt of computer engg. nd mking prjct on “credit card fraud detection using hiden markov model”. I m unable to decide how to implement it exactly, plz hlp me .

  5. pilus

     /  4 April, 2009

    Hai … i’m currently working on a final yr project too on my campus … really needed to know the implementation of HMM and Forward-Backward Algorithm. Maybe you can help ? Any would be greatly appreciated.

  6. How can I help? I already described the Forward algorithm. The references also include descriptions. And there’s a lot of implementations out there you could study!

  7. abhijit

     /  27 July, 2009

    thanx bro for benificial help


     /  7 August, 2009

    Hi Sir, i read it, it will be helpful for my prjct. I’m doing MCA nd mking prjct on “credit card fraud detection using hiden markov model”. I m unable to decide which method i can use for this exactly, plz hlp me .

  9. bhuvi

     /  9 August, 2009

    hai….. it is helpful for my projects.. just i want the full algorithm or paper in credit card fraud detection using hidden markov model… if u have this paper.. plz mail me …. it is very urgent .. help me

  10. janani

     /  16 December, 2009

    Hi Sir,
    This website helps me a lot.I am doing my final project(Edge detection using hidden markov model based on wavelets).Plz help me how to design HMM.

  11. bharath

     /  17 March, 2010

    can any one help out in understanding the algorithms presented in credit card fraud detection using Hmm plzzzzzzzz

  12. hi sir,
    iam doing my MSc n making a project on creditcard using hidden markov model can u give clear discription on this… thank u

  13. ramya

     /  9 June, 2010

    hi , i read your article… even i am doing project on credit card fraud detection using hmm

  14. sanjay

     /  13 December, 2010

    .. just i want the full algorithm or paper in credit card fraud detection using hidden markov model… if u have this paper.. plz mail me …. it is very urgent .. help me….
    i want what is the process of hmm with real tme exanple……

  15. sanjay

     /  13 December, 2010

    all r u doing this proj………………but anyone can’t have any idea……pls anyone explain me…

  16. sujji

     /  12 May, 2011

    what actually is the general definition of HMM ? am doing dis project…help me out soon e1 need clear description

  17. ABHINAV PATIL Ahmednagar

     /  7 November, 2011

    hii…what is hidden markov model…

  18. apple

     /  17 August, 2013

    hello. i am doing a project on GPS furture Prediction base on historial data. how can i apply HMM to GPS Prediction?

  19. Sirisha Sunkara

     /  31 July, 2017

    Hi Tim,

    This is incredibly useful. The only thing I’d add are examples, given the dense representation of the notations. That would have made it one seamless read!

    Still, thanks for this incredible writeup which is pretty entertaining as well!


  1. Hidden Markov Models (draft) | corinnelhh

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: