What is this about?

In Moscow we have a joint seminar on Statistical Clustering Analysis for students from MSU, HSE, MIPT, IITP. Recently I have had a report on this seminar on Sum-Product Algorithm and Factor Graphs. I am not a specialist in the field, but gratefully, the report gave birth to some fruitful discussion on influence of Bayesian methods and Statistical Physics in Image Processing. After brief googling I found two rather profound articles (in Russian): (1) Gibbs Random Fields and (2) Random Fields and Markov Chains in Image Processing. The first one is purely theoretically-physical, the second one is more about image processing and Bayesian inference.

The whole story is based upon the article «Factor Graphs and the Sum-Product Algorithm (2001)» by Kschischang, Frey, Loeliger. Initially, I have studied this article in order to understand the idea standing behind a clustering algorithm «Affinity Propagation». There is also a couple of publications concerning Affinity Propagation, by Frey and Dueck. The Affinity Propagation is a direct competitor of our Adaptive Weight Clustering algorithm, so we wanted to «know the enemy in person».

It turned out that idea of factor graphs is much more general than a single clustering algorithm, and it is a very beautiful piece of math. The difference between the original article and this post is the following: here I will try to emphasize the key points of my report at our clustering seminar, and refine some unclear points. Some technicalities and proofs might be omitted. However, it is worth noticing that the original article is almost perfect, so there is not much I can do. Much of the material is simply copy-pasted from there.

I should also warn a reader that, as already been said, I am not a specialist in Bayesian theory and graphical models. It is completely possible that I have missed some natural analogies – please do not hesitate to point that out in comments!

The sum-product algorithm for factor graphs, in fact, generalize the following things: Forward-Backward Algorithm, Viterbi Algorithm for Markov Chains, Kalman Filter, Fast Fourier Transform, and some algotihms from Linear Code Theory. It is quite a lot, so I will try to exhibit, how all this things can be incorporated in a pretty simple model of message passing in a bipartite graph. This is only the first part, so only one application will be covered.

Sum-product algorithm for marginal functions

Let us start with the following problem.

Let \( x_1, \ldots, x_n \) be a collection of variables, in which, each variable takes values only in some finite alphabet $ \Sigma $. Let \( g(x_1, \ldots, x_n) \) be some real-valued function with some specific internal structure that we will discuss later. We are interested in computing the marginal functions: \[ g_i(x_i) \overset{def}= \sum_{x_1} \cdots \sum_{x_{i-1}} \sum_{x_{i+1}} \cdots \sum_{x_{n}} g(x_1, \ldots, x_n), \] where the value \( g_i(a) \) is obtained by summing \( g(x_1, \ldots, x_n) \) over all configurations of the variables with \( x_i = a \). The authors of the original article invented a special notation for the sums of this kind: \[ g_i (x_i) = \sum_{\sim \{ x_i \}} g(x_1, \ldots, x_n) \] For example, if \( h \) is a function of three variables, then \[ h_2(x_2) = \sum_{\sim \{ x_2 \}} h(x_1, x_2, x_3) = \sum_{x_1} \sum_{x_3} h(x_1, x_2, x_3). \] In general, one needs to take exponential number of steps to compute the function. So, we need some assumptions, which exhibit the underlying structure of the function. We suppose that the function \( g(x_1, \ldots, x_n) \) factors into a product of several local functions, i.e. \[ g(x_1, \ldots, x_n) = \prod_{j \in J} f_j (X_j), \] where each \( X_j \) is a subset of \( \{ x_1, \ldots, x_n \} \), and \( J \) is a discrete index set. You can find an example several lines below.

Why this problem is interesting? Why do we consider this particular structure? Do we often meet such factorizations in reality?

Well, later we will show that such factorizations occur quite often in various problems, as was promised at the beginning of the post. However, the problem of finding a marginal sum is only one of the possibilities. Once we get the idea of marginal sums, we will be able to move further.


A factor graph is a bipartite graph that expresses the structure of the factorization. The first part represents the variables, and the second part represents the functions. There is an edge from variable node \( x_i \) to factor node \( f_j \) if and only if \( x_i \) is an argument of \( f_j \).


Let \[ g(x_1, x_2, x_3, x_4, x_5) = f_A(x_1) f_B(x_2) f_C(x_1, x_2, x_3) f_D(x_3, x_4) f_E(x_3, x_5). \] Then the factor graph corresponds to the one shown in the figure:

Expression tree

If the factor graph does not contain cycles, then it can be represented as a tree.

It is not always true that graph does not contain cycles. We shall begin with the situation where factor graph is a tree, and then we will provide some ideas of how it could be generalized. In fact, most insteresting situations are described by graphs that do contain cycles!

The situation can be used to simplify the computations using the distributive law. Suppose we want to find \( g_1(x_1) \) from the previous example. Then we can compute marginal sums for nodes \( x_4, x_5 \) first, then «pass the messages» from \( f_D, f_E \) to node \( x_3 \) (we shall see later what exactly this message passing means), then compute marginal sums for \( x_3 \) and so on.

More formally, using the distributive law, the marginal for \( g_1(x_1) \) can be expressed as following: \[ g_1 (x_1) = f_A (x_1) \Big( \sum_{x_2} f_B(x_2) \Big( \sum_{x_3} f_C(x_1, x_2, x_3) \Big( \sum_{x_4} f_D(x_3, x_4) \Big) \Big( \sum_{x_5} f_E(x_3, x_5) \Big) \Big) \Big) \] Moreover, in summary notation, i.e. using \( \sum_{\sim \{ x \}} \) instead of \(\sum_{x}\), the same sum can be expressed as \[ g_1(x_1) = f_A (x_1) \sum_{\sim \{x_1 \}} \Big( f_B(x_2)f_C(x_1, x_2, x_3)
\Big( \sum_{\sim \{x_3\}} f_D(x_3, x_4) \Big) \Big( \sum_{\sim \{x_3\}} f_E(x_3, x_5) \Big) \Big) \]

It can take a while until one checks the correctness of the expression.


Write the decomposition of the same kind for \( g_3 (x_3) \).


When a factor graph is cycle-free, the factor graph not only encodes in its structure the factorization of the global function, but also encodes arithmetic expressions by which the marginal functions associated with the global function may be computed.


Do we reduce the complexity of the computation of marginal sums using such decompositions instead of brute-force summation?

Computing marginal functions

As for now, the tree has been «hanged» by the node \( x_1 \), and we managed to calculate the marginal sum \( g_1(x_1) \). If we want to compute marginal sum \( g_3(x_3) \), current technique allows only to re-hang the tree and re-do the whole procedure with another decomposition. Firstly, we will understand how to make the process automatical for one marginal sum, and then we will modify the procedure to make it possible to calculate all the marginal sums.

In order to compute all the marginal functions, we can initiate the message-passing procedure at the leaves. Any vertex waits for all the messages from children to come, i.e. remains idle until messages have arrived on all but one of the edges incident on \( v \).

What is the message? Suppose there is an edge \( \{x, f\} \), connecting variable \( x\) to function node \( f\). No matter which direction points the edge, the message will always be some function of single variable \( x\), say \( \varphi_{ \{x, f\} }(x) \). Each node receives some messages and then sends one message.

We initialize the process at leaves. During the initialization, each leaf variable node \( x\) sends a trivial «identity function» (\( \varphi(x) \equiv 1 \)) message to its parent, and each leaf factor node \( f\), sends a description of \( f \) to its parent. Note that if \( f\) is a leaf node, then the function \( f\) depends on a single variable.

Let us introduce the following update rules:

During each step, each variable node waits for the messages from all the children, and then simply sends the product of messages received from its children.

A factor node \( f \) with parent \( x \) forms the product of \( f \) with the messages received by the children, and then operates on the result with a \( \sum_{\sim \{ x \}} \) summary operator.

Let \( n(v) \) denote the set of neighbors of a given node \( v\). The message computation can be expressed as following:

variable to local function:

\[ \mu_{x \to f} (x) = \prod_{h \in n(x) \backslash \{f\}} \mu_{h \to x} (x) \]

local function to variable:

\[ \mu_{f \to x} (x) = \sum_{\sim \{x\}} \Big( f(X) \prod_{y \in n(f) \backslash \{x\}} \mu_{y \to f} (y) \Big) \]

In this framework, the computation terminates at the root node \( x_i \), because the tree is hanged by the node \( x_i \). The marginal function \( g_i (x_i) \) is obtained as the product of all messages received at \( x_i \).


Check that for the above example the suggested message-passing procedure leads to correct decomposition for \( g_1(x_1) \).

This procedure is perfect if we want only one marginal, but let us think how it could be modified for multiple marginals.

If we want to compute all the marginals, then no particular vertex is taken as a root vertex. Thus, there is no more child-parent relationship between nodes. Therefore, we need to change our message-passing scheme.

Message passing is again initiated at the leaves. Again, each vertex remains idle until messages have arrived on all but one of the edges incident on \( v \). Once the messages have arrived, \( v \) is able to compute a message to be sent on the one remaining edge to its neighbor. After sending a message, vertex \( v \) returns to the idle state, waiting for the return message. Once the message has arrived, the vertex is able to compute and send messages to its other neighbors. The algorithm terminates once two messages have been passed over every edge, one in each direction.

Example of message-passing procedure

For the function mentioned in the very beginning, we can depict the order in which the messages are passed:

In detail, the messages are generated as follows:

Step 1.

\begin{align} \mu_{A \to 1} (x_1) &= \sum_{\sim \{x_1 \}} f_A(x_1) = f_A (x_1) \\ \mu_{B \to 2} (x_2) &= \sum_{\sim \{x_2 \}} f_B(x_2) = f_B (x_2) \\ \mu_{4 \to D} (x_4) &= 1 \\ \mu_{5 \to E} (x_5) &= 1. \end{align}

Step 2.

\begin{align} \mu_{1 \to С} (x_1) &= \mu_{A \to 1} (x_1) \\ \mu_{2 \to С} (x_2) &= \mu_{B \to 2} (x_2) \\ \mu_{D \to 3} (x_3) &= \sum_{\sim \{x_3\}} \mu_{4 \to D} (x_4) f_D (x_3, x_4) \\ \mu_{E \to 3} (x_3) &= \sum_{\sim \{x_3\}} \mu_{5 \to E} (x_5) f_E (x_3, x_5) . \end{align}

Step 3.

\begin{align} \mu_{C \to 3} (x_3) &= \sum_{\sim \{x_3\}} \mu_{1 \to C} (x_1) \mu_{2 \to C} (x_2) f_C (x_1, x_2, x_3) \\ \mu_{3 \to C} (x_3) &= \mu_{D \to 3} (x_3) \mu_{E \to 3} (x_3) \end{align}

Step 4.

\begin{align} \mu_{C \to 1} (x_1) &= \sum_{\sim \{x_1\}} \mu_{3 \to C} (x_3) \mu_{2 \to C} (x_2) f_C (x_1, x_2, x_3) \\ \mu_{C \to 2} (x_2) &= \sum_{\sim \{x_2\}} \mu_{3 \to C} (x_3) \mu_{1 \to C} (x_1) f_C (x_1, x_2, x_3) \\ \mu_{3 \to D} (x_3) &= \mu_{C \to 3} (x_3) \mu_{E \to 3} (x_3) \\ \mu_{3 \to E} (x_3) &= \mu_{C \to 3} (x_3) \mu_{D \to 3} (x_3). \end{align}

Step 5.

\begin{align} \mu_{1 \to A} (x_1) &= \mu_{C \to 1} (x_1) \\ \mu_{2 \to B} (x_2) &= \mu_{C \to 2} (x_2) \\ \mu_{D \to 4} (x_4) &= \sum_{\sim \{x_4\}} \mu_{3 \to D} (x_3) f_D (x_3, x_4) \\ \mu_{E \to 4} (x_5) &= \sum_{\sim \{x_5\}} \mu_{3 \to E} (x_3) f_E (x_3, x_5) \\ \end{align}


\begin{align} g_1(x_1) &= \mu_{A \to 1} (x_1) \mu_{C \to 1} (x_1)\\ g_2(x_2) &= \mu_{B \to 2} (x_2) \mu_{C \to 2} (x_2)\\ g_3(x_3) &= \mu_{C \to 3} (x_3) \mu_{D \to 3} (x_3) \mu_{E \to 3} (x_3)\\ g_4(x_4) &= \mu_{D \to 4} (x_4) \\ g_5(x_5) &= \mu_{E \to 5} (x_5) \end{align}

Some remarks

Semiring trick

One can note that marginal sums involve two operations: \( \{+, \times\} \) and these two operations satisfy distributive law. In fact, we can use any two operations that satisfy the distributive law, and this is the case of commutative semiring. For example, if we are searching for maxima \( \max f(x_1, \ldots, x_n) \), we can use the family \( \{\max, \times\} \), where \( \max \) now plays the role of sum. This allows to formulate another interesting optimization algorithm in terms of factor graph!

Factor graphs with cycles

If we want to allow cycles in the graph, there will be no simple terminating condition, and hence, messages might pass miltiple times on each edge, and this is ok. Many interesting applications involve the situations where graph does have cycles.

We convent that every node should pass a message along every edge at every time, assuming that passing of the messages in synchronized with some «global clock». In fact there are possible many message-scheduling strategies, and several of them are described in the original article.

It is important to emphasize that we do not initialize nodes, but we do initialize edges. So, during the initialization we assume that a unit message has arrived on every edge incident on any given vertex.


Check that if the graph does not have any cycles, then this initialization does not affect the final result, and does not affect the number of steps until convergence.

So, the factor graphs obtained from this function, contains cycles, but which guarantees can be given for such an algorithm? It turns out that iterations of Sum-Product algorithm minimize some Kullback-Leibler divergence. So, there exists some invariant that is minimized by our procedure. Let us do it step by step.

Bethe method

The following theorem claims that there exists a certain structure in probability distributions, which can be represented as a product in case when factor graph does not have cycles. We will not concentrate on the particular form of the expression in the theorem.

Proposition (Mean Field Approximation)

Let us denote by \( N(a) \) the set of arguments of function with index \( a\), and by \( d_i \) – the cardinality of neighbors of \( x_i \).

Suppose \( p(\boldsymbol x) \) is a probability function in the factor form, i.e. \[ p(\boldsymbol x) = \prod_{a = 1}^{M} q_a(\boldsymbol x_{N(a)}) \] and the corresponding factor graph has no cycles. Then it can be expressed as follows: \[ p(\boldsymbol x) = \dfrac{\prod_{a=1}^M p_a(\boldsymbol x_{N(a)})}{\prod_{i=1}^N (p_i(x_i))^{d_i - 1}},
\] where the functions \( p_n (x_n) \) and \( p_{a}(\boldsymbol x_{N(a)}) \) obey the following restrictions: \begin{align} \sum_{x_n} p_n (x_n) &= 1 \quad \forall n \in [1, N], \\ \sum_{\boldsymbol x_{N(a)}} p_a(\boldsymbol x_{N(a)}) &= 1,\\ \sum_{\boldsymbol x_{N(a) \backslash n}} p_a(\boldsymbol x_{N(a)}) &= p_n(x_n) \quad \forall n \in [1, N]. \end{align}


Find the Bethe decomposition for the function from the example, assuming that it is a probability function.

Recall that all the variables \( x_1, \ldots, x_n \) take values only in some finite alphabet \( \Sigma \). It means that function of a single variable \( x_i \) can be represented as a vector of length \( |\Sigma| \). It means that Bethe decomposition can be parametrized with some fixed number of parameters.


If the factor graph for \(p(\boldsymbol x)\) has cycles, then the sum-product updates are equivalent to coordinate descent minimization of the KL-divergence between \( p(\boldsymbol x)\) and bethe-type distribution.

The proof and detailed formulation of the theorem can be found in the Appendix A of PhD Thesis of Delbert Dueck.

Now we are ready for some applications.

Modeling Systems with Factor Graphs

Affinity propagation.

Affinity propagation is a clustering algorithm. Its goal is to maximize net similarity.

The problem is formulated as follows: we are given a set of observations \( X_1, \ldots, X_n \), and a matrix of similarities \( s(i, j) \). We often cannot observe the variables \( X_1, \ldots, X_n \) themselves, but we always observe the similarity matrix.

In case of metric clusterization problem, i.e. \( X_j \in \mathbb R^d\) we can choose similarities like \( s(i, j) = - d(X_i, X_j) \) for \( i \neq j\), but for \( i = j\) we must have \( s(X_i, X_j) \neq 0 \). In fact, there are several possible strategies to define similarity of a vertex to itself: \[ s(i, i) = -\lambda; \quad \text{or} \quad s(i, i) = \mathrm{Median}_{j}(s(i, j)) \]
We would like to choose exemplars \( c_1, c_2, \ldots, c_n \) among the points of dataset, \( c_i \in \{X_1, \ldots, X_n \} \) such that the sum of similarities \[ \sum_{i = 1}^{n} s(i, c_i) \] is maximal. However, there is one restriction, preventing us from the greedy assignment. \[ (c_i = k) \, \Rightarrow \, (c_k = k), \] i.e. the assignment of exemplars is correct. If point \( X_k \) is an exemplar for point \( X_i \), then it should be an exemplar for itself. The requirement can be re-formulated as follows: the set of points is splitted into disjoint sets of points, where each set has its own unique exemplar.

Reference: Brendan J. Frey and Delbert Dueck, Clustering by Passing Messages Between Data Points, Science Feb. 2007
Reference: Brendan J. Frey and Delbert Dueck, “Clustering by Passing Messages Between Data Points”, Science Feb. 2007

The optimization problem can also be interpreted as the Facility Location Problem, which is known to be NP-hard.

It turns out that the optimized function with the correctness restriction can be modified and turned into another single function with no restrictions. Namely, we consider \[ F(\boldsymbol c, \boldsymbol s) = \prod_{i=1}^{N} e^{s(i, c_i)} \prod_{k=1}^{N} f_k (\underbrace{c_1, \ldots, c_N}_{\boldsymbol c}) \] The second term contains a correctness constraint defined as follows: \[ f_k(\boldsymbol c) = \begin{cases} 0, & c_k \neq k, \, \exists i \colon c_i = k\\ 1, & \mathrm{otherwise} \end{cases} \]

This function naturally produces the factor graph:

In fact, the analysis of message-passing for this function \( F(\boldsymbol c, \boldsymbol s) \) is quite cubersome, we need to consider several cases for valid and invalid configurations. The messages passing between \( c_i\) and \( f_j \) are most important, while messages between \( s(i, c_i) \) and \( c_i \) can be easily eliminated. The authors of Affinity Propagation show that max-product algorithm update equations for the functional, after some variable notation change, can be transformed into very simple form:

Affinity Propagation

  1. Set \( a(i, k) \) to zero (availabilities).
  2. Repeat until convergence:
    • \( \forall i, k \colon r(i, k) = s(i, k) - \max_{k’ \neq k} [s(i, k’) + a(i, k’)] \)
    • \( \forall i, k \colon a(i, k) = \begin{cases} \sum_{i’ \neq i} [r(i’, k)]_{+}, & k = i\\ [r(k,k) + [r(i’, k)]_{+}]_{-}, & k \neq i \end{cases} \)
  3. Output: cluster assignments.
    • \( \hat{\boldsymbol c} = (\hat c_{1}, \ldots, \hat c_{N}) \), where
    • \( \hat c_i = \arg\max_{k} [a(i,k) + r(i,k)] \)

Note that this algorithm is rather heuristic than theoretically justified, because of the several reasons:

  • There is justification via KL-divergence only for Sum-Product algorithm, but not for Max-Product. One can use the Sum-Product version of Affinity Propagation, but it neither makes sense, nor is computational as simple as Max-Product version.
  • Algorithm may converge to invalid configuration. In this case, one needs to restart it, until we get valid configuration. There are no theoretical guarantees on the number of restarts.

However, it is claimed to have great practical performance.

Other interpretations of Affinity Propagation, including some probabilistic ones, can be found on its FAQ Page.