In this article we will introduce a set of fairly complex problems (optimization of discrete functions) and a nice solution to them (simulated annealing). Let’s dig into this.

One day I saw a physicist (weird hairs, funny sweater, inability to write down greek letters) who was in front of a bowl of hot sulfur. He then proceeded to cool it down with ice, this way obtaining an undrinkable cold broth. Then another bowl was cooled (this involved more time, so we chatted: they are nice creatures unlike mathematicians). The second bowl ended up as crystals and the physicist insisted that the crystal structure (lattice he says) holds more energy in it’s ordered structure: his claims was that this way he optimized (maximized) the energy of the whole system. He called the minimizing process Annealing.

Let’s get technical: simulated annealing is a stochastic (statistic + time) process. The process itself is structured so to hold a limited memory of the past steps, in form of probability: the probability of a result at step t is influenced by the realizations of the past steps.

Prob{S(t)=s | S(t-1)=s’}

When this memory is limited to only the precedent step, we talk about Markovian Chains. Given each s and s’ we could build a matrix A (a stochastic matrix, with values >=0 and with the sum for each line =1). This premises leads to many pages of calculations (Chapman-Colmo Gorov, Lipshits, also ergodic chains) that will be bailed out in this paper. Let’s make it simple with an example: sudoku boards. Wait, is sudoku still a thing? Yes it is: We describe a sudoku board as a discrete vector in an 81-dimensions discrete hyperspace (81 being 9*9). We then proceed to define a function over these vectors: the number of invalid cells, based on the rules of sudoku. Now if it were a continuous function, the task of optimization (finding a minimum, in this case) would have been carried by derivative, but with discrete functions, talking about derivatives just don’t make sense. Thinking of checking all the possible solutions? 81 dimensions by 9 places per dimension leads to 9^81=1*10^77 possibilities, which is a little too much. The simulated annealing instead execute as a random process: at each step perturb a single random component of the vector (one of the cells of the sudoku board). This generate a new vector to which the algorithm can move with a certain probability. The probability to move from a vector to another is exactly our stochastic matrix, and is defined as follows. Given a start state i, an end state j and being f(x) the function to minimize, the probability to move from i to j is given by:

Pij = min {1 , exp(-beta*(f(j)-f(i)))}

This probability is always one when the new state is better then the precedent (i.e. less errors in the sudoku board), while it’s a greater then zero probability when the new state is worst, lower probabilities to bigger slopes on the function to optimize. As of beta, it represents the temperature of the system: big temperatures generate fast movement into the hyperspace, low temperature are for small adjustment steps. This is critical to any discrete optimization heuristic, because of the risk of getting stuck in a local minimum (a local basin) and missing the global best solution. So our simulated annealing should move fast at start, then slow down at the end.

But will this lead to a solution?

Without going too technical, a solution can always be found (at infinite time) when the stochastic matrix is so that the probability to move from a state i to a state j is non zero (this does not imply the matrix has no zero cells, a sequence i->k->j is perfectly acceptable).

Off with words and down to business: let’s use this (quote – unqote) hard board from http://sudokublog.typepad.com/sudokublog/2005/08/page/2/

And the result, after merely 1482358 iterations is… 