mechanosis

Hopfield Hinton Friston

Dec 30, 2025

Free energy is a scalar that measures how surprised, unstable, or unlikely a system is, given constraints. Free energy measures how much usable structure remains after accounting for disorder. Born in statistical mechanics, it evolved via Hinton, Hopfield, and Friston into modern intelligence.

Statistical Mechanics

classifies the behavior of groups of particles at each timestep into a given macrostate (what can a human observe) and microstate (what are the precise positions and velocities of each particle)

Helmholtz free energy is the combination of energy and entropy that determines how likely a macrostate (observable state) is at equilibrium (rest).

Helmholtz free energy $F$ for a certain macrostate is given by

$ F = E - TS $

where $T$ is temperature, and for the microstate distribution for this macrostate, $E$ is the internal energy and $S$ is entropy. Low $E$ means this macrostate’s microstates are stable, and high E the reverse. High $S$ means many microstates are compatible with this macrostate, and low S the reverse.

Thermodynamic systems can be thought of as engaging in random walks, and they naturally evolve toward more probable states. The probability of a state $s$ is given by: $ P(s) \propto \exp(-F/T) $ So probability decreases exponentially as free energy increases, and temperature flattens the distribution (makes the distribution more uniform). Essentially, free energy is the negative log probability of a state, temperature-softened (familiar?…)

Thus, thermodynamic systems naturally evolve toward lower free energy at equilibrium (at which point they settle permanently into a state or set of states). Low free energy can be achieved by either low energy (stable/ordered configuration), high entropy (many disordered configurations), or a balance of both.

So the equilibrium free energy minimum is the most probable macrostate, considering both energy favorability (low energy) and configurational likelihood (high entropy).

Free energy essentially measures how probable this state is over time, given by how high energy it is (how unstable it is) and how “configuration-likely” it is (how many configurations of particles support this macro-behavior)?

Hopfield Networks

model the classical mechanics of energy. Hopfield Networks retrieve memories by descending the energy/free energy landscape, and the network is a thermodynamic system.

They consist of N fully-connected neurons with values 1 or -11. To learn it, for all datapoints xi in the dataset, each of which represents a memory we want to store and are vectors of length N, we: 1. Choose an input datapoint xi and assign its N values to the network nodes 2. Set all weights $w_{ij}$ for each pair of neurons $i,j$ according to:

$ w_{ij} = \frac{1}{N} \sum_{\mu = 1}^P x_i^{\mu} x_j^{\mu} \text{for} i \neq j $

Intuitively, if two neurons have the same sign for a given datapoint, then their weight is positive; otherwise, it’s negative. And the larger the values, the larger the weight. Essentially, “neurons that fire together wire together”, across all data.

In inference, first we initialize each node with some random noise, and use the weights learned in training. Then repeat until convergence2: 1. select a random or sequential neuron $i$ 2. Compute its input: $h_i=\sum_j w_{ij} s_j$ 3. Update its state: $s_i=+1$ if $h_i≥0$, else $s_i=−1$

The final hidden states should now be set to one of the datapoints/memories we want to retrieve. Intuitively, the network sets the datapoints to be low energy states, where the energy of a state is defined by: $E=−\frac{1}{2}∑_{i,j} w_{i,j}s_is_j$

So the lowest energy states (most negative) are those states that assign large positive weights to same-sign neurons and large negative weights to opposite-sign neurons.

The inference algorithm simply takes all inputs to the nearest low-energy state.3

To create a Hopfield Network that “generates” 100 x 100 images, we create a 10k-(100 x 100)-neuron Hopfield network, train it on many 10k-pixel image examples, and run the train update rule for all examples simultaneously until convergence. At inference time, you should get one of the dataset images.

The vanilla Hopfield network is essentially just a pattern storage and retrieval algo, it won’t really generalize as a generative model empirically.

Instead of statistical mechanics, Hopfield Networks follow deterministic mechanics. There are no micro or macrostates; the micro and the macro are identical. There is no entropy $S$, so $F = E$, free energy is just energy. But they still run random walks and are attracted to low-energy states.

Boltzmann machines5

are stochastic Hopfield networks, approaching statistical mechanics more closely. Boltzmann machines have stochastic inference and can learn a generative model instead of just a memory-accessing algorithm.

Boltzmann machines consist of N fully-connected nodes with stochastic on/off behavior. Each node has a bias and a weighted edge to each other node.

It also has M hidden units.

In training, we learn weights and biases such that energy of a state represents the quality of its solution to an optimization problem. In inference, the stochastic dynamics of the learned Boltzmann machine descends to good solutions in that energy landscape.

Free energy in this context is now conceptually identical to the original statistical mechanic equation for free energy. The free energy of a state is now $ F(v)=−log\sum_h e^{−E(v,h)} $ which is high when many low-energy hidden states correspond to this visible state, and low otherwise.

Visible states correspond to macrostates and hidden states correspond to microstates. So again if many low energy hidden microstates correspond to a visible macrostate, that macrostate’s free energy is low (and thus its probability is high). So free energy is essentially again $F = E - TS$.

In inference, each unit updates according to Gibbs sampling: $ P(s_i=1)=\sigma(b_i+\sum_jw_{ij}s_j) $ where $\sigma(x) = 1/(1 + x)$, the logistic function. So probability is low with large biases and weight-matching node signs. This converges to the Boltzmann distribution, in which the probability of state $s$ is given by: $ P(s)=\frac{e^{−E(s)/T}}{\sum_s e^{−E(s)}} $ where the energy of state s is: $ E(s) = -\sum_i s_i b_i - \sum_{i < j} s_i s_j w_{ij} $ So the energy of a state is low if the state matches the bias and if negative weights connect opposite-signed nodes and positive weights connect same-signed nodes. Essentially, it measures if the state is internally consistent. And it is normalized by the partition function, the summed likelihood of all states.6

In training, we maximize the log-likelihood of the data (familiar)? Essentially, we want the datapoint to have high likelihood (meaning having many low-energy hidden states correspond to them) in the free energy landscape. The gradient of the weights w.r.t the likelihood of the data is given by $ \Delta w_{ij} \propto <s_i s_j>_{\text{data}} - <s_i s_j>_{\text{model}} $ where $<s_i s_j>_{data}$ is $E[s_is_j]$ in the data distribution, and $<s_i s_j>_{model}$ is $E[s_is_j]$ when we sample from the Boltzmann Machine’s equilibrium distribution. In the positive phase of training, we make data more likely by setting visible units to data and hidden units sample until equilibrium (in reality we run a few steps then stop), then measure the $s_is_j$ correlations. In the negative phase of training, we run the model freely starting from noise and measure the equilibrium $s_is_j$s. We then ascend the gradient of weights which increase positive-phase $s_is_j$s and decrease negative-phase $s_i s_j$s. In other words, we are raising the energy of hidden distributions the data correlates to, and lowering the energy of hidden distributions inferred from noise.

So, in training, we are now shaping a free energy landscape that takes into account both energy and entropy. Good representations create smooth/wide energy basins in this landscape, representing concepts instead of individual datapoints.

Hidden states are “beliefs” about latent structure of visible data/”explain” visible data with minimal energy. Free energy of a visible state summarizes how well many hidden configurations explain/correlate to v. Low free energy means many hidden states make v likely so v is a “good fit”. Learning is lowering the free energy of training examples. The Boltzmann Machine’s “beliefs about data” correspond to hidden configurations with the lowest free energy given visible states.

Variational Free Energy

generalizes applying statistical mechanics to probabilistic generation.

All data is produced by some mysterious underlying world function. All observations $x$ are deterministically produced by some generative process defined by variables $z$. There is some joint distribution which defines the bidirectional relationships between $x$ and $z$ (when $z$ appears, how likely is $x$, and vice versa, and how likely is each on its own?)

$ p(x,z)=p(x∣z)p(z) $

What we want is the posterior probability of $z$ being the case given we observed $x$ (since what we can see is only $x$, but what we want is to know the actual generative processes/reasons/interesting causal phenomenon behind the appearance of $x$)

$ p(z∣x)=p(x)p(x,z) $

But $p(x)=\int p(x,z)dz$ is intractable which means $p(z∣x)$ is intractable. All of deep learning can be thought of as trying to solve this problem while staying compute-efficient.

So instead we make simple assumptions about $p(z|x)$ in $q(z)$, an approximation we can actually calculate. Now, free energy is the difference between the distributions: $ F[q] = E_q[\log q(z)] - E_q[\log p(x,z)] = KL(q(z) || p(z|x)) - \log p(x) $ or the difference between the expected hidden state of the model distribution under q(z), and the expected hidden state of data distribution conditioned on both state and observation under q(z).

The first term penalizes $q$ for being too distributed/high entropy, and the second term rewards $q$ when it matches p(x,z). We can also think of it as the entropy of $q$ and expected energy of $q$. We want a $q$ that is simple, explains data well, and is close to the true posterior.

Variational free energy measures how bad the posterior/data misfit is. Minimizing variational free energy is equivalent to minimizing KL divergence between $q$ and the true posterior. It is negative ELBO, so minimizing free energy is maximizing ELBO and makes $q(z)$ approximate the true posterior $p(z|x)$ well

Variational free energy upper-bounds surprise $-\log p(x)$. Surprise is always less than or equal to free energy, so by minimizing free energy, we minimize surprise. Lower free energy means a better explanation of the data under the model, characterized by low complexity and high accuracy.

The Free Energy Principle4

applies variational free energy to life (perception and action)

The organisms that exist more are the ones good at survival and reproduction. To survive and allow offspring to survive, organisms must resist entropy, so must dissipate energy, regulate internal states, and try to minimize catastrophic unpredictability.

Maintaining structure requires sensory states with limited ranges (homeostasis is a tight range for body temperature, glucose levels, posture, etc, otherwise death results). This is the same as keeping sensory inputs “normal”: no major pain or chaotic inputs. This is equivalent to minimizing long-term surprise: the organism reduces the amount of sensory inputs it can’t predict, like extreme heat or pain.5

Surprise is intractable to compute, so the organism minimizes variational free energy instead. AS above, minimizing free energy minimizes surprise since free energy upper bounds surprise.

Formally, an organism encodes a generative model $p(\text{state}, \text{observations})$. The organism’s internal mental states encode an approximate posterior $q(s)$. The organism constructs a probability distribution over states, conditioned on observations. Variational free energy here is: $ F = E_q[\log q(s)] − E_q[\log p(s, o)] $ and measures the unpredictability of inputs.

Organisms minimize surpise/the unpredictability of inputs through free energy by: 1. perception: updating internal beliefs by adjusting q(s) to match obersvational inputs, which means neural activity is a bayesian belief update (given new information, what is new probability of each sensation) 2. action: changing the world so the organism’s sensory input matches predictions of state. An organism predicts its arm at position x, the arm goes to x, or predicts a book at position x, and the body coordinates hierarchically with joint predictions and executions to move the book. So organisms either fit their beliefs to the world or the world to their beliefs, and both reduce free energy.

The organism actually minimizes long-term expected free energy, which means minimizing both risk and ambiguity. Risk is the expected difference between predicted and preferred states. Ambiguity is uncertainty about the sensing-state correlation.

If an organism wanted to purely minimize prediction error, it could sit in dark room with no stimulation. In reality, it is minimizing expected free energy: it needs to explore, seek info, compress more of the world, have goals, be curious, and minimize total expected value. It also needs sustenance to sustain itself. The organism’s generative model is necessarily self referential and temporal-extended.

Evolution also minimizes free energy over generations implicitly (organisms with better predictive and acting gen models survive and reproduce more). Goals are low expected free energy-states to which the organism’s actions are directed to achieve.

Predictive Coding

is the micro-level implementation of free energy minimization in the brain. It is the theory of neuron learning where neurons minimize sensory input surprise. In this model, each neuron operates via predictive coding and: 1. predicts its input 2. receives input 3. adjusts activity to minimize prediction error (proxy for free energy gradient)

Expanded in [Predictive Coding]()

In the grand scheme

these frameworks are no longer used or theorized to be the true, universal underlying explanation of intelligence, but it is useful to internalize how statistical mechanics underlies the best methods for synthesizing useful intelligence we’ve found. VAEs minimize ELBO, which is negative free energy. VAEs are basically continuous BMs with one-shot inference. And Diffusion/flow matching are training score network $\nabla_x \log p(x)$, where the score is the gradient of energy, so they are also explicitly encoding a free energy landscape. Even transformers are implicitly modeling an energy landscape via joint distributions over tokens.

This is also an innately satisfying attempt to unify physics and intelligence, since it seems computation and intelligence are somehow emergent properties from physics. Thermodynamics and intelligence both seem to be modeling energy landscapes. Some are useful, and some are not. Most are not.

And in the long quest to understand the micro-operations of the brain, Boltzmann Machines and the Free Energy Principle are decent theories of how the brain works. It’s far from complete.

-anandmaj