Complexity in Ilya's Model
Dec 25, 2025
Ilya Sutskever has 2 papers about complexity in his top 30 of all time. What do they say, and why did he put them in the list?
The relationship between entropy and complexity
is the foundation of both complexity papers (Complextropy and Quantifying Complexity).
Entropy
measures the randomness, disorder, genericness, uncertainty, or information content in a system. Low entropy means the system is highly ordered, predictable, and has little information. High entropy means the system is disordered, unpredictable, and has a lot of information. According to the 2nd law of thermodynamics, entropy always increases in system until reaches a maximum value because high entropy states are much more likely.
Complexity
is our human notion of how “interesting” some thing is (or “structured” or “intricate” or “designed”). Unlike entropy, complexity does not always increase. As Entropy increases, complexity first increases, and then decreases. For example, when making coffee, the coffee and milk start separated, which is both low entropy and low complexity. As they mix, intricate tendrils of milk and tendrils of coffee start to appear, making medium entropy but high complexity. Once they’re fully mixed, entropy is high since the different particles are all randomly distributed, but complexity is low since the solution is all one color.
This process of low to medium high entropy but low to high to low complexity repeats in the universe. It started as a low entropy, high energy soup, then increased in entropy to high complexity structures like stars, planets, and living organisms, and eventually it will become an all-random, high entropy, but low complexity and energy void.
Another example of this entropy-complexity asymmetry: we would say a uniform random string like “16285034918028740253961403” is high complexity but very low interestingness (we would call it random).
So we care about the complex, the “interesting” structures in all systems. Entropy captures complexity in part, but it misses the human-judged value of “structure” over “randomness”.
Interesting content
depends on the maintenance of high complexity systems (cells, the human body, agriculture, society, our planet, the sun, the Milky Way). Our existence hinges upon growing complexity, despite the constant and continuous increase of entropy in the universe.
It would be nice to know more about the relationship between complexity and entropy. Presumably, we would then be able to make complexity (which we like) more common than blind low or high entropy states (which we generally don't like)
One start of knowing more about their relationship is to formalize their relationship in some objective mathematical definitions of both entropy and complexity (the latter being much harder).
Once we have this, we can run concrete methodical experiments and logical manipulations on the relationships between entropy and complexity
Formulating entropy
is best done with Algorithmic Entropy.
Algorithmic Entropy (technically called Kolmogorov Complexity) is the information content of a string of bits. It is measured by the length of the shortest computer program that can output that string (then halt).
For example, a simple, regular pattern (e.g., "01010101...") can be described by a short program ("print '01' 500 times"). But a random, structureless string (e.g., "1001101010...") has no pattern to exploit, so the shortest program is essentially "print '1001101010...'", which is about as long as the string itself.
Since the Algorithmic Entropy K(x) of string x is the length of the shortest computer program that outputs x, it essentially asks “how compressible is the data?”.
Unlike Boltzmann and Gibbs Entropy, Algorithmic Entropy is the property of an individual object, not an ensemble or distribution. So Algorithmic Entropy allows for measuring the information content of a single, specific object without needing to know the underlying probability distribution that generated it.
It is unfortunately uncomputable (you can't write a program that calculates it for any string), but it can be approximated with standard domain-specific compression algorithms (like gzip).
Formulating complexity
Is even more difficult, but Aaranson has come up with 2 pretty good definitions, Complextropy and Apparent Complexity. Our representation of complexity should assign~ 0 to separate and uniform coffees and a large value to middle.
Complextropy
of n-bit string x is the number of bits in the shortest computer program running in bounded time that outputs a near-uniform sample from set S such that:
1. x in S
2. Any computer program that outputs x in bounded time, given an oracle that provides independent uniform samples from S, has at least $log_2(|S|) - c$ bits for some constant c
In other words, you can't simulate the universe, but you must output a sampler (representing a distribution in which x is a member), and any efficient program outputting x given S must use at least log(|S|) - c bits (for any fast algorithm x is indistinguishable from a random sample from S)
Complextropy is essentially the minimal idea/model that explains structured/nonrandom part of x (rule is anyone using model sees x as typical unremarkable example)
complextropy is equivalent to the length of shortest efficient computer program that outputs a sample from prob dist D such that x isnt efficiently compressible wrt D
(x looks to any efficient algorithm as a random sample from D)
Apparent Complexity
separates noise from structure before computing entropy.
Intuitively, complexity seems to be entropy over a denoised/smoothed version of a system state. Entropy only goes wrong because it counts pure randomness as “high value”. Essentially, we want to know how structured an entity is once you take out the “noise” or “randomness”. We want to separate structure from randomness
The problem of measuring complexity now becomes “how do I identify what is noise and what is structure?” In apparent complexity, we do so using a human-specified denoising function f.
The apparent complexity of x is $H( f(x) )$ where:
1. $x$: The object or system state (e.g., a detailed bitmap image of the coffee cup).
2. $f$: A "denoising" or "smoothing" function. Its purpose is to filter out "incidental" or "random" information, leaving behind the "interesting, non-random" structure.
3. $H$: An entropy measure (like Kolmogorov complexity, which they approximate using file compression like gzip)
The issue is that f based on choice, so f biased to human preferences (tells more about human preference than “actual complexity”?). But “actual complexity” is strongly defined by huamns, its just “interestingness”. f is restricted by what humans can observe (macro not micro) and bias local nature of interactions (both of which make sense)
For deep learning
complexity gives intuitions for how deep learning works, which I think is why Ilya included Aaronson’s complexity papers in his top 30 deep learning papers of all time.
I think Ilya primarily intuits that a generative neural network is exactly a Complextropy measurement machine.
Generative models are learning the distribution that makes the data a regular sample: exactly another phrasing of Complextropy. And in training the model, the trainer tries to use the smallest possible model that can still fit the data. So it is trying to learn the shortest “program” or function under which data is a “regular output”, which is the exact formalization of measuring complexity. And the process of deep learning is itself a measure of complexity.
Our goal in training the neural network is to maximize the “interestingness”/”usefulness” of the generations, not just to overfit the data (lowest entropy) or to output random noise/maximize variance (highest entropy)
Optimal complexity is located at some balance between 0 entropy (all same, max likelihood of data) and full entropy (all random, noise). The goal of a good deep learning regiment is not to fit the data as best as possible, but to find this complexity optimum at the maximum compression point, which is the optimum at which the generative model is interesting or useful to us.
Critically, the more data we have, the more compressed an optimal program will be compared to the raw data information content. If you just have 2 dog images, maybe a neural network can just ahrcode those images into its weights. But at 1 million images, it really must learn the function that encodes the meaning a dog in order to find the shortest program to generate all 1 million images.
Apparent Complexity shows that our goal of optimal complexity is ultimately “human-defined”. What we find to be “desirable outputs” in the complexity optimum between low and high entropy are completely driven by human values/aesthetics. What counts as noise, and what counts as “meaningful structure”?
And in our own life, it's important to recognize that Complexity is transient. High complexity is an intermediate state between full order (0 entropy) and full disorder (maximum entropy).
Formalizations of Entropy and Complexity
are explanations of how we arrived at the above formulations of entropy and complexity as best (what are the alternatives to algorithmic entropy, complextropy, and apparent complexity?). The below are essentially notes on Aaronson’s Quantifying Complexity
Formalizations of entropy
(deriving why the entropy definition is best in our context)
Boltzmann Entropy
Used in: statistical mechanics
For a given arrangement of particles in a space, Boltzmann Entropy asks: how many other micro-states (specific arrangements of particles) look identical from a behavioral, human-judged “macro-state” perspective.
In statistical mechanics, we define $x_a$ as a specific micro-state (particular arrangement of particules), and $X_A$ as a “macrostate”, which is a set or grouping of multiple microstates with identical characteristics (such that a human viewer couldn’t tell the difference). Specfically, the formula is: $S_{\text{Boltzmann}}(x_a) = k_B \log W_A$
where $k_B$ is Boltzmann constant = 1 and $W_A$ is the volume (number of microstates) in macrostate that $x_a$ is in.
For example, a room with all air molecules clustered in a corner is a specific microstate. The “air in one corner” macrostate has small $W_A$ (few microstates look like that) so it has low Boltzmann Entropy. But the “air spread evenly” macrostate has massive $W_A$ so high Boltzmann entropy. The 2nd law of thermodynamics just says that systems evolve from low-$W_A$ to high-$W_A$ macrostates
Gibbs Entropy/Shannon Entropy
Used in: Statistical mechanics/information theory
Gibbs entropy of probability distribution p(x) over all possible microstates, measures uncertainty about the systems true microstate. Instead of a single state, we have a probability for each possible state and quantify the average surprise upon learning one specific state. (1 possible state, then 0 surprise, uniform distribution, then maximum possible expected surprise)
$H(D) = -\sum_x p_x\log p_x$ where $p(x)$ is prob of state $x$ under distribution $p$
For example, a fair coin has distribution $(0.5,0.5)$ and entropy $-[0.5\log(0.5) + 0.5\log(0.5)] = 1$ bit (maximum for binary system), whereas biased coins all have lower entropy up to the one-sided coin with distribution $(0,1)$ and entropy $[0\log(0) + 1\log(1)] = 0$ bits
Kolmogorov Complexity/Algorithmic Entropy
Used in: Computability/Algorithmic Information Theory
See the Algorithmic Entropy section above
Relations
The Boltzmann entropy of microstate $x_a$ is based on human-defined characterizations (course-graining) of a system’s possible arrangements (phase space). It is objective (not adapted to user’s knowledge like statistics) but relative (it depends on our chosen coarse-graining, i.e., how we define the macrostates).
Shannon/Gibbs Entropy is epistemic, about our knowledge or ignorance represented by the distribution p, instead of objective.
Algorithmic Entropy (Kolmogorov Complexity) is truly objective, but it isn’t naively computable.
Boltzmann ≈ Gibbs/Shannon when the distribution is uniform over a macrostate.
Kolmogorov ≈ Gibbs/Shannon for a "typical" string drawn from a computable distribution. For string x sampled from probability distribution D, K(x) = H(D). Essentially, the compressed size (algorithmic entropy) of a typical random string from a distribution is about the same as that distributions shannon entropy.
Formalizations of complexity
Formulations of complexity should start low when cream and coffee are separate, go high when they mix, and go low again when they are fully mixed
Sophistication
Sophistication finds the optimal "model" `S` (a set of similar states) for a string `x`.
Sophistication of x is the size of the shortest program that describes this model `S`
It is uncomputable, and also never becomes large for mid-mixed coffee
Logical depth
Logical depth is the runtime of the shortest program that outputs `x`. It’s also uncomputable.
It operates from the intuition that truly complex objects require substantial computation to be generated from a simple recipe.
For example, a Mandelbrot zoom looks exceptionally intricate/complex/intricate, but the program that generates it is only a few lines long, so its logical depth is low, but its actual complexity is high.
Unfortunately, visually complex patterns can be generated very quickly by simple, chaotic rules, which logical depth wouldn’t catch.
Lightcone complexity
Lightcone complexity compares the mutual information between past and future lightcones of a point in a system with known causal structure.
For example, in Conway’s Game of Life, a glider has high lightcone complexity because knowing its past motion tightly predicts its future trajectory, but the object itself is not complex.
Essentially, Lightcone complexity encodes how much you can predict about the future from the past, and vice versa. Instead of human notions of complexity, lightcone complexity seems to instead measure how deterministic/nonrandom an entity is.
Another drawback is that it requires knowledge of the system's history and future, not just its current state.
Apparent Complexity
See the Apparent Complexity section above
Relations
Apparent Complexity is Sophistication where model search is very constrained/resource-bounded. If we fix smoothing function $f$, then we are essentially choosing model $S_{f,x} = { y | f(y) = f(x) }$. $K(f(x))$ is then essentially$K(S_f,x)$.
Sophistication is coarse (”smoothed”) logical depth because it measures how long the shortest meaningful program must run to generate `x`, discounting programs that merely brute-force it.
For example, a string describing the digits of π has low sophistication because its generating program is simple, even though the digits look random.
Coarse-grained Sophistication and Logical Depth are equivalent in practice because both measure how far an object is from being either trivial or random and instead having a nontrivial, structured generative history.
Light-cone complexity and apparent complexity are highly correlated. High-light-cone complexity entities have "contingent structures" in spacetime (structures that persist/determine future states), which tend to be structured themselves (which means large apparent complexity in the current state). And when apparent complexity is high, this usually signals that it came from a structured past and can predict future states more accurately (since pure noise almost never creates sustained structure).
Apparent complexity was empirically the best for matching the low to high to low complexity constraint, but it was philosophically and formally unsatisfactory since it relied on human notions of complexity for the smoothing function. Complexity is a complex function!
Reflections
on the Complexity Formalizations
Aaronson: Interaction
The emergence of measurable complexity seems to depend on causal interactions between a system's components. Interactions between particles are essential for generating true, incompressible complexity.
Aaronson proved this in his experiments here
In the Interacting Model, where particles swap places showed peak in complexity, constraints and dependencies created by interactions lead to the formation of structured patterns.
In the Non-Interacting Model, cream particles perform independent random walks showed complexity peak with coarse-graining but not with "adjusted coarse-graining" (removes artifacts). Particle distribution at any time can be described by a simple function of its initial state and time. Thus state is always highly compressible/low "sophistication," even if it looks complex
Aaronson: Experiments only upper bound complexity
Aaronson’s experimental/sim approach can only find complexity upper bounds. They show a state is at most this complex, but not that must be this complex.
For example, a compressed file of 100KB means this file is at most 100KB complex, but it could be much simpler. The entire challenge is finding the optimal/smallest algorithm that outputs the string, not any compression of the string. Aaronson’s goal is to find theoretical provable lower bounds. Aaronson wants to truly understand the origins of complexity
My: Strings specify all computation
It’s interesting how computing is all specifiable in strings. if everything is computable, then you could write a string that specifies it in entirety. but clearly the string that specifies the thing is not the thing itself. thats why just writing equations of a nn on paper but not oing the calculations isnt the same. doing the calculations on paper might be the same…
My: Entropy starts low
Why doesn't the universe's entropy start low?
Some would answer, “Because we wouldn't be around if it did, there would be no mind to even ask this question if universal entropy began low”
Some would answer, “Because it all must come from somewhere”