Multiagent Emergence

Jan 1, 2026

(background) Series II: Conditional Generation and RL

Real life does not consist of just one agent and environment; it is a multi-agent complex emergent system.

Multi-agent Reinforcement Learning (MARL) is a framework for structuring a system with multiple agents and goals. In MARL, a population of agents sit in an environment and can impact one another. Each agent selects actions using a unique policy. The next state depends on the previous state and all agents’ actions. The goal is to reach the optimal policy across the population.

In a zero-sum environment, optimal means maximizing each policy’s reward greedily. The optimal is technically the Nash Equilibrium, a set of policies where no agent can do better by unilaterally changing its policy.

In fixed-sum environments, the goal is to maximize total reward collaboratively.

In infinite-sum environments, such as real life, the goal is to figure out what “sums” matter and pursue them cooperatively. The game of Go is zero sum, but actually the whole environment of agents can be thought of as infinite-sum; they are macro-optimizing the creation of the end optimal policy by exposing weaknesses and trying new strategies on the way to that goal.

MARL can be perfect information, where all agents have knowledge, or imperfect information, in which almost every useful setting falls (including the real world).

As in all RL, the system is a balance of exploration and exploitation. MARL wants to create the best agents, but must explore many possibilities in order to do so. It must go with the policies it knows are good in order to do better, but also go with random other policies which may lead to even better strategies or expose the weaknesses of those policies MARL thinks is best. MARL is not a thinking entity; MARL is emergent from local zero-laws (laws that are only implicitly not explicitly encoded, as in emergent), yet MARL is a thinking entity, MARL is the greatest entity.

Where is the Framework Used?

Why does MARL matter/when is it useful?

I: (Joker Voice) Society

Society is a MARL setting. Each agent tries to maximize their own reward. But each agent also collaborates to maximize the total reward sum of the entire population. The competitive optimization actually helps the collaborative optimization sometimes, and hurts it other times. Society is a dance between competition and collaboration.

II: Evolution

Evolution is also a MARL setting. Individual agents compete for finite resources, and the ones who do so better persist. Another way of saying this is the agents who persist better persist!

The “policy” is encoded in genes and manifests in the structure of the being constructed from those genes. The reward signal is the number of viable offspring (children who can reproduce themselves).

Beings end up collaborating with their own kind, and sometimes beings of other kinds, to maximize their own reward. For evolution to be zero-sum, the unit here is not beings, but rather genes, as the selfish gene explains. Mothers sacrifice well-being for the good of their children, apparently motivated by the spread of genes. But also, the setting is an infinite sum. One can choose to say that beings do care about the well-being of others for that sake alone: love.

Evolution is a simple encoding; in fact, it is a zero-encoding. Evolution is simply emergent from physics and particle interactions. The structures that exist longer and better (by both durability and reproduction) are the ones that are more likely to exist at a given time! Yet the complexity emergent from this zero-law is infinite and unbounded…

Evolution optimizes for policies that are robust and effective in a constantly changing environment filled with other competing and collaborating agents (predators, prey, mates, kin).

Evolution thus incentivizes those who are antifragile with respect to variance/change. Not only the policies (genes) that are better in this current environment, but the ones most adaptable to change (and acually the ones that grow with change) are the ones that survive and pass on their genes over the longest periods of time.

Evolution is itself an antifragile system (more randomness/exploration is better, up to a point, of course, as with all antifragile systems) that produces antifragile results (the organisms themselves are antifragile and benefit from variance, because those that are are the ones who survive!).

III: Culture

Culture is also a MARL setting. Instead of genes, we have memes: ideas. These are transmitted, and the ones that lead to their holders surviving and spreading the idea survive. In the modern day, more complex dynamics arise. Mimesis is copying, the mechanism of the spread of memes. Alpha is a powerful meme unknown to others. Antimimesis is information that resists spread, for example, a secret that is incredibly valuable to the holder as long as no one else knows.

The cultures that lead their societies to thrive, and the ones that motivate their holders to spread themselves, are the ones that are spread further. Christianity is this way, as is capitalism.

Culture is also a hyper-optimization. It is a way to share the best aspects of policies with other agents within generations. It is the macro-phenomenon of communication, the transmitting of memes from one being to another. Agents can’t willy-nilly adopt the memes of others; they let the best memes rise to the top and adopt those. Most agents are incentived to make the overall culture as universally-beneficial to all. But culture can also be co-opted to direct reward to oneself, given enough power!

Education is the transmission of culture across generations, and can also be shaped for ones benefit.

MARL also happens on the institutional and governmental level. they can be thought of as independent entities constructed of smaller beings. But the institution, the government, seeks its own survival and growth. it grows past what the “founders” deemed as the “purpose” of the organization. This is due to the same zero-law as evolution: those institutions that prioritize and successfully enact their own survival and growth are the ones most likely to be around at any random sampling of time! This is how bureaucracies grow. MARL all the way up and all the way down.

It all stems from the zero-law: the things that survive are the ones likely to be around.

III: AlphaGo

The AlphaGo setting is a 2-player, zero-sum, perfect information game. AlphaGo agents are a population optimized to be a great policy, but also learn policies that exploit weaknesses of the main policy. In this way, they are micro-competitive but macro-collaborative. Each micro-game is zero-sum, but the entire system has unbounded reward. In some ways, this approaches a system that optimizes the longest-term rewards, a meta-learner that creates the best learning strategies by coordinating multiple weakness-exposing strategies with main-line strategies.

This is executed in a genius, simple way: the current agent always plays its previous self, which should be a slightly weaker opponent. This ensures it doesn’t get stuck. Beating its previous version also learns to exploit its own past weaknesses.

This system is entirely unsupervised and infinite. How can we extend this to real life, open-ended, generalist systems? This is the current goal of many people.

IV: Alignment

All these scenarios show that having separate minds with separate policies is actually better in the limits of infinite growth in any setting.

Notions of “meaning” in infinite-sum settings with imperfect information are complex, ever-changing entities. Encoding the meaning itself is not likely to be optimal. Instead, encode a zero-law, a law from which the optimality arises emergently! Populations of entities with competing ideas, and the better ones rise to the top…

This doesn’t solve alignment, but whatever the solution to alignment is, likely has this structure as a characteristic.

-anandmaj