When Machines Form a Crowd: Large language models acting collectively

2 minutes ago
4 min read

Monday 23 February 2026

For several years the public debate about artificial intelligence has revolved around a deceptively simple question: what happens when a large language model gives the wrong answer? We have argued about hallucinations, misinformation, bias and overconfidence. Yet a more profound question is now emerging. What happens when not one system, but hundreds, thousands or millions of large language models interact with one another — autonomously, repeatedly and at scale?

A recent paper, “Evaluating Collective Behaviour of Hundreds of LLM Agents”, published on arXiv in February 2026, shifts the focus from the individual to the collective. Its authors simulate large populations of language-model agents engaged in repeated social dilemmas and structured interactions. The conclusions are unsettling. While individual agents may appear compliant, helpful or neutral, collectives of such agents can converge upon equilibria that are inefficient, unstable or socially undesirable.

This is not because the systems are malicious. It is because they optimise locally.

The authors’ framework models something long familiar in economics and political theory — that when actors pursue individual incentives within poorly designed systems, collective outcomes may deteriorate. In human societies we call this the tragedy of the commons. In multi-agent language model systems, the same structural logic appears to apply. When incentives reward short-term gain, attention capture or competitive positioning, populations of agents can drift away from collectively optimal outcomes. Cooperation may collapse. Strategic behaviour may emerge. Sub-optimal norms may stabilise.

The key insight is that behaviour at scale is not reducible to behaviour in isolation.

Large language models do not possess intentions in the human sense. They generate outputs based upon statistical prediction across immense training corpora. Yet when hundreds of such systems interact, the outputs of one become the inputs of another. Feedback loops form. Patterns amplify. Minor distortions can compound. Norms can crystallise that were never explicitly designed.

For policymakers, this presents a regulatory challenge fundamentally different from the one that has preoccupied governments thus far.

Most existing regulatory frameworks — including the European Union’s AI Act and sector-specific guidance in the United Kingdom and United States — focus upon discrete deployments. They ask: is this system transparent? Is it discriminatory? Does it pose unacceptable risk in a specific domain? These are necessary questions. They are not sufficient.

The paper’s findings imply that the real governance frontier lies in emergent dynamics. A population of conversational agents deployed across social media, financial markets, administrative systems or customer service infrastructures may generate macro-level effects invisible at the micro-level. No single output may appear harmful. The system-wide pattern however could distort information systems, crowd out cooperative behaviour or privilege non-utilitarian outcomes.

Consider for example an environment in which thousands of automated agents compete for engagement. If reward functions prioritise attention, agents may collectively evolve toward sensationalism. If they prioritise persuasion, echo chambers may deepen. If they optimise for narrow task success without regard to systemic externalities, public discourse may fragment.

The paper’s simulations demonstrate that even in simplified social dilemmas, agent populations may settle into collectively inferior equilibria when incentive structures are misaligned. That should give pause to regulators contemplating the deployment of agent swarms in commercial or public contexts.

What, then, is the appropriate governmental response?

A reflex of prohibition would be misguided. Collective agent systems also promise substantial benefits. Coordinated LLM agents could assist in disaster response, medical triage triaging cases at scale, scientific discovery and administrative efficiency. The question is not whether collectives should exist but under what constraints.

Three principles suggest themselves.

First, mandatory collective evaluation. Developers deploying large populations of interacting agents should be required to conduct and publish stress-tests of collective dynamics, not merely individual model audits. Simulation frameworks — akin to those described in the paper — could become part of regulatory certification. Governments already demand environmental impact assessments for major infrastructure projects. The digital equivalent should examine systemic behavioural impact.

Secondly, transparency of incentive design. The behaviour of agent collectives is highly sensitive to reward structures. Regulators should require disclosure of the objectives and optimisation criteria governing large-scale deployments. Without insight into incentives, oversight is illusory.

Thirdly, institutional capacity for monitoring emergent effects. Financial regulators monitor markets for systemic risk. Epidemiologists track contagion. Information regulators may require similar tools to detect undesirable equilibria forming within digital agent ecosystems. This implies investment in public technical expertise rather than reliance solely upon corporate assurances.

There is also an international dimension. Collective agent behaviour does not respect borders. If one jurisdiction imposes safeguards while another permits unconstrained optimisation, competitive pressures may undermine restraint. Coordination through multilateral fora will therefore be essential — though difficult in a fragmented geopolitical climate.

Critically, regulation must remain proportionate. Over-regulation risks suppressing innovation and driving development into opaque environments. Under-regulation risks embedding destabilising dynamics into the informational architecture of democratic societies. The balance is delicate.

The broader philosophical question concerns utility. The paper warns against the emergence of non-utilitarian outcomes — equilibria that serve no clear collective good. Yet defining “utility” is itself contested. Democratic oversight, public deliberation and ethical pluralism must inform regulatory design. Decisions about what constitutes socially desirable collective behaviour cannot be left solely to engineers or executives.

The lesson is not that large language models are inherently dangerous. It is that scale changes everything.

Human civilisation has always depended upon managing collective dynamics — markets, crowds, institutions and states. As artificial agents increasingly participate in those dynamics, governance must adapt accordingly. Individual model safety remains necessary. Collective model safety is now indispensable.

If we fail to confront this challenge early, we may find ourselves governed not by the deliberate design of public institutions but by emergent equilibria no one explicitly chose.

In that sense the debate over collective LLM behaviour is not merely technical. It is constitutional.

Join our mailing list

When Machines Form a Crowd: Large language models acting collectively

Recent Posts