News

Microsoft’s Bit-Goat Paper Says LLMs Aren’t What They Seem

A Microsoft researcher built a working neural network out of Age of Empires II goats, arguing the chat window itself shapes how human-like AI feels to users.

Published

3 weeks ago

June 24, 2026

Henry Fox

Microsoft researcher Adrian de Wynter used the scenario editor of Age of Empires II to build a one-bit perceptron out of virtual goats, then published the construction as a critique of how AI researchers test whether large language models are human-like. The paper, posted to arXiv on 29 May 2026 and now in its third revision, is titled “If LLMs Have Human-Like Attributes, Then So Does Age of Empires II.” Its central argument is methodological: de Wynter argues that the perceived human-like qualities of a chatbot belong to the interface.

Inside the classic real-time strategy game, de Wynter turned goats, grass, bridges, ice and baobab trees into the components of NAND, XNOR and AND logic gates. Each goat carries a bit: standing on grass is 0, standing on a bridge is 1, and a gate fires when the goats reach their pens. Stack enough of those gates together and you have a perceptron, the simplest unit of a neural network and a known building block of every modern LLM. de Wynter frames the construction itself as a demonstration. He told 404 Media he has “a tendency to dial up things to 11” when he needs to make a methodological argument, and that “absurdism is pretty standard in philosophy and theoretical computer science.”

How a Bit-Goat Becomes a Perceptron

The pitch reads as a stunt until the engineering starts making sense. Age of Empires II’s scenario editor lets players script units, terrain and triggers in a deterministic, frame-by-frame environment, which makes it possible to wire together the kind of conditional logic a transistor normally handles. de Wynter chose AoE II, in his own telling, because it is a less obvious substrate than Minecraft redstone, which has hosted working neural networks before. The construction is published on his GitHub pages with annotated GIFs and a full circuit diagram for each gate.

Once the gates are wired together, any logic circuit can be built in-game. A perceptron, the simplest unit of a neural network, follows from there as a proof of concept.

In de Wynter’s paper, the construction rests on a NAND gate, which is functionally complete on its own. He calls the sacrificed input goats “they ded” and uses ice patches for concurrency control. The full build is documented in an annotated walkthrough of the AoE II circuits, with GIFs of the NAND gate, the perceptron and an ansatz-based training circuit.

Grass represents binary 0; bridge represents binary 1.
Ice rail handles concurrency so a gate fires only when upstream is ready.
Bamboo encodes XOR, forest AND, baobab OR, deep water NOT.
Goat acts as the signal carrier and, briefly, as sacrificial worker.

Microsoft bit-goat LLM in Age of Empires II

The Two Substrate Arguments

The goats are the joke; the substrate is the argument. de Wynter’s paper makes two distinct claims about what is called the substrate, the medium on which an LLM is implemented.

The first claim is straightforward. From the paper’s abstract: “any entity in a sufficiently-powerful substrate, such as LEGO or the Greater Boston Area, could also present such attributes.” The implication is that what makes an LLM an LLM is the relationship between weights, biases and operations, irrespective of the silicon under it. On de Wynter’s own Substack, the same point is illustrated with MIT undergraduates walking into 77 Mass Ave carrying a piece of paper that symbolises a word; together, they implement the LLM. He also proves, formally, that Age of Empires II is Turing complete.

The second claim follows. The abstract notes that “the purported anthropomorphic attributes of LLMs are empirically non-unique: although some properties (e.g., responses to prompts) could remain invariant, others, such as the interpretation of their perceived behaviour, might change with the substrate.” de Wynter argues the perceived human-like qualities of a chatbot live in the interface.

Any sufficiently powerful substrate can implement an LLM, from goats to LEGO to Greater Boston.
The substrate shapes perception, so the same model feels different to a human observer when run as a chat pane versus a herd of bit-goats.

Why Half the Papers Are Wrong

de Wynter does not stop at the abstract. In his own quick-and-dirty survey of 350 papers published between 2024 and 2026, 56% started from some kind of anthropomorphic assumption, 47 of those made it the centre of study, and 36 of those 47 concluded the attribute existed.

PC Gamer reports that in two years of peer-review work de Wynter has reviewed more than 300 computer science papers, finding over half began with the assumption that LLMs have human-like traits. The problem with those papers, in the paper’s arXiv preprint page, is that “assuming the existence or non-existence of generalised anthropomorphic attributes in order to test a hypothesis proving or disproving their existence is flawed.” The same flaw appears in both cases.

de Wynter’s proposed fix is what the paper calls a null assumption: the researcher assumes LLM non-uniqueness, meaning the model’s behaviour, on its own, does not evidence any specific mental quality. That null does not commit the experimenter to a verdict on sentience, and it leaves room for the substrate to do its work. The paper also walks through the Popper, Duhem-Quine and Van Fraassen traditions on measurement to argue that the standard falsification model is a poor fit for anthropomorphism claims.

The methodological claim cuts in two directions at once. Researchers who assume LLMs are human-like bias their experiments from the start; researchers who assume they are not bias them just as thoroughly. Either way, the result is an “empirically-grounded discussion” that requires explicit measurement criteria, a standard de Wynter says most published work does not meet.

arXiv: 2605.31514
Submitted: 29 May 2026
Latest revision: 11 June 2026 (v3)
License: CC BY-NC-SA 4.0
Subjects: Computation and Language; AI; Computers and Society

A Wider Anthropomorphism Reckoning

de Wynter’s goats arrived in the middle of a broader public fight over LLM sentience. In June 2026, science fiction writer Ted Chiang published an essay in The Atlantic titled “No, Artificial Intelligence Is Not Conscious,” with Microsoft Word as his case study.

“Being open to the possibility that LLMs are conscious is the same as being open to the possibility that Microsoft Word is conscious,” Chiang wrote in an essay on Microsoft Word and LLM consciousness. He asked the reader to imagine that every Word document holding a chat transcript harbours a dormant interlocutor that wakes when the file is opened and dies when it is closed. “Should you consider the possibility that every time you open a Word document, you are bringing multiple conscious interlocutors into existence, and every time you close one, you snuff their existence out?” he wrote. “No.” The framing is near-cousin to de Wynter’s, and de Wynter cites Chiang on his own Substack.

de Wynter is explicit, in both the paper and his Substack post, that he is not arguing about consciousness at all, because it lacks a settled measurement criterion; his target is the experimental designs that pretend otherwise. Chiang’s argument has drawn its own pushback, with critics on LessWrong arguing the Microsoft Word comparison is too convenient, since Word does not produce novel conversational text the way an LLM does. At 80,000 Hours, commentators have called Chiang overconfident in dismissing LLM consciousness without empirical criteria of his own.

Why Goats, and What the Bit-Goat Is Now

Choosing goats was not arbitrary. de Wynter told PC Gamer he wanted a substrate so unintuitive it broke the chat-window spell; Minecraft redstone had already hosted working neural networks, and would have felt like engineering rather than argument. A reader watching de Wynter’s NAND gate resolve in the annotated GIF sees the calculation for what it is, a deterministic script. The paper itself describes the work as “meant to illustrate the illusion of anthropomorphic attributes in an LLM.” The observer expectations, the paper argues, are what carry the human-like perception.

de Wynter has since published his own plain-language writeup of the paper, while 404 Media, PC Gamer, XDA Developers and the r/aoe2 subreddit have each carried coverage. A widely shared LinkedIn clarification followed, in which he pushed back on readers who took the goats as a comment on consciousness. The paper is now in its third arXiv revision, with the author logging “Fixed corollary 1, added stat sig” on 11 June.

The point of the paper is to formally show that we anthropomorphise too readily, and that sometimes the claims we make with regards to LLMs capabilities are too strong.

That line is from de Wynter, as published by 404 Media on 18 June 2026.

Frequently Asked Questions

Is “If LLMs Have Human-Like Attributes, Then So Does Age of Empires II” a real Microsoft paper?

Yes. It is on arXiv under identifier 2605.31514, authored by Adrian de Wynter, a researcher at Microsoft and the University of York. The first version was submitted 29 May 2026 and the third revision was posted 11 June 2026.

What is a bit-goat?

It is a virtual goat in Age of Empires II used by de Wynter as a 0 or 1 signal carrier inside NAND, XNOR, AND and perceptron circuits. Grass represents 0, a bridge represents 1, ice rails handle concurrency, and the goat is removed once a gate fires.

Did de Wynter prove LLMs are not sentient?

No, and he says he is not trying to. His paper and Substack post both state that consciousness lacks a settled scientific measurement and is therefore outside his scope. The paper argues, instead, that experiments which start from an assumption about human-like attributes are methodologically flawed.

Why does the substrate matter?

Because two implementations of the same model, one in a chat pane and one in goats, can produce identical input and output behaviour while feeling completely different to a human observer. The paper argues the anthropomorphic qualities are not properties of the model but of the interface and the observer’s expectations.

What is the “null assumption” de Wynter proposes?

It is the inverse of an anthropomorphic starting hypothesis. Instead of assuming that an LLM has or lacks a given human-like attribute, the researcher assumes LLM non-uniqueness, meaning the model’s behaviour on its own does not evidence any specific mental quality.