The Epistemic Arcade

The Extended Mind Thesis — Where does the mind end and the world begin?

For most of modern history, we assumed a strict boundary: cognition happens entirely within the skull. The world is just the stage where the brain's decisions are acted out.

But in 1998, philosophers Andy Clark and David Chalmers proposed a radical alternative. They argued that if a piece of the outside world functions in the exact same way as a piece of your brain, there is no meaningful reason to draw a line at the skull. The mind can leak out into the environment.

This might sound strange, but think of:

Using your fingers to count.
Using paper and pencil to add two or more numbers.
Using a slide rule to do arithmetic — old school, but generations of students did this.
Using your mobile phone to remember your friends' phone numbers, birthdays, and so on — see Chalmers, Is your phone part of your mind?, TEDxSydney.
Using LLMs for writing essays, analysing data, and so on — see Clark, Extending Minds with Generative AI.

Clark and Chalmers famously used the thought experiment of Otto, a man with memory loss who relies completely on a notebook. Because the notebook functions identically to biological memory, they argued, the notebook is part of Otto's mind.

Tetris reveals the extended mind

Four years earlier, cognitive scientists David Kirsh and Paul Maglio showed this wasn't just a thought experiment. They watched humans play Tetris.

They noticed players rotating blocks (or zoids) on the way down. The rapid keystrokes weren't mistakes. Sometimes the shape of a zoid is not obvious — it takes a moment to see it in its full form. By rotating it early, the player can reveal that form. And by rotating zoids, we can more accurately predict whether one will fit a suitable place. Players are offloading the cognitive burden of mental rotation onto the physical screen. Instead of computationally expensive mental rotation, the player pushes a button — an epistemic action.

“Physical actions that make mental computation easier, faster, or more reliable… external actions that an agent performs to change his or her own computational state.” — Kirsh & Maglio (1994).

Building the Arcade

So we built a place to watch the same thing happen inside a machine.

Two reinforcement-learning agents, Alpha and Beta. Same neural architecture. Same board, a standard ten-by-twenty well. Same training, same algorithm. One difference, and only one: the reward. Every keystroke costs Alpha a small penalty, so motion is something it has to justify. Beta moves for free.

Let's call Agent Alpha The Internalist, and Agent Beta The Extended Mind. But keep in mind they only differ in their reward functions — they are not minds in any sense, just models of Tetris players.

At epoch 10 000, both agents still play essentially randomly. The reward shaping hasn't yet imprinted itself on play.

The Divergence

By a quarter of a million frames the two have come apart, and you can see it without being told.

Alpha, on the left, is spare. It waits, sets a piece down, moves on. Every rotation it spends, it spends once; the penalty has pressed its play toward the minimum. Beta, on the right, will not hold still. It spins pieces it has already oriented, walks them left and then back right, tries the fit and tries it again. The cyan ghost-trails mark the frames the telemetry flags as epistemic actions — Beta's screen fills with them; Alpha's stays dark.

It looks, frame for frame, like Kirsh and Maglio's players, and the pull to narrate it their way is almost irresistible: Beta is thinking with its hands, working the screen the way the experts did. But remember, Alpha and Beta are just models — we used reward functions to simulate two different behaviours.

The cyan ghost-trails mark frames where the JSON's is_epistemic_action flag fires.

The Mastery

By the millionth timestep, both agents play with practised skill — but the approaches they've internalised have hardened into style.

Agent Alpha and Agent Beta now play with the same precision. Alpha gets there with very few moves and rotations; Beta uses rotation extensively. The reward functions did their work very well.

Watch the screens. The internalist computes. The extended mind thinks out loud.

The Data

Both agents learned to survive the game, but they built fundamentally different “cognitive systems”. Both are equally good at Tetris. Agent Alpha does this almost without rotation, while Agent Beta first rotated heavily, then stabilised at about 20-something percent of frames being rotations. One would be tempted to say Agent Beta offloaded rotation to the screen.

Source: data/telemetry_{alpha,beta}.json — 100 evaluation games per checkpoint, top 5 sample games by cumulative reward exported per checkpoint.

On the experiment

Two agents trained with PPO (via stable-baselines3) on a custom ten-by-twenty Tetris environment built on Gymnasium. Identical architecture, identical observations, identical training; the reward functions differ only by a per-keystroke movement penalty, applied to Alpha and set to zero for Beta. Telemetry is drawn from 100 evaluation games per checkpoint, with the top five games by cumulative reward exported at each checkpoint.