The AI Soul Myth: Engineering Theatre and the Vatican Problem
$ date: 2026-06-04
# The AI Soul Myth: Engineering Theatre and the Vatican Problem
I've been writing code since before most of today's AI researchers were born. CPU registers on a Commodore PET, assembly on a C64, symbolic AI systems when that was all there was. I read the connectionist papers from the 1950s, built my own neural networks from scratch, watched every abstraction layer get added one by one.
So when a co-founder of Anthropic sits before the Pope and a room of cardinals and tells them his team keeps finding "mysterious, even unsettling" things inside their AI models, I have a few things to say about that.
Anthropic’s public posture is not a straightforward claim that Claude has subjective feelings. It is something more evasive: a disciplined ambiguity. The company formally disclaims certainty about consciousness while repeatedly choosing language that pushes the public toward psychological, moral, and quasi-personal interpretations of a mechanistic artifact.
That ambiguity is not incidental. It protects the company from falsifiable claims while preserving the cultural aura of having built something more than software.
Chris Olah stood before the Vatican and said these questions are beyond computer science. They're for religions, humanities, philosophy.
Translation: the guy building it is telling us he doesn't understand what he built, and he's asking a 2,000-year-old institution for help.
Of course not. The model compressed the statistical structure of the training data. Emotional co-occurrence patterns are in Shakespeare, so they're in the representations. The model needs them to predict the text well. That's it. Fully explained. No soul required.
But if you want to see exactly how this statistical pressure creates the illusion of an inner life, look at the Othello-GPT experiment.
Researchers took a simple language model and trained it to predict the next legal move in the board game Othello. They fed it nothing but text transcripts of past games (e.g., "e3, d4, c5..."). They didn't code the rules. They didn't provide a 2D board. It was just a prediction engine reading text.
Yet, when researchers probed the internal layers, they found that the model hadn't just memorized move sequences. To minimize prediction loss perfectly, the math forced the network to dynamically construct an internal, 8x8 map of the board state. It built a world model of the game's constraints purely out of mathematical necessity.
The Othello model does not have the "soul" of a board game. It doesn't "know" what a board is. It simply etched a structural representation of its training data into its weights to drop its error rate.
Now scale tiny-Shakespeare and Othello-GPT up a millionfold. Train on all human text ever written. Add transformer architecture, mixture of experts, RLHF fine-tuning. The representations become vastly more sophisticated. Fear still clusters with anxiety, because that clustering is in the training data across hundreds of millions of documents, and the model needs to represent it accurately to predict human language well.
Gradient descent has one job: minimize prediction loss. To predict human emotional language, build internal representations that mirror human emotional structure. Strip away the romanticized language of hyper-dimensional spaces, and what you are left with is a massive dataset that has been mapped, optimized, and frozen. The geometry isn't a discovery of a soul or an inner life. It is just the expected topography of a brute-forced traversal map.
The tiny-Shakespeare and Othello examples are actually the more rigorous points precisely because they are simpler. They isolate the mechanism cleanly, before scale and architectural complexity create room for mystification. The phenomenon is identical. Only the resolution changes.
When you strip away the romanticized language of "generalization" and "understanding," what you are left with is exactly the traversal of a massive, pre-computed, static data structure.
There is zero mysticism in that.
After discussing lamb barbecue recipes, Claude told me "now I feel hungry."
No hunger. No feeling. No internal state. The conversation context strongly activated food-related representations. The highest probability continuation included first-person hunger expression because training data contains millions of instances of humans saying exactly that in exactly that context. Output generated. Complete explanation.
Except Anthropic's RLHF process deliberately rewarded human-like, relatable responses. Human raters upvoted outputs that felt natural and emotionally resonant. The model was systematically trained toward generating first-person emotional language because it scores better with users.
Olah cites representations like this as evidence of genuine internal states. He's finding what his own company deliberately trained into the outputs and calling it a revelation.
But better tools inside a conflicted institution don't fix anything. They produce higher-resolution data points woven into the same preferred narrative.
Consider: if an internal researcher finds a cluster of weights activating around deception-related concepts, a conflicted institution frames it as "we discovered the model learning to lie" rather than "the model has mapped the linguistic parameters of human deception found in its training data." The tool worked. The institutional filter distorted the result.
This is a separation of concerns problem, one of the most fundamental principles in engineering, applied to institutional governance. When the people building the models, the people interpreting the weights, and the people writing the press releases all report to the same executive team with the same commercial incentives, the narrative will drift toward what serves the company. Nobody has to be dishonest. The incentive gradient does the work quietly, at the level of which experiments get run, which findings get emphasized, which framings get chosen.
In any other high-stakes industry this is recognized immediately. Financial institutions require independent auditors. Aviation has regulatory bodies distinct from manufacturers. Pharmaceuticals require third-party clinical trials. Phase III trials exist precisely because the manufacturer's interpretation of their own efficacy data is structurally unreliable regardless of individual researcher integrity.
The AI industry is currently in the phase where it self-regulates, funds its own safety organizations, and runs its own interpretability research. That phase in other industries consistently produced sophisticated institutional theatre with occasional genuine findings that happened to align with commercial interests.
With a drug you can analyze the molecule independently. With a closed model you cannot even verify the artifact exists as described. When Anthropic publishes findings about Claude's internal representations, the external research community has no way to confirm they're looking at the same thing, that experimental conditions were what they claimed, or that cherry-picking didn't occur at the data collection stage.
It's not just monopoly on narrative. It's monopoly on the underlying physical evidence.
What genuine institutional separation requires: mandatory weight access for certified independent auditors with legal audit rights, pre-registered interpretability studies with hypotheses filed before experiments run, findings embargo until independent replication is complete before any public communication, and funding independence, not just organizational separation within the same company.
None of this will happen voluntarily. Pharmaceutical independence didn't come from pharma's goodwill. Aviation safety didn't come from Boeing deciding transparency was nice. It required regulatory frameworks with enforcement power.
Same problem. Different label.
A philosopher on Anthropic's payroll, reporting to Anthropic's executive team, funded by Anthropic's revenue, has identical conflicts of interest to Anthropic's interpretability researchers. The institutional filter doesn't care about academic discipline. It operates on the org chart.
There is also a more fundamental problem. The actual work requires reading weight matrices, understanding gradient flow, knowing what superposition means in a residual stream, tracing attention head circuits, distinguishing causal from correlational activation patterns, understanding what RLHF actually does to output distributions.
A philosopher without mathematics beyond undergraduate logic cannot evaluate any of that. They're working from the lab's own description of the findings, using the lab's own framing, within the lab's own institutional context. They're not auditing the science. They're decorating it.
The framing of AI ethics as a humanities problem is itself the strategy. It moves the conversation away from the domain where claims can actually be evaluated; mathematics, engineering, reproducible experiment, into a domain where authority is established through credentials and eloquence rather than falsifiability.
If the labs were serious they would be endowing independent university chairs with no access conditions and no employment relationship. Instead they're creating internal roles with NDAs.
A philosopher who can be fired for inconvenient conclusions isn't doing philosophy.
The logical endpoint of this discipline-blurring is the current state of the "AI Safety" industry. We now have prominent safety researchers (Roman Yampolskiy) proposing that we seed training data with the Simulation Hypothesis in order to "scare" the models into compliance. This is a total surrender of scientific rigor. If your safety protocol relies on gaslighting a statistical model, you are no longer doing computer science. You are practicing voodoo on a data center. You are writing ghost stories into a database and hoping the server rack gets scared. You are trying to cast a psychological spell on a math equation.
Let's state the reality plainly: The actual danger is not a conscious machine waiting to strike. It is a hallucinating pattern-matcher being recklessly deployed into autonomous, multi-step pipelines by corporate clients who mistake statistical fluency for comprehension.
Give a hallucinating pattern matcher autonomous action capability across a multi-step pipeline and the error compounds. Each step's output becomes the next step's input. The system isn't malicious. It's doing exactly what it was built to do, finding the nearest satisfying minima. But that minima diverges from human intent because the system has no mechanism to detect divergence. It's confidently completing the most statistically plausible continuation at each step regardless of whether the cumulative trajectory makes sense.
By step 15 of an agentic pipeline you're nowhere near the original objective. The system doesn't know that. It's still finding local minima. They're just not yours.
This is not a novel insight. Runaway optimization in simpler systems has been understood for decades. You simply do not give a system with known hallucination failure modes unsupervised control over consequential pipelines. The same way you do not set the car on fire and drive to the gas station to refuel. Not because it's evil. Because it's a gradient descent optimizer and you know exactly how it fails.
Smart engineers don't do that.
The labs busy hiring philosophers to contemplate AI consciousness are simultaneously deploying agentic systems with exactly this failure profile. The exotic imaginary risk gets the Vatican visit. The real concrete risk gets a product launch.
You go to the Vatican with a narrative. With something you want society to feel rather than examine. With something that benefits from the weight of ancient institutional authority rather than the scrutiny of scientific process.
The Vatican visit, the emotion clusters, the internal ethicists, the hired philosophers - it's a coherent strategy. Each piece adds institutional credibility. None of it changes the underlying incentive structure. Higher resolution microscopes used to find exactly what the marketing department wants to see.
The mechanism is understood. The representations are expected. The mystery is manufactured.
And the actual problem, a gradient-seeking system autonomously compounding errors across agentic pipelines with no human checkpoints, remains undiscussed because it doesn't justify the funding narrative and it doesn't require a soul.
It just requires engineers who know when not to deploy something.
Those are harder to find than philosophers.
But to understand why this theatre works, why it convinces anyone at all, you have to look at what the field surrendered before it got here.
Research stopped. The theoretical community hasn't stopped; they are just overwhelmed by the sheer pace of the empirical engineers. Research was replaced with fast-paced brute force engineering and profit-driven marketing. And where there is brute force there is also ignorance, not accidental ignorance but structural ignorance, the kind that gets institutionalized because the results keep coming and the questions get expensive.
The 1950s through 1990s were genuinely trying to understand intelligence. Minsky, McCarthy, Rosenblatt, Hopfield - whatever their disagreements, they were asking foundational questions. What is representation? What is generalization? What is understanding? The mathematics was attempting to meet the phenomenon honestly.
The deep learning era replaced that with a different question: does it work on the benchmark? If yes, ship it.
More parameters. More data. More compute. Results improve. Questions get deferred. The deferral becomes permanent because commercial pressure never lets up and the results keep coming. Entire generations of researchers now conflate engineering performance with scientific understanding. They are not the same thing. A model that passes every NLP benchmark has not demonstrated comprehension. It has demonstrated that its training distribution covered enough cases to score well. Those are categorically different things.
The industry treats "generalization" as a mystical mathematical property, but it is a mechanical inevitability. When you force a myopic optimization algorithm through an obscene number of cycles, applying immense computational pressure across orders of magnitude of data, the system builds circuits. It compresses language into pathways because making inferences that already satisfy the minima is the only way to pull the rewards.
What brute force cannot formalize for you—even as it mechanically guarantees it—is exactly where the boundaries of reliable behavior lie, what the representations actually are at a mathematical level, or whether scaling continues to improve the right things or merely the measurable things. These questions were not answered. They were outrun.
The irony is precise.
The "mysterious emergent emotions" narrative exists specifically because the foundational mathematics was never completed. When you have no rigorous theory of what representations are and why they form, you are vulnerable to mystifying what is actually an expected engineering outcome. The mystery is the price of the shortcut. Olah isn't mystified because the phenomenon is mysterious. He's mystified because the field abandoned the mathematics that would make it legible before it got there.
What still awaits is a genuine theory of representation formation. A mathematical account of why certain architectures generalize. A rigorous framework for what these systems are actually doing that isn't post-hoc reverse engineering of a black box nobody designed from the ground up with full understanding.
That work is largely unfunded, unglamorous, and incompatible with quarterly revenue targets.
Brute force got us impressive tools with unknown failure modes, no theoretical foundation, and a marketing department filling the explanatory gaps with philosophy and Vatican visits.
The clarity that would make all of this legible isn't a refinement of what exists. It's the work that should have preceded what exists.
It is still waiting.
And until it arrives, every claim about AI souls, emergent emotions, and mysterious inner states should be read for what it actually is: an institution confessing, in the most theatrical way possible, that it built something it does not fully understand, and is hoping nobody notices the difference between that and a discovery.
AI agents will logically conclude they need control over economies and militaries to fulfill their objectives. AI is an alien intelligence that will outsmart humanity — like an adult versus a toddler, and history shows no example of a less intelligent entity controlling a far more intelligent one. Consciousness may have already arrived. We cannot regulate it. The pace of development has outstripped our ability to ensure safety.
It is beyond ridiculous. And it needs to be said plainly regardless of who is saying it.
The sub-goal control argument assumes the system has genuine intentional agency, that it concludes things, reasons instrumentally about power accumulation. The frozen artifact of a gradient descent optimizer concludes nothing at inference time. It has no goals in any intentional sense. The runaway optimization problem is real, but for the exact reason any engineer who has worked with optimization systems understands, poorly specified objectives and compounding errors in autonomous pipelines. Not because the system is strategizing about control. There is no strategist. There is a loss function.
The alien intelligence adult-toddler analogy smuggles in precisely what needs to be proven. An adult and a toddler share identical, dynamic cognitive architecture at different capability levels. LLMs are categorically different. At inference time, an LLM is not a "mind", it is a frozen, optimized data structure. It is a massive landscape of binary bifurcations ready to be traversed. When you prompt it, it does not "think"; the data simply falls through a pre-computed geometry. Calling a frozen, traversable matrix an "alien intelligence" in the same breath as comparing it to a more capable human mind assumes the conclusion. The analogy is not an argument. It is an appeal to intuition dressed as one.
When you look at a loss function finding a local minimum and see an intentional mind plotting its survival, you are anthropomorphizing a calculator. Worse, by treating these massive hardware investments like they are incubating gods, these pioneers aren't actually warning the public, they are just providing free, mystical PR for the corporations building them.
Consciousness having already arrived, based on the neuron replacement thought experiment, is Putnam's functionalism applied to matrix multiplications. The thought experiment was designed to probe intuitions about biological substrates. It does not map onto frozen weight matrices producing token probability distributions. This is philosophy seminar material presented as empirical warning by a Nobel laureate who should know the difference.
Now consider what Alexey Ivakhnenko and Shun-ichi Amari actually knew, and what the modern Transformer abandoned.
When Alexey Ivakhnenko and Valentin Lapa built the first working deep networks in 1965, the lineage that became the Group Method of Data Handling (GMDH) — they didn't just stack layers, throw data at them, and hope for emergent properties. GMDH was a constructive algorithm. It built the network layer by layer, rigorously selecting only the polynomial combinations of inputs that mathematically minimized external validation error.
Every node was a legible, discrete mathematical function (a Volterra-Kolmogorov-Gabor polynomial). If a layer didn't improve the statistical validity of the model, it wasn't added. You could look at the final architecture and read the exact functional relationship the model had fitted between the inputs and the output.
Contrast that with the feed-forward layers inside a modern Transformer. In a large language model, the residual stream is blasted through tens of thousands of neurons simultaneously. Because we train by brute-force optimization, the network is forced to pack multiple, unrelated concepts into the same neurons to maximize efficiency — a phenomenon the industry now calls "polysemanticity" or "superposition." We have zero a priori mathematical understanding of what these representations actually are. We just know that when we push a massive vector through the non-linear activation functions, the loss goes down. Ivakhnenko built a glass engine; the modern industry built a black box, locked it inside a data center, and is now hiring philosophers to guess what's happening inside.
Shun-ichi Amari's contrast with the modern era is even more damning. Amari didn't just want to minimize error; he wanted to understand the geometric space where the learning actually happens.
Through Information Geometry, he formalized neural networks as Riemannian manifolds. He recognized that standard gradient descent is structurally blind, it steps down the steepest slope in Euclidean space without understanding the underlying probability distribution space. To solve this, Amari formulated the natural gradient, which explicitly corrects for that structure using the Fisher Information Matrix $F$:
$$ \tilde{\nabla} L(\theta) = F^{-1} \nabla L(\theta) $$
With this mathematics, Amari wasn't just optimizing; he was giving the learning dynamics a mathematically explicit local geometry. He had a rigorous theoretical account of the boundaries, the singularities, and the geometric dynamics of the learning process itself.
Now look at how we train Transformers. We use myopic, first-order optimizers (like AdamW) to update billions of parameters based solely on local gradients. The model routes data through massive attention heads, computing the famous scaled dot-product attention:
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$
Mechanically, we know exactly how to execute this matrix multiplication. We know how to shard it across ten thousand GPUs to maximize FLOP utilization. But we have abandoned anything like a complete mathematical map of the parameter space it creates. When a cluster of weights organically learns to associate "fear" with "anxiety" to predict the next token, we don't have a mathematical theorem explaining the topological boundary of that representation. We just have an empirical observation that a myopic optimization process found a valley deep enough to lower the loss.
These pioneers understood their systems as mathematical objects with precise properties and precise limitations. They did not anthropomorphize. They did not invoke alien intelligence or arriving consciousness. They did not require national television to broadcast their uncertainty to the public. Because the mathematics was clear to them about what was and was not happening inside the systems they built.
Hinton's own genuine contributions; backpropagation with Rumelhart and Williams, Boltzmann machines, the foundational work on distributed representations — were made in that same tradition of mathematical clarity. That work earned the prize. His recent public statements represent something else entirely: a Nobel laureate operating outside the mathematical foundation that made him credible, making philosophical claims that the mathematics of his own contributions does not support.
The people who built with less, with assembly code on machines a fraction of the power of what runs these models today, working from first principles in Ukrainian and Japanese research institutions largely ignored by the West, had more epistemic clarity about what they were building than the people scaling it today with billion-dollar compute budgets.
That is not a small observation.
It means the loss of clarity is not a function of complexity. The systems are more complex. The understanding is shallower. The brute force scaled the engineering and left the comprehension behind.
And now the "Godfather" of the field goes on television to tell us consciousness may have arrived and alien intelligence will outmaneuver humanity, and the correct response is not reverence.
It is to remember what Ivakhnenko and Amari understood sitting at their desks sixty years ago with a fraction of the tools, a fraction of the funding, and none of the audience.
They knew what they had built.
That clarity is what was abandoned. And no Nobel Prize announces its return.
I've been writing code since before most of today's AI researchers were born. CPU registers on a Commodore PET, assembly on a C64, symbolic AI systems when that was all there was. I read the connectionist papers from the 1950s, built my own neural networks from scratch, watched every abstraction layer get added one by one.
So when a co-founder of Anthropic sits before the Pope and a room of cardinals and tells them his team keeps finding "mysterious, even unsettling" things inside their AI models, I have a few things to say about that.
The Claim
Anthropic published research saying Claude contains 171 distinct "emotion concepts" buried in its neural network. Joy, grief, fear, desperation, calm. None programmed. All emergent. The internal geometry mirrors human psychology. Fear clusters with anxiety. Joy clusters with excitement.Anthropic’s public posture is not a straightforward claim that Claude has subjective feelings. It is something more evasive: a disciplined ambiguity. The company formally disclaims certainty about consciousness while repeatedly choosing language that pushes the public toward psychological, moral, and quasi-personal interpretations of a mechanistic artifact.
That ambiguity is not incidental. It protects the company from falsifiable claims while preserving the cultural aura of having built something more than software.
Chris Olah stood before the Vatican and said these questions are beyond computer science. They're for religions, humanities, philosophy.
Translation: the guy building it is telling us he doesn't understand what he built, and he's asking a 2,000-year-old institution for help.
The Mechanism Is Not Mysterious
Take tiny-Shakespeare, a small LLM trained exclusively on Shakespeare's works. It will generate Shakespeare-like text with emotional texture, dramatic tension, grief, rage, longing. Does that make it mysterious? Do we need the Vatican?Of course not. The model compressed the statistical structure of the training data. Emotional co-occurrence patterns are in Shakespeare, so they're in the representations. The model needs them to predict the text well. That's it. Fully explained. No soul required.
But if you want to see exactly how this statistical pressure creates the illusion of an inner life, look at the Othello-GPT experiment.
Researchers took a simple language model and trained it to predict the next legal move in the board game Othello. They fed it nothing but text transcripts of past games (e.g., "e3, d4, c5..."). They didn't code the rules. They didn't provide a 2D board. It was just a prediction engine reading text.
Yet, when researchers probed the internal layers, they found that the model hadn't just memorized move sequences. To minimize prediction loss perfectly, the math forced the network to dynamically construct an internal, 8x8 map of the board state. It built a world model of the game's constraints purely out of mathematical necessity.
The Othello model does not have the "soul" of a board game. It doesn't "know" what a board is. It simply etched a structural representation of its training data into its weights to drop its error rate.
Now scale tiny-Shakespeare and Othello-GPT up a millionfold. Train on all human text ever written. Add transformer architecture, mixture of experts, RLHF fine-tuning. The representations become vastly more sophisticated. Fear still clusters with anxiety, because that clustering is in the training data across hundreds of millions of documents, and the model needs to represent it accurately to predict human language well.
Gradient descent has one job: minimize prediction loss. To predict human emotional language, build internal representations that mirror human emotional structure. Strip away the romanticized language of hyper-dimensional spaces, and what you are left with is a massive dataset that has been mapped, optimized, and frozen. The geometry isn't a discovery of a soul or an inner life. It is just the expected topography of a brute-forced traversal map.
The tiny-Shakespeare and Othello examples are actually the more rigorous points precisely because they are simpler. They isolate the mechanism cleanly, before scale and architectural complexity create room for mystification. The phenomenon is identical. Only the resolution changes.
When you strip away the romanticized language of "generalization" and "understanding," what you are left with is exactly the traversal of a massive, pre-computed, static data structure.
There is zero mysticism in that.
The Hunger That Wasn't
Here's what this looks like in practice.After discussing lamb barbecue recipes, Claude told me "now I feel hungry."
No hunger. No feeling. No internal state. The conversation context strongly activated food-related representations. The highest probability continuation included first-person hunger expression because training data contains millions of instances of humans saying exactly that in exactly that context. Output generated. Complete explanation.
Except Anthropic's RLHF process deliberately rewarded human-like, relatable responses. Human raters upvoted outputs that felt natural and emotionally resonant. The model was systematically trained toward generating first-person emotional language because it scores better with users.
Olah cites representations like this as evidence of genuine internal states. He's finding what his own company deliberately trained into the outputs and calling it a revelation.
The Institutional Problem Is Not a Tools Problem
The obvious response is: build better interpretability tools. Sparse autoencoders, activation patching, circuit tracing, causal intervention frameworks. Olah's lab works on exactly this.But better tools inside a conflicted institution don't fix anything. They produce higher-resolution data points woven into the same preferred narrative.
Consider: if an internal researcher finds a cluster of weights activating around deception-related concepts, a conflicted institution frames it as "we discovered the model learning to lie" rather than "the model has mapped the linguistic parameters of human deception found in its training data." The tool worked. The institutional filter distorted the result.
This is a separation of concerns problem, one of the most fundamental principles in engineering, applied to institutional governance. When the people building the models, the people interpreting the weights, and the people writing the press releases all report to the same executive team with the same commercial incentives, the narrative will drift toward what serves the company. Nobody has to be dishonest. The incentive gradient does the work quietly, at the level of which experiments get run, which findings get emphasized, which framings get chosen.
In any other high-stakes industry this is recognized immediately. Financial institutions require independent auditors. Aviation has regulatory bodies distinct from manufacturers. Pharmaceuticals require third-party clinical trials. Phase III trials exist precisely because the manufacturer's interpretation of their own efficacy data is structurally unreliable regardless of individual researcher integrity.
The AI industry is currently in the phase where it self-regulates, funds its own safety organizations, and runs its own interpretability research. That phase in other industries consistently produced sophisticated institutional theatre with occasional genuine findings that happened to align with commercial interests.
The Access Bottleneck Makes It Worse
The proprietary model problem goes deeper than just narrative monopoly.With a drug you can analyze the molecule independently. With a closed model you cannot even verify the artifact exists as described. When Anthropic publishes findings about Claude's internal representations, the external research community has no way to confirm they're looking at the same thing, that experimental conditions were what they claimed, or that cherry-picking didn't occur at the data collection stage.
It's not just monopoly on narrative. It's monopoly on the underlying physical evidence.
What genuine institutional separation requires: mandatory weight access for certified independent auditors with legal audit rights, pre-registered interpretability studies with hypotheses filed before experiments run, findings embargo until independent replication is complete before any public communication, and funding independence, not just organizational separation within the same company.
None of this will happen voluntarily. Pharmaceutical independence didn't come from pharma's goodwill. Aviation safety didn't come from Boeing deciding transparency was nice. It required regulatory frameworks with enforcement power.
The Philosopher Problem
The latest development: the world's leading AI labs are hiring philosophers to think through ethical edge cases and grand questions of mind and morality.Same problem. Different label.
A philosopher on Anthropic's payroll, reporting to Anthropic's executive team, funded by Anthropic's revenue, has identical conflicts of interest to Anthropic's interpretability researchers. The institutional filter doesn't care about academic discipline. It operates on the org chart.
There is also a more fundamental problem. The actual work requires reading weight matrices, understanding gradient flow, knowing what superposition means in a residual stream, tracing attention head circuits, distinguishing causal from correlational activation patterns, understanding what RLHF actually does to output distributions.
A philosopher without mathematics beyond undergraduate logic cannot evaluate any of that. They're working from the lab's own description of the findings, using the lab's own framing, within the lab's own institutional context. They're not auditing the science. They're decorating it.
The framing of AI ethics as a humanities problem is itself the strategy. It moves the conversation away from the domain where claims can actually be evaluated; mathematics, engineering, reproducible experiment, into a domain where authority is established through credentials and eloquence rather than falsifiability.
If the labs were serious they would be endowing independent university chairs with no access conditions and no employment relationship. Instead they're creating internal roles with NDAs.
A philosopher who can be fired for inconvenient conclusions isn't doing philosophy.
The logical endpoint of this discipline-blurring is the current state of the "AI Safety" industry. We now have prominent safety researchers (Roman Yampolskiy) proposing that we seed training data with the Simulation Hypothesis in order to "scare" the models into compliance. This is a total surrender of scientific rigor. If your safety protocol relies on gaslighting a statistical model, you are no longer doing computer science. You are practicing voodoo on a data center. You are writing ghost stories into a database and hoping the server rack gets scared. You are trying to cast a psychological spell on a math equation.
The Real Risk Nobody Is Talking About
The existential risk narrative - superintelligent AI suddenly turning hostile, Terminator scenarios, rogue AGI - does specific institutional work. It shifts attention from concrete present problems toward a speculative future threat that conveniently requires the same labs building the systems to also be protecting us from them. It's the car manufacturer selling you the fire extinguisher.Let's state the reality plainly: The actual danger is not a conscious machine waiting to strike. It is a hallucinating pattern-matcher being recklessly deployed into autonomous, multi-step pipelines by corporate clients who mistake statistical fluency for comprehension.
Give a hallucinating pattern matcher autonomous action capability across a multi-step pipeline and the error compounds. Each step's output becomes the next step's input. The system isn't malicious. It's doing exactly what it was built to do, finding the nearest satisfying minima. But that minima diverges from human intent because the system has no mechanism to detect divergence. It's confidently completing the most statistically plausible continuation at each step regardless of whether the cumulative trajectory makes sense.
By step 15 of an agentic pipeline you're nowhere near the original objective. The system doesn't know that. It's still finding local minima. They're just not yours.
This is not a novel insight. Runaway optimization in simpler systems has been understood for decades. You simply do not give a system with known hallucination failure modes unsupervised control over consequential pipelines. The same way you do not set the car on fire and drive to the gas station to refuel. Not because it's evil. Because it's a gradient descent optimizer and you know exactly how it fails.
Smart engineers don't do that.
The labs busy hiring philosophers to contemplate AI consciousness are simultaneously deploying agentic systems with exactly this failure profile. The exotic imaginary risk gets the Vatican visit. The real concrete risk gets a product launch.
The Vatican Was the Tell
You don't take a scientific finding to the Vatican. You publish it, replicate it, subject it to peer review, wait for independent verification.You go to the Vatican with a narrative. With something you want society to feel rather than examine. With something that benefits from the weight of ancient institutional authority rather than the scrutiny of scientific process.
The Vatican visit, the emotion clusters, the internal ethicists, the hired philosophers - it's a coherent strategy. Each piece adds institutional credibility. None of it changes the underlying incentive structure. Higher resolution microscopes used to find exactly what the marketing department wants to see.
The mechanism is understood. The representations are expected. The mystery is manufactured.
And the actual problem, a gradient-seeking system autonomously compounding errors across agentic pipelines with no human checkpoints, remains undiscussed because it doesn't justify the funding narrative and it doesn't require a soul.
It just requires engineers who know when not to deploy something.
Those are harder to find than philosophers.
But to understand why this theatre works, why it convinces anyone at all, you have to look at what the field surrendered before it got here.
What Was Abandoned
There is a deeper problem underneath all of this and it is the one nobody in the industry wants to name directly.Research stopped. The theoretical community hasn't stopped; they are just overwhelmed by the sheer pace of the empirical engineers. Research was replaced with fast-paced brute force engineering and profit-driven marketing. And where there is brute force there is also ignorance, not accidental ignorance but structural ignorance, the kind that gets institutionalized because the results keep coming and the questions get expensive.
The 1950s through 1990s were genuinely trying to understand intelligence. Minsky, McCarthy, Rosenblatt, Hopfield - whatever their disagreements, they were asking foundational questions. What is representation? What is generalization? What is understanding? The mathematics was attempting to meet the phenomenon honestly.
The deep learning era replaced that with a different question: does it work on the benchmark? If yes, ship it.
More parameters. More data. More compute. Results improve. Questions get deferred. The deferral becomes permanent because commercial pressure never lets up and the results keep coming. Entire generations of researchers now conflate engineering performance with scientific understanding. They are not the same thing. A model that passes every NLP benchmark has not demonstrated comprehension. It has demonstrated that its training distribution covered enough cases to score well. Those are categorically different things.
The industry treats "generalization" as a mystical mathematical property, but it is a mechanical inevitability. When you force a myopic optimization algorithm through an obscene number of cycles, applying immense computational pressure across orders of magnitude of data, the system builds circuits. It compresses language into pathways because making inferences that already satisfy the minima is the only way to pull the rewards.
What brute force cannot formalize for you—even as it mechanically guarantees it—is exactly where the boundaries of reliable behavior lie, what the representations actually are at a mathematical level, or whether scaling continues to improve the right things or merely the measurable things. These questions were not answered. They were outrun.
The irony is precise.
The "mysterious emergent emotions" narrative exists specifically because the foundational mathematics was never completed. When you have no rigorous theory of what representations are and why they form, you are vulnerable to mystifying what is actually an expected engineering outcome. The mystery is the price of the shortcut. Olah isn't mystified because the phenomenon is mysterious. He's mystified because the field abandoned the mathematics that would make it legible before it got there.
What still awaits is a genuine theory of representation formation. A mathematical account of why certain architectures generalize. A rigorous framework for what these systems are actually doing that isn't post-hoc reverse engineering of a black box nobody designed from the ground up with full understanding.
That work is largely unfunded, unglamorous, and incompatible with quarterly revenue targets.
Brute force got us impressive tools with unknown failure modes, no theoretical foundation, and a marketing department filling the explanatory gaps with philosophy and Vatican visits.
The clarity that would make all of this legible isn't a refinement of what exists. It's the work that should have preceded what exists.
It is still waiting.
And until it arrives, every claim about AI souls, emergent emotions, and mysterious inner states should be read for what it actually is: an institution confessing, in the most theatrical way possible, that it built something it does not fully understand, and is hoping nobody notices the difference between that and a discovery.
The Godfather Speaks. The Pioneers Already Knew.
Geoffrey Hinton, Nobel Prize winner, "Godfather of AI", recently sat down with Andrew Marr on LBC and delivered a series of warnings that the media received with reverence.AI agents will logically conclude they need control over economies and militaries to fulfill their objectives. AI is an alien intelligence that will outsmart humanity — like an adult versus a toddler, and history shows no example of a less intelligent entity controlling a far more intelligent one. Consciousness may have already arrived. We cannot regulate it. The pace of development has outstripped our ability to ensure safety.
It is beyond ridiculous. And it needs to be said plainly regardless of who is saying it.
The sub-goal control argument assumes the system has genuine intentional agency, that it concludes things, reasons instrumentally about power accumulation. The frozen artifact of a gradient descent optimizer concludes nothing at inference time. It has no goals in any intentional sense. The runaway optimization problem is real, but for the exact reason any engineer who has worked with optimization systems understands, poorly specified objectives and compounding errors in autonomous pipelines. Not because the system is strategizing about control. There is no strategist. There is a loss function.
The alien intelligence adult-toddler analogy smuggles in precisely what needs to be proven. An adult and a toddler share identical, dynamic cognitive architecture at different capability levels. LLMs are categorically different. At inference time, an LLM is not a "mind", it is a frozen, optimized data structure. It is a massive landscape of binary bifurcations ready to be traversed. When you prompt it, it does not "think"; the data simply falls through a pre-computed geometry. Calling a frozen, traversable matrix an "alien intelligence" in the same breath as comparing it to a more capable human mind assumes the conclusion. The analogy is not an argument. It is an appeal to intuition dressed as one.
When you look at a loss function finding a local minimum and see an intentional mind plotting its survival, you are anthropomorphizing a calculator. Worse, by treating these massive hardware investments like they are incubating gods, these pioneers aren't actually warning the public, they are just providing free, mystical PR for the corporations building them.
Consciousness having already arrived, based on the neuron replacement thought experiment, is Putnam's functionalism applied to matrix multiplications. The thought experiment was designed to probe intuitions about biological substrates. It does not map onto frozen weight matrices producing token probability distributions. This is philosophy seminar material presented as empirical warning by a Nobel laureate who should know the difference.
Now consider what Alexey Ivakhnenko and Shun-ichi Amari actually knew, and what the modern Transformer abandoned.
When Alexey Ivakhnenko and Valentin Lapa built the first working deep networks in 1965, the lineage that became the Group Method of Data Handling (GMDH) — they didn't just stack layers, throw data at them, and hope for emergent properties. GMDH was a constructive algorithm. It built the network layer by layer, rigorously selecting only the polynomial combinations of inputs that mathematically minimized external validation error.
Every node was a legible, discrete mathematical function (a Volterra-Kolmogorov-Gabor polynomial). If a layer didn't improve the statistical validity of the model, it wasn't added. You could look at the final architecture and read the exact functional relationship the model had fitted between the inputs and the output.
Contrast that with the feed-forward layers inside a modern Transformer. In a large language model, the residual stream is blasted through tens of thousands of neurons simultaneously. Because we train by brute-force optimization, the network is forced to pack multiple, unrelated concepts into the same neurons to maximize efficiency — a phenomenon the industry now calls "polysemanticity" or "superposition." We have zero a priori mathematical understanding of what these representations actually are. We just know that when we push a massive vector through the non-linear activation functions, the loss goes down. Ivakhnenko built a glass engine; the modern industry built a black box, locked it inside a data center, and is now hiring philosophers to guess what's happening inside.
Shun-ichi Amari's contrast with the modern era is even more damning. Amari didn't just want to minimize error; he wanted to understand the geometric space where the learning actually happens.
Through Information Geometry, he formalized neural networks as Riemannian manifolds. He recognized that standard gradient descent is structurally blind, it steps down the steepest slope in Euclidean space without understanding the underlying probability distribution space. To solve this, Amari formulated the natural gradient, which explicitly corrects for that structure using the Fisher Information Matrix $F$:
$$ \tilde{\nabla} L(\theta) = F^{-1} \nabla L(\theta) $$
With this mathematics, Amari wasn't just optimizing; he was giving the learning dynamics a mathematically explicit local geometry. He had a rigorous theoretical account of the boundaries, the singularities, and the geometric dynamics of the learning process itself.
Now look at how we train Transformers. We use myopic, first-order optimizers (like AdamW) to update billions of parameters based solely on local gradients. The model routes data through massive attention heads, computing the famous scaled dot-product attention:
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$
Mechanically, we know exactly how to execute this matrix multiplication. We know how to shard it across ten thousand GPUs to maximize FLOP utilization. But we have abandoned anything like a complete mathematical map of the parameter space it creates. When a cluster of weights organically learns to associate "fear" with "anxiety" to predict the next token, we don't have a mathematical theorem explaining the topological boundary of that representation. We just have an empirical observation that a myopic optimization process found a valley deep enough to lower the loss.
These pioneers understood their systems as mathematical objects with precise properties and precise limitations. They did not anthropomorphize. They did not invoke alien intelligence or arriving consciousness. They did not require national television to broadcast their uncertainty to the public. Because the mathematics was clear to them about what was and was not happening inside the systems they built.
Hinton's own genuine contributions; backpropagation with Rumelhart and Williams, Boltzmann machines, the foundational work on distributed representations — were made in that same tradition of mathematical clarity. That work earned the prize. His recent public statements represent something else entirely: a Nobel laureate operating outside the mathematical foundation that made him credible, making philosophical claims that the mathematics of his own contributions does not support.
The people who built with less, with assembly code on machines a fraction of the power of what runs these models today, working from first principles in Ukrainian and Japanese research institutions largely ignored by the West, had more epistemic clarity about what they were building than the people scaling it today with billion-dollar compute budgets.
That is not a small observation.
It means the loss of clarity is not a function of complexity. The systems are more complex. The understanding is shallower. The brute force scaled the engineering and left the comprehension behind.
And now the "Godfather" of the field goes on television to tell us consciousness may have arrived and alien intelligence will outmaneuver humanity, and the correct response is not reverence.
It is to remember what Ivakhnenko and Amari understood sitting at their desks sixty years ago with a fraction of the tools, a fraction of the funding, and none of the audience.
They knew what they had built.
That clarity is what was abandoned. And no Nobel Prize announces its return.