Press "Enter" to skip to content

Why we will almost certainly never have AGI

In the past few years, LLMs have made significant progress in their ability to answer questions. Sometimes, they even surprise us with how well they solve certain problems. While this is impressive, they are still far from being as intelligent as humans. They often make things up (hallucinate), lose track of context, and struggle to produce truly novel ideas. 

Although there has been a steady improvement of the LLMs on benchmarks, in a different article, I have explained what is wrong with simply relying on benchmark results to measure whether benchmark performance actually relates to intelligence. There are scaling laws that help estimate the performance of models at different sizes. This gives us a way to predict how much we need to scale to reach certain benchmark results. However, these laws don’t tell us what might happen if we discover new types of neural network architectures, different from transformers.

In this article, I will explore how much we may need to scale — optimistically — to reach human-level intelligence, by estimating the computational complexity of the human brain. Also, I will argue that discovering an AGI model with human-level intelligence is computationally intractable. This includes the mental effort needed to design clever architectures.

Parameters in the human brain

We first need to understand how the human brain works — or, more specifically, how the thinking part of the human brain works. We only need enough detail to estimate the number of parameters in the human brain, not every known aspect of neuroscience.

The human brain is made up of cells called neurons. A neuron generally receives signals from the output of other neurons through tiny wires called dendrites. Dendrites forward the signals to the main cell body, called the soma. The soma decides whether to generate an action potential if the combination of inputs is strong enough. This signal is then transmitted through a long cable called the axon, which branches into smaller wires ending at the axon terminals. These terminals connect with the dendrites of other neurons or with other types of cells to deliver output signals.

The strength of signal transfer depends on the strength of the synapses, which change during learning and also play a role in storing working memory. There are also axo-axonic synapses, which allow one neuron to modulate how a signal passes through another neuron’s regular synapse. The action potential — i.e., the sum of signals received from the input dendrites that is required to activate a neuron and send a signal forward — is also learnable. However, this makes little difference in our estimate because a single neuron has around a thousand outgoing synapses and up to 10,000 potential synapses that can be formed. You can think of a non-existent synapse that may form through future learning as a parameter whose current value is zero.

As a rough estimate, we can consider the number of parameters in the human brain as the number of synapses in the cerebral cortex — the main thinking part of the brain — which is around 350 trillion (based on electron microscope measurements of synaptic density per unit volume.) Shepherd, Stepanyants, and Chklovskii introduced the concept of “potential synapses,” noting that a synapse could form whenever a dendritic spine comes within a micrometer of an axonal bouton. They computed that pyramidal neurons may have three to 10 times more potential synapse sites than actual realized synapses. This gives us a lower bound of about 1 quadrillion potential synapses. Notably, this is roughly a thousand times greater than the number of parameters in frontier LLMs.

As mentioned earlier, synapses serve both as working memory and as learned or stable memory. The brain does not have separate training and inference phases; in principle, all synapses are trainable at all times. However, only 20% to 30% of synapses change within a week, while about 70% take more than a week to change. We can consider this 70% as the effective parameters of the model. Based on this, we can estimate that the human brain has about 700 trillion model parameters.

Human brain vs. LLMs

In comparison, DeepSeek V3 has about 700 billion parameters. Our estimate for the human brain is roughly a thousand times larger. To approximate the energy required to run such a model with 1-bit quantization on Nvidia H200 GPUs, we can use VRAM as a proxy since it stores both parameters and working memory. Storing a quadrillion values would require about 7,092 GPUs (1,000,000 GB divided by 141 GB). At 600W each, the total power demand is around 4.25 MW. With each H200 priced at $30,000, the GPU cost alone would be around $212 million.

Moreover, consider three factors: training cost, the amount and type of data, and network architecture. As a rule of thumb, we will need about 20 times more tokens than parameters. Since DeepSeek has 700 billion parameters, it was trained on around 14 trillion tokens. Scaling that up, a one-quadrillion-parameter model would require around 20 quadrillion tokens. The entire internet only contains tens of trillions of tokens — so even if we collect 140 trillion tokens, that’s just 1% of what we need. In short, we do not have enough data to train such a model using today’s techniques.

So how can the human brain learn with far less data? It doesn’t. The brain has effectively been trained on an enormous amount of data over millennia. Much of this knowledge could be hard-wired into its structure. For example, newborns cry and suckle without being taught, and toddlers instinctively try to climb — inherent behavors from their biology. This is fundamentally different from LLMs, which learn only from passive input like text, image or video.

When our brain begins learning from the environment, it builds on pre-programmed knowledge — similar to fine-tuning or instruction tuning after the pre-training of an LLM. Unlike LLMs, the brain doesn’t just consume tokens; it learns through full multimodal reinforcement by interacting with the environment. This is like training AI models in a physics simulator. We learn spoken language only after grounding ourselves in concepts, and written language only after mastering speech. This allows us to reason using concepts beyond language — geometry, sound, or abstract thought. Plain static data hardly compares.

Think of it this way: Trying to learn a trade by reading books or watching videos alone, without real-world practice, almost never works. You must get your hands dirty. The same goes for AI — it must interact with the real or simulated world to truly reason about it. But simulating physics at such a scale would be prohibitively expensive using current techniques.

Moreover, if we were to copy nature’s path, we’d also need to optimize neural network design itself by testing millions of variations, each trained on the equivalent of 20 quadrillion tokens. The computation required would be beyond imagination. Nature managed it only because it effectively used an Earth-sized quantum computer, with every subatomic particle acting as a processing unit for millennia.

To put this in perspective, today’s frontier trillion-parameter AI systems are typically sold through subscriptions costing a few hundred dollars per month, often with usage limits. If inference costs scaled linearly on current hardware, a quadrillion-parameter model could theoretically cost hundreds of thousands of dollars per user per month to operate. In that scenario, hiring human experts might become more economical unless major breakthroughs dramatically reduce compute and token costs.

The training of this model would be a whole new game. If we use 20 quadrillion tokens, following the Chinchilla scaling rule of thumb, a dense one-quadrillion-parameter model would require about 1.2×1032 FLOPs to train. For a 100-expert MoE model, assuming only one expert is active per token, the training run would require about 1.2×1030 FLOPs. An H200 GPU can deliver nearly 4 petaflops, or 4×1015 FLOPs per second, at peak performance. To complete training in about a year and a quarter — roughly 40 million seconds — would require about 7.5 million H200 GPUs, assuming peak utilization. At 600 watts per GPU, those chips alone would consume about 4.5 gigawatts, before accounting for networking, storage, cooling and other data center power needs.

But this is not enough to replace all cognitive work, as a single human cannot be an expert in every field. To completely replace all cognitive work, we would likely need a hundred-quadrillion-parameter model. Such a system could require tens of millions of GPUs consuming several gigawatts of power just to run. Training the model could require millions of accelerators and power consumption approaching the terawatt scale.

What’s needed for the right design

Let’s assume the above works. Even then, it would not result in AGI, because the network must also have the correct design to match the brain’s performance with the same number of parameters. To reason about this, we can consider the Kolmogorov complexity of the neural network as a measure of the complexity of its optimal design.

The Kolmogorov complexity refers to the length of the shortest computer program needed to generate a given object. When we talk about the Kolmogorov complexity of the brain, we mean the minimum size of a program that would produce the logical circuit diagram of the brain.

The design of the human brain is encoded in the genome. The human genome has about 3.2 billion base pairs, with each base pair containing two bits of information. This gives about 0.8 GB of data. However, only 20% are known to serve some function. That leaves around 160 MB of functional code. Studies suggest that 33% of the genome influences the brain. That gives 52 MB of code, or about 4×108 bits. If we further assume that only 10% of this encodes the logical structure of the brain, while the rest governs chemical and non-logical details, we are left with about 4×107 bits of logical data to encode the structural details of the brain.

Let’s assume that it took 10 years of effort to come up with the transformer design, which is at most 104 bits Kolmogorov complex. To come up with a design of human brain-scale complexity would take at least 103 (4×107/104) times more effort, which translates to 40,000 years if the same number of people are to work on this problem. And only then, our quadrillion parameter model can match the performance of our brain.

However, the design of the brain probably has non-modular complexity, i.e. complexity without layers abstraction, and we likely would never comprehend this level of complexity no matter how much time we would work. The reason for non-modularity is the fact that it is found by evolutionary optimization which would find one of the structures that work nearly optimal. This is the same reason it is practically impossible to understand the circuit of a trained neural network, because it came from an optimization process.

The best strategy may be to use a less optimized design and try to cover up for the lack of good design with extra parameters. With a very conservative 210 times increase in parameter count for the lack of a 4×107 design complexity, it would still take a quintillion parameter model, which too is intractable. However, a completely different hardware design like neuromorphic analog circuit design in the future may make it somewhat achievable. Analog circuits consume much less power and are automatically massively parallel. They also need much fewer components. The only thing is that they can be inaccurate, but it may not be that much of a problem with neural networks.

Therefore, I believe it is very likely that we will never have human level AGI. This is not because the brain in principle cannot be replicated. After all, the brain is a mechanism and there exists a silicon design that can replicate it. But the problem of finding that model may be computationally intractable under current assumptions.

Author

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

×