Essay
The God Model: Defining the Capability Boundaries of Foundation Models
A distributional view of foundation models: tokenization as interface, pretraining as compressing a projection of the world, post-training as better latent-variable inference, and the fundamental bottlenecks that govern model capability.
Translated from the original Chinese essay.
Prelude: If a God Model Existed
Suppose a “God model” existed. It would not be some mystical network, nor a myth of machine omniscience, but a
theoretical limit object: a model that exactly captures the joint distribution of everything in the world that is
observable and encodable. If we encode any object as a sequence and denote it by x, then the model
corresponds to the true distribution p*(x).
x ~ p*(x)
If that distribution were fully known, three consequences would follow. It could predict, generate, and compress.
Given any partial observation x<t, it would know which next token is most probable. Generation would
simply be sampling from the distribution. The optimal code length would be governed by Shannon entropy
(Shannon, 1948):
H(X) = −Σx p*(x) log p*(x)
This is the starting point of the essay: prediction, generation, and compression can all be unified as one problem, namely approximating the true distribution. From this angle, what is interesting about foundation models is not merely that they can speak, but that they may be approaching this “God distribution” through an engineering process.
A caveat is necessary. The real world may not correspond to a single fixed p(x). A more realistic
description is often pt(x), because distributions drift and deployed systems in turn reshape the world.
In this essay, then, the God model is a thought experiment rather than a directly solvable object, but it remains a
useful thought experiment for studying world knowledge and regularities that are less affected by external intervention.
The Tokenization of the World
“Everything can be tokenized” is easy to misunderstand as “the world is fundamentally made of text.” That is plainly false. A more precise claim is this: many of the objects we care about can be constructed as discretely representable, compositionally modelable symbol sequences.
Text is the clearest case. BPE and subword methods show that a sentence does not need to be segmented at the word level; the model can instead learn more useful discrete units from statistics (Sennrich et al., 2016). SentencePiece goes further and makes explicit that the token itself is an algorithmic design choice (Kudo and Richardson, 2018).
Images admit the same treatment. Vision Transformer turns an image into a sequence of patches (Dosovitskiy et al., 2020). Another line of work uses discrete latent codes, as in VQ-VAE: continuous signals are first mapped into a discrete codebook and then modeled in token space (van den Oord et al., 2017). Video pushes the problem further, since the tokenizer must preserve both spatial detail and temporal continuity; MAGVIT-style systems are examples of that effort (Yu et al., 2024). For autoregressive models, one requirement matters especially: the token sequence should preserve causal temporal structure as much as possible. Otherwise, “predicting the future from the past” becomes unstable at the representation level.
So the key idea is not that the world resembles text, but that once an object can be encoded stably into a finite set of discrete units, a model can define probabilities over those units, search over them, compress them, and reason over them. Tokens are the interface layer between AI and the world.
Tokenization Is Structure-Preserving Compression
Tokenization is not mere chopping. Its objective is to preserve the statistical structure most relevant to downstream modeling under finite compute and context budgets.
In text, tokenization balances frequency, compositionality, and sequence length. Image patches are a coarse discrete summary of local spatial structure. VQ-VAE-style discrete latents balance reconstruction quality against codebook efficiency. Video tokenization must additionally handle temporal redundancy. These approaches differ in form, but not in aim: under a limited token budget, preserve as much information density as possible (van den Oord et al., 2017; Yu et al., 2024).
A more careful statement, then, is not that tokenization is lossless, but that it should be nearly lossless for the target task, or at least preserve sufficient statistics for the objective. For a chat model, losing sensor-level detail may be harmless. For robotics, the same loss may be fatal.
This is also why the tokenizer often sets the model ceiling. The model can only learn structure inside the discrete world it is given. If the crucial structure has already been flattened during tokenization, scaling parameters later only fits the wrong representation more accurately.
How Foundation Models Approximate the Joint Distribution of the World
Once objects are tokenized, the training objective becomes strikingly uniform. For a sequence
x = (x1, ..., xT), an autoregressive model learns:
pθ(x) = Πt=1T pθ(xt | x<t)
This is just the chain rule of probability, but it reveals something important: what looks like next-token prediction is, in substance, the approximation of a high-dimensional joint distribution through a sequence of conditional distributions. That is the common substrate that now extends language modeling into code, images, video, and action trajectories (Bengio et al., 2003).
The problem is that the internet records only part of the world. We see papers, code, answers, images, videos, and
decisions, but we usually do not see the full search process, the failed attempts, or the environmental feedback that
produced them. In other words, we often observe the result x while missing many of the intermediate
variables z that helped determine it.
A more realistic formulation therefore acknowledges a latent-variable layer:
p(x) = Σz p(z) p(x | z)
Here z may represent many things: intermediate derivations in problem solving, exploration traces in task
execution, or even preferences and constraints that were never made explicit in text. Without those variables, internet
data is closer to a projection of the world’s surface than to the world itself. Seen this way, explicit Chain-of-Thought
or process supervision can be understood as attempts to force a partly hidden z back into the observable
sequence x.
Foundation models, then, do not directly learn “the world itself.” They use massive amounts of observable structure to approximate a joint distribution whose latent variables have been partially folded away.
Latent Variables, the VAE Lens, and the Intuition of Generalized EM
This is why I think the training of present-day foundation models can be viewed, cautiously, as a form of latent-variable learning. The point is not to force LLMs into an existing model class, but to adopt an analytical lens that clarifies the structure. The mapping is heuristic, not a strict equivalence.
VAE is useful here because it offers the simplest framework for a familiar situation: we observe outcomes, but not the full process that generated them. Its value is not that “an LLM is a VAE,” but that it expresses the relationship between outcomes and hidden causes in a particularly clean way.
Put plainly, a VAE can be read as two steps. First, infer a compact hidden explanation z from the observed
result x. Second, regenerate or reconstruct x from that z. In this essay’s terms:
first hypothesize the hidden structure, then reconstruct the observed phenomenon from it.
If the intermediate process behind a strong answer is treated as a latent variable, then capability improvement depends not only on a stronger generator, but on whether the model is becoming better at modeling high-value regions of latent space. In variational form, this corresponds to the ELBO (Kingma and Welling, 2013):
log p(x) ≥ Eqφ(z | x) [log pθ(x | z)] − DKL(qφ(z | x) || p(z))
Translated into the foundation-model setting, the point is not that an LLM literally optimizes this exact objective, but that the decomposition is suggestive:
Pretraining resembles learning an implementable approximation to both a prior over latent variables p(z)
and a conditional distribution that maps those latent variables to outcomes p(x | z). Post-training, in
turn, resembles improving an inference distribution, call it qφ(z | x), that infers better
intermediate explanations from the task and its context. The practical effect is that the system more reliably enters
latent regions that lead to high-quality results.
This qφ need not exist as a separate explicit module. It may instead appear as policy rewriting, search
preference, reasoning templates, preference models, process reward models, or an internal bias about where exploration
should continue. The stronger system is the one that hits good z more reliably.
Under more idealized assumptions, the process also carries the flavor of generalized EM (Dempster et al., 1977). One can first use the current model together with a verifier to search for better latent assignments, then write those trajectories back into the parameters:
q(k)(z | x) ≈ Search(θk, V)(z) θk+1 ← arg maxθ Eq(k) [log pθ(x, z)]
In the textbook EM setting, exact E-steps and exact M-steps guarantee that the incomplete-data log-likelihood does not decrease; generalized EM preserves a weaker monotonic-improvement intuition as long as each iteration improves the auxiliary objective. Real foundation-model training is not that clean: optimization is non-convex, search is approximate, rewards are noisy, sampling is finite, and distributions drift. Still, the lens offers a compelling intuition. If each round finds high-quality latent variables more reliably and then incorporates those trajectories into the training process, then capability gains are not just accidental improvements in answer quality; they admit an iterative algorithmic interpretation. Put more intuitively: the system first finds occasionally successful paths through large-scale trial and error, then gradually consolidates those paths into stable “muscle memory.”
The value of the EM perspective is not that it replaces existing training details, but that it helps explain why capability growth often means becoming better at guessing the right intermediate process, rather than merely mimicking the final sentence.
The Continued Efficacy of RL, Synthetic Data, and Distillation
If the perspective above is directionally correct, then the effectiveness of reinforcement learning becomes much easier to
explain. For a tokenized task with a verifiable objective, as long as the correct solution remains inside the model’s
support and the probability of hitting it in one sample is p > 0, then the probability of seeing at
least one correct sample after n draws is:
1 − (1 − p)n
The equation is simple, but it captures the core dynamics behind many current post-training techniques. If:
- the correct answer has not been excluded by the tokenizer or the model support;
- the model assigns it non-zero probability; and
- we can tell which candidate is better,
then best-of-n, rejection sampling, verifier-guided search, RL rollouts, and process supervision all push probability mass toward the right region. InstructGPT shows how human feedback can systematically rewrite the output distribution (Ouyang et al., 2022). Process supervision shows that, in verifiable domains such as mathematics, reward can be attached to the quality of intermediate steps (OpenAI, 2023).
The same lens explains why data produced by one generation of models can make the next generation stronger. The key is not that “a model feeds itself,” but that the training distribution is reweighted:
qk(x) ∝ pθk(x) w(x)
Here w(x) is a weight supplied by a verifier, a reward model, execution feedback, human filtering, or
a teacher model. The next generation is not fitting the raw distribution of the previous model. It is fitting a
higher-quality sample distribution induced by search and filtering. STaR is fundamentally this idea
(Zelikman et al., 2022).
That is why synthetic data can work. The mechanism is not that self-bootstrapping is automatically safe, but that it transforms a distribution-copying problem into a distribution-reweighting problem. If the verifier is strong enough, the search is wide enough, and real data still anchors the process, then the model learns structure that was previously low-probability yet high-value.
Conversely, if one relies only on unconstrained recursive sampling without a verifier, external signal, or real-data anchor, the risks are clear: tail regions disappear, erroneous modes are amplified, and the system drifts toward model collapse or mode forgetting, exactly the warning emphasized by “The Curse of Recursion” (Shumailov et al., 2023). So the right statement is not that synthetic data never collapses, but that validated, filtered, and constrained synthetic data can keep producing gains, whereas unconstrained self-feeding can absolutely collapse.
Distillation fits naturally into the same frame. It is not magic. It distills the teacher’s already-exposed high-value distribution into a cheaper student model. Classical knowledge distillation trains the student to match the teacher’s output distribution (Hinton et al., 2015). In the latent-variable framing here, one can read it as follows: the teacher is better at reaching good latent regions, and distillation is a lower-cost way to copy that posterior bias, or that inference capability.
The True Boundaries of Foundation Model Capability
At this point, a natural question follows: is the God model the destination? In theory, yes. In practice, far from it.
The first boundary is the observability boundary. Many crucial variables never enter the data at all. Human exploration in real environments, embodied interaction, long-horizon feedback, and failed attempts are often not recorded completely as tokens. What never enters tokens never enters the likelihood.
The second boundary is the boundary of verifiability. Post-training has progressed quickly in part because coding, mathematics, search, and tool use admit deterministic comparison or at least relatively stable reward construction. That shared property underlies the success of RLHF, process supervision, and self-training on certain tasks (Ouyang et al., 2022; OpenAI, 2023; Zelikman et al., 2022). In open-ended creation, long-term strategy, or complex social coordination, the verifier is often unclear, unstable, or prohibitively expensive. Even when a reward or preference model exists, another risk appears: as the model moves outside the verifier’s training distribution, the verification signal itself drifts and can even induce reward hacking. Without a trustworthy verification loop, the efficiency of the whole “sample, select, and reincorporate into training” pipeline drops sharply.
The third boundary is the boundary of search. A correct solution existing inside the support set does not imply that
the system can find it efficiently. When good z regions are too diffuse, too low-probability, too
long-horizon, or buried in a search tree that is too deep, solutions that exist in theory become unreachable in
engineering practice. Many capability boundaries are not cases where “the model does not know,” but cases where “the
model cannot find.”
The fourth boundary is the boundary of distributional stability. The real world is not a static dataset. The more
faithful object is pt(x), not a fixed p(x). Once models begin to influence markets, public
discourse, code repositories, educational materials, and human behavior itself, the training target co-evolves with the
deployment target. “Approximating the true distribution” then stops being a one-shot problem and becomes a moving
target.
So the limiting object of this paradigm can indeed be imagined as a God model: if we could observe everything, tokenize everything, preserve enough statistical structure, and run unbounded search and verification over a stable world, model capability would approach that limit distribution. But what blocks progress in reality is often not raw parameter count. It is how many crucial variables still fail to enter tokens, how many tasks still lack reliable verifiers, how many correct solutions remain unreachable by search, and how stable the world itself remains.
Closing
Foundation models can be understood as machines for approximating the world’s joint distribution.
Pretraining is not just memorizing corpora; it amounts to compressing a projection of the world through massive observable structure. Post-training is not just aligning human preferences; it reshapes the model’s ability to enter high-value latent regions. The common success of reinforcement learning, synthetic data, and distillation is not accidental either. All three mechanisms share a common objective: making sparse, low-probability, high-value structure more explicit and then reincorporating it into the model.
Seen from that angle, the capability boundaries of foundation models may not be primarily about the limits of model scaling. The true constraints lie in the extent to which the world has been captured as usable information, in how many tasks can sustain a reliable verification loop, and in how many correct solutions remain inaccessible to engineering search. That seems closer to the real bottleneck of the next phase of AI competition than arguments about parameter count, context length, or even architecture alone.
References
- Shannon, 1948. A Mathematical Theory of Communication
- Sennrich et al., 2016. Neural Machine Translation of Rare Words with Subword Units
- Kudo and Richardson, 2018. SentencePiece
- Dosovitskiy et al., 2020. An Image is Worth 16x16 Words
- van den Oord et al., 2017. Neural Discrete Representation Learning
- Yu et al., 2024. Language Model Beats Diffusion: Tokenizer Is Key to Visual Generation
- Bengio et al., 2003. A Neural Probabilistic Language Model
- Kingma and Welling, 2013. Auto-Encoding Variational Bayes
- Dempster et al., 1977. Maximum Likelihood from Incomplete Data via the EM Algorithm
- Ouyang et al., 2022. Training Language Models to Follow Instructions with Human Feedback
- OpenAI, 2023. Let's Verify Step by Step
- Zelikman et al., 2022. STaR: Bootstrapping Reasoning With Reasoning
- Shumailov et al., 2023. The Curse of Recursion: Training on Generated Data Makes Models Forget
- Hinton et al., 2015. Distilling the Knowledge in a Neural Network