The Next Generation of AI Benchmarks Should Not Be "Questions"

The Next Generation of AI Benchmarks Should Not Be "Questions"
For the past several years, the AI industry has been caught in a strange loop: we build new benchmarks, then we build models to break them. From early language understanding to mathematical reasoning, code generation, and scientific Q&A — and now to increasingly elaborate agent benchmarks — the underlying logic has never changed. We are still trying to measure intelligence with standardized questions.
The problem is that the hard problems of the real world have never come in the form of questions.
Nearly every mainstream benchmark today shares one defining characteristic: at its core, it is a compressed academic intelligence test. These systems assume that problem boundaries are already defined, rules are fixed, inputs are given, objective functions are specified, and correct answers already exist somewhere. What the model actually needs to do is search within a closed space.

This is why today's large models increasingly resemble high-dimensional compression engines — extraordinarily capable, yet fundamentally optimized for one thing: finding answers inside a world that has already been defined.
But the genuinely hard problems of the real world are not like this.
Real-world problems almost never have clear boundaries. You don't know which variables matter, which information is missing, which constraints will suddenly appear, or whether the objective function will drift mid-course. More fundamentally, real problems rarely have a single correct answer. Designing a cancer treatment protocol has no standard answer. Engineering materials for a fusion reactor has no standard answer. Running a company has no standard answer. What we call "optimal" in real systems is, at best, a locally stable state within a particular time window under multiple competing constraints — not a global optimum in any mathematical sense.
This means the structure of real-world problems is categorically different from what today's benchmarks test.
Today's benchmarks are, in essence, static problems. Real-world problems are, in essence, dynamic systems.
Today's benchmarks ask: can the model complete local reasoning under fixed rules? What the real world actually demands is: can the model continuously optimize future states within a complex system — one that is perpetually changing, informationally incomplete, with highly delayed feedback and irreversible consequences?
These are two entirely different classes of problems.
I find myself increasingly convinced that the true AGI benchmark of the future should not be a benchmark at all. It should be a Reality Benchmark — not a dataset, but an entire runnable world.
AI would no longer merely answer questions. It would need to live inside the system over time — continuously observing, continuously modeling, continuously deciding, continuously correcting — and ultimately be held accountable for the long-term state of the whole. What it must navigate is no longer "correct answers," but the irresolvable tensions between competing objectives: cost versus performance, efficiency versus safety, short-term versus long-term, local optima versus global stability, exploration versus exploitation, risk versus reward.
The genuinely hard problems almost all belong to this category.
Consider what a truly meaningful drug discovery benchmark would look like. It should not merely predict binding affinity, let alone answer chemistry trivia. The AI must simultaneously navigate chemical space, ADMET properties, toxicity, crystal polymorphism, synthetic routes, long-term metabolism, patent landscapes, manufacturing costs, clinical trial pathways, and regulatory risk — an entire coupled system. What it optimizes is no longer a single metric, but the path that maximizes long-term compound value under finite time, finite budget, and finite experimental iterations. This is no longer "solving a problem." This is world optimization.
Fusion materials are no different. Most so-called materials benchmarks today are still, at their core, parameter prediction tasks. But the real first-wall problem in fusion requires simultaneously managing thermal loads, neutron irradiation, helium bubble formation, creep, fatigue, manufacturing constraints, maintenance cycles, long-term reliability, and economic cost — multiple mutually conflicting scale systems. The genuine difficulty is not that any single physics equation is unsolvable. It is that theories across different scales cannot naturally close: a structure optimized at the atomic scale may fail completely at the system scale; a locally optimal material may cause total maintenance cost to explode; an extreme high-performance design may be unscalable due to tail-end lifetime risk. The hard part has never been computation. It has always been multi-scale real-world optimization.
And enterprise management is perhaps the clearest case of all. The truly valuable enterprise benchmark of the future should not be "AI helps you write slides" or "AI auto-replies to email." It should be: let AI actually manage a virtual company for several years. It must handle hiring, organizational structure, communication efficiency, R&D, marketing, finance, strategy, conflict, long-term culture, institutional memory, and knowledge transfer — a complex dynamic system in full. The final evaluation is not any single local metric, but: did this company actually survive? Did it actually grow over time? Did it actually maintain organizational stability in a complex environment?
These problems share one crucial property: they are all computable — but their search spaces vastly exceed human cognitive bandwidth.
This point matters enormously. Many people assume that truly hard problems must be uncomputable. The opposite is true. Nearly every problem that has driven civilizational progress is computable — but these problems simultaneously involve enormous state spaces, multi-objective conflicts, long-horizon feedback, multi-scale coupling, and highly dynamic environments that humans cannot globally optimize within finite cognitive resources. This is precisely the class of problems AGI should actually be built to solve.
This leads me to believe that the most powerful AI companies of the future may not be those with the largest models. They may be those with the largest Reality Simulation Infrastructure — because what matters is no longer whether a model can answer questions, but whether it can optimize the real world over time.
Today's large models are, at their core, still closer to Text Prediction Engines. The systems that will matter in the future will be World Optimization Engines. The former is built on language compression. The latter is built on long-horizon causal optimization, world modeling, memory accumulation, multi-scale reasoning, and dynamic feedback control.
The future of AI benchmarks is not an exam.
It is the world itself.




