• 03
Over the past year or two, we have witnessed a visible leap in the linguistic capabilities of large models: writing, summarizing, dialogue, Q&A, and problem-solving have become increasingly "human-like." Benchmark scores like HLE continue to break records, and even Olympiad-level math problems are being systematically conquered. Consequently, a seemingly logical conclusion has become popular: "So-called AGI, or Grand Unified Intelligence, is probably close at hand."
But in my view, this is a beautiful misunderstanding.
To clarify this, I will borrow an analogy: today's mainstream large models are more like "Liberal Arts LLMs." They center on language generation and textual consistency, organizing knowledge into "narratives that seem real" and "answers that look right." Their value lies in "Simulation": they understand our euphemisms and rhetoric, generating elegant prose, realistic dialogue, and moving stories. They will become the new infrastructure for education, communication, and content production—like electricity or water, silently nourishing everything. However, even if they can solve Math Olympiads or score high on HLE, these victories mostly occur within closed systems: problem definitions are clear, rules are fixed, right and wrong are judgeable, and feedback is immediate.
But I have always believed that what humanity truly needs AI to fight against are aging, disease, energy, materials, and climate. These battlefields are not in the closed world of exam questions. There are no standard answers waiting to be generated; there are only phenomena, noise, bias, missing variables, and slow feedback. Correctness is not "written out" but "confirmed" by the external world.
High scores in a closed world prove the maturity of reasoning engineering, but they do not represent a stable mechanism for knowledge production. High-level problem-solving skills are a necessary foundation for discovery, but far from a sufficient condition. Because what truly determines the future is not a closed narrative, but that cold, precise red line of causality. It does not care "if it sounds right"; it cares "if this hypothesis can be rejected or confirmed by reality." Its ultimate product is not a new work of art, but new knowledge—new theorems, new materials, new drugs, new processes, new engineering structures. I call this paradigm "The Science LLM." Its value lies in "Discovery."
I must clarify: "Liberal Arts vs. Science" is not a difference in species, but a difference in default actions:
Liberal Arts LLMs tend to provide a "final answer that looks good."
Science LLMs tend to first provide a set of falsifiable hypotheses, while simultaneously offering the path to turn these hypotheses into evidence.
Where there is uncertainty, a Liberal Arts model is more likely to "round off" the answer; a Science model is more likely to instinctively pause, verify, dismantle, and break the problem down into verifiable sub-problems. The Science model treats Causality as a first-class citizen, asking "what happens if conditions change?" It must also possess cumulative long-term memory, writing back every conclusion obtained from verification in a traceable manner.
In short, a Science LLM is like a surgeon holding a scalpel: amidst countless plans, it identifies which cut truly touches the causal red line. It knows that once the cut is made, reality will offer the most honest and brutal feedback, forming a true causal closed loop. This reverence for "Real Cost" is the fundamental chasm between the two paradigms.
Therefore, what AGI should ultimately be depends on our value orientation: Do we care more about a "soulmate" who understands all rhetoric and can replace human labor, or do we more urgently need a "Mirror of Causality" that helps us tear through the fog, illuminate the unknown, and create value? I believe it is the latter. So, realizing AGI is not about rebuilding a system that is better at chatting or generating; it is about building an intelligence that knows how to "Discover."
Let us examine the main schools of AGI definitions with this value system. One is the Behaviorist Paradigm, derived from the Turing Test, which holds that AGI is indistinguishable from humans in behavior. This is currently the most intuitive standard for the public. But if an AI only imitates human speech, it can never tell us truths that humans have not yet discovered. The second is the Functionalist Paradigm. Represented by OpenAI, it defines AGI as an "autonomous system that surpasses humans in most economically valuable work," focusing on the substitution of human labor. But every leap in human civilization has not come from doing old jobs faster, but from discovering unprecedented new laws. The third is the Capability Levels Paradigm. Represented by DeepMind, it ranks AGI from "Emerging" to "Superhuman." The core metrics are generalization and performance scores in unseen tasks. But the real world is not an exam room; true wisdom is finding the right path where there are no exam papers. Of course, there are other paradigms that more or less suffer from the above problems.
So, what is my target for AGI? In one sentence: It is a high-confidence, verifiable, and self-correcting General Reasoning Engine. It must be capable of sustaining an overall accuracy rate approaching 99% even after 300 steps of complex reasoning, pinning down every step as inspectable evidence through formalization and toolchains, ultimately providing a closed-loop solution to any complex problem.
Why do we obsess over "300 Steps"? We must first define the smallest unit of reasoning—the Standard Inference Unit (SIU)—as an auditable basic element. Each step executes a single logical operation, relies on minimum necessary input, and its result can be directly verified by tools or rules. According to this standard, current large models achieve a single-step reasoning accuracy of about 98%. Even at this peak level, the end-to-end success rate after 300 steps is only 0.23%—approaching zero. This means that after 300 steps, probability and luck become invalid; the system must rely on verifiable reasoning and external feedback loops, rather than "sounding reasonable" to muddle through. Therefore, I view 300 steps as the "Span Threshold" for independently solving complex real-world problems.
Why is 99% a hard line? Because a Discoverative System is not for "chatting," but for entering the realm of Real Cost: experiments, engineering, medical treatment, and strategic decisions. A single percentage point drop in reliability means high-frequency betting errors. In the real world, an error is not a "wrong answer"—it is a wasted experimental window, a burned engineering budget, or even irreversible loss. 99% is not a vanity metric; it is the threshold for "Collateral and Signature."
So, the AGI in my heart is one that can survive "Probabilistic Death" through self-correction during a logical long march of 300 steps, finally arriving at a starting point beyond the map. From here, AGI can exist as an auditable, verifiable general problem solver in science, engineering, and planning.
Of course, I do not believe this destination can be reached by shouting slogans. Pinning the target at "99% reliability over 300 steps" means actively confronting three engineering hard points: Long-chain error accumulation, Open-world verification gaps, and Budget constraints under combinatorial explosion. For this reason, we must anatomize the engineering process, dividing reasoning into two layers: The Logic Generation Layer and The Verification Layer. The Generation Layer is responsible for "Thinking": recursively dismantling big problems until they are refined into atomic operations. We also need a Verification Layer responsible for "Checking": verifying every atomic step through tools, simulation, or external data. Once a step fails, the system performs local backtracking and regeneration, rather than overturning the entire reasoning chain.
MiroMind has already taken the first step on this path. Take BrowseComp as an example: MiroMind achieved SOTA results using a model with only 235B parameters. The significance lies not in the "score itself," but in proving an engineering fact: We are advancing reasoning from "Single-shot Generation" to "Repeated Verification on a Time Series." More specifically, we do not rely on a one-time long chain of thought to gamble on the right answer. Instead, we train the model to continuously obtain external feedback and correct errors during deeper, more frequent agent/environment interactions, gradually turning the reasoning process into an auditable chain of evidence. For us, this is the first foundation of a "General Solver," which we will push to a span of over 300 steps under the premise of 99% reliability. This process is silent, slow, rigorous, and even a bit cruel. It abandons the exquisite imitation of human language, yet it breaks through the soil in the boring, harsh, but repeatedly reproducible "Causal Loops" of reality. Even with the support of patient capital and idealistic persistence, this will be a very painful process.
There is a term in Buddhist sutras called "Great Round Mirror Wisdom". It describes a state where one's mind is cultivated to be like a great round mirror, reflecting the causality of all things as they are—obscured by no dust, distorted by no bias. This is the highest realm of wisdom. I have always aspired to this wisdom; I even named my science popularization video channel "Great Round Mirror."
The AGI in my heart is an intelligent system that infinitely approaches this "Great Round Mirror Wisdom." It does not obsess over beautiful language but asks what the factual truth is; it does not rush to give an answer but seeks to verify the causality behind it. In an AI era stuffed with language and narratives, we need a mirror responsible only for "Causality and Truth."
Tianqiao Chen
💡 Important note: This is to allow Mr. Chen Tianqiao to communicate and discuss directly with you.
Please fill in the "Name" field with: Full name - Organization name (Example: Tianqiao Chen - Cheninstitute.)
Comments that do not follow this format will not be approved




