Thinking Without Words

Photo by Terence Burke
Introduction
The emergence of Large Reasoning Models trained through reinforcement learning has fundamentally transformed our understanding of AI capabilities. These systems, exemplified by models like o1, demonstrate unprecedented reasoning abilities by leveraging extensive Chain-of-Thought (CoT) processes during inference. However, this breakthrough has simultaneously exposed a critical limitation: the constraint of reasoning through natural language tokens.
Traditional Chain-of-Thought reasoning, while interpretable and effective, forces models to articulate every reasoning step through the bottleneck of human language. This linguistic mediation introduces computational inefficiencies and constrains the expressiveness of thought processes. Recent research has begun exploring a radical alternative: continuous latent reasoning, where models perform inference directly in high-dimensional embedding spaces rather than through discrete language tokens.
This paradigm shift represents more than an incremental improvement—it fundamentally challenges how we conceptualize machine reasoning and opens pathways to cognitive capabilities that transcend the limitations of linguistic expression.

Image inspired by the one from the paper Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought Reasoning.
Historical Development: The Evolution Toward Latent Reasoning
Early Foundations (2022-2023)
A significant step in improving model reasoning was the introduction of the Self-Taught Reasoner (STaR) by Zelikman et al. (2022). Instead of abstracting reasoning away from language, STaR refines the model’s ability to produce explicit, step-by-step “chain-of-thought” rationales. The method uses an iterative process: the model generates rationales, and if a rationale leads to a correct answer, it is used as training data to fine-tune the model. This established the principle that a model can effectively teach itself to become a better reasoner by learning from its own successfully articulated reasoning, improving performance on complex tasks without needing a large, pre-existing dataset of human-annotated rationales.

Deng et al. (2023) introduced the concept of Implicit Chain-of-Thought (ICoT), which uses knowledge distillation to train a student model to reason using its internal hidden states. In this process, the student model learns to emulate the layer-by-layer hidden-state trajectories of a larger teacher model as the teacher generates an explicit chain of thought. The goal is to distill the teacher’s “horizontal,” step-by-step reasoning into a more efficient “vertical” reasoning process that occurs implicitly within the student model’s layers. While this method significantly speeds up inference time, this efficiency comes at a cost. The authors report that the implicit approach leads to a notable decrease in task accuracy when compared to models that generate an explicit chain of thought, highlighting a direct trade-off between inference speed and final performance.

A crucial early insight came from interpretability studies. Yang et al. (2024) asked whether large language models latently perform multi-hop reasoning. They found moderate evidence for this latent reasoning, observing it in around 40% of cases on average, with much higher rates for specific types of reasoning tasks. This showed that models’ hidden layers transiently encoded information about intermediate “hops” even when answering directly, hinting at untapped latent reasoning potential.

The Discrete Token Era (2024)
The next evolutionary step involved experimenting with specialized discrete tokens to represent reasoning states. Goyal et al. (2023) introduced “pause tokens” that enabled models to perform additional internal computation before generating outputs. These tokens, inserted in a fixed, non-adaptive sequence, served as computational placeholders, allowing for delayed prediction and improved accuracy on logic-intensive tasks. The key insight was that models could benefit from “thinking time” even within the discrete token framework.

In their paper, “Let’s Think Dot by Dot,” Pfau et al. (2024) investigate whether the performance gains from chain-of-thought are due to interpretable reasoning or simply the greater computation that additional tokens allow. They demonstrate that for certain algorithmic tasks, transformers can use meaningless “filler tokens” (e.g., ’…’) to perform complex, hidden computations, achieving high accuracy on problems they could not solve when forced to respond immediately. For example, on a sufficiently complex 3SUM task, models using filler tokens reached 100% accuracy, whereas models without them were only 66% accurate. This suggests the critical bottleneck is not the semantic content of the tokens, but rather the computational limitation of a single forward pass. The sequence of filler tokens provides the model with a “scratchpad” for multi-step reasoning, directly challenging the assumption that a model’s intermediate steps must be linguistically meaningful to be computationally effective.

Zelikman et al. (2024) developed Quiet-STaR, employing learnable tokens to mark boundaries of internal rationales. This approach enabled language models to infer unstated reasoning steps, improving generalization without task-specific fine-tuning. The system generated token-level rationales internally (one hidden “explanation” per token produced) without outputting them, essentially “thinking before speaking” in a fine-grained way.

The Continuous Revolution (2024-2025)
The most significant breakthrough came with Hao et al. (2024) and their COCONUT (Chain of Continuous Thought) architecture. Rather than using discrete tokens, COCONUT directly feeds the model’s last hidden states as input embeddings for subsequent reasoning steps. This innovation eliminated the lossy projection from high-dimensional representations to discrete vocabulary distributions.

Technically, COCONUT operates by:
- Processing the input question normally through the transformer.
- Taking the final hidden state (a high-dimensional vector, typically 2048-4096 dimensions).
- Instead of projecting to vocabulary and sampling, directly feeding this vector back as the “next token” embedding.
- Repeating this process for multiple latent reasoning steps.
- Only projecting to vocabulary for the final answer.
This preserves orders of magnitude more information. A 4096-dimensional vector, even under an aggressive 4-bit quantization scheme, holds 16,384 bits of information (4096 × 4). In stark contrast, a single discrete token from a typical 50k vocabulary represents just ~16 bits of information ().
In their 2024 paper, Cheng and Van Durme (2024) introduced Compressed Chain-of-Thought (CCoT), a framework that utilizes a dual-module architecture. A CCOT module (parameterized by φ) generates a sequence of dense “contemplation tokens,” which serve as compressed representations of an entire reasoning chain. A second DECODE module (parameterized by ψ) then uses these tokens to produce the final answer. This approach demonstrates that complex reasoning can be effectively summarized in continuous representations. However, contrary to methods that process steps in parallel, CCoT generates these contemplation tokens autoregressively, meaning they are produced sequentially one after another.

Liu et al. (2024) proposed Hidden Chain-of-Thought (HCoT), training auxiliary models to generate compact thought representations that maintain semantic richness while drastically reducing computational overhead. Their method compresses each intermediate reasoning step into a special [CoT]
token, interleaving these compressed thoughts with the generated content.

Token Assorted (Su et al. (2025)) took a hybrid approach, using a VQ-VAE to encode early reasoning steps into latent codes while keeping later, critical steps in text. This model reduced the length of reasoning traces by an average of 17% while maintaining interpretability where needed.

Architectural Innovations (2025)
Recent work has focused on architectural modifications that natively support latent reasoning. Geiping et al. (2025) introduced Huginn, a recurrent framework enabling adaptive computation allocation through RNN-like iterative processing. The architecture consists of:
- Prelude: Initial layers encoding input into latent state.
- Recurrent Core: Transformer blocks that can be applied repeatedly.
- Coda: Final layers decoding the answer.
This design untied computation depth from layer count, allowing a 3.5B model to achieve 50B-model performance through approximately 32 recurrent iterations.


In “Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning,” Yu et al. (2025) introduce RELAY (REasoning through Loop Alignment iteratively), a two-stage framework designed to improve how auto-regressive models handle long reasoning tasks. The method bridges the gap between auto-regressive models, which often struggle with generating accurate, long Chain-of-Thought (CoT) sequences, and Looped Transformers, which have strong length generalization capabilities but limited versatility.
The RELAY framework first trains a Looped Transformer by aligning its internal loop iterations with the explicit reasoning steps of a CoT process. This allows the Looped Transformer to generate accurate, high-quality reasoning chains for problems that are more complex and longer than those in its original training data. These generated chains are then used as a new, high-quality dataset to fine-tune a standard auto-regressive model, significantly enhancing its performance on complex reasoning tasks that require generalization to longer problem lengths.

Chen et al. (2025) proposed the Inner Thinking Transformer (ITT), treating each transformer layer as a discrete reasoning step with adaptive token routing and residual refinement.

Technical Deep Dive: Mechanisms of Continuous Reasoning
Core Architecture
Continuous latent reasoning fundamentally alters the information flow in transformer architectures. In traditional models, the transformation sequence follows:
Input Embeddings → Transformer Layers → Projection to Vocabulary → Sampling → Next Token
Continuous reasoning architectures bypass the projection bottleneck:
Input Embeddings → Transformer Layers → Direct Hidden State Reuse
The mathematical formulation involves modifying the standard transformer update. Instead of:
Latent reasoning uses:
Where ProjectToEmbed
is typically a learned linear transformation that maps the hidden state back to the embedding dimension.
Training Methodologies
Building a model that genuinely “thinks in vectors” is less about inventing huge new architectures and more about guiding the network away from the crutch of language without wrecking its performance. Current practice has converged on the following five-part recipe.
1 · Curriculum-guided latentisation
Training still starts with ordinary chain-of-thought (CoT), but every epoch hides a growing fraction of those intermediate words and asks the model to run directly on their hidden-state vectors.
- COCONUT runs through the corpus in seven discrete stages. At stage 0 no rationale tokens are hidden; by the final stage roughly 85 % of every rationale is replaced by its own activations. Each new stage is introduced only after perplexity has stabilised, so the network never “forgets how to read”.
- Stepwise Internalisation removes chain-of-thought (CoT) tokens from the beginning of the reasoning sequence in a continuous, linear fashion. With each training epoch, more tokens are hidden from the start of the rationale, forcing the model to internalise the initial steps of the reasoning process. A technique called “Removal Smoothing” introduces a small amount of randomness to the number of tokens being removed, which helps to stabilize training as the model learns to operate on increasingly truncated context.
Hidden tokens receive no direct loss; visible tokens and the final answer are trained with the usual cross-entropy.
2 · Hidden-state distillation and self-distillation
Once the network can survive missing words, the next step is to make its latent trajectory mimic a teacher that still reasons out loud.
- Implicit CoT (ICOT) records the layer-wise activations of a frozen teacher and trains a student to match them, often using a mean-squared-error loss, while predicting the final answer. On GPT-2 Small, this provides a significant inference-time speed-up versus explicit CoT (roughly 4 to 8 times faster, depending on the task). However, this efficiency comes at a substantial cost to accuracy, particularly on more complex tasks. For example, while the accuracy drop is minor for 4x4 multiplication, it falls by over 20 points on the GSM8K math dataset and by 90 points on 5x5 multiplication. The paper provides these results for models up to GPT-2 Large.
- CODI runs the same set of weights twice—once with text and once with latent thoughts—and aligns only the hidden state immediately before the answer token. Because there is no separate teacher, CODI matches explicit-CoT accuracy on GSM8K while cutting context length by about a factor of three.
3 · Compact latent tokens
ICOT-style distillation still leaves “one vector per hidden step”. The next advance packs many steps into a handful of learned embeddings.
- CCoT generates a short sequence of “contemplation tokens”, typically 30–50 % as long as the full CoT. An auxiliary decoder can reconstruct the dropped text for auditability.
- HCoT collapses an entire rationale into a single placeholder token
[CoT]
. A contrastive InfoNCE loss pushes two placeholders apart unless they decode to the same complete rationale, producing dense, semantically clustered codes. - Token Assorted compresses only the early reasoning hops into VQ-VAE codes and leaves the final, safety-critical steps in natural language, shaving about 15–20 % off the trace length while preserving human-readable checks.
4 · Recurrent or loop-aligned supervision
Some architectures keep sequence depth fixed and let the model loop through shared blocks as many times as it needs.
- Huginn splits the network into Prelude, Shared Core and Coda blocks. During pre-training it is unrolled 1 – 32 times at random. Cross-entropy is applied only to the final (and optionally the last few) iterations, and a small KL-style stability term limits drift between successive hidden states. At inference a learned halting gate decides when to stop looping.
- RELAY first aligns each loop iteration with the next step of a known CoT, then freezes that looped model and distils its internal trace into a standard auto-regressive decoder. The distilled model scores about 10–12 points higher than an equal-size plain decoder on long-division benchmarks.
- Inner Thinking Transformer (ITT) adds a lightweight linear probe after every residual block that predicts the current sub-answer. The probe and the main weights are trained jointly, and an adaptive token router lets tokens judged “easy” skip further passes, saving roughly one-quarter of the total compute without hurting accuracy.
5 · Hybrid latent reinforcement learning
Supervised data eventually tops out, so teams switch to reinforcement learning that rewards correctness and charges for extra computation.
Hybrid Reasoning Policy Optimisation (HRPO) is the flagship:
- At every generation step the network mixes two embeddings—a normal token embedding and a transformed copy of the previous hidden state—weighted by an action variable
gamma
. - The reward is 1 for a correct final answer, minus a small fee per visible token and per latent iteration (instances where
gamma
≠ 0). - HRPO is trained with Group Relative Policy Optimisation (GRPO), which uses the mean reward of a mini-batch of roll-outs as its baseline instead of a learned critic. GRPO needs about half the memory of PPO and converges just as fast.
Setting the step penalty too low makes the model talk verbosely; setting it too high drives it into silent, brittle reasoning. Authors report that a short grid search over a few hundred prompts is enough to find the sweet-spot penalty.
6 · Generic efficiency add-ons
Three auxiliary techniques, first devised for token-level RL, carry over cleanly to the latent regime:
- GRPO itself, supplying a critic-free baseline.
- Adaptive Length Penalty (ALP), which scales the per-step cost inversely with the model’s real-time solve-rate, trimming median reasoning length by ~50 % without hurting hard cases.
- AdaRFT introduces an adaptive curriculum for reinforcement finetuning. It works by dynamically adjusting the difficulty of training problems to match the model’s current skill level. Based on the model’s recent reward signals, it selects problems that are challenging but still solvable. This adaptive sampling avoids wasting computation on problems that are too easy or too hard, which in turn keeps the reward signal rich and informative for more efficient learning.
A modern training run at a glance
- Supervised warm-up on full CoT.
- Curriculum latentisation for five to ten epochs (COCONUT or Stepwise Internalisation).
- Hidden-state distillation, optionally followed by compact latent token training (CCoT or HCoT).
- Architecture-aligned pre-training if you use loops (Huginn, RELAY, ITT).
- Hybrid RL fine-tuning with HRPO + GRPO, optionally adding ALP and AdaRFT.
Representational Dynamics
Analysis of COCONUT’s latent reasoning reveals a process more complex than a simple linear chain. Instead of committing to a single path, the continuous thoughts can be interpreted as a latent search tree that explores multiple potential next steps simultaneously.
- Parallel Reasoning Paths: The paper demonstrates that a single “continuous thought” can encode multiple branching hypotheses. This is shown by forcing the model to decode its latent state back into language; the probability distribution over the possible next words reveals that several different reasoning paths are being considered at once. For example, in a logical reasoning task, the model remains uncertain about the correct choice after the first continuous thought but successfully identifies the correct path after the second, suggesting it progressively prunes incorrect branches of its search tree.


-
Implicit Value Function: This latent search is not uniform. The model learns to prioritize more promising paths and prune less relevant ones. The paper refers to the probability distribution over potential next steps as the model’s “implicit value function,” which estimates each node’s potential to lead to the correct answer.
-
From Exploration to Focus: This dynamic changes as the reasoning process unfolds. Analysis shows that during the initial continuous thoughts, the model maintains significant diversity, exploring several alternative paths in parallel. In subsequent thoughts, this parallelism narrows as the model gains more certainty and focuses on the most promising reasoning path. This process allows COCONUT to perform a kind of implicit breadth-first search, which is particularly advantageous for complex planning tasks.

Implementation Challenges
Large‑scale latent reasoning is still in its infancy, and every experimental system to date has surfaced at least one blocking issue that does not appear in ordinary Chain‑of‑Thought models. The literature converges on six broad pain‑points:
1. A curriculum is mandatory—otherwise the model never “gets” latent reasoning
COCONUT’s ablation shows that training directly on (question, answer) pairs with hidden‑state recycling performs worse than a no‑CoT baseline. Only the staged schedule that first teaches the model to reason in language and then incrementally replaces early steps with vectors unlocks its gains. The GSM8K accuracy crash from 34.1 → 14.4% in the “w/o curriculum” run makes the point starkly clear. Designing such curricula—and automatically tuning them for new domains—remains an open research problem.
2. Latent loops break GPU parallelism
Because every continuous thought depends on the previous hidden state, training (and inference) cost scales with the number of latent steps, not the batch size. COCONUT explicitly notes that it must execute n + 1 forward passes for n thoughts, and that “the sequential nature of the multiple forward passes poses challenges for parallelism.” This serial dependency throttles throughput on modern GPU clusters built for large‑batch matrix multiplies.
3. KV‑cache memory becomes the new bottleneck
Long latent traces do not expand the token sequence, but they do enlarge the key/value cache: every extra iteration stores another set of 64‑bit vectors. Recent work on SQuat (Wang et al. (2025)) shows that even with aggressive INT‑2 quantisation the cache can dominate peak GPU memory when models “think” for dozens of steps. Compression helps but introduces accuracy/latency trade‑offs that are not yet well understood.
4. Knowing when to stop is still heuristic
During inference a latent‑reasoning model must decide when to emit an <eot>
and return to language space. Current systems either pad to a fixed depth or train an ad‑hoc binary classifier over hidden states, and COCONUT reports that both heuristics work “comparably well.” Huginn trains a learned halting classifier (§4.1, p 5) which shows promise but still requires careful tuning. Neither approach adapts gracefully to problem difficulty, and mis‑predictions manifest as truncated explanations or runaway loops.
5. Deep recurrent stacks risk optimisation instability
Recurrent‑depth architectures such as Huginn push performance by unrolling a shared core 30+ times, but the authors note that gradient signals weaken as depth grows, requiring careful learning‑rate scaling and residual gating to avoid divergence. Balancing depth‑on‑demand with stable training dynamics is still an active area of study.
6. Tooling for debugging and evaluation is immature
A survey of efficient reasoning methods highlights a “complexity of latent‑space implementation” gap: without textual traces, it is hard to verify correctness, attribute errors, or measure reasoning efficiency. New metrics (e.g., embedding‑consistency scores) and visual probes are being proposed, but no standard evaluation suite exists yet.
The Interpretability Crisis
The “Neuralese” Problem
Recent mechanistic interpretability research Lindsey et al. (2025) reveals models performing sophisticated internal reasoning through complex feature interactions, suggesting the development of what we might call “Neuralese”—a term we define as an emergent internal representational language that develops within AI systems operating in continuous latent spaces, characterized by high-dimensional vector patterns that encode semantic and logical relationships in ways that are fundamentally untranslatable to human linguistic concepts. Unlike discrete tokens which at least map to vocabulary, continuous thoughts exist in a 4096-dimensional space with no natural interpretation.
Key challenges include:
- Non-unique Representations: The same reasoning can be encoded in infinitely many ways due to rotation/scaling invariance.
- Distributed Encoding: Information is spread across all dimensions, not localized.
- Dynamic Semantics: The “meaning” of dimensions changes based on context.
Emerging Interpretability Techniques
Researchers are developing novel approaches:
-
Geometric Analysis: Zhang and Viteri (2025) discovered latent CoT vectors—specific directions in activation space that elicit reasoning:
reasoning_vector = h_with_cot - h_without_cot h_enhanced = h_input + α * reasoning_vector

- Representational Probing: Advanced techniques attempt semantic decoding:
- Training linear probes to predict intermediate answers from latent states.
- Using contrastive learning to align latent states with linguistic descriptions.
- Projecting trajectories to 2D/3D for visualization.
Despite these efforts, full interpretability remains elusive. We can detect that reasoning is happening and sometimes what kind, but not the detailed how.
The Alignment Challenge
The interpretability deficit poses serious concerns:
- Deceptive Reasoning: Models could develop reasoning strategies that appear correct but contain hidden flaws or biases undetectable without linguistic output.
- Verification Difficulty: How do we ensure the model isn’t taking shortcuts or using spurious correlations in its latent reasoning?
- Monitoring and Control: Traditional safety measures (like output filtering) fail when the critical computation happens before any text is generated.
Proposed solutions include:
- Consistency Checking: Force occasional explicit reasoning and verify alignment with latent conclusions.
- Adversarial Probing: Train secondary models to detect problematic patterns in latent trajectories.
- Hybrid Architectures: Maintain parallel explicit and latent reasoning streams for cross-validation.
Current Applications and Performance
Mathematical Reasoning (GSM8k)
On the GSM8k math reasoning dataset, the performance of continuous reasoning is more nuanced.
- COCONUT Performance:
- It achieves 34.1% accuracy, which is a significant improvement over the 16.5% from a No-CoT baseline.
- However, it does not surpass the 42.9% accuracy of the standard Chain-of-Thought (CoT) baseline.
- Its primary advantage is efficiency, reducing the number of reasoning tokens from 25.0 for CoT to just 8.2.
Logical Reasoning (ProntoQA & ProsQA)
COCONUT shows its most dramatic improvements on logical reasoning tasks that require planning and searching.
- COCONUT Performance:
- ProntoQA: Achieves 99.8% accuracy, outperforming the 98.8% CoT baseline.
- ProsQA: Shows a major leap, scoring 97.0% accuracy compared to the CoT baseline’s 77.5%.
- This high accuracy is achieved with significant efficiency gains, requiring far fewer tokens than CoT on both datasets.
- This efficiency comes from avoiding verbose explanations and instead directly manipulating concepts in a continuous vector space, which proves especially advantageous for complex logical planning.
Multimodal Integration
Heima (Shen et al. (2025)) demonstrated that “thinking tokens” excel at multimodal reasoning. The framework introduces several key advantages:
- Massive Efficiency Gains: It drastically reduces computational overhead, requiring as few as 6% of the reasoning tokens compared to verbose text-based methods.
- Latent Space Reasoning: By operating in a continuous latent space, it maintains comparable—and in some cases, superior—accuracy on complex visual reasoning benchmarks while avoiding the “description bottleneck” of converting visual concepts into words.
- Effective Cross-Modal Encoding: The model’s latent “thinking tokens” effectively encode rich visual information, which can be decoded back into text descriptions even without access to the original image.

Code Generation and Formal Reasoning
Early experiments show promise:
- Program Synthesis: Latent states can maintain program state, variable bindings, and control flow without verbose comments.
- Theorem Proving: Abstract mathematical relationships represent more naturally as vector transformations than symbolic strings.
- Planning Tasks: The COCONUT paper demonstrates that the implicit breadth-first search used is highly effective for planning problems that require backtracking. On ProsQA, a dataset specifically designed to challenge planning capabilities, COCONUT found solutions significantly faster than traditional CoT, with an average inference time of 0.15 seconds compared to 0.47 seconds for CoT.
Why This Represents the Next Breakthrough
Computational Efficiency Revolution
The token overhead of linguistic reasoning becomes prohibitive as models require deeper thought:
- Scaling Analysis:
- Traditional CoT: tokens but attention cost for n-step reasoning.
- Latent reasoning: total attention, but unlike CoT the context length stays , so memory and KV-cache growth are linear in the number of latent steps.
- At n=100 steps, CoT needs 100× more attention computation than latent reasoning.
Unlocking Non-Linguistic Intelligence
Continuous reasoning enables fundamentally new capabilities:
-
Parallel Hypothesis Exploration: Unlike sequential text, vectors can superpose multiple reasoning branches:
Empirically, COCONUT’s ablation on p 4 shows wider hidden states boost success on ProsQA, consistent with superposition.
-
Continuous Optimization: Reasoning becomes differentiable, enabling gradient-based search for solutions rather than discrete sampling.
-
Emergent Algorithms: Models develop internal procedures resembling classical algorithms (dynamic programming, branch-and-bound) without explicit programming.
Scaling Law Implications
Recent findings from Ye et al. (2025) provide new insights into how model architecture affects reasoning:
-
Depth is More Crucial than Width: The paper demonstrates that for complex reasoning tasks, a model’s depth (number of layers) is more important than its width (number of neurons per layer). For example, a deeper, smaller model (16-layer, 576-dim) significantly outperforms a shallower, larger model (4-layer, 1920-dim) on math problems with longer reasoning chains. This suggests that performance in reasoning doesn’t just depend on model size, but on having sufficient depth to process complex, sequential steps.
-
Computationally Efficient Reasoning: The research shows that models learn to be highly efficient by generating the shortest possible solutions. Instead of computing every possible variable, the model learns to “plan ahead” by identifying only the necessary parameters required to answer the question. This avoids wasting computation on unnecessary steps or verbose explanations.
Integration with Reinforcement Learning
Latent reasoning creates a natural bridge to RL:
-
Direct Trajectory Optimization: RL rewards can shape latent paths without language bottlenecks:
-
Self-Play and Exploration: Models can rapidly explore reasoning strategies in latent space, orders of magnitude faster than generating text.
Conclusion: The Best Path Forward
Building on every technique reviewed above, the most coherent next step is a single, integrated architecture that lets a model decide—token by token—whether to “think in words” or “think in vectors” while scaling its depth on demand. I propose the GRAIL-Transformer (Gated Recurrent Architecture for Interpretable Latent reasoning), a design that deliberately stitches together the strongest ideas from COCONUT, Huginn, Quiet‑STaR, GRPO and the latest interpretability work so that it meshes naturally with the entire history traced in this post.
1. GRAIL‑Transformer—how it works
-
Recurrent‑depth core – We keep the lightweight “Prelude → Shared Core → Coda” loop of Huginn, but allow the shared core to execute any number k of inner iterations at inference time. Extra reasoning therefore costs compute rather than parameters, solving the scalability crunch that plagues ever‑deeper static stacks.
-
Learnable gating between text and latent space – At each generation step the model produces a mixed token embedding:
Early in training the gate value is driven toward 0 (pure language); a curriculum gradually nudges it up so later layers increasingly recycle hidden states instead of words. Because the gate is differentiable, the model can still “fall back” to language for steps that benefit from explicit explanation or safety audits.
-
Latent memory lattice – Rather than a single vector, a compact lattice of 4‑8 cells is updated every loop by gated attention. These cells let the network keep multiple hypotheses alive in parallel, mirroring the implicit breadth‑first search behaviour uncovered in COCONUT, yet remain small enough to fit in cache and to visualise with modern probing tools.
2. Training pipeline
- Supervised warm‑up with standard chain‑of‑thought data; so nothing changes relative to today’s best practice.
- Hybrid curriculum: each epoch raises the target mean of and shortens the explicit CoT, teaching the model to compress more reasoning into hidden space without a sudden jump in difficulty.
- Group Relative Policy Optimisation (GRPO) fine‑tuning: an RL objective rewards final accuracy, penalises unnecessary inner‑loop steps, and lightly discourages verbose output. GRPO updates come from simple moment matching, so one extra forward pass per sample is enough—vastly cheaper than PPO in the many‑step latent regime.
- Difficulty‑aware refinement: hard questions are replayed with a higher weight and a slightly steeper penalty on step count; this focuses the network’s latent search tree where it matters.
3. Interpretability and safety hooks
-
Scheduled contrastive decoding: every few inner loops we clamp for a single hop and force the model to externalise its current thought. A tiny auxiliary decoder converts that hidden state to text; disagreement with the eventual answer adds a consistency penalty during RL. This provides a “window into the lattice” without slowing ordinary inference.
-
Sparse linear probes: as training progresses we freeze periodic checkpoints and train linear maps that predict intermediate sub‑answers from the lattice. A modest mean‑square‑error loss on those probes encourages the hidden basis to stay approximately linear, making post‑hoc auditing tractable instead of desperately ill‑posed.
4. Why this directly tackles the open problems
- Training scalability – Recurrent sharing keeps memory flat; GRPO’s single‑pass updates keep cost linear in data instead of quadratic in steps; mixed‑precision caches plus low‑rank adapters shrink VRAM >50%.
- Inference efficiency – Inner‑loop depth rises only when problems demand it, so routine queries cost little more than a vanilla LLM, yet pathological puzzles can receive dozens of extra reasoning cycles without exceeding context‑length limits.
- Interpretability – Gating, forced reveal steps and probe regularisation ensure that at any point we can sample, decode and analyse a faithful slice of the model’s latent trajectory—something pure‑latent systems could not guarantee.
- Alignment – Because the lattice must periodically translate back into language and match a supervised or self‑consistent rationale, deceptive or shortcutting thoughts face immediate gradient pressure, giving alignment researchers a concrete signal to train against.
I have not implemented GRAIL‑Transformer; at present it is a theoretical synthesis drawn from the most compelling recent findings. Nonetheless, every component already exists in isolation in the literature, and nothing in the design violates current hardware constraints.
Reasoning in latent space therefore looks poised to become the next genuine breakthrough: shedding the bandwidth limits of language, enabling economical yet powerful depth, and—if we build in the right interpretability valves—delivering models whose thought processes are both faster and safer than anything we have today.
References
Chen, X., Wang, L., & Li, Y. (2025). Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking. arXiv preprint arXiv:2502.13842. https://arxiv.org/abs/2502.13842
Cheng, P., & Van Durme, B. (2024). Compressed Chain-of-Thought: Efficient reasoning through dense representations. arXiv preprint arXiv:2412.13171. https://arxiv.org/abs/2412.13171
Chen, X., Zhao, A., Xia, H., Lu, X., Wang, H., et al. (2025). Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought Reasoning. arXiv preprint arXiv:2505.16782. https://arxiv.org/abs/2505.16782
Deng, Y., Choi, Y., & Shieber, S. (2024). From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step. arXiv preprint arXiv:2405.14838. https://arxiv.org/abs/2405.14838
Deng, Y., Prasad, K., Fernandez, R., Smolensky, P., Chaudhary, V., & Shieber, S. (2023). Implicit Chain-of-Thought reasoning via knowledge distillation. arXiv preprint arXiv:2311.01460. https://arxiv.org/abs/2311.01460
Geiping, J., Fowl, L., Somepalli, G., Goldblum, M., Moeller, M., Goldstein, T., & Jacobs, T. (2025). Scaling up test-time compute with latent reasoning: A recurrent depth approach. arXiv preprint arXiv:2502.05171. https://arxiv.org/abs/2502.05171
Goyal, A., Bengio, Y., Weston, J., & Ballas, N. (2023). Think before you speak: Training language models with pause tokens. arXiv preprint arXiv:2310.02226. https://arxiv.org/abs/2310.02226
Hao, S., Gu, Y., Ma, H., Hong, J., Wang, Z., Wang, D., & Hu, Z. (2024). Training large language models to reason in a continuous latent space. arXiv preprint arXiv:2412.06769. https://arxiv.org/abs/2412.06769
Lindsey, R., Kenton, Z., Everitt, T., Wattenberg, M., Mirhoseini, A., Leike, J., & Amodei, D. (2025). Circuit tracing: Revealing computational graphs in language models. Transformer Circuits Thread. https://transformer-circuits.pub/2025/attribution-graphs/biology.html
Liu, J., Chen, X., Wang, H., Zhang, L., & Li, M. (2024). Expediting and elevating large language model reasoning via hidden chain-of-thought decoding. arXiv preprint arXiv:2409.08561. https://arxiv.org/abs/2409.08561
Pfau, J., Merrill, W., & Bowman, S. R. (2024). Let’s think dot by dot: Hidden computation in transformer language models. arXiv preprint arXiv:2404.15758. https://arxiv.org/abs/2404.15758
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv preprint arXiv:2402.03300. https://arxiv.org/abs/2402.03300
Shen, H., Wu, Y., Chen, K., Wang, J., & Zhang, Q. (2025). Efficient reasoning with hidden thinking. arXiv preprint arXiv:2501.19201. https://arxiv.org/abs/2501.19201
Shen, Z., Yan, H., Zhang, L., Hu, Z., Du, Y., & He, Y. (2025). CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation. arXiv preprint arXiv:2502.21074. https://arxiv.org/abs/2502.21074
Shi, T., Wu, Y., Song, L., Zhou, T., & Zhao, J. (2025). Efficient Reinforcement Finetuning via Adaptive Curriculum Learning. arXiv preprint arXiv:2504.05520. https://arxiv.org/abs/2504.05520
Su, Y., Liu, T., Wang, D., Chen, H., & Zhou, J. (2025). Token Assorted: Mixing latent and text tokens for improved language model reasoning. arXiv preprint arXiv:2502.03275. https://arxiv.org/abs/2502.03275
Wang, H., Han, L., Xu, K., & Srivastava, A. (2025). SQuat: Subspace-orthogonal KV Cache Quantization. arXiv preprint arXiv:2503.24358. https://arxiv.org/abs/2503.24358
Xiang, V., Blagden, C., Rafailov, R., Lile, N., Truong, S., Finn, C., & Haber, N. (2025). Just Enough Thinking: Efficient Reasoning with Adaptive Length Penalties Reinforcement Learning. arXiv preprint arXiv:2506.05256. https://arxiv.org/abs/2506.05256
Yang, K., Klein, D., Pang, N., & Sachan, M. (2024). Do large language models latently perform multi-hop reasoning? arXiv preprint arXiv:2402.16837. https://arxiv.org/abs/2402.16837
Ye, H., Zhang, C., Wang, X., Liu, Y., & Sun, M. (2025). Scaling laws for reasoning: The importance of model depth. arXiv preprint arXiv:2407.20311. https://arxiv.org/abs/2407.20311
Yu, D., Wang, S., Chen, L., Zhang, M., & Li, X. (2025). Enhancing auto-regressive Chain-of-Thought through loop-aligned reasoning. arXiv preprint arXiv:2502.08482. https://arxiv.org/abs/2502.08482
Zelikman, E., Wu, Y., Mu, J., & Goodman, N. D. (2022). STaR: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35, 15476–15488. https://arxiv.org/abs/2203.14465
Zelikman, E., Harik, G., Shao, Y., Jayasiri, V., Haber, N., & Goodman, N. D. (2024). Quiet-STaR: Language models can teach themselves to think before speaking. arXiv preprint arXiv:2403.09629. https://arxiv.org/abs/2403.09629
Yue, Z., Jin, B., Zeng, H., Zhuang, H., Qin, Z., Yoon, J., Shang, L., Han, J., & Wang, D. (2025). Hybrid Latent Reasoning via Reinforcement Learning. arXiv preprint arXiv:2505.18454. https://arxiv.org/abs/2505.18454
Zhang, T., & Viteri, M. (2025). Uncovering latent Chain-of-Thought vectors in language models. arXiv preprint arXiv:2409.14026. https://arxiv.org/abs/2409.14026