Thinking Without Words

Jun 13, 2025

Introduction

The emergence of Large Reasoning Models trained through reinforcement learning has fundamentally transformed our understanding of AI capabilities. These systems, exemplified by models like o1, demonstrate unprecedented reasoning abilities by leveraging extensive Chain-of-Thought (CoT) processes during inference. However, this breakthrough has simultaneously exposed a critical limitation: the constraint of reasoning through natural language tokens.

Traditional Chain-of-Thought reasoning, while interpretable and effective, forces models to articulate every reasoning step through the bottleneck of human language. This linguistic mediation introduces computational inefficiencies and constrains the expressiveness of thought processes. Recent research has begun exploring a radical alternative: continuous latent reasoning, where models perform inference directly in high-dimensional embedding spaces rather than through discrete language tokens.

This paradigm shift represents more than an incremental improvement—it fundamentally challenges how we conceptualize machine reasoning and opens pathways to cognitive capabilities that transcend the limitations of linguistic expression.

Illustration of the core idea of latent reasoning, where a model's internal reasoning process is represented as a high-dimensional vector space with no natural interpretation.

Image inspired by the one from the paper Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought Reasoning.

Historical Development: The Evolution Toward Latent Reasoning

Early Foundations (2022-2023)

A significant step in improving model reasoning was the introduction of the Self-Taught Reasoner (STaR) by Zelikman et al. (2022). Instead of abstracting reasoning away from language, STaR refines the model’s ability to produce explicit, step-by-step “chain-of-thought” rationales. The method uses an iterative process: the model generates rationales, and if a rationale leads to a correct answer, it is used as training data to fine-tune the model. This established the principle that a model can effectively teach itself to become a better reasoner by learning from its own successfully articulated reasoning, improving performance on complex tasks without needing a large, pre-existing dataset of human-annotated rationales.

Diagram illustrating the Self-Taught Reasoner (STaR) methodology, where a model iteratively improves by learning from its own generated rationales.

Deng et al. (2023) introduced the concept of Implicit Chain-of-Thought (ICoT), which uses knowledge distillation to train a student model to reason using its internal hidden states. In this process, the student model learns to emulate the layer-by-layer hidden-state trajectories of a larger teacher model as the teacher generates an explicit chain of thought. The goal is to distill the teacher’s “horizontal,” step-by-step reasoning into a more efficient “vertical” reasoning process that occurs implicitly within the student model’s layers. While this method significantly speeds up inference time, this efficiency comes at a cost. The authors report that the implicit approach leads to a notable decrease in task accuracy when compared to models that generate an explicit chain of thought, highlighting a direct trade-off between inference speed and final performance.

Diagram explaining Implicit Chain-of-Thought (ICoT), where a student model learns to reason implicitly by mimicking the hidden-state trajectory of a larger teacher model.

A crucial early insight came from interpretability studies. Yang et al. (2024) asked whether large language models latently perform multi-hop reasoning. They found moderate evidence for this latent reasoning, observing it in around 40% of cases on average, with much higher rates for specific types of reasoning tasks. This showed that models’ hidden layers transiently encoded information about intermediate “hops” even when answering directly, hinting at untapped latent reasoning potential.

Illustration of a latent multi-hop reasoning probe. A diagram shows how changing an input prompt is used to measure a model's internal recall for multi-step inferences.

The Discrete Token Era (2024)

The next evolutionary step involved experimenting with specialized discrete tokens to represent reasoning states. Goyal et al. (2023) introduced “pause tokens” that enabled models to perform additional internal computation before generating outputs. These tokens, inserted in a fixed, non-adaptive sequence, served as computational placeholders, allowing for delayed prediction and improved accuracy on logic-intensive tasks. The key insight was that models could benefit from “thinking time” even within the discrete token framework.

Comparison diagram: standard inference versus 'pause-inference'. The latter uses pause tokens to enable extra computation before output, illustrated with new computational paths.

In their paper, “Let’s Think Dot by Dot,” Pfau et al. (2024) investigate whether the performance gains from chain-of-thought are due to interpretable reasoning or simply the greater computation that additional tokens allow. They demonstrate that for certain algorithmic tasks, transformers can use meaningless “filler tokens” (e.g., ’…’) to perform complex, hidden computations, achieving high accuracy on problems they could not solve when forced to respond immediately. For example, on a sufficiently complex 3SUM task, models using filler tokens reached 100% accuracy, whereas models without them were only 66% accurate. This suggests the critical bottleneck is not the semantic content of the tokens, but rather the computational limitation of a single forward pass. The sequence of filler tokens provides the model with a “scratchpad” for multi-step reasoning, directly challenging the assumption that a model’s intermediate steps must be linguistically meaningful to be computationally effective.

Comparison of three reasoning approaches: chain-of-thought (explicit reasoning), filler tokens (dots for computation), and immediate answer, showing filler tokens achieve similar performance to CoT.

Zelikman et al. (2024) developed Quiet-STaR, employing learnable tokens to mark boundaries of internal rationales. This approach enabled language models to infer unstated reasoning steps, improving generalization without task-specific fine-tuning. The system generated token-level rationales internally (one hidden “explanation” per token produced) without outputting them, essentially “thinking before speaking” in a fine-grained way.

Diagram of the Quiet-STaR algorithm, showing its 'think, talk, learn' phases. The model generates internal thoughts for each token and uses reinforcement learning to improve predictions.

The Continuous Revolution (2024-2025)

The most significant breakthrough came with Hao et al. (2024) and their COCONUT (Chain of Continuous Thought) architecture. Rather than using discrete tokens, COCONUT directly feeds the model’s last hidden states as input embeddings for subsequent reasoning steps. This innovation eliminated the lossy projection from high-dimensional representations to discrete vocabulary distributions.

Diagram comparing Chain-of-Thought (CoT) with Chain of Continuous Thought (COCONUT). CoT uses discrete text tokens, while COCONUT uses continuous hidden states for reasoning.

Technically, COCONUT operates by:

Processing the input question normally through the transformer.
Taking the final hidden state (a high-dimensional vector, typically 2048-4096 dimensions).
Instead of projecting to vocabulary and sampling, directly feeding this vector back as the “next token” embedding.
Repeating this process for multiple latent reasoning steps.
Only projecting to vocabulary for the final answer.

A 4,096-dimensional activation vector, even after aggressive 4-bit quantization, contains 16,384 raw bits—far more than a single discrete token, which carries at most log₂(50,000) ≈ 16 bits. However, directly comparing raw bits can be misleading because these representations differ significantly in information density. A token from a 50k-word vocabulary, compressed by Byte-Pair Encoding (BPE), packs information very densely—though in practice, due to redundancy in natural language, tokens typically contain even fewer effective bits of information. For instance, LLaMA-2-70B achieves a perplexity of 3.32 on WikiText-2, meaning each token effectively encodes only around 1.73 bits of meaningful information (Chen et al., 2025).

Activation vectors, on the other hand, are large and redundant by design. Recent compression methods like Multi-Head Latent Attention (MLA) from DeepSeek-V3 (Liu et al., 2024) show these vectors can be dramatically compressed without significant information loss—implying each activation value may effectively contain around 0.11 bits, translating to roughly 460 meaningful bits for the entire 4,096-dimensional vector.

Despite their redundancy, activation vectors thus still carry significantly more usable information (approximately 460 bits vs. 1.73 bits per token), highlighting their richer representational capacity. This suggests reasoning within latent space offers more representational bandwidth compared to reasoning purely at the token level.

In their 2024 paper, Cheng and Van Durme (2024) introduced Compressed Chain-of-Thought (CCoT), a framework that utilizes a dual-module architecture. A CCOT module (parameterized by φ) generates a sequence of dense “contemplation tokens,” which serve as compressed representations of an entire reasoning chain. A second DECODE module (parameterized by ψ) then uses these tokens to produce the final answer. This approach demonstrates that complex reasoning can be effectively summarized in continuous representations. However, contrary to methods that process steps in parallel, CCoT generates these contemplation tokens autoregressively, meaning they are produced sequentially one after another.

Illustration comparing Chain of Thought (CoT) and Compressed Chain of Thought (CCoT). CoT uses a long sequence of text, while CCoT uses a short sequence of continuous embeddings.

Liu et al. (2024) proposed Hidden Chain-of-Thought (HCoT), training auxiliary models to generate compact thought representations that maintain semantic richness while drastically reducing computational overhead. Their method compresses each intermediate reasoning step into a special [CoT] token, interleaving these compressed thoughts with the generated content.

Examples of Hidden Chain-of-Thought (HCoT), where internal reasoning steps are shown as compressed thoughts with blue strikethrough text in response to user queries.

Token Assorted (Su et al. (2025)) took a hybrid approach, using a VQ-VAE to encode early reasoning steps into latent codes while keeping later, critical steps in text. This model reduced the length of reasoning traces by an average of 17% while maintaining interpretability where needed.

Illustration of the Token Assorted hybrid approach. A sequence of text-based CoT tokens is partially compressed into shorter, discrete latent tokens.

Architectural Innovations (2025)

Recent work has focused on architectural modifications that natively support latent reasoning. Geiping et al. (2025) introduced Huginn, a recurrent framework enabling adaptive computation allocation through RNN-like iterative processing. The architecture consists of:

Prelude: Initial layers encoding input into latent state.
Recurrent Core: Transformer blocks that can be applied repeatedly.
Coda: Final layers decoding the answer.

This design untied computation depth from layer count, allowing a 3.5B model to achieve 50B-model performance through approximately 32 recurrent iterations.

Graph of the Huginn model's performance showing improvement with more recurrent iterations. This demonstrates adaptive computation, where complex tasks benefit from more latent 'thinking' time.

Diagram of the Huginn recurrent architecture. It shows three main blocks: a Prelude for encoding, a shared recurrent core for processing, and a Coda for decoding.

In “Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning,” Yu et al. (2025) introduce RELAY (REasoning through Loop Alignment iteratively), a two-stage framework designed to improve how auto-regressive models handle long reasoning tasks. The method bridges the gap between auto-regressive models, which often struggle with generating accurate, long Chain-of-Thought (CoT) sequences, and Looped Transformers, which have strong length generalization capabilities but limited versatility.

The RELAY framework first trains a Looped Transformer by aligning its internal loop iterations with the explicit reasoning steps of a CoT process. This allows the Looped Transformer to generate accurate, high-quality reasoning chains for problems that are more complex and longer than those in its original training data. These generated chains are then used as a new, high-quality dataset to fine-tune a standard auto-regressive model, significantly enhancing its performance on complex reasoning tasks that require generalization to longer problem lengths.

Visualization contrasting a standard auto-regressive CoT model with a Looped Transformer. As problem complexity increases, the CoT model's reasoning token sequence grows, while the looped model increases its number of internal loop iterations.

Chen et al. (2025) proposed the Inner Thinking Transformer (ITT), treating each transformer layer as a discrete reasoning step with adaptive token routing and residual refinement.

Diagram showing the Inner Thinking Transformer (ITT) concept. Each layer of the model is treated as a step of 'inner thinking' to improve results on difficult tasks without adding parameters.

Technical Deep Dive: Mechanisms of Continuous Reasoning

Core Architecture

Continuous latent reasoning fundamentally alters the information flow in transformer architectures. In traditional models, the transformation sequence follows:

Input Embeddings → Transformer Layers → Projection to Vocabulary → Sampling → Next Token

Continuous reasoning architectures bypass the projection bottleneck:

Input Embeddings → Transformer Layers → Direct Hidden State Reuse

The mathematical formulation involves modifying the standard transformer update. Instead of:

\begin{alignedat}{2} & h_t &&= \text{TransformerLayers}(\text{embed}(\text{token}_t)) \\ & \text{token}_{t+1} &&= \text{sample}(\text{softmax}(W_{\text{vocab}} h_t)) \end{alignedat}

Latent reasoning uses:

\begin{alignedat}{2} & h_0 &&= \text{TransformerLayers}(\text{embed}(\text{input})) \\ & h_{i+1} &&= \text{TransformerLayers}(\text{ProjectToEmbed}(h_i)) \\ & \text{output} &&= \text{decode}(h_N) \text{ after N latent steps} \end{alignedat}

Where ProjectToEmbed is typically a learned linear transformation that maps the hidden state back to the embedding dimension.

Training Methodologies

Building a model that genuinely “thinks in vectors” is less about inventing huge new architectures and more about guiding the network away from the crutch of language without wrecking its performance. Current practice has converged on the following five-part recipe.

1 · Curriculum-guided latentisation

Training still starts with ordinary chain-of-thought (CoT), but every epoch hides a growing fraction of those intermediate words and asks the model to run directly on their hidden-state vectors.

COCONUT runs through the corpus in seven discrete stages. At stage 0 no rationale tokens are hidden; by the final stage roughly 85 % of every rationale is replaced by its own activations. Each new stage is introduced only after perplexity has stabilised, so the network never “forgets how to read”.
Stepwise Internalisation removes chain-of-thought (CoT) tokens from the beginning of the reasoning sequence in a continuous, linear fashion. With each training epoch, more tokens are hidden from the start of the rationale, forcing the model to internalise the initial steps of the reasoning process. A technique called “Removal Smoothing” introduces a small amount of randomness to the number of tokens being removed, which helps to stabilize training as the model learns to operate on increasingly truncated context.

Hidden tokens receive no direct loss; visible tokens and the final answer are trained with the usual cross-entropy.

2 · Hidden-state distillation and self-distillation

Once the network can survive missing words, the next step is to make its latent trajectory mimic a teacher that still reasons out loud.

Implicit CoT (ICOT) records the layer-wise activations of a frozen teacher and trains a student to match them, often using a mean-squared-error loss, while predicting the final answer. On GPT-2 Small, this provides a significant inference-time speed-up versus explicit CoT (roughly 4 to 8 times faster, depending on the task). However, this efficiency comes at a substantial cost to accuracy, particularly on more complex tasks. For example, while the accuracy drop is minor for 4x4 multiplication, it falls by over 20 points on the GSM8K math dataset and by 90 points on 5x5 multiplication. The paper provides these results for models up to GPT-2 Large.
CODI runs the same set of weights twice—once with text and once with latent thoughts—and aligns only the hidden state immediately before the answer token. Because there is no separate teacher, CODI matches explicit-CoT accuracy on GSM8K while cutting context length by about a factor of three.

3 · Compact latent tokens

ICOT-style distillation still leaves “one vector per hidden step”. The next advance packs many steps into a handful of learned embeddings.

CCoT generates a short sequence of “contemplation tokens”, typically 30–50 % as long as the full CoT. An auxiliary decoder can reconstruct the dropped text for auditability.
HCoT collapses an entire rationale into a single placeholder token [CoT]. A contrastive InfoNCE loss pushes two placeholders apart unless they decode to the same complete rationale, producing dense, semantically clustered codes.
Token Assorted compresses only the early reasoning hops into VQ-VAE codes and leaves the final, safety-critical steps in natural language, shaving about 15–20 % off the trace length while preserving human-readable checks.

4 · Recurrent or loop-aligned supervision

Some architectures keep sequence depth fixed and let the model loop through shared blocks as many times as it needs.

Huginn splits the network into Prelude, Shared Core and Coda blocks. During pre-training it is unrolled 1 – 32 times at random. Cross-entropy is applied only to the final (and optionally the last few) iterations, and a small KL-style stability term limits drift between successive hidden states. At inference a learned halting gate decides when to stop looping.
RELAY first aligns each loop iteration with the next step of a known CoT, then freezes that looped model and distils its internal trace into a standard auto-regressive decoder. The distilled model scores about 10–12 points higher than an equal-size plain decoder on long-division benchmarks.
Inner Thinking Transformer (ITT) adds a lightweight linear probe after every residual block that predicts the current sub-answer. The probe and the main weights are trained jointly, and an adaptive token router lets tokens judged “easy” skip further passes, saving roughly one-quarter of the total compute without hurting accuracy.

5 · Hybrid latent reinforcement learning

Supervised data eventually tops out, so teams switch to reinforcement learning that rewards correctness and charges for extra computation.

Hybrid Reasoning Policy Optimisation (HRPO) is the flagship:

At every generation step the network mixes two embeddings—a normal token embedding and a transformed copy of the previous hidden state—weighted by an action variable gamma.
The reward is 1 for a correct final answer, minus a small fee per visible token and per latent iteration (instances where gamma ≠ 0).
HRPO is trained with Group Relative Policy Optimisation (GRPO), which uses the mean reward of a mini-batch of roll-outs as its baseline instead of a learned critic. GRPO needs about half the memory of PPO and converges just as fast.

Setting the step penalty too low makes the model talk verbosely; setting it too high drives it into silent, brittle reasoning. Authors report that a short grid search over a few hundred prompts is enough to find the sweet-spot penalty.

6 · Generic efficiency add-ons

Three auxiliary techniques, first devised for token-level RL, carry over cleanly to the latent regime:

GRPO itself, supplying a critic-free baseline.
Adaptive Length Penalty (ALP), which scales the per-step cost inversely with the model’s real-time solve-rate, trimming median reasoning length by ~50 % without hurting hard cases.
AdaRFT introduces an adaptive curriculum for reinforcement finetuning. It works by dynamically adjusting the difficulty of training problems to match the model’s current skill level. Based on the model’s recent reward signals, it selects problems that are challenging but still solvable. This adaptive sampling avoids wasting computation on problems that are too easy or too hard, which in turn keeps the reward signal rich and informative for more efficient learning.

A modern training run at a glance

Supervised warm-up on full CoT.
Curriculum latentisation for five to ten epochs (COCONUT or Stepwise Internalisation).
Hidden-state distillation, optionally followed by compact latent token training (CCoT or HCoT).
Architecture-aligned pre-training if you use loops (Huginn, RELAY, ITT).
Hybrid RL fine-tuning with HRPO + GRPO, optionally adding ALP and AdaRFT.

Representational Dynamics

Analysis of COCONUT’s latent reasoning reveals a process more complex than a simple linear chain. Instead of committing to a single path, the continuous thoughts can be interpreted as a latent search tree that explores multiple potential next steps simultaneously.

Parallel Reasoning Paths: The paper demonstrates that a single “continuous thought” can encode multiple branching hypotheses. This is shown by forcing the model to decode its latent state back into language; the probability distribution over the possible next words reveals that several different reasoning paths are being considered at once. For example, in a logical reasoning task, the model remains uncertain about the correct choice after the first continuous thought but successfully identifies the correct path after the second, suggesting it progressively prunes incorrect branches of its search tree.

Diagram showing the COCONUT model's latent reasoning process. The model explores multiple potential next steps simultaneously, represented as a search tree.

Implicit Value Function: This latent search is not uniform. The model learns to prioritize more promising paths and prune less relevant ones. The paper refers to the probability distribution over potential next steps as the model’s “implicit value function,” which estimates each node’s potential to lead to the correct answer.
From Exploration to Focus: This dynamic changes as the reasoning process unfolds. Analysis shows that during the initial continuous thoughts, the model maintains significant diversity, exploring several alternative paths in parallel. In subsequent thoughts, this parallelism narrows as the model gains more certainty and focuses on the most promising reasoning path. This process allows COCONUT to perform a kind of implicit breadth-first search, which is particularly advantageous for complex planning tasks.

Implementation Challenges

Large‑scale latent reasoning is still in its infancy, and every experimental system to date has surfaced at least one blocking issue that does not appear in ordinary Chain‑of‑Thought models. The literature converges on six broad pain‑points:

1. A curriculum is mandatory—otherwise the model never “gets” latent reasoning

COCONUT’s ablation shows that training directly on (question, answer) pairs with hidden‑state recycling performs worse than a no‑CoT baseline. Only the staged schedule that first teaches the model to reason in language and then incrementally replaces early steps with vectors unlocks its gains. The GSM8K accuracy crash from 34.1 → 14.4% in the “w/o curriculum” run makes the point starkly clear. Designing such curricula—and automatically tuning them for new domains—remains an open research problem.

2. Latent loops break GPU parallelism

Because every continuous thought depends on the previous hidden state, training (and inference) cost scales with the number of latent steps, not the batch size. COCONUT explicitly notes that it must execute n + 1 forward passes for n thoughts, and that “the sequential nature of the multiple forward passes poses challenges for parallelism.” This serial dependency throttles throughput on modern GPU clusters built for large‑batch matrix multiplies.

3. KV‑cache memory becomes the new bottleneck

Long latent traces do not expand the token sequence, but they do enlarge the key/value cache: every extra iteration stores another set of 64‑bit vectors. Recent work on SQuat (Wang et al. (2025)) shows that even with aggressive INT‑2 quantisation the cache can dominate peak GPU memory when models “think” for dozens of steps. Compression helps but introduces accuracy/latency trade‑offs that are not yet well understood.

4. Knowing when to stop is still heuristic

During inference a latent‑reasoning model must decide when to emit an <eot> and return to language space. Current systems either pad to a fixed depth or train an ad‑hoc binary classifier over hidden states, and COCONUT reports that both heuristics work “comparably well.” Huginn trains a learned halting classifier (§4.1, p 5) which shows promise but still requires careful tuning. Neither approach adapts gracefully to problem difficulty, and mis‑predictions manifest as truncated explanations or runaway loops.

5. Deep recurrent stacks risk optimisation instability

Recurrent‑depth architectures such as Huginn push performance by unrolling a shared core 30+ times, but the authors note that gradient signals weaken as depth grows, requiring careful learning‑rate scaling and residual gating to avoid divergence. Balancing depth‑on‑demand with stable training dynamics is still an active area of study.

6. Tooling for debugging and evaluation is immature

A survey of efficient reasoning methods highlights a “complexity of latent‑space implementation” gap: without textual traces, it is hard to verify correctness, attribute errors, or measure reasoning efficiency. New metrics (e.g., embedding‑consistency scores) and visual probes are being proposed, but no standard evaluation suite exists yet.

The Interpretability Crisis

The “Neuralese” Problem

Recent mechanistic interpretability research Lindsey et al. (2025) reveals models performing sophisticated internal reasoning through complex feature interactions, suggesting the development of what we might call “Neuralese”—a term we define as an emergent internal representational language that develops within AI systems operating in continuous latent spaces, characterized by high-dimensional vector patterns that encode semantic and logical relationships in ways that are fundamentally untranslatable to human linguistic concepts. Unlike discrete tokens which at least map to vocabulary, continuous thoughts exist in a 4096-dimensional space with no natural interpretation.

Key challenges include:

Non-unique Representations: The same reasoning can be encoded in infinitely many ways due to rotation/scaling invariance.
Distributed Encoding: Information is spread across all dimensions, not localized.
Dynamic Semantics: The “meaning” of dimensions changes based on context.

Emerging Interpretability Techniques

Researchers are developing novel approaches:

Geometric Analysis: Zhang and Viteri (2025) discovered latent CoT vectors—specific directions in activation space that elicit reasoning:
```
reasoning_vector = h_with_cot - h_without_cot
h_enhanced = h_input + α * reasoning_vector
```

Diagram showing the Zhang and Viteri (2025) geometric analysis of latent CoT vectors. It shows how a reasoning vector can be added to the input to induce reasoning behavior.

Representational Probing: Advanced techniques attempt semantic decoding:
- Training linear probes to predict intermediate answers from latent states.
- Using contrastive learning to align latent states with linguistic descriptions.
- Projecting trajectories to 2D/3D for visualization.

Despite these efforts, full interpretability remains elusive. We can detect that reasoning is happening and sometimes what kind, but not the detailed how.

The Alignment Challenge

The interpretability deficit poses serious concerns:

Deceptive Reasoning: Models could develop reasoning strategies that appear correct but contain hidden flaws or biases undetectable without linguistic output.
Verification Difficulty: How do we ensure the model isn’t taking shortcuts or using spurious correlations in its latent reasoning?
Monitoring and Control: Traditional safety measures (like output filtering) fail when the critical computation happens before any text is generated.

Proposed solutions include:

Consistency Checking: Force occasional explicit reasoning and verify alignment with latent conclusions.
Adversarial Probing: Train secondary models to detect problematic patterns in latent trajectories.
Hybrid Architectures: Maintain parallel explicit and latent reasoning streams for cross-validation.

Current Applications and Performance

Mathematical Reasoning (GSM8k)

On the GSM8k math reasoning dataset, the performance of continuous reasoning is more nuanced.

COCONUT Performance:
- It achieves 34.1% accuracy, which is a significant improvement over the 16.5% from a No-CoT baseline.
- However, it does not surpass the 42.9% accuracy of the standard Chain-of-Thought (CoT) baseline.
- Its primary advantage is efficiency, reducing the number of reasoning tokens from 25.0 for CoT to just 8.2.

Logical Reasoning (ProntoQA & ProsQA)

COCONUT shows its most dramatic improvements on logical reasoning tasks that require planning and searching.

COCONUT Performance:
- ProntoQA: Achieves 99.8% accuracy, outperforming the 98.8% CoT baseline.
- ProsQA: Shows a major leap, scoring 97.0% accuracy compared to the CoT baseline’s 77.5%.
- This high accuracy is achieved with significant efficiency gains, requiring far fewer tokens than CoT on both datasets.
- This efficiency comes from avoiding verbose explanations and instead directly manipulating concepts in a continuous vector space, which proves especially advantageous for complex logical planning.

Multimodal Integration

Heima (Shen et al. (2025)) demonstrated that “thinking tokens” excel at multimodal reasoning. The framework introduces several key advantages:

Massive Efficiency Gains: It drastically reduces computational overhead, requiring as few as 6% of the reasoning tokens compared to verbose text-based methods.
Latent Space Reasoning: By operating in a continuous latent space, it maintains comparable—and in some cases, superior—accuracy on complex visual reasoning benchmarks while avoiding the “description bottleneck” of converting visual concepts into words.
Effective Cross-Modal Encoding: The model’s latent “thinking tokens” effectively encode rich visual information, which can be decoded back into text descriptions even without access to the original image.

Diagram showing the Heima model's multimodal reasoning process. It shows how the model can reason about images and text in a continuous latent space.

Code Generation and Formal Reasoning

Early experiments show promise:

Program Synthesis: Latent states can maintain program state, variable bindings, and control flow without verbose comments.
Theorem Proving: Abstract mathematical relationships represent more naturally as vector transformations than symbolic strings.
Planning Tasks: The COCONUT paper demonstrates that the implicit breadth-first search used is highly effective for planning problems that require backtracking. On ProsQA, a dataset specifically designed to challenge planning capabilities, COCONUT found solutions significantly faster than traditional CoT, with an average inference time of 0.15 seconds compared to 0.47 seconds for CoT.

Why This Represents the Next Breakthrough

Computational Efficiency Revolution

The token overhead of linguistic reasoning becomes prohibitive as models require deeper thought:

Scaling Analysis:
- Traditional CoT: $O(n)$ tokens but $O(n^2)$ attention cost for n-step reasoning.
- Latent reasoning: $O(n × L^2)$ total attention, but unlike CoT the context length stays $O(1)$ , so memory and KV-cache growth are linear in the number of latent steps.
- At n=100 steps, CoT needs 100× more attention computation than latent reasoning.

Unlocking Non-Linguistic Intelligence

Continuous reasoning enables fundamentally new capabilities:

Parallel Hypothesis Exploration: Unlike sequential text, vectors can superpose multiple reasoning branches:
$h_{\text{combined}} = 0.6 \cdot h_{\text{path1}} + 0.3 \cdot h_{\text{path2}} + 0.1 \cdot h_{\text{exploration}}$
Empirically, COCONUT’s ablation on p 4 shows wider hidden states boost success on ProsQA, consistent with superposition.
Continuous Optimization: Reasoning becomes differentiable, enabling gradient-based search for solutions rather than discrete sampling.
Emergent Algorithms: Models develop internal procedures resembling classical algorithms (dynamic programming, branch-and-bound) without explicit programming.

Scaling Law Implications

Recent findings from Ye et al. (2025) provide new insights into how model architecture affects reasoning:

Depth is More Crucial than Width: The paper demonstrates that for complex reasoning tasks, a model’s depth (number of layers) is more important than its width (number of neurons per layer). For example, a deeper, smaller model (16-layer, 576-dim) significantly outperforms a shallower, larger model (4-layer, 1920-dim) on math problems with longer reasoning chains. This suggests that performance in reasoning doesn’t just depend on model size, but on having sufficient depth to process complex, sequential steps.
Computationally Efficient Reasoning: The research shows that models learn to be highly efficient by generating the shortest possible solutions. Instead of computing every possible variable, the model learns to “plan ahead” by identifying only the necessary parameters required to answer the question. This avoids wasting computation on unnecessary steps or verbose explanations.

Integration with Reinforcement Learning

Latent reasoning creates a natural bridge to RL:

Direct Trajectory Optimization: RL rewards can shape latent paths without language bottlenecks:
$R(\text{trajectory}) = \text{correctness} + \text{efficiency} - \text{reasoning\_steps}$
Self-Play and Exploration: Models can rapidly explore reasoning strategies in latent space, orders of magnitude faster than generating text.

Conclusion: The Best Path Forward

Building on every technique reviewed above, the most coherent next step is a single, integrated architecture that lets a model decide—token by token—whether to “think in words” or “think in vectors” while scaling its depth on demand. I propose the GRAIL-Transformer (Gated Recurrent Architecture for Interpretable Latent reasoning), a design that deliberately stitches together the strongest ideas from COCONUT, Huginn, Quiet‑STaR, GRPO and the latest interpretability work so that it meshes naturally with the entire history traced in this post.

1. GRAIL‑Transformer—how it works

Recurrent‑depth core – We keep the lightweight “Prelude → Shared Core → Coda” loop of Huginn, but allow the shared core to execute any number k of inner iterations at inference time. Extra reasoning therefore costs compute rather than parameters, solving the scalability crunch that plagues ever‑deeper static stacks.
Learnable gating between text and latent space – At each generation step the model produces a mixed token embedding:

$e_t = \gamma_t\,W_h\,h_{t-1} + (1-\gamma_t)\,{\mathrm{embed}}(y_{t-1})$

Early in training the gate value $\gamma_t$ is driven toward 0 (pure language); a curriculum gradually nudges it up so later layers increasingly recycle hidden states instead of words. Because the gate is differentiable, the model can still “fall back” to language for steps that benefit from explicit explanation or safety audits.
Latent memory lattice – Rather than a single vector, a compact lattice of 4‑8 cells is updated every loop by gated attention. These cells let the network keep multiple hypotheses alive in parallel, mirroring the implicit breadth‑first search behaviour uncovered in COCONUT, yet remain small enough to fit in cache and to visualise with modern probing tools.

2. Training pipeline

Supervised warm‑up with standard chain‑of‑thought data; $\gamma_t=0$ so nothing changes relative to today’s best practice.
Hybrid curriculum: each epoch raises the target mean of $\gamma_t$ and shortens the explicit CoT, teaching the model to compress more reasoning into hidden space without a sudden jump in difficulty.
Group Relative Policy Optimisation (GRPO) fine‑tuning: an RL objective rewards final accuracy, penalises unnecessary inner‑loop steps, and lightly discourages verbose output. GRPO updates come from simple moment matching, so one extra forward pass per sample is enough—vastly cheaper than PPO in the many‑step latent regime.
Difficulty‑aware refinement: hard questions are replayed with a higher weight and a slightly steeper penalty on step count; this focuses the network’s latent search tree where it matters.

3. Interpretability and safety hooks

Scheduled contrastive decoding: every few inner loops we clamp $\gamma_t\!\rightarrow\!0$ for a single hop and force the model to externalise its current thought. A tiny auxiliary decoder converts that hidden state to text; disagreement with the eventual answer adds a consistency penalty during RL. This provides a “window into the lattice” without slowing ordinary inference.
Sparse linear probes: as training progresses we freeze periodic checkpoints and train linear maps that predict intermediate sub‑answers from the lattice. A modest mean‑square‑error loss on those probes encourages the hidden basis to stay approximately linear, making post‑hoc auditing tractable instead of desperately ill‑posed.

4. Why this directly tackles the open problems

Training scalability – Recurrent sharing keeps memory flat; GRPO’s single‑pass updates keep cost linear in data instead of quadratic in steps; mixed‑precision caches plus low‑rank adapters shrink VRAM >50%.
Inference efficiency – Inner‑loop depth rises only when problems demand it, so routine queries cost little more than a vanilla LLM, yet pathological puzzles can receive dozens of extra reasoning cycles without exceeding context‑length limits.
Interpretability – Gating, forced reveal steps and probe regularisation ensure that at any point we can sample, decode and analyse a faithful slice of the model’s latent trajectory—something pure‑latent systems could not guarantee.
Alignment – Because the lattice must periodically translate back into language and match a supervised or self‑consistent rationale, deceptive or shortcutting thoughts face immediate gradient pressure, giving alignment researchers a concrete signal to train against.

I have not implemented GRAIL‑Transformer; at present it is a theoretical synthesis drawn from the most compelling recent findings. Nonetheless, every component already exists in isolation in the literature, and nothing in the design violates current hardware constraints.

Reasoning in latent space therefore looks poised to become the next genuine breakthrough: shedding the bandwidth limits of language, enabling economical yet powerful depth, and—if we build in the right interpretability valves—delivering models whose thought processes are both faster and safer than anything we have today.

References

Chen, M., Shao, W., Xu, P., Wang, J., Gao, P., Zhang, K., & Luo, P. (2025). EfficientQAT: Efficient Quantization-Aware Training for Large Language Models. arXiv preprint arXiv:2407.11062. https://arxiv.org/abs/2407.11062

Chen, X., Wang, L., & Li, Y. (2025). Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking. arXiv preprint arXiv:2502.13842. https://arxiv.org/abs/2502.13842

Chen, X., Zhao, A., Xia, H., Lu, X., Wang, H., et al. (2025). Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought Reasoning. arXiv preprint arXiv:2505.16782. https://arxiv.org/abs/2505.16782

Cheng, P., & Van Durme, B. (2024). Compressed Chain-of-Thought: Efficient reasoning through dense representations. arXiv preprint arXiv:2412.13171. https://arxiv.org/abs/2412.13171

Deng, Y., Choi, Y., & Shieber, S. (2024). From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step. arXiv preprint arXiv:2405.14838. https://arxiv.org/abs/2405.14838

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. (2024). DeepSeek-V3 Technical Report. arXiv preprint arXiv:2412.19437. https://arxiv.org/abs/2412.19437v1

Deng, Y., Prasad, K., Fernandez, R., Smolensky, P., Chaudhary, V., & Shieber, S. (2023). Implicit Chain-of-Thought reasoning via knowledge distillation. arXiv preprint arXiv:2311.01460. https://arxiv.org/abs/2311.01460

Geiping, J., Fowl, L., Somepalli, G., Goldblum, M., Moeller, M., Goldstein, T., & Jacobs, T. (2025). Scaling up test-time compute with latent reasoning: A recurrent depth approach. arXiv preprint arXiv:2502.05171. https://arxiv.org/abs/2502.05171

Goyal, A., Bengio, Y., Weston, J., & Ballas, N. (2023). Think before you speak: Training language models with pause tokens. arXiv preprint arXiv:2310.02226. https://arxiv.org/abs/2310.02226

Hao, S., Gu, Y., Ma, H., Hong, J., Wang, Z., Wang, D., & Hu, Z. (2024). Training large language models to reason in a continuous latent space. arXiv preprint arXiv:2412.06769. https://arxiv.org/abs/2412.06769

Lindsey, R., Kenton, Z., Everitt, T., Wattenberg, M., Mirhoseini, A., Leike, J., & Amodei, D. (2025). Circuit tracing: Revealing computational graphs in language models. Transformer Circuits Thread. https://transformer-circuits.pub/2025/attribution-graphs/biology.html

Liu, J., Chen, X., Wang, H., Zhang, L., & Li, M. (2024). Expediting and elevating large language model reasoning via hidden chain-of-thought decoding. arXiv preprint arXiv:2409.08561. https://arxiv.org/abs/2409.08561

Pfau, J., Merrill, W., & Bowman, S. R. (2024). Let’s think dot by dot: Hidden computation in transformer language models. arXiv preprint arXiv:2404.15758. https://arxiv.org/abs/2404.15758

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv preprint arXiv:2402.03300. https://arxiv.org/abs/2402.03300

Shen, H., Wu, Y., Chen, K., Wang, J., & Zhang, Q. (2025). Efficient reasoning with hidden thinking. arXiv preprint arXiv:2501.19201. https://arxiv.org/abs/2501.19201

Shen, Z., Yan, H., Zhang, L., Hu, Z., Du, Y., & He, Y. (2025). CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation. arXiv preprint arXiv:2502.21074. https://arxiv.org/abs/2502.21074

Shi, T., Wu, Y., Song, L., Zhou, T., & Zhao, J. (2025). Efficient Reinforcement Finetuning via Adaptive Curriculum Learning. arXiv preprint arXiv:2504.05520. https://arxiv.org/abs/2504.05520

Su, Y., Liu, T., Wang, D., Chen, H., & Zhou, J. (2025). Token Assorted: Mixing latent and text tokens for improved language model reasoning. arXiv preprint arXiv:2502.03275. https://arxiv.org/abs/2502.03275

Wang, H., Han, L., Xu, K., & Srivastava, A. (2025). SQuat: Subspace-orthogonal KV Cache Quantization. arXiv preprint arXiv:2503.24358. https://arxiv.org/abs/2503.24358

Xiang, V., Blagden, C., Rafailov, R., Lile, N., Truong, S., Finn, C., & Haber, N. (2025). Just Enough Thinking: Efficient Reasoning with Adaptive Length Penalties Reinforcement Learning. arXiv preprint arXiv:2506.05256. https://arxiv.org/abs/2506.05256

Yang, K., Klein, D., Pang, N., & Sachan, M. (2024). Do large language models latently perform multi-hop reasoning? arXiv preprint arXiv:2402.16837. https://arxiv.org/abs/2402.16837

Ye, H., Zhang, C., Wang, X., Liu, Y., & Sun, M. (2025). Scaling laws for reasoning: The importance of model depth. arXiv preprint arXiv:2407.20311. https://arxiv.org/abs/2407.20311

Yu, D., Wang, S., Chen, L., Zhang, M., & Li, X. (2025). Enhancing auto-regressive Chain-of-Thought through loop-aligned reasoning. arXiv preprint arXiv:2502.08482. https://arxiv.org/abs/2502.08482

Zelikman, E., Wu, Y., Mu, J., & Goodman, N. D. (2022). STaR: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35, 15476–15488. https://arxiv.org/abs/2203.14465

Zelikman, E., Harik, G., Shao, Y., Jayasiri, V., Haber, N., & Goodman, N. D. (2024). Quiet-STaR: Language models can teach themselves to think before speaking. arXiv preprint arXiv:2403.09629. https://arxiv.org/abs/2403.09629

Yue, Z., Jin, B., Zeng, H., Zhuang, H., Qin, Z., Yoon, J., Shang, L., Han, J., & Wang, D. (2025). Hybrid Latent Reasoning via Reinforcement Learning. arXiv preprint arXiv:2505.18454. https://arxiv.org/abs/2505.18454

Zhang, T., & Viteri, M. (2025). Uncovering latent Chain-of-Thought vectors in language models. arXiv preprint arXiv:2409.14026. https://arxiv.org/abs/2409.14026