Skip to content
GitHub Twitter

Thinking Without Words

a F-22 Raptor fighter jet

Photo by Terence Burke

Introduction

The emergence of Large Reasoning Models trained through reinforcement learning has fundamentally transformed our understanding of AI capabilities. These systems, exemplified by models like o1, demonstrate unprecedented reasoning abilities by leveraging extensive Chain-of-Thought (CoT) processes during inference. However, this breakthrough has simultaneously exposed a critical limitation: the constraint of reasoning through natural language tokens.

Traditional Chain-of-Thought reasoning, while interpretable and effective, forces models to articulate every reasoning step through the bottleneck of human language. This linguistic mediation introduces computational inefficiencies and constrains the expressiveness of thought processes. Recent research has begun exploring a radical alternative: continuous latent reasoning, where models perform inference directly in high-dimensional embedding spaces rather than through discrete language tokens.

This paradigm shift represents more than an incremental improvement—it fundamentally challenges how we conceptualize machine reasoning and opens pathways to cognitive capabilities that transcend the limitations of linguistic expression.

Illustration of the core idea of latent reasoning, where a model's internal reasoning process is represented as a high-dimensional vector space with no natural interpretation.

Image inspired by the one from the paper Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought Reasoning.


Historical Development: The Evolution Toward Latent Reasoning

Early Foundations (2022-2023)

A significant step in improving model reasoning was the introduction of the Self-Taught Reasoner (STaR) by Zelikman et al. (2022). Instead of abstracting reasoning away from language, STaR refines the model’s ability to produce explicit, step-by-step “chain-of-thought” rationales. The method uses an iterative process: the model generates rationales, and if a rationale leads to a correct answer, it is used as training data to fine-tune the model. This established the principle that a model can effectively teach itself to become a better reasoner by learning from its own successfully articulated reasoning, improving performance on complex tasks without needing a large, pre-existing dataset of human-annotated rationales.

Diagram illustrating the Self-Taught Reasoner (STaR) methodology, where a model iteratively improves by learning from its own generated rationales.

Deng et al. (2023) introduced the concept of Implicit Chain-of-Thought (ICoT), which uses knowledge distillation to train a student model to reason using its internal hidden states. In this process, the student model learns to emulate the layer-by-layer hidden-state trajectories of a larger teacher model as the teacher generates an explicit chain of thought. The goal is to distill the teacher’s “horizontal,” step-by-step reasoning into a more efficient “vertical” reasoning process that occurs implicitly within the student model’s layers. While this method significantly speeds up inference time, this efficiency comes at a cost. The authors report that the implicit approach leads to a notable decrease in task accuracy when compared to models that generate an explicit chain of thought, highlighting a direct trade-off between inference speed and final performance.

Diagram explaining Implicit Chain-of-Thought (ICoT), where a student model learns to reason implicitly by mimicking the hidden-state trajectory of a larger teacher model.

A crucial early insight came from interpretability studies. Yang et al. (2024) asked whether large language models latently perform multi-hop reasoning. They found moderate evidence for this latent reasoning, observing it in around 40% of cases on average, with much higher rates for specific types of reasoning tasks. This showed that models’ hidden layers transiently encoded information about intermediate “hops” even when answering directly, hinting at untapped latent reasoning potential.

Illustration of a latent multi-hop reasoning probe. A diagram shows how changing an input prompt is used to measure a model's internal recall for multi-step inferences.

The Discrete Token Era (2024)

The next evolutionary step involved experimenting with specialized discrete tokens to represent reasoning states. Goyal et al. (2023) introduced “pause tokens” that enabled models to perform additional internal computation before generating outputs. These tokens, inserted in a fixed, non-adaptive sequence, served as computational placeholders, allowing for delayed prediction and improved accuracy on logic-intensive tasks. The key insight was that models could benefit from “thinking time” even within the discrete token framework.

Comparison diagram: standard inference versus 'pause-inference'. The latter uses pause tokens to enable extra computation before output, illustrated with new computational paths.

In their paper, “Let’s Think Dot by Dot,” Pfau et al. (2024) investigate whether the performance gains from chain-of-thought are due to interpretable reasoning or simply the greater computation that additional tokens allow. They demonstrate that for certain algorithmic tasks, transformers can use meaningless “filler tokens” (e.g., ’…’) to perform complex, hidden computations, achieving high accuracy on problems they could not solve when forced to respond immediately. For example, on a sufficiently complex 3SUM task, models using filler tokens reached 100% accuracy, whereas models without them were only 66% accurate. This suggests the critical bottleneck is not the semantic content of the tokens, but rather the computational limitation of a single forward pass. The sequence of filler tokens provides the model with a “scratchpad” for multi-step reasoning, directly challenging the assumption that a model’s intermediate steps must be linguistically meaningful to be computationally effective.

Comparison of three reasoning approaches: chain-of-thought (explicit reasoning), filler tokens (dots for computation), and immediate answer, showing filler tokens achieve similar performance to CoT.

Zelikman et al. (2024) developed Quiet-STaR, employing learnable tokens to mark boundaries of internal rationales. This approach enabled language models to infer unstated reasoning steps, improving generalization without task-specific fine-tuning. The system generated token-level rationales internally (one hidden “explanation” per token produced) without outputting them, essentially “thinking before speaking” in a fine-grained way.

Diagram of the Quiet-STaR algorithm, showing its 'think, talk, learn' phases. The model generates internal thoughts for each token and uses reinforcement learning to improve predictions.

The Continuous Revolution (2024-2025)

The most significant breakthrough came with Hao et al. (2024) and their COCONUT (Chain of Continuous Thought) architecture. Rather than using discrete tokens, COCONUT directly feeds the model’s last hidden states as input embeddings for subsequent reasoning steps. This innovation eliminated the lossy projection from high-dimensional representations to discrete vocabulary distributions.

Diagram comparing Chain-of-Thought (CoT) with Chain of Continuous Thought (COCONUT). CoT uses discrete text tokens, while COCONUT uses continuous hidden states for reasoning.

Technically, COCONUT operates by:

This preserves orders of magnitude more information. A 4096-dimensional vector, even under an aggressive 4-bit quantization scheme, holds 16,384 bits of information (4096 × 4). In stark contrast, a single discrete token from a typical 50k vocabulary represents just ~16 bits of information (log2(50000)log_2(50000)).

In their 2024 paper, Cheng and Van Durme (2024) introduced Compressed Chain-of-Thought (CCoT), a framework that utilizes a dual-module architecture. A CCOT module (parameterized by φ) generates a sequence of dense “contemplation tokens,” which serve as compressed representations of an entire reasoning chain. A second DECODE module (parameterized by ψ) then uses these tokens to produce the final answer. This approach demonstrates that complex reasoning can be effectively summarized in continuous representations. However, contrary to methods that process steps in parallel, CCoT generates these contemplation tokens autoregressively, meaning they are produced sequentially one after another.

Illustration comparing Chain of Thought (CoT) and Compressed Chain of Thought (CCoT). CoT uses a long sequence of text, while CCoT uses a short sequence of continuous embeddings.

Liu et al. (2024) proposed Hidden Chain-of-Thought (HCoT), training auxiliary models to generate compact thought representations that maintain semantic richness while drastically reducing computational overhead. Their method compresses each intermediate reasoning step into a special [CoT] token, interleaving these compressed thoughts with the generated content.

Examples of Hidden Chain-of-Thought (HCoT), where internal reasoning steps are shown as compressed thoughts with blue strikethrough text in response to user queries.

Token Assorted (Su et al. (2025)) took a hybrid approach, using a VQ-VAE to encode early reasoning steps into latent codes while keeping later, critical steps in text. This model reduced the length of reasoning traces by an average of 17% while maintaining interpretability where needed.

Illustration of the Token Assorted hybrid approach. A sequence of text-based CoT tokens is partially compressed into shorter, discrete latent tokens.

Architectural Innovations (2025)

Recent work has focused on architectural modifications that natively support latent reasoning. Geiping et al. (2025) introduced Huginn, a recurrent framework enabling adaptive computation allocation through RNN-like iterative processing. The architecture consists of:

This design untied computation depth from layer count, allowing a 3.5B model to achieve 50B-model performance through approximately 32 recurrent iterations.

Graph of the Huginn model's performance showing improvement with more recurrent iterations. This demonstrates adaptive computation, where complex tasks benefit from more latent 'thinking' time. Diagram of the Huginn recurrent architecture. It shows three main blocks: a Prelude for encoding, a shared recurrent core for processing, and a Coda for decoding.

In “Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning,” Yu et al. (2025) introduce RELAY (REasoning through Loop Alignment iteratively), a two-stage framework designed to improve how auto-regressive models handle long reasoning tasks. The method bridges the gap between auto-regressive models, which often struggle with generating accurate, long Chain-of-Thought (CoT) sequences, and Looped Transformers, which have strong length generalization capabilities but limited versatility.

The RELAY framework first trains a Looped Transformer by aligning its internal loop iterations with the explicit reasoning steps of a CoT process. This allows the Looped Transformer to generate accurate, high-quality reasoning chains for problems that are more complex and longer than those in its original training data. These generated chains are then used as a new, high-quality dataset to fine-tune a standard auto-regressive model, significantly enhancing its performance on complex reasoning tasks that require generalization to longer problem lengths.

Visualization contrasting a standard auto-regressive CoT model with a Looped Transformer. As problem complexity increases, the CoT model's reasoning token sequence grows, while the looped model increases its number of internal loop iterations.

Chen et al. (2025) proposed the Inner Thinking Transformer (ITT), treating each transformer layer as a discrete reasoning step with adaptive token routing and residual refinement.

Diagram showing the Inner Thinking Transformer (ITT) concept. Each layer of the model is treated as a step of 'inner thinking' to improve results on difficult tasks without adding parameters.

Technical Deep Dive: Mechanisms of Continuous Reasoning

Core Architecture

Continuous latent reasoning fundamentally alters the information flow in transformer architectures. In traditional models, the transformation sequence follows:

Input Embeddings → Transformer Layers → Projection to Vocabulary → Sampling → Next Token

Continuous reasoning architectures bypass the projection bottleneck:

Input Embeddings → Transformer Layers → Direct Hidden State Reuse

The mathematical formulation involves modifying the standard transformer update. Instead of:

ht=TransformerLayers(embed(tokent))tokent+1=sample(softmax(Wvocabht))\begin{alignedat}{2} & h_t &&= \text{TransformerLayers}(\text{embed}(\text{token}_t)) \\ & \text{token}_{t+1} &&= \text{sample}(\text{softmax}(W_{\text{vocab}} h_t)) \end{alignedat}

Latent reasoning uses:

h0=TransformerLayers(embed(input))hi+1=TransformerLayers(ProjectToEmbed(hi))output=decode(hN) after N latent steps\begin{alignedat}{2} & h_0 &&= \text{TransformerLayers}(\text{embed}(\text{input})) \\ & h_{i+1} &&= \text{TransformerLayers}(\text{ProjectToEmbed}(h_i)) \\ & \text{output} &&= \text{decode}(h_N) \text{ after N latent steps} \end{alignedat}

Where ProjectToEmbed is typically a learned linear transformation that maps the hidden state back to the embedding dimension.

Training Methodologies

Building a model that genuinely “thinks in vectors” is less about inventing huge new architectures and more about guiding the network away from the crutch of language without wrecking its performance. Current practice has converged on the following five-part recipe.

1 · Curriculum-guided latentisation

Training still starts with ordinary chain-of-thought (CoT), but every epoch hides a growing fraction of those intermediate words and asks the model to run directly on their hidden-state vectors.

Hidden tokens receive no direct loss; visible tokens and the final answer are trained with the usual cross-entropy.

2 · Hidden-state distillation and self-distillation

Once the network can survive missing words, the next step is to make its latent trajectory mimic a teacher that still reasons out loud.

3 · Compact latent tokens

ICOT-style distillation still leaves “one vector per hidden step”. The next advance packs many steps into a handful of learned embeddings.

4 · Recurrent or loop-aligned supervision

Some architectures keep sequence depth fixed and let the model loop through shared blocks as many times as it needs.

5 · Hybrid latent reinforcement learning

Supervised data eventually tops out, so teams switch to reinforcement learning that rewards correctness and charges for extra computation.

Hybrid Reasoning Policy Optimisation (HRPO) is the flagship:

  1. At every generation step the network mixes two embeddings—a normal token embedding and a transformed copy of the previous hidden state—weighted by an action variable gamma.
  2. The reward is 1 for a correct final answer, minus a small fee per visible token and per latent iteration (instances where gamma ≠ 0).
  3. HRPO is trained with Group Relative Policy Optimisation (GRPO), which uses the mean reward of a mini-batch of roll-outs as its baseline instead of a learned critic. GRPO needs about half the memory of PPO and converges just as fast.

Setting the step penalty too low makes the model talk verbosely; setting it too high drives it into silent, brittle reasoning. Authors report that a short grid search over a few hundred prompts is enough to find the sweet-spot penalty.

6 · Generic efficiency add-ons

Three auxiliary techniques, first devised for token-level RL, carry over cleanly to the latent regime:

A modern training run at a glance

  1. Supervised warm-up on full CoT.
  2. Curriculum latentisation for five to ten epochs (COCONUT or Stepwise Internalisation).
  3. Hidden-state distillation, optionally followed by compact latent token training (CCoT or HCoT).
  4. Architecture-aligned pre-training if you use loops (Huginn, RELAY, ITT).
  5. Hybrid RL fine-tuning with HRPO + GRPO, optionally adding ALP and AdaRFT.

Representational Dynamics

Analysis of COCONUT’s latent reasoning reveals a process more complex than a simple linear chain. Instead of committing to a single path, the continuous thoughts can be interpreted as a latent search tree that explores multiple potential next steps simultaneously.

Diagram showing the COCONUT model's latent reasoning process. The model explores multiple potential next steps simultaneously, represented as a search tree. Diagram showing the COCONUT model's latent reasoning process. The model explores multiple potential next steps simultaneously, represented as a search tree. Diagram showing the COCONUT model's latent reasoning process. The model explores multiple potential next steps simultaneously, represented as a search tree.

Implementation Challenges

Large‑scale latent reasoning is still in its infancy, and every experimental system to date has surfaced at least one blocking issue that does not appear in ordinary Chain‑of‑Thought models. The literature converges on six broad pain‑points:

1. A curriculum is mandatory—otherwise the model never “gets” latent reasoning

COCONUT’s ablation shows that training directly on (question, answer) pairs with hidden‑state recycling performs worse than a no‑CoT baseline. Only the staged schedule that first teaches the model to reason in language and then incrementally replaces early steps with vectors unlocks its gains. The GSM8K accuracy crash from 34.1 → 14.4% in the “w/o curriculum” run makes the point starkly clear. Designing such curricula—and automatically tuning them for new domains—remains an open research problem.

2. Latent loops break GPU parallelism

Because every continuous thought depends on the previous hidden state, training (and inference) cost scales with the number of latent steps, not the batch size. COCONUT explicitly notes that it must execute n + 1 forward passes for n thoughts, and that “the sequential nature of the multiple forward passes poses challenges for parallelism.” This serial dependency throttles throughput on modern GPU clusters built for large‑batch matrix multiplies.

3. KV‑cache memory becomes the new bottleneck

Long latent traces do not expand the token sequence, but they do enlarge the key/value cache: every extra iteration stores another set of 64‑bit vectors. Recent work on SQuat (Wang et al. (2025)) shows that even with aggressive INT‑2 quantisation the cache can dominate peak GPU memory when models “think” for dozens of steps. Compression helps but introduces accuracy/latency trade‑offs that are not yet well understood.

4. Knowing when to stop is still heuristic

During inference a latent‑reasoning model must decide when to emit an <eot> and return to language space. Current systems either pad to a fixed depth or train an ad‑hoc binary classifier over hidden states, and COCONUT reports that both heuristics work “comparably well.” Huginn trains a learned halting classifier (§4.1, p 5) which shows promise but still requires careful tuning. Neither approach adapts gracefully to problem difficulty, and mis‑predictions manifest as truncated explanations or runaway loops.

5. Deep recurrent stacks risk optimisation instability

Recurrent‑depth architectures such as Huginn push performance by unrolling a shared core 30+ times, but the authors note that gradient signals weaken as depth grows, requiring careful learning‑rate scaling and residual gating to avoid divergence. Balancing depth‑on‑demand with stable training dynamics is still an active area of study.

6. Tooling for debugging and evaluation is immature

A survey of efficient reasoning methods highlights a “complexity of latent‑space implementation” gap: without textual traces, it is hard to verify correctness, attribute errors, or measure reasoning efficiency. New metrics (e.g., embedding‑consistency scores) and visual probes are being proposed, but no standard evaluation suite exists yet.


The Interpretability Crisis

The “Neuralese” Problem

Recent mechanistic interpretability research Lindsey et al. (2025) reveals models performing sophisticated internal reasoning through complex feature interactions, suggesting the development of what we might call “Neuralese”—a term we define as an emergent internal representational language that develops within AI systems operating in continuous latent spaces, characterized by high-dimensional vector patterns that encode semantic and logical relationships in ways that are fundamentally untranslatable to human linguistic concepts. Unlike discrete tokens which at least map to vocabulary, continuous thoughts exist in a 4096-dimensional space with no natural interpretation.

Key challenges include:

Emerging Interpretability Techniques

Researchers are developing novel approaches:

Diagram showing the Zhang and Viteri (2025) geometric analysis of latent CoT vectors. It shows how a reasoning vector can be added to the input to induce reasoning behavior.

Despite these efforts, full interpretability remains elusive. We can detect that reasoning is happening and sometimes what kind, but not the detailed how.


The Alignment Challenge

The interpretability deficit poses serious concerns:

Proposed solutions include:


Current Applications and Performance

Mathematical Reasoning (GSM8k)

On the GSM8k math reasoning dataset, the performance of continuous reasoning is more nuanced.

Logical Reasoning (ProntoQA & ProsQA)

COCONUT shows its most dramatic improvements on logical reasoning tasks that require planning and searching.

Multimodal Integration

Heima (Shen et al. (2025)) demonstrated that “thinking tokens” excel at multimodal reasoning. The framework introduces several key advantages:

Diagram showing the Heima model's multimodal reasoning process. It shows how the model can reason about images and text in a continuous latent space.

Code Generation and Formal Reasoning

Early experiments show promise:


Why This Represents the Next Breakthrough

Computational Efficiency Revolution

The token overhead of linguistic reasoning becomes prohibitive as models require deeper thought:

Unlocking Non-Linguistic Intelligence

Continuous reasoning enables fundamentally new capabilities:

Scaling Law Implications

Recent findings from Ye et al. (2025) provide new insights into how model architecture affects reasoning:

Integration with Reinforcement Learning

Latent reasoning creates a natural bridge to RL:


Conclusion: The Best Path Forward

Building on every technique reviewed above, the most coherent next step is a single, integrated architecture that lets a model decide—token by token—whether to “think in words” or “think in vectors” while scaling its depth on demand. I propose the GRAIL-Transformer (Gated Recurrent Architecture for Interpretable Latent reasoning), a design that deliberately stitches together the strongest ideas from COCONUT, Huginn, Quiet‑STaR, GRPO and the latest interpretability work so that it meshes naturally with the entire history traced in this post.

1. GRAIL‑Transformer—how it works

2. Training pipeline

  1. Supervised warm‑up with standard chain‑of‑thought data; γt=0\gamma_t=0 so nothing changes relative to today’s best practice.
  2. Hybrid curriculum: each epoch raises the target mean of γt\gamma_t and shortens the explicit CoT, teaching the model to compress more reasoning into hidden space without a sudden jump in difficulty.
  3. Group Relative Policy Optimisation (GRPO) fine‑tuning: an RL objective rewards final accuracy, penalises unnecessary inner‑loop steps, and lightly discourages verbose output. GRPO updates come from simple moment matching, so one extra forward pass per sample is enough—vastly cheaper than PPO in the many‑step latent regime.
  4. Difficulty‑aware refinement: hard questions are replayed with a higher weight and a slightly steeper penalty on step count; this focuses the network’s latent search tree where it matters.

3. Interpretability and safety hooks

4. Why this directly tackles the open problems

I have not implemented GRAIL‑Transformer; at present it is a theoretical synthesis drawn from the most compelling recent findings. Nonetheless, every component already exists in isolation in the literature, and nothing in the design violates current hardware constraints.

Reasoning in latent space therefore looks poised to become the next genuine breakthrough: shedding the bandwidth limits of language, enabling economical yet powerful depth, and—if we build in the right interpretability valves—delivering models whose thought processes are both faster and safer than anything we have today.


References

Chen, X., Wang, L., & Li, Y. (2025). Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking. arXiv preprint arXiv:2502.13842. https://arxiv.org/abs/2502.13842

Cheng, P., & Van Durme, B. (2024). Compressed Chain-of-Thought: Efficient reasoning through dense representations. arXiv preprint arXiv:2412.13171. https://arxiv.org/abs/2412.13171

Chen, X., Zhao, A., Xia, H., Lu, X., Wang, H., et al. (2025). Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought Reasoning. arXiv preprint arXiv:2505.16782. https://arxiv.org/abs/2505.16782

Deng, Y., Choi, Y., & Shieber, S. (2024). From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step. arXiv preprint arXiv:2405.14838. https://arxiv.org/abs/2405.14838

Deng, Y., Prasad, K., Fernandez, R., Smolensky, P., Chaudhary, V., & Shieber, S. (2023). Implicit Chain-of-Thought reasoning via knowledge distillation. arXiv preprint arXiv:2311.01460. https://arxiv.org/abs/2311.01460

Geiping, J., Fowl, L., Somepalli, G., Goldblum, M., Moeller, M., Goldstein, T., & Jacobs, T. (2025). Scaling up test-time compute with latent reasoning: A recurrent depth approach. arXiv preprint arXiv:2502.05171. https://arxiv.org/abs/2502.05171

Goyal, A., Bengio, Y., Weston, J., & Ballas, N. (2023). Think before you speak: Training language models with pause tokens. arXiv preprint arXiv:2310.02226. https://arxiv.org/abs/2310.02226

Hao, S., Gu, Y., Ma, H., Hong, J., Wang, Z., Wang, D., & Hu, Z. (2024). Training large language models to reason in a continuous latent space. arXiv preprint arXiv:2412.06769. https://arxiv.org/abs/2412.06769

Lindsey, R., Kenton, Z., Everitt, T., Wattenberg, M., Mirhoseini, A., Leike, J., & Amodei, D. (2025). Circuit tracing: Revealing computational graphs in language models. Transformer Circuits Thread. https://transformer-circuits.pub/2025/attribution-graphs/biology.html

Liu, J., Chen, X., Wang, H., Zhang, L., & Li, M. (2024). Expediting and elevating large language model reasoning via hidden chain-of-thought decoding. arXiv preprint arXiv:2409.08561. https://arxiv.org/abs/2409.08561

Pfau, J., Merrill, W., & Bowman, S. R. (2024). Let’s think dot by dot: Hidden computation in transformer language models. arXiv preprint arXiv:2404.15758. https://arxiv.org/abs/2404.15758

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv preprint arXiv:2402.03300. https://arxiv.org/abs/2402.03300

Shen, H., Wu, Y., Chen, K., Wang, J., & Zhang, Q. (2025). Efficient reasoning with hidden thinking. arXiv preprint arXiv:2501.19201. https://arxiv.org/abs/2501.19201

Shen, Z., Yan, H., Zhang, L., Hu, Z., Du, Y., & He, Y. (2025). CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation. arXiv preprint arXiv:2502.21074. https://arxiv.org/abs/2502.21074

Shi, T., Wu, Y., Song, L., Zhou, T., & Zhao, J. (2025). Efficient Reinforcement Finetuning via Adaptive Curriculum Learning. arXiv preprint arXiv:2504.05520. https://arxiv.org/abs/2504.05520

Su, Y., Liu, T., Wang, D., Chen, H., & Zhou, J. (2025). Token Assorted: Mixing latent and text tokens for improved language model reasoning. arXiv preprint arXiv:2502.03275. https://arxiv.org/abs/2502.03275

Wang, H., Han, L., Xu, K., & Srivastava, A. (2025). SQuat: Subspace-orthogonal KV Cache Quantization. arXiv preprint arXiv:2503.24358. https://arxiv.org/abs/2503.24358

Xiang, V., Blagden, C., Rafailov, R., Lile, N., Truong, S., Finn, C., & Haber, N. (2025). Just Enough Thinking: Efficient Reasoning with Adaptive Length Penalties Reinforcement Learning. arXiv preprint arXiv:2506.05256. https://arxiv.org/abs/2506.05256

Yang, K., Klein, D., Pang, N., & Sachan, M. (2024). Do large language models latently perform multi-hop reasoning? arXiv preprint arXiv:2402.16837. https://arxiv.org/abs/2402.16837

Ye, H., Zhang, C., Wang, X., Liu, Y., & Sun, M. (2025). Scaling laws for reasoning: The importance of model depth. arXiv preprint arXiv:2407.20311. https://arxiv.org/abs/2407.20311

Yu, D., Wang, S., Chen, L., Zhang, M., & Li, X. (2025). Enhancing auto-regressive Chain-of-Thought through loop-aligned reasoning. arXiv preprint arXiv:2502.08482. https://arxiv.org/abs/2502.08482

Zelikman, E., Wu, Y., Mu, J., & Goodman, N. D. (2022). STaR: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35, 15476–15488. https://arxiv.org/abs/2203.14465

Zelikman, E., Harik, G., Shao, Y., Jayasiri, V., Haber, N., & Goodman, N. D. (2024). Quiet-STaR: Language models can teach themselves to think before speaking. arXiv preprint arXiv:2403.09629. https://arxiv.org/abs/2403.09629

Yue, Z., Jin, B., Zeng, H., Zhuang, H., Qin, Z., Yoon, J., Shang, L., Han, J., & Wang, D. (2025). Hybrid Latent Reasoning via Reinforcement Learning. arXiv preprint arXiv:2505.18454. https://arxiv.org/abs/2505.18454

Zhang, T., & Viteri, M. (2025). Uncovering latent Chain-of-Thought vectors in language models. arXiv preprint arXiv:2409.14026. https://arxiv.org/abs/2409.14026