Skip to content

Thinking Without Words

a F-22 Raptor fighter jet

Photo by Terence Burke

Introduction

Recent “reasoning” models trained with reinforcement learning can get better results on tasks with verifiable answers (math, code, logic), often by generating long Chain-of-Thought (CoT) traces during inference. Those extra tokens help, but they also make reasoning expensive: the model has to express intermediate steps as text.

CoT is readable, but it’s also a bottleneck: every intermediate step is serialized into tokens. A line of work explores continuous latent reasoning, where intermediate reasoning happens in high-dimensional vectors (hidden states) instead of discrete language tokens.

This approach changes where “reasoning” happens (latent space instead of explicit tokens), but it also makes interpretability and evaluation harder because the intermediate steps are not directly readable.

Illustration of the core idea of latent reasoning, where a model's internal reasoning process is represented as a high-dimensional vector space with no natural interpretation.

Image inspired by the one from the paper Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought Reasoning.


Historical development: toward latent reasoning

Early foundations (2022-2023)

Zelikman et al. (2022) introduced the Self-Taught Reasoner (STaR), an iterative self-training loop: the model generates rationales, keeps the ones that lead to a correct answer, and fine-tunes on them. It’s still explicit, token-based chain-of-thought, but it shows how a model can bootstrap better reasoning from its own successful traces.

Diagram illustrating the Self-Taught Reasoner (STaR) methodology, where a model iteratively improves by learning from its own generated rationales.

Deng et al. (2023) introduced Implicit Chain-of-Thought (ICoT), which uses distillation to train a student model to match the hidden-state trajectory of a larger teacher as the teacher generates an explicit chain of thought. The goal is to push the teacher’s step-by-step reasoning into the student’s layers. This can reduce inference time, but the paper reports a drop in accuracy versus explicit CoT, especially on harder tasks.

Diagram explaining Implicit Chain-of-Thought (ICoT), where a student model learns to reason implicitly by mimicking the hidden-state trajectory of a larger teacher model.

Interpretability work also asked whether models do multi-hop reasoning latently. Yang et al. (2024) found moderate evidence for latent hops, with higher rates for some task types. Their probes suggest intermediate structure can exist in hidden states even when the model answers directly.

Illustration of a latent multi-hop reasoning probe. A diagram shows how changing an input prompt is used to measure a model's internal recall for multi-step inferences.

The discrete token era (2024)

Another thread experimented with specialized discrete tokens to represent reasoning states. Goyal et al. (2023) introduced “pause tokens” that let models perform additional internal computation before generating outputs. These tokens are inserted in a fixed, non-adaptive sequence and act as computational placeholders: delay the next prediction, get extra compute, sometimes improve accuracy on logic-heavy tasks.

Comparison diagram: standard inference versus 'pause-inference'. The latter uses pause tokens to enable extra computation before output, illustrated with new computational paths.

In their paper, “Let’s Think Dot by Dot,” Pfau et al. (2024) investigate whether the performance gains from chain-of-thought are due to interpretable reasoning or simply the greater computation that additional tokens allow. They demonstrate that for certain algorithmic tasks, transformers can use meaningless “filler tokens” (e.g., ’…’) to perform complex, hidden computations, achieving high accuracy on problems they could not solve when forced to respond immediately. For example, on a sufficiently complex 3SUM task, models using filler tokens reached 100% accuracy, whereas models without them were only 66% accurate. This suggests the critical bottleneck is the computational limitation of a single forward pass, not the semantic content of the tokens. The sequence of filler tokens provides the model with a “scratchpad” for multi-step reasoning, directly challenging the assumption that a model’s intermediate steps must be linguistically meaningful to be computationally effective.

Comparison of three reasoning approaches: chain-of-thought (explicit reasoning), filler tokens (dots for computation), and immediate answer, showing filler tokens achieve similar performance to CoT.

Zelikman et al. (2024) developed Quiet-STaR, employing learnable tokens to mark boundaries of internal rationales. This approach enabled language models to infer unstated reasoning steps, improving generalization without task-specific fine-tuning. The system generated token-level rationales internally (one hidden “explanation” per token produced) without outputting them, essentially “thinking before speaking” in a fine-grained way.

Diagram of the Quiet-STaR algorithm, showing its 'think, talk, learn' phases. The model generates internal thoughts for each token and uses reinforcement learning to improve predictions.

Continuous latent reasoning (2024-2025)

A representative example is Hao et al. (2024) and their COCONUT (Chain of Continuous Thought) architecture. Instead of sampling a discrete token at each reasoning step, COCONUT feeds the model’s last hidden states as the next-step input embeddings, delaying the projection to the vocabulary until the final answer.

Diagram comparing Chain-of-Thought (CoT) with Chain of Continuous Thought (COCONUT). CoT uses discrete text tokens, while COCONUT uses continuous hidden states for reasoning.

Technically, COCONUT operates by:

A 4,096-dimensional activation vector, even after aggressive 4-bit quantization, contains 16,384 raw bits (far more than a single discrete token, which carries at most log₂(50,000) ≈ 16 bits). However, directly comparing raw bits can be misleading because these representations differ significantly in information density. A token from a 50k-word vocabulary, compressed by Byte-Pair Encoding (BPE), packs information very densely, though in practice, due to redundancy in natural language, tokens typically contain even fewer effective bits of information. For instance, LLaMA-2-70B achieves a perplexity of 3.32 on WikiText-2, meaning each token effectively encodes only around 1.73 bits of meaningful information (Chen et al., 2025).

Activation vectors, on the other hand, are large and redundant by design. Recent compression methods like Multi-Head Latent Attention (MLA) from DeepSeek-V3 (Liu et al., 2024) show these vectors can be compressed by a large factor while keeping quality. This implies each activation value may effectively contain around 0.11 bits, translating to roughly 460 meaningful bits for the entire 4,096-dimensional vector.

Even with redundancy, activation vectors still carry much more usable information (approximately 460 bits vs. 1.73 bits per token). This suggests latent reasoning has more representational bandwidth than reasoning purely at the token level.

In their 2024 paper, Cheng and Van Durme (2024) introduced Compressed Chain-of-Thought (CCoT), a framework that utilizes a dual-module architecture. A CCOT module (parameterized by φ) generates a sequence of dense “contemplation tokens,” which serve as compressed representations of an entire reasoning chain. A second DECODE module (parameterized by ψ) then uses these tokens to produce the final answer. This approach demonstrates that complex reasoning can be effectively summarized in continuous representations. However, contrary to methods that process steps in parallel, CCoT generates these contemplation tokens autoregressively, meaning they are produced sequentially one after another.

Illustration comparing Chain of Thought (CoT) and Compressed Chain of Thought (CCoT). CoT uses a long sequence of text, while CCoT uses a short sequence of continuous embeddings.

Liu et al. (2024) proposed Hidden Chain-of-Thought (HCoT), training auxiliary models to generate compact thought representations that maintain semantic richness while drastically reducing computational overhead. Their method compresses each intermediate reasoning step into a special [CoT] token, interleaving these compressed thoughts with the generated content.

Examples of Hidden Chain-of-Thought (HCoT), where internal reasoning steps are shown as compressed thoughts with blue strikethrough text in response to user queries.

Token Assorted (Su et al. (2025)) took a hybrid approach, using a VQ-VAE to encode early reasoning steps into latent codes while keeping later, critical steps in text. This model reduced the length of reasoning traces by an average of 17% while maintaining interpretability where needed.

Illustration of the Token Assorted hybrid approach. A sequence of text-based CoT tokens is partially compressed into shorter, discrete latent tokens.

Architectural innovations (2025)

Recent work has focused on architectural modifications that natively support latent reasoning. Geiping et al. (2025) introduced Huginn, a recurrent framework enabling adaptive computation allocation through RNN-like iterative processing. The architecture consists of:

This design untied computation depth from layer count, allowing a 3.5B model to achieve 50B-model performance through approximately 32 recurrent iterations.

Graph of the Huginn model's performance showing improvement with more recurrent iterations. This demonstrates adaptive computation, where complex tasks benefit from more latent 'thinking' time. Diagram of the Huginn recurrent architecture. It shows three main blocks: a Prelude for encoding, a shared recurrent core for processing, and a Coda for decoding.

In “Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning,” Yu et al. (2025) introduce RELAY (REasoning through Loop Alignment iteratively), a two-stage framework designed to improve how auto-regressive models handle long reasoning tasks. The method bridges the gap between auto-regressive models, which often struggle with generating accurate, long Chain-of-Thought (CoT) sequences, and Looped Transformers, which have strong length generalization capabilities but limited versatility.

The RELAY framework first trains a Looped Transformer by aligning its internal loop iterations with the explicit reasoning steps of a CoT process. This allows the Looped Transformer to generate accurate, high-quality reasoning chains for problems that are more complex and longer than those in its original training data. These generated chains are then used as a new, high-quality dataset to fine-tune a standard auto-regressive model, significantly enhancing its performance on complex reasoning tasks that require generalization to longer problem lengths.

Visualization contrasting a standard auto-regressive CoT model with a Looped Transformer. As problem complexity increases, the CoT model's reasoning token sequence grows, while the looped model increases its number of internal loop iterations.

Chen et al. (2025) proposed the Inner Thinking Transformer (ITT), treating each transformer layer as a discrete reasoning step with adaptive token routing and residual refinement.

Diagram showing the Inner Thinking Transformer (ITT) concept. Each layer of the model is treated as a step of 'inner thinking' to improve results on difficult tasks without adding parameters.

Mechanisms of continuous latent reasoning

Core Architecture

Continuous latent reasoning fundamentally alters the information flow in transformer architectures. In traditional models, the transformation sequence follows:

Input Embeddings → Transformer Layers → Projection to Vocabulary → Sampling → Next Token

Continuous reasoning architectures bypass the projection bottleneck:

Input Embeddings → Transformer Layers → Direct Hidden State Reuse

The mathematical formulation involves modifying the standard transformer update. Instead of:

ht=TransformerLayers(embed(tokent))tokent+1=sample(softmax(Wvocabht))\begin{alignedat}{2} & h_t &&= \text{TransformerLayers}(\text{embed}(\text{token}_t)) \\ & \text{token}_{t+1} &&= \text{sample}(\text{softmax}(W_{\text{vocab}} h_t)) \end{alignedat}

Latent reasoning uses:

h0=TransformerLayers(embed(input))hi+1=TransformerLayers(ProjectToEmbed(hi))output=decode(hN) after N latent steps\begin{alignedat}{2} & h_0 &&= \text{TransformerLayers}(\text{embed}(\text{input})) \\ & h_{i+1} &&= \text{TransformerLayers}(\text{ProjectToEmbed}(h_i)) \\ & \text{output} &&= \text{decode}(h_N) \text{ after N latent steps} \end{alignedat}

Where ProjectToEmbed is typically a learned linear transformation that maps the hidden state back to the embedding dimension.

Training Methodologies

Building a model that genuinely “thinks in vectors” is less about inventing huge new architectures and more about guiding the network away from the crutch of language without wrecking its performance. Current practice has converged on the following five-part recipe.

1 · Curriculum-guided latentisation

Training still starts with ordinary chain-of-thought (CoT), but every epoch hides a growing fraction of those intermediate words and asks the model to run directly on their hidden-state vectors.

Hidden tokens receive no direct loss; visible tokens and the final answer are trained with the usual cross-entropy.

2 · Hidden-state distillation and self-distillation

Once the network can survive missing words, the next step is to make its latent trajectory mimic a teacher that still reasons out loud.

3 · Compact latent tokens

ICOT-style distillation still leaves “one vector per hidden step”. The next advance packs many steps into a handful of learned embeddings.

4 · Recurrent or loop-aligned supervision

Some architectures keep sequence depth fixed and let the model loop through shared blocks as many times as it needs.

5 · Hybrid latent reinforcement learning

Supervised data eventually tops out, so teams switch to reinforcement learning that rewards correctness and charges for extra computation.

Hybrid Reasoning Policy Optimisation (HRPO) is the flagship:

  1. At every generation step the network mixes two embeddings (a normal token embedding and a transformed copy of the previous hidden state) weighted by an action variable gamma.
  2. The reward is 1 for a correct final answer, minus a small fee per visible token and per latent iteration (instances where gamma ≠ 0).
  3. HRPO is trained with Group Relative Policy Optimisation (GRPO), which uses the mean reward of a mini-batch of roll-outs as its baseline instead of a learned critic. GRPO needs about half the memory of PPO and converges just as fast.

Setting the step penalty too low makes the model talk verbosely; setting it too high drives it into silent, brittle reasoning. Authors report that a short grid search over a few hundred prompts is enough to find the sweet-spot penalty.

6 · Generic efficiency add-ons

Three auxiliary techniques, first devised for token-level RL, carry over cleanly to the latent regime:

A modern training run at a glance

  1. Supervised warm-up on full CoT.
  2. Curriculum latentisation for five to ten epochs (COCONUT or Stepwise Internalisation).
  3. Hidden-state distillation, optionally followed by compact latent token training (CCoT or HCoT).
  4. Architecture-aligned pre-training if you use loops (Huginn, RELAY, ITT).
  5. Hybrid RL fine-tuning with HRPO + GRPO, optionally adding ALP and AdaRFT.

Representational dynamics

Analysis of COCONUT’s latent reasoning reveals a process more complex than a simple linear chain. Instead of committing to a single path, the continuous thoughts can be interpreted as a latent search tree that explores multiple potential next steps simultaneously.

Diagram showing the COCONUT model's latent reasoning process. The model explores multiple potential next steps simultaneously, represented as a search tree. Diagram showing the COCONUT model's latent reasoning process. The model explores multiple potential next steps simultaneously, represented as a search tree. Diagram showing the COCONUT model's latent reasoning process. The model explores multiple potential next steps simultaneously, represented as a search tree.

Engineering constraints

So far, latent reasoning systems come with costs you don’t see in ordinary Chain‑of‑Thought models. The literature converges on six broad constraints:

1. A curriculum is mandatory (otherwise the model never “gets” latent reasoning)

COCONUT’s ablation shows that training directly on (question, answer) pairs with hidden‑state recycling performs worse than a no‑CoT baseline. Only the staged schedule that first teaches the model to reason in language and then incrementally replaces early steps with vectors seems to drive the gains. In their “w/o curriculum” run, GSM8K drops from 34.1% to 14.4%. Designing such curricula (and automatically tuning them for new domains) remains an open research problem.

2. Latent loops break GPU parallelism

Because every continuous thought depends on the previous hidden state, training (and inference) cost scales with the number of latent steps, not the batch size. COCONUT explicitly notes that it must execute n + 1 forward passes for n thoughts, and that “the sequential nature of the multiple forward passes poses challenges for parallelism.” This serial dependency throttles throughput on modern GPU clusters built for large‑batch matrix multiplies.

3. KV‑cache memory becomes the new bottleneck

Long latent traces keep the token sequence constant. They still enlarge the key/value cache: every extra iteration stores another set of 64‑bit vectors. Recent work on SQuat (Wang et al. (2025)) shows that even with aggressive INT‑2 quantisation the cache can dominate peak GPU memory when models “think” for dozens of steps. Compression helps but introduces accuracy/latency trade‑offs that are not yet well understood.

4. Knowing when to stop is still heuristic

During inference a latent‑reasoning model must decide when to emit an <eot> and return to language space. Current systems either pad to a fixed depth or train an ad‑hoc binary classifier over hidden states, and COCONUT reports that both heuristics work “comparably well.” Huginn trains a learned halting classifier (§4.1, p 5) which shows promise but still requires careful tuning. Neither approach adapts gracefully to problem difficulty, and mis‑predictions manifest as truncated explanations or runaway loops.

5. Deep recurrent stacks risk optimisation instability

Recurrent‑depth architectures such as Huginn push performance by unrolling a shared core 30+ times, but the authors note that gradient signals weaken as depth grows, requiring careful learning‑rate scaling and residual gating to avoid divergence. Balancing depth‑on‑demand with stable training dynamics is still an active area of study.

6. Tooling for debugging and evaluation is immature

A survey of efficient reasoning methods highlights a “complexity of latent‑space implementation” gap: without textual traces, it is hard to verify correctness, attribute errors, or measure reasoning efficiency. New metrics (e.g., embedding‑consistency scores) and visual probes are being proposed, but no standard evaluation suite exists yet.


Interpretability limits

The “neuralese” problem

Lindsey et al. (2025) argues that models can implement complex computations via feature interactions. If intermediate reasoning happens as continuous vectors, you can end up with an internal “neuralese” that does not map cleanly to words. Unlike discrete tokens (which at least map to a vocabulary), a continuous thought is a 4096‑dimensional vector with no obvious interpretation.

Key challenges include:

Emerging Interpretability Techniques

Some approaches:

Diagram showing the Zhang and Viteri (2025) geometric analysis of latent CoT vectors. It shows how a reasoning vector can be added to the input to induce reasoning behavior.

Even with these methods, full interpretability remains elusive. We can sometimes detect that reasoning is happening, but not the detailed how.


Alignment considerations

If intermediate steps aren’t readable, some of the usual safety hooks get weaker:

Some mitigation ideas:


Current applications and performance

Mathematical Reasoning (GSM8k)

On the GSM8k math reasoning dataset, the performance of continuous reasoning is more nuanced.

COCONUT reaches 34.1% accuracy (vs. 16.5% for a no-CoT baseline), but it does not surpass the standard Chain-of-Thought baseline (42.9%). The main win here is efficiency: it reduces reasoning tokens from 25.0 (CoT) to 8.2.

Logical Reasoning (ProntoQA & ProsQA)

COCONUT shows its largest gains on logical reasoning tasks that require planning and searching.

On ProntoQA, COCONUT reaches 99.8% accuracy vs. 98.8% for CoT. On ProsQA, it reaches 97.0% vs. 77.5% for CoT. The reported gains come with fewer reasoning tokens than CoT, because the model can do more intermediate work in continuous states instead of emitting long textual traces.

Multimodal Integration

Heima (Shen et al. (2025)) reported that “thinking tokens” work well for multimodal reasoning. In their setup:

Diagram showing the Heima model's multimodal reasoning process. It shows how the model can reason about images and text in a continuous latent space.

Code Generation and Formal Reasoning

Potential fits:


Where latent reasoning might help

Efficiency and token overhead

Long chain-of-thought traces can be expensive because attention cost grows with context length. A simplified way to think about the trade-off:

Representation without text

Latent steps can carry intermediate state without serializing it into tokens. That can make some kinds of search or planning more efficient, but it also makes intermediate reasoning harder to inspect.

One (oversimplified) mental model is mixing multiple partial hypotheses into a single hidden representation:

Architecture and scaling notes

Recent findings from Ye et al. (2025) provide new insights into how model architecture affects reasoning:

Integration with Reinforcement Learning

Latent reasoning connects naturally to RL:


A speculative design sketch

If you want to combine the ideas above into one system, a reasonable sketch is: let the model decide, step by step, whether it should emit a text token or run extra latent steps, and make the number of latent steps adaptive.

This is a design sketch. There are other plausible designs; I’m using this one to make the trade-offs concrete.

What the architecture might look like

What you’d need to validate

I haven’t implemented this; treat it as a sketch for thinking about system design, not as a claim that it is practical or superior.

Latent-space reasoning is promising because it can move more intermediate computation into dense vectors instead of emitting long token traces. The trade-off is that it can be harder to interpret, debug, and evaluate. Progress here will likely depend on better training stability, clearer evaluation protocols, and stronger interpretability hooks.


References

Chen, M., Shao, W., Xu, P., Wang, J., Gao, P., Zhang, K., & Luo, P. (2025). EfficientQAT: Efficient Quantization-Aware Training for Large Language Models. arXiv preprint arXiv:2407.11062. https://arxiv.org/abs/2407.11062

Chen, X., Wang, L., & Li, Y. (2025). Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking. arXiv preprint arXiv:2502.13842. https://arxiv.org/abs/2502.13842

Chen, X., Zhao, A., Xia, H., Lu, X., Wang, H., et al. (2025). Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought Reasoning. arXiv preprint arXiv:2505.16782. https://arxiv.org/abs/2505.16782

Cheng, P., & Van Durme, B. (2024). Compressed Chain-of-Thought: Efficient reasoning through dense representations. arXiv preprint arXiv:2412.13171. https://arxiv.org/abs/2412.13171

Deng, Y., Choi, Y., & Shieber, S. (2024). From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step. arXiv preprint arXiv:2405.14838. https://arxiv.org/abs/2405.14838

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. (2024). DeepSeek-V3 Technical Report. arXiv preprint arXiv:2412.19437. https://arxiv.org/abs/2412.19437v1

Deng, Y., Prasad, K., Fernandez, R., Smolensky, P., Chaudhary, V., & Shieber, S. (2023). Implicit Chain-of-Thought reasoning via knowledge distillation. arXiv preprint arXiv:2311.01460. https://arxiv.org/abs/2311.01460

Geiping, J., Fowl, L., Somepalli, G., Goldblum, M., Moeller, M., Goldstein, T., & Jacobs, T. (2025). Scaling up test-time compute with latent reasoning: A recurrent depth approach. arXiv preprint arXiv:2502.05171. https://arxiv.org/abs/2502.05171

Goyal, A., Bengio, Y., Weston, J., & Ballas, N. (2023). Think before you speak: Training language models with pause tokens. arXiv preprint arXiv:2310.02226. https://arxiv.org/abs/2310.02226

Hao, S., Gu, Y., Ma, H., Hong, J., Wang, Z., Wang, D., & Hu, Z. (2024). Training large language models to reason in a continuous latent space. arXiv preprint arXiv:2412.06769. https://arxiv.org/abs/2412.06769

Lindsey, R., Kenton, Z., Everitt, T., Wattenberg, M., Mirhoseini, A., Leike, J., & Amodei, D. (2025). Circuit tracing: Revealing computational graphs in language models. Transformer Circuits Thread. https://transformer-circuits.pub/2025/attribution-graphs/biology.html

Liu, J., Chen, X., Wang, H., Zhang, L., & Li, M. (2024). Expediting and elevating large language model reasoning via hidden chain-of-thought decoding. arXiv preprint arXiv:2409.08561. https://arxiv.org/abs/2409.08561

Pfau, J., Merrill, W., & Bowman, S. R. (2024). Let’s think dot by dot: Hidden computation in transformer language models. arXiv preprint arXiv:2404.15758. https://arxiv.org/abs/2404.15758

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv preprint arXiv:2402.03300. https://arxiv.org/abs/2402.03300

Shen, H., Wu, Y., Chen, K., Wang, J., & Zhang, Q. (2025). Efficient reasoning with hidden thinking. arXiv preprint arXiv:2501.19201. https://arxiv.org/abs/2501.19201

Shen, Z., Yan, H., Zhang, L., Hu, Z., Du, Y., & He, Y. (2025). CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation. arXiv preprint arXiv:2502.21074. https://arxiv.org/abs/2502.21074

Shi, T., Wu, Y., Song, L., Zhou, T., & Zhao, J. (2025). Efficient Reinforcement Finetuning via Adaptive Curriculum Learning. arXiv preprint arXiv:2504.05520. https://arxiv.org/abs/2504.05520

Su, Y., Liu, T., Wang, D., Chen, H., & Zhou, J. (2025). Token Assorted: Mixing latent and text tokens for improved language model reasoning. arXiv preprint arXiv:2502.03275. https://arxiv.org/abs/2502.03275

Wang, H., Han, L., Xu, K., & Srivastava, A. (2025). SQuat: Subspace-orthogonal KV Cache Quantization. arXiv preprint arXiv:2503.24358. https://arxiv.org/abs/2503.24358

Xiang, V., Blagden, C., Rafailov, R., Lile, N., Truong, S., Finn, C., & Haber, N. (2025). Just Enough Thinking: Efficient Reasoning with Adaptive Length Penalties Reinforcement Learning. arXiv preprint arXiv:2506.05256. https://arxiv.org/abs/2506.05256

Yang, K., Klein, D., Pang, N., & Sachan, M. (2024). Do large language models latently perform multi-hop reasoning? arXiv preprint arXiv:2402.16837. https://arxiv.org/abs/2402.16837

Ye, H., Zhang, C., Wang, X., Liu, Y., & Sun, M. (2025). Scaling laws for reasoning: The importance of model depth. arXiv preprint arXiv:2407.20311. https://arxiv.org/abs/2407.20311

Yu, D., Wang, S., Chen, L., Zhang, M., & Li, X. (2025). Enhancing auto-regressive Chain-of-Thought through loop-aligned reasoning. arXiv preprint arXiv:2502.08482. https://arxiv.org/abs/2502.08482

Zelikman, E., Wu, Y., Mu, J., & Goodman, N. D. (2022). STaR: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35, 15476–15488. https://arxiv.org/abs/2203.14465

Zelikman, E., Harik, G., Shao, Y., Jayasiri, V., Haber, N., & Goodman, N. D. (2024). Quiet-STaR: Language models can teach themselves to think before speaking. arXiv preprint arXiv:2403.09629. https://arxiv.org/abs/2403.09629

Yue, Z., Jin, B., Zeng, H., Zhuang, H., Qin, Z., Yoon, J., Shang, L., Han, J., & Wang, D. (2025). Hybrid Latent Reasoning via Reinforcement Learning. arXiv preprint arXiv:2505.18454. https://arxiv.org/abs/2505.18454

Zhang, T., & Viteri, M. (2025). Uncovering latent Chain-of-Thought vectors in language models. arXiv preprint arXiv:2409.14026. https://arxiv.org/abs/2409.14026