Test Time Compute

Eight years on, while LLM architectures may seem structurally familiar—from GPT-2 to DeepSeek-V3 and Llama 4—a more subtle, yet profound, revolution is unfolding. A shift towards Test-Time Compute (TTC) is enabling LLMs to transition from rapid, intuitive “System 1” responses to deliberate, analytical “System 2” reasoning.

As pre-training scaling laws show diminishing returns, advanced test-time computation is emerging as a driver for performance gains.

1. “Let’s think step by step”

Chain-of-Thought (CoT) prompting was the foundational breakthrough for deliberate reasoning. Appending (Zero-Shot CoT) or providing few-shot examples demonstrate CoT’s power. Its effectiveness is pronounced in larger models, though some 2025 research investigates its distillation into smaller, more efficient models.

Figure 1: A comparison between standard few-shot prompting and Chain-of-Thought prompting. By showing its work, the model is guided toward the correct reasoning process.

**Standard Prompt:**
> Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
> A: 11
>
> Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?
> A: 6

**Chain-of-Thought Prompt:**
> Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
> A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.
>
> Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?
> A: There are 15 trees to start. There are 21 trees later. The difference is the number of trees planted. 21 - 15 = 6. The answer is 6.

2. Amplifying Reasoning: Self-Consistency and Beyond

CoT, in its simplest form, relies on a single greedy reasoning path. If an early step is flawed, the entire solution can fail. Self-Consistency mitigates this by sampling multiple diverse reasoning paths and selecting the answer that emerges most frequently. This “majority vote” approach enhances robustness.

Recent advancements in 2025 refine this further. “Confidence-Informed Self-Consistency (CISC)” optimizes the process by prioritizing high-confidence paths, achieving comparable accuracy with fewer samples. Similarly, “Self-Certainty” offers a reward-free method for scalable “Best-of-N” selection, leveraging LLM output probabilities to estimate response quality. These demonstrate continued efforts to make multi-path exploration more efficient and effective.

3. Structured Deliberation: Trees & Graphs of Thought

Human reasoning isn’t always linear. We explore options, backtrack, and synthesize information. This inspired more structured CoT paradigms:

3.1 Tree of Thoughts (ToT)

Tree of Thoughts (ToT) generalizes CoT by structuring reasoning as a tree. Each node is a “thought”—a coherent text unit. The model generates multiple next thoughts at each step, evaluates their promise, and explores viable branches using search algorithms like BFS or DFS. This allows for explicit exploration and lookahead, outperforming linear CoT on planning-heavy tasks.

3.2 Graph of Thoughts (GoT)

Graph of Thoughts (GoT) extends ToT to an arbitrary graph structure, where thoughts are vertices and dependencies are edges. This enables richer transformations: aggregating multiple thoughts into a new one, refining existing thoughts via feedback loops, and dynamically constructing a complex network of reasoning. This brings LLM reasoning closer to human cognitive processes, which form complex networks.

4. Test-Time Compute in 2025 LLMs

The latest LLMs are increasingly designed or leveraged with TTC in mind.

DeepSeek-R1, a “reasoning model” from early 2025, shows that reasoning capabilities, including long CoTs, can be incentivized directly through reinforcement learning. This training explicitly prepares it for sophisticated test-time reasoning.

Qwen3, introduced in May 2025, integrates “thinking mode” for complex, multi-step reasoning and a “thinking budget mechanism,” allowing users to adaptively allocate computational resources during inference to balance latency and performance. This natively supports on-demand TTC.

Kimi k1.5 and its multimodal variant, Kimi-VL-Thinking, also released in 2025, highlight the role of “long-CoT supervised fine-tuning” and reinforcement learning in achieving state-of-the-art reasoning performance, even for “short-CoT models.” Similarly, Llama 4, while architecturally similar to DeepSeek-V3 in its MoE design, aims for best-in-class performance in reasoning, implicitly benefiting from these advanced test-time strategies.

Beyond specific models, 2025 research broadly explores “scaling test-time compute for LLM agents,” including parallel sampling, sequential revision, and verifiers. Even “Test-Time Reinforcement Learning (TTRL)” is emerging, enabling LLMs to self-evolve during inference without ground-truth labels.

Footnotes

Guo, D., Yang, D., Zhang, H., et al. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv preprint arXiv:2501.12948.
Zhu, K., Li, H., Wu, S., et al. (2025). Scaling Test-time Compute for LLM Agents. arXiv preprint arXiv:2506.12928.
Deng, S., Pan, J., Jiang, K., et al. (2025). CoAT: Chain-of-Associated-Thoughts Framework for Enhancing Large Language Models Reasoning. arXiv preprint arXiv:2502.02390.
Qwen Team. (2025). Qwen3 Technical Report. arXiv preprint arXiv:2505.09388.
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., & Zhou, D. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv preprint arXiv:2203.11171.
Ma, A., Song, J., Wang, Y., et al. (2025). Confidence-Informed Self-Consistency (CISC). arXiv preprint arXiv:2502.06233.
Kang, Z., Li, Y., & Zhao, X. (2025). Scalable Best-of-N Selection for Large Language Models via Self-Certainty. arXiv preprint arXiv:2502.18581.
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., & Narasimhan, K. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv preprint arXiv:2305.10601.
Besta, M., Blach, N., Kubicek, A., Kyrola, A., Losa, I., Grot, B., & Hofstee, P. (2023). Graph of Thoughts: Solving Elaborate Problems with Large Language Models. arXiv preprint arXiv:2308.09687.
Zhang, D., Dong, J., Song, L., et al. (2025). From System 1 to System 2: A Survey of Reasoning Large Language Models. arXiv preprint arXiv:2502.17419.
DeepSeek-AI, Guo, D., Yang, D., et al. (2025). DeepSeek-R1 Thoughtology: Let’s about LLM Reasoning. arXiv preprint arXiv:2504.07128.
MoonshotAI Team, et al. (2025). Kimi-VL Technical Report. arXiv preprint arXiv:2504.07491.
Kimi Team, Du, A., Gao, B., et al. (2025). Kimi k1.5: Scaling Reinforcement Learning with LLMs. arXiv preprint arXiv:2501.12599.
Meta AI. (2025). The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation.
Zhou, C., Li, S., Wang, H., et al. (2025). Test-Time Reinforcement Learning: Boosting LLMs with Unlabelled Test Data. Available online via AI Advances: https://aiadvances.org/2025/07/16/test-time-reinforcement-learning/ (Accessed 2025-07-24).
Raschka, S. (2025). Inference-Time Compute Scaling Methods to Improve Reasoning Models. Ahead of AI. (Accessed 2025-07-24).
Muennighoff, N., Yang, Z., Shi, W., et al. (2025). s1: Simple test-time scaling. (Accessed 2025-07-24).