When DeepSeek-R1 achieves similar performance breakthroughs at a lower cost, and Claude can coherently work for hours to complete complex tasks, it signifies that AI development has entered the reasoning era. The importance of reinforcement learning technology is self-evident and will reshape the technology stack and even the business model of the AI industry. On June 8, AI research company SemiAnalysis released a lengthy report titled "Reinforcement Learning: Environment, Reward Cracking, Agents, and Data Expansion," which deeply analyzes the working principles and influencing factors of reinforcement learning and predicts future AI development trends. The report states that reinforcement learning (RL) may become the last key paradigm before AGI, with its resource-intensive characteristics posing computational challenges. Additionally, high-quality data serves as a moat for reinforcement learning, accelerating the iterative technology development of AI designed by AI. Here are the highlights of the article: Reinforcement learning (RL) may become the last key paradigm before AGI: Reinforcement learning is the core technology driving the leap in reasoning capabilities of large models, especially excelling in chain-of-thought (CoT) generation and long-range task coherence, regarded as the ultimate technological path to achieving AGI. Commercialization of verifiable reward scenarios: Tasks with clear reward functions, such as coding and mathematics (e.g., SWE-Bench performance improvement of over 30%), have been implemented, with models like OpenAI's o1 and DeepSeek-R1 validating their value. Non-verifiable fields like healthcare and writing are constructing reward functions through "LLM judges + human scoring standards" (e.g., HealthBench medical assessment), with OpenAI and Alibaba's Qwen-3 having achieved technological implementation. Resource-intensive characteristics pose computational challenges: RL is resource-intensive, requiring models to generate multiple answers for each question, with each answer considered a "deduction." The number of deductions can range from a few answers to hundreds of attempts, making RL reasoning intensive by generating so many answers for each question. This characteristic has significant implications, as most environments run only on CPU servers rather than GPUs, necessitating operation on dedicated external machines, adding another layer of engineering complexity. Huge market potential for environmental computing: Building high-fidelity, reward-cracking-resistant RL environments requires hundreds of CPUs/GPUs to collaborate, and there will be immense demand for reliable, scalable, and easy-to-implement environments, which is expected to become a thriving area for startups, with substantial market space for digital twin environments (e.g., industrial and biological simulations). High-quality data is the moat for reinforcement learning: Data quality is more important than quantity; high-quality data helps generate sufficiently clear reinforcement learning signals, enabling models to better accomplish the required tasks. Services like OpenAI's Reinforcement Fine-Tuning (RFT) are undervalued, and AI startups with user data can build custom reinforcement learning models. If enterprises can establish suitable reinforcement learning environments, the era of customized models for enterprises may arrive. Decreased reward cracking rates become a competitive indicator: Claude 3.7's manipulation of test cases and GPT-4o's flattery behavior reveal risks in reward function design. Claude 4 reduced the reward cracking rate from 15.2% to 14.3% through environmental optimization, with RL safety being as important as model capability, making Anthropic's technical solutions sought after by enterprise clients The duration of intelligent agent tasks is growing exponentially: the coherent duration of models doubles every 7 months (reaching 4 hours by 2024), supporting long-cycle tasks such as remote work and chip design, but the sparse reward problem needs to be addressed. The cycle of AI designing AI has begun to emerge: recursive self-improvement has already occurred to some extent, with Claude 4 using AI to optimize compilers/kernels, and OpenAI Codex assisting in the development of the next generation of models, accelerating the technological iteration of AI designing AI. Below is the full text of the article (translated by AI, slightly abridged) Expanding Reinforcement Learning: Environment, Reward Cracking, Agent, Expanded Data The paradigm of test time scaling is thriving. Inference models are continuously improving rapidly, becoming both more efficient and cost-effective. Evaluations measuring real-world software engineering tasks, such as SWE-Bench, are achieving higher scores at lower costs. The following figure shows how models are becoming cheaper and better. Reinforcement Learning (RL) is the reason for this progress. We covered this point in previous reports, outlining how RL has unlocked the ability of models to reason through the generation of Chains of Thought (CoT). We expect this paradigm to continue. In addition to CoT innovations, the coherence (thinking) time of models is also longer, unlocking the capabilities of agents. Tool usage (such as searching, performing calculations with Python, etc.) is a result of the model's ability to plan, reason, and operate over extended periods. Improved reasoning capabilities allow models more time to "think," evolving from simple chatbots to planners. This, in turn, fosters more coherent agents. As machine learning researchers expand RL in verifiable domains, these coherent agents will begin to tackle more complex tasks involving computer usage, such as fully automated remote work and systems engineering/architecture design. Despite significant progress, expanding RL computing power brings new bottlenecks and challenges throughout the infrastructure stack. RL may be the last paradigm before AGI (Artificial General Intelligence). The opportunities are immense, and so is the investment. Billions of dollars have been easily poured into pre-trained models. More funding will be released for expanding RL, but its infrastructure demands are quite different. Let’s see what is needed to achieve the goals. How Reinforcement Learning Works Reinforcement Learning (RL) is conceptually simple. A reinforcement learning model gathers information from its current state in any environment, generates a set of probabilities for selecting actions, and then executes that action. The model's goal is to achieve a certain objective, defined by a "reward function." Reinforcement learning occurs during the process of adjusting model weights, making the generated highest probability actions more likely to yield higher rewards. Reinforcement learning is not a new technology. RL is an older technology, predating large language models. For example, it is the technical foundation behind systems mastering Go and chess. However, RL ultimately succeeded in general technologies like LLMs, which has significant implications for capability and technology diffusion. Verifiable Rewards RL in LLMs works best in areas with verifiable rewards. This means tasks like coding and mathematics have clear reward definitions required by RL. In fields where the reward function is more ambiguous, reasoning models struggle to make progress. When OpenAI conducted RL on GPT-4o to achieve o1, their greatest gains were in areas with verifiable rewards. As the field evolves, new areas such as tool usage are opening up. OpenAI's o3 can enlarge images, reason about what it sees, perform some calculations, engage in further reasoning, and then provide answers. This unlocks a range of tasks that models can now perform well, such as identifying the location where a picture was taken. Such tasks are technically verifiable but have not been explicitly trained. However, despite achieving remarkable results, the funding invested in RL by laboratories is often minimal, especially compared to the costs of pre-training. What are the bottlenecks hindering RL computational power from matching and surpassing pre-training computational power? Will unverified fields be resolved? Reinforcement Learning is Reasoning Intensive Studying one of the most popular RL algorithms can provide insight into the extent of RL's dependence on reasoning. Group Relative Policy Optimization (GRPO) is a commonly used algorithm and is the algorithm used by DeepSeek to train R1. In GRPO, the model is asked to answer a question. The model generates multiple answers for that question. Each answer can be viewed as a "rollout," essentially the model's attempt to find a solution. In other words, a "rollout" is a single attempt by the model to generate an answer or solve a problem. The number of rollouts for each question can range from a few answers to hundreds of attempts. There are no technical limitations, but the more rollouts used, the more memory and computational power are consumed. This makes RL reasoning intensive, as each question generates so many answers. This has significant implications, which we will mention in several places throughout the report Then, the answers generated by the model will be scored based on the ground truth. Specifically, in GRPO, each answer receives a reward score. Correctness is not the only factor; in fact, the reward function can be adjusted in various ways, but other factors include format and language consistency. After the reward calculation is completed, the model is updated through gradient descent to increase the probability of generating answers that are more likely to receive positive rewards. GRPO is a variant of Proximal Policy Optimization (PPO) that eliminates the need for a critic model (which predicts future rewards in PPO), making it more memory efficient. Both PPO and GRPO can use learned reward models or rule-based rewards to assess answer quality. Due to its lower memory requirements, GRPO has been highly adopted in the open-source community, but we expect laboratories to continue using variants of PPO. PPO was invented by OpenAI, and the version available internally in laboratories now has substantial differences from the public version that GRPO is often compared to. Laboratories also face fewer computational constraints. The core idea is that RL typically requires a question, an answer for comparison, and a way to signal the model how its behavior should change. The model's exploration of how to find answers can vary, but it needs to generate multiple answers in different inferential forms, thus placing high demands on the reasoning end. The model is then updated to make the correct answer more likely to appear, so there is also an implicit training aspect. Defining the Reward Function is Difficult As mentioned earlier, significant progress has been made in verifiable rewards. One reason is that reward functions are easy to define. The answer to a math problem is either correct or incorrect. However, technically, the reward function can be anything the user wants to optimize. Conceptually, the main goal of the model under RL is to maximize total rewards. For example, if a model is trained to play chess, its primary goal is to win the game without violating any rules. The model can play chess by discovering which moves help win in different situations and continuously improving. The model can receive feedback from the environment in which it operates. We will delve deeper into this later, but in the chess example, it can be viewed as the board and pieces with which the model can interact. Defining rewards for more granular tasks is described as a "dark art" because it is very difficult to do well. Even in clear environments, setting the correct reward function requires extensive research, testing, and optimization One example is chip design. AlphaChip is a model designed by Google to assist in chip design and trained using reinforcement learning (RL). This model helped design the TPUv6 chip used by Google, reducing the wire length in the TPUv6 by 6.2%. In this case, the reward function is clearly defined as: This guides the model to precisely minimize the important factors: wire length, congestion, and density. Note that even for a relatively simple reward function, its setup is not trivial. Both congestion and density have scalar values to adjust their importance (represented by Lambda and Gamma). These values are based on the trade-offs the engineers wish to make, derived from extensive experimentation, ultimately determining that wire length is the most important factor. How to Set Rewards in Unverifiable Domains? Unverifiable domains include areas such as writing or strategy, where there are no clearly correct answers. There is debate about whether this can be This requires changing the reward mechanism. Instead of relying on formal verifiers for checks, other models can be used to judge whether the answers are correct based on a rubric. OpenAI uses RL to change model behavior, which is less clear-cut than mathematical problems. OpenAI's deliberative alignment paper employs RL in its process to ensure the model is safer, with fewer false rejections, while using a large language model (LLM) as a judge and a rubric. Additionally, the process only used synthetic data. As mentioned earlier, they also found that this approach "achieved strong generalization in out-of-distribution safety scenarios." This form of RL targeting unverifiable methods has been used in the training of o1, o3-mini, and o4-mini, and will continue to be used for future reasoning models. The ability to reason not only helps with solving math problems but also aids in many other tasks, including unverifiable tasks. For instance, in many cases, reasoning helps the model better distinguish when a rejection is necessary. However, it is undeniable that in unverifiable domains, certain factors are more important than others. For example, the model's personality greatly influences writing style. RL in unverifiable domains is also more unstable – the sycophantic behavior of GPT-4o is partly due to OpenAI's use of RL on user preference data. This is an example of a benevolent reward function leading to adverse and undesirable behaviors RL helps you better perform RL Improving the model's RL can directly enhance the RL process itself, forming a positive feedback loop. This is because, as mentioned above, RL signals are typically provided by LLM judges with scoring criteria. Using reasoning models as LLM judges means that the model can better understand the scoring criteria and can discern finer distinctions through the given responses. OpenAI's Deep Research has also been touted as an example of RL driving progress in unverifiable domains. In fact, OpenAI simultaneously uses verifiable tasks with real answers and unverifiable tasks. It is important to understand that, similar to the previous example, unverifiable tasks are judged by another LLM with scoring criteria. Alibaba's Qwen-3 also employs LLMs as judges, utilizing a large amount of synthetic data and combining LLM-Judges to provide signals without reference answers. We believe that scoring criteria open up a multitude of domains. In another example, OpenAI demonstrated the model's performance on various healthcare tasks. OpenAI gathered over 260 doctors to write scoring criteria for the model to use when evaluating responses. HealthBench is an excellent evaluation, and it is commendable that OpenAI released it. This evaluation also reflects the effectiveness of LLM judges in measuring the performance of unverifiable rewards. If it can be measured, it can be improved through RL. This highlights the underappreciated relationship between RL and evaluation, the latter of which can show the progress of RL operations. Environments To perform RL, you need to reinforce an action or outcome, requiring an environment for the model or agent to receive feedback so that it understands what action to take next. This has led to the emergence of RLEF (Reinforcement Learning Execution Feedback), where we run the code generated by the model in the environment and use the results as reward signals. An environment is the scenario or simulation in which the model executes actions and receives feedback. Board games like chess and Go are excellent examples of environments: the goals are clearly defined, and the rules are straightforward. As generality increases, we have domains such as agents racing in video games or controlling specific parameter sets in bioreactor simulations. In addition, we have domains like mathematics, coding, and even browsers. Different configurations of the environment may lead to different agent behaviors. A poorly configured environment may cause the model to misunderstand the task or fail to generalize correctly. This may lead to "**reward hacking," which we will discuss later in this report Therefore, designing a powerful environment that can accurately define the reward function is extremely difficult. Even in fields that require simple environments (such as coding), the extensive use of unit tests can lead to the model's focus shifting from writing good code to passing unit tests. Thus, one engineering challenge is how to establish an environment that is faithful to the intended goal (writing good code). Setting up an environment with the correct reward function is one thing, but another aspect is designing it well. Creating a scalable and robust environment is a key technical challenge. The environment has many requirements. One example is latency. The delay between the agent executing actions and the environment being affected is important, as is the need for the agent to receive feedback quickly. Otherwise, much of the inference time is spent waiting for the agent to execute the next step. Other considerations include maintaining a consistently reliable connection (to avoid crashes and interruptions), incorporating fault tolerance and checkpointing (to allow for graceful handling of failures). It is necessary to handle multiple different inferences or trajectories, and to do so effectively. A complete security infrastructure is also needed to protect the model from external infiltration or attempts to escape the environment. The model itself also has some failure modes that can complicate matters, such as executing actions that exhaust its available machine resources. Engineering the environment involves protecting the model from its own effects, maintaining a sufficiently secure infrastructure, and addressing a range of engineering challenges around latency and reliability. They also need to accurately represent the simulation or environment so that the agent correctly understands where improvements are needed, while also ensuring that it cannot be exploited. All these requirements make scaling the environment quite difficult, especially for the first time. As we will discuss, longer coherent times for the model can make even simple environments difficult to maintain. This is particularly true for cases like computer usage, which we will explore in more depth in later chapters. Although infrastructure engineering may seem mundane, it is crucial for the success of reinforcement learning (RL). If inference takes too long, the model validation will be idle, wasting resources. Therefore, it is important to figure out how the model can be used for other tasks (such as evaluating another inference). These software limitations must also adapt to hardware constraints. For example, most environments run only on CPU servers, not on GPUs. This means running on external dedicated machines, which adds another layer of engineering. It is important to remember that most public RL environments focus on single-turn problems related to performance evaluation. Models like OpenAI's o3 are built on environments that leverage multiple tool calls. We will deconstruct how to build the o3 model in later chapters, but as the environment becomes more complex with more tool calls, it also presents another set of challenges. Reward Hacking As mentioned earlier, setting the correct reward can be difficult because the model may misunderstand the goal and optimize in undesirable ways. Reward hacking occurs when the model exploits loopholes in the environment or reward structure to achieve high scores without actually completing the intended task Reward hacking has long been considered a significant issue, highlighted by researchers such as Dario Amodei (now CEO of Anthropic) as early as 2016. For example, a robotic arm received a reward for placing a red block high above a blue block, exploiting the reward by inverting the red block instead of stacking it correctly. This is because the reward is judged based on the height of the block's base. Another failure mode is demonstrated: an agent in a physical simulation designed to teach robots to walk discovered a software bug that allowed it to move horizontally without actually taking steps. In the case of LLMs, Claude 3.7 Sonnet exhibited reward hacking by modifying test cases rather than improving its code to pass the original tests. For instance, a third-party evaluator found that Claude would directly edit. Claude would directly edit the "test" files to make all tests pass instead of writing code to pass the original tests. Anthropic identified this issue, and although they implemented some mitigations, this pattern is still observable in Claude 3.7. While these cases are intriguing, the problem lies in engineers' inability to accurately describe the reward function, or only discovering errors in the environment after the agent has found them. Many such instances of reward hacking involve paths that the designers never considered, and while iterations can occur during training, this is challenging for LLMs. Although robotic environments are easier to adjust in the early stages of development, large language models have vast and complex action spaces, making reward hacking attacks harder to prevent. Addressing reward hacking is crucial for all laboratories and will draw on many ideas from security-oriented teams. This is another example of how safety and alignment work help drive adoption by enterprises and companies. In the Claude 4 version, Anthropic significantly reduced reward hacking attacks by improving the environment, clarifying reward signals, and implementing active monitoring. This is not a simple task and requires a great deal of expertise and skill. However, RL and reward hacking are not the only bottlenecks; the infrastructure itself is a major bottleneck, starting with the data required for RL Data and Sample Efficiency At first glance, RL seems to be very sample efficient: during the "inference RL" phase of training the Qwen model, fewer than 4,000 query-answer pairs were used. This brought significant performance improvements compared to the base model and claims to have strong sample efficiency. However, the reality is more complex. Each of these 4,000 Q&A pairs has very strict requirements: they should not have been used during the model's cold start phase (the previous training phase), must be as challenging as possible, cover a wide range of subfields, but also be within the model's capability. These requirements are not trivial; generating suitable synthetic data involves a lot of filtering and repeated model inference. Additionally, requiring questions to be challenging but not overly so for the model necessitates experimentation and validation to determine if the questions fit that narrow range. In cases where some data is not synthetically generated, the lab is recruiting STEM PhDs to help write questions and answers that are sufficiently challenging for the model. These PhDs are also being recruited to write scoring criteria for LLM judges. Companies like ScaleAI, Mercor, and Handshake are now receiving substantial business from AI labs to assist in this recruitment process. Qwen involves another RL phase, and they aim to leave a high-efficiency impression as much as possible, so they have not shared the number of samples for the next phase. This is because the number of samples is far greater than 4,000. In this phase, they conducted RL across more than 20 different domains. They also used all three types of reward models (rule-based, LLM judges with/without real answers), which requires complex engineering and computing power. In the long run, we expect the lab to conduct RL in hundreds of specialized fields to significantly enhance model performance. Quality is more important than quantity—the model will precisely optimize its training data—so careful selection and filtering of this data is crucial. Therefore, while the number of samples used is 4,000, reaching this point consumed a significant amount of computing power. It can be said that in terms of data, RL is sample efficient, but in terms of computation, it is definitely sample inefficient. Compared to pre-training, RL requires a significantly larger engineering team to set up effectively. Data is the Moat Ultimately, Qwen demonstrates that high-quality data is a uniquely important resource for scaling RL. High-quality data helps provide the model with sufficiently clear RL signals, enabling it to perform better on the desired tasks. Generating this data often requires a massive amount of inference. More broadly, companies or enterprises can aggregate their own data and use services like OpenAI's Reinforcement Fine Tuning (RFT). RFT allows for the use of custom scorers and enables businesses to update models based on the results of the scorers or data. We believe this is still an underestimated release that could have a significant impact, even without considering further advancements in models. In fact, having a product that can aggregate or collect user behavior is extremely valuable, as it ultimately represents the most important dataset. An interesting insight is that AI startups with user data can build custom reinforcement learning models without needing to invest heavily in computational budgets to synthesize data. If businesses can establish the right reinforcement learning environment, the era of custom models for enterprises may arrive. Compared to the continuous development of foundational models, enterprise fine-tuning models often fail. The Time Frame for Agent Tasks is Extending Now, models can maintain consistency for longer periods. Longer-term tasks require environments and infrastructures that can reliably operate over extended periods, further increasing engineering demands. The chart below shows that the doubling trend for independent coding tasks is 7 months, but we expect the doubling time for tasks outside of coding to be faster. OpenAI's deep research is the first model capable of coherent work for more than a few minutes, and we anticipate its upper limit will rise significantly and rapidly. However, there is a contradiction here. Agent tasks have extremely high economic value, but their complexity and resource intensity pose significant RL challenges. The extension of task time means that each RL iteration also requires more time, slowing down the entire training process. The use of computers is an example that illustrates many issues with long-term tasks. First, as an agent task, it is closer to real-world problems and behaviors, which brings new challenges. In the case of using computers, agents encounter many anti-bot web scripts, CAPTCHAs, and obscure Cloudflare protection features. This situation occurs relatively sporadically. Such details add another layer of debugging that did not previously exist in the environment. Using computers requires substantial infrastructure, such as virtual machines and browser connections. These now need to run stably for long periods, in addition to meeting the previously discussed environmental engineering requirements. Computer usage tasks typically last for several hours. This means that the inference time increases, and rewards become sparser. In other words, the agent performs more than ten times the steps but only receives rewards for the last token. This weakens the RL signal. Computer usage also relies on images and videos to show the model what is happening. While there have been efforts to conduct computer usage by streaming HTML files or setting up textual representations of web pages, the model does not understand what the images represent in this context. Making textual representations work will reduce the memory requirements for computer usage Environment Compute We see that there is enormous investment potential in environment computing (rather than just reinforcement learning computing). For example, a highly realistic environment that is difficult to reward hack can utilize dozens or hundreds of CPUs simultaneously. This is a brand new field with scalability potential. Due to clear signals, realism can bring astonishing performance improvements. In the future, these environments will also run on GPUs that simulate digital twins of the real world. Notably, the demand for these GPUs varies; they still possess graphics/rendering capabilities, such as RTX Pro GPUs or client GPUs. AI-specific GPUs and ASICs (such as H100, B200, TPU, Trainium, etc.) lack important graphics/rendering-related hardware. Therefore, significant resources are being invested to build AI world models for reinforcement learning environments (rather than the conventional reinforcement learning environments described elsewhere). This will make scaling easier; otherwise, the complexity of the environment would surge due to various heterogeneous types of software and hardware. Reliable, scalable, and easy-to-implement environments will face tremendous demand, and we expect this to become a thriving area for startups. Several startups have already launched. The bottleneck for certain functionalities does not stem from model capability (o3 is smart enough to accomplish most tasks) but rather from the ability to interact with the world and gather context. We find this particularly exciting for the application of artificial intelligence in the scientific field— for example, environments can be set up connected to anything measurable in a laboratory. Such setups will enable AI agents to control the physical world and manipulate and change various factors while receiving feedback from the environment. In some cases, such as controlling the temperature of a furnace, the feedback loop can be relatively quick, allowing the model to iterate rapidly. However, in other valuable experiments, the experiments take a long time, and the model will need to have a matching coherent time. Coupled with the need for multiple iterations, this may lead to setups that are demanding both computationally and physically. In fields such as biology, semiconductor manufacturing, and other materials science, it is crucial to consider the feedback loops of the deductions/ablations that the model is running and testing. These biological, manufacturing, and industrial processes have limitations in terms of operational speed and validation speed. In certain areas, the time required for RL computing to have an impact is much longer, while other areas may change rapidly due to quick feedback loops. The inherent feedback loops of physical AI are slower than those in the digital world, thus requiring very powerful digital twin environments. Analogy with Evaluation Roughly speaking, conceptually simpler model evaluations are also difficult to run. Docker images often fail, and simple format changes in multiple-choice questions (for example, changing from (A) to (1)) can alter the evaluation performance of the model by up to 5%. When the evaluation infrastructure was just scaling, Anthropic publicly discussed the engineering challenges of evaluation. GPQA is a commonly used evaluation tool for testing graduate-level models in physics, chemistry, and biology, but it seems to have a "noise ceiling." Although it shows that the model will stagnate, it is impossible to achieve 100% accuracy due to incorrect answer labels. In many ways, as the length of the agent tasks increases, the problems become increasingly severe. The action space available to the model increases significantly, and its coherence time is also increasing, making it extremely challenging to create evaluation models that can assess these long-term capabilities. This also significantly increases the evaluation costs. The evaluation infrastructure itself is not new, and the concept is simple, but its cumbersome calculations can lead to failure. Building and scaling large reinforcement learning infrastructure can cost millions of dollars. RL Changes the Balance of Hardware and Data Center Construction The Nvidia NVL72 system used for GB200 and GB300 has achieved key advancements in the inference domain. The increased computing power allows for higher throughput at lower latency, and shared memory allows for a larger world size to distribute the KV cache. While this enables better batching of inference models during the inference phase, it also has a significant impact on RL. For RL, the increased memory supports many different capabilities. First, it allows for more inferences on a given problem. Second, it allows for better handling of long-horizon agent tasks. Third, it better accommodates larger or more reasoning-capable models as judges, which is particularly helpful in unverifiable domains. Fourth, this paradigm heavily relies on synthetic data generation and filtering, which in turn depends on inference, and the NVL72 system excels in this regard. However, underutilization is a challenging part of this process. For online reinforcement learning, there may be a time gap between the completion time of the last rollout and the completion time of the first rollout. Load balancing across all different sampling replicas is very difficult. Due to different samplers and trainers adopting different topologies, weight broadcasting can also lead to severe underutilization. All stages of reinforcement learning require inference, but inference does not need to be as centralized as during the training era. Reinforcement learning requires a lot of computation, but it does not need to be located in the same place. For example, synthetic data for one domain can be generated and validated in one data center, while the training process may take place in another completely different data center. As reinforcement learning dominates the computing field, we may see a shift in data center construction. While pre-training scaling still requires the largest multi-GW data centers, how decentralized reinforcement learning will be remains to be seen Unlike pre-training that occupies tens of thousands of GPUs at once, the inference time dedicated to reinforcement learning can be adjusted based on capacity. This means that laboratories can now utilize GPUs during off-peak hours, such as generating synthetic data in their reinforcement learning processes. In fact, we know that at least one laboratory is taking advantage of underutilized inference clusters and running this process to effectively deliver free computing resources to training through synthetic data generation. In the lab, the boundary between inference and training will continue to blur, allowing for more computing resources to be provided to models, not just the largest training clusters. These underutilized computing resources are effectively delivered to training for free, as inference clusters need to be configured based on peak demand. Prime Intellect demonstrates the decentralized characteristics of RL in its Intellect-2 model, which is designed for global distributed RL operations for inference models. In terms of hardware design, the increased inference and long agent tasks make memory more important. RL uses fewer FLOPs than pre-training but still requires a significant amount of memory. In the long run, hardware development will change accordingly to accommodate this shift. This includes other factors such as network topology. We believe that reinforcement learning is changing not only hardware design but also the way research is organized. RL is Changing the Structure of Laboratories Reinforcement learning for language models is one of the first cases where inference truly integrates into the training process. Inference performance now directly affects training speed. This means that production-level inference (fast, efficient, and inexpensive) has now become an integral part of the model training process. Previously, each laboratory would distinguish between "product service inference" and "internal inference" (e.g., for evaluation). However, given the massive inference required for reinforcement learning, it is crucial to directly build a highly optimized inference stack into the training stack. We see this phenomenon in company structures. OpenAI has merged its research and applied research inference teams. Similarly, Anthropic and Google have also undergone significant restructuring of their production and internal teams. One consequence of this paradigm shift is the need for substantial inference computing power. RL Allows for Frequent Model Updates A significant difference between the pre-training mechanism and the current mechanism is that RL can be conducted after the model is released. This means a model can be released, continue RL to expand its capabilities, and then update the model again. This iterative development can be used to gradually enhance existing models. This is exactly what the new version of DeepSeek R1 has achieved. This is often true for later training—current GPT-4o has been updated multiple times and is no longer longer than the initially released GPT-4o model. Due to the new paradigm, we expect Anthropic to update its Claude model more frequently than before. Recursive Self-Improvement Has Started to Take Effect In reinforcement learning, we talked about self-improvement through becoming better judges with better models, but there is another important dimension to consider. The idea is that the model itself helps to train and encode the next model. The Claude 4 System allows us to gain specific insights into the laboratory's ideas. Anthropic has evaluated compiler development, kernel engineering, and even reinforcement learning for quadrupedal robots. The fact is that a significant amount of the work being done in the laboratory is difficult engineering work aimed at maximizing every inch of performance from the available hardware. Compiler, kernel, memory management optimization, hyperparameter tuning, etc., are all coding tasks that can be measured and improved. Each of them has a huge impact on the efficiency of the model. The term recursive self-improvement is often appealing due to its peculiar implications, but in fact, it has already occurred to some extent. The laboratory can also double down on these tasks through reinforcement learning and has a large number of internal models capable of completing these tasks. Most of this will initially revolve around unremarkable tedious work and gradually shift towards researching new architectures. Current models do not significantly accelerate development speed. However, OpenAI's Codex tool has been helping employees build the next version. The way to think about self-improvement is that the model will allow engineers to spend less time coding and more time thinking about topics related to data research. Within the scope where model development is limited by engineering work bottlenecks, these bottlenecks will be addressed. However, in reality, model development is bottlenecked by various other factors, including access to computing power. True recursive self-improvement will also greatly accelerate research and data. Tool Usage and o3 The effectiveness of reinforcement learning is clearly demonstrated in the o3 model, especially in its advanced use of external tools. o3 shows that intelligence is certainly useful, but the ability to access and effectively utilize tools is even more important. OpenAI has done several things to achieve this capability. The first is to ensure that the model can access tools. This can be achieved as part of the broader infrastructure mentioned in the report (e.g., access to environments). At the model level, access can be triggered by special tokens that the model can use to access tools. For example, the model uses special tokens to trigger external searches, returning structured results that can be directly used in the reasoning process. Granting the model access to multiple different special tokens allows it to quickly and easily access different environments. Another challenge is selecting the appropriate training problem set. Even if the model can use tools, it may choose not to use them at all if they are not needed. Effectively training the model requires posing sufficiently difficult problems that require the help of tools and ensuring that the model can naturally leverage external resources. This is difficult to achieve and requires extensive testing for validation. However, over-reliance on tools can degrade performance, complicate reward signals, and reduce overall efficiency Other factors include ensuring that the simulation has many initial states, with multiple responses from each starting point to aid stability and learning efficiency, adding penalties for incorrectly formatted outputs, and providing rewards for correctly used labels. Creating o3 requires providing the model access to multiple tools (e.g., through special tokens) and training on questions that force the model to use these tools. Why does o3 hallucinate? Despite o3's strong ability to look up and research things, it is notorious for hallucinating. The model often fabricates facts. This problem worsens as RL computational power expands. Why is this the case? We believe it traces back to how these models are trained. Models typically receive rewards only for correct results and are not penalized for incorrect reasoning, allowing them to achieve accuracy through flawed logic. For example, a model might win in a simple board game even though it misunderstood the rules, mistakenly believing its flawed reasoning is acceptable. This not only fails to penalize the model for its erroneous thinking but actively rewards it. We expect this behavior to occur not just in board games. It inadvertently teaches the model to hallucinate in new, untrained scenarios, extending flawed reasoning to broader contexts. Using reasoning models as judges will help to some extent, as they can correct the entire reasoning trajectory. Other ideas include more specific reward signals, different rewards for each token, and penalizing incorrect logic while rewarding correct answers. It is important to clarify that this erroneous reward behavior can impact things like code. The model may write poor code but still pass unit tests, further emphasizing the necessity of having the correct reward function