TROLL: Trust Regions improve Reinforcement Learning for Large Language Models

Philipp Becker^1*, Niklas Freymuth^1*, Serge Thilges¹, Fabian Otto², Gerhard Neumann¹

¹Karlsruhe Institute of Technology
²Microsoft Research
^*Equal contribution. Author order was decided by a fair coin flip.

Abstract

On-policy Reinforcement Learning (RL) with PPO-like clip objectives has become the standard choice for reward-based fine-tuning of large language models (LLMs). Although recent work has explored improved estimators of advantages and normalization, the clipping mechanism itself has remained untouched. Originally introduced as a proxy for principled KL-based trust regions, clipping is a crude approximation that often causes unstable updates and suboptimal performance. We replace the clip objective with a novel discrete differentiable trust region projection, which provides principled token-level KL constraints. The projection operates on a sparse subset of the model's most important token logits to balance computational cost and projection effectiveness. Our approach, Trust Region Optimization for Large Language Models (TROLL), serves as a direct replacement for PPO-like clipping during training and does not alter the model's inference behavior. Across datasets, model families, and advantage-estimation methods, TROLL consistently outperforms PPO-like clipping in terms of training speed, stability, and final success rates.

Method

PPO and many of its variants, such as GRPO, Dr.GRPO and GSPO, rely on heuristic clipping of importance ratios to stabilize policy updates. Since this clipping is a crude approximation of full trust regions, it often leads to unstable training and suboptimal performance. Trust Region Optimization for Large Language Models (TROLL) instead acts as a drop-in replacement for the PPO-like clip objective that uses principled trust regions via KL constraints between token distributions.

The Problem with PPO Clipping

PPO and its many derivatives control policy update steps by clipping the importance ratio $r_t = \frac{\tilde{\pi}(o_t | q, o_{<t})}{\pi^{\text{old}}(o_t | q, o_{<t})}$ between current and old policy around $1$, i.e.,

$$J_{\text{PPO}} = \mathbb{E}_{o_t \sim \pi^{\text{old}}} \left[ \frac{1}{|o|} \sum_{t=1}^{|o|} \min(r_t A_t, \text{clip}(r_t, 1-\epsilon_{\text{PPO}}, 1+\epsilon_{\text{PPO}}) A_t) \right].$$

This clipping makes training more stable but truncates gradients, pushes the policy to be overly cautious, and still allows arbitrarily large KL divergences, causing stability and convergence issues. Standard PPO keeps policies close by clipping the importance ratio $r_t = \frac{\tilde{\pi}(o_t | q, o_{<t})}{\pi^{\text{old}}(o_t | q, o_{<t})}$ around 1:

$$J_{\text{PPO}} = \mathbb{E}_{o_t \sim \pi^{\text{old}}} \left[ \frac{1}{|o|} \sum_{t=1}^{|o|} \min(r_t A_t, \text{clip}(r_t, 1-\epsilon_{\text{PPO}}, 1+\epsilon_{\text{PPO}}) A_t) \right]$$

This heuristic clipping suppresses gradients when ratios exceed the threshold, discarding valuable gradient information and leading to unstable learning.

TROLL's Differentiable Trust Region Projection

TROLL projects each token's output distribution onto a KL-trust region around the old policy:

$$\arg\min_{\pi(o_t | q, o_{<t})} \text{KL}(\pi(o_t | q, o_{<t}) \| \tilde{\pi}(o_t | q, o_{<t}))$$

subject to $\text{KL}(\pi(o_t | q, o_{<t}) \| \pi^{\text{old}}(o_t | q, o_{<t})) \leq \epsilon$. Here, $\tilde{\pi}$ is the unprojected new policy, $\pi^{\text{old}}$ is the old policy, and $\epsilon$ is the trust region bound. The solution is a geometric interpolation

$$\pi(o_t | q, o_{<t}) \propto \exp\left(\frac{\eta^* \log \pi^{\text{old}}(o_t | q, o_{<t}) + \tilde{\pi}(o_t | q, o_{<t})}{\eta^* + 1}\right)$$

between old and new policy logits, which uses $\eta^*$ as a step size. The optimal $\eta^*$ is found by solving a one-dimensional convex dual problem, which can be fully parallelized and solved with bracketing.

The TROLL Objective

TROLL acts on the projected policy and combines its update with a regression term to this projection, i.e.,

$$J_{\text{TROLL}} = \mathbb{E}_{o_t \sim \pi^{\text{old}}} \left[ \frac{1}{|o|} \sum_{t=1}^{|o|} \frac{\pi(o_t | q, o_{<t})}{\pi^{\text{old}}(o_t | q, o_{<t})} A_t \right] - \alpha \text{KL}(\tilde{\pi}(o_t | q, o_{<t}) \| \lfloor \pi(o_t | q, o_{<t})\rfloor ) $$

This approach is fully differentiable and has closed-form gradients through the projection. Unlike clipping, it preserves gradients and mathematically ensures a fixed KL bound between the old and new policy.

Scaling TROLL with Sparsity

LLM tokenizers have vocabularies with tens of thousands of tokens, which are too expensive to directly store and project. However, natural language follows a power-law, where a few tokens take up most of the probability mass. We keep only the highest probability tokens up to either $K$ tokens or a cumulative mass of $1-\delta$. With $K=64$ and $\delta=10^{-5}$, $5{-}10$ tokens typically suffice to cover $99.999\%$ of probability mass. This greedy selection accurately approximates the full distribution, and makes TROLL computationally feasible for arbitrary vocabulary sizes.

TROLL sparsification and trust region projection schematic

Schematic overview of TROLL. The token-wise logit distribution is sparsified for both old and new policies. The new policy's distribution is then projected onto a trust region of the old.

Results

Troll and Clip with GRPO on different Qwen3 models

$3{-}10\%$ absolute improvements across different Qwen3 model sizes.
TROLL (solid lines) consistently outperforms Clip (dashed lines) on Qwen3 models (600M-14B, colors) for DAPO-Math when trained with GRPO. The 4B TROLL model nearly matches the 14B clipped model's performance.

Drop-in replacement for different on-policy methods.
TROLL (solid lines) improves GRPO, Dr.GRPO, PPO, and GSPO (colors) performance compared to Clip (dashed lines), here for Qwen3-$8$B. Most dramatically, GSPO (Clip) fails completely, while GSPO (TROLL) achieves $70\%{+}$ success rates.

Works across model families and datasets.
TROLL improves different Qwen, Llama, SmolLM and Apertus models and outperforms Clip on GSM8K, DAPO-Math, and Eurus-2-RL-Math.

Training stability and computational overhead

Better training stability with low overhead.
TROLL works well for a range of hyperparameters and preserves policy entropy compared to Clip. It comes with a manageable overhead that is constant in model size.

Key Insights

Why TROLL Works

Compared to the PPO-like Clip, TROLL provides a drop-in replacement that

Preserves gradients, even through constrained updates.
Provides principled constraints that use actual KL divergence rather than heuristic ratio clipping

The result is more stable training, faster convergence, and better final performance on various model families, RL algorithms and verifyable math datasets.

Practical Considerations

Sparsification: On average, $5{-}10$ tokens cover $99.999$% of probability mass. This property allows sparsification, which vastly reduces computational cost while staying close to the true token distributions.

Hyperparameters: TROLL uses the same hyperparameters for all experiments. We set $K{=}64$ maximum sparse tokens, $\delta{=}1e{-}5$ for the mass threshold, $\epsilon{=}0.05$ for the KL bound, and $\alpha{=}1$ for the regression weight. TROLL is robust to all its hyperparameters.

Computational cost: TROLL has ${\sim}5\%$ memory and ${\sim}10\%$ runtime overhead on $4$B models, with overhead decreasing relatively as models grow larger.

BibTeX

@article{becker2025troll, title={TROLL: Trust Regions improve Reinforcement Learning for Large Language Models}, author={Becker, Philipp and Freymuth, Niklas and Thilges, Serge and Otto, Fabian and Neumann, Gerhard}, journal={arXiv preprint}, year={2025} }