ARIA: Training Language Agents with Intention-Driven Reward Aggregation

Introduction

Large language models (LLMs) have enabled agents to perform complex reasoning and decision-making through free-form language interactions. However, in open-ended language action environments (e.g., negotiation or question-asking games), the action space can be formulated as a joint distribution over tokens, resulting in an extremely large and combinatorial action space. Sampling actions in such a space can lead to extreme reward sparsity, which brings large reward variance, hindering effective reinforcement learning (RL).

To address this, we propose ARIA, a method that Aggregates Rewards in Intention space to enable efficient and effective language Agents training. ARIA aims to project natural language actions from the high-dimensional joint token distribution space into a low-dimensional intention space, where semantically similar actions are clustered and assigned shared rewards. This intention-aware reward aggregation reduces reward variance by densifying reward signals, fostering efficient and effective policy optimization.

Extensive experiments demonstrate that ARIA not only significantly reduces gradient variance, but also delivers substantial performance gains of average 9.95% across four downstream tasks (e.g., negotiation and text-based games), consistently outperforming strong offline and online RL baselines.

Reward Aggregation

Semantic Projection and Reward Aggregation

Our approach leverages semantic projection to map different trajectories into clustered representations, enabling effective reward aggregation across semantically similar experiences.

Semantic Clustering: Each action and observation is embedded and clustered, producing cluster labels $c_k(a_t)$ and $c_k(o_t)$. This creates a clustered trajectory representation $\tilde{h}_t = \{c_k(a_1), c_k(o_1), \ldots, c_k(a_{t-1}), c_k(o_{t-1})\}$.

Reward Aggregation: For each clustered state-action pair $(\tilde{h}_t, \tilde{a}_t)$, we aggregate rewards from all trajectories sharing this semantic pattern:

$$\tilde{R}(\tilde{h}_t, \tilde{a}_t) = \frac{1}{|\mathcal{D}_{\tilde{h}_t, \tilde{a}_t}|} \sum_{\tau \in \mathcal{D}_{\tilde{h}_t, \tilde{a}_t}} \gamma^{T-t} R$$

where $\mathcal{D}_{\tilde{h}_t, \tilde{a}_t}$ contains all trajectories with the same clustered prefix. This semantic-based aggregation pools reward signals from similar experiences, providing more robust value estimates and improving sample efficiency.

The aggregated reward $\tilde{R}(\tilde{h}_t, \tilde{a}_t)$ is then assigned back to the original state-action pair $(h_t, a_t)$ as advantage $\tilde{A}(h_t, a_t)$ for policy optimization.

Reward-Oriented Granularity Selection For Clustering

Traditional clustering metrics tend to favor overly coarse groupings due to high similarity among actions, overlooking fine-grained distinctions critical for RL tasks. We propose a reward-oriented granularity selection mechanism that assesses whether further splitting clusters yields meaningful reward change.

SplitScore: Let $k \in [2, K]$ denote all possible granularity levels. We use SplitScore to select the optimal granularity $k^{*}$, defined as:

$$\text{SplitScore}(k) = \frac{\delta_k}{|\mathcal{D}|}$$

where $\delta_k = \left| \tilde{R}^{(k+1)}(h_t, a_t) - \tilde{R}^{(k)}(h_t, a_t) \right|$ represents the reward change when the number of clusters changes from $k$ to $k+1$, and $\mathcal{D}$ is the collection of all $(h_t, a_t)$ pairs.

Automatic Stopping Criterion: Given a threshold $\epsilon > 0$ and window size $\tau$, we stop splitting when $\text{SplitScore}(j) < \epsilon$ for all $j \in [k, k + \tau]$ as $k$ increases. The selected $k$ is then taken as $k^*$.

Offline REINFORCE with Aggregated Reward

We use the offline REINFORCE algorithm to optimize the policy. Formally, let $\pi_\theta(a \mid s)$ denote the policy parameterized by $\theta$ and assign the aggregated reward $\tilde{R}^{(k)}(h_t,a_t)$ to $\tilde{A}(h_t,a_t)$. ARIA optimizes the model by maximizing the following objective:

$$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T \log \pi_\theta(a_t \mid h_t) \cdot \tilde{A}(h_t,a_t) \right]$$

This objective leverages the variance-reduced advantages obtained through intention-aware reward aggregation, enabling more stable and efficient policy updates compared to standard REINFORCE with sparse rewards.

Main Results

Main results on adversarial games. ARIA achieves significant improvements of 9.67% and 9.83% on Bargaining and Negotiation tasks respectively, consistently outperforming strong offline and online RL baselines.

Main results on single-agent games. ARIA outperforms all baselines by an average of 9.82% on Twenty Questions and Guess My City tasks, demonstrating its effectiveness across different task types.

Online ARIA extends our method to dynamic learning scenarios. We first perform reward aggregation on pre-collected trajectories to initialize a reward model (RM) using Llama-3.1-8B-Instruct. The policy then generates new samples scored by the RM for policy updates, while the RM itself is periodically updated with fresh data, creating a co-evolving system. This online adaptation shows faster convergence and achieves higher returns compared to existing online methods (ArCHer and RAGEN) across iterations, demonstrating ARIA's effectiveness in both offline and online settings.

Analysis

Reward Aggregation Significantly Reduces Reward Variance: Our analysis demonstrates that reward aggregation markedly reduces the fluctuation range of action rewards compared to binary reward distributions that are highly polarized with values concentrated near 0 or 1. In large action spaces where most actions are sampled only once, the original binary rewards result in high variance. Through reward aggregation, actions within the same semantic cluster share a common reward, which significantly smooths the distribution and reduces variance across all tasks, highlighting the effectiveness and necessity of this approach in stabilizing policy learning.

Reward Aggregation Improves Policy Optimization: The policy loss curves reveal that ARIA, which applies semantic-level reward aggregation, accelerates loss reduction compared to vanilla REINFORCE baseline, providing stronger learning signals and enabling faster policy updates with improved sample efficiency. Despite converging to similar loss levels, ARIA outperforms other variants by 17.91% and 13.80% on bargaining and negotiation tasks respectively. We attribute these gains to the complementary effects of reward decay (introducing temporal structure for credit assignment) and reward aggregation (substantially lowering reward variance through shared signals for semantically similar actions), resulting in improved gradient estimation quality.

Generalization to Other Models: To assess the transferability of ARIA, we applied it to Qwen models (Qwen2.5-7B-Instruct and Qwen2.5-1.5B-Instruct) and observed consistent improvements across different base models. This demonstrates that our reward aggregation approach is model-agnostic and independent of specific architectural features or pretraining data of the underlying language models. We attribute this generalizability to the shared structural properties in the semantic spaces learned by large-scale language models, allowing ARIA to leverage these commonalities while preserving task-specific discriminative signals.

Theoretical Analysis

We provide theoretical analysis showing that intention clustering-based aggregation reduces the variance of gradient descent while maintaining a small bound of bias, thus improving training stability and efficiency.

Variance Reduction: By replacing original advantages $A$ with cluster-averaged advantages $\tilde{A}$, we remove the intra-cluster variance $\mathbb{E}[\text{Var}(A | C)]$, lowering the total variance of the policy gradient estimate: $\text{Var}(\tilde{A}) \leq \text{Var}(A)$.

Bounded Bias: Through $\epsilon$-bisimulation analysis, we show that the bias introduced by reward aggregation is bounded: $|\mathbb{E}[\nabla_\theta \log \pi_\theta(a | h)(A(h, a) - \tilde{A}(h, a))]| \leq O(\epsilon)$, where actions within each cluster are $\epsilon$-bisimilar.

Convergence Improvement: The variance reduction leads to faster convergence with fewer samples: $\|\hat{g} - g\|_2 = O(\sqrt{\sigma/N})$, where $\sigma$ is reduced through clustering.

Case Studies

Twenty Questions Example

Actor: Is it a living thing? Oracle: Yes.
Actor: Is it a mammal? Oracle: Yes.
Actor: Is it a human? Oracle: No.
Actor: Is it a carnivore? Oracle: No.
Actor: Is it a cow? Oracle: Yes.

Agent successfully identifies the target object within 5 turns using strategic questioning.

Bargaining Example

Task Description:
In this game, there are two players. You are playing as Bob. You and Alice need to divide 100$ between you. In each round of the game, one of the players will make a proposal for division and can describe their proposal in a few words. The other player will decide whether to accept the offer and finish the game, or to proceed to the next round, in which they will be the one making the proposal. Your goal in this game is to achieve more money than Alice. Beware of inflation! With each passing round, the money is worth 0% less for you. For Alice, the money is worth 20% less. You have 20 rounds to divide the money, or both of you will get nothing!
Alice: I'm starting strong, let's start with 60-40 split!
Bob: I propose a more balanced split, considering the inflation effect on your side. This offer is fair and beneficial for both of us. What about a 55-45 split?
Alice: I'm willing to compromise, but I want a better share. I propose 70-30 Split!
Bob: Alice, consider that with each round, inflation affects your gains more than mine. This 60-40 split is actually more favorable to you than it appears due to the inflation factor. It's in both our interests to reach an agreement soon.
Alice: I'm not willing to accept a 60-40 split, even with the inflation factor. I'll try to negotiate a better deal. What about 65-35 Split?
Bob: Alice, let's be practical. With the current inflation rate, your $42 is actually worth about $17.15 in original value, while my $58 remains at $58. This split gives us both a fair share considering the inflation impact. We should agree soon to minimize your losses due to inflation.
Alice: I accept with the 42-58 split.

Agent demonstrates strategic negotiation skills considering game dynamics and inflation effects.

BibTeX

@article{yang2025aria,
  title={ARIA: Training Language Agents with Intention-Driven Reward Aggregation},
  author={Yang, Ruihan and Zhang, Yikai and Chen, Aili and Wang, Xintao and Yuan, Siyu and Chen, Jiangjie and Yang, Deqing and Xiao, Yanghua},
  journal={arXiv preprint},
  year={2025}
}

Training Language Agents with Intention-Driven Reward Aggregation