ARIA INTENTION-DRIVEN REWARD AGGREGATION

Training Language Agents with Intention-Driven Reward Aggregation

1Fudan University 2Bytedance Seed
* Equal Contribution
† Corresponding author
ARIA Overview

Overview of ARIA. ARIA first lets agents interact to collect trajectories, then performs semantic projection and aggregates rewards in the intention space, and finally updates the policy using the aggregated rewards.

Introduction

Large language models (LLMs) have enabled agents to perform complex reasoning and decision-making through free-form language interactions. However, in open-ended language action environments (e.g., negotiation or question-asking games), the action space can be formulated as a joint distribution over tokens, resulting in an extremely large and combinatorial action space. Sampling actions in such a space can lead to extreme reward sparsity, which brings large reward variance, hindering effective reinforcement learning (RL).

To address this, we propose ARIA, a method that Aggregates Rewards in Intention space to enable efficient and effective language Agents training. ARIA aims to project natural language actions from the high-dimensional joint token distribution space into a low-dimensional intention space, where semantically similar actions are clustered and assigned shared rewards. This intention-aware reward aggregation reduces reward variance by densifying reward signals, fostering efficient and effective policy optimization.

Extensive experiments demonstrate that ARIA not only significantly reduces gradient variance, but also delivers substantial performance gains of average 9.95% across four downstream tasks (e.g., negotiation and text-based games), consistently outperforming strong offline and online RL baselines.

Method

Reward Aggregation

Semantic Projection and Reward Aggregation

Our approach leverages semantic projection to map different trajectories into clustered representations, enabling effective reward aggregation across semantically similar experiences.

Semantic Clustering: Each action and observation is embedded and clustered, producing cluster labels \(c_k(a_t)\) and \(c_k(o_t)\). This creates a clustered trajectory representation \(\tilde{h}_t = \{c_k(a_1), c_k(o_1), \ldots, c_k(a_{t-1}), c_k(o_{t-1})\}\).

Reward Aggregation: For each clustered state-action pair \((\tilde{h}_t, \tilde{a}_t)\), we aggregate rewards from all trajectories sharing this semantic pattern:

$$\tilde{R}(\tilde{h}_t, \tilde{a}_t) = \frac{1}{|\mathcal{D}_{\tilde{h}_t, \tilde{a}_t}|} \sum_{\tau \in \mathcal{D}_{\tilde{h}_t, \tilde{a}_t}} \gamma^{T-t} R$$

where \(\mathcal{D}_{\tilde{h}_t, \tilde{a}_t}\) contains all trajectories with the same clustered prefix. This semantic-based aggregation pools reward signals from similar experiences, providing more robust value estimates and improving sample efficiency.

The aggregated reward \(\tilde{R}(\tilde{h}_t, \tilde{a}_t)\) is then assigned back to the original state-action pair \((h_t, a_t)\) as advantage \(\tilde{A}(h_t, a_t)\) for policy optimization.

Reward-Oriented Granularity Selection For Clustering

Traditional clustering metrics tend to favor overly coarse groupings due to high similarity among actions, overlooking fine-grained distinctions critical for RL tasks. We propose a reward-oriented granularity selection mechanism that assesses whether further splitting clusters yields meaningful reward change.

SplitScore: Let $k \in [2, K]$ denote all possible granularity levels. We use SplitScore to select the optimal granularity $k^{*}$, defined as:

$$\text{SplitScore}(k) = \frac{\delta_k}{|\mathcal{D}|}$$

where $\delta_k = \left| \tilde{R}^{(k+1)}(h_t, a_t) - \tilde{R}^{(k)}(h_t, a_t) \right|$ represents the reward change when the number of clusters changes from $k$ to $k+1$, and $\mathcal{D}$ is the collection of all $(h_t, a_t)$ pairs.

Automatic Stopping Criterion: Given a threshold $\epsilon > 0$ and window size $\tau$, we stop splitting when $\text{SplitScore}(j) < \epsilon$ for all $j \in [k, k + \tau]$ as $k$ increases. The selected $k$ is then taken as $k^*$.

Offline REINFORCE with Aggregated Reward

We use the offline REINFORCE algorithm to optimize the policy. Formally, let $\pi_\theta(a \mid s)$ denote the policy parameterized by $\theta$ and assign the aggregated reward $\tilde{R}^{(k)}(h_t,a_t)$ to $\tilde{A}(h_t,a_t)$. ARIA optimizes the model by maximizing the following objective:

$$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T \log \pi_\theta(a_t \mid h_t) \cdot \tilde{A}(h_t,a_t) \right]$$

This objective leverages the variance-reduced advantages obtained through intention-aware reward aggregation, enabling more stable and efficient policy updates compared to standard REINFORCE with sparse rewards.

Experimental Results

Main Results

Analysis

Theoretical Analysis

We provide theoretical analysis showing that intention clustering-based aggregation reduces the variance of gradient descent while maintaining a small bound of bias, thus improving training stability and efficiency.

Variance Reduction: By replacing original advantages $A$ with cluster-averaged advantages $\tilde{A}$, we remove the intra-cluster variance $\mathbb{E}[\text{Var}(A | C)]$, lowering the total variance of the policy gradient estimate: $\text{Var}(\tilde{A}) \leq \text{Var}(A)$.

Bounded Bias: Through $\epsilon$-bisimulation analysis, we show that the bias introduced by reward aggregation is bounded: $|\mathbb{E}[\nabla_\theta \log \pi_\theta(a | h)(A(h, a) - \tilde{A}(h, a))]| \leq O(\epsilon)$, where actions within each cluster are $\epsilon$-bisimilar.

Convergence Improvement: The variance reduction leads to faster convergence with fewer samples: $\|\hat{g} - g\|_2 = O(\sqrt{\sigma/N})$, where $\sigma$ is reduced through clustering.

Case Studies

BibTeX

@article{yang2025aria,
  title={ARIA: Training Language Agents with Intention-Driven Reward Aggregation},
  author={Yang, Ruihan and Zhang, Yikai and Chen, Aili and Wang, Xintao and Yuan, Siyu and Chen, Jiangjie and Yang, Deqing and Xiao, Yanghua},
  journal={arXiv preprint},
  year={2025}
}