The only thing better than running a learning algorithm is pretending to run it.
In my last post, I complained about pertinacity: the annoying property of standard policy gradient that encourages current behaviour even it is meh. Fundamentally this is because policy gradient (pretends to) sample from the current policy, so if the current policy is far away from a rewarding region, the signal is small. The movement of probability between meh regions does cancel out in exact expectation but with finite samples you end up spending a lot of data just reshuffling the deck chairs on the Titanic.
Wouldn’t it be nice to (pretend to) sample from a policy that has coverage in the high reward region? How about (pretending to) sample from an optimal policy?
Enter $\pi^*$
The DPO paper was very influential, because it reminded everybody about the power of the Donsker-Varadhan representation and the resulting Gibbs policy. This says that if you pretend to run the following optimization $$
\begin{align*}
\pi^* &= \arg \max_{\pi}\ \mathbb{E}_x\left[ \left. \mathbb{E}_{\substack{y \sim \pi(\cdot|x) \\ r \sim P(r|x,y)}}\left[r\right] – \lambda^{-1} \text{KL}(\pi(\cdot|x) \| h(\cdot|x)) \right| x\right] ,
\end{align*}
$$ then you end up with $$
\begin{align*}
\pi^*(y|x) &= Z(x)^{-1} h(y|x) \exp\left( \lambda \mathbb{E}\left[r | x, y\right]\right), \\
Z(x) &= \mathbb{E}_{y \sim h(\cdot|x)}\left[ \exp\left( \lambda \mathbb{E}\left[r | x, y\right]\right) \right].
\end{align*}
$$
Deriving Reward-Weighted Regression
So that leads to the question, can we (pretend to) sample from $\pi^*$? Let’s define as our loss function KL-divergence from our current policy to $\pi^*$, $$
\begin{align*}
\ell(\theta) &= \mathbb{E}_{\substack{x \sim P(x) \\ y \sim \pi^*(y | x)}}\left[ \log \frac{\pi^*(y|x)}{\pi(y | x; \theta)} \right] \\
&= \mathbb{E}_{\substack{x \sim P(x) \\ y \sim h(y | x)}}\left[ \frac{\pi^*(y | x)}{h(y|x)} \log \frac{\pi^*(y|x)}{\pi(y | x; \theta)} \right] \\
&= -\mathbb{E}_{\substack{x \sim P(x) \\ y \sim h(y | x)}}\left[ Z(x)^{-1} \exp\left(\lambda \mathbb{E}\left[r | x, y\right]\right) \log \pi(y | x; \theta) \right] + \text{const}.
\end{align*}
$$ This loss function pretends to sample from $\pi^*$ rather than the current policy, so hopefully has more reward signal. Given an empirical sample $\mathcal{D}=\{(y_n,r_n)\}_{n=1}^N$ collected at fixed $x$, a natural plug-in estimator is
\begin{align*}
\hat{\ell}_{\mathcal{D}}(\theta) &= -\frac{1}{N} \sum_n \frac{e^{\lambda r_n}}{\sum_m e^{\lambda r_m}} \log \pi(y_n|x,\theta).
\end{align*}
Looks promising! This is a convex combination of score functions, and it is invariant to a constant shift in the rewards. It can still be pertinacious if all the rewards are meh, but a common strategy in GRPO-variants is to discard rollout batches with all zero rewards, so that would ensure RWR will move probability from low to high reward regions. Since coefficients are positive, there is no puke.
Reward-weighted regression (RWR) is an old technique, and the exponential tilting variant is also well known, but for some reason it doesn’t seem very popular for LLMs. I found some references such as ReST, ILQL, and AWR. Overall PPO variants derived from GRPO seem way more popular. Since empiricism trumps theory, there is presumably something about RWR which sucks that I don’t appreciate.
Single-sample RWR is optimistic
A fun exercise is to determine the bias of the above plugin estimator. The leading term is due to conditional reward variance and replacing $\mathbb{E}\left[r | x, y\right]$ with a single-sample in the exponential: this bias does not go away as the number of datapoints increases, assuming that the same $y$ is never sampled twice. The next order term is from estimating $Z(x)^{-1}$, but this decays as $O(1/N)$. Overall you get the following $$
\begin{align*}
\mathbb{E}\left[\hat{\ell}_{\mathcal{D}}(\theta)\right] &= \mathbb{E}_{y \sim h(y|x)}\left[ \frac{e^{\left(\lambda \mathbb{E}\left[r | x, y\right] + \frac{\lambda^2}{2} \mathbb{V}\left[r | x, y\right] + O(\lambda^3)\right)}}{Z_{\text{eff}}(x)} \right] + O\left(\frac{1}{N}\right) \\
Z_{\text{eff}}(x) &= \mathbb{E}_{y \sim h(y|x)}\left[ e^{\left(\lambda \mathbb{E}\left[r | x, y\right] + \frac{\lambda^2}{2} \mathbb{V}\left[r | x, y\right] + O(\lambda^3)\right)} \right].
\end{align*}
$$ If rewards are conditionally deterministic given $(x, y)$ the variance term vanishes. For many RLVR setups of interest, rewards are conditionally deterministic (e.g., a math proof is either correct or it isn’t). If rewards are conditionally stochastic then the single-sample reward estimator is optimistic, but if the reward can be sampled multiple times for the same $(x, y)$ this can be mitigated. Alternatively, this optimism could be beneficial in stochastic environments, if more data is collected under the learned approximation to $\pi^*$ in order to define $(\pi^*)^*$.
Using the Gibbs policy as a process reward model
It’s becoming popular in the literature to use a trained policy as a process reward model, leveraging the identity $\log \pi^*(y|x) – \log h(y|x) = \lambda \log \mathbb{E}\left[r | y, x\right] – \log Z(x)$, e.g., PRIME and iStar. But these identities only hold over complete trajectories, what do they look like under autoregressively generated prefixes? $$
\begin{align*}
\pi^*(y_{1:t} | x) &= \sum_{y_{t+1:}} \pi^*(y|x) \\
&= \sum_{y_{t+1:}} Z(x)^{-1} h(y|x) \exp\left(\lambda \mathbb{E}\left[r | x, y\right]\right), \\
\frac{\pi^*(y_{1:t} | x)}{h(y_{1:t}|x)} &= Z(x)^{-1} \sum_{y_{t+1:}} h(y_{t+1:}|y_{1:t},x) \exp\left(\lambda \mathbb{E}\left[r | x, y\right]\right) \\
&= \frac{\mathbb{E}_{y_{t+1:} \sim h(y_{t+1:} | y_{1:t},x)}\left[\exp\left(\lambda \mathbb{E}\left[r | x, y\right] \right) \right]}{\mathbb{E}_{y \sim h(\cdot|x)}\left[ \exp\left( \lambda \mathbb{E}\left[r | x, y\right]\right) \right]}.
\end{align*}
$$ Actually quite nice! The density ratio looks like a ratio of $Z(x)$ (“how much (exponentiated) reward do I hit starting from prompt $x$ under the logging policy”) and the numerator (“how much (exponentiated) reward do I hit completing from prefix $y_{1:t}$ for prompt $x$ under the logging policy”). In fact, this thing is a champ for ranking prefixes because the partition function cancels,
$$
\begin{align*}
&\log \frac{\pi^*(y_{1:t} | x)}{h(y_{1:t}|x)} – \log \frac{\pi^*(y’_{1:t} | x)}{h(y’_{1:t}|x)} \\
&= \log \frac{\mathbb{E}_{y_{t+1:} \sim h(y_{t+1:} | y_{1:t},x)}\left[\exp\left(\lambda \mathbb{E}\left[r | x, y\right] \right) \right]}{\mathbb{E}_{y_{t+1:} \sim h(y_{t+1:} | y’_{1:t},x)}\left[\exp\left(\lambda \mathbb{E}\left[r | x, y\right] \right) \right]}.
\end{align*}
$$
Leave a Reply