Modeling Training

In a previous blog post I combined the Gibbs policy and Smoothed IGW to develop a sampling strategy for LLMs which minimizes (smoothed) online regret when playing a single action per prompt.

While potentially useful in a continual learning world, this is a bad match for modeling training for two reasons:

We don’t care about bad performance during training, as long as the final model is good. Simple regret is a better fit, since it only cares about the quality of the model that results from training.
We aren’t hooked up to reality, we are hooked up to a simulator. So we can play multiple actions and observe multiple rewards (we can also play the same action multiple times to assess reward variance).

Play two policies, not one

When optimizing simple regret, we play two policies: an exploit policy (which will be deployed after training), and an explore policy (which will be utilized during training). Because the performance of the explore policy doesn’t matter, it is optimal when it provides the most improvement in the exploit policy during training.

Most papers I read in the space play a single policy: they use the policy they are going to deploy (exploit) during training (explore). There’s a large concern about entropy collapse in the LLM RL literature: I think a big piece of the puzzle is only playing one policy. When playing two policies, you are fine with the exploit policy losing entropy as it focuses on high reward actions.¹

Play multiple actions

Because training is done in simulation, the exploration policy plays a multiset (of actions). This clarifies the need for diversity in exploration, since playing the same action repeatedly is a poor strategy unless there is substantial label noise. For example, in linear contextual semi-bandits, the parameter uncertainty is an ellipsoid induced by $V_t$, the actions are vectors, and optimal allocations for best-arm identification balance uncertainty in directions $a – a^*$ against the squared reward gap. A computationally convenient plug-in proxy given $V_t$, an estimate $\hat{mu}_t$ of the reward function, and the current best action estimate $\hat{a}_t^*$, is to pick a multiset via
$$
\begin{align*}
A_t &= \arg\min_{A:|A|=n} \max_{a \neq \hat{a}_t^*} \frac{\| a – \hat{a}_t^* \|_{(V_t + V(A))^{-1}}^2}{\hat{\Delta}_t(a)^2}, \\ \hat{\Delta}_t(a) &= \max\left(\epsilon, \hat{\mu}_t^\top(\hat{a}_t^* – a)\right).
\end{align*}
$$
The parameter $\epsilon$ prevents numerical difficulties but also represents a simple regret tolerance (acceptable suboptimality). This naturally produces diverse nearly-orthogonal action vectors.

Papers on Exploration

There have been some excellent papers recently that improve the exploration policy.

DRA-GRPO uses a small embedding model to approximate correlation between answers, and then changes the reward during training to discourage correlated answers. BoN changes the reward during training to be max@N (pass@N for binary rewards). Both of these papers equate the exploitation and exploration policies (i.e., they deploy the policy they trained with). In the case of BoN, they argue inference is assessed via max@N; in DRA-GRPO the argument that a more diverse policy is better at deployment time is more implicit. I suspect in both cases they could have synthesized an exploit policy from these training runs that was even better for pass@1, but that policy would (appropriately) have low entropy. A curious aspect of both of these papers: they want to maximize a function of the set of rollouts, but they sample independently. This is computationally felicitous and apparently effective, although my intuition says dependent sampling would be more statistically efficient (but compute costs do matter: many parallel samples might dominate a few dependent samples for the same compute budget).

SESA distinguishes between the exploration policy, which is conditioned on a sequence of solution sketches at training time, and the exploitation policy, which is done using ordinary autoregressive generation without solution sketch conditioning. There’s a lot to like about this paper. First, they do dependent sampling, but only of solution sketches, and subsequent full solution generation is done in parallel. That’s a nice computational compromise (although it breaks down in the agentic setting, where dependent sampling is done per action). Second, they actually train the exploitation policy during training. However they don’t train the exploration policy, which shares parameters with the exploitation policy (they differ only in prompt). My guess is training both policies would give the best result.

HOPE uses an LLM to analyze rollouts and propose alternative rollouts that would lead to better outcomes. I really like this design: eventually we need to be bitter-lesson-pilled and learn to explore, and like all previous learning problems, learning to explore can benefit from a good prior: thus, starting with a prompted LLM driving exploration is sensible. HOPE trains the exploitation policy using the (off-policy) data from the exploration strategy via Q-learning. They observe gains even though the hindsight proposer (exploration policy) is not trained.²

Questions

I’m fairly certain there should be distinct explore vs. exploit policies. Explore would be used during training and exploit would be a consequence of training. Can these share parameters (e.g., differ only in prompts like HOPE)? More generally, how can we get the exploit policy at low (no?) additional cost while using the explore policy?

Dependent or independent sampling? Dependent sampling feels like a good idea, but the above papers indicate set functions can be optimized with independent sampling (which is more efficient). Perhaps in large action spaces, the policy is rarely self-colliding and independent samples are good enough? But theoretically the explore policy should be maximizing learning progress, and the HOPE paper indicates (at least in the “rewards are rare” setting) that dependent sampling has superior statistical efficiency, enough to overcome the computational inefficiency.

What’s a good metric of exploration quality? I’d like to bitter-lesson-pill this. In other words, I don’t want to hand design an exploration algorithm. Instead, I’d prefer defining a measure of exploration quality and optimizing it. Is pass@k (or for continuous rewards, max@k) a good measure of exploration quality? Intuitively this feels plausible, at least for binary rewards with no label noise when reward 1 is always realizable. It would be great to find something related to the Hellinger distance term in simple DEC.

Arguably some entropy during exploitation is good to hedge against environmental drift; and furthermore EVOL-RL argues more diversity in latent CoT and less diversity in final answer is a good prior for LLMs. Nonetheless the point remains: if some entropy is good for exploitation, even more entropy is good for exploration. ↩︎
They do distill the proposer for budgetary reasons, and this distillation does filter on successful counterfactuals, so there is some mild training of the proposer. ↩︎