Steepish Descent

Unleash the chains of thought.

Vibe Overthinking It

A comedy of automated autism.

My previous post about Softmax Muon is (as far I can tell) technically correct, but unlikely to lead to improvements in the GPT speedrun because

  1. The softmax gradient is already in the ambient null space.
  2. The Hessian is diagonal in the ambient null space.
  3. Adam is very good at mitigating diagonal curvature.

Nonetheless the coding agents dutifully implemented the changes, so I’ll play around with the modified update.

Was this a fail? Not really: I now understand why Adam is a good fit for the unembedding matrix. But perhaps it was wasteful? I directed automation to generate some code that was unnecessary given a little self-reflection, but code is very cheap now.

Leave a Reply

Discover more from Steepish Descent

Subscribe now to keep reading and get access to the full archive.

Continue reading