A comedy of automated autism.
My previous post about Softmax Muon is (as far I can tell) technically correct, but unlikely to lead to improvements in the GPT speedrun because
- The softmax gradient is already in the ambient null space.
- The Hessian is diagonal in the ambient null space.
- Adam is very good at mitigating diagonal curvature.
Nonetheless the coding agents dutifully implemented the changes, so I’ll play around with the modified update.
Was this a fail? Not really: I now understand why Adam is a good fit for the unembedding matrix. But perhaps it was wasteful? I directed automation to generate some code that was unnecessary given a little self-reflection, but code is very cheap now.
Leave a Reply