X thread response | June 9, 2026
Six link cartpole lab.
A live browser sketch of the six link cartpole problem, plus a Modal GPU policy checkpoint and the training notes needed for the full reinforcement learning reproduction.
Episode 1
0
Links
Mode
Rebuild target
- Observation: cart x, cart velocity, six link angles, six angular velocities.
- Action: start with a small force set, then test continuous cart force.
- Reward: upright chain, centered cart, low angular velocity, survival time.
- Curriculum: solve one link, then two, then mixed one to six link episodes.
- Randomization: mass, force magnitude, initial angle, episode horizon.
- Evaluation: held out seeds, impulse recovery, lower link transfer, failure map.
What the X thread says
- Yacine posted the six pendulum cartpole solve at 00:50 UTC on June 9, 2026.
- He said the run used PufferPPO, MuJoCo Warp, several RTX 4090s, and a Puffer minGRU policy near one million parameters.
- The working trick was not just model size. It was environment speed, reward shaping, thousands of hyperparameter experiments, and randomized episode length.
- Later that day he posted a MuJoCo Playground reproduction at 120k steps per second, 200 million steps, and 27.8 minutes of training.
Implementation status
Modal-trained CEM checkpoint deployed.
The canvas runs a lightweight coupled physics approximation and a checked-in policy trained on Modal L4 with cross-entropy search over a time-basis plus feedback model. This is progress, not the final Yacine-level MuJoCo/PufferPPO solve. The next checkpoint should increase sustained hold time, then move the environment to MJWarp or MJX with PPO and recorded eval videos.
Yacine timeline
Fast baseline
Cartpole in MuJoCo Warp hit 18 million steps per second on some configs, using PufferLib and a large rollout policy batch.
June 4Action and reward pressure
The model was still using five discrete cart forces. He planned to switch action shape, then added stronger upright reward and moved toward curriculum learning.
June 7Scaling pain
He called each extra pendulum super exponential in time to solve, which matches the control literature view of multi link chains as chaotic benchmark systems.
June 9Solve and ablation clue
The final recipe used top scoring hyperparameters from higher compute runs, then randomized episode length once the policy learned the whipping behavior but could not hold it.
Outside checks
- Generalization: commenters asked whether the same policy can handle one through six links and whether the method points toward rope balancing.
- Physics realism: one reply challenged gravity and bar lengths. Yacine answered that gravity was 9.8 and hinge friction was zero.
- Sim to real: replies immediately raised hardware transfer. That should stay a separate gate, not implied by a browser or MuJoCo solve.
- Benchmark comparison: another reply noted that 90k steps per second on a 22 DoF humanoid with domain randomization is not comparable to a no contact cartpole task.
Reconstruction target
Recreate the training flow, not just the clip.
The artifact to build next is a reproducible repository path: MJCF generator, rollout trainer, sweep config, checkpoint recorder, eval renderer, and a public run ledger. The browser canvas is the readable front door for that work.
Training plan
1. Environment
Start from MuJoCo Playground cartpole, extend the MJCF to one through six serial hinged links, set gravity to 9.8, keep hinge friction at zero for the first reproduction, and save every seed.
2. Reward
Score upright links, low angular velocity, centered cart position, low force cost, survival, and a separate swing up term so the policy learns the whip before the hold.
3. Search
Run PufferPPO with a small MinGRU policy, sweep reward weights, action force set or continuous force, horizon, entropy, curriculum mix, mass, and force magnitude.
4. Gate
Require held out seeds, randomized episode lengths, lower link transfer, impulse recovery, and a visible failure map before calling the reproduction solved.
Research ledger
Multi link pendulums are real benchmarks
Kaheman, Fasel, Bramburger, Strom, Kutz, and Brunton frame the multi arm pendulum on a cart as a benchmark for chaos, system identification, learning, and control.
Chain pendulum dynamics are hard
Lee, Leok, and McClamroch derive equations and control structure for a chain pendulum on a cart, which is the mechanical version behind the six link challenge.
Swing up is different from hold
Inverted pendulum work separates the energy needed to swing up from the smaller corrections needed to stay upright. The six link version needs both behaviors.
MJWarp is the speed lever
MuJoCo Warp is optimized for NVIDIA hardware and large batches of simulation steps, which matches Yacine's claim that environment speed was the unlock.
MuJoCo Playground is the clean base
MuJoCo Playground ships GPU accelerated robot learning environments and has CartpoleBalance examples with PPO and a Warp implementation path.
PufferLib matches the policy clue
PufferLib documents a PPO variant and PufferNet using MinGRU, matching the timeline notes about pufferPPO and a small recurrent policy.
Mechanize angle
Mechanize describes environments and graders as the signal for reinforcement learning and evaluations. The cartpole challenge is the robotics shaped version of that pattern.
Next real training run
Ship the benchmark before chasing the hero video.
The useful artifact is not just a clip. It is a reproducible environment, saved seeds, reward curves, policy checkpoints, and failure cases. That makes the solve inspectable and gives future agents something to improve.