AI Lab | cartpole control

X thread response | June 9, 2026

Six link cartpole lab.

A live browser sketch of the six link cartpole problem, plus a Modal GPU policy checkpoint and the training notes needed for the full reinforcement learning reproduction.

Episode 1

0

Links

Mode

Rebuild target

  • Observation: cart x, cart velocity, six link angles, six angular velocities.
  • Action: start with a small force set, then test continuous cart force.
  • Reward: upright chain, centered cart, low angular velocity, survival time.
  • Curriculum: solve one link, then two, then mixed one to six link episodes.
  • Randomization: mass, force magnitude, initial angle, episode horizon.
  • Evaluation: held out seeds, impulse recovery, lower link transfer, failure map.

What the X thread says

  • Yacine posted the six pendulum cartpole solve at 00:50 UTC on June 9, 2026.
  • He said the run used PufferPPO, MuJoCo Warp, several RTX 4090s, and a Puffer minGRU policy near one million parameters.
  • The working trick was not just model size. It was environment speed, reward shaping, thousands of hyperparameter experiments, and randomized episode length.
  • Later that day he posted a MuJoCo Playground reproduction at 120k steps per second, 200 million steps, and 27.8 minutes of training.
Open source thread

Implementation status

Modal-trained CEM checkpoint deployed.

The canvas runs a lightweight coupled physics approximation and a checked-in policy trained on Modal L4 with cross-entropy search over a time-basis plus feedback model. This is progress, not the final Yacine-level MuJoCo/PufferPPO solve. The next checkpoint should increase sustained hold time, then move the environment to MJWarp or MJX with PPO and recorded eval videos.

Yacine timeline

Outside checks

  • Generalization: commenters asked whether the same policy can handle one through six links and whether the method points toward rope balancing.
  • Physics realism: one reply challenged gravity and bar lengths. Yacine answered that gravity was 9.8 and hinge friction was zero.
  • Sim to real: replies immediately raised hardware transfer. That should stay a separate gate, not implied by a browser or MuJoCo solve.
  • Benchmark comparison: another reply noted that 90k steps per second on a 22 DoF humanoid with domain randomization is not comparable to a no contact cartpole task.

Reconstruction target

Recreate the training flow, not just the clip.

The artifact to build next is a reproducible repository path: MJCF generator, rollout trainer, sweep config, checkpoint recorder, eval renderer, and a public run ledger. The browser canvas is the readable front door for that work.

Training plan

1. Environment

Start from MuJoCo Playground cartpole, extend the MJCF to one through six serial hinged links, set gravity to 9.8, keep hinge friction at zero for the first reproduction, and save every seed.

2. Reward

Score upright links, low angular velocity, centered cart position, low force cost, survival, and a separate swing up term so the policy learns the whip before the hold.

3. Search

Run PufferPPO with a small MinGRU policy, sweep reward weights, action force set or continuous force, horizon, entropy, curriculum mix, mass, and force magnitude.

4. Gate

Require held out seeds, randomized episode lengths, lower link transfer, impulse recovery, and a visible failure map before calling the reproduction solved.

Research ledger

Next real training run

Ship the benchmark before chasing the hero video.

The useful artifact is not just a clip. It is a reproducible environment, saved seeds, reward curves, policy checkpoints, and failure cases. That makes the solve inspectable and gives future agents something to improve.

Six Link Cartpole AI Lab | Max Petrusenko