AI Lab | cartpole control

X thread response | June 9, 2026

Six link cartpole lab.

A live browser sketch of the six link cartpole problem, plus a Modal GPU policy checkpoint and the training notes needed for the full reinforcement learning reproduction.

Episode 1

1.00s minimum hold

Links

Rebuild target

Observation: cart x, cart velocity, previous action, and sin/cos angle features.
Action: model-produced continuous cart force only.
Reward: dense swing-up shaping during training, strict visible score only when every link is upright and straight.
Curriculum: solve one link from the hanging position, then two, then mixed one to six link episodes.
Randomization: mass, force magnitude, initial angle, episode horizon.
Evaluation: one second minimum consecutive strict hold, held out seeds, lower link transfer, failure map.

External verified N=6 proof

Nominal seed-free solve

PASS

Generated from a neutral N=2 to N=5 ladder, lifted to N=6 with a one-sided bend-excitation floor, then verified with TVLQR catch and hold.

Perturbed start

16/16

Same controls from random initial offsets of +/-10% pi and +/-0.5 rad/s. The official challenge reports robustness through +/-0.80 rad.

This is not the browser neural checkpoint. It is a reproducible six-link control solve from `m1el/inverted-pendulum`: verified dynamics, seed-free direct collocation, controllability-aware bend order, and full-state TVLQR. The lesson for our trainer is concrete: do not force straightness during the pump phase. Keep bend modes excitable, then catch and hold.

What the X thread says

Yacine posted the six pendulum cartpole solve at 00:50 UTC on June 9, 2026.
He said the run used PufferPPO, MuJoCo Warp, several RTX 4090s, and a Puffer minGRU policy near one million parameters.
The working trick was not just model size. It was environment speed, reward shaping, 3.6k hyperparameter experiments, and randomized episode length.
The randomized horizon came after the policy learned whipping behavior but could not keep it up longer than about 10 seconds.
He reported 18 million steps per second on some MuJoCo Warp configs by capturing the CUDA graph with APIC and calling it from C.
He confirmed gravity was 9.8, hinge friction was zero, and the cart was effectively tracking position zero.
Later that day he posted a MuJoCo Playground reproduction at 120k steps per second, 200 million steps, and 27.8 minutes of training.
A scan of the public yacineMTB GitHub repositories did not find a released six-pendulum source repo, so this page treats the thread as the source of truth.

Open source thread

Implementation status

Link one is solved in the browser proof.

The canvas starts at one hanging pendulum, uses model force only, and keeps links two through six locked. The exported Pezzza-style evolutionary checkpoint now passes the local browser gate: 8.043 seconds observed from down, with policy metadata validation at 5.715 seconds. The full-gravity ablation stayed at zero, so the useful finding is curriculum plus whiplash/recovery shaping, not just more steps.

Experiment comparison

Seed-free N=6

m1el local repro

575.8s generation

Direct collocation plus TVLQR. Nominal verifier PASS at 0.364deg final error; perturbed challenge PASS 16/16.

Pezzza-style CEM

ap-Atp5F3zbazixndWxBHdeQp

~721k SPS

Trainer strict score 187.95. Browser proof observed 8.043s from down; policy metadata validation reports 5.715s.

Pezzza 480 Hz

ap-h3iaYTUzuCymR0hUnE6FOX

~784k SPS

Strict score 164.60, mean hold 5.37s, solved rate 94.9%.

Full gravity only

ap-2tRCgyI6nU9xcxAIMHIXyE

~77k SPS

Strict score 0. Curriculum was the difference.

SAC / TD3 / PPO

multiple

~282 to 825 SPS

Strict score 0. TD3 can hold from near-upright, but down-start still fails.

X media pass

Hero solve

The clip starts from a bent whip posture, then score only appears during straight upright windows. That matches the strict public score gate.

Experiment cloud

The scatter video is a field of hyperparameter runs, with the useful policies living as outliers. The reproduction needs a run ledger, not one hand-tuned policy.

Reward step

The reward/policy clip shows a visible step after useful behavior appears. The next trainer should add horizon randomization after whip behavior, not before.

Phase traces

The simulator-speed clip shows repeated green trajectory clusters. The useful behavior is a learned whip path followed by a hold, not random shaking.

Yacine timeline

June 3

Fast baseline

Cartpole in MuJoCo Warp hit 18 million steps per second on some configs, using PufferLib and a large rollout policy batch.

June 4

Action and reward pressure

The model was still using five discrete cart forces. He planned to switch action shape, then added stronger upright reward and moved toward curriculum learning.

June 7

Scaling pain

He called each extra pendulum super exponential in time to solve, which matches the control literature view of multi link chains as chaotic benchmark systems.

June 9

Solve and ablation clue

The final recipe used top scoring hyperparameters from higher compute runs, then randomized episode length once the policy learned the whipping behavior but could not hold it.

Outside checks

Generalization: commenters asked whether the same policy can handle one through six links and whether the method points toward rope balancing.
Physics realism: one reply challenged gravity and bar lengths. Yacine answered that gravity was 9.8 and hinge friction was zero.
Sim to real: replies immediately raised hardware transfer. That should stay a separate gate, not implied by a browser or MuJoCo solve.
Benchmark comparison: another reply noted that 90k steps per second on a 22 DoF humanoid with domain randomization is not comparable to a no contact cartpole task.

Reconstruction target

Recreate the training flow, not just the clip.

The artifact to build next is a reproducible repository path: MJCF generator, rollout trainer, sweep config, checkpoint recorder, eval renderer, and a public run ledger. The browser canvas is the readable front door for that work.

Training plan

1. Environment

Start from MuJoCo Playground or MuJoCo Warp cartpole, extend the MJCF to one through six serial hinged links, set gravity to 9.8, keep hinge friction at zero, and make the cart track zero.

2. Reward

Use dense swing-up shaping to discover whipping, but visible score is zero unless every link is near upright, the chain is nearly straight, and the hold lasts at least one second.

3. Search

Run many sweeps, not one hand run. Promote top hyperparameters from higher-compute runs, then tweak task details once useful behavior appears.

4. Gate

Randomize episode lengths only after whipping appears. Require held-out seeds, long holds, lower-link transfer, and a visible failure map before calling it solved.

Research ledger

Verified N=6 model-based solve

m1el/inverted-pendulum publishes verified N-link dynamics plus seed-free controllability-aware trajectory optimization and TVLQR. The local reproduction passed nominal and perturbed N=6 verification.

Exudyn N=5 RL baseline

Exudyn's openAIgymNLinkAdvanced.py documents SAC training for multi-link inverted pendulums and notes that four and five links work, with smaller init noise and long training for higher link counts.

Multi link pendulums are real benchmarks

Kaheman, Fasel, Bramburger, Strom, Kutz, and Brunton frame the multi arm pendulum on a cart as a benchmark for chaos, system identification, learning, and control.

Hardware repo for the benchmark

The Dynamics Lab multi-arm pendulum repository publishes CAD, manuals, and collected data for the multi-arm pendulum on a cart paper.

Chain pendulum dynamics are hard

Lee, Leok, and McClamroch derive equations and control structure for a chain pendulum on a cart, which is the mechanical version behind the six link challenge.

Energy swing up is the control prior

Astrom and Furuta's energy-control swing-up paper is the useful classical prior: first pump energy into the chain, then switch to stabilization near upright.

PPO is the algorithm baseline

Schulman, Wolski, Dhariwal, Radford, and Klimov introduced PPO as a practical policy-gradient method for simulated control and robotics-style tasks.

PufferLib is the stack clue

The PufferLib paper and docs explain the fast vectorized RL path that matches the thread's PufferPPO and MinGRU clues.

MJWarp is the speed lever

MuJoCo Warp is optimized for NVIDIA hardware and large batches of simulation steps, which matches Yacine's claim that environment speed was the unlock.

MuJoCo Playground is the clean base

MuJoCo Playground ships GPU accelerated robot learning environments and has CartpoleBalance examples with PPO and a Warp implementation path.

PufferLib matches the policy clue

PufferLib documents a PPO variant and PufferNet using MinGRU, matching the timeline notes about pufferPPO and a small recurrent policy.

Mechanize angle

Mechanize describes environments and graders as the signal for reinforcement learning and evaluations. The cartpole challenge is the robotics shaped version of that pattern.

Next real training run

Two links exist, but do not pass yet.

The useful artifact is not just a clip. The repo now has a strict score report where subsecond holds do not count, a Pezzza-style CEM trainer, a one-link browser proof, and two-link chain runs that still score zero on the down-start gate. The Yacine-aligned next lane is PufferPPO or close recurrent PPO on MuJoCo Warp/MuJoCo, with randomized horizon only after whip behavior appears.