RL Playground

Train RL agents on grid worlds. Compare Q-Learning, SARSA, Expected SARSA, and Monte Carlo. Visualize Q-value heatmaps and policy arrows.

Algorithm

Map

Visualization

Off-policy TD: updates Q toward max future Q regardless of action taken.

Auto-Train Speed

Learning Rate (alpha)0.1

Discount (gamma)0.99

Epsilon1

Epsilon Decay0.995

Max Steps/Episode200

Episodes0

Total Steps0

Avg Reward (100)0

Success Rate0%

Current Epsilon1.0000

ConvergenceLearning...

Start (S)

Goal (G)

Wall

Hole (H)

Slippery (~)

All algorithms from scratch. Q-tables as Float64Arrays. 7 preset grid worlds including Frozen Lake and Cliff Walking.