Reinforcement Learning
Learning through interaction with an environment to maximize cumulative reward. From Q-learning to AlphaGo to RLHF.
Sub-topics
Watkins' 1989 model-free RL algorithm that learns action-value functions. Converges to optimal policy without requiring a model of the environment.
RL algorithms that directly optimize the policy by gradient ascent on expected reward. REINFORCE (Williams, 1992) and actor-critic methods are foundational variants.
Combining deep neural networks with RL. DeepMind's DQN (2013) played Atari from pixels. Enabled RL to scale to high-dimensional observation and action spaces.
DeepMind's AlphaGo defeated world champion Lee Sedol at Go in March 2016. Combined deep neural networks with Monte Carlo tree search. A landmark for AI.
DeepMind's AlphaStar reached Grandmaster level in StarCraft II in 2019, mastering a real-time strategy game with imperfect information and continuous action spaces.
Reinforcement Learning from Human Feedback, introduced by Christiano et al. (2017). Uses human preferences to define reward signals. Critical for aligning LLMs like ChatGPT and Claude.
RL with multiple interacting agents that cooperate or compete. Self-play in AlphaGo Zero and multi-agent training in AlphaStar are landmark examples.
RL methods that learn a model of environment dynamics and plan using it. More sample-efficient than model-free methods but requires accurate world models.