domain reinforcement ★ seed

Reinforcement Learning

Learning through interaction with an environment to maximize cumulative reward. From Q-learning to AlphaGo to RLHF.

#reinforcement-learning #reward #policy #agent

Sub-topics

Watkins' 1989 model-free RL algorithm that learns action-value functions. Converges to optimal policy without requiring a model of the environment.

Policy Gradient Methods concept

RL algorithms that directly optimize the policy by gradient ascent on expected reward. REINFORCE (Williams, 1992) and actor-critic methods are foundational variants.

Deep Reinforcement Learning topic

Combining deep neural networks with RL. DeepMind's DQN (2013) played Atari from pixels. Enabled RL to scale to high-dimensional observation and action spaces.

AlphaGo (2016) concept

DeepMind's AlphaGo defeated world champion Lee Sedol at Go in March 2016. Combined deep neural networks with Monte Carlo tree search. A landmark for AI.

AlphaStar (2019) concept

DeepMind's AlphaStar reached Grandmaster level in StarCraft II in 2019, mastering a real-time strategy game with imperfect information and continuous action spaces.

RLHF (2017) concept

Reinforcement Learning from Human Feedback, introduced by Christiano et al. (2017). Uses human preferences to define reward signals. Critical for aligning LLMs like ChatGPT and Claude.

Multi-Agent RL concept

RL with multiple interacting agents that cooperate or compete. Self-play in AlphaGo Zero and multi-agent training in AlphaStar are landmark examples.

Model-Based RL concept

RL methods that learn a model of environment dynamics and plan using it. More sample-efficient than model-free methods but requires accurate world models.