Reinforcement Learning

What is Reinforcement Learning?

Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment. The agent receives rewards or penalties based on its actions and learns to maximize cumulative rewards over time through trial and error.

Key Components

Agent: The learner or decision maker
Environment: The world the agent interacts with
State: Current situation of the agent
Action: Choices available to the agent
Reward: Feedback from the environment
Policy: Strategy for choosing actions

Applications

Game playing (Chess, Go, Atari)
Robotics and control systems
Autonomous vehicles
Resource management
recommendation systems
Trading and finance

Agent-Based Learning

Reward Maximization

Policy Optimization

Iterative Learning

Types of Reinforcement Learning

Model-Based RL

Agent learns a model of the environment and uses it for planning.

• Dyna-Q
• Monte Carlo Tree Search
• World Models

Model-Free RL

Agent learns directly from experience without modeling the environment.

• Q-Learning
• SARSA
• Deep Q-Networks

Policy Gradient

Directly optimizes the policy without value functions.

• REINFORCE
• Actor-Critic
• PPO, A3C

Interactive Q-Learning Grid World

Watch the agent learn to navigate from start to goal while avoiding obstacles and collecting rewards!

Learning Rate (α)

Discount Factor (γ)

Exploration Rate (ε)

Episodes

Total Reward

Avg Reward

Steps

Reward Progress Over Episodes

How it works:

Agent (🤖): Learns to navigate the grid
Goal (🎯): Target destination (+100 reward)
Obstacles (⛔): Blocks to avoid (-100 penalty)
Rewards (💎): Bonus points (+50 reward)
Empty cells: -1 reward per step

Q-Learning Algorithm

Algorithm Steps:

Initialize Q-Table

Create a table Q(s,a) with zeros for all state-action pairs

Choose Action

Select action using ε-greedy policy: explore randomly or exploit best known action

Take Action & Observe

Execute action, observe next state and reward from environment

Update Q-Value

Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]

Repeat

Continue until convergence or maximum episodes reached

Q-Learning Update Equation:

Q(s,a) ← Q(s,a) + α[r + γ max_a' Q(s',a') - Q(s,a)]

Where:

Q(s,a) = Quality of action a in state s
α = Learning rate (0 to 1)
r = Immediate reward
γ = Discount factor (0 to 1)
s' = Next state
a' = Next action

Advantages & Disadvantages:

Advantages

Model-free: no environment model needed
Off-policy: learns optimal policy
Simple to implement
Guaranteed convergence

Disadvantages

Requires discrete state/action spaces
Slow convergence for large spaces
Memory intensive for large Q-tables
Exploration-exploitation tradeoff

Other RL Algorithms

Description: On-policy TD control algorithm that updates Q-values based on the action actually taken.

Update Rule: Q(s,a) ← Q(s,a) + α[r + γQ(s',a') - Q(s,a)]

Best for: Safe learning, when exploration matters

Description: Uses deep neural networks to approximate Q-values for high-dimensional state spaces.

Key Features: Experience replay, target network, handles continuous states

Best for: Complex environments like Atari games, robotics

Description: Directly optimizes the policy by gradient ascent on expected reward.

Algorithms: REINFORCE, Actor-Critic, PPO, A3C

Best for: Continuous action spaces, stochastic policies

Python Implementation

Q-Learning Implementation

import numpy as np import random class QLearningAgent: def __init__(self, states, actions, learning_rate=0.1, discount_factor=0.9, exploration_rate=0.1): self.states = states self.actions = actions self.learning_rate = learning_rate self.discount_factor = discount_factor self.exploration_rate = exploration_rate self.q_table = np.zeros((states, actions)) def choose_action(self, state): if random.uniform(0, 1) < self.exploration_rate: return random.choice(range(self.actions)) else: return np.argmax(self.q_table[state]) def update_q_value(self, state, action, reward, next_state): current_q = self.q_table[state, action] max_next_q = np.max(self.q_table[next_state]) new_q = current_q + self.learning_rate * ( reward + self.discount_factor * max_next_q - current_q ) self.q_table[state, action] = new_q def get_optimal_policy(self): return np.argmax(self.q_table, axis=1) agent = QLearningAgent(states=100, actions=4) for episode in range(1000): state = env.reset() done = False while not done: action = agent.choose_action(state) next_state, reward, done = env.step(action) agent.update_q_value(state, action, reward, next_state) state = next_state print("Training completed!") print(f"Optimal policy: {agent.get_optimal_policy()}")

Grid World Environment

class GridWorld: def __init__(self, size=5): self.size = size self.agent_position = [0, 0] self.goal_position = [size-1, size-1] self.obstacles = [[1, 1], [2, 2]] self.actions = { 0: [-1, 0], 1: [1, 0], 2: [0, -1], 3: [0, 1] } def reset(self): self.agent_position = [0, 0] return self.get_state() def step(self, action): move = self.actions[action] new_position = [ self.agent_position[0] + move[0], self.agent_position[1] + move[1] ] if self.is_valid_position(new_position): self.agent_position = new_position reward = self.get_reward() done = self.agent_position == self.goal_position return self.get_state(), reward, done def is_valid_position(self, position): if position in self.obstacles: return False if position[0] < 0 or position[0] >= self.size: return False if position[1] < 0 or position[1] >= self.size: return False return True def get_reward(self): if self.agent_position == self.goal_position: return 100 elif self.agent_position in self.obstacles: return -100 else: return -1 def get_state(self): return self.agent_position[0] * self.size + self.agent_position[1] env = GridWorld(size=5)

Test Your Knowledge

Answer these questions to test your understanding of reinforcement learning concepts.

Question 1: What is the main goal of reinforcement learning?

A) Maximize cumulative reward over time

B) Minimize prediction error

C) Classify data into categories

D) Find patterns in unlabeled data

Question 2: In Q-Learning, what does the Q-value represent?

A) The current state of the agent

B) The expected cumulative reward for taking an action in a state

C) The probability of reaching the goal

D) The number of steps taken

Question 3: What is the exploration-exploitation tradeoff?

A) Choosing between different algorithms

B) Balancing trying new actions vs using known good actions

C) Trading speed for accuracy

D) Choosing between supervised and unsupervised learning

Question 4: What does the discount factor (γ) control?

A) The learning speed

B) The exploration rate

C) The importance of future rewards vs immediate rewards

D) The size of the Q-table

Question 5: Which is an example of model-free RL?

A) Q-Learning

B) Monte Carlo Tree Search

C) Dynamic Programming

D) World Models

Question 6: What is a policy in reinforcement learning?

A) The reward function

B) A strategy that maps states to actions

C) The environment model

D) The learning rate parameter