Reinforcement Learning

Reinforcement Learning

Learn through interaction, rewards, and exploration

Interactive Q-Learning simulation, grid world environment, and hands-on practice

What is Reinforcement Learning?

Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment. The agent receives rewards or penalties based on its actions and learns to maximize cumulative rewards over time through trial and error.

Key Components

  • Agent: The learner or decision maker
  • Environment: The world the agent interacts with
  • State: Current situation of the agent
  • Action: Choices available to the agent
  • Reward: Feedback from the environment
  • Policy: Strategy for choosing actions

Applications

  • Game playing (Chess, Go, Atari)
  • Robotics and control systems
  • Autonomous vehicles
  • Resource management
  • recommendation systems
  • Trading and finance
Agent-Based Learning
Reward Maximization
Policy Optimization
Iterative Learning

Types of Reinforcement Learning

Model-Based RL

Agent learns a model of the environment and uses it for planning.

  • • Dyna-Q
  • • Monte Carlo Tree Search
  • • World Models

Model-Free RL

Agent learns directly from experience without modeling the environment.

  • • Q-Learning
  • • SARSA
  • • Deep Q-Networks

Policy Gradient

Directly optimizes the policy without value functions.

  • • REINFORCE
  • • Actor-Critic
  • • PPO, A3C

Interactive Q-Learning Grid World

Watch the agent learn to navigate from start to goal while avoiding obstacles and collecting rewards!

0
Episodes
0
Total Reward
0
Avg Reward
0
Steps

Reward Progress Over Episodes

How it works:
  • Agent (🤖): Learns to navigate the grid
  • Goal (🎯): Target destination (+100 reward)
  • Obstacles (⛔): Blocks to avoid (-100 penalty)
  • Rewards (💎): Bonus points (+50 reward)
  • Empty cells: -1 reward per step

Q-Learning Algorithm

Algorithm Steps:

1
Initialize Q-Table

Create a table Q(s,a) with zeros for all state-action pairs

2
Choose Action

Select action using ε-greedy policy: explore randomly or exploit best known action

3
Take Action & Observe

Execute action, observe next state and reward from environment

4
Update Q-Value

Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]

5
Repeat

Continue until convergence or maximum episodes reached

Q-Learning Update Equation:

Q(s,a) ← Q(s,a) + α[r + γ maxa' Q(s',a') - Q(s,a)]

Where:

  • Q(s,a) = Quality of action a in state s
  • α = Learning rate (0 to 1)
  • r = Immediate reward
  • γ = Discount factor (0 to 1)
  • s' = Next state
  • a' = Next action

Advantages & Disadvantages:

Advantages
  • Model-free: no environment model needed
  • Off-policy: learns optimal policy
  • Simple to implement
  • Guaranteed convergence
Disadvantages
  • Requires discrete state/action spaces
  • Slow convergence for large spaces
  • Memory intensive for large Q-tables
  • Exploration-exploitation tradeoff

Other RL Algorithms

Description: On-policy TD control algorithm that updates Q-values based on the action actually taken.

Update Rule: Q(s,a) ← Q(s,a) + α[r + γQ(s',a') - Q(s,a)]

Best for: Safe learning, when exploration matters

Description: Uses deep neural networks to approximate Q-values for high-dimensional state spaces.

Key Features: Experience replay, target network, handles continuous states

Best for: Complex environments like Atari games, robotics

Description: Directly optimizes the policy by gradient ascent on expected reward.

Algorithms: REINFORCE, Actor-Critic, PPO, A3C

Best for: Continuous action spaces, stochastic policies

Python Implementation

Q-Learning Implementation

import numpy as np import random class QLearningAgent: def __init__(self, states, actions, learning_rate=0.1, discount_factor=0.9, exploration_rate=0.1): self.states = states self.actions = actions self.learning_rate = learning_rate self.discount_factor = discount_factor self.exploration_rate = exploration_rate self.q_table = np.zeros((states, actions)) def choose_action(self, state): if random.uniform(0, 1) < self.exploration_rate: return random.choice(range(self.actions)) else: return np.argmax(self.q_table[state]) def update_q_value(self, state, action, reward, next_state): current_q = self.q_table[state, action] max_next_q = np.max(self.q_table[next_state]) new_q = current_q + self.learning_rate * ( reward + self.discount_factor * max_next_q - current_q ) self.q_table[state, action] = new_q def get_optimal_policy(self): return np.argmax(self.q_table, axis=1) agent = QLearningAgent(states=100, actions=4) for episode in range(1000): state = env.reset() done = False while not done: action = agent.choose_action(state) next_state, reward, done = env.step(action) agent.update_q_value(state, action, reward, next_state) state = next_state print("Training completed!") print(f"Optimal policy: {agent.get_optimal_policy()}")

Grid World Environment

class GridWorld: def __init__(self, size=5): self.size = size self.agent_position = [0, 0] self.goal_position = [size-1, size-1] self.obstacles = [[1, 1], [2, 2]] self.actions = { 0: [-1, 0], 1: [1, 0], 2: [0, -1], 3: [0, 1] } def reset(self): self.agent_position = [0, 0] return self.get_state() def step(self, action): move = self.actions[action] new_position = [ self.agent_position[0] + move[0], self.agent_position[1] + move[1] ] if self.is_valid_position(new_position): self.agent_position = new_position reward = self.get_reward() done = self.agent_position == self.goal_position return self.get_state(), reward, done def is_valid_position(self, position): if position in self.obstacles: return False if position[0] < 0 or position[0] >= self.size: return False if position[1] < 0 or position[1] >= self.size: return False return True def get_reward(self): if self.agent_position == self.goal_position: return 100 elif self.agent_position in self.obstacles: return -100 else: return -1 def get_state(self): return self.agent_position[0] * self.size + self.agent_position[1] env = GridWorld(size=5)

Test Your Knowledge

Answer these questions to test your understanding of reinforcement learning concepts.

Question 1: What is the main goal of reinforcement learning?
A) Maximize cumulative reward over time
B) Minimize prediction error
C) Classify data into categories
D) Find patterns in unlabeled data
Question 2: In Q-Learning, what does the Q-value represent?
A) The current state of the agent
B) The expected cumulative reward for taking an action in a state
C) The probability of reaching the goal
D) The number of steps taken
Question 3: What is the exploration-exploitation tradeoff?
A) Choosing between different algorithms
B) Balancing trying new actions vs using known good actions
C) Trading speed for accuracy
D) Choosing between supervised and unsupervised learning
Question 4: What does the discount factor (γ) control?
A) The learning speed
B) The exploration rate
C) The importance of future rewards vs immediate rewards
D) The size of the Q-table
Question 5: Which is an example of model-free RL?
A) Q-Learning
B) Monte Carlo Tree Search
C) Dynamic Programming
D) World Models
Question 6: What is a policy in reinforcement learning?
A) The reward function
B) A strategy that maps states to actions
C) The environment model
D) The learning rate parameter