Ml Reinforcement Learning
## Machine Learning - Reinforcement Learning Example
Imagine you are teaching a puppy to learn the "sit" command. You don't directly tell it what the word "sit" means, but guide it through rewards and punishments.
* When the puppy accidentally performs the sitting action, you immediately give it a treat (**reward**).
* When it does something wrong, you don't give it a treat (**punishment**).
After many attempts, the puppy will eventually understand the association between the "sit" command and receiving treats, thus learning this skill.
**Reinforcement Learning** is a machine learning method that allows computers (or agents) to learn how to make optimal decisions through trial and error, interacting with the environment to maximize cumulative rewards.
**Reinforcement Learning** is fundamentally different from supervised learning (which has standard answers) and unsupervised learning (which finds inherent data structures) that we learned before. The core of reinforcement learning is the continuous interaction between the **agent** and the **environment**.
* * *
## Core Concepts Explained
Before diving into the code, let's understand several key concepts. They are like game rules that define how the reinforcement learning world operates.
* **Agent:** The agent is our learner or decision-maker. In the analogy above, it's the puppy. In the program, it's an algorithm responsible for observing the environment, making actions, and learning from results.
* **Environment:** The environment is the external world where the agent resides. It receives the agent's actions and provides two feedbacks: the new environment state and the immediate reward from this action.
* **State:** The state is a specific description of the environment at a certain moment. For example, in a maze game, the state is the current coordinate position of the agent.
* **Action:** Actions are the choices the agent can make in a given state. For instance, in a maze, actions can be up, down, left, or right.
* **Reward:** The reward is the direct evaluation signal from the environment to the agent's action, usually a numerical value. **Positive rewards** indicate encouragement, while **negative rewards** indicate punishment. The agent's ultimate goal is to maximize the **total reward (cumulative reward)** from start to finish.
* **Policy:** The policy is the agent's behavioral guideline. It defines which action should be chosen in every possible state. The learning process is essentially the process of optimizing this policy.
To more intuitively understand how these concepts work together, let's look at the basic interaction flow of reinforcement learning:
!(#)
This cycle continues until a terminal state is reached (such as game completion or failure).
* * *
## Classic Problem: Cliff Walking
To put theory into practice, we will use a classic reinforcement learning example environment: **CliffWalking-v0** (Cliff Walking). It comes from the `gymnasium` library (the maintained branch of the original OpenAI Gym).
### Environment Description
* **Scenario**: A 4x12 grid world.
* **Start**: Bottom-left corner (coordinate [3, 0]).
* **Goal**: Bottom-right corner (coordinate [3, 11]).
* **Cliff**: All positions on the bottom row except start and goal ([3, 1] to [3, 10]). Falling off the cliff incurs a huge penalty and returns to the start.
* **Objective**: The agent must safely walk from start to goal while avoiding falling off the cliff.
* **Actions**: Up (0), Right (1), Down (2), Left (3).
* **Rewards**:
* Each step on normal grid: -1 (encourages reaching goal in fewer steps)
* Fall off cliff: -100, and return to start
* Reach goal: 0, and end current attempt
* * *
## Algorithm Introduction: Q-Learning
We will use the **Q-Learning** algorithm to solve this problem. It is a **model-free** reinforcement learning algorithm, meaning the agent doesn't need to know the environment's operating rules (such as state transition probabilities) in advance; it learns through continuous trial and error.
Its core is a table called the **Q-table**.
* **Rows** represent all possible states.
* **Columns** represent all possible actions.
* **Cell values (Q-values)** represent the long-term expected return of taking a certain action in a certain state.
The learning process of Q-Learning can be summarized in the following steps, showing how the agent updates its knowledge (Q-table) through a single experience:
!(#)
### Core Formula: Bellman Equation
The mathematical foundation for Q-table updates is the Bellman equation, with the following update formula:
$$
Q left(right. S , A left.right) leftarrow Q left(right. S , A left.right) + alpha left[right. R + gamma underset{a}{max β‘} Q left(right. S^{'} , a left.right) - Q left(right. S , A left.right) left]right.
$$
Let's break down each part of this formula:
| Symbol | Meaning | Analogy Explanation |
| --- | --- | --- |
| $Q left(right. S , A left.right)$ | Original Q-value of action A in state S | Your old score for the decision to go straight at an intersection |
| $alpha$ | **Learning rate** (0 < Ξ± β€ 1) | How much you believe in this new experience. Ξ±=1 means completely overwrite old knowledge with new experience; Ξ±=0.1 means new experience only slightly modifies old knowledge |
| $R$ | **Immediate reward** obtained after executing action A | After going straight, you find the road clear and get a small positive feedback (+1) |
| $gamma$ | **Discount factor** (0 β€ Ξ³ < 1) | How much you value future rewards. Ξ³=0 means only care about immediate rewards; Ξ³=0.9 means highly value long-term benefits |
| $left(max β‘right)_{a} Q left(right. S^{'} , a left.right)$ | Maximum Q-value among all possible actions in new state S' | After reaching the new intersection, you evaluate which of left turn, right turn, or going straight has the highest future benefit |
| $R + gamma left(max β‘right)_{a} Q left(right. S^{'} , a left.right)$ | **Target Q-value**, representing a new, better estimate of the current decision | Combining immediate reward and best future benefit, derive a new score for the decision to go straight at the old intersection |
| $R + gamma left(max β‘right)_{a} Q left(right. S^{'} , a left.right) - Q left(right. S , A left.right)$ | **Temporal difference error**, the gap between old and new knowledge | The difference between new score and old score, this gap drives learning |
Simply put, this formula allows the agent to continuously correct its judgment of the current decision's value using **immediate reward + discounted estimate of future optimum**.
* * *
## Practical Implementation: Writing a Q-Learning Agent
Now, let's implement a Q-Learning agent in code to solve the cliff walking problem.
### Step 1: Install and Import Libraries
First, ensure you have installed the necessary libraries. Run in terminal or command line:
## Example
pip install gymnasium numpy
Then, import them in your Python file:
## Example
import gymnasium as gym
import numpy as np
import random
### Step 2: Initialize Environment and Q-Table
## Example
# 1. Create environment
env = gym.make("CliffWalking-v0", render_mode="human")# render_mode="human" for visualization
# 2. Get environment information
n_states = env.observation_space.n# Total states (4*12=48)
n_actions = env.action_space.n# Total actions (4 directions)
# 3. Initialize Q-table, shape [number of states, number of actions], initial values all 0
Q_table = np.zeros((n_states, n_actions))
print(f"Environment states: {n_states}, actions: {n_actions}")
print(f"Q-table shape: {Q_table.shape}")
### Step 3: Set Hyperparameters
Hyperparameters are knobs that control algorithm behavior and need to be adjusted based on the problem.
## Example
# Define hyperparameters
alpha =0.1# Learning rate: influence of new information
gamma =0.99# Discount factor: importance of future rewards
epsilon =0.1# Exploration rate: probability of random exploration (vs choosing known optimal)
num_episodes =500# Training episodes (number of times agent plays the game)
### Step 4: Implement Ξ΅-greedy Policy
This is the core strategy for the agent's decision-making, balancing **exploration** and **exploitation**.
* **Exploration**: Randomly select actions to discover potentially better strategies.
* **Exploitation**: Select actions that the current Q-table considers optimal to maximize returns.
## Example
def choose_action(state, Q_table, epsilon):
"""
Choose action based on Ξ΅-greedy policy.
Parameters:
state: current state
Q_table: Q-value table
epsilon: exploration probability
Returns:
action: chosen action (0, 1, 2, 3)
"""
# Generate a random number between 0-1
if random.uniform(0,1)< epsilon:
# Exploration: randomly choose an action
action = env.action_space.sample()
else:
# Exploitation: choose action with maximum Q-value in current state
# np.argmax returns index of maximum value, i.e., optimal action
action = np.argmax(Q_table)
return action
### Step 5: Core Training Loop
This is the main process of algorithm learning.
## Example
# Record total reward per episode to observe learning progress
reward_history =[]
for episode in range(num_episodes):
# Reset environment, get initial state
state, _ = env.reset()
total_reward =0# Cumulative reward for this episode
terminated =False# Whether reached terminal state (goal/cliff)
truncated =False# Whether terminated due to step limit (rare in this environment)
# Episode interaction loop, until game ends
while not(terminated or truncated):
# 1. Choose action
action = choose_action(state, Q_table, epsilon)
# 2. Execute action, get environment feedback
next_state, reward, terminated, truncated, _ = env.step(action)
# 3. Update Q-table (Q-Learning core update formula)
# Get current Q-value
current_q = Q_table[state, action]
# Calculate target Q-value: immediate reward + discounted future maximum Q-value
# Note: if next state is terminal, there is no future Q-value
if terminated:
target_q = reward
else:
target_q = reward + gamma * np.max(Q_table)
# Apply Bellman equation to update Q-value
Q_table[state, action]= current_q + alpha * (target_q - current_q)
# 4. Transition to next state and accumulate reward
state = next_state
total_reward += reward
# Record total reward for this episode
reward_history.append(total_reward)
# Print progress every 100 episodes
if(episode + 1) % 100==0:
avg_reward = np.mean(reward_history[-100:])# Average reward of last 100 episodes
print(f"Episode {episode + 1}, average reward of last 100 episodes: {avg_reward:.2f}")
# Training complete, close environment
env.close()
### Step 6: Test the Trained Agent
After training is complete, we turn off exploration and let the agent purely exploit its learned knowledge (Q-table) to see its performance.
## Example
print("n=== Start Testing ===")
# Create new test environment (can omit render_mode, or set to "human" to watch)
test_env = gym.make("CliffWalking-v0", render_mode="human")
state, _ = test_env.reset()
test_terminated =False
test_truncated =False
step_count =0
while not(test_terminated or test_truncated):
# During testing, we set epsilon=0, i.e., pure exploitation, no exploration
action = choose_action(state, Q_table, epsilon=0)
state, reward, test_terminated, test_truncated, _ = test_env.step(action)
step_count +=1
print(f"Step {step_count}: state {state}, action {action}, reward {reward}")
print(f"Test complete! Total steps: {step_count}, total reward: {reward} (reward is 0 when reaching goal)")
test_env.close()
* * *
## Running Results and Analysis
Complete code:
## Example
import gymnasium as gym
import numpy as np
import random
# =========================
# 1. Create environment
# =========================
env = gym.make("CliffWalking-v0", render_mode="human")
# =========================
# 2. Get environment information
# =========================
n_states = env.observation_space.n
n_actions = env.action_space.n
# =========================
# 3. Initialize Q-table
# =========================
Q_table = np.zeros((n_states, n_actions))
print(f"Environment states: {n_states}")
print(f"Actions: {n_actions}")
print(f"Q-table shape: {Q_table.shape}")
# =========================
# 4. Set hyperparameters
# =========================
alpha =0.1
gamma =0.99
epsilon =0.1
num_episodes =500
# =========================
# 5. Ξ΅-greedy policy
# =========================
def choose_action(state, Q_table, epsilon):
if random.uniform(0,1)< epsilon:
return env.action_space.sample()
return np.argmax(Q_table)
# =========================
# 6. Training loop
# =========================
reward_history =[]
for episode in range(num_episodes):
state, _ = env.reset()
total_reward =0
terminated =False
truncated =False
while not(terminated or truncated):
action = choose_action(state, Q_table, epsilon)
next_state, reward, terminated, truncated, _ = env.step(action)
current_q = Q_table[state, action]
if terminated:
target_q = reward
else:
target_q = reward + gamma * np.max(Q_table)
Q_table[state, action]= current_q + alpha * (target_q - current_q)
state = next_state
total_reward += reward
reward_history.append(total_reward)
if(episode + 1) % 100==0:
avg_reward = np.mean(reward_history[-100:])
print(f"Episode {episode + 1}, average reward of last 100 episodes: {avg_reward:.2f}")
env.close()
# =========================
# 7. Test training results
# =========================
print("n=== Start Testing ===")
test_env = gym.make("CliffWalking-v0", render_mode="human")
state, _
YouTip