Ml Qlearning Sarsa

In the field of artificial intelligence, reinforcement learning is a method that allows an agent to learn how to achieve goals through interaction with the environment.

\\\\n\\\\n

Imagine teaching a puppy a new command: it performs an action (such as sitting), you give a reward (a treat), and it gradually learns to respond correctly when hearing the command. Q-learning and SARSA are two classic and crucial algorithms in reinforcement learning. They are the core tools for an agent to learn which action is best in which state. This article will clearly explain the principles, differences, and implementations of these two algorithms.

\\\\n\\\\n

Foundations of Reinforcement Learning and Markov Decision Processes

\\\\n\\\\n

Before diving into Q-learning and SARSA, we need to understand their common theoretical framework.

\\\\n\\\\n

Core Concepts

\\\\n\\\\n

Reinforcement learning problems are typically modeled as Markov Decision Processes. It contains the following key elements:

\\\\n\\\\n

Agent: The entity that makes decisions and learns.
Environment: The external world with which the agent interacts.
State: A description of the environment at a specific moment.
Action: An operation that the agent can perform in a given state.
Reward: An immediate feedback signal from the environment after the agent performs an action.
Policy: The rule by which the agent selects actions in a given state; this is the target of learning.

\\\\n\\\\n

Objectives and Value Functions

\\\\n\\\\n

The ultimate goal of the agent is to maximize long-term cumulative reward, not just immediate reward. For this purpose, we introduce value functions.

\\\\n\\\\n

State Value Function V(s): Represents the expected cumulative reward obtainable starting from state s and following a specific policy.
Action Value Function Q(s, a): Represents the expected cumulative reward obtainable by performing action a in state s, and then following a specific policy. The core of Q-learning and SARSA is to learn this Q function.

\\\\n\\\\n

To balance immediate rewards and future rewards, we use the discount factor γ (typically in the range [0, 1]). The reward at future step k is multiplied by γ^k, meaning the agent values more recent rewards more highly.

\\\\n\\\\n

Q-learning: Off-Policy Temporal Difference Learning

\\\\n\\\\n

Q-learning is an off-policy algorithm proposed by Watkins in 1989. Its core idea is that the agent indirectly finds the optimal policy by learning an optimal Q-value table.

\\\\n\\\\n

Algorithm Core: Q-Value Update Formula

\\\\n\\\\n

The learning process of Q-learning is driven by the following formula: Q(s_t, a_t) ← Q(s_t, a_t) + α * [ r_{t+1} + γ * max_{a} Q(s_{t+1}, a) - Q(s_t, a_t) ]

\\\\n\\\\n

Let's break down this formula:

\\\\n\\\\n

Q(s_t, a_t): The current estimated value of taking action a_t in state s_t at time t.
α: Learning rate, controlling the extent to which new information overrides old information (0 < α ≤ 1).
r_{t+1}: The immediate reward obtained after performing action a_t.
γ: Discount factor, measuring the importance of future rewards.
max_{a} Q(s_{t+1}, a): The maximum Q value among all possible actions in the next state s_{t+1}. This represents the best estimated future return based on the current Q table.
[ r_{t+1} + γ * max_{a} Q(s_{t+1}, a) - Q(s_t, a_t) ]: Called the temporal difference error. It is the difference between the "target value" (immediate reward plus best future estimate) and the "current estimate". The algorithm updates the Q value by reducing this error.

\\\\n\\\\n

Meaning of Off-Policy

\\\\n\\\\n

Q-learning is called off-policy because when updating the Q value, the action used to evaluate future value (max_{a} Q(s_{t+1}, a)) is separated from the action the agent actually performs for exploration.

\\\\n\\\\n

Behavior Policy: The policy used to select the actual action to perform (e.g., ε-greedy, which randomly explores with a certain probability).
Target Policy: The policy used to update the Q value, which is a fully greedy policy (always choosing the action with the current maximum Q value).

\\\\n\\\\n

This separation allows Q-learning to boldly learn from its estimate of the optimal path, even while currently taking random exploration.

\\\\n\\\\n

Algorithm Flow

\\\\n\\\\n

SARSA: On-Policy Temporal Difference Learning

\\\\n\\\\n

The name SARSA comes from the state-action sequence involved in its update process: (S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1}). It is an on-policy algorithm.

\\\\n\\\\n

Algorithm Core: Q-Value Update Formula

\\\\n\\\\n

SARSA's update formula is very similar to Q-learning's, but with one key difference: Q(s_t, a_t) ← Q(s_t, a_t) + α * [ r_{t+1} + γ * Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) ]

\\\\n\\\\n

Please note the second part:

\\\\n\\\\n

Q-learning uses: γ * max_{a} Q(s_{t+1}, a)
SARSA uses: γ * Q(s_{t+1}, a_{t+1})

\\\\n\\\\n

In SARSA, the value used to estimate future value is the Q value of the action a_{t+1} that the agent will actually perform in the next state s_{t+1}.

\\\\n\\\\n

Meaning of On-Policy

\\\\n\\\\n

SARSA is called on-policy because the policy used to update the Q value (target policy) and the policy used to select actions (behavior policy) are the same, typically both being the ε-greedy policy.

\\\\n\\\\n

It evaluates and optimizes the policy it is currently executing.
What it learns is a policy that considers future exploration risk, and therefore is typically more "conservative".

\\\\n\\\\n

Comparison of Q-learning and SARSA

\\\\n\\\\n

Understanding the differences between the two is key to mastering them. The table below clearly summarizes the core differences:

\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n

Feature	Q-learning	SARSA
Policy Type	Off-policy	On-policy
Update Target	Based on optimal action: `r + γ * max Q(s', a')`	Based on actual action: `r + γ * Q(s', a_{t+1})`
Learning Goal	Learn the Q function of the optimal policy	Learn the Q function of the executed policy (e.g., ε-greedy)
Risk Tendency	More "bold", assuming future optimal actions	More "conservative", considering risks from future exploration
Update Timing	Can update after `(s, a, r, s')`	Requires `(s, a, r, s', a')` quintuple to update
Typical Applications	Discrete environments, seeking global optimal solutions	Environments with high safety requirements, needing to avoid dangerous states

\\\\n\\\\n

Classic Cliff Walking Problem Example

\\\\n\\\\n

This example vividly illustrates the difference between the two. Imagine a grid world where an agent walks from start point S to goal point G, with a cliff below. Falling off incurs a huge negative reward and returns to the start.

\\\\n\\\\n

Q-learning: Since it assumes the future always follows the optimal (safest) path, it will quickly learn the shortest path along the edge of the cliff (path 1), because it "believes" it won't fall off.
SARSA: Since it considers that it still has an ε probability of random exploration in the future, it might fall off the cliff. Therefore, it will learn a safer path higher up (path 2), which is longer but has higher long-term expected return.

\\\\n\\\\n

Conclusion: In environments that need to balance exploration and exploitation, and where the cost of wrong actions is high, SARSA typically learns safer and more robust policies.

\\\\n\\\\n

Code Practice: Implementing Q-learning and SARSA in Python

\\\\n\\\\n

We will demonstrate both algorithms with a simple 4x4 grid world. The goal is for the agent to walk from the top-left corner (0,0) to the bottom-right corner (3,3).

\\\\n\\\\n

Environment Setup

\\\\n\\\\n

Examples

\\\\n\\\\n

import numpy as np\\\\n\\\\nimport random\\\\n\\\\nclass GridWorld:\\\\n\\\\n    def __init__ (self, size=4):\\\\n\\\\n        self.size= size\\\\n\\\\n        self.state=(0,0)# Start state\\\\n\\\\n        self.goal=(size-1, size-1)# Goal state\\\\n\\\\n        self.actions=["up","down","left","right"]\\\\n\\\\n        self.action_map={"up": (-1,0),"down": (1,0),"left": (0, -1),"right": (0,1)}\\\\n\\\\n    def reset(self):\\\\n\\\\n        """Reset environment to Start state"""\\\\n\\\\n        self.state=(0,0)\\\\n\\\\n        return self.state\\\\n\\\\n    def step(self, action):\\\\n\\\\n        """Execute action, return (next state, Reward, Whether terminated)"""\\\\n\\\\n        move =self.action_map\\\\n\\\\n        next_state =(self.state + move,self.state + move)\\\\n\\\\n        # Boundary check: if out of bounds, stay in current state\\\\n\\\\n        if not(0<= next_state<self.size and 0<= next_state<self.size):\\\\n\\\\n            next_state =self.state\\\\n\\\\n        self.state= next_state\\\\n\\\\n        # Reward setting: +10 for reaching goal, -1 for other moves (encourages quick arrival)\\\\n\\\\n        if next_state ==self.goal:\\\\n\\\\n            reward =10\\\\n\\\\n            done =True\\\\n\\\\n        else:\\\\n\\\\n            reward = -1\\\\n\\\\n            done =False\\\\n\\\\n        return next_state, reward, done\\\\n\\\\n    def get_actions(self):\\\\n\\\\n        return self.actions\\\\n

\\\\n\\\\n

Q-learning Algorithm Implementation

\\\\n\\\\n

Examples

\\\\n\\\\n

def q_learning(env, episodes=500, alpha=0.1, gamma=0.9, epsilon=0.1):\\\\n\\\\n    """\\\\n\\\\n    Implement Q-learning algorithm\\\\n\\\\n    env: Environment object\\\\n\\\\n    episodes: Number of training episodes\\\\n\\\\n    alpha: Learning rate\\\\n\\\\n    gamma: Discount factor\\\\n\\\\n    epsilon: ε-greedy In policy'sExploration probability\\\\n\\\\n    """\\\\n\\\\n    # Initialize Q-table, dimensions are [Grid row, Grid columns, Number of actions]\\\\n\\\\n    q_table = np.zeros((env.size, env.size,len(env.actions)))\\\\n\\\\n    for episode in range(episodes):\\\\n\\\\n        state = env.reset()\\\\n\\\\n        done =False\\\\n\\\\n        while not done:\\\\n\\\\n            # 1. ε-greedy Policychoosechoose Action\\\\n\\\\n            if random.uniform(0,1)< epsilon:\\\\n\\\\n                action_idx =random.randint(0,len(env.actions)-1)# Explore: randomly select\\\\n\\\\n            else:\\\\n\\\\n                action_idx = np.argmax(q_table[state, state, :])# Exploit: choose action with max Q Value's\\\\n\\\\n            action = env.actions\\\\n\\\\n            # 2. Execute action, receive feedback\\\\n\\\\n            next_state, reward, done = env.step(action)\\\\n\\\\n            # 3. Q-learning Core update step\\\\n\\\\n            # Current state-action pair's Q Value\\\\n\\\\n            current_q = q_table[state, state, action_idx]\\\\n\\\\n            # Next state'smax Q Value (off-Policy'sKey)\\\\n\\\\n            next_max_q = np.max(q_table[next_state, next_state, :])\\\\n\\\\n            # Calculate target Q Value\\\\n\\\\n            target_q = reward + gamma * next_max_q\\\\n\\\\n            # Update Q-table\\\\n\\\\n            q_table[state, state, action_idx]= current_q + alpha * (target_q - current_q)\\\\n\\\\n            # Transition to next state\\\\n\\\\n            state = next_state\\\\n\\\\n    return q_table\\\\n

\\\\n\\\\n

SARSA Algorithm Implementation

\\\\n\\\\n

Examples

\\\\n\\\\n

def sarsa(env, episodes=500, alpha=0.1, gamma=0.9, epsilon=0.1):\\\\n\\\\n    """\\\\n\\\\n    Implement SARSA algorithm\\\\n\\\\n    Parameter meanings are the same as in Q-learning\\\\n\\\\n    """\\\\n\\\\n    q_table = np.zeros((env.size, env.size,len(env.actions)))\\\\n\\\\n    for episode in range(episodes):\\\\n\\\\n        state = env.reset()\\\\n\\\\n        done =False\\\\n\\\\n        # SARSA Need to first select an action for the initial state\\\\n\\\\n        if random.uniform(0,1)< epsilon:\\\\n\\\\n            action_idx =random.randint(0,len(env.actions)-1)\\\\n\\\\n        else:\\\\n\\\\n            action_idx = np.argmax(q_table[state, state, :])\\\\n\\\\n        action = env.actions\\\\n\\\\n        while not done:\\\\n\\\\n            # 1. Execute the previously chosen action'sAction\\\\n\\\\n            next_state, reward, done = env.step(action)\\\\n\\\\n            # 2. forNext statechoosechoose Action（samePolicy，StillUse ε-greedy）\\\\n\\\\n            if random.uniform(0,1)< epsilon:\\\\n\\\\n                next_action_idx =random.randint(0,len(env.actions)-1)\\\\n\\\\n            else:\\\\n\\\\n                next_action_idx = np.argmax(q_table[next_state, next_state, :])\\\\n\\\\n            next_action = env.actions\\\\n\\\\n            # 3. SARSA Core update step\\\\n\\\\n            current_q = q_table[state, state, action_idx]\\\\n\\\\n            # Key difference: uses next state**Action to actually execute'sAction**'s Q Value\\\\n\\\\n            next_q = q_table[next_state, next_state, next_action_idx]\\\\n\\\\n            target_q = reward + gamma * next_q\\\\n\\\\n            q_table[state, state, action_idx]= current_q + alpha * (target_q - current_q)\\\\n\\\\n            # 4. Update state and action for the next iteration\\\\n\\\\n            state = next_state\\\\n\\\\n            action_idx = next_action_idx\\\\n\\\\n            action = next_action\\\\n\\\\n    return q_table\\\\n

\\\\n\\\\n

Testing and Policy Visualization

\\\\n\\\\n

Examples

\\\\n\\\\n

def test_policy(env, q_table, episodes=10):\\\\n\\\\n    """Test learned policy'sPolicy"""\\\\n\\\\n    total_rewards =[]\\\\n\\\\n    for _ in range(episodes):\\\\n\\\\n        state = env.reset()\\\\n\\\\n        done =False\\\\n\\\\n        total_reward =0\\\\n\\\\n        steps =[]\\\\n\\\\n        while not done:\\\\n\\\\n            # Use greedy policy during testing (no exploration)\\\\n\\\\n            action_idx = np.argmax(q_table[state, state, :])\\\\n\\\\n            action = env.actions\\\\n\\\\n            next_state, reward, done = env.step(action)\\\\n\\\\n            steps.append(action.upper())# Record action initial letter\\\\n\\\\n            total_reward += reward\\\\n\\\\n            state = next_state\\\\n\\\\n        total_rewards.append(total_reward)\\\\n\\\\n        print(f"Episode steps: {steps}, Total reward: {total_reward}")\\\\n\\\\n    print(f"Average reward: {np.mean(total_rewards):.2f}")\\\\n\\\\n# Create environment\\\\n\\\\nenv = GridWorld(size=4)\\\\n\\\\nprint("=== Train and test Q-learning ===")\\\\n\\\\nq_table_ql = q_learning(env, episodes=1000)\\\\n\\\\ntest_policy(env, q_table_ql)\\\\n\\\\nprint("n=== Train and test SARSA ===")\\\\n\\\\nenv.reset()# Reset environment state\\\\n\\\\nq_table_sarsa = sarsa(env, episodes=1000)\\\\n\\\\ntest_policy(env, q_table_sarsa)\\\\n\\\\n# Briefly compare final policies\\\\n\\\\nprint("n=== PolicyComparison（starts fromStart state(0,0)Start'sAction)===")\\\\n\\\\nstart_q_values = q_table_ql[0,0, :]\\\\n\\\\nstart_sarsa_values = q_table_sarsa[0,0, :]\\\\n\\\\nprint(f"Q-learning QValue: {dict(zip(env.actions, start_q_values))}")\\\\n\\\\nprint(f"Recommended action: {env.actions[np.argmax(start_q_values)]}")\\\\n\\\\nprint(f"SARSA QValue: {dict(zip(env.actions, start_sarsa_values))}")\\\\n\\\\nprint(f"Recommended action: {env.actions[np.argmax(start_sarsa_values)]}")\\\\n

YouTip