Ml Reinforcement Learning

## Machine Learning - Reinforcement Learning Example Imagine you are teaching a puppy to learn the "sit" command. You don't directly tell it what the word "sit" means, but guide it through rewards and punishments. * When the puppy accidentally performs the sitting action, you immediately give it a treat (**reward**). * When it does something wrong, you don't give it a treat (**punishment**). After many attempts, the puppy will eventually understand the association between the "sit" command and receiving treats, thus learning this skill. **Reinforcement Learning** is a machine learning method that allows computers (or agents) to learn how to make optimal decisions through trial and error, interacting with the environment to maximize cumulative rewards. **Reinforcement Learning** is fundamentally different from supervised learning (which has standard answers) and unsupervised learning (which finds inherent data structures) that we learned before. The core of reinforcement learning is the continuous interaction between the **agent** and the **environment**. * * * ## Core Concepts Explained Before diving into the code, let's understand several key concepts. They are like game rules that define how the reinforcement learning world operates. * **Agent:** The agent is our learner or decision-maker. In the analogy above, it's the puppy. In the program, it's an algorithm responsible for observing the environment, making actions, and learning from results. * **Environment:** The environment is the external world where the agent resides. It receives the agent's actions and provides two feedbacks: the new environment state and the immediate reward from this action. * **State:** The state is a specific description of the environment at a certain moment. For example, in a maze game, the state is the current coordinate position of the agent. * **Action:** Actions are the choices the agent can make in a given state. For instance, in a maze, actions can be up, down, left, or right. * **Reward:** The reward is the direct evaluation signal from the environment to the agent's action, usually a numerical value. **Positive rewards** indicate encouragement, while **negative rewards** indicate punishment. The agent's ultimate goal is to maximize the **total reward (cumulative reward)** from start to finish. * **Policy:** The policy is the agent's behavioral guideline. It defines which action should be chosen in every possible state. The learning process is essentially the process of optimizing this policy. To more intuitively understand how these concepts work together, let's look at the basic interaction flow of reinforcement learning: !(#) This cycle continues until a terminal state is reached (such as game completion or failure). * * * ## Classic Problem: Cliff Walking To put theory into practice, we will use a classic reinforcement learning example environment: **CliffWalking-v0** (Cliff Walking). It comes from the `gymnasium` library (the maintained branch of the original OpenAI Gym). ### Environment Description * **Scenario**: A 4x12 grid world. * **Start**: Bottom-left corner (coordinate [3, 0]). * **Goal**: Bottom-right corner (coordinate [3, 11]). * **Cliff**: All positions on the bottom row except start and goal ([3, 1] to [3, 10]). Falling off the cliff incurs a huge penalty and returns to the start. * **Objective**: The agent must safely walk from start to goal while avoiding falling off the cliff. * **Actions**: Up (0), Right (1), Down (2), Left (3). * **Rewards**: * Each step on normal grid: -1 (encourages reaching goal in fewer steps) * Fall off cliff: -100, and return to start * Reach goal: 0, and end current attempt * * * ## Algorithm Introduction: Q-Learning We will use the **Q-Learning** algorithm to solve this problem. It is a **model-free** reinforcement learning algorithm, meaning the agent doesn't need to know the environment's operating rules (such as state transition probabilities) in advance; it learns through continuous trial and error. Its core is a table called the **Q-table**. * **Rows** represent all possible states. * **Columns** represent all possible actions. * **Cell values (Q-values)** represent the long-term expected return of taking a certain action in a certain state. The learning process of Q-Learning can be summarized in the following steps, showing how the agent updates its knowledge (Q-table) through a single experience: !(#) ### Core Formula: Bellman Equation The mathematical foundation for Q-table updates is the Bellman equation, with the following update formula: $$ Q left(right. S , A left.right) leftarrow Q left(right. S , A left.right) + alpha left[right. R + gamma underset{a}{max ⁡} Q left(right. S^{'} , a left.right) - Q left(right. S , A left.right) left]right. $$ Let's break down each part of this formula: | Symbol | Meaning | Analogy Explanation | | --- | --- | --- | | $Q left(right. S , A left.right)$ | Original Q-value of action A in state S | Your old score for the decision to go straight at an intersection | | $alpha$ | **Learning rate** (0 < α ≤ 1) | How much you believe in this new experience. α=1 means completely overwrite old knowledge with new experience; α=0.1 means new experience only slightly modifies old knowledge | | $R$ | **Immediate reward** obtained after executing action A | After going straight, you find the road clear and get a small positive feedback (+1) | | $gamma$ | **Discount factor** (0 ≤ γ < 1) | How much you value future rewards. γ=0 means only care about immediate rewards; γ=0.9 means highly value long-term benefits | | $left(max ⁡right)_{a} Q left(right. S^{'} , a left.right)$ | Maximum Q-value among all possible actions in new state S' | After reaching the new intersection, you evaluate which of left turn, right turn, or going straight has the highest future benefit | | $R + gamma left(max ⁡right)_{a} Q left(right. S^{'} , a left.right)$ | **Target Q-value**, representing a new, better estimate of the current decision | Combining immediate reward and best future benefit, derive a new score for the decision to go straight at the old intersection | | $R + gamma left(max ⁡right)_{a} Q left(right. S^{'} , a left.right) - Q left(right. S , A left.right)$ | **Temporal difference error**, the gap between old and new knowledge | The difference between new score and old score, this gap drives learning | Simply put, this formula allows the agent to continuously correct its judgment of the current decision's value using **immediate reward + discounted estimate of future optimum**. * * * ## Practical Implementation: Writing a Q-Learning Agent Now, let's implement a Q-Learning agent in code to solve the cliff walking problem. ### Step 1: Install and Import Libraries First, ensure you have installed the necessary libraries. Run in terminal or command line: ## Example pip install gymnasium numpy Then, import them in your Python file: ## Example import gymnasium as gym import numpy as np import random ### Step 2: Initialize Environment and Q-Table ## Example # 1. Create environment env = gym.make("CliffWalking-v0", render_mode="human")# render_mode="human" for visualization # 2. Get environment information n_states = env.observation_space.n# Total states (4*12=48) n_actions = env.action_space.n# Total actions (4 directions) # 3. Initialize Q-table, shape [number of states, number of actions], initial values all 0 Q_table = np.zeros((n_states, n_actions)) print(f"Environment states: {n_states}, actions: {n_actions}") print(f"Q-table shape: {Q_table.shape}") ### Step 3: Set Hyperparameters Hyperparameters are knobs that control algorithm behavior and need to be adjusted based on the problem. ## Example # Define hyperparameters alpha =0.1# Learning rate: influence of new information gamma =0.99# Discount factor: importance of future rewards epsilon =0.1# Exploration rate: probability of random exploration (vs choosing known optimal) num_episodes =500# Training episodes (number of times agent plays the game) ### Step 4: Implement ε-greedy Policy This is the core strategy for the agent's decision-making, balancing **exploration** and **exploitation**. * **Exploration**: Randomly select actions to discover potentially better strategies. * **Exploitation**: Select actions that the current Q-table considers optimal to maximize returns. ## Example def choose_action(state, Q_table, epsilon): """ Choose action based on ε-greedy policy. Parameters: state: current state Q_table: Q-value table epsilon: exploration probability Returns: action: chosen action (0, 1, 2, 3) """ # Generate a random number between 0-1 if random.uniform(0,1)< epsilon: # Exploration: randomly choose an action action = env.action_space.sample() else: # Exploitation: choose action with maximum Q-value in current state # np.argmax returns index of maximum value, i.e., optimal action action = np.argmax(Q_table) return action ### Step 5: Core Training Loop This is the main process of algorithm learning. ## Example # Record total reward per episode to observe learning progress reward_history =[] for episode in range(num_episodes): # Reset environment, get initial state state, _ = env.reset() total_reward =0# Cumulative reward for this episode terminated =False# Whether reached terminal state (goal/cliff) truncated =False# Whether terminated due to step limit (rare in this environment) # Episode interaction loop, until game ends while not(terminated or truncated): # 1. Choose action action = choose_action(state, Q_table, epsilon) # 2. Execute action, get environment feedback next_state, reward, terminated, truncated, _ = env.step(action) # 3. Update Q-table (Q-Learning core update formula) # Get current Q-value current_q = Q_table[state, action] # Calculate target Q-value: immediate reward + discounted future maximum Q-value # Note: if next state is terminal, there is no future Q-value if terminated: target_q = reward else: target_q = reward + gamma * np.max(Q_table) # Apply Bellman equation to update Q-value Q_table[state, action]= current_q + alpha * (target_q - current_q) # 4. Transition to next state and accumulate reward state = next_state total_reward += reward # Record total reward for this episode reward_history.append(total_reward) # Print progress every 100 episodes if(episode + 1) % 100==0: avg_reward = np.mean(reward_history[-100:])# Average reward of last 100 episodes print(f"Episode {episode + 1}, average reward of last 100 episodes: {avg_reward:.2f}") # Training complete, close environment env.close() ### Step 6: Test the Trained Agent After training is complete, we turn off exploration and let the agent purely exploit its learned knowledge (Q-table) to see its performance. ## Example print("n=== Start Testing ===") # Create new test environment (can omit render_mode, or set to "human" to watch) test_env = gym.make("CliffWalking-v0", render_mode="human") state, _ = test_env.reset() test_terminated =False test_truncated =False step_count =0 while not(test_terminated or test_truncated): # During testing, we set epsilon=0, i.e., pure exploitation, no exploration action = choose_action(state, Q_table, epsilon=0) state, reward, test_terminated, test_truncated, _ = test_env.step(action) step_count +=1 print(f"Step {step_count}: state {state}, action {action}, reward {reward}") print(f"Test complete! Total steps: {step_count}, total reward: {reward} (reward is 0 when reaching goal)") test_env.close() * * * ## Running Results and Analysis Complete code: ## Example import gymnasium as gym import numpy as np import random # ========================= # 1. Create environment # ========================= env = gym.make("CliffWalking-v0", render_mode="human") # ========================= # 2. Get environment information # ========================= n_states = env.observation_space.n n_actions = env.action_space.n # ========================= # 3. Initialize Q-table # ========================= Q_table = np.zeros((n_states, n_actions)) print(f"Environment states: {n_states}") print(f"Actions: {n_actions}") print(f"Q-table shape: {Q_table.shape}") # ========================= # 4. Set hyperparameters # ========================= alpha =0.1 gamma =0.99 epsilon =0.1 num_episodes =500 # ========================= # 5. ε-greedy policy # ========================= def choose_action(state, Q_table, epsilon): if random.uniform(0,1)< epsilon: return env.action_space.sample() return np.argmax(Q_table) # ========================= # 6. Training loop # ========================= reward_history =[] for episode in range(num_episodes): state, _ = env.reset() total_reward =0 terminated =False truncated =False while not(terminated or truncated): action = choose_action(state, Q_table, epsilon) next_state, reward, terminated, truncated, _ = env.step(action) current_q = Q_table[state, action] if terminated: target_q = reward else: target_q = reward + gamma * np.max(Q_table) Q_table[state, action]= current_q + alpha * (target_q - current_q) state = next_state total_reward += reward reward_history.append(total_reward) if(episode + 1) % 100==0: avg_reward = np.mean(reward_history[-100:]) print(f"Episode {episode + 1}, average reward of last 100 episodes: {avg_reward:.2f}") env.close() # ========================= # 7. Test training results # ========================= print("n=== Start Testing ===") test_env = gym.make("CliffWalking-v0", render_mode="human") state, _

YouTip

Ml Reinforcement Learning

📂 Categories