YouTip LogoYouTip

Ml Random Forest

## Random Forest | ## Random Forest Imagine you are participating in an important knowledge competition, facing a difficult question: would you trust the judgment of one top expert, or would you trust the result voted by 100 decent players? In most cases, the wisdom of the crowd can make up for individual biases and limitations, leading to more stable and accurate decisions. In the world of machine learning, **Random Forest** is an outstanding representative of this **collective wisdom** concept. It makes predictions by building a large number of decision trees and having them vote together, making it one of the most powerful and popular machine learning algorithms. ### What is Random Forest? **Random Forest** is a machine learning algorithm based on Ensemble Learning. Its core idea is very simple: **"Three cobblers equal a Zhuge Liang."** * **Forest**: Refers to a collection of multiple **Decision Trees**. * **Random**: Refers to the algorithm introducing two types of randomness when building each decision tree, ensuring every tree is unique. Finally, for classification tasks, the forest gives the result through **voting (majority rule)**; for regression tasks, it gives the result through **averaging**. ### Core Ideas: Bagging and Randomness The success of Random Forest is built on two cornerstones: **Bagging (Bootstrap Aggregating)**: * **Bootstrap (Sampling with replacement)**: Randomly sample **with replacement** from the original training dataset to generate multiple different sub-training sets. This means the same sample may appear multiple times in one subset, while another sample may not appear at all. * **Aggregating**: Train one decision tree independently with each sub-training set, then aggregate all tree results (voting or averaging). !(#) **Feature Randomness**: * When splitting at each node of each tree, the algorithm does not consider all features, but **randomly selects a subset from all features**, then chooses the best split feature from that subset. * This further enhances the differences between trees, allowing the forest to see different aspects of the problem. **In simple terms, Random Forest creates a diverse **expert committee** by having each tree train on slightly different data and feature perspectives.** Even if some trees make mistakes, other correct trees can vote to correct them, resulting in more stable and powerful performance than a single decision tree. * * * ## Algorithm Flow and Key Parameters ### Random Forest Working Steps Let's clearly see its working process through a flowchart: !(#) ### Key Hyperparameters Explained When using the `scikit-learn` library, understanding the following core parameters is crucial: | Parameter | Meaning | Typical Value/Impact | Plain Explanation | | --- | --- | --- | --- | | **`n_estimators`** | Number of decision trees in the forest. | Default 100. Larger values typically make the model more stable and perform better, but also increase computational cost. | **"Number of people in the committee"**. More people usually mean more reliable decisions, but longer meeting times. | | **`max_depth`** | Maximum depth of a single decision tree. | Default `None` (unlimited). Limiting depth can prevent overfitting and simplify the model. | **"Limit each person's speaking time"**. Prevent an expert (tree) from going too deep into details of training data. | | **`max_features`** | Number of features to consider when finding the best split. | Can be integer, float, or `'auto'`/`'sqrt'`. This is the key parameter for introducing "feature randomness". | **"Only randomly look at a few aspects each discussion"**. Ensure each tree analyzes the problem from different angles, increasing diversity. | | **`min_samples_split`** | Minimum number of samples required to split a node. | Default 2. Larger values make tree growth more conservative and less prone to overfitting. | **"How many people must a group have to continue group discussion"**. Avoid creating new rules based on just one or two samples. | | **`min_samples_leaf`** | Minimum number of samples required at a leaf node. | Default 1. Larger values make the model smoother. | **"Final conclusion must be based on at least a few cases"**. Ensure each conclusion has some data support. | | **`bootstrap`** | Whether to use Bootstrap sampling. | Default `True`. If set to `False`, the entire dataset will be used to train each tree, but some randomness will be lost. | **"Allow one person to speak repeatedly"**. Enabling this is the essence of Bagging. | * * * ## Hands-on Practice - Code Examples Let's practice with a classic Iris classification dataset. ### Example 1: Basic Classification Task ## Example # Import necessary libraries from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, classification_report # 1. Load data iris = load_iris() X = iris.data# Features: sepal length, sepal width, petal length, petal width y = iris.target# Labels: three types of iris # 2. Split training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # 3. Create Random Forest classifier # Here we set 100 trees with max depth of 5 rf_clf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42) # 4. Train the model rf_clf.fit(X_train, y_train) # 5. Make predictions on test set y_pred = rf_clf.predict(X_test) # 6. Evaluate model performance print("Test set accuracy:", accuracy_score(y_test, y_pred)) print("nClassification report:") print(classification_report(y_test, y_pred, target_names=iris.target_names)) **Code Explanation:** 1. **Import libraries**: `RandomForestClassifier` is the Random Forest classifier. 2. **Load data**: The Iris dataset has 150 samples, 4 features, and 3 classes. 3. **Split data**: Use 70% of data for training and 30% for testing to verify the model's generalization ability on new data. 4. **Instantiate model**: `random_state=42` ensures reproducible results. 5. **Train model**: The `fit` method builds 100 decision trees. 6. **Predict and evaluate**: Use the trained forest to predict the test set and calculate accuracy and other metrics. Output: setosa 1.00 1.00 1.00 19 versicolor 1.00 1.00 1.00 13 virginica 1.00 1.00 1.00 13 accuracy 1.00 45 macro avg 1.00 1.00 1.00 45 weighted avg 1.00 1.00 1.00 45 ### Example 2: Viewing Feature Importance Random Forest has another powerful feature: evaluating each feature's contribution to predictions. ## Example # Import necessary libraries from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, classification_report import pandas as pd import matplotlib.pyplot as plt # -------------------------- Set Chinese font start -------------------------- plt.rcParams['font.sans-serif']=[ # Windows priority 'SimHei','Microsoft YaHei', # macOS priority 'PingFang SC','Heiti TC', # Linux priority 'WenQuanYi Micro Hei','DejaVu Sans' ] # Fix minus sign displaying as square plt.rcParams['axes.unicode_minus']=False # -------------------------- Set Chinese font end -------------------------- # 1. Load data iris = load_iris() X = iris.data# Features: sepal length, sepal width, petal length, petal width y = iris.target# Labels: three types of iris # 2. Split training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # 3. Create Random Forest classifier # Here we set 100 trees with max depth of 5 rf_clf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42) # 4. Train the model rf_clf.fit(X_train, y_train) #
← Ml Cluster AnalysisMl Regression Model Evaluation β†’