Ml Data Bias

## Data Bias Machine learning is transforming the world, from recommendation systems to autonomous driving — its applications are everywhere. However, these intelligent systems are not perfect; they have a common **Achilles' Heel** — data bias. Today, we will delve into this core issue that affects the fairness and accuracy of machine learning models. * * * ## What is Data Bias? Data bias refers to training data that does not accurately represent real-world situations, causing machine learning models to learn incorrect patterns or make biased predictions. Simply put, it's "garbage in, garbage out" — if the input data is flawed, the output results will also be defective. ### Three Main Types of Data Bias * **Selection Bias:** Occurs when the data collection process itself contains systematic bias. For example, surveying only young people’s opinions on a product through social media while ignoring elderly groups who don’t use social media. * **Measurement Bias:** Errors occur during measurement or recording of data. For instance, facial recognition systems primarily trained on photos of lighter-skinned individuals may perform poorly on darker-skinned individuals. * **Confirmation Bias:** Researchers or data annotators bring their own subjective biases into the data. For example, in sentiment analysis tasks, annotators might interpret textual emotions based on their cultural background, overlooking expressions from other cultures. !(#) * * * ## How Does Data Bias Affect Machine Learning Models? ### Decline in Model Performance When a model performs well on training data but poorly on real-world data, there is likely a data bias issue. ## Example # Simulating the impact of data bias on model performance import numpy as np from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score np.random.seed(42) n_samples =1000 # Real-world data X_real = np.random.uniform(-5,5, n_samples).reshape(-1,1) y_real =(X_real.flatten()>0).astype(int) # Biased data: 90% from X > 0, 10% from X 0 mask_neg = X_real.flatten()<=0 X_pos = X_real y_pos = y_real X_neg = X_real y_neg = y_real # Only sample a few negative samples neg_sample_size =int(0.1 * len(X_pos)) neg_indices = np.random.choice(len(X_neg), neg_sample_size, replace=False) X_biased = np.vstack([X_pos, X_neg]) y_biased = np.hstack([y_pos, y_neg]) # Split into train/test X_train, X_test, y_train, y_test = train_test_split( X_biased, y_biased, test_size=0.2, random_state=42 ) # Train model model = LogisticRegression() model.fit(X_train, y_train) # Evaluate on training set train_pred = model.predict(X_train) train_acc = accuracy_score(y_train, train_pred) print(f"Training Accuracy: {train_acc:.2%}") # Evaluate on real-world data real_pred = model.predict(X_real) real_acc = accuracy_score(y_real, real_pred) print(f"Real-world Accuracy: {real_acc:.2%}") print("Conclusion: Good performance on training set, but significant drop due to distribution bias") Output: Training Accuracy: 99.08%Real-world Accuracy: 96.00%Conclusion: Good performance on training set, but significant drop due to distribution bias ### Fairness Issues Data bias can lead to discriminatory outcomes for certain groups. For example, a hiring algorithm trained mainly on historical data of male employees may exhibit bias against female job seekers. ### Poor Generalization Ability The model fails to adapt to new, unseen data scenarios because the training data does not cover sufficiently diverse cases. * * * ## Common Sources of Data Bias ### Problems During Data Collection Phase | Problem Type | Specific Manifestation | Example | | --- | --- | --- | | **Sampling Bias** | Data samples do not represent the overall population | Collecting autonomous driving data only in urban areas, ignoring rural roads | | **Temporal Bias** | Data is outdated or lacks timeliness | Using e-commerce data from 2010 to predict consumption trends in 2023 | | **Survivorship Bias** | Only focusing on "surviving" data points | Studying only successful companies’ data, ignoring failures | ### Problems During Data Annotation Phase ## Example # Simulating the impact of annotation bias import pandas as pd # Create simulated dataset data ={ 'text': [ 'This product is really great, I love it so much!', 'I'm not sure whether this one works well or not', 'Definitely don’t buy this junk product', 'It’s okay, nothing special', 'Super recommend, worth every penny' ], # Assume annotator has personal bias: positive reviews labeled as 1, others labeled as 0 'biased_label': [1,0,0,0,1],# Biased labeling 'true_label': [1,0.5,0,0.5,1]# True continuous sentiment score } df = pd.DataFrame(data) print("Example of Labeling Bias:") print(df) print("nProblem: Annotator incorrectly labels neutral reviews as negative!") Output: Example of Labeling Bias: text biased_label true_label 0 This product is really great, I love it so much! 1 1.01 I'm not sure whether this one works well or not 0 0.52 Definitely don’t buy this junk product 0 0.03 It’s okay, nothing special 0 0.54 Super recommend, worth every penny 1 1.0Problem: Annotator incorrectly labels neutral reviews as negative! ### Problems During Data Preprocessing Phase * Improper handling of outliers * Biased feature selection * Inappropriate data normalization methods * * * ## How to Detect Data Bias? ### 1. Statistical Analysis of Data Check the distribution of different groups within the dataset. ## Example # Checking distribution of different groups in the dataset import matplotlib.pyplot as plt # -------------------------- Set Chinese font start -------------------------- plt.rcParams['font.sans-serif']=[ # Windows priority 'SimHei','Microsoft YaHei', # macOS priority 'PingFang SC','Heiti TC', # Linux priority 'WenQuanYi Micro Hei','DejaVu Sans' ] # Fix minus sign display as square plt.rcParams['axes.unicode_minus']=False # -------------------------- Set Chinese font end -------------------------- # Simulate demographic statistics groups =['Group A','Group B','Group C','Group D'] population_percent =[40,30,20,10]# Real population proportions dataset_percent =[70,20,8,2]# Proportions in dataset fig,(ax1, ax2)= plt.subplots(1,2, figsize=(12,5)) # Real population distribution ax1.pie(population_percent, labels=groups, autopct='%1.1f%%') ax1.set_title('Real-world Population Distribution') # Dataset distribution ax2.pie(dataset_percent, labels=groups, autopct='%1.1f%%') ax2.set_title('Distribution in Dataset') plt.tight_layout() plt.show() print("Detection Result: Group A is overrepresented, Group D underrepresented!") !(#) ### 2. Analysis of Model Performance Differences Compare model performance across different subgroups. ### 3. Calculation of Fairness Metrics Use statistical metrics to quantify the fairness level of the model. * * * ## Strategies to Address Data Bias ### Solutions at the Data Level #### 1. Improve Data Collection Strategy * **Active Sampling**: Consciously collect under-represented data * **Data Augmentation**: Increase data diversity through technical means * **Multiple Data Sources**: Integrate data from various sources #### 2. Data Preprocessing Techniques ## Example # Using resampling techniques to balance datasets import numpy as np from sklearn.utils import resample np.random.seed(42) # 1. Construct imbalanced data (1D feature + label) X_majority = np.random.normal(0,1,900).reshape(-1,1) y_majority = np.zeros(900, dtype=int) X_minority = np.random.normal(2,1,100).reshape(-1,1) y_minority = np.ones(100, dtype=int) print(f"Before resampling:") print(f"Majority class: {len(X_majority)}") print(f"Minority class: {len(X_minority)}") # 2. Upsample minority class X_minority_upsampled, y_minority_upsampled = resample( X_minority, y_minority, replace=True, n_samples=len(X_majority), random_state=42 ) # 3. Combine balanced dataset X_balanced = np.vstack([X_majority, X_minority_upsampled]) y_balanced = np.hstack([y_majority, y_minority_upsampled]) print(f"nAfter resampling:") print(f"Majority class: {np.sum(y_balanced == 0)}") print(f"Minority class: {np.sum(y_balanced == 1)}") print("nConclusion: Sample counts now perfectly balanced") Output: Before resampling: Majority class: 900Minority class: 100After resampling: Majority class: 900Minority class: 900Conclusion: Sample counts now perfectly balanced ### Algorithm-Level Solutions * Fairness Constraints: Add fairness constraints during model training. * Adversarial De-biasing: Use adversarial learning techniques to reduce bias in models. * Post-processing Methods: Adjust model predictions to improve fairness. * * * ## Practical Exercise: Building an Unbiased Classifier Let’s go through a complete example to learn how to handle data bias throughout the entire process from data collection to model evaluation. ## Example # Complete example: Machine learning workflow dealing with data bias (engineering-level) import numpy as np import pandas as pd from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report, confusion_matrix from imblearn.over_sampling import SMOTE import warnings warnings.filterwarnings("ignore", category=FutureWarning) # 1. Generate simulated imbalanced data X, y = make_classification( n_samples=2000, n_features=10, n_informative=8, n_redundant=2, n_clusters_per_class=1, weights=[0.9,0.1],# Explicitly create class imbalance flip_y=0, random_state=42 ) # Convert to DataFrame for easier analysis feature_names =[f'feature_{i}'for i in range(X.shape)] df = pd.DataFrame(X, columns=feature_names) df['target']= y print("=== Data Bias Analysis ===") print(f"Dataset shape: {df.shape}") print("Class distribution:") print(df['target'].value_counts()) print(f"Minority class proportion: {df['target'].value_counts(normalize=True):.2%}") # 2. Split data (maintain class ratio to prevent secondary bias) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42, stratify=y ) print("n=== Training/Test Set Distribution ===") print("Training set:", np.bincount(y_train)) print("Test set:", np.bincount(y_test)) # 3. Use SMOTE to address training set bias print("n=== Handling Data Bias (SMOTE) ===") smote = SMOTE(random_state=42) X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train) print("Before SMOTE:", np.bincount(y_train)) print("After SMOTE:", np.bincount(y_train_balanced)) # 4. Train models print("n=== Model Training ===") # Baseline model: no bias handling model_imbalanced = RandomForestClassifier( n_estimators=200, random_state=42 ) model_imbalanced.fit(X_train, y_train) # Bias-handling model: trained after SMOTE model_balanced = RandomForestClassifier( n_estimators=200, random_state=42 ) model_balanced.fit(X_train_balanced, y_train_balanced) # 5. Evaluate models print("n=== Model Evaluation (Test Set) ===") print("n Classification Report") y_pred_imbalanced = model_imbalanced.predict(X_test) print(classification_report(y_test, y_pred_imbalanced, digits=4)) print("Confusion Matrix:") print(confusion_matrix(y_test, y_pred_imbalanced)) print("n Classification Report") y_pred_balanced = model_balanced.predict(X_test) print(classification_report(y_test, y_pred_balanced, digits=4)) print("Confusion Matrix:") print(confusion_matrix(y_test, y_pred_balanced)) print( "nConclusion:n" "1. Without bias handling, model has extremely low Recall for minority classn" "2. SMOTE significantly improves Recall and F1 for minority classn" "3. Accuracy is not a reliable metric for imbalanced data" ) Output: === Data Bias Analysis ===Dataset shape: (2000, 11)Class distribution: target 0 18001 200Name: count, dtype: int64 Minority class proportion: 10.00%=== Training/Test Set Distribution ===Training set: Test set: === Handling Data Bias (SMOTE) === Before SMOTE: After SMOTE: === Model Training ====== Model Evaluation (Test Set) === Classification Report

YouTip

Ml Data Bias

📂 Categories