Ml Data Bias
## Data Bias
Machine learning is transforming the world, from recommendation systems to autonomous driving β its applications are everywhere.
However, these intelligent systems are not perfect; they have a common **Achilles' Heel** β data bias.
Today, we will delve into this core issue that affects the fairness and accuracy of machine learning models.
* * *
## What is Data Bias?
Data bias refers to training data that does not accurately represent real-world situations, causing machine learning models to learn incorrect patterns or make biased predictions. Simply put, it's "garbage in, garbage out" β if the input data is flawed, the output results will also be defective.
### Three Main Types of Data Bias
* **Selection Bias:** Occurs when the data collection process itself contains systematic bias. For example, surveying only young peopleβs opinions on a product through social media while ignoring elderly groups who donβt use social media.
* **Measurement Bias:** Errors occur during measurement or recording of data. For instance, facial recognition systems primarily trained on photos of lighter-skinned individuals may perform poorly on darker-skinned individuals.
* **Confirmation Bias:** Researchers or data annotators bring their own subjective biases into the data. For example, in sentiment analysis tasks, annotators might interpret textual emotions based on their cultural background, overlooking expressions from other cultures.
!(#)
* * *
## How Does Data Bias Affect Machine Learning Models?
### Decline in Model Performance
When a model performs well on training data but poorly on real-world data, there is likely a data bias issue.
## Example
# Simulating the impact of data bias on model performance
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
np.random.seed(42)
n_samples =1000
# Real-world data
X_real = np.random.uniform(-5,5, n_samples).reshape(-1,1)
y_real =(X_real.flatten()>0).astype(int)
# Biased data: 90% from X > 0, 10% from X 0
mask_neg = X_real.flatten()<=0
X_pos = X_real
y_pos = y_real
X_neg = X_real
y_neg = y_real
# Only sample a few negative samples
neg_sample_size =int(0.1 * len(X_pos))
neg_indices = np.random.choice(len(X_neg), neg_sample_size, replace=False)
X_biased = np.vstack([X_pos, X_neg])
y_biased = np.hstack([y_pos, y_neg])
# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(
X_biased, y_biased, test_size=0.2, random_state=42
)
# Train model
model = LogisticRegression()
model.fit(X_train, y_train)
# Evaluate on training set
train_pred = model.predict(X_train)
train_acc = accuracy_score(y_train, train_pred)
print(f"Training Accuracy: {train_acc:.2%}")
# Evaluate on real-world data
real_pred = model.predict(X_real)
real_acc = accuracy_score(y_real, real_pred)
print(f"Real-world Accuracy: {real_acc:.2%}")
print("Conclusion: Good performance on training set, but significant drop due to distribution bias")
Output:
Training Accuracy: 99.08%Real-world Accuracy: 96.00%Conclusion: Good performance on training set, but significant drop due to distribution bias
### Fairness Issues
Data bias can lead to discriminatory outcomes for certain groups. For example, a hiring algorithm trained mainly on historical data of male employees may exhibit bias against female job seekers.
### Poor Generalization Ability
The model fails to adapt to new, unseen data scenarios because the training data does not cover sufficiently diverse cases.
* * *
## Common Sources of Data Bias
### Problems During Data Collection Phase
| Problem Type | Specific Manifestation | Example |
| --- | --- | --- |
| **Sampling Bias** | Data samples do not represent the overall population | Collecting autonomous driving data only in urban areas, ignoring rural roads |
| **Temporal Bias** | Data is outdated or lacks timeliness | Using e-commerce data from 2010 to predict consumption trends in 2023 |
| **Survivorship Bias** | Only focusing on "surviving" data points | Studying only successful companiesβ data, ignoring failures |
### Problems During Data Annotation Phase
## Example
# Simulating the impact of annotation bias
import pandas as pd
# Create simulated dataset
data ={
'text': [
'This product is really great, I love it so much!',
'I'm not sure whether this one works well or not',
'Definitely donβt buy this junk product',
'Itβs okay, nothing special',
'Super recommend, worth every penny'
],
# Assume annotator has personal bias: positive reviews labeled as 1, others labeled as 0
'biased_label': [1,0,0,0,1],# Biased labeling
'true_label': [1,0.5,0,0.5,1]# True continuous sentiment score
}
df = pd.DataFrame(data)
print("Example of Labeling Bias:")
print(df)
print("nProblem: Annotator incorrectly labels neutral reviews as negative!")
Output:
Example of Labeling Bias: text biased_label true_label 0 This product is really great, I love it so much! 1 1.01 I'm not sure whether this one works well or not 0 0.52 Definitely donβt buy this junk product 0 0.03 Itβs okay, nothing special 0 0.54 Super recommend, worth every penny 1 1.0Problem: Annotator incorrectly labels neutral reviews as negative!
### Problems During Data Preprocessing Phase
* Improper handling of outliers
* Biased feature selection
* Inappropriate data normalization methods
* * *
## How to Detect Data Bias?
### 1. Statistical Analysis of Data
Check the distribution of different groups within the dataset.
## Example
# Checking distribution of different groups in the dataset
import matplotlib.pyplot as plt
# -------------------------- Set Chinese font start --------------------------
plt.rcParams['font.sans-serif']=[
# Windows priority
'SimHei','Microsoft YaHei',
# macOS priority
'PingFang SC','Heiti TC',
# Linux priority
'WenQuanYi Micro Hei','DejaVu Sans'
]
# Fix minus sign display as square
plt.rcParams['axes.unicode_minus']=False
# -------------------------- Set Chinese font end --------------------------
# Simulate demographic statistics
groups =['Group A','Group B','Group C','Group D']
population_percent =[40,30,20,10]# Real population proportions
dataset_percent =[70,20,8,2]# Proportions in dataset
fig,(ax1, ax2)= plt.subplots(1,2, figsize=(12,5))
# Real population distribution
ax1.pie(population_percent, labels=groups, autopct='%1.1f%%')
ax1.set_title('Real-world Population Distribution')
# Dataset distribution
ax2.pie(dataset_percent, labels=groups, autopct='%1.1f%%')
ax2.set_title('Distribution in Dataset')
plt.tight_layout()
plt.show()
print("Detection Result: Group A is overrepresented, Group D underrepresented!")
!(#)
### 2. Analysis of Model Performance Differences
Compare model performance across different subgroups.
### 3. Calculation of Fairness Metrics
Use statistical metrics to quantify the fairness level of the model.
* * *
## Strategies to Address Data Bias
### Solutions at the Data Level
#### 1. Improve Data Collection Strategy
* **Active Sampling**: Consciously collect under-represented data
* **Data Augmentation**: Increase data diversity through technical means
* **Multiple Data Sources**: Integrate data from various sources
#### 2. Data Preprocessing Techniques
## Example
# Using resampling techniques to balance datasets
import numpy as np
from sklearn.utils import resample
np.random.seed(42)
# 1. Construct imbalanced data (1D feature + label)
X_majority = np.random.normal(0,1,900).reshape(-1,1)
y_majority = np.zeros(900, dtype=int)
X_minority = np.random.normal(2,1,100).reshape(-1,1)
y_minority = np.ones(100, dtype=int)
print(f"Before resampling:")
print(f"Majority class: {len(X_majority)}")
print(f"Minority class: {len(X_minority)}")
# 2. Upsample minority class
X_minority_upsampled, y_minority_upsampled = resample(
X_minority,
y_minority,
replace=True,
n_samples=len(X_majority),
random_state=42
)
# 3. Combine balanced dataset
X_balanced = np.vstack([X_majority, X_minority_upsampled])
y_balanced = np.hstack([y_majority, y_minority_upsampled])
print(f"nAfter resampling:")
print(f"Majority class: {np.sum(y_balanced == 0)}")
print(f"Minority class: {np.sum(y_balanced == 1)}")
print("nConclusion: Sample counts now perfectly balanced")
Output:
Before resampling: Majority class: 900Minority class: 100After resampling: Majority class: 900Minority class: 900Conclusion: Sample counts now perfectly balanced
### Algorithm-Level Solutions
* Fairness Constraints: Add fairness constraints during model training.
* Adversarial De-biasing: Use adversarial learning techniques to reduce bias in models.
* Post-processing Methods: Adjust model predictions to improve fairness.
* * *
## Practical Exercise: Building an Unbiased Classifier
Letβs go through a complete example to learn how to handle data bias throughout the entire process from data collection to model evaluation.
## Example
# Complete example: Machine learning workflow dealing with data bias (engineering-level)
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
# 1. Generate simulated imbalanced data
X, y = make_classification(
n_samples=2000,
n_features=10,
n_informative=8,
n_redundant=2,
n_clusters_per_class=1,
weights=[0.9,0.1],# Explicitly create class imbalance
flip_y=0,
random_state=42
)
# Convert to DataFrame for easier analysis
feature_names =[f'feature_{i}'for i in range(X.shape)]
df = pd.DataFrame(X, columns=feature_names)
df['target']= y
print("=== Data Bias Analysis ===")
print(f"Dataset shape: {df.shape}")
print("Class distribution:")
print(df['target'].value_counts())
print(f"Minority class proportion: {df['target'].value_counts(normalize=True):.2%}")
# 2. Split data (maintain class ratio to prevent secondary bias)
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.3,
random_state=42,
stratify=y
)
print("n=== Training/Test Set Distribution ===")
print("Training set:", np.bincount(y_train))
print("Test set:", np.bincount(y_test))
# 3. Use SMOTE to address training set bias
print("n=== Handling Data Bias (SMOTE) ===")
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)
print("Before SMOTE:", np.bincount(y_train))
print("After SMOTE:", np.bincount(y_train_balanced))
# 4. Train models
print("n=== Model Training ===")
# Baseline model: no bias handling
model_imbalanced = RandomForestClassifier(
n_estimators=200,
random_state=42
)
model_imbalanced.fit(X_train, y_train)
# Bias-handling model: trained after SMOTE
model_balanced = RandomForestClassifier(
n_estimators=200,
random_state=42
)
model_balanced.fit(X_train_balanced, y_train_balanced)
# 5. Evaluate models
print("n=== Model Evaluation (Test Set) ===")
print("n Classification Report")
y_pred_imbalanced = model_imbalanced.predict(X_test)
print(classification_report(y_test, y_pred_imbalanced, digits=4))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_imbalanced))
print("n Classification Report")
y_pred_balanced = model_balanced.predict(X_test)
print(classification_report(y_test, y_pred_balanced, digits=4))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_balanced))
print(
"nConclusion:n"
"1. Without bias handling, model has extremely low Recall for minority classn"
"2. SMOTE significantly improves Recall and F1 for minority classn"
"3. Accuracy is not a reliable metric for imbalanced data"
)
Output:
=== Data Bias Analysis ===Dataset shape: (2000, 11)Class distribution: target 0 18001 200Name: count, dtype: int64 Minority class proportion: 10.00%=== Training/Test Set Distribution ===Training set: Test set: === Handling Data Bias (SMOTE) === Before SMOTE: After SMOTE: === Model Training ====== Model Evaluation (Test Set) === Classification Report
YouTip