YouTip LogoYouTip

Ml Customer Segmentation

## Machine Learning - Customer Segmentation In today's data-driven business world, understanding customers is the key to success. However, when your customer base reaches thousands or even millions, manually analyzing each customer's characteristics and behavior patterns becomes impractical. At this point, machine learning techniques, especially clustering algorithms in **unsupervised learning**, become a powerful tool. Customer segmentation, also known as customer clustering, aims to divide a large customer base into several subgroups with similar characteristics. This is like an experienced shopkeeper who no longer views customers as a vague whole, but can clearly identify different groups such as budget-conscious homemakers, tech enthusiasts keen on new products, and high-end customers who value service experience. By adopting targeted marketing, service, and product strategies for different groups, businesses can significantly improve operational efficiency and customer satisfaction. This article will guide you through a complete customer segmentation project. We will use the classic K-Means clustering algorithm to analyze a simulated retail customer dataset, from data understanding to model evaluation, ultimately obtaining segmentation results with business insights. * * * ## Understanding Cluster Analysis and K-Means Algorithm Before starting the practice, we need to understand the core tools we will use. ### What is Cluster Analysis? Cluster analysis is an unsupervised learning method. Unlike supervised learning (such as predicting house prices or recognizing cat and dog images), clustering algorithms do not have pre-labeled "correct answers" (i.e., labels). Its task is to explore the intrinsic structure of data, automatically grouping similar data points into the same group (called a "cluster"), while making data points in different groups as dissimilar as possible. **A simple analogy**: Imagine you have a basket of mixed fruits containing apples, oranges, and bananas. The task of the clustering algorithm is to automatically group fruits with similar shapes, colors, and sizes into separate piles without anyone telling you the category names. ### How K-Means Algorithm Works K-Means is one of the most commonly used and intuitive clustering algorithms. "K" represents the number of clusters we want to divide the data into. Its working principle can be summarized in four steps: 1. **Initialization**: Randomly select K data points as initial "cluster centers" (centroids). 2. **Assignment**: Calculate the distance from each data point to each centroid (usually using Euclidean distance), then assign each point to the cluster of its nearest centroid. 3. **Update**: Recalculate the centroid of each cluster (i.e., the average of all points in that cluster). 4. **Iteration**: Repeat steps 2 and 3 until the centroid positions no longer change significantly, or the preset number of iterations is reached. The flowchart below clearly illustrates this process: !(#) **Key Points of the Algorithm**: * **Distance Metric**: Euclidean distance is typically used to measure similarity between data points; the closer the distance, the higher the similarity. * **Centroid**: Represents the "average point" or center point of a cluster. * **Objective**: Minimize the sum of squared distances from each data point in a cluster to its centroid (called "within-cluster sum of squares" or Inertia). * * * ## Practical Exercise: Retail Customer Segmentation Now, let's put theory into practice. We will use Python and its powerful data science ecosystem to complete this project. ### Step 1: Environment Preparation and Data Loading First, ensure that the necessary libraries are installed in your Python environment: `pandas` for data processing, `numpy` for numerical computation, `matplotlib` and `seaborn` for visualization, and `scikit-learn` as the core machine learning library. ## Example ```python # Import necessary libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler from sklearn.metrics import silhouette_score import warnings warnings.filterwarnings('ignore') # Ignore non-critical warnings # Set visualization style sns.set_style("whitegrid") # -------------------------- Set Chinese font start -------------------------- plt.rcParams['font.sans-serif'] = [ # Windows priority 'SimHei', 'Microsoft YaHei', # macOS priority 'PingFang SC', 'Heiti TC', # Linux priority 'WenQuanYi Micro Hei', 'DejaVu Sans' ] # Fix negative sign display issue plt.rcParams['axes.unicode_minus'] = False # -------------------------- Set Chinese font end -------------------------- We will use a simulated customer dataset `customer_data.csv`, which typically contains the following features: * `CustomerID`: Unique customer identifier * `Annual_Income_(k$)`: Customer's annual income (in thousands of dollars) * `Spending_Score`: Spending score (0-100, derived from purchase frequency, amount, etc.) * `Age`: Age The content is as follows: CustomerID,Age,Annual_Income_(k$),Spending_Score 1,19,15,39 2,21,15,81 3,20,16,6 4,23,16,77 5,31,17,40 6,22,17,76 7,35,18,6 8,23,18,94 9,64,19,3 10,30,19,72 11,67,20,14 12,35,20,99 13,58,21,15 14,24,21,77 15,37,22,13 16,22,22,79 17,35,23,35 18,20,23,66 19,52,24,29 20,35,24,98 21,46,25,35 22,25,25,73 23,54,26,5 24,28,26,73 25,45,27,28 26,23,28,82 27,40,28,36 28,35,28,61 29,60,29,4 30,21,30,87 31,62,30,17 32,23,30,73 33,18,31,92 34,49,33,14 35,21,33,81 36,42,34,17 37,30,34,73 38,36,37,26 39,20,37,75 40,65,38,35 41,24,38,92 42,48,39,36 43,31,39,61 44,49,40,29 45,24,40,98 46,50,41,15 47,27,42,65 48,29,43,88 49,31,43,19 50,49,44,75 ## Example ```python # Load data df = pd.read_csv('customer_data.csv') print("Data shape (rows, columns):", df.shape) print("nFirst 5 rows of data:") print(df.head()) print("nBasic data information:") print(df.info()) print("nDescriptive statistics:") print(df.describe()) Output: Data shape (rows, columns): (50, 4) First 5 rows of data: CustomerID Age Annual_Income_(k$) Spending_Score 0 1 19 15 39 1 2 21 15 81 2 3 20 16 6 3 4 23 16 77 4 5 31 17 40 Basic data information: RangeIndex: 50 entries, 0 to 49 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CustomerID 50 non-null int64 1 Age 50 non-null int64 2 Annual_Income_(k$) 50 non-null int64 3 Spending_Score 50 non-null int64 dtypes: int64(4) memory usage: 1.7 KB None Descriptive statistics: CustomerID Age Annual_Income_(k$) Spending_Score count 50.00000 50.000000 50.000000 50.000000 mean 25.50000 35.560000 28.160000 51.680000 std 14.57738 14.283085 8.739682 31.506682 min 1.00000 18.000000 15.000000 3.000000 25% 13.25000 23.000000 21.000000 20.750000 50% 25.50000 31.000000 27.500000 61.000000 75% 37.75000 47.500000 36.250000 77.000000 max 50.00000 67.000000 44.000000 99.000000 ### Step 2: Data Exploration and Preprocessing Before applying the algorithm, we must first understand the data and do some "cleaning" work. **1. Exploratory Data Analysis** Discover patterns through visualization and statistics. ## Example ```python # Visualize feature distributions fig, axes = plt.subplots(1, 3, figsize=(15, 4)) sns.histplot(df['Age'], bins=30, kde=True, ax=axes) axes.set_title('Age Distribution') sns.histplot(df['Annual_Income_(k$)'], bins=30, kde=True, ax=axes) axes.set_title('Annual Income Distribution') sns.histplot(df['Spending_Score'], bins=30, kde=True, ax=axes) axes.set_title('Spending Score Distribution') plt.tight_layout() plt.show() # Examine relationships between features sns.pairplot(df[['Age', 'Annual_Income_(k$)', 'Spending_Score']]) plt.suptitle('Feature Relationship Scatter Plot Matrix', y=1.02) plt.show() !(#) **2. Data Preprocessing** Clustering algorithms are very sensitive to the scale (units) of features. The numerical ranges of annual income (tens of thousands) and age (tens) differ greatly, which will seriously affect distance calculations and cause the income feature to dominate the clustering results. Therefore, we need to perform **feature standardization**, scaling each feature to a standard normal distribution with mean 0 and variance 1. ## Example ```python # Select features for clustering features = ['Age', 'Annual_Income_(k$)', 'Spending_Score'] X = df.copy() # Feature standardization scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # fit calculates mean and variance, transform applies transformation X_scaled_df = pd.DataFrame(X_scaled, columns=features) print("First 5 rows of standardized data:") print(X_scaled_df.head()) ### Step 3: Determining the Optimal Number of Clusters (K Value) K-Means requires us to specify the K value in advance. How to choose a reasonable K? We use two classic methods: **1. Elbow Method** Plot the within-cluster sum of squares (Inertia) corresponding to different K values. Inertia will decrease as K increases; we look for the inflection point of the curve (like an elbow), after which the benefit (Inertia decrease) from increasing K becomes smaller. ## Example ```python inertia = [] K_range = range(1, 11) # Test K from 1 to 10 for k in K_range: kmeans = KMeans(n_clusters=k, random_state=42, n_init='auto') # n_init='auto' is for newer versions kmeans.fit(X_scaled) inertia.append(kmeans.inertia_) # Get Inertia for this K value # Plot elbow method graph plt.figure(figsize=(8, 5)) plt.plot(K_range, inertia, 'bo-') plt.xlabel('Number of Clusters (K)') plt.ylabel('Within-Cluster Sum of Squares (Inertia)') plt.title('Elbow Method: Selecting Optimal K Value') plt.xticks(K_range) plt.show() **2. Silhouette Coefficient Method** The silhouette coefficient measures how similar a data point is to its own cluster (cohesion) compared to other clusters (separation). Its value ranges from -1 to 1, **higher is better**, indicating better clustering effect. ## Example ```python silhouette_scores = [] K_range = range(2, 11) # Silhouette coefficient requires at least 2 clusters for k in K_range: kmeans = KMeans(n_clusters=k, random_state=42, n_init='auto') cluster_labels = kmeans.fit_predict(X_scaled) score = silhouette_score(X_scaled, cluster_labels) silhouette_scores.append(score) # Plot silhouette coefficient graph plt.figure(figsize=(8, 5)) plt.plot(K_range, silhouette_scores, 'go-') plt.xlabel('Number of Clusters (K)') plt.ylabel('Silhouette Coefficient') plt.title('Silhouette Coefficient Method: Selecting Optimal K Value') plt.xticks(K_range) plt.show() Combining the elbow method graph (inflection point) and the silhouette coefficient graph (peak), we assume that **K=5** is a good choice. ### Step 4: Applying K-Means for Clustering Use the selected K value to train the model and assign cluster labels to each customer. ## Example ```python # Train final model with K=5 final_k = 5 kmeans_final = KMeans(n_clusters=final_k, random_state=42, n_init='auto') df['Cluster'] = kmeans_final.fit_predict(X_scaled) # Add cluster labels to original dataframe # View number of customers in each cluster cluster_counts = df['Cluster'].value_counts().sort_index() print("Customer count distribution by cluster:") print(cluster_counts) # View feature means for each cluster (original scale) cluster_profile = df.groupby('Cluster').mean().round(2) print("nFeature averages by cluster:") print(cluster_profile) ### Step 5: Result Analysis and Visualization Transform abstract cluster labels into intuitive insights. **1. Visualize Clustering Results** Since we have three features, we can select two most important features for visualization on a 2D plane (e.g., income and spending score). ## Example ```python # Select two features for 2D visualization plt.figure(figsize=(10, 6)) scatter = plt.scatter(df['Annual_Income_(k$)'], df['Spending_Score'], c=df['Cluster'], cmap='viridis', s=50, alpha=0.7) plt.colorbar(scatter, label='Cluster Label') plt.xlabel('Annual Income (k$)') plt.ylabel('Spending Score') plt.title('Customer Segmentation Results (Based on Annual Income and Spending Score)') plt.show() **2. Characterize Customer Profiles** Based on the `cluster_profile` table, we can assign business meaning to each cluster: | Cluster Label | Age | Annual Income | Spending Score | Possible Customer Profile | | --- | --- | --- | --- | --- | | **0** | Medium | **High** | **Low** | **High-Income Cautious Type**: High income but conservative spending, possibly savers or price-sensitive high-net-worth individuals. | | **1** | Medium | **Low** | **Low** | **Low-Income Low-Spending Type**: Limited income and spending power, need high-value products. | | **2** | Medium | **Low** | **High** | **Value-Seeking Type**: Not high income but loves spending, focuses on trends and experiences, target for promotional activities. | | **3** | Medium | **High** | **High** | **Ideal VIP Type**: High income and high spending, core profit source for the business, should receive top-tier service and exclusive benefits. | | **4** | **Young** | Medium | Medium | **Young Potential Type**: Young customers with growing income and spending, key to cultivating brand loyalty. | ### Step 6: Model Evaluation and Application Recommendations **Evaluation**: Besides the silhouette coefficient, check whether the sample distribution within clusters is balanced, and judge whether the segmentation is reasonable based on business logic. **Application Recommendations**: * **Precision Marketing**: Push high-end new products and exclusive events to "Ideal VIP Type" (Cluster 3); send discount coupons and group-buying information to "Value-Seeking Type" (Cluster 2). * **Product Development**: Design fashionable, socially-oriented products for "Young Potential Type" (Cluster 4). * **Customer Service**:
← Ml Reinforcement LearningMl Data Bias β†’