Ml Data Visualizations
## Data Visualization
Before building a complex machine learning model, the first thing we need to do is not to choose an algorithm, but to **understand the data**.
If we compare machine learning to cooking, then data is the ingredients.
An excellent chef must understand the characteristics of the ingredientsβwhether they are fresh or spoiled, sweet or sour, suitable for stewing or quick frying.
Data visualization is the **magnifying glass** and **taste buds** we use to observe and taste the data ingredient.
Data visualization transforms boring numbers into intuitive images through visual elements such as charts and graphics, helping us to:
* **Discover patterns and trends in the data** (e.g., does sales volume change with seasons?)
* **Identify outliers and erroneous data** (e.g., records with age of 300)
* **Understand the relationships between features (variables)** (e.g., is there a positive correlation between house area and price?)
* **Verify hypotheses** and provide basis for subsequent feature engineering and model selection.
This article will use Python's most popular data science library `pandas` and visualization libraries `matplotlib` and `seaborn` to help you master the core skills of data visualization.
* * *
## Preparation: Environment and Data
Before we start drawing charts, we need to prepare the "canvas" and "paints".
### Install Required Libraries
If you are using Anaconda, these libraries are usually pre-installed. Otherwise, you can install them using the following command:
pip install pandas matplotlib seaborn
### Import Libraries and Load Data
We will use a classic public dataset: the Titanic passenger dataset. It contains information about passenger survival, class, age, gender, etc.
## Example
# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Set chart style to make charts look better
sns.set_style("whitegrid")
# -------------------------- Set Chinese font start --------------------------
plt.rcParams['font.sans-serif']=[
# Windows priority
'SimHei','Microsoft YaHei',
# macOS priority
'PingFang SC','Heiti TC',
# Linux priority
'WenQuanYi Micro Hei','DejaVu Sans'
]
# Fix the issue of negative signs displaying as squares
plt.rcParams['axes.unicode_minus']=False
# -------------------------- Set Chinese font end --------------------------
# Load data
# Here we load directly from seaborn's built-in dataset
df = sns.load_dataset('titanic')
# View the first few rows and basic information of the data
print("Data shape (rows, columns):", df.shape)
print("n First 5 rows of data:")
print(df.head())
print("n Basic data information (types, non-null counts, etc.):")
print(df.info())
# Load data
# Here we load directly from seaborn's built-in dataset
df = sns.load_dataset('titanic')
# View the first few rows and basic information of the data
print("Data shape (rows, columns):", df.shape)
print("n First 5 rows of data:")
print(df.head())
print("n Basic data information (types, non-null counts, etc.):")
print(df.info())
When you run the above code, you will see the data has 891 rows (passengers) and 15 columns (features). `df.head()` gives you a preliminary impression of what the data looks like.
* * *
## Univariate Analysis: Understanding the Distribution of a Single Feature
Univariate analysis focuses on the distribution of **one** feature (variable). This is the most basic analysis.
### 1. Numerical Features: Histogram and Box Plot
For continuous numerical features like `age` and `fare`, we commonly use **histograms** and **box plots**.
A **histogram** shows the frequency distribution of data across different intervals ("bins").
## Example
# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Set chart style to make charts look better
sns.set_style("whitegrid")
# -------------------------- Set Chinese font start --------------------------
plt.rcParams['font.sans-serif']=[
# Windows priority
'SimHei','Microsoft YaHei',
# macOS priority
'PingFang SC','Heiti TC',
# Linux priority
'WenQuanYi Micro Hei','DejaVu Sans'
]
# Fix the issue of negative signs displaying as squares
plt.rcParams['axes.unicode_minus']=False
# -------------------------- Set Chinese font end --------------------------
# Load data
# Here we load directly from seaborn's built-in dataset
df = sns.load_dataset('titanic')
# View the first few rows and basic information of the data
print("Data shape (rows, columns):", df.shape)
print("n First 5 rows of data:")
print(df.head())
print("n Basic data information (types, non-null counts, etc.):")
print(df.info())
# Draw histogram for age
plt.figure(figsize=(10,6))# Set chart size
plt.hist(df['age'].dropna(), bins=30, edgecolor='black', alpha=0.7)# dropna() ignores missing values
plt.title('Passenger Age Distribution Histogram')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
!(#)
**Interpretation**: This chart can tell us which age range the passengers are mainly concentrated in (e.g., 20-30 years old), whether the distribution is symmetric, whether there are outliers, etc.
A **box plot** can clearly show the **median, quartiles, and outliers** of the data.
## Example
# Draw box plot for fare
plt.figure(figsize=(8,5))
plt.boxplot(df['fare'].dropna())
plt.title('Fare Box Plot')
plt.ylabel('Fare')
plt.show()
**Interpretation**: The line in the middle of the box is the median. The upper and lower boundaries of the box are the upper quartile (Q3) and lower quartile (Q1). The upper and lower "whiskers" usually extend to the farthest data points within 1.5 times the interquartile range, and points outside are considered **outliers** (the circles in the upper part of the figure). This chart immediately tells us that there are many extremely high outliers in the fare data.
### 2. Categorical Features: Bar Chart
For categorical features like `sex` (gender), `embarked` (port of embarkation), and `survived` (whether survived), we use **bar charts** to count the number of each category.
## Example
# Draw bar chart for passenger gender
survival_counts = df['sex'].value_counts()
plt.figure(figsize=(8,5))
plt.bar(survival_counts.index, survival_counts.values, color=['lightblue','lightcoral'])
plt.title('Passenger Gender Distribution')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()
* * *
## Bivariate Analysis: Exploring Relationships Between Features
Bivariate analysis explores the relationship between **two** features.
### 1. Numerical vs. Numerical: Scatter Plot
Scatter plots are a powerful tool for studying the correlation between two continuous variables.
## Example
# Draw scatter plot of age vs. fare
plt.figure(figsize=(10,6))
plt.scatter(df['age'], df['fare'], alpha=0.5)# alpha sets transparency to better observe point density
plt.title('Age vs. Fare Scatter Plot')
plt.xlabel('Age')
plt.ylabel('Fare')
plt.show()
* **Interpretation**: The distribution pattern of points can suggest correlation. For example, if points are roughly distributed along a diagonal line, it indicates a correlation between the two. From this chart, there is no obvious linear relationship between age and fare, but we can once again confirm the existence of high fare (outliers).
### 2. Categorical vs. Numerical: Grouped Box Plot or Violin Plot
We often want to know how the distribution of a numerical feature differs under different categories. For example: "What is the difference in fare distribution among passengers of different classes?"
## Example
# Use seaborn to draw grouped box
YouTip