Sklearn Iris Dataset
The Iris Dataset is one of the most classic entry-level datasets in machine learning.
The Iris Dataset contains three types of iris flowers (Setosa, Versicolor, Virginica), with 4 features for each flower: sepal length, sepal width, petal length, and petal width.
Next, our task is to predict the type of iris flower based on these features.
This chapter's case will cover steps such as data loading, visualization, feature selection, data preprocessing, building classification models, model evaluation and optimization, etc.
* * *
## 1. Data Loading and Visualization
### Data Loading
First, load the Iris dataset. scikit-learn provides a direct interface to load the Iris dataset.
## Example
from sklearn.datasets import load_iris
import pandas as pd
# Load the Iris dataset
data = load_iris()
# Convert to DataFrame for easy viewing
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target']= data.target
df['species']= df['target'].apply(lambda x: data.target_names)
# View the first few rows of data
print(df.head())
Output:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target species 0 5.1 3.5 1.4 0.2 0 setosa 1 4.9 3.0 1.4 0.2 0 setosa 2 4.7 3.2 1.3 0.2 0 setosa 3 4.6 3.1 1.5 0.2 0 setosa 4 5.0 3.6 1.4 0.2 0 setosa
At this point, the data has been successfully loaded, and we can see the features of each data point and the corresponding flower species.
### Data Visualization
To better understand the data, we can view the relationships between different features through visualization. We can use the matplotlib and seaborn libraries for visualization.
## Example
from sklearn.datasets import load_iris
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load the Iris dataset
data = load_iris()
# Convert to DataFrame for easy viewing
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target']= data.target
df['species']= df['target'].apply(lambda x: data.target_names)
# Plot the relationships between features
sns.pairplot(df, hue="species")
plt.show()
The pairplot will draw a scatter plot matrix between features, using different colors to identify different iris species. This helps us understand the distribution of each feature and the relationships between them.
The display is as follows:
!(#)
### Heatmap Visualization of Feature Correlations
Through the heatmap, we can view the correlations between features. Stronger correlations can help us make better choices when building models.
## Example
from sklearn.datasets import load_iris
import pandas as pd
import seaborn as sns
import matplotlib
YouTip