Pandas Duplicate
Duplicate data is a common issue in data analysis that can affect the accuracy of statistical results. Pandas provides complete functionality for detecting and handling duplicate data.
* * *
## Detecting Duplicate Data
### duplicated Method
## Example
import pandas as pd
# Create a DataFrame with duplicate rows
df = pd.DataFrame({
"Name": ["Zhang San","Li Si","Wang Wu","Zhang San","Li Si","Zhao Liu"],
"Age": [25,30,28,25,30,35],
"City": ["Beijing","Shanghai","Guangzhou","Beijing","Shenzhen","Beijing"]
})
print("Original Data:")
print(df)
print()
# Detect duplicate rows (default marks from first row, keep='first')
print("Detecting duplicate rows:")
print(df.duplicated())
print()
# Count duplicate rows
print(f"Number of duplicate rows: {df.duplicated().sum()}")
print()
# Detect duplicates in specific columns
print("Detecting duplicates by 'Name' column:")
print(df.duplicated(subset=))
### keep Parameter
## Example
import pandas as pd
df = pd.DataFrame({
"Name": ["Zhang San","Li Si","Zhang San","Wang Wu","Li Si"],
"Age": [25,30,25,28,30]
})
print("Original Data:")
print(df)
print()
# keep='first': keep the first occurrence (default)
print("keep='first' (default):")
print(df.duplicated(keep="first"))
print()
# keep='last': keep the last occurrence
print("keep='last':")
print(df.duplicated(keep="last"))
print()
# keep=False: mark all duplicates (don't keep any)
print("keep=False (mark all duplicates):")
print(df.duplicated(keep=False))
* * *
## Removing Duplicate Data
### drop_duplicates Basic Usage
## Example
import pandas as pd
df = pd.DataFrame({
"Name": ["Zhang San","Li Si","Zhang San","Wang Wu","Li Si"],
"Age": [25,30,25,28,30],
"City": ["Beijing","Shanghai","Beijing","Guangzhou","Shenzhen"]
})
print("Original Data:")
print(df)
print()
# Remove duplicate rows (default keeps first)
print("Removing duplicate rows (keeping first):")
print(df.drop_duplicates())
print()
# Keep last
print("Removing duplicate rows (keeping last):")
print(df.drop_duplicates(keep="last"))
print()
# Don't keep any duplicates
print("Removing all duplicate rows:")
print(df.drop_duplicates(keep=False))
### Remove Duplicates by Column
## Example
import pandas as pd
df = pd.DataFrame({
"Name": ["Zhang San","Li Si","Zhang San","Wang Wu"],
"Age": [25,30,35,28],
"City": ["Beijing","Shanghai","Beijing","Guangzhou"]
})
print("Original Data:")
print(df)
print()
# Remove duplicates by single column
print("Remove duplicates by 'Name' column:")
print(df.drop_duplicates(subset=))
print()
# Remove duplicates by combination of multiple columns
print("Remove duplicates by 'Name' + 'City' combination:")
print(df.drop_duplicates(subset=["Name","City"]))
print()
# Keep the row with maximum value in a specific column
print("Group by 'Name', keep the one with highest age:")
print(df.sort_values("Age", ascending=False).drop_duplicates(subset=, keep="first"))
* * *
## Practical Application: Data Cleaning
### Comprehensive Case Study
## Example
import pandas as pd
import numpy as np
# Create simulated order data (with duplicates)
orders = pd.DataFrame({
"Order ID": ["O001","O002","O001","O003","O002","O004"],
"Customer ID": ["C001","C002","C001","C003","C002","C004"],
"Product": ["iPhone","MacBook","iPhone","iPad","AirPods","Apple Watch"],
"Amount": [8000,12000,8000,5000,1500,3000],
"Date": ["2024-01-01","2024-01-02","2024-01-01","2024-01-03","2024-01-02","2024-01-04"]
})
print("Original Order Data:")
print(orders)
print()
# 1. Detect duplicate orders
print("Duplicate Order Detection:")
duplicates = orders[orders.duplicated(subset=, keep=False)]
print(duplicates)
print()
# 2. Remove completely duplicated orders
orders_clean = orders.drop_duplicates()
print("After removing completely duplicated entries:")
print(f"Original {len(orders)} rows -> {len(orders_clean)} rows")
print()
# 3. Remove duplicates by order ID, keeping the latest
# Assume the last entry is the latest
orders_clean = orders.sort_values("Date").drop_duplicates(
subset=,
keep="last"
).sort_index()
print("Remove duplicates by order ID (keeping latest):")
print(orders_clean)
* * *
## Advanced Usage
### Keep Specific Rows Per Group
## Example
import pandas as pd
df = pd.DataFrame({
"Customer": ["A","A","A","B","B","B"],
"Month": ["01","02","03","01","02","03"],
"Spending": [100,200,150,300,250,280]
})
print("Original Data:")
print(df)
print()
# Keep the month with highest spending for each customer
print("Month with highest spending per customer:")
result = df.sort_values("Spending", ascending=False).drop_duplicates(
subset=,
keep="first"
).sort_values("Customer")
print(result)
print()
# Keep the first record for each customer (by month)
print("First record for each customer:")
result = df.sort_values("Month").drop_duplicates(
subset=,
keep="first"
)
print(result)
### Mark Duplicates Instead of Removing
## Example
import pandas as pd
df = pd.DataFrame({
"Name": ["Zhang San","Li Si","Zhang San","Wang Wu","Li Si"],
"City": ["Beijing","Shanghai","Beijing","Guangzhou","Shenzhen"]
})
# Add a duplicate marking column
df= df.duplicated(subset=, keep=False)
print("Marking duplicate rows:")
print(df)
print()
# Or use groupby to mark
df= df.groupby("Name").transform("count")
df=~df.duplicated(subset=, keep="first")
print("Grouped statistics on duplicate counts:")
print(df)
* * *
## Important Notes
**1. Distinguish Between "Fully Duplicate" and "Partially Duplicate"**
Fully duplicate means all column values are identical; partially duplicate means specific columns are the same. Use the `subset` parameter to specify which columns to check for duplicates.
**2. Choose Retention Strategy**
Choose to keep the first, last, or none based on business needs. Which row to keep may impact subsequent analysis results.
**3. Impact of Data Types**
When checking for duplicates, data types are considered. For example, integer 1 and float 1.0 are treated as different values.
> Before removing duplicate data, it's recommended to analyze the cause and distribution of duplicates to ensure that the removal operation does not lose important information.
YouTip