Pandas Df Drop Duplicates

Pandas df.drop_duplicates() Function |\n\n

Image 1: Pandas Common Functions Pandas General Functions

\n\n

df.drop_duplicates() is a function in Pandas used to remove duplicate rows.

\n\n

In the process of data collection and integration, duplicate data is a common issue. drop_duplicates() can help you identify and remove duplicate rows based on specified columns or all columns, keeping either the first or last occurrence. This is very useful for data deduplication and ensuring data uniqueness.

\n\n

Basic Syntax and Parameters

\n\n

drop_duplicates() is a member function of DataFrame, called using the dot operator ..

\n\n

Syntax Format

\n\n

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)\n

\n\n

Parameter Description

\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n

Parameter	Type	Required	Description	Default Value
subset	column label or list	Optional	Specifies columns to use for determining duplicates. If `None`, all columns are used. Can be a single column name or a list of column names.	None
keep	str or False	Optional	Specifies which duplicate record to keep. `'first'` keeps the first; `'last'` keeps the last; `False` removes all duplicates.	'first'
inplace	bool	Optional	If `True`, modifies the original DataFrame directly without returning a new object; if `False`, returns a new DataFrame with the original data unchanged.	False
ignore_index	bool	Optional	If `True`, resets the index of the result starting from 0; if `False`, retains the original index.	False

\n\n

Return Value Description

\n\n

Returns a new DataFrame (if inplace=False) or None (if inplace=True).
The returned DataFrame contains no duplicate rows.

\n\n

Examples

\n\n

Let's go through a series of examples to fully master the usage of drop_duplicates().

\n\n

Example 1: Remove Completely Duplicate Rows

\n\n

By default, it determines duplicates based on all columns.

\n\n

Example

\n\n

import pandas as pd\n\nimport numpy as np\n\n# Create a DataFrame with duplicate rows\n\ndata = {\n\n'Name': ['Zhang San','Li Si','Zhang San','Wang Wu','Li Si'],\n\n'Age': [25,30,25,35,30],\n\n'Department': ['Tech','Marketing','Tech','Tech','Marketing']\n\n}\n\ndf = pd.DataFrame(data)\n\nprint("Original Data:")\n\nprint(df)\n\nprint("=" * 50)\n\n# Remove duplicate rows, keeping the first occurrence\n\ndf_cleaned = df.drop_duplicates()\n\nprint("Data after removing duplicates:")\n\nprint(df_cleaned)\n

\n\n

Expected Output:

\n\n

Original Data:\n   Name  Age Department\n0  Zhang San   25       Tech\n1     Li Si   30  Marketing\n2  Zhang San   25       Tech\n3    Wang Wu   35       Tech\n4     Li Si   30  Marketing\n==================================================\nData after removing duplicates:\n   Name  Age Department\n0  Zhang San   25       Tech\n1     Li Si   30  Marketing\n3    Wang Wu   35       Tech\n

\n\n

Code Explanation:

\n\n

The 0th and 2nd rows are identical (Zhang San, 25 years old, Tech department).
The 1st and 4th rows are also identical (Li Si, 30 years old, Marketing department).
Using drop_duplicates() with default parameters removes duplicates, keeping the first occurrence.

\n\n

Example 2: Remove Duplicates Based on Specified Columns

\n\n

You can use the subset parameter to determine duplicates based on specific columns only.

\n\n

Example

\n\n

import pandas as pd\n\nimport numpy as np\n\n# Create a DataFrame\n\ndata = {\n\n'Name': ['Zhang San','Li Si','Zhang San','Wang Wu'],\n\n'Age': [25,30,28,35], # Zhang San appears twice but with different ages\n\n'Department': ['Tech','Marketing','Tech','Marketing']\n\n}\n\ndf = pd.DataFrame(data)\n\nprint("Original Data:")\n\nprint(df)\n\nprint("=" * 50)\n\n# Determine duplicates only by "Name" column\n\ndf_cleaned = df.drop_duplicates(subset=['Name'])\n\nprint("Data after removing duplicates by Name column:")\n\nprint(df_cleaned)\n\nprint("=" * 50)\n\n# Determine duplicates by both "Name" and "Department" columns\n\ndf_cleaned2 = df.drop_duplicates(subset=['Name','Department'])\n\nprint("Data after removing duplicates by Name and Department columns:")\n\nprint(df_cleaned2)\n

\n\n

Expected Output:

\n\n

Original Data:\n   Name  Age Department\n0  Zhang San   25       Tech\n1     Li Si   30  Marketing\n2  Zhang San   28       Tech # Although name is same, age is different\n3    Wang Wu   35  Marketing\n==================================================\nData after removing duplicates by Name column:\n   Name  Age Department\n0  Zhang San   25       Tech # Keep the first record of Zhang San\n2  Zhang San   28       Tech\n3    Wang Wu   35  Marketing\n==================================================\nData after removing duplicates by Name and Department columns:\n   Name  Age Department\n0  Zhang San   25       Tech\n1     Li Si   30  Marketing\n3    Wang Wu   35  Marketing\n

\n\n

Code Explanation:

\n\n

When judging by the Name column, rows 0 and 2 both have "Zhang San", so only row 0 is kept.
When judging by both Name and Department columns, rows 0 and 2 have identical Name and Department, so only row 0 is kept.

\n\n

Example 3: Keep the Last Duplicate Record

\n\n

Use the keep='last' parameter to keep the last occurrence of duplicate records.

\n\n

Example

\n\n

import pandas as pd\n\nimport numpy as np\n\n# Create a DataFrame with duplicate rows\n\ndata = {\n\n'Student ID': ['S001','S002','S001','S003','S002'],\n\n'Name': ['Zhang San','Li Si','Zhang San','Wang Wu','Li Si'],\n\n'Score': [85,90,88,92,85] # Same student ID, scores may differ\n\n}\n\ndf = pd.DataFrame(data)\n\nprint("Original Data:")\n\nprint(df)\n\nprint("=" * 50)\n\n# Keep the first record (default)\n\ndf_first = df.drop_duplicates(subset=['Student ID'], keep='first')\n\nprint("Keep the first duplicate record:")\n\nprint(df_first)\n\nprint("=" * 50)\n\n# Keep the last record\n\ndf_last = df.drop_duplicates(subset=['Student ID'], keep='last')\n\nprint("Keep the last duplicate record:")\n\nprint(df_last)\n

\n\n

Expected Output:

\n\n

Original Data:\n  Student ID      Name  Score\n0      S001  Zhang San     85\n1      S002     Li Si     90\n2      S001  Zhang San     88 # Same person, different score\n3      S003    Wang Wu     92\n4      S002     Li Si     85 # Same person, different score\n==================================================\nKeep the first duplicate record:\n  Student ID      Name  Score\n0      S001  Zhang San     85 # Keep the first 85\n1      S002     Li Si     90 # Keep the first 90\n3      S003    Wang Wu     92\n==================================================\nKeep the last duplicate record:\n  Student ID      Name  Score\n2      S001  Zhang San     88 # Keep the last 88\n3      S003    Wang Wu     92\n4      S002     Li Si     85 # Keep the last 85\n

\n\n

Code Explanation:

\n\n

Student ID S001 has two records with scores 85 and 88. Using keep='first' keeps the 85 score, while keep='last' keeps the 88 score.
Student ID S002 also has two records with scores 90 and 85.
Choose to keep the first or last based on business needs.

\n\n

Example 4: Remove All Duplicate Records

\n\n

Using keep=False removes all duplicate records, keeping only completely unique rows.

\n\n

Example

\n\n

import pandas as pd\n\nimport numpy as np\n\n# Create a DataFrame\n\ndata = {\n\n'A': [1,1,2,2,3],\n\n'B': [1,1,2,2,3],\n\n'C': [1,2,3,3,5]\n\n}\n\ndf = pd.DataFrame(data)\n\nprint("Original Data:")\n\nprint(df)\n\nprint("=" * 50)\n\n# Remove all duplicate rows (no duplicates retained)\n\ndf_cleaned = df.drop_duplicates(keep=False)\n\nprint("Data after removing all duplicates:")\n\nprint(df_cleaned)\n

\n\n

Expected Output:

\n\n

Original Data:\n   A  B  C\n0  1  1  1\n1  1  1  2 # Same as row 0 in A and B, duplicate row\n2  2  2  3\n3  2  2  3 # Identical to row 2, duplicate row\n4  3  3  5\n==================================================\nData after removing all duplicates:\n   A  B  C\n4  3  3  5\n

\n\n

Code Explanation:

\n\n

Rows 0 and 1 have identical values in columns A and B, so they are duplicates. With keep=False, both are removed.
Rows 2 and 3 are also identical and both are removed.
Only row 4 is unique and is retained.

\n\n

Example 5: Reset Index

\n\n

After removing duplicate rows, the original index might not be continuous. You can use ignore_index=True to reset the index.

\n\n

Example

\n\n

import pandas as pd\n\nimport numpy as np\n\n# Create a DataFrame with duplicate rows\n\ndata = {\n\n'Name': ['Zhang San','Li Si','Zhang San','Wang Wu'],\n\n'City': ['Beijing','Shanghai','Beijing','Guangzhou']\n\n}\n\ndf = pd.DataFrame(data)\n\nprint("Original Data:")\n\nprint(df)\n\nprint("=" * 50)\n\n# Remove duplicates, do not reset index\n\ndf_cleaned1 = df.drop_duplicates()\n\nprint("Remove duplicates (retain original index):")\n\nprint(df_cleaned1)\n\nprint("=" * 50)\n\n# Remove duplicates, reset index\n\ndf_cleaned2 = df.drop_duplicates(ignore_index=True)\n\nprint("Remove duplicates (reset index):")\n\nprint(df_cleaned2)\n

\n\n

Expected Output:

\n\n

Original Data:\n      Name      City\n0  Zhang San    Beijing\n1     Li Si   Shanghai\n2  Zhang San    Beijing\n3    Wang Wu  Guangzhou\n==================================================\nRemove duplicates (retain original index):\n      Name      City\n0  Zhang San    Beijing\n1     Li Si   Shanghai\n3    Wang Wu  Guangzhou\n==================================================\nRemove duplicates (reset index):\n      Name      City\n0  Zhang San    Beijing\n1     Li Si   Shanghai\n2    Wang Wu  Guangzhou\n

\n\n

Code Explanation:

\n\n

Without ignore_index, after removing row 2, the indices become 0, 1, 3, which are not continuous.
With ignore_index=True, the index is reset to 0, 1, 2.

\n\n

Example 6: Combine with Other Operations

\n\n

drop_duplicates() can be combined with other DataFrame operations.

\n\n

Example

\n\n

import pandas as pd\n\nimport numpy as np\n\n# Simulate data retrieved from a database\n\ndata = {\n\n'Order ID': ['O001','O002','O001','O003','O002','O004'],\n\n'Customer Name': ['Zhang San','Li Si','Zhang San','Wang Wu','Li Si','Zhao Liu'],\n\n'Amount': [100,200,100,300,200,400],\n\n'Date': ['2024-01-01','2024-01-02','2024-01-01','2024-01-03','2024-01-02','2024-01-04']\n\n}\n\ndf = pd.DataFrame(data)\n\nprint("Original Order Data:")\n\nprint(df)\n\nprint("=" * 50)\n\n# Check how many duplicate orders exist\n\nprint(f"Total rows: {len(df)}")\n\nprint(f"Rows after deduplication: {len(df.drop_duplicates())}")\n\nprint(f"Number of duplicate rows: {len(df) - len(df.drop_duplicates())}")\n\nprint("=" * 50)\n\n# Actually deduplicate, keeping the first record, and select only needed columns\n\ndf_unique = df.drop_duplicates(subset=['Order ID'])[['Order ID','Customer Name','Amount']]\n\nprint("Deduplicated Order Data:")\n\nprint(df_unique)\n

\n\n

Expected Output:

\n\n

Original Order Data:\n  Order ID Customer Name  Amount        Date\n0     O001    Zhang San     100  2024-01-01\n1     O002      Li Si     200  2024-01-02\n2     O001    Zhang San     100  2024-01-01 # Duplicate order\n3     O003     Wang Wu     300  2024-01-03\n4     O002      Li Si     200  2024-01-02 # Duplicate order\n5     O004    Zhao Liu     400  2024-01-04\n==================================================\nTotal rows: 6\nRows after deduplication: 4\nNumber of duplicate rows: 2\n==================================================\nDeduplicated Order Data:\n  Order ID Customer Name  Amount\n0     O001    Zhang San     100\n1     O002      Li Si     200\n3     O003     Wang Wu     300\n5     O004    Zhao Liu     400\n

\n\n

Code Explanation:

\n\n

The original data has 6 rows, with 2 duplicate orders (O001 and O002 each appear twice).
Use drop_duplicates(subset=['Order ID']) to deduplicate based on order ID.
Combine with column selection [['Order ID', 'Customer Name', 'Amount']] to keep only necessary columns.

\n\n

Notes

\n\n

drop_duplicates() does not modify the original DataFrame by default. To modify in place, use the inplace=True parameter.
When using the subset parameter, only the specified columns are considered for determining duplicates; values in other columns do not affect the judgment.
Using keep=False removes all duplicate rows, potentially causing significant data loss. Use with caution.
Before removing duplicate data, analyze the cause of duplication to ensure the operation aligns with business logic.
If there are missing values (NaN) in the data, they are treated as equal by default during comparison.

\n\n

Image 2: Pandas Common Functions Pandas General Functions

YouTip