\n\n
df.drop_duplicates() is a function in Pandas used to remove duplicate rows.
In the process of data collection and integration, duplicate data is a common issue. drop_duplicates() can help you identify and remove duplicate rows based on specified columns or all columns, keeping either the first or last occurrence. This is very useful for data deduplication and ensuring data uniqueness.
\n\n
Basic Syntax and Parameters
\n\ndrop_duplicates() is a member function of DataFrame, called using the dot operator ..
Syntax Format
\n\nDataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)\n\n\nParameter Description
\n\n| Parameter | \nType | \nRequired | \nDescription | \nDefault Value | \n
|---|---|---|---|---|
| subset | \ncolumn label or list | \nOptional | \nSpecifies columns to use for determining duplicates. If None, all columns are used. Can be a single column name or a list of column names. | \nNone | \n
| keep | \nstr or False | \nOptional | \nSpecifies which duplicate record to keep. 'first' keeps the first; 'last' keeps the last; False removes all duplicates. | \n'first' | \n
| inplace | \nbool | \nOptional | \nIf True, modifies the original DataFrame directly without returning a new object; if False, returns a new DataFrame with the original data unchanged. | \nFalse | \n
| ignore_index | \nbool | \nOptional | \nIf True, resets the index of the result starting from 0; if False, retains the original index. | \nFalse | \n
Return Value Description
\n\n- \n
- Returns a new DataFrame (if
inplace=False) orNone(ifinplace=True). \n - The returned DataFrame contains no duplicate rows. \n
\n\n
Examples
\n\nLet's go through a series of examples to fully master the usage of drop_duplicates().
Example 1: Remove Completely Duplicate Rows
\n\nBy default, it determines duplicates based on all columns.
\n\nExample
\n\nimport pandas as pd\n\nimport numpy as np\n\n# Create a DataFrame with duplicate rows\n\ndata = {\n\n'Name': ['Zhang San','Li Si','Zhang San','Wang Wu','Li Si'],\n\n'Age': [25,30,25,35,30],\n\n'Department': ['Tech','Marketing','Tech','Tech','Marketing']\n\n}\n\ndf = pd.DataFrame(data)\n\nprint("Original Data:")\n\nprint(df)\n\nprint("=" * 50)\n\n# Remove duplicate rows, keeping the first occurrence\n\ndf_cleaned = df.drop_duplicates()\n\nprint("Data after removing duplicates:")\n\nprint(df_cleaned)\n\n\nExpected Output:
\n\nOriginal Data:\n Name Age Department\n0 Zhang San 25 Tech\n1 Li Si 30 Marketing\n2 Zhang San 25 Tech\n3 Wang Wu 35 Tech\n4 Li Si 30 Marketing\n==================================================\nData after removing duplicates:\n Name Age Department\n0 Zhang San 25 Tech\n1 Li Si 30 Marketing\n3 Wang Wu 35 Tech\n\n\nCode Explanation:
\n\n- \n
- The 0th and 2nd rows are identical (Zhang San, 25 years old, Tech department). \n
- The 1st and 4th rows are also identical (Li Si, 30 years old, Marketing department). \n
- Using
drop_duplicates()with default parameters removes duplicates, keeping the first occurrence. \n
Example 2: Remove Duplicates Based on Specified Columns
\n\nYou can use the subset parameter to determine duplicates based on specific columns only.
Example
\n\nimport pandas as pd\n\nimport numpy as np\n\n# Create a DataFrame\n\ndata = {\n\n'Name': ['Zhang San','Li Si','Zhang San','Wang Wu'],\n\n'Age': [25,30,28,35], # Zhang San appears twice but with different ages\n\n'Department': ['Tech','Marketing','Tech','Marketing']\n\n}\n\ndf = pd.DataFrame(data)\n\nprint("Original Data:")\n\nprint(df)\n\nprint("=" * 50)\n\n# Determine duplicates only by "Name" column\n\ndf_cleaned = df.drop_duplicates(subset=['Name'])\n\nprint("Data after removing duplicates by Name column:")\n\nprint(df_cleaned)\n\nprint("=" * 50)\n\n# Determine duplicates by both "Name" and "Department" columns\n\ndf_cleaned2 = df.drop_duplicates(subset=['Name','Department'])\n\nprint("Data after removing duplicates by Name and Department columns:")\n\nprint(df_cleaned2)\n\n\nExpected Output:
\n\nOriginal Data:\n Name Age Department\n0 Zhang San 25 Tech\n1 Li Si 30 Marketing\n2 Zhang San 28 Tech # Although name is same, age is different\n3 Wang Wu 35 Marketing\n==================================================\nData after removing duplicates by Name column:\n Name Age Department\n0 Zhang San 25 Tech # Keep the first record of Zhang San\n2 Zhang San 28 Tech\n3 Wang Wu 35 Marketing\n==================================================\nData after removing duplicates by Name and Department columns:\n Name Age Department\n0 Zhang San 25 Tech\n1 Li Si 30 Marketing\n3 Wang Wu 35 Marketing\n\n\nCode Explanation:
\n\n- \n
- When judging by the Name column, rows 0 and 2 both have "Zhang San", so only row 0 is kept. \n
- When judging by both Name and Department columns, rows 0 and 2 have identical Name and Department, so only row 0 is kept. \n
Example 3: Keep the Last Duplicate Record
\n\nUse the keep='last' parameter to keep the last occurrence of duplicate records.
Example
\n\nimport pandas as pd\n\nimport numpy as np\n\n# Create a DataFrame with duplicate rows\n\ndata = {\n\n'Student ID': ['S001','S002','S001','S003','S002'],\n\n'Name': ['Zhang San','Li Si','Zhang San','Wang Wu','Li Si'],\n\n'Score': [85,90,88,92,85] # Same student ID, scores may differ\n\n}\n\ndf = pd.DataFrame(data)\n\nprint("Original Data:")\n\nprint(df)\n\nprint("=" * 50)\n\n# Keep the first record (default)\n\ndf_first = df.drop_duplicates(subset=['Student ID'], keep='first')\n\nprint("Keep the first duplicate record:")\n\nprint(df_first)\n\nprint("=" * 50)\n\n# Keep the last record\n\ndf_last = df.drop_duplicates(subset=['Student ID'], keep='last')\n\nprint("Keep the last duplicate record:")\n\nprint(df_last)\n\n\nExpected Output:
\n\nOriginal Data:\n Student ID Name Score\n0 S001 Zhang San 85\n1 S002 Li Si 90\n2 S001 Zhang San 88 # Same person, different score\n3 S003 Wang Wu 92\n4 S002 Li Si 85 # Same person, different score\n==================================================\nKeep the first duplicate record:\n Student ID Name Score\n0 S001 Zhang San 85 # Keep the first 85\n1 S002 Li Si 90 # Keep the first 90\n3 S003 Wang Wu 92\n==================================================\nKeep the last duplicate record:\n Student ID Name Score\n2 S001 Zhang San 88 # Keep the last 88\n3 S003 Wang Wu 92\n4 S002 Li Si 85 # Keep the last 85\n\n\nCode Explanation:
\n\n- \n
- Student ID S001 has two records with scores 85 and 88. Using
keep='first'keeps the 85 score, whilekeep='last'keeps the 88 score. \n - Student ID S002 also has two records with scores 90 and 85. \n
- Choose to keep the first or last based on business needs. \n
Example 4: Remove All Duplicate Records
\n\nUsing keep=False removes all duplicate records, keeping only completely unique rows.
Example
\n\nimport pandas as pd\n\nimport numpy as np\n\n# Create a DataFrame\n\ndata = {\n\n'A': [1,1,2,2,3],\n\n'B': [1,1,2,2,3],\n\n'C': [1,2,3,3,5]\n\n}\n\ndf = pd.DataFrame(data)\n\nprint("Original Data:")\n\nprint(df)\n\nprint("=" * 50)\n\n# Remove all duplicate rows (no duplicates retained)\n\ndf_cleaned = df.drop_duplicates(keep=False)\n\nprint("Data after removing all duplicates:")\n\nprint(df_cleaned)\n\n\nExpected Output:
\n\nOriginal Data:\n A B C\n0 1 1 1\n1 1 1 2 # Same as row 0 in A and B, duplicate row\n2 2 2 3\n3 2 2 3 # Identical to row 2, duplicate row\n4 3 3 5\n==================================================\nData after removing all duplicates:\n A B C\n4 3 3 5\n\n\nCode Explanation:
\n\n- \n
- Rows 0 and 1 have identical values in columns A and B, so they are duplicates. With
keep=False, both are removed. \n - Rows 2 and 3 are also identical and both are removed. \n
- Only row 4 is unique and is retained. \n
Example 5: Reset Index
\n\nAfter removing duplicate rows, the original index might not be continuous. You can use ignore_index=True to reset the index.
Example
\n\nimport pandas as pd\n\nimport numpy as np\n\n# Create a DataFrame with duplicate rows\n\ndata = {\n\n'Name': ['Zhang San','Li Si','Zhang San','Wang Wu'],\n\n'City': ['Beijing','Shanghai','Beijing','Guangzhou']\n\n}\n\ndf = pd.DataFrame(data)\n\nprint("Original Data:")\n\nprint(df)\n\nprint("=" * 50)\n\n# Remove duplicates, do not reset index\n\ndf_cleaned1 = df.drop_duplicates()\n\nprint("Remove duplicates (retain original index):")\n\nprint(df_cleaned1)\n\nprint("=" * 50)\n\n# Remove duplicates, reset index\n\ndf_cleaned2 = df.drop_duplicates(ignore_index=True)\n\nprint("Remove duplicates (reset index):")\n\nprint(df_cleaned2)\n\n\nExpected Output:
\n\nOriginal Data:\n Name City\n0 Zhang San Beijing\n1 Li Si Shanghai\n2 Zhang San Beijing\n3 Wang Wu Guangzhou\n==================================================\nRemove duplicates (retain original index):\n Name City\n0 Zhang San Beijing\n1 Li Si Shanghai\n3 Wang Wu Guangzhou\n==================================================\nRemove duplicates (reset index):\n Name City\n0 Zhang San Beijing\n1 Li Si Shanghai\n2 Wang Wu Guangzhou\n\n\nCode Explanation:
\n\n- \n
- Without
ignore_index, after removing row 2, the indices become 0, 1, 3, which are not continuous. \n - With
ignore_index=True, the index is reset to 0, 1, 2. \n
Example 6: Combine with Other Operations
\n\ndrop_duplicates() can be combined with other DataFrame operations.
Example
\n\nimport pandas as pd\n\nimport numpy as np\n\n# Simulate data retrieved from a database\n\ndata = {\n\n'Order ID': ['O001','O002','O001','O003','O002','O004'],\n\n'Customer Name': ['Zhang San','Li Si','Zhang San','Wang Wu','Li Si','Zhao Liu'],\n\n'Amount': [100,200,100,300,200,400],\n\n'Date': ['2024-01-01','2024-01-02','2024-01-01','2024-01-03','2024-01-02','2024-01-04']\n\n}\n\ndf = pd.DataFrame(data)\n\nprint("Original Order Data:")\n\nprint(df)\n\nprint("=" * 50)\n\n# Check how many duplicate orders exist\n\nprint(f"Total rows: {len(df)}")\n\nprint(f"Rows after deduplication: {len(df.drop_duplicates())}")\n\nprint(f"Number of duplicate rows: {len(df) - len(df.drop_duplicates())}")\n\nprint("=" * 50)\n\n# Actually deduplicate, keeping the first record, and select only needed columns\n\ndf_unique = df.drop_duplicates(subset=['Order ID'])[['Order ID','Customer Name','Amount']]\n\nprint("Deduplicated Order Data:")\n\nprint(df_unique)\n\n\nExpected Output:
\n\nOriginal Order Data:\n Order ID Customer Name Amount Date\n0 O001 Zhang San 100 2024-01-01\n1 O002 Li Si 200 2024-01-02\n2 O001 Zhang San 100 2024-01-01 # Duplicate order\n3 O003 Wang Wu 300 2024-01-03\n4 O002 Li Si 200 2024-01-02 # Duplicate order\n5 O004 Zhao Liu 400 2024-01-04\n==================================================\nTotal rows: 6\nRows after deduplication: 4\nNumber of duplicate rows: 2\n==================================================\nDeduplicated Order Data:\n Order ID Customer Name Amount\n0 O001 Zhang San 100\n1 O002 Li Si 200\n3 O003 Wang Wu 300\n5 O004 Zhao Liu 400\n\n\nCode Explanation:
\n\n- \n
- The original data has 6 rows, with 2 duplicate orders (O001 and O002 each appear twice). \n
- Use
drop_duplicates(subset=['Order ID'])to deduplicate based on order ID. \n - Combine with column selection
[['Order ID', 'Customer Name', 'Amount']]to keep only necessary columns. \n
\n\n
Notes
\n\n- \n
drop_duplicates()does not modify the original DataFrame by default. To modify in place, use theinplace=Trueparameter. \n- When using the
subsetparameter, only the specified columns are considered for determining duplicates; values in other columns do not affect the judgment. \n - Using
keep=Falseremoves all duplicate rows, potentially causing significant data loss. Use with caution. \n - Before removing duplicate data, analyze the cause of duplication to ensure the operation aligns with business logic. \n
- If there are missing values (NaN) in the data, they are treated as equal by default during comparison. \n
\n\n
YouTip