Pandas Filter
Data filtering is one of the most commonly used operations in data analysis. Pandas provides rich conditional query functions that can filter data based on various conditions. This section introduces various data filtering methods in detail.
* * *
## Basic Conditional Filtering
### Single Condition Filtering
## Example
import pandas as pd
# Create sample data
df = pd.DataFrame({
"Name": ["Zhang San","Li Si","Wang Wu","Zhao Liu","Qian Qi"],
"Age": [25,30,28,35,22],
"City": ["Beijing","Shanghai","Guangzhou","Beijing","Shenzhen"],
"Salary": [12000,15000,11000,18000,9000],
"Department": ["Tech","Sales","Tech","Operations","Tech"]
})
print("Original Data:")
print(df)
print()
# Equal filtering
print("City equals Beijing:")
print(df[df=="Beijing"])
print()
# Not equal filtering
print("City not equal to Beijing:")
print(df[df!="Beijing"])
print()
# Greater than / Less than
print("Salary greater than 12000:")
print(df[df>12000])
print()
# String condition
print("Name contains 'San':")
print(df[df.str.contains("San")])
### Multiple Condition Combination
## Example
import pandas as pd
df = pd.DataFrame({
"Name": ["Zhang San","Li Si","Wang Wu","Zhao Liu","Qian Qi"],
"Age": [25,30,28,35,22],
"City": ["Beijing","Shanghai","Guangzhou","Beijing","Shenzhen"],
"Salary": [12000,15000,11000,18000,9000],
"Department": ["Tech","Sales","Tech","Operations","Tech"]
})
# AND condition: use & operator
print("Beijing AND Salary > 10000:")
print(df[(df=="Beijing")&(df>10000)])
print()
# OR condition: use | operator
print("Tech Department OR Salary > 15000:")
print(df[(df=="Tech") | (df>15000)])
print()
# NOT condition: use ~ operator
print("Not Tech Department:")
print(df[~(df=="Tech")])
print()
# Complex combination
print("(Beijing and Age > 25) OR (Shanghai):")
print(df[((df=="Beijing")&(df>25)) | (df=="Shanghai")])
> When combining multiple conditions, each condition must be enclosed in parentheses, and conditions are connected using `&` (AND), `|` (OR), rather than Python's `and`, `or` keywords.
* * *
## isin and where
### isin Filtering
## Example
import pandas as pd
df = pd.DataFrame({
"Name": ["Zhang San","Li Si","Wang Wu","Zhao Liu","Qian Qi"],
"City": ["Beijing","Shanghai","Guangzhou","Beijing","Shenzhen"],
"Department": ["Tech","Sales","Tech","Operations","Tech"]
})
# Filter within list
print("City is Beijing or Shanghai:")
print(df[df.isin(["Beijing","Shanghai"])])
print()
# Not in list
print("City is neither Beijing nor Shanghai:")
print(df[~df.isin(["Beijing","Shanghai"])])
print()
# Multiple columns isin
df2 = pd.DataFrame({
"City": ["Beijing","Shanghai"],
"Department": ["Tech","Sales"]
})
print("Meet both City and Department conditions:")
print(df[df[["City","Department"]].isin(df2).all(axis=1)])
### where Condition
## Example
import pandas as pd
import numpy as np
df = pd.DataFrame({
"A": [1,5,3,7],
"B": [4,2,8,3]
})
# where: keep values that meet the condition, set others to NaN or a specified value
print("Keep values greater than 3, set others to 0:")
result = df.where(df >3,0)
print(result)
print()
# Use another DataFrame as condition
other = pd.DataFrame({
"A": [True,False,True,False],
"B": [False,True,False,True]
})
print("Filter based on condition DataFrame:")
print(df.where(other, -1))
* * *
## String Filtering
## Example
import pandas as pd
df = pd.DataFrame({
"Name": ["Zhang San","Li Si","Wang Wu","Zhao Liu","Sun Qi"],
"City": ["Beijing City","Shanghai City","Guangzhou City","Shenzhen City","Hangzhou City"],
"Email": ["zhangsan@email.com","lisi@email.com","wangwu@email.com",
"zhaoliu@email.com","sunqi@email.com"]
})
# Contains
print("City contains 'Beijing':")
print(df[df.str.contains("Beijing")])
print()
# Starts with / Ends with
print("Email starts with 'zhang':")
print(df[df.str.startswith("zhang")])
print()
# Regular expression matching
print("Email username length greater than 4:")
print(df[df.str.extract(r"(w+)@").str.len()>4])
print()
# Multiple pattern matching
print("City starts with Bei, Shang, Guang, or Shen:")
print(df[df.str.contains("^Bei|^Shang|^Guang|^Shen")])
* * *
## Numeric Range Filtering
## Example
import pandas as pd
df = pd.DataFrame({
"Name": ["Zhang San","Li Si","Wang Wu","Zhao Liu","Qian Qi"],
"Age": [25,30,28,35,22],
"Salary": [12000,15000,11000,18000,9000]
})
# Range filtering
print("Age between 25 and 30 (inclusive):")
print(df[df.between(25,30)])
print()
# Quantile filtering
q1 = df.quantile(0.25)
q3 = df.quantile(0.75)
print(f"Salary between 25%-75% ({q1}-{q3}):")
print(df[df.between(q1, q3)])
print()
# Top N / Bottom N
print("Top 3 highest salaries:")
print(df.nlargest(3,"Salary"))
print()
print("Bottom 2 lowest salaries:")
print(df.nsmallest(2,"Salary"))
* * *
## Time Data Filtering
## Example
import pandas as pd
# Create time series data
df = pd.DataFrame({
"Date": pd.date_range("2024-01-01", periods=10, freq="D"),
"Sales": [100,120,90,150,200,180,110,130,170,190]
})
df = df.set_index("Date")
print("Time Series Data:")
print(df)
print()
# Filter by date
print("Data for 2024-01-05:")
print(df.loc)
print()
# Date range filtering
print("From 2024-01-03 to 2024-01-07:")
print(df.loc["2024-01-03":"2024-01-07"])
print()
# Using truncate (more efficient slicing)
print("Truncate filtering:")
print(df.truncate(before="2024-01-03", after="2024-01-07"))
* * *
## query Method
The `query` method provides a more concise SQL-style filtering approach.
## Example
import pandas as pd
df = pd.DataFrame({
"Name": ["Zhang San","Li Si","Wang Wu","Zhao Liu"],
"Age": [25,30,28,35],
"City": ["Beijing","Shanghai","Guangzhou","Beijing"],
"Salary": [12000,15000,11000,18000]
})
# Using query method
print("Age > 26 AND City = 'Beijing':")
print(df.query("Age > 26 and City == 'Beijing'"))
print()
# Using variables
min_age =26
city ="Beijing"
print(f"Age > {min_age} AND City = {city}:")
print(df.query("Age > @min_age and City == @city"))
print()
# Column names with spaces need backticks
df2 = df.rename(columns={"Name": "Name "})
print("Column name with space:")
print(df2.query("`Name ` == 'Zhang San'"))
* * *
## Common Issues
**1. Error in Condition Combination**
When filtering with multiple conditions, use parentheses to clarify precedence, and use `&`, `|` instead of `and`, `or`.
**2. String Contains Null Values**
Before using `str.contains`, handle null values: `df[df.str.contains("x", na=False)]`
**3. Index Filtering**
If the DataFrame has a custom index, pay attention to the difference between `loc` and `iloc`.
> For complex multi-condition queries, the syntax of the `query` method is closer to natural language and easier to read. When dealing with large datasets, `loc` combined with boolean arrays is usually faster than `query`.
YouTip