YouTip LogoYouTip

Pandas Filter

Data filtering is one of the most commonly used operations in data analysis. Pandas provides rich conditional query functions that can filter data based on various conditions. This section introduces various data filtering methods in detail. * * * ## Basic Conditional Filtering ### Single Condition Filtering ## Example import pandas as pd # Create sample data df = pd.DataFrame({ "Name": ["Zhang San","Li Si","Wang Wu","Zhao Liu","Qian Qi"], "Age": [25,30,28,35,22], "City": ["Beijing","Shanghai","Guangzhou","Beijing","Shenzhen"], "Salary": [12000,15000,11000,18000,9000], "Department": ["Tech","Sales","Tech","Operations","Tech"] }) print("Original Data:") print(df) print() # Equal filtering print("City equals Beijing:") print(df[df=="Beijing"]) print() # Not equal filtering print("City not equal to Beijing:") print(df[df!="Beijing"]) print() # Greater than / Less than print("Salary greater than 12000:") print(df[df>12000]) print() # String condition print("Name contains 'San':") print(df[df.str.contains("San")]) ### Multiple Condition Combination ## Example import pandas as pd df = pd.DataFrame({ "Name": ["Zhang San","Li Si","Wang Wu","Zhao Liu","Qian Qi"], "Age": [25,30,28,35,22], "City": ["Beijing","Shanghai","Guangzhou","Beijing","Shenzhen"], "Salary": [12000,15000,11000,18000,9000], "Department": ["Tech","Sales","Tech","Operations","Tech"] }) # AND condition: use & operator print("Beijing AND Salary > 10000:") print(df[(df=="Beijing")&(df>10000)]) print() # OR condition: use | operator print("Tech Department OR Salary > 15000:") print(df[(df=="Tech") | (df>15000)]) print() # NOT condition: use ~ operator print("Not Tech Department:") print(df[~(df=="Tech")]) print() # Complex combination print("(Beijing and Age > 25) OR (Shanghai):") print(df[((df=="Beijing")&(df>25)) | (df=="Shanghai")]) > When combining multiple conditions, each condition must be enclosed in parentheses, and conditions are connected using `&` (AND), `|` (OR), rather than Python's `and`, `or` keywords. * * * ## isin and where ### isin Filtering ## Example import pandas as pd df = pd.DataFrame({ "Name": ["Zhang San","Li Si","Wang Wu","Zhao Liu","Qian Qi"], "City": ["Beijing","Shanghai","Guangzhou","Beijing","Shenzhen"], "Department": ["Tech","Sales","Tech","Operations","Tech"] }) # Filter within list print("City is Beijing or Shanghai:") print(df[df.isin(["Beijing","Shanghai"])]) print() # Not in list print("City is neither Beijing nor Shanghai:") print(df[~df.isin(["Beijing","Shanghai"])]) print() # Multiple columns isin df2 = pd.DataFrame({ "City": ["Beijing","Shanghai"], "Department": ["Tech","Sales"] }) print("Meet both City and Department conditions:") print(df[df[["City","Department"]].isin(df2).all(axis=1)]) ### where Condition ## Example import pandas as pd import numpy as np df = pd.DataFrame({ "A": [1,5,3,7], "B": [4,2,8,3] }) # where: keep values that meet the condition, set others to NaN or a specified value print("Keep values greater than 3, set others to 0:") result = df.where(df >3,0) print(result) print() # Use another DataFrame as condition other = pd.DataFrame({ "A": [True,False,True,False], "B": [False,True,False,True] }) print("Filter based on condition DataFrame:") print(df.where(other, -1)) * * * ## String Filtering ## Example import pandas as pd df = pd.DataFrame({ "Name": ["Zhang San","Li Si","Wang Wu","Zhao Liu","Sun Qi"], "City": ["Beijing City","Shanghai City","Guangzhou City","Shenzhen City","Hangzhou City"], "Email": ["zhangsan@email.com","lisi@email.com","wangwu@email.com", "zhaoliu@email.com","sunqi@email.com"] }) # Contains print("City contains 'Beijing':") print(df[df.str.contains("Beijing")]) print() # Starts with / Ends with print("Email starts with 'zhang':") print(df[df.str.startswith("zhang")]) print() # Regular expression matching print("Email username length greater than 4:") print(df[df.str.extract(r"(w+)@").str.len()>4]) print() # Multiple pattern matching print("City starts with Bei, Shang, Guang, or Shen:") print(df[df.str.contains("^Bei|^Shang|^Guang|^Shen")]) * * * ## Numeric Range Filtering ## Example import pandas as pd df = pd.DataFrame({ "Name": ["Zhang San","Li Si","Wang Wu","Zhao Liu","Qian Qi"], "Age": [25,30,28,35,22], "Salary": [12000,15000,11000,18000,9000] }) # Range filtering print("Age between 25 and 30 (inclusive):") print(df[df.between(25,30)]) print() # Quantile filtering q1 = df.quantile(0.25) q3 = df.quantile(0.75) print(f"Salary between 25%-75% ({q1}-{q3}):") print(df[df.between(q1, q3)]) print() # Top N / Bottom N print("Top 3 highest salaries:") print(df.nlargest(3,"Salary")) print() print("Bottom 2 lowest salaries:") print(df.nsmallest(2,"Salary")) * * * ## Time Data Filtering ## Example import pandas as pd # Create time series data df = pd.DataFrame({ "Date": pd.date_range("2024-01-01", periods=10, freq="D"), "Sales": [100,120,90,150,200,180,110,130,170,190] }) df = df.set_index("Date") print("Time Series Data:") print(df) print() # Filter by date print("Data for 2024-01-05:") print(df.loc) print() # Date range filtering print("From 2024-01-03 to 2024-01-07:") print(df.loc["2024-01-03":"2024-01-07"]) print() # Using truncate (more efficient slicing) print("Truncate filtering:") print(df.truncate(before="2024-01-03", after="2024-01-07")) * * * ## query Method The `query` method provides a more concise SQL-style filtering approach. ## Example import pandas as pd df = pd.DataFrame({ "Name": ["Zhang San","Li Si","Wang Wu","Zhao Liu"], "Age": [25,30,28,35], "City": ["Beijing","Shanghai","Guangzhou","Beijing"], "Salary": [12000,15000,11000,18000] }) # Using query method print("Age > 26 AND City = 'Beijing':") print(df.query("Age > 26 and City == 'Beijing'")) print() # Using variables min_age =26 city ="Beijing" print(f"Age > {min_age} AND City = {city}:") print(df.query("Age > @min_age and City == @city")) print() # Column names with spaces need backticks df2 = df.rename(columns={"Name": "Name "}) print("Column name with space:") print(df2.query("`Name ` == 'Zhang San'")) * * * ## Common Issues **1. Error in Condition Combination** When filtering with multiple conditions, use parentheses to clarify precedence, and use `&`, `|` instead of `and`, `or`. **2. String Contains Null Values** Before using `str.contains`, handle null values: `df[df.str.contains("x", na=False)]` **3. Index Filtering** If the DataFrame has a custom index, pay attention to the difference between `loc` and `iloc`. > For complex multi-condition queries, the syntax of the `query` method is closer to natural language and easier to read. When dealing with large datasets, `loc` combined with boolean arrays is usually faster than `query`.
← Pandas DuplicatePandas Categorical β†’