YouTip LogoYouTip

Pandas Df Filter

[![Image 1: Pandas Common Functions](#) Pandas Common Functions](#) * * * `filter()` is a function in Pandas used to filter columns (not rows). It can select specific columns based on column names, label patterns, or regular expressions. Unlike `loc[]` and `iloc[]`, `filter()` is specifically designed for column selection, which is very useful when dealing with datasets that have many columns. In data analysis, sometimes we need to focus only on specific columns, such as selecting only numeric columns, only columns starting with a specific prefix, or only columns containing a specific keyword. `filter()` is designed for these scenarios. * * * ## Basic Syntax and Parameters `filter()` is a method of DataFrame and Series (for Series, only the items parameter is supported), used to filter columns based on conditions. ### Syntax Format DataFrame.filter(items=None, like=None, regex=None, axis=None) ### Parameter Description | Parameter | Type | Required | Description | Default Value | | --- | --- | --- | --- | --- | | items | list | Optional | Directly specify a list of column names to select matching columns. | None | | like | str | Optional | Column names containing the specified string, supports fuzzy matching. | None | | regex | str | Optional | Regular expression pattern to match column names. | None | | axis | int or str | Optional | Axis to filter. 0 or 'index' means rows, 1 or 'columns' means columns. | 1 or 'columns' | ### Return Value Description * **Return Type**: Returns a new DataFrame containing the filtered columns. * **Column Filtering**: By default, `filter()` operates on columns (axis=1). * **Preserve All Rows**: Filtering only affects columns; row data remains unchanged. * * * ## Examples Let's fully master the usage of `filter()` through rich examples. ### Example 1: Basic Usage - Using the items Parameter Directly specify a list of column names to select specific columns. ## Example import pandas as pd # Create sample DataFrame data ={ 'name': ['Alice','Bob','Charlie','David','Eve'], 'age': [18,19,17,18,20], 'score': [85,92,78,90,88], 'grade': ['A','A','B','A','B'], 'city': ['Beijing','Shanghai','Beijing','Guangzhou','Shanghai'] } df = pd.DataFrame(data) print("Original DataFrame:") print(df) print() # Use items parameter to select specific columns print("Select name and age columns:") print(df.filter(items=['name','age'])) print() # Select single column (returns DataFrame) print("Only select score column:") print(df.filter(items=['score'])) **Output:** Original DataFrame: name age score grade city 0 Alice 18 85 A Beijing1 Bob 19 92 A Shanghai2 Charlie 17 78 B Beijing3 David 18 90 A Guangzhou4 Eve 20 88 B ShanghaiSelect name and age columns: name age 0 Alice 181 Bob 192 Charlie 173 David 184 Eve 20Only select score column: score 0 851 922 783 904 88 **Code Explanation:** 1. `filter(items=['name', 'age'])` selects the specified columns and returns a new DataFrame containing these columns. 2. All rows are preserved while only specific columns are selected. 3. If the specified column name does not exist, a KeyError will be raised. ### Example 2: Using the like Parameter - Fuzzy Matching The `like` parameter allows us to filter columns whose names contain a specified string, which is very suitable for datasets with regular column naming conventions. ## Example import pandas as pd # Create DataFrame with regular column names data ={ 'name': ['Alice','Bob','Charlie','David','Eve'], 'age_2020': [18,19,17,18,20], 'age_2021': [19,20,18,19,21], 'age_2022': [20,21,19,20,22], 'score_2020': [85,92,78,90,88], 'score_2021': [87,94,80,92,90], 'score_2022': [89,96,82,94,92], 'city': ['Beijing','Shanghai','Beijing','Guangzhou','Shanghai'] } df = pd.DataFrame(data) print("Original DataFrame:") print(df) print() # Select all columns containing "age" print("Columns containing 'age':") print(df.filter(like='age')) print() # Select all columns containing "score" print("Columns containing 'score':") print(df.filter(like='score')) print() # Select all columns containing "2021" print("Columns containing '2021':") print(df.filter(like='2021')) **Output:** Original DataFrame: name age_2020 age_2021 age_2022 score_2020 score_2021 score_2022 city 0 Alice 18 19 20 85 87 89 Beijing1 Bob 19 20 21 92 94 96 Shanghai2 Charlie 17 18 19 78 80 82 Beijing3 David 18 19 20 90 92 94 Guangzhou4 Eve 20 21 22 88 90 92 ShanghaiColumns containing 'age': age_2020 age_2021 age_2022 0 18 19 201 19 20 212 17 18 193 18 19 204 20 21 22Columns containing 'score': score_2020 score_2021 score_2022 0 85 87 891 92 94 962 78 80 823 90 92 944 88 90 92Columns containing '2021': age_2021 score_2021 0 19 871 20 942 18 803 19 924 21 90 **Code Explanation:** * `filter(like='age')` selects all columns whose names contain "age". * Fuzzy matching does not require knowing the full column name; just provide a partial matching string. * This is particularly useful when dealing with columns that follow similar naming conventions (e.g., columns organized by year). ### Example 3: Using the regex Parameter - Regular Expression Matching For more complex column filtering needs, regular expressions can be used. ## Example import pandas as pd # Create complex column names data ={ 'id': [1,2,3,4,5], 'name': ['Alice','Bob','Charlie','David','Eve'], 'age_2020': [18,19,17,18,20], 'age_2021': [19,20,18,19,21], 'score_math': [85,92,78,90,88], 'score_english': [87,94,80,92,90], 'score_science': [89,96,82,94,92], 'city': ['Beijing','Shanghai','Beijing','Guangzhou','Shanghai'] } df = pd.DataFrame(data) print("Original DataFrame columns:") print(df.columns.tolist()) print() # Select columns starting with "score_" print("Columns starting with 'score_':") print(df.filter(regex='^score_')) print() # Select columns containing numbers print("Columns containing numbers:") print(df.filter(regex='d'))# d matches digits print() # Select columns containing both "age" and "202" print("Columns containing both 'age' and '202':") print(df.filter(regex='age.*202|202.*age')) print() # Select columns starting with a letter and not containing underscores print("Columns starting with a letter and not containing underscores:") print(df.filter(regex='^+$')) **Output:** Original DataFrame columns:['id', 'name', 'age_2020', 'age_2021', 'score_math', 'score_english', 'score_science', 'city']Columns starting with 'score_': score_math score_english score_science 0 85 87 891 92 94 962 78 80 823 90 92 944 88 90 92Columns containing numbers: age_2020 age_2021 score_math score_english score_science 0 18 19 85 87 891 19 20 92 94 962 17 18 78 80 823 18 19 90 92 944 20 21 88 90 92Columns containing both 'age' and '202': age_2020 age_2021 0 18 191 19 202 17 183 18 194 20 21Columns starting with a letter and not containing underscores: name city 0 Alice Beijing1 Bob Shanghai2 Charlie Beijing3 David Guangzhou4 Eve Shanghai **Code Explanation:** 1. `filter(regex='^score_')` uses `^` to match columns starting with "score_". 2. `filter(regex='d')` uses `d` to match column names containing digits. 3. Regular expressions provide maximum flexibility for handling various complex matching requirements. ### Example 4: Using the axis Parameter to Filter Rows Although `filter()` is mainly used for filtering columns, it can also be used to filter rows via the `axis` parameter. ## Example import pandas as pd # Create DataFrame with named indices data ={ 'name': ['Alice','Bob','Charlie','David','Eve'], 'age': [18,19,17,18,20], 'score': [85,92,78,90,88] } df = pd.DataFrame(data, index=['row1','row2','row3','row4','row5']) print("DataFrame with named indices:") print(df) print() # Use axis='index' to filter rows (using like) print("Rows with indices containing 'row':") print(df.filter(like='row', axis='index')) print() # Use items to filter specific row indices print("Rows with indices 'row1' and 'row3':") print(df.filter(items=['row1','row3'], axis='index')) print() # Use regex to filter rows print("Rows with indices matching regex (row):") print(df.filter(regex='row', axis='index')) **Output:** DataFrame with named indices: name age score row1 Alice 18 85 row2 Bob 19 92 row3 Charlie 17 78 row4 David 18 90 row5 Eve 20 88Rows with indices containing 'row': name age score row1 Alice 18 85 row2 Bob 19 92 row3 Charlie 17 78 row4 David 18 90 row5 Eve 20 88Rows with indices 'row1' and 'row3': name age score row1 Alice 18 85 row3 Charlie 17 78Rows with indices matching regex (row): name age score row1 Alice 18 85 row3 Charlie 17 78 row5 Eve 20 88 **Code Explanation:** * `axis='index'` or `axis=0` tells `filter()` to filter row indices. * The usage of parameters like, items, and regex for row filtering is the same as for column filtering. * This is very useful when you need to filter specific rows based on their index names. ### Example 5: Combining with Other Functions `filter()` can be combined with other Pandas functions to achieve powerful data selection capabilities. ## Example import pandas as pd import numpy as np # Create a large DataFrame np.random.seed(42) df = pd.DataFrame(np.random.randn(5,10), columns=[ 'A','B','C','D','E', 'AA','BB','CC','DD','EE' ]) print("Original DataFrame columns:") print(df.columns.tolist()) print() # Filter single-letter columns and perform calculations single_letter_cols = df.filter(regex='^$') print("Mean of single-letter columns (A-E):") print(single_letter_cols.mean()) print() # Filter double-letter columns and perform calculations double_letter_cols = df.filter(regex='^{2}$') print("Mean of double-letter columns (AA-EE):") print(double_letter_cols.mean()) print() # First filter columns, then select rows print("First 3 rows of double-letter columns:") print(df.filter(regex='^{2}$').head(3)) **Output:** Original DataFrame columns:['A', 'B', 'C', 'D', 'E', 'AA', 'BB', 'CC', 'DD', 'EE']Mean of single-letter columns (A-E): A 0.336744 B 0.128223 C -0.234007 D -0.347540 E -0.197939 dtype: float64 Mean of double-letter columns (AA-EE): AA 0.530256 BB -0.671336 CC 0.506853 DD -0.443588 EE -0.456486 dtype: float64 First 3 rows of double-letter columns: AA BB CC DD EE 0 -0.013497 -1.174139 0.214864 1.550698 0.3756981 -1.368593 0.746580 0.669383 -0.717552 -1.1599502 0.602619 -1.700736 -0.201647 -0.605166 -0.012750 **Code Explanation:** * `filter()` can be chained with other DataFrame methods. * Filtering specific columns first, then performing statistical calculations improves code readability. * This is especially useful when working with wide tables. * * * ## Notes * `items` is used for exact column name specification, `like` for fuzzy matching, and `regex` for regular expression matching. Choose the appropriate method based on your actual needs. * When working with large datasets, filtering required columns first using `filter()` can reduce the amount of data for subsequent computations, improving performance. * `filter()` defaults to operating on columns (axis=1), but this can be changed to rows using `axis='index'` or `axis=0`. * Note that the parameters `items`, `like`, and `regex` are mutually exclusive and cannot be used simultaneously. > Tip: `filter()` is a powerful tool for handling wide tables (datasets with many columns). Especially during data exploration, if you only want to focus on certain types of information (such as all numeric columns or all columns starting with a specific prefix), using `filter()` can quickly complete column filtering. * * * ## Summary `filter()` is a Pandas function specifically designed for filtering columns (and rows). It provides three filtering methods: exact specification (items), fuzzy matching (like), and regular expression matching (regex). Main application scenarios include: selecting specific columns, handling columns with regular naming conventions, and dynamically choosing columns for analysis. Combined with other Pandas functions, it can significantly improve data processing efficiency.
← Pandas Series SumPandas Df Iloc β†’