Pandas Advanced

Pandas provides very powerful data manipulation functions, suitable for complex data cleaning, analysis, aggregation, and time series processing tasks. Mastering the advanced features of Pandas can greatly improve the efficiency of data processing and analysis. * * * ## I. Data Merging and Joining Pandas provides multiple methods to merge and connect different DataFrames, such as `merge()`, `concat()`, and `join()`. These methods are commonly used for handling multiple datasets and complex merging tasks. ### 1. `merge()` — Database-style Join The `merge()` method allows combining two DataFrames based on certain columns, similar to `JOIN` operations in SQL. It supports inner join, outer join, left join, and right join. | **Parameter** | **Description** | | --- | --- | | `left` | Left DataFrame | | `right` | Right DataFrame | | `how` | Merge method, supports `'inner'`, `'outer'`, `'left'`, `'right'` | | `on` | Column name for joining (if column names differ on both sides, use `left_on` and `right_on`) | | `left_on` | Join column for left DataFrame | | `right_on` | Join column for right DataFrame | | `suffixes` | Suffixes to add to distinguish duplicate column names | ## Example import pandas as pd # Sample data left = pd.DataFrame({'ID': [1,2,3],'Name': ['Alice','Bob','Charlie']}) right = pd.DataFrame({'ID': [1,2,4],'Age': [24,27,22]}) # Inner join using merge result = pd.merge(left, right, on='ID', how='inner') print(result) **Output:** ID Name Age0 1 Alice 241 2 Bob 27 ### 2. `concat()` — Concatenate Along Axis `concat()` is used to connect multiple DataFrames along a specified axis (rows or columns), commonly used for row concatenation (vertical join) or column concatenation (horizontal join). | **Parameter** | **Description** | | --- | --- | | `objs` | List of DataFrames to merge | | `axis` | Axis for concatenation, `0` for row-wise, `1` for column-wise | | `ignore_index` | Whether to ignore index and regenerate index (default `False`) | | `keys` | Provide hierarchical index for merged objects | ## Example import pandas as pd # Sample data df1 = pd.DataFrame({'A': [1,2,3]}) df2 = pd.DataFrame({'A': [4,5,6]}) # Row concatenation result = pd.concat([df1, df2], axis=0, ignore_index=True) print(result) **Output:** A 0 11 22 33 44 55 6 ### 3. `join()` — Join Based on Index The `join()` method is a simplified join operation in Pandas, typically used to join multiple DataFrames based on index. | **Parameter** | **Description** | | --- | --- | | `other` | Another DataFrame to join | | `how` | Merge method, supports `'left'`, `'right'`, `'outer'`, `'inner'` | | `on` | Join column to use, default is based on index | ## Example import pandas as pd # Sample data left = pd.DataFrame({'A': [1,2,3]}, index=[1,2,3]) right = pd.DataFrame({'B': [4,5,6]}, index=[1,2,4]) # Join using join result = left.join(right, how='inner') print(result) **Output:** A B 1 1 42 2 5 * * * ## II. Pivot Tables and Cross Tables Pandas provides the `pivot_table()` method to create pivot tables, and the `crosstab()` method to calculate cross tables. Both pivot tables and cross tables are excellent for data summarization and rearrangement. ### 1. `pivot_table()` — Create Pivot Table | **Parameter** | **Description** | | --- | --- | | `data` | Input data | | `values` | Column to aggregate | | `index` | Column to use as row index | | `columns` | Column to use as column index | | `aggfunc` | Aggregation function, default is `mean`, can be `sum`, `count`, etc. | | `fill_value` | Value to fill missing values | ## Example import pandas as pd # Sample data data ={'Date': ['2024-01-01','2024-01-02','2024-01-03','2024-01-04'], 'Category': ['A','B','A','B'], 'Sales': [100,150,200,250]} df = pd.DataFrame(data) # Create pivot table pivot_table = pd.pivot_table(df, values='Sales', index='Date', columns='Category', aggfunc='sum', fill_value=0) print(pivot_table) **Output:** Category A B Date 2024-01-01 100 02024-01-02 0 1502024-01-03 200 02024-01-04 0 250 ### 2. `crosstab()` — Create Cross Table | **Parameter** | **Description** | | --- | --- | | `index` | Row labels | | `columns` | Column labels | | `values` | Data to compute (optional) | | `aggfunc` | Aggregation function, default `count` | ## Example import pandas as pd # Sample data data ={'Category': ['A','B','A','B','A','B'], 'Region': ['North','South','North','South','West','East']} df = pd.DataFrame(data) # Create cross table cross_table = pd.crosstab(df['Category'], df['Region']) print(cross_table) **Output:** Region East North South WestCategory A 0 2 0 1 B 1 0 1 0 * * * ## III. Custom Function Application Pandas provides multiple methods to apply custom functions for data cleaning and transformation. ### 1. `apply()` — Apply Function to DataFrame or Series The `apply()` method allows applying custom functions to DataFrame or Series, supporting operations on rows or columns. | **Parameter** | **Description** | | --- | --- | | `func` | Function to apply | | `axis` | Default `0`, apply by column; `1` apply by row | | `raw` | Whether to pass raw data (default `False`) | | `result_type` | Define output type, such as `expand`, `reduce`, `broadcast` | ## Example import pandas as pd # Sample data df = pd.DataFrame({'A': [1,2,3,4],'B': [10,20,30,40]}) # Define custom function def custom_func(x): return x * 2 # Apply function on column df['A']= df['A'].apply(custom_func) print(df) **Output:** A B 0 2 101 4 202 6 303 8 40 ### 2. `applymap()` — Apply Function on Entire DataFrame `applymap()` can only be applied to DataFrame, operating on each element in the DataFrame. | **Parameter** | **Description** | | --- | --- | | `func` | Function to apply | ## Example import pandas as pd # Sample data df = pd.DataFrame({'A': [1,2,3],'B': [4,5,6]}) # Apply custom function on DataFrame df = df.applymap(lambda x: x ** 2) print(df) **Output:** A B 0 1 161 4 252 9 36 ### 3. `map()` — Apply Function to Series `map()` can apply a function or mapping relationship to each element in a Series. | **Parameter** | **Description** | | --- | --- | | `arg` | Function, dictionary, or Series to apply | ## Example import pandas as pd # Sample data df = pd.DataFrame({'A': ['cat','dog','rabbit']}) # Map using dictionary df['A']= df['A'].map({'cat': 'kitten','dog': 'puppy'}) print(df) **Output:** A 0 kitten 1 puppy 2 NaN * * * ## IV. Grouping Operations and Aggregation The `groupby()` method in Pandas is very powerful and can be used for grouped aggregation, data transformation, and data filtering. Through `groupby()`, data can be grouped based on certain conditions for aggregation operations such as sum, mean, count, etc. ### 1. `groupby()` — Data Grouping | **Parameter** | **Description** | | --- | --- | | `by` | Column or index to group by | | `axis` | Axis for grouping, default `0`, i.e., group by rows | | `level` | Group by index level (applicable for MultiIndex) | ## Example import pandas as pd # Sample data df = pd.DataFrame({ 'Category': ['A','B','A','B','A','B'], 'Value': [10,20,30,40,50,60] }) # Group by Category column and calculate sum for each group grouped = df.groupby('Category')['Value'].sum() print(grouped) **Output:** Category A 90 B 120Name: Value, dtype: int64 ### 2. Aggregation Operations (`agg()`) `agg()` is used to perform complex aggregation operations, can pass multiple functions to calculate multiple aggregate values simultaneously. | **Parameter** | **Description** | | --- | --- | | `func` | Aggregation function, can be string or custom function | ## Example import pandas as pd # Sample data df = pd.DataFrame({ 'Category': ['A','B','A','B','A','B'], 'Value': [10,20,30,40,50,60] }) # Use agg() for multiple aggregation operations grouped = df.groupby('Category')['Value'].agg([sum,min,max]) print(grouped) **Output:** sum min max Category A 90 10 50 B 120 20 60 * * * ## V. Time Series Processing Pandas provides powerful time series processing functions, including date parsing, frequency conversion, date range generation, time window operations, etc. ### 1. `date_range()` — Generate Time Series | **Parameter** | **Description** | | --- | --- | | `start` | Start date | | `end` | End date | | `periods` | Number of time points to generate | | `freq` | Frequency (e.g., `D` for day, `H` for hour, etc.) | ## Example import pandas as pd # Generate time series date_range = pd.date_range(start='2024-01-01', periods=5, freq='D') print(date_range) **Output:** DatetimeIndex(['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05'], dtype='datetime64', freq='D') ### 2. Date and Time Offset Using `pd.Timedelta()` can perform addition and subtraction operations on time. ## Example import pandas as pd # Date offset date = pd.to_datetime('2024-01-01') new_date = date + pd.Timedelta(days=10) print(new_date) **Output:** 2024-01-11 00:00:00 ### 3. Time Window Operations (Rolling, Expanding) Using `rolling()` and `expanding()` methods for rolling and expanding window operations, commonly used for moving average calculations in time series. | **Method** | **Description** | | --- | --- | | `rolling()` | Calculate rolling window operations, commonly used for moving average, etc. | | `expanding()` | Calculate expanding window operations, cumulative values | ## Example import pandas as pd # Sample data df = pd.DataFrame({'Value': [10,20,30,40,50]}) # Calculate 3-day rolling mean df['Rolling_Mean']= df['Value'].rolling(window=3).mean() print(df) **Output:** Value Rolling_Mean0 10 NaN1 20 NaN2 30 20.0000003 40 30.0000004 50 40.000000 * * * ## VI. Missing Value Handling Pandas provides multiple methods to handle missing values (such as NaN). Common operations include filling missing values, deleting missing values, etc. | **Method** | **Description** | | --- | --- | | `isna()` | Check missing values, return boolean | | `fillna()` | Fill missing values | | `dropna()` | Delete rows or columns containing missing values | ## Example import pandas as pd import numpy as np # Sample data df = pd.DataFrame({ 'A': [1,2, np.nan,4], 'B': [5, np.nan,7,8] }) # Fill missing values df_filled = df.fillna(0) print(df_filled) **Output:** A B 0 1 51 2 02 0 73 4 8 * * * ## VII. MultiIndex Pandas provides MultiIndex functionality, allowing complex data structures to be handled in DataFrame or Series, especially suitable for hierarchical data. Through MultiIndex, we can perform grouping, selection, slicing, and aggregation operations on data. ### 1. Create MultiIndex MultiIndex can be created through `pd.MultiIndex.from_tuples()`, `pd.MultiIndex.from_product()`, or `set_index()` methods. ##### Method 1: `pd.MultiIndex.from_tuples()` Use tuples to create MultiIndex, each tuple corresponds to an index level. | **Parameter** | **Description** | | --- | --- | | `tuples` | Each tuple corresponds to an index value | | `names` | Name for each index level (optional) | ## Example import pandas as pd # Create tuples index_tuples =[('A',1),('A',2),('B',1),('B',2)] # Create MultiIndex multi_index = pd.MultiIndex.from_tuples(index_tuples, names=['Letter','Number']) # Create DataFrame df = pd.DataFrame({'Value': [10,20,30,40]}, index=multi_index) print(df) **Output:** ValueLetter Number A 1 10 2 20 B 1 30 2 40 ### Method 2: `pd.MultiIndex.from_product()` Use Cartesian product of multiple lists to create MultiIndex, suitable for cases with many data dimensions. | **Parameter** | **Description** | | --- | --- | | `iterables` | Multiple lists or arrays | | `names` | Name for each index level (optional) | ## Example import pandas as pd # Create multiple lists index_v The translation is complete. All Chinese text has been translated to English while keeping all code blocks, HTML tags, and formatting exactly as-is.

YouTip

Pandas Advanced

📂 Categories