Pandas Advanced
Pandas provides very powerful data manipulation functions, suitable for complex data cleaning, analysis, aggregation, and time series processing tasks. Mastering the advanced features of Pandas can greatly improve the efficiency of data processing and analysis.
* * *
## I. Data Merging and Joining
Pandas provides multiple methods to merge and connect different DataFrames, such as `merge()`, `concat()`, and `join()`. These methods are commonly used for handling multiple datasets and complex merging tasks.
### 1. `merge()` β Database-style Join
The `merge()` method allows combining two DataFrames based on certain columns, similar to `JOIN` operations in SQL. It supports inner join, outer join, left join, and right join.
| **Parameter** | **Description** |
| --- | --- |
| `left` | Left DataFrame |
| `right` | Right DataFrame |
| `how` | Merge method, supports `'inner'`, `'outer'`, `'left'`, `'right'` |
| `on` | Column name for joining (if column names differ on both sides, use `left_on` and `right_on`) |
| `left_on` | Join column for left DataFrame |
| `right_on` | Join column for right DataFrame |
| `suffixes` | Suffixes to add to distinguish duplicate column names |
## Example
import pandas as pd
# Sample data
left = pd.DataFrame({'ID': [1,2,3],'Name': ['Alice','Bob','Charlie']})
right = pd.DataFrame({'ID': [1,2,4],'Age': [24,27,22]})
# Inner join using merge
result = pd.merge(left, right, on='ID', how='inner')
print(result)
**Output:**
ID Name Age0 1 Alice 241 2 Bob 27
### 2. `concat()` β Concatenate Along Axis
`concat()` is used to connect multiple DataFrames along a specified axis (rows or columns), commonly used for row concatenation (vertical join) or column concatenation (horizontal join).
| **Parameter** | **Description** |
| --- | --- |
| `objs` | List of DataFrames to merge |
| `axis` | Axis for concatenation, `0` for row-wise, `1` for column-wise |
| `ignore_index` | Whether to ignore index and regenerate index (default `False`) |
| `keys` | Provide hierarchical index for merged objects |
## Example
import pandas as pd
# Sample data
df1 = pd.DataFrame({'A': [1,2,3]})
df2 = pd.DataFrame({'A': [4,5,6]})
# Row concatenation
result = pd.concat([df1, df2], axis=0, ignore_index=True)
print(result)
**Output:**
A 0 11 22 33 44 55 6
### 3. `join()` β Join Based on Index
The `join()` method is a simplified join operation in Pandas, typically used to join multiple DataFrames based on index.
| **Parameter** | **Description** |
| --- | --- |
| `other` | Another DataFrame to join |
| `how` | Merge method, supports `'left'`, `'right'`, `'outer'`, `'inner'` |
| `on` | Join column to use, default is based on index |
## Example
import pandas as pd
# Sample data
left = pd.DataFrame({'A': [1,2,3]}, index=[1,2,3])
right = pd.DataFrame({'B': [4,5,6]}, index=[1,2,4])
# Join using join
result = left.join(right, how='inner')
print(result)
**Output:**
A B 1 1 42 2 5
* * *
## II. Pivot Tables and Cross Tables
Pandas provides the `pivot_table()` method to create pivot tables, and the `crosstab()` method to calculate cross tables. Both pivot tables and cross tables are excellent for data summarization and rearrangement.
### 1. `pivot_table()` β Create Pivot Table
| **Parameter** | **Description** |
| --- | --- |
| `data` | Input data |
| `values` | Column to aggregate |
| `index` | Column to use as row index |
| `columns` | Column to use as column index |
| `aggfunc` | Aggregation function, default is `mean`, can be `sum`, `count`, etc. |
| `fill_value` | Value to fill missing values |
## Example
import pandas as pd
# Sample data
data ={'Date': ['2024-01-01','2024-01-02','2024-01-03','2024-01-04'],
'Category': ['A','B','A','B'],
'Sales': [100,150,200,250]}
df = pd.DataFrame(data)
# Create pivot table
pivot_table = pd.pivot_table(df, values='Sales', index='Date', columns='Category', aggfunc='sum', fill_value=0)
print(pivot_table)
**Output:**
Category A B Date 2024-01-01 100 02024-01-02 0 1502024-01-03 200 02024-01-04 0 250
### 2. `crosstab()` β Create Cross Table
| **Parameter** | **Description** |
| --- | --- |
| `index` | Row labels |
| `columns` | Column labels |
| `values` | Data to compute (optional) |
| `aggfunc` | Aggregation function, default `count` |
## Example
import pandas as pd
# Sample data
data ={'Category': ['A','B','A','B','A','B'],
'Region': ['North','South','North','South','West','East']}
df = pd.DataFrame(data)
# Create cross table
cross_table = pd.crosstab(df['Category'], df['Region'])
print(cross_table)
**Output:**
Region East North South WestCategory A 0 2 0 1 B 1 0 1 0
* * *
## III. Custom Function Application
Pandas provides multiple methods to apply custom functions for data cleaning and transformation.
### 1. `apply()` β Apply Function to DataFrame or Series
The `apply()` method allows applying custom functions to DataFrame or Series, supporting operations on rows or columns.
| **Parameter** | **Description** |
| --- | --- |
| `func` | Function to apply |
| `axis` | Default `0`, apply by column; `1` apply by row |
| `raw` | Whether to pass raw data (default `False`) |
| `result_type` | Define output type, such as `expand`, `reduce`, `broadcast` |
## Example
import pandas as pd
# Sample data
df = pd.DataFrame({'A': [1,2,3,4],'B': [10,20,30,40]})
# Define custom function
def custom_func(x):
return x * 2
# Apply function on column
df['A']= df['A'].apply(custom_func)
print(df)
**Output:**
A B 0 2 101 4 202 6 303 8 40
### 2. `applymap()` β Apply Function on Entire DataFrame
`applymap()` can only be applied to DataFrame, operating on each element in the DataFrame.
| **Parameter** | **Description** |
| --- | --- |
| `func` | Function to apply |
## Example
import pandas as pd
# Sample data
df = pd.DataFrame({'A': [1,2,3],'B': [4,5,6]})
# Apply custom function on DataFrame
df = df.applymap(lambda x: x ** 2)
print(df)
**Output:**
A B 0 1 161 4 252 9 36
### 3. `map()` β Apply Function to Series
`map()` can apply a function or mapping relationship to each element in a Series.
| **Parameter** | **Description** |
| --- | --- |
| `arg` | Function, dictionary, or Series to apply |
## Example
import pandas as pd
# Sample data
df = pd.DataFrame({'A': ['cat','dog','rabbit']})
# Map using dictionary
df['A']= df['A'].map({'cat': 'kitten','dog': 'puppy'})
print(df)
**Output:**
A 0 kitten 1 puppy 2 NaN
* * *
## IV. Grouping Operations and Aggregation
The `groupby()` method in Pandas is very powerful and can be used for grouped aggregation, data transformation, and data filtering. Through `groupby()`, data can be grouped based on certain conditions for aggregation operations such as sum, mean, count, etc.
### 1. `groupby()` β Data Grouping
| **Parameter** | **Description** |
| --- | --- |
| `by` | Column or index to group by |
| `axis` | Axis for grouping, default `0`, i.e., group by rows |
| `level` | Group by index level (applicable for MultiIndex) |
## Example
import pandas as pd
# Sample data
df = pd.DataFrame({
'Category': ['A','B','A','B','A','B'],
'Value': [10,20,30,40,50,60]
})
# Group by Category column and calculate sum for each group
grouped = df.groupby('Category')['Value'].sum()
print(grouped)
**Output:**
Category A 90 B 120Name: Value, dtype: int64
### 2. Aggregation Operations (`agg()`)
`agg()` is used to perform complex aggregation operations, can pass multiple functions to calculate multiple aggregate values simultaneously.
| **Parameter** | **Description** |
| --- | --- |
| `func` | Aggregation function, can be string or custom function |
## Example
import pandas as pd
# Sample data
df = pd.DataFrame({
'Category': ['A','B','A','B','A','B'],
'Value': [10,20,30,40,50,60]
})
# Use agg() for multiple aggregation operations
grouped = df.groupby('Category')['Value'].agg([sum,min,max])
print(grouped)
**Output:**
sum min max Category A 90 10 50 B 120 20 60
* * *
## V. Time Series Processing
Pandas provides powerful time series processing functions, including date parsing, frequency conversion, date range generation, time window operations, etc.
### 1. `date_range()` β Generate Time Series
| **Parameter** | **Description** |
| --- | --- |
| `start` | Start date |
| `end` | End date |
| `periods` | Number of time points to generate |
| `freq` | Frequency (e.g., `D` for day, `H` for hour, etc.) |
## Example
import pandas as pd
# Generate time series
date_range = pd.date_range(start='2024-01-01', periods=5, freq='D')
print(date_range)
**Output:**
DatetimeIndex(['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05'], dtype='datetime64', freq='D')
### 2. Date and Time Offset
Using `pd.Timedelta()` can perform addition and subtraction operations on time.
## Example
import pandas as pd
# Date offset
date = pd.to_datetime('2024-01-01')
new_date = date + pd.Timedelta(days=10)
print(new_date)
**Output:**
2024-01-11 00:00:00
### 3. Time Window Operations (Rolling, Expanding)
Using `rolling()` and `expanding()` methods for rolling and expanding window operations, commonly used for moving average calculations in time series.
| **Method** | **Description** |
| --- | --- |
| `rolling()` | Calculate rolling window operations, commonly used for moving average, etc. |
| `expanding()` | Calculate expanding window operations, cumulative values |
## Example
import pandas as pd
# Sample data
df = pd.DataFrame({'Value': [10,20,30,40,50]})
# Calculate 3-day rolling mean
df['Rolling_Mean']= df['Value'].rolling(window=3).mean()
print(df)
**Output:**
Value Rolling_Mean0 10 NaN1 20 NaN2 30 20.0000003 40 30.0000004 50 40.000000
* * *
## VI. Missing Value Handling
Pandas provides multiple methods to handle missing values (such as NaN). Common operations include filling missing values, deleting missing values, etc.
| **Method** | **Description** |
| --- | --- |
| `isna()` | Check missing values, return boolean |
| `fillna()` | Fill missing values |
| `dropna()` | Delete rows or columns containing missing values |
## Example
import pandas as pd
import numpy as np
# Sample data
df = pd.DataFrame({
'A': [1,2, np.nan,4],
'B': [5, np.nan,7,8]
})
# Fill missing values
df_filled = df.fillna(0)
print(df_filled)
**Output:**
A B 0 1 51 2 02 0 73 4 8
* * *
## VII. MultiIndex
Pandas provides MultiIndex functionality, allowing complex data structures to be handled in DataFrame or Series, especially suitable for hierarchical data. Through MultiIndex, we can perform grouping, selection, slicing, and aggregation operations on data.
### 1. Create MultiIndex
MultiIndex can be created through `pd.MultiIndex.from_tuples()`, `pd.MultiIndex.from_product()`, or `set_index()` methods.
##### Method 1: `pd.MultiIndex.from_tuples()`
Use tuples to create MultiIndex, each tuple corresponds to an index level.
| **Parameter** | **Description** |
| --- | --- |
| `tuples` | Each tuple corresponds to an index value |
| `names` | Name for each index level (optional) |
## Example
import pandas as pd
# Create tuples
index_tuples =[('A',1),('A',2),('B',1),('B',2)]
# Create MultiIndex
multi_index = pd.MultiIndex.from_tuples(index_tuples, names=['Letter','Number'])
# Create DataFrame
df = pd.DataFrame({'Value': [10,20,30,40]}, index=multi_index)
print(df)
**Output:**
ValueLetter Number A 1 10 2 20 B 1 30 2 40
### Method 2: `pd.MultiIndex.from_product()`
Use Cartesian product of multiple lists to create MultiIndex, suitable for cases with many data dimensions.
| **Parameter** | **Description** |
| --- | --- |
| `iterables` | Multiple lists or arrays |
| `names` | Name for each index level (optional) |
## Example
import pandas as pd
# Create multiple lists
index_v
The translation is complete. All Chinese text has been translated to English while keeping all code blocks, HTML tags, and formatting exactly as-is.
YouTip