Pandas Sample
# Pandas Sampling and Random Data Generation
Pandas provides powerful and flexible tools for random sampling and generating synthetic datasets. Whether you are building machine learning models, performing statistical analysis, or testing code with mock data, mastering random sampling is an essential skill.
This tutorial covers how to perform random sampling on Pandas DataFrames and Series, generate random data, and split datasets for machine learning.
---
## Random Sampling with `sample()`
The primary tool for sampling in Pandas is the `DataFrame.sample()` (or `Series.sample()`) method. It allows you to randomly select rows or columns from your dataset.
### Method Syntax
```python
DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False)
```
### Key Parameters:
* **`n`**: Integer. The exact number of items to return. Cannot be used with `frac`.
* **`frac`**: Float. The fraction of axis items to return (e.g., `0.3` for 30% of the data). Cannot be used with `n`.
* **`replace`**: Boolean, default `False`. Determines whether sampling is done with replacement (allowing the same row to be selected multiple times).
* **`weights`**: String or array-like. Weights to apply to each row for biased sampling.
* **`random_state`**: Integer or `numpy.random.RandomState`. Seed for the random number generator to ensure reproducibility.
* **`axis`**: `0` or `'index'` for rows, `1` or `'columns'` for columns. Defaults to `0`.
---
## Code Examples
### 1. Basic Sampling (Count, Fraction, and Replacement)
The following example demonstrates how to extract a fixed number of rows, a percentage of rows, and how to sample with replacement.
```python
import pandas as pd
import numpy as np
# Create a large synthetic dataset
df = pd.DataFrame({
"ID": range(1, 1001),
"Value": np.random.randn(1000)
})
# 1. Randomly sample exactly 5 rows
print("Randomly sample 5 rows:")
print(df.sample(5))
print()
# 2. Randomly sample 30% of the dataset
print("Randomly sample 30% of the data:")
print(df.sample(frac=0.3))
print()
# 3. Sample with replacement (allows duplicate rows in the sample)
print("Sample 5 rows with replacement:")
print(df.sample(5, replace=True))
```
### 2. Ensuring Reproducibility with `random_state`
When sharing code or writing tests, you often need your random selections to be reproducible. Setting the `random_state` parameter (or seeding NumPy's generator) ensures that the same random rows are selected every time the script runs.
```python
import pandas as pd
import numpy as np
# Set the random seed for NumPy to make data generation reproducible
np.random.seed(42)
df = pd.DataFrame({
"ID": range(1, 11),
"Value": np.random.randn(10)
})
# Sample using a specific random_state
print("Using random_state=42:")
print(df.sample(3, random_state=42))
# Running again with the same random_state yields the exact same result
print("\nRunning again with the same random_state:")
print(df.sample(3, random_state=42))
```
---
## Generating Random Data
Pandas integrates seamlessly with NumPy to generate random Series and DataFrames. This is highly useful for creating mock datasets.
```python
import pandas as pd
import numpy as np
# 1. Generate random integers in the range [0, 10)
print("Random integers [0, 10):")
print(pd.Series(np.random.randint(0, 10, 5)))
print()
# 2. Generate random floats in the range [0.0, 1.0)
print("Random floats [0, 1):")
print(pd.Series(np.random.random(5)))
print()
# 3. Generate random numbers from a Standard Normal Distribution N(0, 1)
print("Standard Normal Distribution N(0, 1):")
print(pd.Series(np.random.randn(5)))
print()
# 4. Generate random numbers from a Normal Distribution with custom Mean and Standard Deviation N(10, 2)
print("Normal Distribution N(10, 2):")
print(pd.Series(np.random.normal(10, 2, 5)))
print()
# 5. Randomly select values from a custom list
choices = ["A", "B", "C", "D"]
print("Random choices from a list:")
print(pd.Series(np.random.choice(choices, 10)))
```
---
## Splitting Data into Train and Test Sets
In machine learning workflows, you frequently need to split your dataset into training and testing subsets. You can achieve this using pure Pandas or by using the industry-standard `scikit-learn` library.
### Method 1: Using Pure Pandas
You can extract a training set using `sample()` and drop those indices to get the test set.
```python
import pandas as pd
import numpy as np
# Simulate a machine learning dataset
df = pd.DataFrame({
"Feature1": np.random.randn(100),
"Feature2": np.random.randn(100),
"Target": np.random.choice([0, 1], 100)
})
# Split into Train and Test sets (80% / 20%)
train = df.sample(frac=0.8, random_state=42)
test = df.drop(train.index)
print(f"Training set: {len(train)} rows")
print(f"Testing set: {len(test)} rows")
```
### Method 2: Stratified Sampling with Scikit-Learn
When dealing with imbalanced datasets, a simple random split might distort the distribution of your target variable. Stratified sampling ensures that the train and test sets have the same proportion of class labels as the input dataset.
```python
from sklearn.model_selection import train_test_split
# Stratified split based on the "Target" column
train, test = train_test_split(df, test_size=0.2, stratify=df, random_state=42)
print("Stratified Training Target distribution:")
print(train.value_counts(normalize=True))
print("\nStratified Testing Target distribution:")
print(test.value_counts(normalize=True))
```
*(Note: Running the stratified split code requires installing scikit-learn via `pip install scikit-learn`)*
---
## Key Considerations
1. **Memory Usage**: When using `sample(frac=1.0)`, Pandas copies the entire DataFrame. If you only want to shuffle the rows of a massive DataFrame, consider shuffling the index instead to save memory.
2. **Index Preservation**: By default, `sample()` preserves the original index of the rows. If you want a clean, sequential index for your sample, set `ignore_index=True`.
3. **Out-of-Bounds Sampling**: If you set `replace=False` (the default) and request an `n` larger than the total number of rows in the DataFrame, Pandas will raise a `ValueError`. Set `replace=True` if you need to sample more items than exist in your dataset.
YouTip