Pytorch Torch Nn Layernorm
[ PyTorch torch.nn Reference Manual](#)
* * *
`torch.nn.LayerNorm` is the layer normalization module in PyTorch.
Unlike batch normalization, layer normalization normalizes along the feature dimensions of a single sample and does not depend on the batch size.
### Function Definition
torch.nn.LayerNorm(normalized_shape, eps=1e-05, elementwise_affine=True)
**Parameter Description:**
* `normalized_shape` (int or list): The dimensions to normalize.
* `eps` (float): Epsilon for numerical stability. Default is 1e-5.
* `elementwise_affine` (bool): Whether to use learnable scaling and shifting. Default is True.
### Mathematical Principle
Layer normalization formula:
y = (x - E) / sqrt(Var + eps) * gamma + beta
Difference from batch normalization: Layer normalization calculates the mean and variance along the last dimension of the features.
* * *
## Usage Examples
### Example 1: Basic Usage
Perform layer normalization on features:
## Instance
import torch
import torch.nn as nn
# Layer normalization: normalize along the last dimension
ln = nn.LayerNorm(normalized_shape=10)
# Input: batch=4, feature dimension=10
x = torch.randn(4,10)
# Forward pass
output = ln(x)
print("Input shape:", x.shape)
print("Output shape:", output.shape)
print("nOriginal input first row:", x.tolist())
print("Normalized first row:", output.tolist())
### Example 2: Multi-dimensional Input
Process 3D or 4D input:
## Instance
import torch
import torch.nn as nn
# Normalize along the sequence dimension: (batch, seq, features)
ln_seq = nn.LayerNorm(normalized_shape=64)
# 3D input
x_3d = torch.randn(2,10,64)
output_3d = ln_seq(x_3d)
print("3D input:", x_3d.shape,"-> output:", output_3d.shape)
# 4D input (e.g., images): (batch, height, width, channels)
# LayerNorm normalizes along the last channel dimension
ln_channel = nn.LayerNorm(normalized_shape=128)
x_4d = torch.randn(2,8,8,128)
output_4d = ln_channel(x_4d)
print("4D input:", x_4d.shape,"-> output:", output_4d.shape)
### Example 3: Using in Transformer
Typical LayerNorm usage:
## Instance
import torch
import torch.nn as nn
class TransformerBlock(nn.Module):
def __init__ (self, d_model, nhead):
super(TransformerBlock,self). __init__ ()
self.self_attn= nn.MultiheadAttention(d_model, nhead, batch_first=True)
self.norm1= nn.LayerNorm(d_model)
self.norm2= nn.LayerNorm(d_model)
self.ffn= nn.Sequential(
nn.Linear(d_model, d_model * 4),
nn.GELU(),
nn.Linear(d_model * 4, d_model)
)
def forward(self, x):
# Self-attention with residual
attn_out, _ =self.self_attn(x, x, x)
x =self.norm1(x + attn_out)
# FFN with residual
ffn_out =self.ffn(x)
x =self.norm2(x + ffn_out)
return x
# Test
block = TransformerBlock(d_model=512, nhead=8)
x = torch.randn(4,100,512)# (batch, seq, d_model)
output = block(x)
print("Input shape:", x.shape)
print("Output shape:", output.shape)
### Example 4: Without Learnable Parameters
Pure normalization without scaling and shifting:
## Instance
import torch
import torch.nn as nn
# Without learnable parameters
ln = nn.LayerNorm(16, elementwise_affine=False)
x = torch.randn(4,16)
output = ln(x)
# No weight and bias
print("Has weight:",hasattr(ln,'weight'))
print("Has bias:",hasattr(ln,'bias'))
print("nOutput shape:", output.shape)
* * *
## Comparison of Normalization Methods
| **Method** | **Normalization Dimension** | **Batch Dependency** | **Applicable Scenarios** |
| --- | --- | --- | --- |
| `BatchNorm` | Batch dimension | Yes | CNN, stable batch |
| `LayerNorm` | Feature dimension | No | Transformer, RNN |
| `InstanceNorm` | Channel + Spatial | No | Style transfer |
| `GroupNorm` | Channel groups | No | Small batch scenarios |
* * *
## Frequently Asked Questions
### Q1: What is the difference between LayerNorm and BatchNorm?
LayerNorm does not depend on the batch, making it suitable for sequence models and scenarios with large batch variations.
### Q2: How to choose normalized_shape?
Usually, the feature dimension is chosen, such as 768 or 1024 in BERT.
### Q3: Why does Transformer use LayerNorm?
Transformer input sequence length is variable, and LayerNorm is more stable.
* * *
## Use Cases
The main application scenarios of `nn.LayerNorm` include:
* **Transformer architectures**: BERT, GPT, etc.
* **Recurrent Neural Networks**: LSTM, GRU
* **Variable-length sequence processing**: Unfixed batch size
> Tip: LayerNorm is a standard component of Transformers, placed after the residual connection (Post-LN) or before it (Pre-LN).
* * PyTorch torch.nn Reference Manual](#)
YouTip