YouTip LogoYouTip

Pytorch Torch Nn Layernorm

[![Image 1: PyTorch torch.nn Reference Manual](#) PyTorch torch.nn Reference Manual](#) * * * `torch.nn.LayerNorm` is the layer normalization module in PyTorch. Unlike batch normalization, layer normalization normalizes along the feature dimensions of a single sample and does not depend on the batch size. ### Function Definition torch.nn.LayerNorm(normalized_shape, eps=1e-05, elementwise_affine=True) **Parameter Description:** * `normalized_shape` (int or list): The dimensions to normalize. * `eps` (float): Epsilon for numerical stability. Default is 1e-5. * `elementwise_affine` (bool): Whether to use learnable scaling and shifting. Default is True. ### Mathematical Principle Layer normalization formula: y = (x - E) / sqrt(Var + eps) * gamma + beta Difference from batch normalization: Layer normalization calculates the mean and variance along the last dimension of the features. * * * ## Usage Examples ### Example 1: Basic Usage Perform layer normalization on features: ## Instance import torch import torch.nn as nn # Layer normalization: normalize along the last dimension ln = nn.LayerNorm(normalized_shape=10) # Input: batch=4, feature dimension=10 x = torch.randn(4,10) # Forward pass output = ln(x) print("Input shape:", x.shape) print("Output shape:", output.shape) print("nOriginal input first row:", x.tolist()) print("Normalized first row:", output.tolist()) ### Example 2: Multi-dimensional Input Process 3D or 4D input: ## Instance import torch import torch.nn as nn # Normalize along the sequence dimension: (batch, seq, features) ln_seq = nn.LayerNorm(normalized_shape=64) # 3D input x_3d = torch.randn(2,10,64) output_3d = ln_seq(x_3d) print("3D input:", x_3d.shape,"-> output:", output_3d.shape) # 4D input (e.g., images): (batch, height, width, channels) # LayerNorm normalizes along the last channel dimension ln_channel = nn.LayerNorm(normalized_shape=128) x_4d = torch.randn(2,8,8,128) output_4d = ln_channel(x_4d) print("4D input:", x_4d.shape,"-> output:", output_4d.shape) ### Example 3: Using in Transformer Typical LayerNorm usage: ## Instance import torch import torch.nn as nn class TransformerBlock(nn.Module): def __init__ (self, d_model, nhead): super(TransformerBlock,self). __init__ () self.self_attn= nn.MultiheadAttention(d_model, nhead, batch_first=True) self.norm1= nn.LayerNorm(d_model) self.norm2= nn.LayerNorm(d_model) self.ffn= nn.Sequential( nn.Linear(d_model, d_model * 4), nn.GELU(), nn.Linear(d_model * 4, d_model) ) def forward(self, x): # Self-attention with residual attn_out, _ =self.self_attn(x, x, x) x =self.norm1(x + attn_out) # FFN with residual ffn_out =self.ffn(x) x =self.norm2(x + ffn_out) return x # Test block = TransformerBlock(d_model=512, nhead=8) x = torch.randn(4,100,512)# (batch, seq, d_model) output = block(x) print("Input shape:", x.shape) print("Output shape:", output.shape) ### Example 4: Without Learnable Parameters Pure normalization without scaling and shifting: ## Instance import torch import torch.nn as nn # Without learnable parameters ln = nn.LayerNorm(16, elementwise_affine=False) x = torch.randn(4,16) output = ln(x) # No weight and bias print("Has weight:",hasattr(ln,'weight')) print("Has bias:",hasattr(ln,'bias')) print("nOutput shape:", output.shape) * * * ## Comparison of Normalization Methods | **Method** | **Normalization Dimension** | **Batch Dependency** | **Applicable Scenarios** | | --- | --- | --- | --- | | `BatchNorm` | Batch dimension | Yes | CNN, stable batch | | `LayerNorm` | Feature dimension | No | Transformer, RNN | | `InstanceNorm` | Channel + Spatial | No | Style transfer | | `GroupNorm` | Channel groups | No | Small batch scenarios | * * * ## Frequently Asked Questions ### Q1: What is the difference between LayerNorm and BatchNorm? LayerNorm does not depend on the batch, making it suitable for sequence models and scenarios with large batch variations. ### Q2: How to choose normalized_shape? Usually, the feature dimension is chosen, such as 768 or 1024 in BERT. ### Q3: Why does Transformer use LayerNorm? Transformer input sequence length is variable, and LayerNorm is more stable. * * * ## Use Cases The main application scenarios of `nn.LayerNorm` include: * **Transformer architectures**: BERT, GPT, etc. * **Recurrent Neural Networks**: LSTM, GRU * **Variable-length sequence processing**: Unfixed batch size > Tip: LayerNorm is a standard component of Transformers, placed after the residual connection (Post-LN) or before it (Pre-LN). * * PyTorch torch.nn Reference Manual](#)
← Pytorch Torch Nn LinearPytorch Torch Nn L1Loss β†’