YouTip LogoYouTip

Pytorch Torch Nn Gelu

[![Image 1: PyTorch torch.nn Reference Manual](https://example.com/images/up.gif) PyTorch torch.nn Reference Manual](https://example.com/pytorch/pytorch-torch-nn-ref.html) * * * `torch.nn.GELU` is the Gaussian Error Linear Unit activation function in PyTorch. It is the default activation function for the Transformer architecture, offering better performance and smoother gradients compared to ReLU. ### Function Definition torch.nn.GELU(approximate='none') **Parameter Description:** * `approximate` (str): Approximation algorithm. Options are `'none'` and `'tanh'`. Default is `'none'`. ### Mathematical Principle The mathematical formula for GELU: GELU(x) = x * Ξ¦(x) where Ξ¦(x) is the cumulative distribution function (CDF) of the standard normal distribution. When using the tanh approximation: GELU(x) β‰ˆ 0.5x * (1 + tanh(√(2/Ο€) * (x + 0.044715 * xΒ³))) * * * ## Usage Examples ### Example 1: Basic Usage Create and use GELU activation: ## Example import torch import torch.nn as nn # Create GELU activation layer gelu = nn.GELU() # Test Input x = torch.tensor([-2.0, -1.0,0.0,1.0,2.0]) # Forward pass output = gelu(x) print("Input:", x.tolist()) print("Output:", output.tolist()) print("nObservation: negative values have slight activation (non-zero), positive values continue to grow") ### Example 2: Comparing Different Activation Functions Compare GELU, ReLU, and Sigmoid: ## Example import torch import torch.nn as nn x = torch.linspace(-4,4,21) # Different activation functions gelu = nn.GELU() relu = nn.ReLU() sigmoid = nn.Sigmoid() tanh = nn.Tanh() print("x GELU ReLU Sigmoid Tanh") print("-" * 50) for i in range(0,21,3): xi = x[i:i+3] print(f"{xi:6.2f} {gelu(xi):8.4f} {relu(xi):8.4f} {sigmoid(xi):8.4f} {tanh(xi):8.4f}") ### Example 3: Usage in Transformer A typical Transformer FFN layer: h2 class="example">Example import torch import torch.nn as nn class FeedForward(nn.Module): def __init__ (self, d_model, dim_feedforward=2048, dropout=0.1): super(FeedForward,self). __init__ () self.linear1= nn.Linear(d_model, dim_feedforward) self.dropout= nn.Dropout(dropout) self.activation= nn.GELU() self.linear2= nn.Linear(dim_feedforward, d_model) def forward(self, x): x =self.linear1(x) x =self.activation(x) x =self.dropout(x) x =self.linear2(x) return x # Test FFN ffn = FeedForward(d_model=512, dim_feedforward=2048) x = torch.randn(32,100,512)# (batch, seq, d_model) output = ffn(x) print("InputShape:", x.shape) print("OutputShape:", output.shape) ### Example 4: Using tanh Approximation Use tanh approximation to accelerate computation: ## Example import torch import torch.nn as nn # Exact Version gelu_exact = nn.GELU(approximate='none') # tanh Approximate version gelu_approx = nn.GELU(approximate='tanh') x = torch.randn(1000) output_exact = gelu_exact(x) output_approx = gelu_approx(x) # Calculate difference diff =(output_exact - output_approx).abs().max().item() print(f"Maximum difference: {diff:.8f}") # Performance comparison import time for _ in range(100): _ = gelu_exact(x) start =time.time() for _ in range(1000): _ = gelu_exact(x) time_exact =time.time() - start start =time.time() for _ in range(1000): _ = gelu_approx(x) time_approx =time.time() - start print(f"Exact VersionTime: {time_exact:.4f}s") print(f"Approximate versionTime: {time_approx:.4f}s") * * * ## Activation Function Comparison | **Activation Function** | **Characteristics** | **Use Cases** | | --- | --- | --- | | `nn.GELU` | Smooth, non-zero negative values, Transformer default | Transformer, BERT, GPT | | `nn.ReLU` | Simple, sparse activation, dead neurons | CNN, general deep learning | | `nn.SiLU` | Smooth, self-gating | MobileNet, EfficientNet | * * * ## Frequently Asked Questions ### Q1: What are the advantages of GELU compared to ReLU? * Slight activation for negative values, no information loss * Smoother gradients, which helps with training * Better performance in Transformers ### Q2: When should the approximate version be used? When inference speed is required and strict precision is not necessary, the tanh approximation is faster. ### Q3: Can GELU be used in the output layer? It is generally not used in the output layer. Softmax is used for classification tasks, and the identity function is used for regression tasks. * * * ## Use Cases The main application scenarios for `nn.GELU` include: * **Transformer Architecture**: Models like BERT and GPT * **Deep Neural Networks**: Scenarios requiring smooth activation * **Pre-trained Models**: Modern NLP models > Tip: GELU is currently the most commonly used activation function in the NLP field and is a standard component of Transformers. * * * [![Image 2: PyTorch torch.nn Reference Manual](https://example.com/images/up.gif) PyTorch torch.nn Reference Manual](https://example.com/pytorch/pytorch-torch-nn-ref.html)
← Pytorch Torch Nn GroupnormPytorch Torch Nn Elu β†’