Pytorch Torch Nn Gelu
[ PyTorch torch.nn Reference Manual](https://example.com/pytorch/pytorch-torch-nn-ref.html)
* * *
`torch.nn.GELU` is the Gaussian Error Linear Unit activation function in PyTorch.
It is the default activation function for the Transformer architecture, offering better performance and smoother gradients compared to ReLU.
### Function Definition
torch.nn.GELU(approximate='none')
**Parameter Description:**
* `approximate` (str): Approximation algorithm. Options are `'none'` and `'tanh'`. Default is `'none'`.
### Mathematical Principle
The mathematical formula for GELU:
GELU(x) = x * Ξ¦(x)
where Ξ¦(x) is the cumulative distribution function (CDF) of the standard normal distribution.
When using the tanh approximation:
GELU(x) β 0.5x * (1 + tanh(β(2/Ο) * (x + 0.044715 * xΒ³)))
* * *
## Usage Examples
### Example 1: Basic Usage
Create and use GELU activation:
## Example
import torch
import torch.nn as nn
# Create GELU activation layer
gelu = nn.GELU()
# Test Input
x = torch.tensor([-2.0, -1.0,0.0,1.0,2.0])
# Forward pass
output = gelu(x)
print("Input:", x.tolist())
print("Output:", output.tolist())
print("nObservation: negative values have slight activation (non-zero), positive values continue to grow")
### Example 2: Comparing Different Activation Functions
Compare GELU, ReLU, and Sigmoid:
## Example
import torch
import torch.nn as nn
x = torch.linspace(-4,4,21)
# Different activation functions
gelu = nn.GELU()
relu = nn.ReLU()
sigmoid = nn.Sigmoid()
tanh = nn.Tanh()
print("x GELU ReLU Sigmoid Tanh")
print("-" * 50)
for i in range(0,21,3):
xi = x[i:i+3]
print(f"{xi:6.2f} {gelu(xi):8.4f} {relu(xi):8.4f} {sigmoid(xi):8.4f} {tanh(xi):8.4f}")
### Example 3: Usage in Transformer
A typical Transformer FFN layer:
h2 class="example">Example
import torch
import torch.nn as nn
class FeedForward(nn.Module):
def __init__ (self, d_model, dim_feedforward=2048, dropout=0.1):
super(FeedForward,self). __init__ ()
self.linear1= nn.Linear(d_model, dim_feedforward)
self.dropout= nn.Dropout(dropout)
self.activation= nn.GELU()
self.linear2= nn.Linear(dim_feedforward, d_model)
def forward(self, x):
x =self.linear1(x)
x =self.activation(x)
x =self.dropout(x)
x =self.linear2(x)
return x
# Test FFN
ffn = FeedForward(d_model=512, dim_feedforward=2048)
x = torch.randn(32,100,512)# (batch, seq, d_model)
output = ffn(x)
print("InputShape:", x.shape)
print("OutputShape:", output.shape)
### Example 4: Using tanh Approximation
Use tanh approximation to accelerate computation:
## Example
import torch
import torch.nn as nn
# Exact Version
gelu_exact = nn.GELU(approximate='none')
# tanh Approximate version
gelu_approx = nn.GELU(approximate='tanh')
x = torch.randn(1000)
output_exact = gelu_exact(x)
output_approx = gelu_approx(x)
# Calculate difference
diff =(output_exact - output_approx).abs().max().item()
print(f"Maximum difference: {diff:.8f}")
# Performance comparison
import time
for _ in range(100):
_ = gelu_exact(x)
start =time.time()
for _ in range(1000):
_ = gelu_exact(x)
time_exact =time.time() - start
start =time.time()
for _ in range(1000):
_ = gelu_approx(x)
time_approx =time.time() - start
print(f"Exact VersionTime: {time_exact:.4f}s")
print(f"Approximate versionTime: {time_approx:.4f}s")
* * *
## Activation Function Comparison
| **Activation Function** | **Characteristics** | **Use Cases** |
| --- | --- | --- |
| `nn.GELU` | Smooth, non-zero negative values, Transformer default | Transformer, BERT, GPT |
| `nn.ReLU` | Simple, sparse activation, dead neurons | CNN, general deep learning |
| `nn.SiLU` | Smooth, self-gating | MobileNet, EfficientNet |
* * *
## Frequently Asked Questions
### Q1: What are the advantages of GELU compared to ReLU?
* Slight activation for negative values, no information loss
* Smoother gradients, which helps with training
* Better performance in Transformers
### Q2: When should the approximate version be used?
When inference speed is required and strict precision is not necessary, the tanh approximation is faster.
### Q3: Can GELU be used in the output layer?
It is generally not used in the output layer. Softmax is used for classification tasks, and the identity function is used for regression tasks.
* * *
## Use Cases
The main application scenarios for `nn.GELU` include:
* **Transformer Architecture**: Models like BERT and GPT
* **Deep Neural Networks**: Scenarios requiring smooth activation
* **Pre-trained Models**: Modern NLP models
> Tip: GELU is currently the most commonly used activation function in the NLP field and is a standard component of Transformers.
* * *
[ PyTorch torch.nn Reference Manual](https://example.com/pytorch/pytorch-torch-nn-ref.html)
YouTip