BERT Series Models |
\n\nBERT (Bidirectional Encoder Representations from Transformers) is a revolutionary natural language processing model proposed by Google in 2018, which has fundamentally transformed research and application paradigms in the NLP field.
\n\nThis article systematically introduces the core principles, training methods, fine-tuning techniques, and mainstream variant models of BERT.
\n\n\n\n
BERT Architecture and Training
\n\nThe figure below illustrates the core architecture of the BERT (Bidirectional Encoder Representations from Transformers) model and the Masked Language Modeling (MLM) task during pretraining.
\n\n1. Input Layer (Embedding)
\n\n- \n
- Input Sequence: Text composed of tokens (or subwords), e.g.,
[Wβ, Wβ, Wβ, , Wβ , Wβ, Wβ, Wβ, Wβ, Wβ, Wβ ]. \n- \n
is a token randomly masked by BERT during pretraining (e.g.,Wβin the original text is replaced by).\n
\n - Embedding Layer: Converts each token into a fixed-dimensional vector representation (e.g., 768 dimensions), consisting of:\n
- \n
- Token Embeddings: Semantic information of the vocabulary. \n
- Position Embeddings: Positional information of tokens in the sequence. \n
- Segment Embeddings: Distinguishes between sentences (useful for sentence-pair tasks; not explicitly shown in the figure). \n
\n
2. Transformer Encoder
\n\n- \n
- Multiple Transformer Blocks: Details are not expanded in the figure, but each block contains:\n
- \n
- Self-Attention Mechanism: Captures bidirectional contextual dependencies (core feature of BERT). \n
- Feed-Forward Network: Nonlinear transformation. \n
- Residual Connections & Layer Normalization: Stabilizes the training process. \n
\n - Output: Context-dependent vector representations corresponding to each input token (e.g.,
Oβ, Oβ, ..., Oβ). \n
3. Masked Language Modeling (MLM) Task
\n\n- \n
- Objective: Predict the original token corresponding to the masked token
(e.g.,Wβin the figure). \n - Classification Layer:\n
- \n
- Fully-Connected Layer: Maps the Transformer output vector (e.g.,
Oβ) to the vocabulary size dimension. \n - Activation Function GELU: Gaussian Error Linear Unit (nonlinear function used by BERT). \n
- Layer Normalization (Norm): Normalizes the output. \n
- Softmax: Computes probabilities for each word in the vocabulary; selects the word with the highest probability as the prediction (e.g.,
W'β, W'β, ..., W'βare candidate words). \n
\n - Fully-Connected Layer: Maps the Transformer output vector (e.g.,
Transformer Encoder Structure
\n\nBERT is built upon the encoder part of the Transformer, with its core being multiple layers of self-attention mechanisms:
\n\nExample
\n\n# Simplified Transformer Encoder Layer\n\nclass TransformerEncoderLayer(nn.Module):\n\ndef __init__ (self, d_model, nhead, dim_feedforward=2048):\n\nsuper(). __init__ ()\n\nself.self_attn= MultiheadAttention(d_model, nhead)\n\nself.linear1= nn.Linear(d_model, dim_feedforward)\n\nself.linear2= nn.Linear(dim_feedforward, d_model)\n\nself.norm1= nn.LayerNorm(d_model)\n\nself.norm2= nn.LayerNorm(d_model)\n\ndef forward(self, src):\n\n# Self-attention mechanism\n\n src2 =self.self_attn(src, src, src)\n\n src = src + self.norm1(src2)\n\n# Feed-forward network\n\n src2 =self.linear2(F.relu(self.linear1(src)))\n\n src = src + self.norm2(src2)\n\nreturn src\n\n\nKey Innovation: Bidirectional Context Modeling
\n\nUnlike traditional language models, BERT achieves bidirectional context understanding through two pretraining tasks:
\n\n- \n
- Masked Language Model (MLM): Randomly masks 15% of input tokens and predicts the masked tokens. \n
- Next Sentence Prediction (NSP): Determines whether two sentences appear consecutively. \n
Training Parameters and Configuration
\n\n| Parameter | \nBERT-base | \nBERT-large | \n
|---|---|---|
| Layers | \n12 | \n24 | \n
| Hidden Size | \n768 | \n1024 | \n
| Attention Heads | \n12 | \n16 | \n
| Total Parameters | \n110M | \n340M | \n
\n\n
BERT Fine-Tuning Methods
\n\nStandard Fine-Tuning Pipeline
\n\n- \n
- Task-Specific Layer Addition: Add classification/regression layers according to downstream tasks. \n
- Learning Rate: Typically set to a small value (2e-5 to 5e-5). \n
- Batch Size: 16 or 32 are common choices. \n
- Training Epochs: 2β4 epochs are usually sufficient. \n
Efficient Fine-Tuning Techniques
\n\nExample
\n\n# Fine-tuning example using HuggingFace Transformers\n\nfrom transformers import BertForSequenceClassification, Trainer\n\nmodel = BertForSequenceClassification.from_pretrained('bert-base-uncased')\n\n trainer = Trainer(\n\n model=model,\n\n args=training_args,\n\n train_dataset=train_dataset,\n\n eval_dataset=eval_dataset\n\n)\n\n trainer.train()\n\n\nComparison of Common Fine-Tuning Strategies
\n\n| Method | \nAdvantages | \nDisadvantages | \n
|---|---|---|
| Full-parameter fine-tuning | \nBest performance | \nHigh computational cost | \n
| Feature extraction (freeze BERT) | \nComputationally efficient | \nSuboptimal performance | \n
| Adapter | \nParameter-efficient | \nRequires architecture modification | \n
| Prompt learning | \nStrong few-shot performance | \nRequires prompt template design | \n
\n\n
Mainstream BERT Variant Models
\n\nRoBERTa (Robustly Optimized BERT)
\n\n- \n
- Improvements:\n
- \n
- Larger batch size (8k vs. 256) \n
- Longer training duration \n
- Removal of NSP task \n
- Dynamic masking strategy \n
\n - Performance: Average improvement of 2β3% on the GLUE benchmark. \n
ALBERT (A Lite BERT)
\n\n- \n
- Core Innovations:\n
- \n
- Parameter sharing (shared attention parameters across layers) \n
- Embedding factorization (decomposing token embeddings into two smaller matrices) \n
\n - Effect: 89% reduction in parameter count and 1.7Γ speedup. \n
Other Important Variants
\n\n- \n
- DistilBERT: Model compression via knowledge distillation. \n
- ELECTRA: Replaces MLM with a generator-discriminator architecture. \n
- SpanBERT: Optimizes modeling of text spans. \n
\n\n
Chinese BERT Models
\n\nOverview of Chinese Pretrained Models
\n\n| Model | \nOrganization | \nFeatures | \n
|---|---|---|
| BERT-wwm | \nHIT (Harbin Institute of Technology) | \nWhole Word Masking (wwm) | \n
| RoBERTa-wwm-ext | \nHIT | \nExtended training data | \n
| ERNIE (Baidu) | \nBaidu | \nKnowledge graph integration | \n
| NEZHA | \nHuawei | \nRelative positional encoding | \n
Chinese BERT Usage Example
\n\nExample
\n\nfrom transformers import BertTokenizer, BertModel\n\ntokenizer = BertTokenizer.from_pretrained('bert-base-chinese')\n\n model = BertModel.from_pretrained('bert-base-chinese')\n\ninputs = tokenizer("Natural language processing is very interesting.", return_tensors="pt")\n\n outputs = model(**inputs)\n\n\nRecommendations for Fine-Tuning Chinese Tasks
\n\n- \n
- Use the whole-word masking (wwm) version for better performance. \n
- Pay attention to Chinese word segmentation boundary issues. \n
- For specialized domains, consider domain-adaptive pretraining. \n
\n\n
Practical Suggestions and Resources
\n\nLearning Roadmap
\n\nRecommended Resources
\n\n- \n
- Papers:\n
- \n
- Original BERT paper (arXiv:1810.04805) \n
- Papers for variants such as RoBERTa, ALBERT, etc. \n
\n - Codebases:\n
- \n
- HuggingFace Transformers \n
- GitHub implementations of Chinese BERT \n
\n - Online Courses:\n
- \n
- Coursera Natural Language Processing Specialization \n
- Hung-yi Leeβs Deep Learning Course \n
\n
Through systematic learning and practice, BERT series models can become powerful tools for solving NLP problems. It is recommended to start with the base version and gradually explore more advanced variants and optimization techniques.
YouTip