Pytorch Torchtext
While the PyTorch ecosystem has `torchvision` for image data and `torchaudio` for audio data, the official `torchtext` library for text processing has undergone some changes. This section introduces how to use various methods for text data preprocessing, vocabulary building, data loading, and other operations.
> Note: The torchtext library has undergone some refactoring. It is recommended to use torchtext.legacy or build your own text processing pipeline. The latest torchtext version has returned and provides a more modern API.
* * *
## 1. Text Data Preprocessing Basics
Text preprocessing is the first step in NLP tasks, including tokenization, vocabulary building, encoding, and other operations.
### 1.1 Basic Text Processing Pipeline
## Example
import re
from collections import Counter
class SimpleTokenizer:
"""
Simple tokenizer: tokenize by spaces and punctuation
"""
def __init__ (self):
# Punctuation mapping
self.punctuation=str.maketrans('','','.,!?;:"\'-()[]{}')
def tokenize(self, text):
# Convert to lowercase
text = text.lower()
# Remove punctuation
text = text.translate(self.punctuation)
# Tokenize
tokens = text.split()
return tokens
class Vocabulary:
"""
Vocabulary building
"""
def __init__ (self, min_freq=2, max_size=10000):
self.min_freq= min_freq
self.max_size= max_size
self.word2idx={'': 0,'': 1}
self.idx2word={0: '',1: ''}
self.word_count= Counter()
def build_vocab(self, texts):
"""Build vocabulary from text list"""
tokenizer = SimpleTokenizer()
# Count word frequency
for text in texts:
tokens = tokenizer.tokenize(text)
self.word_count.update(tokens)
# Build vocabulary
for word, count in self.word_count.most_common(self.max_size):
if count '])
for token in tokens
]
# Pad with zeros
if len(indices)< max_len:
indices +=[self.word2idx['']] * (max_len - len(indices))
return indices
def decode(self, indices):
"""Decode index sequence to text"""
tokens =[self.idx2word.get(idx,'')for idx in indices]
return' '.join(tokens)
# Usage example
texts =[
"Hello world",
"This is a test",
"PyTorch is great for deep learning",
"Natural language processing is fun",
"Deep learning enables many applications",
]
vocab = Vocabulary(min_freq=1, max_size=
YouTip