Text Representation
## Text Representation Methods
Text representation is a fundamental task in Natural Language Processing (NLP), transforming unstructured text data into numerical forms that computers can process.
This article will systematically introduce common text representation methods used in NLP, from traditional approaches to modern deep learning techniques, helping readers gain a comprehensive understanding of this core concept.
* * *
## Traditional Text Representations
### Bag-of-Words Model
The Bag-of-Words model is one of the simplest text representation methods, treating text as an unordered collection of words.
#### Basic Concepts
* Ignores word order and grammar, focusing only on whether a word appears
* Builds a vocabulary list and counts how often each word occurs in a document
* Ultimately represented as a high-dimensional sparse vector
#### Code Example
## Example
from sklearn.feature_extraction.text import CountVectorizer
corpus =[
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?'
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(X.toarray())
#### Pros and Cons Analysis
β
Pros:
* Simple implementation and high computational efficiency
* Suitable for small datasets and simple tasks
β Cons:
* Ignores word order and semantic information
* Suffers from high-dimensional sparsity
* Cannot handle synonyms or polysemy
* * *
### TF-IDF
TF-IDF (Term Frequency-Inverse Document Frequency) is an improvement over the Bag-of-Words model, taking into account the importance of words across the entire corpus.
#### Calculation Formula
* TF (Term Frequency): `Number of times a word appears in a document / Total number of words in the document`
* IDF (Inverse Document Frequency): `log(Total number of documents / Number of documents containing the word)`
* TF-IDF = TF Γ IDF
#### Code Implementation
## Example
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(corpus)
print(tfidf_vectorizer.get_feature_names_out())
print(X_tfidf.toarray())
#### Pros and Cons
β
Pros:
* Reduces the influence of common words, highlighting important ones
* Performs better than the simple Bag-of-Words model
β Cons:
* Still unable to capture semantic relationships
* High-dimensionality issues persist
* * *
### N-gram Model
The N-gram model considers word sequence information by representing text through combinations of n consecutive words.
#### Common Types
* Unigram (1-gram): A single word
* Bigram (2-gram): Combination of two consecutive words
* Trigram (3-gram): Combination of three consecutive words
#### Code Example
## Example
bigram_vectorizer = CountVectorizer(ngram_range=(2,2))
X_bigram = bigram_vectorizer.fit_transform(corpus)
print(bigram_vectorizer.get_feature_names_out())
#### Pros and Cons
β
Pros:
* Captures local word sequence information
* Can represent phrases and fixed expressions
β Cons:
* Worsens the curse of dimensionality
* Still unable to handle long-range dependencies
* * *
## Word Vector Representations
### Word2Vec Principles and Implementation
Word2Vec is a neural network-based word vector representation method proposed by Google in 2013.
#### Two Model Architectures
1. **CBOW (Continuous Bag of Words)**: Predicts the current word based on its context
2. **Skip-gram**: Predicts the context based on the current word
#### Code Implementation
## Example
from gensim.models import Word2Vec
sentences =[["cat","say","meow"],["dog","say","woof"]]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
# Get word vectors
vector = model.wv['cat']
# Find similar words
similar_words = model.wv.most_similar('cat')
#### Features
* Low-dimensional dense vectors (typically 50β300 dimensions)
* Can capture semantic and syntactic relationships between words
* Supports vector operations (e.g., king - man + woman β queen)
* * *
### GloVe Word Vectors
GloVe (Global Vectors for Word Representation) combines global statistical information with local contextual windows.
#### Core Idea
* Based on a word co-occurrence matrix
* Optimizes so that the dot product of two word vectors equals the logarithm of their co-occurrence count
#### Comparison with Word2Vec
| Feature | Word2Vec | GloVe |
| --- | --- | --- |
| Training Method | Local Window | Global Statistics |
| Computational Efficiency | Higher | Lower |
| Performance on Small Datasets | Better | Average |
| Performance on Large Datasets | Good | Better |
* * *
### FastText
FastText is a word vector model developed by Facebook, notable for considering subword information.
#### Main Features
* Represents words as collections of character n-grams
* Can handle out-of-vocabulary (OOV) words
* Particularly suitable for morphologically rich languages
#### Code Example
## Example
from gensim.models import FastText
model = FastText(sentences, vector_size=100, window=5, min_count=1, workers=4)
# Even words not in the dictionary can obtain vectors
vector = model.wv['unseenword']
* * *
## Context-Aware Representations
### ELMo Model
ELMo (Embeddings from Language Models) is one of the earliest context-aware word representation methods.
#### Key Features
* Based on bidirectional LSTM language models
* Word representations depend on the entire input sentence
* Generates multi-layered representations (can combine different semantic layers)
#### Architectural Diagram
!(#)
* * *
### BERT and Its Variants
BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language model introduced by Google.
#### Key Innovations
* Transformer architecture
* Masked Language Modeling (MLM) training objective
* Next Sentence Prediction (NSP) task
#### Common Variants
1. **RoBERTa**: Optimized training strategy
2. **DistilBERT**: Lightweight version of BERT
3. **ALBERT**: Parameter-sharing reduces model size
#### Code Example
## Example
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state
* * *
### Overview of Pre-trained Language Models
Modern NLP primarily uses the pre-training + fine-tuning paradigm:
1. **Pre-training Phase**: Trains general-purpose language representations on large-scale corpora
2. **Fine-tuning Phase**: Adjusts model parameters on task-specific data
#### Model Comparison
| Model | Release Date | Main Features |
| --- | --- | --- |
| Word2Vec | 2013 | Static word vectors |
| GloVe | 2014 | Global statistics + local windows |
| ELMo | 2018 | Bidirectional LSTM, context-aware |
| BERT | 2018 | Transformer, bidirectional context |
| GPT-3 | 2020 | Unidirectional Transformer, strong generative capabilities |
* * *
## Document-Level Representations
### Doc2Vec
Doc2Vec is an extension of Word2Vec, capable of directly learning vector representations for entire documents.
#### Two Models
1. **PV-DM (Distributed Memory)**: Similar to CBOW, with added document ID
2. **PV-DBOW (Distributed Bag of Words)**: Similar to Skip-gram
#### Code Example
## Example
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
documents =[TaggedDocument(doc,)for i, doc in enumerate(corpus)]
model = Doc2Vec(documents, vector_size=100, window=5, min_count=1, workers=4)
vector = model.infer_vector(["new","document","text"])
* * *
### Sentence Vectors and Document Vectors
#### Common Methods
1. **Averaging**: Take the average of word vectors
2. **SIF**: Smooth Inverse Frequency weighted averaging
3. **BERT Sentence Vectors**: Use token or average all word vectors
#### Code Example (using Sentence-BERT)
## Example
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
sentences =["This is an example sentence","Each sentence is converted"]
embeddings = model.encode(sentences)
* * *
### Topic Modeling (LDA)
Latent Dirichlet Allocation (LDA) is an unsupervised topic modeling method.
#### Basic Principle
* Represents documents as mixtures of multiple topics
* Each topic is a probability distribution of words
* Learned via variational inference or Gibbs sampling
#### Code Example
## Example
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
lda = LatentDirichletAllocation(n_components=2)
lda.fit(X)
#### Application Scenarios
* Document clustering
* Content recommendation
* Text summarization
* * *
## Summary
The evolution of text representation methods has progressed from simple statistics to deep learning:
1. **Traditional Methods**: Simple and efficient, suitable for small datasets
2. **Word Vectors**: Capture semantic relationships, low-dimensional
3. **Context-Aware Models**: Dynamic representations, best performance but higher computational cost
4. **Document Representations**: Expand from word level to document level
When choosing a text representation method, consider:
* Task requirements (whether semantic understanding is needed)
* Data scale
* Computational resources
* Language characteristics
With the development of large language models, text representation technologies continue to evolve rapidly, yet understanding these foundational methods remains crucial for mastering NLP.
YouTip