Text Representation

## Text Representation Methods Text representation is a fundamental task in Natural Language Processing (NLP), transforming unstructured text data into numerical forms that computers can process. This article will systematically introduce common text representation methods used in NLP, from traditional approaches to modern deep learning techniques, helping readers gain a comprehensive understanding of this core concept. * * * ## Traditional Text Representations ### Bag-of-Words Model The Bag-of-Words model is one of the simplest text representation methods, treating text as an unordered collection of words. #### Basic Concepts * Ignores word order and grammar, focusing only on whether a word appears * Builds a vocabulary list and counts how often each word occurs in a document * Ultimately represented as a high-dimensional sparse vector #### Code Example ## Example from sklearn.feature_extraction.text import CountVectorizer corpus =[ 'This is the first document.', 'This document is the second document.', 'And this is the third one.', 'Is this the first document?' ] vectorizer = CountVectorizer() X = vectorizer.fit_transform(corpus) print(vectorizer.get_feature_names_out()) print(X.toarray()) #### Pros and Cons Analysis ✅ Pros: * Simple implementation and high computational efficiency * Suitable for small datasets and simple tasks ❌ Cons: * Ignores word order and semantic information * Suffers from high-dimensional sparsity * Cannot handle synonyms or polysemy * * * ### TF-IDF TF-IDF (Term Frequency-Inverse Document Frequency) is an improvement over the Bag-of-Words model, taking into account the importance of words across the entire corpus. #### Calculation Formula * TF (Term Frequency): `Number of times a word appears in a document / Total number of words in the document` * IDF (Inverse Document Frequency): `log(Total number of documents / Number of documents containing the word)` * TF-IDF = TF × IDF #### Code Implementation ## Example from sklearn.feature_extraction.text import TfidfVectorizer tfidf_vectorizer = TfidfVectorizer() X_tfidf = tfidf_vectorizer.fit_transform(corpus) print(tfidf_vectorizer.get_feature_names_out()) print(X_tfidf.toarray()) #### Pros and Cons ✅ Pros: * Reduces the influence of common words, highlighting important ones * Performs better than the simple Bag-of-Words model ❌ Cons: * Still unable to capture semantic relationships * High-dimensionality issues persist * * * ### N-gram Model The N-gram model considers word sequence information by representing text through combinations of n consecutive words. #### Common Types * Unigram (1-gram): A single word * Bigram (2-gram): Combination of two consecutive words * Trigram (3-gram): Combination of three consecutive words #### Code Example ## Example bigram_vectorizer = CountVectorizer(ngram_range=(2,2)) X_bigram = bigram_vectorizer.fit_transform(corpus) print(bigram_vectorizer.get_feature_names_out()) #### Pros and Cons ✅ Pros: * Captures local word sequence information * Can represent phrases and fixed expressions ❌ Cons: * Worsens the curse of dimensionality * Still unable to handle long-range dependencies * * * ## Word Vector Representations ### Word2Vec Principles and Implementation Word2Vec is a neural network-based word vector representation method proposed by Google in 2013. #### Two Model Architectures 1. **CBOW (Continuous Bag of Words)**: Predicts the current word based on its context 2. **Skip-gram**: Predicts the context based on the current word #### Code Implementation ## Example from gensim.models import Word2Vec sentences =[["cat","say","meow"],["dog","say","woof"]] model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4) # Get word vectors vector = model.wv['cat'] # Find similar words similar_words = model.wv.most_similar('cat') #### Features * Low-dimensional dense vectors (typically 50–300 dimensions) * Can capture semantic and syntactic relationships between words * Supports vector operations (e.g., king - man + woman ≈ queen) * * * ### GloVe Word Vectors GloVe (Global Vectors for Word Representation) combines global statistical information with local contextual windows. #### Core Idea * Based on a word co-occurrence matrix * Optimizes so that the dot product of two word vectors equals the logarithm of their co-occurrence count #### Comparison with Word2Vec | Feature | Word2Vec | GloVe | | --- | --- | --- | | Training Method | Local Window | Global Statistics | | Computational Efficiency | Higher | Lower | | Performance on Small Datasets | Better | Average | | Performance on Large Datasets | Good | Better | * * * ### FastText FastText is a word vector model developed by Facebook, notable for considering subword information. #### Main Features * Represents words as collections of character n-grams * Can handle out-of-vocabulary (OOV) words * Particularly suitable for morphologically rich languages #### Code Example ## Example from gensim.models import FastText model = FastText(sentences, vector_size=100, window=5, min_count=1, workers=4) # Even words not in the dictionary can obtain vectors vector = model.wv['unseenword'] * * * ## Context-Aware Representations ### ELMo Model ELMo (Embeddings from Language Models) is one of the earliest context-aware word representation methods. #### Key Features * Based on bidirectional LSTM language models * Word representations depend on the entire input sentence * Generates multi-layered representations (can combine different semantic layers) #### Architectural Diagram !(#) * * * ### BERT and Its Variants BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language model introduced by Google. #### Key Innovations * Transformer architecture * Masked Language Modeling (MLM) training objective * Next Sentence Prediction (NSP) task #### Common Variants 1. **RoBERTa**: Optimized training strategy 2. **DistilBERT**: Lightweight version of BERT 3. **ALBERT**: Parameter-sharing reduces model size #### Code Example ## Example from transformers import BertTokenizer, BertModel tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertModel.from_pretrained('bert-base-uncased') inputs = tokenizer("Hello, my dog is cute", return_tensors="pt") outputs = model(**inputs) last_hidden_states = outputs.last_hidden_state * * * ### Overview of Pre-trained Language Models Modern NLP primarily uses the pre-training + fine-tuning paradigm: 1. **Pre-training Phase**: Trains general-purpose language representations on large-scale corpora 2. **Fine-tuning Phase**: Adjusts model parameters on task-specific data #### Model Comparison | Model | Release Date | Main Features | | --- | --- | --- | | Word2Vec | 2013 | Static word vectors | | GloVe | 2014 | Global statistics + local windows | | ELMo | 2018 | Bidirectional LSTM, context-aware | | BERT | 2018 | Transformer, bidirectional context | | GPT-3 | 2020 | Unidirectional Transformer, strong generative capabilities | * * * ## Document-Level Representations ### Doc2Vec Doc2Vec is an extension of Word2Vec, capable of directly learning vector representations for entire documents. #### Two Models 1. **PV-DM (Distributed Memory)**: Similar to CBOW, with added document ID 2. **PV-DBOW (Distributed Bag of Words)**: Similar to Skip-gram #### Code Example ## Example from gensim.models import Doc2Vec from gensim.models.doc2vec import TaggedDocument documents =[TaggedDocument(doc,)for i, doc in enumerate(corpus)] model = Doc2Vec(documents, vector_size=100, window=5, min_count=1, workers=4) vector = model.infer_vector(["new","document","text"]) * * * ### Sentence Vectors and Document Vectors #### Common Methods 1. **Averaging**: Take the average of word vectors 2. **SIF**: Smooth Inverse Frequency weighted averaging 3. **BERT Sentence Vectors**: Use token or average all word vectors #### Code Example (using Sentence-BERT) ## Example from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') sentences =["This is an example sentence","Each sentence is converted"] embeddings = model.encode(sentences) * * * ### Topic Modeling (LDA) Latent Dirichlet Allocation (LDA) is an unsupervised topic modeling method. #### Basic Principle * Represents documents as mixtures of multiple topics * Each topic is a probability distribution of words * Learned via variational inference or Gibbs sampling #### Code Example ## Example from sklearn.decomposition import LatentDirichletAllocation from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() X = vectorizer.fit_transform(corpus) lda = LatentDirichletAllocation(n_components=2) lda.fit(X) #### Application Scenarios * Document clustering * Content recommendation * Text summarization * * * ## Summary The evolution of text representation methods has progressed from simple statistics to deep learning: 1. **Traditional Methods**: Simple and efficient, suitable for small datasets 2. **Word Vectors**: Capture semantic relationships, low-dimensional 3. **Context-Aware Models**: Dynamic representations, best performance but higher computational cost 4. **Document Representations**: Expand from word level to document level When choosing a text representation method, consider: * Task requirements (whether semantic understanding is needed) * Data scale * Computational resources * Language characteristics With the development of large language models, text representation technologies continue to evolve rapidly, yet understanding these foundational methods remains crucial for mastering NLP.

YouTip

Text Representation

📂 Categories