Ai Vector Database
## Vector Database
In the AI era, the form of data we process has fundamentally changed.
Traditional databases store structured data: numbers, strings, dates.
AI applications process semantics: the meaning of a text, the content of an image, the meaning of an audio clip. This semantic information is represented as vectorsβa string of floating-point numbers, such as [0.123, -0.456, 0.789, ...].
A vector database is specifically designed to store, index, and query these vectors. Without it, RAG (Retrieval-Augmented Generation) cannot work, semantic search cannot be implemented, and recommendation systems cannot run efficiently.
!(#)
> Core Value of Vector Databases: Find the top K results most similar to the query vector among millions or even billions of vectors in milliseconds.
* * *
## Challenges of Vector Similarity Search
Understanding the nature of the problem is essential to understanding why specialized algorithms are needed.
### Complexity Issue of Brute Force Search
The simplest approach is brute force search: calculate the similarity between the query vector and every vector in the database, sort them, and take the top K.
## Example
# ============================================
# Brute Force Search Demo: Simple but Inefficient
# ============================================
import numpy as np
from typing import List, Tuple
def l2_distance(v1: np.ndarray, v2: np.ndarray) ->float:
"""Calculate L2 Euclidean distance: smaller means more similar"""
return np.sqrt(np.sum((v1 - v2) ** 2))
def brute_force_search(
query: np.ndarray,
vectors: List[np.ndarray],
top_k: int=5
) -> List[Tuple[int,float]]:
"""
Brute force search: calculate distance with all vectors, sort and return Top K
Pros: simple and accurate, 100% recall
Cons: extremely slow with large datasets
"""
# Calculate distance with each vector
distances =[]
for idx, vec in enumerate(vectors):
dist = l2_distance(query, vec)
distances.append((idx, dist))
# Sort by distance, take first K
distances.sort(key=lambda x: x)
return distances[:top_k]
# Generate 10000 random 128-dimensional vectors (simulating a vector database)
np.random.seed(42)
vector_count =10000
dimension =128
vectors =[np.random.randn(dimension)for _ in range(vector_count)]
# Generate a query vector
query_vector = np.random.randn(dimension)
# Execute brute force search
print(f"Searching Top 5 among {vector_count} vectors...")
results = brute_force_search(query_vector, vectors, top_k=5)
print("Search results:")
for idx, dist in results:
print(f" Vector {idx:4d}, distance = {dist:.4f}")
# Output similar to:
# Searching Top 5 among 10000 vectors...
# Search results:
# Vector 8236, distance = 13.3986
# Vector 8946, distance = 13.5604
# Vector 9891, distance = 13.5688
# Vector 5172, distance = 13.5987
# Vector 5223, distance = 13.6081
The problem with brute force is obvious: every query requires traversing all vectors.
If there are 1 million vectors, each query requires calculating 1 million distances.
If there are 1 billion vectors, each query requires calculating 1 billion timesβthis is completely unacceptable in production environments.
### The "Curse of Dimensionality" in High-Dimensional Space
To make matters worse, vectors are usually high-dimensional.
OpenAI's text-embedding-3-small is 1536-dimensional, and large is 3072-dimensional.
In high-dimensional space, traditional spatial indexing methods (like KD-Tree) efficiency drops dramatically, and they can even be slower than brute force search.
| Data Scale | Brute Force Time (Estimated) | Acceptable? |
| --- | --- | --- |
| 10 thousand | ~1 ms | Acceptable |
| 1 million | ~100 ms | A bit slow |
| 10 million | ~1 second | Too slow |
| 100 million | ~10 seconds | Completely unacceptable |
| 1 billion | ~100 seconds | Unusable |
This is why Approximate Nearest Neighbor (ANN) algorithms are needed: sacrificing 100% accuracy in exchange for 100x or even 1000x speed improvement.
* * *
## Approximate Nearest Neighbor (ANN) Algorithms
The core idea of ANN algorithms: build an index structure so that queries don't need to traverse all vectors.
There are three mainstream algorithms: HNSW, IVF, and PQ.
### HNSW: Hierarchical Navigable Small World
HNSW (Hierarchical Navigable Small World) is
YouTip