YouTip LogoYouTip

Ai Vector Database

## Vector Database In the AI era, the form of data we process has fundamentally changed. Traditional databases store structured data: numbers, strings, dates. AI applications process semantics: the meaning of a text, the content of an image, the meaning of an audio clip. This semantic information is represented as vectorsβ€”a string of floating-point numbers, such as [0.123, -0.456, 0.789, ...]. A vector database is specifically designed to store, index, and query these vectors. Without it, RAG (Retrieval-Augmented Generation) cannot work, semantic search cannot be implemented, and recommendation systems cannot run efficiently. !(#) > Core Value of Vector Databases: Find the top K results most similar to the query vector among millions or even billions of vectors in milliseconds. * * * ## Challenges of Vector Similarity Search Understanding the nature of the problem is essential to understanding why specialized algorithms are needed. ### Complexity Issue of Brute Force Search The simplest approach is brute force search: calculate the similarity between the query vector and every vector in the database, sort them, and take the top K. ## Example # ============================================ # Brute Force Search Demo: Simple but Inefficient # ============================================ import numpy as np from typing import List, Tuple def l2_distance(v1: np.ndarray, v2: np.ndarray) ->float: """Calculate L2 Euclidean distance: smaller means more similar""" return np.sqrt(np.sum((v1 - v2) ** 2)) def brute_force_search( query: np.ndarray, vectors: List[np.ndarray], top_k: int=5 ) -> List[Tuple[int,float]]: """ Brute force search: calculate distance with all vectors, sort and return Top K Pros: simple and accurate, 100% recall Cons: extremely slow with large datasets """ # Calculate distance with each vector distances =[] for idx, vec in enumerate(vectors): dist = l2_distance(query, vec) distances.append((idx, dist)) # Sort by distance, take first K distances.sort(key=lambda x: x) return distances[:top_k] # Generate 10000 random 128-dimensional vectors (simulating a vector database) np.random.seed(42) vector_count =10000 dimension =128 vectors =[np.random.randn(dimension)for _ in range(vector_count)] # Generate a query vector query_vector = np.random.randn(dimension) # Execute brute force search print(f"Searching Top 5 among {vector_count} vectors...") results = brute_force_search(query_vector, vectors, top_k=5) print("Search results:") for idx, dist in results: print(f" Vector {idx:4d}, distance = {dist:.4f}") # Output similar to: # Searching Top 5 among 10000 vectors... # Search results: # Vector 8236, distance = 13.3986 # Vector 8946, distance = 13.5604 # Vector 9891, distance = 13.5688 # Vector 5172, distance = 13.5987 # Vector 5223, distance = 13.6081 The problem with brute force is obvious: every query requires traversing all vectors. If there are 1 million vectors, each query requires calculating 1 million distances. If there are 1 billion vectors, each query requires calculating 1 billion timesβ€”this is completely unacceptable in production environments. ### The "Curse of Dimensionality" in High-Dimensional Space To make matters worse, vectors are usually high-dimensional. OpenAI's text-embedding-3-small is 1536-dimensional, and large is 3072-dimensional. In high-dimensional space, traditional spatial indexing methods (like KD-Tree) efficiency drops dramatically, and they can even be slower than brute force search. | Data Scale | Brute Force Time (Estimated) | Acceptable? | | --- | --- | --- | | 10 thousand | ~1 ms | Acceptable | | 1 million | ~100 ms | A bit slow | | 10 million | ~1 second | Too slow | | 100 million | ~10 seconds | Completely unacceptable | | 1 billion | ~100 seconds | Unusable | This is why Approximate Nearest Neighbor (ANN) algorithms are needed: sacrificing 100% accuracy in exchange for 100x or even 1000x speed improvement. * * * ## Approximate Nearest Neighbor (ANN) Algorithms The core idea of ANN algorithms: build an index structure so that queries don't need to traverse all vectors. There are three mainstream algorithms: HNSW, IVF, and PQ. ### HNSW: Hierarchical Navigable Small World HNSW (Hierarchical Navigable Small World) is
← Ai Deep LearningAi Fine Tuning β†’