Ai System Architecture

AI System Architecture \n\n

You got a large model demo running on your laptop and it works well, but when you try to turn it into a real product, problems arise:

What if the model has too many parameters to fit on a single GPU?
Training takes weeks—how do you resume from checkpoints if something fails midway?
With millions of user requests daily, how do you keep latency under 1 second?
How do you collect user feedback to keep the model evolving?

These are the problems that AI system architecture solves.

Demos focus on whether it can run at all; production systems focus on whether it can run stably, efficiently, and cost-effectively.

\n
Characteristics of production-grade AI systems: 7x24 availability, support for millions of concurrent users, observability, scalability, disaster recovery reliability, and cost control.
\n

Unique Challenges of AI Systems

Compared to traditional web services, AI systems have three unique challenges.

Challenge 1: Non-deterministic Output

Traditional systems produce deterministic output—you input 1+1, it always returns 2.

AI systems produce probabilistic output—the same prompt may generate different results each time.

This raises several questions: How do you ensure output quality? How do you evaluate performance? How do you handle hallucinations?

Typical solutions include: adding sampling strategies at the output layer, post-processing and filtering results, human feedback loops, and multi-model voting.

Challenge 2: The Latency-Cost Tradeoff

AI inference requires massive computation, which means there's a natural tension between latency and cost.

Want it fast? Use more GPUs, costs skyrocket.

Want to save money? Queue processing, poor user experience.

The core of production systems is finding the balance point between SLA (Service Level Agreement) and cost.

\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n

Optimization Direction	Common Techniques	Effect
Model Compression	Quantization, Pruning, Distillation	2-4x speedup, slight accuracy drop
Inference Optimization	vLLM, TensorRT, FlashAttention	3-10x throughput improvement
Architecture Design	Batch processing, multi-level caching	50%-80% reduction in per-request cost

Challenge 3: Building the Data Flywheel

AI systems aren't "done once deployed"—they require continuous iteration.

The more users use it, the more feedback data is generated, the better the model can be trained, and the more users are attracted—this is the data flywheel.

But getting the flywheel spinning isn't easy: How do you collect effective feedback? How do you label data? How do you train continuously? How do you evaluate new versions?

There's no standard answer to these questions, but every successful AI product has its own flywheel design.

Large-Scale Training Infrastructure

Training models with hundreds of billions or even trillions of parameters requires supercomputing infrastructure.

GPU Cluster Architecture

Modern AI training clusters typically consist of hundreds or thousands of GPUs.

Take the H100 GPU as an example: a single H100 has 80GB of memory and delivers 1979 TFLOPS in FP8 precision.

But a single GPU is far from enough—GPT-3 training used about 355 V100s and took 3 months.

A typical cluster topology is:

\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n

Level	Devices	Connection	Bandwidth
Within single machine	GPU-GPU	NVLink	900GB/s
Same rack	Server-Server	InfiniBand	400Gb/s
Cross-rack	Switch-Switch	InfiniBand Fabric	400Gb/s

\n
Network is the bottleneck of distributed training. If communication bandwidth is insufficient, GPU utilization may drop from 90% to 30%, with most time spent waiting for data.
\n

InfiniBand High-Speed Interconnect

Regular Ethernet is too slow; distributed training uses InfiniBand.

InfiniBand's characteristics are: extremely low latency (microsecond level), extremely high bandwidth, and support for RDMA (Remote Direct Memory Access).

RDMA allows one GPU to directly read and write another server's GPU memory without going through the OS kernel, which is much faster.

Storage System Design

Training data is typically TB or even PB scale, so storage systems are also critical.

Typical tiered storage design:

Hot data: SSD or NVMe, stores the currently training batch
Warm data: Distributed storage (e.g., Ceph, Lustre), stores the complete training set
Cold data: Object storage (e.g., S3), stores historical data and backups

Fault Tolerance and Checkpoints

Training takes weeks—what if a GPU fails during that time? Starting over would be too wasteful.

The solution is Checkpointing—periodically saving model state to disk, and recovering from the most recent checkpoint if an error occurs.

But checkpoints also have costs: saving once may take several minutes and occupy dozens of GB of space.

Typical strategy: save every few hundred steps, keep the most recent few checkpoints, and automatically clean up old ones.

Distributed Training Strategies

A single GPU can't hold large models, so training tasks need to be split across multiple GPUs.

There are mainly three parallel strategies: data parallelism, tensor parallelism, and pipeline parallelism. Combined, they form "3D parallelism".

Image 1: 3D Parallelism Strategy Diagram

Data Parallelism (Data Parallelism)

The simplest and most commonly used strategy: each GPU holds the complete model but processes different data.

For example, with 8 GPUs and batch size 1024, each GPU processes 128 data samples.

Forward propagation is calculated independently, and after backward propagation, gradients are aggregated and averaged, then the model is updated.

The problem with data parallelism is: memory is still the bottleneck—if the model is too large for a single GPU, data parallelism doesn't help.

ZeRO Memory Optimization

ZeRO (Zero Redundancy Optimizer) is an enhanced version of data parallelism that can further save memory.

In ordinary data parallelism, each GPU stores complete model parameters, gradients, and optimizer states—this is very redundant.

ZeRO's idea is: split these states across different GPUs, and communicate only when needed.

\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n

ZeRO Stage	Split Content	Memory Savings
ZeRO-1	Optimizer states	~4x
ZeRO-2	Optimizer states + Gradients	~8x
ZeRO-3	Optimizer states + Gradients + Parameters	Linear with GPU count

Configuring ZeRO with DeepSpeed is simple:

Examples

\n{\n  "train_batch_size": 1024,\n  "train_micro_batch_size_per_gpu": 16,\n  "optimizer": {\n    "type": "Adam",\n    "params": {\n      "lr": 0.0001,\n      "betas": [0.9, 0.95],\n      "eps": 1e-8,\n      "weight_decay": 0.01\n    }\n  },\n  "zero_optimization": {\n    "stage": 3,\n    "allgather_partitions": true,\n    "allgather_bucket_size": 2e8,\n    "overlap_comm": true,\n    "reduce_scatter": true,\n    "reduce_bucket_size": 2e8,\n    "contiguous_gradients": true,\n    "stage3_prefetch_bucket_size": 1e8,\n    "stage3_param_persistence_threshold": 1e5,\n    "stage3_max_live_parameters": 1e9,\n    "stage3_max_reuse_distance": 1e9\n  },\n  "gradient_clipping": 1.0,\n  "fp16": {\n    "enabled": true,\n    "loss_scale": 0,\n    "loss_scale_window": 1000,\n    "initial_scale_power": 16,\n    "hysteresis": 2,\n    "min_loss_scale": 1\n  },\n  "checkpoint": {\n    "tag": "tutorial-checkpoint",\n    "load_universal": true\n  }\n}\n

This configuration uses ZeRO-3, which can distribute model states across all GPUs, with memory usage decreasing linearly with the number of GPUs.

Tensor Parallelism (Tensor Parallelism)

If ZeRO is still not enough, tensor parallelism is needed—splitting the computation of a single layer across multiple GPUs.

Matrix multiplications in Transformers can be split by row or column:

Split matrix A into A₁ and A₂ by row, calculate A₁×B and A₂×B on GPU 0 and GPU 1 respectively
Finally concatenate the results

This requires communication for each layer's computation, but memory usage is also halved.

Megatron-LM is NVIDIA's tensor parallelism library, with good PyTorch compatibility.

Pipeline Parallelism (Pipeline Parallelism)

Tensor parallelism is "intra-layer splitting"; pipeline parallelism is "inter-layer splitting".

For example, with a 32-layer model, GPU 0 holds the first 8 layers, GPU 1 holds the middle 8 layers, GPU 2 holds the next 8 layers, and GPU 3 holds the last 8 layers.

Data flows from GPU 0 to GPU 3, like a factory assembly line.

But pipeline has a problem: bubbles—when GPU 0 starts computing, GPUs 1-3 are idle; when data reaches GPU 1, GPU 0 is idle again.

The solution is to split data into "micro-batches" and feed them in like a pipeline, reducing bubble time.

3D Parallelism (DP+TP+PP)

The three strategies can be combined:

Pipeline parallelism: Split model layers across nodes
Tensor parallelism: Split intra-layer computation within nodes
Data parallelism: Replicate the entire pipeline at a larger scale

For example, with 64 GPUs, you could plan:

8 pipeline stages (PP=8)
2 GPUs for tensor parallelism within each stage (TP=2)
Then replicate 4 copies for data parallelism (DP=4)
Total: 8 × 2 × 4 = 64 GPUs

This is 3D parallelism—the standard configuration for modern large model training.

Data Engineering

Good models require good data—data engineering accounts for over 60% of AI system workload.

Data Collection and Cleaning Pipeline

Training data typically comes from multiple sources: web pages, books, code, conversations, etc.

Typical processing workflow:

Deduplication: Remove duplicate or highly similar documents
Quality filtering: Remove low-quality, toxic, or biased content
Format unification: Convert different sources to a unified format
Tokenization: Convert text to model input sequences

Data Deduplication: MinHash LSH

Directly computing pairwise document similarity is too slow; the common method is MinHash + LSH (Locality Sensitive Hashing).

The idea is: convert each document into a short "fingerprint", where similar documents have fingerprints that are likely the same or similar, then group by fingerprint.

Examples

\nimport hashlib\nimport re\nfrom typing import List, Set, Dict, Tuple\n\ndef generate_shingles(text: str, k: int=5) -> Set:\n    """Generate k-shingles: sequences of k consecutive words\n    e.g., "I love tutorial tutorials", k=2 → {"I love", "love tutorial", "tutorial tutorials"}\n    """\n    # Simple tokenization (professional tools can be used in production)\n    words = re.findall(r'w+', text.lower())\n    shingles = set()\n    for i in range(len(words) - k + 1):\n        shingle = ' '.join(words[i:i+k])\n        shingles.add(shingle)\n    return shingles\n\ndef minhash_signature(shingles: Set, num_hashes: int=100) -> List:\n    """Generate MinHash signature\n    Use multiple hash functions, each taking the minimum value\n    """\n    signature = []\n    for i in range(num_hashes):\n        # Use i as seed to generate different hash functions\n        min_hash = None\n        for shingle in shingles:\n            # Combine shingle and i to generate hash value\n            h = hashlib.sha256(f"{shingle}-{i}".encode()).hexdigest()\n            h_int = int(h, 16)\n            if min_hash is None or h_int  List:\n    """Use banding method to generate LSH keys\n    Split signature into multiple bands, each band is hashed separately\n    """\n    keys = []\n    rows_per_band = len(signature) // bands\n    for i in range(bands):\n        start = i * rows_per_band\n        end = start + rows_per_band\n        band = tuple(signature[start:end])\n        # Hash this band to generate a key\n        band_hash = hashlib.sha256(str(band).encode()).hexdigest()[:16]\n        keys.append(f"band-{i}-{band_hash}")\n    return keys\n\ndef deduplicate_documents(documents: List,\n                          threshold: float=0.7) -> List:\n    """Deduplicate documents using MinHash + LSH\n    Returns deduplicated document list\n    """\n    # Storage: LSH key → list of document indices\n    buckets: Dict[str, List] = {}\n    # Storage: document index → signature\n    signatures: Dict[int, List] = {}\n    # Marks: which documents are duplicates\n    duplicates: Set = set()\n\n    for idx, doc in enumerate(documents):\n        shingles = generate_shingles(doc)\n        sig = minhash_signature(shingles)\n        signatures = sig\n        keys = lsh_banding(sig)\n\n        # Check if similar documents already exist\n        is_duplicate = False\n        for key in keys:\n            if key in buckets:\n                # There are documents in this bucket, compare signatures one by one\n                for other_idx in buckets:\n                    other_sig = signatures\n                    # Calculate signature similarity (Jaccard approximation)\n                    matches = sum(1 for a, b in zip(sig, other_sig) if a == b)\n                    similarity = matches / len(sig)\n                    if similarity >= threshold:\n                        # Exceeds threshold, considered duplicate\n                        is_duplicate = True\n                        duplicates.add(idx)\n                        break\n            if is_duplicate:\n                break\n\n        if not is_duplicate:\n            # Not a duplicate, add self to each bucket\n            for key in keys:\n                if key not in buckets:\n                    buckets = []\n                buckets.append(idx)\n\n    # Return non-duplicate documents\n    return [doc for idx, doc in enumerate(documents) if idx not in duplicates]\n\n# ============================================\n# Test tutorial data deduplication\n# ============================================\n\nif __name__ == "__main__":\n    documents = [\n        "Welcome to tutorial tutorials, this is a great place to learn programming.",\n        "Welcome to tutorial tutorials, this is a great place to learn programming.",  # Highly similar\n        "Python is a concise and elegant language, suitable for beginners.",\n        "Python is a concise and elegant programming language, very suitable for beginners.",  # Highly similar\n        "Machine learning lets computers learn patterns from data.",\n        "This is a completely different article.",\n    ]\n\n    print(f"Before deduplication: {len(documents)} documents")\n    deduplicated = deduplicate_documents(documents, threshold=0.6)\n    print(f"After deduplication: {len(deduplicated)} documentsn")\n\n    print("Retained documents:")\n    for i, doc in enumerate(deduplicated):\n        print(f"  [{i}] {doc}")\n\n    # Output:\n    # Before deduplication: 6 documents\n    # After deduplication: 4 documents\n    #\n    # Retained documents:\n    #    Welcome to tutorial tutorials, this is a great place to learn programming.\n    #    Python is a concise and elegant language, suitable for beginners.\n    #    Machine learning lets computers learn patterns from data.\n    #    This is a completely different article.\n

In actual production, more efficient implementations are used (such as the datasketch library), but the core idea is the same.

Data Format: WebDataset

Small datasets can be stored casually, but TB-scale datasets need specialized formats.

WebDataset is a commonly used format: it packages files into tar archives, with each tar containing thousands of samples, supporting random and sequential access.

Benefits are:

Reduces filesystem pressure (millions of small files are slow)
Supports streaming access, no need to load entire dataset into memory
Can be loaded distributedly, with each worker reading different tars

Data Flywheel Design

Data fly

YouTip