YouTip LogoYouTip

Ai Deployment

You've got a model running on your notebook, generating responses quickly and well. But when you deploy it on a server with a hundred users calling it simultaneously, things change: **some users wait ten seconds for a response, some requests timeout directly, GPU memory fills up instantly, and the bill jumps to make you heartbroken.** This is the problem AI deployment aims to solve: turning a runnable model into a usable service. Traditional web service bottlenecks are usually in CPU and database. AI service bottlenecks primarily involve GPUβ€”video memory capacity, computation speed, and how concurrent requests are queued. The core challenges of AI deployment can be summarized in three words: latency, throughput, and cost. | Metric | Meaning | User Experience | | --- | --- | --- | | Latency | Time from user sending request to receiving the first character | How fast | | Throughput | How many requests can be processed per second | Can serve many people simultaneously | | Cost | GPU/server fees for running the service | How expensive | > Good AI deployment means finding the balance between these three: low enough latency, high enough throughput, and controllable cost. * * * ## Model Serving Frameworks Packaging trained models into API services requires specialized frameworks. This section introduces three mainstream options: vLLM, TGI, and Ollama. ### vLLM: High-Performance Inference Driven by PagedAttention vLLM is an inference engine developed by UC Berkeley, known for its speed. vLLM's core innovation is PagedAttentionβ€”an efficient video memory management technique inspired by operating system virtual memory paging. In traditional inference frameworks, each request's KV Cache (key-value cache) needs to occupy contiguous video memory space. When request lengths vary, video memory becomes fragmented with low utilization. PagedAttention divides video memory into fixed-size pages, and each request's KV Cache can be storeddistributed across different pages, with position tracking through a page table. This significantly improves video memory utilization, enabling service for more requests simultaneously. ## Example # Install vLLM pip install vllm # Start an OpenAI-compatible API service with vLLM # --model specifies the model, --host and --port specify the listening address # --tensor-parallel-size specifies how many cards to use for parallel inference python -m vllm.entrypoints.openai.api_server \ --model Qwen/Qwen2.5-7B-Instruct \ --host 0.0.0.0 \ --port 8000 \ --tensor-parallel-size 1 \ --gpu-memory-utilization 0.9 \ --max-model-len 8192 After the service starts, you can call it directly using the OpenAI SDK: ## Example # File path: test_vllm_client.py from openai import OpenAI # Connect to local vLLM service client = OpenAI( base_url="http://localhost:8000/v1", api_key="tutorial-demo-key"# vLLM doesn't require a real API key by default ) # Call chat completion response = client.chat.completions.create( model="Qwen/Qwen2.5-7B-Instruct", messages=[ {"role": "system","content": "You are a helpful AI assistant."}, {"role": "user","content": "Introduce tutorial tutorials in one sentence"} ], temperature=0.7, max_tokens=500, stream=True# Streaming output ) print("AI Response:", end="", flush=True) for chunk in response: if chunk.choices.delta.content: print(chunk.choices.delta.content, end="", flush=True) print() vLLM also supports writing custom services directly in Python code: ## Example # File path: custom_vllm_server.py from vllm import LLM, SamplingParams # Initialize the model llm = LLM( model="Qwen/Qwen2.5-7B-Instruct", gpu_memory_utilization=0.9,# GPU memory usage ratio tensor_parallel_size=1,# Tensor parallelism (number of cards) max_model_len=8192,# Maximum context length ) # Configure sampling parameters sampling_params = SamplingParams( temperature=0.7, top_p=0.9, max_tokens=500, stop=[""], ) # Batch inference prompts =[ "Introduce tutorial", "What is AI?", "How to learn programming?", ] # Generate responses outputs = llm.generate(prompts, sampling_params) # Print results for output in outputs
← Ai Computer VisionAi Rlhf β†’