Llm Multimodal
Multimodal large models enable AI to simultaneously process text, images, audio, video and other information forms, which is one of the most core development directions of current large models.
This article explains the concept, development process, internal principles and typical capabilities of multimodal from scratch, and provides a runnable example of building the first multimodal application in 5 minutes at the end.
* * *
## 1. What is Multimodal?
Ordinary large models can usually only process one type of information β text. If you give traditional AI an image and ask "How many cats are in this image?", if multimodal is not supported, the model cannot do it.
However, the real world is not composed of text, but of multiple sensory information:
| Sensory | Corresponding Information |
| --- | --- |
| Eyes | Images |
| Ears | Sound |
| Mouth | Language |
| Hands | Actions |
| Video | Images + Time |
| Webpage | Text + Images + Structure |
Therefore, the concept of **Multimodal** emerged.
Understand it in one sentence:
> Multimodal = A model that simultaneously understands and generates multiple information forms.
Two typical examples:
* 1. Input image plus question: **"Is the food in this image high in calories?"**, model answers: **"This is fried chicken and fries, estimated to be about 1200 calories."**
* 2. Input audio: **"Help me summarize this meeting"**, model outputs a text summary with action suggestions.
This ability to describe images and summarize audio is the most direct manifestation of multimodal.
* * *
## 2. What is Modality?
Modality is the **form of information representation**.
Different modalities correspond to different inputs and outputs. The following table lists common modalities and their typical applications:
| Modality | Input Example | Output Example | Typical Application |
| --- | --- | --- | --- |
| Text | Novels, chat records | Answers, articles | Q&A, writing, translation |
| Image | Photos, screenshots | Image understanding, image generation | OCR, visual Q&A |
| Audio | Speech, music | Transcription, speech synthesis | Meeting summary, voice assistant |
| Video | Video streams | Video analysis, timeline | Teaching summary, content retrieval |
| Sensor | GPS, temperature | Control signals | Robotics, autonomous driving |
| Action | Mouse clicks, keyboard input | Execute actions | GUI Agent, automation |
The more modalities a model supports, the more its capabilities approach human-level, which is why all top laboratories are upgrading text models to multimodal models.
!(#)
* * *
## 3. Development Process of Multimodal Large Models
From being able to only process text, to being able to see, hear, and output video, multimodal large models have gone through three clear stages.
### Stage 1: Language Model (LLM)
At this stage, the input and output of models are both text. Representative products include OpenAI GPT series, Anthropic Claude, and Google Gemini.
They can write code, write articles, do translation, and answer questions with strong capabilities, but have one fundamental limitation: they cannot see.
### Stage 2: Vision Language Model (VLM)
VLM adds the ability to "see images" on top of LLM. A typical usage is uploading a webpage screenshot and asking "Why is the layout wrong?" and the model answers "CSS conflict".
At this stage, models begin to have capabilities: image understanding, OCR, chart understanding, etc.
### Stage 3: Native Multimodal Model
Native multimodal models no longer treat "images" and "text" as two spliced modules, but uniformly process images, sound, video, and text from the beginning of training.
It has complete capabilities: seeing, hearing, speaking, reasoning, and executing, which is also the current mainstream direction of the industry.
* * *
## 4. How Do Multimodal Models Work Internally?
The internal work of multimodal models can be summarized in one sentence: convert all data into a unified vector space.
For example:
> "A cat", "cat", and cat.png are three expressions that, although from different modalities, will fall into close regions in the vector space after encoding.
The overall process is shown in the figure below:
The above figure shows the complete chain from input to output. Below is a breakdown of each step.
### Step 1: Encoding
Different modalities need to be converted into numbers before the model can process them.
| Modality | Raw Data | After Encoding |
| --- | --- | --- |
| Text | "Hello" | [2034, 789] a group of token ids |
| Image | Pixel matrix | Visual feature vector |
| Audio | Waveform | Spectral feature vector |
### Step 2: Unified Semantic Space
The key step after encoding is to make "cat", "cat", and cat.png **close to each other** in the vector space.
This is the essence of "cross-modal understanding": projecting features from different modalities into the same space.
### Step 3: Generate Output
After reasoning in the unified space, the model "restores" the results into different modalities as needed.
The output can be: any one of text, image, audio, or video.
For example: upload an image β describe it β generate an introduction video, this is a complete multimodal generation chain.
* * *
## 5. Why Can Transformer Unify Multimodal?
There is only one core reason: the Attention mechanism.
The core formula of Transformer is Attention(Q, K, V), meaning "the model itself decides what to focus on".
For example: when a user asks "Who is playing basketball in the image?", the model will:
1. First look at the image and identify all people in it
2. Identify the basketball in the image
3. Determine the relationship between "who" and "basketball"
4. Synthesize the context and output the answer
This ability to "independently decide where to focus" makes Transformer naturally suitable for processing multimodal inputs:
| Input | What the Model is Focusing On |
| --- | --- |
| Pure text | Semantic relationships between sentences |
| Image + text | Correspondence between image regions and text descriptions |
| Video + text | Relationship between timeline segments and questions |
> This is also why Attention has become the "standard feature" of almost all modern large models β it is inherently designed for cross-modal processing.
* * *
## 6. Typical Multimodal Capabilities
Multimodal models have five typical capabilities. The following figure shows their relationships.
### 1. Image Understanding
Input a screenshot, the model outputs page issues, UI analysis, or OCR results.
The most common scenario for programmers: take a screenshot of an error and let AI locate the problem.
### 2. Image Generation
Input a text description like "generate a future city", and the model returns an AI image.
Typical applications: posters, covers, game assets, product sketches.
### 3. Voice Interaction
Speaking is input, the model responds in real-time, with capabilities covering ASR (speech-to-text), TTS (text-to-speech), and conversation.
### 4. Video Understanding
Upload a teaching video, and the model outputs a summary, timeline, and key segments that can be Q&A'd.
### 5. Agent
Input "help me make a PPT", and the model starts executing a complete action chain: search for information β generate outline β create slides β modify β export.
Agent is the final form that combines "seeing, hearing, speaking" with "hands-on execution".
* * *
## 7. Five Terms Beginners Must Understand
When reading multimodal-related documentation, the terms in the following table will appear repeatedly. Remember them first:
| Term | Meaning | Example |
| --- | --- | --- |
| Token | The smallest unit processed by the model; text is split by words | "Hello world" β |
| Embedding | The process of converting information into numerical coordinates | "cat" β [0.12, 0.78, 0
YouTip