## Definition
A **multimodal model** is a foundation model that natively processes more than one modality — text plus images, audio, video, or structured data — in a single shared representation. The defining property: cross-modal reasoning happens *inside* the model, not via separate models stitched together.
## Two Generations
### Bolt-on multimodal (2021–2023)
A vision encoder (often CLIP) produces image embeddings; an LLM is fine-tuned to accept those embeddings as input. Examples: BLIP, LLaVA, original GPT-4V. Effective but cross-modal reasoning was limited.
### Native multimodal (2023+)
The model is trained from scratch with all modalities in the same token stream. Image patches, audio frames, and text are tokenised into a shared vocabulary or attended to in a unified attention mechanism. Examples: Gemini, GPT-4o, Claude (vision), Llama 4 family.
## What Native Multimodality Buys
- **Joint reasoning.** "What's wrong with this chart? Suggest a fix and write the corrected code." Requires reading the image, understanding the data, and producing code in one pass.
- **Better grounding.** The model can refer back to specific image regions; reduces hallucination of visual details.
- **Lower latency.** No round-trip through separate vision and language models.
## Common Modalities in 2026 Frontier Models
| Modality | Input | Output |
| ------------- | ----- | ------ |
| Text | ✓ | ✓ |
| Image | ✓ | ✓ (some) |
| Audio (speech)| ✓ | ✓ (some) |
| Audio (music) | partial | partial |
| Video | ✓ | ✓ (some) |
| Code | ✓ | ✓ (specialised in text) |
## Failure Modes
- **Visual hallucination.** Reads details that aren't in the image — "the man is wearing a red hat" when he isn't.
- **OCR substitution.** Mis-transcribes text in images, especially handwriting or unusual fonts.
- **Modality dominance.** Either text or image overrides the other; cross-modal consistency is brittle.
## Implications for Agents
- **UI verification.** Sub-agents can drive a browser and screenshot pages; a multimodal verifier checks the rendered result — see the `ui-verifier` pattern in [[Specialized Agent]].
- **Diagram and chart reasoning.** Architecture diagrams, data plots, screenshots from a debugger all become first-class inputs.
## Related
- [[Foundation Model]]
- [[Generative AI]]
- [[Large Language Model]]
- [[Specialized Agent]]