Multimodal Model - Albert Masoliver's learning site

## Definition A **multimodal model** is a foundation model that natively processes more than one modality — text plus images, audio, video, or structured data — in a single shared representation. The defining property: cross-modal reasoning happens *inside* the model, not via separate models stitched together. ## Two Generations ### Bolt-on multimodal (2021–2023) A vision encoder (often CLIP) produces image embeddings; an LLM is fine-tuned to accept those embeddings as input. Examples: BLIP, LLaVA, original GPT-4V. Effective but cross-modal reasoning was limited. ### Native multimodal (2023+) The model is trained from scratch with all modalities in the same token stream. Image patches, audio frames, and text are tokenised into a shared vocabulary or attended to in a unified attention mechanism. Examples: Gemini, GPT-4o, Claude (vision), Llama 4 family. ## What Native Multimodality Buys - **Joint reasoning.** "What's wrong with this chart? Suggest a fix and write the corrected code." Requires reading the image, understanding the data, and producing code in one pass. - **Better grounding.** The model can refer back to specific image regions; reduces hallucination of visual details. - **Lower latency.** No round-trip through separate vision and language models. ## Common Modalities in 2026 Frontier Models | Modality | Input | Output | | ------------- | ----- | ------ | | Text | ✓ | ✓ | | Image | ✓ | ✓ (some) | | Audio (speech)| ✓ | ✓ (some) | | Audio (music) | partial | partial | | Video | ✓ | ✓ (some) | | Code | ✓ | ✓ (specialised in text) | ## Failure Modes - **Visual hallucination.** Reads details that aren't in the image — "the man is wearing a red hat" when he isn't. - **OCR substitution.** Mis-transcribes text in images, especially handwriting or unusual fonts. - **Modality dominance.** Either text or image overrides the other; cross-modal consistency is brittle. ## Implications for Agents - **UI verification.** Sub-agents can drive a browser and screenshot pages; a multimodal verifier checks the rendered result — see the `ui-verifier` pattern in [[Specialized Agent]]. - **Diagram and chart reasoning.** Architecture diagrams, data plots, screenshots from a debugger all become first-class inputs. ## Related - [[Foundation Model]] - [[Generative AI]] - [[Large Language Model]] - [[Specialized Agent]]