Today marks six weeks into my Amazon internship, and with my midpoint presentation wrapped, I figured it’s a good moment to pause and reflect on what’s been taking up most of my headspace at work lately: vision-language models (VLMs).
At their core, VLMs aim to answer a deceptively simple question: how do we get machines to see and read—not as separate skills, but as one coherent act of understanding?
There are two core parts:
image encoder (usually a ViT or ResNet)
text encoder or decoder (usually a Transformer, often an LLM)
You feed in an image and a caption (or question, or instruction), and the model learns to match or fuse them, depending on the architecture.
Architecture: Two-tower vs. one-tower vs. decoder-only
Two-tower (e.g. CLIP)
Separate encoders for image and text
Trained with contrastive loss: “pull together matched pairs, push apart everything else”
At inference, embeddings are compared using cosine similarity
Two-leg (e.g. BLIP-2)
Adds a fusion layer that lets image/text features interact (cross-attention)
Useful for tasks like VQA or image captioning that require generation
Decoder-only (e.g. GPT-4V, LLaVA)
You tokenize the image and jam the tokens directly into a big language model
Often via a learned visual projector that maps vision encoder outputs into token embeddings
Then the LLM just continues generating text as usual
Think of it this way:
CLIP-style = image and text are two friends who never talk, but happen to agree
BLIP-style = image and text are awkward coworkers forced to collaborate
GPT-4V-style = the image is treated as just another weird sentence
VLMs have to be trained with specific goals. There are a few different approaches:
Contrastive loss (InfoNCE): Pull together (img, txt) pairs, push apart mismatches. Still the default for retrieval-type tasks.
Generative loss (cross-entropy): Make the model generate captions, answers, etc. Often paired with masked modeling or causal decoding.
Alignment loss: Explicitly predict whether (img, txt) match globally (image-text matching) or locally (region-word).
You can mix and match these. Newer models often do.
Why does this work? Shared embedding space. At the end of the day, VLMs are about getting vision and language into the same space: whether that’s a cosine-similarity space (CLIP), a joint latent space (BLIP), or literally the same context window (GPT-4V). Once both modalities speak the same embedding language, you can do things like zero-shot classification, Visual QA, Referring expression resolution, Captioning, Multimodal reasoning
Now that we have a background, let’s look into where the industry currently is:
Architectural shift: two-tower → one-tower
Most early VLMs (CLIP, ALIGN, etc.) used dual encoders with contrastive loss: super clean and efficient for zero-shot. But newer models (BLIP-2, Qwen-VL, GPT-4V) are mostly going full decoder-only, leaning on pre-trained LLMs as the backbone and using visual projectors to tokenify images into language-compatible inputs. In some setups (e.g., Emu3), the visual tokens are literally embedded as [VIS_1], [VIS_2], etc., and passed through a shared transformer stack with [SOT]/[EOV] delimiters.
This is a huge shift. Instead of fusing representations late (post-embedding), we’re pushing for unified token streams. Tradeoff: inference is more expensive and fine tuning is more finicky, but you unlock better multimodal reasoning downstream.
Contrastive learning is still the workhorse
Despite all the new alignment and generative losses, contrastive learning is still the dominant pretraining objective. InfoNCE and its variants remain the backbone for image-text alignment. Whether you’re training ViT+Transformer from scratch (CLIP) or initializing from LLaMA or Vicuna and learning a visual projector, the core idea is the same: bring paired stuff closer, push everything else apart.
Side note: Image-text-label contrastive learning (e.g., CoCoOp, ALIP) is underrated. It lets you incorporate supervised info and keep contrastive benefits.
Evaluation setups are all over the place
People love saying “we achieve SoTA on 54 benchmarks,” but VLM evaluation is wildly inconsistent. Most tasks boil down to multiple choice or yes/no formats, which are easy to evaluate but not great at measuring actual multimodal reasoning. A lot of these benchmarks are solvable with the LLM alone and vision input doesn’t help. (Some models even get worse when you add the image.)
Also, linear probing is underused and underrated. It's a cleaner measure of representation quality than zero-shot on VQA.
Hallucination isn’t just an LLM problem
Visual hallucination is real. Many VLMs (esp. those with large LLM backbones) rely more on their internal parametric knowledge than the image itself. Flamingo, GPT-4V, Claude 3 all hallucinate details from images that aren't actually there. This happens especially when vision inputs are projected poorly (bad visual encoders, weak tokenizers, etc.). The vision-language interface is leaky. Potential fixes include better visual grounding (e.g., region-word alignment), or introducing visual attention feedback during generation.
Alignment is getting RLHF’d
Just like in LLM land, we’re starting to RLHF VLMs like DPO, PPO, GRPO, etc. to optimize for human preferences. But multimodal RLHF is messy. Designing reward models that actually “see” is hard. Some methods use critiques (e.g., MM-RLHF), others just freeze the vision encoder and optimize the decoder. Not much consensus here.
Also, Reinforcement Learning with Verifiable Rewards (RLVR) is kinda cool; no reward model needed, just a hard-coded verifier for task correctness.
On-device VLMs = sparsity is king
If you're doing local inference (e.g., mobile or embedded), sparsity matters more than anything. Pruning, quantization, attention windowing, anything to fit ViT-L/14 inside a latency budget. You don’t need GPT-4V to label a parking meter.
This is what I’m personally most excited for and what I’ve been working on the past few weeks! I’ve been digging really deep into LLaVA-OneVision and sparsification techniques.
There’s a lot more to say, from decoder tokenization quirks to visual grounding to whether caption quality can be reliably judged by an LLM itself. Lately, I’ve been thinking about the granularity of supervision—whether we lose something by optimizing only at the sequence level. But I’ll stop here for now.
VLMs sit at a weird intersection: part vision, part language, part systems engineering. That makes them brittle, fascinating, and very much still in flux. We’re still figuring out what “understanding” really looks like in a model that sees and speaks.
For my part, I’m especially excited about making this tech small: pruning, sparsity, and clever architectural tricks that let us push VLMs onto the edge without losing too much capability. More on that soon.
Citations:
https://arxiv.org/pdf/2304.00685
https://arxiv.org/pdf/2501.02189
https://www.therobotreport.com/mit-csails-new-vision-system-helps-robots-understand-their-bodies/ (image credit)