Contrastive Representation Regularization for Vision-Language-Action Models

Accepted to ICML 2026

1KAIST, 2RLWRLD, 3UC Berkeley
Equal advising

TL;DR: Robot State-aware Contrastive Loss (RS-CL) is a lightweight contrastive regularizer that grounds VLM representations in the robot's proprioceptive states. It integrates seamlessly into standard VLA training without extra stages or substantial compute overhead, improving control-relevant representation learning and manipulation performance.

Abstract

Vision-Language-Action (VLA) models have shown strong capabilities in robot manipulation by leveraging rich representations from pre-trained Vision-Language Models (VLMs). However, their representations arguably remain suboptimal, lacking sensitivity to robotic signals such as control actions and proprioceptive states. To address the issue, we introduce Robot State-aware Contrastive Loss (RS-CL), a simple and effective representation regularization for VLA models, designed to bridge the gap between VLM representations and robotic signals. In particular, RS-CL aligns the representations more closely with the robot's proprioceptive states by using relative distances between the states as soft supervision. Complementing the original action prediction objective, RS-CL enhances control-relevant representation learning, while being lightweight and fully compatible with standard VLA training pipelines. Our empirical results demonstrate that RS-CL substantially improves the performance of state-of-the-art VLA models, boosting success rates from 45.0% to 58.3% on challenging real-robot manipulation tasks.

Motivation: VLM representations miss robotic signals

We visualize VLM embeddings of robot episodes performing the same task ("Open the microwave / cabinet door") across different scenes. Pre-trained VLM representations are dominated by visual appearance (e.g., scene layout, surrounding objects), clustering by visual cues rather than control-relevant factors. RS-CL instead guides embeddings to align with the robot's proprioceptive states, capturing common robotic signals (e.g., current pose, next action) and aligning all episodes by task progress.

Identical task trajectories

Identical task trajectories

Pre-trained VLM representations

Pre-trained VLM representations

RS-CL aligned representations

RS-CL aligned representations

Method

Robot State-aware Contrastive Loss (RS-CL) is an objective that regularizes the VLM representation space using supervision from the robot's proprioceptive states. It is built from three key components:

  • Learnable summarization token — amortizes long VLM output embeddings into a compact representation for efficient contrastive learning.
  • Robot state-aware weighting — a soft-weighted InfoNCE loss that assigns pairwise weights from the relative distances between proprioceptive states, pulling together embeddings with similar robot states.
  • View cutoff augmentation — a lightweight representation-level augmentation that masks out the feature slice of a randomly selected observation view, generating diverse contrastive pairs with minimal overhead.

By operating at the representation level and training end-to-end in a single stage, RS-CL remains fully compatible with existing VLA training pipelines.

RS-CL overview

Results

Representation analysis

RS-CL successfully injects control-relevant structure into the VLM representation while preserving the initial semantics.

Representation Scene Acc. (%) Task-progress Acc. (%)
Pre-trained VLM embeddings 99.6 1.2
Embeddings trained with RS-CL 93.6 22.9
Ground-truth proprioceptive states 53.1 25.6

KNN classification accuracy (%) on the representation space.

Real-robot manipulation

RS-CL raises manipulation precision on real robot tasks — improving average success rates from 45.0% to 58.3% — without harming generalization (e.g., visual appearance, physical properties, language following, lighting conditions).

Real-robot task examples
Real-robot results

BibTeX

@inproceedings{kim2026contrastive,
  author    = {Kim, Taeyoung and Lee, Jimin and Koo, Myungkyu and Kim, Dongyoung and Lee, Kyungmin and Kim, Changyeon and Seo, Younggyo and Shin, Jinwoo},
  title     = {Contrastive Representation Regularization for Vision-Language-Action Models},
  booktitle = {International Conference on Machine Learning (ICML)},
  year      = {2026},
}