Embodied AI and Vision-Language-Action Models: Data Annotation, Robot Training Data, and Scaling Challenges

VLAs have recently become one of the hottest topics in robotics and embodied AI [1]. Much of the discussion has focused on the scale of data required to train them, inference latency and acceleration strategies, deployment challenges, and the safety of real world evaluation and motion planning. One of the central components behind these systems, however, has received far less attention: language annotation.

This is somewhat surprising given how foundational annotations are to the behavior of a VLA. They shape how actions are grounded, how tasks generalize across environments, and how robotic behavior becomes interpretable to humans. Despite this, annotation pipelines are often treated as secondary implementation details rather than core research problems.

In this post, we revisit the role of annotation in VLAs and argue that it deserves far more attention than it currently receives. We explore why VLAs rely so heavily on language supervision, how annotation is typically performed in modern robotics datasets, the practical and conceptual challenges involved in producing high quality labels, and the latest efforts aimed at improving this part of the pipeline.

Why Does Language Matter For VLAs

Language acts as the grounding component within the VLA pipeline. It helps models infer what is happening in a scene, how actions are being carried out, and what purpose those actions are meant to achieve [2]. Without language, many robotic behaviors become difficult to interpret beyond their raw motion patterns.

Consider a simple human example. If you were walking with a friend and suddenly saw them sharply change direction toward a group of people, you would likely assume there was a reason behind it. You might wonder whether they recognized someone in the group or intended to speak with them. The action itself only tells part of the story. The intent behind it provides the missing context.

A similar problem appears in robotic learning. If a model observes a gripper moving toward several nearby objects, physical priors and visual cues may help it infer that the robot intends to pick something up. What remains unclear, however, is why one object should be selected over another. From observation alone, that decision can appear almost arbitrary.

Language provides the semantic layer that connects actions to intent. By coupling trajectories and observations with linguistic supervision, models can learn not only what action was performed, but also the intent and objective behind it [2]. This transforms robotic behavior from a sequence of seemingly disconnected motions into actions that are grounded within a task.

At the same time, language is not the only source of grounding in robotic systems. Physical interaction, temporal consistency, environmental constraints, and visual structure all contribute to how a robot interprets and executes behavior. What makes language particularly important is that it provides a flexible interface capable of linking perception, action, and task-level semantics in a way that humans can naturally interact with [3].

What does Annotation Actually Mean in Robotics

Annotation can mean a variety of things depending on the context. In robotics pipelines, it can range from simple labels attached to trajectories all the way to hierarchical descriptions that split behavior into tasks and subtasks [1]. The form of annotation used often depends on the capabilities a model is expected to learn and the level of supervision required during training.

What follows is a brief overview of some of the most common annotation schemes used in VLAs and embodied AI systems.

Success-Failure Labeling

Success-failure labeling consists of marking whether a trajectory or task execution completed successfully or failed. Although relatively simple, this form of annotation is extremely valuable during training because it helps models recognize failure modes, distinguish between effective and ineffective behaviors, and in some cases learn recovery strategies.

These labels are commonly applied either at the trajectory level or at intermediate stages within a task. A full rollout may be marked as successful if the objective is achieved, while individual segments can also be annotated to indicate where execution began to diverge from the intended behavior. This becomes especially important in long-horizon tasks where small early mistakes may compound into larger failures later in the trajectory [4].

Task Labels

Task labels, much like success-failure annotations, are relatively straightforward. They specify the task being demonstrated, such as “pick up the cup,” “open the drawer,” or “place the object on the shelf.” Despite their simplicity, task labels are arguably one of the most important forms of annotation because they explicitly define the objective of a trajectory.

Without task-level supervision, a model may observe motions and interactions without fully understanding what behavior the demonstration is intended to accomplish. Task labels provide the high-level semantic target that connects low-level actions to an overarching goal. In many VLA systems, this relationship between trajectory and task description becomes the foundation for generalization, allowing models to apply similar behaviors across different environments, objects, or embodiments [5].

This type of annotation becomes weaker on long-horizon/complex tasks as it becomes harder to relate the actions with the end goal, however, it still provides the cohesion necessary between the subtasks in order to achieve the goal.

Task labels also play an important role in evaluation. They make it possible to measure whether a policy completed the intended objective rather than simply reproducing plausible motion patterns.

Instruction Tuning Data

Instruction tuning data is designed to teach models how natural language instructions relate to task execution and final objectives [6]. Unlike simple task labels, which only specify the end goal of a demonstration, instruction tuning data attempts to capture the relationship between language, intermediate actions, and the progression of a task over time.

This form of annotation is particularly important because it helps models learn how instructions map onto behavior through sequential execution. A command such as “place the mug in the sink” is not represented as a single atomic action, but rather as a series of coordinated subtasks involving navigation, object recognition, grasping, and placement. By pairing instructions with trajectories, models can begin to associate linguistic structure with the sequence of actions required to complete a task.

Instruction tuning data is also one of the key components that enables more natural human-robot interaction. Instead of relying on narrowly defined commands, robots trained with this type of supervision can better handle varied phrasing, paraphrased requests, and more flexible task descriptions [2].

Dense Trajectory Annotations

Dense trajectory annotations can be thought of as a fine-grained combination of multiple annotation signals applied throughout the full execution of a task. Rather than assigning a single label to an entire demonstration, dense annotations capture information continuously across the trajectory, including subtasks, intermediate actions, object interactions, task progression, and changes in environmental state [4].

This type of annotation is especially valuable because of the richness of supervision it provides. Instead of learning only from the final outcome of a task, models receive semantic information at many different stages of execution. This gives VLAs access to far more structured training data and allows them to better learn temporal relationships, subtask transitions, and the connection between low-level actions and high-level objectives.

Dense annotations become particularly important in long-horizon tasks where execution unfolds over many sequential stages. By grounding language throughout the trajectory rather than only at the beginning or end, the model can build a more complete representation of how actions evolve over time and contribute toward achieving a goal.

How Annotation Pipelines Work Today

Modern annotation pipelines in robotics rarely rely on a single method. Most systems combine human supervision, automated labeling, and increasingly vision-language models to generate annotations at scale [1]. The exact structure of the pipeline depends on the type of robot, the size of the dataset, and the level of semantic detail required.

Human Annotation

One of the most traditional approaches involves collecting demonstrations through teleoperation and manually attaching labels afterward [7]. During collection, robots record trajectories, camera observations, actions, and environmental states while a human operator performs tasks.

Annotators later review these demonstrations and assign labels such as task descriptions, subtasks, success-failure markers, or instruction sequences. Human annotation generally produces higher quality labels because people are better at interpreting intent and contextual meaning. However, this process becomes difficult to scale as datasets grow larger [7].

Modern VLA datasets may contain millions of frames and thousands of trajectories, making fully manual annotation extremely expensive in both time and labor [5].

Retrospective Captioning

To reduce annotation costs, many pipelines now rely on retrospective captioning. Instead of labeling demonstrations during collection, trajectories are first stored and processed later using language models or vision-language models [4].

These systems can generate descriptions of actions, identify manipulated objects, segment trajectories into subtasks, or produce dense trajectory annotations across time. Human annotators may then refine or verify the generated outputs rather than creating labels entirely from scratch.

This approach dramatically improves scalability, though it also introduces challenges related to consistency and hallucinated descriptions.

Simulation-Based Annotation

Simulation provides another powerful source of annotation. Unlike real-world environments, simulated systems have direct access to structured information such as object identities, spatial positions, collisions, and task states.

Because this information already exists in machine-readable form, labels can often be generated automatically with minimal human intervention. This makes simulation attractive for large-scale robotic training pipelines where collecting and annotating real-world data would otherwise be prohibitively expensive [1].

The downside is that simulated annotations do not always transfer cleanly into real-world environments. Differences in physics, appearance, sensor noise, and environmental complexity can create significant gaps between simulated supervision and real robotic behavior [8].

VLM-Based Annotation Pipelines

Recent advances in large vision-language models are rapidly changing how annotation is approached in embodied AI [4]. Instead of relying entirely on human-written labels, many modern pipelines use pretrained multimodal models to automatically relabel trajectories, generate instructions, summarize behaviors, or decompose tasks hierarchically.

In some systems, this process becomes iterative. Models generate annotations that are then used to train stronger robotic policies, which in turn generate more trajectories and supervision data. This creates a feedback loop where annotation and policy learning continuously improve one another.

Despite their promise, these systems remain imperfect. Automatically generated labels may contain semantic ambiguity, inconsistent phrasing, or incorrect assumptions about intent, especially in long-horizon tasks where goals evolve over time.

The Scaling Problem

As robotics datasets continue to grow, annotation quality and annotation scale are increasingly becoming competing objectives [5]. Human-generated labels are usually more reliable but difficult to produce at scale, while automated pipelines are scalable but often less consistent and semantically grounded.

Much of the current research landscape in VLA supervision is centered around balancing these two constraints.

The Core Problem With Annotation

Although annotation is one of the foundational components behind modern VLAs, it also remains one of the least standardized and most problematic parts of the pipeline. As datasets continue to scale and tasks become increasingly complex, many of the limitations of current annotation strategies are becoming more visible.

Cost and Scalability

One of the most immediate challenges is cost. High quality robotic annotation is expensive, especially when demonstrations require detailed semantic descriptions, temporal segmentation, or human verification [7].

Unlike traditional image classification datasets where a single label may be sufficient, robotic trajectories often require contextual understanding across time. Annotators may need to interpret intent, identify task progression, distinguish between subtasks, and determine whether a failure occurred because of planning, perception, or environmental interaction.

As datasets grow into millions of frames and thousands of hours of demonstrations, maintaining high quality human supervision becomes increasingly impractical [5]. This has pushed much of the field toward automated annotation systems despite their limitations.

Ambiguity in Natural Language

Language itself is inherently ambiguous. Different annotators may describe the same behavior using completely different phrasing, while identical words may carry different meanings depending on context.

For example, one annotator may describe an action as “placing a mug on a table,” while another writes “setting down the cup.” To a human these descriptions are obviously related, but models may interpret them differently depending on training distribution and linguistic structure [3].

This ambiguity becomes even more problematic in long-horizon tasks where intent changes throughout execution and multiple valid descriptions may exist simultaneously.

Lack of Standardization

There is currently no universally accepted annotation standard for VLAs or embodied AI systems [1]. Different datasets use different task descriptions, labeling hierarchies, levels of granularity, and semantic assumptions.

Some datasets focus on high-level task labels while others prioritize dense trajectory annotations or instruction decomposition. As a result, combining datasets often introduces inconsistencies in wording, structure, and supervision style.

This fragmentation makes it difficult to build unified robotic systems capable of learning consistently across diverse data sources.

Semantic Drift Across Annotators

Even when annotation guidelines exist, humans naturally introduce variation into the labeling process. Two annotators observing the same trajectory may focus on different aspects of the task, interpret intent differently, or choose entirely different wording.

Over large datasets, this creates semantic drift where annotations gradually lose consistency across demonstrations. The problem becomes especially severe when labels are collected over long periods of time or across distributed teams [5].

In practice, models may end up learning annotator-specific patterns rather than universally meaningful semantic representations.

Weak Grounding Between Language and Action

One of the central challenges in VLAs is ensuring that language remains properly grounded in physical interaction. A textual description may correctly describe the outcome of a task while failing to capture the actual behavior required to achieve it.

For example, two trajectories may both satisfy the label “put the object away” while involving entirely different motions, environments, or interaction strategies. If annotations are too abstract, models may struggle to associate language with meaningful physical behavior [2].

Grounding becomes even more difficult when datasets contain sparse supervision or weak temporal alignment between language and action.

Dataset Bias

Annotation pipelines also inherit many of the biases present in the people and systems generating the labels. Certain objects, environments, tasks, or interaction styles may appear more frequently than others, causing models to overfit toward narrow distributions of behavior [5].

Language bias can also emerge through phrasing patterns, cultural assumptions, or repeated instruction styles. Over time, models may learn correlations that reflect annotation artifacts rather than meaningful robotic understanding.

These biases become particularly problematic during real-world deployment where environments are significantly more diverse than curated training datasets.

Missing Temporal Context

Many annotations fail to capture how tasks evolve over time. A single task label attached to an entire trajectory may ignore critical transitions, intermediate failures, or changing objectives during execution [4].

This becomes especially problematic in long-horizon tasks where early decisions influence later stages of the trajectory. Without sufficient temporal information, models may struggle to understand subtask dependencies or causal relationships between actions.

Dense trajectory annotation partially addresses this issue, but producing high quality temporal supervision at scale remains difficult.

Granularity Mismatch

Another common problem is determining the correct level of annotation granularity. Labels that are too coarse may fail to capture important behavioral structure, while labels that are too detailed can introduce noise and unnecessary complexity.

For example, a high-level label such as “clean the kitchen” may hide dozens of intermediate actions, while excessively fine-grained annotations may overwhelm the model with information that does not meaningfully improve learning.

Finding the right balance between abstraction and specificity remains an open challenge across much of embodied AI research [1].

Human Language vs Robotic Requirements

Perhaps the most fundamental issue is that human language was never designed specifically for robotic supervision. Humans naturally omit information that other humans can infer from context, prior experience, or shared physical understanding.

Robots, however, often require significantly more explicit grounding [3]. Actions that seem obvious to people may contain ambiguities that become major learning problems for embodied systems.

This creates a persistent mismatch between how humans describe behavior and the type of structured information robots actually need in order to learn reliably.

Current Perspectives and Developments in Annotation

As VLAs continue to scale, annotation is increasingly being treated less as a static preprocessing step and more as an active component of the learning pipeline itself. Much of the recent progress in embodied AI has shifted from simply collecting more trajectories toward improving the semantic quality and structure of the supervision attached to them.

VLM-Based Annotation

One of the clearest trends is the growing use of large vision-language models as automatic annotators [4]. Instead of relying entirely on human-written labels, modern pipelines increasingly use pretrained multimodal systems to relabel trajectories, generate instructions, identify subtasks, and produce dense semantic descriptions across time.

This approach dramatically reduces annotation costs while allowing datasets to scale far beyond what fully manual supervision would realistically permit [7].

At the same time, researchers are becoming increasingly aware of the limitations of purely synthetic annotation. Automatically generated labels may be grammatically correct while still failing to accurately represent intent, physical constraints, or task progression. Because of this, many current pipelines are moving toward hybrid systems where language models generate annotations that are later filtered, refined, or verified through human feedback.

Hierarchical and Dense Supervision

Another major development is the growing focus on hierarchical supervision. Earlier robotic datasets often relied heavily on single task-level labels, but modern systems are increasingly incorporating subtasks, intermediate goals, and temporally grounded descriptions throughout execution [5].

This reflects a broader understanding that long-horizon robotic behavior cannot always be represented effectively through sparse supervision alone. Dense trajectory annotations and hierarchical task decompositions provide models with richer temporal structure and help connect low-level actions to broader objectives.

As robotic systems become more capable, the field is gradually moving away from treating trajectories as isolated actions and toward modeling them as sequences of semantically connected behaviors [1].

Interactive Annotation

There is also increasing interest in interactive annotation systems where supervision evolves during deployment rather than remaining fixed after training [9].

Instead of learning exclusively from static datasets, robots may receive corrective feedback, preference signals, or natural language refinements while interacting with users and environments. In these systems, annotation becomes part of a continuous feedback loop rather than a one-time labeling process.

This direction is especially promising for real-world deployment where environments are dynamic and fixed datasets may fail to capture the variability robots encounter in practice.

Simulation and Synthetic Supervision

Simulation is similarly becoming more important as a source of scalable supervision. Because simulated environments provide direct access to object states, contacts, environmental structure, and task information, they allow for the generation of highly detailed annotations with relatively low cost.

Current research is increasingly exploring how synthetic supervision generated in simulation can transfer more reliably into real-world robotic systems [8]. While the sim-to-real gap remains a major challenge, simulation continues to be one of the most practical ways of generating large-scale semantically structured robotic data.

Shifting Perspectives on Annotation

Perhaps the largest conceptual shift is that annotation is no longer viewed purely as metadata attached to demonstrations. Increasingly, it is being treated as one of the primary mechanisms through which robots acquire semantic understanding, task structure, and grounding [2].

The discussion is gradually moving away from simply asking how to label more data and toward understanding what forms of supervision actually produce more capable embodied systems.

As robotics systems continue to improve, the challenge is no longer only collecting trajectories at scale. It is determining how to attach meaningful semantic structure to those trajectories in ways that allow robots to generalize, interact naturally with humans, and operate reliably in complex environments.

Nurvai’s Closing Thoughts

Despite the rapid progress of VLAs and embodied AI systems, annotation remains one of the major unsolved problems in the field. As robotic datasets continue to grow in scale and complexity, the challenge is no longer simply collecting demonstrations, but understanding how to efficiently attach meaningful semantic structure to them.

At Nurvai, this is still an area we are actively exploring. We are currently developing our own annotation frameworks and running internal experiments to better understand how these systems can be made more scalable, consistent, and computationally efficient. Much of our work is focused on understanding where automated supervision succeeds, where it breaks down, and how different forms of annotation influence downstream robotic behavior.

One of the conclusions we continue to arrive at is that fully manual annotation does not scale [7], but fully automated annotation is still unreliable in many real-world scenarios. Because of this, we believe the future of robotic annotation will likely emerge from hybrid systems that combine automated semantic generation with human oversight and refinement.

Rather than replacing humans in the loop entirely, the goal is to build systems where automated annotation layers accelerate the process while human supervision maintains grounding, consistency, and contextual understanding. As embodied systems become increasingly capable, finding the right balance between these two components may become one of the defining challenges in how robots learn and interact with the world.

References

[1] K. Kawaharazuka, J. Oh, J. Yamada, I. Posner, and Y. Zhu, “Vision-language-action models for robotics: A review towards real-world applications,” IEEE Access, vol. 13, pp. 162467–162504, 2025, doi: 10.1109/ACCESS.2025.3609980.

[2] B. Zitkovich et al., “RT-2: Vision-language-action models transfer web knowledge to robotic control,” in Proceedings of the 7th conference on robot learning, in Proceedings of machine learning research, vol. 229. 2023, pp. 2165–2183. Available: https://arxiv.org/abs/2307.15818

[3] C. Lynch et al., “Interactive language: Talking to robots in real time,” IEEE Robotics and Automation Letters, vol. 8, no. 6, pp. 3980–3987, 2023, doi: 10.1109/LRA.2023.3295255.

[4] J. Salfity, S. L. Wanna, M. Choi, and M. Pryor, “Temporal and semantic evaluation metrics for foundation models in post-hoc analysis of robotic sub-tasks,” in arXiv preprint, 2024. Available: https://arxiv.org/abs/2403.17238

[5] Open X-Embodiment Collaboration, “Open X-embodiment: Robotic learning datasets and RT-X models,” in Proceedings of the IEEE international conference on robotics and automation (ICRA), 2024. Available: https://arxiv.org/abs/2310.08864

[6] S. Longpre et al., “The Flan collection: Designing data and methods for effective instruction tuning,” in Proceedings of the 40th international conference on machine learning, in Proceedings of machine learning research, vol. 202. 2023, pp. 22631–22648. Available: https://arxiv.org/abs/2301.13688

[7] S. Dass, K. Pertsch, H. Zhang, Y. Lee, J. J. Lim, and S. Nikolaidis, “PATO: Policy assisted TeleOperation for scalable robot data collection,” in Proceedings of robotics: Science and systems, 2023. Available: https://arxiv.org/abs/2212.04708

[8] Y. Jiang, C. Wang, R. Zhang, J. Wu, and L. Fei-Fei, “TRANSIC: Sim-to-real policy transfer by learning from online correction,” in Proceedings of the conference on robot learning, 2024. Available: https://arxiv.org/abs/2405.10315

[9] J. Hejna and D. Sadigh, “Few-shot preference learning for human-in-the-loop RL,” in Proceedings of the 6th conference on robot learning, in Proceedings of machine learning research, vol. 205. 2023, pp. 2014–2025. Available: https://arxiv.org/abs/2212.03363