As embodied AI continues to develop, the conversation around data keeps resurfacing. New collection methods are being proposed, domain adaptation pipelines are making it possible to reuse existing datasets in new contexts, and models are becoming increasingly effective at extracting value from the data they consume. Despite these advances, data remains one of the central bottlenecks in the development of capable robotic systems.

Collecting high-quality robotic data at scale is still expensive, time-consuming, and technically challenging. Many of the limitations facing modern embodied AI systems can ultimately be traced back to limitations in the data available to train them.

In this blog post, we explore five of the most significant barriers to large-scale data collection in robotics. Our goal is not only to highlight the challenges that still exist, but also to encourage discussion around potential solutions. As a community, understanding these bottlenecks is the first step toward building the infrastructure needed to overcome them.

Why Data Collection Matters

As we have discussed in previous blogs, data is one of the most critical components of the embodied AI pipeline. The data a model consumes shapes both its capabilities and its representation of the world. In many cases, real-world data provides the richest and most accurate view of the environments in which a robot is expected to operate.

The challenge is that collecting this data at scale remains difficult. While real-world data often produces the best learning signal, its benefits are limited by how much of it can realistically be gathered, curated, and annotated. As a result, data collection has become one of the primary bottlenecks in both training and deployment.

Understanding what limits data scalability today is therefore important for anyone working in embodied AI. Many of the field’s current challenges, from model robustness to generalization, are closely tied to the availability of high-quality data. If we want to build more capable robotic systems in the future, we first need to understand the obstacles preventing us from collecting and utilizing data at the scale required.

Sensor Drift and Hardware Consistency

Robotics data is meant to capture the real world and the dynamics that govern it. Anything that reduces a dataset’s ability to accurately represent those dynamics becomes a barrier to effective learning. One of the most significant challenges in this regard is sensor consistency.

During extended data collection sessions, sensors can drift away from their original calibration. Components heat up, voltages fluctuate, mechanical systems experience wear, and environmental conditions change. Over time, these factors can introduce small measurement errors that accumulate into meaningful discrepancies between the recorded data and the physical world.

In most data collection pipelines, drift is identified through quality assurance processes that monitor sensor outputs and compare them against expected behavior. When drift exceeds an acceptable tolerance, the affected data is often discarded or the collection session is repeated. At small scales, this is usually manageable. Recording sessions tend to be shorter, equipment spends less time under continuous load, and operators can more easily monitor the health of the system.

At scale, however, the problem becomes much harder to manage. Data collection infrastructure is often operating for extended periods with minimal downtime. Equipment may run continuously, multiple operators may interact with the same setup, and environmental conditions can vary throughout the day. Small inconsistencies that might be negligible during a single collection session can become significant when multiplied across hundreds or thousands of hours of data.

The challenge is not simply that sensors drift, but that maintaining consistency across large-scale collection efforts becomes increasingly difficult as the volume of data grows. Ensuring that observations collected on different days, by different operators, and across different hardware configurations remain comparable requires careful calibration procedures, continuous monitoring, and robust quality control processes. Without them, the scale of the dataset may increase while the reliability of the data gradually decreases.

Operator Fatigue and Human Limitations

Data collection is a non-trivial task for operators as well. While discussions around scalability often focus on hardware and software limitations, the human element is equally important.

Collecting robotic demonstrations can be physically demanding. Operators may spend hours standing, repeating the same motions, and interacting with teleoperation devices or grippers that can place significant strain on the hands and wrists. To maintain both performance and safety, operators often require regular breaks between recording sessions to rest and recover.

The challenge is not purely physical. Data collection is also highly repetitive, especially when large numbers of demonstrations are required for a single task. Performing the same actions hundreds or even thousands of times can become mentally exhausting, leading to reduced focus, inconsistent demonstrations, and lower overall data quality.

As collection efforts scale, operator fatigue becomes increasingly difficult to ignore. More data does not simply require more storage, sensors, or robots. It also requires sustained human effort. Maintaining high-quality demonstrations over long periods of time demands careful consideration of operator workload, ergonomics, scheduling, and recovery. Without these considerations, human fatigue can become a significant bottleneck in both the quantity and quality of the data that can be collected.

Storage and Memory Constraints

Robotics data is also surprisingly heavy. A single demonstration often contains one or more high-resolution video streams alongside robot states, joint positions, end-effector poses, force measurements, and other sensor readings. While any individual recording may appear manageable, the storage requirements grow rapidly as collection efforts scale.

This creates challenges both during collection and after the data has been recorded. Many recording devices need to remain portable and therefore have limited onboard storage, often well under a terabyte. During large-scale collection campaigns, these devices can fill up quickly, requiring operators to frequently transfer, organize, and clear data before recording can continue.

The problem does not end once the data has been collected. Large datasets need to be stored, backed up, and made accessible to researchers and training pipelines. Cloud storage offers convenience and scalability, but costs can grow rapidly as datasets expand, particularly when multiple copies are maintained for redundancy or long-term archival purposes. Without a clear retention strategy, storage expenses can become a significant operational cost.

Local storage introduces a different set of challenges. Organizations must purchase and maintain physical hardware, manage server infrastructure, ensure adequate cooling, implement backup systems, and control access to sensitive data. As datasets grow into the tens or hundreds of terabytes, these infrastructure requirements become increasingly difficult and expensive to manage.

While storage is often viewed as a secondary concern compared to collection itself, it can quickly become a bottleneck. Every additional hour of data collected must ultimately be transferred, stored, organized, and maintained somewhere. As a result, scaling robotics datasets is not only a data collection problem, but also a data infrastructure problem.

Physical Space Recquirements

One challenge that is not discussed very often is the amount of physical space required to collect data at scale. While much of the conversation focuses on robots, sensors, and storage infrastructure, large-scale data collection also depends on having enough controlled space for operators to work effectively.

As collection efforts grow, each operator typically requires a dedicated recording area where demonstrations can be performed without interference from nearby activity. Other operators moving through the scene, conversations occurring in the background, or objects unintentionally entering the camera’s field of view can all introduce unwanted variability into the dataset. Maintaining clean and consistent recordings therefore becomes increasingly difficult as more collection stations are added to the same facility.

Physical space is also important from the perspective of generalization. The environments used during collection should expose the model to a range of realistic conditions while still remaining sufficiently controlled for reliable data acquisition. A collection setup that is too uniform may limit the diversity of experiences available to the model, while one that is overly chaotic can make it difficult to isolate the behaviors being taught.

This challenge becomes even more apparent when collecting data for general-purpose robotic systems. Models intended to operate in homes, offices, warehouses, or industrial settings benefit from exposure to a wide variety of environments. Collecting this kind of data requires either access to many different spaces or the continuous reconfiguration of existing ones. A kitchen may need to become a living room, then an office, and later a workshop, all while maintaining recording quality and operational efficiency. As a result, physical environments themselves become a resource that must be built, maintained, and managed as part of the data collection pipeline.

Scaling data collection therefore requires more than simply adding additional operators or robots. It often requires expanding the physical infrastructure itself. Dedicated collection spaces, carefully designed recording environments, and sufficient separation between stations all become important considerations as organizations attempt to gather larger volumes of high-quality data.

In this sense, physical space becomes another resource that must scale alongside hardware, storage, and personnel. While it receives far less attention than compute budgets or model architectures, it is often one of the hidden constraints that determines how quickly and effectively a data collection operation can grow.

Data Quality Analysis

Assessing data quality in robotics is far from straightforward. Unlike many traditional machine learning domains, there is rarely a clear line between usable and unusable data. Packet loss, dropped frames, sensor noise, or synchronization issues may be perfectly acceptable at certain rates depending on the task and the role the dataset will play within the training pipeline. A few missing frames in a long navigation sequence may have little impact, while the same issue during a critical manipulation event could significantly reduce the value of a demonstration.

There are also more subtle quality concerns that are difficult to identify automatically. An operator may unintentionally perform a behavior that should not be learned, a demonstration may contain unsafe motions, or irrelevant objects and people may enter the camera’s field of view. In some cases, these issues are obvious to a human reviewer but difficult to define through a simple rule-based validation process.

As a result, quality assurance often becomes a major bottleneck in large-scale collection efforts. Automating the process typically requires either significant computational resources or large-scale model-based evaluation systems capable of reviewing demonstrations and identifying potential issues. Whether through computer vision pipelines, foundation models, or specialized validation frameworks, verifying that data meets quality standards can itself become a costly and resource-intensive task. The challenge is not simply collecting enough data, but confidently determining that the data collected is actually worth keeping.

Nurvai's Closing Thoughts

While large-scale robotics data collection is closer to reality than it has ever been, there is still a considerable amount of work to be done before high-quality data can be collected efficiently and reliably at scale. Challenges related to hardware reliability, operator workload, storage infrastructure, physical space, and quality assurance continue to place practical limits on how quickly datasets can be expanded.

The good news is that these challenges are receiving increasing attention from both industry and academia. As the field matures, we can expect new collection methodologies, better tooling, improved automation pipelines, and more robust validation frameworks to gradually reduce many of the bottlenecks discussed throughout this article.

At Nurvai, these are not just theoretical problems. They are challenges we actively think about while designing our own data collection and validation workflows. We continuously experiment with new processes, tooling, and evaluation frameworks to make data collection more scalable without sacrificing quality. While there is no single solution that addresses every bottleneck, we believe meaningful progress will come from treating data collection as a first-class engineering problem rather than simply a prerequisite for training.

Ultimately, the future of embodied AI will depend not only on better models, but also on our ability to reliably generate the high-quality data those models require. Solving that challenge will require collaboration across the entire community, from researchers and operators to hardware engineers and infrastructure teams. The sooner we treat data collection as a core problem, the sooner we can begin building systems capable of learning at the scale the field demands.


Picture credit: ©Denes Erdos - Your Event Photographer https://youreventphoto.com/