Building general-purpose embodied AI has eluded researchers for decades, but a new combination of simulation, adaptation, and learning from human video is scaling robotics into the real world. General-purpose robotics is closer to reality as new approaches to simulation, adaptation, and human-guided learning unlock scalable embodied intelligence.
The history of AI is tightly bound to the dream of embodied intelligence. Since the 1950s, the goal has been clear: build machines that can perform thousands of tasks across thousands of environments. Yet despite remarkable advances in computing, data, and machine learning, robots still struggle to generalise outside narrow demonstrations. In his presentation at NVIDIA GTC 2025, Deepak Pathak, CEO and Co-Founder of Skild AI, sets out a new path to achieving the elusive goal of general-purpose embodied AI.
Rather than chasing incremental improvements, Pathak proposes a fundamental shift. By using massive-scale simulation, harvesting human action from vast online video stores, and training adaptable, resilient models, he believes robotics can finally move beyond the endless loop of one-off demos and into the fabric of everyday life.
The illusion of progress in robotics
Robotics has long been a field where appearances deceive. Impressive demonstrations are plentiful, from the robotic arms of the 1950s to today’s humanoid machines. Yet, as Pathak warns, “In robotics, seeing is not believing.” Over decades, the gap between staged performance and real-world deployment has barely narrowed.
The root cause is structural. Robotics faces a chicken-and-egg dilemma: robots are not yet competent enough to deploy widely, but without wide deployment, there is no opportunity to collect the trillion-scale datasets that have powered advances in fields like language and vision. Without data, robots cannot learn, and without learning, they cannot generalise. The result is a field locked in a cycle of impressive but ultimately narrow demonstrations.
“Unlike speech or vision where there is an abundance of datasets on the internet, there is no such thing for robotics,” Pathak explains. “You cannot apply the same recipe of large datasets, big networks, and GPU training to robotics because you get stuck at the first step.”
Attempts to brute-force the problem through manual data collection have proven prohibitively expensive and slow. Efforts to graft large language models onto robots have produced limited generalisation, mostly restricted to simple pick-and-place tasks. “The hard part in robotics is not language,” Pathak says. “It is the actual physical skill, the dexterity, the adaptation to uncertainty. And that has proven stubbornly difficult to scale.”
Breaking the cycle with simulation and adaptation
Pathak argues that robotics must take a different approach to escape this deadlock. Instead of waiting for real-world data, researchers can leverage two abundant and underutilised resources: simulation and human video.
Simulation offers safe, scalable, high-speed environments where robots can experience millions of scenarios in days rather than years. Yet simply training in simulation is not enough because no simulation can perfectly match the messiness of the real world. Pathak explains that the critical breakthrough is not to seek perfect simulation but to embrace adaptation.
“Our insight is that rather than trying to build models invariant to changes, we should build models that adapt online, continuously, to whatever environment they encounter,” Pathak says. “By observing the discrepancy between a robot’s intended action and its actual result, the system can infer properties of its environment, such as friction, mass, or surface stability, and adjust its behaviour accordingly.
“This approach turns noise into a friend. Instead of trying to ignore differences, we use them to adapt, which allows robots to operate across unpredictable, dynamic conditions.”
In demonstrations, this adaptive approach enables low-cost robots to perform feats that would otherwise require expensive, finely tuned systems. Robots trained with this method can climb stairs as tall as their own height, recover in real-time from being struck with heavy weights, and navigate unstably vibrating surfaces without explicit programming.
Learning from the human body
While simulation addresses the challenge of environmental variety, it does not solve the problem of task variety. To tackle this, Pathak turns to another vast, untapped resource: human videos.
The principle is simple yet powerful. Human actions, captured in countless hours of online video, encode rich, varied skills and affordances. By analysing how human hands, wrists, and bodies interact with objects, robots can infer where and how objects should be grasped, moved, or manipulated. This does not require labour-intensive annotation or teleoperation but simply observation.
“What we extract is not imitation, but affordance,” Pathak continues. “By watching how a human moves a fridge door or picks up a pan, the robot learns the critical interaction points and motion patterns necessary to replicate the task.” Importantly, this learning is not passive. Robots use human videos to bootstrap their knowledge and then practice autonomously in simulation or the real world to refine their skills. “Watching is not enough,” Pathak notes. “Practice, adaptation, and exploration are essential.”
This combination of human guidance and robot practice accelerates training dramatically. In one study, Pathak’s team enabled robots to master thirty complex tasks within days, a level of productivity that traditional methods could not match.
Towards a general brain for robotics
Pathak’s vision is underlying a deeper ambition: to create a single, unified brain that can control many robots across many tasks and environments. Unlike specialised models trained for narrow domains, this brain would integrate data from diverse simulations, human videos, and real-world practice and continuously adapt as it encounters new situations.
“What we aim for is not just a vision brain or a language brain, but an action brain,” Pathak says. “A system that can decide and execute complex behaviours across any morphology, quadrupeds, humanoids, or arms, and in any context.”
Achieving this requires scale. Skild AI focuses on industrialising the pipeline: using cloud infrastructure to train at orders of magnitude beyond what academic labs can manage, curating vast datasets from simulation and online sources, and designing architectures optimised for online adaptation.
The results, even at this early stage, are striking. Humanoids trained on these principles can perform complex household tasks, resist physical disturbances, and adapt seamlessly to new environments without requiring extensive reprogramming. Quadrupeds and robotic arms exhibit robust performance under challenging conditions that would defeat conventional systems.
The long game of embodied AI
Despite these advances, Pathak is realistic about the road ahead. “Robotics is a long game,” he says. “It is not about flashy demos, but about building systems that work reliably in the real world.”
The principles he outlines, embracing adaptation over invariance, scaling through simulation and video, and learning affordances rather than fixed instructions, represent a pragmatic path forward. They sidestep the bottlenecks that have held robotics back for decades, offering a way to break free of the chicken-and-egg dilemma and move towards true generalisation.
Challenges remain, from the computational cost of processing video data at scale to the need for continual refinement of adaptation mechanisms. Yet the foundation is solid, and the trajectory is clear.
If robotics is to fulfil its original promise as the physical embodiment of AI, it will not come from waiting for perfect data or perfect simulations. It will come from systems that learn, adapt, and improve in the messy, unpredictable world as it really is.
As Pathak concludes, “The hard part of robotics is robotics itself. But by embracing that hardness, by designing systems that adapt and scale, we can finally make embodied AI a reality.”




