NVIDIA’s Cosmos Policy uses a video model to generate robot actions and predict future outcomes |

Ravi Singh

7 hours ago

Thank you for reading this post, don't forget to subscribe!

NVIDIA’s Cosmos Policy uses a video model to generate robot actions and predict future outcomes (Image Source – NVIDIA)

Cosmos policy describes how NVIDIA has adapted an existing video prediction system for use in robotics. The model was first trained on vast collections of video, enough to absorb patterns of movement, contact, and physical change over time. Rather than building a robot controller from the ground up, researchers retrained this model using recorded robot demonstrations. The result is a system known as Cosmos Policy. It produces robot actions while also forming expectations about what may follow those actions, and it assigns a basic measure of how favourable an outcome appears. These elements are generated together, not through separate components. The approach relies on prior visual learning rather than explicit rules. It reflects a shift toward treating robot control as a problem of prediction rather than instruction, with fewer assumptions built in from the start.

Video prediction as a foundation for robotics: NVIDIA’s new Cosmos policy

Video models absorb patterns simply by watching. Objects shift, collide, slow down, or fall, and those patterns repeat. Cosmos Policy leans on this background knowledge rather than formal rules. The robot behaves as though it has seen similar situations before. Learning becomes quicker, and the system avoids tightly engineered control code that often struggles outside narrow conditions.At the centre of the approach is a reframing. Robot control is handled as another kind of video prediction. Actions, internal state, and future rewards are folded into the same representation the model already uses for video. The underlying structure stays intact. Nothing new is bolted on.

Predicting actions, outcomes and value together

Each step produces several things at once. The model suggests what the robot should do next. It also forms a picture of how the scene might look after that action. Alongside this, it generates a value that loosely reflects whether the result is favourable. All of this comes from a single pass through the model, not a chain of separate systems.In its simplest use, the robot follows the actions it predicts directly. That alone works well in many cases. There is also a more involved option where the robot considers multiple imagined futures and chooses between them. This tends to help but costs more computation. Much of the focus remains on the simpler mode.

Blending robot data with video input

Information that is not visual, such as joint angles or reward signals, is converted into numerical form and placed into the model alongside video frames. Internally, everything moves through the same sequence. When the model runs, these hidden representations are translated back into physical actions and value estimates.Earlier attempts often relied on multiple training stages or separate planning modules. Cosmos Policy avoids that structure. One model fills several roles at once. That simplicity makes it easier to scale and less fragile when conditions shift.

Performance across simulations and real robots

The system holds up across simulated tests and physical robot tasks. Even without explicit planning, it matches or exceeds existing methods in several settings. It suggests that large video models can cross into robotics without heavy redesign.Cosmos Policy does not claim a final solution. It shows something narrower. When a robot learns to anticipate what comes next, control begins to follow naturally. The model imagines first, then acts.

Source link