Pollux Logo

WFM (World Foundation Model) at a Glance

WFM (World Foundation Model) at a Glance

Definition

World Foundation Model (WFM) is a large-scale model that models, predicts, and generates real-world states and changes as sequences from diverse inputs such as text, images, and video.

Image

WFM is used for synthetic data generation and environment prediction in Physical-AI domains such as robotics and autonomous driving via physical simulation (e.g., NVIDIA Cosmos). This line of work represents the “World-as-Physics” direction, incorporating real sensors, robot actions, and autonomous-driving environments.

Image

Meanwhile, DeepMind’s Genie 3 and OpenAI’s Sora2 learn not only physical rules of the world but also visual causality, cognitive patterns, and linguistic context—the “World-as-Perception / Understanding” family of WFM.

These models simulate environmental changes from text or video inputs alone and train agents to interpret situations cognitively.

Image

Meta V-JEPA 2 is a self-supervised video model that masks parts of frames and predicts future scenes. By inferring spatial and temporal patterns, it improves both scene understanding and action planning—representing the “World-as-Prediction / Understanding” direction within WFM.

Broader Meaning of WFM (Precise Concept)

CategoryDescription
Core ConceptA WFM is a “general-purpose model that understands and predicts the world as a unit,” learning structure at the visual, linguistic, and cognitive (world understanding) levels, not only the physical environment.
Primary GoalTo simulate, predict, and reconstruct world states so AI can perform contextual world modeling.
ApplicationsExtends beyond robotics/autonomy to vision-language models, AI agents (reasoning), and generative world creation (GenAI).
Technical BaseVision Transformers, Diffusion, Video Generation, World-Model Learning (Dreamer, PlaNet, Genie family), etc.

Why It Matters

  • Collecting real-world data is costly and risky.
  • We need to rapidly generate and test diverse settings in virtual spaces.
  • This allows robots and AV models to be trained sufficiently in simulation and
  • reduces risk and cost in Sim-to-Real transfer.

Core Capabilities

  • Scene / World Understanding & Prediction (video level) — future-frame prediction from past frames, learning video causality (e.g., V-JEPA family)
  • World / Environment Generation (text → interactive world / video) — generate environments from text prompts (e.g., Genie 3)
  • Synthetic Data Generation & Supply — large-scale data pipelines for robotics / autonomy / vision systems

Representative Models / Platforms (as of 2025)

TypeNameTraits / DescriptionLink
Platform / Integrated WFMNVIDIA CosmosWFM platform for Physical-AI with synthetic data generation and world prediction. Integrates with Omniverse to support data generation for robotics, autonomy, and simulation.NVIDIA Newsroom
World-generationDeepMind Genie 3Generates interactive worlds in real time from a single line of text. A flagship generative approach to world models.DeepMind Blog
World-generationOpenAI Sora2Produces high-resolution, physically consistent videos from text prompts.OpenAI Sora
Prediction / UnderstandingMeta V-JEPA 2Self-supervised video model for scene prediction and planning—world understanding oriented.Meta AI Blog

Side-by-Side Snapshot (2025, official benchmarks)

Image

Scope and Role Separation


A WFM is more than a video generator.

Image

It integrates generation + understanding + prediction + synthetic-data provision. This definition is also reflected in NVIDIA’s Cosmos documentation.

Reference:NVIDIA Announces Major Release of Cosmos World Foundation Models— NVIDIA Newsroom

WFM and RFM (Robot Foundation Model) have distinct roles:

  • WFM: builds a digital twin of the environment/world, enabling simulation.
  • RFM: learns robot policy and control to act within that environment.

This role split is stated in the Cosmos paper as a “world model + policy model” architecture.

Image

The figure above (from the Cosmos paper) presents the basic WFM structure: given past observations x0:tx_{0:t}x0:t and the robot’s control inputs ctc_tct, it predicts the next world state x^t+1\hat{x}_{t+1}x^t+1.

This visually clarifies the relationship: WFM predicts the environment; RFM decides actions.

(Reference paper: Cosmos World Foundation Model Platform for Physical AI — arXiv Link: arXiv)

This separation is useful to emphasize WFM’s generation/prediction role vs. RFM’s control role.

Connection to Omniverse / NVIDIA Ecosystem

Image

WFM—particularly NVIDIA Cosmos—ships with Omniverse libraries and supports:

  • Real-world capture & reconstruction → digital twin creation
  • Large-scale synthetic data generation
  • Robot simulation and AI agent training environments
Image

(Article: NVIDIA Opens Portals to World of Robotics With New Omniverse Libraries and Cosmos Physical AI Models — NVIDIA Newsroom)


Omniverse Blueprints are also reported to connect with Cosmos WFM, enabling robot-ready facilities and mass synthetic data generation.

Image

(Article: NVIDIA Omniverse Physical AI Operating System Expands to More Industries and Partners)

Global WFM Research Trends (Beyond NVIDIA Cosmos)

WFM is not a single company’s technology but a global research trend toward AI that “understands, predicts, and generates the world.” Beyond physics-centric WFM, the field expands to cognitive, visual, and language-based world modeling.

1) World-as-Generation (world creation-centric)

ModelOrganizationCore IdeaRepresentative Uses
DeepMind Genie 3Google DeepMindReal-time interactive world generation at ~1080p/30fps from a single text line; self-supervised learning of visual rules and interactions.Virtual environment simulation, video-based agent training
OpenAI Sora2OpenAIPhysically consistent scene/sequence video generation from text; learns visual causal structure of the world.Media generation, pretraining for vision models, environment synthesis

2. World-as-Perception / Understanding (world understanding-centric)

ModelOrganizationCore IdeaRepresentative Uses
Meta V-JEPA 2Meta AISelf-supervised prediction of future frames with masked video; internalizes spatial/temporal causality.Robot vision, action planning, predictive perception
Google VideoPoetGoogle DeepMind (2025 integrated)Multimodal world model across video, audio, and text; strengthens temporal coherence and context understanding.Video understanding, agent prediction, narrative modeling

3. World-as-Physics (world simulation-centric)

ModelOrganizationCore IdeaRepresentative Uses
NVIDIA CosmosNVIDIAPhysically consistent models for robotics, AV, and industrial simulation; Omniverse-based synthetic data & simulation automation.Robot learning, physics-based Sim-to-Real, digital twins
PlaNetDeepMind (+ MIT extensions)Models environment dynamics in latent space; combined with RL to improve policy learning.Reinforcement learning, robot control, environment modeling

Multi-Layer Data Structure Learned by WFM

WFM spans more than visual or physical information.

Image

In practice, WFM integrates World / Automation / Robot / Asset / Analysis Data, evolving into an Integrated World Model that predicts world state and change.

These five data layers define WFM’s input domain. Cosmos, Genie, V-JEPA, etc., each focus on specific domains (physics, vision, cognition) while sharing the common goal of understanding and generating the world.

LayerKey Contents
World DataPhysical environments, visual scenes, temporal/spatial changes—the base state of the world
Automation DataProcesses, equipment, event sequences—procedural data from automation systems
Robot DataRobot sensors, actions, control policies, behavior logs—agent experience data
Asset DataEquipment/facility states, maintenance, utilization—linked to digital twins
Analysis DataIntegrated insights derived from the above, fed back into model training

This layered structure forms the input foundation for WFM to understand and simulate the world, enabling AI to perform contextual world modeling.

Share this post:

Copyright 2025. POLLUX All rights reserved.