D4RT: Unified, Fast 4D Scene Reconstruction & Tracking
Introducing D4RT, a unified AI model for 4D scene reconstruction and tracking across space and time.
Anytime we look at the world, we do an extraordinary job of memory and prediction. We see and understand things as they are at a given moment in time, as they were a moment ago, and as they will be the next moment. Our mental model of the world maintains a continuous representation of reality, and we use this model to draw intuitive conclusions about the causal relationship between the past, present, and future.
To help machines see the world as we do, we can equip them with cameras, but this only solves the input problem. To understand this input, computers must solve a complex inverse problem: capture a video—a series of flat 2D projections—and recover or understand the rich, volumetric 3D world in motion.
Today, we introduce D4RT (Dynamic 4D Reconstruction and Tracking), a new AI paradigm that unifies dynamic scene reconstruction into one powerful framework, bringing us closer to the next frontier of AI: the full realization of our dynamic reality.
Challenge of the fourth dimension
For an AI model to understand a dynamic scene captured in 2D video, it must track every pixel of each object as it moves through the three dimensions of space and the fourth dimension of time. Additionally, you must separate this movement from camera movement, maintaining a cohesive representation even when objects move behind each other or leave the frame entirely. Traditionally, capturing this level of geometry and motion from 2D video requires intensive computational processes or a patchwork of specialized AI models — some for depth, others for motion or camera angles — leading to AI reconstructions that are slow and fragmented.
D4RT’s simplified architecture and new query mechanism place it at the forefront of 4D reconstruction while being up to 300 times more efficient than previous methods – fast enough for real-time applications in robotics, augmented reality, and more.
How D4RT works: A query-based approach
D4RT works as a unified encoder and decoder architecture. The encoder first processes the input video and converts it into a compressed representation of the scene’s geometry and motion. Unlike legacy systems that used separate modules for different tasks, D4RT computes only what it needs using a flexible query mechanism centered around one basic question:
“Where is he? specific pixel From the existing video In three-dimensional space Arbitrarily timeas seen from A Selected camera?
Building on our previous work, a lightweight decoder queries this representation to answer specific instances of the question at hand. Because the queries are independent, they can be processed in parallel on modern AI hardware. This makes D4RT extremely fast and scalable, whether tracking just a few points or reconstructing the entire scene.
Don’t miss more hot News like this! Click here to discover the latest in AI news!
2026-01-16 10:39:00


