Decart

We're excited to announce Oasis, the first experiential, realtime, open-world AI model — it's an interactive video experience, but generated end-to-end by a transformer on a frame-by-frame basis. Oasis takes in user keyboard and mouse input and generates real-time experience, internally simulating physics, rules, and graphics. The model learned to allow users to move around, jump, pick up items, break blocks, and more, all by watching the demonstration directly. We view Oasis as the first step in our research towards foundational models that simulate more complex interactive worlds, thereby replacing the classic demo engine for a future driven by AI.

Achieving Oasis requires a combination of two fundamental advances: improvements in model architecture in order to enable the model to capture the entire world and simulate it, as well as breakthroughs in model inference technology to allow users to interact with the model in real-time with minimal latency. For the former, we adopt the emerging state-of-the-art approach of diffusion training combined with transformer models [1, 2] inspired by advanced large-language-models (LLMs) in order to train an autoregressive model that can generate video on a frame-by-frame basis conditioned by the user actions at that instant. For the latter, we currently use Decart’s proprietary inference framework which is built to provide peak utilization of NVIDIA H100 Tensor Core GPUs for transformer workloads and also have built the model to also support Etched’s upcoming Sohu chip.

We're releasing Oasis's code and the weights of a model you can run locally, and a live demo of a larger checkpoint. Today, using Decart's proprietary inference platform, we show that real-time transformer-based video is possible and can be streamed across the web for live experience. When Etched's transformer ASIC, Sohu, is released, we could run models like Oasis in 4K resolution. Together, we believe fast transformer inference is the missing link to making high quality, affordable generative real-time video a new fundamental interface.

While Oasis is an impressive technical demo, we believe this research is only the beginning of a new journey involving more complex foundation models that enable realtime human-AI interaction on a new level. This may revolutionize a wide variety of experiences by providing an interactive video interface that puts the control at the hands of the user. Simply imagine a world where this integration is so tight that foundation models may even augment modern entertainment platforms by generating content on the fly according to the user preferences. Or perhaps a gaming experience that provides new possibilities for the user interaction such as textual and audio prompts guiding the experience (e.g., “imagine that there is a pink elephant chasing me down”).

‍

Results

‍

Architecture

Diffusion models are rapidly emerging as they recently exceeded expectations and became state-of-the-art in the field of generative image and video models. In essence, diffusion models learn to reverse the iterative process which adds Gaussian noise to the input and thereby enable generating new samples given noise. This approach can be extended towards video generation by adding additional temporal layers to the model architecture that receive a context consisting of the previous frames that were generated (e.g., in an autoregressive fashion).

Recent works have demonstrated that the generalization of transformer architectures beyond large language models (LLMs) can lead to state-of-the-art results across different fields such as generative diffusion models [1, 2]. Thus, we choose to utilize a transformer model for the noise prediction process of the diffusion training (using diffusion-forcing [4]), modifying the architecture to include additional temporal attention layers interleaved between spatial attention layers in order to provide context from previous frames. The diffusion is performed in the latent dimension generated by a ViT VAE [1] to compress the image size and enable the diffusion to focus on higher-level characteristics. In contrast to bidirectional models such as Sora [3], Oasis generates frames autoregressively, with the ability to condition each frame on input to the experience — this enables users to interact with the world in real-time and not only render videos retroactively.

Performance

The daunting task of providing real-time inference of such large transformer-based diffusion models involves a massive effort in systems-level optimization to best exploit the underlying hardware characteristics of the GPUs and servers. Current state-of-the-art text-to-video models with a similar DiT [2] architecture (e.g. Sora [3], Mochi-1 [6] and Runway [7]) can take 10-20 seconds to create just one second of video, even on multiple GPUs. While the field of traditional LLM inference has developed a plethora of open-source kernels and techniques, we found that the vast majority of these public techniques were less relevant to our target model architecture and thus led to either low utilization of the underlying GPU architecture or redundant operations. Therefore, we instead relied on a proprietary optimization infrastructure developed by Decart over the past year and deployed it towards a widespread effort on the acceleration of all of the underlying operatives used as part of the model inference. We’ve found this to be beneficial for all kernels from basic PyTorch primitives that we accelerated drastically to the more advanced combinations of operations. In addition to this extensive effort to optimize the GPU utilization in the kernels that impact the critical path of the image generation latency, we also use optimized communication primitives developed by Decart to best exploit the server architecture beyond the GPUs (e.g., NVIDIA NVLink, PCIe Gen 5 and NUMA, etc…) in order to further reduce the latency. These strategies enable us to scale beyond inference on only a single GPU to realtime multi-GPU inference while minimizing potential bottlenecks that may arise when introducing communication that is beyond intra-GPU data transfer.

Overall, this extensive optimization effort by Decart was crucial for the introduction of real-time inferencing for diffusion-transformer models that are capable of modeling more advanced mechanisms compared to previous models. This culminates in 47ms inference time per frame and only 150ms per iteration for training! However, to make the model an additional order of magnitude faster, and make it cost-efficient to run at scale, new hardware is needed. Oasis is optimized for Sohu, the upcoming Transformer ASIC by Etched. On NVIDIA H100s today, the model can run at 360p at 20fps — Sohu can run the same model at up to 4K. In addition, Oasis' end-to-end Transformer architecture makes it extremely efficient on Sohu - at the same price and power consumption as an H100 GPU, Oasis on Sohu can serve 10x more users. We believe the price of serving models like Oasis is the hidden bottleneck to releasing generative video in production. See more performance figures and read more about Oasis and Sohu on Etched's blog.

‍

Future Explorations

With these many new exciting results, there still remain a few aspects of the model that can be improved:

— In certain situations the model produces hazy outputs before recovering

— Improving the memory of the model so that it may recall details from many frames back

— Providing an initial image outside the distribution of the model can lead to unclear results

Following an in-depth sensitivity analysis on different configurations of the architecture alongside the data and model size, we hypothesize that the majority of these aspects may be addressed through scaling of the model and the datasets. Therefore, we are currently developing this direction alongside additional optimization techniques in order to enable such large-scale training efficiently. Further, once these larger models are developed, new breakthroughs in inferencing technology would be required in order to ensure a sustainable latency and cost trade-off.

‍