Endora: Video Generation Models as Endoscopy Simulators

MICCAI 2024


Chenxin Li1*, Hengyu Liu1*, Yifan Liu1*, Brandon Y. Feng2,
Wuyang Li1, Xinyu Liu1, Zhen Chen3 Jing Shao4 Yixuan Yuan1

1The Chinese University of Hong Kong    2Massachusetts Institute of Technology    
3Centre for Artificial Intelligence and Robotics of Hong Kong      4Shanghai Artificial Intelligence Laboratory

Abstract


TL;DR:Endora enables the high-fidelity medical video generation on endoscopy scenes and demonstrates the versatile ability through successful applications in video-based disease diagnosis and 3D surgical scene reconstruction.

Generative models hold promise for revolutionizing medical education, robot-assisted surgery, and data augmentation for machine learning. Despite progress in generating 2D medical images, the complex domain of clinical video generation has largely remained untapped.

This paper introduces Endora, an innovative approach to generate medical videos to simulate clinical endoscopy scenes. We present a novel generative model design that integrates a meticulously crafted spatial-temporal video transformer with advanced 2D vision foundation model priors, explicitly modeling spatial-temporal dynamics during video generation.

We also pioneer the first public benchmark for endoscopy simulation with video generation models, adapting existing state-of-the-art methods for this endeavor. Endora demonstrates exceptional visual quality in generating endoscopy videos, surpassing state-of-the-art methods in extensive testing.

Moreover, we explore how this endoscopy simulator can empower downstream video analysis tasks and even effectively create 3D scenes with multi-view consistency.

In a nutshell, Endora marks a notable breakthrough in the deployment of generative AI for clinical endoscopy research, setting a substantial stage for further advances in medical content generation.


Results on Colonoscopic Dataset



Results on Kvasir-Capsule Dataset



Results on CholecTriplet Dataset



Endora Generating Video as Efficient 3D Creators (Cocurrent with SV3D)



We train a Gaussian Splatting representationon on the sampled videos by Endora and observe the multi-view consistent geometry (shown by rendered RGB and depth maps) as if in the real 3D world.


Citation


@article{li2024endora,
  author    = {Chenxin Li and Hengyu Liu and Yifan Liu and Brandon Y. Feng, and Wuyang Li and Xinyu Liu, Zhen Chen and Jing shao and Yixuan Yuan},
  title     = {Endora: Video Generation Models as Endoscopy Simulators},
  journal   = {arXiv preprint arXiv:2403.11050},
  year      = {2024}
}
              

Relevant Works


Sora: Video generation models as world simulators

A milestone to create realistic and imaginative scenes from text instructions

[Page] | [Techniqical Report]


SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion

An innovative framework that take an image as input and generates novel multi-view images and 3D models

[Page] | [Paper]


EndoGaussian: Real-time Gaussian Splatting for Dynamic Endoscopic Scene Reconstruction

An intial exploration into real-time surgincal scene reconstruction built on 3D Gaussian Splatting

[Page] | [Paper] | [Code]