KFD-NeRF: Rethinking Dynamic NeRF with Kalman Filter.- Physical-Based Event Camera Simulator.- V-IRL: Grounding Virtual Intelligence in Real Life.- Adversarial Prompt Tuning for Vision-Language Models.- Relightable 3D Gaussians: Realistic Point Cloud Relighting with BRDF Decomposition and Ray Tracing.- Mono-ViFI: A Unified Learning Framework for Self-supervised Single- and Multi-frame Monocular Depth Estimation.- CC-SAM: Enhancing SAM with Cross-feature Attention and Context for Ultrasound Image Segmentation.- An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding.- Think2Drive: Efficient Reinforcement Learning by Thinking with Latent World Model for Autonomous Driving (in CARLA-v2).- PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion.- X-InstructBLIP: A Framework for Aligning Image, 3D, Audio, Video to LLMs and its Emergent Cross-modal Reasoning.- Learning Neural Volumetric Pose Features for Camera Localization.- Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation.- REFRAME: Reflective Surface Real-Time Rendering for Mobile Devices.- Self-Training Room Layout via Geometry-aware Ray-casting.- Closed-Loop Unsupervised Representation Disentanglement with $\beta$-VAE Distillation and Diffusion Probabilistic Feedback.- Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective.- Every Pixel Has its Moments: Ultra-High-Resolution Unpaired Image-to-Image Translation via Dense Normalization.- ZoLA: Zero-Shot Creative Long Animation Generation with Short Video Model.- Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach.- Restore Anything with Masks: Leveraging Mask Image Modeling for Blind All-in-One Image Restoration.- When Fast Fourier Transform Meets Transformer for Image Restoration.- Dolphins: Multimodal Language Model for Driving.- Rethinking Video Deblurring with Wavelet-Aware Dynamic Transformer and Diffusion Model.- CamoTeacher: Dual-Rotation Consistency Learning for Semi-Supervised Camouflaged Object Detection.- Placing Objects in Context via Inpainting for Out-of-distribution Segmentation.- Textual Grounding for Open-vocabulary Visual Information Extraction in Layout-diversified Documents.