SimCMF: A Simple Cross-modal Fine-tuning Strategy from Vision Foundation Models to Any Imaging Modality

Abstract

Foundation models like ChatGPT and Sora that are trained on a huge scale of data have made a revolutionary social impact. However, it is extremely challenging for sensors in many different fields to collect similar scales of natural images to train strong foundation models. To this end, this work presents a simple and effective framework, SimCMF, to study an important problem: cross-modal fine-tuning from vision foundation models trained on natural RGB images to other imaging modalities of different physical properties (e.g., polarization). In SimCMF, we conduct a thorough analysis of different basic components from the most naive design and ultimately propose a novel cross-modal alignment module to address the modality misalignment problem. We apply SimCMF to a representative vision foundation model Segment Anything Model (SAM) to support any evaluated new imaging modality. Given the absence of relevant benchmarks, we construct a benchmark for performance evaluation. Our experiments confirm the intriguing potential of transferring vision foundation models in enhancing other sensors' performance: SimCMF can improve the segmentation performance (mIoU) from 22.15% to 53.88% on average for evaluated modalities and consistently outperforms other baselines.

Architecture

SimCMF receives new modality x as input and pass it through a cross-modal alignment module to obtain an embedding. The embedding matches the dimension of a pretrained foundation model backbone, and then we obtain the output y. The input and foundation are designed in a generic formulation for different input modalities and foundation models.

Visualization

Segmentation Comparison

Qualitative Results We transfer the segment anything ability of SAM to different modalities, including segmentation from depth, thermal, polarization, NIR, and HHA images. The proposed method significantly improves segmentation quality compared to SAM zero-shot and training from scratch.

BibTeX

@misc{lei2024simcmf,
      title={SimCMF: A Simple Cross-modal Fine-tuning Strategy from Vision Foundation Models to Any Imaging Modality},
      author={Lei, Chengyang and Chen, Liyi and Cen, Jun and Chen, Xiao and Lei, Zhen and Heide, Felix and Chen, Qifeng and Zhang, Zhaoxiang},
      year={2024},
      eprint={2409.08083},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
  }

Acknowledgement

We sincerely thank CompletionfFormer for their opensource code. We also thanks HAMMER for the opensource dataset.