* Corresponding author
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications; Aerospace Information Research Institute, Chinese Academy of Sciences; School of Geographic Sciences, Hunan Normal University; Department of Computer Science, City University of Hong Kong
UniRS is a visual language model (VLM) that integrates multi-temporal remote sensing parsing capabilities. The model can parse three types of remote sensing inputs (i.e., single images, dual-temporal image pairs, and videos) and give text responses based on user instructions. We adopt a modular design that adapts to each task, design an inference mechanism that can fully utilize the prior knowledge of the base model ,VILA-1.5, and perform joint fine-tuning on large-scale datasets, ultimately obtaining a large remote sensing visual language model with excellent generalization capabilities on multi-temporal remote sensing tasks.
- [2025/1] The training code is released!
- [2024/12] Paper is on Arxiv!
./environment_setup.sh
or follow the instructions below in order.
conda create -n unirs python=3.10 -y
conda activate unirs
pip install --upgrade pip # enable PEP 660 support
# this is optional if you prefer to system built-in nvcc.
conda install -c nvidia cuda-toolkit -y
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.4.2/flash_attn-2.4.2+cu118torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.4.2+cu118torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install -e .
pip install -e ".[train]"
pip install git+https://github.com/huggingface/[email protected]
site_pkg_path=$(python -c 'import site; print(site.getsitepackages()[0])')
cp -rv ./llava/train/transformers_replace/* $site_pkg_path/transformers/
We mixed three datasets for joint training of UniRS, namely GeoChat-Instruct, LEVIR-CC and ERA.
- The code is released under the Apache 2.0 license as found in the LICENSE file.
- The pretrained weights are released under the CC-BY-NC-SA-4.0 license.
- The service is a research preview intended for non-commercial use only, and is subject to the following licenses and terms:
- Model License of LLaMA. For LLAMA3-VILA checkpoints terms of use, please refer to the LLAMA3 License for additional details.
- Dataset Licenses for each one used during training.
@misc{li2024unirsunifyingmultitemporalremote,
title={UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models},
author={Yujie Li and Wenjia Xu and Guangzuo Li and Zijian Yu and Zhiwei Wei and Jiuniu Wang and Mugen Peng},
year={2024},
eprint={2412.20742},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.20742},
}