TL; DR: AV-NeRF enables joint audio-visual synthesis at novel positions and novel view directions.
Can machines recording an audio-visual scene produce realistic, matching audio-visual experiences at novel positions and novel view directions? We answer it by studying a new task---real-world audio-visual scene synthesis---and a first-of-its-kind NeRF-based approach for multimodal learning. Concretely, given a video recording of an audio-visual scene, the task is to synthesize new videos with spatial audios along arbitrary novel camera trajectories in that scene. We propose an acoustic-aware audio generation module that integrates prior knowledge of audio propagation into NeRF, in which we implicitly associate audio generation with the 3D geometry and material properties of a visual environment. Furthermore, we present a coordinate transformation module that expresses a view direction relative to the sound source, enabling the model to learn sound source-centric acoustic fields. To facilitate the study of this new task, we collect a high-quality Real-World Audio-Visual Scene (RWAVS) dataset. We demonstrate the advantages of our method on this real-world dataset and the simulation-based SoundSpaces dataset. We recommend that readers visit our project page for convincing comparisons.
We provide the Real-World Audio-Visual Scene (RWAVS) Dataset.
-
The dataset can be downloaded from the Hugging Face repository: https://huggingface.co/datasets/susanliang/RWAVS.
-
After you download the dataset, you can decompress the
RWAVS_Release.zip
.unzip RWAVS_Release.zip cd release/
-
The data is organized with the following directory structure.
./release/ ├── 1 │ ├── binaural_syn_re.wav │ ├── feats_train.pkl │ ├── feats_val.pkl │ ├── frames │ │ ├── 00001.png | | ├── ... │ │ ├── 00616.png │ ├── source_syn_re.wav │ ├── transforms_scale_train.json │ ├── transforms_scale_val.json │ ├── transforms_train.json │ └── transforms_val.json ├── ... ├── 13 └── position.json
The dataset contains 13 scenes indexed from 1 to 13. For each scene, we provide
-
transforms_train.json
: camera poses for training. -
transforms_val.json
: camera poses for evaluation. We split the data intotrain
andval
subsets with 80% data for training and the rest for evaluation. -
transforms_scale_train.json
: normalized camera poses for training. We scale 3D coordindates to$[-1, 1]^3$ . -
transforms_scale_val.json
: normalized camera poses for evaluation. -
frames
: corresponding video frames for each camera pose. -
source_syn_re.wav
: single-channel audio emitted by the sound source. -
binaural_syn_re.wav
: two-channel audio captured by the binaural microphone. We synchronizesource_syn_re.wav
andbinaural_syn_re.wav
and resample them to$22050$ Hz. -
feats_train.pkl
: extracted vision and depth features at each camera pose for training. We rely on V-NeRF to synthesize vision and depth images for each camera pose. We then use a pre-trained encoder to extract features from rendered images. -
feats_val.pkl
: extracted vision and depth features at each camera pose for inference. -
position.json
: normalized 3D coordinates of the sound source.
Please note that some frames may not have corresponding camera poses because COLMAP fails to estimate the camera parameters of these frames.
-
After downloading the dataset, please modify the DATA_DIR
and LOG_DIR
variables in the run.sh
file. DATA_DIR
should point to the path where you saved the dataset, and LOG_DIR
will be used to store all checkpoints as well as results.
Then, you can train and evaluate the model by running:
bash run.sh
The run.sh
contains both training and evaluation commands. During training, the script traverses all 13 scenes. Once the training has finished, the program will print the evaluation results for all environments and the overall performance.
We utilize the nerfacto model provided by nerf-studio
as the V-NeRF. Please refer to nerfstudio installation for detailed guidance on installing nerfstudio
and tiny-cuda-nn
. You can train a NeRF for a given environment by running:
ns-train nerfacto --output-dir xxx --data xxx --max-num-iterations 100000 --viewer.quit-on-train-completion True
@inproceedings{liang23avnerf,
author = {Liang, Susan and Huang, Chao and Tian, Yapeng and Kumar, Anurag and Xu, Chenliang},
booktitle = {Conference on Neural Information Processing Systems (NeurIPS)},
title = {AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis},
year = {2023}
}
We borrowed a lot of code from NAF and INRAS. We thank the authors for sharing their code. If you use our codes, please also consider citing their nice works.
If you have any comments or questions, feel free to contact Susan Liang and Chao Huang.