Human auditory perception is shaped by moving sound sources in 3D space, yet prior work in generative sound modelling has largely been restricted to mono signals or static spatial audio. In this work, we introduce a framework for generating moving sounds given text prompts in a controllable fashion. To enable training, we construct a synthetic dataset that records moving sounds in binaural format, their spatial trajectories, and text captions about the sound event and spatial motion. Using this dataset, we train a text-to-trajectory prediction model that outputs the three-dimensional trajectory of a moving sound source given text prompts. To generate spatial audio, we first fine-tune a pre-trained text-to-audio generative model to output temporally aligned mono sound with the trajectory. The spatial audio is then simulated using the predicted temporally-aligned trajectory. Experimental evaluation demonstrates reasonable spatial understanding of the text-to-trajectory model. This approach could be easily integrated into existing text-to-audio generative workflow and extended to moving sound generation in other spatial audio formats.
| Attribute | Accuracy | Macro-F1 | MAE | RA-MAE |
|---|---|---|---|---|
| Naive Model (End Point) | ||||
| Azimuth | 98.2% | 98.5% | 5.789° | 2.339° |
| Elevation | 98.0% | 98.1% | 6.431° | 1.352° |
| Distance | 87.5% | 87.1% | 0.166 | 0.013 |
| Whole-trajectory prediction | ||||
| Azimuth | 75.9% | 75.1% | 18.53° | 15.52° |
| Elevation | 61.2% | 65.9% | 28.75° | 21.44° |
| Distance | 66.7% | 52.1% | 1.601 | 0.365 |
We have trained a trajectory prediction model which predicts the azimuth, elevation and distance of an audio object give text prompts. In addition, we have also trained a naive model that predicts only the spatial attributes of the onset and offset of the trajectory for comparison. Above are the prediction results. The RA-MAE stands for Range-Aware MAE, and the error is measured relative to the nearest boundary within that range.
Because text-to-audio generative models are generally trained without explicit or precise temporal information, above we show a common issue in synthesizing
variable-length latent representations using latent diffusion model. With the original pre-trained transformer-based VAE model, you could copy and paste the silent
part of the latent space to any other region to obtain a silence elsewhere. In 'VAE reconstructed' we copied and pasted the first 200 frames to (200,400)
frames, and you could see this part gets silenced out smoothly. However if we do so to the latent space generated from the LDM, you may notice the artifacts
in Mel-spectrograms that result in harmonic glitchies. Therefore, we have fine-tuned the pretrained the latent diffusion model from Make-an-Audio 2 by modifying
its latent space temporally.
A police siren is alarming right back and moves far away.
A motorcycle runs right to left back ending close to me.
A racing car accelerates from right back to front left and disappears far away.
A rooster crows back to left back close to me.
A jet engine hums in front from a distance away to nearby.
A sheep close to me bleats to my back.
A horse cries in front of me and moves left far away.
A helicopter flies back to front right above me.
Ducks shouting left back to my back close to me.
A motorboat moves from far right back to close left back.