Text2Move: Text-to-moving sound generation via trajectory prediction and temporal alignment

Yunyi Liu1,2*, Shaofan Yang, 2 Kai Li, 2 Xu Li2
1University of Sydney
2Dolby Laboratories, Inc

*Work was done while interning at and funded by Dolby Laboratories
Description of the image

Human auditory perception is shaped by moving sound sources in 3D space, yet prior work in generative sound modelling has largely been restricted to mono signals or static spatial audio. In this work, we introduce a framework for generating moving sounds given text prompts in a controllable fashion. To enable training, we construct a synthetic dataset that records moving sounds in binaural format, their spatial trajectories, and text captions about the sound event and spatial motion. Using this dataset, we train a text-to-trajectory prediction model that outputs the three-dimensional trajectory of a moving sound source given text prompts. To generate spatial audio, we first fine-tune a pre-trained text-to-audio generative model to output temporally aligned mono sound with the trajectory. The spatial audio is then simulated using the predicted temporally-aligned trajectory. Experimental evaluation demonstrates reasonable spatial understanding of the text-to-trajectory model. This approach could be easily integrated into existing text-to-audio generative workflow and extended to moving sound generation in other spatial audio formats.


Trajectory Prediction

Attribute Accuracy Macro-F1 MAE RA-MAE
Naive Model (End Point)
Azimuth 98.2% 98.5% 5.789° 2.339°
Elevation 98.0% 98.1% 6.431° 1.352°
Distance 87.5% 87.1% 0.166 0.013
Whole-trajectory prediction
Azimuth 75.9% 75.1% 18.53° 15.52°
Elevation 61.2% 65.9% 28.75° 21.44°
Distance 66.7% 52.1% 1.601 0.365

We have trained a trajectory prediction model which predicts the azimuth, elevation and distance of an audio object give text prompts. In addition, we have also trained a naive model that predicts only the spatial attributes of the onset and offset of the trajectory for comparison. Above are the prediction results. The RA-MAE stands for Range-Aware MAE, and the error is measured relative to the nearest boundary within that range.

Temporal Alignment

Description 1 Because text-to-audio generative models are generally trained without explicit or precise temporal information, above we show a common issue in synthesizing variable-length latent representations using latent diffusion model. With the original pre-trained transformer-based VAE model, you could copy and paste the silent part of the latent space to any other region to obtain a silence elsewhere. In 'VAE reconstructed' we copied and pasted the first 200 frames to (200,400) frames, and you could see this part gets silenced out smoothly. However if we do so to the latent space generated from the LDM, you may notice the artifacts in Mel-spectrograms that result in harmonic glitchies. Therefore, we have fine-tuned the pretrained the latent diffusion model from Make-an-Audio 2 by modifying its latent space temporally.

Text-to-moving sound generation

Below we show the examples of our trajectory-prediction model and a naive model in spatilizing the generated audio from fine-tuned Make-an-Audio model given text prompts.

A police siren is alarming right back and moves far away.

Mono

Naive

Trajectory


A motorcycle runs right to left back ending close to me.

Mono

Naive

Trajectory


A racing car accelerates from right back to front left and disappears far away.

Mono

Naive

Trajectory


A rooster crows back to left back close to me.

Mono

Naive

Trajectory


A jet engine hums in front from a distance away to nearby.

Mono

Naive

Trajectory


A sheep close to me bleats to my back.

Mono

Naive

Trajectory


A horse cries in front of me and moves left far away.

Mono

Naive

Trajectory


A helicopter flies back to front right above me.

Mono

Naive

Trajectory


Ducks shouting left back to my back close to me.

Mono

Naive

Trajectory


A motorboat moves from far right back to close left back.

Mono

Naive

Trajectory