Text2Move: Text-to-moving sound generation via trajectory prediction and temporal alignment

Yunyi Liu1,2*, Shaofan Yang, 2 Kai Li, 2 Xu Li2
1University of Sydney
2Dolby Laboratories, Inc

*Work was done while interning at and funded by Dolby Laboratories
Description of the image

Human auditory perception is shaped by moving sound sources in 3D space, yet prior work in generative sound modelling has largely been restricted to mono signals or static spatial audio. In this work, we introduce a framework for generating moving sounds given text prompts in a controllable fashion. To enable training, we construct a synthetic dataset that records moving sounds in binaural format, their spatial trajectories, and text captions about the sound event and spatial motion. Using this dataset, we train a text-to-trajectory prediction model that outputs the three-dimensional trajectory of a moving sound source given text prompts. To generate spatial audio, we first fine-tune a pre-trained text-to-audio generative model to output temporally aligned mono sound with the trajectory. The spatial audio is then simulated using the predicted temporally-aligned trajectory. Experimental evaluation demonstrates reasonable spatial understanding of the text-to-trajectory model. This approach could be easily integrated into existing text-to-audio generative workflow and extended to moving sound generation in other spatial audio formats.


Trajectory Prediction

Attribute Accuracy Macro-F1 MAE RA-MAE
Naive Model (End Point)
Azimuth 98.2% 98.5% 5.789° 2.339°
Elevation 98.0% 98.1% 6.431° 1.352°
Distance 87.5% 87.1% 0.166 0.013
Whole-trajectory prediction
Azimuth 75.9% 75.1% 18.53° 15.52°
Elevation 61.2% 65.9% 28.75° 21.44°
Distance 66.7% 52.1% 1.601 0.365

We have trained a trajectory prediction model which predicts the azimuth, elevation and distance of an audio object give text prompts. In addition, we have also trained a naive model that predicts only the spatial attributes of the onset and offset of the trajectory for comparison. Above are the prediction results. The RA-MAE stands for Range-Aware MAE, and the error is measured relative to the nearest boundary within that range.

Temporal Alignment

Description 1 Because text-to-audio generative models are generally trained without explicit or precise temporal information, above we show a common issue in synthesizing variable-length latent representations using latent diffusion model. With the original pre-trained transformer-based VAE model, you could copy and paste the silent part of the latent space to any other region to obtain a silence elsewhere. In 'VAE reconstructed' we copied and pasted the first 200 frames to (200,400) frames, and you could see this part gets silenced out smoothly. However if we do so to the latent space generated from the LDM, you may notice the artifacts in Mel-spectrograms that result in harmonic glitchies. Therefore, we have fine-tuned the pretrained the latent diffusion model from Make-an-Audio 2 by modifying its latent space temporally.

Text-to-moving sound generation

Below we show the examples of our trajectory-prediction model and a naive model in spatilizing the generated audio from fine-tuned Make-an-Audio model given text prompts.

A police siren is alarming right back and moves far away.

Mono

Naive

Trajectory


A motorcycle runs right to left back ending close to me.

Mono

Naive

Trajectory


A racing car accelerates from right back to front left and disappears far away.

Mono

Naive

Trajectory


A rooster crows back to left back close to me.

Mono

Naive

Trajectory


A jet engine hums in front from a distance away to nearby.

Mono

Naive

Trajectory


A sheep close to me bleats to my back.

Mono

Naive

Trajectory


A horse cries in front of me and moves left far away.

Mono

Naive

Trajectory


A helicopter flies back to front right above me.

Mono

Naive

Trajectory


Ducks shouting left back to my back close to me.

Mono

Naive

Trajectory


A motorboat moves from far right back to close left back.

Mono

Naive

Trajectory


Dataset Curation

We apologize for the strict dataset-release policy. Currently we could only provide the scripts of the dataset curation process. The complete dataset will be uploaded after the release has been approved.

Overview of the curated dataset
Dataset mapping of spatial attributes to human-readable captions and value ranges.

We select AudioTime as our base dataset, which consists of 5,000 mono audio clips with precise timestamp annotations. Since AudioTime includes clips with multiple overlapping events, we separate the data to contain only single-source events, yielding 7,685 clean clips. We randomly assign the spatial attributes (azimuth, elevation, and distance) to each audio according to the table above. After we generate 10 variations of spatial attributes for each audio, we use an LLM (GPT-4o) for writing the captions. Below, we show the LLM prompts for writing the captions.

LLM prompts for caption rewriting

I will provide you a json file with descriptions about the sound in terms of time duration, spatial coordinates, audio category, etc. You need to write them into a sentence that describes such spatial attributes. You can creatively think of other words to describe them. eg. swift corresponds to fast far away means far or distant top means up etc. The requirement for the captions are:

  1. From this json file you can see that there are spatial attributes. Your task is to rewrite the enriched caption so that it includes the acoustic event written in the caption the verb that describes the action the beginning and end direction and distance. I have provided a clear relationship between the attributes and the caption in the table. For example when the start distance is from 0.5m to 1m it's written as close. When the end distance is 1 to 3 then it falls into moderate meaning no specific caption about the end distance needs to be captured.
  2. You must use human languages to describe the audio movements.
  3. Make sure your caption is just one sentence with approximately 10 to 20 words.
  4. You need to infer the speed of the object by checking the duration of the sound. For example if the duration is below 0.5s and the distance difference is larger than 5m it's generally considered fast.
  5. The original caption should be removed. You should output the adapted caption with only one sentence. Avoid using punctuations in between.
  6. There are settings with a label OM meaning omit in the dataset. When this occurs you generally don't need to write the captions for such attributes.
  7. Consider using different synonyms to describe the subject different verbs to describe the action and different words to describe the spatial and motion.

Input format

{"audio_id": "syn_10", "event_type": "Cap gun", "start_time": 1.045, "end_time": 1.438, "duration": 0.393, "variation": 8, "spatial": {"start_azimuth": 86.65464342157418, "end_azimuth": -38.07924208395616, "start_elevation": 9.695088459188575, "end_elevation": -15.40823925228966, "start_distance": 9.499581296804277, "end_distance": 0.580381126908331}}

Output format

{"caption": "A sudden cap gun sound shoots from my far right fastly to my front left close to me."}