Text2Move: Text-to-moving sound generation via trajectory prediction and temporal alignment

Yunyi Liu^1,2*, Shaofan Yang, ² Kai Li, ² Xu Li²

¹University of Sydney
²Dolby Laboratories, Inc
^*Work was done while interning at and funded by Dolby Laboratories

Human auditory perception is shaped by moving sound sources in 3D space, yet prior work in generative sound modelling has largely been restricted to mono signals or static spatial audio. In this work, we introduce a framework for generating moving sounds given text prompts in a controllable fashion. To enable training, we construct a synthetic dataset that records moving sounds in binaural format, their spatial trajectories, and text captions about the sound event and spatial motion. Using this dataset, we train a text-to-trajectory prediction model that outputs the three-dimensional trajectory of a moving sound source given text prompts. To generate spatial audio, we first fine-tune a pre-trained text-to-audio generative model to output temporally aligned mono sound with the trajectory. The spatial audio is then simulated using the predicted temporally-aligned trajectory. Experimental evaluation demonstrates reasonable spatial understanding of the text-to-trajectory model. This approach could be easily integrated into existing text-to-audio generative workflow and extended to moving sound generation in other spatial audio formats.

Attribute	Accuracy	Macro-F1	MAE	RA-MAE
Naive Model (End Point)
Azimuth	98.2%	98.5%	5.789°	2.339°
Elevation	98.0%	98.1%	6.431°	1.352°
Distance	87.5%	87.1%	0.166	0.013
Whole-trajectory prediction
Azimuth	75.9%	75.1%	18.53°	15.52°
Elevation	61.2%	65.9%	28.75°	21.44°
Distance	66.7%	52.1%	1.601	0.365

Below we show the examples of our trajectory-prediction model and a naive model in spatilizing the generated audio from fine-tuned Make-an-Audio model given text prompts.

Dataset Curation

We apologize for the strict dataset-release policy. Currently we could only provide the scripts of the dataset curation process. The complete dataset will be uploaded after the release has been approved.

Overview of the curated dataset — Dataset mapping of spatial attributes to human-readable captions and value ranges.

We select AudioTime as our base dataset, which consists of 5,000 mono audio clips with precise timestamp annotations. Since AudioTime includes clips with multiple overlapping events, we separate the data to contain only single-source events, yielding 7,685 clean clips. We randomly assign the spatial attributes (azimuth, elevation, and distance) to each audio according to the table above. After we generate 10 variations of spatial attributes for each audio, we use an LLM (GPT-4o) for writing the captions. Below, we show the LLM prompts for writing the captions.

LLM prompts for caption rewriting

I will provide you a json file with descriptions about the sound in terms of time duration, spatial coordinates, audio category, etc. You need to write them into a sentence that describes such spatial attributes. You can creatively think of other words to describe them. eg. swift corresponds to fast far away means far or distant top means up etc. The requirement for the captions are:

From this json file you can see that there are spatial attributes. Your task is to rewrite the enriched caption so that it includes the acoustic event written in the caption the verb that describes the action the beginning and end direction and distance. I have provided a clear relationship between the attributes and the caption in the table. For example when the start distance is from 0.5m to 1m it's written as close. When the end distance is 1 to 3 then it falls into moderate meaning no specific caption about the end distance needs to be captured.
You must use human languages to describe the audio movements.
Make sure your caption is just one sentence with approximately 10 to 20 words.
You need to infer the speed of the object by checking the duration of the sound. For example if the duration is below 0.5s and the distance difference is larger than 5m it's generally considered fast.
The original caption should be removed. You should output the adapted caption with only one sentence. Avoid using punctuations in between.
There are settings with a label OM meaning omit in the dataset. When this occurs you generally don't need to write the captions for such attributes.
Consider using different synonyms to describe the subject different verbs to describe the action and different words to describe the spatial and motion.

Input format

{"audio_id": "syn_10", "event_type": "Cap gun", "start_time": 1.045, "end_time": 1.438, "duration": 0.393, "variation": 8, "spatial": {"start_azimuth": 86.65464342157418, "end_azimuth": -38.07924208395616, "start_elevation": 9.695088459188575, "end_elevation": -15.40823925228966, "start_distance": 9.499581296804277, "end_distance": 0.580381126908331}}

Output format

{"caption": "A sudden cap gun sound shoots from my far right fastly to my front left close to me."}

Text2Move: Text-to-moving sound generation via trajectory prediction and temporal alignment

Trajectory Prediction

Temporal Alignment

Text-to-moving sound generation

Mono

Naive

Trajectory

Mono

Naive

Trajectory

Mono

Naive

Trajectory

Mono

Naive

Trajectory

Mono

Naive

Trajectory

Mono

Naive

Trajectory

Mono

Naive

Trajectory

Mono

Naive

Trajectory

Mono

Naive

Trajectory

Mono

Naive

Trajectory

Dataset Curation

LLM prompts for caption rewriting

Input format

Output format