Simi-SFX: A similarity-based conditioning method for controllable sound effects synthesis

Computing and Audio Research Laboratory, Department of Eletrical and Information Engineering
The University of Sydney, Sydney, Australia
MY ALT TEXT

Imagine a dataset with three categories of sounds, each with distinct timbre characteristics. We could use pre-trained audio representation models such as CLAP to extract the embeddings for each sound in our dataset. For any sound, we could also obtain its embedding and compare its similarity relative to each class by measuring its distance to the center of each embedding cluster. We take the idea from anomaly detection, where we measure the Mahalanobis distance between the reference audio embedding and the embedding clusters of each class by treating each cluster as a Gaussian Distribution. The lower the distance, the more similar the sound is to that class. We could normalize the calculated Mahalanobis Distance to range [0,1] and incorporate this similarity score as conditioning information representing the timbre of the sound into the training process.

Architecture

MY ALT TEXT

We could incorporate this similarity and condition it on a DDSP network. Loudness and spectral centroid envelopes are extracted from the input audio and also passed as conditioning to the DDSP decoder. The decoder predicts corresponding parameters required for the transient and noise synthesizers to synthesize the waveform.

Regression analysis of the timbre control

We plot the calculated Mahalanobis Distance between the generated sound and the reference sound by interpolating the input similarity score from 0 to 1. The relationship between the normalized MD of each synthesized sound and its conditioning similarity score closely follows an exponential trend. Ordinary Least Squares (OLS) regression yields a mean R^2 value of 0.4774 for the Footstep-set model and 0.6041 for the Impact-set model, indicating a strong correlation between the conditioning similarity score c and the dependent variable y (Mahalanobis distance relative to each class). The plots reveal that almost all sound categories exhibit clear separations between the presence of a particular feature (at c = 0) and its absence (at c = 1). This demonstrates that the model successfully learns to output distinct timbres based on the proposed similarity scores. Additionally, the regression lines indicate a positive correlation between the normalized MD and the input similarity scores. This suggests that the conditioning method effectively encodes timbral information unique to different sounds, independent of other acoustic features such as loudness and spectral centroid, which were held constant during this test. Regression Analysis of Timbre Control

Creative usage of timbre control

We qualitatively demonstrate how interpolating between two conditioning similarity scores produces distinct timbres. For this example, we randomly selected a sound from our test dataset and extracted its loudness and spectral centroid as input. All channels of the similarity score were fixed at 1, except for the first channel (C_1) and the second channel (C_2) which correspond to footsteps on metallic boards and footsteps on gravel, respectively. These channels were interpolated from 0 to 1 (C_1) and 1 to 0 (C_2). The resulting spectrograms reveal a clear progression in timbre. Initially, the signals exhibit prominent harmonics in the higher frequencies, characteristic of footsteps on metallic boards. As the interpolation progresses, these harmonics gradually transition into noisier signals lacking harmonic structure, which are indicative of footsteps on gravel. For a more interactive experience, please visit our Colab notebook.

C1=0.0, C2=1.0

Plot 1

C1=0.2, C2=0.8

Plot 2

C1=0.4, C2=0.6

Plot 1

C1=0.6, C2=0.4

Plot 2

C1=0.8, C2=0.2

Plot 1

C1=1.0, C2=0.0

Plot 2

Example Outputs from the four models we tested

The following examples compare synthesized audio from four models across different sound categories. Each row corresponds to a specific reference sound, and each column presents a variant synthesized by a different method.

Sound
Reference
Simi-SFX
NoiseBandNet
DDSP
ICGAN
Footstep 1
Footstep 2
Footstep 3
Footstep 4

Example Outputs from the four models we tested

In addition to the footsteps category, we also show some impact sounds synthesized from the four models.

Sound
Reference
Simi-SFX
NoiseBandNet
DDSP
ICGAN
Impact 1
Impact 2
Impact 3

BibTeX

@misc{liu2024simisfxsimilaritybasedconditioningmethod,
        title={Simi-SFX: A similarity-based conditioning method for controllable sound effect synthesis}, 
        author={Yunyi Liu and Craig Jin},
        year={2024},
        eprint={2412.18710},
        archivePrefix={arXiv},
        primaryClass={cs.SD},
        url={https://arxiv.org/abs/2412.18710}, 
  }