Simi-SFX: A similarity-based conditioning method for controllable sound effects synthesis

Yunyi Liu, Craig Jin

Computing and Audio Research Laboratory, Department of Eletrical and Information Engineering
The University of Sydney, Sydney, Australia

Paper Code Colab arXiv

Imagine a dataset with three categories of sounds, each with distinct timbre characteristics. We could use pre-trained audio representation models such as CLAP to extract the embeddings for each sound in our dataset. For any sound, we could also obtain its embedding and compare its similarity relative to each class by measuring its distance to the center of each embedding cluster. We take the idea from anomaly detection, where we measure the Mahalanobis distance between the reference audio embedding and the embedding clusters of each class by treating each cluster as a Gaussian Distribution. The lower the distance, the more similar the sound is to that class. We could normalize the calculated Mahalanobis Distance to range [0,1] and incorporate this similarity score as conditioning information representing the timbre of the sound into the training process.

Architecture

We could incorporate this similarity and condition it on a DDSP network. Loudness and spectral centroid envelopes are extracted from the input audio and also passed as conditioning to the DDSP decoder. The decoder predicts corresponding parameters required for the transient and noise synthesizers to synthesize the waveform.

Regression analysis of the timbre control

We plot the calculated Mahalanobis Distance between the generated sound and the reference sound by interpolating the input similarity score from 0 to 1. The relationship between the normalized MD of each synthesized sound and its conditioning similarity score closely follows an exponential trend. Ordinary Least Squares (OLS) regression yields a mean R^2 value of 0.4774 for the Footstep-set model and 0.6041 for the Impact-set model, indicating a strong correlation between the conditioning similarity score c and the dependent variable y (Mahalanobis distance relative to each class). The plots reveal that almost all sound categories exhibit clear separations between the presence of a particular feature (at c = 0) and its absence (at c = 1). This demonstrates that the model successfully learns to output distinct timbres based on the proposed similarity scores. Additionally, the regression lines indicate a positive correlation between the normalized MD and the input similarity scores. This suggests that the conditioning method effectively encodes timbral information unique to different sounds, independent of other acoustic features such as loudness and spectral centroid, which were held constant during this test. Regression Analysis of Timbre Control

Example Outputs from the four models we tested

The following examples compare synthesized audio from four models across different sound categories. Each row corresponds to a specific reference sound, and each column presents a variant synthesized by a different method.

Sound

Reference

Simi-SFX

NoiseBandNet

DDSP

ICGAN

Footstep 1

Footstep 2

Footstep 3

Footstep 4

Example Outputs from the four models we tested

In addition to the footsteps category, we also show some impact sounds synthesized from the four models.

Sound

Reference

Simi-SFX

NoiseBandNet

DDSP

ICGAN

Impact 1

Impact 2

Impact 3

BibTeX

@misc{liu2024simisfxsimilaritybasedconditioningmethod,
        title={Simi-SFX: A similarity-based conditioning method for controllable sound effect synthesis}, 
        author={Yunyi Liu and Craig Jin},
        year={2024},
        eprint={2412.18710},
        archivePrefix={arXiv},
        primaryClass={cs.SD},
        url={https://arxiv.org/abs/2412.18710}, 
  }

Simi-SFX: A similarity-based conditioning method for controllable sound effects synthesis

Architecture

Regression analysis of the timbre control

Creative usage of timbre control

C1=0.0, C2=1.0

C1=0.2, C2=0.8

C1=0.4, C2=0.6

C1=0.6, C2=0.4

C1=0.8, C2=0.2

C1=1.0, C2=0.0

Example Outputs from the four models we tested

Example Outputs from the four models we tested

BibTeX