We could incorporate this similarity and condition it on a DDSP network. Loudness and spectral centroid envelopes are extracted from the input audio and also passed as conditioning to the DDSP decoder. The decoder predicts corresponding parameters required for the transient and noise synthesizers to synthesize the waveform.
The following examples compare synthesized audio from four models across different sound categories. Each row corresponds to a specific reference sound, and each column presents a variant synthesized by a different method.
In addition to the footsteps category, we also show some impact sounds synthesized from the four models.
@misc{liu2024simisfxsimilaritybasedconditioningmethod,
title={Simi-SFX: A similarity-based conditioning method for controllable sound effect synthesis},
author={Yunyi Liu and Craig Jin},
year={2024},
eprint={2412.18710},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2412.18710},
}