Deep audio generative networks typically require large amounts of data and computational resources, while relying on excessive descriptive ground-truth labels to guide the sound generation. Differentiable digital signal processing(DDSP) is a variant of neural audio synthesis models that employ digital signal processing modules as part of the neural network architecture. By incorporating pre-processed audio features as conditioning vectors along with pre-written digital synthesizers it is capable of synthesizing high-quality musical audio and controlling the acoustic characteristics(pitch and loudness) of the sounds. However, due to the limitation of its design, DDSP has its weaknesses in synthesizing impulsive signals and providing nuanced control over the complex timbres of the sound effects. To this end we incorporate Verma's method to synthesize impulsive signals into the DDSP architecture. Below we show the performance of sound synthesized from DDSP and our approach, DDSP-SFX. We use a reference sound effect and perform timbre transfer by extracting the required audio features as input to the decoder.
Below we show our synthesis results. We use some reference SFX tracks as guiding sounds and extract their acoustic features including fundamental frequency, amplitude, transient components, and mel-spectrograms. We fed these features into our model to perform timbre transfer. Ideally the generated sound should be identical or close to the reference soundtrack.
Above we could see that for impulsive sounds (footsteps and gunshots), DDSP tends to rely heavily on its harmonic synthesizer, with very audible harmonic artifacts. This is because we set the number of harmonics for the harmonic synthesizer as 100, for unity purposes. The decoder by itself doesn't learn to attenuate the harmonic synthesizer very well with this structure. Our approach could synthesize inharmonic sounds easily with an indicative harmonic attentuator. Further, our approach seems to synthesize impulsive sounds with sharper attacks. This could be seen in gunshot sounds which contain many fast repetitive impulses.
Our model is capable of performing timbre transfer from out-of-domain sounds. We use voice as guiding sounds to showcase this. We extract the fundamental frequency and amplitude contours from the voice. We feed in a self-determined latent vector z to variate the timbres. We set z=0 for the entire time frames as a reference. We then change the value of z from 2 or 3 seconds of time to show how varying the latent variable could contribute to timbre changes.
From the spectrogram we could see a clear difference when we variate the value of z from a time frame. This shows that our latent vector is able to achieve time-varying timbre control over the generated sounds. When changing it from a certain time frame, it changes its spectral characteristics immediately without creating any distorted glitch in the joint areas.
@inproceedings{liu2023ddspsfx,
title={DDSP-SFX: Acoustically-guided sound effects generation with differentiable digital signal processing},
Conference={Digital Audio Effects Conference 2024},
year={2024},
Pages={216-221},
}