DDSP-SFX: Acoustically-guided sound effects generation with DDSP

Yunyi Liu1,2*, Craig Jin, 1 David Gunawan2
1University of Sydney
2Dolby Laboratories, Inc

*Work done while interning at Dolby
Description of the image

Deep audio generative networks typically require large amounts of data and computational resources, while relying on excessive descriptive ground-truth labels to guide the sound generation. Differentiable digital signal processing(DDSP) is a variant of neural audio synthesis models that employ digital signal processing modules as part of the neural network architecture. By incorporating pre-processed audio features as conditioning vectors along with pre-written digital synthesizers it is capable of synthesizing high-quality musical audio and controlling the acoustic characteristics(pitch and loudness) of the sounds. However, due to the limitation of its design, DDSP has its weaknesses in synthesizing impulsive signals and providing nuanced control over the complex timbres of the sound effects. To this end we incorporate Verma's method to synthesize impulsive signals into the DDSP architecture. Below we show the performance of sound synthesized from DDSP and our approach, DDSP-SFX. We use a reference sound effect and perform timbre transfer by extracting the required audio features as input to the decoder.


Transient Modelling

Description 1 In the preprocessing stage, we extract a guiding amplitude vector from the target sound effects following the steps as shown above: computing STFT, performing a harmonic-percussive source separation of the signal, computing the spectral peak of the percussive signal, and then getting the amplitude vector of the place where it detects a spectral peak. The decoder uses this information to output the required amplitude and frequency for the sinusoidal modelling in the DCT domain. Once the sinusoids are synthesized, we convert them to time domain using IDCT, which would result in various kinds of transient signals in different time frames.

Synthesis results

Below we show our synthesis results. We use some reference SFX tracks as guiding sounds and extract their acoustic features including fundamental frequency, amplitude, transient components, and mel-spectrograms. We fed these features into our model to perform timbre transfer. Ideally the generated sound should be identical or close to the reference soundtrack.

Gunshots

Reference

Description 1

Reference

Description 1

DDSP

Description 2

DDSP

Description 2

DDSP-SFX

Description 3

DDSP-SFX

Description 3

Footsteps

Reference

Description 1

Reference

Description 1

Reference

Description 1

DDSP

Description 2

DDSP

Description 2

DDSP

Description 2

DDSP-SFX

Description 3

DDSP-SFX

Description 3

DDSP-SFX

Description 3

Motors

Reference

Description 1

Reference

Description 1

Reference

Description 1

DDSP

Description 2

DDSP

Description 2

DDSP

Description 2

DDSP-SFX

Description 3

DDSP-SFX

Description 3

DDSP-SFX

Description 3

Above we could see that for impulsive sounds (footsteps and gunshots), DDSP tends to rely heavily on its harmonic synthesizer, with very audible harmonic artifacts. This is because we set the number of harmonics for the harmonic synthesizer as 100, for unity purposes. The decoder by itself doesn't learn to attenuate the harmonic synthesizer very well with this structure. Our approach could synthesize inharmonic sounds easily with an indicative harmonic attentuator. Further, our approach seems to synthesize impulsive sounds with sharper attacks. This could be seen in gunshot sounds which contain many fast repetitive impulses.


Voice-to-SFX timbre transfer

Our model is capable of performing timbre transfer from out-of-domain sounds. We use voice as guiding sounds to showcase this. We extract the fundamental frequency and amplitude contours from the voice. We feed in a self-determined latent vector z to variate the timbres. We set z=0 for the entire time frames as a reference. We then change the value of z from 2 or 3 seconds of time to show how varying the latent variable could contribute to timbre changes.

Gunshots

Voice

Description 1

z=0

Description 2

z=3 after 2s

Description 3

Footsteps

Voice

Description 1

z=0

Description 2

z=3 after 2s

Description 3

Voice

Description 1

z=0

Description 2

z=3 after 2s

Description 3

Voice

Description 1

z=0

Description 2

z=3 after 2s

Description 3

Motors

Voice

Description 1

z=0

Description 2

z=3 after 2s

Description 3

Voice

Description 1

z=0

Description 2

z=3 after 3s

Description 3

From the spectrogram we could see a clear difference when we variate the value of z from a time frame. This shows that our latent vector is able to achieve time-varying timbre control over the generated sounds. When changing it from a certain time frame, it changes its spectral characteristics immediately without creating any distorted glitch in the joint areas.

BibTeX

@inproceedings{liu2023ddspsfx,
        title={DDSP-SFX: Acoustically-guided sound effects generation with differentiable digital signal processing}, 
        Conference={Digital Audio Effects Conference 2024},
        year={2024},
        Pages={216-221},
  }