VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech (INTERSPEECH 2024)

This page and the demos below are intended solely for research demonstration.

Authors

Heeseung Kim gmltmd789@snu.ac.kr
Sang-gil Lee sanggill@nvidia.com
Jiheum Yeom quilava1234@snu.ac.kr
Che Hyun Lee saga1214@snu.ac.kr
Sungwon Kim sungwonk@nvidia.com
Sungroh Yoon (Corresponding author) sryoon@snu.ac.kr

Abstract

We propose VoiceTailor, a parameter-efficient speaker-adaptive text-to-speech (TTS) system, by equipping a pre-trained diffusion-based TTS model with a personalized adapter. VoiceTailor identifies pivotal modules that benefit from the adapter based on a weight change ratio analysis. We utilize Low-Rank Adaptation (LoRA) as a parameter-efficient adaptation method and incorporate the adapter into pivotal modules of the pre-trained diffusion decoder. To achieve powerful adaptation performance with few parameters, we explore various guidance techniques for speaker adaptation and investigate the best strategies to strengthen speaker information. VoiceTailor demonstrates comparable speaker adaptation performance to existing adaptive TTS models by fine-tuning only 0.25% of the total parameters. VoiceTailor shows strong robustness when adapting to a wide range of real-world speakers, as shown in the demo.

LibriTTS Dataset

Among our baselines, YourTTS generates voices at 16kHz. Therefore, for a fair comparison, all the demos below were resampled to 16kHz and normalized to -27dB.

Model Comparison

Transcript: The second passed in height The first, and sought the forehead, and half missed, Half falling on the hair.

Reference	GT	Vocoder (BigVGAN)	VoiceTailor	UnitSpeech	XTTS $v2$	YourTTS

Transcript: But from the offer that came to teach Negroes-country Negroes, and little ones at that-she shrank, and, indeed, probably would have refused it out of hand had it not been for her queer brother, john.

Reference	GT	Vocoder (BigVGAN)	VoiceTailor	UnitSpeech	XTTS $v2$	YourTTS

Transcript: So, with a great many chucklings and shruggings when no one was by, he had departed after breakfast one day, simply saying he shouldn’t be back to lunch.

Reference	GT	Vocoder (BigVGAN)	VoiceTailor	UnitSpeech	XTTS $v2$	YourTTS

Ablation Studies

Lora Rank $r$

Transcript: Then they started on again and two hours later came in sight of the house of dr Pipt.

Reference	$r=2$	$r=4$	$r=8$	$r=16$ (default)	$r=32$

Transcript: But when the pick was shipped again, Hans pointed out on its surface deep prints as if it had been violently compressed between two hard bodies.

Reference	$r=2$	$r=4$	$r=8$	$r=16$ (default)	$r=32$

Transcript: Also, a draft on futurity, sometimes honored, but generally extended.

Reference	$r=2$	$r=4$	$r=8$	$r=16$ (default)	$r=32$

Speaker Condition Strengthen Methods

w/o strengthening: No additional methods applied
LoRA scale adjustment: Using a larger $\alpha$ value for inference than the $\alpha$ value used during fine-tuning.
Speaker embedding guidance: When calculating the unconditional score, LoRA is plugged in as is, only replacing speaker embedding $e_S$ with unconditional embedding $e_\phi$.
LoRA guidance: When calculating the unconditional score, embedding $e_S$ is provided as is, only plugging out the low-rank adaptor.
Embedding & LoRA guidance: Removing all speaker information when calculating the unconditional score.

Transcript: And therefore, primarily, they must be able to divide so that elementary exercises in color must be directed, like first exercises in music, to the clear separation of notes and the final perfections of color are those in which, of innumerable notes or hues, every one has a distinct office, and can be fastened on by the eye, and approved, as fulfilling it.

Reference	w/o strengthening	LoRA scale adjustment $(2 \cdot \alpha)$	Speaker embedding guidance with $\gamma_S=1$ (default)	Speaker embedding guidance with $\gamma_S=2$	LoRA guidance with $\gamma_S=1$	LoRA guidance with $\gamma_S=2$	Embedding & LoRA guidance with $\gamma_S=1$	Embedding & LoRA guidance with $\gamma_S=2$

Transcript: He shrugged his shoulders in ungracious acquiescence, while our visitor in hurried words and with much excitable gesticulation poured forth his story.

Reference	w/o strengthening	LoRA scale adjustment $(2 \cdot \alpha)$	Speaker embedding guidance with $\gamma_S=1$ (default)	Speaker embedding guidance with $\gamma_S=2$	LoRA guidance with $\gamma_S=1$	LoRA guidance with $\gamma_S=2$	Embedding & LoRA guidance with $\gamma_S=1$	Embedding & LoRA guidance with $\gamma_S=2$

Transcript: If a man had stolen a pound in his youth and had used that pound to amass a huge fortune how much was he obliged to give back, the pound he had stolen only or the pound together with the compound interest accruing upon it or all his huge fortune?

Reference	w/o strengthening	LoRA scale adjustment $(2 \cdot \alpha)$	Speaker embedding guidance with $\gamma_S=1$ (default)	Speaker embedding guidance with $\gamma_S=2$	LoRA guidance with $\gamma_S=1$	LoRA guidance with $\gamma_S=2$	Embedding & LoRA guidance with $\gamma_S=1$	Embedding & LoRA guidance with $\gamma_S=2$

Real-world Scenarios

To avoid copyright and misuse issues, we have removed the real-world scenarios.