VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech (INTERSPEECH 2024)
This page and the demos below are intended solely for research demonstration.
Authors
- Heeseung Kim gmltmd789@snu.ac.kr
- Sang-gil Lee sanggill@nvidia.com
- Jiheum Yeom quilava1234@snu.ac.kr
- Che Hyun Lee saga1214@snu.ac.kr
- Sungwon Kim sungwonk@nvidia.com
- Sungroh Yoon (Corresponding author) sryoon@snu.ac.kr
Abstract
We propose VoiceTailor, a parameter-efficient speaker-adaptive text-to-speech (TTS) system, by equipping a pre-trained diffusion-based TTS model with a personalized adapter. VoiceTailor identifies pivotal modules that benefit from the adapter based on a weight change ratio analysis. We utilize Low-Rank Adaptation (LoRA) as a parameter-efficient adaptation method and incorporate the adapter into pivotal modules of the pre-trained diffusion decoder. To achieve powerful adaptation performance with few parameters, we explore various guidance techniques for speaker adaptation and investigate the best strategies to strengthen speaker information. VoiceTailor demonstrates comparable speaker adaptation performance to existing adaptive TTS models by fine-tuning only 0.25% of the total parameters. VoiceTailor shows strong robustness when adapting to a wide range of real-world speakers, as shown in the demo.
LibriTTS Dataset
Among our baselines, YourTTS generates voices at 16kHz. Therefore, for a fair comparison, all the demos below were resampled to 16kHz and normalized to -27dB.
Model Comparison
Transcript: The second passed in height The first, and sought the forehead, and half missed, Half falling on the hair.
Reference | GT | Vocoder (BigVGAN) | VoiceTailor | UnitSpeech | XTTS $v2$ | YourTTS |
---|---|---|---|---|---|---|
Transcript: But from the offer that came to teach Negroes-country Negroes, and little ones at that-she shrank, and, indeed, probably would have refused it out of hand had it not been for her queer brother, john.
Reference | GT | Vocoder (BigVGAN) | VoiceTailor | UnitSpeech | XTTS $v2$ | YourTTS |
---|---|---|---|---|---|---|
Transcript: So, with a great many chucklings and shruggings when no one was by, he had departed after breakfast one day, simply saying he shouldn’t be back to lunch.
Reference | GT | Vocoder (BigVGAN) | VoiceTailor | UnitSpeech | XTTS $v2$ | YourTTS |
---|---|---|---|---|---|---|
Ablation Studies
Lora Rank $r$
Transcript: Then they started on again and two hours later came in sight of the house of dr Pipt.
Reference | $r=2$ | $r=4$ | $r=8$ | $r=16$ (default) | $r=32$ |
---|---|---|---|---|---|
Transcript: But when the pick was shipped again, Hans pointed out on its surface deep prints as if it had been violently compressed between two hard bodies.
Reference | $r=2$ | $r=4$ | $r=8$ | $r=16$ (default) | $r=32$ |
---|---|---|---|---|---|
Transcript: Also, a draft on futurity, sometimes honored, but generally extended.
Reference | $r=2$ | $r=4$ | $r=8$ | $r=16$ (default) | $r=32$ |
---|---|---|---|---|---|
Speaker Condition Strengthen Methods
- w/o strengthening: No additional methods applied
- LoRA scale adjustment: Using a larger $\alpha$ value for inference than the $\alpha$ value used during fine-tuning.
- Speaker embedding guidance: When calculating the unconditional score, LoRA is plugged in as is, only replacing speaker embedding $e_S$ with unconditional embedding $e_\phi$.
- LoRA guidance: When calculating the unconditional score, embedding $e_S$ is provided as is, only plugging out the low-rank adaptor.
- Embedding & LoRA guidance: Removing all speaker information when calculating the unconditional score.
Transcript: And therefore, primarily, they must be able to divide so that elementary exercises in color must be directed, like first exercises in music, to the clear separation of notes and the final perfections of color are those in which, of innumerable notes or hues, every one has a distinct office, and can be fastened on by the eye, and approved, as fulfilling it.
Reference | w/o strengthening | LoRA scale adjustment $(2 \cdot \alpha)$ | Speaker embedding guidance with $\gamma_S=1$ (default) | Speaker embedding guidance with $\gamma_S=2$ | LoRA guidance with $\gamma_S=1$ | LoRA guidance with $\gamma_S=2$ | Embedding & LoRA guidance with $\gamma_S=1$ | Embedding & LoRA guidance with $\gamma_S=2$ |
---|---|---|---|---|---|---|---|---|
Transcript: He shrugged his shoulders in ungracious acquiescence, while our visitor in hurried words and with much excitable gesticulation poured forth his story.
Reference | w/o strengthening | LoRA scale adjustment $(2 \cdot \alpha)$ | Speaker embedding guidance with $\gamma_S=1$ (default) | Speaker embedding guidance with $\gamma_S=2$ | LoRA guidance with $\gamma_S=1$ | LoRA guidance with $\gamma_S=2$ | Embedding & LoRA guidance with $\gamma_S=1$ | Embedding & LoRA guidance with $\gamma_S=2$ |
---|---|---|---|---|---|---|---|---|
Transcript: If a man had stolen a pound in his youth and had used that pound to amass a huge fortune how much was he obliged to give back, the pound he had stolen only or the pound together with the compound interest accruing upon it or all his huge fortune?
Reference | w/o strengthening | LoRA scale adjustment $(2 \cdot \alpha)$ | Speaker embedding guidance with $\gamma_S=1$ (default) | Speaker embedding guidance with $\gamma_S=2$ | LoRA guidance with $\gamma_S=1$ | LoRA guidance with $\gamma_S=2$ | Embedding & LoRA guidance with $\gamma_S=1$ | Embedding & LoRA guidance with $\gamma_S=2$ |
---|---|---|---|---|---|---|---|---|
Real-world Scenarios
To avoid copyright and misuse issues, we have removed the real-world scenarios.