Voice Cloning with Tortoise TTS Model

Jul 18, 2024

Voice Cloning for Any Language Using the Tortoise TTS Model

Introduction

  • Objective: Fine-tune the Tortoise TTS model for any language.
  • Example Language: German.

Nvidia RTX 3080 TI GPU Giveaway

  • Eligibility: Attend Nvidia's 2024 GTC Conference and send proof of attendance (screenshot).

Getting Started

Prerequisites

  • Dataset: Speech data in the target language (e.g., a German dataset containing 97 hours of audio from 117 speakers).
  • Dataset Structure: Follow the LJ speech format.
    • Folder with audio samples: filenames containing the transcription.
    • Includes train and validation subsets.
    • Example naming: identifier|waveform file|transcription.

Download and Format Dataset

  • Download Script: Provided in the video/script info.
  • Unzip and Format: Adjust folder structure and create the dataset.
    • Example steps: Move all files to one folder, adjust filenames, and divide into train/validation.

Fine-Tuning the Tortoise TTS Model

Preprocessing

  1. Transliterate/Normalize Text: Modify the text normalization function (cleanup.py file).
    • Use transliterate library (e.g., German transliterate library) to handle language-specific characters.
  2. Special Characters: Add the special characters for your language in symbols.py (both lowercase and uppercase).
  3. YAML Configuration: Modify YAML file (config.yaml) to specify training and validation data.
    • Example changes: custom language name, tokenizer vocabulary, paths, number of iterations.
  4. Tokenizer Training: Train a tokenizer using the transcriptions.
    • Create a file with all transcriptions concatenated.
  5. Adjust Sampling Rate: Ensure all audio samples have a sampling rate of 22.05 kHz.
  6. Training: Use the modified scripts to train the model on your dataset.
    • Example steps: Fork the GitHub repository, clone and adjust the code, and train the model.

Changes to Code and Configuration

  • fork the Repository: GitHub alterations and fork it to make changes.
  • Download and Install Modules: Follow the provided scripts and use specific library versions as needed (e.g., Transformers library version 4.29.2).
  • Tokenizer Training Code: Adjust text cleaners and special characters regex.
  • Adjust Cleaning and Punctuation Handling: Improve overall quality.
  • Additional Changes: Modify fine-tuning script if necessary.

Training Execution

  • Running the Training: Example provided for NVIDIA GPUs.
  • Depending on GPU power, training can take hours up to a day.
  • Checkpoint Saving: Models save at specified steps (every 500 steps).
    • Important: stop training only after a completed step set to avoid losing progress.
  • Resume Training: Update the config file to use the last checkpoint for further fine-tuning.

Post Training

  • Access the fine-tuned model checkpoint files from the specified directory.
  • Next Steps: Use the fine-tuned model to generate speech in your language (covered in part 2).

Conclusion

  • Future videos: Detailed creation of your own speech dataset and adjusting inference code for multiple languages.
  • Reminder: Subscribe to the channel for more tutorials and updates.