Voice Cloning with Tortoise TTS Model

hello and welcome back to another video guys in this video I will show you how you can use voice cloning for any language and for this we will use the ttis CDs model and find tune it for any language in my particular case I will find tune it for the German language but I will also show you how you can fune the tus CDs model for your language and as always I would say we just jump right into it let's go but before we start I have great news for you guys one of you can win this Nvidia RTX 3080 TI GPU with 12 GB of vram tens of course and 912 GB a second memory bandwidth and what do you need to do to win this GPU first attend to nvidia's 2024 GTC conference and second send me a screenshot as a proof of attendance that's it the GTC conference is happening online and in person in case you haven't heard about the GTC conference yet the GTC conference covers a wide range of Topics in the field of AI giving you a great idea of what's coming next in AI there are more than 600 sessions and people from all major players in the field of AI like matter open AI Google Deep Mind Nvidia or Runway ml we'll be holding talks and sessions and personally I found the what's next in generative AI the fastest stable diffusion in the world as well as the human-like AI voices exploring the evolution of voice technology talks very interesting and yeah good luck to everyone and don't miss out on this one all right and the first thing that we need in order to to be able to fine-tune the Tardis CDs model is a data set that contains speech data of a particular language so for example I will fine-tune the Tardis CDs model using a German data set and in particular I will be using the following data set which stems from a paper published in 2021 and it contains a highquality text to speech data set that contains 97 hours of audio from 117 speakers and you might be thinking right now okay well he's using an already existing data set but where should I get my speech data or audio data from in order to be able to fine-tune the toris CDs model and I have to say that's a very valid point and if there wouldn't be already an existing German data set I would ask the same question so for that reason in a future video I will show you how you can create your own audio data set from publicly available speech dator and if that sounds interesting to you make sure to subscribe to my channel but for the rest of this video I will show you how you can find you the toris TDS model for any language if you already have an existing speech data set in your particular language and the first thing I needed to do is to download the data set that I mentioned and it contains around 15 GB of speech data so this can definitely take a while you can see here that the estimated time to download this is around 12 hours so I will fast forward this step and yeah don't worry this is not usually the case but I think the server is pretty slow so I can't download it faster all right and once you have your data set downloaded we need to format the data set to the LG speech format and LG speech is also a publicly available data set that's where the name convention is coming from Imagine you have one folder called data set and in this folder you have all the audio speech samples with the following naming convention so you see the simple name which can be a number or a more descriptive text of the audio speech but basically just an identifier then we have the ending point waveform file and then we have a pipe as a delimiter and here we have the transcription of this file so the transcription of the speech file that we have here is contained in the file name this is the LG speech format and then we have basically one big folder containing all the speech samples and in their file names we have the transcriptions and this is basically the first pre-processing step formating your speech data set to this format so we can use the data set to find youe the tortois model and I have already done this before recording this video so I skipped the 10 hours of downloading this data set but here's the code to format this data set to the LG speech format for this we first need to unzip the data set then we adjust the data set structure so that all train and validation waveform files are located in an individual folder so this data set has aness set structure and we want to overcome this by putting all the waveform files in one folder called data set and therefore we iterate through the nested files and also add the transcription which is always stated in a metadata CSV file and then once we have an array containing two PS of the old path and the new path that we would like to create we then divide this data set into train and validation subsets and and here we use a ratio of 85% so 85% of all the speech samples will be used for training and 15% for validation then we create a folder called data set and in this folder also subfolder called W and then in the W folder we will store all the speech samples and in the data set folder we will store two text files called train and Val or validation containing all the file names of the training data and the validation data and then basically as a last step we can create a zip file for this data set folder and I have done this previously that's why I didn't download the whole data set again but this is basically the code I use to to transform this data set into a LG speech format which I then can use in the next step and in case you're thinking hey I actually would like to create my own speech data set but how long should one audio sample be and how should I construct the data set as I mentioned I will cover this in a future video but for now I would recommend you to either search on Google for TDS data set and then your language or check out hugging face where you can click on data sets and then activate the text to speech filter and here you can see that there are 122 publicly available speech data sets for example here you can see Ukrainian language English dialect Chinese or Swedish so those are some recommendations from my side which can also save you some time here you can also see for the data set that I used the samples have a minimum of 5sec length and the majority of them are shorter than 15 seconds as you can see here and the overall average is 9.5 seconds seconds for the speech samples I Ed now that we have our pre-processed data set we can move forward and adjust the code that we will use to find the Taris CDs model and for this we will add special letters for our particular language so for example for the German language I will add the following letters which are a u and S or like it's pronounced a double s and per default the ttis model is trained for English language and therefore also only supports letters of the English alphabet and since you probably we want to support all the letters that are used in your alphabet we need to manually add those letters to the vocabulary and for this we need to make some adjustments to the fine-tuning code that is available at this GitHub repository and all the changes that are necessary I've listed here and the first thing that we will do is open this repository and I would recommend you to Fork it so you click here on Fork your own copy and just create the fork all right and now you can see I for work this and then you can click on this link and I then typed get clone and copied the link of the repository to download or clone the repository to my local computer and open the repository with my local IDE which in my case is visual code but feel free to use your preferred ID doesn't matter at all and here you can see I now opened the fork repository on my local computer and now we can go back to the list of changes that we need to make and we will start with the cleanup. Pi file which is stored under the following path and in this file you can find at the very end the method called English cleaners which takes an input text and more or less normalizes or transliterates input text so you might be wondering what's exactly going on here so we can see we convert it to asky and I looked into it the toris model uses the uni decode module and here we can see for this input text the output is the following so we kind of normalize it and convert non asky characters to asky characters so it is safely encoded to ask in a way using this module and since those characters or letters are exactly those we interested if you're working with languages that are not English we obviously want to keep those letters and for this reason we need to alter or adjust the code here and I use the following replacement which we can see here so I use the library German transliterate which already does a lot of the heavy lifting for me and you can find the repository here and what this does is to give you a brief idea of how this works so for example using the we can transliterate abbreviations like ABC uh into a phonic version and this is the German pronunciation which would be AB therefore if this is our input text this will be the result of our transliteration so it's less about the actual letter and more how it sounds to us and here we pass our input text to this Library which does some of the work but otherwise just imagine that here you want to make sure that the text that you pass to your model is clean this kind of adds a safety level to generating good quality speech results and then we transform the text to lowercase and remove multiple wi spaces and then we also remove any quotation marks in the text and since I also use this Library I need to import it at the beginning or the start of the file here and that's it then we can save this file again and in case there is no transliteration library for your language what you can see here is kind of what such a library does so you have abbreviations like Mrs or Mr and here is the way it would be pronounced and this is the abbreviation you kind of have like a dictionary in which you replace all those inputs with the phonemic version of the abbreviation so you could craft yourself such a normalization step but as I said it makes a generation of speech overall more robust but it's not a mandatory step to 100% replace this for your language then we can move on to the next file which is this one and add special characters to the letters variable line 12 and this file is in the same folder here called symbols and what we will do here is just at the end add our special characters so for me those would be the following and here we also add them in uppercase I'm not sure exactly why those are also added in uppercase because internally we don't process those uppercase letters but just to be safe I also added them here in uppercase and then we go to the experiments folder and copy the example GPT yl file and this one we rename to custom language gt. yl file so here we have our custom language file here and then in line one we can type custom language GPT here we can replace this with just train data set then we change the path of our training data set to this one and add in line 29 as a new line and you attribute the tokenizer vocabulary because we use a custom tokenizer in our case for our specific language then we do the same for the validation data set where we name it Val data set change the path to this one and then as well add at the end the tokenizer vocabulary and then as a last step we will change the number of iterations from 50,000 to 5,000 just to make the fine-tuning shorter and yeah that's all we need to do for this file and then we can also save the custom language gbt yl file then we will open the do get ignore file and add the following statement important is the exclamation mark which excludes our custom language GP T yl file from the ignore list so up here it stated that all yaml files should be ignored but with specifically stating this at the end of the file this rule will be overwritten and our custom language GPT yam file will still be added to our GitHub repository then we will open the requirements La text file and this file is stored inside the folder codes and what we will do here is add a specific module version of the Transformers library because using a newer version led to an error for me and therefore I fixed the project version to 4. 29.2 which makes sure that the fine tuning code will work and while fine tuning I ran into an error which was caused by the util dopy file which is stored under codes utils and then U.P and here we need to replace this statement in line 25 by the following so it seems like this is deprecated and the new way to import the infinity is right away from torch so also make sure to make this adjustment to the code to be able to run this code and then we also save this one and then we will add all our changes to the repository by first typing in get add period and then just get comment minus am and calling it changes for custom language fine tuning all right and then I'll just push these changes to my git repository technically you could just work on your local computer in case you have a local GPU otherwise I would recommend to create a fork and also push your changes to your GitHub repository and then you can for example use a CL GPU or collab to use the changes made all right and what we then can do is either return to our notebook or just run this notebook inside your fork repository in case you're still working for example on a cloud GPU instance like a collup notebook you can then clone your repository using the git clone command as you can see here I'm cloning my forked deal Art School repository and then I navigate into the deal Art School directory or folder and then to the Cod codes folder and in there I will install all the required modules using the requirements. lx. text file and we can just run this cell by typing command enter all right once all required modules are installed we can then move on and download the model weights for the vector quantized variational Auto encoder and the autoagressive model which is a gpt2 model and in this video we will find you in the auto regressive model the toris model or toris architecture overall consists of four models which I've recently covered in a series of videos feel free to check them out to further understand the meaning of the vector quantized variational out encoder and the auto regressive model and to download the model weights for both models we will run the following cell and once the model weights for both models are downloaded what we then will do is create a text file that contains all transcriptions as a source to train a tokenizer and with all transcriptions we mean all the transcriptions inside our data set including the training and validation subsets and since this is also the data that we expect during our training and also later during inference the transcriptions are perfectly suitable to train a tokenizer and you might be wondering especially if you have some experience with large language models why would we now train a tokenizer because usually if you train a tokenizer and then train a large language model on tokens that are encoded by a tokenizer you can't replace or exchange the tokenizer after the training so there is a strict dependency of the large language model toward the tokenizer and and in this case we have a gbt2 model which generates Mel tokens and you might be wondering why would we now create a new tokenizer because the model also learned to understand the tokens encoded by the tokenizer and I find this very confusing but for some reason it actually works if you train a new tokenizer and pass it to the model so the model is during fine tuning able to adapt to the new tokens I'm kind of curious how much the model actually relies on the tokens so feel free to additionally to this video do some more experiments on why the tokenizer actually can be replaced and you can still obtain great results with fine-tuning the auto regressive model of the tauris architecture or tauris model and to create the text file containing all transcriptions we iterate through the training and validation subset so we read the train text file and also the validation text file in which all the waveform files are stated so we iteratively read the lines and then for each line which is one waveform file we split it by the pipe by using the second element we obtain the transcription which we then can strip and then by prepending a blank we can create a let's say fluent text so all the transcriptions are just separated by a blank which is kind of just a big very big text and this serves as an input to our tokenizer and then we write this text containing all transcriptions to the transcriptions text file and can then use it as a source to train a tokenizer and if you want after running this you can also have a quick look at the transcription text file but just running the command CAD transcriptions text file and here you can see the content of the file and if you're not from Germany or don't speak German this probably doesn't make much sense to you and it's hard to read but I can confirm this is German so this we will use to train our tokenizer and to train the tokenizer I first wanted to use the voice tokenizer P file inside the fine-tuning repository but for some reason there were some issues with relative Imports and I think the code was written for python 2 so since it seemed to me to be a little extra work I just de decided to copy the code to Trend the tokenizer because it's not all that much all you need to do inside this code is to First exchange this text cleaners method and here we use the same code that we have used earlier to alter the method English cleaners inside the cleaners. P file so you can just copy the method English cleaners from this file and paste it here and again I'm using the German transliterate library but as I said earlier you not necessarily need to use such a transliterate library it just overall improves this speech quality when for example having abbreviations and then there is another method to remove punctuation so in a way again cleaning the input text and one more thing that you need to do is to add the special letters for your language so there is a recx that allows only a certain set of characters to be in the input text and by default it's from A to Z only lowercase so what I did I added those four letters so the a u u and double s in German language and yeah for your language just remove them and whatever is needed for example an e like that for Spanish just add the letters that you need for your language here all right and then you can just run this cell by pressing command enter as always and this might take like 20 seconds depending on how much transcriptions or text Data you have and then the tokenizer will be saved as a Json file and basically the tokenizer learns to efficiently represent your language and specific tokens depending on the language sequences of characters occur more often or less often so this is individual for every language and therefore kind of makes sense to train a tokenizer that is optimized for your particular language and here we can now can see this is done and then as a last pre-processing step that I haven't mentioned in the start but is definitely important for the toris model the tauris architecture or the toris model takes its input audio samples with a sampling rate of 22.05 khz and the data set I was using had originally a sampling rate of 44 Kilz so I had to resample all audio samples in case that's also the case for your data set I just pasted the code that I used here but in case your data set already has a sampling rate of 22.05 khz you can just skip this step and what this code basically does is it checks for each of your audio samples if it has a sampling rate of 22.05 khz and if it doesn't then the audio sample will be resampled and then the wave for audio will be overwritten with a new sampling rate and yeah feel free to use it in case your data set doesn't have already assembling rate of 22.05 khz all right and now after all the prep work we're finally able to fine tune the autoaggressive model of the toris architecture and for this we just run the following statement while using the training. pi file and passing our custom language gbt yl file or the config file for our fine tuning that we have created and edited earlier and yeah as always just type command enter and then we can run the fine tuning and this can depending on the GPU that you're using take hours to even a day so maybe grab one or two cups of coffee and watch a movie and then come back and explore the fine-tuned tortois model in your language and when you run this you should see something like the following so first we can see all the training arguments that we have defined in our configuration file then we can see the model architecture and also that the model was loaded here and then start of the training so here we see the first epic which has the number zero and during one Epic our model is trained on the entire training data set and we can see for our training data one appic contains 160 steps and in total we train our model for 5,000 steps so the model will roughly be trained for 30 epics and if you feel like the training takes too long and you're eager to check out how the fine Cube model Works already you can also at any time just stop the model by pressing for example this button and what I would recommend you in the config that we have chosen every 500 steps a model checkpoint will be saved so you can always see here the step so this is Step 1,000 and at step 1,000 one model gets saved so you can see here info saving model then at a later step you can see again 1,500 model gets again saved I'm honestly not 100% sure when the model will change and I think this is also dependent on the size of the data set so feel free to interrupt the training of the model but what I would recommend you to do is to make sure that you interrupt the model basically at a for example step 2,000 or 2,500 because if you would interrupt the model at step 2,300 you would lose the training progress of the last 300 steps which you obviously don't want and in case you figure out that the find you model isn't as good as you would like to you can always resume the training and for this you can change in the custom language GPT yaml file in line 122 the pre-train model GPT and here instead of using the auto regressive model or model path you will here pass your training checkpoint file and therefore it can initialize the auto regressive model with your training checkpoint and continue training your already fine-tuned model and then once you're done with the fine-tuning you can find the checkpoint of the find model under this path so it's an experiments custom language gbt and then models for example we can have a quick look and see here there are five different checkpoint files of the model weights for the fine gbt model here you can see it after 500 steps 1,000 steps and the most progress with 2,500 steps and this one we will also use now you might ask yourself okay now we have this fine-tuned GPT model but how can we now use it to generate speech and that's an excellent question and this question we will cover in part two in which I will show you how you can adjust the inference code of the tortoise model so that it can also generate speech for your language and in that part two video I will also share the winner of the RTX 3080 TI Nvidia GPU so definitely make sure to also watch the second video and as always I would appreciate if you subscribe to my channel and like this video and I'm looking forward to seeing you in the next video bye-bye

Transcript for:Voice Cloning with Tortoise TTS Model

Transcript for:
Voice Cloning with Tortoise TTS Model