Voice Cloning and Text-to-Speech Models

with voice cloning you can narrate your own content with your own voice this includes making realistic voiceovers for podcasts audiobooks or personal assistance in this video I'll explain how to make highquality voiceovers using just 20 minutes of audio I'll start by explaining the basics of how textto speech models work then I'll cover data set preparation and last of all explain how to fine-tune a model using that data set you've prepared if you are new to text to speech models don't worry because I will go through Jupiter notebook step by step from start to finish and if you're more advanced I'll describe in detail how I handle very challenging voices like my Irish accent and here we go with text to speech so I'll start off describing how to build a really simple model and I'll do that with a few neural network approaches Transformers diffusers and gener generative adversarial networks that's going to be useful because it will lay the foundation for explaining how to control Style and I'll explain how to tweak style in audio that's been generated in a naive way and then I'll do it using the style TTS approach a r uh style tts2 actually then I want to highlight a little around how voice cloning Works versus fine-tuning what the differences are the trade-offs when you might want to use one or the other before getting into the mey part of the presentation where I'll run through a notebook for data preparation and then one for find F tuning the style tts2 model and as usual I'll finish off with uh a few final tips so here's a very simple text to speech model it takes in the quick brown fox and it generates an output Soundwave uh over here now let's um maybe make some quick pre-processing steps here on the input so rather than just putting in text we're actually going to split it into tokens and that will go into our neural network this should look familiar if you've done any work with things like llama you tokenize into subwords and those subwords are what go into neural network now because this is sound you might want to tokenize not just with words or subwords but with what are called uh phones what are phones well they're uh characters used to represent the sound of um well yeah the sound of different parts of words so here for example we have this is the sound the quick brown and then fox now these sounds here are easily generated uh you can check a notebook like this one here I've linked uh you can access it for free where I import what's called a photomizer and then I put in the text through this phizer you do need to tell it what language and it will then create some pH names and these are the pH names I got back out and these are onetoone representations of the sounds that are in words and they um for example if I make a sound like P there's a specific full name that represents that and you can see intuitively why that might be a useful representation if you want to finally generate a sound now let's take a quick look at the right hand side here where we're generating a soundwave and of course we're going to generate what is our best estimate for the sound that represents the quick brown fox and this generated best estimate is is what I've shown here in red and this red sound wave that we generate we're going to compare to the actual sound wve that we presumably have in our training data set for the quick brown fox and here I've just represented that or very small snippet of it in blue and the idea of this graph is to show you a comparison of the original wave with the generated wave and that comparison is important because we need to grade the quality of our neural network and the quality of its outputs and one simple way we can do that is just by comparing the distance between the red and the blue lines more specifically we could look at say mean squared error which is just the difference between red and blue squared and then averaged across all of the sound wave here and that value if it's large it means there's a large discrepancy and if it's zero it means well we've matched the generated and the original and so that value of the mean squared error or of the loss which is what we commonly call it is useful because with that loss or the calculated mean squ eror we can back propagate through the neural network and update the weights to provide a higher quality neural network and that's the idea of training these neural networks you take some kind of input where you know what the truth is the blue you generate the red you compare you calculate uh how good or bad it is which we call the loss and you back propagate that loss to update the quality of the neural network and if you keep doing this and you've got enough data and you've got the right design for neural network in principle you can eventually Converge on a Model that will produce a red line here that's pretty closely matched to the blue now to make that specific you might have a neural network that takes in up to 512 tokens maybe each token represents a phon name and perhaps this network is capable of producing 10 seconds of audio so if you're producing 10 seconds of a sound wave at 24 khz 24 khz means 24 sample points per second that means the network would have to produce uh 240 output magnitudes to be able to represent 10 seconds of audio now actually if you have 512 input phones that's probably closer to like a minute or more so you would actually need more like maybe six times this 1 and A2 million output magnitudes to be predicted in order to generate sound W all right so let's uh keep going deeper into how this model works here I've got uh some text input phone in representation and I'm getting into more detail now on exactly what this neural network is and I propose that it could be one of three things it could be a Transformer type Network could be a diffuser or could be a gained ad or a generative adversarial Network so let me describe each one of these in series The first is the Transformer approach here I'm proposing and I'm not saying this is a good idea but just to understand the intuition around how you might do it you could take uh some input text and tokenize it and input that into maybe a Bert or a llama model which is a Transformer what do I mean by Transformer I mean a multi-layer uh neural network where each layer has got both attension and multi-layer perceptrons what's attention attention is where you have a layer that pays attention to the previous tokens in the sequence or the later tokens if it's uh bir directional and what's uh multi-layer perceptron this is where we have what's called activation activation is silu gallu Ru it's what makes neurons either output a really high value or else zero it kind of turns inputs into outputs that look more like uh zeros or one and this is the core of Transformers they have typically got attention and they typically have activation functions with within multi-layer perceptrons now I've described uh the Transformer but this second part here is also important because it's a linear layer that makes sure we get the right size of output remember to make 10 seconds of audio at 24 khz we need to have 240,000 magnitude outputs so if our Transformer doesn't naturally generate that many outputs we need to have some kind of transformation or final layer that's going to bring us to the necessary output we need to produce and the way we would train this is we have some input text we make a prediction in red we calculate the difference with the blue and we back propagate and update the weights not just in the minear there but also in the Transformer the second approach I want to cover is the diffuser approach which is an approach that's used for making images as well in dolly or stable diffusion this is the diffusion type approach and hopefully you'll see the analogy when I describe it for sound the easiest way to think about it is to consider starting off with a clean sound wave this would normally be uh say a voice in a training data set here though I've just shown a simple sine wave because it's illustrative and the first thing we do in diffusion training is we add a little bit of noise to that clean Soundwave to get a moderately noisy sample and then we add more noise to get a very noisy sample so now we have these three samples with have the clean training sound the moderate noise and the high noise and we use these different levels of noise to train a neural network that's able to predict the noise going from one to the other specifically we might take an input of this moderate noise sample and knowing that corresponds or knowing that the clean version corresponds to the quick brown fox we will put those as inputs to a neural network that tries to predict the noise required to get back to a clean sample and we will compare that prediction of noise to the actual difference between the input and the clean sample and to the extent there's a difference between those two predictions we'll use that error or loss to update the weights in the neural network and you can see if we do this on many samples this neural network will get better and better at predicting the noise that's present in this sample and needs to be removed in order to get back to this samp here we can also do this training at multiple steps of noise so we can do it going from clean or we can do it from Little noisy to no to no noise or we can do it from very noisy to moderate level of noise so here we take in a very noisy sample we take in the underlying text represented and we ask the network to predict the noise to get to a slightly less noisy sample We compare a predicted noise to the difference between these samples the actual difference and we use the error uh the the model's error to back propagate and improve its prediction capabilities now once this model is trained and it's been trained to predict noise for various levels of uh various steps of the noise process we can now move to inference where we're trying to generate clean sound and in inference what we do is actually start off with Pure Noise so we start with just maybe gaussian noise or some other random type of noise and with text that we want to generate and we ask the neural network okay predict the incremental noise that should be removed to get to a slightly cleaner sample and that allows us to get to a slightly cleaner sample which hopefully matches uh the intermediate noise sample and then we iterate so we take that slightly better sample again tell us here's the Baseline text predict the noise to get to an even cleaner sample and then we get an even cleaner sample and hopefully by going through every one of the steps we get back to uh a final wave that is actually clean and represents the input text here so this is diffusion and it allows us to go from Pure Noise to a clean sample that represents the input text and this is how image diffusion works too except the noise here would be just pixelated noise so you would start with random pixels in your image and ask the model to remove or clarify some of those and then keep on uh removing noise until you get to a clean image out of this end here the the third approach is generative adversarial networks and this is probably the most uh popular approach to use for doing text to speech right now here you have some text coming in again you tokenize it maybe with phenomes maybe with subwords or words and you put it into some kind of neural net that we call a generator we call a generator because it generates the sound so nothing too different here than the Transformer approach doesn't have to be a Transformer though could be a convolution NE convolutional neur Network could be a residual Network could be any of those types but the key thing is we have one Ural Network that will generate the sound and we have a second Network that's called a discriminator and the discriminator role here is to take whatever the generator creates and compare it with the original sound and try to predict whether that sound is real or fake or rather which one is real and which one is fake and this is useful in two ways first of all it's useful because if the discriminator incorrectly thinks this original sound is fake that means it is in error and to the degree that it is in error with its predictions we can back propagate that error to update and improve this neural network so the neural uh the discriminator by training it on generated versus original sounds and comparing those to the ground Truth by comparing its predictions to the ground truth we can back proper it's Error in making those predictions to improve the network but it's useful for a second reason because if the generator is doing a really good job it will be very difficult for the discriminator to discriminate between the generated and the original so to the extent the generator is able to fool the discriminator we can say the discriminator is a measure of its performance let me just say that differently so if the discriminator is rar fooled that means the generator is doing a bad job that means that the loss or the error of the generator is going to be high whereas if the discriminator is doing a bad job in other words it can't discriminate that means the generator is doing a great job and the generator's error is quite low so actually the discriminator is a measure of the generator's performance and a measure of the error of the generator which means we can use the discriminator to update the weights of the generator's neural network and this is the idea of the adversarial Network it's adversarial because they both improve during the training process the discriminator improves Itself by comparing to ground truth but the discriminator gets better and better and as it gets better that means it will impose more of a loss upon the generator forcing the generator to get better and better now one of the drawbacks of these uh Gan networks generative adversarial networks is that you can get stuck at a local Optimum if you're somewhere and the discriminator is not really able to improve anymore that makes it hard for the generator to improve now having explained those three networks the Transformer diffuser and generative adversarial Network what I want to point out is that at no point here have I explained how voice cloning works I've simply explained how to generate a voice based on input text in fact if you train a model like this the voice you get is going to be some average of the voices that are in your data set maybe there'll be some dependence on the input text for example if the input text is American spelling maybe your output sound will be slightly more American Voice because that's the type of voice you will tend to have matched with American spelling in your data set but this is not going to be good enough if you want to clone the accent of somebody or your own accent it's not allowing you enough fine grain control over the style and that brings us to the next part of the video where I'll talk about how to control style and I'll do that in a naive way and then in the style tts2 type approach if we want to control style that means when we train a neural network there needs to be some input that affects style probably the most naive way you could do this is to input a sound snippet for example you could take 1 second of a representative audio uh sound wave and you could input then if it's 24 khz that would be 240,000 magnitudes that you could just con catenate as an input alongside your input tokens that go into the model and if you now train this model such that you have a representative sound snippet that is paired and relevant for the quick brown fox and relevant to the original output sound that will be used for comparison with the generated one you're now going to have a model at the end where you should in principle be able to control independently the input text and the reference style here now you can extend this line of things and you could say well if I just want to control the pitch I could have an input for pitch here where the pitch is just uh the mathematically calculated pitch of a reference snippet sound so if I train in this way I'll be able to control the pitch that is associated uh with the sound I want to generate or you could control the speed so you have a parameter here it's just a signal magnitude and might be a normaliz speed and in inference you then be able to control the speed now if you want to make this a little more General you can think about having a vector maybe 128 Dimension Vector that somehow represents the style and this is now moving in the direction of the style TTS approach how this could work is you take a reference snippet of audio and you pass it through an encoder like a Transformer maybe Bert or Lama style and you take an output maybe it's a pooled average but it's going to be a single Vector maybe as I said 128 dimensions and that will represent the style of the reference snippet and then that reference style will be used to as an input for this neural network approach here that takes in the input text as well now when you're training this overall Network you're going to forward pass and generate the sound compared to the original and on the backward pass you'll update not just your neural network setup but you'll also be updating this style encoder here so that you more accurately are able to represent reference Snippets as meaningful style vectors moving again to another level of complexity you can take a look at the style tts2 approach and while this isn't the approach in full detail I'll highlight uh the key aspects that make this style tts2 as opposed to the more naive approaches the first is that it uses a gained adverse a generative adversarial Network so as I described earlier on in order to generate the output Soundwave it's using a discriminator and a generator now importantly in the sty tts2 approach this discriminator is actually taken off the shelf and that's a benefit because it means it doesn't need to be uh trained it's already performing somewhat well when we start off the whole training now where do you get an off-the-shelf A discriminator well there are models that are available open source like whisper for example that will convert audio into text so already those models are converting from a sound wve into some Vector representation and then they go from a vector representation back through embeddings into token representation but if you rip off that last part going from vectors to tokens and you replace it with a classification head of real or fake well you've still got all of this kind of vector encoding that's going to be useful in pre-trained so that's one of the ideas in style TTS instead of coming up and training uh from scratch discriminator why don't we use an off-the-shelf model um that is just slightly Modified by adding on this classification head that says real or fake at the end another benefit of using the generative adversarial network over using a diffusion type approach is that generative adversarial networks usually they're once through and they make a prediction whereas diffusion approaches are step by step making them a bit slower during generation now this generative adversarial Network it takes in two broad types of information the first is a style Vector which controls the style of the output and the other is textual information now it's not raw text that goes in it actually gets converted into phonemes as I described earlier and then gets converted into a vector representation using a Bert type of encoder so actually what goes in here is a vector representation of the words but this is still textual information and this here is style information and combining these two it's able to generate a sound moving a little further Upstream in order to get the style Vector style tts2 uses a diffusion approach so it diffuses out the style Vector from noise it starts off with just a random vector and it will diffuse from that random Vector into a clean Vector representing Style and what guides that diffusion is not words it's two different things the first is the reference snippet so that comes in here as an input and the second is textual information which again like with the acoustic uh encoder here the text is not sent in as text but rather sent in as a vector representation of that text that has gone through a Bert type encoder here essentially so this style is diffused out by taking in a combination of information from the reference audio and from the text now why does that make sense well when you hear audio like I'm speaking right now it's going to be a function of two things it's a function first of all of Ronin and the accent I have but it's also a function of the words I'm saying if I'm talking about the sad man was weary on a on a somber day then those words are going to be more somber and so the words affect my style and of course when you have a reference snippet that underlying text the underlying text for that snippet is not the same as the text you're generally trying to create and so you need to have a combination of information on well what is the accent and the style of the speaker but also what is the topic they're talking about and by combining those two you need to get to what the right style Vector is for that given approach and so this is a very loose description of how style tts2 works there is a more complexity I'm notot showing for example actually the game adversarial Network takes in information on the pitch which is obtained from the reference snippet and from the phems it takes in information as well on the energy level and it also takes in information on the duration that should be assigned to each of the fores and that duration information is basically extracted from a combination of the reference nippet and again the input text but in Broad terms you can just think of it as taking in style information taking in text based information albe it in Vector form and then generating directly a sound wave generating uh magnitudes that are required for a given sampling rate and a given duration of output sound with that it's time to talk a little bit about voice clone versus fine tuning so voice cloning I mean probably people have different meanings for exactly what that could be but one interpretation is that voice cloning is where you simply take a reference snippet from a voice and by inputting that during inference you get the output snippet and if you do it that way once you have the pre-trained model it's quite simple you just need a few seconds or maybe 30 seconds uh depending but it's going to be a short snippet and you don't need to do any fine-tuning it's literally just an input that's being provided along with text and generates an output sound now the style TTS 2 models are trained on a wide variety of voices so if your input sample is roughly within the accent set of those voices then style tts2 is probably going to work fairly well however if you have an Irish accent and there's literally no Irish data within the training data set or there's very little then whatever style Vector is being generated that's probably just going to be out of the data set it's going to be out of the norms for the model and it's going to be difficult to get a voice that sounds like it so in general for text to speech models if you have a model that's good with American accents and you want to try do an Irish or an Indian accent it's probably going to be difficult just because there's not a whole lot of that data in the data set and so if you're doing voice cloning uh if it does work and you're roughly in the data set it's nice because it's fast it requires fine tuning um but it is limited to Accents within that data set whereas if you have an accent that's not in the data set that's when you need to fine tune you need to literally update the parameters that are in these models and you need to update them so that they're more familiar with the data set that you want to impose this means that it's a bit more slow it requires some continued training um however it is going to generate high quality samples especially if you have a unique accent and this is what I'm going to show you today a data set prepare a data set with my accent with my voice from the trellis YouTube and let's see how we can get a model to be fine-tuned with that and then I'll show you um through inference how you do voice cloning just using a little snippet before I show you the notebooks for doing data set preparation just a few notes on what we're going to go for I'll uh say this now you might not catch it all but hopefully you'll catch it then on the second time when we go through the notebook line by line as with fine-tuning for any AI model we need high quality data and that means having high quality audio ideally 48 khz even though we'll down sample to 24 and you want to have minimal noise as well the next thing is we're going to segment that data into maximum length Snippets the core of style tts2 involves a lot of birt type models which have 512 tokens as their inputs so we need to measure how long uh our Snippets are within the raw data and we need to break it into segments that are no longer than 502 tokens this turns out to be about 1 to two minutes of audio for every segment now how you chop those segments is important as well you don't want to chop segments of audio so that you end up in the middle of words or in the middle of phenomes you do want the audio to be aligned very well with the text as well this can be done using a model like whisper X which not just transcribes your audio into text but will also look at things from a phenome level so it will match the audio with every puff like if you say p uh it's going to align that P up with the p sound in the audio another key point is you want to have segment started with 200 milliseconds before every phenome that's being pronounced and finish 200 seconds after every last Pome in other words you don't want the audio to be starting right when I say p you want to include maybe 200 seconds of buffer before and you don't want audio Snippets to end at the end of me uh pronouncing a certain phenome you want it to end with a little bit of padding at the end not silence I mean you actually want to include the 200 milliseconds of audio afterwards and last of all if you have words where there are very small gaps between them well you probably don't want to include the full 200 milliseconds of padding there because you will accidentally include a portion of the next phenome so the approach you can take is to split in half the gap between those two and just include Half of the Gap because hopefully then that won't include any of the next phenome and avoid any overlap now just before I get started I want to show you all the materials that you will need for doing the fine tuning first of all there's the paper here which I will link in the description I just want to not note here around um the author's notes around potential misuse so so they have requested that users inform those listening to samples synthesized by style tts2 that they're listening to synthesized speech um or obtain informed consent around the use of the sample synthesized uh so I do want to respect uh those restrictions or uh guidances that are provided by the authors now there are two repositories that you can make use of if you want to do the fine-tuning and data set preparation the first is the repo prepared by the authors of this paper it's a repository that you can use online and the second um as is typically the case with these trus videos is the trus repository uh which is Tris Advanced transcription and you can purchase that uh for a once- off fee for lifetime membership and it contains the customized notebooks that I'll go through today just so you have the link and I'll put this in the description below here is the style tus2 um GitHub repo containing scripts around data set preparation and fine-tuning and and here is the paid uh lifetime access repo this is Advanced transcription it previously uh was used when I did the video on uh speech to text so it contains all of the scripts for fine-tuning whisper models if you want to convert uh from speech of different accents into text cleanly and now it contains a folder here called text to speech that uh provides an overview of the sty TTS uh to model as well as custom scripts around data set preparation and fine-tuning like I'll be going through today so let's head over to the data set curation uh script that I have here and I recommend that you run this in collab especially if you want to download data from YouTube because collab seamlessly interacts with YouTube whereas you have to log in if you're going to do it outside of collab and it can lead to restrictions on downloading that data set so what I'm going to do is I'm going to download the audio from a YouTube video from the trellis uh top 10 tips for fine-tuning video uh you can check that out if you like it's primarily around llm fine tuning but I'll take that audio which is about 25 minutes in length and split it into chunks uh chunks that are just under 512 tokens in length and then I'm going to push those chunks of text router in phenome form up to huging face Hub along with the uh wav files which are sound files that go along with them so we'll take this step by step maybe increasing the size of the screen by a little bit more a few installations required first uh we have some libraries here necessary for converting to phenomes like ese uh phenomizer we're going to use uh P tube actually um actually I think we're going to use um YouTube DLP so we'll see that installation down below for doing um for finding the YouTube videos uh so after those installations are all done we're going to uh create some folders and we will want folders here for the segmented audio and we'll want a folder for the SRT file that is going to be the transcription of the audio I'll download from YouTube so I'll start with just audio and convert that to audio and text because remember we need both audio and text for doing the fine-tuning and there'll be a folder here uh for audio so yeah as I said we're going to use YouTube uh DLP you can maybe use uh p p uh tube but you can maybe use PBE but I find there are issues with that uh sometimes for downloading so I've installed uh YouTube DLP and that allows me to download a video here I have the trus research video on uh top 10 tips so that file has been downloaded and if I check out here on the side of my screen um I should have the audio and you can see here the has been out downloaded which is uh top 10 you'll be just requested for the name and I put in the top 10 and that's how it's going to save the file now it's actually downloaded as MP3 it says MP4 there but it's really MP3 and then we want to convert that into wav because that's the typical format uh that's used for training so here it's being converted to wav and the original MP3 I think was deleted so the next step now that we have the audio is to transcribe it and you'll uh recall I have a video on whisper where I find youe whisper it's a transcription tool from for going from audio into text now whisper X is a slight modification on that and it improves in two ways first of all it slices out silence in uh the wav or the WAV file which uh results in reducing noise uh basically what I mean by silence is when there's no text or there's no actual speech it will remove that and the second thing is it uses a phenome model or a pH model I keep saying phenome but it uses a phon model to get Word level Tim stamps uh so if it hears the P sound it will get a specific time stamp for that and that's very useful for aligning the text and the audio and that's further useful for us or rather as a consequence that's useful for us because we can cleanly snip the audio into uh small groups so we're going to transcribe the audio using whisper X and it will be transcribed into SRT format just to give you a quick look at SRT this is what it looks like it has time stamps and it has the text and as I mentioned it's pH name aligned so here this very specific time stamp of 3 seconds and 372 milliseconds should correspond to the four sound and that alignment allows us to cleanly uh snip between all of those so it looks like I have converted to SRT and next I'm going to split it up into segments now to be able to generate segments we need to have a way to um we need to understand rer how the model is eventually going to tokenize and the tokenization is uh I mean it's kind of simple but it's kind of complicated for the style tts2 model so let's uh just take a look here basically what it does is it's trying to convert the input text into uh a dictionary of the following symbols so it has uh this symbol for padding these punctuation symbols these letters and then it has these letters IPA which include all sorts of symbols that are used for fores so there's a dictionary that's been made by combining all of these symbols here you have the symbols and that's converted into a dictionary what does that mean converting to a dictionary it means assigning a number to each of uh the tokens in the dictionary now this is a dictionary with a fairly low vocabulary size uh for Lama you often have like 3,000 sorry 30,000 tokens in the dictionary or even Lama 3 I think there's 120 or 100,000 um it seems that higher higher size vocabs are better H but this is a very low vocab size I think it's in the order of 1002 uh 200 so the tokenizer if you call it a tokenizer it's a dictionary with a relatively small size now there are a few steps involved in tokenizing and what we're tokenizing is the input text so if you remember uh in the SRT file we're basically trying to create uh segments so let me describe how it will work manually if I have um a limit of 512 tokens what I'm going to do is I'm going to take this first line I'm going to add in the second line so I'll put those two lines together and I'll calculate now how many tokens I have there if it's less than 512 I'll add in a third line if that's still less than 512 I'll add in a fourth line and I'll keep doing that until I have H over 512 and once I exceed it I'll go back and say okay don't add that last one let's just stop because now I have a segment that's 512 but to do that I need to be able to calculate tokens and that's what's happening here so to calculate tokens in a string of text there are a series of uh cleaning steps and after cleaning it's then uh converted so the text cleaner will clean it up uh removing excess padding and things like that and eventually then it will go through the dictionary where we convert from the raw text into this dictionary of tokens and then lastly we're able to count those tokens so these are roughly the steps involved it's easiest just to look through an example um we have hello world and that is going into this function to count the tokens and you can see um here are the tokens coming out and notice how it is actually been converted into full names first so there's kind of a full name here for hell War world and this is then being split into individual tokens of that dictionary so here we've um 18 tokens there are actually I think 17 fores and this is basically how we count the number of tokens now once we can count the number of tokens in a given string we can run that script uh on the SRT file that I just described where you keep keep adding lines until you get to 512 as your target um now when we create those segments of 512 in text we also save the corresponding audio so we also concatenate the audio to get to that segment of text and then we also add in padding at the start and the end of 200 milliseconds so that the segments don't start off with a p sound or a phon sound and they don't end like right at the end of the phon sound there's also a little audio included before and after and as I mentioned before if the gaps between segments are really small and less than 200 milliseconds we take half of the Gap as the panning distance so here um you can see I had an error running and that's because I need an SRT file and an audio file I do have them but I need to just go up here and rerun again uh the directory creation make sure that those directories are present as expected so with that we can run the script and you can see that we are now generating these segments of text that are all below 512 to tokens um so you can see we've got about 50 segments here and if you want to very quickly go through the script uh you can see how it's basically iterating through lines in the SRT file adding buffer to uh the very start of the clips and it's counting the tokens and if you're below a threshold and so here it's iterating through the lines of the SRT file if you're above the threshold of 512 it's going to break and move and if it's below then it's basically going to combine the next snippet of text onto it so uh this this is the script in total for making those segments and once we have the segments um we can opt add in padding so we can add silence to the start and the end I think if you have properly added uh padding as I described with 200 milliseconds end and start there shouldn't be any needing uh for any adding any further padding so actually I've set the pad here to zero uh which means it will just add the files to the padded audio folder but it's not actually going to add any P padding now you can listen to some of the Snippets if you like um so that's just a snippet of my voice and you can also calculate the duration so here I'm going to run this cell and it's telling me I have 20 minutes of audio in total across the Snippets the longest snippet is uh top 102 and it's 30 seconds seconds and the length in frames is 581 now what's the length of frames and why is that important well actually the audio in the style TTS model it's not inputed ever as a WAV format it's also not imported as a pure wave so it's not imported as 24 khz for 10 seconds or 30 seconds that's not the raw input it goes into rouer it's converted into the frequency domain so that's done through a 4year transform and actually it's then shifted into the human range for audio which means converting it to a Mel spectrogram so the input format is not just a raw wave uh it's not just a raw waveform it's uh this log Mel spectrogram format and these Mel spectrograms are frequency representations they're done for frames of the input audio so actually that 30 seconds is split into many overlapping frames the way these uh frames are set up you have uh a sampling rate of 24 khz so that's uh the raw wave and then the length of the frames in terms um of how long a frame spans is going to be uh 1,200 samples uh So based on this we're able to calculate how many frames that we have in a given uh sample of audio and this is important because when we get to setting some hyper parameters in the fine tuning we need need to set the maximum number of frames that are going to be uh generated by the model and so we need to know what the maximum frame length is of the sound that we're going to be handling and The Logical way of thinking about this is our actual limit is going to be on the input length uh of the sequences to the birth to the birth type models so that's the 512 so we're ultimately Limited in how long we can put in uh samples we're limited by the 512 and then once we get those segments we can calculate what the maximum number of frames is and that allows us to set the maximum number of frames in the hyperparameters now there's a point here to do with training that's probably subtle at this point but is important later basically you want to have frame you want to have samples that are as long as possible so we would love to actually have 30 minute samples going in we can't um the reason we want longer samples is because it allows the model to get used to these uh longer term dependencies and it also allows it to represent things like the transitions between sentences if you just train a model on single sentences it's not going to be accurate in having the right pauses and intonation when you've two sentences likewise if you train it on just two sentences that are together it's not going to have the right intonation in going between paragraphs so there are all these Transitions and the only real way you get the transitions nicely is by having uh input audio that's going to be long enough to cover all of these features so like transitions between sentences between paragraphs between potentially chapters and books and things like this so we would love to have longer frames but the reason we can't or not longer frames but longer samples the reason we can't have longer samples is because we're limited by the max sequence length of the encoders that we have that 512 now there is another limit we'll get to later which is that there is a limit on the vram so the longer the audio SE segment that we generate the more vram so the more memory we need in our gpus and ultimately that's going to limit the maximum number of frames and the batch size that we can have during the fine tuning so for now what's important to know is we've generated the maximum length sequences that we can and we've noted what the frame length is of those Max sequences or the number of uh the length in the frames and we're going to use that later on to set as a hyperparameter in the fine tuning at this point we have generated uh some out put text I'm going to uh find that output text and show you an example it's probably easiest to look at it here training data and output text so here's what we've got so far we've this output text which consists of the WAV file followed by uh the raw text and then at the end there's actually this number which just represents the speaker number these are all ones because it's not a multi-term conversation a it's just a single turn so that's going to be one or it could be zero but it's the same number cuz it's the same speaker uh so as I said we have this uh text that's um here but actually the model expects inputs of pH names so as a last step in creating our data we need to convert this text into phonemes which is what this uh script here will do so I've just run the script here and it's going to process all of that output and it's going to convert it into a training data set and a validation data set and you can see this is the very same except these are now phes instead of being just word representations and now the last step is simply to push this up to hugging face Hub here I'm pushing up Tris voice 512 just because the Max uh token length is 512 and I can take a look here at what that data set looks like it's a public data set if you want to check it out you can see a list of the web files then it's got the phone names and then it's got the speaker number and if you look at the files here you can see we have the train list Val list and we have a list of the wave files here so we're at a point now where our data set is ready we have audio we have matching segments of text and we're ready to move to fine-tuning next we're going to move on to the fine-tuning notebooks now for running these I recommend having a minimum of 48 GB of vram I'm actually going to run with 80 GB of vram by using an a100 if you want you can use a oneclick uh fine-tuning template uh from runp pod so if you head to this uh trellis it's an affiliate link uh which supports the channel so if you're happy to use that you can click the link and it will uh set you up with uh a oneclick Cuda 12.1 template and you can select maybe this 48 uh gigabyte option for um the 6,000 in fact you can use the a6000 here RTX a6000 or you can select an a100 if you want 80 GB of vram and that's uh what I've done here so I'm up and running on that instance and I've opened up jupyter notebook and I'm just running uh some of these installations now and I will get them going so that I can explain them uh one by one to you okay so the very first step is to clone the style tts2 uh repository from GitHub and install some of the packages that are needed including uh torch audio we need trans uh Transformers phizer huging face Hub HF transfer to accelerate uploads and downloads of models uh the default is to use tensor board but I've got script set up for weights and biases I I don't necessarily think it's better but I am used to it so I'm going to use that also if you want to do Laura instead of full fine tuning there's a PFT installation that you need to run as well so with all of those installed we're going to um change directory CD into the style TTS folder so by the way I should have mentioned I uploaded first of all this Tris style tts2 fine tune demo notebook um which is from the trus advanced transcription repository uh then I cloned the style tts2 and now within that folder I'm going to download uh the model that we're going to train and the model we're going to train is this uh libery 2ts model and we're going to Target this checkpoint here let me increase my uh screen size a little bit to make this easier so we have the model and it should be downloaded by now um and you can see that I've already done a a full fine tune earlier so here um we're now going to download the data set that we need uh for doing the fine tuning so for that we'll need to log into hugging face um and get the token if it's a private data set now this data set is actually public so there is no need uh to log in and then we'll be able to uh download and download it into this trus data folder so all of those files look like they've been downloaded and if we check here in trus data you can see the train and validation files and you can also see a list of the wav uh files ready here so we're going to make sure that we're in the style tts2 folder um we already are so that's why that error is showing and we're going to load uh the con config file which is used for uh setting the hyper parameters for the fine tuning so that's the file that's in configs and then config ft and what we're doing here is we're updating some of these parameters um because well that's needed for the fine tuning so first of all we're going to make sure that the root path is pointed to the sound files in the Tris data and also that the training data and Val data pointing to the right txd files because I'm using weights and biases I've set up a project name style TTS 2 that all of my runs are going to be in and next I'm going to uh configure how many Epoch I'm going to train for now I find that you can actually get decent results even training just on two or three EPO uh four or eight goes well uh it seems to be convention to train on way more like 30 or 50 so you can always try out doing more to me that seems uh coming from kind of large language model perspective that seems to be excessive um but maybe that just makes good empirical sense for um text to speech models as I said though the results are not bad even if you find tune on a lower number of epoch so here I've selected um 80 pox I'm actually just for the sake of this fine tuning I'm going to uh reduce that to four next is my batch size so this this is how many batches of data I'm going to use for training and this affects my vram so you can't set this too high because you'll run out of vram the minimum that the code supports is a batch size of two you'll hit errors if you try to do anything larger than that um I recommend too for an A600 you probably could go up to more maybe three for an a100 the problem is that this model takes quite a lot of memory because it's a float 32 model so um this is going to be fairly memory hungry um especially if you're not using Laura and even with Laura it will reduce your vram a bit but still um a lot of the modules are being fully fine tune so you're not saving vram that much now next up is this Max Len parameter this is the Max uh frames of input and this is important because it also affects vram and this is where we need to go back and look at how many frames we have in our data set so you'll remember we have a maximum length of 581 frames and for for that reason I'm going to set the max length here to 600 the code is then going to pad out the rest of those frames so it'll just add padding uh to all of my samples now from a training quality and efficiency standpoint you ideally want your segments to be just under the 600 frames because you don't want to have a lot of padding that's just wasted compute um and so that's why when I did the creation of segments I wanted them all to be just up towards the limit of the 512 input tokens which means they all be up towards the limit of the frames in terms of audio length now there are two parameters here that are important for controlling um the training so I'm going to show these uh graphically by going back to my oversimplified representation of the side tts2 model basically when we fine-tune this model everything is being fine-tuned at once so things like the encoders have been fine-tuned but there is a control as to when you want to turn on fine tuning of the diffuser and there's a control as to when you want to turn on the adversarial training here in the generative adversarial Network and the way it's controlled is through EPO so here this is saying the diffusion is going to turn on at Epoch 2 now I think it's zerob based coun counting so when I set this um to two I think it will turn on for the third Epoch um and here I set this uh to four so it means that the adversarial training will uh turn on at four and I'm actually just going to modify these down a little bit because I'm running a fewer epochs but you need to make sure that these numbers are less than the total number of epochs otherwise uh your diffusion training and your adversarial training is just not going to turn on Save frequency just means how often you save the model this is one so once every Epoch that might be too much um for example you could just save it uh on the final Epoch by saving it to four the next parameter here of interest is the batch percentage this controls the amount of outof distribution data that's going to be used during adversarial training so generally the data that's going to be used is what's in the Tris data folder but for adversarial train training you can throw in some other examples that are not in your core data set and for now the way that things are set up it's going to pull those samples uh from the data folder which will be the original samples that were used in the lib TTS model so these are the outo distribution texts that are going to be used in order to do um the adversarial training now you don't have to run adversarial training but if you do you should go and um check what's in your data folder so I've downloaded a different set of data here the LJ speech data set just by running this uh snippet of code here and make sure to first delete the data folder because it contains uh older data but this will then download a set of training and validation and uh WB files that that go along with it continuing on here with the parameters if you're going to use Lowa fine tuning set this to True otherwise leave it as false and if you're going to use Lura you have to set the alpha and Laura or I have um selected these so that they're roughly appropriate for the size of model that is uh in play here now all of this script here simply provides a weights and bias's run name that will include like the number of epoch and information that helps you track uh the runs performance so when I go ahead and run this and first I need to make sure my config file is loaded so when I run that it should basically rewrite my config file to be updated with the parameters I've just Chosen and next I'm going to log into weights and biases so here I'm just going to uh select weights and biases and that should be should allow me to log in next I need to upload some fine-tuning scripts uh so that is going to be done by taking the scripts from the trus advanced uh transcription repository uh I didn't mean to create that cell I Mentor order here in the style tts2 folder I'm going to go um to the folder within Advanced transcription so here I'm in in advanced transcription in the text to speech folder I've G cloned the advanced transcription repo and I'm going to upload TR train fine tune which is for full fine tuning and trius train lower fine tune which is for lower fine tuning um of course you only need one of them and I've uploaded them earlier so I'm just going to dismiss that now I'm ready to go ahead and do the fine tuning so I'm going to get started with that and we should see weights and biases uh kick off and I can click on the link to see the Run underway now it's going to load uh um series of models you can see that there are encoders being loaded and you can see there are uh the different uh sound wave files being loaded as well and you can check out here the current run the run is in green and you can see some earlier runs as well I'll just hide uh some of the runs apart from one I did yesterday uh for for EPO or R it was 80 pox yesterday and you can see here um the training is progressing here I have system information showing the percentage of GPU memory allocated um here I have my training information and here I have my evaluation loss and you can see very roughly I'm uh tracking what was happening yesterday now what we should expect is that because of my settings basically my diffusion losses so losses associated with the diffuser should only turn on at Epoch uh 2 whereas my losses associated with uh the adversarial training only turn on at uh Epoch 3 or after two epochs so that's what we'd expect to see when we look at the training data and indeed it is you can see here uh first let's find uh the diffusion loss on the second page so yeah after just one Epoch you can see that the diffusion loss has started and it's falling nicely which is what you want uh you can also see that the style lost because the style is now put of the diffuser that's only starting to appear after one Epoch and meanwhile then for the adversarial loss you can see that is starting here in green uh gen loss slm and D loss slm are both related to the adversarial loss so we're just seeing those appear um after two Epoch and you can see my run yesterday where I ran for eight epochs I started the diffusion loss uh I think at Epoch two or three and I started the adversar loss later at halfway at around eoch 4 so that's why you can only start that uh see that start to appear and what you're looking for generally is you want the losses of course uh to be trending down over time which they generally are note that the adversary losses are very noisy um so that indicates potential room for improvement there maybe by trading for more Epoch or more carefully examining the data now the Mel loss this is like the frequency analysis of the Soundwave being generated you can see this is very very jumpy now it's not helping that my batch size is two it would be nice if I larger batch size but I really need more vram for that an alternative would be to code in gradient accumulation so you only back pass after averaging over the losses for a number of batches so definitely room for improvement here on the performance of uh the Mel loss looking over at the style loss you can see a very significant Improvement um the train gen loss here is also improving over time um so generally speaking we can see material Improvement in the performance of the model uh throughout these training now the training should be done here uh for this limited period of uh four EPO I think it's just saving the model and so now we can move on uh to testing the model to run inference there are a few different functions that need to be set up there's a function here that does the masking so this will basically add padding uh to the max length of the samples uh wave pre-processing this is actually an example of where we take in a raw wave and we convert it uh into a Mel spectrogram which which means getting its frequency representation and uh furthermore that is converted into a log Mel spectrogram by the way Mel means that the frequency is further adapted to be relevant to the humaner uh range of frequencies uh compute style so this is a function that will uh take in the path to an audio file and it will output um a style vector and Mainely it's using a style encoder in order to create um an output vector and there's also a predictor encoder this is for duration prediction duration being the length of phones a phon being like the sound P duration being the length of that sound P there is actually prediction for how long each of the phins that are generated should last how many frames they should last uh we have a check then to make sure that the models are all moved on to Cuda which is using the GPU now if you have fine tune do Laura you can at this point run uh this series of cells this will allow you to load the base uh model that you fine tuned and then apply the lur adapters and merge them in and then save them as a merged Laura model however we have fine-tuned uh our model fully so I'm going to load the fully fine-tuned model it's been saved to the libri TTS fft folder you can see um here there are 80 pocks that were saved um and 3 minutes ago this fourth Epoch was saved because it's zero based counting so these are the four other epochs I ran yesterday so here you're able to select uh the path to that file and it's going to load up um various of the encoders and models like here for example is uh your pitch extractor um FZ refers to pitch um here's a Bert model used for encoding so all of the individual models are being loaded up from this uh pth file so I'm going to go ahead and on that and I'm going to sort through the files uh print out the sorted files it's just going to print all of the files in this folder and what I'm going to do then is Select uh the last file I'm actually going to go back so I'm not going to pick this last one but one I'm going to two three four five I want to pick this um fifth from last file so we will load up the parameters from the model I've just trained now just so you can see what 4ox does um that model will be loaded it's got kind of a wrapper that's called net so I'm just going to unwrap uh that wrapper from it and if you wish you can um print through and see what that model looks like you can see there are all of the different model types like here's a a diffuser here's the pitch extractor style encoder predictor encoder decoder text encoder so yeah it's quite a complex model when you look at all the pieces um we're going to load up the sampler so this is the diffusion sampler here and now we're going to move towards defining the inference function so the inference function will take in text we want to convert to speech it will take in a reference H style Vector so we'll first compute a style Vector for the reference audio and then input that into the inference and we have Alpha Beta diffusion steps and embedding scale the embedding scale I recommend just uh keeping to one it's for normalizing the length of embeddings diffusion steps affects uh quality I recommend uh keeping that to 10 you can try lower if you like but it's how many times you want to D noise before getting to the final sample and what Alpha and beta do is they blend the style of the reference with the Baseline style of the model so if you want to move your voice more towards your reference speech like your short snippet you should put Alpha at a low value and beta at a low value specifically these are going to blend two things one is um the style and the other is um the prediction I believe so here we have um the style s and ref is uh the predictor so this is duration predictor and we basically blend the style using uh beta and we blend uh the prediction using the alpha parameter so yeah I'd recommend probably keeping Alpha and beta low if you want something compared to um your reference style now I will go through this inference code uh in a bit more detail but it's actually easier if I start uh with the generation so let's start off with generating some speech for this Tech next I'm going to walk you through 10 quick tips for fine tuning and I'm going to use a reference um audio snippet which is just uh the start of one of my Tris videos so that will serve as the reference now what's happening here when we generate the output audio is we start off by creating some noise um because this noise is needed to diffuse out the style vector and that style Vector is then used as an input to our generator um so we start off uh with a path to our data which is to our reference audio and we're going to compute the style for that so this here is the style Vector that is uh coming out of the reference audio and if you remember when we looked at the compute style function I'll just pull it up here so here's that uh compute style function again and what it's doing is it's pre-processing the audio as I described into a log Mouse spectrogram and then it's running it uh through a style encoder to get a reference style and a predictor encoder to get a reference representation of uh the duration so the duration is the length that we want for each of the phonemes that are being generated so this is the compute style function and once we have computed the style of that reference audio we're going to input it to the inference function which takes in the text and which also takes in the style and it uses the text and the style um with these weightings I described in order to generate the output so I'm just going to uh run that and in the meantime I'll explain up here uh what's happening in the inference in more detail so we take in the text we take in uh the style the encoded style we clean up the text and put it into tokens we mask the text so put in padding where it's needed uh we encode the text and we actually encode the text in two different ways one is we encode it to go into the Gan Network the generative adversarial Network and the other is we encode it because we'll want to use that information for the style diffuser and this here is the style diffuser it takes in the reference uh style that we inputed to the inference function so it's taking in this reference style here and it's also taking in uh an embedding so a text a representation of the text so it's trying to take any information on Style that's present in the text and it's going to diffuse that out from the noise into a clean uh Vector so that's going to provide a clean style vector and actually the style Vector has got a style component and it's got a duration prediction component so that's just how style tts2 is uh set up and when we have uh this diffused style and diffused prediction component then the blending happens so this is where we blend a portion of the reference Style with the diffuse style and this allows you to control basically how much of the Baseline model uh you want to add to your final sound wave versus how much of the style from the reference that you want to diffuse in and as you can see if you set Alpha to be low it's going to weigh it towards uh the reference here if you set beta to be low that means it's going to weigh it towards the reference here and with that essentially you now gather all of the inputs so you gather all of the style you gather all of the textual information there are some further operations here I won't describe in too much detail but you finally will push that into the decoder um which is Hi-Fi Gan HiFi is High Fidelity generative adversarial Network and finally you're going to have a forward pass through the adversarial Network which is going to take in here basically ASR is a representation of um the text combined with the alignment and the style F0 is the pitch information n PR is the energy uh level and ref here is uh rer the style so sorry I misspoke there ASR is the textual information the Align textual information not the style uh ref here is the style and then as I said way back earlier in the PowerPoint you have um the noise rather the pitch and the energy uh of the reference sample that are going in here as well so with that we have a prediction of um the output sound and I'm just going to play here a little reference I don't know hopefully you can hear it I'm going to walk you through 10 quick tips for fine-tuning and here is the synthesized I'm going to walk you through 10 quick tips for fine tuning okay so um you can see not too bad and one of the drawbacks with the synthesized sound is that it's actually a bit slower now that's something you can Rectify with post processing I've done a very naive uh form of prst processing here but it allows you to get uh a tiny bit closer so here's the sped up synthesized audio so again the reference I'm going to walk you through 10 quick tips for fine-tuning I'm going to walk you through 10 quick tips for fine tuning so you can see that there are definitely aspects of my voice it's way closer than what you would hear if you loaded uh raw model which um we can try in a second um it's not quite perfect so there's opportunity to fine tune for more Epoch I only did four Epoch but I think you can see that already with those four Epoch um we're getting quite close in the result now at this point you can decide to push uh the model to HUB so you can select a checkpoint that you want to push and will create a repository uh up on hugging face now there are two things I want to cover before I wrap up in the script the first is I've glossed over like detailed description of L losses just categorizing them high level as diffuser L uh generative adverse areal loss or or other losses but you can find out a description here um the losses are described one by one versus what you see in the weights and biases or tense report logs so you can see a description of say a m loss uh generator discriminator cross entry loss and the duration so this here if you want to have more information now something else I want to do is show you uh the performance of a baseline model so let's test out um the model I'm going to run through all of the inference scripts again but this time instead of loading up uh the fully fine tuned model I'm going to load up uh the base model and actually that that code there is for the lower one so what I'm going to do is instead I'm going to copy out some code here and I'm going to just adjust so that we read in models lib TTS and the config file here is just called config yaml so let's load through we can see there's uh only one sorted file that should appear that's correct so I have to pick out that last sorted file load it up and make sure the model loads load the diffuser now run through inference and finally synthesize at that same speech now just for interest let's um try to make it as close as possible to the reference because I'm going to provide a reference so I'm doing voice cloning here I'm going to try clone my Irish accent um so let's run it in a way that is as close as possible to the reference and here it's not a fine-tuned model so you'll just see how good or bad it is I'm going to walk you through 10 quick tips for fine tuning all right that was the reference I'm going to walk you through 10 quick tips for fine tuning yeah so here it's clearly not getting much of the Irish accent just through voice cloning so just putting in the reference voice is really not sufficient and that shows you why you probably need to fully F tune or Lowa find tune if you're going to go for an accent that's not uh very close to what's in the Raw data set you can also notice there's a slight delay in the word fine-tuning which uh isn't present in the fully fine-tuned version that I did um and that's probably because the word fine-tuning is not in the base training data set but it is in my data set because I say it a lot in the trellis videos and that is a pretty much an overview of the notebook for doing full fine tuning or low ref fine tuning with that we're towards the end of the video and it's time for a few final tips first of all I recommend using a minimum of 20 minutes if you want to do fine-tuning I used uh 20 minutes there and I trained for four Epoch and I got reasonable results but if you bump that up to about 2 hours and you start doing it from more EPO you can get even better uh results especially on uh the adversarial l now as I've said you should think carefully about what outof distribution data set you want to use for adversarial training this will help the model to generalize a bit better beyond your single accent but actually you may not value the importance of generalization if you're just trying to hone in on one specific use case so perhaps you might want to set your outof distribution data you might just want to use more of uh say the Ronin voice in the case I was going through last of all with regards to vram this does impose a limit on the bat size which means that your loss your training loss can be somewhat jumpy so it is beneficial if you can find ways to increase that batch size even if it means increasing the virtual batch size by coding in some gradient accumulation alternatively you can use a larger GPU for example there are 192 GB gpus available on runpod um if you want to rent out an AMD GPU now that isn't used in Cuda so it require some testing out but it's potentially an option if you want to try and run a larger batch size and that is it for this video on text to speech fine-tuning you can find links that I mentioned in the video down below in the description including a link to tr.com Advanced dtrans cription if you'd like to purchase access to the Trav scripts or get lifetime access to the speech the text and text to speech repository additionally I'd like to thank Rohan Sharma for his work helping on this video as part of the trus internship program if you're a very talented developer you can find out more about that program at tr.com up in the top section of the website in the meantime if you have any questions on text to speeech fine tuning put them Below in the comments cheers

Transcript for:Voice Cloning and Text-to-Speech Models

Transcript for:
Voice Cloning and Text-to-Speech Models