so what we are going to do is as part of this project we'll first take a video URL of Interest we'll take a URL of Interest we'll use the PBE library to download that video from YouTube so I'm for the purposes of this uh tutorial the video is going to be a YouTube video right so I'm going to take a YouTube url download it using PBE once we have this video available we're going to use FFM Peg to then extract the audio so right you know so you first get the video from the video we extract the audio stream that audio stream gets saved on the disk and we then use that audio stream to send as an input for the openi whisper model now again here what we are going to do is we are not making use of the whisper API which is a paid API we are going to make use of the model that is locally going to be present for our systems right so we are going to be making use of the free model the open source model uh I have also done a video on open a whisper earlier where we did very similar thing but we did not use FFM there there we downloaded an audio file itself so here the new edition is the video will uh be we will first download the video file then using ffmp we're going to come to the audio part of it and from there we'll get to the open VI whisper where which will then give us the transcripts we'll do some pre-processing on the transcripts because you know there are some formats which we'll talk about uh for the titles those the time formats will do some pre-processing once that pre-processing is done we'll again use FFM to now embed the subtitles in the original video that we downloaded uh there are two ways when you embed subtitles one is the hard embedding which will which is basically you saying hey you are burning the subtitles on the video so they are always going to be there you cannot switch them off the other is which you might have seen where which are more like closed captions where you can have several languages as part of your subtitles and you can uh use generate a SRT file which then the video can uh find and accordingly play right you know basis what if the user wants subtitles in English they can it can be played if the user wants subtitles in Spanish they can be played right in this for the scope of this video we are going to embed the subtitle so the subtitles are going to be hard burnt on the video so let's get started right so uh this notebook has all the steps now we'll get to the coding part of it we'll use uh these three libraries we'll first use faster visper uh we'll use ffmpeg Python and we'll use PBE these libraries let's first get them installed uh the way to do it is uh you should go out and create an environment if you already don't have an environment I recommend you create an environment I uh just create an environment uh let me just create uh a python cell run this why I did that was because it it will automatically ask me for the python environment I'll say I want to choose a python environment I want to create a new python environment we'll say we want a VNV uh type uh this is my interpreter python 312 and if you already have a requirements.txt it will ask you to select that requirements.txt if you want those dependencies to be installed so we'll just do this it will now go out and create the environment so our environment is now created I can go to the I can quickly go to the terminal and if I see VNV I know that my requirement is uh created I can do a pip freeze and we'll see we have um all of these libraries installed uh ffmc python is part of it so I know that my requirements.txt was also installed like you know those dependencies were installed so from a requirements perspective for this video we are making use of faster whisper which is basically a c translate to implementation of the regular whisper model uh this is again a open source model so if you go to github.com strand faster whisper uh you can read through what the model what this implementation is all about um given that this is a c translate 2 which is again a fast inference engine for Transformer models the author claims that this is uh basically four times faster than your regular model um Regular implementation which whisper has done um and you know if you do quantization it is even faster uh what we are going to do is we are just going to run it um uh without the quantization today so let's get started uh I wanted to talk about why we are using faster whisper and the reason is that you know transcriptions take some time if if there is a way which allows us to do faster transcription without losing too much on on the accuracy I think it's better to uh go with that approach right so we are we going to use faster whisper uh because of that so now let's get to the first step where we have where we need a URL and we'll then use ptube to download that video um for that purpose I'm going to make use of this video open AI recently came out with the GPT 40 model and uh this is the video I'm sure a lot of you must have already seen this demo video I'm going to make use of this video and uh basically uh use this to download this uh video file right so uh we'll need the URL so I'm just going to copy this URL and this URL is our URL of Interest now right so import OS we'll also import P tube uh run the cell next uh what we want to do is come and uh basically save this URL here right so we'll say our URL is equal to this is our URL right so and the way you download from P tube is you say PBE do YouTube and you give it the URL right so so that becomes uh this clearly is not right uh yeah so this is how you instantiate the P tube uh instance and now using pyu will download the video the way to download is you say yt. streams. filter you will say progressive is true uh file extension I'm I'm just uh first loading whatever go GitHub collab was giving me and then we'll go through it so yt. streams. filter Progressive is true seems okay file extension is MP4 order by resolution yes so we are going to get the highest resolution in the descending order first and we are then going to download it we are I'm not specifying any path here so do download it basically means if you don't specify any path it is going to just download in the current working directory right so um that's my objective uh uh so yeah so let's just do that once this is done we should see a file come up here right and the default file name for these files is the name of the video. MP4 uh it's still downloading I think it must be a little lar largish file so still downloading I think it's all done now um and what you can now do is this Wht object that we created here I can use that to print a lot of details like things like hey what was the title of the video right so it. title I can print what is the how many views does this video have right so I can say yt. views I can even print the description of the video uh and it will be yt. description so so all of these uh calls are available uh then we can even print the length I can say print uh length yt. length right so and these are these this is in seconds so I can just say these many seconds right uh so let's just run this this is the video we just uh so the URL that we got this URL corresponds to this video which title is this much these many views have happened this is the description um and um you know this is the description and the length is 1573 seconds if you convert that to minutes you'll get roughly 28 minutes right so 26 27 minutes something like that um right so so this is it uh now uh we'll what we'll do is I don't want this MP4 to be an mp4 um and you will come to know when when we when we do more things what I'll what I'll do is I'll just rename this MP4 to a file without an extension so rename the MP4 uh to the title of the video without a file extension right so I'm going to basically just say rename os. rename yt. tile plus MP4 to yt. title right uh this yt. title is basically the name of the file right so if you see here this was the title of the video MP4 got added and that's basically the name of the file I'm just renaming it back to without an extension and if you see now the extension is gone right so we don't have the MP4 extension anymore and the reason I did that is because now this becomes my raw file for doing all the processing now if I want to convert this extract audio and save that as awav I don't want the file name to be introducing GPT 4. mp4.wav I want it to be introducing GPT 4. wav right so so that's the reason I just removed the file extension it's perfectly all right to skip this step and it'll still be fine it's just be little it won't read that well that's the only thing so now that we have the video as part of The Next Step what we need to do is we have to from the video extract the audio so like let's just build those uh build that uh code uh extract audio from the video we'll create a uh method and we'll call this method extract audio so I'll say def extract uncore audio uh extract audio let's just see what GitHub Code Pilot is giving it says AUD audio is equal to video. streams okay so bybe also has a way by which you can just say only audio is true when when we were filtering and you can get the audio I'm not using this method uh I'm basically using the MP4 we let's assume you already have an mp4 you are not even downloading you got the MP4 from somewhere now we'll use FFM to actually download so what GitHub copilot was suggesting was more of a PBE method I'm not using that um extracted audio is equal to um and will give this fileer name I'm going to say audio hyen input uncore file uh dot wav okay so and we can Define here that you know this is my video and we can even say input uncore file right so I can even Define uh what input file is um this is this becomes my this will be the file with which file name which I'll use to save the file when I done with extracting the audio next we'll say stream is equal to at this point I'll invoke ffmpeg and I'll say input is my input file um and we'll similarly Define our output we'll say stream is equal to FFM do output stream and extracted audio so this becomes where it should write I think I have not imported YT ffmpeg and that's why I'm getting that error uh so maybe I should first do some imports here uh so yeah so let's just do some imports here itself I'll import time I'll import math I'll import FFM some of these libraries I'll use um once we have to also do pre-processing so I've just imported them now uh the FFM Peg error is gone now once I have done this I'll say FFM peg. run stream and I'll say overwrite output is equal to True uh and we'll return the extracted audio right so we'll return this file I'll uh run the cell now we'll come to the point where we'll say audio we we'll basically call this method we'll say audio extract is equal to extract audio YT and yt. title yeah so um we actually don't need to give the video sorry I think I I Mok it because this was a GitHub copilot completion sorry Google uh yeah GitHub copilot completion and that was not right so yt. title uh is my uh input this becomes my input file let's just run it if this goes well within this file thing you will see a WAV file also coming up right so so we'll run this now okay something is wrong so I'm getting missing one required positional argument I think I did not run the cell um yeah I'll run the cell again and let's see if uh yeah so I got the WAV file with that we also got a lot of message from ffmpeg so whenever you will run python ffmpeg you will see a lot of messages come up on the console it will tell you about uh metadata about the video when was it created what type of file it is and all of that what the output name of the file is um yeah so so we are we now have we are at that point where our WAV file is also created so we we have the audio file now the audio has to be sent to uh open AI whisper so we are at that point where we'll now have to send this audio to whisper uh so yeah so let's uh let's code that part now so now let's build the transcribe function and the way we'll do it is that you know we'll uh we'll Define a function called transcribe uh which will take in an audio and it should return to us the language of the audio so we'll we'll send the audio we'll ask openi whisper model to also tell us the language and give us the segments right so if you now go back to the faster whisper you will see that you know as part of the implement ation uh you get some segments and you get some info right and uh It also says that segments here is a generator so the transcription only starts when you iterate over it right so um this can be run to completion by gathering segments in a list or a for Loop right so um I just wanted to quickly uh help you also understand how to read some of these documentations a lot of users a lot of developers generally know that there are packages but what they don't really understand is how to use those packages so the quickest way is sometimes just read through the code chat GPD is a great tool you can just if you are not able to understand copy this entire thing send it to chat GPD ask it hey you know what what does all of this mean uh that can help you learn a lot so um we are going to return the language and the segments we're going to define the language and segments now uh but our method is going to take the audio and we'll make you use of now is the point where we'll start using the uh whisper model from uh faster whisper right so from faster uncore whisper import whisper model right so um so this is what we'll make use of uh we'll say once we are in this function we'll say our model is whisper model and we are going to make use of the small model in in my case right now uh again go back to the documentation and and you can pick the model sizes uh whatever size you wish to have you can pick the size of the model uh you have um in this documentation somewhere it was mentioned uh the sizes are mentioned yeah so uh there's a small model there's a large V2 model so you can you can basically uh find the model that you want I'm going to make use of the small model here in their examples have made use of the large V3 model right so I'm going to make use of the small model right uh openi documentation has names of all the models that are available uh we are going to make use of the small model here so um and the reason I'm I'm making use of the small model is just for this demo purpose because I'm not doing it on collab where I have access to a GPU this I'm doing it on my MacBook and there for I'm using the small model right um otherwise I would have used the if you are doing it on collab and if you have access to GPD uh to gpus uh feel free to do like a bigger model right so then we'll call model. transcribe we'll send it the audio uh our language uh you know if you remember even here it was sending some info uh segments and info is what the model. transcribe gives back within info is where you get the language and language is basically nothing but info zero right so info Z which is basically the the first element in that list will give you the language right so um so we are let's also quickly just print the transcription language and we'll say language uh this becomes our language here um and um this is where we'll print the language next if you remember till now the transcribe has not happened that's what this documentation told us when we went here that segments is a generator so we need to iterate over it so if you have any generator in Python you have to iterate over it so we'll we'll iterate over it by calling list like it mentioned so we'll say segments uh is equal to list of segments right this is where uh the transcribe happens okay so this is where the transcribe happens next what we'll do is for segment in segments we are just looping over over the segments uh why is it giving me an error from oh sorry for segment in segments we are now going to Loop over each segment that the model has given us and we just want to print the segment right so and I want to print basically the time right so I want to print the time uh and that's what um start and the end time of that trans of this segment is what I want to print because you know our final goal is to basically take take uh these um transcripts and convert them into uh what we call a uh subtitle right so for that we will need these uh uh these um times time stamps right so that's what we'll need and uh segment zero is basically nothing but segment do we'll say segment do start segment dot end and whatever is the text of the segment that comes here in segment two that basically is the text of the segment right so so what have we done we have said print percentage 2.2 FS uh 2.2 FS percent s and we are then transposing these three values these are basically segment start segment end segment text the these become our three values we are just going to print these three values and we are then going to uh return segment right so we are going to return uh so the return is actually already coded right so uh yeah so return is already coded so this is our uh transcribed method we took uh we took the audio we defined the model which is the whisper model small then we called the transcribe on the model we got the language as part of the info the transcription Lang language is our info zero which is the language then we iterated over this segments and then we called we went through each segment and printed out the time and the text right so that's what we have done next step is to now do a time on this uh we'll just time this uh method uh this cell and we'll say language comma segments is transcribe audio extract right so now we are going to call this method so till now we only declared it now let's just call this method this is going to take a while so I'll pause the video and I'll come back once the call is finished right so let it run and um it's giving me some say some uh Lang some error or some warning saying hey you are using float 16 but the model weights have been automatically converted to use float 32 uh so it depends on what kind of a system you are using uh it has already told me that my transcription language is uh and we'll just let it run for a while now so now the model is done with the transcription we have uh you can see you know this is what we printed we printed the start time we printed the end time and printed the text uh and we have that so uh now we have this ready with us the transcript is there we have it stored in the language and segments now uh and but but one thing that you will also notice is that because we are using the small model uh it's really good but it still is not like 100% accurate so chat gbt is like know chat gbt and there are few other places where I was just noticing uh not able to find right now but uh there are like some minor minor errors like you know but because I use the small model uh small is relatively faster but less accurate but if you go to let's say the large models uh and these are the available models within uh faster whisper if you look at them yeah if I go to large V3 it will be a slower model but it'll be a very accurate model right so if I go to and uh right now I'm using the small model right so medium large V1 V2 V3 um the higher you go the models will become slower but they will get more and more accurate right so um so yeah you can try if you have access to a GPU you can you can you should definitely try this right so with this now our we have we are at this point where our transcripts are ready um and what we now need to do is we at this point now the transcripts are ready and we now need to get into some pre-processing and what we need to pre-process is basically the time right you know because this time is going to get read by the player uh so we'll we'll just do some pre-processing here now so there are many subtitle formats available we are going to make use of a format called the sub uh sub RP SRT format uh so this is our format and the key features key uh things when you are building this is that you know you need three things here so you need a subtitle index which is basically a number right so it could be like a 0o one two types so it's a it indicates it's a sequential number that indicates what is the order of the subtitle in the file right so um next what you need is the time code this time code basically has start and end markers uh for the like what we generated till now and it has to be in HH mm SS comma SSS format this is the format in which your time has to appear uh all your start time end time the time codes have to appear like this and next you need basically the text right you know the subtitle text that's what you will need so these are the three things so this this is the pre-processing that we need to do our pre-processing needs to result in in an output which will be something like this right so that's uh that's what I mean by the pre-processing step here so so so let's write that method now so we'll we'll build a method and we'll call it format time method the format time for sub SRT or for SRT let's just say and it'll take seconds and what we are now seeing is let me just add some comment here we are saying this method uh this is a helper function that takes time in seconds and converts it into this format right so it into this format uh format for the SRT subtitle files okay uh so let's see if uh now I gave some hint to get up copilot let's see if it is able to give me something uh okay uh so seconds is equal to math. flow seconds sorry sorry sorry I think the code was right uh seconds is map. floor in seconds RS is seconds divided by 3600 uh okay I would have done slightly differently but the the math dot floor is already there so seconds is a math do floor and you are then dividing it by 360 3600 minutes is uh seconds divided by 3600 seconds is seconds minus 36 600 uh okay I slightly feeling lost with this code so I'll not use GitHub copilot what I will do is I'll just say RS is equal to math. floor uh seconds divided by 3600 uh then we'll say seconds we'll take a percentage equal to 3600 next we'll say minutes is equal to math. floor second by 60 yeah and then we'll say seconds is nothing but seconds percentage 60 okay we'll then say milliseconds is equal to round of seconds minus math. floor seconds into 1,000 yeah that will be my milliseconds next what do we need we need uh seconds is nothing but math. FL seconds and I'll say formatted time this is my final output is f of RS colon 0 to D colon colon minutes 0 to D seconds 0 2D milliseconds yeah this looks good and we'll finally return the formatted time yeah so with this my method is ready my method to format time for SRT is now ready and uh we will now go ahead and create a method that will allow us to generate subtitle files right so so what we'll do we'll say def generate sub title file and we'll pass it we'll need two things here one is the language the other is the segments uh these are the language and segments from our transcripts that came out right earlier uh these these things so we'll pass the language and the segments and then we'll say we'll have an output file we'll say subtitle file is uh I'll just call the name as sub uh input underscore file uh do language do SRT uh this becomes my output file uh for the subtitle uh and then what we'll do oops Yeah so now we have the subtitle file we'll say text is empty sometimes these Auto completions can be irritating uh next what we'll do is we'll Loop through uh for segment in enumerate segments uh we are going to Loop through the segments and autocomplete is giving me some text let me just see or maybe leave this I'll just start writing code on my own uh we'll say segment start is equal to format time this is there now with start calling the format time and we'll pass it the segment start time uh this is the segment start time that came as part of segments So within the segment we are calling segment start time format time for SRT yeah format time for SRT and we we'll pass it segment. start uh same way we'll get for segment. end uh we'll get segmentor end format time for SRT next now the times are available we'll just say text plus equal to F string and we are going to say we first need we need these three things now right you know so we need the index we need the time code and we need the text so we are going to say uh first thing is Str Str of index + one and a new line next we'll say text plus equal to segment start to segment end so and third is where we'll give the text right so segment uh. text next we'll just um yeah we can do two empty strings and we should be good uh two empty line breaks yeah uh next f is equal to open subtitle file comma W f. write text uh F do close we are writing to the file now right so and then we'll say it return subtitle _ file right so we we just wrote a method that will generate subtitle file for us uh I think I should also have a language input underscore file here um yeah so this method will take an input file it will take a language and it will take segments uh now what we'll do is we'll pass we'll call the subtitle generation we'll just say subtitle underscore file is equal to generate subtitle file that's the name of the method yeah and we are going to pass it yt. title we'll pass it the language and we'll pass it the segments yeah so language and segments are there why is it uh not recognizing language segments yeah it did okay so we have the subtitle file let's just see if it got written uh yeah so we can see a subtitle file has got written St Str index plus one something is wrong with our code then okay uh yeah so this should have been a variable uh which instead of that we just printed it as a uh as a literal string right so we'll just redo the file call subtitle file again uh and now we'll see we are seeing the index numbers as one two and proper uh start end and the text uh this our SRT file in English is now ready right so the next step is now where we'll basically take this SRT file and we'll burn this sit file using ffmpeg on to the video itself right so so that's the last step we uh there's one step missing in this visual which is both the pre-processing we generate the SRT file and then we get to the FFM right so uh so we are at that point where our SRT is now ready we'll now write one more method and we'll be done this method will invoke FFM and the SRT file we have created to embed those F uh those uh that text on the video yeah so let's uh write this method we'll call it def add subtitle to video and it'll take an uh it will basically take a few things uh first it'll take whether uh the file what is the subtitle file then it'll take subtitle language and what we'll do is we'll say We'll first Define a input stream FF FFM input stream so we'll say FFM do input and this is our input file uh it's yt. title but it's better to have input underscore file and let's have this input underscore file here so it'll take an input file that will be the FFM input stream then we have a stream for the subtitle and we'll call it subtitle input stream this will be our ff. input subtitle file next we'll Define an output video uh this is the name of the file that we are defining here so output hyphen input file hyphen subtitle language yeah seems good right so so our final output file will be called output Dash whatever was the input file name Dash whatever is the language of the subtitle do MP4 that's our final file uh that will be the name of our file uh then we'll say subtitle track title is equal to subtitle file. repl we are replacing SRT uh from the title and we'll finally say stream is equal to FFM Peg do output we'll call the video input stream uh and we'll call the sub output video next we'll say VF is equal to F subtitles is equal to subtitle underscore file yeah so so this will basically uh and next we'll say FFM Peg run so this is our method this method will add the subtitles to the video we first Define a video input stream then we Define a subtitle input stream next we Define what the output video files name is then we tell it tell tell that uh tell the method what the sub track title is next we blend all of it into the stream and we run this ffmpeg with that stream right so uh yeah so this is the method next all we need to do is just call this method right so uh we'll call this method with ADD subtitle to video uh it will need an input file which is yt. tile subtitle file and it will need language right so I don't know why for language every time it is saying it's not defined whereas language is defined right so if we just do language and run this I get En I don't know why it's saying maybe some some issue with pyland here right so um yeah so let's forget that for a minute um this is the final call we'll just run this call uh did we add no so let's see uh okay there is an output file getting generated it's an mp4 file uh that is getting generated now um it will take uh whatever time it takes uh maybe a little while and you can see ffmpeg is outputting a lot of uh you know metadata information it's telling you the name of the encoder it's telling you how many frames have got written um what is how much time it has already taken and all of that or how much time uh it's in seconds I feel six seconds 7 seconds I don't know maybe it's not seconds right so what is the bit rate um I hope it's not the time it will will need right so yeah so let's let it run uh and then we'll come back to it so okay now that the code has run let's just quickly look at the output file this is the output that got generated let's just run this uh and see uh yeah so you can see that you know um the subtitles are showing up they have been hard bured on the video so it's part of the video now um and yeah as as you can see now the subtitles are part of the video right so uh so this was it guys uh I hope you learned something new and interesting um there is also a way where you don't if you don't want to hard code the videos hard code the subtitles on this you just need to change this method this line a little bit and um I'll leave that as an exercise for you uh figure out uh if you know uh if you want to know ask me in comments and I'll be happy to answer uh but that's one small change if you don't want uh the subtitles to be hard burnt on the video right so hard embedded on the video right so so today this is what we learned we took a video we extracted the audio from it then we sent it to whisper for getting the transcripts once the transcript schem we pre-processed it for time information then from that we generated our SRT file that SRT file was then used to embed the text on the video right using ffmpeg right so this is this is the entire flow that we have done today uh I hope you learn something new I'm also going to commit this on GitHub and it will be part of the uh the code will be available on GitHub for you to have a look at thank you so much have a good time