Transcript for:
Lip Sync Overview and Solutions

lip sync when creating talking characters in 2D 3D games or AI powered chatbots you need two things to make it convincing the audio of what is being said and the mouth movements to visually represent it called lip sync for the synchronization between what you hear and what you see for the audio you can record it yourself generate it with texttospech solutions like OpenAI or 11 Labs or you can even generate fun sounds like what Animal Crossing is doing about lip sync and the solutions at our disposal for web experiences are limited in my previous tutorials I showed two of them using Asia as our TTS solution the good part is that the data to do the lip sync is provided when generating the audio the bad part is that Asia is expensive and not the most advanced solution to generate realistic speech from text this is what I used in my teacher tutorial but it's also the reason why I can keep the AI live as a free demo the second alternative I found until now was to use Reubarb lip sync library it's what I used in the virtual girlfriend tutorial it takes an audio as an input and generates the lip-s sync data as an output it's free and handle any audio source in the virtual girlfriend scenario I used 11 Labs which is high quality and got very good results hey dear how was your day i missed you so much please don't go for so long seems perfect right the problem is that it is very slow to generate the data let's say we have a simple flow we generate the text using AI then we generate the audio using AI it can take up to a few seconds by adding the rubar blipsing process it adds a few seconds for a short audio and many more for longer ones not ideal and I didn't even mention the fact it only runs server side meaning your back end should do the process for every request from every user one of the project I'm working on is a professionalgrade 3D AI chatbot template including multiple scenarios more to come about this in the future but to make it happen I need a fast free and effective solution i knew real time was possible because many years ago for a similar VR use case I used the Oculus Lipsync library with Unity unfortunately no equivalent existed so far for the web browser so I decided to create mine wawa lips sync big frogs jump quickly vex that sharp duck zoom near before I show you how to use it let's discuss what it is and how it works wawa lips sync is an open-source JavaScriptbased realtime lip sync library while the main example is using 3JS with the React 3 fiber the library is written in Typescript and can work with any JS project you have with any framework you can even use it to animate 2D characters what it does is it analyze the last milliseconds of the audio signal and deduct the lip sync data called visim a visim is the visual representation of a phone for example if you close your eyes and I say P you will hear the sound P based on the sound I did this is the P phonem and to produce this sound my lips will have to be pinched together this is the visim if I say P while you see my mouth wide open like this P it's not very convincing while for the O sound my mouth will be open and slightly rounded there is around a dozen of different visims and the good part is that it only relies on the sound and not on the language to be able to detect the phone based on the audio we need to dissect it to do so we are using the analyzer node available on all the browsers the graph you see is a visual representation of the played audio the bars represents the volumes per group of frequencies the legend on the bottom represent their frequencies and the moving white line is the centrid to understand the dynamic of the sound with this we are already able to detect some sounds let's play a drum kick and visualize its frequency it's around 80 Hz while a snare will be around 200 hertz but this alone can't help us know exactly what phone we have first we need to identify the volume and frequencies over time if it's a short burst of volume it's surely a plausive like in the B or P sounds if it's a sustained power it's probably a vowel we also have the fricative sounds you can hear in the F or S sound that produce unique high frequencies by combining all of this information we can deduce which phone is played and assign the correct visim i won't go too much into the details but if you are curious the code is open source feel free to look at the algorithm and even contribute to make it even better now let's see how to use it within your JS projects first you need to install Wawa lip sync package with npm install Wawa lips sync then create a lip-s sync manager with new lip sync after setting the source of the audio element you want to play connect it to the lips sync manager with connect audio then you simply call process audio in a loop for example request animation frame and you now have access to the visim property in the lip sync manager this is what I'm using in the avatar component to render it smoothly on my 3D character it's a simple plug-and-play solution to get visims in real time from any audio source don't be too surprised about the Liam voice with a woman face big frogs jump quickly vexed that sharp ducks zoom near it was for testing purpose and we are in 2025 if you want to learn how to create and animate this character you can check my dedicated tutorial or if you are new to 3D web development and want to build a solid foundation consider exploring my course React 3 Fiber: The Ultimate Guide to 3D Web Development a project-based course with everything you need to know to start creating professional 3D web experiences with 3JS and React link in the description thank you for watching i hope you enjoy this video please hit the like button to help this channel be more visible to other creative developers don't forget to subscribe to not miss my upcoming tutorials if you want to continue your 3D web development journey have a look to my course or watch one of my other videos like this one available here