okay uh hello everyone Welcome to our conversational AI reading group today we have the honor of having alexos from as a co-founder at qai and nonprofit lab for research and artificial intelligence base in Paris as many of you are already familiar with some of his Works Moshi in codec IDE craft so without any further Ado stage is all yours next okay uh thank you very much pun thank you very much for the the invitation uh to present at your reading group so today I'm going to discuss about uh Moshi but not only so Moshi is our speech to speech uh model for realtime dialogue but it's also uh more of a also generic approach to modeling uh multiple streams of audio at the same time and also at the end an extension to uh live translation okay so first I'm going to present the team behind Moshi so that was myself Laur mazare manini am Patrick Perez aigu Ed gra n z and it's also a good time to thanks our funding donors zavel rodolph sad and Eric M uh thanks to them we have this very nice lab so about qai we're a nonprofit uh we opened a year and a half ago in Paris uh our focus is to do open source and open science research um so we released uh last July uh a demo of Moshi then we open sourced it in September uh later we published ilium 2B which is a small multilingual Foundation model uh and then more recently we released ibiky which is a live translation um so speech to speech from uh French to English at the moment that can run uh on device uh even on a on a mobile phone so our initial focus is mainly around multimodal llm in particular generative models um we do have at the moment a strong emphasis on on speech which is also due to the fact that NE and I uh we were two people in the core team doing doing speech so uh we went a bit on that aspect but we are also studying um uh the application of NLP vision and how to combine them together and so the goal is to first like kind of open the technology uh and also to train a number of people uh so for instance NE and I we had the opportunity to do our PhD uh part-time at meta at Facebook a research and I think there was a really nice environment uh to learn uh deep learning while somewhat in contact with the industry and so we want to offer the same possibility but kind of outside of a major US company and with a stronger focus of uh kind of the independence of France and Europe um so we'll be discussing about first the motivation uh for building an AI assistant and so the observation when we started um like a year and a half ago was that the way we like speech is an important medium for interaction between humans uh but it's still somewhat suboptimal when interacting with computers and there are a number of differences uh between a humano human conversation and human to computer um so human uh can uh to people can have like a really fluent uh conversation with many interruptions back channeling um we can interrupt each other if we if we want to quickly intervene uh we can still like talk one on top of the other we're still understanding what the other is doing even if we try not to do it too much because it's rude but nonetheless it still kind of works uh which we believe is kind of important so that we forget that we're talking to a computer and we can do in the most natural way and not be kind of concerned with uh whether the model is going to understand that this is a request for it that we should phrase it in specific ways or that like we should Mark specific pose um and also uh humans have the ability to understand all the paral linguistic content so the emotion and the tone which brings very important information so when we when we started this project the main approach to uh this kind of uh interaction was uh cascading system so cascading system consist in a number of components that starts with voice activity detection determining whether someone is actually speaking at the moment or not once we assume that their turn is over we can run automatic speech recognition turn that into text feed it to a language text model then T each reply and convert it to speech uh with text to speech obviously this is going to uh take many steps uh like add a lot of overhead so for instance if you remember and it's still a little bit the case but the early version of the voice mode for GPT like you would need to stay silent for a few seconds uh for it to understand that your turn was over and I had some experiments where for instance just closing the door from the fridge would actually re-trigger The Voice activity detection making it think that I was not finished while I clearly was so our question was can we merge all these steps into a single audio language model and and and especially a strong motivation for us was to be able to achieve full duplex rather than half duplex so uh Al duplex is a bit like talking on a on a toyi uh you either have uh it's either your turn and you can speak and everyone else listens or it's not your turn and then you can only listen but basically there can never be two people speaking at the same time uh but that's obviously not what happens when two people have a conversation and if you're making a phone call for instance it's not like a talki like anyone can speaks at any time uh and we're still able to process uh that that information okay so that's a bit of the the motivation uh and so we're going to dive first into the neural audio codex part uh which is kind of an essential brick to uh build those models and I think it's also important to understand those to see why speech is so different from uh text uh and comes with unique challenges so if we start from uh waveform waveform in our case would be this kind of oscillating curve that's sampled at 24,000 Hertz so that's 24,000 values per second uh that you need to model uh just to Output one one second of speech uh it's obviously very different from trying to predict uh predict uh words and even if you compare it to images uh like maybe maybe that would be roughly like a 150 pixel by 150 pixel uh image but you can already put a lot of stuff in A50 by 150 image uh versus in one second of audio it's not like you have a ton of things maybe you have one word like like two three words um so the original representation is really the furthest you can imagine of a compact semantic representation the audio tokenization is an essential step that is going to allow us to go from a waveform to a sequence and in fact multiple sequences of tokens at a reasonably low frame rate um in particular in our case the frame rate is going to be 12.5 Hertz and typically this frame rate is going to be our often you need to uh Auto regressive step with a Transformer language model you need to do so the lowest the better and even though we did a lot of effort in this work to lower this frame rate um it's still much higher than text so text roughly at an average speed for English would be around like three tokens per second so we're still four times higher which means that you know anything like if you want to pass the same amount of content you need to do four times more uh forwards in your model uh but it's somewhat in the manageable realm and the other thing is that it's not really possible to uh represent it as one sequence of token uh that would be at 12.5 Hertz uh unless you have very strong priors on the kind of audio which in our case would kind of work so we could have prior that we already know for instance the speaker identity that it's never going to change uh and that we know we don't want to Output any kind of uh Sound audio or anything like that then we could probably get it lower but without those assumptions just assuming you want your model theoretically to Output any audio with uh any voice or any like sound effects or whatever it's very hard to uh get lower uh than having a number of those sequences um so at the basic level um at the moment we ignore everything about like the interruptions and the turns and things like that we basically just want to get like this kind of sequence of tokens assuming we we're just doing like unconditional generation we have an audio prefix and we want to be able to generate the continuation from it um so the the our neural audio codc which is Mimi is based on work that was uh done initially by Nel zido and its team at uh at Google Deep Mind and uh then improved on by my with my own team at uh at Facebook a research uh the idea is to have uh an encoder and a decoder so an autoencoder with an information bottleneck in the middle and adversarial losses for reconstruction um the adversarial loss here is mandatory because there is just too much information in audio that our brain is not sensitive to and this information represents the bulk of what's in the audio so most of the things you will find stored like especially neur waveforms are very high frequency content with specific phase information to which our brain is mostly oblivious or at least it's oblivious to it if it has roughly the right characteristics uh but let's say we could try to just reconstruct with a male spectrogram loss and that's going to give us like horrible sounds so that's what I represent on the bottom right uh basically we have the set of all waveforms that kind of matches an amplitude spectrogram or spectrogram most of them are going to sound absolutely horrible uh and then we have the all the waveforms that are kind of realistic on Earth and somewhere in between at the intersection we find the waveform that we want uh and that allows us to reduce uh enormously the complexity of the space to model and the model is going to find by itself and it's a bit magical in that aspect that both the adversarial loss and our own perception converges to uh being sensitive to similar aspects of the audio um so we have this reconstruction loss which I mentioned is based on the spectrogram and then the feature matching uh which is an interesting loss where it's going to say basically that the activation between the input waveform and the reconstructed waveform uh the activation in the adversarial losses adversarial Network needs to be the same between the ground truth and reconstructed signal um and that that will be important a bit a bit after um and then so the residual Vector quantization so initially the motivation for that was just to uh do audio compression but it turns out that residual Vector quantization is uh perfect for language modeling uh and it has a number of characteristics that are uh very uh convenient okay and that that's basically this process is what we call uh what comes out of the residual vector quantization is what we call acoustic tokens they really encode all the acoustic characteristics you can inverse them to a waveform not necessarily exactly the same as the input but one that's very close from the human perception but they don't contain high level information like you could think because of the training the model would maybe converge to like you know understanding speech and then saying oh it's actually just someone speaking so I can just transmit the identity of the speaker through the butter neck and then only the words but that's not at all what happened the model kind of never converges to the big picture and kind of stays in a local Optima of how it's going to encode efficiently the local acoustic characteristics at each time step and that's not necessarily a big issue uh from the point of view of the Reconstruction because we still uh get like very low bit rates um but it's a problem when we do language modeling because like a number of experiments U that that I have done shows that uh if you don't have this semantic token and you're trying like if you only have acoustic information and you're trying to model speech if you use like a 300 million language model 300 million parameter model you're only going to get like bubbling so a bit like Sims style gibberish um if you use a 1.5 billion parameter you're going to give get like words that have no meaning togethers with a three billion parameters maybe you start having a tiny bit of sentences 7 billion it's a bit better but basically the the cost in terms of training time and number of parameters just to get basic sentences out of your model is just gigantic and if you add those semantic token it suddenly uh becomes much much better um okay so now I'm going to maybe describe a bit better residual Vector quantisation uh this is really an essential part and it's also important to understand uh for for the next part on the language modeling to understand all those different streams of tokens are related so we get a continuous latent uh out of the encoder model uh that are at our frame rate to so 12.5 Hertz uh and leave in some Dimension space uh in this case probably 256 and what we're going to do is we're going to have a series of quantization operation each quantization is attached to a code books so a code books is just a matrix with n row which is the cardinality of our codebook and the same Dimension as the latent Vector Z and the projection operation consist in looking in The Matrix Q what is the row that is the nearest for the L2 distance to the input tensor uh Z once we have found this nearest point it has an index uh which is in that case like 81 and the quanti so that's like kind of the discrete quantized value and then we also have a continuous quantied value which is just like the actual row of the codebook Matrix that corresponds to that index we're going to substract the continuous quantise value from the input to get a first residual in that case Z2 and we're going to store the uh index value uh that was the closest for later use and then we repeat the process with a second uh code book Matrix uh we're again going to project Z on the closest entry in this Matrix that's going to give us a second um a second uh discrete index and a second uh continuous quantex value which we're going to subtract etc etc so you can see that uh the more you go in deps in residual Vector quantization the more you're just fitting residuals which intuitively are very likely to become of smaller and smaller scale uh so you will tend with the first uh codebook to capture variance uh phenomenon in the in the latent structure and then as you go down you will get to finer and finer details and if your decoder is kind of uh I mean the intuition is just that as as you model finer and finer details it's going to have less and less impacts on the reconstructed waveform so at some point we stop we just pretend that whatever remains is mostly zero uh so we just effectively sum all the all the continuous quantized value that's going to give us a pretty good approximation of the and we fit that to the decoder for a construction so the the the way the Q Matrix the code book Matrix are learned is through exponential moving average so each time a latent tensor is assigned to a TENS to a row in Q then we update a little bit this Row in the direction of the Z uh similarly there is a commitment l do which is going to force uh Z to be close to its quantise value so those mechanism they call they come from the vqv literature in image um okay and then some uh some in other interesting things so the the gradient you can see that this whole process here is not really differentiable because you have this uh kind of one OD projection over a single row so during the backward process we just pretend that this whole operation is just the identity function so that the straight through estimator uh trick uh and then finally one nice thing we can do is uh that comes from the Soundstream original paper is to sample during training a depth stochastically which allow us to have the decoder being able to handle both like let's say there's like 32 Cod books in total it can work with the 32 cut books or 16 or eight or four and we don't have to retrain a each time we know that it can handle all those different uh cases so you see that now those acoustic tokens they that's why we get multiple sequences uh they go from kind of course modeling course aspect of the audio to finer details for instance you could you could understand something with only like two acoustic code books but it would sound very rough and then when you add all the code books you get like pretty good Reverb reconstruction and things like that that are going to to make uh the things pleas Pleasant to the here um yes okay they forget something think it's good so in uh in Mimi uh so which is our Cod I can give a bit of an uh overview so compared to previous model what we changed first we reduced a lot the frame rate uh so the best frame rates at the time were around 50 Herz now we're uh we improved that to 12.5 Hertz uh we added some Transformer layers um that that improved a bit the the performance before and after the buttter neck we uh use semantic uh semantic distillation so we have a pre-train wave LM model um that we distillate into the first codebook so the task of the first codebook is not really to participate to their construction but it's to approximate as good as possible the wave LM uh latent representation for a given audio frame and one very interesting thing is although wavm is non-coal uh our whole entire model is coal so we managed to get a really good coal approximation of wavm and the reason why we need coal approximation is because if you're doing a a speech to speech conversational agent it cannot infer what's going to come from the future right if you need to process the audio from the user you're not going to know what the user is going to say in the future so you need to be able to get that extract that in a fully coal way and even for what the model says even if the model knows what it's going to say in the future maybe that's not entirely true maybe it SS it's going to say a full paragraph but maybe the user is going to interrupt the model uh long before uh it actually uh can complete its thing so the advantage of causality is you know your representation is uh it's kind of always compatible with real world realtime streaming usage um okay and then another interesting uh finding was that uh we could get rid of the the adversarial um adversarial feature matching loss uh sorry we could get rid of the Reconstruction loss and although that kind of deteriorates the objective metrics it improves on all the subjective evaluation we ran um and that's kind of related to the Future matching loss I mentioned earlier because if is only an adversarial loss you might be thinking okay but then there's no reason why the reconstructed audio would match in any way the ground truth but somehow the feature matching loss in the adversar loss is itself proving to be a sufficient and in fact a better uh reconstruction loss than the one we had designed by hand um okay so I think that's it for the neural aodc part so if anyone has a question about that now is the time um I wonder have you ever um do something on the encoder part part of this model to see if it really encodes say semantics it we know it doesn't encode semantics mostly from um um I mean the reason why we know it is just because when we put a language model on top of it it doesn't work very well unless we have the semantic token um it kind of works if you have like pretty strong conditioning so for instance you can make uh text to speech if you know the full text it's it's going to work to directly work in this latent space and to generate the speech that correspond to the text if you don't have kind of this text uh structure given to you uh then it's not going to work very well so the such the difference that appears when training a language model on top Yeah there there are some works that are studying that uh but I mean even if you have flow matching for instance you would need uh you would need many many steps of flow matching decoding uh so there there is definitely a trade-off in our experiment so we focused on that because that's kind of the sure technology um but for sure flow matching diffusion is an interesting uh alternative um uh if I can have another question before moving on um it's about causality right so what is the performance right uh uh price to pay to get the model caal that's a good question I don't think we have uh run formal studies than that because we just focus purely on coal models um I don't think it's so gigantic like it's obviously there is a bit of a of of a gap um but but we still managed to get like to the level of quality we wanted so yeah I I don't have really a good answers I know for sure that that having no questions to ask about how I want to use this codec like basically if it's coal it means any use case is valid uh for me far outweigh the benefit uh the potential benefit of like gaining a little bit in audio quality for specific use case but then uh basically coal model can do anything that non coal model uh can do but there is not so true yeah thank you and then we have question about the data so we use uh yeah we use a large scale data set um but actually Mimi is not trained on that much uh data so I think it's trained for um 100 uh uh 100 uh maybe 100,000 updates uh uh in fact with those uh adversarial models um you see like they become usable even after like 25,000 updates like you get a bit of gain from training longer but they typically train extremely fast uh in particular in comparison with models that would have like a diffusion or flow objective for the for the vocoder part in my experience um so yeah okay so so I'm going to switch to the Joint sequence modeling uh so yeah so far just reminding you so the joint sequence modeling there's like multiple motivations for it first the fact that even for a single sequence we get multiple tokens but also remember that we want to have full duplex models and in that case it means that we're going to have uh multiple audio streams um and we we also have um uh or we're going to see that we also will even add a text stream at the end um so I want to start with an early stage prototype that doesn't have this kind of that like the audio is kind of on a single stream and you have explicit change of turns which at the time was with uh touching the space bar you speak like a pirate Oh you mean in a fix app yeah exactly like I'm a southerner and I know how to talk with the draw I mean I know how to say y and thank you all and I can whip up a good old Barbie okay so that was like uh one of the first time we actually interacted with a uh early version of Moshi but obviously it's not very convenient this uh this aspect of pressing the the space bar so now we're going to see how specifically we um we uh we handle the the audio aspect um so as I mentioned we have chunked of 80 millisecond of audio uh because it's coal like for any like any time we're able to Output one step like the all the tokens for a one time step we know we can output 80 millisec of audio um we have so one semantic token seven acoustic tokens and so there there was a number of work U that tries to model those tokens starting with uh odm from Google and what they did was mostly to flatten all the code books uh but that's obviously going to be big price to pay right because now if we flatten them uh like on the top right figure uh we can't so we're in the simple Transformer uh approach uh of like modeling a single stream but now we have to U we have to do eight times more Auto regressive steps which for realtime application is not possible um with my former team we worked on a number of other approach like for instance one of them is the kind of the delay pattern so so first the first thing you might think is why not predict all the code books in parallel for a given time step and the reason for that is that that would only work if the mutual information between the tokens for a single time step is zero which obviously is not at all the case because they are the residual from one another so they're very clearly dependent uh one thing that we found out was that if we kind of shift in time uh we offset by one each uh token at a given depths uh that kind of breaks down reduce a lot the mutual information and it makes this kind of parallel approach works a lot better but that also increases quite a bit the latency uh and then uh there was this interesting idea that uh was introduced for image uh so the RQ Transformer and also applied to audio with uni audio of like modeling the inter codebook uh uh dependency for a single time step using a smaller scale Transformer so the smaller scale Transformer just like sees as a small sequence the code books for a single time step and it gets like kind of all its context from the output of the temporal Transformer which is the big one so in case of Moshi the temporal Transformer is 7 billion parameter uh and the depth Transformer is is smaller uh and the temporal Transformer basically gets the sum of the code books for the codebook tokens for each previous for the previous time steps um and so thanks to that we're kind of able to model the the dependency between those code books with introducing minimal delay and minimal computation overhead um and we added some improvements compared to the original RQ Transformer so it turns out that there is still a benefit in uh delaying the acoustic tokens with respect to the semantic token and I guess I'll show I'll show a figure of that at I can show it now maybe here basically like a vertical line ver column is like a single step for the big Transformer uh uh and so going from left to right uh is like progressing on the steps for the big Transformer and from bottom to top is going into the depth Transformer Auto regressively and what we do is the semantic token for a given time step is predicted one step ahead of the acoustic token for the same time step and this kind of helps again break down the dependency on a single column uh like uh like um between the acoustic and semantic tokens and now the dependency can be modeled by the big Transformer versus otherwise all the decisions about that would have to be made by the depth Transformer which has much much less capacity um okay and then the per code book parameters what it means is in this D depth Transformer each time we change steps so the step of the depth Transformer is just the depth in the in this kind of rvq token structure we use a different set of weights for the linear projection uh in the attention in the fully connected which doesn't really change the runtime but it can have a pretty strong impact on uh the number of parameters uh which can start being a bit problematic uh with uh for instance if you want to generate very high quality audio with 32 code books uh then the number of parameters in the depth Transformer can uh start uh being larger than in the temporal Transformer so that's something we improved a bit in in II uh but uh but having those kind of parameters per code book really improved the the performance um so now we kind of have the ability to model like any number of streams uh and the dependency between those streams and so we thought why I sto there uh there was already a number of work uh for instance Spirit LM that was combining text and audio but usually they would do it in a way either it's like um like all part of the content is going to be in text only and part of the content in audio only uh other methods like pslm they're going to put all the text uh like kind of left aligned so at the start of the sequence and then the audio uh follows the issue is that is that you don't necessarily know what the text in the future is going to be because you don't know the future so if there is an interruption maybe you're going to put as a text uh at the beginning like a full paragraph but you're going to get interrupted after the first sentence so instead of that what we chose to do was to add a text stream that closely that is closely aligned with the audio so the level of alignment is at the word level so each word is going to be put in the Stream that has the same frame rate as the uh audio token so 12.5 Herz uh as close as possible to the the beginning of the word um and then we're going to put special padding tokens between the word uh so that the next word is roughly aligned with its uh audio so most of the time it's not a problem because the text is more compact than the audio like as I said it's three Herz so we have a lot of padding token it's not like the text can uh uh be less compact than the than the audio during pre-training we do that for any speaker uh and then at uh during the fine-tuning to make it a conversational model we change it so that only the models uh own speech is uh is transcribed into the text script the text stream and so we call that the inner modog and it's a little bit like a super semantic token because on the long run it's going to make the generation much more stable so for instance we did experiment where we look we try to let the model generate so just in a continuation mode for a long time and we see when the generation breaks down and without the text stream uh it's going to break down much faster than with the text stream um so it provides like a good scaffolding um it's not really so it's the model outputs it itself so it outputs the text token for its own uh for its own uh speech output at the same times as it outputs the audio token so it doesn't add latency and in fact it's purely beyond the effect on training then it's purely Aesthetics because we don't need it for anything so we show it to the user uh and it also sometimes allow us to quickly evaluate what the kind of the replies of the model just looking at that we can know mostly what the model is saying uh without needing to further process it sensor uh even if it's not always like 100% accurate um and then to handle the user stream we have uh second set of audio tokens that correspond to the users audio stream um and this one we don't so during some part of the training we actually still have a loss on this that we model it which allows us to do kind of selfplay with the model for the automatic evaluation uh at the end we stop uh we stop training it and we kind of discard anything that was related to modeling the users audio stream because obviously this one we don't need to generate we can just force it to whatever is coming from the from the user uh so that's kind of the overall structure so as I mentioned the model is going to say okay for instance hello I'm Mushi so we see the first text stream uh that's directly predicted as the output of the of the big Transformer then we have the semantic token which is predicted by the first part the first step of the depth Transformer um and then the acoustic tokens which are going to be predicted by the next steps of the depth Transformer but with a shift uh in time and then when the user is speaking so we continuously input to the model the audio tokens from the user so there's no real difference between the user is asking something or not it's only purely based on the content and same thing for the model if the model is not speaking we still kind of evaluate its output and it mostly just outputs uh silence um this means that there is potentially like no latency or no lost time uh when we go from one to the other so is there any question on the uh on this aspect on the modeling aspect and where there so yes so we didn't look at scaling lows um mostly mostly because uh we I mean I find that there's always a choice to do between doing scaling lows and actually like testing various features uh I think scaling lows are interesting once you have really like this kind of definitive design and you really want to push uh push the numbers but in that case I think a lot of the work was put on things like you know what is the impact of the acoustic delays and things like that and that was kind of uh proving to uh like be the most important things and once we were like kind of setting all those details there was not really much time or compute left uh for further uh trying to see of the various parameters scaled also one thing to keep in mind is uh if you're if you want your model to run on device uh and you have such you have a cap on like the size of the model so it's going to be seven billion then in that case the scaling low be like okay if I have a fixed compute budget like how big should be my model but in that case like it's not like we could make the model bigger than that because we still have like the kind of realtime inference aspect um so given all those uh restrictions already this kind of remove the need for scaling lows we just trained with as much data as we could for as long as we possibly could before uh the release uh yeah and then where did you get the word alignments so word alignments they are coming from whisper timestamp uh and yeah we have some tics for those uh so there's a number of things for instance we notice that the medium model gives way better time stamp than the large V3 model uh and then we're trying to detect things like um change of not at this time but uh change of language for instance uh because one annoying thing with whisper is that it's going to translate uh if there are switch of languages but while by doing that that's going to completely mess up the time stamps so that can introduce like huge uh change between um the actual time stamp and the real ones and in general we notice that whisper time stamps was giving way more accurate time stamp than for instance the ugin face implementation but it's also very ly uh so we typically we had like several tier of annotations once that was initially done with the ug face implementation and then much finer higher quality annotation uh with with whisper time stamp um and mostly that that's it then we could kind of try to detect things like uh Aggregates of words and many words that are too close together we could try to detect like word repetition and uh we could try to detect uh time stamp going backward in times all those things we would detect and not during pre-training but during the fine tuning stages we would get rid of those okay so about training there's actually a lot of stages so it's not the simplest uh the simplest Training Table ever uh one thing to keep in mind is at the time that we started that there was no open-source text model that that had really a a license that was open enough if I remember properly some of the choices like all the licenses were either restrictive like forcing you to call your model Lama something um and also what we found out was that it was kind of important to uh keep training on text only data even when we start training on audio and uh that that explains why we kept some uh some um uh so we needed to have a text data set which also means that we needed to have a good Text data set so typically a good text language model was a natural thing to train to evaluate that um and then we have several phases the first phase we train only on audio that has a single channel uh then we do the post trining phases we train on audio with two channels but the channels are emulated so we do diarization and we put uh one speaker on the Left Channel and all the other ones on the right channels but then this is imperfect and there's like perfect zeros like this the waveform is absolutely silent if one speaker is speaking on the other channel which is not very natural then we fine tuned on Fisher which is kind of low quality it's 8 KZ uh uh and then we fine tuned on uh like a synthetic data set that was kind of trying to give the personality to Moshi so here I can give a quick example uh of of the kind of data we find you know hello what's going on do you watch a drama series if so which one yeah I do there are so many good ones out there can you tell me about the one you watch so that we did with the yeah with a kind of a TTS that was based on the the similar architecture uh in fact based on the same pre-training so given that there was not that much time left uh in terms of results so at the time we were having uh really strong results especially for coal so here we can see like actually a few uh few comparisons so some models non coal model can have for instance better ABX which measures like the semantic aspect of it uh and pretty close uh well in fact no because it's for a much much higher beat rate so overall here we were able to beat all the existing CeX uh while being coal and at a lower frame rate uh there's been a number of interesting work however since we released Mimi uh so that's that's always quite quite exciting to see um then one aspect that's that's really interesting so we really put the emphasis of the naturalness of the interaction but uh what surprised us was how complex it was for the model to be factual even though we had maybe like 10,000 hours of synthetic instruct data uh and in particular if we compare like the trivia QA rate of proper answers compared to the base text model it's much uh lower uh part of it is explained by the fact that you know you need to take the question you need to synthesize it as audio then you need to take the answer of the model and you need to synth uh do ASR on it any of those part could fail you know uh especially for named entities but uh it's not the full answer like sometimes just even listening to the audio manually and the questions we see that that the model gets uh not as smart as the based alium model um and so I think that's a very interesting aspect maybe one one point that you may be wondering is why didn't we use like text based instruct data and the reason for that is that this data is very hard to synthesize it's really geared toward the way people are using llm so it's either going to be you know multiple choice questions uh for like benchmarks or it's going to be uh list of answers or it's going to include some uh some uh you know code in the replies uh so the style is very far away from what we would get um with with more speech oriented synthetic data uh so in terms of the demo I guess I'm not necessarily show uh all of them let's say we're in hyperspace and now um we have five months I want to learn to know you a bit better why did you decide to join Starfleet I wanted to make a difference to make a difference in the world okay uh how long have you been in Starfleet I've been in Starfleet for about 6 years okay that's a long time I guess you've done a lot of other missions which one was the most exciting the most exciting mission was when we discovered a new planet with intelligent life oh my God that must it was incredible okay okay uh then I think I'm going to switch this one uh maybe one thing I want to show and a big advantage to other method is the ability of this method to work even under very strong noise constraints uh so here in that case we had some pretty intense construction work good day how are you doing hey uh I'm doing some construction work right now as you can see uh I'm I'm looking for some advice can you help me with that sure I'm happy to help what's the issue you're facing with your project so I need to build the wall and I'm a bit unsure about the proper material uh what do you recommend well before we jump into that can you tell me more about the project and what kind of wall you're trying to build so it's a and we can find another uh I'll give another example with with zi after um so yeah the other aspects overall we get about 200 millisecond of latency uh on The L4 GPU it also runs on a MacBook Pro with in for quantization we have an online demos and we know that there are still a few issues so sometimes Moshi like will tend to replies too quickly uh or stay silent uh at the end of a question or kind of misses the point uh and we're we're excited to be working on all those all those aspects I'm going to skip this okay and then I want to quickly go to um to iiki which uses a very similar approach uh for translation speech to speech translation to roughly the same team but also with our Master student at the time Tom labios who's now going to start a PhD with us and the idea is just to replace uh the structures that you've seen of this conversational AI with now we have two streams of audio we have the source stream which is going to be the um for instance my speech in French and then what the model outputs is now the same thing but in uh English uh trying to keep the voice as close as possible to the original one so in terms of modeling there was no not too many challenge it's really like the same architecture we just had to switch the definition of the two the two audio streams um then we needed to build synthetic data so for that we started from our um we started from data audio data in French we transcribed it we translated it with a madlad uh model we synthesized the translation in English and we kind of also evaluated the optimal delay uh by by looking in the text domain at what context was sufficient to make the log likelihood of a given output token uh increase dramatically so around this kind of sharp change in log likely uh so in that case it's like I need to translate um I will translate the next word is in or into and basically once you know that uh or here uh in purple is in the input context you have enough information to Output it so that was giving us like some rough estimate of the optimal delays and then we could do two things either insert those delays uh just by inserting silences in the output if the output is too far ahead of the input uh but we also had like a kind of alignment aware TTS where we could say the this word in this output should not be before this time stamp and this gives much smoother um uh kind of uh transition between the words like the model can even Mark almost humanlike pose uh which wouldn't be the case with the silent insertion uh so yes that we released uh we obtain in particular very strong res in terms of of quality and speaker similarity compared to existing method like seamless and it also runs on a on a phone um so I'm going to give maybe okay so maybe first I'm going to give the let's see I'm going to show you this demo where uh so we were actually at a party so again to Showcase that with this technology that's speech to speech it allows to be really robust that's also a robust we kind of build during the training we had many augmentations uh to the to the model my name is Alexandre and I am testing this model in extreme conditions at this moment there is very loud music and I can barely hear what I'm saying however our model is able to translate live in English my voice on a mobile phone this is a 100% French Innovation that comes from our open research laboratory okay and now I'm going to try to do the same thing live so here I have my phone that's connected uh to my computer for the audio recording I'm going to take the microphone a bit away from meour our live translation model translating fromen to English while m a model that we developed intelligence laboratory and is capable hope this many all the data on the phone the thank you very much for listening to my presentation and I will now answer that remain okay oh I hope this worked nicely I didn't actually have the uh feedback in my headphones so I don't know it will the text looks looks okay yeah it did work well thank you thank you for the presentation Alexandre okay so can we use transducer for streaming so I'm not sure uh what is transducer and uh uh a entrance user so that's a good question so you mean replacing the I like the depth Transformer or the Transformer with the recur neural network I'm not super familiar with the transducer architecture so I'm not sure I can answer very well that question [Music] um how useful is the mo architecture for real time speech to text is the more less uh computationally intensive wave to vector or whisper encoder so so the audio codc is I would say much less expensive than uh so we used wavm for instance it's much less expensive than wavm to run uh it's probably around uh 200 million parameters but that's for the encoder and decoder so it's maybe 100 million 100 million 50 for the encoder only uh which would be what's equivalent to wave to VC or whisper encoder uh same thing I'm not super familiar with whisper encoder so overall I would say it's it's cheap it can run real time so the whole process encoder decoder can run real time on a laptop CPU uh on a mobile phone CPU it's a bit more um difficult so we we try to do things like where we would run it in the browser for instance which obviously like also adds a layer of uh um overhead uh and we could run it in real time on a laptop in the browser but not on the cell phone so it's still quite reasonable but yeah could be made smaller for a lot of embed applications so the noise augmentation you mentioned for robustness was uh no it's not we don't actually do noise augmentation for MiMi because we want Mimi to really replicate the input audio uh as good as possible so it's not trying to do like cleaning on the audio but we do that when training uh Moshi or iiki so first we add back some uh Echo with a variable delay uh because we have to account you know when we do live Demos in that case it's a perfect setup because the phone is plugged on a cable and then it's going straight to wherever you are but when we do live demo uh in person we obviously going to get a lot of feedback from the speakers so we need the model to handle that gracefully uh so we add Echo and we add noise from the deep noise suppression uh challenge we had a lot of it with various uh various noise levels uh the light sounds quite good at preserving can you talk a bit yes so for uh speaker identity the reason for that as I said we started from French audio data we synthesized it with a model a TTS model that could be conditioned on the speaker and one thing that was really interesting when you train such a TTS model on English only uh data and you give it a French speaker as conditioning it actually tends to do a French accent um and and conserve the speaker identity and also do a French accent uh so depending on how you generate things because in that case we generate with the TTS we we could also do the opposite right uh where we do the we start from English and we synthetize in French at the time we didn't have a French TTS uh then because the English is the output it's going to be a bit different in terms of accent it means the accent is most likely going to be like uh American or whatever in the training data um but yeah it's all all kind of thanks to the uh TTS that can be conditioned on Piers uh is it possible to prompt backband Transformer to include external tool cing so that's a really uh good question at the moment we haven't explored that much but it's really something that that is on our uh road map uh because it would definitely unlock a lot of the potential also potentially allow to reduce the size of a mhi style model where it's only uh only Mission would be to kind of interact with the user understand what's going on and kind of fill the gaps while calling external tools and still being able to somewhat uh process and understand the answer uh but I think it would be a great way to also not have to retrain this model every time you need a new application what do I think of Sesame so yes Sesame is uh very impressive uh the yeah it's very similar so we obviously we looked at it we Tred to reverse engineer from the few figures uh and legend in the blog post what was going on uh there there's not that many details part of the approach is very similar to ours so they are using the Mimi codc uh they are um they're using this kind of depth uh transformer for outputting the audio with minimal latency uh we think the difference is that it's still kind of turn based because they mention in the blog post explicitly that they have an end of turn token that the model outputs when its answer is over uh and also from what they show in the figures it seems like they accumulate the audio uh from the user until they think it's over and then they give all the text tokens for the request in a row so so it's not inter leave so first in our case we don't use text tokens for the for the user mostly because getting reliable text tokens from the user requires introducing a significant latency um and um uh but in their case it seems like they are transcribing the answer uh and they are giving both the audio and the text token in in one go from the user so as far as I can tell those are the differences in terms of modeling and then probably a lot of the difference is in the fine-tuning data set uh that we have very little little details on what was used whether it's synthetic data or real recordings but that could eily impact things you plan to publish the training code and data so we plan on publishing uh at least fine tuning code that should come uh soon like in the in the next uh next few months uh both for e and Moshi we're probably not going at this stage to open source full training code uh because that's like yeah a lot it would be a lot of work um and we're probably not going to uh publish training data we consider whether to publish uh fine-tuning data uh but uh yeah same thing at the moment nothing is decided so how many hours of that I was used to train Mimi uh as I mentioned so it's probably 100 100 100,000 updates uh with mini batch 64 and duration of 20 seconds um so that's the amount of effective data that we used uh but it's really a fraction of what was used to train uh Moshi separate speech to speech or I'm concern okay so then one question about the integration with uh existing llms I think this relates to the earlier question of adding tooling into Moshi I think once you have tooling then it becomes uh much more straightforward I think there's two things that you would need that we don't have at the moment first one would be the equivalent of a system prompt where you can of give context to Moshi on what how it should interact with the user uh also potentially what tools it can access um and then you would need tooling but once you have that I think you can pretty much nail uh a kind of One-Stop model for handling the interaction in a natural way and kind of delegate uh anything else to like plugable tools um which which I think is a potentially a reasonable aspect um and then then I guess for certain U application it might be worse to like fine tune Moshi as like the technology on TTS uh improves it's going to be probably easier to make synthetic training data um but uh but yeah at the moment those are the two possible ways I I see and have we tried using a recent trained llm instead of helium we had some early experiments uh with some more recent models I think we saw some improvements on some metrics like um uh in particular TSC and SSC so topic story clause and speech story clause which are specific metric to judge the understanding of the audio uh of the model and we were seeing improvements when uh starting from a stronger model uh it seems that early results show that during the fun the the Gap was a little bit reduced so the fine tuning data seems to be somewhat uh very important um in our case we also generated the fine tuning data with helium so generating instruct data with a larger language model would also be an interesting aspect uh yes and then are we planning to expand m i to other languages so iiki for sure um uh so it's it's kind of a a bit of a slow process each time because to open up to a new language we need uh at the moment we need data that is per channel uh for one of the phase of the training uh so we need to have like conversational data where each speaker is separated uh which means we basically need to buy that data for each new languages uh it takes a bit of time we need to like check the quality of the data we also need to have TTS in the new languages in to be able to generate the synthetic data so it's a lot of steps uh for sure we want to add uh I think it's a bit simpler for EB key because we don't need to regenerate all of the synthetic data like we don't need to generate the scripts uh for them uh so yeah it's on our road map but it's true that so far we we have let that a bit aside to focus on new applications like IB ke uh but we hope to get back to those uh and then another question about the Japanese version of Moshi yes so for the Japanese version of Moshi they actually said that they fine tuned uh from Moshi um so yeah it was a very uh very nice uh results they they managed to find unit um even without our release of a code base we're H that this is going to be a bit easier once we release the fine tuning code base but but uh shows that it's it's possible to adapt it uh already I try to use it in fact but I don't speak well enough Japanese so I could just say like two sentences and then I had no clue what it was replying um a s net implementation of Moshi training F tuning process and also the data proing so I don't know what is uh airst net actually prob missed that so they arate captioning [Music] okay captioning with adaptive attention on Visual and non visual words reference codeing process so because I don't know the paper I guess it's going to be a bit hard for me to answer that question okay maybe I could also add the final questions uh I think that's a couple of more question so it seems that recently people also use only one code book for quantization and it seems effective in Mimi you use like 32 code books have you seen any advantage of having multiple code books versus only one one with expanded vocabulary um no so more code books only improves audio quality so starting with um yeah like for us the thing is basically the code books they end up representing uh even if the depth Transformer is small when you do inference uh because you're doing the inference one step at a time then you need to do one forward for each of the code book and that's a lot of like Kel scheduling uh that's also a lot of mem bandwidths usage because you need to move all the weights around just to be used for one uh one code prediction so in the end that's like during training time it's not necessarily the case but during inference if you have 32 cut books the time spent in the depth Transformer becomes uh potentially even larger than the time spent uh in the backbone so for instance for EB the version that I show on the phone has only eight code books when we're running on a GPU more offline we use 16 so yeah the basically it improves audio quality but it makes it more challenging to run in real time on a wider uh range of devices I see I see thanks okay there's one more question I think there was a question Mir okay can you hear me yeah we can hear you all right so thank you for for the talk that that was really great I really um maybe a couple of very quick question first of all if you can you please comment a little bit about the amount of gpus that you use for training Moshi the GPU s and these kind of things uh yes so typically we did pre training on 96 uh h100 uh and the full pre-training is probably around a month um then the fine tuning is done with much less so we do we do is like eight or 16 gpus and it takes a few hours uh so yeah def definitely the most expensive part of the process was like kind of getting the right architecture getting all the tricks about you know the weight per steps in the depth Transformer uh the acoustic delay like how to make uh all those things work nicely just during the pre-training phase um and then there was a second phase which was somewhat cheaper on getting the right instruct data although it was cheaper in terms of training but we still needed to generate uh the synthetic data so I think at the end we had maybe 10,000 hours or 20,000 hours and because we were iterating on that at some point we were using uh maybe half of our cluster some a few hundred uh gpus uh for for a day just to re generate uh with the TTS all the all the models um and uh yeah yeah yeah actually very good um yeah um I have a l very general question right so many people these days actually the majority are trying to uh build multimodel LS by feeding discrete audio representation but still possible Right to potentially to also inject continuous representation so what what what's your point of view on that uh yes so for sure um so definitely for the for the user I guess the reason why we used a discret representation is first during the early phase uh it was really useful to have this kind of self-play ability so we were both generating the users and the Moshi uh Moshi turns uh in the TTS when we generate the synthetic data we also have to handle the users stream as quantise because we need to generate it uh in fact the architecture of our TTS which we describe a bit in the in the paper is very it's it's actually based on the same pre-training as Moshi and we just kind of change a little bit the delays between the audio and text streams um so it was kind of natural to have everything as discret uh as far as the user is concerned it is entirely possible to pass uh continuous representation probably that can be useful also to pass information in a more compact way uh similar to what is done for a vision language model for instance where you're going to have a few kind of virtual time steps which take uh continuous representations so that that that for sure is an I guess an effective way of passing the passing the information uh with we still the constraint that for for discussions so if you want to give access to an external audio documents it makes sense to kind of compact it into a low low like small number of time step but High information density potentially continues but if you want to handle the conversation aspect you kind of cannot concentrate too much because you still need to be able to react with a reasonable latency great boy thank you thank you Michael okay I think we done with the questions yeah okay perfect thanks a lot sorry we could not hear still probably because of the headset or something that's still yet okay so now you should be able to hear me yeah I had abon live that was doing kind of the merging of my microphone and the phone but it crashed so then everything was gone uh yeah I just uh wanted to thank you again for the invitation and the opportunity to present our work work and uh yeah yeah thanks a lot again and next week we have uh Julian Parker from stability AI is going to present making Transformer works for audio coding thanks any everyone for join and looking forward to see you all next week thank you bye see you everyone goodbye