🤖

Advancements in Conversational AI Models

Mar 15, 2025

Notes on Conversational AI Reading Group Presentation by Alexos

Introduction

  • Presenter: Alexos, co-founder at QAI, a nonprofit lab for AI research in Paris.
  • Topic: Discussion on Moshi, a speech-to-speech model for real-time dialogue, and advancements in audio modeling and live translation.
  • Team Behind Moshi: Alexos, Laur Mazare, Manini Am, Patrick Perez, Aigu Ed, Gra N Z.
  • Funding: Zavel, Rodolph, Sad, Eric M.

Overview of QAI

  • Nonprofit organization focused on open source and open science research.
  • Recent projects: Moshi, Ilium 2B (multilingual foundation model), Ibiky (live translation from French to English).
  • Emphasis on multimodal LLMs, particularly on speech due to expertise in the team.

Motivation for AI Assistants

  • Speech as a medium is suboptimal for human-computer interaction.
  • Human conversations are fluid with interruptions and paralinguistic content like emotions and tone.
  • Traditional systems use cascading components that introduce latency.
  • Aim to merge speech processing steps into a single audio language model for full duplex communication.

Neural Audio Codecs - Mimi

  • Waveform Complexity: Audio is complex (24,000 values/sec), challenging to model efficiently.
  • Audio Tokenization: Essential to convert waveforms to token sequences for modeling.
  • Residual Vector Quantization: Used for audio compression and language modeling.
  • Adversarial Losses: Crucial for realistic audio reconstruction.
  • Semantic Distillation: First codebook trained to approximate waveLM for semantic information.
  • Causality: Focus on causal models for real-time applications.

Joint Sequence Modeling

  • Motivation: Multiple audio streams and full duplex communication.
  • Early Prototypes: Included manual turn changes (pressing space bar).
  • Current Model: Uses parallel token prediction with Transformers for efficient processing.
  • Text Stream Integration: Aligns text with audio at the word level to improve generation stability.

Training and Data

  • Training: Several stages including audio-only and dual-channel audio with synthetic fine-tuning.
  • Challenges: Difficulty in maintaining factual accuracy in AI responses.
  • Data Sources: Large scale data sets, Whisper for timestamps, Synthetic data generation for personality.

Applications and Demos

  • Moshi: Conversational AI with natural interaction even under noisy conditions (e.g., construction noise).
  • Ibiky: Speech-to-speech translation model that maintains speaker identity and runs on mobile devices.

Future Directions

  • Tool Integration: Potential for external tool use and prompt-based instructions.
  • Multilingual Expansion: Plans to introduce more languages for models like Ibiky.
  • Code and Data Release: Plans to release fine-tuning code, but training data remains proprietary.

Audience Questions

  • Topics covered include transducers, model scaling laws, real-time applications, acoustic token handling, and speaker identity preservation in translation.

Conclusion

  • Presentation highlighted advancements and challenges in speech-to-speech AI models.
  • QAI is working on expanding and refining their current models for wider applications and better performance.
  • Future presentations will continue exploring audio coding innovations.