Notes on Conversational AI Reading Group Presentation by Alexos
Introduction
Presenter: Alexos, co-founder at QAI, a nonprofit lab for AI research in Paris.
Topic: Discussion on Moshi, a speech-to-speech model for real-time dialogue, and advancements in audio modeling and live translation.
Team Behind Moshi: Alexos, Laur Mazare, Manini Am, Patrick Perez, Aigu Ed, Gra N Z.
Funding: Zavel, Rodolph, Sad, Eric M.
Overview of QAI
Nonprofit organization focused on open source and open science research.
Recent projects: Moshi, Ilium 2B (multilingual foundation model), Ibiky (live translation from French to English).
Emphasis on multimodal LLMs, particularly on speech due to expertise in the team.
Motivation for AI Assistants
Speech as a medium is suboptimal for human-computer interaction.
Human conversations are fluid with interruptions and paralinguistic content like emotions and tone.
Traditional systems use cascading components that introduce latency.
Aim to merge speech processing steps into a single audio language model for full duplex communication.
Neural Audio Codecs - Mimi
Waveform Complexity: Audio is complex (24,000 values/sec), challenging to model efficiently.
Audio Tokenization: Essential to convert waveforms to token sequences for modeling.
Residual Vector Quantization: Used for audio compression and language modeling.
Adversarial Losses: Crucial for realistic audio reconstruction.
Semantic Distillation: First codebook trained to approximate waveLM for semantic information.
Causality: Focus on causal models for real-time applications.
Joint Sequence Modeling
Motivation: Multiple audio streams and full duplex communication.
Early Prototypes: Included manual turn changes (pressing space bar).
Current Model: Uses parallel token prediction with Transformers for efficient processing.
Text Stream Integration: Aligns text with audio at the word level to improve generation stability.
Training and Data
Training: Several stages including audio-only and dual-channel audio with synthetic fine-tuning.
Challenges: Difficulty in maintaining factual accuracy in AI responses.
Data Sources: Large scale data sets, Whisper for timestamps, Synthetic data generation for personality.
Applications and Demos
Moshi: Conversational AI with natural interaction even under noisy conditions (e.g., construction noise).
Ibiky: Speech-to-speech translation model that maintains speaker identity and runs on mobile devices.
Future Directions
Tool Integration: Potential for external tool use and prompt-based instructions.
Multilingual Expansion: Plans to introduce more languages for models like Ibiky.
Code and Data Release: Plans to release fine-tuning code, but training data remains proprietary.
Audience Questions
Topics covered include transducers, model scaling laws, real-time applications, acoustic token handling, and speaker identity preservation in translation.
Conclusion
Presentation highlighted advancements and challenges in speech-to-speech AI models.
QAI is working on expanding and refining their current models for wider applications and better performance.
Future presentations will continue exploring audio coding innovations.