Notes on Conversational AI Reading Group Presentation by Alexos

Introduction

Presenter: Alexos, co-founder at QAI, a nonprofit lab for AI research in Paris.
Topic: Discussion on Moshi, a speech-to-speech model for real-time dialogue, and advancements in audio modeling and live translation.
Team Behind Moshi: Alexos, Laur Mazare, Manini Am, Patrick Perez, Aigu Ed, Gra N Z.
Funding: Zavel, Rodolph, Sad, Eric M.

Nonprofit organization focused on open source and open science research.
Recent projects: Moshi, Ilium 2B (multilingual foundation model), Ibiky (live translation from French to English).
Emphasis on multimodal LLMs, particularly on speech due to expertise in the team.

Speech as a medium is suboptimal for human-computer interaction.
Human conversations are fluid with interruptions and paralinguistic content like emotions and tone.
Traditional systems use cascading components that introduce latency.
Aim to merge speech processing steps into a single audio language model for full duplex communication.

Waveform Complexity: Audio is complex (24,000 values/sec), challenging to model efficiently.
Audio Tokenization: Essential to convert waveforms to token sequences for modeling.
Residual Vector Quantization: Used for audio compression and language modeling.
Adversarial Losses: Crucial for realistic audio reconstruction.
Semantic Distillation: First codebook trained to approximate waveLM for semantic information.
Causality: Focus on causal models for real-time applications.

Motivation: Multiple audio streams and full duplex communication.
Early Prototypes: Included manual turn changes (pressing space bar).
Current Model: Uses parallel token prediction with Transformers for efficient processing.
Text Stream Integration: Aligns text with audio at the word level to improve generation stability.

Training: Several stages including audio-only and dual-channel audio with synthetic fine-tuning.
Challenges: Difficulty in maintaining factual accuracy in AI responses.
Data Sources: Large scale data sets, Whisper for timestamps, Synthetic data generation for personality.

Moshi: Conversational AI with natural interaction even under noisy conditions (e.g., construction noise).
Ibiky: Speech-to-speech translation model that maintains speaker identity and runs on mobile devices.

Tool Integration: Potential for external tool use and prompt-based instructions.
Multilingual Expansion: Plans to introduce more languages for models like Ibiky.
Code and Data Release: Plans to release fine-tuning code, but training data remains proprietary.

Topics covered include transducers, model scaling laws, real-time applications, acoustic token handling, and speaker identity preservation in translation.

Presentation highlighted advancements and challenges in speech-to-speech AI models.
QAI is working on expanding and refining their current models for wider applications and better performance.
Future presentations will continue exploring audio coding innovations.