🎤

Building a Voice Assistant with AI

Sep 28, 2024

Creating a Voice Assistant Using Large Language Model

Overview of Voice Assistant Demo

  • A voice assistant is created to facilitate order placement.
  • Example interaction: user orders "1 idly and 1 Coke."
  • Total cost of the order: INR 70.

Voice Recognition Process

  • Conversion of voice to text using Faster Whisper model:
    • Faster Whisper is an optimized implementation of Open AI Whisper.
    • Provides up to 4x faster performance and uses less memory.
    • Further efficiency can be achieved with 8-bit quantization.

Requirements

  • Python: 3.8 or greater.
  • CUDA Toolkit: Includes necessary libraries (CUBlas, CNN).
    • Recommended to install it instead of individually installing libraries.

Workflow of Voice Assistant

  1. Voice to Text Conversion:
    • Use Faster Whisper to transcribe audio input.
    • Input example: Customer states "Pepsi and pizza."
  2. Retrieval-Augmented Generation (RAG):
    • Text goes through RAG which fetches relevant information from a menu database.
    • Information is converted into chunks and embeddings saved to Quadrant DB.
  3. Interaction with Large Language Model:
    • The processed information is sent to an LLM for generating a response.
    • gtts (Google Text-to-Speech) is used to convert the response back to voice.

Chat Memory Feature

  • Implements chat memory to retain conversation context:
    • Essential for maintaining interactive dialogue without repetitive queries.
    • Utilizes Llama Index Chat Memory Buffer.

Code Structure

  • Main Components:
    • app.py: Starting point of the application.
    • voice_service.py: Handles text-to-speech functionality.
    • RAG Module: Manages data retrieval and processing.

Virtual Environment Setup

  • Create virtual environment using: python -m venv vnv
  • Install required packages from requirements.txt.

Main Method Functionality

  • Model and device configurations set in the main method.
  • Continuous loop to listen for customer input until the conversation ends.
  • Audio recording and transcription handling:
    • Audio chunks processed and transcribed.
    • Transcription sent for interaction with LLM.

Running the Application

  • Ensure all dependencies (AMA, Docker for Quadrant) are running.
  • Quick commands to initialize: AMA serve docker run quadrant
  • Monitor and manage processes to avoid port conflicts.

Conclusion of Demo

  • Customer interaction example provided.
  • Accurate order handling and response generation.
  • Emphasis on adapting components for more extensive databases when necessary.

Summary of Key Libraries and Tools Used

  • Faster Whisper: Voice to text processing.
  • gtts: Text to speech conversion.
  • RAG: Information retrieval framework.
  • CUDA Toolkit: GPU support for model execution.
  • Llama Index: Chat memory management.
  • Quadrant DB: Database for storing embeddings and chunks.