📊

Building a Realtime Data Processing Pipeline for Financial News

Jun 25, 2024

Building a Realtime Data Processing Pipeline for Financial News

Introduction

  • Objective: Learn to build a realtime data processing pipeline
  • Data Source: Financial text from Alpaka News API
  • Transformations: Convert text to vector embeddings using Bite Works
  • Storage: Save embeddings in Quadrant vector database

Getting Started

  1. Repository: Clone the hands-on-llms repository
    • Go to the terminal and clone the repository
    • Open the project with Visual Studio Code
  2. Modules: Navigate to modules/streaming_pipeline
  3. Dependencies: Install required tools for running the pipeline
    • Python 3.0
    • Poetry
    • Make (for Linux/Mac)
    • AWS CLI
  4. Install Dependencies: Use the Makefile
    • Command: make install
  5. Environment File: Create and set up an .env file

Setting Up Alpaka and Quadrant

  1. Alpaka
    • Create an account on Alpaka for API key and secret
    • Add these credentials to the .env file
  2. Quadrant
    • Create a cluster and generate an API key
    • Add API key and cluster URL to the .env file
    • Verify the cluster is running via terminal command

Running the Pipeline

  1. Command: make run realtime
  2. Code Insight: Understand the Makefile and core Python scripts
    • The main script calls Bite Works functions to build and run the data flow

Data Flow

  1. Input Construction
    • Uses Alpaka's RESTful API for batch data and streaming API for real-time data
  2. Processing Steps
    • FlatMap: Applies a function on elements of a list, eliminating list structure
    • Map: Applies a function on elements of a list, but keeps list structure
    • Pydantic Class: Defines structure for processing steps
    • Chunks and Embeddings: Splits text into chunks; applies vector embeddings
  3. Output Construction
    • Saves transformed data to Quadrant Vector DB

Deployment

  1. Options:
    • Bitewax's BWXL Tool: For quick and easy cloud deployment
    • AWS CLI: For creating EC2 instances and deploying Docker images
    • Production-Ready: Use AWS CDK or Terraform along with GitHub Actions
  2. CI/CD Integration
    • Continuous Integration and Deployment pipelines in GitHub

Conclusion

  • Recap of learning how to build and deploy a real-time data processing pipeline for financial news
  • Reminder to star the GitHub repository