Building a Realtime Data Processing Pipeline for Financial News

hi guys in this video you will learn how to build a realtime data processing pipeline for financial news that is going to inest in realtime Financial text from the alpaka news API is going to transform it into Vector eddings using bite works I'm going to save them in a vector DB which is going to be quadrant let's get to it so let's first head to the repository hands on llms you can have you see the URL right here you get go to code then you copy and you go to your terminal and then I'm going to S into Source uh live sessions and then I'm going to get clone it ah it says I I already have it so I'm just going to S into it and then do a g pull super and then I'm going to open with Visual Studio code we're going to go to the modules streaming pipeline right here and then there's a r me where you first see the list of dependencies that is the tools that you need to run the streaming Pipeline on your end python 3.0 then poetry I'm going to talk about that in a second make which is standard if you have Linux or Mac I'm not sure on Windows check that and then the AWS CLI so usual make install so I'm going to head to the make file which I'm going to open to the site here make install basically installs all the dependencies for this project inside the virtual environment so I'm there in streaming pipeline I'm going to run make installed super everything is installed if you scroll down there's thism file so you run this command copy and you create your M file and then you will see it right here in this file you need to paste a few secrets to run this whole thing we're going to use two Services which are alpaka to fetch a financial news in real time and for that you need an API key and an API secret and then quadrant quadrant is going to be our serverless Vector DB so in order to use these Services you need to get the credentials so let's go to alpaka and there you need to create a account for free you don't need to pay anything so you just go to your account here in home and here on the right you will see V API keys so if you click here and you regenerate you will you will be able to see a new key which I'm copying it here and I'm pasting in my M file so alpaka key this one and then secret which obviously I'm showing mine here but there's a value that you should not share super so alpaka is set up in terms of credentials now I'm going to check quadrant so for Quadrant I'm going to go to quadrant quadrant is an open source Vector DB which has also a serverless plat a vector database is a scalable way to index and store your data embeddings so later on you can do semantic similarity search to find similar embeddings between a query embedding and the embeddings from your vector database using distance matrics such as cosine Sim similarity super so let's create a cluster you can create it for free right here I'm going to give it a name so I'm going to write it um I'm going to write alpaka financial news super then I'm going to generate an API key so I click get API key and this is the value that I'm going to copy then I'm going to immediately add it to my M file under API key right here then I go back to quadrant then I continue this is the URL of my cluster so I'm going to copy it and then I'm going to add here under quadrant URL super to double check that my cluster is up and running I'm going to go to the terminal and then I'm going to copy this C command super so I get a positive response I see that the system is up and running super so we have our quadrant um cluster up and running so I'm going to go back to the code so we have the two services that we need alpak and quadrant now we're going to start running things the first one is make run real time what this does is it runs uh the realtime financial news pipeline locally so let's go to the make file to see what's happening here behind the scenes so run real time let me open it here on the left and then expand a bit it's basically running this python script is calling bite wats bws is a python Library again built with rust what with Native python biing so you define your business logic your steps in Python but you leverage the efficiency of rust so with bitew run we basically need to provide the path of a function that generates a so-called data flow data flow in BW is a is a sequence of steps that Define our transformation from input collection in this case from a web socket from alpaka to the final destination in this case quadrant Vector DB so in between there are a bunch of steps processing steps that let us transform rot text that we get from alpaka into Vector embeddings that we save in our Vector DB so if you go to tools run realtime build flow which I'm doing right now run real time build flow this is the function that essentially returns a data flow this is the object the BW object what this function does it initializes login so this just a convenient wrapper but this is the key function flow Builder I'm going to click on that this is the function that constructs our data flow using bworks so what are the steps the first one is the input so the input is buil with this function let's click on that build input is the function that constructs the input step in our data flow and if you scroll down you will see that it defines two separate cases whether it's batch or not what does this mean Paul alaka badge leverages alpaca's wrestful API to inest historical data into the C run Vector DB and alpaka stream leverages alpaca's streaming API to listen to data 24/7 and ingest data in real time super so I'm going to go back to the data flow the next step is a flat map so you see flat maps and Maps what is the difference Paul so imagine you have a list in a bpex flow both flat map and map will apply a function on all the elements of this list but m map will send this list down the bbex flow while flat map will emit every element from the list individually basically eliminating the Leist structure down the flow super so we Define the flat map to Parts the incoming messages so news articles is defined right here as a pantic class so this is a very nice way to define the structure of our data using pantic classes so let me go back to the data flow and you continue scrolling basically is a step where we transform this article which contains raw data into a document so what is that so a document contains basic information that we need to push our data to quadrant so has an ID it has some metadata which is important Vector embeddings alone are not that useful you also need to embed them with meta data for example timestamps then the text the rotext so you can also see when you run a query against your vector be what you get makes sense and then the ch and the embeddings what are these chunks to compute embeddings we will use a language model that has a fixed input size we cannot embed arbitrarily large text so what we do is we Chun based on the input size of the model that we use this is what this chunks means and then for every Chun we compute embeddings so essentially we represent Rotex by a collection of vector embeddings not just one but many and this is what we push to quadrant DB so this is what we are representing right here with this these fields so if you go back to the flow there are these two steps there's the compute chunks and then there's the compute embeddings so to compute embeddings behind the scenes what we're using is a model right this is the embedding model that we use what model are we using in this case if you go to the constants you will see what's the embedding model IDE so we're using an open source model from hugging phase actually from sentence Transformers which is uh very standard model to use but feel free to experiment with others this is actually an hyperparameter that you can play with in order to improve the performance of the whole system we finally need to to write the output to the vector DB and this is what we do with the built output function here the buil output returns a data flow step that can either be save to our serverless quadrant DB or to a local instance that we're running with doer the key advantage of using Cadron cloud is that you can directly start focusing on your app application or product instead of wasting valuable resources on setting up your infrastructure and quadrant itself which can be a very time consuming process this one here is the the one that matters and here you can see what Bor output does this standard way to build outputs on bite waxs you need to Define two things first you need to Define what is called a source in this case a collection in quadron and then a if you scroll app a dynamic output which basically uses the source and make sure the data is written at least once super so that's it this is how the whole data flow is defined let's run it make run real time super so you can see the pipeline is running it connected to alpaka API and as soon as data uh arrives through our web socket is going to be processed and then sent to quadrant so Paul what are our options in terms of deployment there are two main ways to deploy a bwax flow to the cloud the first one is using bite wax waxl tool which allows you to quickly and easily deploy any flow to the cloud but if you want to tap into more customizable ways you can use AWS CLI which US in this course to create an E2 instance and deploy a Docker image to it but for more scalable and production ready ways you should use AWS cdk or terraform to create your infrastructure and GitHub actions to deploy the docker image to this infrastructure feel free to choose the one that gives you the nice tradeoff uh between flexibility and performance then finally as usually it's great to integrate so all the deployment of new code needs to go through testing and then automatic deployment so this is what we have in this GitHub action so if you go to do GitHub here you will find continuous integration and continuous deployment uh pipelines to make sure that the latest changes of over streaming pipeline are tested and then deployed in production so that's it for today I hope you learn something new and please don't forget to give us a star on GitHub see you soon

hi guys in this video you will learn how to build a realtime data processing pipeline for financial news that is going to inest in realtime Financial text from the alpaka news API is going to transform it into Vector eddings using bite works I&#39;m going to save them in a vector DB which is going to be quadrant let&#39;s get to it so let&#39;s first head to the repository hands on llms you can have you see the URL right here you get go to code then you copy and you go to your terminal and then I&#39;m going to S into Source uh live sessions and then I&#39;m going to get clone it ah it says I I already have it so I&#39;m just going to S into it and then do a g pull super and then I&#39;m going to open with Visual Studio code we&#39;re going to go to the modules streaming pipeline right here and then there&#39;s a r me where you first see the list of dependencies that is the tools that you need to run the streaming Pipeline on your end python 3.0 then poetry I&#39;m going to talk about that in a second make which is standard if you have Linux or Mac I&#39;m not sure on Windows check that and then the AWS CLI so usual make install so I&#39;m going to head to the make file which I&#39;m going to open to the site here make install basically installs all the dependencies for this project inside the virtual environment so I&#39;m there in streaming pipeline I&#39;m going to run make installed super everything is installed if you scroll down there&#39;s thism file so you run this command copy and you create your M file and then you will see it right here in this file you need to paste a few secrets to run this whole thing we&#39;re going to use two Services which are alpaka to fetch a financial news in real time and for that you need an API key and an API secret and then quadrant quadrant is going to be our serverless Vector DB so in order to use these Services you need to get the credentials so let&#39;s go to alpaka and there you need to create a account for free you don&#39;t need to pay anything so you just go to your account here in home and here on the right you will see V API keys so if you click here and you regenerate you will you will be able to see a new key which I&#39;m copying it here and I&#39;m pasting in my M file so alpaka key this one and then secret which obviously I&#39;m showing mine here but there&#39;s a value that you should not share super so alpaka is set up in terms of credentials now I&#39;m going to check quadrant so for Quadrant I&#39;m going to go to quadrant quadrant is an open source Vector DB which has also a serverless plat a vector database is a scalable way to index and store your data embeddings so later on you can do semantic similarity search to find similar embeddings between a query embedding and the embeddings from your vector database using distance matrics such as cosine Sim similarity super so let&#39;s create a cluster you can create it for free right here I&#39;m going to give it a name so I&#39;m going to write it um I&#39;m going to write alpaka financial news super then I&#39;m going to generate an API key so I click get API key and this is the value that I&#39;m going to copy then I&#39;m going to immediately add it to my M file under API key right here then I go back to quadrant then I continue this is the URL of my cluster so I&#39;m going to copy it and then I&#39;m going to add here under quadrant URL super to double check that my cluster is up and running I&#39;m going to go to the terminal and then I&#39;m going to copy this C command super so I get a positive response I see that the system is up and running super so we have our quadrant um cluster up and running so I&#39;m going to go back to the code so we have the two services that we need alpak and quadrant now we&#39;re going to start running things the first one is make run real time what this does is it runs uh the realtime financial news pipeline locally so let&#39;s go to the make file to see what&#39;s happening here behind the scenes so run real time let me open it here on the left and then expand a bit it&#39;s basically running this python script is calling bite wats bws is a python Library again built with rust what with Native python biing so you define your business logic your steps in Python but you leverage the efficiency of rust so with bitew run we basically need to provide the path of a function that generates a so-called data flow data flow in BW is a is a sequence of steps that Define our transformation from input collection in this case from a web socket from alpaka to the final destination in this case quadrant Vector DB so in between there are a bunch of steps processing steps that let us transform rot text that we get from alpaka into Vector embeddings that we save in our Vector DB so if you go to tools run realtime build flow which I&#39;m doing right now run real time build flow this is the function that essentially returns a data flow this is the object the BW object what this function does it initializes login so this just a convenient wrapper but this is the key function flow Builder I&#39;m going to click on that this is the function that constructs our data flow using bworks so what are the steps the first one is the input so the input is buil with this function let&#39;s click on that build input is the function that constructs the input step in our data flow and if you scroll down you will see that it defines two separate cases whether it&#39;s batch or not what does this mean Paul alaka badge leverages alpaca&#39;s wrestful API to inest historical data into the C run Vector DB and alpaka stream leverages alpaca&#39;s streaming API to listen to data 24/7 and ingest data in real time super so I&#39;m going to go back to the data flow the next step is a flat map so you see flat maps and Maps what is the difference Paul so imagine you have a list in a bpex flow both flat map and map will apply a function on all the elements of this list but m map will send this list down the bbex flow while flat map will emit every element from the list individually basically eliminating the Leist structure down the flow super so we Define the flat map to Parts the incoming messages so news articles is defined right here as a pantic class so this is a very nice way to define the structure of our data using pantic classes so let me go back to the data flow and you continue scrolling basically is a step where we transform this article which contains raw data into a document so what is that so a document contains basic information that we need to push our data to quadrant so has an ID it has some metadata which is important Vector embeddings alone are not that useful you also need to embed them with meta data for example timestamps then the text the rotext so you can also see when you run a query against your vector be what you get makes sense and then the ch and the embeddings what are these chunks to compute embeddings we will use a language model that has a fixed input size we cannot embed arbitrarily large text so what we do is we Chun based on the input size of the model that we use this is what this chunks means and then for every Chun we compute embeddings so essentially we represent Rotex by a collection of vector embeddings not just one but many and this is what we push to quadrant DB so this is what we are representing right here with this these fields so if you go back to the flow there are these two steps there&#39;s the compute chunks and then there&#39;s the compute embeddings so to compute embeddings behind the scenes what we&#39;re using is a model right this is the embedding model that we use what model are we using in this case if you go to the constants you will see what&#39;s the embedding model IDE so we&#39;re using an open source model from hugging phase actually from sentence Transformers which is uh very standard model to use but feel free to experiment with others this is actually an hyperparameter that you can play with in order to improve the performance of the whole system we finally need to to write the output to the vector DB and this is what we do with the built output function here the buil output returns a data flow step that can either be save to our serverless quadrant DB or to a local instance that we&#39;re running with doer the key advantage of using Cadron cloud is that you can directly start focusing on your app application or product instead of wasting valuable resources on setting up your infrastructure and quadrant itself which can be a very time consuming process this one here is the the one that matters and here you can see what Bor output does this standard way to build outputs on bite waxs you need to Define two things first you need to Define what is called a source in this case a collection in quadron and then a if you scroll app a dynamic output which basically uses the source and make sure the data is written at least once super so that&#39;s it this is how the whole data flow is defined let&#39;s run it make run real time super so you can see the pipeline is running it connected to alpaka API and as soon as data uh arrives through our web socket is going to be processed and then sent to quadrant so Paul what are our options in terms of deployment there are two main ways to deploy a bwax flow to the cloud the first one is using bite wax waxl tool which allows you to quickly and easily deploy any flow to the cloud but if you want to tap into more customizable ways you can use AWS CLI which US in this course to create an E2 instance and deploy a Docker image to it but for more scalable and production ready ways you should use AWS cdk or terraform to create your infrastructure and GitHub actions to deploy the docker image to this infrastructure feel free to choose the one that gives you the nice tradeoff uh between flexibility and performance then finally as usually it&#39;s great to integrate so all the deployment of new code needs to go through testing and then automatic deployment so this is what we have in this GitHub action so if you go to do GitHub here you will find continuous integration and continuous deployment uh pipelines to make sure that the latest changes of over streaming pipeline are tested and then deployed in production so that&#39;s it for today I hope you learn something new and please don&#39;t forget to give us a star on GitHub see you soon

Transcript for:Building a Realtime Data Processing Pipeline for Financial News

Transcript for:
Building a Realtime Data Processing Pipeline for Financial News