Low Latency C++ Trading System Notes

today we're going to talk about low latency C plus present trading system so first of all what would you like to build such system like an automated trading one well um it's a heistics game everyone wants to crack the S P 500 or here in Germany the Dax index or the Euro stocks there is a lot to make or lose and it's a hard problem there is a lot of people on the market so here on the bottom of that slide you can see a graph of the average daily traded volume in the past 10 years and so this is on derivatives but it doesn't really matter too much the general information to take from this graph is that more and more people trade on financial markets right so compared to like 10 years ago on stock options we have three times more volume traded why that it's mostly retail flow as we said so like people like you and me we have a smartphone and we have an application where we can buy and sell stocks and you know it's just much easier than before to access markets so that means that a lot of people are there trading and that's why it's such a hard problem because it's both a trading problem and also a technological one right you need to be at the same time smart to come up with a price at which you're happy to sell obey because you want to make good trades but there is a big technological aspect to it the need of latency so now why do we need such latency um well so I work for Optimum and we are a market maker and in our case when you take to the application of a smartphone you know to buy or sell if you want to buy a stock we will be there on the other side you know to sell and vice versa to have to imagine that in our case we have a few hundred thousand orders out there at any given time um a bit more I would say probably more like a million but anyway let's say 100 000 is already a lot on any on any instrument you know to facilitate um you know like the trading with other people and so let's imagine that you have this order out there and you have a headline like this um so you you get you know you you want to do a few things when you have such headline um but there is one thing that you should always do so what is it yeah you are indeed I mean you could do that but the first thing you want to do regardless of of you know you being a market maker a retail investor or any you know hedge fund or any kind of like a trading um you know actor a financial actor is that you want to cancel your buy orders if you had buy orders anywhere for a certain price you know now this information is Obsolete and you immediately want to cancel this that's the first thing you want to do after that you're going to do other things but so that is of course just one example of where you need this like very like low latency in trading is uh for for headlines there are many many others of course but that's um that's one so today we are going to um you know we we want to we would like to buy um build sorry together um such low latency automated trading system and uh well you know it's it's it's hard to cover everything in an hour talk so I had to pick a few things and I would like to cover with you four um topics the first one would be data model access then system tuning and eventually we would like to talk about how to measure performance and we are going to focus on some like really building you know like this this building blocks that you can use on that effective you're being used in any low latency system it's also time to say a bit more about me to introduce myself I'm David I've been working on automated trading system for almost 10 years so optivia Market maker before that I was working on high throughput system so quite some overlap with the low latency side of things but in very different industry in defense so first of all designed for performance that's what we're going to talk about today is it actually a good idea do you know what Knute said premature optimization is the root of all evil also a friend saying you know we will make it fast if it needs to be but that sounds reasonable you know myself I consider um myself a pragmatic engineer you know only solving problems when they when they happen um so so I like that I agree with that um but what I also say um and actually heard it also yesterday at meeting CBP doing another talk on the deadline on the on the real-time system uh matya said you know performance should be considered at the beginning when you design a system so yeah performance can really be enough default and this you know this these two quotes you know they seem to contradict each other um so who agrees with Knute got some people I would expect more people to agree but uh who agrees with uh with me saying it can rarely be an afterthought a very sweet oh okay you agree more with me that's that's very sweet of you well effectively um I think we I mean both statement are correct and they don't really contradict each other the reason is that there are two things to look at um when we consider are not just performance but but in general systems strategy and tactics the strategy is your um you know is your is the approach like the overall overarching approach you know to meet a goal so let's for example take as a goal um I'd like to handle a million of even per second or I would like to react off a certain event with a certain latency distribution like average less than 100 nanoseconds standard deviation less than 10 and a second there are requirements that are there are goals that you have and so the strategy is like your overall approach to to you know to solve that problem another tactics they are your individual actions and so if I translate that to like an engineering uh approach it would be like a or just when we develop it would be like the implementation details like hey should I call Reserve understood Vector which are the hash Maps the most efficient and so on and so forth and so you need both and now back to the previous slide with like a Knutson saying a lot about premature optimization is definitely talking about tactics we shouldn't immediately try to you know do some local optimization everywhere um because in general our intuition is wrong and we would just be complicating our software for probably nothing um so he's talking about tactics and then when I said you shouldn't be an after food talking about strategy so today we are going to look at a bit of both um again unlike the photo picks that I mentioned previously try to give you like a general strategy when it comes to what a very concrete topic of building a trading system and then look at some details still so if you want to have a strategy on how to build such system we need you know a goal so what is our goal how fast is fast so five years ago uh one of my colleague from Optical presented the talk at meeting CBP here and then CVP gone and here the similar slide and so you know if you can see on the slide that we are talking about 2.5 microseconds you know software turning system that's roughly very well or what we're talking about so where are we nowadays um you know things change a little bit so the exchange drawers very quickly um on financial markets so what you can see on that slide on the right is what we call a trigger to Target latency okay what is this a bit of a technical term that I will explain um in a bit it's on u-rex it's the the biggest European exchange it's actually in Germany in Frankfurt and so what is this trigger to Target latency um a trigger is when the exchange sends and the event to the market participant so the the event could be a trade or a price change on a stock for example so these events I will send in a very you know Fair way so that all the market participants receive them at the same time and so that that's uh that's what we call a trigger and then the target is when any Market participants send an order back so we have this two timestamp um being taken by like um like a level one switch like a very accurate device I mean a device that can very accurately timestamp packet in and out and if you measure the difference between the two we see on that heat map that we approach or that we already are around 10 dollars okay that's it's very fast so what kind of C plus plus can achieve that [Music] by the way taking the energy again of the speed of light it is uh roughly you know the time that it takes for the light to travel between you know me and I think you I think just the very first whoa so it's uh it's it's um it's not a lot not sure what kind of logic we can have that down um so yeah no super Spurs can achieve this even the new uh coming PCI Express standard is you know is already too slow way too slow for land so what people are doing is that they have fpgas right so they have fpga vhgl sorry a flash on the fpga that's a custom board and that's how they achieve the latency so what are we going to talk about today is C plus plus obsolete um well not really effectively not at all um for at least three reasons so hey it's like a simplified but realistic diagram of um trading system and so the first reason that low latency super space is still very much a thing is that what we saw on the police line is just the the tip of the iceberg and so people like to talk about these numbers because while they're impressive that's true but there is there is a a lot a lot going on um in a trading system aside of that and so the first reason you need low latency is well if you have this this amazing fpga that can send a hundred thousand order in a second let's say you probably want to have some low rate and C code around it you know to receive the notification that you send something in order to do some post checks let's say it's not just the safety like a risk mechanism the second reason is um if that fpga is able to send an order within 10 nano second it is then probably uh you know very very simple it's not easy it's a lot of engineering going on in there but it's simple as like it sees bits compare bits and bits doesn't really understand much about you know how to price a derivative instrument or anything it's just not possible because that's always it's a trade-off so what it means is that you see and there is this thing that we call in Industry trigger Pius which is a you know at what point do I want to buy or sell how much do I think that this stock is worth and so this is uh very much an input of this fpga and and again thinking about the headline that we saw before if our friend Elon you know tweet that can imagine that our opinion about the Tesla stock is going to change relatively quickly it's also something there is some latency sensitivity in there so that's again something that's you know all the strategy part of it we need C plus plus again low latency and the final reason is that as for anything um it's just a trade-off um like there are some and just a trade-off on like engineering requirements there are some strategies where you you really want to aim for uh yeah blazing fast standard second latency sure then you you're for you know really complicated Hardware on vhdl on you go for it but there is a lot um for you know trading strategy where you absolutely do not need such a such latency and then you use C plus plus like reasonably fast like you know two micro second one micro second or even a bit below and that's that that's great ah so we still have some things to talk about today that's that's nice um now I'm going to share a story on the you know to come to our first you know strategy um about data modeling a story that when I joined uh when I joined up TV almost 10 years ago um there was an algorithm turning study that was really smart um that wanted to be fast now you probably already encountered that um in your career you know there is there is a software a and at some point some Engineers start working on software B and at the beginning b um doesn't really have a the goal of B is not per se to replace a but at some point B becomes so nice but everyone wants to walk on B and then it becomes a goal to replace a right you probably recognize that it's a typical engineering thing well in that case the poem was that our algorithm a was quite fast not not blazing's mouth relatively relatively fast and B was a software that was then designed to be much smarter but not perceive faster and I witnessed brilliant Engineers um working on making this this algorithm build that was very smart trying you know as fast as a month month engineering time um you know were poor into the default and eventually I also tried I was I was quite new in there but you know I also gave it a shot inventory didn't work so I was quite um quite surprised you know I was like oh all these amazing tactics and everything and like all this um all this engineer trying to do that but why Why didn't it walk well back to what we what we said about the designing for performance is that it was not designed for it and again there are many reasons um that you know a certain software wouldn't be able to be fast but the main one that I want to share with you today um is about data is that if you don't really design your software properly it's just gonna be slow everywhere so what's on the on this graph this is like the result of a of a profiler a profiling result um it's an in-house profiler that we built on the top of a clang x-ray so we instrument the code and we add some like time stamping on the right you see some function names and on the left not sure if you can read it it's a bit small but you have some time and then there is this graph and so the take on this thing is that you know what you see in textbook when it comes to profiling and make things fast that usually eighty percent of your CPU time is going to be in 20 of your code but it's awesome then you can just make that fast and you optimize everything well that's actually in the textbook reality is um if you don't model properly your data you will see something like that and that's very underwhelming because there is no there is no kind of hotspot there is nothing to see it's just flow everywhere okay why is it slow everywhere because it cannot fit in cash so the first strategy is all about is is going to be about the size and your locality of the of the data and here I put an example of um you know some some size of cash just you know to get an ID of what we have on like modern modern processors um you know on the different level of cash [Music] um you know up front when you design your system how much data you need look in trading we know roughly how many uh stocks for example we are going to trade in an Argo we know how much data we manipulate so we we know um like if it's gonna fit in this cache or depending on the data structure that we use we already know a phone or we should know we should have an ID if it fits or not if it doesn't then you can choose to have more cash you can split your application across several cores or you want to call several servers but you need you need to take that into account in your in your design and something that is also often Overlook is that we think a lot about our code because we love our card but we don't really think about our colleagues the thing is um you need to think about the system as well so like when you when you actually have your application on a car you effectively share your last level on cash with your you know excluding you know that is not there already for a few years his code is still running uh on that call next to yours and it can it can affect a lot the the performance of your application and so again data model issues are from my experience like the main um the main caveats that um you know Engineers fall into when it comes to uh designing um a system so this is like a war story from just three weeks ago so I thought I would include it in um in today's presentation so what you have on the x-axis is to have two days um and then the second day we updated the binary around you know 20 20 past two or so and so the y-axis is the latency of the system and you have just drastic change um and the change was mostly about moving data around and making sure that it was well couldn't fit internally in the cache but just at least it would fit better so it's not really only if you would see the code change it was not really anything fancy it's just moving things around on removing things so they're more packed and together so let's take an example um a core element of a training system is what we call an instrument store we're not going to you know what is an instrument exactly but you can imagine it's a stock or ETF something like that it's a relatively small data structure it's less than a kilobyte in general and we have you know not so many below so not so sure between one and ten thousands of them in uh in an algo you want to do a lookup from time to time you receive an ID from an external um you know process you receive an ID so you want to do a lookup but it's not the latency of this hookup is not too critical you want to be reasonably fast and last requirement is that most of the code wants to store reference to this instrument we don't want to do um you know within the system lookups all the time we want to keep reference because that's that's just nice it's a nice characteristic so the store the instrument Stone you know it owns all these things so what data structure should we use with search requirements well from my experience what I see most people using is the student Auto map and um I like to know the map I think it does great characteristic you know it's reasonably fast um it does actually stable reference on the riyash so that's that's nice um but now the big downside of another map is that um all the elements the value of the hash map will be all over your Hammer like all over your memory so it's like a random walk through your home so often um you know if you if you if you watch talk or from previous in a meeting CBP or cbbcon about like performance or cash friendliness you see all this Benchmark about stood list and stood vector and and I think sometimes people take a bit to um you know the first degree the fact that you know we look at stood vectors to at least and we say students is bad it's not really about to at least it's about the memory model of stood list if you want to be fast the memory moral of students itself is bad and stood another map of the same memory model um it is a node container which is each element is going to be allocated in separate node and so that's the main the main problem there so like already a better solution or something that is a nicer is uh to use a vector close them up but then this doesn't only satisfy our requirements anymore because stood Vector on the resides we're going to you know not have a value reference anymore so that that's not great so my like my favorite solution for that is to use like a container called stable Vector I'm going to show you like the the current next line it's a very very simple container and it is not boost stable Vector that is actually quite a bit different base uh boost table Vector is again a not container which was the stable Vector as the same property of a vector continuous memory stable pointer um this one actually Vector didn't have it but it actually had the stability to that vector so what is it so it's a stood Vector of static Vector so I'm not sure if you're familiar with Boost static Vector it's it's roughly a stood array with like also a size so that you know how many elements are in the right pre-allocated a vector and so we have a vector of the static vector and we can define a chunk size so can you recognize uh so this this container is very close to a STL container do you recognize it yes that was fast yeah stood deck so indeed it's it's to Deck the big difference to deck is that we can um yeah we can customize this chunk size we might say oh that's just a small difference effectively it's big because for our locality it actually matters and stood deck has a very poor locality because of that while the implementation is compiler dependent but most compiler will take a relatively small chunk size and as I said the downside is that you pay a little bit of overhead um when you call the you know the autopartum Square Market or a few iterators that's what I put on the right if you provide it you will see you can spend a lot of time shifting bits you know to calculate the index because yeah you don't do modulo or division because you take a power of two but that's you know if you have slightly different requirements if you want a good walk through this content of us you can actually do a better implementation I mean different implementation than that it's just a trade-off so this is one of my favorite container because it's simple simple things are fast and you know you keep things together so that's our strategy you know we want to keep things local and and small and a great measure you know to check that you're doing that well is the WSS the working Set Side so not to be um confused with RSS so the RSS is usually what we allocate and touch this is um the size of your workload and in general we can measure it in Cache lines um that effectively there is not really any any tool to measure it in Cache lines so Brandon Greggs your performance engineer from Netflix and developed a small and very nice tool you know to measure it but in pages so you might say oh that's not so nice because you know pages on Cache lines are actually quite different but um let me prove you that it's actually really really nice so we take a benchmark like this um to be clear we do not really care in that Benchmark about the speed of execution we we're not interested in that we just use this code you know to calculate the WSS so we declare a stable Vector we put some integer in there and then we do a sum on the sum you're just going to touch all the elements so we can calculate or see you know the WSS on this load we do similarly on the two dollar map small detail actually important detail you see that there is a stoot list there called TMP where we push back a bunch of elements in between our employees we do this and that's that's very important when you when you do such micro benchmark um it's a it's a it's in order to randomize the Heap otherwise the result you will have will be so different than what you see in production code that you need to do this in in other words um if if you if you don't do that for example then another map you know all your allocation will be one of these other and depending that's the problem depending on the Heap the malloc implementation you're using your system you might have contiguous allocation which will change a lot to results and you might have that but you might also know that at all so the idea is just to um to remove this this pattern that you have when doing micro benchmarking specifically and so this is what we get um one megabytes for the you know the stable vector so we have a hundred thousand in 32 to 400 000 uh bytes and then we have 400 megabytes of another map of WSS so you might say okay this tool is just not working and you know this is just absolutely not not the truth you know why would why would we look at that and that's why I say that it's it's actually quite a nice thing that it measure in pages because yeah it is what it is it measuring pages and therefore you get a result that is you might say exaggerated but it gives you a sense of the locality of your data in this Benchmark every single in 32 is on a page so when you think about it you're wasting actually quite some memory right so even if even if you know if you measure in Cache line the result will be much closer the point is you're wasting so many pages that you will suffer of course from like many at least the main thing you will suffer will be tlb Miss because you will just do page lookup all the time this is just with one container so again strategy size and locality you can look at WSS but this is why it matters so much to keep things together so now that we talked about data model let's look at you know how to share this data because it's one thing to have a process that is fast you know that stay in cash usually I mean on any modern processor we have many many cores and we want to share data we're going to communicate between this application so one strategy so what we want to do here um is we receive this this Market data 10 gigabit per second so that's on the on the top in this market data receiver and we want to that's very typical of trading uh system we want to Fan out all this data to many different applications usually you have like 30 to call on that server you have like yeah roughly 30 application running on that server and you want all of them to receive the same Market data right so what should be your strategy there um we want low latency right so average less than 100 nanoseconds um also like not so much cheetah and um and our strategy should Target many consumers but only one producer and the key element is that the consumer should not affect the producer why that because the producer in drain is getting linerates data if any of the consumer would block it you know we would start dropping packets you know and then then you know the old system just falls apart they should not affect it and so in order to do that we have um you know we make the decision between two two things here we want to have events that we want to send and state so what do I mean exactly by that okay first of all how much how much data are we exactly talking about um this is on you know one one multicast channel just to get an ID um it's it's on the on the U.S market SMP future we received per month for at least 300 million message all right these are the Lord is that um effectively not so much you know if this is really like what one application should receive it's manageable the problem is it's very busy again think about the headliner we saw before can be very quiet during the day and then there is a headline and and this is it everyone reacts so you get a million message um you get a burst of like a minute message in less than a second back to what we were saying about Evans and States um so what you want to send between this application is for example this order book so think about another book as you know a few Pisces you know that we want to bury or sell and you have that instrument we want to we want to share this we have two ways in um in our system to to share this data the first one is to send an event um so think about it as like every single um you know new insertion or you know deletions like amendment of this uh of this other book we wanna send and the second one is well we just want to share the the old thing we don't want to share the state so in one you receive you receive a notification as well and the other one you don't so they are quite quite different and we want to use both in our system right so like the strategy ABC the one I react of this event in order to be able to send an order so they need to receive this event in a low latency way but then if I take the example of uh you know the at the bottom you see this trading GUI if you just you know send ties to a GUI it doesn't really matter you don't want to push 10 gigabyte 10 gigabyte bits of of price per second to agree you know you're not gonna never going to be able to to see that you just want to to have a state where you can as a consumer just pull the data you know from time to time so it's a push versus full approach here we want to look at both of them so let's start first of all state um we are definitely in the domain of like you know like clock data structure and so we have we have a lot of options there so today I'm just going to talk about one because I think this is this is like the you know like the maybe not the perfect fit but like one that is definitely underrated um not so many people talk about it and it's actually a great fit um for trading system or effectively it's used also in you know in chromium but also in the corner because it was originally introduced into the Linux kernel so that's they were called seclog originally fast read write lock and so the problem that the kernel developers had at the time is between user user land and on Canal land like as a user code if you would do a for Loop we'd get time of the day um the time that you get you know when you recall the C function get time of the day is um I mean it's updated by the canner and before using this lock um it was just a basic log that was there so the the reader the the code in New Zealand could do a Daniel of service where like the kernel wouldn't be able to get the log so I wanted to like change that and I also wanted to be a high performance so they come up with this SEC clock so what is it exactly so let's look at let's look at code I think it will be simple so this is by the way simplified version of the code um it's not a talk about atomics further in my slides the reference in case you know you want to find the full series code so let's not be distracted by atomics but also let's not use this code directly in uh you know in production because I think it compiles so we have an atomic version here and then we have our data and when the writer wants to publish something it's going to do a plus one on the version copy the data and press one on the version again right some on the right I have the assembly node of this version but of the let's say the the version you can find The Links at the end of my slides and it is you know us you know as fast as you can get you don't have actually even any any locked well this is on Intel of course on 86 you don't have any prefix locked instruction um so it's it's fast so you might be wondering where is the trick well the trick is in the reader the reader is a bit special that's where like the smartness of that slot come come from as a reader you read the version if the lowest bit is set to one It it means that there is a right in progress and in that case you're going to return you failed reading the data otherwise you copy the data and then you check you need to recheck the version because while you're equipping the data Maybe the writer you know wrote something so it could be corrupted and therefore you check the version again if the version number match awesome it means that there was no right of course uh you know in that case T needs to be trivial equilibrium using student copy [Music] um that's um that's all good so this is quite special um and that are very interesting properties of that log that's you know in in trading system are amazing first of all you see that it's based on something that we call Optimist optimistic looking right why are we optimistic well because we we start copying the data thinking that it's not going to be changed at the same time as we read so if you have a lot of contention that log is going to be quite bad but if in general you don't then you go as fast as you can it fits a particular pattern when you will have only a few um producer in our case you only have one and many consumer in our case I mean on the diagram I showed you know we have like four or five in real trading application it's more like between 10 or 100. so that's that's quite nice and then the last characteristic is the producer is weight free the producer always get the right you know you always get the look no matter what the reader are doing and that that's also really nice so again for like um you know a system where you have a hundred thousand of this very small elements to share um the chance of contention are very low and that's a great property that the more elements you have um the better its case uh engineering system it's like it's it's inverse and in general yeah the data we share is quite small you know we are talking about a few double or floats you know we want to have a few prices there or volume you know system so it's um it works quite well so that's that's one of the you know like first Saturday you don't know to share data this is um this is really nice it's um so the latency you know we could measure it but it's it's just the latency of mem copy so I mean I could I could Benchmark it but like it's very much just benchmarking a mem copy and then of course on the reading side yeah it all depends about like you know the main variable will be the number of producer and also the frequency rates of your of the producer as well president or filament so it's actually yeah The Benchmark didn't really feel like adding one because it's uh it just depends on so many variables that depends on your system but yeah so with this you are you know as fast as you can get um when it comes to like sharing data so the second one that we want to look at is events you want to send now events with notification um not just sharing a state let's say on shell memory so in order to do that you have um again you know a few few options as I said you know our strategy is to have very low Jitter and you're aiming for like you know less than 100 nanoseconds so that discards quite a few options TCP well anyway not really your natural fit because we want to Fan out so it doesn't scale very well um UDP why not but you need to use or space it right if you you cannot go to Kernel you cannot afford going to Kernel for this event because of our requirements so you could use a space it but in my opinion you know a lot of code um to manage and quite some complexity while actually we we don't really want to send anything between server here our problem is just on one server we have 32 core we wanna you know like have some communication between this course so what a proposition is a queue like very basic ring buffer on share memory so let's look at you know what kind of queue can we can we use here the first one is um not yet multi-consumer so this is a typical single producer single consumer queue it's um it's the one that's used in the kernel again if you look at uh you know the the are you are you urine um there is a queue it looks like that um you have two indices one for the rider one for the reader dial aligned uh okay this is again x86 you know 64 so that you avoid any any sharing and that's why and so it's called um you know collaborative queue because the writer can you know the writer knows when when the queue is full and therefore you can push back we're looking at quite a different Beast here right we want multi multi-consumer and so this read index do we really want to have between 10 or 100 read index well actually not really that's gonna scale relatively poorly and at the same time this pushback mechanism do we actually want it no we said that we don't want a reader to compromise the writer right so in a system again when you have many many readers I don't want uh the reader to influence and to compromise the slow reader to compromise the inter trading system that that would be really bad if if a reader is slow I just want the hero to die right at some point the writer is just going to overrun it and then we will let it die so that's good news because you know relaxing um that constraint and effectively that's what we want we can think of something um that can scale better so this is the first first queue that I propose it uses two indices but both will be updated by the writer you know reader now we have many of them you can imagine that they are a bit all over you know like the queue reading we the writer doesn't really know where they are and so what's um what's happening there is that the the writer is going to update this pending index then it's going to copy the data and then update index and that's kind of like how we we will manage like uh you know like how we will detect that there is an overflow again we align this index yeah 64 you know 64 bytes okay this is again very x86 oriented and so simplified code again um you know the full code wouldn't really fit on the on the slide because we need to you know manage the whopping of the queue this kind of things but this is how it looks like and so why do we have these two indices um is again so the the logic kind of like the smile that is in a way is in the reader it's because you need you need the spending index on this index you need that to to check for overflow if you think that if let's imagine you will only have index so only one one of this Atomic Barber uh you would have to choose if you updated before or after your mem copy and there wouldn't be a possibility for the reader to detect if there is actually an overall so again very simplified code absolutely not complete but this is roughly how it looks like so what kind of performance do we get with uh with such code so this is where we are in Orange so I Benchmark against you know the famous Java descriptor because that's like the reference in terms of like blazing fast Q also doing you know fan out it's fast you know it's um and then two two other comparison uh I must say you know not fair comparison boost lock 3Q and Moody camera one not fair why that because uh there are although the SPM cqs they are doing unicast and not multicast so like each each of the consumer is going to you know pick pick on one of the elements it's kind of like a load balancer like a queue that we use now so I put them here but I must say you know not not fair to compare directly these numbers okay so now how do we go faster because you know that's that's nice but uh it's um it's always interesting to get a little bit faster effectively when when we have a queue in in Java that is in front of us um so the idea is to have less contention right so the reason that we are a bit limited is that here these two indices index and pending index um every single push and pop that we do is going to touch this Index right the writer is going to update them the reader is going to read them and that is going to create to cause a lot of cash currency traffic um so modern CPU all you know use different cash currency protocol but in short the cache line uh will never be in an executive state and that's that's the main bottleneck here so the idea that we want to try is to you know spread out this Atomic counter over you know like all over the place in our queue you know you kind of like a concept of elements in the queue and so we still need some atomics of course I mean some some uh some things in there you know to detect the Overflow so we need one bit you know to see if the elements is being written and we need a version to see if the if the writer didn't you know um didn't catch up with the reader so this is what it looks like we start at zero and then we publish a few elements the the the fifth one uh not sorry the sixth one is in is in progress um isn't being written and then the writer continued like that so this is now how it looks like we have a header with still a atomic viable but now the big difference is that this header is only used when when the reader joined the queue the rest of the time they are going to use the one per block and so the code looks like this again simplified version um didn't want to go through all the atomics because again it's a bit it's a bit tricky you need a you actually need fence to implement this but so this looks familiar what is it it it's uh it's a it's a seclog that we saw just a few slides before so what we actually did is we we have like this ring buffer it's an array of Vlog and we just have all our second there and again why does it work um because the secular properties are you know optimistic looking the more we have the better it gets um and also they work greatly you know as we saw before with just a few producer in our case one and many readers okay so that looks quite promising and it is so we we get a you know decent like performance um again you know like uh comparing to the other queues uh okay one reader you know you are like more than 60 million messages per second that's that's you know that's really a lot I'm not sure what we would do with that exactly but the nice property of that queue is that you know even at 10 or 100 readers we we stay relatively constant and we are actually like um we stay on like reasonable numbers you know we're able to dispatch um I mean in that case for 10 readers more than 20 million message per second so that's nice um finish on this concurrent access you know our strategy you know low latency low cheetah it doesn't actually also compromise the the producer that's that's what we have here to dispatch our event so now let's zoom out again a little bit so you know we looked at first you know optimizing on a data model in a process then we see okay how to share data between different applications and now the last you know building block or the last piece of knowledge that you need um you know to to build such low latency system is um is to look at the entire server serve on its own so you can run you know a quick test to see if your server is properly tuned or not this graph is a graph of each point is consecutive calls of Rd TSC rdts season 86 instruction it's uh it's reading the TSC TSC counter is um the the wall clock that we have in in the in the CPU itself so we are reading this clock you know 200 times or so I mean 256 here and then we are we are plotting the distribution where we are plotting uh the the latency um and also the standard deviation so what we see in the graph is that we have um roughly five or like six buckets of latency quite quite a high standard deviation and so you know why would we get that so this is what you should get if your system is properly um well configure or tune for low latency application okay what is what is the difference oh this is code actually just for reference so again you know calling GSM um using using some boost accumulator to not do the you know any mistakes when calculating the statistics but so the main difference between these two um graphs are C States and P States so your your CPU is uh optimized for power consumption and um and the first one is then P States uh you know this laptop now that I'm using just to show the slides well we're not doing high performance Computing with it at the moment it's just you know displaying the slides therefore it's probably running at 400 megahertz it can run on a much higher uh voltage and frequency up to three gigahertz but at the moment it's at 400 megahertz and that's that's great that's also like why your battery can last so long um the problem is uh if you have a lot of like if you have a lot of violence in like the input you get in your system back to financial markets you can have a very very quiet day and then there is a headline or something is happening and you get a burst of a million of messenger it's problematic that we are at 400 megahertz when we start receiving a burst of a million message you actually want to wanna you you can't really uh afford that even worse than that um if the system is really not doing much you might and I mean not you but like your CPU might and assistate on the C state are effectively idle States so ECB is sleeping so imagine as if you were sleeping in a receive a million message well that's that's not gonna fly especially when the CPU enters sea State depending on the C State without going into the details you're going to start flushing all the different levels of cash so not just your awake uh you're all of a sudden being wake up um but also all your cash are empty so you're going to be relatively slow by processing your data so yeah what do you want to what you want to do if you already aim and that's not like for every single server that you have of course but like for the one that needs to process at line rate you need to tune that you need to like disable this again not on all your server but just on the one that like need like to process 10 gigabits per second another common optimization is like shared um you know L3 and so LLC is like the the L3 last level of cache it doesn't have to always be as driven in general days and so here this is like an optimization that I did a few years ago we were changing a single line of code like no code change just reorganizing application you know moving them from One Core to another and so this is what we um briefly you know talked about earlier which is like you need to consider a system of the whole some simple some some code not just even zippers first but like any application can run on different core or news more or less memory and can do what we call you know L3 thrashing um and so by by moving application between server or on different cores and making sure that all the low latency ones are together and because in general they shouldn't use so much memory um you you just you achieve great great results like that we were changing any any code you have much more um you know CPU Co isolation in interrupt Affinity um between servers you definitely you know we mostly Focus today about sharing data on the same server so we're talking about shared memory but between server definitely yeah you need to use a networking stack there is not so much option but you need to to use a space itself um and then pneuma aware code um you know if you allocate your memory on the far new pneuma node and you access it you're just gonna pay for for nothing um and the last one not least like huge Pages um yeah you can get great great benefit from from huge Pages um and again you you can enable them with um what's called on Linux transparent huge pages but there's also have a little bit of a cost actually a quite a quite a quite an expensive cost so in general again your code if if it really matters needs to be aware about huge pages and you you should use allocator Finance um further in the slides again on reference I put some links if you want to go further on this so now that the last last section of this talk again zooming out now you have you know reasonably fast application um you have also reasonably fast communication between this application and you also did all the system tuning so the system is on everything is fast awesome um one thing that you want is to scale right you you're going to have one trading server and then you have a second one and you know maybe at some point you have a thousand of them and um and even if you have only ten you still want to make sure that all the things that you measure day Zero because it's always the same you know days yo we're all very excited so all the eyes are on the new system um but how about six months later or a year later or even five years later you actually need to make sure that you stay fast and yeah no secret you know it's um it's not really um uh a breakthrough here and I'm going to say but like you just need to continuously measure so the same as like you would do a Google Benchmark to make sure that you got this fast Day Zero you need to keep on measuring in pod always so here this is you know just a graph and a dashboard I think it just illustrates so well you know why you should measure in production constantly because looking at this graph immediately you have a few questions that comes to mind so these what what is it this is just a period like a few hours where the same application running on different exchanges or products are running and why access is the latency of a processing but even without knowing um what is this application exactly doing because it's the same application I can immediately you know ask myself hey why why is this top application like this instance that is in blue why is the latency of this one like what is it two times yeah actually a four times higher than the second one um and then why do I have a difference of 10 times uh you know with the third one what's happening there so you need you need like this very like high level um dashboards on Matrix in order to yeah have this like View and then you can dig in into like uh a more like you know system level metrics or something that is a bit more low level so how does it work how do you do it again you need to do that when you design your application you can't just uh you know go on every single server and run perf or something that's uh you should do that as well if there is a poem you know if but it's a later stage first you want to have you want to collect all these metrics and therefore it needs to be built in your application and it's not really hard to to do it just needs to be usually done at the beginning of the you know when when you design that software in short what you want to do is you just want to have like you know time stamping uh all your you know not just not functions let's say but like critical part of your system like you know how long does it take to process iOS um so CPU time unlike for example you took you know earlier about cues that's awesome all these queues but all the skus that are fixed size okay you probably you probably are very interested about like the the capacity of the Queue at a given interval you know am I 10 behind or am I 90 behind because even 90 behind it means that you know if it's a bit busier then then the consumer will be out so you're interested in all this characteristic of your system and then um even you know better than you know looking at them because you can only it also doesn't really scale right I mean you can look at this graph every day when you wake up with a bit of coffee but um what's even better is to have some code I actually check that you know the median of the average or you know you name it the statistic uh that you that you pick for your system actually match and the expectation much and much higherity and so yeah how does it work concretely back back to your system um again cues um this metrics you know you're going to have like a timestamp on a x86 you know we always use rdtsc because that's like as fast as you can get you know in just a few cycle you get you get a nice wall clock and and then you just you know you have a you have statistics on the difference between rdtsc and you push that to a queue and that's just you know spacq um and then you have a another process or thread feeding this Matrix and then just publishing that to database and then you can visualize it in in the previous slide we're using a grafana to visualize that come to the conclusion um we you know we we saw a few things today but there is of course a lot more to say about low latency systems also about trading system and ours of course you know way way too short um but I hope that you know today's strategies and tactics die enough or maybe not to build a full trading system of course but uh there are definitely the building blocks that that we use on that you know most of the companies in the industry use so that you can be you know robust um which means no matter how busy things get you can keep trading and then also fast um back to this latency on like you know why you need it um a few reference uh you know if you wanna go further in in these topics um I always put up the first reference you know this uh this famous paper from ulrish for his maintenance of gdpc because it's very much when it comes to data model uh it is the reference relatively old now but still very much up to date and then you know some others um I always like Mike Acton talk or TV Beacon already eight years ago about data and the design I think it's great talk and um on the paper but seclog you know if you're interested uh it's as I said it's quite tricky to do in C plus plus so you know there is a full paper about you know how to do that correctly in uh in C plus plus so I am thank you very much thank you I see that it's exactly 2 p.m um yeah I will stay around if you have a if you have questions thank you foreign [Applause]

Transcript for:Low Latency C++ Trading System Notes

Transcript for:
Low Latency C++ Trading System Notes