what if you could apply Ai and machine learning to prevent failures in your network and it infrastructure operations eliminate noise and dramatically reduce the time to remediate and root cause failures what if I told you you could do that today and do it at scale and do it with software which is deployed in production by very very large scale Enterprises and providers that's what OCTA does I'm the founder and CEO at OA we transform network operations by applying Ai and machine learning to operational data starting with the network but then moving all the way to servers to infrastructure both in the data center in service provider networks as well as in uh large scale llm infrastructure if you look at the last uh several decades of operation in the network it's been riddled with noise and more often than not Network operators learn of issues after things break when applications complain there is increasingly more and more data in the cloud in the data center uh in the van in the 5G ran there is there are more and more layers in this infrastructure but at the same time operators are dealing with this massive massive Hast stack of data and they're left with Silo tools to mine that data that's where OCTA saw an opportunity to apply Ai and machine learning several years back to take this data and apply realtime algorithms to find AI insights this was long before the current llm uh you can say movement which has taken place in the industry and and what we do is we apply largely unsupervised algorithms to find anomalies find misbehaviors predict issues and then we apply unsupervised algorithms Again by learning the topology discovering the network model discovering the connectivity in the network and automatically correlating events and anomalies to eliminate noise cut down tickets uh our customers have seen both in the data center and in the service provider side in the government side at very large scale 15,000 switches in routers in one case uh in the data center several thousand service provider Edge and P routers in the case of someone like orange which has done a public press release with us we've been able to cut down tickets by 70 to 90% cut down time to detect issues from 47 minutes to 1 minute and just cut down the overall time it takes to remediate and mitigate by up to you 60% plus and all this is done by a platform which is software only it can be deployed fully on premise it can be air gapped or it can be consumed as as SAS depending on your comfort and it scales horizontally it's a microservice architecture takes data from all parts of your infrastructure switches routers servers as well firewalls load balancers and can also connect to your data Lakes for example Prometheus and Splunk we have very large scale implementations of telemetry which has existed in networks for decades SNMP and CIS log as well as we are designed for big data and Telemetry streaming with open config and grpc gnmi we are agentless we rely on network infrastructure and we integrate with that at the same time we do have an agent which adds additional value on Linux operating systems for example Sonic servers or VMS we can take in data at Json logs we can take in data through Kafka so many many ways to integrate data including red fish from the servers now this then comes into our Ai and machine learning engine which applies many many purpose built algorithms I should take a pause here we when we started o era we looked at open source and we took production data took us a while to get that and we tried applying open source algorithms and it was really nothing but noise as a result of that we went down the path of building our own algorithms and uh that took us took us a long time it was a hard road but it led to results with much higher Fidelity and much much more actionability uh so what we have is many algorithms for finding misbehaviors of literally hundreds of metrics in the TCP layer in the optical layer HTTP layer uh and correlating all of that by learning the topology and the model of the network and then driving actions by either creating tickets workflows some of our customers are beginning to automatically immediate issues back uh for example if there are CRC errors or Optical issues or link flaps you can shut down the interface automatically and in large environments with multi multipath that's better than having traffic Transit a port or an interface which is actually seeing issues uh our Solutions cut across all segments of the infrastructure hybrid cloud data center SD van service provider backbone 5G environments campus and so on and they range from observability and I want to draw a distinction between observability which is really more sophisticated analytics and visibility and AI Ops so we have a lot of use cases around prevention for example one very common and popular use case difficult to do is our ability to take packet and flow data and find find TCP retransmits in that data and do application aware AI Ops for the infrastructure where we can find application misbehavior using TCP retransmits and correlate that with congestion on specific links in the network or drive meantime to innocence so the operators can very quickly the network operators can very quickly tell the server and the application people the network is not to blame it's running clean the problem is somewhere else and we do that are bringing many things together you know TCP data Q data link data all of that another common use case which is getting more and more traction now especially with llm infrastructure being deployed in the data center is Optical misbehavior there is increasing use of Optics on servers and as Optics misbehaves we are able to often detect these signals days in advance letting operators to maintenance before before Optics breaks and starts impacting applications another use case is post change verification especially subtle changes you know you apply a bgp change that starts to draw in extra traffic results in traffic bursts and congestion in the network and maybe has a completely unforeseen impact on a firewall which has not been updated we can correlate all of this automatically literally hundreds of different events and anomalies and create a single ticket with the root cause being the config change uh another example in the llm infrastructure is tying the job metrics together for both training and inferencing and taking those job metrics and correlating that with GPU utilization server Nick flaps for example if if a particular job is taking a long time on a GPU is the GPU underutilized if yes is that because of Nick flaps or is the congestion in the network which is resulting in long tail latency between two specific gpus we can bring all of that together with our jna ethernet cluster solution so these uh examples I'm giving are production examples done at scale baked at scale uh we also have synthetics as part of our solution where we can deploy our agent to generate probes to other a agents across servers we've had customers where we are able to do meantime to innocence across ovm infrastructure and switches so for example um in one large environment we were able to pinpoint that a specific hypervisor upgrade on one vendor is what resulted in the application latency uh and avoided finger pointing between the switch team and the server team and that was done by using using combination of our synthetic probes along with network Telemetry so I'm here to tell you that Network AI Ops is real it's in production it's been done at scale uh customers like orange have come out publicly talking about the significant Roi they're seeing and I'm also here to tell you that llms are a very small piece of a very large puzzle the algorithms we have created are unsupervised on log we can we apply natural language processing to find rare logs for example you might have a memory corruption log one out of 100 million logs we find that using NLP and unsupervised machine learning not using llms we do similar things on optical data tcpd transmit data where we learn from the pattern of the metrics llms have a role to play they can be very useful to take in data in the public domain and use that for uh uh recommending remediation they can be useful for helping human language query interfaces and those are things we support and there are capabilities we working on but just think of it as one piece of a very large puzzle of Technologies and capabilities you need which we have been working on long before people talked about llms so I would leave you with fun one final thought the AI Ops in the network is real according to Gartner 10% of Enterprises were using it in 2023 and 90% are projected to use it over the next few years and we are seeing that significant ramp in the interest and the very late stage conversations we are having with a lot of Enterprises and providers so uh really if you're thinking about is it mature it is mature and uh we would love to talk to you how we can help you transform your operations using network AI Ops and then apply it more broadly to your infrastructure going all the way to the servers and the application thank you [Music]