🤖

AI and Machine Learning in Network Operations

May 9, 2025

Lecture Notes: AI and Machine Learning in Network Operations

Introduction

  • Application of AI and machine learning to prevent IT infrastructure failures.
  • Focus on eliminating noise and reducing remediation time.
  • OCTA's role in transforming network operations.

Overview of OCTA

  • Founded by the speaker, who is the CEO.
  • Applies AI/ML to operational data in networks, servers, and infrastructure.
  • Operates in data centers, service provider networks, and large-scale LLM infrastructures.

Challenges in Network Operations

  • Historically, network operations have been noisy and reactive.
  • Presence of vast data across various layers (cloud, data center, 5G, etc.).
  • Operators have been using siloed tools for data mining.

OCTA's AI and ML Solutions

  • Uses real-time algorithms, largely unsupervised, to find insights from data.
  • Detects anomalies, predicts issues, and correlates events to reduce noise.
  • Proven to decrease tickets by 70-90% and detection time from 47 minutes to 1 minute.
  • Platform is software-only, scalable, and can be deployed on-premise or as a SaaS.

Integration and Data Collection

  • Collects data from infrastructure components (switches, routers, servers, etc.).
  • Integrates with data lakes like Prometheus and Splunk.
  • Designed for big data and Telemetry streaming.

Unique Algorithms and Approach

  • Built custom algorithms after finding open-source solutions inadequate.
  • Focuses on misbehavior detection across TCP, optical, HTTP layers.
  • Automates actions such as ticket generation and issue remediation.

Use Cases and Applications

  • Observability and AI Ops in hybrid cloud, data center, 5G environments, etc.
  • Use cases include:
    • TCP retransmissions and congestion correlation.
    • Optical misbehavior detection ahead of failures.
    • Post-change verification for BGP and other changes.
    • LLM infrastructure job metrics analysis.
    • Synthetic probing for quick issue identification.

AI Ops and the Role of LLMs

  • AI Ops is a reality, already implemented at large scales.
  • LLMs play a supportive role, but not central in OCTA’s strategy.
  • Unsupervised ML is used for log analysis and anomaly detection.

Market Impact and Future

  • Growing adoption and interest in AI Ops, significant ROI reported by users.
  • Gartner projects a major increase in enterprise adoption by 2030.
  • OCTA aims to help enterprises mature their network operations with AI Ops.

Conclusion

  • AI Ops in the network is mature and beneficial.
  • OCTA offers a comprehensive approach to transform network operations.
  • Encouragement to engage with OCTA for operational transformation.