Machine Learning Applications in Cybersecurity

Sep 22, 2024

Machine Learning for Cybersecurity - Lecture Notes

Course Introduction

  • Professor: Ricardo Calix, Purdue University Northwest
  • Focus: Application of machine learning in cybersecurity.
  • Topics Covered:
    • Fundamental machine learning concepts (deep learning, TensorFlow).
    • Applications: malware detection, intrusion detection, IoT detection, phishing.

Course Resources

  • Website: Course linked through YouTube videos.
  • GitHub: Code available for labs.
  • Videos: YouTube channel for lecture content.
  • Tools:
    • SKLearn
    • Python
    • TensorFlow
    • Weka
  • Course Calendar: Recommended sequence for students/instructors.
  • Virtual Machine: Future link for downloading VM to work on labs.

Learning Outcomes

Upon completion, students should understand:

  • Definition and purpose of machine learning.
  • Key concepts: features, datasets, and machine learning’s role in cybersecurity.
  • Difference between machine learning and deep learning.
  • Importance of big data in machine learning applications.

Key Terms

  • Machine Learning: Subset of AI focused on learning from data.
  • Deep Learning: Subset of machine learning with multiple layers.
  • Big Data: Large datasets requiring advanced processing techniques.

Machine Learning and Big Data

  • Evolution of Algorithms: Need for new techniques as data sizes grew.
  • Data Sources in Cybersecurity:
    • Network packets.
    • Log files.
    • Files related to cybersecurity issues.
  • Vector Space Model: Common format for converting various data sources for machine learning.

Tools and Approaches

  1. Weka: GUI-based tool for learning and prototyping.
  2. Python: Main programming language for machine learning; uses libraries like NumPy and Pandas for data preprocessing.
  3. TensorFlow: Framework for deep learning applications.
  4. Hadoop and Spark: Tools for handling big data (not covered in detail).

Machine Learning Definitions

  • Machine Learning Purpose: Predicting, detecting, or grouping data samples based on learned models.
  • Models: Can be geometrical, based on distance metrics and linear boundaries.

Machine Learning in Cybersecurity

  • Need for Automation: Manual analysis of cybersecurity data is labor-intensive; machine learning can automate pattern detection.
  • Big Data Challenges: Requires powerful computers or multiple computers working in parallel to analyze vast datasets.

Types of Learning

  1. Supervised Learning:
    • Involves labeled data (e.g., network traffic classified as good or bad).
    • Training and testing data used to build and evaluate models.
  2. Unsupervised Learning:
    • Does not involve labels; focuses on finding correlations and grouping data points.
    • Techniques include clustering.

Machine Learning Pipeline

  1. Data Collection
  2. Data Preprocessing
    • Cleaning and formatting the data.
  3. Vector Space Model Creation
  4. Model Building
    • Using supervised learning to make predictions.
  5. Performance Evaluation
    • Comparing predicted outcomes with actual outcomes.

Common Machine Learning Algorithms

  • Naive Bayes
  • Decision Trees
  • Random Forests
  • K-Nearest Neighbors (KNN)
  • Linear Regression
  • Logistic Regression
  • Neural Networks
  • Support Vector Machines
  • Deep Neural Networks

Deep Learning Overview

  • Definition: Neural networks with multiple layers (deep neural networks).
  • Importance: Essential for modern machine learning applications, particularly with big data.
  • Batch Processing: Required for handling large datasets effectively.

Data Formats

  • Common formats: CSV, LiveSVM, ARF (Weka-specific).
  • Sample Definition: Critical to define what constitutes a sample for analysis.

Industry Applications

  • Companies Using Machine Learning:
    • Tesla, Apple, Amazon, Google, Facebook.
    • Northrop Grumman (defense systems), Blue Vector (machine learning intrusion detection).
  • Confidentiality Issues: Challenges in obtaining datasets due to privacy concerns.

Course Environment

  • Utilization of virtual machines for labs.
  • Recommended tools: TensorFlow, sklearn, and Weka.
  • GPU Requirement: For big data labs; possibility to rent AWS instances.

Summary

  • Course focus on machine learning and deep learning for cybersecurity problems.
  • Emphasis on data preprocessing, understanding datasets, and machine learning pipelines.
  • Encouragement to engage with lab sessions and practical applications.