Machine Learning for Cybersecurity - Lecture Notes
Course Introduction
- Professor: Ricardo Calix, Purdue University Northwest
- Focus: Application of machine learning in cybersecurity.
- Topics Covered:
- Fundamental machine learning concepts (deep learning, TensorFlow).
- Applications: malware detection, intrusion detection, IoT detection, phishing.
Course Resources
- Website: Course linked through YouTube videos.
- GitHub: Code available for labs.
- Videos: YouTube channel for lecture content.
- Tools:
- SKLearn
- Python
- TensorFlow
- Weka
- Course Calendar: Recommended sequence for students/instructors.
- Virtual Machine: Future link for downloading VM to work on labs.
Learning Outcomes
Upon completion, students should understand:
- Definition and purpose of machine learning.
- Key concepts: features, datasets, and machine learning’s role in cybersecurity.
- Difference between machine learning and deep learning.
- Importance of big data in machine learning applications.
Key Terms
- Machine Learning: Subset of AI focused on learning from data.
- Deep Learning: Subset of machine learning with multiple layers.
- Big Data: Large datasets requiring advanced processing techniques.
Machine Learning and Big Data
- Evolution of Algorithms: Need for new techniques as data sizes grew.
- Data Sources in Cybersecurity:
- Network packets.
- Log files.
- Files related to cybersecurity issues.
- Vector Space Model: Common format for converting various data sources for machine learning.
Tools and Approaches
- Weka: GUI-based tool for learning and prototyping.
- Python: Main programming language for machine learning; uses libraries like NumPy and Pandas for data preprocessing.
- TensorFlow: Framework for deep learning applications.
- Hadoop and Spark: Tools for handling big data (not covered in detail).
Machine Learning Definitions
- Machine Learning Purpose: Predicting, detecting, or grouping data samples based on learned models.
- Models: Can be geometrical, based on distance metrics and linear boundaries.
Machine Learning in Cybersecurity
- Need for Automation: Manual analysis of cybersecurity data is labor-intensive; machine learning can automate pattern detection.
- Big Data Challenges: Requires powerful computers or multiple computers working in parallel to analyze vast datasets.
Types of Learning
- Supervised Learning:
- Involves labeled data (e.g., network traffic classified as good or bad).
- Training and testing data used to build and evaluate models.
- Unsupervised Learning:
- Does not involve labels; focuses on finding correlations and grouping data points.
- Techniques include clustering.
Machine Learning Pipeline
- Data Collection
- Data Preprocessing
- Cleaning and formatting the data.
- Vector Space Model Creation
- Model Building
- Using supervised learning to make predictions.
- Performance Evaluation
- Comparing predicted outcomes with actual outcomes.
Common Machine Learning Algorithms
- Naive Bayes
- Decision Trees
- Random Forests
- K-Nearest Neighbors (KNN)
- Linear Regression
- Logistic Regression
- Neural Networks
- Support Vector Machines
- Deep Neural Networks
Deep Learning Overview
- Definition: Neural networks with multiple layers (deep neural networks).
- Importance: Essential for modern machine learning applications, particularly with big data.
- Batch Processing: Required for handling large datasets effectively.
Data Formats
- Common formats: CSV, LiveSVM, ARF (Weka-specific).
- Sample Definition: Critical to define what constitutes a sample for analysis.
Industry Applications
- Companies Using Machine Learning:
- Tesla, Apple, Amazon, Google, Facebook.
- Northrop Grumman (defense systems), Blue Vector (machine learning intrusion detection).
- Confidentiality Issues: Challenges in obtaining datasets due to privacy concerns.
Course Environment
- Utilization of virtual machines for labs.
- Recommended tools: TensorFlow, sklearn, and Weka.
- GPU Requirement: For big data labs; possibility to rent AWS instances.
Summary
- Course focus on machine learning and deep learning for cybersecurity problems.
- Emphasis on data preprocessing, understanding datasets, and machine learning pipelines.
- Encouragement to engage with lab sessions and practical applications.