Coconote
AI notes
AI voice & video notes
Export note
Try for free
Enhancing Multi-Layer Perceptron Performance
Sep 25, 2024
Lecture Notes: Multi-Layer Perceptron Character Level Language Model
Introduction
Location: Kyoto (Background change)
Focus: Continuing the implementation of a multi-layer perceptron character level language model.
Previous Recap
Built an architecture that predicts the fourth character based on three previous characters using a single hidden layer (10 neurons).
Aim to complicate the architecture to work with more characters and deeper models.
Proposed Architecture
Move from using a single hidden layer to using multiple layers to progressively fuse information.
Inspiration from Wavenet (2016): an auto-regressive model predicting audio sequences.
Key Changes in Architecture
Increase input sequence length (from 3 to 8 characters).
Introduce deeper neural network architecture for better performance.
Data Preparation
182,000 examples from word dataset.
Each example: 3 characters predicting the 4th.
Layer Modules
Developed layer modules as building blocks (e.g., class
Linear
).
Mimicking PyTorch’s
torch.nn
API for compatibility.
Implemented layers include:
Linear
Batch Normalization
Batch Normalization
Maintains running mean and variance across training and evaluation.
Behaviour varies between training and evaluation states, careful tracking required.
Model Structure
Embedding table for character representation.
List of layers including Linear, Batch Normalization, and activation function (10h).
Initialized parameters for training.
Evaluation Process
Need to set batch norm layers correctly during training and evaluation to avoid bugs.
Achieved validation loss of 2.10, with sample outputs improving over iterations.
Graph Simplification
Enhancing graph readability and complexity during coding.
Created modules for embedding and flattening operations for clarity.
PyTorch Containers
Introduction of
Sequential
to manage layers more efficiently.
Benefits of using containers include easier parameter management and code simplification.
Forward Pass Streamlining
Implemented flatter architecture to reduce the complexity of forward pass operations.
Forwarding through the model yields logits for further evaluation.
Model Performance and Adjustments
Analyzed performance improvements from scaling up model architecture (block size increased from 3 to 8).
Initial results showed reduced validation loss when increasing context length.
Need for further tuning and optimization remained evident.
Hierarchical Fusion Proposed
Suggested model architecture that gradually fuses pairs of characters, leading to a more complex hierarchical structure.
Proposed using dilated causal convolution layers for efficiency rather than traditional layers.
Final Thoughts
Improved the model performance from 2.1 to around 1.99 validation loss.
Emphasis on needing an experimental harness for systematic evaluation and tuning of hyperparameters.
Future topics to explore:
Implementing dilated convolutions.
Residual and skip connections.
Exploring RNNs, LSTMs, and Transformers.
Potential Challenges
No experimental harness yet for efficient testing and validation.
Current architecture does not guarantee better performance without rigorous testing and tuning.
Conclusion
Encouragement for further experimentation to improve upon current results and explore different layer implementations.
📄
Full transcript