Enhancing Multi-Layer Perceptron Performance

Sep 25, 2024

Lecture Notes: Multi-Layer Perceptron Character Level Language Model

Introduction

Location: Kyoto (Background change)
Focus: Continuing the implementation of a multi-layer perceptron character level language model.

Previous Recap

Built an architecture that predicts the fourth character based on three previous characters using a single hidden layer (10 neurons).
Aim to complicate the architecture to work with more characters and deeper models.

Proposed Architecture

Move from using a single hidden layer to using multiple layers to progressively fuse information.
Inspiration from Wavenet (2016): an auto-regressive model predicting audio sequences.

Key Changes in Architecture

Increase input sequence length (from 3 to 8 characters).
Introduce deeper neural network architecture for better performance.

Data Preparation

182,000 examples from word dataset.
Each example: 3 characters predicting the 4th.

Layer Modules

Developed layer modules as building blocks (e.g., class Linear).
Mimicking PyTorch’s torch.nn API for compatibility.
Implemented layers include:
- Linear
- Batch Normalization

Batch Normalization

Maintains running mean and variance across training and evaluation.
Behaviour varies between training and evaluation states, careful tracking required.

Model Structure

Embedding table for character representation.
List of layers including Linear, Batch Normalization, and activation function (10h).
Initialized parameters for training.

Evaluation Process

Need to set batch norm layers correctly during training and evaluation to avoid bugs.
Achieved validation loss of 2.10, with sample outputs improving over iterations.

Graph Simplification

Enhancing graph readability and complexity during coding.
Created modules for embedding and flattening operations for clarity.

PyTorch Containers

Introduction of Sequential to manage layers more efficiently.
Benefits of using containers include easier parameter management and code simplification.

Forward Pass Streamlining

Implemented flatter architecture to reduce the complexity of forward pass operations.
Forwarding through the model yields logits for further evaluation.

Model Performance and Adjustments

Analyzed performance improvements from scaling up model architecture (block size increased from 3 to 8).
Initial results showed reduced validation loss when increasing context length.
Need for further tuning and optimization remained evident.

Hierarchical Fusion Proposed

Suggested model architecture that gradually fuses pairs of characters, leading to a more complex hierarchical structure.
Proposed using dilated causal convolution layers for efficiency rather than traditional layers.

Final Thoughts

Improved the model performance from 2.1 to around 1.99 validation loss.
Emphasis on needing an experimental harness for systematic evaluation and tuning of hyperparameters.
Future topics to explore:
- Implementing dilated convolutions.
- Residual and skip connections.
- Exploring RNNs, LSTMs, and Transformers.

Potential Challenges

No experimental harness yet for efficient testing and validation.
Current architecture does not guarantee better performance without rigorous testing and tuning.

Conclusion

Encouragement for further experimentation to improve upon current results and explore different layer implementations.

Full transcript