Enhancing Multi-Layer Perceptron Performance

Sep 25, 2024

Lecture Notes: Multi-Layer Perceptron Character Level Language Model

Introduction

  • Location: Kyoto (Background change)
  • Focus: Continuing the implementation of a multi-layer perceptron character level language model.

Previous Recap

  • Built an architecture that predicts the fourth character based on three previous characters using a single hidden layer (10 neurons).
  • Aim to complicate the architecture to work with more characters and deeper models.

Proposed Architecture

  • Move from using a single hidden layer to using multiple layers to progressively fuse information.
  • Inspiration from Wavenet (2016): an auto-regressive model predicting audio sequences.

Key Changes in Architecture

  • Increase input sequence length (from 3 to 8 characters).
  • Introduce deeper neural network architecture for better performance.

Data Preparation

  • 182,000 examples from word dataset.
  • Each example: 3 characters predicting the 4th.

Layer Modules

  • Developed layer modules as building blocks (e.g., class Linear).
  • Mimicking PyTorch’s torch.nn API for compatibility.
  • Implemented layers include:
    • Linear
    • Batch Normalization

Batch Normalization

  • Maintains running mean and variance across training and evaluation.
  • Behaviour varies between training and evaluation states, careful tracking required.

Model Structure

  • Embedding table for character representation.
  • List of layers including Linear, Batch Normalization, and activation function (10h).
  • Initialized parameters for training.

Evaluation Process

  • Need to set batch norm layers correctly during training and evaluation to avoid bugs.
  • Achieved validation loss of 2.10, with sample outputs improving over iterations.

Graph Simplification

  • Enhancing graph readability and complexity during coding.
  • Created modules for embedding and flattening operations for clarity.

PyTorch Containers

  • Introduction of Sequential to manage layers more efficiently.
  • Benefits of using containers include easier parameter management and code simplification.

Forward Pass Streamlining

  • Implemented flatter architecture to reduce the complexity of forward pass operations.
  • Forwarding through the model yields logits for further evaluation.

Model Performance and Adjustments

  • Analyzed performance improvements from scaling up model architecture (block size increased from 3 to 8).
  • Initial results showed reduced validation loss when increasing context length.
  • Need for further tuning and optimization remained evident.

Hierarchical Fusion Proposed

  • Suggested model architecture that gradually fuses pairs of characters, leading to a more complex hierarchical structure.
  • Proposed using dilated causal convolution layers for efficiency rather than traditional layers.

Final Thoughts

  • Improved the model performance from 2.1 to around 1.99 validation loss.
  • Emphasis on needing an experimental harness for systematic evaluation and tuning of hyperparameters.
  • Future topics to explore:
    • Implementing dilated convolutions.
    • Residual and skip connections.
    • Exploring RNNs, LSTMs, and Transformers.

Potential Challenges

  • No experimental harness yet for efficient testing and validation.
  • Current architecture does not guarantee better performance without rigorous testing and tuning.

Conclusion

  • Encouragement for further experimentation to improve upon current results and explore different layer implementations.