Mamba: A New Neural Net Architecture

Jul 23, 2024

Mamba: A New Neural Net Architecture

Introduction

  • Mamba is a new neural net architecture designed for language modelling
  • It potentially surpasses transformers, which have been the dominant model for 7 years
  • Mamba has shown promising results with smaller model sizes (a few billion parameters)
  • Uses less computational power than transformers:
    • Mamba: O(nlog(n)) compute
    • Transformers: O(n^2) compute
  • Allows for greater context sizes

Deep Dive into Mamba Architecture

State-Space Models and RNNs

  • Mamba is often presented as an extension of state-space models
  • State-space models:
    • Complex theory, advanced mathematics
    • Another type of sequence model gaining popularity
  • Understanding Mamba through Recurrent Neural Networks (RNNs):
    • RNNs: simpler path to understanding
    • RNNs apply neural nets to an input vector and the previous output

RNNs vs Transformers

  • RNNs incorporate information from all previous input vectors
  • Downsides of RNNs:
    • Sequential computation: slow on modern parallel hardware
    • Difficult to train: vanishing and exploding gradients
  • Transformers:
    • Can handle long range dependencies better with parallel computation
    • Computation cost is quadratic

Linear RNNs

  • Linear RNNs addressed RNN limitations:
    • Use linear functions for recurrent operations
    • Alternate linear recurrent layers and element-wise neural networks
  • Efficient computational technique:
    • Uses matrix diagonalization for faster computations
    • Bypasses the need for slow matrix multiplications

Parallel Computation of Linear Recurrence

  • Linear recurrence can be computed in parallel using O(log(n)) time
  • Uses cumulative sum algorithm but adapted for linear recurrence
  • Reduction in computational complexity

Training Linear RNNs

  • Initialization strategy to stabilize gradients:
    • Parametrization of weights in complex polar form
    • Magnitude kept close to 1
    • Inputs scaled to be initially around 0
  • Leads to stable training and long-range context learning
  • Linear RNNs outperform transformers on long-range arena benchmarks

State-Space Models

  • Derived from control theory
  • Similar to linear RNNs but with different initialization
  • Also performs well on long-range benchmarks

Mamba Architecture

Making RNNs Selectively Forget

  • Implementation using different weights for each step, based on input
  • Dynamic weight generation allows selective forgetting
  • Enlarged output vectors (by factor of 16) for more information storage
  • Efficient use of GPU memory

Performance

  • Outperforms transformers in language modelling
  • Computational efficiency: O(nlog(n)) compared to O(n^2)

Controversy and Peer Review

  • Mamba paper submitted to ICLR 2024 and rejected
  • Peer review criticisms:
    • Not tested on long-range arena benchmark
    • Only evaluated on language modelling, not downstream tasks
    • Incorrect assumptions about quadratic memory requirements
  • Sparked debate on peer reviewing practices

Conclusion

  • Mamba shows significant promise for language modelling
  • Possibility of addressing longstanding issues with RNNs and transformers
  • Raises questions about peer review practices in academia