Mamba: A New Neural Net Architecture

Jul 23, 2024

Mamba: A New Neural Net Architecture

Introduction

Mamba is a new neural net architecture designed for language modelling
It potentially surpasses transformers, which have been the dominant model for 7 years
Mamba has shown promising results with smaller model sizes (a few billion parameters)
Uses less computational power than transformers:
- Mamba: O(nlog(n)) compute
- Transformers: O(n^2) compute
Allows for greater context sizes

Deep Dive into Mamba Architecture

State-Space Models and RNNs

Mamba is often presented as an extension of state-space models
State-space models:
- Complex theory, advanced mathematics
- Another type of sequence model gaining popularity
Understanding Mamba through Recurrent Neural Networks (RNNs):
- RNNs: simpler path to understanding
- RNNs apply neural nets to an input vector and the previous output

RNNs vs Transformers

RNNs incorporate information from all previous input vectors
Downsides of RNNs:
- Sequential computation: slow on modern parallel hardware
- Difficult to train: vanishing and exploding gradients
Transformers:
- Can handle long range dependencies better with parallel computation
- Computation cost is quadratic

Linear RNNs

Linear RNNs addressed RNN limitations:
- Use linear functions for recurrent operations
- Alternate linear recurrent layers and element-wise neural networks
Efficient computational technique:
- Uses matrix diagonalization for faster computations
- Bypasses the need for slow matrix multiplications

Parallel Computation of Linear Recurrence

Linear recurrence can be computed in parallel using O(log(n)) time
Uses cumulative sum algorithm but adapted for linear recurrence
Reduction in computational complexity

Training Linear RNNs

Initialization strategy to stabilize gradients:
- Parametrization of weights in complex polar form
- Magnitude kept close to 1
- Inputs scaled to be initially around 0
Leads to stable training and long-range context learning
Linear RNNs outperform transformers on long-range arena benchmarks

State-Space Models

Derived from control theory
Similar to linear RNNs but with different initialization
Also performs well on long-range benchmarks

Mamba Architecture

Making RNNs Selectively Forget

Implementation using different weights for each step, based on input
Dynamic weight generation allows selective forgetting
Enlarged output vectors (by factor of 16) for more information storage
Efficient use of GPU memory

Performance

Outperforms transformers in language modelling
Computational efficiency: O(nlog(n)) compared to O(n^2)

Controversy and Peer Review

Mamba paper submitted to ICLR 2024 and rejected
Peer review criticisms:
- Not tested on long-range arena benchmark
- Only evaluated on language modelling, not downstream tasks
- Incorrect assumptions about quadratic memory requirements
Sparked debate on peer reviewing practices

Conclusion

Mamba shows significant promise for language modelling
Possibility of addressing longstanding issues with RNNs and transformers
Raises questions about peer review practices in academia

Full transcript