Coconote
AI notes
AI voice & video notes
Try for free
Mamba: A New Neural Net Architecture
Jul 23, 2024
Mamba: A New Neural Net Architecture
Introduction
Mamba is a new neural net architecture designed for language modelling
It potentially surpasses transformers, which have been the dominant model for 7 years
Mamba has shown promising results with smaller model sizes (a few billion parameters)
Uses less computational power than transformers:
Mamba: O(nlog(n)) compute
Transformers: O(n^2) compute
Allows for greater context sizes
Deep Dive into Mamba Architecture
State-Space Models and RNNs
Mamba is often presented as an extension of state-space models
State-space models:
Complex theory, advanced mathematics
Another type of sequence model gaining popularity
Understanding Mamba through Recurrent Neural Networks (RNNs):
RNNs: simpler path to understanding
RNNs apply neural nets to an input vector and the previous output
RNNs vs Transformers
RNNs incorporate information from all previous input vectors
Downsides of RNNs:
Sequential computation: slow on modern parallel hardware
Difficult to train: vanishing and exploding gradients
Transformers:
Can handle long range dependencies better with parallel computation
Computation cost is quadratic
Linear RNNs
Linear RNNs addressed RNN limitations:
Use linear functions for recurrent operations
Alternate linear recurrent layers and element-wise neural networks
Efficient computational technique:
Uses matrix diagonalization for faster computations
Bypasses the need for slow matrix multiplications
Parallel Computation of Linear Recurrence
Linear recurrence can be computed in parallel using O(log(n)) time
Uses cumulative sum algorithm but adapted for linear recurrence
Reduction in computational complexity
Training Linear RNNs
Initialization strategy to stabilize gradients:
Parametrization of weights in complex polar form
Magnitude kept close to 1
Inputs scaled to be initially around 0
Leads to stable training and long-range context learning
Linear RNNs outperform transformers on long-range arena benchmarks
State-Space Models
Derived from control theory
Similar to linear RNNs but with different initialization
Also performs well on long-range benchmarks
Mamba Architecture
Making RNNs Selectively Forget
Implementation using different weights for each step, based on input
Dynamic weight generation allows selective forgetting
Enlarged output vectors (by factor of 16) for more information storage
Efficient use of GPU memory
Performance
Outperforms transformers in language modelling
Computational efficiency: O(nlog(n)) compared to O(n^2)
Controversy and Peer Review
Mamba paper submitted to ICLR 2024 and rejected
Peer review criticisms:
Not tested on long-range arena benchmark
Only evaluated on language modelling, not downstream tasks
Incorrect assumptions about quadratic memory requirements
Sparked debate on peer reviewing practices
Conclusion
Mamba shows significant promise for language modelling
Possibility of addressing longstanding issues with RNNs and transformers
Raises questions about peer review practices in academia
đ
Full transcript