Randomized Multi-Head Attention
================================

Embedding dimension 128 — ReLU activation
------------------------------------------

Forward time for attention embed=128 ReLU

.. image:: ../_static/67.png
   :alt: Forward pass time (ms) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=128, ReLU kernel
   :width: 600px
   :align: center

Backward time for attention embed=128 ReLU

.. image:: ../_static/68.png
   :alt: Backward pass time (ms) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=128, ReLU kernel
   :width: 600px
   :align: center

Forward memory for attention embed=128 ReLU

.. image:: ../_static/69.png
   :alt: Forward pass peak GPU memory (MB) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=128, ReLU kernel
   :width: 600px
   :align: center

Backward memory for attention embed=128 ReLU

.. image:: ../_static/70.png
   :alt: Backward pass peak GPU memory (MB) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=128, ReLU kernel
   :width: 600px
   :align: center

Embedding dimension 128 — Softmax activation
---------------------------------------------

Forward time for attention embed=128 Softmax

.. image:: ../_static/71.png
   :alt: Forward pass time (ms) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=128, Softmax kernel
   :width: 600px
   :align: center

Backward time for attention embed=128 Softmax

.. image:: ../_static/72.png
   :alt: Backward pass time (ms) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=128, Softmax kernel
   :width: 600px
   :align: center

Forward memory for attention embed=128 Softmax

.. image:: ../_static/73.png
   :alt: Forward pass peak GPU memory (MB) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=128, Softmax kernel
   :width: 600px
   :align: center

Backward memory for attention embed=128 Softmax

.. image:: ../_static/74.png
   :alt: Backward pass peak GPU memory (MB) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=128, Softmax kernel
   :width: 600px
   :align: center

Embedding dimension 256 — ReLU activation
------------------------------------------

Forward time for attention embed=256 ReLU

.. image:: ../_static/75.png
   :alt: Forward pass time (ms) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=256, ReLU kernel
   :width: 600px
   :align: center

Backward time for attention embed=256 ReLU

.. image:: ../_static/76.png
   :alt: Backward pass time (ms) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=256, ReLU kernel
   :width: 600px
   :align: center

Forward memory for attention embed=256 ReLU

.. image:: ../_static/77.png
   :alt: Forward pass peak GPU memory (MB) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=256, ReLU kernel
   :width: 600px
   :align: center

Backward memory for attention embed=256 ReLU

.. image:: ../_static/78.png
   :alt: Backward pass peak GPU memory (MB) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=256, ReLU kernel
   :width: 600px
   :align: center

Embedding dimension 256 — Softmax activation
---------------------------------------------

Forward time for attention embed=256 Softmax

.. image:: ../_static/79.png
   :alt: Forward pass time (ms) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=256, Softmax kernel
   :width: 600px
   :align: center

Backward time for attention embed=256 Softmax

.. image:: ../_static/80.png
   :alt: Backward pass time (ms) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=256, Softmax kernel
   :width: 600px
   :align: center

Forward memory for attention embed=256 Softmax

.. image:: ../_static/81.png
   :alt: Forward pass peak GPU memory (MB) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=256, Softmax kernel
   :width: 600px
   :align: center

Backward memory for attention embed=256 Softmax

.. image:: ../_static/82.png
   :alt: Backward pass peak GPU memory (MB) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=256, Softmax kernel
   :width: 600px
   :align: center

Embedding dimension 512 — ReLU activation
------------------------------------------

Forward time for attention embed=512 ReLU

.. image:: ../_static/83.png
   :alt: Forward pass time (ms) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=512, ReLU kernel
   :width: 600px
   :align: center

Backward time for attention embed=512 ReLU

.. image:: ../_static/84.png
   :alt: Backward pass time (ms) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=512, ReLU kernel
   :width: 600px
   :align: center

Forward memory for attention embed=512 ReLU

.. image:: ../_static/85.png
   :alt: Forward pass peak GPU memory (MB) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=512, ReLU kernel
   :width: 600px
   :align: center

Backward memory for attention embed=512 ReLU

.. image:: ../_static/86.png
   :alt: Backward pass peak GPU memory (MB) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=512, ReLU kernel
   :width: 600px
   :align: center

Embedding dimension 512 — Softmax activation
---------------------------------------------

Forward time for attention embed=512 Softmax

.. image:: ../_static/87.png
   :alt: Forward pass time (ms) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=512, Softmax kernel
   :width: 600px
   :align: center

Backward time for attention embed=512 Softmax

.. image:: ../_static/88.png
   :alt: Backward pass time (ms) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=512, Softmax kernel
   :width: 600px
   :align: center

Forward memory for attention embed=512 Softmax

.. image:: ../_static/89.png
   :alt: Forward pass peak GPU memory (MB) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=512, Softmax kernel
   :width: 600px
   :align: center

Backward memory for attention embed=512 Softmax

.. image:: ../_static/90.png
   :alt: Backward pass peak GPU memory (MB) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=512, Softmax kernel
   :width: 600px
   :align: center

Embedding dimension 1024 — ReLU activation
-------------------------------------------

Forward time for attention embed=1024 ReLU

.. image:: ../_static/91.png
   :alt: Forward pass time (ms) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=1024, ReLU kernel
   :width: 600px
   :align: center

Backward time for attention embed=1024 ReLU

.. image:: ../_static/92.png
   :alt: Backward pass time (ms) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=1024, ReLU kernel
   :width: 600px
   :align: center

Forward memory for attention embed=1024 ReLU

.. image:: ../_static/93.png
   :alt: Forward pass peak GPU memory (MB) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=1024, ReLU kernel
   :width: 600px
   :align: center

Backward memory for attention embed=1024 ReLU

.. image:: ../_static/94.png
   :alt: Backward pass peak GPU memory (MB) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=1024, ReLU kernel
   :width: 600px
   :align: center

Embedding dimension 1024 — Softmax activation
----------------------------------------------

Forward time for attention embed=1024 Softmax

.. image:: ../_static/95.png
   :alt: Forward pass time (ms) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=1024, Softmax kernel
   :width: 600px
   :align: center

Backward time for attention embed=1024 Softmax

.. image:: ../_static/96.png
   :alt: Backward pass time (ms) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=1024, Softmax kernel
   :width: 600px
   :align: center

Forward memory for attention embed=1024 Softmax

.. image:: ../_static/97.png
   :alt: Forward pass peak GPU memory (MB) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=1024, Softmax kernel
   :width: 600px
   :align: center

Backward memory for attention embed=1024 Softmax

.. image:: ../_static/98.png
   :alt: Backward pass peak GPU memory (MB) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=1024, Softmax kernel
   :width: 600px
   :align: center