Randomized Multi-Head Attention

Contents

Randomized Multi-Head Attention#

Embedding dimension 128 — ReLU activation#

Forward time for attention embed=128 ReLU

Forward pass time (ms) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=128, ReLU kernel

Backward time for attention embed=128 ReLU

Backward pass time (ms) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=128, ReLU kernel

Forward memory for attention embed=128 ReLU

Forward pass peak GPU memory (MB) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=128, ReLU kernel

Backward memory for attention embed=128 ReLU

Backward pass peak GPU memory (MB) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=128, ReLU kernel

Embedding dimension 128 — Softmax activation#

Forward time for attention embed=128 Softmax

Forward pass time (ms) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=128, Softmax kernel

Backward time for attention embed=128 Softmax

Backward pass time (ms) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=128, Softmax kernel

Forward memory for attention embed=128 Softmax

Forward pass peak GPU memory (MB) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=128, Softmax kernel

Backward memory for attention embed=128 Softmax

Backward pass peak GPU memory (MB) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=128, Softmax kernel

Embedding dimension 256 — ReLU activation#

Forward time for attention embed=256 ReLU

Forward pass time (ms) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=256, ReLU kernel

Backward time for attention embed=256 ReLU

Backward pass time (ms) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=256, ReLU kernel

Forward memory for attention embed=256 ReLU

Forward pass peak GPU memory (MB) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=256, ReLU kernel

Backward memory for attention embed=256 ReLU

Backward pass peak GPU memory (MB) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=256, ReLU kernel

Embedding dimension 256 — Softmax activation#

Forward time for attention embed=256 Softmax

Forward pass time (ms) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=256, Softmax kernel

Backward time for attention embed=256 Softmax

Backward pass time (ms) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=256, Softmax kernel

Forward memory for attention embed=256 Softmax

Forward pass peak GPU memory (MB) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=256, Softmax kernel

Backward memory for attention embed=256 Softmax

Backward pass peak GPU memory (MB) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=256, Softmax kernel

Embedding dimension 512 — ReLU activation#

Forward time for attention embed=512 ReLU

Forward pass time (ms) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=512, ReLU kernel

Backward time for attention embed=512 ReLU

Backward pass time (ms) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=512, ReLU kernel

Forward memory for attention embed=512 ReLU

Forward pass peak GPU memory (MB) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=512, ReLU kernel

Backward memory for attention embed=512 ReLU

Backward pass peak GPU memory (MB) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=512, ReLU kernel

Embedding dimension 512 — Softmax activation#

Forward time for attention embed=512 Softmax

Forward pass time (ms) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=512, Softmax kernel

Backward time for attention embed=512 Softmax

Backward pass time (ms) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=512, Softmax kernel

Forward memory for attention embed=512 Softmax

Forward pass peak GPU memory (MB) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=512, Softmax kernel

Backward memory for attention embed=512 Softmax

Backward pass peak GPU memory (MB) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=512, Softmax kernel

Embedding dimension 1024 — ReLU activation#

Forward time for attention embed=1024 ReLU

Forward pass time (ms) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=1024, ReLU kernel

Backward time for attention embed=1024 ReLU

Backward pass time (ms) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=1024, ReLU kernel

Forward memory for attention embed=1024 ReLU

Forward pass peak GPU memory (MB) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=1024, ReLU kernel

Backward memory for attention embed=1024 ReLU

Backward pass peak GPU memory (MB) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=1024, ReLU kernel

Embedding dimension 1024 — Softmax activation#

Forward time for attention embed=1024 Softmax

Forward pass time (ms) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=1024, Softmax kernel

Backward time for attention embed=1024 Softmax

Backward pass time (ms) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=1024, Softmax kernel

Forward memory for attention embed=1024 Softmax

Forward pass peak GPU memory (MB) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=1024, Softmax kernel

Backward memory for attention embed=1024 Softmax

Backward pass peak GPU memory (MB) for RandMultiHeadAttention vs nn.MultiheadAttention, embed=1024, Softmax kernel