Neural Networks API#
The panther.nn module provides sketched neural network layers that reduce memory usage while maintaining performance.
Note
The neural network classes below require compiled C++ extensions (pawX) which may not be available in all environments.
Linear Layers#
SKLinear#
- class SKLinear(in_features, out_features, num_terms, low_rank, W_init=None, bias=True, dtype=None, device=None)#
SKLinear is a custom linear (fully connected) layer with sketching and optional low-rank approximation, designed for efficient computation and potential GPU Tensor Core acceleration.
- Parameters:
in_features (int) – Number of input features.
out_features (int) – Number of output features.
num_terms (int) – Number of sketching terms (controls the number of low-rank approximations).
low_rank (int) – Rank of the low-rank approximation for each term.
W_init (torch.Tensor) – Optional initial weight matrix. If None, weights are initialized using Kaiming uniform initialization.
bias (bool) – If True, adds a learnable bias to the output. Default: True.
dtype (torch.dtype) – Data type of the parameters.
device (torch.device) – Device to store the parameters.
SKLinear_triton#
- class SKLinear_triton(in_features, out_features, num_terms, low_rank, W_init=None, bias=True, dtype=None, device=None)#
Triton-accelerated version of SKLinear for enhanced GPU performance.
- Parameters:
in_features (int) – Number of input features.
out_features (int) – Number of output features.
num_terms (int) – Number of sketching terms.
low_rank (int) – Rank of the low-rank approximation for each term.
W_init (torch.Tensor) – Optional initial weight matrix.
bias (bool) – If True, adds a learnable bias to the output.
dtype (torch.dtype) – Data type of the parameters.
device (torch.device) – Device to store the parameters.
Convolution Layers#
SKConv2d#
- class SKConv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, bias=True, num_terms=1, low_rank=64, W_init=None, dtype=None, device=None)#
Sketched 2D convolution layer for memory-efficient convolution operations.
- Parameters:
in_channels (int) – Number of input channels.
out_channels (int) – Number of output channels.
kernel_size (int) – Size of the convolution kernel.
stride (int) – Stride of the convolution operation.
padding (int) – Padding applied to the input.
bias (bool) – If True, adds a learnable bias.
num_terms (int) – Number of sketching terms.
low_rank (int) – Rank of the low-rank approximation.
W_init (torch.Tensor) – Optional initial weight matrix.
dtype (torch.dtype) – Data type of the parameters.
device (torch.device) – Device to store the parameters.
Attention Mechanisms#
RandMultiHeadAttention#
- class RandMultiHeadAttention(embed_dim, num_heads, num_random_features, dropout=0.0, bias=True, kernel_fn=\"softmax\", iscausal=False, SRPE=None, device=None, dtype=None)#
Randomized Multi-Head Attention mechanism using random feature approximation for efficient attention computation.
- Parameters:
embed_dim (int) – Total dimension of the model.
num_heads (int) – Number of parallel attention heads.
num_random_features (int) – Number of random features for the projection matrix.
dropout (float) – Dropout probability. Default: 0.0.
bias (bool) – If True, adds bias to input/output projections.
kernel_fn (str) – Kernel function to use ("softmax" or "relu"). Default: "softmax".
iscausal (bool) – If True, applies causal masking for autoregressive tasks. Default: False.
SRPE – Sketched Random Positional Encoding. Default: None.
device (torch.device) – Device to store the parameters.
dtype (torch.dtype) – Data type of the parameters.
Examples#
Basic Sketched Linear Layer
import torch
import panther as pr
# Create a sketched linear layer
layer = pr.nn.SKLinear(
in_features=512,
out_features=256,
num_terms=2, # Number of sketching terms
low_rank=16, # Rank of each sketch
bias=True
)
# Forward pass
x = torch.randn(64, 512) # batch_size=64
y = layer(x)
print(f"Output shape: {y.shape}") # (64, 256)
Replacing Standard Layers
import torch.nn as nn
import panther as pr
class StandardMLP(nn.Module):
def __init__(self):
super().__init__()
self.layer1 = nn.Linear(784, 512)
self.layer2 = nn.Linear(512, 256)
self.layer3 = nn.Linear(256, 10)
class SketchedMLP(nn.Module):
def __init__(self):
super().__init__()
# Drop-in replacements with memory savings
self.layer1 = pr.nn.SKLinear(784, 512, num_terms=2, low_rank=64)
self.layer2 = pr.nn.SKLinear(512, 256, num_terms=1, low_rank=32)
self.layer3 = pr.nn.SKLinear(256, 10, num_terms=1, low_rank=16)
Sketched Convolution
import panther as pr
# Sketched 2D convolution layer
conv_layer = pr.nn.SKConv2d(
in_channels=64,
out_channels=128,
kernel_size=3,
stride=1,
padding=1,
num_terms=1,
low_rank=16
)
# Forward pass
x = torch.randn(32, 64, 56, 56) # (batch, channels, height, width)
y = conv_layer(x)
print(f"Output shape: {y.shape}") # (32, 128, 56, 56)
Multi-Head Attention with Sketching
from panther.nn import RandMultiHeadAttention
from panther.nn.pawXimpl import sinSRPE
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Randomized multi-head attention with SPRE
spre = sinSRPE(
num_heads=8,
perHead_in=512 // 8, # embed_dim // num_heads
sines=16,
num_realizations=256,
device=device,
dtype=torch.float32
)
randomized_attention = RandMultiHeadAttention(
embed_dim=512,
num_heads=8,
num_random_features=256, # Number of random features for approximation
kernel_fn="softmax", # Can be "softmax" or "relu"
SRPE=spre, # Sketched Random Positional Encoding
device=device
)
# Self-attention
x = torch.randn(32, 100, 512, device=device) # (batch, seq_len, embed_dim)
output, attn_weights = randomized_attention(x, x, x)
print(f"Output shape: {output.shape}") # (32, 100, 512)
Parameter Selection Guidelines#
Choosing num_terms and low_rank
The key parameters for sketched layers are:
num_terms: Number of low-rank terms in the approximation
low_rank: Rank of each term
Rules of thumb:
from panther.nn import SKLinear
# Conservative: Fewer parameters, faster but less accurate
conservative = SKLinear(1024, 512, num_terms=1, low_rank=32)
# Balanced: Good accuracy/speed tradeoff
balanced = SKLinear(1024, 512, num_terms=1, low_rank=64)
# Aggressive: More parameters, slower but more accurate
aggressive = SKLinear(1024, 512, num_terms=2, low_rank=64)
Parameter count constraint:
The total parameters should be less than the original layer:
GPU Optimization#
Tensor Core Requirements
For optimal GPU performance on modern hardware:
from panther.nn import SKLinear
# All dimensions should be multiples of 16
layer = SKLinear(
in_features=1024, # ✓ Multiple of 16
out_features=512, # ✓ Multiple of 16
num_terms=2,
low_rank=32 # ✓ Multiple of 16
)
# Batch size should also be multiple of 16
x = torch.randn(128, 1024) # ✓ batch_size=128
See Also#
Sketched Linear Layers - In-depth tutorial on sketched linear layers