Neural Networks API#

The panther.nn module provides sketched neural network layers that reduce memory usage while maintaining performance.

Note

The neural network classes below require compiled C++ extensions (pawX) which may not be available in all environments.

Linear Layers#

SKLinear#

class SKLinear(in_features, out_features, num_terms, low_rank, W_init=None, bias=True, dtype=None, device=None)#

SKLinear is a custom linear (fully connected) layer with sketching and optional low-rank approximation, designed for efficient computation and potential GPU Tensor Core acceleration.

Parameters:
  • in_features (int) – Number of input features.

  • out_features (int) – Number of output features.

  • num_terms (int) – Number of sketching terms (controls the number of low-rank approximations).

  • low_rank (int) – Rank of the low-rank approximation for each term.

  • W_init (torch.Tensor) – Optional initial weight matrix. If None, weights are initialized using Kaiming uniform initialization.

  • bias (bool) – If True, adds a learnable bias to the output. Default: True.

  • dtype (torch.dtype) – Data type of the parameters.

  • device (torch.device) – Device to store the parameters.

SKLinear_triton#

class SKLinear_triton(in_features, out_features, num_terms, low_rank, W_init=None, bias=True, dtype=None, device=None)#

Triton-accelerated version of SKLinear for enhanced GPU performance.

Parameters:
  • in_features (int) – Number of input features.

  • out_features (int) – Number of output features.

  • num_terms (int) – Number of sketching terms.

  • low_rank (int) – Rank of the low-rank approximation for each term.

  • W_init (torch.Tensor) – Optional initial weight matrix.

  • bias (bool) – If True, adds a learnable bias to the output.

  • dtype (torch.dtype) – Data type of the parameters.

  • device (torch.device) – Device to store the parameters.

Convolution Layers#

SKConv2d#

class SKConv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, bias=True, num_terms=1, low_rank=64, W_init=None, dtype=None, device=None)#

Sketched 2D convolution layer for memory-efficient convolution operations.

Parameters:
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • kernel_size (int) – Size of the convolution kernel.

  • stride (int) – Stride of the convolution operation.

  • padding (int) – Padding applied to the input.

  • bias (bool) – If True, adds a learnable bias.

  • num_terms (int) – Number of sketching terms.

  • low_rank (int) – Rank of the low-rank approximation.

  • W_init (torch.Tensor) – Optional initial weight matrix.

  • dtype (torch.dtype) – Data type of the parameters.

  • device (torch.device) – Device to store the parameters.

Attention Mechanisms#

RandMultiHeadAttention#

class RandMultiHeadAttention(embed_dim, num_heads, num_random_features, dropout=0.0, bias=True, kernel_fn=\"softmax\", iscausal=False, SRPE=None, device=None, dtype=None)#

Randomized Multi-Head Attention mechanism using random feature approximation for efficient attention computation.

Parameters:
  • embed_dim (int) – Total dimension of the model.

  • num_heads (int) – Number of parallel attention heads.

  • num_random_features (int) – Number of random features for the projection matrix.

  • dropout (float) – Dropout probability. Default: 0.0.

  • bias (bool) – If True, adds bias to input/output projections.

  • kernel_fn (str) – Kernel function to use ("softmax" or "relu"). Default: "softmax".

  • iscausal (bool) – If True, applies causal masking for autoregressive tasks. Default: False.

  • SRPE – Sketched Random Positional Encoding. Default: None.

  • device (torch.device) – Device to store the parameters.

  • dtype (torch.dtype) – Data type of the parameters.

Examples#

Basic Sketched Linear Layer

import torch
import panther as pr

# Create a sketched linear layer
layer = pr.nn.SKLinear(
    in_features=512,
    out_features=256,
    num_terms=2,        # Number of sketching terms
    low_rank=16,        # Rank of each sketch
    bias=True
)

# Forward pass
x = torch.randn(64, 512)  # batch_size=64
y = layer(x)
print(f"Output shape: {y.shape}")  # (64, 256)

Replacing Standard Layers

import torch.nn as nn
import panther as pr

class StandardMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(784, 512)
        self.layer2 = nn.Linear(512, 256)
        self.layer3 = nn.Linear(256, 10)

class SketchedMLP(nn.Module):
    def __init__(self):
        super().__init__()
        # Drop-in replacements with memory savings
        self.layer1 = pr.nn.SKLinear(784, 512, num_terms=2, low_rank=64)
        self.layer2 = pr.nn.SKLinear(512, 256, num_terms=1, low_rank=32)
        self.layer3 = pr.nn.SKLinear(256, 10, num_terms=1, low_rank=16)

Sketched Convolution

import panther as pr

# Sketched 2D convolution layer
conv_layer = pr.nn.SKConv2d(
    in_channels=64,
    out_channels=128,
    kernel_size=3,
    stride=1,
    padding=1,
    num_terms=1,
    low_rank=16
)

# Forward pass
x = torch.randn(32, 64, 56, 56)  # (batch, channels, height, width)
y = conv_layer(x)
print(f"Output shape: {y.shape}")  # (32, 128, 56, 56)

Multi-Head Attention with Sketching

from panther.nn import RandMultiHeadAttention
from panther.nn.pawXimpl import sinSRPE
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Randomized multi-head attention with SPRE
spre = sinSRPE(
    num_heads=8,
    perHead_in=512 // 8,  # embed_dim // num_heads
    sines=16,
    num_realizations=256,
    device=device,
    dtype=torch.float32
)

randomized_attention = RandMultiHeadAttention(
    embed_dim=512,
    num_heads=8,
    num_random_features=256,  # Number of random features for approximation
    kernel_fn="softmax",      # Can be "softmax" or "relu"
    SRPE=spre,                # Sketched Random Positional Encoding
    device=device
)

# Self-attention
x = torch.randn(32, 100, 512, device=device)  # (batch, seq_len, embed_dim)
output, attn_weights = randomized_attention(x, x, x)
print(f"Output shape: {output.shape}")  # (32, 100, 512)

Parameter Selection Guidelines#

Choosing num_terms and low_rank

The key parameters for sketched layers are:

  • num_terms: Number of low-rank terms in the approximation

  • low_rank: Rank of each term

Rules of thumb:

from panther.nn import SKLinear

# Conservative: Fewer parameters, faster but less accurate
conservative = SKLinear(1024, 512, num_terms=1, low_rank=32)

# Balanced: Good accuracy/speed tradeoff
balanced = SKLinear(1024, 512, num_terms=1, low_rank=64)

# Aggressive: More parameters, slower but more accurate
aggressive = SKLinear(1024, 512, num_terms=2, low_rank=64)

Parameter count constraint:

The total parameters should be less than the original layer:

\[2 \times \text{num_terms} \times \text{low_rank} \times (\text{in_features} + \text{out_features}) < \text{in_features} \times \text{out_features}\]

GPU Optimization#

Tensor Core Requirements

For optimal GPU performance on modern hardware:

from panther.nn import SKLinear

# All dimensions should be multiples of 16
layer = SKLinear(
    in_features=1024,    # ✓ Multiple of 16
    out_features=512,    # ✓ Multiple of 16
    num_terms=2,
    low_rank=32          # ✓ Multiple of 16
)

# Batch size should also be multiple of 16
x = torch.randn(128, 1024)  # ✓ batch_size=128

See Also#