Neural Networks API

Neural Networks API#

The panther.nn module provides sketched neural network layers that reduce memory usage while maintaining performance.

Note

The neural network classes below require compiled C++ extensions (pawX) which may not be available in all environments.

Linear Layers#

SKLinear#

class SKLinear(in_features, out_features, num_terms, low_rank, W_init=None, bias=True, dtype=None, device=None)#

SKLinear is a custom linear (fully connected) layer with sketching and optional low-rank approximation, designed for efficient computation and potential GPU Tensor Core acceleration.

Parameters:

in_features (int) – Number of input features.
out_features (int) – Number of output features.
num_terms (int) – Number of sketching terms (controls the number of low-rank approximations).
low_rank (int) – Rank of the low-rank approximation for each term.
W_init (torch.Tensor) – Optional initial weight matrix. If None, weights are initialized using Kaiming uniform initialization.
bias (bool) – If True, adds a learnable bias to the output. Default: True.
dtype (torch.dtype) – Data type of the parameters.
device (torch.device) – Device to store the parameters.

SKLinear_triton#

class SKLinear_triton(in_features, out_features, num_terms, low_rank, W_init=None, bias=True, dtype=None, device=None)#

Triton-accelerated version of SKLinear for enhanced GPU performance.

Parameters:

in_features (int) – Number of input features.
out_features (int) – Number of output features.
num_terms (int) – Number of sketching terms.
low_rank (int) – Rank of the low-rank approximation for each term.
W_init (torch.Tensor) – Optional initial weight matrix.
bias (bool) – If True, adds a learnable bias to the output.
dtype (torch.dtype) – Data type of the parameters.
device (torch.device) – Device to store the parameters.

Convolution Layers#

SKConv2d#

class SKConv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, bias=True, num_terms=1, low_rank=64, W_init=None, dtype=None, device=None)#

Sketched 2D convolution layer for memory-efficient convolution operations.

Parameters:

in_channels (int) – Number of input channels.
out_channels (int) – Number of output channels.
kernel_size (int) – Size of the convolution kernel.
stride (int) – Stride of the convolution operation.
padding (int) – Padding applied to the input.
bias (bool) – If True, adds a learnable bias.
num_terms (int) – Number of sketching terms.
low_rank (int) – Rank of the low-rank approximation.
W_init (torch.Tensor) – Optional initial weight matrix.
dtype (torch.dtype) – Data type of the parameters.
device (torch.device) – Device to store the parameters.

Attention Mechanisms#

RandMultiHeadAttention#

class RandMultiHeadAttention(embed_dim, num_heads, num_random_features, dropout=0.0, bias=True, kernel_fn=\"softmax\", iscausal=False, SRPE=None, device=None, dtype=None)#

Randomized Multi-Head Attention mechanism using random feature approximation for efficient attention computation.

Parameters:

embed_dim (int) – Total dimension of the model.
num_heads (int) – Number of parallel attention heads.
num_random_features (int) – Number of random features for the projection matrix.
dropout (float) – Dropout probability. Default: 0.0.
bias (bool) – If True, adds bias to input/output projections.
kernel_fn (str) – Kernel function to use ("softmax" or "relu"). Default: "softmax".
iscausal (bool) – If True, applies causal masking for autoregressive tasks. Default: False.
SRPE – Sketched Random Positional Encoding. Default: None.
device (torch.device) – Device to store the parameters.
dtype (torch.dtype) – Data type of the parameters.

Examples#

Basic Sketched Linear Layer

import torch
import panther as pr

# Create a sketched linear layer
layer = pr.nn.SKLinear(
    in_features=512,
    out_features=256,
    num_terms=2,        # Number of sketching terms
    low_rank=16,        # Rank of each sketch
    bias=True
)

# Forward pass
x = torch.randn(64, 512)  # batch_size=64
y = layer(x)
print(f"Output shape: {y.shape}")  # (64, 256)

Replacing Standard Layers

import torch.nn as nn
import panther as pr

class StandardMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(784, 512)
        self.layer2 = nn.Linear(512, 256)
        self.layer3 = nn.Linear(256, 10)

class SketchedMLP(nn.Module):
    def __init__(self):
        super().__init__()
        # Drop-in replacements with memory savings
        self.layer1 = pr.nn.SKLinear(784, 512, num_terms=2, low_rank=64)
        self.layer2 = pr.nn.SKLinear(512, 256, num_terms=1, low_rank=32)
        self.layer3 = pr.nn.SKLinear(256, 10, num_terms=1, low_rank=16)

Sketched Convolution

import panther as pr

# Sketched 2D convolution layer
conv_layer = pr.nn.SKConv2d(
    in_channels=64,
    out_channels=128,
    kernel_size=3,
    stride=1,
    padding=1,
    num_terms=1,
    low_rank=16
)

# Forward pass
x = torch.randn(32, 64, 56, 56)  # (batch, channels, height, width)
y = conv_layer(x)
print(f"Output shape: {y.shape}")  # (32, 128, 56, 56)

Multi-Head Attention with Sketching

from panther.nn import RandMultiHeadAttention
from panther.nn.pawXimpl import sinSRPE
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Randomized multi-head attention with SPRE
spre = sinSRPE(
    num_heads=8,
    perHead_in=512 // 8,  # embed_dim // num_heads
    sines=16,
    num_realizations=256,
    device=device,
    dtype=torch.float32
)

randomized_attention = RandMultiHeadAttention(
    embed_dim=512,
    num_heads=8,
    num_random_features=256,  # Number of random features for approximation
    kernel_fn="softmax",      # Can be "softmax" or "relu"
    SRPE=spre,                # Sketched Random Positional Encoding
    device=device
)

# Self-attention
x = torch.randn(32, 100, 512, device=device)  # (batch, seq_len, embed_dim)
output, attn_weights = randomized_attention(x, x, x)
print(f"Output shape: {output.shape}")  # (32, 100, 512)

Parameter Selection Guidelines#

Choosing num_terms and low_rank

The key parameters for sketched layers are:

num_terms: Number of low-rank terms in the approximation
low_rank: Rank of each term

Rules of thumb:

from panther.nn import SKLinear

# Conservative: Fewer parameters, faster but less accurate
conservative = SKLinear(1024, 512, num_terms=1, low_rank=32)

# Balanced: Good accuracy/speed tradeoff
balanced = SKLinear(1024, 512, num_terms=1, low_rank=64)

# Aggressive: More parameters, slower but more accurate
aggressive = SKLinear(1024, 512, num_terms=2, low_rank=64)

Parameter count constraint:

The total parameters should be less than the original layer:

\[2 \times \text{num_terms} \times \text{low_rank} \times (\text{in_features} + \text{out_features}) < \text{in_features} \times \text{out_features}\]

GPU Optimization#

Tensor Core Requirements

For optimal GPU performance on modern hardware:

from panther.nn import SKLinear

# All dimensions should be multiples of 16
layer = SKLinear(
    in_features=1024,    # ✓ Multiple of 16
    out_features=512,    # ✓ Multiple of 16
    num_terms=2,
    low_rank=32          # ✓ Multiple of 16
)

# Batch size should also be multiple of 16
x = torch.randn(128, 1024)  # ✓ batch_size=128

Neural Networks API

Contents

Neural Networks API#

Linear Layers#

SKLinear#

SKLinear_triton#

Convolution Layers#

SKConv2d#

Attention Mechanisms#

RandMultiHeadAttention#

Examples#

Parameter Selection Guidelines#

GPU Optimization#

See Also#