Autotuning for Maximum Performance

Autotuning for Maximum Performance#

This tutorial shows you how to use Panther’s AutoTuner to automatically find the best sketching parameters for your neural network, maximizing throughput while maintaining accuracy.

Why the Auto-Tuner? The Bridge to Production #

The SKAutoTuner is the critical bridge that transforms standard PyTorch models into production-optimized Panther models. Without it, deploying sketched networks would be manual, error-prone, and effectively impossible at scale.

The Problem It Solves:

Deep neural networks have complex nested hierarchies. To deploy a sketched model, you must:

Navigate deep module hierarchies to locate target layers
Extract and preserve original weights
Replace layers with sketched versions (e.g., SKLinear)
Discover optimal sketching parameters for each layer
Search multidimensional parameter spaces efficiently
Maintain accuracy while maximizing speed

Manual Workflow (Without the Tuner):

Manually replacing even a single BERT layer requires:

Understanding model architecture and naming conventions
Writing custom code to traverse BertForMaskedLM → BertOnlyMLMHead → BertLMPredictionHead → target Linear
Carefully managing weight copying to preserve training
Guessing or grid-searching through parameter combinations
Re-evaluating accuracy and speed for each attempt
Finding it was slow, inaccurate, and unreproducible

This quickly becomes unmaintainable for large models with hundreds of layers across multiple architectures.

What the Auto-Tuner Does:

Automates hierarchy navigation: Finds layers using intuitive pattern matching
Manages complexity behind the scenes: Weight management, layer replacement, configuration tracking
Discovers optimal parameters systematically: Uses industry-standard Optuna with state-of-the-art TPE sampler
Guarantees accuracy thresholds: Searches only parameter combinations that maintain your target accuracy
Maximizes speed within constraints: Optimizes throughput once accuracy is satisfied
Handles noisy evaluations: Re-evaluates uncertain configurations to ensure robustness
Provides full traceability: Complete metrics on every trial for analysis and visualization

Real Impact:

Using a BERT based model , using the tuner we were able to reduce parameters from 109.51M to 30.38M maintaining the same accuracy and achieving a 1.04x speedup using wikitext-2-raw-v1 dataset.
Achieved 30% parameter reduction on Resnet50 with 3.5% accuracy loss on Cifar-10
Deploy across multiple architectures (BERT,ResNet, etc.) with the same workflow

One-line difference:

Without tuner: Hours of debugging, manual layer selection, parameter guessing With tuner: tuner.tune() → Apply best params → Deploy

Overview #

When sketching neural network layers, choosing the right parameters e.g. (num_terms, low_rank) is crucial:

Too aggressive: Fast but inaccurate
Too conservative: Accurate but slow
Just right: Maximum speedup with acceptable accuracy loss

The SKAutoTuner automates this entire process using industry-standard hyperparameter optimization (Optuna with TPE sampler), relieving you from manual parameter tuning, layer discovery, and accuracy management.

Quick Start #

Here’s the fastest way to tune your model:

import torch.nn as nn
from panther.tuner import SKAutoTuner, LayerConfig, TuningConfigs

model = YourModel()

# Use "auto" for automatic parameter ranges
config = LayerConfig(
    layer_names={"type": "Linear"},
    params="auto",
    separate=True
)

tuner = SKAutoTuner(
    model=model,
    configs=TuningConfigs([config]),
    accuracy_eval_func=your_accuracy_function,
    verbose=True
)

tuner.tune()
optimized_model = tuner.apply_best_params()

Behind the scenes, the tuner automatically:

Discovers layers: Finds all Linear layers in your model using flexible pattern matching
Generates parameter space: Creates intelligent search ranges based on layer dimensions
Searches efficiently: Uses Optuna’s TPE sampler to intelligently explore the parameter space (not random or grid)
Maintains accuracy: Evaluates your function on each trial
Applies winners: Replaces layers with their sketched versions using optimal parameters

Industrial-Grade Search Capabilities #

Unlike basic grid search or random sampling, Panther’s tuner uses Optuna with the Tree-structured Parzen Estimator (TPE) sampler—the same technology used in production at top tech companies:

Why TPE (Tree-Structured Parzen Estimator)?

Intelligent exploration: Builds probabilistic models of the parameter space
Adaptive sampling: Focuses trials in promising regions while still exploring
Sample efficient: Finds good parameters in 50-200 trials instead of thousands
Mixed parameter types: Handles categorical, integer, and continuous parameters simultaneously
No assumptions: Works well regardless of parameter distributions

Without Optuna/TPE (manual or random search):

Grid search: 3 × 3 × 3 = 27 trials for 3 parameters with 3 values each
Random: Inefficient, many wasted trials
Manual guessing: Error-prone and non-reproducible

With Optuna/TPE (tuner default):

Intelligently narrows search space
Often converges in 50-100 trials
Reproducible results with seed control
Full trial history for analysis

Constraint-Based Optimization: The Tuner’s Superpower #

The most powerful feature is constraint-based optimization: maximize speed while maintaining a minimum accuracy threshold. This ensures you never sacrifice model quality for compression.

This is what transforms the tuner from a hyperparameter tool into a production-grade compression framework:

Problem: Sketching trades accuracy for speed. How much should we compress?
Solution: Define your acceptable accuracy loss (e.g., 99% of original), and the tuner finds the fastest configuration that meets that constraint
Result: No guessing, no manual validation, provably maintains quality while maximizing deployment benefit

Setting Up Evaluation Functions #

You need two functions:

Accuracy function: Returns a score where higher is better (e.g., accuracy, F1)
Speed function: Returns throughput where higher is better (e.g., samples/second)

import torch
import time

def accuracy_eval_func(model):
    """Evaluate model accuracy on validation set."""
    model.eval()
    correct = total = 0

    with torch.no_grad():
        for inputs, targets in val_loader:
            inputs, targets = inputs.cuda(), targets.cuda()
            outputs = model(inputs)
            _, predicted = outputs.max(1)
            total += targets.size(0)
            correct += predicted.eq(targets).sum().item()

    return correct / total

def speed_eval_func(model):
    """Measure inference throughput (samples per second)."""
    model.eval()
    batch_size = 64
    x = torch.randn(batch_size, *input_shape).cuda()

    # Warmup
    for _ in range(10):
        with torch.no_grad():
            model(x)

    torch.cuda.synchronize()

    # Measure
    iterations = 100
    start = time.perf_counter()
    for _ in range(iterations):
        with torch.no_grad():
            model(x)
    torch.cuda.synchronize()
    elapsed = time.perf_counter() - start

    return (iterations * batch_size) / elapsed

Layer Discovery and Pattern Matching #

The tuner can find layers in multiple ways—you don’t need to know the exact layer names:

By type (find all layers of a kind):

# All Linear layers in the model
layer_names={"type": "Linear"}

By name pattern (regex matching):

# Linear layers in the encoder (BERT transformer blocks)
layer_names={"pattern": "encoder\\.layers\\.[0-5]\\..*", "type": "Linear"}

By text matching (simple substring):

# Attention layers
layer_names={"contains": "attn"}

By exact name (if you know it):

# Specific layers
layer_names=["layer1.0.linear", "layer2.1.linear"]

Automatic Parameter Space Generation #

The tuner can automatically determine good search ranges based on layer dimensions:

config = LayerConfig(
    layer_names={"type": "Linear"},
    params="auto",  # Tuner decides optimal ranges
    separate=True
)

For a Linear layer with 768 input and 3072 output dimensions, the tuner automatically generates ranges like:

num_terms: [1, 2, 3, 4]
low_rank: [64, 128, 256, 512] (based on min dimension)

Or explicitly specify your own ranges:

from panther.tuner import Categorical, Int

config = LayerConfig(
    layer_names={"type": "Linear"},
    params={
        "num_terms": Categorical([1, 2, 3, 4]),
        "low_rank": Int(8, 128, step=8)
    },
    separate=True
)

Running Constrained Tuning #

from panther.tuner import (
    SKAutoTuner, LayerConfig, TuningConfigs,
    Categorical, Int, OptunaSearch
)

# Define which layers to tune and their parameter search space
config = LayerConfig(
    layer_names={"type": "Linear"},  # All Linear layers
    params={
        "num_terms": Categorical([1, 2, 3, 4]),
        "low_rank": Int(8, 128, step=8)
    },
    separate=True,      # Tune each layer independently
    copy_weights=True   # Preserve trained weights
)

# Create tuner with constraint
tuner = SKAutoTuner(
    model=model,
    configs=TuningConfigs([config]),
    accuracy_eval_func=accuracy_eval_func,
    accuracy_threshold=0.95,              # Must maintain 95% accuracy
    optimization_eval_func=speed_eval_func,  # Maximize this
    search_algorithm=OptunaSearch(n_trials=100, seed=42),
    verbose=True
)

# Run the search
best_params = tuner.tune()


# Apply winning configuration
optimized_model = tuner.apply_best_params()

print(f"Best parameters found: {best_params}")

Understanding the Optimization #

When both accuracy_threshold and optimization_eval_func are set:

Each trial evaluates accuracy first (fast failure)
If accuracy ≥ threshold: The configuration passes; its speed score becomes the objective
If accuracy < threshold: The configuration is rejected with score -inf
The tuner maximizes speed among all configurations meeting the accuracy bar
Best configuration is selected based on the fastest valid configuration found

This is a constrained optimization problem and is exactly how production ML systems work—you cannot sacrifice quality, but you can optimize deployment speed.

Real-World Impact: What the Tuner Achieves #

The SKAutoTuner delivers concrete, measurable results:

BERT Model Compression (110M parameters)

Original: 110M parameters, 66ms per sample, 1.0x baseline
Tuned with sketching: 30.38M parameters (72.26% reduction), 63.5ms per sample, 1.04x speedup
Accuracy maintained: 99.8% of original performance

ResNet-50 on Cifar-10

Tuned: 30% parameter reduction, 52ms per image, 1.05x speedup
Top-1 accuracy: 89% → 85.5% (only 3.5% drop for 30% parameter reduction)

Advanced Configuration #

Using Different Samplers #

The tuner defaults to Optuna’s TPE (Tree-structured Parzen Estimator) sampler, which works well for most problems. You can customize it:

from optuna.samplers import TPESampler, CmaEsSampler, RandomSampler

# TPE (default) - good general-purpose sampler, best for mixed parameter spaces
search = OptunaSearch(n_trials=100, sampler=TPESampler(seed=42))

# CMA-ES - excellent for continuous parameters, requires many trials
search = OptunaSearch(n_trials=200, sampler=CmaEsSampler(seed=42))

# Random - useful as a simple baseline or sanity check
search = OptunaSearch(n_trials=50, sampler=RandomSampler(seed=42))

Recommendation: Start with TPE (default). Use CMA-ES only if you have pure continuous parameters and many trials to spend.

Analyzing Results #

DataFrame Analysis #

The tuner provides comprehensive trial data for deep analysis:

import pandas as pd

# Get all trial results
df = tuner.get_results_dataframe()

# Columns: layer_name, num_terms, low_rank, accuracy, speed, score, trial_number

# Best configurations per layer
best_per_layer = df.loc[df.groupby("layer_name")["score"].idxmax()]
print(best_per_layer)

# Configurations meeting accuracy threshold and their speeds
valid = df[df["accuracy"] >= 0.95]
print(valid.sort_values("speed", ascending=False).head(10))

# How does each parameter affect performance?
for param in ["num_terms", "low_rank"]:
    if param in df.columns:
        print(f"\n{param} vs Score:")
        print(df.groupby(param)["score"].agg(["mean", "std", "count"]))

# Correlation between parameters and accuracy
print(df[["num_terms", "low_rank", "accuracy"]].corr())

What to look for:

Do valid configurations exist? (Any row where accuracy ≥ threshold)
What’s the speedup range? (min/max speed of valid configs)
Which parameters matter? (High std in score across parameter values)
Are there clusters? (Do certain parameter values consistently beat others?)

Visualization #

Use Optuna’s built-in visualization for rich analysis:

import optuna.visualization as vis

# Access the underlying Optuna study
study = tuner.search_algorithm.study

# Parameter importances - which params affect score most?
fig = vis.plot_param_importances(study)
fig.show()

# Optimization history - did the search make progress?
fig = vis.plot_optimization_history(study)
fig.show()

# Parameter relationships
fig = vis.plot_parallel_coordinate(study, params=["num_terms", "low_rank"])
fig.show()

# 2D contour of parameter interactions
fig = vis.plot_contour(study, params=["num_terms", "low_rank"])
fig.show()

Best Practices #

Golden Rules:

Start with “auto”: Let the tuner determine reasonable ranges first. Saves manual parameter exploration.
Use a representative validation set: Your accuracy function must reflect real-world performance. A small, biased validation set leads to bad parameter choices.

Warm up GPU: Include warmup iterations in speed measurements to stabilize GPU clocks. Skip first few iterations in timing:

# Bad: includes GPU warmup time
start = time.perf_counter()
for i in range(100):
    model(x)

# Good: GPU is warmed up
for _ in range(20):  # Warmup
    model(x)
torch.cuda.synchronize()
start = time.perf_counter()
for i in range(100):
    model(x)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start

Set realistic accuracy thresholds: A 0.99 threshold when original accuracy is 0.92 is impossible. Start with 0.98 or 0.95 (1-2% loss acceptable).
Save results: Use database storage for long tuning runs (100+ trials). Enables resumption if interrupted.
Visualize early: After ~20 trials, check the optimization history. If no progress, increase n_trials or adjust parameter ranges.
Consider batch size: Measure speed at your deployment batch size, not just any batch size. Smaller batches → less throughput advantage, larger batches → memory bottleneck.
Check layer coverage: Use ModelVisualizer.print_module_tree() to confirm your patterns match intended layers.
Start conservative: Begin with tight accuracy thresholds. You can loosen them later if needed. Better to have a super-stable model first.
Enable verbose mode: verbose=True during development to track progress and identify stuck searches.

Complete Example: Transformer Tuning #

import torch
import torch.nn as nn
import time
from panther.tuner import (
    SKAutoTuner, LayerConfig, TuningConfigs,
    Categorical, Int, OptunaSearch, ModelVisualizer
)

# Load your trained transformer
model = load_my_transformer()
model.cuda()
model.eval()

# Inspect the model structure
ModelVisualizer.print_module_tree(model)

# Prepare evaluation data
val_loader = get_validation_loader()

def accuracy_eval_func(model):
    model.eval()
    correct = total = 0
    with torch.no_grad():
        for x, y in val_loader:
            x, y = x.cuda(), y.cuda()
            out = model(x)
            _, pred = out.max(-1)
            correct += (pred == y).sum().item()
            total += y.numel()
    return correct / total

def speed_eval_func(model):
    model.eval()
    x = torch.randn(32, 128).long().cuda()  # batch_size=32, seq_len=128

    # Warmup
    for _ in range(20):
        with torch.no_grad():
            model(x)
    torch.cuda.synchronize()

    # Measure
    start = time.perf_counter()
    for _ in range(100):
        with torch.no_grad():
            model(x)
    torch.cuda.synchronize()
    elapsed = time.perf_counter() - start

    return (100 * 32) / elapsed  # samples/sec

# Configure tuning for transformer layers
configs = TuningConfigs([
    # Tune attention projections
    LayerConfig(
        layer_names={"contains": "attn", "type": "Linear"},
        params={
            "num_terms": Categorical([1, 2, 3]),
            "low_rank": Int(32, 256, step=32)
        },
        separate=True
    ),
    # Tune FFN layers
    LayerConfig(
        layer_names={"contains": "mlp", "type": "Linear"},
        params={
            "num_terms": Categorical([1, 2, 4]),
            "low_rank": Int(64, 512, step=64)
        },
        separate=True
    ),
])

# Run constrained optimization
tuner = SKAutoTuner(
    model=model,
    configs=configs,
    accuracy_eval_func=accuracy_eval_func,
    accuracy_threshold=0.98,  # Keep 98% of original accuracy
    optimization_eval_func=speed_eval_func,
    search_algorithm=OptunaSearch(
        n_trials=200,
        study_name="transformer_tuning",
        storage="sqlite:///transformer_tuning.db",
        seed=42
    ),
    verbose=True
)

tuner.tune()

# Apply best configuration
optimized_model = tuner.apply_best_params()

# Final verification
orig_acc = accuracy_eval_func(model)
opt_acc = accuracy_eval_func(optimized_model)
orig_speed = speed_eval_func(model)
opt_speed = speed_eval_func(optimized_model)

print(f"Original:  accuracy={orig_acc:.4f}, speed={orig_speed:.1f} samples/sec")
print(f"Optimized: accuracy={opt_acc:.4f}, speed={opt_speed:.1f} samples/sec")
print(f"Speedup: {opt_speed/orig_speed:.2f}x")

# Save results
tuner.save_tuning_results("best_config.pkl")

# Visualize with Optuna (optional)
# import optuna.visualization as vis
# vis.plot_optimization_history(tuner.search_algorithm.study).show()

Autotuning for Maximum Performance

Contents

Autotuning for Maximum Performance#