PMETAL(1)

NAME

pmetalPowdered Metal — High performance LLM fine-tuning framework for Apple Silicon

SYNOPSIS

$https://github.com/Epistates/pmetal/releases

INFO

212 stars
11 forks
0 views

DESCRIPTION

Powdered Metal — High performance LLM fine-tuning framework for Apple Silicon

README

Crates.io Rust License Platform

PMetal

Powdered Metal — An ML SDK, framework, and application suite for Apple Silicon, written in Rust.

PMetal is a complete machine learning platform for Apple Silicon — from low-level Metal GPU kernels and Apple Neural Engine integration to high-level training APIs, a terminal TUI, and a full desktop GUI. Ship fine-tuned models without leaving the Apple ecosystem.

Use PMetal Your Way

Desktop GUI

pmetal screenshot showing GUI

A full Tauri + Svelte desktop application for visual model management, training, and inference.

cd crates/pmetal-gui
bun install && bun tauri dev

10 pages: Dashboard, Models, Datasets, Training, Distillation, GRPO, Inference, Merging, Quantize, and Settings. Download models from HuggingFace, configure LoRA training with live loss metrics, chat with models, merge weights, and quantize — all from the GUI. Training runs in-process with real-time progress updates.

Terminal TUI

pmetal screenshot showing TUI

A full-featured terminal control center with 9 tabs.

pmetal tui
TabDescription
DashboardLive loss curves (braille), LR schedule, throughput sparklines, timing breakdown gauges
DeviceGPU/ANE info, Metal feature detection, memory gauge, kernel tuning, UltraFusion topology
ModelsBrowse cached models, HuggingFace Hub search (S), memory fit estimation, download
DatasetsScan and preview local datasets (JSONL, Parquet, CSV) with line counts
TrainingConfigure and launch SFT/LoRA/QLoRA training runs with sectioned parameter forms
DistillationConfigure knowledge distillation (online, offline, progressive, cross-vocab)
GRPOConfigure GRPO/DAPO reasoning training with reward functions and sampling params
InferenceInteractive chat interface with markdown rendering and generation settings sidebar
JobsTraining run history with log viewer, status tracking, and metadata

Keybindings: Tab/Shift+Tab to switch tabs, Alt+1-9 for direct access, L to adjust learning rate mid-run, q to quit.

CLI

# LoRA fine-tuning with sequence packing (default)
pmetal train \
  --model Qwen/Qwen3-0.6B \
  --dataset train.jsonl \
  --output ./output \
  --lora-r 16 --batch-size 4 --learning-rate 2e-4

Inference with LoRA adapter

pmetal infer
--model Qwen/Qwen3-0.6B
--lora ./output/lora_weights.safetensors
--prompt "Explain quantum entanglement"
--chat --show-thinking

Knowledge distillation

pmetal distill
--teacher Qwen/Qwen3-4B
--student unsloth/Qwen3.5-0.8B-Base
--dataset train.jsonl

GRPO reasoning training

pmetal grpo
--model Qwen/Qwen3-0.6B
--dataset reasoning.jsonl
--reasoning-rewards

HuggingFace model search with memory fit

pmetal search "qwen 0.6b" --detailed

Merge models with SLERP

pmetal merge
--models model-a model-b
--method slerp --t 0.5

Quantize to GGUF

pmetal quantize
--model ./output
--output model.gguf --type q4km

Fuse LoRA into base model

pmetal fuse
--model Qwen/Qwen3-0.6B
--lora ./output/lora_weights.safetensors

Evaluate perplexity

pmetal eval
--model Qwen/Qwen3-0.6B
--dataset eval.jsonl

Start OpenAI-compatible server (requires --features serve)

pmetal serve --model Qwen/Qwen3-0.6B --port 8080

All CLI Commands

CommandDescription
trainFine-tune with LoRA/QLoRA/DoRA (SFT)
inferInteractive inference with chat, tool use, and thinking mode
distillKnowledge distillation (online, offline, progressive)
grpoGRPO/DAPO reasoning training
searchSearch HuggingFace Hub with memory fit estimation
downloadDownload a model from HuggingFace Hub
mergeMerge two or more models (12 strategies)
quantizeGGUF quantization (13 format options)
fuseFuse LoRA adapter weights into base model
evalEvaluate model perplexity on a dataset
serveOpenAI-compatible inference server (feature-gated)
tuiFull TUI control center (9 tabs)
dashboardReal-time training metrics visualization
datasetDataset utilities: analyze, download, convert
ollamaOllama integration: modelfile, create, templates
infoShow device info (GPU, ANE, bandwidth, NAX)
memoryShow memory usage and available capacity
initGenerate a sample configuration file
benchBenchmark training performance
bench-genBenchmark generation loop timing
bench-ffiBenchmark FFI overhead

SDK

PMetal is an embeddable SDK — integrate training, inference, and model operations into your own Rust applications. The easy module provides high-level builders, while the underlying crates (pmetal-trainer, pmetal-models, pmetal-lora, etc.) offer full control over every pipeline stage.

use pmetal::easy;

// Fine-tune with LoRA let result = easy::finetune("Qwen/Qwen3-0.6B", "train.jsonl") .lora(16, 32.0) .learning_rate(2e-4) .epochs(3) .output("./output") .run() .await?;

// DPO preference optimization let result = easy::dpo("Qwen/Qwen3-0.6B", "preferences.jsonl") .dpo_beta(0.1) .reference_model("Qwen/Qwen3-0.6B") .run() .await?;

// Inference let output = easy::infer("Qwen/Qwen3-0.6B") .temperature(0.7) .lora("./output/lora_weights.safetensors") .generate("What is 2+2?") .await?;

// Streaming inference easy::infer("Qwen/Qwen3-0.6B") .generate_streaming("Tell me a story", |delta| { print!("{delta}"); true // return false to stop early }) .await?;

Available builders: easy::finetune(), easy::dpo(), easy::simpo(), easy::orpo(), easy::kto(), easy::infer().

For lower-level control, use the crates directly — pmetal-trainer::TrainingLoop, pmetal-models::DynamicModel, pmetal-lora::DynamicLoraModel, pmetal-distill::Distiller, etc. See the examples/ directory for complete working examples including manual training loop orchestration and ANE-specific workflows.

Python SDK

PMetal exposes a Python extension module via PyO3. Install with maturin develop from crates/pmetal-py.

Quick Start (Easy API)

import pmetal

Fine-tune with sensible defaults

result = pmetal.finetune( "Qwen/Qwen3-0.6B", "train.jsonl", lora_r=16, learning_rate=2e-4, epochs=3, ) print(f"Loss: {result['final_loss']}, Steps: {result['total_steps']}")

Inference

text = pmetal.infer("Qwen/Qwen3-0.6B", "What is 2+2?") print(text)

Inference with LoRA adapter

text = pmetal.infer( "Qwen/Qwen3-0.6B", "Explain quantum entanglement", lora="./output/lora_weights.safetensors", )

Full Control

import pmetal

Configure training components

lora_config = pmetal.LoraConfig(r=16, alpha=32.0) training_config = pmetal.TrainingConfig( learning_rate=2e-4, num_epochs=3, batch_size=4, max_seq_len=2048, )

Create trainer

trainer = pmetal.Trainer( model_id="Qwen/Qwen3-0.6B", lora_config=lora_config, training_config=training_config, dataset_path="train.jsonl", ) trainer.add_callback(pmetal.ProgressCallback()) result = trainer.train()

Load model for inference

model = pmetal.Model.load("Qwen/Qwen3-0.6B") print(model.generate("Hello world", temperature=0.7))

Installation

Prebuilt signed binaries are available on the Releases page.

Crates are available on crates.io.

Build from source:

git clone https://github.com/epistates/pmetal.git && cd pmetal
cargo build --release          # CLI + TUI
cd crates/pmetal-gui && bun install && bun tauri build  # GUI (optional)

Hardware Support

PMetal automatically detects Apple Silicon capabilities at startup and tunes kernel parameters accordingly.

Chip FamilyGPU FamilyNAXANEUltraFusionStatus
M1 / Pro / Max / UltraApple7-16 coresUltra: 2-dieFully supported
M2 / Pro / Max / UltraApple8-16 coresUltra: 2-dieFully supported
M3 / Pro / Max / UltraApple9-16 coresUltra: 2-dieFully supported
M4 / Pro / Max / UltraApple9-16 coresUltra: 2-dieFully supported
M5 / Pro / Max / UltraApple10Yes16 coresUltra: 2-dieFully supported

Auto-detected features: GPU family, device tier, core counts, memory bandwidth, dynamic caching, mesh shaders, NAX (M5+), UltraFusion topology (via sysctl hw.packages), ANE availability.

Tier-based kernel tuning: Matrix tile sizes, FlashAttention block sizes, fused kernel threadgroup sizes, and batch multipliers are automatically selected based on device tier (Base/Pro/Max/Ultra) and GPU family. See docs/hardware-support.md for the full tuning matrix.

Architecture

PMetal is organized as a Rust workspace with 18 specialized crates:

pmetal/
├── pmetal-core         # Foundation: configs, traits, types, error handling
├── pmetal-metal        # Custom Metal GPU kernels + ANE runtime
├── pmetal-mlx          # MLX backend integration (KV cache, RoPE, etc.)
├── pmetal-models       # LLM architectures (Llama, Qwen, DeepSeek, etc.)
├── pmetal-lora         # LoRA/QLoRA training implementations
├── pmetal-trainer      # Training loops (SFT, DPO, SimPO, ORPO, KTO, GRPO, etc.)
├── pmetal-data         # Dataset loading, chat templates, tokenization
├── pmetal-hub          # HuggingFace Hub integration + model fit estimation
├── pmetal-distill      # Knowledge distillation (online, offline, cross-vocab, TAID)
├── pmetal-merge        # Model merging (14 strategies)
├── pmetal-gguf         # GGUF format with imatrix quantization
├── pmetal-mhc          # Manifold-Constrained Hyper-Connections
├── pmetal-distributed  # Distributed training (mDNS, Ring All-Reduce)
├── pmetal-vocoder      # BigVGAN neural vocoder
├── pmetal-serve        # OpenAI-compatible inference server
├── pmetal-py           # Python bindings (maturin/PyO3)
├── pmetal-cli          # Command-line interface + TUI control center
└── pmetal-gui          # Desktop GUI (Tauri + Svelte + TailwindCSS)

The pmetal facade crate re-exports all modules with feature flags and provides the easy API for quick-start usage.

Supported Models

Inference (via DynamicModel dispatcher)

All models below can be loaded from HuggingFace Hub or local safetensors and used for inference via the CLI, TUI, GUI, or SDK.

FamilyArchitectureVariantsmodel_type values
LlamaLlama2, 3, 3.1, 3.2, 3.3llama, llama3
Llama 4Llama4Scout, Maverickllama4
Qwen 2Qwen22, 2.5qwen2, qwen2_5
Qwen 3Qwen33qwen3
Qwen 3 MoEQwen3MoE3-MoEqwen3_moe
Qwen 3.5Qwen3Next3.5 (Next)qwen3_next, qwen3_5
DeepSeekDeepSeekV3, V3.2, V3.2-Specialedeepseek, deepseek_v3
MistralMistral7B, Mixtral 8x7Bmistral, mixtral
GemmaGemma2, 3gemma, gemma2, gemma3
Phi 3Phi3, 3.5phi, phi3
Phi 4Phi44phi4
CohereCohereCommand Rcohere, command_r
GraniteGranite3.0, 3.1, Hybrid MoEgranite, granitehybrid
NemotronHNemotronHHybrid (Mamba+Attention)nemotron_h
StarCoder2StarCoder23B, 7B, 15Bstarcoder2
RecurrentGemmaRecurrentGemmaGriffinrecurrentgemma, griffin
JambaJamba1.5jamba
FluxFlux1-dev, 1-schnellflux

LoRA/QLoRA Training Support

LoRA training is supported for models that have implementations in DynamicLoraModel. Architecture detection is automatic — just point pmetal train at a model directory or HuggingFace ID.

ArchitectureLoRAQLoRANotes
LlamaYesYesCovers Llama 2, 3, 3.1, 3.2, 3.3. Gradient checkpointing supported.
Qwen 2YesUses Qwen3 LoRA implementation internally.
Qwen 3YesYesGradient checkpointing supported.
Qwen 3.5 (Next)YesHybrid architecture with nested text_config handling.
GemmaYesYesGeGLU activation, special RMSNorm.
MistralYesYesSliding window attention support.
Phi 3YesPartial RoPE, fused gate_up projection.

Architectures not listed above (Llama 4, Qwen 3 MoE, DeepSeek, Cohere, Granite, NemotronH, Phi 4, StarCoder2, RecurrentGemma, Jamba) support inference but do not yet have LoRA training integration via DynamicLoraModel. Contributions welcome.

Architecture Modules (Not Yet in Dispatcher)

The following architectures have implementations in pmetal-models but are not wired into the DynamicModel dispatcher and cannot be loaded via the CLI or DynamicModel::load():

FamilyModuleNotes
GPT-OSSgpt_ossMoE with Top-4 sigmoid routing, 20B/120B variants
Pixtralpixtral12B vision-language model
Qwen2-VLqwen2_vl2B, 7B vision-language model
MLlamamllamaLlama 3.2-Vision
CLIPclipViT-L/14 vision encoder
WhisperwhisperBase, Small, Medium, Large speech models
T5t5Encoder-decoder architecture

These modules can be used directly via their Rust types (e.g., pmetal_models::architectures::gpt_oss::GptOssForCausalLM) but require manual weight loading.

Diffusion Models

FamilyVariantsStatus
Flux1-dev, 1-schnellDispatcher + pipeline implemented

Training Methods

All training methods support callback-based cancellation (should_stop()), metrics JSONL logging, and adaptive learning rate control.

MethodCLIGUITUILibrary
SFT (Supervised Fine-Tuning)trainYesYeseasy::finetune()
LoRAtrainYesYeseasy::finetune()
QLoRA (4-bit)train --quantization nf4YesYeseasy::finetune()
DoRAtrain --doraYesYeseasy::finetune()
DPO (Direct Preference)easy::dpo()
SimPO (Simple Preference)easy::simpo()
ORPO (Odds-Ratio Preference)easy::orpo()
KTO (Kahneman-Tversky)easy::kto()
GRPO (Reasoning)grpoYesYesGrpoTrainer
DAPO (Decoupled GRPO)grpo --dapoYesYesDapoTrainer
Knowledge DistillationdistillYesYesDistiller
TAID (Temporally Adaptive)TaidDistiller
ANE Trainingtrain (auto)YesAneTrainingLoop

Additional methods available via the library only: GSPO (GspoTrainer), PPO (PpoTrainer), Online DPO (OnlineDpoTrainer), Diffusion Training (DiffusionTrainer).

Key Features

Metal GPU Optimizations

Custom Metal shaders provide significant speedups:

  • FlashAttention: O(n) memory attention with fused softmax, tier-aware block sizes
  • Fused GDN: Gated Delta Network recurrence kernel (ported from FLA Triton) — single-pass state update with SIMD reductions
  • Fused LoRA: Combined forward pass for adapter layers (~2x speedup with lora-metal-fused feature)
  • Fused Cross-Entropy: Unsloth-style chunked loss computation
  • Fused Linear Cross-Entropy: Skips logits materialization entirely
  • Fused RoPE: Rotary position embeddings in-kernel
  • Fused SwiGLU: Fused gate + activation with tier-tuned threadgroups
  • Fused RMSNorm + LoRA: Combined normalization and adapter projection
  • Fused Sampler: JIT-compiled token sampling
  • Fused MLP: Combined gate/up/down projections
  • Async Scheduler: Double/triple-buffered GPU command scheduling

ANE (Neural Engine) Pipeline

Native ANE integration for power-efficient training and inference:

  • Dynamic Weight Pipeline: 9 MIL kernels compiled once at startup; weights packed alongside activations in IOSurface spatial dimension
  • Hybrid Inference: ANE prefill + CPU decode with KV cache. Power-of-2 sequence bucketing for optimal kernel compilation
  • CPU RMSNorm: RMSNorm computed in f32 on CPU to avoid fp16 overflow on ANE (saturation arithmetic)
  • IOSurface Zero-Copy: fp32 shared memory surfaces for CPU-ANE data transfer with no serialization overhead
  • M1-M5 Compatibility: Per-matrix weight blobs for M1, single-blob for M3+. CPU FFN fallback for 4B+ models

Training Infrastructure

  • Sequence Packing: Efficiently pack multiple sequences into single batches for 2-5x throughput. Enabled by default
  • Gradient Checkpointing: Trade compute for memory on large models with configurable layer grouping
  • Adaptive LR: EMA-based anomaly detection with spike recovery, plateau reduction, and divergence detection
  • Callback System: TrainingCallback trait with lifecycle hooks (on_step_start, on_step_end, should_stop) for metrics logging, progress reporting, and clean cancellation
  • Checkpoint Management: Save and resume training from checkpoints with best-loss rollback
  • Tool/Function Calling: Chat templates with native tool definitions for Qwen, Llama 3.1+, Mistral v3+, and DeepSeek
  • Schedule-Free Optimizer: Memory-efficient optimizer without learning rate schedules
  • Metal Fused Optimizer: GPU-accelerated AdamW parameter updates
  • 8-bit Adam: Memory-efficient optimizer for large models
  • LoRA+: Differentiated learning rates for LoRA A and B matrices
  • NEFTune: Noise-augmented fine-tuning for improved generation quality
  • Distributed Training: mDNS auto-discovery, Ring All-Reduce with gradient compression

Dataset Formats

Auto-detected training data formats:

  • ShareGPT: {"conversations": [{"from": "human", "value": "..."}, ...]}
  • Alpaca: {"instruction": "...", "input": "...", "output": "..."}
  • OpenAI/Messages: {"messages": [{"role": "user", "content": "..."}, ...]}
  • Reasoning: {"problem": "...", "thinking": "...", "solution": "..."}
  • Simple: {"text": "..."}
  • Parquet: Supports both standard text columns and reasoning formats

The pmetal dataset subcommand provides utilities for analysis, download from HuggingFace, and format conversion (Parquet, JSON, JSONL, CSV, ShareGPT, Alpaca).

Model Operations

  • HuggingFace Hub Search: pmetal search with memory fit estimation and download

  • Model Merging (16 strategies via library, 12 via CLI):

    CLILibraryDescription
    linearLinearMergeSimple weighted averaging
    slerpSlerpMergeSpherical linear interpolation
    tiesTiesMergeTask arithmetic with sparsification and sign consensus
    dare_tiesDareMergeRandom pruning with rescaling (TIES variant)
    dare_linearDareMergeRandom pruning with rescaling (linear variant)
    task_arithmeticTaskArithmeticMergeTask vector arithmetic
    dellaDellaMergeAdaptive magnitude-based pruning
    della_linearDellaMergeAdaptive magnitude pruning (linear variant)
    breadcrumbsBreadcrumbsMergeBreadcrumbs merge strategy
    model_stockModelStockMergeGeometric interpolation based on task vector similarity
    nearswapNearswapMergeNear-swap merge strategy
    passthroughPassthroughMergeLayer passthrough composition
    RamMergeRAM merge strategy
    SouperMergeSouper merge strategy
    KarcherMergeKarcher mean on weight manifold
    MultiSlerpMergeMulti-model SLERP
  • GPU-Accelerated Merging: Metal-based merge operations for large models

  • FP8-Aware Merging: Merge with FP8 quantization for memory efficiency

  • Async Merge Pipeline: Double-buffered streaming merge for large models

  • LoRA Fusing: Merge LoRA adapters into base weights (standard and accurate modes)

  • GGUF Quantization (13 format options):

    FormatDescription
    dynamicAuto-select per layer
    q8_08-bit quantization
    q6k6-bit k-quant
    q5km5-bit k-quant (medium)
    q5ks5-bit k-quant (small)
    q4km4-bit k-quant (medium)
    q4ks4-bit k-quant (small)
    q3km3-bit k-quant (medium)
    q3ks3-bit k-quant (small)
    q3kl3-bit k-quant (large)
    q2k2-bit k-quant
    f16Float16
    f32Float32

    Supports importance matrix (--imatrix) for improved quantization quality.

  • FP8 Runtime Quantization: Convert to FP8 (E4M3) at inference time for ~2x memory reduction

Knowledge Distillation

Multiple distillation methods and loss functions:

  • Methods: Online (live teacher inference), Offline (cached logits with compression), Progressive
  • TAID: Temporally Adaptive Interpolated Distillation (ICLR 2025 SOTA) — TaidDistiller
  • Token-Level Losses: KL Divergence, Jensen-Shannon, Soft Cross-Entropy, TVD, Hinge Ranking, Logistic Ranking
  • Hidden State Losses: MSE, Cosine similarity, L1
  • Reasoning-Aware: Rationale distillation for reasoning models
  • Cross-Vocabulary: Distill between models with different tokenizers
  • Offline Logit Caching: Compressed logit storage for memory-efficient offline distillation

Configuration

pmetal train Parameters

ParameterDefaultDescription
--lora-r16LoRA rank
--lora-alpha32.0LoRA scaling factor (2x rank)
--batch-size1Micro-batch size
--learning-rate2e-4Learning rate
--max-seq-len0Max seq len (0 = auto-detect)
--epochs1Number of training epochs
--max-grad-norm1.0Gradient clipping
--quantizationnoneQLoRA method (nf4, fp4, int8)
--gradient-accumulation-steps4Gradient accumulation steps
--no-anefalseDisable ANE training
--embedding-lrNoneSeparate LR for embeddings
--no-metal-fused-optimizerfalseDisable Metal fused optimizer
--lr-schedulecosineSchedule type (constant, linear, cosine, cosine_with_restarts, polynomial, wsd)
--no-gradient-checkpointingfalseDisable gradient checkpointing (enabled by default)
--gradient-checkpointing-layers4Number of layers per checkpoint block
--warmup-steps100Learning rate warmup steps
--weight-decay0.01AdamW weight decay coefficient
--no-sequence-packingfalseDisable sequence packing
--configPath to YAML configuration file

pmetal infer Parameters

ParameterDefaultDescription
--temperatureModel defaultSampling temperature
--top-kModel defaultTop-k sampling
--top-pModel defaultNucleus sampling
--min-pModel defaultMin-p dynamic sampling
--max-tokens256Maximum generation length
--repetition-penalty1.0Repetition penalty
--frequency-penalty0.0Frequency penalty
--presence-penalty0.0Presence penalty
--chatfalseApply chat template
--show-thinkingfalseShow reasoning content
--fp8falseUse FP8 weights (~2x mem reduction)
--compiledfalseUse JIT-compiled sampling
--no-anefalseDisable ANE inference
--ane-max-seq-len1024Max ANE kernel sequence length
--toolsTool/function definitions file (OpenAI format)
--systemSystem message

Feature Flags

FeatureDefaultCrateDescription
coreYespmetal-coreFoundation types, configs, traits
ggufYespmetal-ggufGGUF format support
metalYespmetal-metalMetal GPU kernels
hubYespmetal-hubHuggingFace Hub integration
mlxYespmetal-mlxMLX backend
modelsYespmetal-modelsLLM architectures
loraYespmetal-loraLoRA/QLoRA
trainerYespmetal-trainerTraining loops (pulls in data, distill)
easyYesHigh-level builders (pulls in trainer, hub, data)
aneYesApple Neural Engine
dataYes*pmetal-dataDataset loading (*default via easy)
distillYes*pmetal-distillKnowledge distillation (*default via trainer)
lora-metal-fusedNo~2x LoRA training speedup via fused Metal kernels
mergeNopmetal-mergeModel merging strategies
vocoderNopmetal-vocoderBigVGAN neural vocoder
distributedNopmetal-distributedDistributed training
mhcNopmetal-mhcManifold-Constrained Hyper-Connections
fullNoAll features

Development

Building

# Release build (default features: ANE + Dashboard)
cargo build --release

Build without ANE

cargo build --release --no-default-features --features dashboard

Run tests (single-threaded for Metal compatibility)

just test

Build GUI

cd crates/pmetal-gui && bun install && bun tauri build

Formal Verification

# cargo-kani proofs for ring all-reduce and topology
just kani-verify

License

Licensed under either of MIT or Apache-2.0.

Acknowledgments

  • MLX - Apple's machine learning framework
  • mlx-rs - Rust bindings for MLX
  • Unsloth - Inspiration for fused kernels
  • Tauri - Desktop application framework

SEE ALSO

clihub3/18/2026PMETAL(1)