NAME
shimmy — ⚡ Python-free Rust inference server — OpenAI-API compatible. GGUF + SafeTensors, hot model swap, auto-discovery,…
SYNOPSIS
npm install -g shimmy-jsINFO
DESCRIPTION
⚡ Python-free Rust inference server — OpenAI-API compatible. GGUF + SafeTensors, hot model swap, auto-discovery, single binary. FREE now, FREE forever.
README
Shimmy will be free forever. No asterisks. No "free for now." No pivot to paid.
💝 Support Shimmy's Growth
🚀 If Shimmy helps you, consider sponsoring — 100% of support goes to keeping it free forever.
- $5/month: Coffee tier ☕ - Eternal gratitude + sponsor badge
- $25/month: Bug prioritizer 🐛 - Priority support + name in SPONSORS.md
- $100/month: Corporate backer 🏢 - Logo placement + monthly office hours
- $500/month: Infrastructure partner 🚀 - Direct support + roadmap input
🎯 Become a Sponsor | See our amazing sponsors 🙏
Drop-in OpenAI API Replacement for Local LLMs
Shimmy is a single-binary that provides 100% OpenAI-compatible endpoints for GGUF models. Point your existing AI tools to Shimmy and they just work — locally, privately, and free.
🎉 NEW in v1.9.0: One download, all GPU backends included! No compilation, no backend confusion - just download and run.
Developer Tools
Whether you're forking Shimmy or integrating it as a service, we provide complete documentation and integration templates.
Try it in 30 seconds
# 1) Download pre-built binary (includes all GPU backends) # Windows: curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-windows-x86_64.exe -o shimmy.exe ./shimmy.exe serve &Linux:
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64 -o shimmy && chmod +x shimmy ./shimmy serve &
macOS (Apple Silicon):
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-arm64 -o shimmy && chmod +x shimmy ./shimmy serve &
2) See models and pick one
./shimmy list
3) Smoke test the OpenAI API
curl -s http://127.0.0.1:11435/v1/chat/completions
-H 'Content-Type: application/json'
-d '{ "model":"REPLACE_WITH_MODEL_FROM_list", "messages":[{"role":"user","content":"Say hi in 5 words."}], "max_tokens":32 }' | jq -r '.choices[0].message.content'
🚀 Compatible with OpenAI SDKs and Tools
No code changes needed - just change the API endpoint:
- Any OpenAI client: Python, Node.js, curl, etc.
- Development applications: Compatible with standard SDKs
- VSCode Extensions: Point to
http://localhost:11435 - Cursor Editor: Built-in OpenAI compatibility
- Continue.dev: Drop-in model provider
Use with OpenAI SDKs
- Node.js (openai v4)
import OpenAI from "openai";const openai = new OpenAI({ baseURL: "http://127.0.0.1:11435/v1", apiKey: "sk-local", // placeholder, Shimmy ignores it });
const resp = await openai.chat.completions.create({ model: "REPLACE_WITH_MODEL", messages: [{ role: "user", content: "Say hi in 5 words." }], max_tokens: 32, });
console.log(resp.choices[0].message?.content);
- Python (openai>=1.0.0)
from openai import OpenAIclient = OpenAI(base_url="http://127.0.0.1:11435/v1", api_key="sk-local")
resp = client.chat.completions.create( model="REPLACE_WITH_MODEL", messages=[{"role": "user", "content": "Say hi in 5 words."}], max_tokens=32, )
print(resp.choices[0].message.content)
⚡ Zero Configuration Required
- Automatically finds models from Hugging Face cache, Ollama, local dirs
- Auto-allocates ports to avoid conflicts
- Auto-detects LoRA adapters for specialized models
- Just works - no config files, no setup wizards
🧠 Advanced MOE (Mixture of Experts) Support
Run 70B+ models on consumer hardware with intelligent CPU/GPU hybrid processing:
- 🔄 CPU MOE Offloading: Automatically distribute model layers across CPU and GPU
- 🧮 Intelligent Layer Placement: Optimizes which layers run where for maximum performance
- 💾 Memory Efficiency: Fit larger models in limited VRAM by using system RAM strategically
- ⚡ Hybrid Acceleration: Get GPU speed where it matters most, CPU reliability everywhere else
- 🎛️ Configurable:
--cpu-moeand--n-cpu-moeflags for fine control
# Enable MOE CPU offloading during installation cargo install shimmy --features moeRun with MOE hybrid processing
shimmy serve --cpu-moe --n-cpu-moe 8
Automatically balances: GPU layers (fast) + CPU layers (memory-efficient)
Perfect for: Large models (70B+), limited VRAM systems, cost-effective inference
🎯 Perfect for Local Development
- Privacy: Your code never leaves your machine
- Cost: No API keys, no per-token billing
- Speed: Local inference, sub-second responses
- Reliability: No rate limits, no downtime
Quick Start (30 seconds)
Installation
✨ v1.9.0 NEW: Download pre-built binaries with ALL GPU backends included!
📥 Pre-Built Binaries (Recommended - Zero Dependencies)
Pick your platform and download - no compilation needed:
# Windows x64 (includes CUDA + Vulkan + OpenCL) curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-windows-x86_64.exe -o shimmy.exeLinux x86_64 (includes CUDA + Vulkan + OpenCL)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64 -o shimmy && chmod +x shimmy
macOS ARM64 (includes MLX for Apple Silicon)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-arm64 -o shimmy && chmod +x shimmy
macOS Intel (CPU-only)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-intel -o shimmy && chmod +x shimmy
Linux ARM64 (CPU-only)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-aarch64 -o shimmy && chmod +x shimmy
That's it! Your GPU will be detected automatically at runtime.
🛠️ Build from Source (Advanced)
Want to customize or contribute?
# Basic installation (CPU only) cargo install shimmy --features huggingfaceKitchen Sink builds (what pre-built binaries use):
Windows/Linux x64:
cargo install shimmy --features huggingface,llama,llama-cuda,llama-vulkan,llama-opencl,vision
macOS ARM64:
cargo install shimmy --features huggingface,llama,mlx,vision
CPU-only (any platform):
cargo install shimmy --features huggingface,llama,vision
⚠️ Build Notes:
- Windows: Install LLVM first for libclang.dll
- Recommended: Use pre-built binaries to avoid dependency issues
- Advanced users only: Building from source requires C++ compiler + CUDA/Vulkan SDKs
GPU Acceleration
✨ NEW in v1.9.0: One binary per platform with automatic GPU detection!
⚠️ IMPORTANT - Vision Feature Performance:
CPU-based vision inference (MiniCPM-V) is 5-10x slower than GPU acceleration.
CPU: 15-45 seconds per image | GPU (CUDA/Vulkan): 2-8 seconds per image
For production vision workloads, GPU acceleration is strongly recommended.
📥 Download Pre-Built Binaries (Recommended)
No compilation needed! Each binary includes ALL GPU backends for your platform:
| Platform | Download | GPU Support | Auto-Detects |
|---|---|---|---|
| Windows x64 | shimmy-windows-x86_64.exe | CUDA + Vulkan + OpenCL | ✅ |
| Linux x86_64 | shimmy-linux-x86_64 | CUDA + Vulkan + OpenCL | ✅ |
| macOS ARM64 | shimmy-macos-arm64 | MLX (Apple Silicon) | ✅ |
| macOS Intel | shimmy-macos-intel | CPU only | N/A |
| Linux ARM64 | shimmy-linux-aarch64 | CPU only | N/A |
How it works: Download one file, run it. Shimmy automatically detects and uses your GPU!
# Windows example curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-windows-x86_64.exe -o shimmy.exe ./shimmy.exe serve --gpu-backend auto # Auto-detects CUDA/Vulkan/OpenCLLinux example
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64 -o shimmy chmod +x shimmy ./shimmy serve --gpu-backend auto # Auto-detects CUDA/Vulkan/OpenCL
macOS ARM64 example
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-arm64 -o shimmy chmod +x shimmy
./shimmy serve # Auto-detects MLX on Apple Silicon
🎯 GPU Auto-Detection
Shimmy uses intelligent GPU detection with this priority order:
- CUDA (NVIDIA GPUs via nvidia-smi)
- Vulkan (Cross-platform GPUs via vulkaninfo)
- OpenCL (AMD/Intel GPUs via clinfo)
- MLX (Apple Silicon via system detection)
- CPU (Fallback if no GPU detected)
No manual configuration needed! Just run with --gpu-backend auto (default).
🔧 Manual Backend Override
Want to force a specific backend? Use the --gpu-backend flag:
# Auto-detect (default - recommended) shimmy serve --gpu-backend autoForce CPU (for testing or compatibility)
shimmy serve --gpu-backend cpu
Force CUDA (NVIDIA GPUs only)
shimmy serve --gpu-backend cuda
Force Vulkan (AMD/Intel/Cross-platform)
shimmy serve --gpu-backend vulkan
Force OpenCL (AMD/Intel alternative)
shimmy serve --gpu-backend opencl
🛡️ Error Handling & Robustness: If you force an unavailable backend (e.g., --gpu-backend cuda on AMD GPU), Shimmy will:
- ✅ Display clear error message explaining the issue
- ✅ Automatically fallback to next available backend in priority order
- ✅ Log which backend was actually used (check with
--verbose) - ✅ Continue serving requests (graceful degradation, no crashes)
- ✅ Support environment variable override:
SHIMMY_GPU_BACKEND=cuda
Common scenarios:
--gpu-backend cudaon non-NVIDIA → Falls back to Vulkan or OpenCL--gpu-backend vulkanwithout drivers → Falls back to OpenCL or CPU--gpu-backend invalid→ Clear error + fallback to auto-detection- No GPU detected → Runs on CPU with performance warning
Environment Variable: Set SHIMMY_GPU_BACKEND=cuda to override default without CLI flags.
🔍 Check GPU Support
# Show detected GPU backends shimmy gpu-infoCheck which backend is being used
shimmy serve --gpu-backend auto --verbose
⚡ Binary Sizes
- GPU-enabled binaries (Windows/Linux x64, macOS ARM64): ~40-50MB
- CPU-only binaries (macOS Intel, Linux ARM64): ~20-30MB
Trade-off: Slightly larger binaries for zero compilation and automatic GPU detection.
🛠️ Build from Source (Advanced)
Want to customize or contribute? Build from source:
- Multiple backends can be compiled in, best one selected automatically
- Use
--gpu-backend <backend>to force specific backend
Get Models
Shimmy auto-discovers models from:
- Hugging Face cache:
~/.cache/huggingface/hub/ - Ollama models:
~/.ollama/models/ - Local directory:
./models/ - Environment:
SHIMMY_BASE_GGUF=path/to/model.gguf
# Download models that work out of the box
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-gguf --local-dir ./models/
huggingface-cli download bartowski/Llama-3.2-1B-Instruct-GGUF --local-dir ./models/
Start Server
# Auto-allocates port to avoid conflicts shimmy serveOr use manual port
shimmy serve --bind 127.0.0.1:11435
Point your development tools to the displayed port — VSCode Copilot, Cursor, Continue.dev all work instantly.
📦 Download & Install
Package Managers
- Rust:
cargo install shimmy --features moe(recommended) - Rust (basic):
cargo install shimmy - VS Code: Shimmy Extension
- Windows MSVC: Uses
shimmy-llama-cpp-2packages for better compatibility - npm:
npm install -g shimmy-js(planned) - Python:
pip install shimmy(planned)
Direct Downloads
- GitHub Releases: Latest binaries
- Docker:
docker pull shimmy/shimmy:latest(coming soon)
🍎 macOS Support
Full compatibility confirmed! Shimmy works flawlessly on macOS with Metal GPU acceleration.
# Install dependencies brew install cmake rustInstall shimmy
cargo install shimmy
✅ Verified working:
- Intel and Apple Silicon Macs
- Metal GPU acceleration (automatic)
- MLX native acceleration for Apple Silicon
- Xcode 17+ compatibility
- All LoRA adapter features
Integration Examples
VSCode Copilot
{
"github.copilot.advanced": {
"serverUrl": "http://localhost:11435"
}
}
Continue.dev
{
"models": [{
"title": "Local Shimmy",
"provider": "openai",
"model": "your-model-name",
"apiBase": "http://localhost:11435/v1"
}]
}
Cursor IDE
Works out of the box - just point to http://localhost:11435/v1
Why Shimmy Will Always Be Free
I built Shimmy to retain privacy-first control on my AI development and keep things local and lean.
This is my commitment: Shimmy stays MIT licensed, forever. If you want to support development, sponsor it. If you don't, just build something cool with it.
💡 Shimmy saves you time and money. If it's useful, consider sponsoring for $5/month — less than your Netflix subscription, infinitely more useful for developers.
API Reference
Endpoints
GET /health- Health checkPOST /v1/chat/completions- OpenAI-compatible chatGET /v1/models- List available modelsPOST /api/generate- Shimmy native APIGET /ws/generate- WebSocket streaming
CLI Commands
shimmy serve # Start server (auto port allocation)
shimmy serve --bind 127.0.0.1:8080 # Manual port binding
shimmy serve --cpu-moe --n-cpu-moe 8 # Enable MOE CPU offloading
shimmy list # Show available models (LLM-filtered)
shimmy discover # Refresh model discovery
shimmy generate --name X --prompt "Hi" # Test generation
shimmy probe model-name # Verify model loads
shimmy gpu-info # Show GPU backend status
Technical Architecture
- Rust + Tokio: Memory-safe, async performance
- llama.cpp backend: Industry-standard GGUF inference
- OpenAI API compatibility: Drop-in replacement
- Dynamic port management: Zero conflicts, auto-allocation
- Zero-config auto-discovery: Just works™
🚀 Advanced Features
- 🧠 MOE CPU Offloading: Hybrid GPU/CPU processing for large models (70B+)
- 🎯 Smart Model Filtering: Automatically excludes non-language models (Stable Diffusion, Whisper, CLIP)
- 🛡️ 6-Gate Release Validation: Constitutional quality limits ensure reliability
- ⚡ Smart Model Preloading: Background loading with usage tracking for instant model switching
- 💾 Response Caching: LRU + TTL cache delivering 20-40% performance gains on repeat queries
- 🚀 Integration Templates: One-command deployment for Docker, Kubernetes, Railway, Fly.io, FastAPI, Express
- 🔄 Request Routing: Multi-instance support with health checking and load balancing
- 📊 Advanced Observability: Real-time metrics with self-optimization and Prometheus integration
- 🔗 RustChain Integration: Universal workflow transpilation with workflow orchestration
Community & Support
- 🐛 Bug Reports: GitHub Issues
- 💬 Discussions: GitHub Discussions
- 📖 Documentation: docs/ • Engineering Methodology • OpenAI Compatibility Matrix • Benchmarks (Reproducible)
- 💝 Sponsorship: GitHub Sponsors
Star History
🚀 Momentum Snapshot
📦 Sub-5MB single binary (142x smaller than Ollama)
🌟 stars and climbing fast
⏱ <1s startup
🦀 100% Rust, no Python
📰 As Featured On
🔥 Hacker News • Front Page Again • IPE Newsletter
Companies: Need invoicing? Email michaelallenkuykendall@gmail.com
⚡ Performance Comparison
| Tool | Binary Size | Startup Time | Memory Usage | OpenAI API |
|---|---|---|---|---|
| Shimmy | 4.8MB | <100ms | 50MB | 100% |
| Ollama | 680MB | 5-10s | 200MB+ | Partial |
| llama.cpp | 89MB | 1-2s | 100MB | Via llama-server |
Quality & Reliability
Shimmy maintains high code quality through comprehensive testing:
- Comprehensive test suite with property-based testing
- Automated CI/CD pipeline with quality gates
- Runtime invariant checking for critical operations
- Cross-platform compatibility testing
Development Testing
Run the complete test suite:
# Using cargo aliases cargo test-quick # Quick development testsUsing Makefile
make test # Full test suite make test-quick # Quick development tests
See our testing approach for technical details.
License & Philosophy
MIT License - forever and always.
Philosophy: Infrastructure should be invisible. Shimmy is infrastructure.
Testing Philosophy: Reliability through comprehensive validation and property-based testing.
Forever maintainer: Michael A. Kuykendall Promise: This will never become a paid product Mission: Making local model inference simple and reliable