Litmux

Unit tests for AI. Test prompts, compare models, catch regressions.

pip install litmux && litmux init && litmux run

Why

Every team shipping AI features hits the same three problems:

No testing standard. REST has Postman, frontends have Cypress. LLM calls have manual spot-checking.
Prompt regression is invisible. A one-word change can silently break 15% of edge cases.
Model selection is vibes. "We use GPT-4o because it's good" — but is it $15k/month better than Gemini Flash?

Litmux gives you a YAML config, pass/fail assertions, and a cost report. That's it.

Quick Start

pip install litmux cp .env.example .env Add at least one: OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_API_KEY, HF_TOKEN

litmux init # scaffold a project litmux run # run tests against all configured models

No database, no cloud account, no Docker.

Core Commands

`litmux run` — unit tests for prompts

# litmux.yaml
models:
  - model: gpt-4o-mini
  - model: claude-haiku-4-5-20251001
tests:

name: summarize_earnings
prompt: prompts/summarize.txt
inputs:
text: "Revenue grew 15% to $4.2 billion..."
assert:

type: contains
value: "revenue"
type: cost-less-than
value: 0.01

`litmux eval` — bulk evaluation against datasets

evals:
  - name: ticket_classifier
    prompt: prompts/classify.txt
    dataset: datasets/support_tickets.csv
    input_mapping:
      ticket: text
    expected: expected_category
    assert:
      - type: json-valid
    judge:
      criteria: "Did the model correctly classify the ticket?"
      threshold: 7.0

`litmux generate` — AI-generated test datasets

litmux generate \
  --prompt prompts/classify.txt \
  --seed datasets/sample_tickets.csv \
  --n 50 \
  --output datasets/support_tickets.csv

`litmux cost` — cost projection across models

litmux cost --volume 50000

Finds the cheapest model that passes your tests.

`litmux compare` — side-by-side model outputs

litmux compare

Cloud (Optional, Free)

Sync results to a hosted dashboard for history, trends, and team visibility.

litmux login       # one-time browser auth
litmux run         # results auto-sync
litmux dashboard   # open app.litmux.dev

The CLI works fully offline. Cloud is opt-in.

Assertion Types

Type	Description
`contains`	Output contains substring
`not-contains`	Output does not contain substring
`regex`	Output matches regex pattern
`json-valid`	Output is valid JSON
`json-schema`	Output has required JSON keys
`cost-less-than`	Cost below threshold (USD)
`latency-less-than`	Latency below threshold (ms)
`llm-judge`	LLM scores output 1–10 against criteria

CI/CD

# .github/workflows/litmux.yml
- run: litmux run --ci
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Configuration

models:
  - provider: openai | anthropic | google | huggingface
    model: string
    temperature: 0.0
    max_tokens: 1024
defaultTest:
assert:
- type: cost-less-than
value: 0.01
tests:

name: string
prompt: path/to/prompt.txt
inputs: { variable: "value" }
assert:

type: contains
value: "expected"



evals:

name: string
prompt: path/to/prompt.txt
dataset: path/to/data.csv
input_mapping: { prompt_var: csv_column }
expected: csv_column
assert: [...]
judge:
criteria: "..."
threshold: 7.0

Environment Variables

Variable	Purpose
`OPENAI_API_KEY`	OpenAI models, LLM judge, dataset generation
`ANTHROPIC_API_KEY`	Anthropic models
`GOOGLE_API_KEY`	Google models
`HF_TOKEN`	HuggingFace models
`LITMUX_NO_CACHE`	Set to `1` to skip the response cache
`LITMUX_API_URL`	Override cloud API endpoint (default: `https://api.litmux.dev`)
`LITMUX_API_URL_ALLOW_INSECURE`	Set to `1` to allow non-HTTPS `LITMUX_API_URL` (local dev only)
`LITMUX_DASHBOARD_URL`	Override dashboard URL (default: `https://app.litmux.dev`)
`LITMUX_JUDGE_MODEL`	LLM model used for `llm-judge` assertions (default: `gpt-4o-mini`)
`LITMUX_CLOUD_ENABLED`	Set to `1` to opt in to Litmux Cloud (private beta)

All Commands

litmux run                    Run all tests
litmux run -t <name>          Run a specific test
litmux run --ci               CI output (markdown)
litmux eval                   Run all evals
litmux eval --limit 10        Evaluate first N rows
litmux generate ...           Generate a test dataset
litmux compare                Side-by-side model outputs
litmux cost -v 50000          Project monthly cost
litmux cache                  View / clear response cache
litmux init                   Scaffold a new project
litmux version                Show version
Cloud (private beta — join the waitlist at https://litmux.dev)
litmux login                  Authenticate with Litmux Cloud
litmux logout                 Remove saved credentials
litmux history                Recent runs from cloud
litmux dashboard              Open the dashboard

Examples

See examples/ for three ready-to-run projects:

01-quickstart — minimal single-model test
02-multi-model — compare across providers
03-generate-and-eval — AI-generated dataset + LLM judge

License

MIT

NAME

SYNOPSIS

INFO

DESCRIPTION

README

Litmux

Why

Quick Start

Add at least one: OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_API_KEY, HF_TOKEN

Core Commands

`litmux run` — unit tests for prompts

`litmux eval` — bulk evaluation against datasets

`litmux generate` — AI-generated test datasets

`litmux cost` — cost projection across models

`litmux compare` — side-by-side model outputs

Cloud (Optional, Free)

Assertion Types

CI/CD

Configuration

Environment Variables

All Commands

Cloud (private beta — join the waitlist at https://litmux.dev)

Examples

License

SEE ALSO

NAME

SYNOPSIS

INFO

DESCRIPTION

README

Litmux

Why

Quick Start

Add at least one: OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_API_KEY, HF_TOKEN

Core Commands

litmux run — unit tests for prompts

litmux eval — bulk evaluation against datasets

litmux generate — AI-generated test datasets

litmux cost — cost projection across models

litmux compare — side-by-side model outputs

Cloud (Optional, Free)

Assertion Types

CI/CD

Configuration

Environment Variables

All Commands

Cloud (private beta — join the waitlist at https://litmux.dev)

Examples

License

SEE ALSO

`litmux run` — unit tests for prompts

`litmux eval` — bulk evaluation against datasets

`litmux generate` — AI-generated test datasets

`litmux cost` — cost projection across models

`litmux compare` — side-by-side model outputs