LeanLM — Cut LLM Inference Costs Without Rewriting

We train smaller models on your production data — so you stop paying frontier prices for tasks that don't need them.

Launch updates only — no spam.

You're on the list.

10x

Cost reduction on high-volume
extraction & classification

96%+

Eval pass rate vs. baseline on production data
measured across internal benchmarks

<1 day

To start — point your base URL
at LeanLM and analysis begins

How it works

Profile. Build task models. Gate deploys with your evals.

Analyze your workflow

LeanLM connects to your existing LLM API calls and profiles every task in your pipeline — classification, extraction, summarization, generation, agentic workflows. It identifies which calls are paying frontier prices for work a smaller, faster model can handle.

Train and configure replacements

For each task, LeanLM applies the right optimization technique — model routing, prompt caching, knowledge distillation, quantization, batching, or context compression — to produce a cheaper alternative. A summarization task running on Gemini 2.5 Pro at $10/M output tokens can often be handled by Flash Lite at $0.40/M — 25x cheaper. Smaller models also mean faster inference and lower latency.

Deploy and verify

Swap in the optimized models automatically. LeanLM validates every replacement against your actual production data — not synthetic benchmarks — so quality is measured the way your business measures it. If a replacement doesn't meet the bar, the original model stays.

Chris Cholette, Founder

Engineering leader with experience building and optimizing ML inference systems at scale. Founded LeanLM to solve the cost problem he saw firsthand — most AI API calls use models 10x more expensive than needed.

What LeanLM optimizes

Most teams over-model. LeanLM finds where — and fixes it.

Over-modeling is when engineering teams use frontier LLMs like GPT-4 for tasks that smaller, cheaper models handle at equal quality — paying 10–25x more than necessary. The industry calls it right-sizing: matching model capability to task complexity. LeanLM automates it.

Over-Modeling Detection

Most teams use frontier models for every call — even when a smaller model handles the task fine. LeanLM profiles every API call in your pipeline and flags where you're paying for intelligence you don't need. This is the diagnostic that drives everything else.

Model Routing

Route each query to the cheapest model that can handle it — individually or in batches. Classification doesn't need GPT-4.

Prompt Caching

Restructure prompts to maximize cache hits. Pay a fraction for repeated system instructions and context.

Semantic Caching

Skip redundant API calls entirely. Similar queries get instant responses from cache, with context optimization to reduce token volume.

Model Distillation

Train smaller task-specific models from frontier outputs via knowledge distillation, quantization, and compression. The core of what LeanLM builds.

LeanLM also applies batch inference, context compression, quantization, and KV cache optimization based on your workload profile.

The cost gap

Frontier vs. efficient models

As of February 2026. Prices per million output tokens.

Provider	Frontier Model	Cost	Efficient Model	Cost	Savings
Google	Gemini 2.5 Pro	$10.00	Flash Lite	$0.40	25x
OpenAI	GPT-4.1	$8.00	GPT-4.1 nano	$0.80	10x
Anthropic	Sonnet 4.5	$15.00	Haiku 4.5	$5.00	3x

Security

Right-sized models reduce your attack surface

Least-privilege by design

A classifier fine-tuned for one task can't execute code, call tools, or follow multi-step instructions. LeanLM scopes each replacement to the narrowest model that can handle the job — constraining what's possible if inputs are adversarial.

Validated before any swap

Every replacement model is tested against your production data before deployment. If it doesn't meet your quality bar, the original model stays — no silent degradation.

Data handling

Questions about how LeanLM handles your prompts, retention, and data residency? Email us for our data handling FAQ — we'll send it before you evaluate.

Get a cost reduction estimate for your workload

Share your email — we'll analyze your usage pattern and send an estimate within 48 hours.

No spam — estimate only.

We'll be in touch within 48 hours.

FAQ

Frequently asked questions

What is over-modeling and why does it cost so much?

Over-modeling means using a frontier LLM like GPT-4 or Claude Opus for tasks that a smaller, cheaper model handles just as well — classification, extraction, formatting, simple summarization. Most teams default to their most capable model for every call because it's easier than evaluating alternatives. The result: 10–25x higher costs on tasks that don't benefit from frontier intelligence. LeanLM detects over-modeled calls automatically and applies the right optimization — routing, distillation, caching, or compression — so you only pay for the intelligence each task actually needs.

How does LeanLM measure output quality?

Most model benchmarks test against generic datasets like MMLU or HumanEval. LeanLM validates every optimization against your actual production outputs — the real queries, edge cases, and quality standards your team already cares about. If a cheaper model can't match what you're getting today on your data, it doesn't deploy. The original model stays until the replacement proves it can match fidelity on your workload.

How do smaller models reduce security risk?

A frontier model used for a simple classification task still carries the ability to execute code, call tools, and follow complex multi-step instructions — all of which expand the surface area for prompt injection. A smaller model fine-tuned for that same task has a drastically reduced capability set. It's a least-privilege approach: LeanLM scopes each replacement to the narrowest model that can handle the job, constraining what's possible if inputs are adversarial.

Can open-source models replace GPT-4 in production?

For many production tasks, yes. Models like Llama 3, Mistral, and Qwen have closed the quality gap with frontier models on focused tasks — classification, extraction, summarization, structured generation. The challenge is knowing which tasks are safe to migrate and validating that quality holds. That's what LeanLM automates: it identifies which calls can move to open-source or cheaper frontier tiers, trains the replacement, and verifies output fidelity before any swap.

What's the difference between model routing and model optimization?

Model routers send each query to a pre-existing model based on complexity or cost. LeanLM goes further — it doesn't just route, it creates the cheaper model. Using fine-tuning, knowledge distillation, and prompt optimization, LeanLM trains task-specific replacements that didn't exist before. Routing picks from what's available. LeanLM builds what's optimal for your specific workload.

How long does it take to integrate?

Point your SDK or API base URL at LeanLM — a one-line config change. LeanLM begins profiling your workflow immediately. The initial analysis takes hours, not weeks. Replacement models are trained and deployed incrementally — you start saving on the easiest wins first.

How much cheaper are smaller models than frontier models?

The gap is massive. As of February 2026: Gemini 2.5 Pro costs $10/M output tokens — Flash Lite costs $0.40/M, a 25x difference. GPT-4.1 costs $8/M output — GPT-4.1 nano costs $0.80/M, a 10x gap. Claude Sonnet 4.5 costs $15/M output — Haiku 4.5 costs $5/M, 3x cheaper. LeanLM finds which of your tasks can safely move down these tiers and handles the migration automatically.

From the Blog

LLM Cost Optimization: Why Enterprises Overspend 50–90% and How to Fix It

Chris Cholette · February 2026 · 9 min read

→

LLM Model Routing: Automatically Send Every Query to the Cheapest Capable Model

Chris Cholette · May 2026 · 8 min read

→

Prompt Compression: Reduce LLM Token Costs by 2–20× Without Losing Quality

Chris Cholette · May 2026 · 7 min read

→

LoRA Adapters: Fine-Tune LLMs for 1% of the Cost of Full Fine-Tuning

Chris Cholette · May 2026 · 8 min read