Cutting Your AI API Bill in Half: 12 Token Optimization Tactics That Actually Work in 2026
A practitioner's guide to reducing OpenAI, Claude, Gemini, and Llama API costs in production. Twelve tactics with real dollar figures, working code, and the order to apply them β from the obvious ones everyone knows to the boring ones at the bottom of the list that quietly save the most money.
Cutting Your AI API Bill in Half: 12 Token Optimization Tactics That Actually Work in 2026
The honest truth about AI API costs: nobody is surprised by their first bill. Everyone is surprised by their third. The first one is "oh, forty bucks, less than Netflix, fine." The third one is "wait, four thousand two hundred dollars, what changed?"
I've been on the receiving end of three different "why did our AI bill triple this month?" investigations in the last eighteen months β once at a startup I was advising, twice in my own production systems. The pattern is the same every time: nothing changed. Or rather, nothing that anyone wrote down changed. A teammate added one extra few-shot example to the system prompt. The chat history quietly grew from "last 5 messages" to "last 20." The "ah, just throw it all in context" pattern silently went from $0.003 per call to $0.04 per call. Times 200,000 calls a day. Welcome to month three.
Here are twelve tactics I've actually used to drag those bills back down. Most give you double-digit percentage savings each, and the boring ones at the bottom of the list are often the highest-leverage. None of them require switching providers or compromising on output quality. A few require you to actually measure where your tokens go, which is the single biggest unlock. Let's start there.
0. Measure first. Always.
Before you optimize anything, count tokens. The reason most "let me cut prompt size" attempts fail is that engineers cut the wrong thing β usually the visible system prompt, while the real culprit (a 6KB tool-calling JSON schema, a verbose few-shot example, or a chat history that's been accumulating since Tuesday) goes untouched.
Don't eyeball this. Don't trust character counts. Tokens and characters are not the same thing β for English prose it's roughly 1 token per 4 characters, but it ranges from 1.5 (dense Unicode) to 8 (whitespace), and JSON sits somewhere in the middle. A "small" prompt can be much bigger than it looks.
Use a real token counter. I built one I use daily β the LLM Token Counter & Cost Estimator β that supports the major models with the actual tiktoken encodings for OpenAI and calibrated estimates for Claude, Gemini, Grok, DeepSeek, and Llama. Paste your prompt, see the count, see the cost per call, see which scenario (chat / classification / long-doc) lands where in your context window. The token visualization alone has saved me hours of "wait, why is getUserById six tokens?" debugging.
Once you have a number, the rest of this list is just deciding which lever to pull.
1. Use the cheapest model that still works (it's usually one tier down)
The single biggest waste I see in production: using a flagship model for work a smaller model can do.
Concrete numbers, OpenAI as of May 2026:
| Model | Input $/M | Output $/M | When |
|---|---|---|---|
| GPT-5 | $1.25 | $10.00 | Hard reasoning, complex tool calls |
| GPT-5 mini | $0.25 | $2.00 | Most chat, summarization, RAG synthesis |
| GPT-5 nano | $0.05 | $0.40 | Classification, intent detection, formatting |
Going from GPT-5 to GPT-5 nano for a simple classification task is a 25Γ cost reduction with minimal quality loss for most use cases. Not 25%. Twenty-five times. People resist this because "but the flagship is smarter" β yes, and you're paying for that smartness on tasks that don't need it.
The mistake I see most often: a team picks a model in week one ("we'll start with the best so we know quality is fine"), ships it, and then never revisits. Six months later they're paying GPT-5 prices for "is this email spam, yes or no" decisions that a fine-tuned BERT could handle for a hundredth of the cost.
The fix is mechanical. List every distinct task your app uses the model for. For each one, ask "what's the cheapest tier this would still work on?" Test the cheaper tier on a representative sample β fifty to a hundred examples is usually enough. If quality holds, switch. If it doesn't, escalate one tier and try again. Most of my apps end up running 80% of calls on the smallest model and 20% on the flagship.
The same logic applies across providers. Claude Haiku 4.5 ($1/$5) handles a huge chunk of work that people reflexively send to Sonnet 4.6 ($3/$15) or Opus 4.7 ($5/$25). Gemini 2.5 Flash-Lite ($0.10/$0.40) is borderline free for high-volume classification. DeepSeek V3 ($0.252/$0.378) is roughly forty times cheaper than Claude Sonnet 4.6 for tasks where DeepSeek's quality is acceptable β which, for a lot of structured-data tasks, it is.
2. Cascade routing: cheap model first, escalate on uncertainty
The natural extension of #1. Don't pick one model per task; let a cheap model handle the obvious cases and only escalate to the expensive one when it's actually needed.
Here's the pattern in pseudocode:
async function classify(input: string) {
// First try with the cheap model
const draft = await openai.chat.completions.create({
model: 'gpt-5-nano',
messages: [
{ role: 'system', content: CLASSIFY_PROMPT },
{ role: 'user', content: input }
],
logprobs: true,
max_tokens: 10
});
const confidence = Math.exp(
draft.choices[0].logprobs.content[0].logprob
);
if (confidence > 0.85) {
return draft.choices[0].message.content; // cheap model was sure
}
// Escalate to the bigger model only for the uncertain ~10-20%
const escalated = await openai.chat.completions.create({
model: 'gpt-5',
messages: [
{ role: 'system', content: CLASSIFY_PROMPT },
{ role: 'user', content: input }
],
max_tokens: 10
});
return escalated.choices[0].message.content;
}
In a recent classification system I worked on, the escalation rate landed at 14%. That meant 86% of calls cost $0.05/M instead of $1.25/M for input β a roughly 18Γ weighted savings on input tokens, or about 8Γ on total cost (because the expensive escalations still pull the average up, and output tokens stay full price).
The catch: you have to design the cheap-model prompt so it can express uncertainty. Asking for a single-token answer with logprobs is the cleanest way; "I'm not sure" or "needs review" as an explicit class also works. Don't try to read uncertainty out of free-text β the model will hedge unreliably.
3. Use prompt caching aggressively (90% off cached input)
Every major provider now offers prompt caching, and it's the highest-leverage optimization on this list β provided you actually use it correctly.
The idea is simple: a long system prompt or document that you send on every call doesn't need to be re-tokenized and re-processed every time. The provider caches the prefix; subsequent calls hit the cache and are billed at a fraction of the normal input rate.
The pricing math, for a typical 4,000-token system prompt called 1,000 times per hour on Claude Sonnet 4.6:
- Without caching: 4,000 Γ 1,000 Γ $3/M = $12/hour input cost
- With cache hits (90% off cached input): 4,000 Γ 1,000 Γ $0.30/M = $1.20/hour input cost (after the first uncached write, which is slightly more expensive than a normal call)
- That's roughly a 90% drop on the system-prompt portion for a code change of "add
cache_controlto your prompt."
The catch β and this is where teams trip up β is that caching only works if your prefix is byte-identical across calls. A timestamp in the system prompt, a randomly-shuffled few-shot example, a per-user variable spliced near the top β any of those break the cache and you pay the full rate. I've seen teams add prompt caching, see no savings on the bill, and conclude "it's broken." It wasn't broken; their Date.now() at the top of the system prompt was invalidating the cache on every single call.
The fix: structure your prompts so the cacheable part comes first and the variable part comes last:
[CACHED PREFIX β byte-identical across calls]
- System instructions
- Tool definitions
- Few-shot examples
- Long reference documents (RAG context if it doesn't change)
[USER-SPECIFIC SUFFIX β varies per call]
- Current date/time
- User ID / personalization
- The actual user message
Anthropic's cache_control markers, OpenAI's automatic prefix caching, and Gemini's context caching all reward this structure. Get it wrong and you pay full price; get it right and you watch the bill drop overnight.
One nuance: the cache TTL is short β typically 5 minutes for OpenAI and Anthropic, longer if you opt in. For low-traffic endpoints (a few calls per hour) the cache may expire between calls and you won't see the savings. Caching is a high-volume optimization; if you're at 10 calls a day, skip it.
4. Use Batch APIs for anything that isn't real-time (50% off)
If your work doesn't need to respond inside a user request β overnight summarization, daily digest generation, bulk re-classification of a backlog, evaluation runs against a test set β use the Batch API. Half price across the board, with up to 24-hour latency.
OpenAI, Anthropic, and Google all offer this. The pattern is the same: submit a JSONL file with all your requests, get a JSONL file back when it's done.
Things I've moved to batch and the math:
- Nightly summarization of the day's customer-support tickets (~3,000 tickets Γ ~2,000 input tokens Γ ~400 output): ran at $1.25/$10 per M live; now $0.625/$5 per M batch. Saved roughly $4.7K/month on that one job alone.
- Weekly RAG re-embedding of changed documents: latency irrelevant, batch is fine.
- Eval harness runs (testing prompt changes against 500 fixtures): used to take 20 minutes live; now takes 2 hours but is 50% cheaper, and we run them more often as a result. Counter-intuitively, making the eval cheaper made us run more evals, which improved prompt quality.
The mental flip: stop asking "can this be batched?" and start asking "does this need to be live?" Most internal jobs don't.
The only real downside is the latency unpredictability β "up to 24h" usually means 30 minutes to 4 hours in practice, but you have to design for the upper bound. If your job needs to be done by 9am Monday, submit Sunday night.
5. Cap output with max_tokens
Output tokens are 4β8Γ more expensive than input tokens. A runaway 8K-token response on GPT-5 costs the same as 64K tokens of input. The model has no incentive to be brief β if you don't constrain it, it will sometimes generate three paragraphs of "as a helpful assistant, I'd be happy to..." before getting to the answer.
max_tokens (or max_output_tokens depending on the API) is your cap. Set it.
The right value depends on the task:
- Classification / single-label: 20
- JSON tool call: 200
- Chat reply: 400 (most assistants top out around 300)
- Summary of a document: 800
- Code snippet: 1,500
- Long-form generation: as needed, but consider whether it should really be one call
The risk with capping too aggressively is the model gets cut off mid-sentence. The mitigation is to monitor the finish_reason field in the response β if it's length more than ~5% of the time, your cap is too tight for that task. If it's almost always stop, your cap is correct.
I've audited apps where adding max_tokens to existing calls cut total cost by 25%. The model was generating 3,000 tokens for tasks that needed 600 β purely because nobody told it to stop sooner.
6. Strip the JSON out of system prompts
This is the one that drives me crazy.
Half the system prompts I review have something like this:
{
"role": "assistant",
"instructions": "You are a helpful assistant.",
"rules": [
{"id": 1, "rule": "Always be polite"},
{"id": 2, "rule": "Cite sources when possible"},
{"id": 3, "rule": "If unsure, say 'I don't know'"}
],
"examples": [
{"input": "...", "output": "..."}
]
}
Every {, }, [, ], ", ,, : is its own token. JSON keys repeat ("input", "output", "rule" show up over and over). On a typical system prompt I see a 20β30% reduction just by switching to YAML or markdown:
You are a helpful assistant.
Rules:
- Always be polite
- Cite sources when possible
- If unsure, say "I don't know"
Examples:
- Input: ...
Output: ...
The model parses both equally well. The JSON costs you 20β30% more for nothing.
The exception: if you're showing the model an actual API response or a structured object that the model itself needs to mirror in its tool-call output, JSON is the right answer β the model picks up the format you want it to produce. But for instructions to the model? YAML, markdown, or plain prose. Always.
While we're here: every emoji you put in a system prompt is 1β4 tokens of pure decoration. Skip them. Your system prompt is not a marketing brochure.
7. Compress and dedupe the system prompt
Once you've stripped the JSON, look at the prose itself.
The pattern I see in 90% of system prompts: instructions added by different team members at different times, never reviewed as a whole. Three different sentences telling the model the same thing slightly differently. A "tone guide" section that repeats most of the "voice" section. A list of forbidden phrases that's grown to 40 entries because everyone added their pet peeve.
Spend an hour with the full system prompt in front of you and a token counter open. For each section, ask:
- Is this still true? (Instructions get added; they rarely get removed.)
- Is this said somewhere else, possibly better?
- Does the model actually obey this rule, or does it ignore it anyway?
- Could three sentences be one?
Typical first pass on a two-year-old system prompt: 30β40% smaller, equal or better quality. Models in 2026 are smart enough to follow short instructions; the longer your prompt, the more chance you're burying the actually-important rules in noise. There's a real diminishing return after about 1,500 tokens of system prompt β past that, the model often weights the latest user message more heavily than your fifteenth rule.
8. Trim chat history with a sliding window β but be smart about what stays
The naΓ―ve fix: "only send the last N messages." It works, kind of β you cap the cost, but you also lose context that mattered.
The smarter pattern: keep the system prompt + the first message + a summary of the dropped middle + the last N turns.
function buildContext(history: Message[], maxTurns = 10) {
if (history.length <= maxTurns + 1) return history;
const first = history[0];
const middle = history.slice(1, -maxTurns);
const recent = history.slice(-maxTurns);
// Summarize the middle once, cache it by content hash
const summary = summarizeMiddle(middle);
return [
first,
{ role: 'system', content: `Earlier in conversation: ${summary}` },
...recent
];
}
The "summarize the middle" step is itself an LLM call, but you cache it by content hash β for a given conversation prefix, it's computed once and reused for every subsequent turn. Net savings on a 50-turn conversation: typically 60β70% of input cost compared to sending the whole history every time, with answer quality that's better than naive truncation because the early context is preserved in compressed form.
Use a small cheap model for the summarization itself (GPT-5 nano or Gemini Flash-Lite). The summary doesn't need to be brilliant β it just needs to capture "what happened earlier."
9. For tool calling: shorter names, sparser descriptions, fewer enums
If you're using function/tool calling, your tool definitions are sent on every call. They're often the largest part of the prompt, and they're cacheable (see #3) β but the cached version is what you want to be small in the first place.
Common offenders:
Verbose parameter descriptions. "The user's email address, formatted as a valid RFC 5322 email address, e.g. user@example.com." Cut to "User's email." The model knows what an email is.
Long enums. A category parameter with 80 enum values you regenerate from a database every deploy. Each value is 2β5 tokens. Either prune to the top 20 most-used and add an "other" with a free-text follow-up, or move the enum out of the schema and into the system prompt as a one-time reference.
Multiple tools where one parameter would do. Instead of create_event_with_attendees, create_event_without_attendees, create_event_recurring, define one create_event with optional parameters. The schema is much shorter and the model does a better job picking it.
I've seen tool definitions that were 4,000 tokens compressed to 1,200 with no behavior change. On a high-volume agent app with 10K calls/day, that's roughly $90/day saved on input alone β if cache misses; the cached path saves a still-meaningful $9/day. Per tool. With ten tools defined.
10. RAG: tune chunk size, re-rank before stuffing
For retrieval-augmented generation, the cost driver is usually the number of chunks you stuff into the prompt, not the model. Many teams default to "top 10 chunks of 500 tokens each" β that's 5,000 tokens of context per call, regardless of whether all 10 are relevant.
Two compounding optimizations:
Chunk size. Smaller chunks (200β400 tokens) match retrieval queries more precisely and let you fit more diverse content. Bigger chunks (800β1,200) carry more context per chunk but waste more on irrelevant content. There's a sweet spot for every corpus; measure against a held-out eval set rather than guessing.
Re-ranking before stuffing. Fetch the top 30 candidates from your vector store, re-rank with a small cross-encoder (Cohere Rerank, or a tiny local model), then send only the top 3β5 to the LLM. The re-rank step costs a fraction of a cent and typically lets you cut chunks-in-prompt by 50% with better answer quality, because you're feeding the model less noise.
The combined effect on a RAG pipeline I helped optimize last year: 62% reduction in input tokens, slightly better answer quality measured against the same eval set. The win came from stopping the "more context = better answer" reflex; less of the right context outperforms more of the wrong context.
11. Stream and early-stop when the answer arrives
Streaming is usually framed as a UX win. It's also a cost win, because you can stop generating as soon as you have what you need.
Concrete cases:
- JSON tool calls: stop when the closing
}arrives. - Structured output with a known terminator: stop when you see it.
- Chat: abort the stream if the user navigates away mid-response.
The savings are bounded β you can only save the tokens that would have been generated. But on tasks where the model has a tendency to add post-answer commentary ("Let me know if you need anything else!"), stopping early routinely saves 50β150 tokens per call. At GPT-5 output prices that's $0.0005β$0.0015 per call, which sounds tiny until you multiply by 200,000 calls a day.
The implementation needs care: process streaming chunks, watch for your stop condition, send controller.abort() on the fetch when you have it. Don't forget to log so you know how often the early-stop fires β if it's never firing, your stop condition is wrong; if it's firing on >40% of calls, you might be cutting off legitimate output.
12. Cache common responses on YOUR end (not just the provider's)
The most underrated optimization on this list: a regular HTTP-style cache in front of the API, for inputs you've already seen.
For tasks that are deterministic given the input β classification, extraction, formatting, intent detection β the model should produce the same answer for the same input. There's no reason to call the API twice for "is noreply@spam.example.com a real email?" if you already asked it yesterday.
The pattern:
async function classify(input: string): Promise<string> {
const key = `classify:v3:${sha256(input)}`;
const cached = await redis.get(key);
if (cached) return cached;
const result = await llmClassify(input);
await redis.set(key, result, { ex: 60 * 60 * 24 * 30 }); // 30 days
return result;
}
The v3 in the cache key is critical β when you change the prompt or model, bump the version so you don't serve stale cached answers from the previous version. The TTL depends on how often the underlying answer might change; 30 days is a good default for static data, much shorter for anything time-sensitive.
This is the "same answer to the same question" cache, and it's distinct from the provider's prompt caching from #3 (which lowers the per-call cost but still costs you a call). This cache eliminates the call entirely. Cache hit rates on classification workloads are routinely 30β50% in production. Half your traffic, free.
For chat or generative tasks where outputs aren't deterministic, this cache won't help. For everything else β and "everything else" is more of your traffic than you think β it should be the first thing in front of your LLM client.
A practical action plan
You don't do all twelve at once. The order I recommend:
- Week 1: Measure. Open the LLM Token Counter, paste your top three prompts, write down the numbers. Add
max_tokenscaps everywhere. Move non-real-time jobs to the Batch API. - Week 2: Strip JSON from system prompts; compress and dedupe. This alone is often 20%+.
- Week 3: Add prompt caching by reorganizing your prompt structure (cacheable prefix first, variable suffix last). 90% off the system-prompt portion is the single biggest line-item win.
- Week 4: Add a response cache for deterministic tasks. Audit model tier per task β drop one tier where quality holds.
- Month 2: Cascade routing for the highest-volume tasks. Re-ranking for RAG pipelines. Tool-definition cleanup.
- Month 3: Streaming + early-stop. Smarter sliding-window history.
At each step, measure the before/after on real traffic, not estimates. The reason most cost-optimization projects fail isn't that the tactics don't work β it's that nobody measures, the savings get attributed to "well, traffic was lower that week," and the work doesn't get done again. Pick one tactic, measure baseline, ship the change, measure again, log the savings, move on.
The compounding effect of doing five of these well: 30β60% bill reduction, no quality regression, no provider switch. I've watched teams take a $40K/month bill to $14K/month with the first six items on this list, in three weeks of focused work. It is not exotic. It is just measuring.
If you only do one thing this week, do #0. Open a token counter, paste your largest prompt, look at the number, and then look at what's actually in there. The savings find themselves once you can see them.