LLM Token Counter & Cost Estimator

Count tokens for any LLM prompt and estimate API cost across GPT-5, Claude 4.7, Gemini 3, Grok 4, DeepSeek R1, and Llama 4. Visualise tokens, see context-window utilisation, and compare models side-by-side — all in your browser.

Model

Primary

o200k_base · exact tokenizer · 400,000-token context

Compare with another model

Try a sample:

Prompt / Document

0 chars · 0 words · 0 linesPer-line counts

Tokens

Exact

GPT-5 · o200k_base

Cost estimate

Input$0

Output (500 tok)$0.005000

Round-trip$0.005000

Output scenario500 tok

Output length is unpredictable per request. Pick a scenario (or your max_tokens cap) and read the cost as “cost if the model returns this much”.

Pricing: $1.25/M input · $10/M output

Context window

500 / 400,0000.1%

100% client-side · prompts never leave your devicePricing reference last updated: 2026-04-30

About LLM Token Counter & Cost Estimator

The LLM Token Counter & Cost Estimator counts tokens for any prompt or document across the major commercial language models as of mid-2026 — the GPT-5 family (GPT-5 / mini / nano), GPT-4.1 family, GPT-4o, the o3 / o4-mini reasoning models, Claude Opus 4.7 / Sonnet 4.6 / Haiku 4.5, Gemini 3 Pro and the Gemini 2.5 family, xAI Grok 4 and Grok 4.1 Fast, DeepSeek V3 and R1, and Meta Llama 4 Scout / Maverick — and turns the count into a USD cost estimate using current per-1M-token pricing. OpenAI counts are exact via the official tiktoken encodings (o200k_base for GPT-5 / GPT-4o / o3 / o4-mini, cl100k_base for older models). For models whose tokenizers are not publicly available (Claude, Gemini, Grok, DeepSeek, Llama), counts are produced with cl100k_base and then corrected by a per-model calibration multiplier — yielding ±3-12% accuracy depending on the model. Every model badge shows its actual accuracy band so you know exactly how much to trust the number. Tokens are visualised in a colour-banded preview, context-window utilisation is shown as a gauge, and a slider for expected output length gives you the full round-trip cost.

Why use a LLM Token Counter & Cost Estimator?

Token counting is the universal currency of LLM development — it determines context-window fit, latency, and cost. Eyeballing 'this prompt is probably ~500 tokens' is a recipe for surprise overage bills and 'context too long' errors at runtime. This tool replaces the eyeball with an exact number for OpenAI models and a calibrated approximation for others, plus a USD cost figure you can paste straight into a budget proposal. The token visualisation panel is uniquely useful for prompt debugging — you can see at a glance that your variable name is 6 tokens, that a stray Unicode character cost you 8 bytes, or that a system prompt template is 30% larger than you thought.

Who is it for?

Built for AI engineers and prompt designers building production LLM apps, software developers integrating OpenAI / Anthropic / Google / Meta APIs, technical leads writing RAG and agent systems who need to size context windows correctly, finance / FP&A staff sizing AI budgets and unit economics, ML researchers comparing tokenizer efficiency across models, and educators or content creators teaching prompt engineering. Anyone who has ever opened a per-token API bill should bookmark this.

How to use the tool

Pick the model you're targeting from the dropdown — it controls the tokenizer used and the per-1M-token pricing applied to the cost estimate

Paste your prompt or document into the text area, or click one of the sample buttons (chat message, tool-calling prompt, long doc) to load an example

Read the headline token count and supporting counts (characters, words, lines) — for OpenAI models the count is exact; non-OpenAI models show an 'Approximate' badge

Drag the 'Expected output length' slider to set how many tokens you expect the model to generate — the cost estimate updates with the full round-trip cost

Watch the context-window gauge — if you're at 80%+ you're at risk of overflow when output is added; consider trimming or chunking

Open the 'Token preview' panel to see exactly how your prompt is split — useful for spotting accidentally-tokenised whitespace, oddly-tokenised variable names, or Unicode-heavy sections that cost more than expected

Toggle 'Compare with another model' to see how the same prompt sizes differently across two providers — useful when choosing which API to standardise on

Toggle 'Per-line counts' for prompt-engineering loops where each line is a separate prompt; you get a count per line plus the total

Key Features

22 frontier models supported

OpenAI GPT-5 / mini / nano, GPT-4.1 / nano, GPT-4o / mini, o3, o4-mini (exact via tiktoken). Anthropic Claude Opus 4.7, Sonnet 4.6, Haiku 4.5. Google Gemini 3 Pro, 2.5 Pro / Flash / Flash-Lite. xAI Grok 4, Grok 4.1 Fast. DeepSeek V3, R1. Meta Llama 4 Scout, Maverick. Each model's badge shows its accuracy band — 'Exact' for OpenAI, 'Calibrated ±X%' for the rest.

Per-model calibration for non-OpenAI models

Models without a public tokenizer (Claude, Gemini, Grok, DeepSeek, Llama) get cl100k_base counts multiplied by a per-model calibration factor: ×1.35 for Claude Opus 4.7 (which shipped a denser tokenizer), ×1.05 for Anthropic and DeepSeek (slightly denser than cl100k), ×0.95 for Gemini and Llama 4 (more efficient), ×1.0 for Grok (claimed tiktoken-compatible). The calibrated count and accuracy band are shown explicitly so you know how much headroom to add.

Exact tokenization for OpenAI

Uses the official OpenAI tiktoken cl100k_base and o200k_base encodings — the exact same algorithm OpenAI uses on the server, so your count matches the bill.

USD cost estimator

Input + output pricing per 1M tokens for every model, with a slider for expected output length so you see total round-trip cost rather than just input cost.

Context-window utilisation gauge

Colour-coded progress bar showing tokens-used vs the model's context window — green / amber / red — so you spot overflow risk before deployment.

Token visualisation

Every token is rendered as a colour-banded chip — see exactly where your prompt is split. Invaluable for prompt debugging: spot oddly-tokenised variable names, unexpected whitespace, or Unicode-heavy sections.

Side-by-side model comparison

Tokenise the same prompt with two models and see the count delta — useful when choosing which API to standardise on or when porting prompts between providers.

Per-line / batch counting

Toggle batch mode to get a count per line (each line treated as a separate prompt) plus a total — designed for prompt-engineering loops where you need per-prompt sizing.

100% client-side

Tokenization, cost calculation, and visualisation all run in your browser. Prompts never leave the device — no upload, no API call, no analytics on prompt content. Verifiable in DevTools (zero network requests).

Common Use Cases

Sizing a system prompt for production

Scenario: You're shipping a chatbot with a 2000-character system prompt and you need to know how much of the model's context window it eats before user messages and history are added.

✓ Paste the system prompt, see the exact token count, watch the context-window gauge — you immediately see whether you have room for a 30-message history at this prompt length, or whether the system prompt needs to be compressed.

Pre-launch cost forecast

Scenario: You're about to ship an AI feature to production and finance wants a unit-cost figure: 'how much will one user interaction cost us?'.

✓ Take a typical prompt + expected output length, plug both into the calculator, and copy the per-call USD figure into your forecast. Multiply by expected calls/month, defend it in the budget meeting.

Choosing a model for a price-sensitive workload

Scenario: You're deciding whether to use GPT-4o, Claude Sonnet, or Gemini 1.5 Pro for a high-volume classification task.

✓ Toggle Compare mode, paste a representative prompt, see token counts and prices side-by-side. The cheapest model with sufficient quality wins; no spreadsheet required.

Debugging 'context too long' errors

Scenario: Your agent is failing in production with 'context length exceeded' errors and you can't reproduce it locally.

✓ Paste the failing prompt, see the exact token count and the context-window gauge in red — now you know exactly which content to trim. The token preview shows where the prompt is most token-dense, pointing you at the right section to compress.

Prompt-engineering for cost reduction

Scenario: Your AI product is profitable but margins are thin; product wants the team to 'cut prompt costs by 20%' without quality regression.

✓ The token preview panel makes hidden waste visible — long delimiters, repeated boilerplate, oddly-tokenised JSON keys. Iterate prompts in the tool, see the count drop in real time, lock in the savings.

Exact

tiktoken encodings for OpenAI (cl100k_base, o200k_base); cl100k-based counts with per-model calibration multipliers for Claude (×1.05–1.35), Gemini (×0.95), Grok (×1.0), DeepSeek (×1.05), Llama 4 (×0.95). Accuracy band shown next to every model.

100%

client-side — prompts never leave your device. No upload, no API call, no analytics on prompt content.

Static

site — no backend dependency.

Frequently Asked Questions

Are the OpenAI counts exact?

Yes — for GPT-4o, GPT-4o-mini, o1, and o1-mini we use the o200k_base encoding; for GPT-4-Turbo, GPT-4, and GPT-3.5-Turbo we use cl100k_base. These are the actual encodings OpenAI uses on the server. For a string like "Hello, world!" you'll get 4 tokens, matching OpenAI's official tokenizer page.

Why is Claude / Gemini / Grok / DeepSeek / Llama labelled as 'Calibrated ±X%' instead of 'Exact'?

Those providers don't publish browser-runnable tokenizers — Anthropic's and Google's are proprietary, and Llama / DeepSeek tokenizers exist but only as Python / SentencePiece files too large to bundle. We tokenise with OpenAI's cl100k_base (which is publicly available and runs in your browser) and then multiply by a per-model calibration factor that compensates for the average difference between cl100k and the real tokenizer. For Claude Opus 4.7 that's ×1.35 because Anthropic shipped a denser tokenizer; for Gemini and Llama 4 it's ×0.95 because their tokenizers are slightly more efficient; for Grok it's ×1.0 because xAI claims tiktoken-compatible vocab. The accuracy band shown next to each model is the expected absolute error after calibration — typically ±3-8%, with Claude Opus 4.7 at ±12% due to the larger correction. For exact counts you'd need to call the provider's own count_tokens endpoint, which requires an API key.

Where does the pricing data come from?

Pricing is taken from each provider's official rate card (OpenAI, Anthropic, Google AI Studio, Meta / Together / Groq for Llama). Prices change — there's a 'Last updated' date in the tool. Always cross-check with the provider's billing page before committing a number to a contract or a board deck.

Does this tool send my prompt anywhere?

No. Tokenization runs entirely in your browser using the js-tiktoken library; cost calculation is pure JavaScript. There's no API call, no upload, no analytics event with prompt content. You can verify in your browser's DevTools network tab — using the tool produces zero outbound requests with prompt data. This matters when you're working on confidential prompts (proprietary system prompts, customer data, unreleased product copy).

What's the difference between cl100k_base and o200k_base?

Both are byte-pair encoding (BPE) tokenizers with different vocabularies. cl100k_base is the older OpenAI vocabulary used by GPT-3.5 and GPT-4 (~100k tokens). o200k_base is the newer vocabulary used by GPT-4o and o1 (~200k tokens) — it has more tokens dedicated to non-English languages, code, and common multi-character sequences, so the same prompt typically counts to *fewer* tokens under o200k. That's part of why GPT-4o is cheaper per request even at similar per-token prices.

Why does my prompt count differently across models?

Different tokenizers have different vocabularies. A word like 'tokenization' might be one token in cl100k, two in o200k, three in Llama. Special characters, code, and non-English text vary even more. The Compare mode shows you these deltas explicitly — sometimes switching models drops your token count by 20% even before considering price.

Does the cost estimator include the system message and chat history?

It counts whatever you paste in. To estimate a real chat call, paste the *concatenation* of system message + history + new user message into the field. Chat APIs add a small per-message overhead (~3-4 tokens per message for OpenAI's chat format) which this tool does not currently model — adjust slightly upward when budgeting for chat-heavy workloads.

Can I use this for batch / fine-tuning cost estimation?

Yes — paste each example into batch mode (toggle 'Per-line counts'), get per-prompt and total counts, multiply by your dataset size, and apply the per-token training price from the provider's docs. Note that fine-tuning APIs typically have a separate, higher per-token price than inference; double-check the provider's pricing page.

Technical Specifications

Supported Formats

✓OpenAI GPT-5, GPT-5 mini, GPT-5 nano (o200k_base — exact)
✓OpenAI GPT-4.1, GPT-4.1 nano (o200k_base — exact)
✓OpenAI GPT-4o, GPT-4o mini (o200k_base — exact)
✓OpenAI o3, o4-mini reasoning models (o200k_base — exact)
✓Anthropic Claude Opus 4.7 (cl100k × 1.35, ±12%), Sonnet 4.6 / Haiku 4.5 (cl100k × 1.05, ±8%)
✓Google Gemini 3 Pro, 2.5 Pro, 2.5 Flash, 2.5 Flash-Lite (cl100k × 0.95, ±8%)
✓xAI Grok 4, Grok 4.1 Fast (cl100k × 1.0, ±3%)
✓DeepSeek V3, DeepSeek R1 (cl100k × 1.05, ±8%)
✓Meta Llama 4 Scout, Llama 4 Maverick (cl100k × 0.95, ±8%)
✓Tokenization library: js-tiktoken (pure JS port of OpenAI tiktoken)
✓Cost data: per-1M-token input + output rates, last refreshed 2026-04-30

Limits & Performance

•File Size: No hard cap on input length; tokenization is linear and runs in <50ms for ~10 000 tokens
•Validations: Encoding tables loaded lazily on first use of each tokenizer family (cl100k vs o200k)
•Response Time: Tokenization is synchronous, in-browser — sub-millisecond for short prompts, <50 ms for full-context inputs
•Browsers: All modern browsers (Chrome, Firefox, Safari, Edge); works offline once loaded

Pro Tips

Tokens, not characters, are the unit of cost — and the ratio is roughly 1 token ≈ 4 characters of English, but ranges from 1.5 (dense Unicode) to 8 (repeating whitespace). Always count, don't estimate.
Variable names matter: `getUserById` is 4 tokens, `get_user_by_id` is 6, `getuser` is 2. In high-volume prompts, this adds up.
JSON is token-expensive because of all the quotes and braces — every '{', '}', ',', and '"' is its own token. If you can switch to YAML or a custom DSL for system-prompt fixtures, you can drop 20-30% of system prompt tokens.
Use Compare mode when migrating between providers — the same prompt can be 10-20% cheaper on a different model just from tokenizer differences, before any quality difference.
If the context-window gauge goes red, trimming the system prompt is usually a much higher-leverage fix than truncating user input. System prompts are often 2-3× larger than they need to be.
For RAG pipelines, run your retrieved chunks through this tool — chunks above 800 tokens are often retrieving more noise than signal and could be split.
When you're optimising for cost, focus on reducing *output* tokens first (price 4-5× higher than input on most models). Stronger 'be concise' instructions, JSON schema constraints, and stop sequences are higher-leverage than prompt compression.
Bookmark the URL — the tool restores nothing automatically (privacy), but having it one click away during prompt-engineering sessions saves real time.

Share This Tool

Found this tool helpful? Share it with others who might benefit from it!

💡 Help others discover useful tools! Sharing helps us keep these tools free and accessible to everyone.

Support This Project

Buy Me a Coffee