back · gurjot
// engineering·Apr 26, 2026·8 min read

Cutting LLM inference cost by 90%

Prompt caching, async orchestration, and the boring engineering that makes GenAI affordable.

Gurjot
Gurjot Singh
@gurjotcodes
−90%
cost cut
900K
txns/month
~3×
faster

Last year we shipped a GenAI-powered GST Input Credit Analyzer at Shorthills AI. It worked. It also burned money. By the time it scaled to 900K transactions a month, our LLM bill was the loudest line item on the dashboard.

Six weeks later, the bill was 10× smaller. Same model, same accuracy, same SLA. Here's what we actually changed — none of it was clever. All of it was boring.

Most "AI cost optimization" is just regular distributed-systems hygiene with a fancier price tag.

Where the money was going

Before optimizing anything, we instrumented every call. Three things stood out:

  • ~62% of input tokens across calls were identical (system prompts, schemas, few-shot examples).
  • Per-transaction calls were strictly sequential — even when the model didn't depend on prior outputs.
  • We were re-classifying transactions the model had already seen, because we had no idempotency layer.

Each one of these is embarrassing in isolation. Together they were hemorrhaging cash.

1. Prompt caching

The first lever was the easiest. Anthropic and OpenAI both support prompt caching now — you mark the static prefix of your prompt, the provider hashes it, and subsequent calls hit a cheap cached read instead of paying full input rate.

const response = await client.messages.create({
  model: "claude-haiku-4-5",
  system: [
    { type: "text", text: SCHEMA + FEW_SHOTS,
      cache_control: { type: "ephemeral" } }
  ],
  messages: [{ role: "user", content: txn }]
});

That single annotation moved 62% of our input tokens from $3/MTok → $0.30/MTok. ~55% of total cost gone in an afternoon. If you're not doing this yet, stop reading and go do it.

2. Async orchestration

Our pipeline was written like a script: classify → enrich → validate → tag, one transaction at a time. But classify and enrich don't depend on each other. Neither do most of the validations.

We rewrote the orchestrator around an async DAG with asyncio.gather at every join point. Same model, same calls — just stopped waiting on things that didn't need waiting.

  • End-to-end latency dropped from ~12s → ~4s per transaction.
  • Throughput per worker tripled. We deleted half our pods.

3. Idempotency & semantic dedup

We added a Redis-backed cache keyed on a normalized transaction signature (party, amount bucket, narration hash). On a hit, we skip the LLM entirely.

For near-misses we use a sentence-transformer embedding lookup with cosine threshold of 0.94. About 28% of incoming transactions short-circuit before they ever touch a model.

4. The unsexy switch: RAG → LightRAG

This is the one I'd talk about at dinner. We were retrieving with vanilla cosine-similarity RAG. We swapped to LightRAG — a graph-augmented retriever — and accuracy moved from 87% → 92% at the same cost envelope. I ended up contributing 4 PRs upstream while we were at it.

The takeaway

None of these are AI techniques. They're caching, parallelism, deduplication, and choosing a better algorithm — the four horsemen of every backend optimization since 1995. The only difference is the cost gradient is steeper, so the wins are bigger.

Treat your LLM calls like database queries. Cache them. Batch them. Don't make them when you don't have to. The boring stuff wins.


If you're working on something similar — or your AI bill is starting to look like a real problem — I'd love to hear about it. Drop me a note.