← to the binder
llmэкономикаархитектура

Token economy: an LLM infrastructure for half a dollar a week

Sergei Pak · designs AI operating systems · русская версия

Last week my system made 855 LLM calls and paid 50 cents for them. No discounts, no credits. Just three rules I keep with no exceptions.

Local model by default

A 9B model on my Mac takes 81% of the calls. It's free, it's fast, and the data never leaves the machine. The cloud gets a task only after a measurement shows the small model can't handle it.

Here's what one of those measurements looked like. For a month I ran a cloud model and the local one side by side on text-quality scoring. The local one gave every text the same 75 out of 100, zero variance. The cloud model spread them from 32 to 62. I cancelled the migration and left quality scoring in the cloud. Triage, classification and summaries, though, went local long ago: no quality gap there.

Three rules

  1. One gateway for every call. Every LLM request goes through a single module. It tracks the cost, picks a model for the task, and caches prompts. A direct API call from the code gets caught at commit time.
  2. Prompts that cache. The system prompt is static and longer than the caching threshold, so a repeat call pays pennies for the part the model has already seen.
  3. Costs live in a database. Who spent it, which model, how many tokens. Once a week I read the slice: what to move local, what got pricier and why. I wouldn't trust a line in a provider's dashboard with this.

Why a business should care

The gap between three dollars a week and three hundred isn't the real point. This is: cheap infrastructure can afford to think all the time – triaging every email, checking every contract, recalculating the numbers daily instead of once a quarter before the report. The moment each call costs real money, you start saving on how often the system pays attention, and it goes blind.